To create a PCE metamodel in Persalys from a numerical DOE, using a data model, no information regarding the marginal distributions of the inputs is required. This raises the following questions:
what input distribution and/or polynomial bases are chosen for the expansion?
is it possible for the user to specify the marginal distributions, and will it change the expansion ?
Are the Sobol indices derived from the PCE still valid when the input distributions are not specified?
This is a long story - I work on this topic now and then since at least 4 years ! - but I will try to make it as short as possible.
PERSALYS uses BuildDistribution from OpenTURNS. The goal of this function is to create the multivariate distribution from a given Sample.
In OT1.15, this is done here:
There are two steps:
In the first step, we estimate the marginals which best fit the data, using the KS-test as a metric of fit. The distribution with best p-value is assigned to the marginal.
In the second step, we estimate if there is a dependency with Spearmanâs correlation coefficient. If there is such a dependency, a Normal copula is estimated on the sample.
This raises a number of issues, which are linked to the problem of performing the KS test when the parameters are estimated from the data. Estimating the p-value in this case is done by Monte-Carlo sampling, which is costly. Because of this, the default sample size of the Monte-Carlo sampling was set to a low value equal to 10 in OT 1.15, so that the unit tests could pass within the time limit of the continuous integration. This limitation is linked to Emmanuelâs bug report at https://github.com/openturns/openturns/issues/1585.
The previous OT technical committee discussion led to an improvement of the BuildDistribution method in the current development master, which will be released in OT v1.16:
This improves many implementation details. It uses one criterion among a list of various criterions (e.g. K.S. p-value or BIC) to select the marginal distribution which best fit. Because the Histogram distribution always fit better, it is excluded from the list of candidate distributions. The K.S. criterion, however, uses the approximation that the parameters are known, which is much faster. The default criterion is the BIC (https://github.com/openturns/openturns/issues/1415), but this can be customized with the âMetaModelAlgorithm-ModelSelectionCriterionâ ResourceMap key. The BIC is a good choice, because it weights the distributions which have more parameters. Emmanuelâs bug report should soon be fixed as an - important - side-product.
I guess that the next release of PERSALYS (v9) may use the updated algorithm from OpenTURNS v1.16, but this is not decided yet, because each project has its own roadmap and planning.
Now to your questions.
âWhat input distribution and/or polynomial bases are chosen for the expansion?â
Any distribution which may fit to the data will be used, based on the output of BuildDistribution.
The orthogonal basis is computed from the distribution (with Stieljes algorithm, if the distribution does not have a built-in orthogonal family).
âIs it possible for the user to specify the marginal distributions, and will it change the expansion ?â
This is the crucial point. If the information provided is a Sample, then BuildDistribution makes its best to find the distribution. It does change the expansion, because it changes the orthogonal basis.
âAre the Sobol indices derived from the PCE still valid when the input distributions are not specified?â
The Sobolâ indices are still valid, because the distribution is specified as the output of the BuildDistribution. If a dependency is detected, the Sobolâ indices wonât be estimated and an error will be generated as a red message in the dialog box.
The point 2) is the most important one.
If the user knows the distribution of the input, this information can be used in PERSALYS directly. The DOE can be generated from the distribution, and then be used by PERSALYS to fit the polynomial chaos by regression (this is the default in PERSALYS, and cannot be changed).
If the user only has a sample from the input, PERSALYS will do the best it can to recover the distribution from the sample.
A typical error here would be to use OpenTURNS (or any other tool) to generate a sample from, say a gaussian quadrature rule, and to put that DOE into PERSALYS to use the polynomial chaos.
Error 1: the result would be much better if the distribution was directly used by PERSALYS.
Error 2: PERSALYS will estimate the chaos by regression so that the DOE based on the gaussian quadrature wonât be used and may reduce the accuracy. Indeed, a lot less information will be known in the center of the domain, where it matters.
Let me know if this corresponds to your question: do not hesitate to push it further if required, because this is a complicated topic that already has generated lots and lots of discussions!
Happy to inaugurate this forum, and thanks Michaël for a very thorough answer indeed! The workings of PCE in Persalys are now very clear⊠if possible, I would suggest adding some of this info to the online documentation.