This is a long story - I work on this topic now and then since at least 4 years ! - but I will try to make it as short as possible.
PERSALYS uses BuildDistribution from OpenTURNS. The goal of this function is to create the multivariate distribution from a given Sample.
In OT1.15, this is done here:
There are two steps:
In the first step, we estimate the marginals which best fit the data, using the KS-test as a metric of fit. The distribution with best p-value is assigned to the marginal.
In the second step, we estimate if there is a dependency with Spearman’s correlation coefficient. If there is such a dependency, a Normal copula is estimated on the sample.
I created an example of this at:
Notice that the Notebook was generated with OT v1.15 so that it often selects the Histogram distribution because it generally matches better. This does not limit performance because there is a very efficient orthogonal basis in OT for that particular distribution (thank to Régis Lebrun). It does, however, limits accuracy, because an histogram always limits the range of values that can be generated from it, truncating the range from the minimum value in the sample to the maximum value in the sample.
This is what PERSALYS v8 uses.
This raises a number of issues, which are linked to the problem of performing the KS test when the parameters are estimated from the data. Estimating the p-value in this case is done by Monte-Carlo sampling, which is costly. Because of this, the default sample size of the Monte-Carlo sampling was set to a low value equal to 10 in OT 1.15, so that the unit tests could pass within the time limit of the continuous integration. This limitation is linked to Emmanuel’s bug report at https://github.com/openturns/openturns/issues/1585.
The previous OT technical committee discussion led to an improvement of the BuildDistribution method in the current development master, which will be released in OT v1.16:
This improves many implementation details. It uses one criterion among a list of various criterions (e.g. K.S. p-value or BIC) to select the marginal distribution which best fit. Because the Histogram distribution always fit better, it is excluded from the list of candidate distributions. The K.S. criterion, however, uses the approximation that the parameters are known, which is much faster. The default criterion is the BIC (https://github.com/openturns/openturns/issues/1415), but this can be customized with the “MetaModelAlgorithm-ModelSelectionCriterion” ResourceMap key. The BIC is a good choice, because it weights the distributions which have more parameters. Emmanuel’s bug report should soon be fixed as an - important - side-product.
I guess that the next release of PERSALYS (v9) may use the updated algorithm from OpenTURNS v1.16, but this is not decided yet, because each project has its own roadmap and planning.
Now to your questions.
“What input distribution and/or polynomial bases are chosen for the expansion?”
Any distribution which may fit to the data will be used, based on the output of BuildDistribution.
The orthogonal basis is computed from the distribution (with Stieljes algorithm, if the distribution does not have a built-in orthogonal family).
“Is it possible for the user to specify the marginal distributions, and will it change the expansion ?”
This is the crucial point. If the information provided is a Sample, then BuildDistribution makes its best to find the distribution. It does change the expansion, because it changes the orthogonal basis.
“Are the Sobol indices derived from the PCE still valid when the input distributions are not specified?”
The Sobol’ indices are still valid, because the distribution is specified as the output of the BuildDistribution. If a dependency is detected, the Sobol’ indices won’t be estimated and an error will be generated as a red message in the dialog box.
The point 2) is the most important one.
If the user knows the distribution of the input, this information can be used in PERSALYS directly. The DOE can be generated from the distribution, and then be used by PERSALYS to fit the polynomial chaos by regression (this is the default in PERSALYS, and cannot be changed).
If the user only has a sample from the input, PERSALYS will do the best it can to recover the distribution from the sample.
A typical error here would be to use OpenTURNS (or any other tool) to generate a sample from, say a gaussian quadrature rule, and to put that DOE into PERSALYS to use the polynomial chaos.
Error 1: the result would be much better if the distribution was directly used by PERSALYS.
Error 2: PERSALYS will estimate the chaos by regression so that the DOE based on the gaussian quadrature won’t be used and may reduce the accuracy. Indeed, a lot less information will be known in the center of the domain, where it matters.
Let me know if this corresponds to your question: do not hesitate to push it further if required, because this is a complicated topic that already has generated lots and lots of discussions!
Happy to inaugurate this forum, and thanks Michaël for a very thorough answer indeed! The workings of PCE in Persalys are now very clear… if possible, I would suggest adding some of this info to the online documentation.