Polynomial chaos expansion in Persalys

MerlinKeller · October 2, 2020, 3:42pm

Hi,

To create a PCE metamodel in Persalys from a numerical DOE, using a data model, no information regarding the marginal distributions of the inputs is required. This raises the following questions:

what input distribution and/or polynomial bases are chosen for the expansion?
is it possible for the user to specify the marginal distributions, and will it change the expansion ?
Are the Sobol indices derived from the PCE still valid when the input distributions are not specified?

Many thanks in advance!
Merlin

mbaudin47 · October 3, 2020, 6:10pm

Hi Merlin !

This is a long story - I work on this topic now and then since at least 4 years ! - but I will try to make it as short as possible.

PERSALYS uses BuildDistribution from OpenTURNS. The goal of this function is to create the multivariate distribution from a given Sample.

In OT1.15, this is done here:

github.com

openturns/openturns/blob/1.15/lib/src/Uncertainty/Algorithm/MetaModel/FunctionalChaos/FunctionalChaosAlgorithm.cxx#L156


    const AdaptiveStrategy & adaptiveStrategy)
  : MetaModelAlgorithm(distribution, DatabaseFunction(inputSample, outputSample))
  , adaptiveStrategy_(adaptiveStrategy)
  , projectionStrategy_(LeastSquaresStrategy(inputSample, outputSample))
  , maximumResidual_(ResourceMap::GetAsScalar( "FunctionalChaosAlgorithm-DefaultMaximumResidual" ))
{
  // Check sample size
  if (inputSample.getSize() != outputSample.getSize()) throw InvalidArgumentException(HERE) << "Error: the input sample and the output sample must have the same size.";
}

Distribution FunctionalChaosAlgorithm::BuildDistribution(const Sample & inputSample)
{
  // Recover the distribution, taking into account that we look for performance
  // so we avoid to rebuild expensive distributions as much as possible
  const UnsignedInteger inputDimension = inputSample.getDimension();
  // For the dependence structure, we use the Spearman independence test to decide between an independent and a Normal copula.
  Bool isIndependent = true;
  for (UnsignedInteger j = 0; j < inputDimension && isIndependent; ++ j)
  {
    const Sample marginalJ = inputSample.getMarginal(j);
    for (UnsignedInteger i = j + 1; i < inputDimension && isIndependent; ++ i)

There are two steps:

In the first step, we estimate the marginals which best fit the data, using the KS-test as a metric of fit. The distribution with best p-value is assigned to the marginal.
In the second step, we estimate if there is a dependency with Spearman’s correlation coefficient. If there is such a dependency, a Normal copula is estimated on the sample.

I created an example of this at:

Notice that the Notebook was generated with OT v1.15 so that it often selects the Histogram distribution because it generally matches better. This does not limit performance because there is a very efficient orthogonal basis in OT for that particular distribution (thank to Régis Lebrun). It does, however, limits accuracy, because an histogram always limits the range of values that can be generated from it, truncating the range from the minimum value in the sample to the maximum value in the sample.

This is what PERSALYS v8 uses.

This raises a number of issues, which are linked to the problem of performing the KS test when the parameters are estimated from the data. Estimating the p-value in this case is done by Monte-Carlo sampling, which is costly. Because of this, the default sample size of the Monte-Carlo sampling was set to a low value equal to 10 in OT 1.15, so that the unit tests could pass within the time limit of the continuous integration. This limitation is linked to Emmanuel’s bug report at https://github.com/openturns/openturns/issues/1585.

The previous OT technical committee discussion led to an improvement of the BuildDistribution method in the current development master, which will be released in OT v1.16:

This improves many implementation details. It uses one criterion among a list of various criterions (e.g. K.S. p-value or BIC) to select the marginal distribution which best fit. Because the Histogram distribution always fit better, it is excluded from the list of candidate distributions. The K.S. criterion, however, uses the approximation that the parameters are known, which is much faster. The default criterion is the BIC (https://github.com/openturns/openturns/issues/1415), but this can be customized with the “MetaModelAlgorithm-ModelSelectionCriterion” ResourceMap key. The BIC is a good choice, because it weights the distributions which have more parameters. Emmanuel’s bug report should soon be fixed as an - important - side-product.

I guess that the next release of PERSALYS (v9) may use the updated algorithm from OpenTURNS v1.16, but this is not decided yet, because each project has its own roadmap and planning.

Now to your questions.

“What input distribution and/or polynomial bases are chosen for the expansion?”
Any distribution which may fit to the data will be used, based on the output of BuildDistribution.
The orthogonal basis is computed from the distribution (with Stieljes algorithm, if the distribution does not have a built-in orthogonal family).
“Is it possible for the user to specify the marginal distributions, and will it change the expansion ?”
This is the crucial point. If the information provided is a Sample, then BuildDistribution makes its best to find the distribution. It does change the expansion, because it changes the orthogonal basis.
“Are the Sobol indices derived from the PCE still valid when the input distributions are not specified?”
The Sobol’ indices are still valid, because the distribution is specified as the output of the BuildDistribution. If a dependency is detected, the Sobol’ indices won’t be estimated and an error will be generated as a red message in the dialog box.

The point 2) is the most important one.

If the user knows the distribution of the input, this information can be used in PERSALYS directly. The DOE can be generated from the distribution, and then be used by PERSALYS to fit the polynomial chaos by regression (this is the default in PERSALYS, and cannot be changed).
If the user only has a sample from the input, PERSALYS will do the best it can to recover the distribution from the sample.

A typical error here would be to use OpenTURNS (or any other tool) to generate a sample from, say a gaussian quadrature rule, and to put that DOE into PERSALYS to use the polynomial chaos.

Error 1: the result would be much better if the distribution was directly used by PERSALYS.
Error 2: PERSALYS will estimate the chaos by regression so that the DOE based on the gaussian quadrature won’t be used and may reduce the accuracy. Indeed, a lot less information will be known in the center of the domain, where it matters.

Let me know if this corresponds to your question: do not hesitate to push it further if required, because this is a complicated topic that already has generated lots and lots of discussions!

Best regards,

Michaël

dumas · October 5, 2020, 12:39pm

Hi Merlin,

Thanks for being the 1st user to post a question. And Thanks Michaël for this accurate answer !

Antoine

MerlinKeller · October 5, 2020, 1:05pm

Happy to inaugurate this forum, and thanks Michaël for a very thorough answer indeed! The workings of PCE in Persalys are now very clear… if possible, I would suggest adding some of this info to the online documentation.

Best
Merlin

Topic		Replies	Views
Using Persalys as a data visualization tool: questions and suggestions Persalys usage	4	299	November 25, 2022
Metamodels from Data models Persalys usage	5	231	December 8, 2022
Distribution in Calibration Persalys usage	1	440	May 3, 2021
Studies in Salome Meca Use cases	10	975	November 9, 2021
Vectorial Functions definition Persalys usage	4	501	February 3, 2021

Polynomial chaos expansion in Persalys

Related topics