Hi all,
I am a (quite) new Persalys user (under v12.0.1) and I am mainly using it (until now) as a “data visualization and inference tool”, which means that I import some .csv data files and I mostly build data models. First of all, I would like to thank all the members of the development team for such a great job! The tool is pretty awesome!
Typically, the .csv files I’m uploading are around 2000 lines times 30 columns (others can have more than 100 columns). The main goals for me are typically: analyzing data, building metamodels and estimating sensitivity indices.
Thus, the main goal of this post is to ask some questions/remarks and to submit some possible suggestions to enhance the software (if they are relevant enough). Some are minor questions, others go deeper into the analysis.
- In the data analysis part:
- Why does the button ‘PDF/CDF’ is not split into two buttons (one ‘PDF’ and one ‘CDF’). The fact that one needs to change it in the bottom left panel is a little bit awkward;
- Moreover, why does the dimension “d” of the problem (meaning, the number of inputs) is not explicitely mentioned?
- Would it be possible to add a feature that automatically detect that two columns are identical? When the number of columns is large, it is almost impossible to detect it visibly. However, this could be also detected as an output of the analysis as soon as the Spearman Rho is equal to 1. This “perfect” correlation clearly indicates a multicollinearity. Thus, a warning could be sent to the user that some columns seem to be “spurious redundant columns”.
- Would it be possible to imagine “Principal Component Analysis” as a new feature for this part? T
- In the marginals’ inference part:
- Would it possible to try another test, e.g., Shapiro-Wilk, to better test normality, than Kolmogorov-Smirnov?
- What does happen to an input if none of the candidate laws are accepted according to the KS statistic, especially if one builds a metamodel after (e.g., a PCE)?
- In the metamodel part:
- Why is it not possible to do, at this stage, another selection of the inputs X and the samples n? It could be useful to select only a part of the inputs without necessarily build another data model.
- When one uses PCE for the first time, it is a little bit confusing to let the user choose the polynomial degree without providing any recommendation (documention says: “default: 2”). Do you have any idea about a possible rule of thumb to choose the degree depending on the dimensionality and the volume of data? This question seems crucial to me as the possible large dimension (>10) makes the metamodel building phase being very time-consuming (which can make the app freezing if you try to stop the building process by clicking on the “Stop” button.
- Would it be possible to imagine as a possible future feature a standard linear regression algorithm (together with its importance measures, i.e., the SRC^2, as a first and primary metamodel? Such a model could be used first and foremost, before deploying more advanced tools (such as PCE and GP regression).
- When one observe the Sobol’ indices obtained through the PCE metamodel, would it be possible to plot a “grid” on the graph in order to better identify the values (especially when the number of inputs is large).
- What does the Interactions number represent? This value is somehow confusing.
- Last question: why do you provide only a single value for the “Residuals” and not a full histogram? And why do you provide the relative error (1-R^2) as a metric, but not the R^2 (which is more easily readable for most people as they get used to it, in a similar way as the Q^2 predictivity coefficient).
Again, all of these questions/remars/suggestions are not criticisms at all and should not be misinterpreted. Thanks a lot for the tremendous work achieved.
All the best,
Vincent
PS : if you need more information about the applications I am working on, I can easily provide more insights and details. Please, don’t hesitate to ask.