EnginSoft
22-23 October 2012 Pacengo del Garda
(VR) - Italy

www.caeconference.com

2011 Conference Proceedings
2011 Conference Proceedings are now avaliable to download
2006-2010 Proceedings

download CAE proceedings
EnginSoft Electronical CAE newsletter




modeFRONTIER as a statistical tool

Nowadays, the fact that the use of statistical software can improve processes or drive and speed up the development of new products is well-known. The aim of this article is to show that modeFRONTIER can be considered as a complete and comprehensive software for data analysis able to perform a statistical consideration of database.
The Designs Space in modeFRONTIER version 4 can be considered as a stand-alone environment, where the user is able to perform extensive and complete statistical analyses of data deriving from different contexts. The statistical environment in modeFRONTIER 4 should be considered as a compelling tool for the decision-making process.

In its new version, modeFRONTIER offers a brand new environment which provides a wide variety of tools, reports and charts that can be used to explore data and perform complex statistical or engineering analyses.

For example, the user can easily:

  • Visualize data using several charts such as the scatter plot or bubble charts. The bubble chart is a variation of the scatter chart, where the data points are replaced by bubbles. The bubbles provide a way for displaying a third variable in the two dimensional chart. The diameter of each bubble is proportional to the value it represents: the larger the bubble, the greater the value.
  • Monitor trends and time series by means of the history plots with its moving average and Bollinger bounds;
  • Visualize data distributions by using histograms, probability and cumulative distribution plots;
  • Find out important linear relations between variables using tools that summarize all these effects in a single chart, such as the correlation chart and the scatter matrix;
  • Find series outliers by means of useful charts such as Box-Whiskers or Quantile-Quantile (Q-Q) plot;
  • Verify whether or not a series of data corresponds to a given distribution using some the distribution fitting, the histogram plot and the Q-Q plot;
  • Check the effects of the parameters on the outcomes using useful statistical tools such as the DOE main effect or the interaction effects;
  • Perform several statistical tests (e.g. t-Student analysis, ANOVA).

In the post-processing panel, a complete new environment named Multivariate Analysis (MVA) includes tools that provide users with the possibility to:

  • Organize designs into groups according to a given rule and look for clusters of data (hierarchical and partitive clustering);
  • Build Self Organizing Maps (SOMs) in order to have an easy-to-read bi-dimensional representation of complex multi-dimensional data.

Moreover, and as in the previous version, the user can also take advantage of the Response Surface Methodology (RSM), which allows the construction of meta-models of data and eventually to perform virtual optimizations. This capability has particularly been improved as part of a painstaking research in the upcoming release.

mfst
Figure 1: one step of the Data Wizard. The tool for importing data into modeFRONTIER

It is quite straightforward to understand that it is very difficult, nearly impossible, to effectively analyze and summarize a huge amount of data without the help of a good statistical analysis tool. The following example demonstrates that modeFRONTIER can be considered as a complete tool for making statistical analysis of complex multidimensional data.

Example – Evaluate the risks of LDL cholesterol

In order to show how the statistical instruments included in modeFRONTIER can be used to extract the most relevant information from a database, we present the following simple example.

Let us suppose that we have collected the sex, the age, the height, the weight, the number of cigarettes smoked per day and the systolic blood pressure of a certain group of people and, finally, their level of LDL cholesterol. These data could be the result of a medical investigation which, obviously, could also consider many other aspects of patient health, including for example the consumption of alcohol, the triglycerides level and so on. All the selected quantities aim at monitoring some of the main risk factors of cardiovascular diseases. The data contained in the database have been generated artificially and for demonstration purposes only, taking into account the information contained in [1] and [2].

All these data can be easily collected in a table whose dimension can be obviously very large, depending on the number of monitored people.

mfst
mfst
Figure 2: This histogram chart shows that the population weight has a distribution with two distinct peaks. This is due to the overlapping of two distinct Gaussian distributions for men and women.
Figure 3: modeFRONTIER chart showing the ECDF of the population BMI.

 

Load data in modeFRONTIER and manage work tables

We can suppose that this kind of data is usually well-organized in a file, where columns collect the age (in [years]), the sex (0 for men, 1 for women), the height (in [cm]), the weight (in [kg]), the blood pressure (in [mm hg]), the LDL cholesterol level (in [mg/dl]) and finally the number of cigarettes smoked per day. By means of the Data Import Wizard [Figure 1], it is very easy to load the data into modeFRONTIER. During the import phase, the user can remove rows and columns containing useless data, specify the role of each column, insert objectives and constraints if any and set up the visualization format for numbers. In this example, the variable “cholesterol” is set as output while all the others are set as inputs. Moreover, thanks to the work table capabilities it is also possible to insert additional columns containing derived data. In this example, we can introduce the ratio between the weight (in [kg]) and the squared height (in [m]). This ratio is commonly named as body mass index (BMI) and it is often used to identify if a person is of normal weight or not.

With the Find tool, it is easy to select designs which satisfy certain conditions (e.g. age less than a given value and weight greater than a specified limit) and thus to subdivide designs into categories or create new tables of data. For example, the BMI values usually considered normal lay between 18.5 and 25.0, hence one can easily determine if a patient’s BMI exceeds or falls below this range.

mfst
mfst
Figure 4: modeFRONTIER showing a box-whiskers chart of the blood pressure.
Figure 5: Box-whiskers showing four different categories of risk.

Histograms

Once the data have been loaded, it is straightforward to build the histograms charts by changing the number of classes in order to fit the user’s needs. The probability density functions which better fit the data are highlighted, and the user can visualize them superimposed on top of the histogram. As it can be seen in example [Figure 2], the population weight has a distribution with two distinct peaks. If one marks the designs corresponding to the men (sex = = 0), the designs are consequently highlighted in the histogram. In this way, it is easy to understand that, with reference to the variable weight, the population is substantially divided into two groups, men and women. A similar consideration can be done if the variable height is considered.
In the table below, the graph collects the most important statistics of the data (mean, standard deviation etc).
Obviously, the same operation can be performed by dividing data according to other criteria and consequently constructing subgroups of data which can be analyzed separately.

mfst
mfst
Figure 6: A pie plot describing the percentages for each category. It can be useful to have a global view on how the population is organized. In this case a large majority of the subjects do not present any risk (according to our own subdivision), while roughly the 5% has a medium/high risk.
Figure 7: A scatter matrix summarizes lot information in a single chart. It helps in finding out linear relations between variables.

Cumulative Distribution

The cumulative distribution plot gives indications on the probability that a given event arises in the population. It reports the experimental cumulative distribution (shortly referred to as ECDF) together with the most probable theoretical CDF, if any.
In this example [Figure 3], there is roughly a 50% probability to find a man who is overweight, being 0.5 the CDF value corresponding to a BMI of 25. Only a small portion (less than the 5%) of the male population is obese, having a BMI greater than 30.

Box-Whiskers

The Box-Whiskers plot can be used to visualize the distribution of data in an effective way, summarizing certain information about the data, such as the mean and its confidence interval, the quartiles and eventually the outliers. The confidence limit is an estimate for the mean with lower and upper limits. It gives an indication of how much uncertainty is in our estimate of the true mean. The narrower the interval, the more precise is our estimate. Confidence limits are expressed in terms of a confidence coefficient. Although the choice of confidence coefficient is somewhat arbitrary, in practice 90%, 95%, and 99% intervals are often used, with 95% being the most commonly used.
The last ones are the designs that fall out of an interval centered in the mean and with semi-amplitude of 1.5 of the standard deviation. In the example [Figure 4], it is interesting to note that, if we consider the blood pressure, there are four outliers which can be easily selected and eventually categorized.

The population can be categorized according to many criteria. In the following example, the most important risk factors for cardiovascular diseases have been considered. The patients have been organized into four groups, according to the risk level they belong to. The table summarizes the criteria adopted for the selection of the patients: obviously, this is only for demonstration purposes and has neither scientific nor medical relevance.

Now, it is possible to build a Category Box-Whiskers which plots the data series taking into account the subdivision into categories of the designs. In Figure 5, the population age is considered. It can be seen that, statistically, the most risky age for cardiovascular diseases ranges between 29.5 and 46 years (the first and the third quartiles of the medium and the high risk distributions have been considered respectively to define this range). However, if we examine also the low risk series, the age range should be enlarged up to 58. Moreover, it can be seen that the densest half is located in the highest part. This means that the risk, even low, is statistically higher for increasing ages; this statement is corroborated by the fact that the densest half of the no risk distribution is located in the lower part.

mfst
mfst
Figure 8: DOE Main Effects of the cholesterol level. This chart reveals that the consumption of tobacco and the blood pressure have a direct effect on the cholesterol level.
Figure 9: t-Student analyses on the output variables identifying the most important factors.

Find out linear relations between variables

The Correlation matrix and the Scatter matrix are useful tools to check if there is any linear relation between variables. The correlation coefficient is a measure of the closeness of the linear relationship between two variables. The correlation coefficient is a pure number without units or dimensions which can range from -1.00 to +1.00. The value of -1.00 represents a perfect negative correlation while a value of +1.00 represents a perfect positive correlation. Positive values of the correlation indicate a tendency of the two variables to increase together. When the coefficient is negative, large values of the first variable are associated with small values of the second one.
The correlation ranking reports the most relevant connections. It is important to note that only relative high values of correlation coefficients should be considered as reliable. In the Scatter matrix, the scatter plots together with the regression lines help to visualize the designs and understand how they are distributed in the space. All the graphs can be enlarged and explored by just clicking on their left-high corner.
In this example [Figure 7], it can be seen that the most important negative correlations involve the sex (-0.799) and the weight and the sex and the height (-0.728). The two most important positive relations involve the height and the weight (0.566) and the cholesterol and the number of cigarettes (0.516).

DOE Main Effects and Student Analysis

In the first place, for constructing a DOE main effect graph, it is necessary to identify effect and factors. The factor domains are split into two equal intervals containing the lowest and the highest values, identified with a – and a + respectively. In this way, two separate distributions are created and then plotted with respect to the chosen effect, using a layout very similar to the Box-Whiskers one. The resulting graph allows to identify the factors that influence the effect more and also to identify if there is a direct or inverse relation between factors and the effect. In this case for example [Figure 8], the consumption of tobacco and the blood pressure have a direct effect on the cholesterol level, while the age and the height have an inverse effect.

By performing a t-Student analyses of data on the output variable, it is possible to understand which are the most important factors for the cholesterol levels. In the case of [Figure 9], it is quite clear that the high consumption of tobacco and high blood pressure are the two main factors which contribute to high levels of cholesterol.
The t-Student test is a very useful tool to identify the most important causes of an undesired effect. It may hence allow to make a correct diagnosis for a new patient just by looking at few parameters instead of the whole set.

Finally, it is interesting to point out that it is possible to generate a statistical report for every variable; this report represents a descriptive statistics tool that contains all the most important univariate statistics and graphs that completely characterize the data series.
This report can even be saved in different formats, to allow the user to collect results and reuse them in a subsequent context.

mfst
mfst
mfst
Figure 10: SOM components plot. This tool allows a global view of the database.
Figure 11: D-matrix of the SOM expressing the average distance between neurons
Figure 12: SOM response, this chart reports the best matching unit for each point. In this example, it allows the estimation of the cholesterol level of new patients.

The Multivariate Analysis (MVA) tool

In this example, the number of variables does not allow a compact visualization of designs. Actually, if the dimensionality is higher than 4 or 5 it becomes prohibitive, and somewhat useless, to plot all the information contained in the database using classical 2-dimensional charts. Obviously, this represents an important limit to the user’s understanding of the data. For this reason, it is often extremely difficult to find groups of similar designs, identify outliers, and understand how the space is filled with designs. One possible strategy to solve this problem is to use a Self Organizing Map (SOM) which is based on an unsupervised and competitive learning of a neural network. A SOM is able to map the designs, belonging to a multi-dimensional space, onto a lower dimensional space, preserving the original data topology and density.

SOMs are new important tools included in modeFRONTIER. They are part of the new multivariate analysis environment. Following step by step a user-friendly wizard, the creation of a SOM is really easy. For example, in this case, all variables can be considered to build the SOM except for the cholesterol and the BMI, as the latter one represents a derived quantity. The cholesterol and the BMI values can be seen as a kind of additional properties of the designs which do not contribute to the creation of the map.

When a self organizing map is created, a new table is added to the modeFRONTIER project collecting neurons of the map, the corresponding prototype vector and the designs which have been captured by the neuron itself. Different graphical ways are available to visualize results: the first one is surely the SOM components plot [Figure 10], where all the database components are displayed. This tool allows a global view of the database and supports the user in detecting if there is any relation between variables.

In this case for example, it is important to note that the cholesterol variable has a relatively smooth colored map, in view of the fact that this variable has been neglected during the map creation, this indicates that its behavior is related in some sense with the other components and not just a result of a random process. Actually, it can be seen that the maximum values of the cholesterol are located in the lower-left corner of the map, so are the cigarettes, the weight, the blood pressure and the height components.
The age and the sex have similar maps, simply rotated, with well separated red and blue zones: this can be seen as a demonstration that the examined population represents females and males of all ages.

Other charts are available along with the SOM components. For example, the D-matrix expresses the average distance between a neuron and its neighborhoods. With this representation, it is possible to detect if there are clusters of data and to judge if eventually they are well separated or not. In the case of [Figure 11], it seems that there is no significant clustered distribution of data, the designs are uniformly spread on the map (the dimension of the square is proportional to the number of designs pertaining to a given neuron) especially in the brightest zones of the map, where the distances between designs are minimal.

Now, let us suppose that we have done another medical investigation on a second population, relatively homogeneous to the first one, collecting the same data of the first investigation, except for the cholesterol level. This time, the goal could be to understand how this second population is distributed and therefore, to predict the cholesterol level for all patients belonging to the second population. In this way, it is possible to identify dangerous situations, such as high values of cholesterol, without having any experimental evidence of such a fact. To this aim, the user can load a new table of data in modeFRONTIER and plot a SOM response. This plot reports, for each new patient in the new table, the Best Matching Unit (BMU) of the SOM and affinity (that can be read as an accuracy value) between the value and the corresponding BMU. The BMU represents a kind of reference situation for the design under consideration, and therefore, the unknown cholesterol for the design can be taken from the corresponding neuron BMU.

Conclusions

In this article, the statistical and the multivariate analysis tools available in modeFRONTIER have been briefly presented. The aim was to demonstrate how all these tools can be used to capture the most important information contained in a database and how to discover hidden or not immediately evident relations between variables. The importance and usefulness of these tools have been presented by means of an example. In this example, the Self Organizing Map has been used with two different purposes; firstly as an effective representation of multidimensional data and, secondly, as a prediction tool.

References

[1] “Highlights on health in Italy 2004”, by the WHO Regional Office for Europe, available at www.euro.who.int.
[2] “Health for All” data on the Italian population provided by ISTAT, the Italian Institute of Statistics, available at http://www.istat.it/sanita/Health

The websites, www.esteco.com and www.network.modefrontier.eu,
provide several examples of how modeFRONTIER can be used as a statistical tool.


For any questions on this article or to request further examples or information, please email the authors:

Silvia Poles - Optimization Consulting
EnginSoft
info@enginsoft.it

Massimiliano Margonari
info@enginsoft.it
EnginSoft S.p.A.

copyright © 2011 all rights reserved | terms of use | Download EnginSoft Logo | VAT nb DE245183550