Author: Baran Kaplan
Principal Component Analysis using DATAtab Statistics Calculator
In this article I will talk about principal component analysis and the effects of the so-called ‘dimensionality curse’. In data science, the number of dimensions is equal to your independent variables.
Imagine you are making pizza in a bakery. You can add a layer of dough that will fall on each slice at the same rate, but some will have less salami, sausage, or mozzarella in their pizza? The more ingredients (let’s call them inputs or features) are added to the pizza, the greater the size and the more likely the dataset to represent a very sparse and possibly non-representative sampling of that area. This situation is called the dimensionality curse.
Principal Component Analysis can be used to reduce the dimension and therefore counteract the problem of dimensionality curse. To give a real-life example, PCA was used to detect and diagnose anomalies in internet traffic.
The purpose of principal component analysis
The purpose of principal component analysis is to summarize the correlations between a set of observed variables into a smaller set of linear combinations. Principal components are linear combinations of predictive variables (for numerical data only!) They are computed to minimize correlation between components and reduce redundancy. This way, a limited number of components will typically explain most of the variance in the dependent variable.
So the goal here is to create a correlation or covariance matrix for these variables or factors and base everything else on that. When we say we have a solution with two factors, we are actually saying that the first two factors capture enough variance to be useful across the entire set of variables.
The concept of Feature Engineering was born as a need at this point. But how can we understand the projection of a lower dimensional space, the pizza slice, that still retains the most important features of the original data? How can we solve the dimensionality curse with dimension reduction?
I will try to explain this in a few steps with the Datatab application. Just go to the PCA calculator, download the pizza.csv file from https://data.world/sdhilip/pizza-datasets, click on import and drop it over the area i marked.
Let’s look at what our variables are:
- brand — Pizza brand
- id — the number of the analyzed sample
- mois — the amount of water per 100 g in the sample
- prot — the amount of protein per 100 g in the sample
- fat — the amount of fat per 100 g in the sample
- ash — the amount of ash per 100 grams in the sample. Ash is what remains
- after the ‘burning’ of food during digestion. The body can burn essential
- nutrients (carbohydrates, protein and fat) but not minerals (eg calcium, potassium, sodium, magnesium) or trace elements (eg zinc and iron)
- sodium — the amount of sodium per 100 grams in the sample
- carb — The amount of carbohydrates per 100 grams in the sample
- cal — The amount of calories per 100 grams in the sample
QUESTION: Can I enter the model with as many independent variables that can represent 90% of the variance in the dependent variable, without high correlation between the independent variables (eliminating the multicollinearity problem)?
Let’s choose our independent and metric variables. Even though the id is expressed here as a number, the identification number is a nominal attribute such as a postal code.
PCA and Communality
It is useful to briefly discuss some concepts. Communality is the extent to which an item is related to all other items. The higher the Communality value, the better (ranging from 0–1). If the common Communality for a particular variable is low (between 0.0–0.4), that variable may have difficulty loading significantly on any factor.
The above image shows the communality values when 1 component is used. Eigenvalue represents the total amount of variance that can be explained by a particular principal component. In theory they can be positive or negative, but in practice they always explain the variance that is positive — the sum of all the elements in a single component.
If the eigenvalue is greater than zero, it’s a good sign. Since the variance cannot be negative, a negative Eigenvalue indicates that the model is poorly conditioned. The near-zero Eigenvalue indicates that the item is multi-linearity, as all variance can be taken by the first component.
The elements of the Component Matrix are the correlation of the item with each component. There is a correlation — R² similarity between Component Matrix and total variance. It includes the correlation between variable and component. Those who have high coefficients among the components have more weight in the variance definition.
Interpreting the results of principal component analysis
Let’s interpret the tables in the PCA tab.
While 1 component can represent the diversity of variance at the rate of 66%, when 2 components are selected, this ratio rises to 91%. Our model, which normally works with 6 components, reached 91% with 2 components, and this ratio is both quite sufficient and eliminated the dimensionality curse.After deciding how many factors we will work with, let’s continue to revise our table in this way.
While 1 component can represent the diversity of variance at the rate of 66%, when 2 components are selected, this ratio rises to 91%. Our model, which normally works with 6 components, reached 91% with 2 components, and this ratio is both quite sufficient and eliminated the dimensionality curse.After deciding how many factors we will work with, let’s continue to revise our table in this way.
Variance is represented by 91%.
It is 5,484 over Eigenvalue 6 and this ratio is 91% just like the variance. Communality values between 0–1 are satisfactorily high. Two points should be noted here.
There is a low variable in the communality values, but if the variance diversity is at a high level even in this case, turn to the alternatives when determining the number of components.Although the 1st component has 61% variance representation, the mois, that is, the amount of water per 100 grams in the sample, was quite low. So if we were moving with a single component, we would get rid of the dimensionality curse, the size would be reduced, but the model was inconsistent.
When we selected 2 components, both the communality values were homogeneously high and 91% variance representation was obtained. The curse of dimensionality was eliminated. After that, the Eigenvalue will follow a horizontal course and adding one more dimension will not be effective. Finally, the correlations of the variables in the 2 components we selected and their weights / effects on the model are seen.
According to the following component selections (1–2–3–6 pieces), you can see the relationship between Variance representation-Community Matrix and Component Matrix more easily.
Statistics made easy
- many illustrative examples
- ideal for exams and theses
- statistics made easy on 412 pages
- 5rd revised edition (April 2024)
- Only 8.99 €
"Super simple written"
"It could not be simpler"
"So many helpful examples"