In a regression analysis, multicollinearity occurs when two or more predictor variables (independent variables) show a high correlation. This can lead to the regression coefficients being unstable and no longer being interpretable.
Why is Multicollinearity a Problem?
Multicollinearity is a problem because it distorts the statistical significance of the independent variable.
A main goal of regression is to determine the relationship of each independent variable and the dependent variable. However, when variables are highly correlated, it may no longer be possible to determine exactly which influence comes from which variable. Thus, the p values of the regression coefficients can no longer be interpreted.
With multicollinearity, the regression coefficients can vary greatly when the data change very slightly or new variables are added.
Is Multicollinearity always a Problem?
Multicollinearity only affects the independent variables that are highly correlated. If you are interested in other variables that do not exhibit multicollinearity, then you can interpret them normally.
If you are using the regression model to make a prediction, then multicollinearity does not affect the outcome of the prediction. The multicollinearity only affects the individual coefficients and the p-value.
How to avoid multicollinearity?
To avoid multicollinearity, there must be no linear dependence between the predictors; this is the case, for example, when one variable is the multiple of another variable. In this case, since the variables are perfectly correlated, one variable explains 100% of the other variable and there is no added value in taking both variables in a regression model. If there is no correlation between the independent variables, then there is no multicollinearity.
In reality, a perfect linear correlation hardly ever occurs, which is why we speak of multicollinearity when individual variables are highly correlated with each other. And in this case the effect of individual variables cannot be clearly separated from each other.
It should be noted that the regression coefficients can no longer be interpreted in a meaningful way, but the prediction with the regression model is possible.
Since there is always some multicollinearity in a given set of data, ratios were introduced to indicate multicollinearity. To test for multicollinearity, a new regression model is created for each independent variable. In these regression models, the original dependent variable is left out and one of the independent variables is made the dependent variable in each case.
Thus, it tests how well one independent variable can be represented by the other independent variables. If the one independent variable can be very well represented by the other independent variables, this is a sign of multicollinearity.
For example, if x1 can be completely composed of the other variables, then the regression model cannot know what b1 is or what the other coefficients must be. In mathematics we say that the equation is overdetermined.
In order to find out whether multicollinearity is present, the tolerance of the individual predictors is considered on the one hand. The tolerance Ti for the i. predictor is calculated with
To calculate Ri2, a new regression model is created, as discussed above. This model contains all predictors, whereby the i. predictor is used as a new criterion (dependent variable). This now makes it possible to determine how well the i. predictor can be represented by the other predictors.
A tolerance value (T) below 0.1 is considered critical and multicollinearity is present. In this case, more than 90% of the variance can be explained by the other predictors.
Another measure used to test for multicollinearity is the VIF (Variance Inflation Factor). The VIF statistic is calculated by
The higher the VIF value, the more likely multicollinearity is present. In the VIF test, values above 10 are considered critical. The VIF value therefore increases with increasing multicollinearity.
Statistics made easy
- many illustrative examples
- ideal for exams and theses
- statistics made easy on 276 pages
- 3rd revised edition (July 2023)
- Only 6.99 €
"Super simple written"
"It could not be simpler"
"So many helpful examples"