Linear Regression
What is a linear regression analysis?
Linear Regression analysis is used to create a model that describes the relationship between a dependent variable and one or more independent variables. Depending on whether there are one or more independent variables, a distinction is made between simple and multiple linear regression analysis.
In the case of a simple linear regression, the aim is to examine the influence of an independent variable on one dependent variable. In the second case, a multiple linear regression, the influence of several independent variables on one dependent variable is analyzed.
In linear regression, an important prerequisite is that the measurement scale of the dependent variable is metric and a normal distribution exists. If the dependent variable is categorical, a logistic regression is used. You can easily perform a regression analysis in the linear regression calculator here on DATAtab.
Example: Simple Linear Regression
Does the height have an influence on the weight of a person?
Example: Multiple Linear Regression
Do the height and gender have have an influence on the weight of a person?
- Dependent variable
- Independent variables
Simple Linear Regression
The goal of a simple linear regression is to predict the value of a dependent variable based on an independent variable. The greater the linear relationship between the independent variable and the dependent variable, the more accurate is the prediction. This goes along with the fact that the greater the proportion of the dependent variable's variance that can be explained by the independent variable is, the more accurate is the prediction. Visually, the relationship between the variables can be shown in a scatter plot. The greater the linear relationship between the dependent and independent variables, the more the data points lie on a straight line.
The task of simple linear regression is to exactly determine the straight line which best describes the linear relationship between the dependent and independent variable. In linear regression analysis, a straight line is drawn in the scatter plot. To determine this straight line, linear regression uses the method of least squares.
The regression line can be described by the following equation:
Definition of "Regression coefficients":
- a : point of intersection with the y-axis
- b : gradient of the straight line
ŷ is the respective estimate of the y-value. This means that for each x-value the corresponding y-value is estimated. In our example, this means that the height of people is used to estimate their weight.
If all points (measured values) were exactly on one straight line, the estimate would be perfect. However, this is almost never the case and therefore, in most cases a straight line must be found, which is as close as possible to the individual data points. The attempt is thus made to keep the error in the estimation as small as possible so that the distance between the estimated value and the true value is as small as possible. This distance or error is called the "residual", is abbreviated as "e" (error) and can be represented by the greek letter epsilon (ϵ).
When calculating the regression line, an attempt is made to determine the regression coefficients (a and b) so that the sum of the squared residuals is minimal. (OLS- "Ordinary Least Squares")
The regression coefficient b can now have different signs, which can be interpreted as follows
- b > 0: there is a positive correlation between x and y (the greater x, the greater y)
- b < 0: there is a negative correlation between x and y (the greater x, the smaller y)
- b = 0: there is no correlation between x and y
Standardized regression coefficients are usually designated by the letter "beta". These are values that are comparable with each other. Here the unit of measurement of the variable is no longer important. The standardized regression coefficient (beta) is automatically output by DATAtab.
Multiple Linear Regression
Unlike simple linear regression, multiple linear regression allows more than two independent variables to be considered. The goal is to estimate a variable based on several other variables. The variable to be estimated is called the dependent variable (criterion). The variables that are used for the prediction are called independent variables (predictors).
Multiple linear regression is frequently used in empirical social research as well as in market research. In both areas it is of interest to find out what influence different factors have on a variable. For example, what determinants influence a person's health or purchasing behavior?
Marketing example:
For a video streaming service you should predict how many times a month a person streams videos. For this you get a record of user's data (age, income, gender, ...).
Medical example:
You want to find out which factors have an influence on the cholesterol level of patients. For this purpose, you analyze a patient data set with cholesterol level, age, hours of sport per week and so on.
The equation necessary for the calculation of a multiple regression is obtained with k dependent variables as:
The coefficients can now be interpreted similarly to the linear regression equation. If all independent variables are 0, the resulting value is a. If an independent variable changes by one unit, the associated coefficient indicates by how much the dependent variable changes. So if the independent variable x_{i} increases by one unit, the dependent variable y increases by b_{i}.
Multiple Regression vs. Multivariate Regression
Multiple regression should not be confused with multivariate regression. In the former case, the influence of several independent variables on a dependent variable is examined. In the second case, several regression models are calculated to allow conclusions to be drawn about several dependent variables. Consequently, in a multiple regression, one dependent variable is taken into account, whereas in a multivariate regression, several dependent variables are analyzed.
Coefficient of determination
In order to find out how well the regression model can predict or explain the dependent variable, two main measures are used. This is on the one hand the coefficient of determination R^{2} and on the other hand the standard estimation error. The coefficient of determination R^{2}, also known as the variance explanation, indicates how large the portion of the variance is that can be explained by the independent variables. The more variance can be explained, the better the regression model is. In order to calculate R^{2}, the variance of the estimated value is related to the variance in the observed values:
Adjusted R^{2}
The coefficient of determination R^{2} is influenced by the number of independent variables used. The more independent variables are included in the regression model, the greater the variance resolution R^{2}. To take this into account, the adjusted R^{2} is used.
Standard estimation error
The standard estimation error is the standard deviation of the estimation error. This gives an impression of how much the prediction differs from the correct value. Graphically interpreted, the standard estimation error is the dispersion of the observed values around the regression line.
The coefficient of determination and the standard estimation error are used for simple and multiple linear regression.
Standardized and unstandardized regression coefficient
The regression coefficient is distinguished between the standardized and the unstandardized regression coefficient. The unstandardized regression coefficients are the coefficients that occur or are used in the regression equation and are abbreviated b.
The standardized regression coefficients are obtained by multiplying the regression coefficient b_{i} by the standard deviation of the dependent variable S_{xi} and dividing by the standard deviation of the respective independent variable S_{y}.
Assumptions of Linear Regression
In order to interpret the results of the regression analysis meaningfully, certain conditions must be met.
- Linearity: There must be a linear relationship between the dependent and independent variables.
- Homoscedasticity: The residuals must have a constant variance.
- Normality: Normally distributed error
- No multicollinearity: No high correlation between the independent variables
- No auto-correlation: The error component should have no auto correlation
Linearity
In linear regression, a straight line is drawn through the data. This straight line should represent all points as good as possible. If the points are distributed in a non-linear way, the straight line cannot fulfill this task.
In the upper left graph, there is a linear relationship between the dependent and the independent variable, hence the regression line can be meaningfully put in. In the right graph you can see that there is a clearly non-linear relationship between the dependent and the independent variable. Therefore it is not possible to put the regression line through the points in a meaningful way. For that reason, the coefficients cannot be meaningfully interpreted by the regression model and there could be errors in the prediction that are greater than thought.
Therefore it is important to check beforehand, whether a linear relationship between the dependent variable and each of the independent variables exists. This is usually checked graphically.
Homoscedasticity
Since in practice the regression model never exactly predicts the dependent variable, there is always an error. This very error must have a constant variance over the predicted range.
To test homoscedasticity, i.e. the constant variance of the residuals, the dependent variable is plotted on the x-axis and the error on the y-axis. Now the error should scatter evenly over the entire range. If this is the case, homoscedasticity is present. If this is not the case, heteroskedasticity is present. In the case of heteroscedasticity, the error has different variances, depending on the value range of the dependent variable.
Normal distribution of the error
The next requirement of linear regression is that the error epsilon must be normally distributed. There are two ways to find it out: One is the analytical way and the other is the graphical way. In the analytical way, you can use either the Kolmogorov-Smirnov test or the Shapiro-Wilk test. If the p-value is greater than 0.05, there is no deviation of the data from the normal distribution and one can assume that the data are normally distributed.
However, these analytical tests are used less and less because they tend to attest normal distribution for small samples and become significant very quickly for large samples, thus rejecting the null hypothesis that the data are normally distributed. Therefore, the graphical variant is increasingly used.
In the graphical variant, either the histogram is looked at or, even better, the so-called QQ-plot or Quantile-Quantile-plot. The more the data lie on the line, the better the normal distribution.
Multicollinearity
Multicollinearity means that two or more independent variables are strongly correlated with one another. The problem with multicollinearity is that the effects of each independent variable cannot be clearly separated from one another.
If, for example, there is a high correlation between x_{1} and x_{2}, then it is difficult to determine b_{1} and b_{2}. If both are e.g. completely equal, the regression model does not know how large b_{1} and b_{2} should be, becoming unstable.
This is of course not tragic if the regression model is only used for a prediction; in the case of a prediction, one is only interested in the prediction, but not in how great the influence of the respective variables is. However, if the regression model is used to measure the influence of the independent variables on the dependent variable, and if multicollinearity exists, the coefficients cannot be interpreted meaningfully.
More detailed information about multicollinearity can be found hereSignificance test and Regression
The regression analysis is often carried out in order to make statements about the population based on a sample. Therefore, the regression coefficients are calculated using the data from the sample. To rule out the possibility that the regression coefficients are not just random and have completely different values in another sample, the results are statistically tested with significance test. This test takes place at two levels.
- Significance test for the whole regression model
- Significance test for the regression coefficients
It should be noted, however, that the assumptions in the previous section must be met.
Significance test for the regression model
Here it is checked whether the coefficient of determination R^{2} in the population differs from zero. The null hypothesis is therefore that the coefficient of determination R^{2} in the population is zero. To confirm or reject the null hypothesis, the following F-test is calculated
The calculated F-value must now be compared with the critical F-value. If the calculated F-value is greater than the critical F-value, the null hypothesis is rejected and the R^{2} deviates from zero in the population. The critical F-value can be read from the F-distribution table. The denominator degrees of freedom are k and the numerator degrees of freedom are n-k-1.
Significance test for the regression coefficients
The next step is to check which variables have a significant contribution to the prediction of the dependent variable. This is done by checking whether the slopes (regression coefficients) also differ from zero in the population. The following test statistics are calculated in order to analyze it
where b_{j} is the j^{th} regression coefficient and s_{b_j} is the standard error of b_{j}. This test statistic is t-distributed with the degrees of freedom n-k-1. The critical t-value can be read from the t-distribution table.
Calculate with DATAtab
Recalculate the example directly with DATAtab for free:
Load linear regression data setAs an example of linear regression, a model is set up that predicts the body weight of a person. The dependent variable is thus the body weight, while the height, age and gender are chosen as independent variables. The following example data set is available:
Weight | Height | Age | Gender |
---|---|---|---|
79 | 1.80 | 35 | Male |
69 | 1.68 | 39 | Male |
73 | 1.82 | 25 | Male |
95 | 1.70 | 60 | Male |
82 | 1.87 | 27 | Male |
55 | 1.55 | 18 | Female |
69 | 1.50 | 89 | Female |
71 | 1.78 | 42 | Female |
64 | 1.67 | 16 | Female |
69 | 1.64 | 52 | Female |
After you have copied your data into the statistics calculator, you must select the variables that are relevant for you. Then you receive the results in table form.
Interpretation of the results
This table shows that 75.4% of the variation in weight can be determined by height, age and sex. When predicting a person's weight, the model is wrong by an average of 6.587 which is the standard error.
Weight = 47,379 · Height + 0,297 · Age + 8,922 · is_male -24.41
The equation shows for example, that if the age increases by one year, the weight increases by 0.297 kg according to the model. In the case of the dichotomous variable sex, the slope is to be interpreted as the difference: according to the model a man weighs 8.922 kg more than a woman. If all independent variables are zero, the result is a weight of -24.41.
The standardized coefficients beta are measured seperately and always range between -1 and +1. The greater beta is, the greater is the contribution of each independent variable to explain the dependent variable. In this regression analysis, the variable age has the greatest influence on the variable weight.
The calculated coefficients refer to the sample used for the calculation by the regression analysis, so it is of interest whether the B-values deviate from zero only by chance or whether they are also different from zero in the population. For this purpose, the null hypothesis is formulated that the respective calculated B value is equal to zero in the population. If this is the case, it means that the respective dependent variable has no significant influence on the dependent variable.
The p-value indicates whether a variable has a significant influence. p-values smaller than 0.05 are considered as significant. In this example, only age can be considered as a significant predictor of the weight of a person.
Presenting the results of the regression
When presenting your results, you should include the estimated effect, that is, the regression coefficient, the standard error of the estimate, and the p-value. Of course, it is also useful to interpret the regression results so that everyone knows what the regression coefficients mean.
For example: a significant relationship (p < .041) was found between a person's weight and a person's age.
If a simple linear regression was calculated, the result can also be displayed using a scatter plot.
Statistics made easy
- many illustrative examples
- ideal for exams and theses
- statistics made easy on 276 pages
- 3rd revised edition (July 2023)
- Only 6.99 €
"Super simple written"
"It could not be simpler"
"So many helpful examples"