Regression analysis
Regression is a statistical method that allows modeling relationships between a dependent variable and one or more independent variables.
A regression analysis makes it possible to infer or predict another variable based on one or more variables.
For example, you might be interested in what influences a person's salary. In order to find it out, you could take level of education, the weekly working hours and the age of a person.
Further you could now investigate whether these three variables have an influence on a person's salary. If so, you can predict a person's salary by using the highest education level, the weekly working hours and the age of a person.
What are dependent and independent variables?
The variable to be inferred is called the dependent variable (criterion). The variables used for prediction are called independent variables (predictors).
Thus, in the example above, salary is the dependent variable and highest educational attainment, weekly hours worked, and age are the independent variables.
When do we use a regression analysis?
By performing a regression analysis two goals can be pursued. On the one hand, the influence of one or more variables on another variable can be measured, and on the other hand, the regression can be used to predict a variable by one or more other variables. For example:
1) Measurement of the influence of one or more variables on another variable
- What influences children's ability to concentrate?
- Do the educational level of the parents and the place of residence affect the future educational attainments of children?
2) Prediction of a variable by one or more other variables
- How long does a patient stay in the hospital?
- What product is a person most likely to buy from an online store?
The regression analysis thus provides information about how the value of the dependent variable changes if one of the independent variables is changed.
Types of regression analysis
Regression analyses are divided into simple linear regression, multiple linear regression and logistic regression. The type of regression analysis that should be used, depends on the number of independent variables and the scale of measurement of the dependent variable.
Number of independent variables | Scale of measurement dependent variable | Scale of measurement independent variable | |
Simple linear Regression | one | metric | metric, ordinal, nominal |
Multiple lineare Regression | multiple | metric | metric, ordinal, nominal |
Logistic Regression | multiple | ordinal, nominal | metric, ordinal, nominal |
If you only want to use one variable for prediction, a simple regression is used. If you use more than one variable, you need to perform a multiple regression. If the dependent variable is nominally scaled, a logistic regression must be calculated. If the dependet variable is metrically scaled, a linear regression is used. Whether a linear or a non-linear regression is used depends on the relationship itself. In order to perform a linear regression, a linear relationship between the independent variables and the dependent variable is necessary.
Independent variable of the regression
No matter which regression is calculated, the scale level of the independent variables can take any form (metric, ordinal and nominal). However, if there is an ordinal or nominal variable with more than two values, so-called dummy variables must be formed.
Dummy variables and Reference category
When an independent variable is categorical, it is encoded as a set of binary dummy variables before being included in the regression model.
When dummy variables are created, a variable with several categories is made into several variables with only 2 categories each.
One of the categories is set as the reference category and a new variable is created for each of the remaining categories.
Let's take an example to illustrate this. Suppose you are studying the effect of education level (a categorical variable with three levels: high school, college, and graduate) on salary. In order to include this categorical variable in a regression model, it needs to be encoded as dummy variables.
Let's say we use high school as reference category and we create two dummy variables: is_college and is_graduate. The variable is_college for example will take a value of 1 if the individual has a college degree and 0 otherwise.
Control Variable (covariate)
In regression analysis, a control variable (also known as a "covariate") is an additional independent variable that is included in the regression model to account for potential confounding factors. The primary purpose of including control variables is to isolate the relationship of interest between the main independent variable(s) and the dependent variable, ensuring that the observed relationship is not being driven by some other unobserved factors.
Inclusion of control variables can help in several ways:
- Reducing omitted variable bias: If there's a variable that affects both the dependent variable and one of the independent variables and it's not included in the model, the coefficient on the independent variable could be biased. Including the control variable helps to reduce or eliminate this bias.
- Increasing precision: Controlling for additional sources of variability can reduce the residual variance, leading to more precise estimates.
- Accounting for confounding: In many cases, the relationship between two variables might be spurious because of a third variable that influences both. Including this third variable as a control can help reveal the true relationship.
Example
For example, let's say you're studying the effect of exercise on weight loss. Age might also influence weight loss (metabolism changes as we age) and might be related to how much someone exercises (maybe younger people exercise more). If you ignore age, you might mistakenly attribute the entire effect on weight loss to exercise, when age also plays a role. By including age as a control variable in your regression, you can better isolate the specific impact of exercise on weight loss.
Considerations
However, it's crucial to be thoughtful about which control variables to include in a model. Including irrelevant control variables can unnecessarily complicate the model and reduce the power of the analysis. On the other hand, omitting important controls can lead to biased estimates. Proper theoretical reasoning and empirical diagnostic tests can guide the choice of control variables.
Correlation and causality in regression analysis
In the case of linear regression, the independent variable can be used to predict the dependent variable if there is a correlation between the two variables . However, what is important to note is that a correlation between two variables does not necessarily mean causality. So what does this mean? If high values of one variable are accompanied by high values of the other variable, it does not mean that values on one variable will increase because values on the other variable will increase.
Examples of a regression
Simple linear regression
Does the weekly working time have an influence on the hourly wage of employees?
Multiple lineare regression
Do the weekly working time and the age of employees have an influence on their hourly wage?
Logistic regression
Do the weekly working time and the age of employees have an influence on the probability that they are at risk of burnout?
- Dependent variable
- Independent variables
Calculate regression
Only three simple steps are necessary and the regression calculator will give you all important key figures:
- 1. Copy your data into the table of the statistics calculator
- 2. Click on Regression
- 3. Select a dependent variable and one or more independent variables
If one of the independent variables has a categorical level of measurement (ordinal or nominal), dummy variables are automatically generated and a reference category is defined. As soon as a series contains only numbers, the statistics calculator automatically defines it as a metric variable.
Statistics made easy
- many illustrative examples
- ideal for exams and theses
- statistics made easy on 412 pages
- 5rd revised edition (April 2024)
- Only 8.99 €
"Super simple written"
"It could not be simpler"
"So many helpful examples"