Correlation analysis
Medical example data Marketing example dataWhat is a correlation analysis?
Correlation analysis is a statistical method used to evaluate the relationship between two variables, such as the association between body size and shoe size.
The strength of this relationship is measured by the correlation coefficient, which ranges from -1 to +1. A coefficient close to +1 indicates a strong positive correlation, while a value near -1 signifies a strong negative correlation. Values around zero suggest little to no relationship. Correlation analyses can thus be used to make a statement about the strength and direction of the correlation.
Example
You want to find out whether there is a connection between the age at which a child speaks its first sentences and its later success at school.
Correlation and causality
If correlation analysis reveals a relationship between two variables, it is possible to further investigate whether one variable can be used to predict the other. For instance, if a correlation is found, one could examine whether the age at which a child first speaks sentences can be used to predict their future academic success through linear regression analysis.
However, caution is necessary! Correlations do not imply causation. Any identified correlations should be examined in greater detail and not immediately interpreted as causal relationships, even if a connection seems obvious.
Correlation and causality example:
If the correlation between sales figures and price is analysed and a strong correlation is identified, it would be logical to assume that sales figures are influenced by the price (and not vice versa). This assumption can, however, by no means be proven on the basis of a correlation analysis.
Furthermore, it can happen that the correlation between variable x and y is generated by the variable z, see Partial Correlation for more information.
However, in some cases, the nature of the variables allows for a causal relationship to be assumed from the outset. For example, if a correlation is found between age and salary, it is evident that age influences salary rather than the reverse—otherwise, it would imply that reducing one's salary could somehow make a person younger, which is clearly nonsensical.
Interpret correlation
With the help of correlation analysis two statements can be made:
- one about the direction
- and one about the strength
of the linear relationship between two metric or ordinally scaled variables. The direction indicates whether the correlation is positive or negative, while the strength indicates whether the correlation between the variables is strong or weak.
Positive correlation
A positive correlation exists if larger values of the variable x are accompanied by larger values of the variable y, and the other way around. Height and shoe size, for example, correlate positively and the correlation coefficient lies between 0 and 1, i.e. a positive value.
Negative correlation
A negative correlation exists if larger values of the variable x are accompanied by smaller values of the variable y, and the other way around. The product price and the sales quantity usually have a negative correlation; the more expensive a product is, the smaller the sales quantity. In this case, the correlation coefficient is between -1 and 0, so it assumes a negative value.
Strength of correlation
With regard to the strength of the correlation coefficient r, the following table can be used as a guide:
| r | | Strength of correlation |
---|---|
0.0 < 0.1 | no correlation |
0.1 < 0.3 | little correlation |
0.3 < 0.5 | medium correlation |
0.5 < 0.7 | high correlation |
0.7 < 1 | very high correlation |
Scatter plot and correlation
Just as important as the consideration of the correlation coefficient is the graphical consideration of the correlation of two variables in a scatter diagram.
The scatter plot gives you a rough estimate of whether there is a correlation, whether it is linear or nonlinear, and whether there are outliers.
Test correlation for significance
If there is a correlation in the sample, it is still necessary to test whether there is enough evidence that the correlation also exists in the population. Thus, the question arises when a correlation coefficient can be considered statistically significant.
The significance of correlation coefficients can be tested using a t-test. As a rule, it is tested whether the correlation coefficient is significantly different from zero, i.e. linear independence is tested. In this case, the null hypothesis is that there is no correlation between the variables under consideration. In contrast, the alternative hypothesis assumes that there is a correlation.
As with any other hypothesis test, the significance level is first set, usually at 5%. If the calculated p-value is below 5 %, the null hypothesis is rejected and the alternative hypothesis applies. Thus, if the p-value is below 5%, it is assumed that there is a relationship between the variables in the population.
The t-value for testing the hypothesis is given by
where n is the sample size and r is the determined correlation in the sample. The corresponding p-value can be easily calculated in the correlation calculator on DATAtab.
Directional and non-directional hypotheses
With correlation analysis you can test directional and non-directional correlation hypotheses.
Non-directional correlation hypothesis:
You are only interested in whether there is a relationship or correlation between two variables, for example, whether there is a correlation between age and salary, but you are not interested in the direction of this correlation.
Directional correlation hypothesis:
You are also interested in the direction of the correlation, i.e. whether there is a positive or negative correlation between the variables.
Your alternative hypothesis is then e.g. age has a positive influence on salary. What you have to pay attention to in the case of a directional hypothesis, we will go through at the bottom of the example.
Pearson correlation analysis
With the Pearson correlation analysis you get a statement about the linear correlation between metric scaled variables. The respective covariance is used for the calculation. The covariance gives a positive value if there is a positive correlation between the variables and a negative value if there is a negative correlation. The covariance is calculated as:
However, the covariance is not standardized and can assume values between plus and minus infinity. This makes it difficult to compare the strength of relationships between different variables. For this reason, the correlation coefficient, also called product-moment correlation coefficient, is calculated. The correlation coefficient is obtained by normalizing the covariance. For this normalization, the variances of the two variables involved are used and the correlation coefficient is calculated as
The Pearson correlation coefficient can now take values between -1 and +1 and can be interpreted as follows
- The value +1 means that there is an entirely positive linear relationship (the more, the more).
- The value -1 indicates that an entirely negative linear relationship exists (the more, the less).
- With a value of 0 there is no linear relationship, i.e. the variables do not correlate with each other.
Now finally the strength of the relationship can be interpreted. This can be illustrated by the following table:
| r | | Strength of correlation |
---|---|
0.0 < 0,1 | no correlation |
0.1 < 0,3 | little correlation |
0.3 < 0,5 | medium correlation |
0.5 < 0,7 | high correlation |
0.7 < 1 | very high correlation |
To check in advance whether a linear relationship exists, scatter plots should be considered. This way, the respective relationship between the variables can also be checked visually. The Pearson correlation is only useful and purposeful if linear relationships are present.
Pearson Correlation assumptions
For Pearson correlation to be used, the variables must be normally distributed and there must be a linear relationship between the variables. The normal distribution can be tested either analytically or graphically with the Q-Q plot. Whether the variables have a linear correlation is best checked with a scatter plot. If these conditions are not met, then the Spearman correlation is used.
Spearman rank correlation
Spearman correlation analysis is used to calculate the relationship between two variables that have ordinal level of measurement. Spearman rank correlation is the non-parametric equivalent of Pearson correlation analysis. This procedure is therefore used when the prerequisites for a correlation analysis (=parametric procedure) are not met, i.e. when there is no metric data and no normal distribution. In this context it is often referred to as "Spearman correlation" or "Spearman's Rho" if Spearman rank correlation is meant.
The questions that can be treated by Spearman rank correlation are similar to those of the Pearson correlation coefficient, i.e. "Is there a correlation between two variables or characteristics". For example: "Is there a correlation between age and religiousness in the France population?
The calculation of the rank correlation is based on the ranking system of the data series. This means that the measured values are not used for the calculation, but are transformed into ranks. The test is then performed using these ranks.
For the rank correlation coefficient ρ, values between -1 and 1 are possible. If there is a value less than zero (ρ < 0), there is a negative linear correlation. If a value is greater than zero (ρ > 0), there is a positive linear relationship. If the value is zero (ρ = 0), there is no relationship between the variables. As with the Spearman correlation coefficient, the strength of the correlation can be classified as follows:
Value r | Strength of correlation |
---|---|
0.0 < 0,1 | no correlation |
0.1 < 0,3 | little correlation |
0.3 < 0,5 | medium correlation |
0.5 < 0,7 | high correlation |
0.7 < 1 | very high correlation |
Point biserial correlation
The point biserial correlation is used when one of the variables is dichotomous, e.g. studied and not studied, and the other has metric scale level, e.g. salary.
The calculation of a point biserial correlation is the same as the calculation of the Pearson correlation. To calculate it, one of the two categories of the dichotomous variable is coded as 0 and the other as 1.
Calculate correlation analysis with DATAtab
Medical example dataA student wants to know if there is a correlation between the height and weight of the participants in the statistics course. For this purpose, the student drew a sample, which is described in the table below.
Height | Weight |
---|---|
1.62 | 53 |
1.72 | 71 |
1.85 | 85 |
1.82 | 86 |
1.72 | 76 |
1.55 | 62 |
1.65 | 68 |
1.77 | 77 |
1.83 | 97 |
1.53 | 65 |
To analyze the linear relationships by means of a correlation analysis, you can calculate a correlation with DATAtab. First copy the table above into the statistics calculator. Then click on "Correlation" and select the two variables from the example. Finally you will get the following results.
First, you will get the null and the alternative hypothesis. The null hypothesis is: "There is no correlation between height and weight". Then you get the correlation coefficient and the p value. If you click on Summary in words, you will get the following interpretation:
A Pearson correlation analysis was performed to test whether there is a relationship between height and weight. The result of the Pearson correlation analysis showed that there was a significant relationship between height and weight, r(8) = 0.86, p = 0.001.
There is a very high, positive correlation between the variables of height and weight, r= 0.86. Thus, there is a very high, positive correlation in this sample between height and weight.
Directional (one-sided) correlation hypothesis
Of course, in DATatab you can also choose to calculate a directional hypothesis.
In this case, you must first check whether the correlation is at all in the direction of the alternative hypothesis, i.e. that height and weight are positively correlated. If this is the case, the calculated p-value must be divided by two, since only one side of the distribution is considered. However, DATAtab takes care of these two steps for you. The summary in words then looks like this:
A Pearson correlation analysis was performed to test whether there is a positive relationship between height and weight. The result of Pearson correlation analysis showed that there was a significant positive relationship between height and weight, r(8) = 0.86, p = <0.001.
There is a very high positive correlation between the variables of height and weight, r= 0.86. Thus, there is a very high, positive correlation in this sample between height and weight.
Statistics made easy
- many illustrative examples
- ideal for exams and theses
- statistics made easy on 412 pages
- 5rd revised edition (April 2024)
- Only 8.99 €
"Super simple written"
"It could not be simpler"
"So many helpful examples"