# Correlation analysis

## What is a correlation analysis?

Correlation analysis is a statistical technique that gives you information about the relationship between variables.

**Correlation analysis** can be calculated to investigate the **relationship** of
variables. How strong the **correlation** is is determined by the
**correlation coefficient**, which varies from -1 to +1. Correlation analyses can
thus be used to make a statement about the strength and direction of the correlation.

### Example

You want to find out whether there is a connection between the age at which a child speaks its first sentences and its later success at school.

## Correlation and causality

If the correlation analysis shows that two characteristics are related to each other, it can subsequently be checked whether one characteristic can be used to predict the other characteristic. If the correlation mentioned in the example is confirmed, for example, it can be checked whether school success can be predicted by the age at which a child speaks its first sentences by means of a linear regression.

But beware! Correlations need not be **causal relationships**. Any correlations that
are discovered should therefore be investigated more closely, but never interpreted
immediately in terms of content, even if this would be obvious.

### Correlation and causality example:

If the correlation between sales figures and price is analysed and a strong correlation is identified, it would be logical to assume that sales figures are influenced by the price (and not vice versa). This assumption can, however, by no means be proven on the basis of a correlation analysis.

Furthermore, it can happen that the correlation between variable *x* and
*y* is generated by the variable *z*, see
Partial Correlation for more
information.

However, depending on which variables you use, you may be able to speak of a causal relationship right from the start. For example, if there is a correlation between age and salary, it is clear that age influences salary and not the other way around, otherwise everyone would want to earn as little salary as possible : )

## Interpret correlation

With the help of correlation analysis two statements can be made:

- one about the direction
- and one about the strength

of the linear relationship between two metric or ordinally scaled variables. The direction indicates whether the correlation is positive or negative, while the strength indicates whether the correlation between the variables is strong or weak.

### Positive correlation

A positive correlation exists if larger values of the variable *x* are
accompanied by larger values of the variable *y*. Height and shoe size, for
example, correlate positively and the correlation coefficient lies between 0 and 1,
i.e. a positive value.

### Negative correlation

A negative correlation exists if larger values of the variable *x* are
accompanied by smaller values of the variable *y*. The product price and the
sales quantity usually have a negative correlation; the more expensive a product is,
the smaller the sales quantity. In this case, the correlation coefficient is between
-1 and 0, so it assumes a negative value.

### Strength of correlation

With regard to the strength of the correlation coefficient *r*, the following
table can be used as a guide:

| r | |
Strength of correlation |
---|---|

0.0 < 0.1 | no correlation |

0.1 < 0.3 | little correlation |

0.3 < 0.5 | medium correlation |

0.5 < 0.7 | high correlation |

0.7 < 1 | very high correlation |

**Tip:** On DATAtab you can
calculate the correlation coefficient
directly online.

## Scatter plot and correlation

Just as important as the consideration of the correlation coefficient is the graphical consideration of the correlation of two variables in a scatter diagram.

The scatter plot gives you a rough estimate of whether there is a correlation, whether it is linear or nonlinear, and whether there are outliers.

## Test correlation for significance

If there is a correlation in the sample, it is still necessary to test whether there is enough evidence that the correlation also exists in the population. Thus, the question arises when a correlation coefficient can be considered statistically significant.

The significance of correlation coefficients can be tested using a t-test. As a rule, it is tested whether the correlation coefficient is significantly different from zero, i.e. linear independence is tested. In this case, the null hypothesis is that there is no correlation between the variables under consideration. In contrast, the alternative hypothesis assumes that there is a correlation.

As with any other hypothesis test, the significance level is first set, usually at 5%. If the calculated p-value is below 5 %, the null hypothesis is rejected and the alternative hypothesis applies. Thus, if the p-value is below 5%, it is assumed that there is a relationship between the variables in the population.

The t-value for testing the hypothesis is given by

where n is the sample size and r is the determined correlation in the sample. The corresponding p-value can be easily calculated in the correlation calculator on DATAtab.

#### Directional and non-directional hypotheses

With correlation analysis you can test directional and non-directional correlation hypotheses.

##### Non-directional correlation hypothesis:

You are only interested in whether there is a relationship or correlation between two variables, for example, whether there is a correlation between age and salary, but you are not interested in the direction of this correlation.

##### Directional correlation hypothesis:

You are also interested in the direction of the correlation, i.e. whether there is a positive or negative correlation between the variables.

Your alternative hypothesis is then e.g. age has a positive influence on salary. What you have to pay attention to in the case of a directional hypothesis, we will go through at the bottom of the example.

## Pearson correlation analysis

With the
Pearson correlation analysis you get a
statement about the linear correlation between metric scaled variables. The respective
**covariance** is used for the calculation. The covariance gives a positive value if
there is a positive correlation between the variables and a negative value if there is a
negative correlation. The covariance is calculated as:

However, the covariance is not standardized and can assume values between
**plus** and **minus infinity**. This makes it difficult to compare the strength
of relationships between different variables. For this reason, the
**correlation coefficient**, also called
**product-moment correlation coefficient**, is calculated. The correlation
coefficient is obtained by normalizing the covariance. For this normalization, the
variances of the two variables involved are used and the correlation coefficient is
calculated as

The **Pearson correlation coefficient** can now take values between -1 and +1 and can
be interpreted as follows

- The value +1 means that there is an entirely positive linear relationship (the more, the more).
- The value -1 indicates that an entirely negative linear relationship exists (the more, the less).
- With a value of 0 there is no linear relationship, i.e. the variables do not correlate with each other.

Now finally the strength of the relationship can be interpreted. This can be illustrated by the following table:

| r | |
Strength of correlation |
---|---|

0.0 < 0,1 | no correlation |

0.1 < 0,3 | little correlation |

0.3 < 0,5 | medium correlation |

0.5 < 0,7 | high correlation |

0.7 < 1 | very high correlation |

To check in advance whether a linear relationship exists,
**scatter plots** should be considered. This way, the respective relationship between
the variables can also be checked visually. The Pearson correlation is only useful and
purposeful if linear relationships are present.

### Pearson Correlation assumptions

For Pearson correlation to be used, the variables must be normally distributed and there must be a linear relationship between the variables. The normal distribution can be tested either analytically or graphically with the Q-Q plot. Whether the variables have a linear correlation is best checked with a scatter plot. If these conditions are not met, then the Spearman correlation is used.

## Spearman rank correlation

Spearman correlation analysis is used to calculate the relationship between two variables that have ordinal level of measurement. Spearman rank correlation is the non-parametric equivalent of Pearson correlation analysis. This procedure is therefore used when the prerequisites for a correlation analysis (=parametric procedure) are not met, i.e. when there is no metric data and no normal distribution. In this context it is often referred to as "Spearman correlation" or "Spearman's Rho" if Spearman rank correlation is meant.

The questions that can be treated by Spearman rank correlation are similar to those of the Pearson correlation coefficient, i.e. "Is there a correlation between two variables or characteristics". For example: "Is there a correlation between age and religiousness in the France population?

The calculation of the **rank correlation** is based on the ranking system of the
data series. This means that the measured values are not used for the calculation, but
are transformed into ranks. The test is then performed using these ranks.

For the **rank correlation coefficient ρ**, values between -1 and 1 are possible. If
there is a value less than zero (ρ < 0), there is a negative linear correlation. If a
value greater than zero (ρ > 0), there is a positive linear relationship. If the
value is zero (ρ = 0), there is no relationship between the variables. As with the
Spearman correlation coefficient, the strength of the correlation can be classified as
follows:

Value r | Strength of correlation |
---|---|

0.0 < 0,1 | no correlation |

0.1 < 0,3 | little correlation |

0.3 < 0,5 | medium correlation |

0.5 < 0,7 | high correlation |

0.7 < 1 | very high correlation |

## Point biserial correlation

The point biserial correlation is used when one of the variables is dichotomous, e.g. studied and not studied, and the other has metric scale level, e.g. salary.

The calculation of a point biserial correlation is the same as the calculation of the Pearson correlation. To calculate it, one of the two expressions of the dichotomous variable is coded as 0 and the other as 1.

## Calculate correlation analysis with DATAtab

Calculate the example directly with DATAtab for free:

Correlation analysis Load data setA student wants to know if there is a correlation between the height and weight of the participants in the statistics course. For this purpose, the student drew a sample, which is described in the table below.

Height | Weight |
---|---|

1.62 | 53 |

1.72 | 71 |

1.85 | 85 |

1.82 | 86 |

1.72 | 76 |

1.55 | 62 |

1.65 | 68 |

1.77 | 77 |

1.83 | 97 |

1.53 | 65 |

To analyze the linear relationships by means of a correlation analysis, you can calculate a correlation with DATAtab. First copy the table above into the statistics calculator.

Then click on "Correlation" and select the two variables from the example. Finally you will get the following results.

First, you will get the null and the alternative hypothesis. The null hypothesis is: "There is no correlation between height and weight". Then you get the correlation coefficient and the p value. If you click on Summary in words, you will get the following interpretation:

A Pearson correlation analysis was performed to test whether there is a relationship between height and weight. The result of the Pearson correlation analysis showed that there was a significant relationship between height and weight, r(8) = 0.86, p = 0.001.

There is a very high, positive correlation between the variables of height and weight, r= 0.86. Thus, there is a very high, positive correlation in this sample between height and weight.

## Directional (one-sided) correlation hypothesis

Of course, in DATatab you can also choose to calculate a directional hypothesis.

In this case, you must first check whether the correlation is at all in the direction of the alternative hypothesis, i.e. that height and weight are positively correlated. If this is the case, the calculated p-value must be divided by two, since only one side of the distribution is considered. However, DATAtab takes care of these two steps for you. The summary in words then looks like this:

A Pearson correlation analysis was performed to test whether there is a positive relationship between height and weight. The result of Pearson correlation analysis showed that there was a significant positive relationship between height and weight, r(8) = 0.86, p = <0.001.

There is a very high positive correlation between the variables of height and weight, r= 0.86. Thus, there is a very high, positive correlation in this sample between height and weight.

### Statistics made easy

- many illustrative examples
- ideal for exams and theses
- statistics made easy on 276 pages
- 3rd revised edition (July 2023)
**Only 6.99 €**

*"Super simple written"*

*"It could not be simpler"*

*"So many helpful examples"*