 # Pearson Correlation

Pearson correlation analysis examines the relationship between two variables. For example, is there a correlation between a person's age and salary? More specifically, we can use the pearson correlation coefficient to measure the linear relationship between two variables.

## Strength and direction of correlation

With a correlation analysis we can determine:

• How strong the correlation is
• and in which direction the correlation goes.

We can read the strength and direction of the correlation in the Pearson correlation coefficient r, whose value varies between -1 and 1.

### Strength of the correlation

The strength of the correlation can be read in a table. An r between 0 and 0.1 indicates no correlation. An amount of r between 0.7 and 1 is indicates a very strong correlation.

Amount of r Strength of correlation
0.0 < 0.1 no correlation
0.1 < 0.3 low correlation
0.3 < 0.5 medium correlation
0.5 < 0.7 high correlation
0.7 < 1 very high correlation
From Kuckartz et al.: Statistik, Eine verständliche Einführung, 2013, p. 213

### Direction of the correlation

A positive relationship or correlation exists when large values of one variable are associated with large values of the other variable, or when small values of one variable are associated with small values of the other variable. A positive correlation results, for example, for height and shoe size. This results in a positive correlation coefficient. A negative correlation is when large values of one variable are associated with small values of the other variable and vice versa. A negative correlation is usually found between product price and sales volume. This results in a negative correlation coefficient. ## Calculate Pearson correlation

The Pearson correlation coefficient is calculated using the following equation. Here r is the Pearson correlation coefficient, xi are the individual values of one variable e.g. age, yi are the individual values of the other variable e.g. salary and and are the mean values of the two variables respectively. In the equation, we can see that the respective mean value is first subtracted from both variables.

So in our example, we calculate the mean values of age and salary. We then subtract the mean values from each of age and salary. We then multiply both values. We then sum up the individual results of the multiplication. The expression in the denominator ensures that the correlation coefficient is scaled between -1 and 1.

If we now multiply two positive values we get a positive value. If we multiply two negative values we also get a positive value (minus times minus is plus). So all values that lie in these ranges have a positive influence on the correlation coefficient. If we multiply a positive value and a negative value we get a negative value (minus times plus is minus). So all values that are in these ranges have a negative influence on the correlation coefficient. Therefore, if our values are predominantly in the two green areas from previous two figures, we get a positive correlation coefficient and therefore a positive correlation.

If our scores are predominantly in the two red areas from the figures, we get a negative correlation coefficient and thus a negative correlation.

If the points are distributed over all four areas, the positive terms and the negative terms cancel each other out and we might end up with a very small or no correlation.

## Testing correlation coefficients for significance

In general, the correlation coefficient is calculated using data from a sample. In most cases, however, we want to test a hypothesis about the population. In the case of correlation analysis, we then want to know if there is a correlation in the population.

For this, we test whether the correlation coefficient in the sample is statistically significantly different from zero.

### Hypotheses in the Pearson Correlation

The null hypothesis and the alternative hypothesis in Pearson correlation are thus:

• Null hypothesis: The correlation coefficient is not significantly different from zero (There is no linear relationship).
• Alternative hypothesis: The correlation coefficient deviates significantly from zero (there is a linear correlation).

Attention: It is always tested whether the null hypothesis is rejected or not rejected.

In our example with the salary and the age of a person, we could thus have the question: Is there a correlation between age and salary in the German population (the population)?

To find out, we draw a sample and test whether the correlation coefficient is significantly different from zero in this sample.

• The null hypothesis is then: There is no correlation between salary and age in the German population.
• and the alternative hypothesis: There is a correlation between salary and age in the German population.

## Significance and the t-test

Whether the Pearson correlation coefficient is significantly different from zero based on the sample surveyed can be checked using a t-test. Here, r is the correlation coefficient and n is the sample size. A p-value can then be calculated from the test statistic t. If the p-value is smaller than the specified significance level, which is usually 5%, then the null hypothesis is rejected, otherwise it is not.

## Assumptions of the Pearson correlation

But what about the assumptions for a Pearson correlation? Here we have to distinguish whether we just want to calculate the Pearson correlation coefficient, or whether we want to test a hypothesis.

To calculate the Pearson correlation coefficient, only two metric variables must be present. Metric variables are, for example, a person's weight,a person's salary or electricity consumption.

The Pearson correlation coefficient, then tells us how large the linear relationship is. If there is a non-linear correlation, we cannot read it from the Pearson correlation coefficient. However, if we want to test whether the Pearson correlation coefficient is significantly different from zero in the sample, i.e. we want to test a hypothesis, the two variables must also be normally distributed! If this is not given, the calculated test statistic t or the p-value cannot be interpreted reliably. If the assumptions are not met, Spearman's rank correlation can be used.

## Calculate Pearson correlation online with DATAtab

If you like, you can of course calculate a correlation analysis online with DATAtab. To do this, simply copy your data into this table in the statistics calculator and click on either the Hypothesis tests or Correlation tab.

If you now look at two metric variables, a Pearson correlation will be calculated automatically. If you don't know exactly how to interpret the results, you can also just click on Summary in words!

Cite DATAtab: DATAtab Team (2023). DATAtab: Online Statistics Calculator. DATAtab e.U. Graz, Austria. URL https://datatab.net