# Correlation analysis

## What is a correlation analysis?

Correlation analysis is a statistical technique that gives you information about the relationship between variables.

**Correlation analysis** can be calculated to investigate the
**relationship** of variables. How strong the **correlation** is is
determined by the **correlation coefficient**, which varies from -1 to +1.
Correlation analyses can thus be used to make a statement about the strength
and direction of the correlation.

### Example

You want to find out whether there is a connection between the age at which a child speaks its first sentences and its later success at school.

## Correlation and causality

If the correlation analysis shows that two characteristics are related to each other, it can subsequently be checked whether one characteristic can be used to predict the other characteristic. If the correlation mentioned in the example is confirmed, for example, it can be checked whether school success can be predicted by the age at which a child speaks its first sentences by means of a linear regression.

But beware! Correlations need not be **causal relationships**. Any
correlations that are discovered should therefore be investigated more
closely, but never interpreted immediately in terms of content, even if this
would be obvious.

### Correlation and causality Example:

If the correlation between sales figures and price is analysed and a strong correlation occurs, it would be logical to assume that sales figures are influenced by the price (and not vice versa), but this assumption can by no means be proven on the basis of a correlation analysis.

Furthermore, it can happen that the correlation between variable x and y is generated by the variable z, see Partial Correlation for more information.

However, depending on which variables you use, you may be able to speak of a causal relationship right from the start. For example, if there is a correlation between age and salary, it is clear that age influences salary and not the other way around, otherwise everyone would want to earn as little salary as possible : )

## Interpret correlation

With the help of correlation analysis two statements can be made, one about

- the direction and
- the strength

of the linear relationship between two metric or ordinally scaled variables. The direction indicates whether there is a positive correlation or a negative correlation.

### Positive correlation

A positive correlation exists if larger values of variable A are accompanied by larger values of variable B. Height and shoe size, for example, correlate positively and a correlation coefficient of between 0 and 1 results, i.e. a positive value.

### Negative correlation

A negative correlation exists if larger values of variable A are accompanied by smaller values of variable B. The product price and the sales quantity usually have a negative correlation; the more expensive a product is, the smaller the sales quantity. In this case, the correlation coefficient is between -1 and 0, so it assumes a negative value.

### Strength of correlation

With regard to the strength of the correlation, the following table can be taken as a guide:

Amount of r | Strength of correlation |
---|---|

0.0 < 0.1 | no correlation |

0.1 < 0.3 | little correlation |

0.3 < 0.5 | medium correlation |

0.5 < 0.7 | high correlation |

0.7 < 1 | very high correlation |

**Tip:** On DATAtab you can
calculate the correlation coefficient
directly online.

## Scatter plot and correlation

Just as important as the consideration of the correlation coefficient is the graphical consideration of the correlation of two variables in a scatter diagram.

The scatter plot gives you a rough estimate of whether there is a correlation, whether it is linear or nonlinear, and whether there are outliers.

## Test correlation for significance

If there is a correlation in the sample, it is still necessary to test whether there is enough evidence that the correlation also exists in the population. Thus, the question arises when a correlation coefficient can be considered statistically significant.

The significance of correlation coefficients can be tested using a t-test. As a rule, it is tested whether the correlation coefficient is significantly different from zero, i.e. linear independence is tested. In this case, the null hypothesis is that there is no correlation between the variables under consideration. In contrast, the alternative hypothesis assumes that there is a correlation.

As with any other hypothesis test, the significance level is first set, usually at 5%. If the calculated p-value is below 5 %, the null hypothesis is rejected and the alternative hypothesis applies. Thus, if the p-value is below 5%, it is assumed that there is a relationship between the variables in the population.

The t-value for testing the hypothesis is given by

where n is the sample size and r is the determined correlation in the sample. The corresponding p-value can be easily calculated in the correlation calculator on DATAtab.

#### Directed and undirected hypotheses

With correlation analysis you can test directed and undirected correlation hypotheses.

##### Undirected correlation hypothesis:

You are only interested in whether there is a relationship or correlation between two variables, for example, whether there is a correlation between age and salary, but you are not interested in the direction of this correlation.

##### Directed correlation hypothesis:

You are also interested in the direction of the correlation, i.e. whether there is a positive or negative correlation between the variables.

Your alternative hypothesis is then e.g. age has a positive influence on salary. What you have to pay attention to in the case of a directed hypothesis, we will go through at the bottom of the example.

## Pearson correlation analysis

With the **Pearson correlation analysis** you get a statement about the
linear correlation between metric scaled variables. The respective
**covariance** is used for the calculation. The covariance gives a positive
value if there is a positive correlation between the variables and a negative
value if there is a negative correlation. The covariance is calculated using

However, the covariance is not normalized and can assume values between
**plus** and **minus infinity**. This makes it difficult to compare the
strength of relationships between different variables. For this reason, the
**correlation coefficient**, also called **product-moment correlation**,
is calculated. The correlation coefficient is obtained by normalizing the
covariance. For this normalization, the variances of the two variables
involved are used and the correlation coefficient is calculated as

The **Pearson correlation coefficient** can now take values between -1 and
+1 and can be interpreted as follows

- The value +1 means that there is an entirely positive linear relationship (the more, the more).
- The value -1 indicates that an entirely negative linear relationship exists (the more, the less).
- With a value of 0 there is no linear relationship, i.e. the variables do not correlate with each other.

Now finally the strength of the relationship can be interpreted. This can be illustrated by the following table:

Amount of r | Strength of correlation |
---|---|

0.0 < 0,1 | no correlation |

0.1 < 0,3 | little correlation |

0.3 < 0,5 | medium correlation |

0.5 < 0,7 | high correlation |

0.7 < 1 | very high correlation |

To check in advance whether a linear relationship exists,
**scatter plots** should be considered. In this way, the respective
relationship between the variables can also be checked visually in advance.
The Pearson correlation is only useful and purposeful if linear relationships
are present.

The calculated correlation coefficient can now also be checked for
**significance**. This serves to find out whether the correlation found
also applies to the population. If an 2-tailed hypothesis is to be tested, the
null hypothesis could be that there is no correlation between two
characteristics in the population. In the directional case, you can check that
either a positive or negative correlation can be found in the population.

An example of this would be: There is a positive correlation in the population and therefore the following alternative hypothesis is formulated: "The greater a person's climate awareness, the greater his/her sustainability awareness".

In order to calculate the probability that a correlation exists in the sample that does not exist in the population, an inspection variable is required. This test variable follows, mathematically speaking, the t-distribution with n-2 degrees of freedom (df).

With the help of the test value it can finally be decided whether the null hypothesis is maintained or rejected, i.e. whether or not there is a positive correlation between climate awareness and sustainability awareness in the population.

With the help of the test value it can finally be decided whether the null hypothesis is maintained or rejected, i.e. whether or not there is also a positive correlation between climate awareness and sustainability awareness in the population.

### Pearson Correlation assumptions

For Pearson correlation to be used, the variables must be normally distributed and there must be a linear relationship between the variables. The normal distribution can be tested either analytically or graphically with the QQ plot. Whether the variables have a linear correlation is best checked with a scatter plot.

If these conditions are not met, then the Spearman correlation is used.

## Spearman rank correlation

Spearman correlation analysis is used to calculate the relationship between two variables that have ordinal level of measurement. Spearman rank correlation is the non-praparametric equivalent of Pearson correlation analysis. This procedure is therefore used when the prerequisites for a correlation analysis (=parametric procedure) are not met, i.e. when there is no metric data and no normal distribution. In this context it is often referred to as "Spearman correlation" or "Spearman's Rho" if Spearman rank correlation is meant.

The questions that can be treated by Spearman rank correlation are similar to those of the Pearson correlation coefficient. These are: "Is there a correlation between two variables or characteristics", for example: "Is there a correlation between age and religiousness in the France population?

The calculation of the **rank correlation** is based on the ranking system
of the data series. This means that the measured values are not used for the
calculation, but are transformed into ranks. The test is then performed using
these ranks.

For the **rank correlation coefficient ρ**, values between -1 and 1 are
possible. If there is a value less than zero (ρ < 0), there is a negative
linear correlation. If a value greater than zero (ρ > 0), there is a
positive linear relationship and if the value is zero (ρ = 0), there is no
relationship between the variables. The strength of the correlation can be
classified as follows, as with the Spearman correlation coefficient:

Amount of r | Strength of correlation |
---|---|

0.0 < 0,1 | no correlation |

0.1 < 0,3 | little correlation |

0.3 < 0,5 | medium correlation |

0.5 < 0,7 | high correlation |

0.7 < 1 | very high correlation |

## Point biserial correlation

The point biserial correlation is used when one of the variables is dichotomous, e.g. studied and not studied, and the other has metric scale level, e.g. salary.

The calculation of a point biserial correlation is the same as the calculation of the Pearson correlation. To calculate it, one of the two expressions of the dichotomous variable is coded as 0 and the other as 1.

## Calculate correlation analysis with DATAtab

Calculate the example directly with DATAtab for free:

Correlation analysis Load data setA student wants to know if there is a correlation between the height and weight of the participants in the statistics course. For this purpose, the student drew a sample, which is described in the table below.

Body height | Weight |
---|---|

1.62 | 53 |

1.72 | 71 |

1.85 | 85 |

1.82 | 86 |

1.72 | 76 |

1.55 | 62 |

1.65 | 68 |

1.77 | 77 |

1.83 | 97 |

1.53 | 65 |

To analyze the linear relationships by means of a correlation analysis, you can calculate a correlation with DATAtab. First copy the table above into the statistics calculator.

Then click on "Correlation" and select the two variables from the example. Finally you will get the following results.

First, you will get the null and the alternative hypothesis. The null hypothesis is: "There is no correlation between height and weight". Then you get the correlation coefficient and the p value. If you click on Summary in words, you will get the following interpretation:

A Pearson correlation analysis was performed to test whether there is a relationship between height and weight. The result of the Pearson correlation analysis showed that there was a significant relationship between height and weight, r(8) = 0.86, p = 0.001.

There is a very high, positive correlation between the variables of height and weight, r= 0.86. Thus, there is a very high, positive correlation in this sample between height and weight.

## Directed (one-sided) correlation hypothesis

Of course, in DATatab you can also choose to calculate a directed hypothesis.

In this case, you must first check whether the correlation is at all in the direction of the alternative hypothesis, i.e. that height and weight are positively correlated. If this is the case, the calculated p-value must be divided by two, since only one side of the distribution is considered. However, DATAtab takes care of these two steps for you. The summary in words then looks like this:

A Pearson correlation analysis was performed to test whether there is a positive relationship between height and weight. The result of Pearson correlation analysis showed that there was a significant positive relationship between height and weight, r(8) = 0.86, p = <0.001.

There is a very high positive correlation between the variables of height and weight, r= 0.86. Thus, there is a very high, positive correlation in this sample between height and weight.

### Statistics made easy

- Many illustrative examples
- Ideal for exams and theses
- Statistics made easy on 251 pages
**Only 6.99 €**

*"Super simple written"*

*"It could not be simpler"*

*"So many helpful examples"*