 # Correlation analysis

## What is a correlation analysis?

Correlation analysis is a statistical technique that gives you information about the relationship between variables.

Correlation analysis can be calculated to investigate the relationship of variables. How strong the correlation is is determined by the correlation coefficient, which varies from -1 to +1. Correlation analyses can thus be used to make a statement about the strength and direction of the correlation.

### Example

You want to find out whether there is a connection between the age at which a child speaks its first sentences and its later success at school.

## Correlation and causality

If the correlation analysis shows that two characteristics are related to each other, it can subsequently be checked whether one characteristic can be used to predict the other characteristic. If the correlation mentioned in the example is confirmed, for example, it can be checked whether school success can be predicted by the age at which a child speaks its first sentences by means of a linear regression.

But beware! Correlations need not be causal relationships. Any correlations that are discovered should therefore be investigated more closely, but never interpreted immediately in terms of content, even if this would be obvious.

### Correlation and causality example:

If the correlation between sales figures and price is analysed and a strong correlation is identified, it would be logical to assume that sales figures are influenced by the price (and not vice versa). This assumption can, however, by no means be proven on the basis of a correlation analysis.

Furthermore, it can happen that the correlation between variable x and y is generated by the variable z, see Partial Correlation for more information.

However, depending on which variables you use, you may be able to speak of a causal relationship right from the start. For example, if there is a correlation between age and salary, it is clear that age influences salary and not the other way around, otherwise everyone would want to earn as little salary as possible : )

## Interpret correlation

With the help of correlation analysis two statements can be made:

• and one about the strength

of the linear relationship between two metric or ordinally scaled variables. The direction indicates whether the correlation is positive or negative, while the strength indicates whether the correlation between the variables is strong or weak.

### Positive correlation

A positive correlation exists if larger values of the variable x are accompanied by larger values of the variable y. Height and shoe size, for example, correlate positively and the correlation coefficient lies between 0 and 1, i.e. a positive value.

### Negative correlation

A negative correlation exists if larger values of the variable x are accompanied by smaller values of the variable y. The product price and the sales quantity usually have a negative correlation; the more expensive a product is, the smaller the sales quantity. In this case, the correlation coefficient is between -1 and 0, so it assumes a negative value.

### Strength of correlation

With regard to the strength of the correlation coefficient r, the following table can be used as a guide:

| r | Strength of correlation
0.0 < 0.1 no correlation
0.1 < 0.3 little correlation
0.3 < 0.5 medium correlation
0.5 < 0.7 high correlation
0.7 < 1 very high correlation

Tip: On DATAtab you can calculate the correlation coefficient directly online.

## Scatter plot and correlation

Just as important as the consideration of the correlation coefficient is the graphical consideration of the correlation of two variables in a scatter diagram. The scatter plot gives you a rough estimate of whether there is a correlation, whether it is linear or nonlinear, and whether there are outliers.

## Test correlation for significance

If there is a correlation in the sample, it is still necessary to test whether there is enough evidence that the correlation also exists in the population. Thus, the question arises when a correlation coefficient can be considered statistically significant.

The significance of correlation coefficients can be tested using a t-test. As a rule, it is tested whether the correlation coefficient is significantly different from zero, i.e. linear independence is tested. In this case, the null hypothesis is that there is no correlation between the variables under consideration. In contrast, the alternative hypothesis assumes that there is a correlation.

As with any other hypothesis test, the significance level is first set, usually at 5%. If the calculated p-value is below 5 %, the null hypothesis is rejected and the alternative hypothesis applies. Thus, if the p-value is below 5%, it is assumed that there is a relationship between the variables in the population.

The t-value for testing the hypothesis is given by where n is the sample size and r is the determined correlation in the sample. The corresponding p-value can be easily calculated in the correlation calculator on DATAtab.

#### Directional and non-directional hypotheses

With correlation analysis you can test directional and non-directional correlation hypotheses.

##### Non-directional correlation hypothesis:

You are only interested in whether there is a relationship or correlation between two variables, for example, whether there is a correlation between age and salary, but you are not interested in the direction of this correlation.

##### Directional correlation hypothesis:

You are also interested in the direction of the correlation, i.e. whether there is a positive or negative correlation between the variables.

Your alternative hypothesis is then e.g. age has a positive influence on salary. What you have to pay attention to in the case of a directional hypothesis, we will go through at the bottom of the example.

## Pearson correlation analysis

With the Pearson correlation analysis you get a statement about the linear correlation between metric scaled variables. The respective covariance is used for the calculation. The covariance gives a positive value if there is a positive correlation between the variables and a negative value if there is a negative correlation. The covariance is calculated as: However, the covariance is not standardized and can assume values between plus and minus infinity. This makes it difficult to compare the strength of relationships between different variables. For this reason, the correlation coefficient, also called product-moment correlation coefficient, is calculated. The correlation coefficient is obtained by normalizing the covariance. For this normalization, the variances of the two variables involved are used and the correlation coefficient is calculated as The Pearson correlation coefficient can now take values between -1 and +1 and can be interpreted as follows

• The value +1 means that there is an entirely positive linear relationship (the more, the more).
• The value -1 indicates that an entirely negative linear relationship exists (the more, the less).
• With a value of 0 there is no linear relationship, i.e. the variables do not correlate with each other. Now finally the strength of the relationship can be interpreted. This can be illustrated by the following table:

| r | Strength of correlation
0.0 < 0,1 no correlation
0.1 < 0,3 little correlation
0.3 < 0,5 medium correlation
0.5 < 0,7 high correlation
0.7 < 1 very high correlation

To check in advance whether a linear relationship exists, scatter plots should be considered. This way, the respective relationship between the variables can also be checked visually. The Pearson correlation is only useful and purposeful if linear relationships are present.

### Pearson Correlation assumptions

For Pearson correlation to be used, the variables must be normally distributed and there must be a linear relationship between the variables. The normal distribution can be tested either analytically or graphically with the Q-Q plot. Whether the variables have a linear correlation is best checked with a scatter plot. If these conditions are not met, then the Spearman correlation is used.

## Spearman rank correlation

Spearman correlation analysis is used to calculate the relationship between two variables that have ordinal level of measurement. Spearman rank correlation is the non-parametric equivalent of Pearson correlation analysis. This procedure is therefore used when the prerequisites for a correlation analysis (=parametric procedure) are not met, i.e. when there is no metric data and no normal distribution. In this context it is often referred to as "Spearman correlation" or "Spearman's Rho" if Spearman rank correlation is meant.

The questions that can be treated by Spearman rank correlation are similar to those of the Pearson correlation coefficient, i.e. "Is there a correlation between two variables or characteristics". For example: "Is there a correlation between age and religiousness in the France population?

The calculation of the rank correlation is based on the ranking system of the data series. This means that the measured values are not used for the calculation, but are transformed into ranks. The test is then performed using these ranks.

For the rank correlation coefficient ρ, values between -1 and 1 are possible. If there is a value less than zero (ρ < 0), there is a negative linear correlation. If a value greater than zero (ρ > 0), there is a positive linear relationship. If the value is zero (ρ = 0), there is no relationship between the variables. As with the Spearman correlation coefficient, the strength of the correlation can be classified as follows:

Value r Strength of correlation
0.0 < 0,1 no correlation
0.1 < 0,3 little correlation
0.3 < 0,5 medium correlation
0.5 < 0,7 high correlation
0.7 < 1 very high correlation

## Point biserial correlation

The point biserial correlation is used when one of the variables is dichotomous, e.g. studied and not studied, and the other has metric scale level, e.g. salary.

The calculation of a point biserial correlation is the same as the calculation of the Pearson correlation. To calculate it, one of the two expressions of the dichotomous variable is coded as 0 and the other as 1.

## Calculate correlation analysis with DATAtab

A student wants to know if there is a correlation between the height and weight of the participants in the statistics course. For this purpose, the student drew a sample, which is described in the table below.

Height Weight
1.62 53
1.72 71
1.85 85
1.82 86
1.72 76
1.55 62
1.65 68
1.77 77
1.83 97
1.53 65

To analyze the linear relationships by means of a correlation analysis, you can calculate a correlation with DATAtab. First copy the table above into the statistics calculator.

Then click on "Correlation" and select the two variables from the example. Finally you will get the following results. First, you will get the null and the alternative hypothesis. The null hypothesis is: "There is no correlation between height and weight". Then you get the correlation coefficient and the p value. If you click on Summary in words, you will get the following interpretation:

A Pearson correlation analysis was performed to test whether there is a relationship between height and weight. The result of the Pearson correlation analysis showed that there was a significant relationship between height and weight, r(8) = 0.86, p = 0.001.

There is a very high, positive correlation between the variables of height and weight, r= 0.86. Thus, there is a very high, positive correlation in this sample between height and weight.

## Directional (one-sided) correlation hypothesis

Of course, in DATatab you can also choose to calculate a directional hypothesis. In this case, you must first check whether the correlation is at all in the direction of the alternative hypothesis, i.e. that height and weight are positively correlated. If this is the case, the calculated p-value must be divided by two, since only one side of the distribution is considered. However, DATAtab takes care of these two steps for you. The summary in words then looks like this:

A Pearson correlation analysis was performed to test whether there is a positive relationship between height and weight. The result of Pearson correlation analysis showed that there was a significant positive relationship between height and weight, r(8) = 0.86, p = <0.001.

There is a very high positive correlation between the variables of height and weight, r= 0.86. Thus, there is a very high, positive correlation in this sample between height and weight.

Cite DATAtab: DATAtab Team (2023). DATAtab: Online Statistics Calculator. DATAtab e.U. Graz, Austria. URL https://datatab.net