Cohen's Kappa

Cohen's Kappa is a measure of agreement between two dependent categorical samples, and you use it whenever you want to know if two raters' measurements are in agreement.

In the case of Cohen's Kappa, the variable to be measured by the two rates is a nominal variable.

So if you have a nominal variable and you want to know how much agreement there is between two raters, you would use Cohen's Kappa. If you have an ordinal variable and two raters, you would use Kendall's tau or the weighted Cohen's Kappa, and if you have a metric variable, you would use Pearson's correlation. If you have more than two nominal dependent samples, the Fleiss Kappa is used.

Cohen's Kappa Example

Let's say you have developed a measurement tool, for example a questionnaire, that doctors can use to determine whether a person is depressed or not. Now you give this tool to a doctor and ask her to assess 50 people with it.

For example, your method shows that the first person is depressed, the second person is depressed and the third person is not depressed. The big question now is: Will a second doctor come to the same conclusion?

So, with a second doctor, the result could now look like this: For the first person, both doctors come to the same result, but for the second person, the result differs. You're interested in how big the agreement of the doctors are, and this is where Cohen's Kappa comes in.

Inter-rater reliability

If the assessments of the two doctors agree very well, the inter-rater reliability is high. And it is this inter-rater reliability that is measured by Cohen's Kappa.

Definition:

Cohen's Kappa (κ) is a statistical measure used to quantify the level of agreement between two raters (or judges, observers, etc.) who each classify items into categories. It's especially useful in situations where decisions are subjective and the categories are nominal (i.e., they do not have a natural order).

Cohen's Kappa is therefore a measure of how reliably two raters measure the same thing.

Use cases for Cohen's Kappa

So far we have considered the case where two people measure the same thing. However, Cohen's Kappa can also be used when the same rater makes the measurement at two different times.

In this case, the Cohen's Kappa score indicates how well the two measurements from the same person agree.

Measuring the agreement

Cohen's Kappa measures the agreement between two dependent categorical samples.

Cohen's Kappa reliability and validity

It is important to note that the Cohen's Kappa coefficient can only tell you how reliably both raters are measuring the same thing. It does not tell you whether what the two raters are measuring is the right thing!

In the first case we speak of reliability (whether both are measuring the same thing) and in the second case we speak of validity (whether both are measuring the right thing). Cohen's Kappa can only be used to measure reliability.

Calculate Cohen's Kappa

Now the question arises, how is Cohen's Kappa calculated? This is not difficult! We create a table with the frequencies of the corresponding answers.

For this we take our two raters, each of whom has rated whether a person is depressed or not. Now we count how often both have measured the same and how often not.

So we make a table with Rater 1 with "not depressed" and "depressed" and Rater 2 with "not depressed" and "depressed". Now we simply keep a tally sheet and count how often each combination occurs.

Let's say our final result is as follows: 17 people rated both raters as "not depressed." For 19 people, both chose the rating "depressed."

So if both raters measured the same thing, that person is on the diagonal, if they measured something different, that person is on the edge. Now we want to know how often both raters agree and how often they don't.

Rater 1 and Rater 2 agree that 17 patients are not depressed and 19 are depressed. So both raters agree in 36 cases. In total, 50 people were assessed.

With these numbers, we can now calculate the probability that both raters are measuring the same thing in a person. We do this by dividing 36 by 50. This gives us the following result: In 72% of the cases, both raters assess the same, in 28% of the cases they rate it differently.

This gives us the first part we need to calculate Cohen's Kappa. Cohen's Kappa is given by this formula:

So we just calculated p_o, what is p_e?

If both doctors were to answer the question of whether a person is depressed or not purely by chance, by simply tossing a coin, they would probably come to the same conclusion in some cases, purely by chance.

And that is exactly what p_e indicates: The hypothetical probability of a random match. But how do you calculate p_e?

To calculate p_e, we first need the sums of the rows and columns. Then we can calculate p_e.

In the first step, we calculate the probability that both raters would randomly arrive at the rating "not depressed."

Rater 1 rated 25 out of 50 people as "not depressed", i.e. 50%.
Rater 2 rated 23 out of 50 people as "not depressed", i.e. 46%.

The overall probability that both raters would say "not depressed" by chance is: 0.5 * 0.46 = 0.23

In the second step, we calculate the probability that the raters would both say "depressed" by chance.

Rater 1 says "depressed" in 25 out of 50 persons, i.e. 50%.
Rater 2 says "depressed" in 27 out of 50 people, i.e. 54%.

The total probability that both raters say "depressed" by chance is: 0.5 * 0.54 = 0.27. Now we can calculate p_e.

If both values are now added, we get the probability that the two raters coincidentally agree. p_e is therefore 0.23 + 0.27 which is equal to 0.50. Therefore, if the doctors had no guidance and simply rolled the dice, the probability of such a match is 50%.

Now we can calculate the Cohen's Kappa coefficient. We simply substitute p_o and p_e and we get a Kappa value of 0.4 in our example.

By the way, in p_o the o stands for "observed". And in p_e, the e stands for "expected". Therefore, p_o is what we actually observed and p_e is what we would expect if it were purely random.

Cohen's Kappa interpretation

Now, of course, we would like to interpret the calculated Cohens Kappa coefficient. The table of Landis & Koch (1977) can be used as a guide.

Kappa
>0.8	Almost Perfect
>0.6	Substantial
>0.4	Moderate
>0.2	Fair
0-0,2	Slight
<0	Poor

Therefore, the calculated Cohen's Kappa coefficient of 0.44 indicates moderate reliability or agreement.

Cohen's Kappa Standard Error (SE)

The Standard Error (SE) of a statistic, like Cohen's Kappa, is a measure of the precision of the estimated value. It indicates the extent to which the calculated value would vary if the study were repeated multiple times on different samples from the same population. Therefore it is a measure of the variability or uncertainty around the Kappa statistic estimate.

Calculating Standard Error of Cohen's Kappa:

The calculation of the SE for Cohen's Kappa involves somewhat complex formulas that account for the overall proportions of each category being rated and the distribution of ratings between the raters. The general formula for the SE of Cohen's Kappa is:

Where n is the total number of items being rated.

Interpreting Standard Error

Small Standard Error: A small SE suggests that the sample estimate is likely to be close to the true population value. The smaller the SE, the more precise the estimate is considered to be.

Large Standard Error: A large SE indicates that there is more variability in the estimates from sample to sample and, therefore, less precision. It suggests that if the study were repeated, the resulting estimates could vary widely.

Weighted Cohen's Kappa

Cohen's Kappa takes into account the agreement between two raters, but it is only relevant whether both raters measure the same or not. In the case of an ordinal variable, i.e. a variable with a ranking, such as school grades, it is of course desirable that the gradations are also considered. A difference between "very good" and "satisfactory" is greater than between "very good" and "good".

To take this into account, the weighted Kappa can be calculated. Here, the deviation is included in the calculation. The differences can be taken into account linearly or quadratically.

Calculate Cohen's Kappa with DATAtab

Now we will discuss how you can easily calculate Cohen's Kappa for your data online using DATAtab.

Simply go to the Cohen's Kappa calculator and copy your own data into the table. Now click on the tab "Reliability".

All you have to do is click on the variables you want to analyse and Cohen's Kappa will be displayed automatically. First you will see the crosstab and then you can read the calculated Cohen's Kappa coefficient. If you don't know how to interpret the result, just click on interpretations in words.

An inter-rater reliability analysis was performed between the dependent samples Rater1 and Rater2. For this, Cohen's Kappa was calculated, which is a measure of the agreement between two related categorical samples. The Cohen's Kappa showed that there was moderate agreement between the samples Rater1 and Rater2 with κ= 0.23.

Statistics made easy

many illustrative examples
ideal for exams and theses
statistics made easy on 412 pages
5rd revised edition (April 2024)
Only 7.99 €

Free sample

"Super simple written"

"It could not be simpler"

"So many helpful examples"