Cohen's Kappa is a measure of agreement between two dependent categorical samples, and you use it whenever you want to know if two raters' measurements are in agreement.
In the case of Cohen's Kappa, the variable to be measured by the two rates is a nominal variable.
So if you have a nominal variable and you want to know how much agreement there is between two raters, you would use Cohen's Kappa. If you have an ordinal variable and two raters, you would use Kendall's tau or the weighted Cohen's Kappa, and if you have a metric variable, you would use Pearson's correlation. If you have more than two nominal dependent samples, the Fleiss Kappa is used.
Cohen's Kappa Example
Let's say you have developed a measurement tool, for example a questionnaire, that doctors can use to determine whether a person is depressed or not. Now you give this tool to a doctor and ask her to assess 50 people with it.
For example, your method shows that the first person is depressed, the second person is depressed and the third person is not depressed. The big question now is: Will a second doctor come to the same conclusion?
So, with a second doctor, the result could now look like this: For the first person, both doctors come to the same result, but for the second person, the result differs. You're interested in how big the agreement of the doctors are, and this is where Cohen's Kappa comes in.
If the assessments of the two doctors agree very well, the inter-rater reliability is high. And it is this inter-rater reliability that is measured by Cohen's Kappa.
Cohen's Kappa is a measure of inter-rater reliability.
Cohen's Kappa is therefore a measure of how reliably two raters measure the same thing.
Use cases for Cohen's Kappa
So far we have considered the case where two people measure the same thing. However, Cohen's Kappa can also be used when the same rater makes the measurement at two different times.
In this case, the Cohen's Kappa score indicates how well the two measurements from the same person agree.
Measuring the agreement
Cohen's Kappa measures the agreement between two dependent categorical samples.
Cohen's Kappa reliability and validity
It is important to note that the Cohen's Kappa coefficient can only tell you how reliably both raters are measuring the same thing. It does not tell you whether what the two raters are measuring is the right thing!
In the first case we speak of reliability (whether both are measuring the same thing) and in the second case we speak of validity (whether both are measuring the right thing). Cohen's Kappa can only be used to measure reliability.
Calculate Cohen's Kappa
Now the question arises, how is Cohen's Kappa calculated? This is not difficult! We create a table with the frequencies of the corresponding answers.
For this we take our two raters, each of whom has rated whether a person is depressed or not. Now we count how often both have measured the same and how often not.
So we make a table with Rater 1 with "not depressed" and "depressed" and Rater 2 with "not depressed" and "depressed". Now we simply keep a tally sheet and count how often each combination occurs.
Let's say our final result is as follows: 17 people rated both raters as "not depressed." For 19 people, both chose the rating "depressed."
So if both raters measured the same thing, that person is on the diagonal, if they measured something different, that person is on the edge. Now we want to know how often both raters agree and how often they don't.
Rater 1 and Rater 2 agree that 17 patients are not depressed and 19 are depressed. So both raters agree in 36 cases. In total, 50 people were assessed.
With these numbers, we can now calculate the probability that both raters are measuring the same thing in a person. We do this by dividing 36 by 50. This gives us the following result: In 72% of the cases, both raters assess the same, in 28% of the cases they rate it differently.
This gives us the first part we need to calculate Cohen's Kappa. Cohen's Kappa is given by this formula:
So we just calculated po, what is pe?
If both doctors were to answer the question of whether a person is depressed or not purely by chance, by simply tossing a coin, they would probably come to the same conclusion in some cases, purely by chance.
And that is exactly what pe indicates: The hypothetical probability of a random match. But how do you calculate pe?
To calculate pe, we first need the sums of the rows and columns. Then we can calculate pe.
In the first step, we calculate the probability that both raters would randomly arrive at the rating "not depressed."
- Rater 1 rated 25 out of 50 people as "not depressed", i.e. 50%.
- Rater 2 rated 23 out of 50 people as "not depressed", i.e. 46%.
The overall probability that both raters would say "not depressed" by chance is: 0.5 * 0.46 = 0.23
In the second step, we calculate the probability that the raters would both say "depressed" by chance.
- Rater 1 says "depressed" in 25 out of 50 persons, i.e. 50%.
- Rater 2 says "depressed" in 27 out of 50 people, i.e. 54%.
The total probability that both raters say "depressed" by chance is: 0.5 * 0.54 = 0.27. Now we can calculate pe.
If both values are now added, we get the probability that the two raters coincidentally agree. pe is therefore 0.23 + 0.27 which is equal to 0.50. Therefore, if the doctors had no guidance and simply rolled the dice, the probability of such a match is 50%.
Now we can calculate the Cohen's Kappa coefficient. We simply substitute po and pe and we get a Kappa value of 0.4 in our example.
By the way, in po the o stands for "observed". And in pe, the e stands for "expected". Therefore, po is what we actually observed and pe is what we would expect if it were purely random.
Cohen's Kappa interpretation
Now, of course, we would like to interpret the calculated Cohens Kappa coefficient. The table of Landis & Koch (1977) can be used as a guide.
Therefore, the calculated Cohen's Kappa coefficient of 0.44 indicates moderate reliability or agreement.
Weighted Cohen's Kappa
Cohen's Kappa takes into account the agreement between two raters, but it is only relevant whether both raters measure the same or not. In the case of an ordinal variable, i.e. a variable with a ranking, such as school grades, it is of course desirable that the gradations are also considered. A difference between "very good" and "satisfactory" is greater than between "very good" and "good".
To take this into account, the weighted kappa can be calculated. Here, the deviation is included in the calculation. The differences can be taken into account linearly or quadratically.
Calculate Cohen's Kappa with DATAtab
Now we will discuss how you can easily calculate Cohen's Kappa for your data online using DATAtab.
Simply go to the Cohen's Kappa calculator and copy your own data into the table. Now click on the tab "Reliability".
All you have to do is click on the variables you want to analyse and Cohen's Kappa will be displayed automatically. First you will see the crosstab and then you can read the calculated Cohen's Kappa coefficient. If you don't know how to interpret the result, just click on interpretations in words.
An inter-rater reliability analysis was performed between the dependent samples Rater1 and Rater2. For this, Cohen's Kappa was calculated, which is a measure of the agreement between two related categorical samples. The Cohen's Kappa showed that there was moderate agreement between the samples Rater1 and Rater2 with κ= 0.23.
Statistics made easy
- many illustrative examples
- ideal for exams and theses
- statistics made easy on 276 pages
- 3rd revised edition (July 2023)
- Only 6.99 €
"Super simple written"
"It could not be simpler"
"So many helpful examples"