# Fleiss Kappa

You use the Fleiss Kappa whenever you want to know if the measurements of more than two people agree. The people who measure something are called raters.

In the case of the Fleiss Kappa, the variable to be measured by the three or more rates is a nominal variable. Therefore, if you have a nominal variable, you use the Fleiss Kappa.

If you had an ordinal variable and more than two raters you would use the Kendall's W and if you had a metric variable you would use the intra-class correlation. If you had only two raters and a nominal variable, you would use Cohen's Kappa.

But that's enough theory for now, let's take a look at an example.

# Fleiss Kappa Example

Let's say you have developed a measuring instrument, for example a questionnaire, with which doctors can determine whether a person is depressed or not.

Now you give the measuring instrument to doctors and let them evaluate 50 people with it. The big question now is: How well do the doctors' measurements agree?

If the ratings of the raters agree very well, one speaks of a high inter-rater reliability.

And it is precisely this inter-rater reliability that is measured by the Fleiss Kappa. The Fleiss Kappa is a measure of inter-rater reliability.

### Definition:

The Fleiss Kappa is a measure of how reliably three or more raters measure the same thing.

## Fleiss Kappa with repeated measurement

So far, we have considered the case where two or more people measure the same thing. However, the Fleiss Kappa can also be used when the same rater takes the measurement at more than two different times.

In that case, the Fleiss Kappa indicates how well the measurements of the same person agree.

In this case, the variable under study has two expressions, depressed and non-depressed; of course, the variable under study may consist of more than two expressions.

### Measure of the agreement:

The Fleiss Kappa is a measure of the agreement of more than two dependent categorical samples.

## Fleiss Kappa Reliability and validity

It is important to note that with the Fleiss Kappa you can only make a statement about how reliably the raters measure the same thing. But you cannot make a statement about whether what the raters measure is the right thing!

So if all raters measured the same thing, you would have a very high Fleiss Kappa. The Fleiss Kappa does not tell you whether this measured value matches reality, i.e. whether the correct value is measured!

In the first case one speaks of the reliability, in the second case one speaks of the validity.

## Calculate Fleiss Kappa

We can calculate the Fleiss Kappa with this equation:

In this formula *p _{o}* is the observed agreement of the raters,
and pe is the expected agreement if the raters. The expected agreement occurs
when raters make completely random judgments, i.e., simply flip a coin on each
patient as to whether they are depressed or not.

So how do we calculate *p _{o}* and

*p*? Let's start with

_{e}*p*

_{e}Let's say we have 7 patients and three raters. Each patient has been assessed by each rater.

In the first step, we simply count how many times a patient was judged depressed and how many times they were judged not depressed.

For the first patient, 0 raters said that this person is not depressed and 3 raters said that this person is depressed. For the second person, one rater said that the person is not depressed and two said that the person is depressed.

Now we do that for all the other patients and we can calculate the total for each. In total we have 8 ratings with not depressed and 13 ratings with depressed. In total, there were 21 ratings.

With this we can calculate how likely it is that a person is rated as not depressed or as depressed. To do this, we divide each of the number of ratings of depressed and not depressed by the total number of 21.

So once 8 by 21 and thus we get that 38% of the patients are judged by the raters as not depressive and then another 13 by 21 and thus we get that 62% of the patients were judged as depressive.

To calculate *p _{e}* we now square both values and sum them up.

So 0.38^{2} plus 0.62^{2} equals 0.53.

Now we have *p _{e}* and now we need

*p*.

_{o}
*p _{o}* we can calculate with this formula, don't worry, it looks
more complicated than it is.

Let's start with the first part. Capital N is the number of patients, i.e. 7, and small n is the number of raters, i.e. 3. This gives us 0.024 for the first part.

In the second part of the formula, we simply square each value in this table
and sum that up. So 0^{2} plus 3^{2} to finally 1^{2}
plus 2^{2}. That gives us 47.

And the third part comes out to 7 times 3 which equals 21. If we substitute everything, we get 0.024 times 47 - 21 which is equal to 0.624.

So now we have *p _{o}* and

*p*. Putting them into the formula for kappa, we get a kappa of 0.19.

_{e}## Fleiss Kappa interpretation

Now, of course, the Fleiss Kappa coefficient must be interpreted. For this we can use the table of Landis and Kock (1977).

For a Fleiss Kappa value of 0.19, we get just a slight match.

## Calculate Fleiss Kappa with DATAtab

With DATAtab you can easily calculate the Fleiss Kappa online. To do this, simply go to datatab.de and copy your own data into the table at the Fleiss Kappa calculator. Now click on the Reliability tab. Under Reliability you can calculate different reliability statistics, depending on how many variables you click on and which scale level they have, you will get a suitable suggestion.

The Fleiss Kappa is calculated for nominal variables. If your data is recognized as metric, please change the scale level under Data View to nominal.

If you now click on Rater 1 and Rater 2, the Cohens Kappa will be calculated, if you now click on Rater 3, the Fleiss Kappa will be calculated.

Here below you can read the calculated Fleiss Kappa.

If you don't know how to interpret the result, just click on Interpretations in Words.

An inter-rater reliability analysis was performed between the dependent samples of Rater 1, Rater 2 and Rater 3. For this purpose, the Fleiss Kappa was calculated, which is a measure of the agreement between more than two dependent categorical samples.

The Fleiss Kappa showed that there was a slight agreement between samples Rater 1, Rater 2 and Rater 3 with κ= 0.16.

### Statistics made easy

- Many illustrative examples
- Ideal for exams and theses
- Statistics made easy on 251 pages
**Only 6.99 €**

*"Super simple written"*

*"It could not be simpler"*

*"So many helpful examples"*