ROC Curve
Load data setA ROC curve is a graphical representation of the performance of a binary classification model for all classification thresholds. Here, ROC stands for Receiver Operating Characteristic.
Example of a ROC curve
We would like to classify, based on a screening, whether a person has cancer or not.
![Binary classification model Binary classification model](assets/tutorial/roc/Binary_classification_model.png)
This classification is done with the help of a certain blood value, where high values indicate cancer. The question now is which value we choose as the classification threshold. So from which value do we predict a disease?
![Classification threshold Classification threshold](assets/tutorial/roc/Classification_threshold.png)
For this, we obtain data from 10 people about how high the blood value is and whether or not the disease is present.
We could now choose a classification threshold of 45, for example. In this case, of the 5 people with the disease, we would correctly classify 4 as "diseased" and 1 incorrectly as "healthy." Thus, we correctly classified 4 out of 5 as "diseased."
![Classification threshold ROC curve Classification threshold ROC curve](assets/tutorial/roc/Classification_threshold_ROC_curve.png)
This value is called True Positive Rate (TPR) and is equal to the sensitivity.
On the other hand, out of the 5 healthy individuals, we have 2 misclassified as "diseased" and 3 correctly classified as "healthy". So, we misclassified 2 out of 5 as "diseased". This value is called False Positive Rate (FPR).
![Classification threshold False Positive Rate Classification threshold False Positive Rate](assets/tutorial/roc/Classification_threshold_False_Positive_Rate.png)
So for a threshold of 45 we get a True Positive Rate of 4/5, i.e. 0.8, and a False Positive Rate of 3/5, i.e. 0.6.
True Positive Rate and False Positive Rate
The True Positive Rate (TPR) is calculated with this equation:
![True Positve Rate True Positve Rate](assets/tutorial/roc/True_Positve_Rate.png)
The True Positive Rate is equal to the true positives divided by the true positives plus the false negatives. The true positives are those correctly classified as "diseased" and the false negatives are those incorrectly classified as "healthy".
The False Positive Rate (FPR) is obtained using this equation:
![False Positve Rate False Positve Rate](assets/tutorial/roc/False_Positve_Rate.png)
The False Positive Rate is equal to the false positives divided by the false positives plus the true negatives. The false positives are the healthy individuals misclassified as "diseased" and the true negatives are the individuals correctly classified as "healthy".
Plot the ROC Curve
We can now calculate for each threshold what the True Positive Rate and the False Positive Rate are. These two values are plotted on the ROC curve. The True Positive Rate is plotted on the y-axis and the False Positive Rate on the x-axis.
Now let's plot the complete ROC curve for our example!
![ROC Curve ROC Curve](assets/tutorial/roc/ROC_Curve.png)
If we choose the threshold value to be very small, i.e. pushing it all the way to the left, we correctly classify all 5 diseased individuals. Our True Positive Rate is thus 5 out of 5 i.e. 1.
![threshold ROC Curve threshold ROC Curve](assets/tutorial/roc/threshold_ROC_Curve.png)
In the same way, however, we also misclassify all 5 healthy persons as "diseased". Our False Positive Rate is therefore 5 out of 5, i.e. 1.
![threshold False Positve Rate threshold False Positve Rate](assets/tutorial/roc/threshold_False_Positve_Rate.png)
This gives us the first point:
![ROC curve first point ROC curve first point](assets/tutorial/roc/ROC_curve_first_point.png)
Now we can increase the threshold. Here we still classify all 5 diseased people correctly as "diseased". So we still have a True Positive Rate 5 / 5. However, of the 5 healthy individuals, we now only misclassify 4 out of 5 as "diseased". So we have 4 out of 5, or 0.8.
![True Positive Rate and False Positive Rate True Positive Rate and False Positive Rate](assets/tutorial/roc/True_Positive_Rate_and_False_Positive_Rate.png)
At the next threshold, we still have a True Positive Rate of 1. All 5 diseased are correctly classified. The False Positive Rate takes the value of 3/5, so 0.6.
![Threshold value 3 Threshold value 3](assets/tutorial/roc/Threshold_value_3.png)
At the next threshold, for the first time, a diseased person is misclassified as "healthy". We therefore obtain a True Positive Rate of 4/5, i.e. 0.8, and a False Positive Rate of 3/5, i.e. 0.6.
![Threshold value 4 Threshold value 4](assets/tutorial/roc/Threshold_value_4.png)
We can do this for all other thresholds, finishing the ROC curve. At the marked point below, for example, 80% of the diseased people were correctly classified as "diseased" and 20% of the healthy people were incorrectly classified as "diseased".
![Finished ROC curve Finished ROC curve](assets/tutorial/roc/Finished_ROC_curve.png)
AUC value
Using the ROC curve, we can now also compare different classification methods. A classification model is better the higher the curve is. Therefore, the larger the area under the curve, the better the classifier. Exactly this area, the Area under the Curve, is reflected by the AUC value.
![AUC value AUC value](assets/tutorial/roc/AUC_value.png)
The AUC value varies between 0 and 1. The larger the value, the better the classifier.
ROC curve and logistic regression
But what about the ROC curve and the logistic regression? We could, for example, create a new classifier using logistic regression. Here we could use, in addition to the blood value, the age and gender of the person.
In a logistic regression, the estimated value is then how likely it is that a particular person has the disease.
![Logistic regression classification threshold Logistic regression classification threshold](assets/tutorial/roc/Logistic_regression_classification_threshold.png)
Very often, 50% is simply taken as the threshold to classify whether a person is "diseased" or not. But of course this does not have to be the case! Any threshold can be used.
![Logistic Regression Threshold Logistic Regression Threshold](assets/tutorial/roc/Logistic_Regression_Threshold.png)
Therefore, we can also create a ROC curve for the different threshold values in the logistic regression.
Create ROC curve with DATAtab
Load data setOf course, we can easily output a ROC curve online with DATAtab. To do this, we simply copy our data into this table and click on ROC Calculator. Alternatively, you can also create an ROC curve in the Regression Calculator in Logistic Regression.
![Create ROC curve online Create ROC curve online](assets/tutorial/roc/Create_ROC_curve_online.png)
We simply select the two variables Diseased and Blood Value and specify what we consider a positive event, in our case the answer yes. Now we get the ROC curve. In the table below the ROC curve we find the respective threshold value for each point of the curve.
Statistics made easy
- many illustrative examples
- ideal for exams and theses
- statistics made easy on 412 pages
- 5rd revised edition (April 2024)
- Only 7.99 €
![Datatab Datatab](assets/statistics_book.png)
"Super simple written"
"It could not be simpler"
"So many helpful examples"