Hierarchical cluster analysis
Sample dataA hierarchical cluster analysis is a clustering method that creates a hierarchical tree or dendrogram of the objects to be clustered.
The tree represents the relationships between objects and shows how objects are clustered at different levels.
Example Hierarchical Cluster Analysis
Example: We asked people about how many hours a week they spend on social media platforms and at the gym.
We now want to know if there are clusters in this dataset and perform a Hierarchical Cluster Analysis.
How is a Hierarchical Cluster Analysis calculated?
First, we plot the points in a scatter plot.
With this we can now start to create the clusters. In the first step we assign a cluster to each point. So we have as many clusters as we have persons.
The goal now is: to merge more and more clusters little by little, until finally all points are in one cluster.
In each step, the clusters that are closest together are always merged. What does "closest together" mean?
For this we need to determine two things:
- How the distance between two points is measured.
- How points in a cluster are connected.
Distance between two points
Let's start with the question, how do we calculate the distance between two points? Here are the most known distances:
- Euclidean Distance
- Manhattan Distance
- Maximum Distance
Let's take the distance between Max and Caro. The difference on the y-axis is 1 and the difference on the x-axis is 4.
Euclidean Distance
The Euclidean distance is the square root of the sum of the squared differences.
Manhattan Distance
The Manhattan distance uses the sum of the absolute differences. So we simply calculate 4 plus 1 and keep a distance of 5.
Maximum Distance
The maximum distance is simply the maximum value of the absolute differences. In this case it is 4.
Linking method
Now that we know what ways there are to calculate the distances between points, we need to determine how to link the points within a cluster.
Let's say we have a cluster with the points Joe and Lisa and a cluster with Max and Caro. Now how do we determine the distance between these two clusters? Here are the most popular methods:
- Single-linkage,
- Complete-linkage
- and Average-linkage.
Single-linkage
Single-linkage uses the distance between the closest elements in the cluster. This is the distance between Caro and Joe.
Complete-linkage
Complete linkage uses the distance between the farthest elements in the cluster. So between Max and Joe.
Average-linkage
Average-linkage uses the average of all pairwise distances. From each combination the distance is calculated and from it the average.
Example Hierarchical Cluster Analysis
For our example we use the Euclidean distance and the single-linkage method. So now we need the distance from each cluster to the other clusters.
For this we first need to calculate the distance matrix. In the distance matrix we enter the clusters on both dimensions and then calculate the distances from each cluster to each other cluster.
The distance between Alan and Lisa is given by:
We can now do this for all other combinations until we have calculated the total distance matrix. Now we can merge the first clusters. For this we look between which two clusters we have the smallest distance. This is the case between Joe and Lisa.
With this, we now combine Joe and Lisa into one cluster. In our tree diagram or dendrogram we can draw the first connection.
Now we need to update our distance matrix. We decided to use the single linkage method. So the distance between two clusters is given by the elements that are closest to each other. To the clusters Alan, Max and Caro, from the cluster Lisa and Joe respectively, Joe is always the closest person.
So we calculate the distance from Alan to Joe, the distance from Max to Joe, and the distance from Caro to Joe.
Now we again merge the clusters that are closest. These are Max and Alan.
In our tree diagram or dendrogram, we can draw in the second connection.
Now we update the distance matrix again. We calculate the distance between Alan and Joe, Caro and Joe and between Caro and Alan. We get the smallest distance between the Caro cluster and the Lisa and Joe cluster.
So we connect these two clusters and draw the third connection in the tree diagram.
Now there are only two clusters left, and we merge them in the last step. And we get our finished dendrogram.
Calculate hierarchical cluster analysis with DATAtab
Sample dataTo calculate a hierarchical cluster analysis online, just visit the statistics calculator and copy your own data into the table or use the link to load the dataset. Now we click on cluster and select hierarchical cluster.
If we now click on Social Media and Gym a hierarchical cluster analysis will be calculated for us. Additionally we can specify the label, in our case the names of the persons.
Now we can specify which connection method should be used and how the distance should be calculated. We simply take Single linkage and the Euclidean distance again.
Now we get the results output down here. We see the tree plot, a scatter plot and the elbow plot. In the elbow plot we can now read how many clusters we take. We can see a kink here, so we'll take 4 as the cluster count. We can still select these up here and then in the tree plot we get the 4 clusters highlighted by different colors. We see the first cluster, the second cluster, the third cluster and the fourth cluster.
Statistics made easy
- many illustrative examples
- ideal for exams and theses
- statistics made easy on 412 pages
- 5rd revised edition (April 2024)
- Only 8.99 €
"Super simple written"
"It could not be simpler"
"So many helpful examples"