Overview of Hierarchical Clustering Analysis
Hierarchical Clustering analysis is an algorithm that is used to group the data points having the similar properties, these groups are termed as clusters, and as a result of hierarchical clustering we get a set of clusters where these clusters are different from each other. Clustering of this data into clusters is classified as Agglomerative Clustering( involving decomposition of cluster using bottom-up strategy ) and Divisive Clustering ( involving decomposition of cluster using top-down strategy )
There are various types of clustering analysis, one such type is Hierarchical clustering.
Hierarchical clustering will help in creating clusters in a proper order/hierarchy. Example: the most common everyday example we see is how we order our files and folders in our computer by proper hierarchy.
Types of Hierarchical Clustering
Hierarchical clustering is further classified into two types i.e Agglomerative clustering and Divisive Clustering (DIANA)
In this case of clustering, the hierarchical decomposition is done with the help of bottom-up strategy where it starts by creating atomic (small) clusters by adding one data object at a time and then merges them together to form a big cluster at the end, where this cluster meets all the termination conditions. This procedure is iterative until all the data points are brought under one single big cluster.
AGNES (AGglomerative NESting) is a type of agglomerative clustering that combines the data objects into a cluster based on similarity. The result of this algorithm is a tree-based structured called Dendrogram. Here it uses the distance metrics to decide which data points should be combined with which cluster. Basically, it constructs a distance matrix and checks for the pair of clusters with the smallest distance and combines them.
The above figure shows Agglomerative vs. Divisive clustering
Based on how the distance between each cluster is measured we can have 3 different methods
- Single linkage: Where the shortest distance between the two points in each cluster is defined as the distance between the clusters.
- Complete linkage: In this case, we will consider the longest distance between the points in each cluster as the distance between the clusters.
- Average linkage: Here we will take the average between each point in one cluster to every other point in the other cluster.
Now let’s discuss the strengths & weakness in AGNES; this algorithm has a time complexity of at least O(n2) hence it doesn’t do well in scaling, and one other major drawback is that whatever has been done can never be undone i.e. If we incorrectly group any cluster in an earlier stage of the algorithm then we will not be able to change the outcome/modify it. But this algorithm has a bright side as well since there are many smaller clusters are formed it can be helpful in the process of discovery and it produces an ordering of objects which is very helpful in visualization.
Divisive Clustering (DIANA)
Diana basically stands for Divisive Analysis; this is another type of hierarchical clustering where basically it works on the principle of top-down approach (inverse of AGNES) where the algorithm begins by forming a big cluster and it recursively divides the most dissimilar cluster into two and it goes on until we’re all the similar data points belong in their respective clusters. These divisive algorithms result in highly accurate hierarchies than the agglomerative approach but they are computationally expensive.
The above figure shows Divisive clustering step by step process
Multiphase Hierarchical Clustering
To improve the quality of clusters generated by the above mentioned hierarchical clustering techniques, we integrate our hierarchical clustering techniques with other clustering techniques; this is called as multiphase clustering. The different types of multiphase clustering are as follows:
- BIRCH (Balanced Iterative Reducing and Clustering using Hierarchies)
- ROCK (RObust Clustering using links)
1. Balanced Iterative Reducing and Clustering using Hierarchies
This method is mainly used for clustering a huge quantity of numeric data by integrating our hierarchical/micro clustering at the initial phase and macro clustering/iterative partitioning at the later phase. This method helps in overcoming the scalability problem we faced in AGNES and the inability to undo what was done in the before step. BIRCH uses two important concepts in its algorithm
a. Clustering feature (Helps in summarizing the cluster)
CF is defined as <n, LS, SS> (n- number of data points in the cluster, the linear sum of n points, the square sum of n points). Storing the feature of a cluster in this way helps in avoiding storing detailed information about it and CF is additive in nature for different clusters.
CF1 + CF2 = <n1+n2, LS1+LS2, SS1+SS2>
b. Clustering feature tree (helps in representing a cluster as a hierarchy)
CF tree is a balanced tree with branching factor B (maximum number of children) and threshold T (max number of sub-clusters that can be stored in leaf nodes).
The algorithm basically works in 2 phases; in phase 1 it scans the database and builds an in-memory CF tree and in phase 2 it uses the clustering algorithm which helps in clustering the leaf nodes by removing the outliers (sparse clusters) and groups the cluster with maximum density. The one and only drawback of this algorithm are that it handles only the numeric data type.
2. Robust clustering using link’s
Link is defined as the number of common neighbours between two objects. ROCK algorithm is a type of clustering algorithm which uses this concept of link with the categorical dataset. As we know that the distance measured clustering algorithms does not provide high-quality clusters for the categorical dataset, but in the case of ROCK, it considers the neighbourhoods of the data points as well i.e. if two data points have the same neighbourhood then they are most likely to belong in the same cluster. The algorithm will construct a sparse graph in the first step taking into account the similarity matrix with the concept of neighbourhood and similarity threshold. The second step, it uses the agglomerative hierarchical clustering technique on the sparse graph.
This type of hierarchical clustering algorithm uses the concept of dynamic modelling. Wondering why is it called dynamic? It’s called dynamic because it has the ability to adapt to the internal cluster characteristics automatically by evaluating the cluster similarity, i.e. how well connected the data points within a cluster and at the proximity of clusters. One of the drawbacks of chameleon is that the processing cost is too high (O(n2) for n objects is the worst-case time complexity).
Image source – Google
In this article, we have learnt what a cluster is and what is cluster analysis, different types of hierarchical clustering techniques their advantages and disadvantages. Each of the techniques we discussed has its own plus and minus, hence before going ahead with an algorithm we need to understand our data with proper exploratory data analysis and choose the algorithm with caution.
This is a guide to Hierarchical Clustering Analysis. Here we discuss the overview, agglomerative clustering, divisive Clustering (DIANA) and multiphase hierarchical clustering. You may also look at the following articles to learn more