## Introduction to Hierarchical Clustering

- Recently one of our clients asked our team to bring out a list of segments with an order of importance within their customers to target them to franchise one of their newly launched products. Clearly just segmenting the customers using partial clustering (k-means, c-fuzzy) will not bring out the order of importance that’s where hierarchical clustering comes into the picture.
- Hierarchical clustering is separating the data into different groups based on some similarity measures known as clusters, which essentially targets at building the hierarchy among clusters. It is basically unsupervised learning and choosing the attributes to measure similarity is application-specific.

### The Cluster of Data Hierarchy

- Agglomerative Clustering
- Divisive Clustering

Let us take an example of data, marks obtained by 5 students to group them for an upcoming competition.

Student | Marks |

A | 10 |

B | 7 |

C | 28 |

D | 20 |

E | 35s |

#### 1. Agglomerative Clustering

- To start with we consider each individual point/element here weight as clusters and keep on merging the similar points/elements to form a new cluster at the new level until we are left with the single cluster is a bottom-up approach.
- Single linkage and complete linkage are two popular examples of agglomerative clustering. Other than that Average linkage and Centroid linkage. In single linkage, we merge in each step the two clusters, whose two closest members have the smallest distance. In complete linkage, we merge in the members of the smallest distance which provide the smallest maximum pairwise distance.
- Proximity matrix, It’s the core for performing hierarchical clustering, which gives the distance between each of the points.
- Let us make proximity matrix for our data given in the table, since we are calculating the distance between each of the points with other points it will be an asymmetric matrix of shape n × n, in our case 5 × 5 matrices.

A popular method for distance calculations are:

- Euclidian distance(Squared)

`dist((x, y), (a, b)) = √(x - a)² + (y - b)²`

- Manhattan distance

`dist((x, y), (a, b)) =|x−c|+|y−d|`

Euclidian distance is most commonly used, we will use the same here, and we will go with complex linkage.

Student(Clusters) | A | B | C | D | E |

A | 0 | 3 | 18 | 10 | 25 |

B | 3 | 0 | 21 | 13 | 28 |

C | 18 | 21 | 0 | 8 | 7 |

D | 10 | 13 | 8 | 0 | 15 |

E | 25 | 28 | 7 | 15 | 0 |

Diagonal elements of proximity matrix will be always 0, as the distance between the point with the same point will be always 0, hence diagonal elements are exempted from consideration for grouping.

Here, in iteration 1, the smallest distance is 3 hence we merge A and B to form a cluster, again form new proximity matrix with cluster (A, B) by taking (A, B) cluster point as 10, i.e maximum of (7,10) so newly formed proximity matrix would be

Clusters | (A, B) | C | D | E |

(A, B) | 0 | 18 | 10 | 25 |

C | 18 | 0 | 8 | 7 |

D | 10 | 8 | 0 | 15 |

E | 25 | 7 | 15 | 0 |

In iteration 2, 7 is the minimum distance hence we merge C and E forming a new cluster (C, E), we repeat the process followed in iteration 1 until we end up with the single cluster, here we stop at iteration 4.

4.5 (2,711 ratings)

View Course

The whole process is depicted in the below figure:

(A,B,D) and (D, E) are the 2 clusters formed at iteration 3, at the last iteration we can see we are left with a single cluster.

#### 2. Divisive Clustering

To start with we consider all points as a single cluster and separate them by farthest distance until we end off with individual points as individual clusters (not necessarily we can stop in the middle, depends on the minimum number of elements we want in each cluster) at each step. It’s just the opposite of agglomerative clustering and it is a top-down approach. Divisive clustering is a way repetitive k means clustering.

Choosing between Agglomerative and Divisive Clustering is again application dependent, yet few points to be considered are:

- Divisive is more complex than agglomerative clustering.
- Divisive clustering is more efficient if we do not generate a complete hierarchy down to individual data points.
- Agglomerative clustering makes a decision by considering the local patters, without taking into account global patterns initially which cannot be reversed.

### Visualization of Hierarchical Clustering

A Super helpful method to visualize hierarchical clustering which helps in business is Dendogram. Dendograms are tree-like structures that record the sequence of merges and splits in which vertical line represents the distance between the clusters, distance between vertical lines and distance between the clusters is directly proportional i.e more the distance more the clusters are likely to be dissimilar.

We can use the dendogram to decide the number of clusters, just draw a line that intersects with a longest vertical line on the dendogram, a number of vertical lines intersected will be the number of clusters to be considered.

Below is the example Dendogram.

There are pretty simple and direct python packages and it’s functions to perform hierarchical clustering and plot dendograms.

- The hierarchy from scipy.
- Cluster.hierarchy.dendogram for visualization.

#### Common Scenarios in which Hierarchical Clustering is used

- Customer Segmentation to product or service marketing.
- City planning to identify the places to build structures/services/building.
- Social Network Analysis, For example, identify all MS Dhoni fans to advertise his biopic.

### Advantages of Hierarchical Clustering

The advantages are been given below:

- In the case of partial clustering like k-means, the number of clusters should be known prior to clustering, which is not possible in practical applications, whereas in hierarchical clustering no prior knowledge of the number of clusters is required.
- Hierarchical clustering outputs a hierarchy, i.e a structure more informative than the unstructured set of the flat clusters returned by partial clustering.
- Hierarchical clustering is easy to implement.
- Brings out results in most of the scenarios.

### Conclusion

Type of clustering makes the big difference when data is being presented, Hierarchical clustering being more informative and easy to analyze is more preferred than partial clustering. And it is often associated with heat maps. Not to forget attributes chosen to calculate similarity or dissimilarity predominantly influences both clusters and hierarchy.

### Recommended Articles

This is a guide to Hierarchical Clustering. Here we discuss the introduction, advantages of Hierarchical Clustering and Common Scenarios in which Hierarchical Clustering is used. You can also go through our other suggested articles to learn more–