Introduction to K Means Clustering Algorithm?
K Means clustering belongs to the unsupervised learning algorithm. It is used when the data is not defined in groups or categories i.e. unlabeled data. The aim of this clustering algorithm is to search and find the groups in the data, where variable K represents the number of groups.
Understanding K Means Clustering Algorithm
This algorithm is an iterative algorithm that partitions the dataset according to their features into K number of predefined non overlapping distinct clusters or subgroups. It makes the data points of inter clusters as similar as possible and also tries to keep the clusters as far as possible. It allocates the data points to a cluster if the sum of the squared distance between the cluster’s centroid and the data points is at a minimum where the cluster’s centroid is the arithmetic mean of the data points that are in the cluster. A less variation in the cluster results in similar or homogeneous data points within the cluster.
How the K Means Clustering Algorithm Works?
K Means Clustering Algorithm needs the following inputs:
 K = number of subgroups or clusters
 Sample or Training Set = {x_{1}, x_{2}, x_{3},………x_{n}}
Now let us assume we have a data set which is unlabeled and we need to divide it into clusters.
Now we need to find the number of clusters. This can be done by two methods:
 Elbow Method.
 Purpose Method.
Let us discuss them in brief:
Elbow Method
In this method, a curve is drawn between “within the sum of squares” (WSS) and the number of clusters. The curve plotted resembles a human arm. It is called the elbow method because the point of elbow in the curve gives us the optimum number of clusters. In the graph or curve, after the elbow point, the value of WSS changes very slowly so elbow point must be considered to give the final value of the number of clusters.
PurposeBased
In this method, the data is divided based on different metrics and after then it is judged how well it performed for that case. For example, the arrangement of the shirts in the men’s clothing department in a mall is done on the criteria of the sizes. It can be done on the basis of price and the brands also. The best suitable would be chosen to give the optimal number of clusters i.e. the value of K.
Now lets us get back to our given data set above. We can calculate the number of clusters i.e. the value of K by using any of the above methods.
4.8 (1,419 ratings)
How to Use the Above Methods?
Now let us see the execution process:
Step 1: Initialisation
Firstly, initialize any random points called as the centroids of the cluster. While initializing you must take care that the centroids of the cluster must be less than the number of training data points. This algorithm is an iterative algorithm hence the next two steps are performed iteratively.
Step 2: Cluster Assignment
After initialization, all data points are traversed and the distance between all the centroids and the data points are calculated. Now the clusters would be formed depending upon the minimum distance from the centroids. In this example, the data is divided into two clusters.
Step 3: Moving Centroid
As the clusters formed in the above step are not optimized so we need to form optimized clusters. For this, we need to move the centroids iteratively to a new location. Take data points of one cluster, compute their average and then move the centroid of that cluster to this new location. Repeat the same step for all other clusters.
Step 4: Optimization
The above two steps are done iteratively until the centroids stop moving i.e. they do not change their positions anymore and have become static. Once this is done the k means algorithm is termed to be converged.
Step 5: Convergence
Now this algorithm has converged and distinct clusters are formed and clearly visible. This algorithm can give different results depending on how the clusters were initialized in the first step.
Applications of K Means Clustering Algorithm
 Market segmentation
 Document clustering
 Image segmentation
 Image compression
 Vector quantization
 Cluster analysis
 Feature learning or dictionary learning
 Identifying crimeprone areas
 Insurance fraud detection
 Public transport data analysis
 Clustering of IT assets
 Customer segmentation
 Identifying Cancerous data
 Used in search engines
 Drug Activity Prediction
Advantages of K Means Clustering Algorithm
 It is fast
 Robust
 Easy to understand
 Comparatively efficient
 If data sets are distinct then gives the best results
 Produce tighter clusters
 When centroids are recomputed the cluster changes.
 Flexible
 Easy to interpret
 Better computational cost
 Enhances Accuracy
 Works better with spherical clusters
Disadvantages of K Means Clustering Algorithm
 Needs prior specification for the number of cluster centers
 If there are two highly overlapping data then it cannot be distinguished and cannot tell that there are two clusters
 With the different representation of the data, the results achieved are also different
 Euclidean distance can unequally weight the factors
 It gives the local optima of the squared error function
 Sometimes choosing the centroids randomly cannot give fruitful results
 Can be used only if the meaning is defined
 Cannot handle outliers and noisy data
 Do not work for the nonlinear data set
 Lacks consistency
 Sensitive to scale
 If very large data sets are encountered then the computer may crash.
 Prediction issues
Recommended Articles
This has been a guide to K Means clustering algorithm. Here we discussed the working, applications, advantages, and disadvantages of K Means clustering algorithm. You can also go through our other suggested articles to learn more –
 What is Neural Networks?
 What Is Data Mining?  Role of Data Mining
 Data Mining Interview Question
 Machine Learning vs Neural Network
Data Science Course  All in One Bundle
360+ Online Courses
1500+ Hours
Verifiable Certificates
Lifetime Access

Machine Learning Course

Data Science with Python Course

Data Scientist Course

Deep Learning Course

IoT Course
Leave a Reply