What is Cluster Analysis
Cluster analysis groups data based on the characteristics they possess. Cluster analysis groups objects based upon the factors that makes them similar. Cluster analysis is otherwise called Segmentation analysis or taxonomy analysis. Cluster analysis does not differentiate dependent and independent variables. Cluster analysis is used in a wide variety of fields such as psychology, biology, statistics, data mining, pattern recognition and other social sciences.
Objective of Cluster Analysis
The main cluster analysis objective is to address the heterogeneity in each set of data. The other cluster analysis objectives are
 Taxonomy description – Identifying groups within the data
 Data simplification – The ability to analyze groups of similar observations instead of all individual observation
 Hypothesis generation or testing – Develop hypothesis based on the nature of the data or to test the previously stated hypothesis
 Relationship Identification – The simplified structure from cluster analysis that describes the relationships
There are two main purposes of cluster analysis – Understanding and Utility.
In the circumstance of Understanding, cluster analysis groups objects that share some common characteristics
In the purpose of Utility, cluster analysis provides the characteristics of each data object to the clusters to which they belong.
Cluster analysis goes hand in hand with factor analysis and discriminant analysis.
4.5 (5,370 ratings)
View Course
You should ask to yourself few cluster analysis questions before starting with it
 What variables are relevant ?
 Is the sample size enough ?
 Can outliers be detected and should it be removed ?
 How should object similarity to be measured ?
 Should data be standardized ?
Types of Clusters
There are three major type of clustering
 Hierarchical Clustering – Which contains Agglomerative and Divisive method
 Partitional Clustering – Contains KMeans, Fuzzy KMeans, Isodata under it
 Density based Clustering – Has Denclust, CLUPOT, Mean Shift, SVC, ParzenWatershed under it
Assumptions in Cluster Analysis
There are always two assumptions in cluster analysis
 It is assumed that the sample is a representative of the population
 It is assumed that the variables are not correlated. Even if variables are correlated remove correlated variables or use distance measures that compensates for the correlation.
Steps in Cluster Analysis

 Step 1 : Define the Problem
 Step 2 : Decide the appropriate similarity measure
 Step 3 : Decide on how to group the objects
 Step 4 : Decide the number of clusters
 Step 5 : Interpret, describe and validate the cluster
Cluster Analysis in SPSS
In SPSS you can find the cluster analysis option in Analyze/Classify option. In SPSS there are three methods for the cluster analysis – KMeans Cluster, Hierarchical Cluster and Two Step Cluster.
KMeans cluster method classifies a given set of data through a fixed number of clusters. This method is easy to understand and gives best output when the data are well separated from each other.
Two Step cluster analysis is a tool designed to handle large data sets. It creates clusters on both categorical and continuous variables.
Hierarchical cluster is the most commonly used method of cluster analysis. It combines cases into homogeneous clusters by bringing them together through a series of sequential steps.
Hierarchical cluster analysis contains three steps
 Calculate the distance
 Link the clusters
 Choosing a solution by selecting the right number of clusters
Given below are the steps for performing Hierarchical Cluster analysis in SPSS.
 First step is to select the variables which are to be clustered. The below dialog box explains it to you
 By clicking the statistics option in the above dialog box, you will get the dialog box where you want to specify the output
 In the dialog box plots, add the Dendrogram. Dendrogram is the graphical representation of the hierarchical cluster analysis method. It shows how the clusters are combined at every step until it forms a single cluster.
 The dialog box method is crucial. You can mention the distance and clustering method here. In SPSS there are three measures for Interval, counts and binary data.
 The Squared Euclidian Distance is the sum of the squared differences without taking the square root.
 In the counts you can select between Chi Square and Phi Square measure
 In the Binary section you have a lot of options to choose. Squared Euclidean distance is the best option to use.
 Next step is to choose the cluster method. It is always recommended to use Single Linkage or Nearest Neighbour as it easily helps to identify the outliers. After the outliers are identified you can use Ward’s Method.
 The last step is Standardization
Criticisms of Cluster Analysis
The most common criticisms are listed below
 It is descriptive, theoretical and non inferential.
 It will produce clusters regardless of the actual existence of any structure
 It is cannot be used widely as it totally depends upon the variables used as a basis for the similarity measure
What is Factor Analysis ?
Factor analysis is an explorative analysis which helps in grouping similar variables into dimensions. It can be used to simplify the data by reducing the dimensions of the observations. Factor analysis has several different rotation methods.
Factor analysis is used mostly for data reduction purposes.
There are two types of factor analysis – Exploratory and Confirmatory
 Exploratory method is used when you do not have a pre defined idea about the structures or dimensions in a set of variables.
 Confirmatory method is used when you want to test specific hypothesis about the structures or dimensions in a set of variables.
Objectives of Factor Analysis
There are two main objectives of Factor Analysis which is mentioned below
 Identification of the underlying factors – This includes clustering variables into homogenous sets, creating new variables and helping to gain knowledge about the categories
 Screening of variables – It is helpful in regression and identifies groupings to allow you to select one variable that represents many.
Assumptions of Factor analysis
There are four main assumptions of Factor analysis which are mentioned below
 Models are usually based on linear relationships
 It assumes that the data collected are interval scaled
 Multicollinearity in the data is desirable as the objective is to find out the interrelated set of variables
 The data should be open and responsive for factor analysis. It should not be in such a way that a variable is only correlated with itself and no correlation exists with any other variable. Factor analysis cannot be done on such data.
Types of Factoring
 Principal component factoring – Most commonly used method where factor weights are computed to extract the maximum possible variance and continues until there is no meaningful variance left.
 Canonical factor analysis – Finds factors which have the highest canonical correlation with the observed variables
 Common factor analysis – Seeks the least number of factors which can account for the common variance of a set of variables
 Image factoring – Based on the correlation matrix where each variable is predicted from the others using multiple regression
 Alpha Factoring – Maximizes the reliability of factors
 Factor regression model – Combination of factor model and regression model whose factors are partially known
Criteria of Factor analysis

Eigenvalue criteria
 Represents the amount of variance in the original variables that is connected with a factor
 Sum of the square of the factor loadings of each variable on a factor represents the eigenvalue
 Factors with eigenvalues which are greater than 1.0 are kept

Scree Plot Criteria
 A plot of the eigenvalues against the number of factors, in order of extraction.
 The shape of the plot determines the number of factors

Percentage of Variance Criteria
 The number of factors extracted is found out so that the increasing percentage of variance extracted by the factors reaches the level of satisfaction.

Significance Test Criteria
 Statistical importance of the separate eigenvalues is found out, and only those factors that are statistically significant are retained
Factor analysis is used in various fields like Psychology, Sociology, Political Science, Education and Mental health.
Factor Analysis in SPSS
In SPSS the factor analysis option can be found in the Analyze à Dimension reduction à Factor
 Start by adding the variables to the list of variables section
 Click the Descriptive tab and add few statistics under which the assumptions of factor analysis are verified.
 Click the Extraction option which will let you to choose the extraction method and cut off value for extraction
 Principal Components (PCA) is the default extraction method which extracts even uncorrelated linear combinations of the variables. PCA can be used when a correlation matrix is singular. It is very similar to Canonical Correlation Analysis where the first factor has maximum variance and the following factors explain smaller portion of the variance.
 The second most general analysis is Principal axis factoring. It identifies the latent constructs behind the observations.
 Next step is to select a rotation method. The most frequently used method is Varimax. This method simplifies the interpretation of the factors.
 The second method is Quartimax. This method rotates the factors in order to minimize the number of factors. It simplifies the interpretation of the observed variable.
 Next method is Equamax which is a combination of the above two methods.
 In the dialog box by clicking on the “options” you can manage the missing values
 Before saving the results to data set, first run the factor analysis and check for assumptions and confirm that the results are meaningful and useful.
Cluster Analysis vs Factor Analysis
Both cluster analysis and factor analysis are unsupervised learning method which is used for segmentation of data. Many researchers who are new to this field feel that the cluster analysis and factor analysis are similar. It might seem similar but they differ in many ways. The differences between cluster analysis and factor analysis are listed below

Objective
The objective of cluster and factor analysis are different. The objective of cluster analysis is to divide the observations into homogeneous and distinct groups. The factor analysis on the other hand explains the homogeneity of the variables resulting from the similarity of values.

Complexity
Complexity is another factor on which cluster and factor analysis differ. The data size affects the analysis differently. If the data size is too big then it becomes computationally intractable in cluster analysis.

Solution
The solution to a problem is more or less similar in both the factor and cluster analysis. But factor analysis provides a better solution to the researcher in a better aspect. Cluster analysis do not yield best result as all the algorithms in cluster analysis are computationally inefficient.

Applications
Factor analysis and cluster analysis are applied differently to real data. Factor analysis is suitable for simplifying complex models. It reduces the large set of variables to a much smaller set of factors. The researcher can develop a set of hypothesis and run factor analysis to confirm or deny these hypothesis.
Cluster analysis is suitable for classifying objects based on certain criteria. The researcher can measure certain aspects of a group and divide them into specific categories using cluster analysis.
There are also lot of other differences which are mentioned below
 Cluster analysis attempts to group cases whereas factor analysis attempts to group features.
 Cluster analysis is used to find smaller groups of cases that are representative of a data as a whole. Factor analysis is used to find a smaller group of features that are representative of data sets original features.
 The most important part of cluster analysis is finding the number of clusters. Basically clustering methods are divided into two – Agglomerative method and Partitioning method. Agglomerative method starts with each case in its own cluster and stops when a criteria is reached. Partitioning method starts with all cases in one cluster.
 Factor analysis is used to find out an underlying structure in a set of data.
Conclusion
Hope this article would have helped you to understand the basics of Cluster analysis and Factor analysis and the differences between the two.
Related Courses :
 Cluster Analysis Course