Updated February 27, 2023

What is Cluster Analysis?

Cluster analysis groups data based on the characteristics they possess. Cluster analysis groups objects based upon the factors that makes them similar. Cluster analysis is otherwise called Segmentation analysis or taxonomy analysis. Cluster analysis does not differentiate dependent and independent variables. Cluster analysis is used in a wide variety of fields such as psychology, biology, statistics, data mining, pattern recognition and other social sciences.

Objective

The main objective is to address the heterogeneity in each set of data. The other cluster analysis objectives are

Taxonomy description – Identifying groups within the data
Data simplification – The ability to analyze groups of similar observations instead of all individual observation
Hypothesis generation or testing – Develop hypothesis based on the nature of the data or to test the previously stated hypothesis
Relationship Identification – The simplified structure from cluster analysis that describes the relationships

There are two main purposes of it- Understanding and Utility.

In the circumstance of Understanding, it groups objects that share some common characteristics

In the purpose of Utility, it provides the characteristics of each data object to the clusters to which they belong.

It goes hand in hand with factor analysis and discriminant analysis.

You should ask to yourself few cluster analysis questions before starting with it

What variables are relevant?
Is the sample size enough?
Can outliers be detected and should it be removed?
How should object similarity to be measured?
Should data be standardized?

Types of Clusters

There are three major type of clustering

Hierarchical Clustering – Which contains Agglomerative and Divisive method
Partitional Clustering – Contains K-Means, Fuzzy K-Means, Isodata under it
Density based Clustering – Has Denclust, CLUPOT, Mean Shift, SVC, Parzen-Watershed under it

Assumptions

There are always two assumptions in it.

It is assumed that the sample is a representative of the population
It is assumed that the variables are not correlated. Even if variables are correlated remove correlated variables or use distance measures that compensate for the correlation.

Steps

Below are some of the steps given.

- Step 1 : Define the Problem
- Step 2 : Decide the appropriate similarity measure
- Step 3 : Decide on how to group the objects
- Step 4 : Decide the number of clusters
- Step 5 : Interpret, describe and validate the cluster

Cluster Analysis in SPSS

In SPSS you can find the cluster analysis option in Analyze/Classify option. In SPSS there are three methods for the cluster analysis – K-Means Cluster, Hierarchical Cluster and Two Step Cluster.

K-Means cluster method classifies a given set of data through a fixed number of clusters. This method is easy to understand and gives best output when the data are well separated from each other.

Two Step cluster analysis is a tool designed to handle large data sets. It creates clusters on both categorical and continuous variables.

Hierarchical cluster is the most commonly used method of cluster analysis. It combines cases into homogeneous clusters by bringing them together through a series of sequential steps.

Hierarchical cluster analysis contains three steps

Calculate the distance
Link the clusters
Choosing a solution by selecting the right number of clusters

Given below are the steps for performing Hierarchical Cluster analysis in SPSS.

First step is to select the variables which are to be clustered. The below dialog box explains it to you
By clicking the statistics option in the above dialog box, you will get the dialog box where you want to specify the output
In the dialog box plots, add the Dendrogram. Dendrogram is the graphical representation of the hierarchical cluster analysis method. It shows how the clusters are combined at every step until it forms a single cluster.
The dialog box method is crucial. You can mention the distance and clustering method here. In SPSS there are three measures for Interval, counts and binary data.
The Squared Euclidian Distance is the sum of the squared differences without taking the square root.
In the counts you can select between Chi Square and Phi Square measure
In the Binary section you have a lot of options to choose. Squared Euclidean distance is the best option to use.
Next step is to choose the cluster method. It is always recommended to use Single Linkage or Nearest Neighbour as it easily helps to identify the outliers. After the outliers are identified you can use Ward’s Method.
The last step is Standardization

Criticisms

The most common criticisms are listed below

It is descriptive, theoretical and non inferential.
It will produce clusters regardless of the actual existence of any structure
It is cannot be used widely as it totally depends upon the variables used as a basis for the similarity measure

What is Factor Analysis?

Factor analysis is an explorative analysis that helps in grouping similar variables into dimensions. It can be used to simplify the data by reducing the dimensions of the observations. Factor analysis has several different rotation methods.

Factor analysis is used mostly for data reduction purposes.

There are two types of factor analysis – Exploratory and Confirmatory

Exploratory method is used when you do not have a pre defined idea about the structures or dimensions in a set of variables.
Confirmatory method is used when you want to test specific hypothesis about the structures or dimensions in a set of variables.

Objectives

There are two main objectives of Factor Analysis which is mentioned below

Identification of the underlying factors – This includes clustering variables into homogenous sets, creating new variables and helping to gain knowledge about the categories
Screening of variables – It is helpful in regression and identifies groupings to allow you to select one variable that represents many.

Assumptions

There are four main assumptions of Factor analysis which are mentioned below

Models are usually based on linear relationships
It assumes that the data collected are interval scaled
Multicollinearity in the data is desirable as the objective is to find out the interrelated set of variables
The data should be open and responsive for factor analysis. It should not be in such a way that a variable is only correlated with itself and no correlation exists with any other variable. Factor analysis cannot be done on such data.

Types of Factoring

Below are some of the types of factoring.

Principal component factoring – Most commonly used method where factor weights are computed to extract the maximum possible variance and continues until there is no meaningful variance left.
Canonical factor analysis – Finds factors that have the highest canonical correlation with the observed variables
Common factor analysis – Seeks the least number of factors that can account for the common variance of a set of variables
Image factoring – Based on the correlation matrix where each variable is predicted from the others using multiple regression
Alpha Factoring – Maximizes the reliability of factors
Factor regression model – Combination of factor model and regression model whose factors are partially known

Criteria

Below are some of the criteria described.

Eigenvalue criteria

Represents the amount of variance in the original variables that is connected with a factor
Sum of the square of the factor loadings of each variable on a factor represents the eigenvalue
Factors with eigenvalues that are greater than 1.0 are kept

Scree Plot Criteria

A plot of the eigenvalues against the number of factors, in order of extraction.
The shape of the plot determines the number of factors

Percentage of Variance Criteria

The number of factors extracted is found out so that the increasing percentage of variance extracted by the factors reaches the level of satisfaction.

Significance Test Criteria

Statistical importance of the separate eigenvalues is found out, and only those factors that are statistically significant are retained

Factor analysis is used in various fields like Psychology, Sociology, Political Science, Education and Mental health.

Factor Analysis in SPSS

In SPSS the factor analysis option can be found in the Analyze à Dimension reduction à Factor

Start by adding the variables to the list of variables section
Click the Descriptive tab and add few statistics under which the assumptions of factor analysis are verified.
Click the Extraction option which will let you to choose the extraction method and cut off value for extraction
Principal Components (PCA) is the default extraction method that extracts even uncorrelated linear combinations of the variables. PCA can be used when a correlation matrix is singular. It is very similar to Canonical Correlation Analysis where the first factor has maximum variance and the following factors explain smaller portion of the variance.
The second most general analysis is Principal axis factoring. It identifies the latent constructs behind the observations.
Next step is to select a rotation method. The most frequently used method is Varimax. This method simplifies the interpretation of the factors.
The second method is Quartimax. This method rotates the factors in order to minimize the number of factors. It simplifies the interpretation of the observed variable.
Next method is Equamax which is a combination of the above two methods.
In the dialog box by clicking on the “options” you can manage the missing values
Before saving the results to data set, first run the factor analysis and check for assumptions and confirm that the results are meaningful and useful.

Cluster Analysis vs Factor Analysis

Both cluster analysis and factor analysis are unsupervised learning method which is used for segmentation of data. Many researchers who are new to this field feel that the cluster analysis and factor analysis are similar. It might seem similar but they differ in many ways. The differences between both are listed below

Objective

The objective of cluster and factor analysis are different. The objective of this is to divide the observations into homogeneous and distinct groups. The factor analysis on the other hand explains the homogeneity of the variables resulting from the similarity of values.

Complexity

Complexity is another factor on which cluster and factor analysis differ. The data size affects the analysis differently. If the data size is too big then it becomes computationally intractable in cluster analysis.

Solution

The solution to a problem is more or less similar in both the factor and cluster analysis. But factor analysis provides a better solution to the researcher in a better aspect.It do not yield best result as all the algorithms in cluster analysis are computationally inefficient.

Applications

Factor analysis and cluster analysis are applied differently to real data. Factor analysis is suitable for simplifying complex models. It reduces the large set of variables to a much smaller set of factors. The researcher can develop a set of hypothesis and run a factor analysis to confirm or deny this hypothesis.

It is suitable for classifying objects based on certain criteria. The researcher can measure certain aspects of a group and divide them into specific categories using cluster analysis.

There are also lot of other differences which are mentioned below

It attempts to group cases whereas factor analysis attempts to group features.
It is used to find smaller groups of cases that are representative of a data as a whole.It is used to find a smaller group of features that are representative of data sets original features.
The most important part of this is to find the number of clusters. Basically, clustering methods are divided into two – Agglomerative method and Partitioning method. Agglomerative method starts with each case in its own cluster and stops when a criteria is reached. Partitioning method starts with all cases in one cluster.
It is used to find out an underlying structure in a set of data.

Conclusion

Hope this article would have helped you to understand the basics of Cluster analysis and Factor analysis and the differences between the two.

Quiz Result
Total Questions	Correct Answers	Wrong Answers	Percentage

What is Cluster Analysis?

Objective

Types of Clusters

Assumptions

Steps

Cluster Analysis in SPSS

Criticisms

What is Factor Analysis?

Objectives

Assumptions

Types of Factoring

Criteria

Eigenvalue criteria

Scree Plot Criteria

Percentage of Variance Criteria

Significance Test Criteria

Factor Analysis in SPSS

Cluster Analysis vs Factor Analysis

Objective

Complexity

Solution

Applications

Conclusion

Recommended Articles

Follow us!

APPS

Blog

Courses

Email