What is Cluster Analysis
Cluster analysis groups data based on the characteristics they possess. Cluster analysis groups objects based upon the factors that makes them similar. Cluster analysis is otherwise called Segmentation analysis or taxonomy analysis. Cluster analysis does not differentiate dependent and independent variables. Cluster analysis is used in a wide variety of fields such as psychology, biology, statistics, data mining, pattern recognition and other social sciences.
Objective of Cluster Analysis
The main cluster analysis objective is to address the heterogeneity in each set of data. The other cluster analysis objectives are
- Taxonomy description – Identifying groups within the data
- Data simplification – The ability to analyze groups of similar observations instead of all individual observation
- Hypothesis generation or testing – Develop hypothesis based on the nature of the data or to test the previously stated hypothesis
- Relationship Identification – The simplified structure from cluster analysis that describes the relationships
There are two main purposes of cluster analysis – Understanding and Utility.
In the circumstance of Understanding, cluster analysis groups objects that share some common characteristics
In the purpose of Utility, cluster analysis provides the characteristics of each data object to the clusters to which they belong.
Cluster analysis goes hand in hand with factor analysis and discriminant analysis.
You should ask to yourself few cluster analysis questions before starting with it
- What variables are relevant ?
- Is the sample size enough ?
- Can outliers be detected and should it be removed ?
- How should object similarity to be measured ?
- Should data be standardized ?
Types of Clusters
There are three major type of clustering
- Hierarchical Clustering – Which contains Agglomerative and Divisive method
- Partitional Clustering – Contains K-Means, Fuzzy K-Means, Isodata under it
- Density based Clustering – Has Denclust, CLUPOT, Mean Shift, SVC, Parzen-Watershed under it
Assumptions in Cluster Analysis
There are always two assumptions in cluster analysis
- It is assumed that the sample is a representative of the population
- It is assumed that the variables are not correlated. Even if variables are correlated remove correlated variables or use distance measures that compensates for the correlation.
Steps in Cluster Analysis
- Step 1 : Define the Problem
- Step 2 : Decide the appropriate similarity measure
- Step 3 : Decide on how to group the objects
- Step 4 : Decide the number of clusters
- Step 5 : Interpret, describe and validate the cluster
Cluster Analysis in SPSS
In SPSS you can find the cluster analysis option in Analyze/Classify option. In SPSS there are three methods for the cluster analysis – K-Means Cluster, Hierarchical Cluster and Two Step Cluster.
K-Means cluster method classifies a given set of data through a fixed number of clusters. This method is easy to understand and gives best output when the data are well separated from each other.
Two Step cluster analysis is a tool designed to handle large data sets. It creates clusters on both categorical and continuous variables.
Hierarchical cluster is the most commonly used method of cluster analysis. It combines cases into homogeneous clusters by bringing them together through a series of sequential steps.
Hierarchical cluster analysis contains three steps
- Calculate the distance
- Link the clusters
- Choosing a solution by selecting the right number of clusters
Given below are the steps for performing Hierarchical Cluster analysis in SPSS.
- First step is to select the variables which are to be clustered. The below dialog box explains it to you
- By clicking the statistics option in the above dialog box, you will get the dialog box where you want to specify the output
- In the dialog box plots, add the Dendrogram. Dendrogram is the graphical representation of the hierarchical cluster analysis method. It shows how the clusters are combined at every step until it forms a single cluster.
- The dialog box method is crucial. You can mention the distance and clustering method here. In SPSS there are three measures for Interval, counts and binary data.
- The Squared Euclidian Distance is the sum of the squared differences without taking the square root.
- In the counts you can select between Chi Square and Phi Square measure
- In the Binary section you have a lot of options to choose. Squared Euclidean distance is the best option to use.
- Next step is to choose the cluster method. It is always recommended to use Single Linkage or Nearest Neighbour as it easily helps to identify the outliers. After the outliers are identified you can use Ward’s Method.
- The last step is Standardization
Criticisms of Cluster Analysis
The most common criticisms are listed below
- It is descriptive, theoretical and non inferential.
- It will produce clusters regardless of the actual existence of any structure
- It is cannot be used widely as it totally depends upon the variables used as a basis for the similarity measure
What is Factor Analysis ?
Factor analysis is an explorative analysis which helps in grouping similar variables into dimensions. It can be used to simplify the data by reducing the dimensions of the observations. Factor analysis has several different rotation methods.
Factor analysis is used mostly for data reduction purposes.
There are two types of factor analysis – Exploratory and Confirmatory
- Exploratory method is used when you do not have a pre defined idea about the structures or dimensions in a set of variables.
- Confirmatory method is used when you want to test specific hypothesis about the structures or dimensions in a set of variables.
Objectives of Factor Analysis
There are two main objectives of Factor Analysis which is mentioned below
- Identification of the underlying factors – This includes clustering variables into homogenous sets, creating new variables and helping to gain knowledge about the categories
- Screening of variables – It is helpful in regression and identifies groupings to allow you to select one variable that represents many.
Assumptions of Factor analysis
There are four main assumptions of Factor analysis which are mentioned below
- Models are usually based on linear relationships
- It assumes that the data collected are interval scaled
- Multicollinearity in the data is desirable as the objective is to find out the interrelated set of variables
- The data should be open and responsive for factor analysis. It should not be in such a way that a variable is only correlated with itself and no correlation exists with any other variable. Factor analysis cannot be done on such data.
Types of Factoring
- Principal component factoring – Most commonly used method where factor weights are computed to extract the maximum possible variance and continues until there is no meaningful variance left.
- Canonical factor analysis – Finds factors which have the highest canonical correlation with the observed variables
- Common factor analysis – Seeks the least number of factors which can account for the common variance of a set of variables
- Image factoring – Based on the correlation matrix where each variable is predicted from the others using multiple regression
- Alpha Factoring – Maximizes the reliability of factors
- Factor regression model – Combination of factor model and regression model whose factors are partially known
Criteria of Factor analysis
- Represents the amount of variance in the original variables that is connected with a factor
- Sum of the square of the factor loadings of each variable on a factor represents the eigenvalue
- Factors with eigenvalues which are greater than 1.0 are kept
Scree Plot Criteria
- A plot of the eigenvalues against the number of factors, in order of extraction.
- The shape of the plot determines the number of factors
Percentage of Variance Criteria
- The number of factors extracted is found out so that the increasing percentage of variance extracted by the factors reaches the level of satisfaction.
Significance Test Criteria
- Statistical importance of the separate eigenvalues is found out, and only those factors that are statistically significant are retained
Factor analysis is used in various fields like Psychology, Sociology, Political Science, Education and Mental health.
Factor Analysis in SPSS
In SPSS the factor analysis option can be found in the Analyze à Dimension reduction à Factor
- Start by adding the variables to the list of variables section
- Click the Descriptive tab and add few statistics under which the assumptions of factor analysis are verified.
- Click the Extraction option which will let you to choose the extraction method and cut off value for extraction
- Principal Components (PCA) is the default extraction method which extracts even uncorrelated linear combinations of the variables. PCA can be used when a correlation matrix is singular. It is very similar to Canonical Correlation Analysis where the first factor has maximum variance and the following factors explain smaller portion of the variance.
- The second most general analysis is Principal axis factoring. It identifies the latent constructs behind the observations.
- Next step is to select a rotation method. The most frequently used method is Varimax. This method simplifies the interpretation of the factors.
- The second method is Quartimax. This method rotates the factors in order to minimize the number of factors. It simplifies the interpretation of the observed variable.
- Next method is Equamax which is a combination of the above two methods.
- In the dialog box by clicking on the “options” you can manage the missing values
- Before saving the results to data set, first run the factor analysis and check for assumptions and confirm that the results are meaningful and useful.
Cluster Analysis vs Factor Analysis
Both cluster analysis and factor analysis are unsupervised learning method which is used for segmentation of data. Many researchers who are new to this field feel that the cluster analysis and factor analysis are similar. It might seem similar but they differ in many ways. The differences between cluster analysis and factor analysis are listed below
The objective of cluster and factor analysis are different. The objective of cluster analysis is to divide the observations into homogeneous and distinct groups. The factor analysis on the other hand explains the homogeneity of the variables resulting from the similarity of values.
Complexity is another factor on which cluster and factor analysis differ. The data size affects the analysis differently. If the data size is too big then it becomes computationally intractable in cluster analysis.
The solution to a problem is more or less similar in both the factor and cluster analysis. But factor analysis provides a better solution to the researcher in a better aspect. Cluster analysis do not yield best result as all the algorithms in cluster analysis are computationally inefficient.
Factor analysis and cluster analysis are applied differently to real data. Factor analysis is suitable for simplifying complex models. It reduces the large set of variables to a much smaller set of factors. The researcher can develop a set of hypothesis and run factor analysis to confirm or deny these hypothesis.
Cluster analysis is suitable for classifying objects based on certain criteria. The researcher can measure certain aspects of a group and divide them into specific categories using cluster analysis.
There are also lot of other differences which are mentioned below
- Cluster analysis attempts to group cases whereas factor analysis attempts to group features.
- Cluster analysis is used to find smaller groups of cases that are representative of a data as a whole. Factor analysis is used to find a smaller group of features that are representative of data sets original features.
- The most important part of cluster analysis is finding the number of clusters. Basically clustering methods are divided into two – Agglomerative method and Partitioning method. Agglomerative method starts with each case in its own cluster and stops when a criteria is reached. Partitioning method starts with all cases in one cluster.
- Factor analysis is used to find out an underlying structure in a set of data.
Hope this article would have helped you to understand the basics of Cluster analysis and Factor analysis and the differences between the two.
Related Courses :-
- Cluster Analysis Course