Blog

Cluster Analysis - Part 1: Introduction

What is Cluster Analysis?

Cluster analysis is a collective term for various algorithms to find group structures in data. The groups are called clusters and are usually not known a priori. In contrast, classification procedures assign the observations to already known groups (e.g., buyers and non-buyers). A classification is often performed with the groups determined in cluster analysis.

In cluster analysis, the group membership of the individual observations is determined such that the groups are as heterogeneous as possible and the observations within a group are as homogeneous as possible. Whether a resulting cluster solution makes sense depends in practice on the interpretability of the identified clusters. There are many different algorithms which result in different numbers and sizes of clusters. In the following, we focus on the most common procedures.

The following scatter plot shows a simple example. It visualizes the weight in kilograms on the x-axis and the height in centimeters on the y-axis for 20 people. The cluster analysis identifies two groups that have been marked in red and blue. The interpretation turns out to be simple in this example: the clusters just correspond to gender (red: women, blue: men).

Distance vs. Similarity Measures

Clustering can be performed on the basis of distance or similarity measures. It is important to make a decision about the measure before the analysis. The following graph shows observations of two groups, “red” and “blue”. On the one hand, the absolute distance between the observations of the groups could be of interest, then the distance would be a suitable measure. On the other hand, the similarity of the profile, represented here by the combination of the observations, may be more interesting. In this case, a similarity measure would be a good choice.

To clarify the difference between distance and similarity measures, we will now calculate the distance and similarity matrix for the example with the groups “red” and “blue”.

matrix <- cbind(red = c(3, 5, 1, 2), blue = c(7, 9, 5, 6))
matrix
##      red blue
## [1,]   3    7
## [2,]   5    9
## [3,]   1    5
## [4,]   2    6

The command dist() computes the Euclidean distance of the observations as distance measure and simil() gives the correlation coefficient (Pearson) as a similarity measure.

dist(matrix)
##          1        2        3
## 2 2.828427                  
## 3 2.828427 5.656854         
## 4 1.414214 4.242641 1.414214
simil(matrix)
##    1  2  3
## 2  1      
## 3  1  1   
## 4  1  1  1

Note: In this blog article the algorithms for finding cluster structures will use measures of distance only.

Methods of Cluster Analysis

Clustering methods can be divided into two large groups: partitioning and hierarchical clustering methods. Partitioning methods are characterized by the fact that the number of resulting clusters must be specified beforehand. As already mentioned, however, the number of clusters is usually not known, which is why this property is often viewed as a disadvantage. Depending on the method used, the algorithm iteratively seeks the optimum of an objective function by swapping the observations back and forth between the given clusters. The famous k-means algorithm belongs to the partitioning cluster method: First, cluster centers are chosen randomly and then the sum of the squared distances of the observations to the nearest cluster center is minimized. The cluster centers are then re-determined by averaging and the observations reassigned to the nearest clusters. This happens until the assignment of observations does not change anymore.

Hierarchical clustering methods are subdivided into agglomerative and divisive methods. Agglomerative means nothing more than initially treating each observation as a separate cluster. Next, the two clusters that are closest to each other are clustered and the distances between all clusters are calculated again. This happens until all observations are finally grouped into a cluster. In the divisive method, it is exactly the other way round: the starting point is one single cluster that contains all observations. This cluster divided into more and more clusters during the subsequent steps.

Hierarchical cluster methods can be represented graphically by means of a so-called dendrogram. A dendrogram shows at which distance observations are summarized (agglomerative) or separated (divisive).

How Many Clusters?

There is no “right” number of clusters. In order to get a feeling about how many clusters make sense, a screeplot is recommended. A screeplot shows the number of clusters on the x-axis and the variance within the clusters on the y-axis (which should be reasonably low). The at which we see a kink (so-called elbow), is usually used, because additional clusters hardly contribute to the reduction of the variance. To judge the result of the group assignment, a silhouette plot is suitable. A silhouette plot shows the silhouette width for each observation , which is defined as the normalized difference of the smallest distance to the observations outside the own group and the mean of the distances within a group. The silhouette width can take any value within the interval and is interpreted as follows:

  • : The observation is assigned to the “right” cluster.
  • : The observation could just as well have been assigned to another group.
  • The observation is poorly assigned.

In addition, the average silhouette width across all observations can be calculated to assess group formation as a whole. The average silhouette width is interpreted analogously.

The second part of the series will illustrate the presented concepts of cluster analysis by means of an example.