Blog

Cluster Analysis - Part 2: Hands On

library(broom)
library(cluster)
library(dplyr)
library(ggplot2)
library(ggdendro)

In the first part of this blog series, we examined the theoretical foundations of cluster analysis. In the following article we put the theory into practice using R. For the analysis in R, we will use the variables mpg (fuel consumption in miles per gallon) and disp (cubic capacity) for 32 cars from the mtcars data set. A detailed description of the dataset can be found here.

First we create a new data frame called car, which contains the mentioned variables mpg and disp for the 32 cars. The observations are shown in a scatter plot.

car <- select(mtcars, mpg, disp)

ggplot(car, aes(mpg, disp)) + 
  geom_point()

Now we perform a hierarchical clustering method and look at the dendrogram. The y-axis of the dendrogram shows the distance between the clusters (Euclidean distance).

h.cluster <- car %>% dist(., method = "euclidean") %>% hclust(., method = "ward.D")

ggdendrogram(h.cluster, theme_dendro = FALSE)

Obviously, two clusters make sense in this case. Next we create a screeplot, where k-means method is performed for each . The screeplot shows the number of clusters on the x-axis and the variance within the clusters on the y-axis.

multi.clust <- data.frame(k = 1:6) %>% 
  group_by(k) %>% 
  do(clust = kmeans(car, .$k))

sumsq.clust <- multi.clust %>% 
  group_by(k) %>% 
  do(glance(.$clust[[1]]))

ggplot(sumsq.clust, aes(k, tot.withinss)) + 
  geom_line() + 
  geom_point()

The kink is at , so we stay with two clusters. Now let’s have a look at the group structure obtained by the k-means algorithm for .

p.cluster <- car %>% kmeans(., 2)
p.cluster$cluster <- as.factor(p.cluster$cluster)

ggplot(car, aes(mpg, disp, label = rownames(car))) + 
  scale_fill_discrete(name = "Cluster") + 
  coord_cartesian(xlim = c(9, 35)) +
  geom_label(aes(fill = p.cluster$cluster), colour = "white", fontface = "bold", size = 2)

At this point it would be interesting to compare the group structure for , , and so on.

multi.clust <- data.frame(k = 1:6) %>% 
  group_by(k) %>% 
  do(clust = kmeans(car, .$k))

multi.k <- multi.clust %>% 
  group_by(k) %>% 
  do(augment(.$clust[[1]], car))

ggplot(multi.k, aes(mpg, disp)) + 
  geom_point(aes(color = .cluster)) + 
  facet_wrap(~k)

As mentioned in the first part of the series about Cluster Analysis, the quality of the group assignment can be easily assessed with a silhouette plot.

sil <- car %>% dist(.) %>% silhouette(as.numeric(p.cluster$cluster), .)
sil.data <- data.frame(cluster = factor(sil[, 1]), sil_width = sil[, 3])

ggplot(sil.data, aes(x = row.names(sil.data), y = sil_width, fill = cluster, col = cluster)) +
  geom_bar(stat = "identity", width = 0.5) + coord_flip() + labs(x = "") +
  scale_x_discrete(limits = row.names(sil.data[order(sil.data$cluster, 
  sil.data$sil_width), ])) 

mean(sil.data$sil_width)
## [1] 0.687796

None of the silhouette widths has a value . In addition, the average width of the silhouettes is 0.69, so we can evaluate the assignment of the observations to the two clusters as good.

Interpretation of the Cluster Solution

Now that we have validated the cluster solution, we come to the most difficult part of the analysis: interpretation. For the analysis we used the fuel consumption mpg and the displacement disp. Short note for those who have no idea about cars: the bigger the displacement of a vehicle, the more powerful the engine.

For assistance, we use the remaining information from the mtcars data set for interpretation. To do this, we add the cluster mapping column to the original dataset.

clustcars <- cbind(mtcars, cluster = p.cluster$cluster)

Now we can examine the remaining variables within each cluster.

clustcars %>% 
  group_by(cluster) %>% 
  summarise(avg_cyl = mean(cyl), 
            avg_horsepower = mean(hp), 
            avg_drat = mean(drat),
            avg_weight = mean(wt),
            avg_qsec = mean(qsec),
            share_vmotor = mean(vs),
            share_manualtransmission = mean(am),
            avg_gears = mean(gear),
            avg_carb = mean(carb)) %>% 
  mutate_if(is.numeric, round, digits = 1)
## # A tibble: 2 x 10
##   cluster avg_cyl avg_horsepower avg_drat avg_weight avg_qsec share_vmotor
##   <fct>     <dbl>          <dbl>    <dbl>      <dbl>    <dbl>        <dbl>
## 1 1           7.9          203.       3.2        3.9     16.9          0.1
## 2 2           4.7           97.4      3.9        2.6     18.6          0.8
## # ... with 3 more variables: share_manualtransmission <dbl>,
## #   avg_gears <dbl>, avg_carb <dbl>

We could see already in the above cluster solution that the cars assigned to cluster 1 generally have higher power and consumption than the cars from cluster 2.

In addition, the table shows that on average, cluster 1 cars have about 8 cylinders, while cluster 2 typically contains cars with 4 to 6 cylinders. Usually the power increases with the number of cylinders (the variable horsepower is on average twice as high for the cars in cluster 1 compared to cluster 2), but the price and the fuel consumption as well.

With the collected information, we interpret the two clusters as two price ranges. Cluster 1 contains the more luxurious cars, with more power, more cylinders and higher fuel consumption. The cluster 2 therefore contains less powerful cars, which are cheaper in price and have lower consumption. To confirm this interpretation, we have also computed the recommended retail price (attention: the models are from the 1970s) for the cars. The table below shows the average values for each of the two clusters, with the median in brackets.

Avg. price cluster 1Avg. price cluster 2
9974 Dollar (7474)6283 Dollar (4295)