A few months ago we started our new company, Kluster, with the name being a tip of the hat to cluster analysis. The immediate question is often:
“why do you spell kluster with a k?”
My novelty answer:
“because k-means clustering!”
In honour of this, I thought I would do a short answer to the question “what is clustering?” for the uninitiated, with a focus on k-means clustering, assisted by R.
Cluster analysis is a form of unsupervised learning, which aims to group observations into clusters.
The central part of k-means clustering is a rather simple algorithm:
As always with R, there are packages which can do this for you really well. In this case, the kmeans() function in the stats package is one of many to choose from.
However, in order to understand packages better, it can be fun to quickly hack one from scratch, which is what I will do here!
There are many ways to choose the best k, which we won’t discuss here.
I am using the classic iris data set, so I will cheat and set k=3 for this example (because there are three species of irises in the data set).
Step 1 is to initiate the clusters. There are many ways of doing this - I will split my dataset into k random samples, and set the k cluster centroids as the mean locations of these points. Another option is to randomly select k points from the data as a starting point.
I have created my own custom function to calculate the Euclidean distance between two points:
… and also a second function to assign each observation to a cluster …
Note that looping is inefficient, but for the purpose of understanding the underlying logic of a method it can be useful.
I now use these custom functions to assign each observation to a cluster:
We now have each observation assigned to a randomly selected cluster.
Next, the exciting part. We’ll loop through iterations of updating the centroids and re-assigning the observations until we’re happy with the result. I’ve done a custom function to update the centroids (note it uses dplyr):
I know that this will converge quickly, so I’ll cheat and set maximum iterations to 8 for the purpose of a nice graph you’re about to see. In reality you could have a high maximum, exiting earlier if the observations stop changing clusters.
Now, let’s have a look at how the clusters converge over each iteration:
We can see that the clusters stabilise at about the fifth iteration, and it appears that our custom cluster algorithm (roughly) might well work!
To check, let’s compare against the actual species from the iris data set. Let’s see below to get a sense of our accuracy:
Awesome. As you can see, the setosas were clustered correctly, and there is a small difference between virginica and versicolor - but on the whole, pretty cool!
Let’s quickly check the accuracy with a confusion matrix (using R’s table() function) before we finish:
Not bad - all the setosas were accurate, and only got six incorrect out of the others.
The accuracy (sum of the diagonal)/(sum of total) comes out at 96%, which I’m happy with!
There is plenty to improve on and of course don’t use this for actual clustering, however I find exercises such as this are good to properly understand the logic underlying the methods we can leverage with R’s great packages.