June 17, 2026
Determining if one cluster solution is better than another can be challenging. A common approach is to calculate some kind of ratio between intra-cluster spread to inter-cluster spread. Ideally, things in the same cluster are all very close to each other (low intra-cluster spread) and things in different clusters are all very far from each other (high inter-cluster spread).
The bad thing (IMO) about this type of ratio-based approach is that it isn’t directly tied to the usefulness of a cluster solution. It’s possible for a solution splitting athletes by height to have a better ratio than one splitting them by play style, even though the latter might be much more useful for Moneyball-style decision making. The good thing about this type of approach is that it’s easy to calculate. Here are a few popular examples.
The Davies-Bouldin index is one example of this approach. Let’s say we’re given the following properties about a cluster solution:
For any cluster , we know how spaced out it is based on the value . For any two clusters and , we know how far apart they are based on the value .
We want to end up with a single ratio. How’s this:
is the sum of the two clusters’ spreads divided by their distance. Reducing can be achieved by placing the two clusters farther apart or making either of the clusters denser.
It looks like a pretty decent measure so far. So… can we just average all the s from all the pairs of clusters and call it a day? Yes! We can do anything! There aren’t any rules out here. But, we can do better.
The problem with this metric as is is that it accounts for a lot of unnecessary junk. Consider the green cluster below.
Davies and Bouldin figured that there’s no use in celebrating the separation between the green and red clusters when the green and purple clusters are so much closer together. It is the green cluster’s value with the purple cluster (the closest one) that should define how neat of a cluster it is.
So they created a bonus intermediate value, , defined as the worst (biggest) value for each cluster. Taking the average of all , we get , the Davies-Bouldin index:
Conceptually, that’s all there is to it. To practically implement this thing, we still need formulae for and . We’ll start with the presented in the original Dave & Buster’s paper [1]. Let’s let:
Then we could calculate (the spread) of cluster as:
I.e., the average Euclidean distance between every point in the cluster to its centroid.
I often have to calculate quality measures using nothing more than a pre-computed distance matrix of observations, so I have to do things a little differently. One option is to take the average of all inter-observation distances within a cluster:
Don’t worry for now if you’re not sure where the 2 and the other stuff scaling the summation came from.
Another option is to take the maximum of all inter-observation distances within a cluster:
We next need to pick an formula. The recommendation in the paper is the distance between the centroids of the two clusters. If you, like me, have nothing but a distance matrix to work from, you might want to consider some of the options below.
Let be the th observation in cluster and be the th observation in cluster .
Where, and are the sizes of clusters and respectively.
Horrible.
The good news is, we’re done! You now have everything you need to work out the Davies-Bouldin index calculation for yourself.
One implementation note to watch out for: don’t forget to take your averages properly. The clv package’s calculation of the Davies-Bouldin index (whose tragic removal from CRAN has catalyzed this adventure) made a small slip up on this. When calculating mean intra-cluster distance from a data matrix, the package uses the C code:
average_intracluster[cluster_i] +=
2*(dist/(cluster_size[cluster_i]*(cluster_size[cluster_i]-1)));But when calculating the same thing from a distance matrix, the package uses:
average_intracluster[cluster_i] +=
(dist/(cluster_size[cluster_i]*(cluster_size[cluster_i]-1)));self-distance matrices are tricky in that their diagonals are useless and half of their off-diagonals are duplicates of the other half.
If you subtract off the diagonals and do the appropriate halving, the number of terms to divide over for the average is:
For both of the snippets presented above, dist is the sum of the upper triangle of a cluster_size[cluster_i] cluster_size[cluster_i] distance matrix. If you’re bored, you can try to figure out which of the two snippets above is correct.
Once you understand the Davies-Bouldin index, you get to understand the Dunn index for free. It’s the smallest inter-cluster distance divided by the largest intra-cluster distance [2]:
I dare say I find this index a little less impressive than the former. While the Davies-Bouldin index captures information about every cluster involved, the Dunn index exclusively focuses on the worst clusters. If you had a 10 cluster solution where 8 clusters were spectacularly compact and well-separated but two were fuzzy clouds right beside each other, it would appear awful by the Dunn index but a rightfully quite good by the Davies-Bouldin index.
The most powerful and complicated of the tailed beasts is the lowercased silhouette value. The silhouette value (or “score”) was created by Peter Rousseeuw, a statistics professor currently at Katholieke Universiteit Leuven in Belgium. He seems to have constructed all sorts of zany methods apart from this one. But who could be surprised after knowing that his PhD advisor’s advisor’s advisor’s advisor’s advisor’s advisor’s advisor’s advisor’s two advisors were Laplace and Lagrange?
Rather than calculating a spread value for each cluster, the silhouette score starts from a per-observation average distance to all other observations in its own cluster. If a cluster has observations, you’ll end up with of these averages.
This time, let’s call the th observation in cluster and the th observation in cluster . Sorry for the change-up on the indexing; it’ll make things easier.
The average distance of to all the other observations in its own cluster is denoted by :
Repeat for all to get all your values.
The average distance of to all the observations in a separate cluster is:
This handsome, nameless set of values tells us how far apart is from , much like the values from before. If we minimize it across all clusters where , we’ll have found which cluster is specifically closest to our observation of cluster . This is the value:
Repeat for all to get all your values.
To recap, for each of our data points, we have calculated a single value measuring its average distance to every other point in its cluster and a single value measuring its shortest distance to its closest cluster.
The hard part is done now. Each data point gets a silhouette score calculated as:
Unless an entire cluster has only a single data point, in which case that data point gets a silhouette score of 0.
This is a pretty neat measure and the most information-rich of the three discussed here.
Today we learned:
Happy implementing!