Clustering is one of the most common exploratory data analysis techniques used to get an intuition about the structure of the data. K-means clustering is one of the simplest and most popular unsupervised machine learning algorithms.
k-Means Clustering is an algorithm that, given a dataset, will identify which data points belong to each one of the k clusters. It identifies subgroups in the data such that data points in the same subgroup (cluster) are very similar while data points in different clusters are very different.
k-means clustering is a good place to start exploring an unlabeled dataset. The k in k-Means denotes the number of clusters. This algorithm is bound to converge to a solution after some iterations.
It has 4 basic steps:
- Initialize Cluster Centroids: First the model picks up k, (let k = 3) data points from the dataset. These data points are called cluster centroids.
- Assign data points to Clusters: The model would calculate the distance between the data point & all the centroids and will be assigned to the cluster with the nearest centroid
- Update Cluster Centroids: Because the initial centroids were chosen arbitrarily, the model updates them with new cluster values. After all data points have been assigned to clusters, recalculate the centroids of each cluster. To do this, compute the mean value of each feature (dimension) separately for all the data points assigned to that cluster. This means taking the average of all X values and the average of all Y values for data points in that cluster if you’re working with two-dimensional data. For a cluster with n data points and two features (X and Y), the updated centroid will be mean of X values, mean of Y values.
- Repeat steps 2–3 until the stopping condition is met: Steps 2 and 3 are performed iteratively, and without a stopping criterion, they could continue indefinitely. The purpose of a stopping criterion is to determine when the algorithm should stop updating the clusters. It’s important to emphasize that setting a stopping criterion doesn’t guarantee finding the absolute best clusters but rather ensures the algorithm returns reasonably good clusters. More importantly, a stopping criterion is essential to ensure the algorithm converges and at least returns some clusters.
The objective of clustering is not just to create clusters but to generate meaningful and high-quality clusters. High-quality clustering means that data points within a cluster should be close to each other and significantly distant from points in other clusters. To learn more about hyperparameters in k-means, read this.
There are 2 methods commonly used to measure the quality of clusters:
- Silhouette Score: The silhouette score assesses how distant data points in one cluster are from data points in another cluster. The silhouette score falls within the range of -1 to 1, where a score closer to 1 indicates that data points are well-separated into their clusters, while a score closer to -1 suggests overlapping or poorly distinguished clusters.
- Elbow Method: Another valuable technique for evaluating the quality of clusters is the elbow method. This method helps determine the optimal number of clusters for a given dataset. By plotting the inertia (or within-cluster sum of squares) against the number of clusters, you can observe an “elbow point” on the graph. The elbow point represents the ideal number of clusters where adding more clusters doesn’t significantly reduce the inertia. It aids in finding a balance between too few clusters (underfitting) and too many clusters (overfitting), thereby contributing to the creation of meaningful clusters.
- Inertia: inertia is a way to quantify how well-defined and compact clusters are. Lower inertia values indicate that clusters are cohesive and data points within each cluster are closer together. It serves as a useful metric for assessing the quality of clustering solutions and can be helpful in selecting the optimal number of clusters using techniques like the elbow method.The range of inertia values starts from zero and increases.
The goal when evaluating cluster quality is to achieve low inertia values and silhouette scores closer to 1, as these indicators signify that the clustering is effective in creating distinct and well-separated clusters.