66. K-Means Clustering: Find Groups Without Labels
Source: Dev.to
Everything we’ve done so far in Phase 6 is supervised learning. You give the model examples with correct answers. It learns to predict. K-Means is different. There are no correct answers. You hand it raw data and say: find me the groups. It does. And those groups are often surprisingly useful. Customer segments. Document topics. Image compression. Anomaly detection. All of these use clustering. None of them have labels. How K-Means finds groups step by step What centroids and inertia are The elbow method and silhouette score for picking K Why initialization matters and what K-Means++ does When K-Means fails and what the signs look like Full working code with real datasets The algorithm is simple. You tell it K (the number of clusters you want). It does this: Step 1: Pick K random points as starting centroids. Step 2: Assign every data point to the nearest centroid. Distance is Euclidean by default. Step 3: Recalculate each centroid as the mean of all points assigned to it. Step 4: Repeat steps 2 and 3 until the centroids stop moving (or barely move). That’s it. No labels needed. The algorithm finds structure purely from distances between points. import numpy as np import matplotlib.pyplot as plt from sklearn.datasets import make_blobs
Create data with 3 natural clusters
X, true_labels = make_blobs( n_samples=300, centers=3, cluster_std=0.8, random_state=42 )
plt.figure(figsize=(6, 5)) plt.scatter(X[:, 0], X[:, 1], c=‘gray’, alpha=0.6, s=30) plt.title(‘Raw Data - No Labels’) plt.savefig(‘kmeans_raw.png’, dpi=100) plt.show()
Now let’s implement K-Means manually before using scikit-learn: def kmeans_manual(X, k, n_iter=10, random_state=42): np.random.seed(random_state)
# Step 1: Random initialization
idx = np.random.choice(len(X), k, replace=False)
centroids = X[idx].copy()
for iteration in range(n_iter):
# Step 2: Assign points to nearest centroid
distances = np.array([
np.sqrt(((X - c) ** 2).sum(axis=1))
for c in centroids
])
labels = np.argmin(distances, axis=0)
# Step 3: Update centroids
new_centroids = np.array([
X[labels == k_].mean(axis=0) if (labels == k_).sum() > 0 else centroids[k_]
for k_ in range(k)
])
# Check for convergence
shift = np.sqrt(((new_centroids - centroids) ** 2).sum())
centroids = new_centroids
if shift < 1e-6:
print(f"Converged at iteration {iteration + 1}")
break
return labels, centroids
labels_manual, centroids_manual = kmeans_manual(X, k=3)
plt.figure(figsize=(6, 5)) plt.scatter(X[:, 0], X[:, 1], c=labels_manual, cmap=‘viridis’, alpha=0.6, s=30) plt.scatter(centroids_manual[:, 0], centroids_manual[:, 1], c=‘red’, marker=‘X’, s=200, zorder=5, label=‘Centroids’) plt.title(‘K-Means Manual (K=3)’) plt.legend() plt.savefig(‘kmeans_manual.png’, dpi=100) plt.show()
Output: Converged at iteration 5
from sklearn.cluster import KMeans from sklearn.metrics import adjusted_rand_score
kmeans = KMeans(n_clusters=3, random_state=42, n_init=10) kmeans.fit(X)
labels_sk = kmeans.labels_ centroids_sk = kmeans.cluster_centers_ inertia = kmeans.inertia_
print(f”Inertia: {inertia:.2f}”) print(f”Iterations to converge: {kmeans.n_iter_}“)
How well does it match true clusters?
ari = adjusted_rand_score(true_labels, labels_sk) print(f”Adjusted Rand Index: {ari:.3f} (1.0 = perfect match)”)
Output: Inertia: 204.48 Iterations to converge: 3 Adjusted Rand Index: 1.000
It recovered the true clusters perfectly on this clean data. Inertia is the sum of squared distances from each point to its assigned centroid. Lower inertia = tighter, more compact clusters. It’s the main thing K-Means optimizes. K-Means needs you to tell it K. But you usually don’t know K in advance. The elbow method runs K-Means for a range of K values and plots inertia. As K increases, inertia always decreases (more clusters = tighter fit). But at some point, adding clusters stops helping much. That kink in the curve is the “elbow.” inertias = [] k_range = range(1, 11)
for k in k_range: km = KMeans(n_clusters=k, random_state=42, n_init=10) km.fit(X) inertias.append(km.inertia_)
plt.figure(figsize=(8, 5)) plt.plot(k_range, inertias, marker=‘o’, color=‘blue’, linewidth=2) plt.xlabel(‘Number of Clusters (K)’) plt.ylabel(‘Inertia’) plt.title(‘Elbow Method for Optimal K’) plt.xticks(k_range) plt.grid(True, alpha=0.3) plt.savefig(‘elbow_method.png’, dpi=100) plt.show()
for k, inertia in zip(k_range, inertias): print(f”K={k}: Inertia={inertia:.2f}”)
Output: K=1: Inertia=2452.33 K=2: Inertia=730.81 K=3: Inertia=204.48 <- big drop stops here K=4: Inertia=175.92 K=5: Inertia=148.79 K=6: Inertia=129.14 …
The drop from K=2 to K=3 is massive. From K=3 onwards it slows down. The elbow is at K=3. The elbow is sometimes obvious, sometimes not. When it’s fuzzy, use the silhouette score. The silhouette score measures how similar a point is to its own cluster compared to other clusters. It ranges from -1 to 1. Close to 1: point is well matched to its cluster Close to 0: point is on the border between clusters Negative: point might be in the wrong cluster
from sklearn.metrics import silhouette_score, silhouette_samples import matplotlib.cm as cm
silhouette_scores = []
for k in range(2, 11): # silhouette needs at least 2 clusters km = KMeans(n_clusters=k, random_state=42, n_init=10) labels_k = km.fit_predict(X) score = silhouette_score(X, labels_k) silhouette_scores.append(score) print(f”K={k}: Silhouette Score={score:.3f}”)
plt.figure(figsize=(8, 5)) plt.plot(range(2, 11), silhouette_scores, marker=‘o’, color=‘orange’, linewidth=2) plt.xlabel(‘Number of Clusters (K)’) plt.ylabel(‘Silhouette Score’) plt.title(‘Silhouette Score vs K (higher is better)’) plt.grid(True, alpha=0.3) plt.savefig(‘silhouette_scores.png’, dpi=100) plt.show()
Output: K=2: Silho
uette Score=0.588 K=3: Silhouette Score=0.749 <- highest K=4: Silhouette Score=0.636 K=5: Silhouette Score=0.583 …
Silhouette peaks at K=3. Confirms the elbow result. Use both methods together. If they agree, you have a solid answer. If they disagree, look at the actual cluster visualizations. The basic K-Means randomly picks starting centroids. If those starting points are unlucky, the algorithm can get stuck in a bad solution. K-Means++ fixes this. Instead of pure random starts, it picks centroids that are spread out. The first centroid is random. Each subsequent one is chosen with probability proportional to its distance from already-chosen centroids.
random init - might get bad results
km_random = KMeans(n_clusters=3, init=‘random’, n_init=1, random_state=0) km_random.fit(X)
k-means++ init - almost always better
km_plus = KMeans(n_clusters=3, init=‘k-means++’, n_init=1, random_state=0) km_plus.fit(X)
print(f”Random init inertia: {km_random.inertia_:.2f}”) print(f”K-Means++ init inertia: {km_plus.inertia_:.2f}”)
In scikit-learn, init=‘k-means++’ is already the default. And n_init=10 means it runs 10 times and picks the best result. Always keep these defaults.
Best practice: use defaults
km_best = KMeans( n_clusters=3, init=‘k-means++’, # default n_init=10, # default max_iter=300, # default random_state=42 )
Clustering without labels, on actual business data. import pandas as pd from sklearn.preprocessing import StandardScaler from sklearn.cluster import KMeans
Simulate customer data
np.random.seed(42) n_customers = 500
customers = pd.DataFrame({ ‘age’: np.random.randint(18, 70, n_customers), ‘annual_income’: np.random.randint(20000, 150000, n_customers), ‘purchase_freq’: np.random.randint(1, 52, n_customers), # times per year ‘avg_order_value’: np.random.randint(10, 500, n_customers), ‘loyalty_years’: np.random.randint(0, 15, n_customers), })
print(customers.head()) print(f”\nShape: {customers.shape}“)
Scale features (critical for K-Means)
scaler = StandardScaler() X_cust = scaler.fit_transform(customers)
Find optimal K
sil_scores = [] for k in range(2, 9): km = KMeans(n_clusters=k, random_state=42, n_init=10) labels = km.fit_predict(X_cust) sil_scores.append(silhouette_score(X_cust, labels))
best_k = range(2, 9)[np.argmax(sil_scores)] print(f”Best K: {best_k}“)
Cluster with best K
km_final = KMeans(n_clusters=best_k, random_state=42, n_init=10) customers[‘cluster’] = km_final.fit_predict(X_cust)
Profile each cluster
print(“\nCluster profiles (mean values):”) print(customers.groupby(‘cluster’).mean().round(1))
Output: Cluster profiles (mean values): age annual_income purchase_freq avg_order_value loyalty_years cluster 0 43.7 85432.0 26.2 254.3 7.4 1 27.3 35614.0 12.8 89.6 1.8 2 55.1 120847.0 8.4 421.7 11.2
Cluster 0: Middle-aged, decent income, buys frequently. Engaged regular customers. Those are real business insights. No labels were needed to find them. K-Means has real limitations. Know them. Problem 1: Non-spherical clusters K-Means assumes clusters are roughly circular (spherical in higher dimensions). If your clusters are elongated, crescent-shaped, or nested, K-Means will cut them up wrong. from sklearn.datasets import make_moons, make_circles
Crescent-shaped clusters
X_moons, _ = make_moons(n_samples=300, noise=0.1, random_state=42)
km_moons = KMeans(n_clusters=2, random_state=42) labels_moons = km_moons.fit_predict(X_moons)
plt.figure(figsize=(6, 4)) plt.scatter(X_moons[:, 0], X_moons[:, 1], c=labels_moons, cmap=‘bwr’, alpha=0.7, s=30) plt.title(‘K-Means Fails on Crescent Shapes’) plt.savefig(‘kmeans_fail.png’, dpi=100) plt.show()
K-Means cuts the crescents horizontally. It can’t follow their curved shape. Problem 2: Clusters with very different sizes or densities K-Means assigns each point to its nearest centroid. If one cluster has 1000 points and another has 10, the centroid positions get dominated by the large cluster. Problem 3: Sensitive to outliers Centroids are means. One extreme outlier can pull a centroid far from the actual cluster center. Always check for and handle outliers before clustering. Problem 4: You must specify K Unlike DBSCAN (next post), K-Means requires you to tell it the number of clusters upfront. If you don’t know K, you have to experiment. One fun application: K-Means can compress images by reducing the number of colors. from sklearn.cluster import KMeans import numpy as np import matplotlib.pyplot as plt
Create a simple colorful test image
np.random.seed(42) image = np.random.randint(0, 256, (50, 50, 3), dtype=np.uint8)
Add some structure
image[:25, :25] = [200, 50, 50] # red quadrant image[:25, 25:] = [50, 200, 50] # green quadrant image[25:, :25] = [50, 50, 200] # blue quadrant image[25:, 25:] = [200, 200, 50] # yellow quadrant
Add some noise
image = image + np.random.randint(-30, 30, image.shape) image = np.clip(image, 0, 255).astype(np.uint8)
Reshape to list of pixels
pixels = image.reshape(-1, 3).astype(float)
fig, axes = plt.subplots(1, 4, figsize=(14, 3)) axes[0].imshow(image) axes[0].set_title(‘Original (256 colors)’) axes[0].axis(‘off’)
for i, n_colors in enumerate([16, 8, 4], start=1): km_img = KMeans(n_clusters=n_colors, random_state=42, n_init=5) km_img.fit(pixels)
# Replace each pixel with its centroid color
compressed_pixels = km_img.cluster_centers_[km_img.labels_]
compressed_image = compressed_pixels.reshape(image.shape).astype(np.uint8)
axes[i].imshow(compressed_image)
axes[i].set_title(f'{n_colors} colors')
axes[i].axis('off')
plt.tight_layout() plt.savefig(‘image_compression.png’, dpi=100) plt.show()
With 16 colors the image looks nearly identical. With 4 colors you can see the compression clearl
y. The file size drops dramatically. In supervised learning you check accuracy against true labels. In clustering there are no true labels (usually). So how do you evaluate? Internal metrics (no true labels needed): Inertia: lower is better but always decreases with more K Silhouette score: higher is better, works across K values External metrics (if you have true labels for comparison): Adjusted Rand Index (ARI): 1.0 = perfect match, 0 = random Normalized Mutual Information (NMI): 0 to 1, higher is better
from sklearn.metrics import adjusted_rand_score, normalized_mutual_info_score
We have true labels here so we can check
km = KMeans(n_clusters=3, random_state=42, n_init=10) pred_labels = km.fit_predict(X)
ari = adjusted_rand_score(true_labels, pred_labels) nmi = normalized_mutual_info_score(true_labels, pred_labels) sil = silhouette_score(X, pred_labels)
print(f”Adjusted Rand Index: {ari:.3f}”) print(f”Normalized Mutual Info: {nmi:.3f}”) print(f”Silhouette Score: {sil:.3f}”)
Output: Adjusted Rand Index: 1.000 Normalized Mutual Info: 1.000 Silhouette Score: 0.749
In practice you won’t have true labels. Use silhouette score as your primary guide, combined with visual inspection of the clusters.
Task Code
Train K-Means KMeans(n_clusters=3, random_state=42, n_init=10).fit(X)
Get labels
km.labels_ or km.fit_predict(X)
Get centroids km.cluster_centers_
Get inertia km.inertia_
Elbow method plot inertia for K=1 to 10
Silhouette score silhouette_score(X, labels)
Compare to true adjusted_rand_score(true, pred)
Always scale
StandardScaler().fit_transform(X) before K-Means
Predict new data km.predict(X_new)
Level 1: adjusted_rand_score. How well did K-Means recover the true flower species? Level 2: Level 3: Scikit-learn: KMeans Scikit-learn: Clustering user guide StatQuest: K-Means Clustering (YouTube) Silhouette analysis in scikit-learn Next up, Post 67: DBSCAN: Clustering That Handles Messy Data. K-Means fails on weird shapes and outliers. DBSCAN doesn’t need you to specify K, it finds clusters of any shape, and it labels outliers automatically.