66. K-Means Clustering: Find Groups Without Labels

Published: (May 10, 2026 at 12:09 PM EDT)
9 min read
Source: Dev.to

Source: Dev.to

Everything we’ve done so far in Phase 6 is supervised learning. You give the model examples with correct answers. It learns to predict. K-Means is different. There are no correct answers. You hand it raw data and say: find me the groups. It does. And those groups are often surprisingly useful. Customer segments. Document topics. Image compression. Anomaly detection. All of these use clustering. None of them have labels. How K-Means finds groups step by step What centroids and inertia are The elbow method and silhouette score for picking K Why initialization matters and what K-Means++ does When K-Means fails and what the signs look like Full working code with real datasets The algorithm is simple. You tell it K (the number of clusters you want). It does this: Step 1: Pick K random points as starting centroids. Step 2: Assign every data point to the nearest centroid. Distance is Euclidean by default. Step 3: Recalculate each centroid as the mean of all points assigned to it. Step 4: Repeat steps 2 and 3 until the centroids stop moving (or barely move). That’s it. No labels needed. The algorithm finds structure purely from distances between points. import numpy as np import matplotlib.pyplot as plt from sklearn.datasets import make_blobs

Create data with 3 natural clusters

X, true_labels = make_blobs( n_samples=300, centers=3, cluster_std=0.8, random_state=42 )

plt.figure(figsize=(6, 5)) plt.scatter(X[:, 0], X[:, 1], c=‘gray’, alpha=0.6, s=30) plt.title(‘Raw Data - No Labels’) plt.savefig(‘kmeans_raw.png’, dpi=100) plt.show()

Now let’s implement K-Means manually before using scikit-learn: def kmeans_manual(X, k, n_iter=10, random_state=42): np.random.seed(random_state)

# Step 1: Random initialization
idx = np.random.choice(len(X), k, replace=False)
centroids = X[idx].copy()

for iteration in range(n_iter):
    # Step 2: Assign points to nearest centroid
    distances = np.array([
        np.sqrt(((X - c) ** 2).sum(axis=1))
        for c in centroids
    ])
    labels = np.argmin(distances, axis=0)

    # Step 3: Update centroids
    new_centroids = np.array([
        X[labels == k_].mean(axis=0) if (labels == k_).sum() > 0 else centroids[k_]
        for k_ in range(k)
    ])

    # Check for convergence
    shift = np.sqrt(((new_centroids - centroids) ** 2).sum())
    centroids = new_centroids

    if shift < 1e-6:
        print(f"Converged at iteration {iteration + 1}")
        break

return labels, centroids

labels_manual, centroids_manual = kmeans_manual(X, k=3)

plt.figure(figsize=(6, 5)) plt.scatter(X[:, 0], X[:, 1], c=labels_manual, cmap=‘viridis’, alpha=0.6, s=30) plt.scatter(centroids_manual[:, 0], centroids_manual[:, 1], c=‘red’, marker=‘X’, s=200, zorder=5, label=‘Centroids’) plt.title(‘K-Means Manual (K=3)’) plt.legend() plt.savefig(‘kmeans_manual.png’, dpi=100) plt.show()

Output: Converged at iteration 5

from sklearn.cluster import KMeans from sklearn.metrics import adjusted_rand_score

kmeans = KMeans(n_clusters=3, random_state=42, n_init=10) kmeans.fit(X)

labels_sk = kmeans.labels_ centroids_sk = kmeans.cluster_centers_ inertia = kmeans.inertia_

print(f”Inertia: {inertia:.2f}”) print(f”Iterations to converge: {kmeans.n_iter_}“)

How well does it match true clusters?

ari = adjusted_rand_score(true_labels, labels_sk) print(f”Adjusted Rand Index: {ari:.3f} (1.0 = perfect match)”)

Output: Inertia: 204.48 Iterations to converge: 3 Adjusted Rand Index: 1.000

It recovered the true clusters perfectly on this clean data. Inertia is the sum of squared distances from each point to its assigned centroid. Lower inertia = tighter, more compact clusters. It’s the main thing K-Means optimizes. K-Means needs you to tell it K. But you usually don’t know K in advance. The elbow method runs K-Means for a range of K values and plots inertia. As K increases, inertia always decreases (more clusters = tighter fit). But at some point, adding clusters stops helping much. That kink in the curve is the “elbow.” inertias = [] k_range = range(1, 11)

for k in k_range: km = KMeans(n_clusters=k, random_state=42, n_init=10) km.fit(X) inertias.append(km.inertia_)

plt.figure(figsize=(8, 5)) plt.plot(k_range, inertias, marker=‘o’, color=‘blue’, linewidth=2) plt.xlabel(‘Number of Clusters (K)’) plt.ylabel(‘Inertia’) plt.title(‘Elbow Method for Optimal K’) plt.xticks(k_range) plt.grid(True, alpha=0.3) plt.savefig(‘elbow_method.png’, dpi=100) plt.show()

for k, inertia in zip(k_range, inertias): print(f”K={k}: Inertia={inertia:.2f}”)

Output: K=1: Inertia=2452.33 K=2: Inertia=730.81 K=3: Inertia=204.48 <- big drop stops here K=4: Inertia=175.92 K=5: Inertia=148.79 K=6: Inertia=129.14 …

The drop from K=2 to K=3 is massive. From K=3 onwards it slows down. The elbow is at K=3. The elbow is sometimes obvious, sometimes not. When it’s fuzzy, use the silhouette score. The silhouette score measures how similar a point is to its own cluster compared to other clusters. It ranges from -1 to 1. Close to 1: point is well matched to its cluster Close to 0: point is on the border between clusters Negative: point might be in the wrong cluster

from sklearn.metrics import silhouette_score, silhouette_samples import matplotlib.cm as cm

silhouette_scores = []

for k in range(2, 11): # silhouette needs at least 2 clusters km = KMeans(n_clusters=k, random_state=42, n_init=10) labels_k = km.fit_predict(X) score = silhouette_score(X, labels_k) silhouette_scores.append(score) print(f”K={k}: Silhouette Score={score:.3f}”)

plt.figure(figsize=(8, 5)) plt.plot(range(2, 11), silhouette_scores, marker=‘o’, color=‘orange’, linewidth=2) plt.xlabel(‘Number of Clusters (K)’) plt.ylabel(‘Silhouette Score’) plt.title(‘Silhouette Score vs K (higher is better)’) plt.grid(True, alpha=0.3) plt.savefig(‘silhouette_scores.png’, dpi=100) plt.show()

Output: K=2: Silho

uette Score=0.588 K=3: Silhouette Score=0.749 <- highest K=4: Silhouette Score=0.636 K=5: Silhouette Score=0.583 …

Silhouette peaks at K=3. Confirms the elbow result. Use both methods together. If they agree, you have a solid answer. If they disagree, look at the actual cluster visualizations. The basic K-Means randomly picks starting centroids. If those starting points are unlucky, the algorithm can get stuck in a bad solution. K-Means++ fixes this. Instead of pure random starts, it picks centroids that are spread out. The first centroid is random. Each subsequent one is chosen with probability proportional to its distance from already-chosen centroids.

random init - might get bad results

km_random = KMeans(n_clusters=3, init=‘random’, n_init=1, random_state=0) km_random.fit(X)

k-means++ init - almost always better

km_plus = KMeans(n_clusters=3, init=‘k-means++’, n_init=1, random_state=0) km_plus.fit(X)

print(f”Random init inertia: {km_random.inertia_:.2f}”) print(f”K-Means++ init inertia: {km_plus.inertia_:.2f}”)

In scikit-learn, init=‘k-means++’ is already the default. And n_init=10 means it runs 10 times and picks the best result. Always keep these defaults.

Best practice: use defaults

km_best = KMeans( n_clusters=3, init=‘k-means++’, # default n_init=10, # default max_iter=300, # default random_state=42 )

Clustering without labels, on actual business data. import pandas as pd from sklearn.preprocessing import StandardScaler from sklearn.cluster import KMeans

Simulate customer data

np.random.seed(42) n_customers = 500

customers = pd.DataFrame({ ‘age’: np.random.randint(18, 70, n_customers), ‘annual_income’: np.random.randint(20000, 150000, n_customers), ‘purchase_freq’: np.random.randint(1, 52, n_customers), # times per year ‘avg_order_value’: np.random.randint(10, 500, n_customers), ‘loyalty_years’: np.random.randint(0, 15, n_customers), })

print(customers.head()) print(f”\nShape: {customers.shape}“)

Scale features (critical for K-Means)

scaler = StandardScaler() X_cust = scaler.fit_transform(customers)

Find optimal K

sil_scores = [] for k in range(2, 9): km = KMeans(n_clusters=k, random_state=42, n_init=10) labels = km.fit_predict(X_cust) sil_scores.append(silhouette_score(X_cust, labels))

best_k = range(2, 9)[np.argmax(sil_scores)] print(f”Best K: {best_k}“)

Cluster with best K

km_final = KMeans(n_clusters=best_k, random_state=42, n_init=10) customers[‘cluster’] = km_final.fit_predict(X_cust)

Profile each cluster

print(“\nCluster profiles (mean values):”) print(customers.groupby(‘cluster’).mean().round(1))

Output: Cluster profiles (mean values): age annual_income purchase_freq avg_order_value loyalty_years cluster 0 43.7 85432.0 26.2 254.3 7.4 1 27.3 35614.0 12.8 89.6 1.8 2 55.1 120847.0 8.4 421.7 11.2

Cluster 0: Middle-aged, decent income, buys frequently. Engaged regular customers. Those are real business insights. No labels were needed to find them. K-Means has real limitations. Know them. Problem 1: Non-spherical clusters K-Means assumes clusters are roughly circular (spherical in higher dimensions). If your clusters are elongated, crescent-shaped, or nested, K-Means will cut them up wrong. from sklearn.datasets import make_moons, make_circles

Crescent-shaped clusters

X_moons, _ = make_moons(n_samples=300, noise=0.1, random_state=42)

km_moons = KMeans(n_clusters=2, random_state=42) labels_moons = km_moons.fit_predict(X_moons)

plt.figure(figsize=(6, 4)) plt.scatter(X_moons[:, 0], X_moons[:, 1], c=labels_moons, cmap=‘bwr’, alpha=0.7, s=30) plt.title(‘K-Means Fails on Crescent Shapes’) plt.savefig(‘kmeans_fail.png’, dpi=100) plt.show()

K-Means cuts the crescents horizontally. It can’t follow their curved shape. Problem 2: Clusters with very different sizes or densities K-Means assigns each point to its nearest centroid. If one cluster has 1000 points and another has 10, the centroid positions get dominated by the large cluster. Problem 3: Sensitive to outliers Centroids are means. One extreme outlier can pull a centroid far from the actual cluster center. Always check for and handle outliers before clustering. Problem 4: You must specify K Unlike DBSCAN (next post), K-Means requires you to tell it the number of clusters upfront. If you don’t know K, you have to experiment. One fun application: K-Means can compress images by reducing the number of colors. from sklearn.cluster import KMeans import numpy as np import matplotlib.pyplot as plt

Create a simple colorful test image

np.random.seed(42) image = np.random.randint(0, 256, (50, 50, 3), dtype=np.uint8)

Add some structure

image[:25, :25] = [200, 50, 50] # red quadrant image[:25, 25:] = [50, 200, 50] # green quadrant image[25:, :25] = [50, 50, 200] # blue quadrant image[25:, 25:] = [200, 200, 50] # yellow quadrant

Add some noise

image = image + np.random.randint(-30, 30, image.shape) image = np.clip(image, 0, 255).astype(np.uint8)

Reshape to list of pixels

pixels = image.reshape(-1, 3).astype(float)

fig, axes = plt.subplots(1, 4, figsize=(14, 3)) axes[0].imshow(image) axes[0].set_title(‘Original (256 colors)’) axes[0].axis(‘off’)

for i, n_colors in enumerate([16, 8, 4], start=1): km_img = KMeans(n_clusters=n_colors, random_state=42, n_init=5) km_img.fit(pixels)

# Replace each pixel with its centroid color
compressed_pixels = km_img.cluster_centers_[km_img.labels_]
compressed_image  = compressed_pixels.reshape(image.shape).astype(np.uint8)

axes[i].imshow(compressed_image)
axes[i].set_title(f'{n_colors} colors')
axes[i].axis('off')

plt.tight_layout() plt.savefig(‘image_compression.png’, dpi=100) plt.show()

With 16 colors the image looks nearly identical. With 4 colors you can see the compression clearl

y. The file size drops dramatically. In supervised learning you check accuracy against true labels. In clustering there are no true labels (usually). So how do you evaluate? Internal metrics (no true labels needed): Inertia: lower is better but always decreases with more K Silhouette score: higher is better, works across K values External metrics (if you have true labels for comparison): Adjusted Rand Index (ARI): 1.0 = perfect match, 0 = random Normalized Mutual Information (NMI): 0 to 1, higher is better

from sklearn.metrics import adjusted_rand_score, normalized_mutual_info_score

We have true labels here so we can check

km = KMeans(n_clusters=3, random_state=42, n_init=10) pred_labels = km.fit_predict(X)

ari = adjusted_rand_score(true_labels, pred_labels) nmi = normalized_mutual_info_score(true_labels, pred_labels) sil = silhouette_score(X, pred_labels)

print(f”Adjusted Rand Index: {ari:.3f}”) print(f”Normalized Mutual Info: {nmi:.3f}”) print(f”Silhouette Score: {sil:.3f}”)

Output: Adjusted Rand Index: 1.000 Normalized Mutual Info: 1.000 Silhouette Score: 0.749

In practice you won’t have true labels. Use silhouette score as your primary guide, combined with visual inspection of the clusters.

Task Code

Train K-Means KMeans(n_clusters=3, random_state=42, n_init=10).fit(X)

Get labels

km.labels_ or km.fit_predict(X)

Get centroids km.cluster_centers_

Get inertia km.inertia_

Elbow method plot inertia for K=1 to 10

Silhouette score silhouette_score(X, labels)

Compare to true adjusted_rand_score(true, pred)

Always scale

StandardScaler().fit_transform(X) before K-Means

Predict new data km.predict(X_new)

Level 1: adjusted_rand_score. How well did K-Means recover the true flower species? Level 2: Level 3: Scikit-learn: KMeans Scikit-learn: Clustering user guide StatQuest: K-Means Clustering (YouTube) Silhouette analysis in scikit-learn Next up, Post 67: DBSCAN: Clustering That Handles Messy Data. K-Means fails on weird shapes and outliers. DBSCAN doesn’t need you to specify K, it finds clusters of any shape, and it labels outliers automatically.

0 views
Back to Blog

Related posts

Read more »