K-Means Clustering: Find Patterns in Data Without Labels
A streaming service groups users into "action fans," "comedy lovers," and "documentary watchers" — without anyone filling out a survey. A retailer segments customers into "big spenders," "bargain hunters," and "window shoppers" based on purchase behavior. Nobody labeled these groups. The algorithm discovered them.
This is unsupervised learning. Unlike classification and regression, there are no labels to learn from. The algorithm looks at the data and finds natural groupings on its own. Clustering is the most common type of unsupervised learning.
In this tutorial, you'll master K-Means clustering — the most popular clustering algorithm. You'll learn how it works, how to choose the right number of clusters, how to evaluate results without labels, and when to use DBSCAN as an alternative.
How Does K-Means Find Clusters?
K-Means works by placing K center points (called centroids) in the data space, then repeating two steps until nothing changes:
Think of it like organizing a messy room into K piles. You start with K random "pile centers." Each item goes to the nearest pile. Then you recalculate where the center of each pile is and reassign items. After a few rounds, the piles stabilize.
How Do You Use K-Means in Practice?
K-Means has the same .fit() and .predict() API as supervised models. Call .fit(X) to learn clusters from your data. Then use .predict(X_new) to assign new points to the nearest cluster. The .labels_ attribute gives you the cluster assignment for every training point.
How Many Clusters Should You Use?
The hardest part of K-Means is choosing K. Too few clusters and you merge distinct groups. Too many and you split natural groups into meaningless fragments.
The elbow method runs K-Means with different values of K and plots the inertia. As K increases, inertia drops. The "elbow" — where the curve bends sharply — suggests the best K. After the elbow, adding more clusters gives diminishing returns.
How Do You Evaluate Clustering Without Labels?
In supervised learning, you compare predictions to true labels. In clustering, there are no true labels. So how do you know if the clusters are good?
The silhouette score measures how similar each point is to its own cluster compared to the nearest other cluster. Scores range from -1 to +1. A score near +1 means points are well-matched to their cluster. A score near 0 means the clusters overlap. A negative score means points are assigned to the wrong cluster.
When Should You Use DBSCAN Instead?
K-Means assumes clusters are round and equal-sized. But what if your data has irregular shapes, or some points don't belong to any cluster? DBSCAN (Density-Based Spatial Clustering) handles both cases.
DBSCAN groups together points that are tightly packed and marks isolated points as noise (label = -1). You don't need to specify the number of clusters — DBSCAN finds them automatically. You only set eps (the neighborhood radius) and min_samples (the minimum points to form a cluster).
What Does a Real Clustering Workflow Look Like?
In practice, clustering is rarely the final step. It is usually part of a larger analysis. For example, a retailer might cluster customers, then analyze each cluster to create targeted marketing campaigns.
Practice Exercises
Fit a KMeans model with n_clusters=3, random_state=42, and n_init=10 on the data. Print the number of points in each cluster (0, 1, 2) in the format "Cluster X: N points" on separate lines.
This code compares inertia for K=1 vs K=3. Will K=3 have higher or lower inertia than K=1? What will the comparison print?
Compute the silhouette score for K=2 through K=6. Print each as "K=X: Y.YYYY". Then print the K with the highest silhouette score on a new line as "Best K: X".
Fit a DBSCAN(eps=0.5, min_samples=5) on the scaled data. Print the number of clusters found (excluding noise), then the number of noise points on a new line. Noise points have label -1.
Build a full pipeline: scale the customer data with StandardScaler, find the best K (from 2 to 5) using silhouette score, then print the best K and its silhouette score rounded to 4 decimal places. Format: "Best K: X" then "Silhouette: X.XXXX" on separate lines.