Skip to main content

K-Means Clustering: Find Patterns in Data Without Labels

Advanced25 min5 exercises100 XP
0/5 exercises

A streaming service groups users into "action fans," "comedy lovers," and "documentary watchers" — without anyone filling out a survey. A retailer segments customers into "big spenders," "bargain hunters," and "window shoppers" based on purchase behavior. Nobody labeled these groups. The algorithm discovered them.

This is unsupervised learning. Unlike classification and regression, there are no labels to learn from. The algorithm looks at the data and finds natural groupings on its own. Clustering is the most common type of unsupervised learning.

In this tutorial, you'll master K-Means clustering — the most popular clustering algorithm. You'll learn how it works, how to choose the right number of clusters, how to evaluate results without labels, and when to use DBSCAN as an alternative.

How Does K-Means Find Clusters?

K-Means works by placing K center points (called centroids) in the data space, then repeating two steps until nothing changes:

  • Assign each data point to the nearest centroid.
  • Update each centroid to be the average of all points assigned to it.
  • Think of it like organizing a messy room into K piles. You start with K random "pile centers." Each item goes to the nearest pile. Then you recalculate where the center of each pile is and reassign items. After a few rounds, the piles stabilize.

    K-Means on synthetic data
    Loading editor...

    How Do You Use K-Means in Practice?

    K-Means has the same .fit() and .predict() API as supervised models. Call .fit(X) to learn clusters from your data. Then use .predict(X_new) to assign new points to the nearest cluster. The .labels_ attribute gives you the cluster assignment for every training point.

    Using fit, predict, and labels_
    Loading editor...

    How Many Clusters Should You Use?

    The hardest part of K-Means is choosing K. Too few clusters and you merge distinct groups. Too many and you split natural groups into meaningless fragments.

    The elbow method runs K-Means with different values of K and plots the inertia. As K increases, inertia drops. The "elbow" — where the curve bends sharply — suggests the best K. After the elbow, adding more clusters gives diminishing returns.

    Elbow method for choosing K
    Loading editor...

    How Do You Evaluate Clustering Without Labels?

    In supervised learning, you compare predictions to true labels. In clustering, there are no true labels. So how do you know if the clusters are good?

    The silhouette score measures how similar each point is to its own cluster compared to the nearest other cluster. Scores range from -1 to +1. A score near +1 means points are well-matched to their cluster. A score near 0 means the clusters overlap. A negative score means points are assigned to the wrong cluster.

    Silhouette score for different K values
    Loading editor...

    When Should You Use DBSCAN Instead?

    K-Means assumes clusters are round and equal-sized. But what if your data has irregular shapes, or some points don't belong to any cluster? DBSCAN (Density-Based Spatial Clustering) handles both cases.

    DBSCAN groups together points that are tightly packed and marks isolated points as noise (label = -1). You don't need to specify the number of clusters — DBSCAN finds them automatically. You only set eps (the neighborhood radius) and min_samples (the minimum points to form a cluster).

    DBSCAN vs K-Means with noisy data
    Loading editor...

    What Does a Real Clustering Workflow Look Like?

    In practice, clustering is rarely the final step. It is usually part of a larger analysis. For example, a retailer might cluster customers, then analyze each cluster to create targeted marketing campaigns.

    Customer segmentation workflow
    Loading editor...

    Practice Exercises

    Run Your First K-Means
    Write Code

    Fit a KMeans model with n_clusters=3, random_state=42, and n_init=10 on the data. Print the number of points in each cluster (0, 1, 2) in the format "Cluster X: N points" on separate lines.

    Loading editor...
    Predict the Inertia Pattern
    Predict Output

    This code compares inertia for K=1 vs K=3. Will K=3 have higher or lower inertia than K=1? What will the comparison print?

    Loading editor...
    Find the Best K with Silhouette
    Write Code

    Compute the silhouette score for K=2 through K=6. Print each as "K=X: Y.YYYY". Then print the K with the highest silhouette score on a new line as "Best K: X".

    Loading editor...
    Use DBSCAN to Handle Noise
    Write Code

    Fit a DBSCAN(eps=0.5, min_samples=5) on the scaled data. Print the number of clusters found (excluding noise), then the number of noise points on a new line. Noise points have label -1.

    Loading editor...
    Complete Clustering Pipeline
    Write Code

    Build a full pipeline: scale the customer data with StandardScaler, find the best K (from 2 to 5) using silhouette score, then print the best K and its silhouette score rounded to 4 decimal places. Format: "Best K: X" then "Silhouette: X.XXXX" on separate lines.

    Loading editor...