Skip to main content

Python Classification: Build a Classifier with scikit-learn

Intermediate30 min6 exercises90 XP
0/6 exercises

Your email inbox sorts messages into "spam" or "not spam." A bank flags transactions as "fraudulent" or "legitimate." A doctor classifies a tumor as "benign" or "malignant." These are all classification problems — the model predicts which category something belongs to.

Classification is different from regression. Regression predicts a number (like a price). Classification predicts a label (like "spam" or "not spam"). But the workflow is almost identical: prepare data, split, train, predict, evaluate.

In this tutorial, you'll build classifiers with three popular algorithms — Logistic Regression, Decision Trees, and K-Nearest Neighbors. You'll also learn how to measure classification performance beyond simple accuracy.

How Is Classification Different from Regression?

The key difference is in the target variable. If the target is a continuous number (price, temperature, salary), it's regression. If the target is a discrete category (spam/not-spam, cat/dog, healthy/sick), it's classification.

A binary classification dataset
Loading editor...

What Is Logistic Regression?

Despite the name, logistic regression is a classification algorithm. It works by finding a line (or hyperplane) that separates the classes, then applies a sigmoid function to squish the output between 0 and 1. If the output is above 0.5, it predicts class 1. Otherwise, class 0.

Logistic Regression classifier
Loading editor...

How Confident Is the Model?

Sometimes you don't just want the label — you want to know how confident the model is. The predict_proba() method returns the probability for each class. A prediction with 99% probability is more trustworthy than one with 51%.

Prediction probabilities
Loading editor...

How Does a Decision Tree Classify?

A decision tree works like a game of 20 Questions. It asks a series of yes/no questions about the features ("Is feature 2 greater than 0.5?") and follows different branches based on the answers until it reaches a final prediction.

Decision Tree with controlled depth
Loading editor...

What Is K-Nearest Neighbors (KNN)?

KNN is the simplest classifier to understand. When a new data point arrives, KNN looks at the K closest points in the training data and takes a majority vote. If 4 out of 5 neighbors are "spam," the new point is classified as "spam."

K-Nearest Neighbors classifier
Loading editor...

What Does a Confusion Matrix Show?

Accuracy tells you the overall percentage of correct predictions, but it hides important details. A confusion matrix breaks predictions into four categories: true positives, true negatives, false positives, and false negatives.

Think of a medical test. A false negative means telling a sick person they're healthy (dangerous!). A false positive means telling a healthy person they're sick (stressful but less dangerous). The confusion matrix lets you see both types of errors.

Confusion matrix and classification report
Loading editor...

How Do You Choose the Best Classifier?

There's no single "best" algorithm. The right choice depends on your data and your problem. A quick way to compare is to train several models on the same data and check their test accuracy.

Head-to-head model comparison
Loading editor...

Practice Exercises

Train a Logistic Regression Classifier
Write Code

Create a LogisticRegression classifier with random_state=42. Train it on the provided training data and print the accuracy on the test set, rounded to 4 decimal places.

Loading editor...
Predict the Output
Predict Output

What shape will predict_proba() return for 40 test samples in a binary classification problem?

Loading editor...
Build a Decision Tree Classifier
Write Code

Train a DecisionTreeClassifier with max_depth=3 and random_state=42. Print the accuracy on the test set rounded to 4 decimal places, then on the next line print the number of leaves in the tree.

Loading editor...
Fix the KNN Bug
Fix the Bug

This KNN classifier is getting poor results because it is missing a critical step. Find and fix the bug. Print the accuracy rounded to 4 decimal places.

Loading editor...
Read a Confusion Matrix
Write Code

Train a LogisticRegression(random_state=42) on the data. Print the confusion matrix, then on the next line print the number of false positives (predicted 1 but actually 0). The confusion matrix layout is [[TN, FP], [FN, TP]].

Loading editor...
Compare Three Classifiers
Write Code

Train LogisticRegression, DecisionTreeClassifier(max_depth=4), and KNeighborsClassifier(n_neighbors=5) on the scaled data. Print each model's accuracy on the test set in the format "ModelName: X.XXXX" on separate lines. Use names: "LR", "DT", "KNN". All models use random_state=42 where applicable.

Loading editor...