Python Classification: Build a Classifier with scikit-learn
Your email inbox sorts messages into "spam" or "not spam." A bank flags transactions as "fraudulent" or "legitimate." A doctor classifies a tumor as "benign" or "malignant." These are all classification problems — the model predicts which category something belongs to.
Classification is different from regression. Regression predicts a number (like a price). Classification predicts a label (like "spam" or "not spam"). But the workflow is almost identical: prepare data, split, train, predict, evaluate.
In this tutorial, you'll build classifiers with three popular algorithms — Logistic Regression, Decision Trees, and K-Nearest Neighbors. You'll also learn how to measure classification performance beyond simple accuracy.
How Is Classification Different from Regression?
The key difference is in the target variable. If the target is a continuous number (price, temperature, salary), it's regression. If the target is a discrete category (spam/not-spam, cat/dog, healthy/sick), it's classification.
What Is Logistic Regression?
Despite the name, logistic regression is a classification algorithm. It works by finding a line (or hyperplane) that separates the classes, then applies a sigmoid function to squish the output between 0 and 1. If the output is above 0.5, it predicts class 1. Otherwise, class 0.
How Confident Is the Model?
Sometimes you don't just want the label — you want to know how confident the model is. The predict_proba() method returns the probability for each class. A prediction with 99% probability is more trustworthy than one with 51%.
How Does a Decision Tree Classify?
A decision tree works like a game of 20 Questions. It asks a series of yes/no questions about the features ("Is feature 2 greater than 0.5?") and follows different branches based on the answers until it reaches a final prediction.
What Is K-Nearest Neighbors (KNN)?
KNN is the simplest classifier to understand. When a new data point arrives, KNN looks at the K closest points in the training data and takes a majority vote. If 4 out of 5 neighbors are "spam," the new point is classified as "spam."
What Does a Confusion Matrix Show?
Accuracy tells you the overall percentage of correct predictions, but it hides important details. A confusion matrix breaks predictions into four categories: true positives, true negatives, false positives, and false negatives.
Think of a medical test. A false negative means telling a sick person they're healthy (dangerous!). A false positive means telling a healthy person they're sick (stressful but less dangerous). The confusion matrix lets you see both types of errors.
How Do You Choose the Best Classifier?
There's no single "best" algorithm. The right choice depends on your data and your problem. A quick way to compare is to train several models on the same data and check their test accuracy.
Practice Exercises
Create a LogisticRegression classifier with random_state=42. Train it on the provided training data and print the accuracy on the test set, rounded to 4 decimal places.
What shape will predict_proba() return for 40 test samples in a binary classification problem?
Train a DecisionTreeClassifier with max_depth=3 and random_state=42. Print the accuracy on the test set rounded to 4 decimal places, then on the next line print the number of leaves in the tree.
This KNN classifier is getting poor results because it is missing a critical step. Find and fix the bug. Print the accuracy rounded to 4 decimal places.
Train a LogisticRegression(random_state=42) on the data. Print the confusion matrix, then on the next line print the number of false positives (predicted 1 but actually 0). The confusion matrix layout is [[TN, FP], [FN, TP]].
Train LogisticRegression, DecisionTreeClassifier(max_depth=4), and KNeighborsClassifier(n_neighbors=5) on the scaled data. Print each model's accuracy on the test set in the format "ModelName: X.XXXX" on separate lines. Use names: "LR", "DT", "KNN". All models use random_state=42 where applicable.