Evaluating ML Models: Cross-Validation, Metrics, and Overfitting
You've trained a model and it gets 95% accuracy. Should you celebrate? Not yet. That number might be lying to you. Maybe your test set was too easy. Maybe the model memorized the training data. Maybe accuracy is the wrong metric entirely.
Model evaluation is the most important skill in machine learning. A model that looks great on paper but fails in production is worse than useless — it gives you false confidence. In this tutorial, you'll learn how to honestly assess model performance.
We'll cover cross-validation (for reliable estimates), precision/recall/F1 (for classification nuance), ROC-AUC (for threshold-independent evaluation), and GridSearchCV (for finding the best hyperparameters). These are the tools that separate hobbyists from professionals.
Why Is a Single Train-Test Split Risky?
When you split data once with train_test_split, your score depends on which samples ended up in the test set. You might get lucky (easy test set) or unlucky (hard test set). A different random seed could give a very different score.
Cross-validation solves this by splitting the data multiple times. In 5-fold cross-validation, the data is divided into 5 chunks. The model trains on 4 chunks and tests on the remaining 1, rotating through all 5 combinations. You get 5 scores instead of 1, giving a much more reliable estimate.
What Are Overfitting and Underfitting?
Imagine studying for a history exam. Underfitting is like skimming the textbook — you barely learn the material and fail the exam. Overfitting is like memorizing every word verbatim — you ace the practice exam but can't answer questions phrased differently on the real test.
The sweet spot is understanding the core concepts well enough to answer new questions. In ML terms, you want a model complex enough to capture patterns but not so complex that it memorizes noise.
Why Do You Need Three Sets?
If you use the test set to choose hyperparameters, you've contaminated it. The test set should only be used once — at the very end. During model selection, you need a separate validation set.
The standard strategy is:
Cross-validation is an alternative to a fixed validation set. It's more data-efficient because every sample gets used for both training and validation across different folds.
When Is Accuracy Not Enough?
Imagine a dataset where 99% of transactions are legitimate and 1% are fraud. A model that always predicts "legitimate" gets 99% accuracy — but catches zero fraud. Accuracy is misleading when classes are imbalanced.
Precision tells you: of all items predicted as positive, how many actually are? Recall tells you: of all actual positives, how many did the model catch? F1-score is the harmonic mean of precision and recall — a single number that balances both.
What Does ROC-AUC Measure?
The ROC curve plots true positive rate vs false positive rate at every possible threshold. The Area Under the Curve (AUC) summarizes this into a single number. An AUC of 1.0 is a perfect classifier. An AUC of 0.5 is no better than random coin flipping.
AUC is useful because it evaluates the model's ability to rank predictions correctly, regardless of which threshold you pick. A model with high AUC has good separation between classes.
How Do You Find the Best Hyperparameters?
Hyperparameters are settings you choose before training — like max_depth for a decision tree or n_neighbors for KNN. GridSearchCV tries every combination you specify and uses cross-validation to find the best one.
How Do You Choose the Best Model?
Here is the professional workflow for model selection:
Practice Exercises
Use cross_val_score with 5-fold CV to evaluate a LogisticRegression(random_state=42). Print the mean accuracy rounded to 4 decimal places, then the standard deviation rounded to 4 decimal places on a new line.
Train two decision trees: one with max_depth=2 and one with max_depth=None (unlimited). For each, print the training accuracy and the mean 5-fold CV accuracy (both rounded to 4 decimal places). Format: "Depth-2 train: X.XXXX cv: X.XXXX" and "No-limit train: X.XXXX cv: X.XXXX".
Train a LogisticRegression(random_state=42) on the imbalanced dataset. Print three lines: "Precision: X.XXXX", "Recall: X.XXXX", "F1: X.XXXX" — each rounded to 4 decimal places.
Use GridSearchCV to find the best max_depth for a DecisionTreeClassifier from the values [2, 3, 4, 5, 6, 7]. Use 5-fold CV and scoring='accuracy'. Print the best max_depth value, then the best CV score rounded to 4 decimal places on a new line.
Split data 80/20, then use cross_val_score (5-fold, scoring='f1') on the training set to evaluate a LogisticRegression(random_state=42). Print the mean CV F1 score rounded to 4 decimal places. Then train on full training set, predict on test set, and print the test ROC-AUC rounded to 4 decimal places on a new line.