Build Your First ML Model: Linear Regression with scikit-learn
Imagine you're a real estate agent. A client asks, "How much is my house worth?" You look at the square footage, the number of bedrooms, and the neighborhood — then you give an estimate. You just did regression in your head.
Linear regression is the same idea, but done mathematically. You feed a model examples of houses with known prices, and it learns a formula to predict the price of new houses. It is one of the most widely used techniques in machine learning, and scikit-learn makes it surprisingly easy.
In this tutorial, you'll build a complete regression pipeline: split data, train a model, make predictions, and evaluate how well it performs. By the end, you'll understand the core workflow that every ML project follows.
What Does the ML Workflow Look Like?
Every supervised machine learning project follows the same five steps:
.fit() to let it learn patterns..predict() on new data.Let's walk through each step with a concrete example.
How Do You Prepare Data for scikit-learn?
scikit-learn expects your data in a specific shape. Features (the inputs) go in a 2D array where each row is a sample and each column is a feature. The target (the value you want to predict) goes in a 1D array.
Why Split Data into Train and Test?
Imagine studying for a test by memorizing every answer on the practice exam. You'd score 100% on that practice exam — but bomb the real one. That's what happens when you evaluate a model on the same data it trained on.
By splitting your data, you train on one portion and test on another. The test set acts as "unseen" data, giving you an honest measure of how well the model generalizes.
How Do You Train a Linear Regression Model?
Training is the step where the model learns. Linear regression finds the best line (or hyperplane) through your data by minimizing the sum of squared errors between predictions and actual values.
What Do the Coefficients Tell You?
After training, the model stores two things: coefficients (the slopes) and the intercept (the y-intercept). The coefficient tells you how much the target changes when a feature increases by one unit.
How Do You Measure Regression Performance?
The R-squared score (R^2) tells you what fraction of the variation in the target your model explains. An R^2 of 1.0 means perfect predictions. An R^2 of 0.0 means the model is no better than guessing the average every time.
When Should You Scale Your Features?
If one feature is measured in thousands (like salary) and another in single digits (like years of experience), the model might give unfair weight to the larger-scale feature. StandardScaler transforms each feature to have a mean of 0 and a standard deviation of 1.
Plain linear regression does not technically require scaling because it can adjust its coefficients. But scaling helps with regularized models (Ridge, Lasso) and makes coefficients directly comparable.
What If the Relationship Isn't a Straight Line?
Not every relationship is linear. If house prices grow faster as square footage increases (a curve, not a line), you need polynomial features. This adds squared and cubed terms so a linear model can capture curved patterns.
Practice Exercises
Create a LinearRegression model, fit it on the given X_train and y_train, then print the R-squared score on X_test, y_test rounded to 4 decimal places. Print only the number.
The data is already created for you.
What will this code print? Think about what coef_ and intercept_ represent.
This code has a common data leakage bug. The scaler is fit on the full dataset before splitting. Fix it so the scaler is fit only on training data, then used to transform both train and test. Print the R2 score rounded to 4 decimal places.
Build a regression model with 5 features. Generate data with make_regression(n_samples=200, n_features=5, noise=15, random_state=42). Split 80/20, train a LinearRegression, and print all 5 coefficients rounded to 2 decimal places as a list. Use model.coef_.round(2).tolist().
The data below is curved (quadratic). Fit a plain LinearRegression and a polynomial regression (degree 2). Print two R2 scores (on training data) each rounded to 4 decimal places, on separate lines. Label them "Linear R2: ..." and "Poly R2: ...".
Build a complete pipeline: generate data with make_regression(n_samples=150, n_features=4, noise=20, random_state=0), split 75/25 with random_state=0, scale with StandardScaler (fit on train only), train LinearRegression, and print the RMSE on the test set rounded to 2 decimal places. Use np.sqrt(mean_squared_error(...)).