Skip to main content

Build Your First ML Model: Linear Regression with scikit-learn

Intermediate30 min6 exercises90 XP
0/6 exercises

Imagine you're a real estate agent. A client asks, "How much is my house worth?" You look at the square footage, the number of bedrooms, and the neighborhood — then you give an estimate. You just did regression in your head.

Linear regression is the same idea, but done mathematically. You feed a model examples of houses with known prices, and it learns a formula to predict the price of new houses. It is one of the most widely used techniques in machine learning, and scikit-learn makes it surprisingly easy.

In this tutorial, you'll build a complete regression pipeline: split data, train a model, make predictions, and evaluate how well it performs. By the end, you'll understand the core workflow that every ML project follows.

What Does the ML Workflow Look Like?

Every supervised machine learning project follows the same five steps:

  • Prepare your data — organize features (inputs) and target (output).
  • Split into train and test — hold some data back for honest evaluation.
  • Train the model — call .fit() to let it learn patterns.
  • Make predictions — call .predict() on new data.
  • Evaluate — measure how close the predictions are to reality.
  • Let's walk through each step with a concrete example.

    How Do You Prepare Data for scikit-learn?

    scikit-learn expects your data in a specific shape. Features (the inputs) go in a 2D array where each row is a sample and each column is a feature. The target (the value you want to predict) goes in a 1D array.

    Generating sample data
    Loading editor...

    Why Split Data into Train and Test?

    Imagine studying for a test by memorizing every answer on the practice exam. You'd score 100% on that practice exam — but bomb the real one. That's what happens when you evaluate a model on the same data it trained on.

    By splitting your data, you train on one portion and test on another. The test set acts as "unseen" data, giving you an honest measure of how well the model generalizes.

    Splitting data with train_test_split
    Loading editor...

    How Do You Train a Linear Regression Model?

    Training is the step where the model learns. Linear regression finds the best line (or hyperplane) through your data by minimizing the sum of squared errors between predictions and actual values.

    Fit, predict — the core workflow
    Loading editor...

    What Do the Coefficients Tell You?

    After training, the model stores two things: coefficients (the slopes) and the intercept (the y-intercept). The coefficient tells you how much the target changes when a feature increases by one unit.

    Reading model coefficients
    Loading editor...

    How Do You Measure Regression Performance?

    The R-squared score (R^2) tells you what fraction of the variation in the target your model explains. An R^2 of 1.0 means perfect predictions. An R^2 of 0.0 means the model is no better than guessing the average every time.

    Evaluating with R-squared and RMSE
    Loading editor...

    When Should You Scale Your Features?

    If one feature is measured in thousands (like salary) and another in single digits (like years of experience), the model might give unfair weight to the larger-scale feature. StandardScaler transforms each feature to have a mean of 0 and a standard deviation of 1.

    Plain linear regression does not technically require scaling because it can adjust its coefficients. But scaling helps with regularized models (Ridge, Lasso) and makes coefficients directly comparable.

    StandardScaler in action
    Loading editor...

    What If the Relationship Isn't a Straight Line?

    Not every relationship is linear. If house prices grow faster as square footage increases (a curve, not a line), you need polynomial features. This adds squared and cubed terms so a linear model can capture curved patterns.

    Polynomial features for curved data
    Loading editor...

    Practice Exercises

    Train Your First Model
    Write Code

    Create a LinearRegression model, fit it on the given X_train and y_train, then print the R-squared score on X_test, y_test rounded to 4 decimal places. Print only the number.

    The data is already created for you.

    Loading editor...
    Read the Coefficients
    Predict Output

    What will this code print? Think about what coef_ and intercept_ represent.

    Loading editor...
    Fix the Scaling Bug
    Fix the Bug

    This code has a common data leakage bug. The scaler is fit on the full dataset before splitting. Fix it so the scaler is fit only on training data, then used to transform both train and test. Print the R2 score rounded to 4 decimal places.

    Loading editor...
    Multi-Feature Regression
    Write Code

    Build a regression model with 5 features. Generate data with make_regression(n_samples=200, n_features=5, noise=15, random_state=42). Split 80/20, train a LinearRegression, and print all 5 coefficients rounded to 2 decimal places as a list. Use model.coef_.round(2).tolist().

    Loading editor...
    Polynomial vs Linear
    Write Code

    The data below is curved (quadratic). Fit a plain LinearRegression and a polynomial regression (degree 2). Print two R2 scores (on training data) each rounded to 4 decimal places, on separate lines. Label them "Linear R2: ..." and "Poly R2: ...".

    Loading editor...
    Complete Regression Pipeline
    Write Code

    Build a complete pipeline: generate data with make_regression(n_samples=150, n_features=4, noise=20, random_state=0), split 75/25 with random_state=0, scale with StandardScaler (fit on train only), train LinearRegression, and print the RMSE on the test set rounded to 2 decimal places. Use np.sqrt(mean_squared_error(...)).

    Loading editor...