Skip to main content

Predict Customer Churn: End-to-End Data Analysis (with sklearn)

Advanced45 min7 exercises140 XP
0/7 exercises

Losing customers is expensive. Studies show that acquiring a new customer costs five to seven times more than keeping an existing one. That's why companies invest heavily in churn prediction — figuring out which customers are about to leave so they can intervene before it's too late.

In this project, you'll build a complete churn prediction pipeline from scratch. You'll create a realistic customer dataset, explore it for patterns, engineer new features, train a machine learning model, and identify at-risk customers. By the end, you'll have a working system that a real business could use.

Step 1: Create the Customer Dataset

Every data project starts with data. In a real company, you'd pull this from a database. Here, we'll build a realistic dataset as a pandas DataFrame. Our fictional telecom company tracks each customer's tenure, monthly charges, contract type, and whether they churned.

The dataset needs enough rows to train a model and enough variety in its features to make predictions meaningful. We'll include both numerical features (like monthly charges) and categorical ones (like contract type).

Building a customer dataset
Loading editor...

Notice the churn rate is around 50%. In real datasets it's usually lower (5–20%), but a balanced dataset makes our small example easier to learn from.

Exercise 1: Build the Customer DataFrame
Write Code

Create a pandas DataFrame called customers with at least 20 rows and these columns: customer_id (integers 1–20), tenure_months (integers), monthly_charges (floats or ints), contract (strings: 'Month', 'Year', or 'TwoYear'), and churned (0 or 1). Print the shape of the DataFrame and the churn rate (mean of the churned column) formatted as: Churn rate: X.XX.

Loading editor...

Step 2: Explore and Understand the Data

Before building any model, you need to understand your data. What does the average customer look like? Are there differences between customers who stayed and those who left? Exploratory Data Analysis (EDA) answers these questions.

The describe() method gives you statistics for every numerical column at once. Grouping by the churn label reveals how churned customers differ from loyal ones.

Exploring churn patterns
Loading editor...
Exercise 2: Exploratory Data Analysis
Write Code

Given the DataFrame df (already created for you), perform exploratory analysis. Print: (1) the output of df.describe(), (2) the value counts of the contract column, and (3) the mean tenure_months and monthly_charges grouped by churned. Print a label before each output: "Summary Statistics:", "Contract Counts:", and "Churn Group Means:".

Loading editor...

Step 3: Feature Engineering

Raw data rarely gives the best predictions. Feature engineering is the art of creating new columns that capture patterns more clearly. For example, instead of raw tenure in months, we can create buckets like "new" (0–6 months), "growing" (7–24), and "loyal" (25+). This helps the model spot thresholds.

Another useful feature is the average monthly charge rate. If total charges divided by tenure doesn't match monthly charges, that might indicate plan changes — another churn signal.

Exercise 3: Engineer New Features
Write Code

Starting from the provided DataFrame df, create two new columns:

1. tenure_bucket — 'new' if tenure_months <= 6, 'growing' if 7–24, 'loyal' if 25+. Use pd.cut() or conditional logic.

2. charge_per_monthtotal_charges / tenure_months (handle division by zero by replacing inf/NaN with monthly_charges).

Print the value counts of tenure_bucket, then print the first 5 rows showing customer_id, tenure_bucket, and charge_per_month.

Loading editor...

Step 4: Prepare Features for the Model

Machine learning models work with numbers, not strings. We need to convert categorical columns like "contract" into numerical form. One-hot encoding creates a separate binary column for each category. We also need to separate our features (X) from our target (y).

Exercise 4: Encode and Scale Features
Write Code

Starting from the provided DataFrame df:

1. One-hot encode the contract column using pd.get_dummies() with drop_first=True.

2. Create feature matrix X with columns: tenure_months, monthly_charges, and the encoded contract columns. Create target y from the churned column.

3. Scale X using StandardScaler from sklearn.

4. Print the shape of X and the column names.

Loading editor...

Step 5: Train a Classification Model

Now for the exciting part — training the model. Logistic Regression is a great starting point for binary classification. It's fast, interpretable, and works well when features have a roughly linear relationship with the outcome.

With only 20 samples, we can't do a proper train/test split without starving the model of data. In a real project with thousands of rows, you'd split 80/20. Here, we'll train on all data and focus on understanding the model's coefficients.

Exercise 5: Train a Logistic Regression Model
Write Code

Using the prepared features, train a LogisticRegression model. Then print:

1. "Model trained successfully"

2. The model coefficients with their feature names

3. The training accuracy using model.score(X_scaled, y)

Format accuracy as: Training accuracy: X.XX

Loading editor...

Step 6: Evaluate with a Confusion Matrix

Accuracy alone can be misleading. If 90% of customers don't churn, a model that always predicts "no churn" gets 90% accuracy but is useless. A confusion matrix breaks down predictions into four categories: true positives, true negatives, false positives, and false negatives.

Exercise 6: Build a Confusion Matrix
Write Code

Using the trained model from the previous step, generate predictions and build a confusion matrix. Print:

1. The confusion matrix using confusion_matrix(y, predictions)

2. The classification report using classification_report(y, predictions)

Label the output: "Confusion Matrix:" and "Classification Report:"

Loading editor...

Step 7: Identify At-Risk Customers

The real business value isn't just knowing that churn happens — it's knowing who is most likely to churn and why. We can use the model's predicted probabilities to rank customers by risk and generate actionable insights for the retention team.

Exercise 7: Generate a Churn Risk Report
Write Code

Using model.predict_proba(), get the churn probability for each customer. Add it as a churn_risk column to the original DataFrame. Then:

1. Sort by churn_risk descending.

2. Print "=== CHURN RISK REPORT ===" as a header.

3. Print the top 5 at-risk customers showing customer_id, tenure_months, monthly_charges, contract, and churn_risk (formatted to 2 decimal places).

4. Print a summary: "High risk customers (>70%): N" with the count.

Loading editor...

Project Complete!

You've built a complete customer churn prediction system. Let's recap what you accomplished:

  • Created a realistic customer dataset with mixed feature types
  • Explored the data to find patterns between churners and loyal customers
  • Engineered new features like tenure buckets and charge ratios
  • Prepared features with one-hot encoding and standardization
  • Trained a Logistic Regression classifier
  • Evaluated performance with a confusion matrix and classification report
  • Generated an actionable churn risk report for the business