Predict Customer Churn: End-to-End Data Analysis (with sklearn)
Losing customers is expensive. Studies show that acquiring a new customer costs five to seven times more than keeping an existing one. That's why companies invest heavily in churn prediction — figuring out which customers are about to leave so they can intervene before it's too late.
In this project, you'll build a complete churn prediction pipeline from scratch. You'll create a realistic customer dataset, explore it for patterns, engineer new features, train a machine learning model, and identify at-risk customers. By the end, you'll have a working system that a real business could use.
Step 1: Create the Customer Dataset
Every data project starts with data. In a real company, you'd pull this from a database. Here, we'll build a realistic dataset as a pandas DataFrame. Our fictional telecom company tracks each customer's tenure, monthly charges, contract type, and whether they churned.
The dataset needs enough rows to train a model and enough variety in its features to make predictions meaningful. We'll include both numerical features (like monthly charges) and categorical ones (like contract type).
Notice the churn rate is around 50%. In real datasets it's usually lower (5–20%), but a balanced dataset makes our small example easier to learn from.
Create a pandas DataFrame called customers with at least 20 rows and these columns: customer_id (integers 1–20), tenure_months (integers), monthly_charges (floats or ints), contract (strings: 'Month', 'Year', or 'TwoYear'), and churned (0 or 1). Print the shape of the DataFrame and the churn rate (mean of the churned column) formatted as: Churn rate: X.XX.
Step 2: Explore and Understand the Data
Before building any model, you need to understand your data. What does the average customer look like? Are there differences between customers who stayed and those who left? Exploratory Data Analysis (EDA) answers these questions.
The describe() method gives you statistics for every numerical column at once. Grouping by the churn label reveals how churned customers differ from loyal ones.
Given the DataFrame df (already created for you), perform exploratory analysis. Print: (1) the output of df.describe(), (2) the value counts of the contract column, and (3) the mean tenure_months and monthly_charges grouped by churned. Print a label before each output: "Summary Statistics:", "Contract Counts:", and "Churn Group Means:".
Step 3: Feature Engineering
Raw data rarely gives the best predictions. Feature engineering is the art of creating new columns that capture patterns more clearly. For example, instead of raw tenure in months, we can create buckets like "new" (0–6 months), "growing" (7–24), and "loyal" (25+). This helps the model spot thresholds.
Another useful feature is the average monthly charge rate. If total charges divided by tenure doesn't match monthly charges, that might indicate plan changes — another churn signal.
Starting from the provided DataFrame df, create two new columns:
1. tenure_bucket — 'new' if tenure_months <= 6, 'growing' if 7–24, 'loyal' if 25+. Use pd.cut() or conditional logic.
2. charge_per_month — total_charges / tenure_months (handle division by zero by replacing inf/NaN with monthly_charges).
Print the value counts of tenure_bucket, then print the first 5 rows showing customer_id, tenure_bucket, and charge_per_month.
Step 4: Prepare Features for the Model
Machine learning models work with numbers, not strings. We need to convert categorical columns like "contract" into numerical form. One-hot encoding creates a separate binary column for each category. We also need to separate our features (X) from our target (y).
Starting from the provided DataFrame df:
1. One-hot encode the contract column using pd.get_dummies() with drop_first=True.
2. Create feature matrix X with columns: tenure_months, monthly_charges, and the encoded contract columns. Create target y from the churned column.
3. Scale X using StandardScaler from sklearn.
4. Print the shape of X and the column names.
Step 5: Train a Classification Model
Now for the exciting part — training the model. Logistic Regression is a great starting point for binary classification. It's fast, interpretable, and works well when features have a roughly linear relationship with the outcome.
With only 20 samples, we can't do a proper train/test split without starving the model of data. In a real project with thousands of rows, you'd split 80/20. Here, we'll train on all data and focus on understanding the model's coefficients.
Using the prepared features, train a LogisticRegression model. Then print:
1. "Model trained successfully"
2. The model coefficients with their feature names
3. The training accuracy using model.score(X_scaled, y)
Format accuracy as: Training accuracy: X.XX
Step 6: Evaluate with a Confusion Matrix
Accuracy alone can be misleading. If 90% of customers don't churn, a model that always predicts "no churn" gets 90% accuracy but is useless. A confusion matrix breaks down predictions into four categories: true positives, true negatives, false positives, and false negatives.
Using the trained model from the previous step, generate predictions and build a confusion matrix. Print:
1. The confusion matrix using confusion_matrix(y, predictions)
2. The classification report using classification_report(y, predictions)
Label the output: "Confusion Matrix:" and "Classification Report:"
Step 7: Identify At-Risk Customers
The real business value isn't just knowing that churn happens — it's knowing who is most likely to churn and why. We can use the model's predicted probabilities to rank customers by risk and generate actionable insights for the retention team.
Using model.predict_proba(), get the churn probability for each customer. Add it as a churn_risk column to the original DataFrame. Then:
1. Sort by churn_risk descending.
2. Print "=== CHURN RISK REPORT ===" as a header.
3. Print the top 5 at-risk customers showing customer_id, tenure_months, monthly_charges, contract, and churn_risk (formatted to 2 decimal places).
4. Print a summary: "High risk customers (>70%): N" with the count.
Project Complete!
You've built a complete customer churn prediction system. Let's recap what you accomplished: