SciPy Statistics: Distributions, Hypothesis Tests, and Correlations

Advanced30 min6 exercises120 XP

0/6 exercises

A pharmaceutical company develops a new headache medicine. They give it to 100 patients and a placebo to another 100. The treatment group reports slightly fewer headaches. But is that difference real, or just random chance? This is the question statistics answers.

SciPy's stats module gives you the tools to answer these questions with mathematical rigor. You'll learn to describe data with summary statistics, model it with probability distributions, test hypotheses with t-tests and chi-square tests, and measure relationships with correlations.

How Do You Summarize Data with Descriptive Statistics?

Before running any fancy tests, you need to understand your data. Descriptive statistics condense thousands of data points into a handful of numbers that tell you what's typical, how spread out the data is, and whether it's skewed.

Descriptive statistics with SciPy

Loading editor...

Let's break down what each measurement tells you:

Mean: the average. Sensitive to outliers.

Median: the middle value. Robust to outliers.

Std deviation: how far values typically are from the mean.

IQR (interquartile range): the range of the middle 50% of data.

Skewness: 0 means symmetric, positive means tail on the right, negative means tail on the left.

Kurtosis: how heavy the tails are. Positive means more outliers than a normal distribution.

What Is the Normal Distribution and Why Does It Matter?

The normal distribution (the "bell curve") is the most important distribution in statistics. Heights, test scores, measurement errors, and countless natural phenomena follow it. It's defined by just two parameters: the mean (center) and standard deviation (width).

SciPy lets you work with normal distributions programmatically. You can calculate probabilities, find percentiles, and generate random samples.

Working with the normal distribution

Loading editor...

How Does Hypothesis Testing Work?

Hypothesis testing is like a courtroom trial. You start by assuming the "defendant" (your hypothesis) is innocent (no effect). Then you look at the evidence (data). If the evidence is strong enough, you reject the assumption of innocence.

In statistics, the "innocent until proven guilty" assumption is called the null hypothesis (H0). The alternative hypothesis (H1) is what you're trying to prove. The p-value is the probability of seeing your data (or something more extreme) if the null hypothesis were true.

The Independent Samples T-Test

The t-test compares the means of two groups. "Is there a real difference between group A and group B, or is the difference just random noise?" This is the workhorse of A/B testing.

A/B testing with the t-test

Loading editor...

When Do You Use the Chi-Square Test?

The t-test works for numerical data (amounts, scores, times). But what about categorical data? "Did more people choose option A or option B?" "Is product preference independent of age group?" That's where the chi-square test comes in.

Chi-square test for categorical data

Loading editor...

You can also use chi-square to test whether two categorical variables are independent. For example: "Is there a relationship between gender and product preference?"

Chi-square test of independence

Loading editor...

How Do You Measure the Relationship Between Two Variables?

Correlation measures how strongly two variables move together. Do taller people tend to weigh more? Does more advertising lead to more sales? Correlation gives you a number between -1 and +1 that quantifies the relationship.

+1 = perfect positive correlation (as one goes up, the other always goes up)

0 = no linear relationship

-1 = perfect negative correlation (as one goes up, the other always goes down)

SciPy offers two types: Pearson (for linear relationships) and Spearman (for any monotonic relationship, even non-linear).

Pearson and Spearman correlations

Loading editor...

What Are Confidence Intervals?

A single number (like a sample mean) is a point estimate. But how precise is it? A confidence interval gives you a range: "We're 95% confident the true mean is between X and Y." Wider intervals mean less certainty.

Confidence intervals

Loading editor...

Practice Exercises

Compute Descriptive Statistics

Write Code

Given the salaries array, compute and print the mean, median, and standard deviation (with ddof=1), each rounded to 1 decimal place. Also compute and print the IQR (interquartile range) using scipy.stats.iqr(), rounded to 1 decimal place. Print each on a separate line.

Loading editor...

Work with the Normal Distribution

Write Code

Battery life for a phone model follows a normal distribution with mean 10 hours and standard deviation 1.5 hours. Using scipy.stats.norm, calculate: (1) the probability that a random phone lasts more than 12 hours (print as percentage rounded to 1 decimal), (2) the battery life at the 90th percentile (rounded to 2 decimal places). Print each on a separate line.

Loading editor...

Run an A/B Test with a T-Test

Write Code

You ran an A/B test on email subject lines. Group A (old subject) and Group B (new subject) have different open rates. Run an independent samples t-test using stats.ttest_ind(). Print the t-statistic rounded to 2 decimal places, the p-value rounded to 4 decimal places, and whether the result is 'Significant' or 'Not significant' at alpha=0.05.

Loading editor...

Test Survey Results with Chi-Square

Write Code

A survey asked 200 people to choose their favorite season. The results were: Spring=60, Summer=70, Fall=45, Winter=25. If there were no preference, each season would get 50 votes. Use scipy.stats.chisquare() to test whether the distribution is uniform. Print the chi-square statistic (rounded to 1 decimal) and whether preferences are 'Significant' or 'Not significant' at alpha=0.05.

Loading editor...

Calculate and Interpret Correlation

Write Code

Calculate the Pearson correlation between ad_spend and revenue. Print the correlation coefficient rounded to 3 decimal places. Then print the interpretation: 'Strong positive' if r > 0.7, 'Moderate positive' if r > 0.4, 'Weak positive' if r > 0, 'Negative' if r <= 0.

Loading editor...

Compute a Confidence Interval

Write Code

Given the response_times array, compute the 95% confidence interval for the mean using stats.t.interval(). Print the lower bound and upper bound, each rounded to 2 decimal places, on separate lines.

Loading editor...