SciPy Statistics: Distributions, Hypothesis Tests, and Correlations
A pharmaceutical company develops a new headache medicine. They give it to 100 patients and a placebo to another 100. The treatment group reports slightly fewer headaches. But is that difference real, or just random chance? This is the question statistics answers.
SciPy's stats module gives you the tools to answer these questions with mathematical rigor. You'll learn to describe data with summary statistics, model it with probability distributions, test hypotheses with t-tests and chi-square tests, and measure relationships with correlations.
How Do You Summarize Data with Descriptive Statistics?
Before running any fancy tests, you need to understand your data. Descriptive statistics condense thousands of data points into a handful of numbers that tell you what's typical, how spread out the data is, and whether it's skewed.
Let's break down what each measurement tells you:
What Is the Normal Distribution and Why Does It Matter?
The normal distribution (the "bell curve") is the most important distribution in statistics. Heights, test scores, measurement errors, and countless natural phenomena follow it. It's defined by just two parameters: the mean (center) and standard deviation (width).
SciPy lets you work with normal distributions programmatically. You can calculate probabilities, find percentiles, and generate random samples.
How Does Hypothesis Testing Work?
Hypothesis testing is like a courtroom trial. You start by assuming the "defendant" (your hypothesis) is innocent (no effect). Then you look at the evidence (data). If the evidence is strong enough, you reject the assumption of innocence.
In statistics, the "innocent until proven guilty" assumption is called the null hypothesis (H0). The alternative hypothesis (H1) is what you're trying to prove. The p-value is the probability of seeing your data (or something more extreme) if the null hypothesis were true.
The Independent Samples T-Test
The t-test compares the means of two groups. "Is there a real difference between group A and group B, or is the difference just random noise?" This is the workhorse of A/B testing.
When Do You Use the Chi-Square Test?
The t-test works for numerical data (amounts, scores, times). But what about categorical data? "Did more people choose option A or option B?" "Is product preference independent of age group?" That's where the chi-square test comes in.
You can also use chi-square to test whether two categorical variables are independent. For example: "Is there a relationship between gender and product preference?"
How Do You Measure the Relationship Between Two Variables?
Correlation measures how strongly two variables move together. Do taller people tend to weigh more? Does more advertising lead to more sales? Correlation gives you a number between -1 and +1 that quantifies the relationship.
SciPy offers two types: Pearson (for linear relationships) and Spearman (for any monotonic relationship, even non-linear).
What Are Confidence Intervals?
A single number (like a sample mean) is a point estimate. But how precise is it? A confidence interval gives you a range: "We're 95% confident the true mean is between X and Y." Wider intervals mean less certainty.
Practice Exercises
Given the salaries array, compute and print the mean, median, and standard deviation (with ddof=1), each rounded to 1 decimal place. Also compute and print the IQR (interquartile range) using scipy.stats.iqr(), rounded to 1 decimal place. Print each on a separate line.
Battery life for a phone model follows a normal distribution with mean 10 hours and standard deviation 1.5 hours. Using scipy.stats.norm, calculate: (1) the probability that a random phone lasts more than 12 hours (print as percentage rounded to 1 decimal), (2) the battery life at the 90th percentile (rounded to 2 decimal places). Print each on a separate line.
You ran an A/B test on email subject lines. Group A (old subject) and Group B (new subject) have different open rates. Run an independent samples t-test using stats.ttest_ind(). Print the t-statistic rounded to 2 decimal places, the p-value rounded to 4 decimal places, and whether the result is 'Significant' or 'Not significant' at alpha=0.05.
A survey asked 200 people to choose their favorite season. The results were: Spring=60, Summer=70, Fall=45, Winter=25. If there were no preference, each season would get 50 votes. Use scipy.stats.chisquare() to test whether the distribution is uniform. Print the chi-square statistic (rounded to 1 decimal) and whether preferences are 'Significant' or 'Not significant' at alpha=0.05.
Calculate the Pearson correlation between ad_spend and revenue. Print the correlation coefficient rounded to 3 decimal places. Then print the interpretation: 'Strong positive' if r > 0.7, 'Moderate positive' if r > 0.4, 'Weak positive' if r > 0, 'Negative' if r <= 0.
Given the response_times array, compute the 95% confidence interval for the mean using stats.t.interval(). Print the lower bound and upper bound, each rounded to 2 decimal places, on separate lines.