In the previous article, we talked about hypothesis testing using the Welch's t-test on two independent samples of data. So what happens if we want know the statiscal significance for $k$ groups of data?
This is where the analysis of variance technique, or ANOVA is useful.
ANOVA Assumptions¶
We'll be looking at SAT scores for five different districts in New York City. Specifically, we'll be using "scores.csv" from Kaggle. First let's get the assumptions out of the way:
- The dependent variable (SAT scores) should be continuous.
- The independent variables (districts) should be two or more categorical groups.
- There must be different participants in each group with no participant being in more than one group. In our case, each school cannot be in more than one district.
- The dependent variable should be approximately normally distributed for each category.
- Variances of each group are approximately equal.
Data Exploration¶
Let's begin by taking a look at what our data looks like.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
%matplotlib inline
data = pd.read_csv("scores.csv")
data.head()
data['Borough'].value_counts()
Creating New Columns¶
There is no total score column, so we'll have to create it. In addition, we'll have to find the mean score of the each district across all schools.
data['total_score'] = data['Average Score (SAT Reading)'] + \
data['Average Score (SAT Math)'] + \
data['Average Score (SAT Writing)']
data = data[['Borough', 'total_score']].dropna()
x = ['Brooklyn', 'Bronx', 'Manhattan', 'Queens', 'Staten Island']
district_dict = {}
#Assigns each test score series to a dictionary key
for district in x:
district_dict[district] = data[data['Borough'] == district]['total_score']
y = []
yerror = []
#Assigns the mean score and 95% confidence limit to each district
for district in x:
y.append(district_dict[district].mean())
yerror.append(1.96*district_dict[district].std()/np.sqrt(district_dict[district].shape[0]))
print(district + '_std : {}'.format(district_dict[district].std()))
sns.set(font_scale=1.8)
fig = plt.figure(figsize=(10,5))
ax = sns.barplot(x, y, yerr=yerror)
ax.set_ylabel('Average Total SAT Score')
plt.show()
From our data exploration, we can see that the average SAT scores are quite different for each district. We are interested in knowing if this is caused by random variation in data, or if there is an underlying cause. Since we have five different groups, we cannot use the t-test. Also note that the standard deviation of each group are also very different, so we've violated one of our assumpions. However, we are going to use the 1-way ANOVA test anyway just to understand the concepts.
The Null and Alternative Hypothesis¶
There are no significant differences between the groups' mean SAT scores.
$ H_0 : \mu_1 = \mu_2 = \mu_3 = \mu_4 = \mu_5 $
There is a significant difference between the groups' mean SAT scores.
$ H_a : \mu_i \ne \mu_j $
Where $\mu_i$ and $\mu_j$ can be the mean of any group. If there is at least one group with a significant difference with another group, the null hypothesis will be rejected.
1-way ANOVA¶
Similar to the t-test, we can calculate a score for the ANOVA. Then we can look up the score in the F-distribution and obtain a p-value.
The F-statistic is defined as follows:
$$ F = \frac{MS_{b}} {MS_w} $$
$$ MS_{b} = \frac{SS_{b}} {K-1}$$
$$ MS_{w} = \frac{SS_{w}} {N-K}$$
$$ SS_{b} = {n_k\sum(\bar{x_{k}}-\bar{x_{G}})^2} $$
$$ SS_{w} = \sum(x_{i}-\bar{x_{k}})^2 $$
Where $MS_{b}$ is the estimated variance between groups and $MS_{w}$ is the estimated variance within groups, $\bar{x_{k}}$ is the mean within each group, $n_k$ is the sample size for each group, ${x_i}$ is the individual data point, and $\bar{x_{G}}$ is the total mean.
This is quite a lot of math, fortunately scipy has a function that plugs in all the values for us. The documentation for calculating 1-way ANOVA using scipy is here.
stats.f_oneway(
district_dict['Brooklyn'], district_dict['Bronx'], \
district_dict['Manhattan'], district_dict['Queens'], \
district_dict['Staten Island']
)
The resulting pvalue was less than 0.05. We can reject the null hypothesis and conclude that there is a significant difference between the SAT scores for each district. Even though we've obtained a very low p-value, we cannot make any assumptions about the magnitude of the effect. Also scipy does not calculate $SS_b$ and $SS_w$, so it is probably better to write our own code.
districts = ['Brooklyn', 'Bronx', 'Manhattan', 'Queens', 'Staten Island']
ss_b = 0
for d in districts:
ss_b += district_dict[d].shape[0] * \
np.sum((district_dict[d].mean() - data['total_score'].mean())**2)
ss_w = 0
for d in districts:
ss_w += np.sum((district_dict[d] - district_dict[d].mean())**2)
msb = ss_b/4
msw = ss_w/(len(data)-5)
f=msb/msw
print('F_statistic: {}'.format(f))
The Effect Size¶
We can calculate the magnitude of the effect to determine how large the difference is. One of the measures we can use is Eta-squared.
$$ \eta^2 = \frac{SS_{b}} {SS_{total}}$$
$$ SS_{b} = {n_k\sum(\bar{x_{k}}-\bar{x_{G}})^2} $$
$$ SS_{total} = \sum(x_{i}-\bar{x_{G}})^2 $$
ss_t = np.sum((data['total_score']-data['total_score'].mean())**2)
eta_squared = ss_b/ss_t
print('eta_squared: {}'.format(eta_squared))
The general rules of thumb given by Cohen and Miles & Shevlin (2001) for analyzing eta-squared, $\eta^2$:
- Small effect: $ 0.01 $
- Medium ffect: $ 0.06 $
- Large effect: $ 0.14 $
From our calculations, the effect size for this ANOVA test would be "Medium". For a full write up on effect sizes click here.
The files used for this article can be found in my GitHub repository.
Comments
comments powered by Disqus