So far, we've been comparing data with at least one one numerical(continuous) column and one categorical(nominal) column. So what happens if we want to determine the statistical significance of two independent categorical groups of data?
This is where the Chi-squared test for independence is useful.
Chi-Squared Test Assumptions¶
We'll be looking at data from the census in 1994. Specifically, we are interested in the relationship between 'sex' and 'hours-per-week' worked. Click here for the documentation and citation of the data. First let's get the assumptions out of the way:
- There must be different participants in each group with no participant being in more than one group. In our case, each individual can only have one 'sex' and can not be in multiple workhour categories.
- Random samples from the population. In our case, the census is a good representation of the population.
Data Exploration¶
For the sake of this example, we'll convert the numerical column 'hours-per-week' into a categorical column using pandas. Then we'll assign 'sex' and 'hours_per_week_categories' to a new dataframe.
import matplotlib.pyplot as plt
import numpy as np
import math
import seaborn as sns
import pandas as pd
%matplotlib inline
cols = ['age', 'workclass', 'fnlwg', 'education', 'education-num',
'marital-status','occupation','relationship', 'race','sex',
'capital-gain', 'capital-loss', 'hours-per-week', 'native-country', 'income']
data = pd.read_csv('census.csv', names=cols)
#Create a column for work hour categories.
def process_hours(df):
cut_points = [0,9,19,29,39,49,1000]
label_names = ["0-9","10-19","20-29","30-39","40-49","50+"]
df["hours_per_week_categories"] = pd.cut(df["hours-per-week"],
cut_points,labels=label_names)
return df
data = process_hours(data)
workhour_by_sex = data[['sex', 'hours_per_week_categories']]
workhour_by_sex.head()
workhour_by_sex['sex'].value_counts()
workhour_by_sex['hours_per_week_categories'].value_counts()
The Null and Alternate Hypotheses¶
Recall that we are interested in knowing if there is a relationship between 'sex' and 'hours_per_week_categories'. In order to do so, we would have to use the Chi-squared test. But first, let's state our null hypothesis and the alternative hypothesis.
$ H_0 : \text{There is no statistically significant relationship between sex and the # of hours per week worked.} $
$ H_a : \text{There is a statistically significant relationship between sex and the # of hours per week worked.} $
Constructing the Contingency Table¶
The next step is to format the data into a frequency count table. This is called a Contingency Table, we can accomplish this by using the pd.crosstab() function in pandas.
contingency_table = pd.crosstab(
workhour_by_sex['sex'],
workhour_by_sex['hours_per_week_categories'],
margins = True
)
contingency_table
Each cell in this table represents a frequency count. For example, the intersection of the 'Male' row and the '10-19' column of the table would represent the number of males who works 10-19 hours per week from our sample data set. The intersection of the 'All' row and the '50+' column would represent the total number of people who works 50+ hours a week.
Visualizing the Contingency Table with a Stacked Bar Chart¶
#Assigns the frequency values
malecount = contingency_table.iloc[0][0:6].values
femalecount = contingency_table.iloc[1][0:6].values
#Plots the bar chart
fig = plt.figure(figsize=(10, 5))
sns.set(font_scale=1.8)
categories = ["0-9","10-19","20-29","30-39","40-49","50+"]
p1 = plt.bar(categories, malecount, 0.55, color='#d62728')
p2 = plt.bar(categories, femalecount, 0.55, bottom=malecount)
plt.legend((p2[0], p1[0]), ('Male', 'Female'))
plt.xlabel('Hours per Week Worked')
plt.ylabel('Count')
plt.show()
The chart above visualizes our sample data from the census. If there is truly no relationship between sex and the number of hours per week worked. Then the data would show an even ratio split between 'Male' and 'Female' for each time category. For example, if 5% of the females worked 50+ hours, we would expect the same percentage for males who worked 50+ hours.
The Chi-Squared Test for Independence - Calculation with Numpy¶
In order to determine whether we accept or reject the null hypothesis. We have to compute p-value similar to the welch's t-test and ANOVA. For testing with two categorical variables, we will use the Chi-squared test.
$$ X^2 = \frac{(observed - expected)^2} {(expected)}$$
Where $X^2$ is the test statistic, $observecd$ are values we have in the contingency table, $expected$ are values we would expect assuming the null hypothesis is true. Theoretically speaking, if all the expected values are equal to the observed values, then the $X^2$ statistic will be 0. As a result, the null hypothesis will be retained.
First, let's put the observed values into a one dimensional array, reading the contingency table from left to right then top to bottom.
f_obs = np.append(contingency_table.iloc[0][0:6].values, contingency_table.iloc[1][0:6].values)
f_obs
Next, we need to calculate the expected values. The expected values assume that null hypothesis is true. We would need to calculate values if there is an equal percentage of males and females for each category. For example, this is how we would calculate the expected value for the top left cell:
$$ \text{Expected # of Females in the '0-9' Category} = \frac{\text{Total # of Females} * \text{Number of People in the '0-9' Category}} {\text{Total # of People}}$$
row_sums = contingency_table.iloc[0:2,6].values
row_sums
col_sums = contingency_table.iloc[2,0:6].values
col_sums
total = contingency_table.loc['All', 'All']
f_expected = []
for j in range(2):
for i in col_sums:
f_expected.append(i*row_sums[j]/total)
f_expected
Now that we have all our observed and expected values, we can just plug everything into the Chi-squared test formula.
chi_squared_statistic = ((f_obs - f_expected)**2/f_expected).sum()
print('Chi-squared Statistic: {}'.format(chi_squared_statistic))
Degrees of Freedom¶
Similar to the Welch's t-test, we would have to calculate the degrees of freedom before we can determine the p-value.
$$ DoF = {(\text{Number of rows} - 1)}*{(\text{Number of columns} - 1)}$$
dof = (len(row_sums)-1)*(len(col_sums)-1)
print("Degrees of Freedom: {}".format(dof))
Now we are ready to look into the Chi-squared distribution table. The cut off for a p-value of 0.05 was 11.070. Our $X^2$ statistic was so large that the p-value is approximately zero. So we have evidence against the null hypothesis.
The Chi-Squared Test for Independence - Using Scipy¶
Now that we've gone through all the calculations, it is time to look for shortcuts. Scipy has a function that plugs in all the values for us. Click here for the documentation.
All we need to do is format the observed values into a two-dimensional array and plug it into the function.
f_obs = np.array([contingency_table.iloc[0][0:6].values,
contingency_table.iloc[1][0:6].values])
f_obs
from scipy import stats
stats.chi2_contingency(f_obs)[0:3]
The results were exactly the same as our calculations with Numpy. The $X^2$ = ~2287, p-value = ~0 and degrees of freedom = 5.
Conclusions¶
With a p-value < 0.05 , we can reject the null hypothesis. There is definitely some sort of relationship between 'sex' and the 'hours-per-week' column. We don't know what this relationship is, but we do know that these two variables are not independent of each other.
The files used for this article can be found in my GitHub repository.
Comments
comments powered by Disqus