In this project, we'll look at 20,000 rows of the jeopardy dataset in "jeopardy.csv". We want to see if there are patterns in the questions asked so we can get a little bit of an edge to win.
First, we'll have to tidy up the data.
import pandas as pd
import matplotlib.pyplot as plt
jeopardy = pd.read_csv('jeopardy.csv')
jeopardy.head(5)
print(jeopardy.columns)
Looks like there is a space after each column name, we can fix this pretty easily with the .columns() method.
jeopardy.columns = ['Show Number', 'Air Date', 'Round', 'Category', 'Value',
'Question', 'Answer']
jeopardy.columns
Next, let's make all the strings in the question and answer columns lower case. We can do write a function and then use the .apply() method.
We also want to remove all the punctuations, the goal is to have the "Question" and "Answer" columns down to just words.
import re
def lowercase_no_punct(string):
lower = string.lower()
punremoved = re.sub('[^A-Za-z0-9\s]','', lower)
return punremoved
jeopardy['clean_question'] = jeopardy['Question'].apply(lowercase_no_punct)
jeopardy['clean_answer'] = jeopardy['Answer'].apply(lowercase_no_punct)
The "Value" column is usually a dollar sign followed by a number. However, this is currently in a string format. We should conver tthis to an integer and remove the dollar sign.
def punremovandtoint(string):
punremoved = re.sub('[^A-Za-z0-9\s]','', string)
try:
integer = int(punremoved)
except Exception:
integer = 0
return integer
jeopardy['clean_values'] = jeopardy['Value'].apply(punremovandtoint)
We'll have to convert the values in the "Air Date" column into a datetime object
jeopardy['Air Date'] = pd.to_datetime(jeopardy['Air Date'])
Let's see what our table currently looks like
jeopardy.head()
Now that the data is cleaned, we can start analyzing it.
Suppose we are interested in the number of words in the answer that occurs in the question. We'll create a function and use the .apply() method to create a new column. This column will have ratio of matching question words to total answer words.
def cleaner(series):
split_answer = series['clean_answer'].split(' ')
split_question = series['clean_question'].split(' ')
match_count = 0
if "the" in split_answer:
split_answer.remove('the')
if len(split_answer) == 0:
return 0
for item in split_answer:
if item in split_question:
match_count +=1
return match_count/len(split_answer)
jeopardy['answer_in_question'] = jeopardy.apply(cleaner, axis=1)
jeopardy['answer_in_question'].mean()
It looks like the answer only appears in the question 6% of the time, so this is not a super reliable strategy.
Next, we'll look at words used in the questions column. We can write a function to see how often they repeat
question_overlap = []
#a python set is an unordered list of items
terms_used = set()
for idx, row in jeopardy.iterrows():
split_question = row['clean_question'].split(" ")
match_count = 0
newlist = []
for word in split_question:
if len(word) >= 6:
newlist.append(word)
for word in newlist:
if word in terms_used:
match_count += 1
for word in newlist:
terms_used.add(word)
if len(newlist) > 0:
match_count = match_count/len(newlist)
question_overlap.append(match_count)
jeopardy['question_overlap'] = question_overlap
jeopardy['question_overlap'].mean()
There is a 69% overlap of words between new questions and old ones. However words can be put together as different phases with a big difference in meaning. So this huge overlap is not super significant.
Let's take a look at the number of questions that are > 800 dollars. Maybe it is a good idea to only study high value questions.
def highvalue(row):
value = 0
if row['clean_values'] > 800:
value = 1
return value
jeopardy['high_value'] = jeopardy.apply(highvalue, axis =1)
high_value_count = jeopardy[jeopardy['high_value'] == 1].shape[0]
low_value_count = jeopardy[jeopardy['high_value'] == 0].shape[0]
print(high_value_count)
low_value_count
It doesnt look like there are that many high value questions in the dataset.
We can create a function that takes in a word, then return the # of high/low values questions this word showed up in. Maybe this will help us study.
def highlowcounts(word):
low_count = 0
high_count = 0
for idx, row in jeopardy.iterrows():
if word in row['clean_question'].split(' '):
if row["high_value"] == 1:
high_count += 1
else:
low_count += 1
return high_count, low_count
observed_expected = []
comparison_terms = list(terms_used)[:5]
comparison_terms
for term in comparison_terms:
observed_expected.append(highlowcounts(term))
observed_expected
We can use the chi squared test to see if the values of the terms in "comparsion_terms" are statiscally significant.
chi_squared =[]
from scipy.stats import chisquare
import numpy as np
for lists in observed_expected:
total = sum(lists)
total_prop = total/jeopardy.shape[0]
expected_high = total_prop * high_value_count
expected_low = total_prop * low_value_count
observed = np.array([lists[0], lists[1]])
expected = np.array([expected_high, expected_low])
chi_squared.append(chisquare(observed, expected))
chi_squared
None of the p values are less than 0.05 so this is not statiscally significant.
Learning Summary¶
Python concepts explored: pandas, matplotlib, data cleaning, string manipulation, chi squared test, regex, try/except
Python functions and methods used: .columns, .lower(), .sub(), .apply(), sum(), .array(), .split(), .shape, .mean(), .iterrows(), .remove(), .add(), .append()
The files used for this project can be found in my GitHub repository.
Comments
comments powered by Disqus