In this project, we'll look at 20,000 rows of the jeopardy dataset in "jeopardy.csv". We want to see if there are patterns in the questions asked so we can get a little bit of an edge to win.
First, we'll have to tidy up the data.
import pandas as pd import matplotlib.pyplot as plt jeopardy = pd.read_csv('jeopardy.csv') jeopardy.head(5)
|Show Number||Air Date||Round||Category||Value||Question||Answer|
|0||4680||2004-12-31||Jeopardy!||HISTORY||$200||For the last 8 years of his life, Galileo was ...||Copernicus|
|1||4680||2004-12-31||Jeopardy!||ESPN's TOP 10 ALL-TIME ATHLETES||$200||No. 2: 1912 Olympian; football star at Carlisl...||Jim Thorpe|
|2||4680||2004-12-31||Jeopardy!||EVERYBODY TALKS ABOUT IT...||$200||The city of Yuma in this state has a record av...||Arizona|
|3||4680||2004-12-31||Jeopardy!||THE COMPANY LINE||$200||In 1963, live on "The Art Linkletter Show", th...||McDonald's|
|4||4680||2004-12-31||Jeopardy!||EPITAPHS & TRIBUTES||$200||Signer of the Dec. of Indep., framer of the Co...||John Adams|
Index(['Show Number', ' Air Date', ' Round', ' Category', ' Value', ' Question', ' Answer'], dtype='object')
Looks like there is a space after each column name, we can fix this pretty easily with the .columns() method.
jeopardy.columns = ['Show Number', 'Air Date', 'Round', 'Category', 'Value', 'Question', 'Answer'] jeopardy.columns
Index(['Show Number', 'Air Date', 'Round', 'Category', 'Value', 'Question', 'Answer'], dtype='object')
Next, let's make all the strings in the question and answer columns lower case. We can do write a function and then use the .apply() method.
We also want to remove all the punctuations, the goal is to have the "Question" and "Answer" columns down to just words.
import re def lowercase_no_punct(string): lower = string.lower() punremoved = re.sub('[^A-Za-z0-9\s]','', lower) return punremoved
jeopardy['clean_question'] = jeopardy['Question'].apply(lowercase_no_punct) jeopardy['clean_answer'] = jeopardy['Answer'].apply(lowercase_no_punct)
The "Value" column is usually a dollar sign followed by a number. However, this is currently in a string format. We should conver tthis to an integer and remove the dollar sign.
def punremovandtoint(string): punremoved = re.sub('[^A-Za-z0-9\s]','', string) try: integer = int(punremoved) except Exception: integer = 0 return integer
jeopardy['clean_values'] = jeopardy['Value'].apply(punremovandtoint)
We'll have to convert the values in the "Air Date" column into a datetime object
jeopardy['Air Date'] = pd.to_datetime(jeopardy['Air Date'])
Let's see what our table currently looks like
|Show Number||Air Date||Round||Category||Value||Question||Answer||clean_question||clean_answer||clean_values|
|0||4680||2004-12-31||Jeopardy!||HISTORY||$200||For the last 8 years of his life, Galileo was ...||Copernicus||for the last 8 years of his life galileo was u...||copernicus||200|
|1||4680||2004-12-31||Jeopardy!||ESPN's TOP 10 ALL-TIME ATHLETES||$200||No. 2: 1912 Olympian; football star at Carlisl...||Jim Thorpe||no 2 1912 olympian football star at carlisle i...||jim thorpe||200|
|2||4680||2004-12-31||Jeopardy!||EVERYBODY TALKS ABOUT IT...||$200||The city of Yuma in this state has a record av...||Arizona||the city of yuma in this state has a record av...||arizona||200|
|3||4680||2004-12-31||Jeopardy!||THE COMPANY LINE||$200||In 1963, live on "The Art Linkletter Show", th...||McDonald's||in 1963 live on the art linkletter show this c...||mcdonalds||200|
|4||4680||2004-12-31||Jeopardy!||EPITAPHS & TRIBUTES||$200||Signer of the Dec. of Indep., framer of the Co...||John Adams||signer of the dec of indep framer of the const...||john adams||200|
Now that the data is cleaned, we can start analyzing it.
Suppose we are interested in the number of words in the answer that occurs in the question. We'll create a function and use the .apply() method to create a new column. This column will have ratio of matching question words to total answer words.
def cleaner(series): split_answer = series['clean_answer'].split(' ') split_question = series['clean_question'].split(' ') match_count = 0 if "the" in split_answer: split_answer.remove('the') if len(split_answer) == 0: return 0 for item in split_answer: if item in split_question: match_count +=1 return match_count/len(split_answer)
jeopardy['answer_in_question'] = jeopardy.apply(cleaner, axis=1) jeopardy['answer_in_question'].mean()
It looks like the answer only appears in the question 6% of the time, so this is not a super reliable strategy.
Next, we'll look at words used in the questions column. We can write a function to see how often they repeat
question_overlap =  #a python set is an unordered list of items terms_used = set() for idx, row in jeopardy.iterrows(): split_question = row['clean_question'].split(" ") match_count = 0 newlist =  for word in split_question: if len(word) >= 6: newlist.append(word) for word in newlist: if word in terms_used: match_count += 1 for word in newlist: terms_used.add(word) if len(newlist) > 0: match_count = match_count/len(newlist) question_overlap.append(match_count) jeopardy['question_overlap'] = question_overlap
There is a 69% overlap of words between new questions and old ones. However words can be put together as different phases with a big difference in meaning. So this huge overlap is not super significant.
Let's take a look at the number of questions that are > 800 dollars. Maybe it is a good idea to only study high value questions.
def highvalue(row): value = 0 if row['clean_values'] > 800: value = 1 return value jeopardy['high_value'] = jeopardy.apply(highvalue, axis =1)
high_value_count = jeopardy[jeopardy['high_value'] == 1].shape low_value_count = jeopardy[jeopardy['high_value'] == 0].shape
It doesnt look like there are that many high value questions in the dataset.
We can create a function that takes in a word, then return the # of high/low values questions this word showed up in. Maybe this will help us study.
def highlowcounts(word): low_count = 0 high_count = 0 for idx, row in jeopardy.iterrows(): if word in row['clean_question'].split(' '): if row["high_value"] == 1: high_count += 1 else: low_count += 1 return high_count, low_count
observed_expected =  comparison_terms = list(terms_used)[:5] comparison_terms
['emigrated', 'ruffles', 'waterworld', 'mussorgsky', 'appendages']
for term in comparison_terms: observed_expected.append(highlowcounts(term)) observed_expected
[(1, 0), (0, 2), (1, 0), (1, 1), (1, 2)]
We can use the chi squared test to see if the values of the terms in "comparsion_terms" are statiscally significant.
chi_squared = from scipy.stats import chisquare import numpy as np for lists in observed_expected: total = sum(lists) total_prop = total/jeopardy.shape expected_high = total_prop * high_value_count expected_low = total_prop * low_value_count observed = np.array([lists, lists]) expected = np.array([expected_high, expected_low]) chi_squared.append(chisquare(observed, expected))
[Power_divergenceResult(statistic=2.4877921171956752, pvalue=0.11473257634454047), Power_divergenceResult(statistic=0.80392569225376798, pvalue=0.36992223780795708), Power_divergenceResult(statistic=2.4877921171956752, pvalue=0.11473257634454047), Power_divergenceResult(statistic=0.44487748166127949, pvalue=0.50477764875459963), Power_divergenceResult(statistic=0.031881167234403623, pvalue=0.85828871632352932)]
None of the p values are less than 0.05 so this is not statiscally significant.
Python concepts explored: pandas, matplotlib, data cleaning, string manipulation, chi squared test, regex, try/except
Python functions and methods used: .columns, .lower(), .sub(), .apply(), sum(), .array(), .split(), .shape, .mean(), .iterrows(), .remove(), .add(), .append()
The files used for this project can be found in my GitHub repository.