In this project, we'll look at 20,000 rows of the jeopardy dataset in "jeopardy.csv". We want to see if there are patterns in the questions asked so we can get a little bit of an edge to win.

First, we'll have to tidy up the data.

In [1]:

import pandas as pd
import matplotlib.pyplot as plt

jeopardy = pd.read_csv('jeopardy.csv')
jeopardy.head(5)

Out[1]:

	Show Number	Air Date	Round	Category	Value	Question	Answer
0	4680	2004-12-31	Jeopardy!	HISTORY	$200	For the last 8 years of his life, Galileo was ...	Copernicus
1	4680	2004-12-31	Jeopardy!	ESPN's TOP 10 ALL-TIME ATHLETES	$200	No. 2: 1912 Olympian; football star at Carlisl...	Jim Thorpe
2	4680	2004-12-31	Jeopardy!	EVERYBODY TALKS ABOUT IT...	$200	The city of Yuma in this state has a record av...	Arizona
3	4680	2004-12-31	Jeopardy!	THE COMPANY LINE	$200	In 1963, live on "The Art Linkletter Show", th...	McDonald's
4	4680	2004-12-31	Jeopardy!	EPITAPHS & TRIBUTES	$200	Signer of the Dec. of Indep., framer of the Co...	John Adams

In [2]:

print(jeopardy.columns)

Index(['Show Number', ' Air Date', ' Round', ' Category', ' Value',
       ' Question', ' Answer'],
      dtype='object')

Looks like there is a space after each column name, we can fix this pretty easily with the .columns() method.

In [3]:

jeopardy.columns = ['Show Number', 'Air Date', 'Round', 'Category', 'Value',
       'Question', 'Answer']
jeopardy.columns

Out[3]:

Index(['Show Number', 'Air Date', 'Round', 'Category', 'Value', 'Question',
       'Answer'],
      dtype='object')

Next, let's make all the strings in the question and answer columns lower case. We can do write a function and then use the .apply() method.

We also want to remove all the punctuations, the goal is to have the "Question" and "Answer" columns down to just words.

In [4]:

import re
def lowercase_no_punct(string):
    lower = string.lower()
    punremoved = re.sub('[^A-Za-z0-9\s]','', lower)
    return punremoved

In [5]:

jeopardy['clean_question'] = jeopardy['Question'].apply(lowercase_no_punct)
jeopardy['clean_answer'] = jeopardy['Answer'].apply(lowercase_no_punct)

The "Value" column is usually a dollar sign followed by a number. However, this is currently in a string format. We should conver tthis to an integer and remove the dollar sign.

In [6]:

def punremovandtoint(string):
    punremoved = re.sub('[^A-Za-z0-9\s]','', string)
    try:
        integer = int(punremoved)
    except Exception:
        integer = 0
    return integer

In [7]:

jeopardy['clean_values'] = jeopardy['Value'].apply(punremovandtoint)

We'll have to convert the values in the "Air Date" column into a datetime object

In [8]:

jeopardy['Air Date'] = pd.to_datetime(jeopardy['Air Date'])

Let's see what our table currently looks like

In [9]:

jeopardy.head()

Out[9]:

	Show Number	Air Date	Round	Category	Value	Question	Answer	clean_question	clean_answer	clean_values
0	4680	2004-12-31	Jeopardy!	HISTORY	$200	For the last 8 years of his life, Galileo was ...	Copernicus	for the last 8 years of his life galileo was u...	copernicus	200
1	4680	2004-12-31	Jeopardy!	ESPN's TOP 10 ALL-TIME ATHLETES	$200	No. 2: 1912 Olympian; football star at Carlisl...	Jim Thorpe	no 2 1912 olympian football star at carlisle i...	jim thorpe	200
2	4680	2004-12-31	Jeopardy!	EVERYBODY TALKS ABOUT IT...	$200	The city of Yuma in this state has a record av...	Arizona	the city of yuma in this state has a record av...	arizona	200
3	4680	2004-12-31	Jeopardy!	THE COMPANY LINE	$200	In 1963, live on "The Art Linkletter Show", th...	McDonald's	in 1963 live on the art linkletter show this c...	mcdonalds	200
4	4680	2004-12-31	Jeopardy!	EPITAPHS & TRIBUTES	$200	Signer of the Dec. of Indep., framer of the Co...	John Adams	signer of the dec of indep framer of the const...	john adams	200

Now that the data is cleaned, we can start analyzing it.

Suppose we are interested in the number of words in the answer that occurs in the question. We'll create a function and use the .apply() method to create a new column. This column will have ratio of matching question words to total answer words.

In [10]:

def cleaner(series):
    split_answer = series['clean_answer'].split(' ')
    split_question = series['clean_question'].split(' ')
    match_count = 0
    if "the" in split_answer:
        split_answer.remove('the')
    if len(split_answer) == 0:
        return 0
    for item in split_answer:
        if item in split_question:
            match_count +=1
    return match_count/len(split_answer)

In [11]:

jeopardy['answer_in_question'] = jeopardy.apply(cleaner, axis=1)
jeopardy['answer_in_question'].mean()

Out[11]:

0.060493257069335872

It looks like the answer only appears in the question 6% of the time, so this is not a super reliable strategy.

Next, we'll look at words used in the questions column. We can write a function to see how often they repeat

In [12]:

question_overlap = []
#a python set is an unordered list of items
terms_used = set()
for idx, row in jeopardy.iterrows():
    split_question = row['clean_question'].split(" ")     
    match_count = 0
    newlist = []
    for word in split_question:
        if len(word) >= 6:
            newlist.append(word)
    for word in newlist:
        if word in terms_used:
            match_count += 1
    for word in newlist:
        terms_used.add(word)
    if len(newlist) > 0:
        match_count = match_count/len(newlist)
    question_overlap.append(match_count)

jeopardy['question_overlap'] = question_overlap

In [13]:

jeopardy['question_overlap'].mean()

Out[13]:

0.69087373156719623

There is a 69% overlap of words between new questions and old ones. However words can be put together as different phases with a big difference in meaning. So this huge overlap is not super significant.

Let's take a look at the number of questions that are > 800 dollars. Maybe it is a good idea to only study high value questions.

In [14]:

def highvalue(row):
    value = 0
    if row['clean_values'] > 800:
        value = 1
    return value

jeopardy['high_value'] = jeopardy.apply(highvalue, axis =1)

In [15]:

high_value_count = jeopardy[jeopardy['high_value'] == 1].shape[0]
low_value_count = jeopardy[jeopardy['high_value'] == 0].shape[0]

In [16]:

print(high_value_count)
low_value_count

Out[16]:

It doesnt look like there are that many high value questions in the dataset.

We can create a function that takes in a word, then return the # of high/low values questions this word showed up in. Maybe this will help us study.

In [17]:

def highlowcounts(word):
    low_count = 0
    high_count = 0 
    for idx, row in jeopardy.iterrows():
        if word in row['clean_question'].split(' '):
            if row["high_value"] == 1:
                high_count += 1
            else:
                low_count += 1   
    return high_count, low_count

In [18]:

observed_expected = []
comparison_terms = list(terms_used)[:5]
comparison_terms

Out[18]:

['emigrated', 'ruffles', 'waterworld', 'mussorgsky', 'appendages']

In [19]:

for term in comparison_terms:
    observed_expected.append(highlowcounts(term))

observed_expected

Out[19]:

[(1, 0), (0, 2), (1, 0), (1, 1), (1, 2)]

We can use the chi squared test to see if the values of the terms in "comparsion_terms" are statiscally significant.

In [20]:

chi_squared =[]
from scipy.stats import chisquare
import numpy as np
for lists in observed_expected:
    total = sum(lists)
    total_prop = total/jeopardy.shape[0]
    expected_high = total_prop * high_value_count
    expected_low = total_prop * low_value_count
    observed = np.array([lists[0], lists[1]])
    expected = np.array([expected_high, expected_low])
    chi_squared.append(chisquare(observed, expected))

In [21]:

chi_squared

Out[21]:

[Power_divergenceResult(statistic=2.4877921171956752, pvalue=0.11473257634454047),
 Power_divergenceResult(statistic=0.80392569225376798, pvalue=0.36992223780795708),
 Power_divergenceResult(statistic=2.4877921171956752, pvalue=0.11473257634454047),
 Power_divergenceResult(statistic=0.44487748166127949, pvalue=0.50477764875459963),
 Power_divergenceResult(statistic=0.031881167234403623, pvalue=0.85828871632352932)]

None of the p values are less than 0.05 so this is not statiscally significant.

Learning Summary¶

Python concepts explored: pandas, matplotlib, data cleaning, string manipulation, chi squared test, regex, try/except

Python functions and methods used: .columns, .lower(), .sub(), .apply(), sum(), .array(), .split(), .shape, .mean(), .iterrows(), .remove(), .add(), .append()

The files used for this project can be found in my GitHub repository.

Coding Disciple

Analyzing Jeopardy Data

Learning Summary¶

Comments

Coding Disciple

Learning Summary¶

Part 12 of the Dataquest series

Previous articles

Next articles

Comments