In this project, we will prepare data in preparation for SQL. We'll have to first clean the data.

In [1]:
import pandas as pd
df = pd.read_csv('academy_awards.csv', encoding='ISO-8859-1')

#Turns off warnings for potentially confusing assignments
pd.options.mode.chained_assignment = None  # default='warn'
df.head()
Out[1]:
Year Category Nominee Additional Info Won? Unnamed: 5 Unnamed: 6 Unnamed: 7 Unnamed: 8 Unnamed: 9 Unnamed: 10
0 2010 (83rd) Actor -- Leading Role Javier Bardem Biutiful {'Uxbal'} NO NaN NaN NaN NaN NaN NaN
1 2010 (83rd) Actor -- Leading Role Jeff Bridges True Grit {'Rooster Cogburn'} NO NaN NaN NaN NaN NaN NaN
2 2010 (83rd) Actor -- Leading Role Jesse Eisenberg The Social Network {'Mark Zuckerberg'} NO NaN NaN NaN NaN NaN NaN
3 2010 (83rd) Actor -- Leading Role Colin Firth The King's Speech {'King George VI'} YES NaN NaN NaN NaN NaN NaN
4 2010 (83rd) Actor -- Leading Role James Franco 127 Hours {'Aron Ralston'} NO NaN NaN NaN NaN NaN NaN

This is not considered tidy data. Here are the issues:

  1. We have string values in the year column, we should change these into integers.

  2. Additional Info has the movie name and the role name, this can be split into two columns.

  3. The "Won?" column change should be changed to 0's and 1's in.

  4. There are a lot of NaN values under the unnammed columns, we should consider dropping them.

First let's take a look at a unnammed columns, we can use .value_counts() to see if there are any significant values in these columns.

In [2]:
print(df['Unnamed: 10'].value_counts())
print(df['Unnamed: 9'].value_counts())
print(df['Unnamed: 8'].value_counts())
print(df['Unnamed: 7'].value_counts())
print(df['Unnamed: 6'].value_counts())
print(df['Unnamed: 5'].value_counts())
*    1
Name: Unnamed: 10, dtype: int64
*    1
Name: Unnamed: 9, dtype: int64
*                                                 1
 understanding comedy genius - Mack Sennett.""    1
Name: Unnamed: 8, dtype: int64
*                                                     1
 while requiring no dangerous solvents. [Systems]"    1
 kindly                                               1
Name: Unnamed: 7, dtype: int64
*                                                                   9
 direct radiator bass style cinema loudspeaker systems. [Sound]"    1
 flexibility and water resistance                                   1
 sympathetic                                                        1
Name: Unnamed: 6, dtype: int64
*                                                                                                               7
 discoverer of stars                                                                                            1
 D.B. "Don" Keele and Mark E. Engebretson has resulted in the over 20-year dominance of constant-directivity    1
 error-prone measurements on sets. [Digital Imaging Technology]"                                                1
 resilience                                                                                                     1
Name: Unnamed: 5, dtype: int64

It doesn't look like there any significant information under these columns. We can probably drop all the unnamed columns, but first let's convert the year column to integers.

In [3]:
df['Year'] = df['Year'].str[0:4]
df['Year'] = df['Year'].astype(int)
df.head()
Out[3]:
Year Category Nominee Additional Info Won? Unnamed: 5 Unnamed: 6 Unnamed: 7 Unnamed: 8 Unnamed: 9 Unnamed: 10
0 2010 Actor -- Leading Role Javier Bardem Biutiful {'Uxbal'} NO NaN NaN NaN NaN NaN NaN
1 2010 Actor -- Leading Role Jeff Bridges True Grit {'Rooster Cogburn'} NO NaN NaN NaN NaN NaN NaN
2 2010 Actor -- Leading Role Jesse Eisenberg The Social Network {'Mark Zuckerberg'} NO NaN NaN NaN NaN NaN NaN
3 2010 Actor -- Leading Role Colin Firth The King's Speech {'King George VI'} YES NaN NaN NaN NaN NaN NaN
4 2010 Actor -- Leading Role James Franco 127 Hours {'Aron Ralston'} NO NaN NaN NaN NaN NaN NaN
In [4]:
df.describe(include = 'all')
Out[4]:
Year Category Nominee Additional Info Won? Unnamed: 5 Unnamed: 6 Unnamed: 7 Unnamed: 8 Unnamed: 9 Unnamed: 10
count 10137.000000 10137 10137 9011 10137 11 12 3 2 1 1
unique NaN 40 6001 6424 16 5 4 3 2 1 1
top NaN Writing Meryl Streep Metro-Goldwyn-Mayer NO * * * * * *
freq NaN 888 16 60 7168 7 9 1 1 1 1
mean 1970.330768 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
std 23.332917 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
min 1927.000000 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
25% 1950.000000 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
50% 1970.000000 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
75% 1991.000000 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
max 2010.000000 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

We are only interested in data after year 2000, and actors in award winnng categories. We can use boolean filtering and the .isin() method to filter the data out and then create a new dataframe named "nominations".

In [5]:
later_than_2000 = df[df['Year'] > 2000]

award_categories = [
    "Actor -- Leading Role",
    "Actor -- Supporting Role",
    "Actress -- Leading Role",
    "Actress -- Supporting Role"
]


nominations = later_than_2000[later_than_2000['Category'].isin(award_categories)]
nominations.head()
Out[5]:
Year Category Nominee Additional Info Won? Unnamed: 5 Unnamed: 6 Unnamed: 7 Unnamed: 8 Unnamed: 9 Unnamed: 10
0 2010 Actor -- Leading Role Javier Bardem Biutiful {'Uxbal'} NO NaN NaN NaN NaN NaN NaN
1 2010 Actor -- Leading Role Jeff Bridges True Grit {'Rooster Cogburn'} NO NaN NaN NaN NaN NaN NaN
2 2010 Actor -- Leading Role Jesse Eisenberg The Social Network {'Mark Zuckerberg'} NO NaN NaN NaN NaN NaN NaN
3 2010 Actor -- Leading Role Colin Firth The King's Speech {'King George VI'} YES NaN NaN NaN NaN NaN NaN
4 2010 Actor -- Leading Role James Franco 127 Hours {'Aron Ralston'} NO NaN NaN NaN NaN NaN NaN

We can also use use the .map() method to convert all the "No" values to 0 and all the "Yes" values to 1.

In [6]:
yes_no = {
    "NO":0,
    "YES":1
}

nominations['Won'] = nominations['Won?'].map(yes_no)
nominations.head()
Out[6]:
Year Category Nominee Additional Info Won? Unnamed: 5 Unnamed: 6 Unnamed: 7 Unnamed: 8 Unnamed: 9 Unnamed: 10 Won
0 2010 Actor -- Leading Role Javier Bardem Biutiful {'Uxbal'} NO NaN NaN NaN NaN NaN NaN 0
1 2010 Actor -- Leading Role Jeff Bridges True Grit {'Rooster Cogburn'} NO NaN NaN NaN NaN NaN NaN 0
2 2010 Actor -- Leading Role Jesse Eisenberg The Social Network {'Mark Zuckerberg'} NO NaN NaN NaN NaN NaN NaN 0
3 2010 Actor -- Leading Role Colin Firth The King's Speech {'King George VI'} YES NaN NaN NaN NaN NaN NaN 1
4 2010 Actor -- Leading Role James Franco 127 Hours {'Aron Ralston'} NO NaN NaN NaN NaN NaN NaN 0

Next, we'll drop the unnamed columns

In [7]:
columns_drop = [
    "Won?",
    "Unnamed: 5",
    "Unnamed: 6",
    "Unnamed: 7",
    "Unnamed: 8",
    "Unnamed: 9",
    "Unnamed: 10"
]
final_nominations = nominations.drop(columns_drop, axis = 1)
In [8]:
final_nominations.head()
Out[8]:
Year Category Nominee Additional Info Won
0 2010 Actor -- Leading Role Javier Bardem Biutiful {'Uxbal'} 0
1 2010 Actor -- Leading Role Jeff Bridges True Grit {'Rooster Cogburn'} 0
2 2010 Actor -- Leading Role Jesse Eisenberg The Social Network {'Mark Zuckerberg'} 0
3 2010 Actor -- Leading Role Colin Firth The King's Speech {'King George VI'} 1
4 2010 Actor -- Leading Role James Franco 127 Hours {'Aron Ralston'} 0

The last thing we'll have to do is separate the "Additional Info" column into two new columns "Movie" and "Character". We'll have to manipulate the strings and the split the two values in this one column in two new strings.

First we'll use .str.rstrip() to remove the end quotation mark and the bracket from the column series, then we'll split the the string again into a series of lists.

In [9]:
additional_info_one = final_nominations['Additional Info'].str.rstrip(to_strip = "'}")
additional_info_one.head()
Out[9]:
0                        Biutiful {'Uxbal
1             True Grit {'Rooster Cogburn
2    The Social Network {'Mark Zuckerberg
3      The King's Speech {'King George VI
4                127 Hours {'Aron Ralston
Name: Additional Info, dtype: object
In [10]:
#Split into a series of lists
additional_info_two = additional_info_one.str.split(' {\'')
additional_info_two.head()
Out[10]:
0                        [Biutiful, Uxbal]
1             [True Grit, Rooster Cogburn]
2    [The Social Network, Mark Zuckerberg]
3      [The King's Speech, King George VI]
4                [127 Hours, Aron Ralston]
Name: Additional Info, dtype: object
In [11]:
#Set a series with the first element to movie names
movie_names = additional_info_two.str[0]


#Set a series with the second element to characters
characters = additional_info_two.str[1]
In [12]:
final_nominations["Movie"] = movie_names
final_nominations["Character"] = characters
final_nominations = final_nominations.drop(["Additional Info"], axis=1)
final_nominations.head()
Out[12]:
Year Category Nominee Won Movie Character
0 2010 Actor -- Leading Role Javier Bardem 0 Biutiful Uxbal
1 2010 Actor -- Leading Role Jeff Bridges 0 True Grit Rooster Cogburn
2 2010 Actor -- Leading Role Jesse Eisenberg 0 The Social Network Mark Zuckerberg
3 2010 Actor -- Leading Role Colin Firth 1 The King's Speech King George VI
4 2010 Actor -- Leading Role James Franco 0 127 Hours Aron Ralston

Now that we are done cleaning up the data, we can do some simply analysis using sqlite3

In [13]:
import sqlite3

conn = sqlite3.connect("nominations.db")
cursor = conn.cursor()

#Creates the table "nominations"
final_nominations.to_sql("nominations", conn, index=False)
In [14]:
q1 = '''
Pragma table_info(nominations)
'''
result = cursor.execute(q1).fetchall()
result
Out[14]:
[(0, 'Year', 'INTEGER', 0, None, 0),
 (1, 'Category', 'TEXT', 0, None, 0),
 (2, 'Nominee', 'TEXT', 0, None, 0),
 (3, 'Won', 'INTEGER', 0, None, 0),
 (4, 'Movie', 'TEXT', 0, None, 0),
 (5, 'Character', 'TEXT', 0, None, 0)]
In [15]:
q2 = '''
SELECT * FROM nominations LIMIT 10
'''

result = cursor.execute(q2).fetchall()
result
Out[15]:
[(2010, 'Actor -- Leading Role', 'Javier Bardem', 0, 'Biutiful', 'Uxbal'),
 (2010,
  'Actor -- Leading Role',
  'Jeff Bridges',
  0,
  'True Grit',
  'Rooster Cogburn'),
 (2010,
  'Actor -- Leading Role',
  'Jesse Eisenberg',
  0,
  'The Social Network',
  'Mark Zuckerberg'),
 (2010,
  'Actor -- Leading Role',
  'Colin Firth',
  1,
  "The King's Speech",
  'King George VI'),
 (2010,
  'Actor -- Leading Role',
  'James Franco',
  0,
  '127 Hours',
  'Aron Ralston'),
 (2010,
  'Actor -- Supporting Role',
  'Christian Bale',
  1,
  'The Fighter',
  'Dicky Eklund'),
 (2010,
  'Actor -- Supporting Role',
  'John Hawkes',
  0,
  "Winter's Bone",
  'Teardrop'),
 (2010,
  'Actor -- Supporting Role',
  'Jeremy Renner',
  0,
  'The Town',
  'James Coughlin'),
 (2010,
  'Actor -- Supporting Role',
  'Mark Ruffalo',
  0,
  'The Kids Are All Right',
  'Paul'),
 (2010,
  'Actor -- Supporting Role',
  'Geoffrey Rush',
  0,
  "The King's Speech",
  'Lionel Logue')]
In [16]:
conn.close()

Learning Summary

Python/SQL concepts explored: python+sqlite3, pandas, data cleaning, columns manipulation

Python functions and methods used: .str.rstrip(), .str.split(), .connect(), .cursor(), .drop(), .str[], .map(), .value_counts()

SQL statements used: SELECT, FROM, PRAGMA

The files used for this project can be found in my GitHub repository.



Comments

comments powered by Disqus