In this project, we will prepare data in preparation for SQL. We'll have to first clean the data.
import pandas as pd
df = pd.read_csv('academy_awards.csv', encoding='ISO-8859-1')
#Turns off warnings for potentially confusing assignments
pd.options.mode.chained_assignment = None # default='warn'
df.head()
This is not considered tidy data. Here are the issues:
We have string values in the year column, we should change these into integers.
Additional Info has the movie name and the role name, this can be split into two columns.
The "Won?" column change should be changed to 0's and 1's in.
There are a lot of NaN values under the unnammed columns, we should consider dropping them.
First let's take a look at a unnammed columns, we can use .value_counts() to see if there are any significant values in these columns.
print(df['Unnamed: 10'].value_counts())
print(df['Unnamed: 9'].value_counts())
print(df['Unnamed: 8'].value_counts())
print(df['Unnamed: 7'].value_counts())
print(df['Unnamed: 6'].value_counts())
print(df['Unnamed: 5'].value_counts())
It doesn't look like there any significant information under these columns. We can probably drop all the unnamed columns, but first let's convert the year column to integers.
df['Year'] = df['Year'].str[0:4]
df['Year'] = df['Year'].astype(int)
df.head()
df.describe(include = 'all')
We are only interested in data after year 2000, and actors in award winnng categories. We can use boolean filtering and the .isin() method to filter the data out and then create a new dataframe named "nominations".
later_than_2000 = df[df['Year'] > 2000]
award_categories = [
"Actor -- Leading Role",
"Actor -- Supporting Role",
"Actress -- Leading Role",
"Actress -- Supporting Role"
]
nominations = later_than_2000[later_than_2000['Category'].isin(award_categories)]
nominations.head()
We can also use use the .map() method to convert all the "No" values to 0 and all the "Yes" values to 1.
yes_no = {
"NO":0,
"YES":1
}
nominations['Won'] = nominations['Won?'].map(yes_no)
nominations.head()
Next, we'll drop the unnamed columns
columns_drop = [
"Won?",
"Unnamed: 5",
"Unnamed: 6",
"Unnamed: 7",
"Unnamed: 8",
"Unnamed: 9",
"Unnamed: 10"
]
final_nominations = nominations.drop(columns_drop, axis = 1)
final_nominations.head()
The last thing we'll have to do is separate the "Additional Info" column into two new columns "Movie" and "Character". We'll have to manipulate the strings and the split the two values in this one column in two new strings.
First we'll use .str.rstrip() to remove the end quotation mark and the bracket from the column series, then we'll split the the string again into a series of lists.
additional_info_one = final_nominations['Additional Info'].str.rstrip(to_strip = "'}")
additional_info_one.head()
#Split into a series of lists
additional_info_two = additional_info_one.str.split(' {\'')
additional_info_two.head()
#Set a series with the first element to movie names
movie_names = additional_info_two.str[0]
#Set a series with the second element to characters
characters = additional_info_two.str[1]
final_nominations["Movie"] = movie_names
final_nominations["Character"] = characters
final_nominations = final_nominations.drop(["Additional Info"], axis=1)
final_nominations.head()
Now that we are done cleaning up the data, we can do some simply analysis using sqlite3
import sqlite3
conn = sqlite3.connect("nominations.db")
cursor = conn.cursor()
#Creates the table "nominations"
final_nominations.to_sql("nominations", conn, index=False)
q1 = '''
Pragma table_info(nominations)
'''
result = cursor.execute(q1).fetchall()
result
q2 = '''
SELECT * FROM nominations LIMIT 10
'''
result = cursor.execute(q2).fetchall()
result
conn.close()
Learning Summary¶
Python/SQL concepts explored: python+sqlite3, pandas, data cleaning, columns manipulation
Python functions and methods used: .str.rstrip(), .str.split(), .connect(), .cursor(), .drop(), .str[], .map(), .value_counts()
SQL statements used: SELECT, FROM, PRAGMA
The files used for this project can be found in my GitHub repository.
Comments
comments powered by Disqus