In this project we will be looking at a star wars survey, 'star_wars.csv'. This project will focus on data cleaning so we have a data set ready for analysis. Let's begin by reading the csv file and the first couple rows.
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
star_wars = pd.read_csv('star_wars.csv', encoding='ISO-8859-1')
star_wars.head(3)
print(star_wars.shape)
star_wars.columns
It looks like most of these columns are unnamed, but first we'll remove any row without a RespondentID.
star_wars = star_wars[star_wars['RespondentID'].notnull()]
Next we want to convert the "Yes" and "No" strings into booleans. We can use the .map() method along with a dictionary to replace the "Yes" string into True and the "No" string into False.
yes_no = {"Yes": True, "No": False, np.nan:False}
for col in [
"Have you seen any of the 6 films in the Star Wars franchise?",
"Do you consider yourself to be a fan of the Star Wars film franchise?"
]:
star_wars[col] = star_wars[col].map(yes_no)
star_wars.head()
Columns 4 to 9 have string values with the movie the respondant saw. Similar to how we cleaned columns 2-3, we want to convert these into booleans with the .map() method. In addition, we want to change the column names to reference the true or false question.
true_false = {
"Star Wars: Episode I The Phantom Menace": True,
"Star Wars: Episode II Attack of the Clones": True,
"Star Wars: Episode III Revenge of the Sith": True,
"Star Wars: Episode IV A New Hope": True,
"Star Wars: Episode V The Empire Strikes Back": True,
"Star Wars: Episode VI Return of the Jedi": True,
np.nan: False,
}
for col in star_wars.columns[3:9]:
star_wars[col] = star_wars[col].map(true_false)
star_wars.head()
#Change the column names with the .rename() method
star_wars = star_wars.rename(columns={
'Which of the following Star Wars films have you seen? Please select all that apply.': "seen_1",
"Unnamed: 4": "seen_2",
"Unnamed: 5": "seen_3",
"Unnamed: 6": "seen_4",
"Unnamed: 7": "seen_5",
"Unnamed: 8": "seen_6",
})
star_wars.columns
We've successfully cleaned up columns 1-9. Let's check the data types of the rest of the dataframe.
star_wars.dtypes
Columns 10 to 16 are movie ranking values. Again, we can change the column names using the .rename() method. In addition, columns 10-16 are current listed as 'Object' we want to convert the values in these columns into float.
star_wars[star_wars.columns[9:15]] = star_wars[star_wars.columns[9:15]].astype(float)
star_wars = star_wars.rename(columns={
'Please rank the Star Wars films in order of preference with 1 being your favorite film in the franchise and 6 being your least favorite film.': "ranking_1",
"Unnamed: 10": "ranking_2",
"Unnamed: 11": "ranking_3",
"Unnamed: 12": "ranking_4",
"Unnamed: 13": "ranking_5",
"Unnamed: 14": "ranking_6",
})
star_wars.columns
means = star_wars[star_wars.columns[9:15]].mean()
%matplotlib inline
plt.bar(range(1,7), means)
plt.xlabel("Movie #")
plt.ylabel('Average Ranking')
Column 10 contains the following string: 'Please rank the Star Wars films in order of preference with 1 being your favorite film in the franchise and 6 being your least favorite film.'
So columns with lower ranking values are considered better by the survey respondants. From the chart, it looks like the older movies(#4-6) have higher rankngs than the newer star war movies(#1-3).
sums = star_wars[star_wars.columns[3:9]].sum()
plt.bar(range(1,7), sums)
plt.xlabel("Movie #")
plt.ylabel('Total Respondants')
Same thing here, more respondants saw the original movies(#4-6), and they were ranked higher. Keep in a mind a lower value for average ranking means the respondant liked the movie more.
Let's do the same analysis again, but seperate the plots by gender.
star_wars_males = males = star_wars[star_wars["Gender"] == "Male"]
star_wars_females = females = star_wars[star_wars["Gender"] == "Female"]
means_males = star_wars_males[star_wars_males.columns[9:15]].mean()
plt.bar(range(1,7), means_males)
plt.xlabel("Movie #")
plt.ylabel('Average Ranking')
plt.show()
means_females = star_wars_females[star_wars_females.columns[9:15]].mean()
plt.bar(range(1,7), means_females)
plt.xlabel("Movie #")
plt.ylabel('Average Ranking')
plt.show()
sums_males = star_wars_males[star_wars_males.columns[3:9]].sum()
plt.bar(range(1, 7), sums_males)
plt.xlabel("Movie #")
plt.ylabel('Total Respondants')
plt.show()
sums_females = star_wars_females[star_wars_females.columns[3:9]].sum()
plt.bar(range(1, 7), sums_females)
plt.xlabel("Movie #")
plt.ylabel('Total Respondants')
plt.show()
More males saw the prequel movies (#1-3) and they gave higher ratings than females. Both groups gave high ratings for original movies (#4-6).
Learning Summary¶
Python concepts explored: pandas, matplotlib.pyplot, data cleaning, string manipulation, bar plots
Python functions and methods used: .read_csv(), .columns, notnull, map(), .dtypes, .rename, astype(), .mean(), .sum(), .xlabel(), .ylabel()
The files used for this project can be found in my GitHub repository.
Comments
comments powered by Disqus