In this project we will be looking at a star wars survey, 'star_wars.csv'. This project will focus on data cleaning so we have a data set ready for analysis. Let's begin by reading the csv file and the first couple rows.
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
star_wars = pd.read_csv('star_wars.csv', encoding='ISO-8859-1')
It looks like most of these columns are unnamed, but first we'll remove any row without a RespondentID.
star_wars = star_wars[star_wars['RespondentID'].notnull()]
Next we want to convert the "Yes" and "No" strings into booleans. We can use the .map() method along with a dictionary to replace the "Yes" string into True and the "No" string into False.
yes_no = {"Yes": True, "No": False, np.nan:False}
for col in [
"Have you seen any of the 6 films in the Star Wars franchise?",
"Do you consider yourself to be a fan of the Star Wars film franchise?"
star_wars[col] = star_wars[col].map(yes_no)
Columns 4 to 9 have string values with the movie the respondant saw. Similar to how we cleaned columns 2-3, we want to convert these into booleans with the .map() method. In addition, we want to change the column names to reference the true or false question.
true_false = {
"Star Wars: Episode I The Phantom Menace": True,
"Star Wars: Episode II Attack of the Clones": True,
"Star Wars: Episode III Revenge of the Sith": True,
"Star Wars: Episode IV A New Hope": True,
"Star Wars: Episode V The Empire Strikes Back": True,
"Star Wars: Episode VI Return of the Jedi": True,
np.nan: False,
for col in star_wars.columns[3:9]:
star_wars[col] = star_wars[col].map(true_false)
#Change the column names with the .rename() method
star_wars = star_wars.rename(columns={
'Which of the following Star Wars films have you seen? Please select all that apply.': "seen_1",
"Unnamed: 4": "seen_2",
"Unnamed: 5": "seen_3",
"Unnamed: 6": "seen_4",
"Unnamed: 7": "seen_5",
"Unnamed: 8": "seen_6",
We've successfully cleaned up columns 1-9. Let's check the data types of the rest of the dataframe.
Columns 10 to 16 are movie ranking values. Again, we can change the column names using the .rename() method. In addition, columns 10-16 are current listed as 'Object' we want to convert the values in these columns into float.
star_wars[star_wars.columns[9:15]] = star_wars[star_wars.columns[9:15]].astype(float)
star_wars = star_wars.rename(columns={
'Please rank the Star Wars films in order of preference with 1 being your favorite film in the franchise and 6 being your least favorite film.': "ranking_1",
"Unnamed: 10": "ranking_2",
"Unnamed: 11": "ranking_3",
"Unnamed: 12": "ranking_4",
"Unnamed: 13": "ranking_5",
"Unnamed: 14": "ranking_6",
means = star_wars[star_wars.columns[9:15]].mean()
%matplotlib inline,7), means)
plt.xlabel("Movie #")
plt.ylabel('Average Ranking')
Column 10 contains the following string: 'Please rank the Star Wars films in order of preference with 1 being your favorite film in the franchise and 6 being your least favorite film.'
So columns with lower ranking values are considered better by the survey respondants. From the chart, it looks like the older movies(#4-6) have higher rankngs than the newer star war movies(#1-3).
sums = star_wars[star_wars.columns[3:9]].sum(),7), sums)
plt.xlabel("Movie #")
plt.ylabel('Total Respondants')
Same thing here, more respondants saw the original movies(#4-6), and they were ranked higher. Keep in a mind a lower value for average ranking means the respondant liked the movie more.
Let's do the same analysis again, but seperate the plots by gender.
star_wars_males = males = star_wars[star_wars["Gender"] == "Male"]
star_wars_females = females = star_wars[star_wars["Gender"] == "Female"]
means_males = star_wars_males[star_wars_males.columns[9:15]].mean(),7), means_males)
plt.xlabel("Movie #")
plt.ylabel('Average Ranking')
means_females = star_wars_females[star_wars_females.columns[9:15]].mean(),7), means_females)
plt.xlabel("Movie #")
plt.ylabel('Average Ranking')
sums_males = star_wars_males[star_wars_males.columns[3:9]].sum(), 7), sums_males)
plt.xlabel("Movie #")
plt.ylabel('Total Respondants')
sums_females = star_wars_females[star_wars_females.columns[3:9]].sum(), 7), sums_females)
plt.xlabel("Movie #")
plt.ylabel('Total Respondants')
More males saw the prequel movies (#1-3) and they gave higher ratings than females. Both groups gave high ratings for original movies (#4-6).
Learning Summary¶
Python concepts explored: pandas, matplotlib.pyplot, data cleaning, string manipulation, bar plots
Python functions and methods used: .read_csv(), .columns, notnull, map(), .dtypes, .rename, astype(), .mean(), .sum(), .xlabel(), .ylabel()
The files used for this project can be found in my GitHub repository.
comments powered by Disqus