In this project, we willl analyze various movie review websites using "fandango_score_comparison.csv" We will use descriptive statistics to draw comparisons between fandango and other review websites. In addition, we'll also use linear regression to determine fandango review scores based on other review scores.
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
movies = pd.read_csv('fandango_score_comparison.csv')
movies.head()
First, we'll use a histogram to see the distribution of ratings for "Fandango_Stars" and "Metacritic_norm_round".
mc = movies['Metacritic_norm_round']
fd = movies['Fandango_Stars']
plt.hist(mc, 5)
plt.show()
plt.hist(fd, 5)
plt.show()
It looks like fandango seems to have higher overalll ratings than metacritic, but just looking at histograms isn't enough to prove that. We can calclate the mean, median, and standard deviation of the two websites using numpy functions.
mean_fd = fd.mean()
mean_mc = mc.mean()
median_fd = fd.median()
median_mc = mc.median()
std_fd = fd.std()
std_mc = mc.std()
print("means", mean_fd, mean_mc)
print("medians",median_fd, median_mc)
print("std_devs",std_fd, std_mc)
Couple of things to note here:
Fandango rating methods are hidden, where as metacritic takes a weighted average of all the published critic scores.
The mean and the median for fandango is way higher, they also got a low std deviation. I'd imagine their scores are influenced by studios and have inflated scores to get people on the website to watch the movies.
The standard deviation for fandango is also lower because most of their ratings are clustered on the high side.
Metacritic on the other hand has a median of 3.0 and an average of 3 which is basically what you would expect from a normal distribution.
Let's make a scatter plot between fandango and metacritic to see if we can draw any correlations.
plt.scatter(fd, mc)
plt.show()
movies['fm_diff'] = fd - mc
movies['fm_diff'] = np.absolute(movies['fm_diff'])
dif_sort = movies['fm_diff'].sort_values(ascending=False)
movies.sort_values(by='fm_diff', ascending = False).head(5)
It looks like the difference can get as high as 4.0 or 3.0. We should try to calculate the correlation between the two websites. We can do this by simply using the .pearsonr() function from scipy.
import scipy.stats as sci
r, pearsonr = sci.pearsonr(mc, fd)
print(r)
print(pearsonr)
If both movie review sites uses the similar methods for rating their movies, we should see a strong correlation. A low correlation tells us that these two websites have very different review methods.
Doing a linear regression wouldn't be very accurate with a low correlation. However, let's do it for the sake of practice anyway.
m, b, r, p, stderr = sci.linregress(mc, fd)
#Fit into a line, y = mx+b where x is 3.
pred_3 = m*3 + b
pred_3
pred_1 = m*1 + b
print(pred_1)
pred_5 = m*5 + b
print(pred_5)
We can make predictions of what the fandango score is based on the metacritic score by doing a linear regression. However it is important to keep in mind, if the correlation is low, the model might not be very accurate.
x_pred = [1.0, 5.0]
y_pred = [3.89708499687, 4.28632930877]
plt.scatter(fd, mc)
plt.plot(x_pred, y_pred)
plt.show()
Learning Summary¶
Concepts explored: pandas, descriptive statistics, numpy, matplotlib, scipy, correlations
Functions and methods used: .sort_values(), sci.linregress(), .hist(), .absolute(), .mean(), .median(), .absolute()
The files used for this project can be found in my GitHub repository.
Comments
comments powered by Disqus