In this project, we are going to look at 'bike_rental_hour.csv', a dataset that contains the hourly and daily count of rental bikes between years 2011 and 2012 in the Capital bikeshare system. From the dataset, we are going to apply various machine learning algorithms to generate a model that can predict the number of bike rentals.
For more information on this dataset click here.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
%matplotlib inline
bike_rentals = pd.read_csv("bike_rental_hour.csv")
bike_rentals.head()
Data Analysis and Feature Engineering¶
plt.hist(bike_rentals['cnt'])
bike_rentals['cnt'].describe()
Using a histogram, we've quickly plotted the distribution of the 'cnt' column. This is the total number of bike rentals for a particular hour of a day. We can see that this is a right skewed distribution. The 50% percentile, or the median is 142.
We can add a feature to the dataset. By splitting the day to four time brackets, we can create a new column, 'time_label'.
def assign_label(hour):
if hour > 6 and hour <= 12:
return 1
elif hour > 12 and hour <= 18:
return 2
elif hour > 18 and hour <= 24:
return 3
else:
return 4
bike_rentals['time_label'] = bike_rentals['hr'].apply(assign_label)
Next, Let's take a look at the correlation of this dataset.
bike_rentals.corr()
correlations = bike_rentals.corr()
correlations['cnt']
We are not really seeing very strong correlations in these columns. The 'casual' and 'registered' columns are simply subcategories of the 'cnt' column. These columns leak information on the target column so we'll have to drop them. The 'dteday' column is just the date, and can't be used in this machine learning exercise.
columns = bike_rentals.columns.drop(['cnt', 'casual', 'dteday', 'registered'])
columns
Applying Machine Learning¶
In order to prepare for machine learning, we'll need to split the data into a training set and a testing set. We can use the math module to randomly sample 80% of the data and assign it to the training set. Then we can use the remaining 20% as the testing set.
We will use MSE for the evauation of this machine learning algorithm. MSE works well since the target column 'cnt' is continuous.
import math
#Sample 80% of the data randomly and assigns it to train.
eighty_percent_values = math.floor(bike_rentals.shape[0]*0.8)
train = bike_rentals.sample(n=eighty_percent_values, random_state = 1)
#Selects the remaining 20% to test.
test = bike_rentals.drop(train.index)
train.shape[0] + test.shape[0] == bike_rentals.shape[0]
Let's start by trying a simple linear regression model and checking the error of both the testing set and the training set.
lr = LinearRegression()
lr.fit(train[columns], train['cnt'])
predictions_test = lr.predict(test[columns])
mse_test = mean_squared_error(test['cnt'], predictions_test)
mse_test
predictions_train = lr.predict(train[columns])
mse_train = mean_squared_error(train['cnt'], predictions_train)
mse_train
Both the training set and the test set showed high error.The linear regression model is probably not the best for this dataset. We can use the decision tree to see if we can improve our predictions. The linear regression method is great for datasets with lots of continuous data, but a lot of the columns in this dataset is not continuous, but rather categorical.
Let's start by using a single tree model.
tree = DecisionTreeRegressor(min_samples_leaf=5)
tree.fit(train[columns], train['cnt'])
predictions = tree.predict(test[columns])
mse = mean_squared_error(test['cnt'], predictions)
mse
As we can see, the decision tree model reduced our error signficantly. We can further improve our results if we use a forest of decision trees to reduce overfitting.
tree = RandomForestRegressor(min_samples_leaf=2, n_estimators=250)
tree.fit(train[columns], train['cnt'])
predictions = tree.predict(test[columns])
mse = mean_squared_error(test['cnt'], predictions)
mse
We specified the hyperparameter values 'min_samples_leaf' and 'n_estimators', we can optimize these values by using a for loop.
mse_leaf=[]
for i in range(1, 10):
tree = RandomForestRegressor(min_samples_leaf=i, n_estimators=250)
tree.fit(train[columns], train['cnt'])
predictions = tree.predict(test[columns])
mse = mean_squared_error(test['cnt'], predictions)
mse_leaf.append(mse)
mse_leaf
n_trees = [250, 500, 750]
mse_trees=[]
for i in n_trees:
tree = RandomForestRegressor(min_samples_leaf=1, n_estimators=i)
tree.fit(train[columns], train['cnt'])
predictions = tree.predict(test[columns])
mse = mean_squared_error(test['cnt'], predictions)
mse_trees.append(mse)
mse_trees
Using 750 trees and 1 min_samples_leaf, we managed to slightly lower the MSE down to 1812. The random forest regressor is a powerful tool. However, using a large amount of trees in conjunction with a for loop takes a very long time to process.
Learning Summary¶
Concepts explored:: pandas, matplotlib, features engineering, linear regression, decision trees, random forests, MSE
Functions, methods, and properties used:.hist(), .apply(), .corr(), .columns, .drop(), .sample(), .index, .floor(),.fit() .predict(), .mean_squared_error(), .append()
The files used for this project can be found in my GitHub repository.
Comments
comments powered by Disqus