In this project, we are going to apply machine learning algorithms to predict the price of a house using 'AmesHousing.tsv'. In order to do so, we'll have to transform the data and apply various feature engineering techniques.
We will be focusing on the linear regression model, and use RMSE as the error metric. First let's explore the data.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
%matplotlib inline
pd.set_option('display.max_columns', 500)
data = pd.read_csv("AmesHousing.tsv", delimiter='\t')
print(data.shape)
print(len(str(data.shape))*'-')
print(data.dtypes.value_counts())
data.head()
Data Cleaning and Features Engineering¶
This dataset has a total of 82 columns and 2930 rows. Since we'll be using the linear regression model, we can only use numerical values in our model. One of the most important aspects of machine learning is knowing the features. Here are a couple things we can do to clean up the data:
The 'Order' and 'PID' columns are not useful for machine learning as they are simply identification numbers.
It doesn't make much sense to use 'Year built' and 'Year Remod/Add' in our model. We should generate a new column to determine how old the house is since the last remodelling.
We want to drop columns with too many missing values, let's start with 5% for now.
We don't want to leak sales information to our model. Sales information will not be available to us when we actually use the model to estimate the price of a house.
#Create a new feature, 'years_to_sell'.
data['years_to_sell'] = data['Yr Sold'] - data['Year Remod/Add']
data = data[data['years_to_sell'] >= 0]
#Remove features that are not useful for machine learning.
data = data.drop(['Order', 'PID'], axis=1)
#Remove features that leak sales data.
data = data.drop(['Mo Sold', 'Yr Sold', 'Sale Type', 'Sale Condition'], axis=1)
#Drop columns with more than 5% missing values
is_null_counts = data.isnull().sum()
features_col = is_null_counts[is_null_counts < 2930*0.05].index
data = data[features_col]
data.head()
Since we are dealing with a dataset with a large number a columns, it is a good idea to split the data up into two dataframes. We'll first work with the 'float' and 'int' columns. Then we'll set 'object' columns to a new dataframe. Once both dataframes contain only numerical values, we can combine them again and use the features for our linear regression model.
There are qutie a bit of NA values in the numerical columns, so we'll fill them up with the mode. Some of the columns are categorical, so it wouldn't make sense to use median or mean for this.
numerical_cols = data.dtypes[data.dtypes != 'object'].index
numerical_data = data[numerical_cols]
numerical_data = numerical_data.fillna(data.mode().iloc[0])
numerical_data.isnull().sum().sort_values(ascending = False)
Next, let's check the correlations of all the numerical columns with respect to 'SalePrice'
num_corr = numerical_data.corr()['SalePrice'].abs().sort_values(ascending = False)
num_corr
We can drop values with less than 0.4 correlation for now. Later, we'll make this value an adjustable parameter in a function.
num_corr = num_corr[num_corr > 0.4]
high_corr_cols = num_corr.index
hi_corr_numerical_data = numerical_data[high_corr_cols]
For the 'object' or text columns, we'll drop any column with more than 1 missing value.
text_cols = data.dtypes[data.dtypes == 'object'].index
text_data = data[text_cols]
text_null_counts = text_data.isnull().sum()
text_not_null_cols = text_null_counts[text_null_counts < 1].index
text_data = text_data[text_not_null_cols]
From the documatation we want to convert any columns that are nominal into categories. 'MS subclass' is a numerical column but it should be categorical.
For the text columns, we'll take the list of nominal columns from the documentation and use a for loop to search for matches.
nominal_cols = ['MS Zoning', 'Street', 'Alley', 'Land Contour', 'Lot Config', 'Neighborhood', 'Condition 1', 'Condition 2', 'Bldg Type', 'House Style', 'Overall Qual', 'Roof Style', 'Roof Mat1', 'Exterior 1st', 'Exterior 2nd', 'Mas Vnr Type', 'Foundation', 'Heating', 'Central Air']
nominal_num_col = ['MS SubClass']
#Finds nominal columns in text_data
nominal_text_col = []
for col in nominal_cols:
if col in text_data.columns:
nominal_text_col.append(col)
nominal_text_col
Simply use boolean filtering to keep the relevant columns in our text dataframe.
text_data = text_data[nominal_text_col]
for col in nominal_text_col:
print(col)
print(text_data[col].value_counts())
print("-"*10)
Columns with too many categories can cause overfitting. We'll remove any columns with more than 10 categories. We'll write a function later to adjust this as a parameter in our feature selection.
nominal_text_col_unique = []
for col in nominal_text_col:
if len(text_data[col].value_counts()) <= 10:
nominal_text_col_unique.append(col)
text_data = text_data[nominal_text_col_unique]
Finally, we can use the pd.get_dummies function to create dummy columns for all the categorical columns.
#Create dummy columns for nominal text columns, then create a dataframe.
for col in text_data.columns:
text_data[col] = text_data[col].astype('category')
categorical_text_data = pd.get_dummies(text_data)
categorical_text_data.head()
#Create dummy columns for nominal numerical columns, then create a dataframe.
for col in numerical_data.columns:
if col in nominal_num_col:
numerical_data[col] = numerical_data[col].astype('category')
categorical_numerical_data = pd.get_dummies(numerical_data.select_dtypes(include=['category']))
Using the pd.concat() function, we can combine the two categorical columns together.
categorical_data = pd.concat([categorical_text_data, categorical_numerical_data], axis=1)
We end up with one numerical dataframe, and one categorical dataframe. We can then combine them into one dataframe for machine learning.
hi_corr_numerical_data.head()
categorical_data.head()
final_data = pd.concat([hi_corr_numerical_data, categorical_data], axis=1)
Creating Functions with Adjustable Parameters¶
When we did our data cleaning, we decided to remove columns that had more than 5% missing values. We can incorporate our this into a function as an adjustable parameter. In addition, this function will perform all the data cleaning operations I've explained above.
def transform_features(data, percent_missing=0.05):
#Adding relevant features:
data['years_since_remod'] = data['Year Built'] - data['Year Remod/Add']
data['years_to_sell'] = data['Yr Sold'] - data['Year Built']
data = data[data['years_since_remod'] >= 0]
data = data[data['years_to_sell'] >= 0]
#Remove columns not useful for machine learning
data = data.drop(['Order', 'PID', 'Year Built', 'Year Remod/Add'], axis=1)
#Remove columns that leaks sale data
data = data.drop(['Mo Sold', 'Yr Sold', 'Sale Type', 'Sale Condition'], axis=1)
#Drop columns with too many missing values defined by the function
is_null_counts = data.isnull().sum()
low_NaN_cols = is_null_counts[is_null_counts < len(data)*percent_missing].index
transformed_data = data[low_NaN_cols]
return transformed_data
For the feature engineering and selection step, we chose columns that had more than 0.4 correlation with 'SalePrice' and removed any columns with more than 10 categories.
Once again, I've combined all the work we've done previously into a function with adjustable parameters.
def select_features(data, corr_threshold=0.4, unique_threshold=10):
#Fill missing numerical columns with the mode.
numerical_cols = data.dtypes[data.dtypes != 'object'].index
numerical_data = data[numerical_cols]
numerical_data = numerical_data.fillna(data.mode().iloc[0])
numerical_data.isnull().sum().sort_values(ascending = False)
#Drop text columns with more than 1 missing value.
text_cols = data.dtypes[data.dtypes == 'object'].index
text_data = data[text_cols]
text_null_counts = text_data.isnull().sum()
text_not_null_cols = text_null_counts[text_null_counts < 1].index
text_data = text_data[text_not_null_cols]
num_corr = numerical_data.corr()['SalePrice'].abs().sort_values(ascending = False)
num_corr = num_corr[num_corr > corr_threshold]
high_corr_cols = num_corr.index
#Apply the correlation threshold parameter
hi_corr_numerical_data = numerical_data[high_corr_cols]
#Nominal columns from the documentation
nominal_cols = ['MS Zoning', 'Street', 'Alley', 'Land Contour', 'Lot Config', 'Neighborhood', 'Condition 1', 'Condition 2', 'Bldg Type', 'House Style', 'Overall Qual', 'Roof Style', 'Roof Mat1', 'Exterior 1st', 'Exterior 2nd', 'Mas Vnr Type', 'Foundation', 'Heating', 'Central Air']
nominal_num_col = ['MS SubClass']
#Finds nominal columns in text_data
nominal_text_col = []
for col in nominal_cols:
if col in text_data.columns:
nominal_text_col.append(col)
nominal_text_col
text_data = text_data[nominal_text_col]
nominal_text_col_unique = []
for col in nominal_text_col:
if len(text_data[col].value_counts()) <= unique_threshold:
nominal_text_col_unique.append(col)
text_data = text_data[nominal_text_col_unique]
text_data.head()
#Set all these columns to categorical
for col in text_data.columns:
text_data[col] = text_data[col].astype('category')
categorical_text_data = pd.get_dummies(text_data)
#Change any nominal numerical columns to categorical, then returns a dataframe
for col in numerical_data.columns:
if col in nominal_num_col:
numerical_data[col] = numerical_data[col].astype('category')
categorical_numerical_data = pd.get_dummies(numerical_data.select_dtypes(include=['category']))
final_data = pd.concat([hi_corr_numerical_data, categorical_text_data, categorical_numerical_data], axis=1)
return final_data
Applying Machine Learning¶
Now we are ready to apply machine learning, we'll use the linear regression model from scikit-learn. Linear regression should work well here since our target column 'SalePrice' is a continuous value. We'll evaluate this model with RMSE as an error metric.
def train_and_test(data):
train = data[0:1460]
test = data[1460:]
features = data.columns.drop(['SalePrice'])
#train
lr = LinearRegression()
lr.fit(train[features], train['SalePrice'])
#predict
predictions = lr.predict(test[features])
rmse = mean_squared_error(test['SalePrice'], predictions)**0.5
return rmse
data = pd.read_csv("AmesHousing.tsv", delimiter='\t')
transformed_data = transform_features(data, percent_missing=0.05)
final_data = select_features(transformed_data, 0.4, 10)
result = train_and_test(final_data)
result
We've selected the first 1460 rows as the training set, and the remaining data as the testing set. This is not really a good way to evaluate a model's performance because the error will change as soon as we shuffle the data.
We can use KFold cross validation to split the data in K number of folds. Using the KFold function from scikit learn, we can get the indices for the testing and training sets.
from sklearn.model_selection import KFold
def train_and_test2(data, k=2):
rf = LinearRegression()
if k == 0:
train = data[0:1460]
test = data[1460:]
features = data.columns.drop(['SalePrice'])
#train
rf.fit(train[features], train['SalePrice'])
#predict
predictions = rf.predict(test[features])
rmse = mean_squared_error(test['SalePrice'], predictions)**0.5
return rmse
elif k == 1:
train = data[:1460]
test = data[1460:]
features = data.columns.drop(['SalePrice'])
rf.fit(train[features], train["SalePrice"])
predictions_one = rf.predict(test[features])
mse_one = mean_squared_error(test["SalePrice"], predictions_one)
rmse_one = np.sqrt(mse_one)
rf.fit(test[features], test["SalePrice"])
predictions_two = rf.predict(train[features])
mse_two = mean_squared_error(train["SalePrice"], predictions_two)
rmse_two = np.sqrt(mse_two)
return np.mean([rmse_one, rmse_two])
else:
kf = KFold(n_splits=k, shuffle=True, random_state = 2)
rmse_list = []
for train_index, test_index in kf.split(data):
train = data.iloc[train_index]
test = data.iloc[test_index]
features = data.columns.drop(['SalePrice'])
#train
rf.fit(train[features], train['SalePrice'])
#predict
predictions = rf.predict(test[features])
rmse = mean_squared_error(test['SalePrice'], predictions)**0.5
rmse_list.append(rmse)
return np.mean(rmse_list)
data = pd.read_csv("AmesHousing.tsv", delimiter='\t')
transformed_data = transform_features(data, percent_missing=0.05)
final_data = select_features(transformed_data, 0.4, 10)
results = []
for i in range(100):
result = train_and_test2(final_data, k=i)
results.append(result)
x = [i for i in range(100)]
y = results
plt.plot(x, y)
plt.xlabel('Kfolds')
plt.ylabel('RMSE')
print(results[99])
Our error is actually the lowest, when k = 0. This is acutally not very useful because it means the model is only useful for the indices we've picked out. Without validation there is no way to be sure that the model works well for any set of data.
This is when cross validation is useful for evaluating model performance. We can see the average RMSE goes down as we increase the number of folds. This makes sense as the RMSE shown on the graph above is an average of the cross validation tests. A larger K means we have less bias towards overestimating the model's true error. As a trade off, this requires a lot more computation time.
Learning Summary¶
Concepts explored: pandas, data cleaning, features engineering, linear regression, hyperparameter tuning, RMSE, KFold validation
Functions and methods used: .dtypes, .value_counts(), .drop, .isnull(), sum(), .fillna(), .sort_values(), . corr(), .index, .append(), .get_dummies(), .astype(), predict(), .fit(), KFold(), mean_squared_error()
The files used for this project can be found in my GitHub repository.
Comments
comments powered by Disqus