In this project we are going to look at 'imports-85.data'. This file contains specifications of vehicles in 1985. For more information on the data set click here.
We are going to explore the fundamentals of machine learning using the k-nearest neighbors algorithm from scikit-learn. First, we'll import the libraries we'll need.
import pandas as pd
import numpy as np
from sklearn.neighbors import KNeighborsRegressor
from sklearn.metrics import mean_squared_error
import matplotlib.pyplot as plt
%matplotlib inline
cars = pd.read_csv("imports-85.data")
cars.head()
It looks like this dataset does not include the column names. We'll have to add in the column names manually using the documentation here.
colnames = ['symboling', 'normalized-losses', 'make', 'fuel-type', 'aspiration', 'num-of-doors', 'body-style',
'drive-wheels', 'engine-location', 'wheel-base', 'length', 'width', 'height', 'curb-weight', 'engine-type',
'num-of-cylinders', 'engine-size', 'fuel-system', 'bore', 'stroke', 'compression-rate', 'horsepower', 'peak-rpm', 'city-mpg', 'highway-mpg', 'price']
cars = pd.read_csv("imports-85.data", names=colnames)
cars.head()
Data Cleaning and Preparing the Features¶
Looks like we managed to fix the dataframe. The k-nearest neighbors algorithm uses the distance formula to determine the nearest neighbors. That means, we can only use numerical columns for this machine learning algorithm. So we'll have to do a little bit of data cleaning.
Here are some of the issues with this dataframe:
- There are missing values with the string '?'.
- There are many non numerical columns.
First, we'll replace the string value '?' with NaN. That way, we can use the .isnull() method to determine which columns have missing values.
Using the documentation, we can determine which columns are numerical. Then we can drop them from the dataframe.
cars = cars.replace("?", np.nan)
to_drop = ["symboling", "make", "fuel-type", "aspiration", "num-of-doors", "body-style", "drive-wheels", "engine-location", "engine-type", "num-of-cylinders", "fuel-system", "engine-size"]
cars_num = cars.drop(to_drop, axis=1)
cars_num.head()
cars_num = cars_num.astype("float")
cars_num.isnull().sum()
We are going to use the machine learning algorithm to determine the price of a car. It doesn't make sense to keep rows with missing values in the 'price' column. So we'll just drop them entirely.
For the 'bore' and 'stroke' columns, we'll use the mean to fill in the missing values.
cars_num = cars_num.dropna(subset=["price"])
cars_num.isnull().sum()
cars_num = cars_num.fillna(cars_num.mean())
cars_num.isnull().sum()
cars_num.head()
The k-nearest neighbors algorithm uses the euclidean distance to determine the closest neighbor.
$$ Distance = \sqrt{{(q_1-p_1)}^2+{(q_2-p_2)}^2+...{(q_n-p_n)}^2} $$
Where q and p represent two rows and the subscript representing a column. However, each column have different scaling. For example, if we take row 2, and row 3. The peak RPM has a difference of 500, while the difference in width is 0.7. The algorithm will give extra weight towards the difference in peak RPM.
That is why it is important to normalize the dataset into a unit vector. After normalization we'll have values from -1 to 1. For more information on feature scaling click here.
$$ x' = \frac{x - mean(x)}{x(max) - x(min)}$$
In pandas this would be:
$$ df' = \frac{df - df.mean()}{df.max() - df.min()}$$
Where df is any dataframe.
normalized_cars = (cars_num-cars_num.mean())/(cars_num.max()-cars_num.min())
normalized_cars['price'] = cars_num['price']
normalized_cars.head()
Applying Machine Learning¶
Suppose we have a dataframe named 'train', and a row named 'test'. The idea behind k-nearest neighbors is to find k number of rows from 'train' with the lowest distance to 'test'. Then we can determine the average of the target column of 'train' of those five rows and return the result to 'test'.
We are going to write a function that uses the KNeighborsRegressor class from scikit-learn. This works a little bit differently, the class actually generates a model that fits the training dataset. It is a regression method using k-nearest neighbors. More information on this can be found in the documentation.
#Returns the root mean squared error using KNN
def knn_train_test(features, target_col, df):
#randomize sets
np.random.seed(1)
randomed_index = np.random.permutation(df.index)
randomed_df = df.reindex(randomed_index)
half_point = int(len(randomed_df)/2)
#assign test and training sets
train_df = randomed_df.iloc[0:half_point]
test_df = randomed_df.iloc[half_point:]
#training
knn = KNeighborsRegressor()
knn.fit(train_df[[features]], train_df[[target_col]])
#test
predictions = knn.predict(test_df[[features]])
mse = mean_squared_error(test_df[[target_col]], predictions)
rmse = mse**0.5
return rmse
We can write a for loop and use the function for each column. That way, we can see the RMSE of each column.
features = normalized_cars.columns.drop('price')
rmse = {}
for item in features:
rmse[item] = knn_train_test(item, 'price', normalized_cars)
results = pd.Series(rmse)
results.sort_values()
It looks like the 'horsepower' column has the least amount of error. We should definitely keep this list in mind when using the function for multiple features.
But first, let's modify the function to include k value or the number of neighbors as a parameter. Then we can loop through a list of K values and features to determine which K value and features are most optimal in our machine learning model.
def knn_train_test2(features, target_col, df, k_values):
#randomize sets
np.random.seed(1)
randomed_index = np.random.permutation(df.index)
randomed_df = df.reindex(randomed_index)
half_point = int(len(randomed_df)/2)
#assign test and training sets
train_df = randomed_df.iloc[0:half_point]
test_df = randomed_df.iloc[half_point:]
k_rmse = {}
#training
for k in k_values:
knn = KNeighborsRegressor(n_neighbors=k)
knn.fit(train_df[[features]], train_df[[target_col]])
#test
predictions = knn.predict(test_df[[features]])
mse = mean_squared_error(test_df[[target_col]], predictions)
rmse = mse**0.5
k_rmse[k] = rmse
return k_rmse
#input k parameter as a list, use function to return a dictionary of dictionaries
k = [1, 3, 5, 7, 9]
features = normalized_cars.columns.drop('price')
feature_k_rmse = {}
for item in features:
feature_k_rmse[item] = knn_train_test2(item, 'price', normalized_cars, k)
feature_k_rmse
best_features = {}
plt.figure(figsize=(10, 12))
for key, value in feature_k_rmse.items():
x = list(value.keys())
y = list(value.values())
order = np.argsort(x)
x_ordered = np.array(x)[order]
y_ordered = np.array(y)[order]
print(key)
print('average_rmse: '+str(np.mean(y)))
best_features[key] = np.mean(y)
plt.plot(x_ordered, y_ordered, label=key)
plt.xlabel("K_value")
plt.ylabel("RMSE")
plt.legend()
plt.show()
This figure is a bit confusing to look at. A better way is to sort the values of the best_features which contains the features as the key and the average RMSE as the values.
sorted_features_list = sorted(best_features, key=best_features.get)
sorted_features_list
Now we know which features have the lowest amount of error, we can begin applying the function to multiple features at once.
def knn_train_test3(features, target_col, df):
#randomize sets
np.random.seed(0)
randomed_index = np.random.permutation(df.index)
randomed_df = df.reindex(randomed_index)
half_point = int(len(randomed_df)/2)
#assign test and training sets
train_df = randomed_df.iloc[0:half_point]
test_df = randomed_df.iloc[half_point:]
#training
knn = KNeighborsRegressor(n_neighbors=5)
knn.fit(train_df[features], train_df[[target_col]])
#test
predictions = knn.predict(test_df[features])
mse = mean_squared_error(test_df[[target_col]], predictions)
rmse = mse**0.5
return rmse
k_rmse_features ={}
best_two_features = sorted_features_list[0:2]
best_three_features = sorted_features_list[0:3]
best_four_features = sorted_features_list[0:4]
best_five_features = sorted_features_list[0:5]
k_rmse_features["best_two_rmse"] = knn_train_test3(best_two_features, 'price', normalized_cars)
k_rmse_features["best_three_rmse"] = knn_train_test3(best_three_features, 'price', normalized_cars)
k_rmse_features["best_four_rmse"] = knn_train_test3(best_four_features, 'price', normalized_cars)
k_rmse_features["best_five_rmse"] = knn_train_test3(best_five_features, 'price', normalized_cars)
k_rmse_features
Let looks like using the best three features gave us the lowest RMSE.
Now, let's try varying the K values. We can further tune our machine learning model by finding the optimal K value to use.
def knn_train_test4(features, target_col, df, k_values):
#randomize sets
np.random.seed(0)
randomed_index = np.random.permutation(df.index)
randomed_df = df.reindex(randomed_index)
half_point = int(len(randomed_df)/2)
#assign test and training sets
train_df = randomed_df.iloc[0:half_point]
test_df = randomed_df.iloc[half_point:]
k_rmse = {}
#training
for k in k_values:
knn = KNeighborsRegressor(n_neighbors=k)
knn.fit(train_df[features], train_df[[target_col]])
#test
predictions = knn.predict(test_df[features])
mse = mean_squared_error(test_df[[target_col]], predictions)
rmse = mse**0.5
k_rmse[k] = rmse
return k_rmse
#input k parameter as a list, use function to return a dictionary of dictionaries
k = list(range(1,25))
features = [best_three_features, best_four_features, best_five_features]
feature_k_rmse2 = {}
feature_k_rmse2["best_three_features"] = knn_train_test4(best_three_features, 'price', normalized_cars, k)
feature_k_rmse2["best_four_features"] = knn_train_test4(best_four_features, 'price', normalized_cars, k)
feature_k_rmse2["best_five_features"] = knn_train_test4(best_five_features, 'price', normalized_cars, k)
feature_k_rmse2
plt.figure(figsize=(6, 6))
for key, value in feature_k_rmse2.items():
x = list(value.keys())
y = list(value.values())
plt.plot(x, y, label=key)
plt.xlabel("k_value")
plt.ylabel("RMSE")
plt.legend()
plt.show()
From the chart above, we can see that choosing the best three features with a K value of 2 will give us the RMSE of 2824. That is it for now though, the goal of this project is to explore the fundamentals of machine learning.
Learning Summary¶
Concepts explored: pandas, data cleaning, features engineering, k-nearest neighbors, hyperparameter tuning, RMSE
Functions and methods used: .read_csv(), .replace(), .drop(), .astype(), isnull().sum(), .min(), .max(), .mean(), .permutation(), .reindex(), .iloc[], .fit(), .predict(), mean_squared_error(), .Series(), .sort_values(), .plot(), .legend()
The files used for this project can be found in my GitHub repository.
Comments
comments powered by Disqus