In this project we are going to look at 'imports-85.data'. This file contains specifications of vehicles in 1985. For more information on the data set click here.

We are going to explore the fundamentals of machine learning using the k-nearest neighbors algorithm from scikit-learn. First, we'll import the libraries we'll need.

In [1]:
import pandas as pd
import numpy as np
from sklearn.neighbors import KNeighborsRegressor
from sklearn.metrics import mean_squared_error
import matplotlib.pyplot as plt
%matplotlib inline
In [2]:
cars = pd.read_csv("imports-85.data")

cars.head()
Out[2]:
3 ? alfa-romero gas std two convertible rwd front 88.60 ... 130 mpfi 3.47 2.68 9.00 111 5000 21 27 13495
0 3 ? alfa-romero gas std two convertible rwd front 88.6 ... 130 mpfi 3.47 2.68 9.0 111 5000 21 27 16500
1 1 ? alfa-romero gas std two hatchback rwd front 94.5 ... 152 mpfi 2.68 3.47 9.0 154 5000 19 26 16500
2 2 164 audi gas std four sedan fwd front 99.8 ... 109 mpfi 3.19 3.40 10.0 102 5500 24 30 13950
3 2 164 audi gas std four sedan 4wd front 99.4 ... 136 mpfi 3.19 3.40 8.0 115 5500 18 22 17450
4 2 ? audi gas std two sedan fwd front 99.8 ... 136 mpfi 3.19 3.40 8.5 110 5500 19 25 15250

5 rows × 26 columns

It looks like this dataset does not include the column names. We'll have to add in the column names manually using the documentation here.

In [3]:
colnames = ['symboling', 'normalized-losses', 'make', 'fuel-type', 'aspiration', 'num-of-doors', 'body-style', 
        'drive-wheels', 'engine-location', 'wheel-base', 'length', 'width', 'height', 'curb-weight', 'engine-type', 
        'num-of-cylinders', 'engine-size', 'fuel-system', 'bore', 'stroke', 'compression-rate', 'horsepower', 'peak-rpm', 'city-mpg', 'highway-mpg', 'price']


cars = pd.read_csv("imports-85.data", names=colnames)
In [4]:
cars.head()
Out[4]:
symboling normalized-losses make fuel-type aspiration num-of-doors body-style drive-wheels engine-location wheel-base ... engine-size fuel-system bore stroke compression-rate horsepower peak-rpm city-mpg highway-mpg price
0 3 ? alfa-romero gas std two convertible rwd front 88.6 ... 130 mpfi 3.47 2.68 9.0 111 5000 21 27 13495
1 3 ? alfa-romero gas std two convertible rwd front 88.6 ... 130 mpfi 3.47 2.68 9.0 111 5000 21 27 16500
2 1 ? alfa-romero gas std two hatchback rwd front 94.5 ... 152 mpfi 2.68 3.47 9.0 154 5000 19 26 16500
3 2 164 audi gas std four sedan fwd front 99.8 ... 109 mpfi 3.19 3.40 10.0 102 5500 24 30 13950
4 2 164 audi gas std four sedan 4wd front 99.4 ... 136 mpfi 3.19 3.40 8.0 115 5500 18 22 17450

5 rows × 26 columns

Data Cleaning and Preparing the Features


Looks like we managed to fix the dataframe. The k-nearest neighbors algorithm uses the distance formula to determine the nearest neighbors. That means, we can only use numerical columns for this machine learning algorithm. So we'll have to do a little bit of data cleaning.

Here are some of the issues with this dataframe:

  • There are missing values with the string '?'.
  • There are many non numerical columns.

First, we'll replace the string value '?' with NaN. That way, we can use the .isnull() method to determine which columns have missing values.

Using the documentation, we can determine which columns are numerical. Then we can drop them from the dataframe.

In [5]:
cars = cars.replace("?", np.nan)
In [6]:
to_drop = ["symboling", "make", "fuel-type", "aspiration", "num-of-doors", "body-style", "drive-wheels", "engine-location", "engine-type", "num-of-cylinders", "fuel-system", "engine-size"]

cars_num = cars.drop(to_drop, axis=1)
In [7]:
cars_num.head()
Out[7]:
normalized-losses wheel-base length width height curb-weight bore stroke compression-rate horsepower peak-rpm city-mpg highway-mpg price
0 NaN 88.6 168.8 64.1 48.8 2548 3.47 2.68 9.0 111 5000 21 27 13495
1 NaN 88.6 168.8 64.1 48.8 2548 3.47 2.68 9.0 111 5000 21 27 16500
2 NaN 94.5 171.2 65.5 52.4 2823 2.68 3.47 9.0 154 5000 19 26 16500
3 164 99.8 176.6 66.2 54.3 2337 3.19 3.40 10.0 102 5500 24 30 13950
4 164 99.4 176.6 66.4 54.3 2824 3.19 3.40 8.0 115 5500 18 22 17450
In [8]:
cars_num = cars_num.astype("float")
cars_num.isnull().sum()
Out[8]:
normalized-losses    41
wheel-base            0
length                0
width                 0
height                0
curb-weight           0
bore                  4
stroke                4
compression-rate      0
horsepower            2
peak-rpm              2
city-mpg              0
highway-mpg           0
price                 4
dtype: int64

We are going to use the machine learning algorithm to determine the price of a car. It doesn't make sense to keep rows with missing values in the 'price' column. So we'll just drop them entirely.

For the 'bore' and 'stroke' columns, we'll use the mean to fill in the missing values.

In [9]:
cars_num = cars_num.dropna(subset=["price"])
cars_num.isnull().sum()
Out[9]:
normalized-losses    37
wheel-base            0
length                0
width                 0
height                0
curb-weight           0
bore                  4
stroke                4
compression-rate      0
horsepower            2
peak-rpm              2
city-mpg              0
highway-mpg           0
price                 0
dtype: int64
In [10]:
cars_num = cars_num.fillna(cars_num.mean())
cars_num.isnull().sum()
Out[10]:
normalized-losses    0
wheel-base           0
length               0
width                0
height               0
curb-weight          0
bore                 0
stroke               0
compression-rate     0
horsepower           0
peak-rpm             0
city-mpg             0
highway-mpg          0
price                0
dtype: int64
In [11]:
cars_num.head()
Out[11]:
normalized-losses wheel-base length width height curb-weight bore stroke compression-rate horsepower peak-rpm city-mpg highway-mpg price
0 122.0 88.6 168.8 64.1 48.8 2548.0 3.47 2.68 9.0 111.0 5000.0 21.0 27.0 13495.0
1 122.0 88.6 168.8 64.1 48.8 2548.0 3.47 2.68 9.0 111.0 5000.0 21.0 27.0 16500.0
2 122.0 94.5 171.2 65.5 52.4 2823.0 2.68 3.47 9.0 154.0 5000.0 19.0 26.0 16500.0
3 164.0 99.8 176.6 66.2 54.3 2337.0 3.19 3.40 10.0 102.0 5500.0 24.0 30.0 13950.0
4 164.0 99.4 176.6 66.4 54.3 2824.0 3.19 3.40 8.0 115.0 5500.0 18.0 22.0 17450.0

The k-nearest neighbors algorithm uses the euclidean distance to determine the closest neighbor.

$$ Distance = \sqrt{{(q_1-p_1)}^2+{(q_2-p_2)}^2+...{(q_n-p_n)}^2} $$

Where q and p represent two rows and the subscript representing a column. However, each column have different scaling. For example, if we take row 2, and row 3. The peak RPM has a difference of 500, while the difference in width is 0.7. The algorithm will give extra weight towards the difference in peak RPM.

That is why it is important to normalize the dataset into a unit vector. After normalization we'll have values from -1 to 1. For more information on feature scaling click here.

$$ x' = \frac{x - mean(x)}{x(max) - x(min)}$$

In pandas this would be:

$$ df' = \frac{df - df.mean()}{df.max() - df.min()}$$

Where df is any dataframe.

In [12]:
normalized_cars = (cars_num-cars_num.mean())/(cars_num.max()-cars_num.min())
normalized_cars['price'] = cars_num['price']
In [13]:
normalized_cars.head()
Out[13]:
normalized-losses wheel-base length width height curb-weight bore stroke compression-rate horsepower peak-rpm city-mpg highway-mpg price
0 0.000000 -0.297289 -0.080612 -0.152911 -0.413889 -0.002974 0.099492 -0.274716 -0.072767 0.035528 -0.047995 -0.116086 -0.097015 13495.0
1 0.000000 -0.297289 -0.080612 -0.152911 -0.413889 -0.002974 0.099492 -0.274716 -0.072767 0.035528 -0.047995 -0.116086 -0.097015 16500.0
2 0.000000 -0.125277 -0.044791 -0.033253 -0.113889 0.103698 -0.464793 0.101474 -0.072767 0.236463 -0.047995 -0.171642 -0.123331 16500.0
3 0.219895 0.029242 0.035806 0.026577 0.044444 -0.084820 -0.100508 0.068141 -0.010267 -0.006528 0.156087 -0.032753 -0.018068 13950.0
4 0.219895 0.017580 0.035806 0.043671 0.044444 0.104086 -0.100508 0.068141 -0.135267 0.054220 0.156087 -0.199420 -0.228594 17450.0

Applying Machine Learning


Suppose we have a dataframe named 'train', and a row named 'test'. The idea behind k-nearest neighbors is to find k number of rows from 'train' with the lowest distance to 'test'. Then we can determine the average of the target column of 'train' of those five rows and return the result to 'test'.

We are going to write a function that uses the KNeighborsRegressor class from scikit-learn. This works a little bit differently, the class actually generates a model that fits the training dataset. It is a regression method using k-nearest neighbors. More information on this can be found in the documentation.

In [14]:
#Returns the root mean squared error using KNN
def knn_train_test(features, target_col, df):
    #randomize sets
    np.random.seed(1)
    randomed_index = np.random.permutation(df.index)
    randomed_df = df.reindex(randomed_index)
    
    half_point = int(len(randomed_df)/2)
    
    #assign test and training sets
    train_df = randomed_df.iloc[0:half_point]
    test_df = randomed_df.iloc[half_point:]
    
    #training
    knn = KNeighborsRegressor()
    knn.fit(train_df[[features]], train_df[[target_col]])
    
    #test
    predictions = knn.predict(test_df[[features]])
    mse = mean_squared_error(test_df[[target_col]], predictions)
    rmse = mse**0.5
    return rmse

We can write a for loop and use the function for each column. That way, we can see the RMSE of each column.

In [15]:
features = normalized_cars.columns.drop('price')
rmse = {}
for item in features:
    rmse[item] = knn_train_test(item, 'price', normalized_cars)

results = pd.Series(rmse)
results.sort_values()
Out[15]:
horsepower           4010.414152
curb-weight          4401.118255
highway-mpg          4652.697833
width                4908.609914
city-mpg             4973.940485
length               5429.900973
wheel-base           5460.787788
compression-rate     6610.812153
bore                 6806.695830
normalized-losses    7304.373172
peak-rpm             7678.470979
height               7842.199226
stroke               8005.611387
dtype: float64

It looks like the 'horsepower' column has the least amount of error. We should definitely keep this list in mind when using the function for multiple features.

But first, let's modify the function to include k value or the number of neighbors as a parameter. Then we can loop through a list of K values and features to determine which K value and features are most optimal in our machine learning model.

In [16]:
def knn_train_test2(features, target_col, df, k_values):
    #randomize sets
    np.random.seed(1)
    randomed_index = np.random.permutation(df.index)
    randomed_df = df.reindex(randomed_index)
    
    half_point = int(len(randomed_df)/2)
    
    #assign test and training sets
    train_df = randomed_df.iloc[0:half_point]
    test_df = randomed_df.iloc[half_point:]
    
    k_rmse = {}
    #training
    for k in k_values:
        knn = KNeighborsRegressor(n_neighbors=k)
        knn.fit(train_df[[features]], train_df[[target_col]])
        
        #test
        predictions = knn.predict(test_df[[features]])
        mse = mean_squared_error(test_df[[target_col]], predictions)
        rmse = mse**0.5
        k_rmse[k] = rmse
    return k_rmse
In [17]:
#input k parameter as a list, use function to return a dictionary of dictionaries
k = [1, 3, 5, 7, 9]
features = normalized_cars.columns.drop('price')
feature_k_rmse = {}

for item in features:
    feature_k_rmse[item] = knn_train_test2(item, 'price', normalized_cars, k)
    
feature_k_rmse
Out[17]:
{'bore': {1: 8602.58848450066,
  3: 6984.239489480916,
  5: 6806.695830075582,
  7: 6939.105845651802,
  9: 6915.297375013411},
 'city-mpg': {1: 5863.190943471308,
  3: 4672.77285307275,
  5: 4973.94048466108,
  7: 5413.390882677539,
  9: 5277.1766643494775},
 'compression-rate': {1: 8087.205346523092,
  3: 7375.063685578359,
  5: 6610.812153159129,
  7: 6732.801282941515,
  9: 7024.485525463435},
 'curb-weight': {1: 5288.0195725810245,
  3: 5022.318011757233,
  5: 4401.118254793124,
  7: 4330.6701276238755,
  9: 4633.425879994758},
 'height': {1: 8942.012951995952,
  3: 8378.23385277286,
  5: 7842.199225717336,
  7: 7709.0699416548505,
  9: 7777.1734491607085},
 'highway-mpg': {1: 6022.866724754784,
  3: 4671.390389789466,
  5: 4652.697832525993,
  7: 4817.230104360727,
  9: 5261.877043557105},
 'horsepower': {1: 4170.054848037801,
  3: 3985.1389178696736,
  5: 4010.4141521891734,
  7: 4351.268271181572,
  9: 4514.504641478055},
 'length': {1: 4611.990241761035,
  3: 5129.672039752984,
  5: 5429.900972639673,
  7: 5311.883616635263,
  9: 5383.054514833446},
 'normalized-losses': {1: 7829.153502413683,
  3: 7515.021862294153,
  5: 7304.373172258108,
  7: 7634.134919298568,
  9: 7682.244506601594},
 'peak-rpm': {1: 9511.480067750124,
  3: 8537.550899973421,
  5: 7678.470978516542,
  7: 7520.322843484608,
  9: 7364.560980451443},
 'stroke': {1: 9116.495955406906,
  3: 7338.68466990294,
  5: 8005.611386699424,
  7: 7788.91860301835,
  9: 7702.038219213702},
 'wheel-base': {1: 4493.734068810494,
  3: 5208.39331165465,
  5: 5460.78778823338,
  7: 5448.173408324034,
  9: 5738.621574471594},
 'width': {1: 4559.257297950061,
  3: 4648.149766156945,
  5: 4908.609914413773,
  7: 4781.944236558163,
  9: 4719.070452207012}}
In [18]:
best_features = {}
plt.figure(figsize=(10, 12))

for key, value in feature_k_rmse.items():
    x = list(value.keys())
    y = list(value.values())
    
    order = np.argsort(x)
    x_ordered = np.array(x)[order]
    y_ordered = np.array(y)[order]
    print(key)
    print('average_rmse: '+str(np.mean(y)))
    best_features[key] = np.mean(y)

    plt.plot(x_ordered, y_ordered, label=key)
    plt.xlabel("K_value")
    plt.ylabel("RMSE")
plt.legend()
plt.show()
horsepower
average_rmse: 4206.276166151255
wheel-base
average_rmse: 5269.94203029883
width
average_rmse: 4723.406333457191
peak-rpm
average_rmse: 8122.477154035228
city-mpg
average_rmse: 5240.0943656464315
stroke
average_rmse: 7990.349766848265
length
average_rmse: 5173.30027712448
bore
average_rmse: 7249.585404944475
height
average_rmse: 8129.737884260341
curb-weight
average_rmse: 4735.110369350003
highway-mpg
average_rmse: 5085.212418997615
compression-rate
average_rmse: 7166.0735987331045
normalized-losses
average_rmse: 7592.985592573221

This figure is a bit confusing to look at. A better way is to sort the values of the best_features which contains the features as the key and the average RMSE as the values.

In [19]:
sorted_features_list = sorted(best_features, key=best_features.get)
sorted_features_list
Out[19]:
['horsepower',
 'width',
 'curb-weight',
 'highway-mpg',
 'length',
 'city-mpg',
 'wheel-base',
 'compression-rate',
 'bore',
 'normalized-losses',
 'stroke',
 'peak-rpm',
 'height']

Now we know which features have the lowest amount of error, we can begin applying the function to multiple features at once.

In [20]:
def knn_train_test3(features, target_col, df):
    #randomize sets
    np.random.seed(0)
    randomed_index = np.random.permutation(df.index)
    randomed_df = df.reindex(randomed_index)
    
    half_point = int(len(randomed_df)/2)
    
    #assign test and training sets
    train_df = randomed_df.iloc[0:half_point]
    test_df = randomed_df.iloc[half_point:]
    
    #training
    knn = KNeighborsRegressor(n_neighbors=5)
    knn.fit(train_df[features], train_df[[target_col]])
    #test
    predictions = knn.predict(test_df[features])
    mse = mean_squared_error(test_df[[target_col]], predictions)
    rmse = mse**0.5
    return rmse
In [21]:
k_rmse_features ={}

best_two_features = sorted_features_list[0:2]
best_three_features = sorted_features_list[0:3]
best_four_features = sorted_features_list[0:4]
best_five_features = sorted_features_list[0:5]


k_rmse_features["best_two_rmse"]  = knn_train_test3(best_two_features, 'price', normalized_cars)
k_rmse_features["best_three_rmse"] = knn_train_test3(best_three_features, 'price', normalized_cars)
k_rmse_features["best_four_rmse"] = knn_train_test3(best_four_features, 'price', normalized_cars)
k_rmse_features["best_five_rmse"] = knn_train_test3(best_five_features, 'price', normalized_cars)
In [22]:
k_rmse_features
Out[22]:
{'best_five_rmse': 3533.7489988020734,
 'best_four_rmse': 3404.6909417321376,
 'best_three_rmse': 3214.9121121904577,
 'best_two_rmse': 3635.0424706141075}

Let looks like using the best three features gave us the lowest RMSE.

Now, let's try varying the K values. We can further tune our machine learning model by finding the optimal K value to use.

In [23]:
def knn_train_test4(features, target_col, df, k_values):
    #randomize sets
    np.random.seed(0)
    randomed_index = np.random.permutation(df.index)
    randomed_df = df.reindex(randomed_index)
    
    half_point = int(len(randomed_df)/2)
    
    #assign test and training sets
    train_df = randomed_df.iloc[0:half_point]
    test_df = randomed_df.iloc[half_point:]
    
    k_rmse = {}
    #training
    for k in k_values:
        knn = KNeighborsRegressor(n_neighbors=k)
        knn.fit(train_df[features], train_df[[target_col]])
        #test
        predictions = knn.predict(test_df[features])
        mse = mean_squared_error(test_df[[target_col]], predictions)
        rmse = mse**0.5
        k_rmse[k] = rmse
    return k_rmse
In [24]:
#input k parameter as a list, use function to return a dictionary of dictionaries
k = list(range(1,25))
features = [best_three_features, best_four_features, best_five_features]
feature_k_rmse2 = {}
feature_k_rmse2["best_three_features"] = knn_train_test4(best_three_features, 'price', normalized_cars, k)
feature_k_rmse2["best_four_features"] = knn_train_test4(best_four_features, 'price', normalized_cars, k)
feature_k_rmse2["best_five_features"] = knn_train_test4(best_five_features, 'price', normalized_cars, k)
In [25]:
feature_k_rmse2
Out[25]:
{'best_five_features': {1: 2925.271682398335,
  2: 3052.8623032449823,
  3: 3142.306197305374,
  4: 3461.561181031581,
  5: 3533.7489988020734,
  6: 3374.173588712601,
  7: 3375.877721539395,
  8: 3324.876540782833,
  9: 3312.502627933449,
  10: 3366.851324397034,
  11: 3447.2822267103743,
  12: 3496.4998131087186,
  13: 3547.2160454903924,
  14: 3615.6147456955036,
  15: 3579.8430331574878,
  16: 3678.503216850985,
  17: 3750.96940429519,
  18: 3815.3901236791603,
  19: 3851.386198466123,
  20: 3939.9382237297746,
  21: 3975.0410078767027,
  22: 4005.3349453198225,
  23: 4054.671493676685,
  24: 4106.550851309336},
 'best_four_features': {1: 2870.800286876242,
  2: 2924.256000834373,
  3: 3217.2983830519292,
  4: 3392.3729838615423,
  5: 3404.6909417321376,
  6: 3532.1939129716366,
  7: 3523.454893817346,
  8: 3405.9189129672363,
  9: 3400.2247346580157,
  10: 3549.4612599577126,
  11: 3539.83054253496,
  12: 3562.1876094096256,
  13: 3675.4563993723814,
  14: 3770.166928307887,
  15: 3818.056456588688,
  16: 3804.305744235327,
  17: 3859.410326582298,
  18: 3863.6635808098727,
  19: 3905.7831606401637,
  20: 3927.4475182570377,
  21: 3956.7797179695103,
  22: 3966.0139478992855,
  23: 4016.972042009795,
  24: 4025.3723801365027},
 'best_three_features': {1: 2879.8872399542806,
  2: 2824.3660223522415,
  3: 2869.933838736334,
  4: 3141.137015378538,
  5: 3214.9121121904577,
  6: 3280.710541238062,
  7: 3421.1652350874347,
  8: 3379.778228190607,
  9: 3464.5878589864906,
  10: 3575.3981609613693,
  11: 3617.9433332156923,
  12: 3630.7516137575535,
  13: 3632.7823295462845,
  14: 3754.4036203105343,
  15: 3837.677054164043,
  16: 3841.760205882436,
  17: 3858.9773285906977,
  18: 3844.3626311824146,
  19: 3875.9053428113702,
  20: 3869.5372272718314,
  21: 3901.5634407576663,
  22: 3926.0570825223363,
  23: 4019.34604419612,
  24: 4048.5637998085253}}
In [26]:
plt.figure(figsize=(6, 6))

for key, value in feature_k_rmse2.items():
    
    x = list(value.keys())
    y = list(value.values())
    plt.plot(x, y, label=key)
    plt.xlabel("k_value")
    plt.ylabel("RMSE")
    
plt.legend()
plt.show()

From the chart above, we can see that choosing the best three features with a K value of 2 will give us the RMSE of 2824. That is it for now though, the goal of this project is to explore the fundamentals of machine learning.


Learning Summary

Concepts explored: pandas, data cleaning, features engineering, k-nearest neighbors, hyperparameter tuning, RMSE

Functions and methods used: .read_csv(), .replace(), .drop(), .astype(), isnull().sum(), .min(), .max(), .mean(), .permutation(), .reindex(), .iloc[], .fit(), .predict(), mean_squared_error(), .Series(), .sort_values(), .plot(), .legend()

The files used for this project can be found in my GitHub repository.



Comments

comments powered by Disqus