In this project we are going to explore the machine learning workflow. Specifically, we'll be looking at the famous titanic dataset. This project is an extended version of a guided project from dataquest, you can check them out here.
The goal of this project is to accurately predict if a passenger survived the sinking of the Titanic or not. We must predict a 0 or 1 value for the 'Survived' column in the test dataset. Then we'll submit the file of the predictions to kaggle.
For more information on this competition, click here.
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
%matplotlib inline
train = pd.read_csv('train.csv')
holdout = pd.read_csv('test.csv')
Data Exploration¶
Before we modify the dataframe and generate new columns, the first step is to plan out what needs to be done. We can explore the data with the basic pandas functions. In addition, we'll refer to the data information given by kaggle.
holdout.head()
holdout.dtypes
Here are some of my initial findings:
- PassengerID is a numerical column with an identification number not useful for machine learning.
- Pclass is a numerical column with three values: 1, 2, and 3.
- Name is a text column that includes the titles of the passengers. This is going to be useful in determining the social status of the passengers.
- Sex is a text column with two values: male, and female.
- Age is a numerical column, but can be converted into age categories.
- SibSp is a numerical column showing the number of siblings/spouses aboard the Titanic.
- Parch is a numerical column showing the number of parents/children aboard the Titanic.
- Ticket is a text column, but it is just the serial number not useful for machine learning.
- Fare is a numerical column, but can be converted into price categories.
- Cabin is text column with cabin numbers.
- Embarked is a text column with information of the boarding location: C = Cherbourg, Q = Queenstown, S = Southampton.
Additional exploration will be required for most of these columns, but we can safely drop 'PassengerID' and 'Ticket'.
Pclass¶
This is a categorical column with 3 values. We'll want to create 3 dummy columns with binary values(0 and 1) for this variable. I'll demonstrate this later in the feature engineering section.
First, let's take a look at this column relative to the 'Survived' column
sns.countplot(train['Pclass'], hue=train['Survived'])
plt.show()
print('correlation = ' + str(train.corr()['Survived']['Pclass']))
train['Survived'].groupby(train['Pclass']).mean()
We are seeing a negative correlation for 'Pclass' and 'Survived'. Passengers with Pclass = 3 have a much lower survival rate than passengers with Pclass = 1.
Name¶
This is a text column containing information on the passenger's title. We'll want to split the title out of the string and create categorical columns.
train['Name'].head(5)
The structure for each string is { LAST NAME }, { TITLE } { FIRST NAME } { MIDDLE NAME }.
We can split each split by the comma and select the title.
titles = train['Name'].apply(lambda x: x.split(', ')[1]).apply(lambda x: x.split(' ')[0])
train['Survived'].groupby(titles).mean().sort_values()
This is interesting, the survival rate for passengers with the title 'Mr' is 0.157. The survival rate for passengers with the title 'Mrs.' is 0.792. There are way too many categories for this column, we'll have condense these titles down into a few categories later.
Sex¶
This is another categorical column with two possible values. My plan is to transform this column to a numerical column to a binary value, 0 for male, and 1 for female.
print(train['Survived'].groupby(train['Sex']).mean())
sns.countplot(train['Sex'], hue=train['Survived'])
plt.show()
This is very closely tied with the 'Mr' and 'Mrs' titles I talked about earlier. Women had a much higher chance of survival on the titanic.
Age¶
For this column, it is better to split the data into age categories.
Ages = train["Age"].fillna(-0.5)
cut_points = [-1,0,5,12,18,35,60,100]
label_names = ["Missing","Infant","Child","Teenager","Young Adult","Adult","Senior"]
Ages_category = pd.cut(Ages,cut_points,labels=label_names)
fig = plt.figure(figsize=(8, 8))
sns.countplot(Ages_category, hue=train['Survived'])
plt.xlabel('')
plt.show()
print(train['Survived'].groupby(Ages_category).mean().sort_values())
It looks like infants(age 0-5) have the highest rate of survival.
SibSp and Parch¶
These two columns are quite interesting, because it has to do with the number of family members the passengers have.
train['SibSp'].hist()
plt.show()
train['Parch'].hist()
plt.show()
pd.pivot_table(train, values='Survived', index=['SibSp'], aggfunc=np.mean)
pd.pivot_table(train, values='Survived', index=['Parch'], aggfunc=np.mean)
The majority of the data of the 'Parch' and 'SibSp' column have values of 0 and 1, so we'll focus on those values. We are seeing a positive correlation for both of these columns.
We can see passengers with 0 'SibSp' or 0 'Parch' have a much lower rate of survival. It might be a good idea to combine these two columns.
Fare¶
Passengers with a higher social standing can afford to pay for a higher passenger class. So this column relates back to the 'Name' column and the 'Pclass' column.
cut_points = [-1,12,50,100,1000]
label_names = ["0-12","12-50","50-100","100+"]
Fare_categories = pd.cut(train["Fare"],cut_points,labels=label_names)
sns.countplot(Fare_categories, hue=train['Survived'])
plt.xlabel('')
plt.show()
print(train['Survived'].groupby(Fare_categories).mean().sort_values())
As expected, passengers with more expensive fares have a higher rate of survival. There are a few missing values in this column, so we'll use the mean to fill them out.
Cabin¶
The cabin column contains a letter followed by some numbers. The plan is to split this column into categories containing only the first letter.
cabin_letters = train['Cabin'].str[0]
cabin_letters = cabin_letters.fillna('Unknown')
cabin_letters.value_counts()
print(train['Survived'].groupby(cabin_letters).mean().sort_values())
We can see that passengers with cabin have a higher chance of survival. However, the majority of the passengers did not have a cabin.
Embarked¶
This column has three categories. Each category is a city name where the passenger boarded.
train['Survived'].groupby(train['Embarked']).mean().sort_values()
We see that passengers from Cherbourg had a higher rate of survival.
train[train['Embarked'] == 'C']['Pclass'].value_counts()
This column is closely related to Pclass as well. The majority of passengers from Cherbourg were in the first class. As for the missing values, we'll fill this out with 'U' for unknown.
Feature Engineering¶
Now we've looked at the data, let's create categorical columns for the columns we've explored. Following through with the plan I mentioned in the data exploration section, I've written a set of functions shown below.
#Fill out missing values in the dataset.
def process_missing(df):
df["Fare"] = df["Fare"].fillna(train["Fare"].mean())
df["Embarked"] = df["Embarked"].fillna("U")
return df
#Create a column for sex, 1 for female, 0 for male.
def process_sex(df):
df['is_female'] = df['Sex'].apply(lambda x: 1 if x == 'female' else 0)
return df
#Create a column for age categories.
def process_age(df):
df["Age"] = df["Age"].fillna(-0.5)
cut_points = [-1,0,5,12,18,35,60,100]
label_names = ["Missing","Infant","Child","Teenager","Young Adult","Adult","Senior"]
df["Age_categories"] = pd.cut(df["Age"],cut_points,labels=label_names)
return df
#Create a column for fare categories.
def process_fare(df):
cut_points = [-1,12,50,100,1000]
label_names = ["0-12","12-50","50-100","100+"]
df["Fare_categories"] = pd.cut(df["Fare"],cut_points,labels=label_names)
return df
#Create a column for cabin letters.
def process_cabin(df):
df["Cabin_type"] = df["Cabin"].str[0]
df["Cabin_type"] = df["Cabin_type"].fillna("Unknown")
df = df.drop('Cabin',axis=1)
return df
#Create a column for passenger titles.
def process_titles(df):
titles = {
"Mr" : "Mr",
"Mme": "Mrs",
"Ms": "Mrs",
"Mrs" : "Mrs",
"Master" : "Master",
"Mlle": "Miss",
"Miss" : "Miss",
"Capt": "Officer",
"Col": "Officer",
"Major": "Officer",
"Dr": "Officer",
"Rev": "Officer",
"Jonkheer": "Royalty",
"Don": "Royalty",
"Sir" : "Royalty",
"Countess": "Royalty",
"Dona": "Royalty",
"Lady" : "Royalty"
}
extracted_titles = df["Name"].str.extract(' ([A-Za-z]+)\.',expand=False)
df["Title"] = extracted_titles.map(titles)
return df
#Create dummy columns for categorical data.
def create_dummies(df,column_name):
dummies = pd.get_dummies(df[column_name],prefix=column_name)
df = pd.concat([df,dummies],axis=1)
return df
def process_dataframe(df):
df = process_missing(df)
df = process_sex(df)
df = process_age(df)
df = process_fare(df)
df = process_titles(df)
df = process_cabin(df)
cols = ["Pclass", "Title", "Age_categories", "Fare_categories", "Cabin_type", "Embarked"]
for col in cols:
df = create_dummies(df, col)
return df
train = pd.read_csv('train.csv')
holdout = pd.read_csv('test.csv')
train = process_dataframe(train)
holdout = process_dataframe(holdout)
Revisiting SibSp and Parch¶
For the 'SibSp' and 'Parch' columns they are so closely related, I've decided to combine them in one column.
train['sum_SibSp_Parch'] = train['SibSp'] + train['Parch']
holdout['sum_SibSp_Parch'] = holdout['SibSp'] + holdout['Parch']
pivot_series = pd.pivot_table(train, values='Survived', index=['sum_SibSp_Parch'], aggfunc=np.mean)
pivot_series.plot.bar()
plt.show()
I took the sum of these two columns to see if we can see a positive correlation with the 'Survived' column, and this appears to be true. This makes sense because the sum of the two columns is just the family size.
It is also very likely, that passengers who also had parents on board, have a higher survival rate because of their younger age.It is important to note that the data past x=3 is unreliable due to low sample size.
We can create a new column for passengers without a family.
def nofamily(df):
if df['sum_SibSp_Parch'] == 0:
return 1
else:
return 0
train['isalone'] = train.apply(nofamily, axis=1)
holdout['isalone'] = holdout.apply(nofamily, axis=1)
train.columns
Feature Selection¶
For feature selection, let's try to use the RFECV function from scikit-learn. We'll optimize the features with respect to the random forest classifier model.
RFECV stands for Recursive Feature Elimination with Cross Validation. This is an automated approach to help us select features that are relevant to the model. This is especially useful when we are dealing with a large number of columns. However, this is not the case for the titanic dataset. So it is important to keep in mind that we can still drop columns if our model is overfit.
from sklearn.feature_selection import RFECV
from sklearn.ensemble import RandomForestClassifier
#Selects the optimal features for machine learning with RFECV
def select_features(df):
#return a df with only numeric values
numeric_df = df.select_dtypes(['float64', 'int64', 'uint8'])
#drop columns with NA values and remove survived and passengerid
selected_col = numeric_df.dropna(axis=1, how='any').columns.drop(['Survived', 'PassengerId'])
all_X = df[selected_col]
all_y = df['Survived']
random_forest = RandomForestClassifier(random_state=1)
selector = RFECV(random_forest, cv=10, step = 0.5)
selector.fit(all_X, all_y)
print(selected_col[selector.support_])
return selected_col[selector.support_]
bf = select_features(train)
Even after applying the RFECV function from scikit-learn. We can still manually remove or add features using our domain knowledge.
'Cabin_T' and 'Embarked_U' were not generated in the holdout data set, so we should remove these columns.
bf = bf.drop(['Cabin_type_T', 'Embarked_U'])
bf
Model Selection and Hyperparameter Tuning¶
#Tunes the model to reduce error
from sklearn.model_selection import GridSearchCV
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
def select_model(df, features):
all_X = df[features]
all_y = df['Survived']
LogisticRegression_dict = {
'model_name': 'LogisticRegression',
'estimator': LogisticRegression(),
'hyperparams': {
"solver": ["newton-cg", "lbfgs", "liblinear"]
}
}
KNeighborsClassifier_dict = {
'model_name': 'KNeighborsClassifier',
'estimator': KNeighborsClassifier(),
'hyperparams': {
"n_neighbors": range(1,20,2),
"weights": ["distance", "uniform"],
"algorithm": ["ball_tree", "kd_tree", "brute"],
"p": [1,2]
}
}
RandomForestClassifier_dict = {
'model_name': 'RandomForestClassifier',
'estimator': RandomForestClassifier(),
'hyperparams': {
"n_estimators": [4, 6, 9],
"criterion": ["entropy", "gini"],
"max_depth": [2, 5, 10],
"max_features": ["log2", "sqrt"],
"min_samples_leaf": [1, 5, 8],
"min_samples_split": [2, 3, 5]
}
}
hyper_params_list = [LogisticRegression_dict, KNeighborsClassifier_dict, RandomForestClassifier_dict]
scores = {}
for model in hyper_params_list:
print('-'*len(model['model_name']))
print(model['model_name'])
print('-'*len(model['model_name']))
estimator = model['estimator']
grid = GridSearchCV(estimator, model['hyperparams'], cv=10)
grid.fit(all_X, all_y)
model['best_params'] = grid.best_params_
model['best_score'] = grid.best_score_
model["best_model"] = grid.best_estimator_
scores[model['model_name']] = grid.best_score_
print('Best Score: ' + str(model['best_score']))
print('Best Parameters: ' + str(model['best_params']))
best_model = max(scores, key=scores.get)
print('-'*len('Best Model: ' + str(best_model)))
print('Best Model: ' + str(best_model))
print('-'*len('Best Model: ' + str(best_model)))
for model in hyper_params_list:
if model['model_name'] == best_model:
print('Best Model Score: ' + str(model['best_score']))
print('Best Model Parameters:' + str(model['best_params']))
return hyper_params_list
params = select_model(train, bf)
Looks like our best model is the RandomForestClassifier. However, the score is purely based on the train dataset. It is possible that we are overfitting the data.
I've only included a maximum of 9 trees in the grid search. If we include more trees in our forest, we should be able to reduce some overfitting.
Applying the Machine Learning Model¶
def save_submission_file(col_list, filename="submission.csv"):
#Training
rf = RandomForestClassifier(
min_samples_leaf=1, n_estimators=700, criterion='entropy',
max_depth= 10, max_features= 'sqrt', min_samples_split= 3)
rf.fit(train[bf], train['Survived'])
#Predict
all_X = holdout[col_list]
predictions = rf.predict(all_X)
submission = pd.DataFrame({'PassengerId': holdout['PassengerId'], 'Survived': predictions})
submission.to_csv(filename, index=False)
save_submission_file(bf)
That is it for now, this model had an accuracy score of 0.77990 on the holdout set. The accuracy score on the training set was 0.83501. This means that the model is still overfit.
We managed to hit the 50 percentile in the competition which is a great start.
Learning Summary¶
Concepts explored: feature engineering, feature selection, model selection, model tuning, binary classification problem
The files used for this project can be found in my GitHub repository.
Comments
comments powered by Disqus