1.0 - Introduction¶
In this notebook, I will focus on data analysis and preprocessing for the gender wage gap. Specifically, I am going to focus on public jobs in the city of San Francisco and Newport Beach. This data set is publically available on Kaggle and Transparent California.
I also created a web application based on this dataset. You can play arround with it here. For a complete list of requirements and files used for my web app, check out my GitHub repository here.
In this notebook following questions will be explored:
- Is there an overall gender wage gap for public jobs in San Francisco?
- Is the gender gap really 78 cents on the dollar?
- Is there a gender wage gap for full time employees?
- Is there a gender wage gap for part time employees?
- Is there a gender wage gap if the employees were grouped by job categories?
- Is there a gender wage gap if the employees were grouped by exact job title?
- If the gender wage gap exists, is the data statistically significant?
- If the gender wage gap exists, how does the gender wage gap in San Francisco compare with more conservative cities in California?
Lastly, I want to mention that I am not affiliated with any political group, everything I write in this project is based on my perspective of the data alone.
1.1 - Library imports and loading the data from SQL to pandas¶
The SQL database is about 18 megabytes, which is small enough for my computer to handle. So I've decided to just load the entire database into memory using pandas. However, I created a function that takes in a SQL query and returns the result as a pandas dataframe just in case I need to use SQL queries.
import pandas as pd
import numpy as np
import sqlite3
import matplotlib.pyplot as plt
import seaborn as sns
import gender_guesser.detector as gender
import time
import collections
%matplotlib inline
sns.set(font_scale=1.5)
def run_query(query):
with sqlite3.connect('database.sqlite') as conn:
return pd.read_sql(query, conn)
#Read the data from SQL->Pandas
q1 = '''
SELECT * FROM Salaries
'''
data = run_query(q1)
data.head()
2.0 - Data Cleaning¶
Fortunately, this data set is already very clean. However, we should still look into every column. Specifically, we are interested in the data types of each column, and check for null values within the rows.
2.1 - Pre-cleaning, investigating data types¶
Before we do anything to the dataframe, we are going to simply explore the data a little bit.
data.dtypes
data['JobTitle'].nunique()
There is no gender column, so we'll have to create one. In addition, we'll need to reduce the number of unique values in the 'JobTitle'
column. 'BasePay'
, 'OvertimePay'
, 'OtherPay'
, and 'Benefits'
are all object columns. We'll need to find a way to covert these into numeric values.
Let's take a look at the rest of the columns using the .value_counts()
method.
data['Year'].value_counts()
data['Notes'].value_counts()
data['Agency'].value_counts()
data['Status'].value_counts()
It looks like the data is split into 4 years. The 'Notes'
column is empty for 148654 rows, so we should just remove it. The 'Agency'
column is also not useful, because we already know the data is for San Francisco.
The 'Status'
column shows a separation for full time employees and part time employees. We should leave that alone for now.
2.2 - Dealing with non-numerical values¶
Let's tackle the object columns first, we are going to convert everything into integers using the pandas.to_numeric()
function. If we run into any errors, the returned value will be NaN.
def process_pay(df):
cols = ['BasePay','OvertimePay', 'OtherPay', 'Benefits']
print('Checking for nulls:')
for col in cols:
df[col] = pd.to_numeric(df[col], errors ='coerce')
print(len(col)*'-')
print(col)
print(len(col)*'-')
print(df[col].isnull().value_counts())
return df
data = process_pay(data.copy())
Looking at our results above, we found 609 null values in BasePay
and 36163 null values in Benefits
. We are going to drop the rows with null values in BasePay
. Not everyone will recieve benefits for their job, so it makes more sense to fill in the null values for Benefits
with zeroes.
def process_pay2(df):
df['Benefits'] = df['Benefits'].fillna(0)
df = df.dropna()
print(df['BasePay'].isnull().value_counts())
return df
data = process_pay2(data)
Lastly, let's drop the Agency
and Notes
columns as they do not provide any information.
data = data.drop(columns=['Agency', 'Notes'])
3.0 - Creating New Features¶
Unfortunately, this data set does not include demographic information. Since this project is focused on investigating the gender wage gap, we need a way to classify a person's gender. Furthermore, the JobTitle
column has 2159 unique values. We'll need to simplify this column.
3.1 - Creating the 'gender' column¶
Due to the limitations of this data set. We'll have to assume the gender of the employee by using their first name. The gender_guesser
library is very useful for this.
#Create the 'Gender' column based on employee's first name.
d = gender.Detector(case_sensitive=False)
data['FirstName'] = data['EmployeeName'].str.split().apply(lambda x: x[0])
data['Gender'] = data['FirstName'].apply(lambda x: d.get_gender(x))
data['Gender'].value_counts()
We are just going to remove employees with ambiguous or gender neutral first names from our analysis.
#Retain data with 'male' and 'female' names.
male_female_only = data[(data['Gender'] == 'male') | (data['Gender'] == 'female')].copy()
male_female_only['Gender'].value_counts()
3.2 - Categorizing job titles¶
Next, we'll have to simplify the JobTitles
column. To do this, we'll use the brute force method. I created an ordered dictionary with keywords and their associated job category. The generic titles are at the bottom of the dictionary, and the more specific titles are at the top of the dictionary. Then we are going to use a for loop in conjunction with the .map()
method on the column.
I used the same labels as this kernel on Kaggle, but I heavily modified the code for readability.
def find_job_title2(row):
#Prioritize specific titles on top
titles = collections.OrderedDict([
('Police',['police', 'sherif', 'probation', 'sergeant', 'officer', 'lieutenant']),
('Fire', ['fire']),
('Transit',['mta', 'transit']),
('Medical',['anesth', 'medical', 'nurs', 'health', 'physician', 'orthopedic', 'pharm', 'care']),
('Architect', ['architect']),
('Court',['court', 'legal']),
('Mayor Office', ['mayoral']),
('Library', ['librar']),
('Public Works', ['public']),
('Attorney', ['attorney']),
('Custodian', ['custodian']),
('Gardener', ['garden']),
('Recreation Leader', ['recreation']),
('Automotive',['automotive', 'mechanic', 'truck']),
('Engineer',['engineer', 'engr', 'eng', 'program']),
('General Laborer',['general laborer', 'painter', 'inspector', 'carpenter', 'electrician', 'plumber', 'maintenance']),
('Food Services', ['food serv']),
('Clerk', ['clerk']),
('Porter', ['porter']),
('Airport Staff', ['airport']),
('Social Worker',['worker']),
('Guard', ['guard']),
('Assistant',['aide', 'assistant', 'secretary', 'attendant']),
('Analyst', ['analy']),
('Manager', ['manager'])
])
#Loops through the dictionaries
for group, keywords in titles.items():
for keyword in keywords:
if keyword in row.lower():
return group
return 'Other'
start_time = time.time()
male_female_only["Job_Group"] = male_female_only["JobTitle"].map(find_job_title2)
print("--- Run Time: %s seconds ---" % (time.time() - start_time))
male_female_only['Job_Group'].value_counts()
4.0 - Data Analysis and Visualizations¶
In this section, we are going to use the data to answer the questions stated in the introduction section.
4.1 - Overview of the gender gap¶
Let's begin by splitting the data set in half, one for females and one for males. Then we'll plot the overall income distribution using kernel density estimation based on the gausian function.
fig = plt.figure(figsize=(10, 5))
male_only = male_female_only[male_female_only['Gender'] == 'male']
female_only = male_female_only[male_female_only['Gender'] == 'female']
ax = sns.kdeplot(male_only['TotalPayBenefits'], color ='Blue', label='Male', shade=True)
ax = sns.kdeplot(female_only['TotalPayBenefits'], color='Red', label='Female', shade=True)
plt.yticks([])
plt.title('Overall Income Distribution')
plt.ylabel('Density of Employees')
plt.xlabel('Total Pay + Benefits ($)')
plt.xlim(0, 350000)
plt.show()
The income distribution plot is bimodal. In addition, we see a gender wage gap in favor of males in between the ~110000 and the ~275000 region. But, this plot doesn't capture the whole story. We need to break down the data some more. But first, let's explore the percentage of employees based on gender.
fig = plt.figure(figsize=(5, 5))
colors = ['#AFAFF5', '#EFAFB5']
labels = ['Male', 'Female']
sizes = [len(male_only), len(female_only)]
explode = (0.05, 0)
sns.set(font_scale=1.5)
ax = plt.pie(sizes, labels=labels, explode=explode, colors=colors, shadow=True, startangle=90, autopct='%1.f%%')
plt.title('Estimated Percentages of Employees: Overall')
plt.show()
Another key factor we have to consider is the number of employees. How do we know if there are simply more men working at higher paying jobs? How can we determine if social injustice has occured?
The chart above only tells us the total percentage of employees across all job categories, but it does give us an overview of the data.
4.2 - Exploring the year column¶
The data set contain information on employees between 2011-2014. Let's take a look at an overview of the income based on the Year
column regardless of gender.
data_2011 = male_female_only[male_female_only['Year'] == 2011]
data_2012 = male_female_only[male_female_only['Year'] == 2012]
data_2013 = male_female_only[male_female_only['Year'] == 2013]
data_2014 = male_female_only[male_female_only['Year'] == 2014]
plt.figure(figsize=(10,7.5))
ax = plt.boxplot([data_2011['TotalPayBenefits'].values, data_2012['TotalPayBenefits'].values, \
data_2013['TotalPayBenefits'].values, data_2014['TotalPayBenefits'].values])
plt.ylim(0, 350000)
plt.xticks([1, 2, 3, 4], ['2011', '2012', '2013', '2014'])
plt.xlabel('Year')
plt.ylabel('Total Pay + Benefits ($)')
plt.tight_layout()
From the boxplots, we see that the total pay is increasing for every year. We'll have to consider inflation in our analysis. In addition, it is very possible for an employee to stay at their job for multiple years. We don't want to double sample on these employees.
To simplify the data for the purpose of investigating the gender gap. It makes more sense to only choose only one year for our analysis. From our data exploration, we noticed that the majority of the status
column was blank. Let's break the data down by year using the .value_counts()
method.
years = ['2011', '2012', '2013', '2014']
all_data = [data_2011, data_2012, data_2013, data_2014]
for i in range(4):
print(len(years[i])*'-')
print(years[i])
print(len(years[i])*'-')
print(all_data[i]['Status'].value_counts())
The status of the employee is critical to our analysis, only year 2014 has this information. So it makes sense to focus on analysis on 2014.
data_2014_FT = data_2014[data_2014['Status'] == 'FT']
data_2014_PT = data_2014[data_2014['Status'] == 'PT']
4.3 - Full time vs. part time employees¶
Let's take a look at the kernal density estimation plot for part time and full time employees.
fig = plt.figure(figsize=(10, 5))
ax = sns.kdeplot(data_2014_PT['TotalPayBenefits'], color = 'Orange', label='Part Time Workers', shade=True)
ax = sns.kdeplot(data_2014_FT['TotalPayBenefits'], color = 'Green', label='Full Time Workers', shade=True)
plt.yticks([])
plt.title('Part Time Workers vs. Full Time Workers')
plt.ylabel('Density of Employees')
plt.xlabel('Total Pay + Benefits ($)')
plt.xlim(0, 350000)
plt.show()
If we split the data by employment status, we can see that the kernal distribution plot is no longer bimodal. Next, let's see how these two plots look if we seperate the data by gender.
fig = plt.figure(figsize=(10, 10))
fig.subplots_adjust(hspace=.5)
#Generate the top plot
male_only = data_2014_FT[data_2014_FT['Gender'] == 'male']
female_only = data_2014_FT[data_2014_FT['Gender'] == 'female']
ax = fig.add_subplot(2, 1, 1)
ax = sns.kdeplot(male_only['TotalPayBenefits'], color ='Blue', label='Male', shade=True)
ax = sns.kdeplot(female_only['TotalPayBenefits'], color='Red', label='Female', shade=True)
plt.title('Full Time Workers')
plt.ylabel('Density of Employees')
plt.xlabel('Total Pay & Benefits ($)')
plt.xlim(0, 350000)
plt.yticks([])
#Generate the bottom plot
male_only = data_2014_PT[data_2014_PT['Gender'] == 'male']
female_only = data_2014_PT[data_2014_PT['Gender'] == 'female']
ax2 = fig.add_subplot(2, 1, 2)
ax2 = sns.kdeplot(male_only['TotalPayBenefits'], color ='Blue', label='Male', shade=True)
ax2 = sns.kdeplot(female_only['TotalPayBenefits'], color='Red', label='Female', shade=True)
plt.title('Part Time Workers')
plt.ylabel('Density of Employees')
plt.xlabel('Total Pay & Benefits ($)')
plt.xlim(0, 350000)
plt.yticks([])
plt.show()
For part time workers, the KDE plot is nearly identical for both males and females.
For full time workers, we still see a gender gap. We'll need to break down the data some more.
4.4 - Breaking down the total pay¶
We used total pay including benefits for the x-axis for the KDE plot in the previous section. Is this a fair way to analyze the data? What if men work more overtime hours than women? Can we break down the data some more?
male_only = data_2014_FT[data_2014_FT['Gender'] == 'male']
female_only = data_2014_FT[data_2014_FT['Gender'] == 'female']
fig = plt.figure(figsize=(10, 15))
fig.subplots_adjust(hspace=.5)
#Generate the top plot
ax = fig.add_subplot(3, 1, 1)
ax = sns.kdeplot(male_only['OvertimePay'], color ='Blue', label='Male', shade=True)
ax = sns.kdeplot(female_only['OvertimePay'], color='Red', label='Female', shade=True)
plt.title('Full Time Workers')
plt.ylabel('Density of Employees')
plt.xlabel('Overtime Pay ($)')
plt.xlim(0, 60000)
plt.yticks([])
#Generate the middle plot
ax2 = fig.add_subplot(3, 1, 2)
ax2 = sns.kdeplot(male_only['Benefits'], color ='Blue', label='Male', shade=True)
ax2 = sns.kdeplot(female_only['Benefits'], color='Red', label='Female', shade=True)
plt.ylabel('Density of Employees')
plt.xlabel('Benefits Only ($)')
plt.xlim(0, 75000)
plt.yticks([])
#Generate the bottom plot
ax3 = fig.add_subplot(3, 1, 3)
ax3 = sns.kdeplot(male_only['BasePay'], color ='Blue', label='Male', shade=True)
ax3 = sns.kdeplot(female_only['BasePay'], color='Red', label='Female', shade=True)
plt.ylabel('Density of Employees')
plt.xlabel('Base Pay Only ($)')
plt.xlim(0, 300000)
plt.yticks([])
plt.show()
We see a gender gap for all three plots above. Looks like we'll have to dig even deeper and analyze the data by job cateogries.
But first, let's take a look at the overall correlation for the data set.
data_2014_FT.corr()
The correlation table above uses Pearson's R to determine the values. The BasePay
and Benefits
column are very closely related. We can visualize this relationship using a scatter plot.
fig = plt.figure(figsize=(10, 5))
ax = plt.scatter(data_2014_FT['BasePay'], data_2014_FT['Benefits'])
plt.ylabel('Benefits ($)')
plt.xlabel('Base Pay ($)')
plt.show()
This makes a lot of sense because an employee's benefits is based on a percentage of their base pay. The San Francisco Human Resources department includes this information on their website here.
As we move further into our analysis of the data, it makes the most sense to focus on the BasePay
column. Both Benefits
and OvertimePay
are dependent of the BasePay
.
4.5 - Breaking down the base pay by job category¶
Next we'll analyze the base pay of full time workers by job category.
pal = sns.diverging_palette(0, 255, n=2)
ax = sns.factorplot(x='BasePay', y='Job_Group', hue='Gender', data=data_2014_FT,
size=10, kind="bar", palette=pal, ci=None)
plt.title('Full Time Workers')
plt.xlabel('Base Pay ($)')
plt.ylabel('Job Group')
plt.show()
At a glance, we can't really draw any conclusive statements about the gender wage gap. Some job categories favor females, some favor males. It really depends on what job group the employee is actually in. Maybe it makes more sense to calculate the the difference between these two bars.
salaries_by_group = pd.pivot_table(data = data_2014_FT,
values = 'BasePay',
columns = 'Job_Group', index='Gender',
aggfunc = np.mean)
count_by_group = pd.pivot_table(data = data_2014_FT,
values = 'Id',
columns = 'Job_Group', index='Gender',
aggfunc = len)
salaries_by_group
fig = plt.figure(figsize=(10, 15))
sns.set(font_scale=1.5)
differences = (salaries_by_group.loc['female'] - salaries_by_group.loc['male'])*100/salaries_by_group.loc['male']
labels = differences.sort_values().index
x = differences.sort_values()
y = [i for i in range(len(differences))]
palette = sns.diverging_palette(240, 10, n=28, center ='dark')
ax = sns.barplot(x, y, orient = 'h', palette = palette)
#Draws the two arrows
bbox_props = dict(boxstyle="rarrow,pad=0.3", fc="white", ec="black", lw=1)
t = plt.text(5.5, 12, "Higher pay for females", ha="center", va="center", rotation=0,
size=15,
bbox=bbox_props)
bbox_props2 = dict(boxstyle="larrow,pad=0.3", fc="white", ec="black", lw=1)
t = plt.text(-5.5, 12, "Higher pay for males", ha="center", va="center", rotation=0,
size=15,
bbox=bbox_props2)
#Labels each bar with the percentage of females
percent_labels = count_by_group[labels].iloc[0]*100 \
/(count_by_group[labels].iloc[0] + count_by_group[labels].iloc[1])
for i in range(len(ax.patches)):
p = ax.patches[i]
width = p.get_width()*1+1
ax.text(15,
p.get_y()+p.get_height()/2+0.3,
'{:1.0f}'.format(percent_labels[i])+' %',
ha="center")
ax.text(15, -1+0.3, 'Female Representation',
ha="center", fontname='Arial', rotation = 0)
plt.yticks(range(len(differences)), labels)
plt.title('Full Time Workers (Base Pay)')
plt.xlabel('Mean Percent Difference in Pay (Females - Males)')
plt.xlim(-11, 11)
plt.show()
I believe this is a better way to represent the gender wage gap. I calculated the mean difference between female and male pay based on job categories. Then I converted the values into a percentage by using this formula:
$$ \text{Mean Percent Difference} = \frac{\text{(Female Mean Pay - Male Mean Pay)*100}} {\text{Male Mean Pay}} $$
The theory stating that women makes 78 cents for every dollar men makes implies a 22% pay difference. None of these percentages were more than 10%, and not all of these percentage values showed favoritism towards males. However, we should keep in mind that this data set only applies to San Francisco public jobs. We should also keep in mind that we do not have access to job experience data which would directly correlate with base pay.
In addition, I included a short table of female representation for each job group on the right side of the graph. We'll dig further into this on the next section.
4.6 - Gender representation by job category¶
contingency_table = pd.crosstab(
data_2014_FT['Gender'],
data_2014_FT['Job_Group'],
margins = True
)
contingency_table
#Assigns the frequency values
femalecount = contingency_table.iloc[0][0:-1].values
malecount = contingency_table.iloc[1][0:-1].values
totals = contingency_table.iloc[2][0:-1]
femalepercentages = femalecount*100/totals
malepercentages = malecount*100/totals
malepercentages=malepercentages.sort_values(ascending=True)
femalepercentages=femalepercentages.sort_values(ascending=False)
length = range(len(femalepercentages))
#Plots the bar chart
fig = plt.figure(figsize=(10, 12))
sns.set(font_scale=1.5)
p1 = plt.barh(length, malepercentages.values, 0.55, label='Male', color='#AFAFF5')
p2 = plt.barh(length, femalepercentages, 0.55, left=malepercentages, color='#EFAFB5', label='Female')
labels = malepercentages.index
plt.yticks(range(len(malepercentages)), labels)
plt.xticks([0, 25, 50, 75, 100], ['0 %', '25 %', '50 %', '75 %', '100 %'])
plt.xlabel('Percentage of Males')
plt.title('Gender Representation by Job Group')
plt.legend(bbox_to_anchor=(0, 1, 1, 0), loc=3,
ncol=2, mode="expand", borderaxespad=0)
plt.show()
The chart above does not include any information based on pay. I wanted to show an overview of gender representation based on job category. It is safe to say, women don't like working with automotives with <1% female representation. Where as female representation is highest for medical jobs at 73%.
4.7 - Significance testing by exact job title¶
So what if breaking down the wage gap by job category is not good enough? Should we break down the gender gap by exact job title? Afterall, the argument is for equal pay for equal work. We can assume equal work if the job titles are exactly the same.
We can use hypothesis testing using the Welch's t-test to determine if there is a statistically significant result between male and female wages. The Welch's t-test is very robust as it doesn't assume equal variance and equal sample size. It does however, assume a normal distrbution which is well represented by the KDE plots. I talk about this in detail in my blog post here.
Let's state our null and alternative hypothesis:
$ H_0 : \text{There is no statistically significant relationship between gender and pay.} $
$ H_a : \text{There is a statistically significant relationship between gender and pay.} $
We are going to use only job titles with more than 100 employees, and job titles with more than 30 females and 30 males for this t-test. Using a for loop, we'll perform the Welch's t-test on every job title tat matches our criteria.
from scipy import stats
#Significance testing by job title
job_titles = data_2014['JobTitle'].value_counts(dropna=True)
job_titles_over_100 = job_titles[job_titles > 100 ]
t_scores = {}
for title,count in job_titles_over_100.iteritems():
male_pay = pd.to_numeric(male_only[male_only['JobTitle'] == title]['BasePay'])
female_pay = pd.to_numeric(female_only[female_only['JobTitle'] == title]['BasePay'])
if female_pay.shape[0] < 30:
continue
if male_pay.shape[0] < 30:
continue
t_scores[title] = stats.ttest_ind_from_stats(
mean1=male_pay.mean(), std1=(male_pay.std()), nobs1= male_pay.shape[0], \
mean2=female_pay.mean(), std2=(female_pay.std()), nobs2=female_pay.shape[0], \
equal_var=False)
for key, value in t_scores.items():
if value[1] < 0.05:
print(len(key)*'-')
print(key)
print(len(key)*'-')
print(t_scores[key])
print(' ')
print('Male: {}'.format((male_only[male_only['JobTitle'] == key]['BasePay']).mean()))
print('sample size: {}'.format(male_only[male_only['JobTitle'] == key].shape[0]))
print(' ')
print('Female: {}'.format((female_only[female_only['JobTitle'] == key]['BasePay']).mean()))
print('sample size: {}'.format(female_only[female_only['JobTitle'] == key].shape[0]))
len(t_scores)
Out of the 25 jobs that were tested using the Welch's t-test, 5 jobs resulted in a p-value of less than 0.05. However, not all jobs showed favoritism towards males. 'Registered Nurse' and 'Senior Clerk' both showed an average pay in favor of females. However, we should take the Welch's t-test results with a grain of salt. We do not have data on the work experience of the employees. Maybe female nurses have more work experience over males. Maybe male transit operators have more work experience over females. We don't actually know. Since BasePay
is a function of work experience, without this critical piece of information, we can not make any conclusions based on the t-test alone. All we know is that a statistically significant difference exists.
5.0 - San Francisco vs. Newport Beach¶
Let's take a look at more a more conservative city such as Newport Beach. This data can be downloaded at Transparent California here.
We can process the data similar to the San Francisco data set. The following code performs the following:
- Read the data using pandas
- Create the
Job_Group
column - Create the
Gender
column - Create two new dataframes: one for part time workers and one for full time workers
#Reads in the data
nb_data = pd.read_csv('newport-beach-2016.csv')
#Creates job groups
def find_job_title_nb(row):
titles = collections.OrderedDict([
('Police',['police', 'sherif', 'probation', 'sergeant', 'officer', 'lieutenant']),
('Fire', ['fire']),
('Transit',['mta', 'transit']),
('Medical',['anesth', 'medical', 'nurs', 'health', 'physician', 'orthopedic', 'pharm', 'care']),
('Architect', ['architect']),
('Court',['court', 'legal']),
('Mayor Office', ['mayoral']),
('Library', ['librar']),
('Public Works', ['public']),
('Attorney', ['attorney']),
('Custodian', ['custodian']),
('Gardener', ['garden']),
('Recreation Leader', ['recreation']),
('Automotive',['automotive', 'mechanic', 'truck']),
('Engineer',['engineer', 'engr', 'eng', 'program']),
('General Laborer',['general laborer', 'painter', 'inspector', 'carpenter', 'electrician', 'plumber', 'maintenance']),
('Food Services', ['food serv']),
('Clerk', ['clerk']),
('Porter', ['porter']),
('Airport Staff', ['airport']),
('Social Worker',['worker']),
('Guard', ['guard']),
('Assistant',['aide', 'assistant', 'secretary', 'attendant']),
('Analyst', ['analy']),
('Manager', ['manager'])
])
#Loops through the dictionaries
for group, keywords in titles.items():
for keyword in keywords:
if keyword in row.lower():
return group
return 'Other'
start_time = time.time()
nb_data["Job_Group"]=data["JobTitle"].map(find_job_title_nb)
#Create the 'Gender' column based on employee's first name.
d = gender.Detector(case_sensitive=False)
nb_data['FirstName'] = nb_data['Employee Name'].str.split().apply(lambda x: x[0])
nb_data['Gender'] = nb_data['FirstName'].apply(lambda x: d.get_gender(x))
nb_data['Gender'].value_counts()
#Retain data with 'male' and 'female' names.
nb_male_female_only = nb_data[(nb_data['Gender'] == 'male') | (nb_data['Gender'] == 'female')]
nb_male_female_only['Gender'].value_counts()
#Seperates full time/part time data
nb_data_FT = nb_male_female_only[nb_male_female_only['Status'] == 'FT']
nb_data_PT = nb_male_female_only[nb_male_female_only['Status'] == 'PT']
nb_data_FT.head()
5.1 - Part time vs. full time workers¶
fig = plt.figure(figsize=(10, 5))
nb_male_only = nb_data_PT[nb_data_PT['Gender'] == 'male']
nb_female_only = nb_data_PT[nb_data_PT['Gender'] == 'female']
ax = fig.add_subplot(1, 1, 1)
ax = sns.kdeplot(nb_male_only['Total Pay & Benefits'], color ='Blue', label='Male', shade=True)
ax = sns.kdeplot(nb_female_only['Total Pay & Benefits'], color='Red', label='Female', shade=True)
plt.title('Newport Beach: Part Time Workers')
plt.ylabel('Density of Employees')
plt.xlabel('Total Pay + Benefits ($)')
plt.xlim(0, 400000)
plt.yticks([])
plt.show()
Similar to the KDE plot for San Francisco, the KDE plot is nearly identical for both males and females for part time workers.
Let's take a look at the full time workers.
fig = plt.figure(figsize=(10, 10))
fig.subplots_adjust(hspace=.5)
#Generate the top chart
nb_male_only = nb_data_FT[nb_data_FT['Gender'] == 'male']
nb_female_only = nb_data_FT[nb_data_FT['Gender'] == 'female']
ax = fig.add_subplot(2, 1, 1)
ax = sns.kdeplot(nb_male_only['Total Pay & Benefits'], color ='Blue', label='Male', shade=True)
ax = sns.kdeplot(nb_female_only['Total Pay & Benefits'], color='Red', label='Female', shade=True)
plt.title('Newport Beach: Full Time Workers')
plt.ylabel('Density of Employees')
plt.xlabel('Total Pay + Benefits ($)')
plt.xlim(0, 400000)
plt.yticks([])
#Generate the bottom chart
male_only = data_2014_FT[data_2014_FT['Gender'] == 'male']
female_only = data_2014_FT[data_2014_FT['Gender'] == 'female']
ax2 = fig.add_subplot(2, 1, 2)
ax2 = sns.kdeplot(male_only['TotalPayBenefits'], color ='Blue', label='Male', shade=True)
ax2 = sns.kdeplot(female_only['TotalPayBenefits'], color='Red', label='Female', shade=True)
plt.title('San Francisco: Full Time Workers')
plt.ylabel('Density of Employees')
plt.xlabel('Total Pay + Benefits ($)')
plt.xlim(0, 400000)
plt.yticks([])
plt.show()
The kurtosis of the KDE plot for Newport Beach full time workers is lower than KDE plot for San Francisco full time workers. We can see a higher gender wage gap for Newport beach workers than San Francisco workers. However, these two plots do not tell us the full story. We need to break down the data by job category.
5.2 - Comparisons by job cateogry¶
nb_salaries_by_group = pd.pivot_table(data = nb_data_FT,
values = 'Base Pay',
columns = 'Job_Group', index='Gender',
aggfunc = np.mean,)
nb_salaries_by_group
fig = plt.figure(figsize=(10, 7.5))
sns.set(font_scale=1.5)
differences = (nb_salaries_by_group.loc['female'] - nb_salaries_by_group.loc['male'])*100/nb_salaries_by_group.loc['male']
nb_labels = differences.sort_values().index
x = differences.sort_values()
y = [i for i in range(len(differences))]
nb_palette = sns.diverging_palette(240, 10, n=9, center ='dark')
ax = sns.barplot(x, y, orient = 'h', palette = nb_palette)
plt.yticks(range(len(differences)), nb_labels)
plt.title('Newport Beach: Full Time Workers (Base Pay)')
plt.xlabel('Mean Percent Difference in Pay (Females - Males)')
plt.xlim(-25, 25)
plt.show()
Most of these jobs shows a higher average pay for males. The only job category where females were paid higher on average was 'Manager'. Some of these job categories do not even have a single female within the category, so the difference cannot be calculated. We should create a contingency table to check the sample size of our data.
5.3 - Gender representation by job category¶
nb_contingency_table = pd.crosstab(
nb_data_FT['Gender'],
nb_data_FT['Job_Group'],
margins = True
)
nb_contingency_table
The number of public jobs is much lower in Newport Beach compared to San Francisco. With only 3 female managers working full time in Newport Beach, we can't really say female managers make more money on average than male managers.
#Assigns the frequency values
nb_femalecount = nb_contingency_table.iloc[0][0:-1].values
nb_malecount = nb_contingency_table.iloc[1][0:-1].values
nb_totals = nb_contingency_table.iloc[2][0:-1]
nb_femalepercentages = nb_femalecount*100/nb_totals
nb_malepercentages = nb_malecount*100/nb_totals
nb_malepercentages=nb_malepercentages.sort_values(ascending=True)
nb_femalepercentages=nb_femalepercentages.sort_values(ascending=False)
nb_length = range(len(nb_malepercentages))
#Plots the bar chart
fig = plt.figure(figsize=(10, 10))
sns.set(font_scale=1.5)
p1 = plt.barh(nb_length, nb_malepercentages.values, 0.55, label='Male', color='#AFAFF5')
p2 = plt.barh(nb_length, nb_femalepercentages, 0.55, left=nb_malepercentages, color='#EFAFB5', label='Female')
labels = nb_malepercentages.index
plt.yticks(range(len(nb_malepercentages)), labels)
plt.xticks([0, 25, 50, 75, 100], ['0 %', '25 %', '50 %', '75 %', '100 %'])
plt.xlabel('Percentage of Males')
plt.title('Gender Representation by Job Group')
plt.legend(bbox_to_anchor=(0, 1, 1, 0), loc=3,
ncol=2, mode="expand", borderaxespad=0)
plt.show()