### 1.0 - Introduction¶

In this notebook, I will focus on data analysis and preprocessing for the gender wage gap. Specifically, I am going to focus on public jobs in the city of San Francisco and Newport Beach. This data set is publically available on Kaggle and Transparent California.

I also created a web application based on this dataset. You can play arround with it here. For a complete list of requirements and files used for my web app, check out my GitHub repository here.

In this notebook following questions will be explored:

• Is there an overall gender wage gap for public jobs in San Francisco?
• Is the gender gap really 78 cents on the dollar?
• Is there a gender wage gap for full time employees?
• Is there a gender wage gap for part time employees?
• Is there a gender wage gap if the employees were grouped by job categories?
• Is there a gender wage gap if the employees were grouped by exact job title?
• If the gender wage gap exists, is the data statistically significant?
• If the gender wage gap exists, how does the gender wage gap in San Francisco compare with more conservative cities in California?

Lastly, I want to mention that I am not affiliated with any political group, everything I write in this project is based on my perspective of the data alone.

#### 1.1 - Library imports and loading the data from SQL to pandas¶

The SQL database is about 18 megabytes, which is small enough for my computer to handle. So I've decided to just load the entire database into memory using pandas. However, I created a function that takes in a SQL query and returns the result as a pandas dataframe just in case I need to use SQL queries.

In [1]:
import pandas as pd
import numpy as np
import sqlite3
import matplotlib.pyplot as plt
import seaborn as sns
import gender_guesser.detector as gender
import time
import collections
%matplotlib inline
sns.set(font_scale=1.5)

def run_query(query):
with sqlite3.connect('database.sqlite') as conn:

q1 = '''
SELECT * FROM Salaries
'''

data = run_query(q1)
Out[1]:
Id EmployeeName JobTitle BasePay OvertimePay OtherPay Benefits TotalPay TotalPayBenefits Year Notes Agency Status
0 1 NATHANIEL FORD GENERAL MANAGER-METROPOLITAN TRANSIT AUTHORITY 167411 0 400184 567595.43 567595.43 2011 San Francisco
1 2 GARY JIMENEZ CAPTAIN III (POLICE DEPARTMENT) 155966 245132 137811 538909.28 538909.28 2011 San Francisco
2 3 ALBERT PARDINI CAPTAIN III (POLICE DEPARTMENT) 212739 106088 16452.6 335279.91 335279.91 2011 San Francisco
3 4 CHRISTOPHER CHONG WIRE ROPE CABLE MAINTENANCE MECHANIC 77916 56120.7 198307 332343.61 332343.61 2011 San Francisco
4 5 PATRICK GARDNER DEPUTY CHIEF OF DEPARTMENT,(FIRE DEPARTMENT) 134402 9737 182235 326373.19 326373.19 2011 San Francisco

### 2.0 - Data Cleaning¶

Fortunately, this data set is already very clean. However, we should still look into every column. Specifically, we are interested in the data types of each column, and check for null values within the rows.

#### 2.1 - Pre-cleaning, investigating data types¶

Before we do anything to the dataframe, we are going to simply explore the data a little bit.

In [2]:
data.dtypes
Out[2]:
Id                    int64
EmployeeName         object
JobTitle             object
BasePay              object
OvertimePay          object
OtherPay             object
Benefits             object
TotalPay            float64
TotalPayBenefits    float64
Year                  int64
Notes                object
Agency               object
Status               object
dtype: object
In [3]:
data['JobTitle'].nunique()
Out[3]:
2159

There is no gender column, so we'll have to create one. In addition, we'll need to reduce the number of unique values in the 'JobTitle' column. 'BasePay', 'OvertimePay', 'OtherPay', and 'Benefits' are all object columns. We'll need to find a way to covert these into numeric values.

Let's take a look at the rest of the columns using the .value_counts() method.

In [4]:
data['Year'].value_counts()
Out[4]:
2014    38123
2013    37606
2012    36766
2011    36159
Name: Year, dtype: int64
In [5]:
data['Notes'].value_counts()
Out[5]:
148654
Name: Notes, dtype: int64
In [6]:
data['Agency'].value_counts()
Out[6]:
San Francisco    148654
Name: Agency, dtype: int64
In [7]:
data['Status'].value_counts()
Out[7]:
110535
FT     22334
PT     15785
Name: Status, dtype: int64

It looks like the data is split into 4 years. The 'Notes' column is empty for 148654 rows, so we should just remove it. The 'Agency' column is also not useful, because we already know the data is for San Francisco.

The 'Status' column shows a separation for full time employees and part time employees. We should leave that alone for now.

#### 2.2 - Dealing with non-numerical values¶

Let's tackle the object columns first, we are going to convert everything into integers using the pandas.to_numeric() function. If we run into any errors, the returned value will be NaN.

In [8]:
def process_pay(df):
cols = ['BasePay','OvertimePay', 'OtherPay', 'Benefits']

print('Checking for nulls:')
for col in cols:
df[col] = pd.to_numeric(df[col], errors ='coerce')
print(len(col)*'-')
print(col)
print(len(col)*'-')
print(df[col].isnull().value_counts())

return df

data = process_pay(data.copy())
Checking for nulls:
-------
BasePay
-------
False    148045
True        609
Name: BasePay, dtype: int64
-----------
OvertimePay
-----------
False    148650
True          4
Name: OvertimePay, dtype: int64
--------
OtherPay
--------
False    148650
True          4
Name: OtherPay, dtype: int64
--------
Benefits
--------
False    112491
True      36163
Name: Benefits, dtype: int64

Looking at our results above, we found 609 null values in BasePay and 36163 null values in Benefits. We are going to drop the rows with null values in BasePay. Not everyone will recieve benefits for their job, so it makes more sense to fill in the null values for Benefits with zeroes.

In [9]:
def process_pay2(df):
df['Benefits'] = df['Benefits'].fillna(0)

df = df.dropna()
print(df['BasePay'].isnull().value_counts())
return df

data = process_pay2(data)
False    148045
Name: BasePay, dtype: int64

Lastly, let's drop the Agency and Notes columns as they do not provide any information.

In [10]:
data = data.drop(columns=['Agency', 'Notes'])

### 3.0 - Creating New Features¶

Unfortunately, this data set does not include demographic information. Since this project is focused on investigating the gender wage gap, we need a way to classify a person's gender. Furthermore, the JobTitle column has 2159 unique values. We'll need to simplify this column.

#### 3.1 - Creating the 'gender' column¶

Due to the limitations of this data set. We'll have to assume the gender of the employee by using their first name. The gender_guesser library is very useful for this.

In [11]:
#Create the 'Gender' column based on employee's first name.
d = gender.Detector(case_sensitive=False)
data['FirstName'] = data['EmployeeName'].str.split().apply(lambda x: x[0])
data['Gender'] = data['FirstName'].apply(lambda x: d.get_gender(x))
data['Gender'].value_counts()
Out[11]:
male             69521
female           50182
unknown          15027
andy              5403
mostly_female     4559
mostly_male       3353
Name: Gender, dtype: int64

We are just going to remove employees with ambiguous or gender neutral first names from our analysis.

In [12]:
#Retain data with 'male' and 'female' names.
male_female_only = data[(data['Gender'] == 'male') | (data['Gender'] == 'female')].copy()
male_female_only['Gender'].value_counts()
Out[12]:
male      69521
female    50182
Name: Gender, dtype: int64

#### 3.2 - Categorizing job titles¶

Next, we'll have to simplify the JobTitles column. To do this, we'll use the brute force method. I created an ordered dictionary with keywords and their associated job category. The generic titles are at the bottom of the dictionary, and the more specific titles are at the top of the dictionary. Then we are going to use a for loop in conjunction with the .map() method on the column.

I used the same labels as this kernel on Kaggle, but I heavily modified the code for readability.

In [13]:
def find_job_title2(row):

#Prioritize specific titles on top
titles = collections.OrderedDict([
('Police',['police', 'sherif', 'probation', 'sergeant', 'officer', 'lieutenant']),
('Fire', ['fire']),
('Transit',['mta', 'transit']),
('Medical',['anesth', 'medical', 'nurs', 'health', 'physician', 'orthopedic', 'pharm', 'care']),
('Architect', ['architect']),
('Court',['court', 'legal']),
('Mayor Office', ['mayoral']),
('Library', ['librar']),
('Public Works', ['public']),
('Attorney', ['attorney']),
('Custodian', ['custodian']),
('Gardener', ['garden']),
('Automotive',['automotive', 'mechanic', 'truck']),
('Engineer',['engineer', 'engr', 'eng', 'program']),
('General Laborer',['general laborer', 'painter', 'inspector', 'carpenter', 'electrician', 'plumber', 'maintenance']),
('Food Services', ['food serv']),
('Clerk', ['clerk']),
('Porter', ['porter']),
('Airport Staff', ['airport']),
('Social Worker',['worker']),
('Guard', ['guard']),
('Assistant',['aide', 'assistant', 'secretary', 'attendant']),
('Analyst', ['analy']),
('Manager', ['manager'])
])

#Loops through the dictionaries
for group, keywords in titles.items():
for keyword in keywords:
if keyword in row.lower():
return group
return 'Other'

start_time = time.time()
male_female_only["Job_Group"] = male_female_only["JobTitle"].map(find_job_title2)
print("--- Run Time: %s seconds ---" % (time.time() - start_time))

male_female_only['Job_Group'].value_counts()
--- Run Time: 2.2815001010894775 seconds ---
Out[13]:
Medical              17869
Other                16460
Police               15486
Transit               9742
Engineer              6481
General Laborer       6015
Clerk                 5994
Social Worker         4969
Fire                  4618
Public Works          4458
Assistant             3848
Analyst               3500
Manager               2829
Library               2328
Automotive            2052
Court                 1905
Custodian             1862
Guard                 1590
Attorney              1542
Gardener              1135
Porter                1003
Food Services          685
Architect              424
Airport Staff          273
Mayor Office           165
Name: Job_Group, dtype: int64

### 4.0 - Data Analysis and Visualizations¶

In this section, we are going to use the data to answer the questions stated in the introduction section.

#### 4.1 - Overview of the gender gap¶

Let's begin by splitting the data set in half, one for females and one for males. Then we'll plot the overall income distribution using kernel density estimation based on the gausian function.

In [14]:
fig = plt.figure(figsize=(10, 5))
male_only = male_female_only[male_female_only['Gender'] == 'male']
female_only = male_female_only[male_female_only['Gender'] == 'female']

ax = sns.kdeplot(male_only['TotalPayBenefits'], color ='Blue', label='Male', shade=True)
ax = sns.kdeplot(female_only['TotalPayBenefits'], color='Red', label='Female', shade=True)

plt.yticks([])
plt.title('Overall Income Distribution')
plt.ylabel('Density of Employees')
plt.xlabel('Total Pay + Benefits ($)') plt.xlim(0, 350000) plt.show() The income distribution plot is bimodal. In addition, we see a gender wage gap in favor of males in between the ~110000 and the ~275000 region. But, this plot doesn't capture the whole story. We need to break down the data some more. But first, let's explore the percentage of employees based on gender. In [15]: fig = plt.figure(figsize=(5, 5)) colors = ['#AFAFF5', '#EFAFB5'] labels = ['Male', 'Female'] sizes = [len(male_only), len(female_only)] explode = (0.05, 0) sns.set(font_scale=1.5) ax = plt.pie(sizes, labels=labels, explode=explode, colors=colors, shadow=True, startangle=90, autopct='%1.f%%') plt.title('Estimated Percentages of Employees: Overall') plt.show() Another key factor we have to consider is the number of employees. How do we know if there are simply more men working at higher paying jobs? How can we determine if social injustice has occured? The chart above only tells us the total percentage of employees across all job categories, but it does give us an overview of the data. #### 4.2 - Exploring the year column¶ The data set contain information on employees between 2011-2014. Let's take a look at an overview of the income based on the Year column regardless of gender. In [16]: data_2011 = male_female_only[male_female_only['Year'] == 2011] data_2012 = male_female_only[male_female_only['Year'] == 2012] data_2013 = male_female_only[male_female_only['Year'] == 2013] data_2014 = male_female_only[male_female_only['Year'] == 2014] plt.figure(figsize=(10,7.5)) ax = plt.boxplot([data_2011['TotalPayBenefits'].values, data_2012['TotalPayBenefits'].values, \ data_2013['TotalPayBenefits'].values, data_2014['TotalPayBenefits'].values]) plt.ylim(0, 350000) plt.xticks([1, 2, 3, 4], ['2011', '2012', '2013', '2014']) plt.xlabel('Year') plt.ylabel('Total Pay + Benefits ($)')
plt.tight_layout()

From the boxplots, we see that the total pay is increasing for every year. We'll have to consider inflation in our analysis. In addition, it is very possible for an employee to stay at their job for multiple years. We don't want to double sample on these employees.

To simplify the data for the purpose of investigating the gender gap. It makes more sense to only choose only one year for our analysis. From our data exploration, we noticed that the majority of the status column was blank. Let's break the data down by year using the .value_counts() method.

In [17]:
years = ['2011', '2012', '2013', '2014']
all_data = [data_2011, data_2012, data_2013, data_2014]

for i in range(4):
print(len(years[i])*'-')
print(years[i])
print(len(years[i])*'-')
print(all_data[i]['Status'].value_counts())
----
2011
----
29473
Name: Status, dtype: int64
----
2012
----
29798
Name: Status, dtype: int64
----
2013
----
29838
Name: Status, dtype: int64
----
2014
----
FT    18295
PT    12299
Name: Status, dtype: int64

The status of the employee is critical to our analysis, only year 2014 has this information. So it makes sense to focus on analysis on 2014.

In [18]:
data_2014_FT = data_2014[data_2014['Status'] == 'FT']
data_2014_PT = data_2014[data_2014['Status'] == 'PT']

#### 4.3 - Full time vs. part time employees¶

Let's take a look at the kernal density estimation plot for part time and full time employees.

In [19]:
fig = plt.figure(figsize=(10, 5))
ax = sns.kdeplot(data_2014_PT['TotalPayBenefits'], color = 'Orange', label='Part Time Workers', shade=True)
ax = sns.kdeplot(data_2014_FT['TotalPayBenefits'], color = 'Green', label='Full Time Workers', shade=True)
plt.yticks([])

plt.title('Part Time Workers vs. Full Time Workers')
plt.ylabel('Density of Employees')
plt.xlabel('Total Pay + Benefits ($)') plt.xlim(0, 350000) plt.show() If we split the data by employment status, we can see that the kernal distribution plot is no longer bimodal. Next, let's see how these two plots look if we seperate the data by gender. In [20]: fig = plt.figure(figsize=(10, 10)) fig.subplots_adjust(hspace=.5) #Generate the top plot male_only = data_2014_FT[data_2014_FT['Gender'] == 'male'] female_only = data_2014_FT[data_2014_FT['Gender'] == 'female'] ax = fig.add_subplot(2, 1, 1) ax = sns.kdeplot(male_only['TotalPayBenefits'], color ='Blue', label='Male', shade=True) ax = sns.kdeplot(female_only['TotalPayBenefits'], color='Red', label='Female', shade=True) plt.title('Full Time Workers') plt.ylabel('Density of Employees') plt.xlabel('Total Pay & Benefits ($)')
plt.xlim(0, 350000)
plt.yticks([])

#Generate the bottom plot
male_only = data_2014_PT[data_2014_PT['Gender'] == 'male']
female_only = data_2014_PT[data_2014_PT['Gender'] == 'female']
ax2 = sns.kdeplot(male_only['TotalPayBenefits'], color ='Blue', label='Male', shade=True)
ax2 = sns.kdeplot(female_only['TotalPayBenefits'], color='Red', label='Female', shade=True)
plt.title('Part Time Workers')
plt.ylabel('Density of Employees')
plt.xlabel('Total Pay & Benefits ($)') plt.xlim(0, 350000) plt.yticks([]) plt.show() For part time workers, the KDE plot is nearly identical for both males and females. For full time workers, we still see a gender gap. We'll need to break down the data some more. #### 4.4 - Breaking down the total pay¶ We used total pay including benefits for the x-axis for the KDE plot in the previous section. Is this a fair way to analyze the data? What if men work more overtime hours than women? Can we break down the data some more? In [21]: male_only = data_2014_FT[data_2014_FT['Gender'] == 'male'] female_only = data_2014_FT[data_2014_FT['Gender'] == 'female'] fig = plt.figure(figsize=(10, 15)) fig.subplots_adjust(hspace=.5) #Generate the top plot ax = fig.add_subplot(3, 1, 1) ax = sns.kdeplot(male_only['OvertimePay'], color ='Blue', label='Male', shade=True) ax = sns.kdeplot(female_only['OvertimePay'], color='Red', label='Female', shade=True) plt.title('Full Time Workers') plt.ylabel('Density of Employees') plt.xlabel('Overtime Pay ($)')
plt.xlim(0, 60000)
plt.yticks([])

#Generate the middle plot
ax2 = sns.kdeplot(male_only['Benefits'], color ='Blue', label='Male', shade=True)
ax2 = sns.kdeplot(female_only['Benefits'], color='Red', label='Female', shade=True)
plt.ylabel('Density of Employees')
plt.xlabel('Benefits Only ($)') plt.xlim(0, 75000) plt.yticks([]) #Generate the bottom plot ax3 = fig.add_subplot(3, 1, 3) ax3 = sns.kdeplot(male_only['BasePay'], color ='Blue', label='Male', shade=True) ax3 = sns.kdeplot(female_only['BasePay'], color='Red', label='Female', shade=True) plt.ylabel('Density of Employees') plt.xlabel('Base Pay Only ($)')
plt.xlim(0, 300000)
plt.yticks([])

plt.show()

We see a gender gap for all three plots above. Looks like we'll have to dig even deeper and analyze the data by job cateogries.

But first, let's take a look at the overall correlation for the data set.

In [22]:
data_2014_FT.corr()
Out[22]:
Id BasePay OvertimePay OtherPay Benefits TotalPay TotalPayBenefits Year
Id 1.000000 -0.877397 -0.411169 -0.509704 -0.855137 -0.937600 -0.941416 NaN
BasePay -0.877397 1.000000 0.093583 0.341613 0.931050 0.899263 0.921638 NaN
OvertimePay -0.411169 0.093583 1.000000 0.461372 0.146877 0.500797 0.450531 NaN
OtherPay -0.509704 0.341613 0.461372 1.000000 0.373886 0.593976 0.568061 NaN
Benefits -0.855137 0.931050 0.146877 0.373886 1.000000 0.866937 0.905795 NaN
TotalPay -0.937600 0.899263 0.500797 0.593976 0.866937 1.000000 0.996455 NaN
TotalPayBenefits -0.941416 0.921638 0.450531 0.568061 0.905795 0.996455 1.000000 NaN
Year NaN NaN NaN NaN NaN NaN NaN NaN

The correlation table above uses Pearson's R to determine the values. The BasePay and Benefits column are very closely related. We can visualize this relationship using a scatter plot.

In [23]:
fig = plt.figure(figsize=(10, 5))

ax = plt.scatter(data_2014_FT['BasePay'], data_2014_FT['Benefits'])

plt.ylabel('Benefits ($)') plt.xlabel('Base Pay ($)')

plt.show()

This makes a lot of sense because an employee's benefits is based on a percentage of their base pay. The San Francisco Human Resources department includes this information on their website here.

As we move further into our analysis of the data, it makes the most sense to focus on the BasePay column. Both Benefits and OvertimePay are dependent of the BasePay.

#### 4.5 - Breaking down the base pay by job category¶

Next we'll analyze the base pay of full time workers by job category.

In [24]:
pal = sns.diverging_palette(0, 255, n=2)
ax = sns.factorplot(x='BasePay', y='Job_Group', hue='Gender', data=data_2014_FT,
size=10, kind="bar", palette=pal, ci=None)

plt.title('Full Time Workers')
plt.xlabel('Base Pay ($)') plt.ylabel('Job Group') plt.show() At a glance, we can't really draw any conclusive statements about the gender wage gap. Some job categories favor females, some favor males. It really depends on what job group the employee is actually in. Maybe it makes more sense to calculate the the difference between these two bars. In [25]: salaries_by_group = pd.pivot_table(data = data_2014_FT, values = 'BasePay', columns = 'Job_Group', index='Gender', aggfunc = np.mean) count_by_group = pd.pivot_table(data = data_2014_FT, values = 'Id', columns = 'Job_Group', index='Gender', aggfunc = len) salaries_by_group Out[25]: Job_Group Airport Staff Analyst Architect Assistant Attorney Automotive Clerk Court Custodian Engineer ... Manager Mayor Office Medical Other Police Porter Public Works Recreation Leader Social Worker Transit Gender female 93049.884167 95567.748640 114343.524687 71048.813761 161300.608015 74622.464000 60660.043242 71426.702368 53371.864054 99386.705777 ... 135784.930749 101367.370667 97588.758869 94161.462824 96904.757377 54082.077759 81163.647556 62936.435455 75711.145885 75066.864596 male 92446.252414 96833.190149 120101.229531 67914.279570 163313.664820 77187.707896 58414.105333 76148.568065 53336.443035 103364.369535 ... 141848.676974 110323.298000 98908.448211 95949.373554 107490.302458 53561.220392 80571.132034 64989.786250 75083.555331 71368.674835 2 rows × 26 columns In [26]: fig = plt.figure(figsize=(10, 15)) sns.set(font_scale=1.5) differences = (salaries_by_group.loc['female'] - salaries_by_group.loc['male'])*100/salaries_by_group.loc['male'] labels = differences.sort_values().index x = differences.sort_values() y = [i for i in range(len(differences))] palette = sns.diverging_palette(240, 10, n=28, center ='dark') ax = sns.barplot(x, y, orient = 'h', palette = palette) #Draws the two arrows bbox_props = dict(boxstyle="rarrow,pad=0.3", fc="white", ec="black", lw=1) t = plt.text(5.5, 12, "Higher pay for females", ha="center", va="center", rotation=0, size=15, bbox=bbox_props) bbox_props2 = dict(boxstyle="larrow,pad=0.3", fc="white", ec="black", lw=1) t = plt.text(-5.5, 12, "Higher pay for males", ha="center", va="center", rotation=0, size=15, bbox=bbox_props2) #Labels each bar with the percentage of females percent_labels = count_by_group[labels].iloc[0]*100 \ /(count_by_group[labels].iloc[0] + count_by_group[labels].iloc[1]) for i in range(len(ax.patches)): p = ax.patches[i] width = p.get_width()*1+1 ax.text(15, p.get_y()+p.get_height()/2+0.3, '{:1.0f}'.format(percent_labels[i])+' %', ha="center") ax.text(15, -1+0.3, 'Female Representation', ha="center", fontname='Arial', rotation = 0) plt.yticks(range(len(differences)), labels) plt.title('Full Time Workers (Base Pay)') plt.xlabel('Mean Percent Difference in Pay (Females - Males)') plt.xlim(-11, 11) plt.show() I believe this is a better way to represent the gender wage gap. I calculated the mean difference between female and male pay based on job categories. Then I converted the values into a percentage by using this formula: $$\text{Mean Percent Difference} = \frac{\text{(Female Mean Pay - Male Mean Pay)*100}} {\text{Male Mean Pay}}$$ The theory stating that women makes 78 cents for every dollar men makes implies a 22% pay difference. None of these percentages were more than 10%, and not all of these percentage values showed favoritism towards males. However, we should keep in mind that this data set only applies to San Francisco public jobs. We should also keep in mind that we do not have access to job experience data which would directly correlate with base pay. In addition, I included a short table of female representation for each job group on the right side of the graph. We'll dig further into this on the next section. #### 4.6 - Gender representation by job category¶ In [27]: contingency_table = pd.crosstab( data_2014_FT['Gender'], data_2014_FT['Job_Group'], margins = True ) contingency_table Out[27]: Job_Group Airport Staff Analyst Architect Assistant Attorney Automotive Clerk Court Custodian Engineer ... Mayor Office Medical Other Police Porter Public Works Recreation Leader Social Worker Transit All Gender female 24 375 32 327 136 5 694 76 111 206 ... 15 1088 1094 648 58 90 22 520 285 6493 male 29 335 64 186 139 366 270 31 201 968 ... 15 408 1770 2539 102 59 32 332 1390 11802 All 53 710 96 513 275 371 964 107 312 1174 ... 30 1496 2864 3187 160 149 54 852 1675 18295 3 rows × 27 columns In [28]: #Assigns the frequency values femalecount = contingency_table.iloc[0][0:-1].values malecount = contingency_table.iloc[1][0:-1].values totals = contingency_table.iloc[2][0:-1] femalepercentages = femalecount*100/totals malepercentages = malecount*100/totals malepercentages=malepercentages.sort_values(ascending=True) femalepercentages=femalepercentages.sort_values(ascending=False) length = range(len(femalepercentages)) #Plots the bar chart fig = plt.figure(figsize=(10, 12)) sns.set(font_scale=1.5) p1 = plt.barh(length, malepercentages.values, 0.55, label='Male', color='#AFAFF5') p2 = plt.barh(length, femalepercentages, 0.55, left=malepercentages, color='#EFAFB5', label='Female') labels = malepercentages.index plt.yticks(range(len(malepercentages)), labels) plt.xticks([0, 25, 50, 75, 100], ['0 %', '25 %', '50 %', '75 %', '100 %']) plt.xlabel('Percentage of Males') plt.title('Gender Representation by Job Group') plt.legend(bbox_to_anchor=(0, 1, 1, 0), loc=3, ncol=2, mode="expand", borderaxespad=0) plt.show() The chart above does not include any information based on pay. I wanted to show an overview of gender representation based on job category. It is safe to say, women don't like working with automotives with <1% female representation. Where as female representation is highest for medical jobs at 73%. #### 4.7 - Significance testing by exact job title¶ So what if breaking down the wage gap by job category is not good enough? Should we break down the gender gap by exact job title? Afterall, the argument is for equal pay for equal work. We can assume equal work if the job titles are exactly the same. We can use hypothesis testing using the Welch's t-test to determine if there is a statistically significant result between male and female wages. The Welch's t-test is very robust as it doesn't assume equal variance and equal sample size. It does however, assume a normal distrbution which is well represented by the KDE plots. I talk about this in detail in my blog post here. Let's state our null and alternative hypothesis:$ H_0 : \text{There is no statistically significant relationship between gender and pay.}  H_a : \text{There is a statistically significant relationship between gender and pay.} $We are going to use only job titles with more than 100 employees, and job titles with more than 30 females and 30 males for this t-test. Using a for loop, we'll perform the Welch's t-test on every job title tat matches our criteria. In [29]: from scipy import stats #Significance testing by job title job_titles = data_2014['JobTitle'].value_counts(dropna=True) job_titles_over_100 = job_titles[job_titles > 100 ] t_scores = {} for title,count in job_titles_over_100.iteritems(): male_pay = pd.to_numeric(male_only[male_only['JobTitle'] == title]['BasePay']) female_pay = pd.to_numeric(female_only[female_only['JobTitle'] == title]['BasePay']) if female_pay.shape[0] < 30: continue if male_pay.shape[0] < 30: continue t_scores[title] = stats.ttest_ind_from_stats( mean1=male_pay.mean(), std1=(male_pay.std()), nobs1= male_pay.shape[0], \ mean2=female_pay.mean(), std2=(female_pay.std()), nobs2=female_pay.shape[0], \ equal_var=False) for key, value in t_scores.items(): if value[1] < 0.05: print(len(key)*'-') print(key) print(len(key)*'-') print(t_scores[key]) print(' ') print('Male: {}'.format((male_only[male_only['JobTitle'] == key]['BasePay']).mean())) print('sample size: {}'.format(male_only[male_only['JobTitle'] == key].shape[0])) print(' ') print('Female: {}'.format((female_only[female_only['JobTitle'] == key]['BasePay']).mean())) print('sample size: {}'.format(female_only[female_only['JobTitle'] == key].shape[0])) ------------------------------ Community Police Services Aide ------------------------------ Ttest_indResult(statistic=2.1634922455155117, pvalue=0.03368488556993355) Male: 64833.20256410256 sample size: 78 Female: 63633.422115384616 sample size: 52 ---------- Manager II ---------- Ttest_indResult(statistic=2.4258239418558945, pvalue=0.017341194165396163) Male: 121976.36624999999 sample size: 56 Female: 117550.0123529412 sample size: 51 ---------------- Transit Operator ---------------- Ttest_indResult(statistic=5.20560164116152, pvalue=3.9803552807992223e-07) Male: 64068.532049433576 sample size: 971 Female: 62491.15360215054 sample size: 186 ------------ Senior Clerk ------------ Ttest_indResult(statistic=-2.5391065634795362, pvalue=0.013575306848745547) Male: 51291.90378378379 sample size: 37 Female: 53830.58425 sample size: 120 ---------------- Registered Nurse ---------------- Ttest_indResult(statistic=-3.7087643141584667, pvalue=0.00033539561477398915) Male: 128033.80955882354 sample size: 68 Female: 132992.0894059406 sample size: 202 In [30]: len(t_scores) Out[30]: 25 Out of the 25 jobs that were tested using the Welch's t-test, 5 jobs resulted in a p-value of less than 0.05. However, not all jobs showed favoritism towards males. 'Registered Nurse' and 'Senior Clerk' both showed an average pay in favor of females. However, we should take the Welch's t-test results with a grain of salt. We do not have data on the work experience of the employees. Maybe female nurses have more work experience over males. Maybe male transit operators have more work experience over females. We don't actually know. Since BasePay is a function of work experience, without this critical piece of information, we can not make any conclusions based on the t-test alone. All we know is that a statistically significant difference exists. ### 5.0 - San Francisco vs. Newport Beach¶ Let's take a look at more a more conservative city such as Newport Beach. This data can be downloaded at Transparent California here. We can process the data similar to the San Francisco data set. The following code performs the following: • Read the data using pandas • Create the Job_Group column • Create the Gender column • Create two new dataframes: one for part time workers and one for full time workers In [31]: #Reads in the data nb_data = pd.read_csv('newport-beach-2016.csv') #Creates job groups def find_job_title_nb(row): titles = collections.OrderedDict([ ('Police',['police', 'sherif', 'probation', 'sergeant', 'officer', 'lieutenant']), ('Fire', ['fire']), ('Transit',['mta', 'transit']), ('Medical',['anesth', 'medical', 'nurs', 'health', 'physician', 'orthopedic', 'pharm', 'care']), ('Architect', ['architect']), ('Court',['court', 'legal']), ('Mayor Office', ['mayoral']), ('Library', ['librar']), ('Public Works', ['public']), ('Attorney', ['attorney']), ('Custodian', ['custodian']), ('Gardener', ['garden']), ('Recreation Leader', ['recreation']), ('Automotive',['automotive', 'mechanic', 'truck']), ('Engineer',['engineer', 'engr', 'eng', 'program']), ('General Laborer',['general laborer', 'painter', 'inspector', 'carpenter', 'electrician', 'plumber', 'maintenance']), ('Food Services', ['food serv']), ('Clerk', ['clerk']), ('Porter', ['porter']), ('Airport Staff', ['airport']), ('Social Worker',['worker']), ('Guard', ['guard']), ('Assistant',['aide', 'assistant', 'secretary', 'attendant']), ('Analyst', ['analy']), ('Manager', ['manager']) ]) #Loops through the dictionaries for group, keywords in titles.items(): for keyword in keywords: if keyword in row.lower(): return group return 'Other' start_time = time.time() nb_data["Job_Group"]=data["JobTitle"].map(find_job_title_nb) #Create the 'Gender' column based on employee's first name. d = gender.Detector(case_sensitive=False) nb_data['FirstName'] = nb_data['Employee Name'].str.split().apply(lambda x: x[0]) nb_data['Gender'] = nb_data['FirstName'].apply(lambda x: d.get_gender(x)) nb_data['Gender'].value_counts() #Retain data with 'male' and 'female' names. nb_male_female_only = nb_data[(nb_data['Gender'] == 'male') | (nb_data['Gender'] == 'female')] nb_male_female_only['Gender'].value_counts() #Seperates full time/part time data nb_data_FT = nb_male_female_only[nb_male_female_only['Status'] == 'FT'] nb_data_PT = nb_male_female_only[nb_male_female_only['Status'] == 'PT'] nb_data_FT.head() Out[31]: Employee Name Job Title Base Pay Overtime Pay Other Pay Benefits Total Pay Total Pay & Benefits Year Notes Agency Status Job_Group FirstName Gender 0 Scott Poster FIRE CHIEF 228067.56 0.00 57289.33 145114.67 285356.89 430471.56 2016 NaN Newport Beach FT Transit Scott male 1 Charles Duncan FIRE ASSISTANT CHIEF 197131.52 0.00 44747.91 144994.50 241879.43 386873.93 2016 NaN Newport Beach FT Police Charles male 2 Ty Lunde FIRE CAPTAIN + 7.5% 118806.31 130011.25 30486.04 106032.23 279303.60 385335.83 2016 NaN Newport Beach FT Police Ty male 3 Jon Lewis POLICE CHIEF 208498.34 0.00 29679.40 118476.62 238177.74 356654.36 2016 NaN Newport Beach FT Automotive Jon male 4 David Kiff CITY MANAGER 262652.38 0.00 13067.92 77488.07 275720.30 353208.37 2016 NaN Newport Beach FT Fire David male #### 5.1 - Part time vs. full time workers¶ In [32]: fig = plt.figure(figsize=(10, 5)) nb_male_only = nb_data_PT[nb_data_PT['Gender'] == 'male'] nb_female_only = nb_data_PT[nb_data_PT['Gender'] == 'female'] ax = fig.add_subplot(1, 1, 1) ax = sns.kdeplot(nb_male_only['Total Pay & Benefits'], color ='Blue', label='Male', shade=True) ax = sns.kdeplot(nb_female_only['Total Pay & Benefits'], color='Red', label='Female', shade=True) plt.title('Newport Beach: Part Time Workers') plt.ylabel('Density of Employees') plt.xlabel('Total Pay + Benefits ($)')
plt.xlim(0, 400000)
plt.yticks([])

plt.show()

Similar to the KDE plot for San Francisco, the KDE plot is nearly identical for both males and females for part time workers.

Let's take a look at the full time workers.

In [33]:
fig = plt.figure(figsize=(10, 10))

#Generate the top chart
nb_male_only = nb_data_FT[nb_data_FT['Gender'] == 'male']
nb_female_only = nb_data_FT[nb_data_FT['Gender'] == 'female']
ax = sns.kdeplot(nb_male_only['Total Pay & Benefits'], color ='Blue', label='Male', shade=True)
ax = sns.kdeplot(nb_female_only['Total Pay & Benefits'], color='Red', label='Female', shade=True)
plt.title('Newport Beach: Full Time Workers')
plt.ylabel('Density of Employees')
plt.xlabel('Total Pay + Benefits ($)') plt.xlim(0, 400000) plt.yticks([]) #Generate the bottom chart male_only = data_2014_FT[data_2014_FT['Gender'] == 'male'] female_only = data_2014_FT[data_2014_FT['Gender'] == 'female'] ax2 = fig.add_subplot(2, 1, 2) ax2 = sns.kdeplot(male_only['TotalPayBenefits'], color ='Blue', label='Male', shade=True) ax2 = sns.kdeplot(female_only['TotalPayBenefits'], color='Red', label='Female', shade=True) plt.title('San Francisco: Full Time Workers') plt.ylabel('Density of Employees') plt.xlabel('Total Pay + Benefits ($)')
plt.xlim(0, 400000)
plt.yticks([])

plt.show()

The kurtosis of the KDE plot for Newport Beach full time workers is lower than KDE plot for San Francisco full time workers. We can see a higher gender wage gap for Newport beach workers than San Francisco workers. However, these two plots do not tell us the full story. We need to break down the data by job category.

#### 5.2 - Comparisons by job cateogry¶

In [34]:
nb_salaries_by_group = pd.pivot_table(data = nb_data_FT,
values = 'Base Pay',
columns = 'Job_Group', index='Gender',
aggfunc = np.mean,)

nb_salaries_by_group
Out[34]:
Job_Group Assistant Attorney Automotive Engineer Fire General Laborer Manager Medical Other Police Public Works Transit
Gender
female NaN 70139.559524 NaN NaN 76448.505882 NaN 114664.376667 85279.764286 82460.784211 78270.252857 NaN 82686.25500
male 123162.972 75404.094000 114410.966667 63998.7 99360.982205 86154.174286 94423.185000 102067.444857 105185.938529 96273.211111 135033.84 99279.03913
In [35]:
fig = plt.figure(figsize=(10, 7.5))
sns.set(font_scale=1.5)

differences = (nb_salaries_by_group.loc['female'] - nb_salaries_by_group.loc['male'])*100/nb_salaries_by_group.loc['male']
nb_labels  = differences.sort_values().index
x = differences.sort_values()
y = [i for i in range(len(differences))]
nb_palette = sns.diverging_palette(240, 10, n=9, center ='dark')
ax = sns.barplot(x, y, orient = 'h', palette = nb_palette)

plt.yticks(range(len(differences)), nb_labels)
plt.title('Newport Beach: Full Time Workers (Base Pay)')
plt.xlabel('Mean Percent Difference in Pay (Females - Males)')
plt.xlim(-25, 25)
plt.show()

Most of these jobs shows a higher average pay for males. The only job category where females were paid higher on average was 'Manager'. Some of these job categories do not even have a single female within the category, so the difference cannot be calculated. We should create a contingency table to check the sample size of our data.

#### 5.3 - Gender representation by job category¶

In [36]:
nb_contingency_table = pd.crosstab(
nb_data_FT['Gender'],
nb_data_FT['Job_Group'],
margins = True
)
nb_contingency_table
Out[36]:
Job_Group Assistant Attorney Automotive Engineer Fire General Laborer Manager Medical Other Police Public Works Transit All
Gender
female 0 21 0 0 34 0 3 28 19 49 0 4 158
male 5 20 3 1 127 7 16 70 34 126 1 23 433
All 5 41 3 1 161 7 19 98 53 175 1 27 591

The number of public jobs is much lower in Newport Beach compared to San Francisco. With only 3 female managers working full time in Newport Beach, we can't really say female managers make more money on average than male managers.

In [37]:
#Assigns the frequency values
nb_femalecount = nb_contingency_table.iloc[0][0:-1].values
nb_malecount = nb_contingency_table.iloc[1][0:-1].values

nb_totals = nb_contingency_table.iloc[2][0:-1]
nb_femalepercentages = nb_femalecount*100/nb_totals
nb_malepercentages = nb_malecount*100/nb_totals

nb_malepercentages=nb_malepercentages.sort_values(ascending=True)
nb_femalepercentages=nb_femalepercentages.sort_values(ascending=False)
nb_length = range(len(nb_malepercentages))

#Plots the bar chart
fig = plt.figure(figsize=(10, 10))
sns.set(font_scale=1.5)
p1 = plt.barh(nb_length, nb_malepercentages.values, 0.55, label='Male', color='#AFAFF5')
p2 = plt.barh(nb_length, nb_femalepercentages, 0.55, left=nb_malepercentages, color='#EFAFB5', label='Female')
labels = nb_malepercentages.index
plt.yticks(range(len(nb_malepercentages)), labels)
plt.xticks([0, 25, 50, 75, 100], ['0 %', '25 %', '50 %', '75 %', '100 %'])
plt.xlabel('Percentage of Males')
plt.title('Gender Representation by Job Group')
plt.legend(bbox_to_anchor=(0, 1, 1, 0), loc=3,