In this project we will look at earnings from recent college graduates based on each major in 'recent-grads.csv'. We'll visualize the data using histograms, bar charts, and scatter plots and see if we can draw any interesting insights from it. However, the main purpose of this project is to practice some of the data visualization tools.
import pandas as pd
import matplotlib as plt
#jupyter magic so the plots are displayed inline
%matplotlib inline
recent_grads = pd.read_csv('recent-grads.csv')
recent_grads.iloc[0]
recent_grads.head(1)
recent_grads.tail(1)
recent_grads.describe()
First, let's clean up the data a bit and drop the rows that have NaN as values.
recent_grads = recent_grads.dropna()
recent_grads
Let's begin exploring the data using scatter plots and see if we can draw any interesting correlations.
recent_grads.plot(x='Sample_size', y='Median', kind = 'scatter')
recent_grads.plot(x='Sample_size', y='Unemployment_rate', kind = 'scatter')
recent_grads.plot(x='Full_time', y='Median', kind = 'scatter')
recent_grads.plot(x='ShareWomen', y='Unemployment_rate', kind = 'scatter')
recent_grads.plot(x='Men', y='Median', kind = 'scatter')
recent_grads.plot(x='Women', y='Median', kind = 'scatter')
From the 'Unemployment_rate' vs. 'ShareWomen' plot, it looks like there is no correlation between unemployment rate and the amount of women in the major.
Doesn't look like there is much other useful information from these scatter plots, let's explore the data a bit further using histograms instead.
The y axis shows the frequency of the data and the x axis refers to the column name specified in code.
recent_grads['Median'].hist(bins=25)
recent_grads['Employed'].hist(bins=25)
recent_grads['Full_time'].hist(bins=25)
recent_grads['ShareWomen'].hist(bins=25)
recent_grads['Unemployment_rate'].hist(bins=25)
recent_grads['Men'].hist(bins=25)
recent_grads['Women'].hist(bins=25)
Again, not much correlation from these histograms. We do see a distribution of unemployment rates for various majors. If unemployment rate is not related to major, then we should see a wide plateau on the histogram.
Next we'll use scatter matrix from pandas to see if we can draw more insight. A scatter matrix can plot many different variables together and allow us to quickly see if there are correlations between those variables.
from pandas.plotting import scatter_matrix
scatter_matrix(recent_grads[['Sample_size', 'Median']], figsize=(10,10))
scatter_matrix(recent_grads[['Men', 'ShareWomen', 'Median']], figsize=(10,10))
We are not really seeing much correlations betwen these plots, There is a weak negative correlation between 'ShareWomen' and Median. Majors with less women tend to have higher earnings. It could be due to the fact that high paying majors like engineering tend to have less women.
The first ten rows in the data are mostly engineering majors, and the last ten rows are non engineering majors. We can generate a bar chart and look at the 'ShareWomen' vs 'Majors' to see if our hypothesis is correct.
recent_grads[:10].plot(kind='bar', x='Major', y='ShareWomen', colormap='winter')
recent_grads[163:].plot(kind='bar', x='Major', y='ShareWomen', colormap='winter')
Let's plot the majors we selected above with 'Median' income to see if engineers earn more income.
recent_grads[:10].plot(kind='bar', x='Major', y='Median', colormap='winter')
recent_grads[163:].plot(kind='bar', x='Major', y='Median', colormap='winter')
Our hypothesis appears to be correct, at least for the majors we selected. Majors with less women such as engineering tend to earn higher salaries.
Learning Summary¶
Python concepts explored: pandas, matplotlib, histograms, bar charts, scatterplots, scatter matrices
Python functions and methods used: .plot(), scatter_matrix(), hist(), iloc[], .head(), .tail(), .describe()
The files used for this project can be found in my GitHub repository.
Comments
comments powered by Disqus