In this project we will look at earnings from recent college graduates based on each major in 'recent-grads.csv'. We'll visualize the data using histograms, bar charts, and scatter plots and see if we can draw any interesting insights from it. However, the main purpose of this project is to practice some of the data visualization tools.

In [1]:

import pandas as pd
import matplotlib as plt

#jupyter magic so the plots are displayed inline
%matplotlib inline

In [2]:

recent_grads = pd.read_csv('recent-grads.csv')
recent_grads.iloc[0]

Out[2]:

Rank                                        1
Major_code                               2419
Major                   PETROLEUM ENGINEERING
Total                                    2339
Men                                      2057
Women                                     282
Major_category                    Engineering
ShareWomen                           0.120564
Sample_size                                36
Employed                                 1976
Full_time                                1849
Part_time                                 270
Full_time_year_round                     1207
Unemployed                                 37
Unemployment_rate                   0.0183805
Median                                 110000
P25th                                   95000
P75th                                  125000
College_jobs                             1534
Non_college_jobs                          364
Low_wage_jobs                             193
Name: 0, dtype: object

In [3]:

recent_grads.head(1)

Out[3]:

	Rank	Major_code	Major	Total	Men	Women	Major_category	ShareWomen	Sample_size	Employed	...	Part_time	Full_time_year_round	Unemployed	Unemployment_rate	Median	P25th	P75th	College_jobs	Non_college_jobs	Low_wage_jobs
0	1	2419	PETROLEUM ENGINEERING	2339.0	2057.0	282.0	Engineering	0.120564	36	1976	...	270	1207	37	0.018381	110000	95000	125000	1534	364	193

1 rows × 21 columns

In [4]:

recent_grads.tail(1)

Out[4]:

	Rank	Major_code	Major	Total	Men	Women	Major_category	ShareWomen	Sample_size	Employed	...	Part_time	Full_time_year_round	Unemployed	Unemployment_rate	Median	P25th	P75th	College_jobs	Non_college_jobs	Low_wage_jobs
172	173	3501	LIBRARY SCIENCE	1098.0	134.0	964.0	Education	0.87796	2	742	...	237	410	87	0.104946	22000	20000	22000	288	338	192

1 rows × 21 columns

In [5]:

recent_grads.describe()

Out[5]:

	Rank	Major_code	Total	Men	Women	ShareWomen	Sample_size	Employed	Full_time	Part_time	Full_time_year_round	Unemployed	Unemployment_rate	Median	P25th	P75th	College_jobs	Non_college_jobs	Low_wage_jobs
count	173.000000	173.000000	172.000000	172.000000	172.000000	172.000000	173.000000	173.000000	173.000000	173.000000	173.000000	173.000000	173.000000	173.000000	173.000000	173.000000	173.000000	173.000000	173.000000
mean	87.000000	3879.815029	39370.081395	16723.406977	22646.674419	0.522223	356.080925	31192.763006	26029.306358	8832.398844	19694.427746	2416.329480	0.068191	40151.445087	29501.445087	51494.219653	12322.635838	13284.497110	3859.017341
std	50.084928	1687.753140	63483.491009	28122.433474	41057.330740	0.231205	618.361022	50675.002241	42869.655092	14648.179473	33160.941514	4112.803148	0.030331	11470.181802	9166.005235	14906.279740	21299.868863	23789.655363	6944.998579
min	1.000000	1100.000000	124.000000	119.000000	0.000000	0.000000	2.000000	0.000000	111.000000	0.000000	111.000000	0.000000	0.000000	22000.000000	18500.000000	22000.000000	0.000000	0.000000	0.000000
25%	44.000000	2403.000000	4549.750000	2177.500000	1778.250000	0.336026	39.000000	3608.000000	3154.000000	1030.000000	2453.000000	304.000000	0.050306	33000.000000	24000.000000	42000.000000	1675.000000	1591.000000	340.000000
50%	87.000000	3608.000000	15104.000000	5434.000000	8386.500000	0.534024	130.000000	11797.000000	10048.000000	3299.000000	7413.000000	893.000000	0.067961	36000.000000	27000.000000	47000.000000	4390.000000	4595.000000	1231.000000
75%	130.000000	5503.000000	38909.750000	14631.000000	22553.750000	0.703299	338.000000	31433.000000	25147.000000	9948.000000	16891.000000	2393.000000	0.087557	45000.000000	33000.000000	60000.000000	14444.000000	11783.000000	3466.000000
max	173.000000	6403.000000	393735.000000	173809.000000	307087.000000	0.968954	4212.000000	307933.000000	251540.000000	115172.000000	199897.000000	28169.000000	0.177226	110000.000000	95000.000000	125000.000000	151643.000000	148395.000000	48207.000000

First, let's clean up the data a bit and drop the rows that have NaN as values.

In [6]:

recent_grads = recent_grads.dropna()
recent_grads

Out[6]:

	Rank	Major_code	Major	Total	Men	Women	Major_category	ShareWomen	Sample_size	Employed	...	Part_time	Full_time_year_round	Unemployed	Unemployment_rate	Median	P25th	P75th	College_jobs	Non_college_jobs	Low_wage_jobs
0	1	2419	PETROLEUM ENGINEERING	2339.0	2057.0	282.0	Engineering	0.120564	36	1976	...	270	1207	37	0.018381	110000	95000	125000	1534	364	193
1	2	2416	MINING AND MINERAL ENGINEERING	756.0	679.0	77.0	Engineering	0.101852	7	640	...	170	388	85	0.117241	75000	55000	90000	350	257	50
2	3	2415	METALLURGICAL ENGINEERING	856.0	725.0	131.0	Engineering	0.153037	3	648	...	133	340	16	0.024096	73000	50000	105000	456	176	0
3	4	2417	NAVAL ARCHITECTURE AND MARINE ENGINEERING	1258.0	1123.0	135.0	Engineering	0.107313	16	758	...	150	692	40	0.050125	70000	43000	80000	529	102	0
4	5	2405	CHEMICAL ENGINEERING	32260.0	21239.0	11021.0	Engineering	0.341631	289	25694	...	5180	16697	1672	0.061098	65000	50000	75000	18314	4440	972
5	6	2418	NUCLEAR ENGINEERING	2573.0	2200.0	373.0	Engineering	0.144967	17	1857	...	264	1449	400	0.177226	65000	50000	102000	1142	657	244
6	7	6202	ACTUARIAL SCIENCE	3777.0	2110.0	1667.0	Business	0.441356	51	2912	...	296	2482	308	0.095652	62000	53000	72000	1768	314	259
7	8	5001	ASTRONOMY AND ASTROPHYSICS	1792.0	832.0	960.0	Physical Sciences	0.535714	10	1526	...	553	827	33	0.021167	62000	31500	109000	972	500	220
8	9	2414	MECHANICAL ENGINEERING	91227.0	80320.0	10907.0	Engineering	0.119559	1029	76442	...	13101	54639	4650	0.057342	60000	48000	70000	52844	16384	3253
9	10	2408	ELECTRICAL ENGINEERING	81527.0	65511.0	16016.0	Engineering	0.196450	631	61928	...	12695	41413	3895	0.059174	60000	45000	72000	45829	10874	3170
10	11	2407	COMPUTER ENGINEERING	41542.0	33258.0	8284.0	Engineering	0.199413	399	32506	...	5146	23621	2275	0.065409	60000	45000	75000	23694	5721	980
11	12	2401	AEROSPACE ENGINEERING	15058.0	12953.0	2105.0	Engineering	0.139793	147	11391	...	2724	8790	794	0.065162	60000	42000	70000	8184	2425	372
12	13	2404	BIOMEDICAL ENGINEERING	14955.0	8407.0	6548.0	Engineering	0.437847	79	10047	...	2694	5986	1019	0.092084	60000	36000	70000	6439	2471	789
13	14	5008	MATERIALS SCIENCE	4279.0	2949.0	1330.0	Engineering	0.310820	22	3307	...	878	1967	78	0.023043	60000	39000	65000	2626	391	81
14	15	2409	ENGINEERING MECHANICS PHYSICS AND SCIENCE	4321.0	3526.0	795.0	Engineering	0.183985	30	3608	...	811	2004	23	0.006334	58000	25000	74000	2439	947	263
15	16	2402	BIOLOGICAL ENGINEERING	8925.0	6062.0	2863.0	Engineering	0.320784	55	6170	...	1983	3413	589	0.087143	57100	40000	76000	3603	1595	524
16	17	2412	INDUSTRIAL AND MANUFACTURING ENGINEERING	18968.0	12453.0	6515.0	Engineering	0.343473	183	15604	...	2243	11326	699	0.042876	57000	37900	67000	8306	3235	640
17	18	2400	GENERAL ENGINEERING	61152.0	45683.0	15469.0	Engineering	0.252960	425	44931	...	7199	33540	2859	0.059824	56000	36000	69000	26898	11734	3192
18	19	2403	ARCHITECTURAL ENGINEERING	2825.0	1835.0	990.0	Engineering	0.350442	26	2575	...	343	1848	170	0.061931	54000	38000	65000	1665	649	137
19	20	3201	COURT REPORTING	1148.0	877.0	271.0	Law & Public Policy	0.236063	14	930	...	223	808	11	0.011690	54000	50000	54000	402	528	144
20	21	2102	COMPUTER SCIENCE	128319.0	99743.0	28576.0	Computers & Mathematics	0.222695	1196	102087	...	18726	70932	6884	0.063173	53000	39000	70000	68622	25667	5144
22	23	2502	ELECTRICAL ENGINEERING TECHNOLOGY	11565.0	8181.0	3384.0	Engineering	0.292607	97	8587	...	1873	5681	824	0.087557	52000	35000	60000	5126	2686	696
23	24	2413	MATERIALS ENGINEERING AND MATERIALS SCIENCE	2993.0	2020.0	973.0	Engineering	0.325092	22	2449	...	1040	1151	70	0.027789	52000	35000	62000	1911	305	70
24	25	6212	MANAGEMENT INFORMATION SYSTEMS AND STATISTICS	18713.0	13496.0	5217.0	Business	0.278790	278	16413	...	2420	13017	1015	0.058240	51000	38000	60000	6342	5741	708
25	26	2406	CIVIL ENGINEERING	53153.0	41081.0	12072.0	Engineering	0.227118	565	43041	...	10080	29196	3270	0.070610	50000	40000	60000	28526	9356	2899
26	27	5601	CONSTRUCTION SERVICES	18498.0	16820.0	1678.0	Industrial Arts & Consumer Services	0.090713	295	16318	...	1751	12313	1042	0.060023	50000	36000	60000	3275	5351	703
27	28	6204	OPERATIONS LOGISTICS AND E-COMMERCE	11732.0	7921.0	3811.0	Business	0.324838	156	10027	...	1183	7724	504	0.047859	50000	40000	60000	1466	3629	285
28	29	2499	MISCELLANEOUS ENGINEERING	9133.0	7398.0	1735.0	Engineering	0.189970	118	7428	...	1662	5476	597	0.074393	50000	39000	65000	3445	2426	365
29	30	5402	PUBLIC POLICY	5978.0	2639.0	3339.0	Law & Public Policy	0.558548	55	4547	...	1306	2776	670	0.128426	50000	35000	70000	1550	1871	340
30	31	2410	ENVIRONMENTAL ENGINEERING	4047.0	2662.0	1385.0	Engineering	0.342229	26	2983	...	930	1951	308	0.093589	50000	42000	56000	2028	830	260
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
143	144	1105	PLANT SCIENCE AND AGRONOMY	7416.0	4897.0	2519.0	Agriculture & Natural Resources	0.339671	110	6594	...	1246	4522	314	0.045455	32000	22900	40000	2089	3545	1231
144	145	2308	SCIENCE AND COMPUTER TEACHER EDUCATION	6483.0	2049.0	4434.0	Education	0.683943	59	5362	...	1227	3247	266	0.047264	32000	28000	39000	4214	1106	591
145	146	5200	PSYCHOLOGY	393735.0	86648.0	307087.0	Psychology & Social Work	0.779933	2584	307933	...	115172	174438	28169	0.083811	31500	24000	41000	125148	141860	48207
146	147	6002	MUSIC	60633.0	29909.0	30724.0	Arts	0.506721	419	47662	...	24943	21425	3918	0.075960	31000	22300	42000	13752	28786	9286
147	148	2306	PHYSICAL AND HEALTH EDUCATION TEACHING	28213.0	15670.0	12543.0	Education	0.444582	259	23794	...	7230	13651	1920	0.074667	31000	24000	40000	12777	9328	2042
148	149	6006	ART HISTORY AND CRITICISM	21030.0	3240.0	17790.0	Humanities & Liberal Arts	0.845934	204	17579	...	6140	9965	1128	0.060298	31000	23000	40000	5139	9738	3426
149	150	6000	FINE ARTS	74440.0	24786.0	49654.0	Arts	0.667034	623	59679	...	23656	31877	5486	0.084186	30500	21000	41000	20792	32725	11880
150	151	2901	FAMILY AND CONSUMER SCIENCES	58001.0	5166.0	52835.0	Industrial Arts & Consumer Services	0.910933	518	46624	...	15872	26906	3355	0.067128	30000	22900	40000	20985	20133	5248
151	152	5404	SOCIAL WORK	53552.0	5137.0	48415.0	Psychology & Social Work	0.904075	374	45038	...	13481	27588	3329	0.068828	30000	25000	35000	27449	14416	4344
152	153	1103	ANIMAL SCIENCES	21573.0	5347.0	16226.0	Agriculture & Natural Resources	0.752144	255	17112	...	5353	10824	917	0.050862	30000	22000	40000	5443	9571	2125
153	154	6003	VISUAL AND PERFORMING ARTS	16250.0	4133.0	12117.0	Arts	0.745662	132	12870	...	6253	6322	1465	0.102197	30000	22000	40000	3849	7635	2840
154	155	2312	TEACHER EDUCATION: MULTIPLE LEVELS	14443.0	2734.0	11709.0	Education	0.810704	142	13076	...	2214	8457	496	0.036546	30000	24000	37000	10766	1949	722
155	156	5299	MISCELLANEOUS PSYCHOLOGY	9628.0	1936.0	7692.0	Psychology & Social Work	0.798920	60	7653	...	3221	3838	419	0.051908	30000	20800	40000	2960	3948	1650
156	157	5403	HUMAN SERVICES AND COMMUNITY ORGANIZATION	9374.0	885.0	8489.0	Psychology & Social Work	0.905590	89	8294	...	2405	5061	326	0.037819	30000	24000	35000	2878	4595	724
157	158	3402	HUMANITIES	6652.0	2013.0	4639.0	Humanities & Liberal Arts	0.697384	49	5052	...	2225	2661	372	0.068584	30000	20000	49000	1168	3354	1141
158	159	4901	THEOLOGY AND RELIGIOUS VOCATIONS	30207.0	18616.0	11591.0	Humanities & Liberal Arts	0.383719	310	24202	...	8767	13944	1617	0.062628	29000	22000	38000	9927	12037	3304
159	160	6007	STUDIO ARTS	16977.0	4754.0	12223.0	Arts	0.719974	182	13908	...	5673	7413	1368	0.089552	29000	19200	38300	3948	8707	3586
160	161	2201	COSMETOLOGY SERVICES AND CULINARY ARTS	10510.0	4364.0	6146.0	Industrial Arts & Consumer Services	0.584776	117	8650	...	2064	5949	510	0.055677	29000	20000	36000	563	7384	3163
161	162	1199	MISCELLANEOUS AGRICULTURE	1488.0	404.0	1084.0	Agriculture & Natural Resources	0.728495	24	1290	...	335	936	82	0.059767	29000	23000	42100	483	626	31
162	163	5502	ANTHROPOLOGY AND ARCHEOLOGY	38844.0	11376.0	27468.0	Humanities & Liberal Arts	0.707136	247	29633	...	14515	13232	3395	0.102792	28000	20000	38000	9805	16693	6866
163	164	6102	COMMUNICATION DISORDERS SCIENCES AND SERVICES	38279.0	1225.0	37054.0	Health	0.967998	95	29763	...	13862	14460	1487	0.047584	28000	20000	40000	19957	9404	5125
164	165	2307	EARLY CHILDHOOD EDUCATION	37589.0	1167.0	36422.0	Education	0.968954	342	32551	...	7001	20748	1360	0.040105	28000	21000	35000	23515	7705	2868
165	166	2603	OTHER FOREIGN LANGUAGES	11204.0	3472.0	7732.0	Humanities & Liberal Arts	0.690111	56	7052	...	3685	3214	846	0.107116	27500	22900	38000	2326	3703	1115
166	167	6001	DRAMA AND THEATER ARTS	43249.0	14440.0	28809.0	Arts	0.666119	357	36165	...	15994	16891	3040	0.077541	27000	19200	35000	6994	25313	11068
167	168	3302	COMPOSITION AND RHETORIC	18953.0	7022.0	11931.0	Humanities & Liberal Arts	0.629505	151	15053	...	6612	7832	1340	0.081742	27000	20000	35000	4855	8100	3466
168	169	3609	ZOOLOGY	8409.0	3050.0	5359.0	Biology & Life Science	0.637293	47	6259	...	2190	3602	304	0.046320	26000	20000	39000	2771	2947	743
169	170	5201	EDUCATIONAL PSYCHOLOGY	2854.0	522.0	2332.0	Psychology & Social Work	0.817099	7	2125	...	572	1211	148	0.065112	25000	24000	34000	1488	615	82
170	171	5202	CLINICAL PSYCHOLOGY	2838.0	568.0	2270.0	Psychology & Social Work	0.799859	13	2101	...	648	1293	368	0.149048	25000	25000	40000	986	870	622
171	172	5203	COUNSELING PSYCHOLOGY	4626.0	931.0	3695.0	Psychology & Social Work	0.798746	21	3777	...	965	2738	214	0.053621	23400	19200	26000	2403	1245	308
172	173	3501	LIBRARY SCIENCE	1098.0	134.0	964.0	Education	0.877960	2	742	...	237	410	87	0.104946	22000	20000	22000	288	338	192

172 rows × 21 columns

Let's begin exploring the data using scatter plots and see if we can draw any interesting correlations.

In [7]:

recent_grads.plot(x='Sample_size', y='Median', kind = 'scatter')
recent_grads.plot(x='Sample_size', y='Unemployment_rate', kind = 'scatter')
recent_grads.plot(x='Full_time', y='Median', kind = 'scatter')
recent_grads.plot(x='ShareWomen', y='Unemployment_rate', kind = 'scatter')
recent_grads.plot(x='Men', y='Median', kind = 'scatter')
recent_grads.plot(x='Women', y='Median', kind = 'scatter')

Out[7]:

From the 'Unemployment_rate' vs. 'ShareWomen' plot, it looks like there is no correlation between unemployment rate and the amount of women in the major.

Doesn't look like there is much other useful information from these scatter plots, let's explore the data a bit further using histograms instead.

The y axis shows the frequency of the data and the x axis refers to the column name specified in code.

In [8]:

recent_grads['Median'].hist(bins=25)

Out[8]:

In [9]:

recent_grads['Employed'].hist(bins=25)

Out[9]:

In [10]:

recent_grads['Full_time'].hist(bins=25)

Out[10]:

In [11]:

recent_grads['ShareWomen'].hist(bins=25)

Out[11]:

In [12]:

recent_grads['Unemployment_rate'].hist(bins=25)

Out[12]:

In [13]:

recent_grads['Men'].hist(bins=25)

Out[13]:

In [14]:

recent_grads['Women'].hist(bins=25)

Out[14]:

Again, not much correlation from these histograms. We do see a distribution of unemployment rates for various majors. If unemployment rate is not related to major, then we should see a wide plateau on the histogram.

Next we'll use scatter matrix from pandas to see if we can draw more insight. A scatter matrix can plot many different variables together and allow us to quickly see if there are correlations between those variables.

In [15]:

from pandas.plotting import scatter_matrix

In [16]:

scatter_matrix(recent_grads[['Sample_size', 'Median']], figsize=(10,10))

Out[16]:

array([[,
        ],
       [,
        ]], dtype=object)

In [17]:

scatter_matrix(recent_grads[['Men', 'ShareWomen', 'Median']], figsize=(10,10))

Out[17]:

array([[,
        ,
        ],
       [,
        ,
        ],
       [,
        ,
        ]], dtype=object)

We are not really seeing much correlations betwen these plots, There is a weak negative correlation between 'ShareWomen' and Median. Majors with less women tend to have higher earnings. It could be due to the fact that high paying majors like engineering tend to have less women.

The first ten rows in the data are mostly engineering majors, and the last ten rows are non engineering majors. We can generate a bar chart and look at the 'ShareWomen' vs 'Majors' to see if our hypothesis is correct.

In [18]:

recent_grads[:10].plot(kind='bar', x='Major', y='ShareWomen', colormap='winter')
recent_grads[163:].plot(kind='bar', x='Major', y='ShareWomen', colormap='winter')

Out[18]:

Let's plot the majors we selected above with 'Median' income to see if engineers earn more income.

In [19]:

recent_grads[:10].plot(kind='bar', x='Major', y='Median', colormap='winter')
recent_grads[163:].plot(kind='bar', x='Major', y='Median', colormap='winter')

Out[19]:

Our hypothesis appears to be correct, at least for the majors we selected. Majors with less women such as engineering tend to earn higher salaries.

Learning Summary¶

Python concepts explored: pandas, matplotlib, histograms, bar charts, scatterplots, scatter matrices

Python functions and methods used: .plot(), scatter_matrix(), hist(), iloc[], .head(), .tail(), .describe()

The files used for this project can be found in my GitHub repository.

Coding Disciple

Visualizing Earnings Based On College Majors

Learning Summary¶

Comments

Coding Disciple

Learning Summary¶

Part 4 of the Dataquest series

Previous articles

Next articles

Comments