We will look at various CSV files in the 'school' folder. Each one of these csv files contain information on NYC high schools. We will clean the datasets and condense it into one dataframe. The data cleaning process was done in the missions, so we'll quickly repeat the process in this notebook and focus on the data analysis.
import pandas
import numpy
import re
import matplotlib
data_files = [
"ap_2010.csv",
"class_size.csv",
"demographics.csv",
"graduation.csv",
"hs_directory.csv",
"sat_results.csv"
]
data = {}
for f in data_files:
d = pandas.read_csv("schools/{0}".format(f))
data[f.replace(".csv", "")] = d
Read in the surveys¶
all_survey = pandas.read_csv("schools/survey_all.txt", delimiter="\t", encoding='windows-1252')
d75_survey = pandas.read_csv("schools/survey_d75.txt", delimiter="\t", encoding='windows-1252')
survey = pandas.concat([all_survey, d75_survey], axis=0)
survey["DBN"] = survey["dbn"]
survey_fields = [
"DBN",
"rr_s",
"rr_t",
"rr_p",
"N_s",
"N_t",
"N_p",
"saf_p_11",
"com_p_11",
"eng_p_11",
"aca_p_11",
"saf_t_11",
"com_t_11",
"eng_t_10",
"aca_t_11",
"saf_s_11",
"com_s_11",
"eng_s_11",
"aca_s_11",
"saf_tot_11",
"com_tot_11",
"eng_tot_11",
"aca_tot_11",
]
survey = survey.loc[:,survey_fields]
data["survey"] = survey
Add DBN columns¶
data["hs_directory"]["DBN"] = data["hs_directory"]["dbn"]
def pad_csd(num):
string_representation = str(num)
if len(string_representation) > 1:
return string_representation
else:
return "0" + string_representation
data["class_size"]["padded_csd"] = data["class_size"]["CSD"].apply(pad_csd)
data["class_size"]["DBN"] = data["class_size"]["padded_csd"] + data["class_size"]["SCHOOL CODE"]
Convert columns to numeric¶
cols = ['SAT Math Avg. Score', 'SAT Critical Reading Avg. Score', 'SAT Writing Avg. Score']
for c in cols:
data["sat_results"][c] = pandas.to_numeric(data["sat_results"][c], errors="coerce")
data['sat_results']['sat_score'] = data['sat_results'][cols[0]] + data['sat_results'][cols[1]] + data['sat_results'][cols[2]]
def find_lat(loc):
coords = re.findall("\(.+, .+\)", loc)
lat = coords[0].split(",")[0].replace("(", "")
return lat
def find_lon(loc):
coords = re.findall("\(.+, .+\)", loc)
lon = coords[0].split(",")[1].replace(")", "").strip()
return lon
data["hs_directory"]["lat"] = data["hs_directory"]["Location 1"].apply(find_lat)
data["hs_directory"]["lon"] = data["hs_directory"]["Location 1"].apply(find_lon)
data["hs_directory"]["lat"] = pandas.to_numeric(data["hs_directory"]["lat"], errors="coerce")
data["hs_directory"]["lon"] = pandas.to_numeric(data["hs_directory"]["lon"], errors="coerce")
Condense datasets¶
class_size = data["class_size"]
class_size = class_size[class_size["GRADE "] == "09-12"]
class_size = class_size[class_size["PROGRAM TYPE"] == "GEN ED"]
class_size = class_size.groupby("DBN").agg(numpy.mean)
class_size.reset_index(inplace=True)
data["class_size"] = class_size
data["demographics"] = data["demographics"][data["demographics"]["schoolyear"] == 20112012]
data["graduation"] = data["graduation"][data["graduation"]["Cohort"] == "2006"]
data["graduation"] = data["graduation"][data["graduation"]["Demographic"] == "Total Cohort"]
Convert AP scores to numeric¶
cols = ['AP Test Takers ', 'Total Exams Taken', 'Number of Exams with scores 3 4 or 5']
for col in cols:
data["ap_2010"][col] = pandas.to_numeric(data["ap_2010"][col], errors="coerce")
Combine the datasets¶
combined = data["sat_results"]
combined = combined.merge(data["ap_2010"], on="DBN", how="left")
combined = combined.merge(data["graduation"], on="DBN", how="left")
to_merge = ["class_size", "demographics", "survey", "hs_directory"]
for m in to_merge:
combined = combined.merge(data[m], on="DBN", how="inner")
combined = combined.fillna(combined.mean())
combined = combined.fillna(0)
Add a school district column for mapping¶
def get_first_two_chars(dbn):
return dbn[0:2]
combined["school_dist"] = combined["DBN"].apply(get_first_two_chars)
Find correlations¶
correlations = combined.corr()
correlations = correlations["sat_score"]
print(correlations)
We are interested the correlations with the columns specified in survey_fields.
%matplotlib inline
correlations[survey_fields].plot.bar()
The saf_s_11 column measures how students perceive safety at the school, let's see how students do in the SATs if the schools have higher safety scores.
combined.plot.scatter(['sat_score'], ["saf_s_11"])
Looks like there is a weak positive correlation between safety and SAT scores. Maybe it is easier for students to study when they are not worried about safety.
We can generate a NYC basemap and see which schools have lower safety scores, and color label the schools with lower safety scores with a different color
import matplotlib.pyplot as plt
from mpl_toolkits.basemap import Basemap
districts = combined.groupby("school_dist").agg(numpy.mean)
districts.reset_index(inplace=True)
m = Basemap(
projection='merc',
llcrnrlat=40.496044,
urcrnrlat=40.915256,
llcrnrlon=-74.255735,
urcrnrlon=-73.700272,
resolution='i'
)
m.drawmapboundary(fill_color='#85A6D9')
m.drawcoastlines(color='#6D5F47', linewidth=.4)
m.drawrivers(color='#6D5F47', linewidth=.4)
#converts to list
longitudes = districts['lon'].tolist()
latitudes = districts['lat'].tolist()
#generates a scatterplot with the 2 lists as arguments, s=20 for size of points, zorder=2 for dots on top of the map, latlong=True to indicate we are putting in the latitude/longitude
m.scatter(longitudes, latitudes, s=50, zorder=2, latlon=True, c=districts['saf_s_11'], cmap='summer')
plt.show()
combined.info()
Next we are interested to see if there is a correlation between race and SAT scores.
races = [
'white_per',
'asian_per',
'black_per',
'hispanic_per',
]
correlations[races].plot.bar()
We are seeing positive correlation for whites/asians in SAT scores, negative correlations for blacks/hispanic in SAT scores. However, I wouldn't say any of these correlations are very strong. Let's use a scatter plot for hispanic to see if we can draw any insights.
combined.plot.scatter(['sat_score'], ["hispanic_per"])
Schools with high amount of hispanics do poorly in the SATs. Schools with low amount of hispanics can do relatively well in the SATs. Let's filter the data down to scools with >95% Hispanics
schools_with_hispanics = combined[combined['hispanic_per'] > 95]
schools_with_hispanics
Let's filter the data to see the schools with low hispanics AND high SAT scores.
schools_with_lh = combined[(combined['hispanic_per'] < 10) &
(combined['sat_score'] > 1800)]
schools_with_lh
Using google searches, we found out that these schools have students who recently immigrated, so most of them are probably still ESL students. It makes sense they do not perform as well in SATs.
Next we'll look correlations between gender and SAT scores
genders =['male_per', 'female_per']
correlations[genders].plot.bar()
Seems like theres very little correlation overall for both sides small positive correlation for females and small negative correlation for males.
combined.plot.scatter(["female_per"], ['sat_score'])
Scatter plot doesn't show much, data seems very scattered and we are not seeing much correlation between schools with more females and SAT scores.
Let's look at schools with < 60% females and higher than 1700 SAT scores
schools_with_hf = combined[(combined['female_per'] < 60) &
(combined['sat_score'] > 1700)]
schools_with_hf
A google search shows that these schools are heavily STEM focused and have higher entrance requirements, so it makes sense that they have higher SAT scores
Let's take a look at AP test takers in these schools. It is possible that AP test takers score higher in SATS.
combined['ap_per'] = combined['AP Test Takers ']/combined['total_enrollment']
combined.plot.scatter(['ap_per'],['sat_score'])
Schools with low AP test takers tend to score low on SAT as well. Wheras schools with high AP test takers generally score higher. This makes sense because students who enroll in AP classes are generally honor students so they study hard for the SATs as well.
Learning Summary¶
Python concepts explored: pandas, matplotlib.pyplot, correlations, regex, basemap, data analysis, string manipulation
Python functions and methods used: .scatter(), info(), .tolist(), .groupby(), .agg(), .concat(), .apply(), .strip, .merge(), .fillna(), .corr()
The files used for this project can be found in my GitHub repository.
Comments
comments powered by Disqus