In this project we will try to scrape data from reddit using their API. The objective is to load reddit data into a pandas dataframe. In order to achieve this, first we'll import the following libraries.

The documentation for this API can be found here.

In [1]:

import pandas as pd
import urllib.request as ur
import json
import time

We can access the raw json data of any subreddit by adding '.json' to the URL. Using the urllib.request library, we can extract that data and read it in python.

From the documentation, we need to fill out the header using the suggested format.

Example: User-Agent: android:com.example.myredditapp:v1.2.3 (by /u/kemitche)

In [2]:

#Header to be submitted to reddit.
hdr = {'User-Agent': 'codingdisciple:playingwithredditAPI:v1.0 (by /u/ivis_reine)'}

#Link to the subreddit of interest.
url = "https://www.reddit.com/r/datascience/.json?sort=top&t;=all"

#Makes a request object and receive a response.
req = ur.Request(url, headers=hdr)
response = ur.urlopen(req).read()

#Loads the json data into python.
json_data = json.loads(response.decode('utf-8'))

I took a snapshot of the data structure below. It looks like the data is just a bunch of lists and dictionaries. We want reach the part of the dictionary until we see a list. Each item on this list will be a post made on this subreddit.

reddit_data_structure

In [3]:

#The actual data starts.
data = json_data['data']['children']

Each request can only get us 100 posts, we can write a for loop to send 10 requests at 2 second intervals and add the data to the list of posts.

In [4]:

for i in range(10):
    #reddit only accepts one request every 2 seconds, adds a delay between each loop
    time.sleep(2)
    last = data[-1]['data']['name']
    url = 'https://www.reddit.com/r/datascience/.json?sort=top&t;=all&limit;=100&after;=%s' % last
    req = ur.Request(url, headers=hdr)
    text_data = ur.urlopen(req).read()
    datatemp = json.loads(text_data.decode('utf-8'))
    data += datatemp['data']['children']
    print(str(len(data))+" posts loaded")

125 posts loaded
225 posts loaded
325 posts loaded
425 posts loaded
525 posts loaded
625 posts loaded
725 posts loaded
825 posts loaded
898 posts loaded
898 posts loaded

We've assigned all the posts to a list with the variable named 'data'. In order to begin constructing our pandas dataframe, we need a list of column names. Each post consists of a dictionary, we can simply loop through this dictionary and extract the column names.

In [5]:

#Create a list of column name strings to be used to create our pandas dataframe
data_names = [value for value in data[0]['data']]
print(data_names)

['contest_mode', 'num_reports', 'subreddit_type', 'report_reasons', 'spoiler', 'quarantine', 'author', 'media_embed', 'banned_at_utc', 'is_self', 'selftext_html', 'stickied', 'banned_by', 'author_flair_css_class', 'approved_by', 'mod_reason_title', 'author_flair_text', 'selftext', 'suggested_sort', 'hide_score', 'brand_safe', 'subreddit', 'hidden', 'link_flair_text', 'mod_note', 'num_comments', 'name', 'pinned', 'user_reports', 'thumbnail', 'subreddit_name_prefixed', 'score', 'can_mod_post', 'can_gild', 'mod_reason_by', 'edited', 'downs', 'domain', 'is_crosspostable', 'clicked', 'approved_at_utc', 'title', 'removal_reason', 'likes', 'is_reddit_media_domain', 'locked', 'view_count', 'id', 'over_18', 'distinguished', 'visited', 'media', 'gilded', 'link_flair_css_class', 'secure_media', 'is_video', 'mod_reports', 'permalink', 'secure_media_embed', 'parent_whitelist_status', 'saved', 'num_crossposts', 'url', 'archived', 'created', 'created_utc', 'subreddit_id', 'whitelist_status', 'ups']

In order to build a dataframe using the pd.DataFrame() function, we will need a list of dictionaries.

We can loop through each element in 'data', using each column name as a key to the dictionary, then accessing the corresponding value with that key. If we come across a post that has

In [6]:

#Create a list of dictionaries to be loaded into a pandas dataframe
df_loadinglist = []
for i in range(0, len(data)):
    dictionary = {}
    for names in data_names:
        try:
            dictionary[str(names)] = data[i]['data'][str(names)]
        except:
            dictionary[str(names)] = 'None'
    df_loadinglist.append(dictionary)
df=pd.DataFrame(df_loadinglist)

In [7]:

df.shape

Out[7]:

(898, 69)

In [8]:

df.tail()

Out[8]:

	approved_at_utc	approved_by	archived	author	author_flair_css_class	author_flair_text	banned_at_utc	banned_by	brand_safe	can_gild	...	subreddit_type	suggested_sort	title	ups	url	user_reports	view_count	visited	whitelist_status
893	None	None	False	amoun1365	None	None	None	None	True	False	...	public	None	Is there any good statistics specialization (M...	10	https://www.reddit.com/r/datascience/comments/...	[]	None	False	all_ads
894	None	None	False	sananth12	None	None	None	None	True	False	...	public	None	Carla – Open source simulator for autonomous d...	4	http://www.carla.org/	[]	None	False	all_ads
895	None	None	False	blacksite_	testflair	BS (Economics) \| Data Scientist \| IT Operations	None	None	True	False	...	public	None	What statistical methods/tools do you use most?	44	https://www.reddit.com/r/datascience/comments/...	[]	None	False	all_ads
896	None	None	False	thatneedle	None	None	None	None	True	False	...	public	None	Free, Fast Entity Extraction - requesting feed...	0	http://entity.thatneedle.com	[]	None	False	all_ads
897	None	None	False	basketballwonk	None	None	None	None	True	False	...	public	None	I'm at a crossroads and looking for advice (st...	2	https://www.reddit.com/r/datascience/comments/...	[]	None	False	all_ads

5 rows × 69 columns

Now that we have a pandas dataframe, we can do simple analysis on the reddit posts. For example, we can write a function to find the most common words used in the last 925 posts.

In [9]:

#Counts each word and return a pandas series

def word_count(df, column):
    dic={}
    for idx, row in df.iterrows():
        split = row[column].lower().split(" ")
        for word in split:
            if word in dic:
                dic[word] += 1
            else:
                dic[word] = 1
    dictionary = pd.Series(dic)
    dictionary = dictionary.sort_values(ascending=False)
    return dictionary

top_counts = word_count(df, "selftext")
top_counts[0:5]

Out[9]:

to     2820
the    2670
i      2575
a      2554
and    2189
dtype: int64

The results are not too surprising, common english words showed up the most. That is it for now! We've achieved our goal of turning json data into a pandas dataframe.

Learning Summary¶

Concepts explored: lists, dictionaries, API, data structures, JSON

The files for this project can be found in my GitHub repository

Coding Disciple

Scraping Reddit Data to a Pandas Dataframe

Learning Summary¶

Comments