In this project we will try to scrape data from reddit using their API. The objective is to load reddit data into a pandas dataframe. In order to achieve this, first we'll import the following libraries.

The documentation for this API can be found here.

In [1]:
import pandas as pd
import urllib.request as ur
import json
import time

We can access the raw json data of any subreddit by adding '.json' to the URL. Using the urllib.request library, we can extract that data and read it in python.

From the documentation, we need to fill out the header using the suggested format.

Example: User-Agent: android:com.example.myredditapp:v1.2.3 (by /u/kemitche)

In [2]:
#Header to be submitted to reddit.
hdr = {'User-Agent': 'codingdisciple:playingwithredditAPI:v1.0 (by /u/ivis_reine)'}

#Link to the subreddit of interest.
url = "https://www.reddit.com/r/datascience/.json?sort=top&t;=all"

#Makes a request object and receive a response.
req = ur.Request(url, headers=hdr)
response = ur.urlopen(req).read()

#Loads the json data into python.
json_data = json.loads(response.decode('utf-8'))

I took a snapshot of the data structure below. It looks like the data is just a bunch of lists and dictionaries. We want reach the part of the dictionary until we see a list. Each item on this list will be a post made on this subreddit.

reddit_data_structure

In [3]:
#The actual data starts.
data = json_data['data']['children']

Each request can only get us 100 posts, we can write a for loop to send 10 requests at 2 second intervals and add the data to the list of posts.

In [4]:
for i in range(10):
    #reddit only accepts one request every 2 seconds, adds a delay between each loop
    time.sleep(2)
    last = data[-1]['data']['name']
    url = 'https://www.reddit.com/r/datascience/.json?sort=top&t;=all&limit;=100&after;=%s' % last
    req = ur.Request(url, headers=hdr)
    text_data = ur.urlopen(req).read()
    datatemp = json.loads(text_data.decode('utf-8'))
    data += datatemp['data']['children']
    print(str(len(data))+" posts loaded")
    
125 posts loaded
225 posts loaded
325 posts loaded
425 posts loaded
525 posts loaded
625 posts loaded
725 posts loaded
825 posts loaded
898 posts loaded
898 posts loaded

We've assigned all the posts to a list with the variable named 'data'. In order to begin constructing our pandas dataframe, we need a list of column names. Each post consists of a dictionary, we can simply loop through this dictionary and extract the column names.

In [5]:
#Create a list of column name strings to be used to create our pandas dataframe
data_names = [value for value in data[0]['data']]
print(data_names)
['contest_mode', 'num_reports', 'subreddit_type', 'report_reasons', 'spoiler', 'quarantine', 'author', 'media_embed', 'banned_at_utc', 'is_self', 'selftext_html', 'stickied', 'banned_by', 'author_flair_css_class', 'approved_by', 'mod_reason_title', 'author_flair_text', 'selftext', 'suggested_sort', 'hide_score', 'brand_safe', 'subreddit', 'hidden', 'link_flair_text', 'mod_note', 'num_comments', 'name', 'pinned', 'user_reports', 'thumbnail', 'subreddit_name_prefixed', 'score', 'can_mod_post', 'can_gild', 'mod_reason_by', 'edited', 'downs', 'domain', 'is_crosspostable', 'clicked', 'approved_at_utc', 'title', 'removal_reason', 'likes', 'is_reddit_media_domain', 'locked', 'view_count', 'id', 'over_18', 'distinguished', 'visited', 'media', 'gilded', 'link_flair_css_class', 'secure_media', 'is_video', 'mod_reports', 'permalink', 'secure_media_embed', 'parent_whitelist_status', 'saved', 'num_crossposts', 'url', 'archived', 'created', 'created_utc', 'subreddit_id', 'whitelist_status', 'ups']

In order to build a dataframe using the pd.DataFrame() function, we will need a list of dictionaries.

We can loop through each element in 'data', using each column name as a key to the dictionary, then accessing the corresponding value with that key. If we come across a post that has

In [6]:
#Create a list of dictionaries to be loaded into a pandas dataframe
df_loadinglist = []
for i in range(0, len(data)):
    dictionary = {}
    for names in data_names:
        try:
            dictionary[str(names)] = data[i]['data'][str(names)]
        except:
            dictionary[str(names)] = 'None'
    df_loadinglist.append(dictionary)
df=pd.DataFrame(df_loadinglist)
In [7]:
df.shape
Out[7]:
(898, 69)
In [8]:
df.tail()
Out[8]:
approved_at_utc approved_by archived author author_flair_css_class author_flair_text banned_at_utc banned_by brand_safe can_gild ... subreddit_type suggested_sort thumbnail title ups url user_reports view_count visited whitelist_status
893 None None False amoun1365 None None None None True False ... public None Is there any good statistics specialization (M... 10 https://www.reddit.com/r/datascience/comments/... [] None False all_ads
894 None None False sananth12 None None None None True False ... public None Carla – Open source simulator for autonomous d... 4 http://www.carla.org/ [] None False all_ads
895 None None False blacksite_ testflair BS (Economics) | Data Scientist | IT Operations None None True False ... public None What statistical methods/tools do you use most? 44 https://www.reddit.com/r/datascience/comments/... [] None False all_ads
896 None None False thatneedle None None None None True False ... public None Free, Fast Entity Extraction - requesting feed... 0 http://entity.thatneedle.com [] None False all_ads
897 None None False basketballwonk None None None None True False ... public None I'm at a crossroads and looking for advice (st... 2 https://www.reddit.com/r/datascience/comments/... [] None False all_ads

5 rows × 69 columns

Now that we have a pandas dataframe, we can do simple analysis on the reddit posts. For example, we can write a function to find the most common words used in the last 925 posts.

In [9]:
#Counts each word and return a pandas series

def word_count(df, column):
    dic={}
    for idx, row in df.iterrows():
        split = row[column].lower().split(" ")
        for word in split:
            if word in dic:
                dic[word] += 1
            else:
                dic[word] = 1
    dictionary = pd.Series(dic)
    dictionary = dictionary.sort_values(ascending=False)
    return dictionary

top_counts = word_count(df, "selftext")
top_counts[0:5]
Out[9]:
to     2820
the    2670
i      2575
a      2554
and    2189
dtype: int64

The results are not too surprising, common english words showed up the most. That is it for now! We've achieved our goal of turning json data into a pandas dataframe.


Learning Summary

Concepts explored: lists, dictionaries, API, data structures, JSON

The files for this project can be found in my GitHub repository



Comments

comments powered by Disqus