In this project, we are going to explore the database, 'chinook.db'. We are going to come up with some business questions that might be interesting. Then we'll try to answer them with SQL. This project is an extended version of a guided project from dataquest, you can check them out here.

First, we want to create two helper functions:

1) A function that can execute a SQL query and return the result in pandas.

2) A function that can execute a SQL command.

In [1]:
import pandas as pd
import numpy as np
import sqlite3
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

def run_query(query):
    with sqlite3.connect('chinook.db') as conn:
        return pd.read_sql(query, conn)


def run_command(command):
    with sqlite3.connect('chinook.db') as conn:
        conn.isolation_level = None
        conn.execute(c)

Below, I've attached an image of the database scehma.

database schema
In [2]:
def show_tables():
    q = '''
        SELECT
            name,
            type
        FROM sqlite_master
        WHERE type IN ("table","view");
        '''
    return run_query(q)

show_tables()
Out[2]:
name type
0 album table
1 artist table
2 customer table
3 employee table
4 genre table
5 invoice table
6 invoice_line table
7 media_type table
8 playlist table
9 playlist_track table
10 track table

Determining the Best Genre

Suppose that the company wants to know which genre sells the best in the United States. They need this information to figure what albums to add to the store. We can extract this information from our SQL database using multiple joins. Then load the result into a pandas dataframe and perform analysis using matplotlib/seaborn.

Here is the list of artists and their respective genres that the company is considering:

1) Regal, hip-hop

2) Red Tone, punk

3) Slim Jim Bites, blues

4) Meteor and the Girls, pop

In [3]:
q1 = '''
WITH usa_sales AS
    (
    SELECT
        g.name Genre,
        SUM(i.quantity) Tracks_sold
    FROM customer c
    INNER JOIN invoice inv ON inv.customer_id = c.customer_id
    INNER JOIN invoice_line i on i.invoice_id = inv.invoice_id
    INNER JOIN track t ON t.track_id = i.track_id
    INNER JOIN genre g ON t.genre_id = g.genre_id
    WHERE country = 'USA'
    GROUP BY Genre
    )

SELECT
    Genre,
    Tracks_sold,
    CAST(Tracks_sold AS FLOAT)/(SELECT SUM(Tracks_sold) FROM usa_sales) AS Percentage
FROM usa_sales
GROUP BY 2
ORDER BY 2 DESC;
'''


genre_and_sales = run_query(q1)

genre_and_sales
Out[3]:
Genre Tracks_sold Percentage
0 Rock 561 0.533777
1 Alternative & Punk 130 0.123692
2 Metal 124 0.117983
3 R&B;/Soul 53 0.050428
4 Blues 36 0.034253
5 Alternative 35 0.033302
6 Pop 22 0.020932
7 Hip Hop/Rap 20 0.019029
8 Jazz 14 0.013321
9 Easy Listening 13 0.012369
10 Reggae 6 0.005709
11 Electronica/Dance 5 0.004757
12 Classical 4 0.003806
13 Heavy Metal 3 0.002854
14 Soundtrack 2 0.001903
15 TV Shows 1 0.000951
In [4]:
sns.set_style("whitegrid")
sns.set(font_scale=1.8)

fig = plt.figure(figsize=(15,7))
ax = sns.barplot(x="Genre", y="Tracks_sold", data=genre_and_sales, palette="Blues_d")
ax.set_xticklabels(genre_and_sales['Genre'], rotation=90)
percent_labels = genre_and_sales['Percentage']
ax.set_xlabel('')
ax.set_ylabel('Tracks Sold')


sns.set(font_scale=1.3)

#Labels each bar with the percentage of sales.
for i in range(len(ax.patches)):
    p = ax.patches[i]
    height = p.get_height()
    ax.text(p.get_x()+p.get_width()/2.,
            height + 3,
            '{:1.2f}'.format(percent_labels.iloc[i]*100)+'%',
            ha="center") 
plt.show()

Let's go over our list:

1) Punk: 12.37% of albums sold in the US.

2) Blues: 3.43% of albums sold in the US.

3) Pop: 2.09% of albums sold in the US.

4) Hip-hop/Rap: 1.9% of albums sold in the US.

If we had to pick one artist out of the four given to us. Based on our data, I would recommend Red Tone, Punk.

Evaluating Employee Performance

Each customer for the Chinook store gets assigned to a sales support agent within the company when they first make a purchase. Suppose that our bosses asked us to analyze the sales of each employee to see if any sales support agent is underperforming.

We can follow the same process. First extract this information from our SQL database using multiple joins. Then load the result into a pandas dataframe and perform analysis using matplotlib/seaborn. We can also customize our plots making it easier for others to understand.

In [5]:
q2 = '''
WITH sales_employee AS
    (
    SELECT
        e.employee_id,
        e.first_name,
        e.last_name,
        SUM(i.total) total_sales
    FROM employee e
    INNER JOIN customer c ON c.support_rep_id = e.employee_id
    INNER JOIN invoice i ON i.customer_id = c.customer_id
    GROUP BY e.employee_id 
    )

SELECT
    s.first_name || ' ' || s.last_name Employee_name,
    e.hire_date Hire_date,
    s.total_sales Total_sales
FROM sales_employee s
INNER JOIN employee e ON e.employee_id = s.employee_id
ORDER BY 3 DESC;
'''


sales_per_agent = run_query(q2)

sales_per_agent
Out[5]:
Employee_name Hire_date Total_sales
0 Jane Peacock 2017-04-01 00:00:00 1731.51
1 Margaret Park 2017-05-03 00:00:00 1584.00
2 Steve Johnson 2017-10-17 00:00:00 1393.92
In [6]:
#Generates the bar plot.
sns.set(font_scale=1.8)
fig = plt.figure(figsize=(8,5))
ax = sns.barplot(x="Employee_name", y="Total_sales", data=sales_per_agent, palette="Blues_d")
ax.set_xticklabels(sales_per_agent['Employee_name'])
ax.set_xlabel('')
ax.set_ylabel('Total Sales')

#Labels each bar with the percentage of sales.
sns.set(font_scale=1.3)
percent_labels = sales_per_agent['Hire_date'].apply(lambda x: x.split(' ')[0])
for i in range(len(ax.patches)):
    p = ax.patches[i]
    height = p.get_height()/2
    ax.text(p.get_x()+p.get_width()/2.,
            height + 15,
            'Hired on: ',
            ha="center") 
    ax.text(p.get_x()+p.get_width()/2.,
            height - 75,
            (percent_labels.iloc[i]),
            ha="center") 
plt.show()

Jane had the most sales, but she is also the most senior employee in the sales group. Steve had the least sales but he is also the most junior employee.

I don't believe any of these employees are underperforming relative to each other.

Analyzing Sales by Country

Suppose that the company wants to advertise their services to other countries. They want to know which country is the best to focus advertisements on. In order words, our bosses wants the most bang for their buck for each customer acquistion.

To tackle this problem, we would have to use SQL to figure out the average value of sales per customer in each country. Then we can follow the same process and create data visualizations.

In [7]:
q3 = '''
WITH sales_country AS
    (
    SELECT
        CASE
           WHEN (
                 SELECT count(*)
                 FROM customer 
                 where country = c.country
                ) = 1 THEN "Other"
           ELSE c.country
        END AS country,
        c.customer_id,
        SUM(i.total) customer_sales
    FROM customer c
    INNER JOIN invoice i ON c.customer_id = i.customer_id
    GROUP BY 2
    ),
    
    orders_country AS
    (
    SELECT
        c.country,
        AVG(i.total) average_country_order
    FROM customer c
    INNER JOIN invoice i ON c.customer_id = i.customer_id
    GROUP BY 1
    )
    

SELECT
    s.country,
    SUM(s.customer_sales) country_sales,
    COUNT(*) total_customers,
    SUM(s.customer_sales)/COUNT(*) average_value_of_sales_per_customer,
    average_country_order
FROM sales_country s
LEFT JOIN orders_country o ON o.country = s.country
GROUP BY 1 
ORDER BY CASE WHEN s.country = 'Other' THEN 2 ELSE 1 END, country_sales DESC

'''

sales_country = run_query(q3)

sales_country
Out[7]:
country country_sales total_customers average_value_of_sales_per_customer average_country_order
0 USA 1040.49 13 80.037692 7.942672
1 Canada 535.59 8 66.948750 7.047237
2 Brazil 427.68 5 85.536000 7.011148
3 France 389.07 5 77.814000 7.781400
4 Germany 334.62 4 83.655000 8.161463
5 Czech Republic 273.24 2 136.620000 9.108000
6 United Kingdom 245.52 3 81.840000 8.768571
7 Portugal 185.13 2 92.565000 6.383793
8 India 183.15 2 91.575000 8.721429
9 Other 1094.94 15 72.996000 NaN
In [8]:
#Generates the sales bar plot.
sales_include_other = sales_country.loc[0:9]
sns.set(font_scale=1.8)
fig = plt.figure(figsize=(15,15))
ax = fig.add_subplot(2, 2, 1)
ax = sns.barplot(x="country", y="country_sales", data=sales_include_other, palette="Blues_d")
ax.set_xticklabels(sales_include_other['country'], rotation=90)
ax.set_xlabel('')
ax.set_ylabel('Total Sales')

#Generate sales percentage labels.
sns.set(font_scale=1.0)
percent_labels = sales_include_other['country_sales']/sales_include_other['country_sales'].sum()
for i in range(len(ax.patches)):
    p = ax.patches[i]
    height = p.get_height()
    ax.text(p.get_x()+p.get_width()/2,
            height + 5,
            '{:1.2f}'.format(percent_labels.iloc[i]*100)+'%',
            ha="center")
    
    
#Generate the top pie chart
sns.set(font_scale=1.2)
fig.subplots_adjust(hspace=.5)    
ax2 = fig.add_subplot(2, 2, 2)
ax2 = plt.pie(percent_labels, labels=sales_include_other["country"], autopct='%1.1f%%', shadow=True, pctdistance = 0.85)
plt.title('Percentage of Sales by Country')

#Generate the customer value bar plot
sales_exclude_other = sales_country.loc[0:8]
sns.set(font_scale=1.8)
ax3 = fig.add_subplot(2, 2, 3)
ax3 = sns.barplot(x="country", y="average_value_of_sales_per_customer", data=sales_exclude_other, palette='Blues_d')
ax3.set_xticklabels(sales_exclude_other['country'], rotation=90)
ax3.set_xlabel('')
ax3.set_ylabel('Average Sales Per Customer')


#Generate the bottom pie chart
sns.set(font_scale=1.2)
percent_labels = sales_exclude_other['total_customers']/sales_exclude_other['total_customers'].sum()
ax4 = fig.add_subplot(2, 2, 4)
ax4 = plt.pie(percent_labels, labels=sales_exclude_other["country"], autopct='%1.1f%%', shadow=True, pctdistance = 0.85)
plt.title('Percentage of Customers by Country')

plt.show()

Based on our data, the Czech Republic currently makes up 5.8% of our total sales. They also only make up 4.5% of our customer base. However, they have the highest spending per customer.

I recommend focusing our marketing efforts on the Czech Republic based on our data.

Note that the sample size we have in our database is extremely low. Even though the people from the Czech Republic have the highest spending per customer, our sample size was 2. Realistically speaking, our conclusions were unreliable.

Albums vs Tracks

Suppose that the company wants to know if the customers purchase an entire album, or just a single track.

If the majority of the customers only purchase select track, we could be losing a significant amount of sales. So it could be a good idea to increase the price of selling track seperately. Alternatively, we could sell the entire album at a discount as a promotion.

If the album purchases are too low, the company could drop them entirely from the shop. In that case, we should focus our efforts on purchasing rights to select tracks.

We'll need to write a query that categorizes each invoice as either an album purchase or not, and calculates the following summary statistics:

  • Number of invoices
  • Percentage of invoices
  • Sales
  • Percentage of Sales

Note: For the invoice table, the customer cannot buy an album AND individual tracks in one purchase.

In [9]:
q6 = '''
WITH invoice_detailed AS
    (SELECT
        t.track_id,
        i.invoice_id,
        (i.total) total
    FROM
        track t
    INNER JOIN invoice_line il ON il.track_id = t.track_id
    INNER JOIN invoice i ON i.invoice_id = il.invoice_id
    GROUP BY i.invoice_id
    )
    
SELECT
    album_purchase,
    COUNT(*) number_of_invoices,
    CAST (COUNT(*) AS FLOAT)/(SELECT COUNT(invoice_id) FROM invoice_detailed) AS invoice_percentage,
    SUM(total) sales,
    CAST (SUM(total) AS FLOAT)/(SELECT SUM(total) FROM invoice_detailed) AS sales_percentage  
FROM 
    (
    SELECT
        ivd.*,
        CASE
            WHEN
                (
                SELECT t.track_id FROM track t 
                WHERE t.album_id = (
                                    SELECT t2.album_id FROM track t2 
                                    WHERE t2.track_id = ivd.track_id
                                    )
                EXCEPT 

                SELECT il2.track_id FROM invoice_line il2
                WHERE il2.invoice_id = ivd.invoice_id
                ) IS NULL
             AND
                (
                SELECT il2.track_id FROM invoice_line il2
                WHERE il2.invoice_id = ivd.invoice_id

                EXCEPT 

                SELECT t.track_id FROM track t
                WHERE t.album_id = (
                                    SELECT t2.album_id FROM track t2
                                    WHERE t2.track_id = ivd.track_id
                                    ) 
                ) IS NULL                                  
            THEN 'Yes'
            ELSE 'No'
        END AS 'album_purchase'
    FROM invoice_detailed ivd
    )
GROUP BY album_purchase
'''

album_purchase = run_query(q6)
album_purchase
Out[9]:
album_purchase number_of_invoices invoice_percentage sales sales_percentage
0 No 500 0.814332 3248.19 0.68972
1 Yes 114 0.185668 1461.24 0.31028
In [10]:
#Figure settings
sns.set(font_scale=1.5)
fig = plt.figure(figsize=(15,7.5))

#Generate the left pie chart.
ax = fig.add_subplot(1, 2, 1)
percent_labels = album_purchase['invoice_percentage']
ax = plt.pie(percent_labels, labels=['Track', 'Album'], autopct='%1.1f%%', shadow=True)
plt.title('Percentage of Invoices')
    
    
#Generate the right pie chart.
ax2 = fig.add_subplot(1, 2, 2)
percent_labels = album_purchase['sales_percentage']
ax2 = plt.pie(percent_labels, labels=['Track', 'Album'], autopct='%1.1f%%', shadow=True)
plt.title('Percentage of Sales')

plt.show()

Based on the data, the album purchases are significant. Album purchases make up 18.57% of the invoices, and 31.02% of the sales.

I recommend not dropping albums from the shop. However, I would suggest doing promotional discounts on unpopular albums. Currently, we are charging customers 99 cents per track. If the customer buys an album with 3 tracks, they are charged 2.97. As a result, many customers are simply selecting tracks they like rather than buying whole albums.

Determining Best Selling Album

As I mentioned in the previous section, albums make up a significant amount of sales. Suppose that the company is interested in advertising the best performing albums.

We can use SQL/pandas to determine the revenue generated per song. This is going to be the key metric because we want to get the maximum value for each song we purchase the rights for.

In [11]:
q7 = '''
WITH invoice_detailed AS
    (SELECT
        a.album_id,
        a.title,
        t.track_id,
        i.invoice_id,
        (i.total) total
    FROM
        track t
    INNER JOIN invoice_line il ON il.track_id = t.track_id
    INNER JOIN invoice i ON i.invoice_id = il.invoice_id
    INNER JOIN album a ON a.album_id = t.album_id
    GROUP BY i.invoice_id
    ),
    
    songs AS
    (
    SELECT 
        album_id album_key,
        count(*) number_of_songs
    FROM 
        track
    GROUP BY album_key
    )
    
SELECT
    album_id,
    title,
    SUM(total),
    count(*) copies_sold,
    number_of_songs,
    CAST (SUM(total) AS FLOAT)/(number_of_songs) Revenue_per_song      
FROM 
    (
    SELECT
        ivd.*,
        CASE
            WHEN
                (
                SELECT t.track_id FROM track t 
                WHERE t.album_id = (
                                    SELECT t2.album_id FROM track t2 
                                    WHERE t2.track_id = ivd.track_id
                                    )
                EXCEPT 

                SELECT il2.track_id FROM invoice_line il2
                WHERE il2.invoice_id = ivd.invoice_id
                ) IS NULL
             AND
                (
                SELECT il2.track_id FROM invoice_line il2
                WHERE il2.invoice_id = ivd.invoice_id

                EXCEPT 

                SELECT t.track_id FROM track t
                WHERE t.album_id = (
                                    SELECT t2.album_id FROM track t2
                                    WHERE t2.track_id = ivd.track_id
                                    ) 
                ) IS NULL                                  
            THEN 'Yes'
            ELSE 'No'
        END AS 'album_purchase'
    FROM invoice_detailed ivd
    ) q
INNER JOIN songs s ON s.album_key = q.album_id 
WHERE album_purchase = 'Yes' 
GROUP BY album_id
ORDER BY copies_sold DESC
LIMIT 5
'''

album_sales = run_query(q7)
album_sales
Out[11]:
album_id title SUM(total) copies_sold number_of_songs Revenue_per_song
0 120 Are You Experienced? 100.98 6 17 5.94
1 119 Get Born 51.48 4 13 3.96
2 186 News Of The World 43.56 4 11 3.96
3 157 Miles Ahead 41.58 3 14 2.97
4 188 Green 32.67 3 11 2.97
In [12]:
#Figure settings
sns.set(font_scale=1.5)
fig = plt.figure(figsize=(10,3))

#Generate the bar plot
ax = fig.add_subplot(1, 1, 1)
ax = sns.barplot(y="title", x="Revenue_per_song", data=album_sales, palette="deep")
ax.set_ylabel('')
ax.set_xlabel('')
ax.set_xticklabels('')
plt.title('Revenue Per Song') 

#Generate Revenue_per_song labels.
percent_labels = album_sales['Revenue_per_song']
for i in range(len(ax.patches)):
    p = ax.patches[i]
    width = p.get_width()
    ax.text(width/2,
            p.get_y()+p.get_height()-0.2,
            '$'+'{:1.2f}'.format(percent_labels.iloc[i]),
            ha="center")
plt.show()

Based on our data, the best performing album is 'Are You Experienced?' generating $5.94 per song.

Note that our sample size was 6. So it is probably a better idea to gather more data before making an expensive business decision.

Determining the Best Artists

Suppose the company is interested in knowing which artist is the most valuable to the company. Some artists could have more songs than others. The key metric is the average revenue per song. This metric will be categorized by the artist's name.

Similar to the data analysis process above, we'll use SQL to get the data we need out of the database. Then use pandas/seaborn/matplotlib for data visualization.

In [13]:
q8 = '''
WITH invoice_line_detailed AS
    (
    SELECT
        ar.artist_id,
        ar.name,
        il.quantity,
        il.unit_price
    FROM
        track t
    INNER JOIN invoice_line il ON il.track_id = t.track_id
    INNER JOIN album a ON a.album_id = t.album_id
    INNER JOIN artist ar ON ar.artist_id = a.artist_id
    ),

    tracks_artist AS
    (
    SELECT
        a.artist_id artist_key,
        count(*) songs_in_shop
    FROM
        track t
    INNER JOIN album a ON a.album_id = t.album_id
    INNER JOIN artist ar ON ar.artist_id = a.artist_id
    GROUP BY artist_key
    )

SELECT
    artist_id,
    name Artist,
    SUM(quantity) songs_sold,
    SUM(unit_price) revengue_generated,
    songs_in_shop,
    SUM(unit_price)/CAST (songs_in_shop AS FLOAT) revenue_per_song
FROM invoice_line_detailed ivd
INNER JOIN tracks_artist ta on ta.artist_key = ivd.artist_id 
GROUP BY name
ORDER BY revenue_per_song DESC
'''


artist_sales = run_query(q8)
top_five_artists = artist_sales.iloc[0:5]
top_five_artists
Out[13]:
artist_id Artist songs_sold revengue_generated songs_in_shop revenue_per_song
0 196 Cake 31 30.69 1 30.69
1 94 Jimi Hendrix 187 185.13 17 10.89
2 253 Calexico 10 9.90 1 9.90
3 135 System Of A Down 93 92.07 11 8.37
4 240 Gustav Mahler 8 7.92 1 7.92
In [14]:
#Figure settings
sns.set(font_scale=1.5)
fig = plt.figure(figsize=(13,8))


#Generate the top bar plot
ax = fig.add_subplot(2, 1, 1)
ax = sns.barplot(y="Artist", x="revenue_per_song", data=top_five_artists, palette="deep")
ax.set_ylabel('')
ax.set_xlabel('')
ax.set_xticklabels('')
plt.title('Revenue Per Song') 

#Generate revenue_per_song labels.
labels = top_five_artists['revenue_per_song']
for i in range(len(ax.patches)):
    p = ax.patches[i]
    width = p.get_width()
    ax.text(width/2,
            p.get_y()+p.get_height()-0.2,
            '$'+'{:1.2f}'.format(labels.iloc[i]),
            ha="center")
    
#Generate the bottom bar plot
ax = fig.add_subplot(2, 1, 2)
ax = sns.barplot(y="Artist", x="songs_in_shop", data=top_five_artists, palette="deep")
ax.set_ylabel('')
ax.set_xlabel('')
ax.set_xticklabels('')
plt.title('Songs in Shop') 

#Generate 'Songs in Shop' labels.
percent_labels = top_five_artists['songs_in_shop']
for i in range(len(ax.patches)):
    p = ax.patches[i]
    width = p.get_width()
    ax.text(width/2,
            p.get_y()+p.get_height()-0.2,
            '{:1.0f}'.format(percent_labels.iloc[i]),
            ha="center")
plt.show()

Immediately we see the artist "Cake" has only one song in shop. However, this one song generated the most revenue($30.69/song) in the entire shop. The artists 'Calexico' and 'Gustav Mahler' also only have one song in our shop.

My recommendation for the company is to perform more research on these artists with only one song in our shop. We'll need to see if they have more songs that are similar in popularity. It is possible that these artists are actually quite popular, but customers can only find one of their songs in our shop. As a result, we are missing out on potential sales.


Learning Summary

Concepts explored: data visualization, SQL, business oriented data analysis, seaborn plots

The files used for this project can be found in my GitHub repository.



Comments

comments powered by Disqus