GitHub Repository: YStrano/DataScience_GA
Path: blob/master/lessons/lesson_18/code/solution-code/SQL-lab-solution.ipynb
¹⁹⁰⁴ views

Kernel: Python 3

SQL Lab

In this lab we will learn how to use execute SQL from the ipython notebook and practice some queries on the Northwind sample database that we used in Lesson 3.1.

First of all let's install psycopg2.

Either:

conda install psycopg2

Or:

pip install psycopg2

In [1]:

import psycopg2
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

Out[1]:

/anaconda3/lib/python3.6/site-packages/psycopg2/__init__.py:144: UserWarning: The psycopg2 wheel package will be renamed from release 2.8; in order to keep installing from binary please use "pip install psycopg2-binary" instead. For details see: <http://initd.org/psycopg/docs/install.html#binary-install-from-pypi>.
  """)

In [2]:

conn_str = "host='dsi.c20gkj5cvu3l.us-east-1.rds.amazonaws.com' dbname='northwind' user='dsi_student' password='gastudents'"
conn = psycopg2.connect(conn_str)

Nice!!! We can now go ahead with the lab!

1: Inspect the Database

If we were connected via console, it would be easy to list all tables using \dt. We can however access table information performing a query on the information_schema.tables table.

1.a: List Tables

Write a SELECT statement that lists all the tables in the public schema of the northwind database, sorted alphabetically.

In [3]:

SQL_STRING = '''
SELECT table_schema,table_name
FROM information_schema.tables
WHERE table_schema = 'public'
ORDER BY table_name;
'''

northwind_tables = pd.read_sql(SQL_STRING, con=conn)
northwind_tables.head()

Out[3]:

1.b: Print Schemas

The table INFORMATION_SCHEMA.COLUMNS contains schema information on each.

Query it to display schemas of all the public tables. In particular we are interested in the column names and data types. Make sure you only include public schemas to avoid cluttering your results with a bunch of postgres related stuff.

In [4]:

SQL_STRING = '''
select table_name, column_name, data_type
from INFORMATION_SCHEMA.COLUMNS
where table_catalog = 'northwind'
and table_schema = 'public';
'''

northwind_schemas = pd.read_sql(SQL_STRING, con=conn)
northwind_schemas.head()

Out[4]:

1.c Table peek

Another way of quickly looking at table information is to query the first few rows. Do this for a couple of tables, for example: orders, products, usstates. Display only the first 3 rows.

In [5]:

SQL_STRING = 'select * from orders limit 3'

pd.read_sql(SQL_STRING, con=conn)

Out[5]:

In [6]:

SQL_STRING = 'select * from products limit 3'

pd.read_sql(SQL_STRING, con=conn)

Out[6]:

In [7]:

SQL_STRING = 'select * from usstates limit 3'

pd.read_sql(SQL_STRING, con=conn)

Out[7]:

As you can see, some tables (like usstates or region) contain information that is probably less prone to change than other tables (like orders or order_details). This database is well organized to avoid unnecessary duplication. Let's start digging deeper in the data.

2: Products

What products is this company selling? The products and categories tables contain information to answer this question.

Use a combination of SQL queries and Pandas merge to answer the following questions:

What categories of products is the company selling?
How many products per category does the catalog contain?
Let's focus only on products that have not been discontinued => how many products per category?
What are the most expensive 5 products (not discontinued)?
How many units of each of these 5 products are there in stock?
Draw a pie chart of the categories, with slices that have the size of the number of products in that category (use non discontinued products).

2.a: What categories of products is the company selling?

Remember that PostgreSQL is case sensitive.

In [8]:

categories = pd.read_sql('select "CategoryID", "CategoryName", "Description" from categories;', con=conn)
categories

Out[8]:

2.b: How many products per category does the catalog contain?

Keep in mind that you can cast a %sql result to a pandas dataframe using the .DataFrame() method.

In [9]:

SQL_STRING='''
select "CategoryID", count(*) AS count
from  products
group by "CategoryID"
order by count desc;
'''
category_counts = pd.read_sql(SQL_STRING, con=conn)
category_counts

Out[9]:

In [10]:

pd.merge(category_counts, categories, on="CategoryID")

Out[10]:

2.c: How many not discontinued products per category?

In [11]:

SQL_STRING='''
select "CategoryID", count(*) AS count
from  products
where "Discontinued" = 0
group by "CategoryID"
order by count desc;
'''
category_counts_not_disc = pd.read_sql(SQL_STRING, con=conn)
category_counts_not_disc

Out[11]:

In [12]:

products_per_category = pd.merge(category_counts_not_disc, categories, on='CategoryID')
products_per_category

Out[12]:

2.d: What are the most expensive 5 products (not discontinued)?

In [13]:

SQL_STRING='''
select "ProductID", "ProductName", "UnitPrice" from products
where "Discontinued" = 0
order by "UnitPrice" desc
limit 5;
'''
pd.read_sql(SQL_STRING, con=conn)

Out[13]:

2.e: How many units of each of these 5 products are there in stock?

In [14]:

SQL_STRING='''
select "ProductID", "ProductName", "UnitsInStock" from products
where "Discontinued" = 0
order by "UnitPrice" desc
limit 5;
'''
pd.read_sql(SQL_STRING, con=conn)

Out[14]:

2.f: Pie Chart

Use pandas to make a pie chart plot.

In [15]:

products_per_category[['CategoryName', 'count']]\
            .set_index('CategoryName').plot(kind = 'pie',\
                                            y = 'count',\
                                            figsize = (10,10),\
                                            colormap = 'Spectral')
plt.legend(loc = 2)

Out[15]:

<matplotlib.legend.Legend at 0x114232588>

3: Orders

Now that we have a better understanding of products, let's start digging into orders.

How many orders in total?
How many orders per year
How many orders per quarter
Which country is receiving the most orders
Which country is receiving the least
What's the average shipping time (ShippedDate - OrderDate)
What customer is submitting the highest number of orders?
What customer is generating the highest revenue (need to pd.merge with order_details)
What fraction of the revenue is generated by the top 5 customers?

In [16]:

pd.read_sql('select * from orders limit 3;', con=conn)

Out[16]:

3.a: How many orders in total?

In [17]:

pd.read_sql('select count(*) from orders;', con=conn)

Out[17]:

3.b: How many orders per year?

In [18]:

order_dates = pd.read_sql('select "OrderID", "OrderDate" from orders;', con=conn)
order_dates.index = pd.to_datetime(order_dates['OrderDate'])
order_dates.head()

Out[18]:

In [19]:

orders_per_year = order_dates.groupby(pd.Grouper(freq='Y'))['OrderID'].count()
orders_per_year

Out[19]:

OrderDate
1996-12-31    152
1997-12-31    408
1998-12-31    270
Freq: A-DEC, Name: OrderID, dtype: int64

3.c: How many orders per quarter?

Make a line plot for these.

In [20]:

orders_per_quarter = order_dates.groupby(pd.Grouper(freq='Q'))['OrderID'].count()
orders_per_quarter.plot()

Out[20]:

<matplotlib.axes._subplots.AxesSubplot at 0x11d353828>

3.d: Which country is receiving the most orders?

In [21]:

SQL_STRING='''
select "ShipCountry", count(*) as count
from orders
group by "ShipCountry"
order by count desc
limit 3;
'''
pd.read_sql(SQL_STRING, con=conn)

Out[21]:

3.e: Which country is receiving the least?

In [22]:

SQL_STRING='''
select "ShipCountry", count(*) as count
from orders
group by "ShipCountry"
order by count
limit 3;
'''
pd.read_sql(SQL_STRING, con=conn)

Out[22]:

3.f: What's the average shipping time (ShippedDate - OrderDate)?

In [23]:

#In 2 steps
SQL_STRING='''
select 
    "ShippedDate", 
    "OrderDate", 
    "ShippedDate" - "OrderDate" as daysdiff 
from orders;
'''
days_diff = pd.read_sql(SQL_STRING, con=conn)
days_diff['daysdiff'].mean()

Out[23]:

8.491965389369591

In [24]:

# in 1 query:
pd.read_sql('select avg("ShippedDate" - "OrderDate") from orders;', con=conn)

Out[24]:

3.g: What customer is submitting the highest number of orders?

In [25]:

#In 2 steps
SQL_STRING='''
select 
    "CustomerID", 
    count(*) as count 
from orders 
group by "CustomerID" 
limit 3 ;
'''
pd.read_sql(SQL_STRING, con=conn)

Out[25]:

3.h: What customer is generating the highest revenue (need to pd.merge with order_details)?

In [26]:

customer_orders = pd.read_sql('select "CustomerID", "OrderID" from orders;', con=conn)
order_details =  pd.read_sql('select * from order_details', con=conn)
customer_order_details = pd.merge(customer_orders, order_details)
customer_order_details.head()

Out[26]:

In [27]:

customer_order_details['OrderRevenue'] = customer_order_details['UnitPrice'] * \
                                         customer_order_details['Quantity'] * \
                                         (1.0 - customer_order_details['Discount'])

In [28]:

customer_revenue = customer_order_details.groupby('CustomerID')['OrderRevenue'].sum().sort_values(ascending = False)
customer_revenue.head()

Out[28]:

CustomerID
QUICK    110277.3050
ERNSH    104874.9785
SAVEA    104361.9500
RATTC     51097.8005
HUNGO     49979.9050
Name: OrderRevenue, dtype: float64

3.i: What fraction of the revenue is generated by the top 5 customers?

Compare that with the fraction represented by 5 customers over the total number of customers.

In [29]:

customer_revenue.head().sum() / customer_revenue.sum()

Out[29]:

0.33227543988244534

In [30]:

5.0 / customer_revenue.count()

Out[30]:

0.056179775280898875