Path: blob/master/lessons/lesson_18/code/solution-code/SQL-lab-solution.ipynb
1904 views
SQL Lab
In this lab we will learn how to use execute SQL from the ipython notebook and practice some queries on the Northwind sample database that we used in Lesson 3.1.
First of all let's install psycopg2.
Either:
conda install psycopg2
Or:
pip install psycopg2
Nice!!! We can now go ahead with the lab!
1: Inspect the Database
If we were connected via console, it would be easy to list all tables using \dt
. We can however access table information performing a query on the information_schema.tables
table.
1.a: List Tables
Write a
SELECT
statement that lists all the tables in the public schema of thenorthwind
database, sorted alphabetically.
1.b: Print Schemas
The table INFORMATION_SCHEMA.COLUMNS
contains schema information on each.
Query it to display schemas of all the public tables. In particular we are interested in the column names and data types. Make sure you only include public schemas to avoid cluttering your results with a bunch of postgres related stuff.
1.c Table peek
Another way of quickly looking at table information is to query the first few rows. Do this for a couple of tables, for example: orders
, products
, usstates
. Display only the first 3 rows.
As you can see, some tables (like usstates
or region
) contain information that is probably less prone to change than other tables (like orders
or order_details
). This database is well organized to avoid unnecessary duplication. Let's start digging deeper in the data.
2: Products
What products is this company selling? The products
and categories
tables contain information to answer this question.
Use a combination of SQL queries and Pandas merge to answer the following questions:
What categories of products is the company selling?
How many products per category does the catalog contain?
Let's focus only on products that have not been discontinued => how many products per category?
What are the most expensive 5 products (not discontinued)?
How many units of each of these 5 products are there in stock?
Draw a pie chart of the categories, with slices that have the size of the number of products in that category (use non discontinued products).
2.a: What categories of products is the company selling?
Remember that PostgreSQL is case sensitive.
2.b: How many products per category does the catalog contain?
Keep in mind that you can cast a %sql result to a pandas dataframe using the .DataFrame()
method.
2.c: How many not discontinued products per category?
2.d: What are the most expensive 5 products (not discontinued)?
2.e: How many units of each of these 5 products are there in stock?
2.f: Pie Chart
Use pandas to make a pie chart plot.
3: Orders
Now that we have a better understanding of products, let's start digging into orders.
How many orders in total?
How many orders per year
How many orders per quarter
Which country is receiving the most orders
Which country is receiving the least
What's the average shipping time (ShippedDate - OrderDate)
What customer is submitting the highest number of orders?
What customer is generating the highest revenue (need to pd.merge with order_details)
What fraction of the revenue is generated by the top 5 customers?
3.a: How many orders in total?
3.b: How many orders per year?
3.c: How many orders per quarter?
Make a line plot for these.
3.d: Which country is receiving the most orders?
3.e: Which country is receiving the least?
3.f: What's the average shipping time (ShippedDate - OrderDate)?
3.g: What customer is submitting the highest number of orders?
3.h: What customer is generating the highest revenue (need to pd.merge with order_details)?
3.i: What fraction of the revenue is generated by the top 5 customers?
Compare that with the fraction represented by 5 customers over the total number of customers.