GitHub Repository: ine-rmotr-curriculum/FreeCodeCamp-Pandas-Real-Life-Example
Path: blob/master/Lecture_2.ipynb
¹⁴⁵ views

Kernel: Python 3

rmotr

The Sakila Database

One of the best example databases out there is the Sakila Database, which was originally created by MySQL and has been open sourced under the terms of the BSD License.

The Sakila database is a nicely normalised schema modelling a DVD rental store, featuring things like films, actors, film-actor relationships, and a central inventory table that connects films, stores, and rentals.

purple-divider

Hands on!

In [1]:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import sqlite3

%matplotlib inline

green-divider

Loading our data:

In [2]:

conn = sqlite3.connect('data/sakila.db')

df = pd.read_sql('''
    SELECT
        rental.rental_id, rental.rental_date, rental.return_date,
        customer.last_name AS customer_lastname,
        store.store_id,
        city.city AS rental_store_city,
        film.title AS film_title, film.rental_duration AS film_rental_duration,
        film.rental_rate AS film_rental_rate, film.replacement_cost AS film_replacement_cost,
        film.rating AS film_rating
    FROM rental
    INNER JOIN customer ON rental.customer_id == customer.customer_id
    INNER JOIN inventory ON rental.inventory_id == inventory.inventory_id
    INNER JOIN store ON inventory.store_id == store.store_id
    INNER JOIN address ON store.address_id == address.address_id
    INNER JOIN city ON address.city_id == city.city_id
    INNER JOIN film ON inventory.film_id == film.film_id
    ;
''', conn, index_col='rental_id', parse_dates=['rental_date', 'return_date'])

green-divider

The data at a glance:

In [3]:

df.head()

Out[3]:

In [4]:

df.shape

Out[4]:

(16044, 10)

In [5]:

df.info()

Out[5]:

<class 'pandas.core.frame.DataFrame'>
Int64Index: 16044 entries, 1 to 16049
Data columns (total 10 columns):
 #   Column                 Non-Null Count  Dtype         
---  ------                 --------------  -----         
 0   rental_date            16044 non-null  datetime64[ns]
 1   return_date            15861 non-null  datetime64[ns]
 2   customer_lastname      16044 non-null  object        
 3   store_id               16044 non-null  int64         
 4   rental_store_city      16044 non-null  object        
 5   film_title             16044 non-null  object        
 6   film_rental_duration   16044 non-null  int64         
 7   film_rental_rate       16044 non-null  float64       
 8   film_replacement_cost  16044 non-null  float64       
 9   film_rating            16044 non-null  object        
dtypes: datetime64[ns](2), float64(2), int64(2), object(4)
memory usage: 1.3+ MB

In [6]:

df.describe()

Out[6]:

green-divider

Numerical analysis and visualization

We'll analyze the film_rental_rate column:

In [7]:

df['film_rental_rate'].describe()

Out[7]:

count    16044.000000
mean         2.942630
std          1.649678
min          0.990000
25%          0.990000
50%          2.990000
75%          4.990000
max          4.990000
Name: film_rental_rate, dtype: float64

In [8]:

df['film_rental_rate'].mean()

Out[8]:

2.9426302667663933

In [9]:

df['film_rental_rate'].median()

Out[9]:

2.99

In [10]:

df['film_rental_rate'].plot(kind='box', vert=False, figsize=(14,6))

Out[10]:

<matplotlib.axes._subplots.AxesSubplot at 0x7f25cb837df0>

In [11]:

df['film_rental_rate'].plot(kind='density', figsize=(14,6)) # kde

Out[11]:

<matplotlib.axes._subplots.AxesSubplot at 0x7f25cb858e80>

In [12]:

ax = df['film_rental_rate'].value_counts().plot(kind='bar', figsize=(14,6))
ax.set_ylabel('Number of Rentals')

Out[12]:

Text(0, 0.5, 'Number of Rentals')

green-divider

Categorical analysis and visualization

We'll analyze the rental_store_city column:

In [13]:

df['rental_store_city'].value_counts()

Out[13]:

Woodridge     8121
Lethbridge    7923
Name: rental_store_city, dtype: int64

In [14]:

df['rental_store_city'].value_counts().plot(kind='pie', figsize=(6,6))

Out[14]:

<matplotlib.axes._subplots.AxesSubplot at 0x7f25bf853910>

In [15]:

ax = df['rental_store_city'].value_counts().plot(kind='bar', figsize=(14,6))
ax.set_ylabel('Number of Rentals')

Out[15]:

Text(0, 0.5, 'Number of Rentals')

green-divider

Column wrangling

We can also create new columns or modify existing ones.

Add and calculate a new `rental_rate_return` column

We want to know the rental rate of return of each film. To do that we'll use this formula:

rental\_gain\_return = \frac{film\_rental\_rate}{film\_replacement\_cost} * 100

In [16]:

df['rental_gain_return'] = df['film_rental_rate'] / df['film_replacement_cost'] * 100

df['rental_gain_return'].head()

Out[16]:

rental_id
  13.597090
  17.598587
  19.946631
   4.502046
   9.969990
Name: rental_gain_return, dtype: float64

In [17]:

df['rental_gain_return'].plot(kind='density', figsize=(14,6))

Out[17]:

<matplotlib.axes._subplots.AxesSubplot at 0x7f25bf7c7e20>

In [18]:

df['rental_gain_return'].mean().round(2)

Out[18]:

16.34

In [19]:

df['rental_gain_return'].median().round(2)

Out[19]:

13.6

In [20]:

ax = df['rental_gain_return'].plot(kind='density', figsize=(14,6)) # kde
ax.axvline(df['rental_gain_return'].mean(), color='red')
ax.axvline(df['rental_gain_return'].median(), color='green')

Out[20]:

<matplotlib.lines.Line2D at 0x7f25bf7c20d0>

Each rental represents 13.6% of film cost.

So 7.35 rentals are needed to recover film market price (film_replacement_cost)

In [21]:

100 / 13.6

Out[21]:

7.352941176470589

While in average each film is rented 16.74 times.

In [22]:

df['film_title'].value_counts().mean()

Out[22]:

16.747390396659707

green-divider

Selection & Indexing:

Get the rental records of the customer with lastname `HANSEN`

In [23]:

df.loc[df['customer_lastname'] == 'HANSEN']

Out[23]:

Create a list of all the films with the highest replacement cost

In [24]:

df['film_replacement_cost'].max()

Out[24]:

29.99

In [25]:

df.loc[df['film_replacement_cost'] == df['film_replacement_cost'].max(), 'film_title'].unique()

Out[25]:

array(['IDOLS SNATCHERS', 'LAWLESS VISION', 'SONG HEDWIG',
       'LOATHING LEGALLY', 'PATIENT SISTER', 'RESERVOIR ADAPTATION',
       'JEEPERS WEDDING', 'GOLDFINGER SENSIBILITY', 'CHARIOTS CONSPIRACY',
       'HONEY TIES', 'GRAFFITI LOVE', 'SLACKER LIAISONS', 'DIRTY ACE',
       'BLINDNESS GUN', 'WYOMING STORM', 'FEUD FROGMEN', 'SALUTE APOLLO',
       'JINGLE SAGEBRUSH', 'HILLS NEIGHBORS', 'UNCUT SUICIDES',
       'EVERYONE CRAFT', 'FLATLINERS KILLER', 'BALLROOM MOCKINGBIRD',
       'RIVER OUTLAW', 'ARABIA DOGMA', 'VIRGIN DAISY', 'JERICHO MULAN',
       'SASSY PACKER', 'TRACY CIDER', 'LOVER TRUMAN', 'DOCTOR GRAIL',
       'GILMORE BOILED', 'PRINCESS GIANT', 'CRUELTY UNFORGIVEN',
       'REIGN GENTLEMEN', 'WEST LION', 'BONNIE HOLOCAUST', 'EARTH VISION',
       'RANDOM GO', 'CLOCKWORK PARADISE', 'FANTASIA PARK', 'RIGHT CRANES',
       'CUPBOARD SINNERS', 'OSCAR GOLD', 'SMILE EARRING',
       'HOLLYWOOD ANONYMOUS', 'POSEIDON FOREVER',
       'EXTRAORDINARY CONQUERER', 'QUEST MUSSOLINI', 'JAPANESE RUN',
       'CLYDE THEORY', 'DESPERATE TRAINSPOTTING'], dtype=object)

How many `PG` or `PG-13` rating films were rented?

In [26]:

df.loc[(df['film_rating'] == 'PG') | (df['film_rating'] == 'PG-13')].shape[0]

Out[26]:

6797

purple-divider

The Sakila Database

Hands on!

Loading our data:

The data at a glance:

Numerical analysis and visualization

Categorical analysis and visualization

Column wrangling

Add and calculate a new `rental_rate_return` column

Selection & Indexing:

Get the rental records of the customer with lastname `HANSEN`

Create a list of all the films with the highest replacement cost

How many `PG` or `PG-13` rating films were rented?

Product

Resources

Company

The Sakila Database

Hands on!

Loading our data:

The data at a glance:

Numerical analysis and visualization

Categorical analysis and visualization

Column wrangling

Add and calculate a new rental_rate_return column

Selection & Indexing:

Get the rental records of the customer with lastname HANSEN

Create a list of all the films with the highest replacement cost

How many PG or PG-13 rating films were rented?

Add and calculate a new `rental_rate_return` column

Get the rental records of the customer with lastname `HANSEN`

How many `PG` or `PG-13` rating films were rented?