Book a Demo!
CoCalc Logo Icon
StoreFeaturesDocsShareSupportNewsAboutPoliciesSign UpSign In
ine-rmotr-curriculum
GitHub Repository: ine-rmotr-curriculum/FreeCodeCamp-Pandas-Real-Life-Example
Path: blob/master/Lecture_2.ipynb
145 views
Kernel: Python 3

rmotr


The Sakila Database

One of the best example databases out there is the Sakila Database, which was originally created by MySQL and has been open sourced under the terms of the BSD License.

The Sakila database is a nicely normalised schema modelling a DVD rental store, featuring things like films, actors, film-actor relationships, and a central inventory table that connects films, stores, and rentals.

purple-divider

Hands on!

import numpy as np import pandas as pd import matplotlib.pyplot as plt import sqlite3 %matplotlib inline

green-divider

Loading our data:

conn = sqlite3.connect('data/sakila.db') df = pd.read_sql(''' SELECT rental.rental_id, rental.rental_date, rental.return_date, customer.last_name AS customer_lastname, store.store_id, city.city AS rental_store_city, film.title AS film_title, film.rental_duration AS film_rental_duration, film.rental_rate AS film_rental_rate, film.replacement_cost AS film_replacement_cost, film.rating AS film_rating FROM rental INNER JOIN customer ON rental.customer_id == customer.customer_id INNER JOIN inventory ON rental.inventory_id == inventory.inventory_id INNER JOIN store ON inventory.store_id == store.store_id INNER JOIN address ON store.address_id == address.address_id INNER JOIN city ON address.city_id == city.city_id INNER JOIN film ON inventory.film_id == film.film_id ; ''', conn, index_col='rental_id', parse_dates=['rental_date', 'return_date'])

green-divider

The data at a glance:

df.head()
df.shape
(16044, 10)
df.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 16044 entries, 1 to 16049 Data columns (total 10 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 rental_date 16044 non-null datetime64[ns] 1 return_date 15861 non-null datetime64[ns] 2 customer_lastname 16044 non-null object 3 store_id 16044 non-null int64 4 rental_store_city 16044 non-null object 5 film_title 16044 non-null object 6 film_rental_duration 16044 non-null int64 7 film_rental_rate 16044 non-null float64 8 film_replacement_cost 16044 non-null float64 9 film_rating 16044 non-null object dtypes: datetime64[ns](2), float64(2), int64(2), object(4) memory usage: 1.3+ MB
df.describe()

green-divider

Numerical analysis and visualization

We'll analyze the film_rental_rate column:

df['film_rental_rate'].describe()
count 16044.000000 mean 2.942630 std 1.649678 min 0.990000 25% 0.990000 50% 2.990000 75% 4.990000 max 4.990000 Name: film_rental_rate, dtype: float64
df['film_rental_rate'].mean()
2.9426302667663933
df['film_rental_rate'].median()
2.99
df['film_rental_rate'].plot(kind='box', vert=False, figsize=(14,6))
<matplotlib.axes._subplots.AxesSubplot at 0x7f25cb837df0>
Image in a Jupyter notebook
df['film_rental_rate'].plot(kind='density', figsize=(14,6)) # kde
<matplotlib.axes._subplots.AxesSubplot at 0x7f25cb858e80>
Image in a Jupyter notebook
ax = df['film_rental_rate'].value_counts().plot(kind='bar', figsize=(14,6)) ax.set_ylabel('Number of Rentals')
Text(0, 0.5, 'Number of Rentals')
Image in a Jupyter notebook

green-divider

Categorical analysis and visualization

We'll analyze the rental_store_city column:

df['rental_store_city'].value_counts()
Woodridge 8121 Lethbridge 7923 Name: rental_store_city, dtype: int64
df['rental_store_city'].value_counts().plot(kind='pie', figsize=(6,6))
<matplotlib.axes._subplots.AxesSubplot at 0x7f25bf853910>
Image in a Jupyter notebook
ax = df['rental_store_city'].value_counts().plot(kind='bar', figsize=(14,6)) ax.set_ylabel('Number of Rentals')
Text(0, 0.5, 'Number of Rentals')
Image in a Jupyter notebook

green-divider

Column wrangling

We can also create new columns or modify existing ones.

Add and calculate a new rental_rate_return column

We want to know the rental rate of return of each film. To do that we'll use this formula:

rental_gain_return=film_rental_ratefilm_replacement_cost∗100rental\_gain\_return = \frac{film\_rental\_rate}{film\_replacement\_cost} * 100
df['rental_gain_return'] = df['film_rental_rate'] / df['film_replacement_cost'] * 100 df['rental_gain_return'].head()
rental_id 1 13.597090 2 17.598587 3 19.946631 4 4.502046 5 9.969990 Name: rental_gain_return, dtype: float64
df['rental_gain_return'].plot(kind='density', figsize=(14,6))
<matplotlib.axes._subplots.AxesSubplot at 0x7f25bf7c7e20>
Image in a Jupyter notebook
df['rental_gain_return'].mean().round(2)
16.34
df['rental_gain_return'].median().round(2)
13.6
ax = df['rental_gain_return'].plot(kind='density', figsize=(14,6)) # kde ax.axvline(df['rental_gain_return'].mean(), color='red') ax.axvline(df['rental_gain_return'].median(), color='green')
<matplotlib.lines.Line2D at 0x7f25bf7c20d0>
Image in a Jupyter notebook

Each rental represents 13.6% of film cost.

So 7.35 rentals are needed to recover film market price (film_replacement_cost)

100 / 13.6
7.352941176470589

While in average each film is rented 16.74 times.

df['film_title'].value_counts().mean()
16.747390396659707

green-divider

Selection & Indexing:

Get the rental records of the customer with lastname HANSEN

df.loc[df['customer_lastname'] == 'HANSEN']

Create a list of all the films with the highest replacement cost

df['film_replacement_cost'].max()
29.99
df.loc[df['film_replacement_cost'] == df['film_replacement_cost'].max(), 'film_title'].unique()
array(['IDOLS SNATCHERS', 'LAWLESS VISION', 'SONG HEDWIG', 'LOATHING LEGALLY', 'PATIENT SISTER', 'RESERVOIR ADAPTATION', 'JEEPERS WEDDING', 'GOLDFINGER SENSIBILITY', 'CHARIOTS CONSPIRACY', 'HONEY TIES', 'GRAFFITI LOVE', 'SLACKER LIAISONS', 'DIRTY ACE', 'BLINDNESS GUN', 'WYOMING STORM', 'FEUD FROGMEN', 'SALUTE APOLLO', 'JINGLE SAGEBRUSH', 'HILLS NEIGHBORS', 'UNCUT SUICIDES', 'EVERYONE CRAFT', 'FLATLINERS KILLER', 'BALLROOM MOCKINGBIRD', 'RIVER OUTLAW', 'ARABIA DOGMA', 'VIRGIN DAISY', 'JERICHO MULAN', 'SASSY PACKER', 'TRACY CIDER', 'LOVER TRUMAN', 'DOCTOR GRAIL', 'GILMORE BOILED', 'PRINCESS GIANT', 'CRUELTY UNFORGIVEN', 'REIGN GENTLEMEN', 'WEST LION', 'BONNIE HOLOCAUST', 'EARTH VISION', 'RANDOM GO', 'CLOCKWORK PARADISE', 'FANTASIA PARK', 'RIGHT CRANES', 'CUPBOARD SINNERS', 'OSCAR GOLD', 'SMILE EARRING', 'HOLLYWOOD ANONYMOUS', 'POSEIDON FOREVER', 'EXTRAORDINARY CONQUERER', 'QUEST MUSSOLINI', 'JAPANESE RUN', 'CLYDE THEORY', 'DESPERATE TRAINSPOTTING'], dtype=object)

How many PG or PG-13 rating films were rented?

df.loc[(df['film_rating'] == 'PG') | (df['film_rating'] == 'PG-13')].shape[0]
6797

purple-divider