Book a Demo!
CoCalc Logo Icon
StoreFeaturesDocsShareSupportNewsAboutPoliciesSign UpSign In
ine-rmotr-curriculum
GitHub Repository: ine-rmotr-curriculum/FreeCodeCamp-Pandas-Real-Life-Example
Path: blob/master/Exercises_2.ipynb
145 views
Kernel: Python 3

rmotr


Exercises

The Sakila Database

One of the best example databases out there is the Sakila Database, which was originally created by MySQL and has been open sourced under the terms of the BSD License.

The Sakila database is a nicely normalised schema modelling a DVD rental store, featuring things like films, actors, film-actor relationships, and a central inventory table that connects films, stores, and rentals.

purple-divider

Hands on!

import numpy as np import pandas as pd import matplotlib.pyplot as plt import sqlite3 %matplotlib inline
conn = sqlite3.connect('data/sakila.db') df = pd.read_sql(''' SELECT rental.rental_id, rental.rental_date, rental.return_date, customer.last_name AS customer_lastname, store.store_id, city.city AS rental_store_city, film.title AS film_title, film.rental_duration AS film_rental_duration, film.rental_rate AS film_rental_rate, film.replacement_cost AS film_replacement_cost, film.rating AS film_rating FROM rental INNER JOIN customer ON rental.customer_id == customer.customer_id INNER JOIN inventory ON rental.inventory_id == inventory.inventory_id INNER JOIN store ON inventory.store_id == store.store_id INNER JOIN address ON store.address_id == address.address_id INNER JOIN city ON address.city_id == city.city_id INNER JOIN film ON inventory.film_id == film.film_id ; ''', conn, index_col='rental_id', parse_dates=['rental_date', 'return_date'])
df.head()

green-divider

What's the mean of film_rental_duration?

# your code goes here
df['film_rental_duration'].mean()

green-divider

What's the most common rental duration?

Show a bar plot with all the durations.

# your code goes here
df['film_rental_duration'].value_counts().plot(kind='bar', figsize=(14,6))

green-divider

What is the most common rental rate?

  • Show a pie plot with all possible rental rates.

  • Show a bar plot with all possible rental rates.

  • Which plot you think fits the best in this case? Why?

# your code goes here
df['film_rental_rate'].value_counts().plot(kind='pie', figsize=(6,6))
# your code goes here
df['film_rental_rate'].value_counts().plot(kind='bar', figsize=(14,6))

green-divider

How is the replacement cost distributed?

  • Show a box plot of the replacement costs.

  • Show a density plot of the replacement costs.

  • Add a red line on the mean.

  • Add a green line on the median median.

# your code goes here
df['film_replacement_cost'].plot(kind='box', vert=False, figsize=(14,6))
# your code goes here
ax = df['film_replacement_cost'].plot(kind='density', figsize=(14,6)) ax.axvline(df['film_replacement_cost'].mean(), color='red') ax.axvline(df['film_replacement_cost'].median(), color='green')

green-divider

How many films of each rating do we have?

  • Show the raw count of each film rating.

  • Show a bar plot with all possible film ratings.

# your code goes here
df['film_rating'].value_counts()
# your code goes here
df['film_rating'].value_counts().plot(kind='bar', figsize=(14,6))

green-divider

### Does the film replacement cost vary depending on film rating?

In the United States, film classification is a voluntary process with the ratings issued by the Motion Picture Association of America (MPAA) via the Classification and Rating Administration (CARA).

  • G (General Audiences): All Ages are Admitted.

  • PG (Parental Guidance Suggested): Some Material May Not Be Suitable for Children.

  • PG-13 (Parents Strongly Cautioned): Some Material May Be Inappropriate for Children Under 13.

  • R (Restricted): Under 17 Requires Accompanying Parent or Adult Guardian.

  • NC-17 (Adults Only): No One 17 and Under Admitted.

Show a grouped box plot per film rating with the film replacement costs.

# your code goes here
df[['film_replacement_cost', 'film_rating']].boxplot(by='film_rating', figsize=(14,6))

green-divider

Add and calculate a new rental_days column

This numeric column should have the count of days between rental_date and return_date.

# your code goes here
df['rental_days'] = df[['rental_date', 'return_date']].apply(lambda x: (x[1] - x[0]).days, axis=1) df['rental_days'].head()

green-divider

Analyze the distribution of rental_days

  • Calculate the mean of rental_days.

  • Show a density (KDE) of rental_days.

# your code goes here
df['rental_days'].mean()
# your code goes here
ax = df['rental_days'].plot(kind='density', figsize=(14,6)) ax.axvline(df['rental_days'].mean(), color='red')

green-divider

### Add and calculate a new film_daily_rental_rate column

This value should be the division of film_rental_rate by film_rental_duration.

# your code goes here
df['film_daily_rental_rate'] = df['film_rental_rate'] / df['film_rental_duration'] df['film_daily_rental_rate'].head()

green-divider

Analyze the distribution of film_daily_rental_rate

  • Calculate the mean of film_daily_rental_rate.

  • Show a density (KDE) of film_daily_rental_rate.

# your code goes here
df['film_daily_rental_rate'].mean()
# your code goes here
ax = df['film_daily_rental_rate'].plot(kind='density', figsize=(14,6)) ax.axvline(df['film_daily_rental_rate'].mean(), color='red')

green-divider

List 10 films with the lowest daily rental rate

# your code goes here
df.loc[df['film_daily_rental_rate'] == df['film_daily_rental_rate'].min()].head(10)

green-divider

List 10 films with the highest daily rental rate

# your code goes here
df.loc[df['film_daily_rental_rate'] == df['film_daily_rental_rate'].max()].head(10)

green-divider

How many rentals were made in Lethbridge city?

# your code goes here
df.loc[df['rental_store_city'] == 'Lethbridge'].shape[0]

green-divider

How many rentals of each film rating were made in Lethbridge city?

Show a bar plot with each film rating count.

# your code goes here
df.loc[df['rental_store_city'] == 'Lethbridge', 'film_rating'].value_counts()
df.loc[df['rental_store_city'] == 'Lethbridge', 'film_rating'].value_counts().plot(kind='bar', figsize=(14,6))

green-divider

How many rentals were made in Woodridge city with rental duration higher than 5 days?

# your code goes here
df.loc[(df['rental_store_city'] == 'Woodridge') & (df['film_rental_duration'] > 5)].shape[0]

green-divider

How many rentals were made at the store with id 2 or with replacement cost lower than 10.99 USD?

# your code goes here
df.loc[(df['store_id'] == 2) | (df['film_replacement_cost'] < 10.99)].shape[0]

purple-divider