¹³⁸³ views
ubuntu2004

Kernel: Python 3 (system-wide)

Module 2: Data Engineering: Pandas

In this notebook, basics of using Pandas in Python Notebooks (in Jupyter) will be introduced. For you to be able to use the notebooks, you need to be able to have the environment to run the notebooks. Several options are as follows:

Jupyter : download and install Python and Jupyter in your workstation. This does not require internet to run
Google Colab : upload the notebook to Google Colab, along with the data needed. This requires internet connection to be continuously used. GDrive needs to be set-up as a mounted memory to where you can upload or download data and results. This means it requires a Google Account to be logged-in.
CoCalc : upload the notebook to Google Colab, along with the data needed. This requires internet connection to be continuously used. The files tab allows you to easily upload or download data during the running session. No pre-requirement for any account, just a persistent internet connection to remain in the currest workings session.

Instructions on how to install are linked above for reference.

Check if file uploaded exists

In [32]:

import os

## File path is where the data you have uploaded is located within your work area
file_path = 'movie_metadata.csv'

In [33]:

## Note: this should be True before you can proceed to the next stem
os.path.exists(file_path)

Out[33]:

True

In [34]:

file_path

Out[34]:

'movie_metadata.csv'

Enter Pandas

In [3]:

'''
pandas library: fast, powerful, flexible and easy to use open source data analysis and manipulation tool
library: collections of prewritten code that users can use to optimize tasks.
'''
import pandas as pd

In [4]:

data = pd.read_csv(file_path)

In [25]:

data

Out[25]:

In [0]:

In [0]:

In [18]:

data.shape

Out[18]:

(5044, 28)

In [22]:

data.columns

Out[22]:

Index(['movie_title', 'color', 'director_name', 'num_critic_for_reviews',
       'duration', 'director_facebook_likes', 'actor_3_facebook_likes',
       'actor_2_name', 'actor_1_facebook_likes', 'gross', 'genres',
       'actor_1_name', 'num_voted_users', 'cast_total_facebook_likes',
       'actor_3_name', 'facenumber_in_poster', 'plot_keywords',
       'movie_imdb_link', 'num_user_for_reviews', 'language', 'country',
       'content_rating', 'budget', 'title_year', 'actor_2_facebook_likes',
       'imdb_score', 'aspect_ratio', 'movie_facebook_likes'],
      dtype='object')

Slicing data frames

In [32]:

data[5:10]

Out[32]:

Indexing Columns

In [38]:

data.director_name[0:5]

Out[38]:

      James Cameron
     Gore Verbinski
         Sam Mendes
  Christopher Nolan
        Doug Walker
Name: director_name, dtype: object

In [36]:

data.director_name[19]

Out[36]:

'Barry Sonnenfeld'

In [37]:

data.actor_1_name[49]

Out[37]:

'Eddie Marsan'

In [40]:

data.columns

Out[40]:

Index(['movie_title', 'color', 'director_name', 'num_critic_for_reviews',
       'duration', 'director_facebook_likes', 'actor_3_facebook_likes',
       'actor_2_name', 'actor_1_facebook_likes', 'gross', 'genres',
       'actor_1_name', 'num_voted_users', 'cast_total_facebook_likes',
       'actor_3_name', 'facenumber_in_poster', 'plot_keywords',
       'movie_imdb_link', 'num_user_for_reviews', 'language', 'country',
       'content_rating', 'budget', 'title_year', 'actor_2_facebook_likes',
       'imdb_score', 'aspect_ratio', 'movie_facebook_likes'],
      dtype='object')

In [47]:

cols = ["movie_title","director_name", "imdb_score", "title_year"]
data[cols]

Out[47]:

Exercise: isolate only movie title, director name, imdb rating, year

In [5]:

cols = ["movie_title","director_name", "title_year"]
data[cols][:5]

Out[5]:

Indexing Rows

In [7]:

data[10:12]

Out[7]:

Find Movies by James Cameron

In [17]:

data[data.director_name == 'James Cameron']

Out[17]:

In [9]:

data.director_name == 'James Cameron'

Out[9]:

      True
     False
     False
     False
     False
        ...  
  False
  False
  False
  False
  False
Name: director_name, Length: 5044, dtype: bool

Exercise: Find movies by Zack Snyder

In [10]:

data[data.director_name == 'Zack Snyder']

Out[10]:

Sort films by gross earnings

In [11]:

sorted_data = data.sort_values(by="gross", ascending=False)
sorted_data[:5]

Out[11]:

In [0]:

## Sort filems by rating
sorted_data = data.sort_values(by="imdb_score", ascending=False)
sorted_data[:5]

Exercise: Get the top 5 films of Michael Bay

In [13]:

michael_bay = data[data.director_name == 'Michael Bay']

michael_bay.sort_values(by="gross", ascending=False)[0:5]

Out[13]:

Multiple Conditions: Find films from the Canada that have Hugh Jackman as the actor_1_name

In [5]:

data[(data['country'] == 'Canada') & (data['actor_1_name'] == 'Hugh Jackman')]

Out[5]:

Exercise: Find the actor who is the actor_1_name for the movie that grossed exactly 67344392. Then, find films whose actor_3_name is Piolo Pascual and actor_1_name is the person from Armageddon.

In [7]:

## Find the actor who is the actor_1_name for the movie that grossed exactly 67344392


data[data.gross == 67344392]['actor_1_name']

Out[7]:

347    Bruce Willis
Name: actor_1_name, dtype: object

In [26]:

data[data.movie_title == 'Armageddon\xa0'].actor_1_name

Out[26]:

151    Bruce Willis
Name: actor_1_name, dtype: object

In [27]:

data[151:152]

Out[27]:

In [15]:

data[data.movie_title == 'Armageddon']

Out[15]:

In [0]:

## Find films whose actor_3_name is Piolo Pascual

In [31]:

data[data.actor_3_name == 'Piolo Pascual' ].head()

Out[31]:

In [0]:

## Find a movie with the title Armageddon

In [1]:

## Find a movie with the 'term' Armageddon
## Check or search for function: str.contains()

In [0]:

Preprocessing

In [0]:

data.dtypes

Convert actor_1_facebook_likes to integers

In [0]:

In [0]:

# will have an error
#data.actor_1_facebook_likes.astype(np.int64)

In [0]:

data['actor_1_facebook_likes'] = data['actor_1_facebook_likes'].fillna(0)
data['actor_1_facebook_likes'] = data['actor_1_facebook_likes'].astype(np.int64)
data.dtypes

Apply function for each cell

Actually, one can think of a lambda as a nameless function.

In [0]:

data.actor_1_facebook_likes.apply(lambda x: np.sqrt(x))[:5]

Clean titles

In [0]:

data.movie_title.tolist()[:5]

In [0]:

data.movie_title[:10].apply(
    lambda x: x.encode().decode('unicode_escape').encode('ascii','ignore')).tolist()

In [0]:

# a weakness though is that it removes enye
print(data[data.movie_title.str.contains("ñ")]["movie_title"])
print("The following removed the enye:")
print(data[data.movie_title.str.contains("ñ")]["movie_title"].apply(
    lambda x: x.encode().decode('unicode_escape').encode('ascii','ignore')).tolist())

In [0]:

data["movie_title"] = data["movie_title"].apply(
    lambda x: x.encode().decode('unicode_escape').encode('ascii','ignore'))

Clean all

Replace values with 0. Of course, imputation could be done here, but other libraries are need for it.

We are adding a new reference to the original data. Only the new cells are allocated memory. The unchanged cells are referenced to the original.

In [0]:

cleaned_data = data.fillna(0)

Data Summaries

In [0]:

cleaned_data.describe()

Data Correlations

In [0]:

cleaned_data.corr()

Outliers : Clipping to the 99th percentile

This is just one of the many rules-of-thumb used in practice. It doesn't always work, especially if one has too many outliers.

In [0]:

cleaned_data.duration.quantile(0.99)

In [0]:

cleaned_data['duration'] = np.clip(cleaned_data['duration'], 0, 189)
cleaned_data.describe()

Output to CSV file

In [0]:

cleaned_data.to_csv('movie_metadata_cleaned.csv')

Aggregations

In this example, the group by statement does two things:

groups together the dataframe by title_year
the size() function has the title_year as the index

In [0]:

cleaned_data["title_year"] = cleaned_data["title_year"].astype(np.int64)
movies_per_year = cleaned_data.groupby("title_year").size()
movies_per_year[-5:]

In [0]:

like_per_year = cleaned_data.groupby("title_year")["movie_facebook_likes"].mean()
like_per_year[-5:]

Data Visualization

Matplotlib: Line Plot : Average Facebook Likes per Year

Plot takes in as the first parameter the x axis and the second, the y axis values.

plt.figure(figsize=(15,8)) is just the size of the plot.

In [0]:

# we're preempting seaborn for a better look-and-feel of the plots
import seaborn as sns
import matplotlib.pyplot as plt

In [0]:

fig = plt.figure(figsize=(15,8))
years = like_per_year.index.tolist()[1:]
likes = like_per_year[1:]
plt.plot(years, likes, 'r-')
plt.savefig('lineplot.png')
plt.show()

In [0]:

plt.savefig('lineplot.png')

Matplotlib: Scatterplot : Gross VS Budget

In [0]:

fig = plt.figure(figsize=(15,8))
plt.scatter(cleaned_data["gross"], cleaned_data["budget"])
plt.show()

Matplotlib: Histogram of IMDB scores

In [0]:

fig = plt.figure(figsize=(15,8))
plt.hist(cleaned_data["imdb_score"], bins=20)
plt.show()

Pandas : Barplot of the gross earnings of the 10 movies with highest budget superimposed with their budget as a line graph

In [0]:

top_budget = cleaned_data.sort_values(by="budget", ascending=False)["movie_title"][:10].tolist()
top_budget_data = cleaned_data[cleaned_data["movie_title"].isin(top_budget)]

ax1 = top_budget_data.plot(x="movie_title", y="gross", kind="bar", figsize=(15,8))
ax2 = ax1.twinx()
top_budget_data.plot(x="movie_title", y="budget", kind="line", 
                     color='red', figsize=(15,8), ax=ax2)
ax1.legend(loc='upper left')
ax2.legend(loc='upper right')
plt.show()

Plot the histograms of imdb_scores according to different content rating Types

We're finding out which content type tends to have the higher imdb_score.

In [0]:

import seaborn as sns

ax = sns.distplot(cleaned_data[cleaned_data['content_rating'] == 'PG-13']["imdb_score"], 
                  color='red')
sns.distplot(cleaned_data[cleaned_data['content_rating'] == 'R']["imdb_score"], 
             color='teal', ax=ax)
sns.distplot(cleaned_data[cleaned_data['content_rating'] == 'GP']["imdb_score"], 
             color='blue', ax=ax)

In [0]:

from pandas.plotting import scatter_matrix
cols = ["num_critic_for_reviews", "imdb_score", "movie_facebook_likes"]
scatter_matrix(cleaned_data[cols], alpha=0.2, figsize=(20, 20), 
               diagonal='kde', marker='o')
plt.show()

In [0]:

from pandas.plotting import scatter_matrix
cols = ["num_critic_for_reviews", "duration", "facenumber_in_poster", 
        "num_user_for_reviews", "budget", "imdb_score", "movie_facebook_likes", "gross"]
scatter_matrix(cleaned_data[cols], alpha=0.2, figsize=(20, 20), 
               diagonal='kde', marker='o')
plt.show()

Seaborn : Conditional Formatting

For more, check out: https://pandas.pydata.org/pandas-docs/stable/style.html

In [0]:

import seaborn as sns

cm = sns.light_palette("green", as_cmap=True)
cols = ["movie_title", "imdb_score","gross"]
color_me = cleaned_data[cols][:10]
s = color_me.style.background_gradient(cmap=cm)
s

Advanced Graphs: Plotting more than 2 variables

Let's take this slowly.

In [0]:

plt.figure(figsize=(15,8))
sizes = 1000*((cleaned_data["gross"] - min(cleaned_data["gross"])) 
              / max(cleaned_data["gross"] ) - min(cleaned_data["gross"] ))
colors = np.where(cleaned_data.genres.str.contains("Fantasy"), 'red', 'green')
plt.scatter(x=cleaned_data["movie_facebook_likes"], 
            y=cleaned_data["imdb_score"], c=colors,
            s = sizes)

plt.xlabel("Facebook Likes")
plt.ylabel("IMDB Score")
plt.xlim((0,100000))
plt.show()

Seaborn : Line Plot with Regression : Voted users and reviews

In [0]:

sns.lmplot(x="num_voted_users", 
           y="num_user_for_reviews",
           data=cleaned_data, size=8)

In [0]:

sns.jointplot(x="num_voted_users", y="num_user_for_reviews", data=cleaned_data, size=8,
             kind="hex", color="#4CB391")

Additional Excercises and Challenges

Challenge! Fix the above plot by clipping by the Tukey's Test (k=1):

Use the np.clip function to bound the results as the following interval: ${\big [}Q_{1}-k(Q_{3}-Q_{1}),Q_{3}+k(Q_{3}-Q_{1}){\big ]}$

In [0]:

Seaborn : Barplot of the gross earnings of the last 10 years

In [0]:

cleaned_data["title_year"] = cleaned_data["title_year"].astype(np.int64)
latest_10_years = np.sort(cleaned_data["title_year"].unique())[-10:]

latest_movies_data=cleaned_data[cleaned_data["title_year"].isin(latest_10_years)]
plt.figure(figsize=(15,4))
sns.barplot(x="title_year", y="gross", data=latest_movies_data)
plt.xticks(rotation=45)
plt.show()

Challenge : Get the top 10 directors with most movies directed and use a boxplot for their gross earnings

In [0]:

Additional Activities

Plot the following variables in one graph:

num_critic_for_reviews
IMDB score
gross
Steven Spielberg against others

In [0]:

Compute Sales (Gross - Budget), add it as another column

In [0]:

Which directors garnered the most total sales?

In [0]:

Which actors garnered the most total sales?

In [0]:

Plot sales and average likes as a scatterplot. Fit it with a line.

In [0]:

Which of these genres are the most profitable? Plot their sales using different histograms, superimposed in the same axis.

Romance
Comedy
Action
Fantasy

In [0]:

Standardization

Standardize sales using the following formula then save it to a new column.

Z={X-\operatorname {E} [X] \over \sigma (X)}

The first values should be: [2.612646, 0.026695, -0.246587, 0.975996, -0.020609]

In [0]:

For each of movie, compute average likes of the three actors and store it as a new variable. Standardize. Read up on the mean function.

Store it as a new column, average_actor_likes.

In [0]:

Create a linear hypothesis function

Create a function that takes (1) a scalar, (2) theta and (3) a bias variable to output a value as close as possible to gross.

score = b + \sum_j{(\theta_j * x)}

score = \theta_1 * average\_actor\_likes + bias

In [0]:

Create an RMSE function

Create a function that compares two vectors and outputs the root mean squared error / deviation.

\operatorname{RMSD}(\hat{\theta}) = \sqrt{\operatorname{MSE}(\hat{\theta})} = \sqrt{\operatorname{E}((\hat{\theta}-\theta)^2)}

In [0]:

Create the best possible thetas by brute-forcing against the RMSE function.

Create predictions for your entire dataset. Compare your predictions against the score. Achieve the smallest RMSE you can.

In [0]:

Plot your best theta, bias variable against the imdb score for each movie

For a cleaner plot:

(1) compile your average_actor_likes, imdb_scores and predicted to a new dataframe

(2) limit the bounds of your predicted ratings

In [0]:

In [0]:

Convert your hypothesis function to use more variables:

Don't forget to standardize your new variables.

score = \theta_1 * average\_actor\_likes + \theta_2 * movie\_facebook\_likes + \theta_3 * sales + bias

In [0]:

In [0]:

Compile your theta values to a new pandas dataframe which consists of the following columns:

$\theta_1$	$\theta_2$	$\theta_3$	$RMSE$

0.10.10.110000

0.20.20.22000

In [0]:

Plot how each theta parameter influence the RMSE. Which one seems to be most influential?

In [0]:

Advanced Activities

Using Linear Regression (Ridge)

Find the best coefficients using Ridge regression

In [0]: