Book a Demo!
CoCalc Logo Icon
StoreFeaturesDocsShareSupportNewsAboutPoliciesSign UpSign In
Download
1383 views
ubuntu2004
Kernel: Python 3 (system-wide)

Module 2: Data Engineering: Pandas

In this notebook, basics of using Pandas in Python Notebooks (in Jupyter) will be introduced. For you to be able to use the notebooks, you need to be able to have the environment to run the notebooks. Several options are as follows:

  • Jupyter : download and install Python and Jupyter in your workstation. This does not require internet to run

  • Google Colab : upload the notebook to Google Colab, along with the data needed. This requires internet connection to be continuously used. GDrive needs to be set-up as a mounted memory to where you can upload or download data and results. This means it requires a Google Account to be logged-in.

  • CoCalc : upload the notebook to Google Colab, along with the data needed. This requires internet connection to be continuously used. The files tab allows you to easily upload or download data during the running session. No pre-requirement for any account, just a persistent internet connection to remain in the currest workings session.

Instructions on how to install are linked above for reference.

Check if file uploaded exists

import os ## File path is where the data you have uploaded is located within your work area file_path = 'movie_metadata.csv'
## Note: this should be True before you can proceed to the next stem os.path.exists(file_path)
True
file_path
'movie_metadata.csv'

Enter Pandas

''' pandas library: fast, powerful, flexible and easy to use open source data analysis and manipulation tool library: collections of prewritten code that users can use to optimize tasks. ''' import pandas as pd
data = pd.read_csv(file_path)
data
data.shape
(5044, 28)
data.columns
Index(['movie_title', 'color', 'director_name', 'num_critic_for_reviews', 'duration', 'director_facebook_likes', 'actor_3_facebook_likes', 'actor_2_name', 'actor_1_facebook_likes', 'gross', 'genres', 'actor_1_name', 'num_voted_users', 'cast_total_facebook_likes', 'actor_3_name', 'facenumber_in_poster', 'plot_keywords', 'movie_imdb_link', 'num_user_for_reviews', 'language', 'country', 'content_rating', 'budget', 'title_year', 'actor_2_facebook_likes', 'imdb_score', 'aspect_ratio', 'movie_facebook_likes'], dtype='object')

Slicing data frames

data[5:10]

Indexing Columns

data.director_name[0:5]
0 James Cameron 1 Gore Verbinski 2 Sam Mendes 3 Christopher Nolan 4 Doug Walker Name: director_name, dtype: object
data.director_name[19]
'Barry Sonnenfeld'
data.actor_1_name[49]
'Eddie Marsan'
data.columns
Index(['movie_title', 'color', 'director_name', 'num_critic_for_reviews', 'duration', 'director_facebook_likes', 'actor_3_facebook_likes', 'actor_2_name', 'actor_1_facebook_likes', 'gross', 'genres', 'actor_1_name', 'num_voted_users', 'cast_total_facebook_likes', 'actor_3_name', 'facenumber_in_poster', 'plot_keywords', 'movie_imdb_link', 'num_user_for_reviews', 'language', 'country', 'content_rating', 'budget', 'title_year', 'actor_2_facebook_likes', 'imdb_score', 'aspect_ratio', 'movie_facebook_likes'], dtype='object')
cols = ["movie_title","director_name", "imdb_score", "title_year"] data[cols]

Exercise: isolate only movie title, director name, imdb rating, year

cols = ["movie_title","director_name", "title_year"] data[cols][:5]

Indexing Rows

data[10:12]

Find Movies by James Cameron

data[data.director_name == 'James Cameron']
data.director_name == 'James Cameron'
0 True 1 False 2 False 3 False 4 False ... 5039 False 5040 False 5041 False 5042 False 5043 False Name: director_name, Length: 5044, dtype: bool

Exercise: Find movies by Zack Snyder

data[data.director_name == 'Zack Snyder']

Sort films by gross earnings

sorted_data = data.sort_values(by="gross", ascending=False) sorted_data[:5]
## Sort filems by rating sorted_data = data.sort_values(by="imdb_score", ascending=False) sorted_data[:5]

Exercise: Get the top 5 films of Michael Bay

michael_bay = data[data.director_name == 'Michael Bay'] michael_bay.sort_values(by="gross", ascending=False)[0:5]

Multiple Conditions: Find films from the Canada that have Hugh Jackman as the actor_1_name

data[(data['country'] == 'Canada') & (data['actor_1_name'] == 'Hugh Jackman')]

Exercise: Find the actor who is the actor_1_name for the movie that grossed exactly 67344392. Then, find films whose actor_3_name is Piolo Pascual and actor_1_name is the person from Armageddon.

## Find the actor who is the actor_1_name for the movie that grossed exactly 67344392 data[data.gross == 67344392]['actor_1_name']
347 Bruce Willis Name: actor_1_name, dtype: object
data[data.movie_title == 'Armageddon\xa0'].actor_1_name
151 Bruce Willis Name: actor_1_name, dtype: object
data[151:152]
data[data.movie_title == 'Armageddon']
## Find films whose actor_3_name is Piolo Pascual
data[data.actor_3_name == 'Piolo Pascual' ].head()
## Find a movie with the title Armageddon
## Find a movie with the 'term' Armageddon ## Check or search for function: str.contains()

Preprocessing

data.dtypes

Convert actor_1_facebook_likes to integers

# will have an error #data.actor_1_facebook_likes.astype(np.int64)
data['actor_1_facebook_likes'] = data['actor_1_facebook_likes'].fillna(0) data['actor_1_facebook_likes'] = data['actor_1_facebook_likes'].astype(np.int64) data.dtypes

Apply function for each cell

Actually, one can think of a lambda as a nameless function.

data.actor_1_facebook_likes.apply(lambda x: np.sqrt(x))[:5]

Clean titles

data.movie_title.tolist()[:5]
data.movie_title[:10].apply( lambda x: x.encode().decode('unicode_escape').encode('ascii','ignore')).tolist()
# a weakness though is that it removes enye print(data[data.movie_title.str.contains("ñ")]["movie_title"]) print("The following removed the enye:") print(data[data.movie_title.str.contains("ñ")]["movie_title"].apply( lambda x: x.encode().decode('unicode_escape').encode('ascii','ignore')).tolist())
data["movie_title"] = data["movie_title"].apply( lambda x: x.encode().decode('unicode_escape').encode('ascii','ignore'))

Clean all

Replace values with 0. Of course, imputation could be done here, but other libraries are need for it.

We are adding a new reference to the original data. Only the new cells are allocated memory. The unchanged cells are referenced to the original.

cleaned_data = data.fillna(0)

Data Summaries

cleaned_data.describe()

Data Correlations

cleaned_data.corr()

Outliers : Clipping to the 99th percentile

This is just one of the many rules-of-thumb used in practice. It doesn't always work, especially if one has too many outliers.

cleaned_data.duration.quantile(0.99)
cleaned_data['duration'] = np.clip(cleaned_data['duration'], 0, 189) cleaned_data.describe()

Output to CSV file

cleaned_data.to_csv('movie_metadata_cleaned.csv')

Aggregations

In this example, the group by statement does two things:

  1. groups together the dataframe by title_year

  2. the size() function has the title_year as the index

cleaned_data["title_year"] = cleaned_data["title_year"].astype(np.int64) movies_per_year = cleaned_data.groupby("title_year").size() movies_per_year[-5:]
like_per_year = cleaned_data.groupby("title_year")["movie_facebook_likes"].mean() like_per_year[-5:]

Data Visualization

Matplotlib: Line Plot : Average Facebook Likes per Year

Plot takes in as the first parameter the x axis and the second, the y axis values.

plt.figure(figsize=(15,8)) is just the size of the plot.

# we're preempting seaborn for a better look-and-feel of the plots import seaborn as sns import matplotlib.pyplot as plt
fig = plt.figure(figsize=(15,8)) years = like_per_year.index.tolist()[1:] likes = like_per_year[1:] plt.plot(years, likes, 'r-') plt.savefig('lineplot.png') plt.show()
plt.savefig('lineplot.png')

Matplotlib: Scatterplot : Gross VS Budget

fig = plt.figure(figsize=(15,8)) plt.scatter(cleaned_data["gross"], cleaned_data["budget"]) plt.show()

Matplotlib: Histogram of IMDB scores

fig = plt.figure(figsize=(15,8)) plt.hist(cleaned_data["imdb_score"], bins=20) plt.show()

Pandas : Barplot of the gross earnings of the 10 movies with highest budget superimposed with their budget as a line graph

top_budget = cleaned_data.sort_values(by="budget", ascending=False)["movie_title"][:10].tolist() top_budget_data = cleaned_data[cleaned_data["movie_title"].isin(top_budget)] ax1 = top_budget_data.plot(x="movie_title", y="gross", kind="bar", figsize=(15,8)) ax2 = ax1.twinx() top_budget_data.plot(x="movie_title", y="budget", kind="line", color='red', figsize=(15,8), ax=ax2) ax1.legend(loc='upper left') ax2.legend(loc='upper right') plt.show()

Plot the histograms of imdb_scores according to different content rating Types

We're finding out which content type tends to have the higher imdb_score.

import seaborn as sns ax = sns.distplot(cleaned_data[cleaned_data['content_rating'] == 'PG-13']["imdb_score"], color='red') sns.distplot(cleaned_data[cleaned_data['content_rating'] == 'R']["imdb_score"], color='teal', ax=ax) sns.distplot(cleaned_data[cleaned_data['content_rating'] == 'GP']["imdb_score"], color='blue', ax=ax)
from pandas.plotting import scatter_matrix cols = ["num_critic_for_reviews", "imdb_score", "movie_facebook_likes"] scatter_matrix(cleaned_data[cols], alpha=0.2, figsize=(20, 20), diagonal='kde', marker='o') plt.show()
from pandas.plotting import scatter_matrix cols = ["num_critic_for_reviews", "duration", "facenumber_in_poster", "num_user_for_reviews", "budget", "imdb_score", "movie_facebook_likes", "gross"] scatter_matrix(cleaned_data[cols], alpha=0.2, figsize=(20, 20), diagonal='kde', marker='o') plt.show()

Seaborn : Conditional Formatting

For more, check out: https://pandas.pydata.org/pandas-docs/stable/style.html

import seaborn as sns cm = sns.light_palette("green", as_cmap=True) cols = ["movie_title", "imdb_score","gross"] color_me = cleaned_data[cols][:10] s = color_me.style.background_gradient(cmap=cm) s

Advanced Graphs: Plotting more than 2 variables

Let's take this slowly.

plt.figure(figsize=(15,8)) sizes = 1000*((cleaned_data["gross"] - min(cleaned_data["gross"])) / max(cleaned_data["gross"] ) - min(cleaned_data["gross"] )) colors = np.where(cleaned_data.genres.str.contains("Fantasy"), 'red', 'green') plt.scatter(x=cleaned_data["movie_facebook_likes"], y=cleaned_data["imdb_score"], c=colors, s = sizes) plt.xlabel("Facebook Likes") plt.ylabel("IMDB Score") plt.xlim((0,100000)) plt.show()

Seaborn : Line Plot with Regression : Voted users and reviews

sns.lmplot(x="num_voted_users", y="num_user_for_reviews", data=cleaned_data, size=8)
sns.jointplot(x="num_voted_users", y="num_user_for_reviews", data=cleaned_data, size=8, kind="hex", color="#4CB391")

Additional Excercises and Challenges

Challenge! Fix the above plot by clipping by the Tukey's Test (k=1):

Use the np.clip function to bound the results as the following interval: [Q1k(Q3Q1),Q3+k(Q3Q1)] {\big [}Q_{1}-k(Q_{3}-Q_{1}),Q_{3}+k(Q_{3}-Q_{1}){\big ]}

Seaborn : Barplot of the gross earnings of the last 10 years

cleaned_data["title_year"] = cleaned_data["title_year"].astype(np.int64) latest_10_years = np.sort(cleaned_data["title_year"].unique())[-10:] latest_movies_data=cleaned_data[cleaned_data["title_year"].isin(latest_10_years)] plt.figure(figsize=(15,4)) sns.barplot(x="title_year", y="gross", data=latest_movies_data) plt.xticks(rotation=45) plt.show()

Challenge : Get the top 10 directors with most movies directed and use a boxplot for their gross earnings

Additional Activities

Plot the following variables in one graph:

  • num_critic_for_reviews

  • IMDB score

  • gross

  • Steven Spielberg against others

Compute Sales (Gross - Budget), add it as another column

Which directors garnered the most total sales?

Which actors garnered the most total sales?

Plot sales and average likes as a scatterplot. Fit it with a line.

Which of these genres are the most profitable? Plot their sales using different histograms, superimposed in the same axis.

  • Romance

  • Comedy

  • Action

  • Fantasy

Standardization

Standardize sales using the following formula then save it to a new column.

Z=XE[X]σ(X)Z={X-\operatorname {E} [X] \over \sigma (X)}

The first values should be: [2.612646, 0.026695, -0.246587, 0.975996, -0.020609]

For each of movie, compute average likes of the three actors and store it as a new variable. Standardize. Read up on the mean function.

Store it as a new column, average_actor_likes.

Create a linear hypothesis function

Create a function that takes (1) a scalar, (2) theta and (3) a bias variable to output a value as close as possible to gross.

score=b+j(θjx)score = b + \sum_j{(\theta_j * x)}score=θ1average_actor_likes+biasscore = \theta_1 * average\_actor\_likes + bias

Create an RMSE function

Create a function that compares two vectors and outputs the root mean squared error / deviation.

RMSD(θ^)=MSE(θ^)=E((θ^θ)2)\operatorname{RMSD}(\hat{\theta}) = \sqrt{\operatorname{MSE}(\hat{\theta})} = \sqrt{\operatorname{E}((\hat{\theta}-\theta)^2)}

Create the best possible thetas by brute-forcing against the RMSE function.

Create predictions for your entire dataset. Compare your predictions against the score. Achieve the smallest RMSE you can.

Plot your best theta, bias variable against the imdb score for each movie

For a cleaner plot:

(1) compile your average_actor_likes, imdb_scores and predicted to a new dataframe

(2) limit the bounds of your predicted ratings

Convert your hypothesis function to use more variables:

Don't forget to standardize your new variables.

score=θ1average_actor_likes+θ2movie_facebook_likes+θ3sales+biasscore = \theta_1 * average\_actor\_likes + \theta_2 * movie\_facebook\_likes + \theta_3 * sales + bias

Compile your theta values to a new pandas dataframe which consists of the following columns:

θ1\theta_1 θ2\theta_2 θ3\theta_3 RMSERMSE
0.10.10.110000
0.20.20.22000

Plot how each theta parameter influence the RMSE. Which one seems to be most influential?

Advanced Activities

Using Linear Regression (Ridge)

Find the best coefficients using Ridge regression