Module 2: Data Engineering: Pandas
In this notebook, basics of using Pandas in Python Notebooks (in Jupyter) will be introduced. For you to be able to use the notebooks, you need to be able to have the environment to run the notebooks. Several options are as follows:
Jupyter : download and install Python and Jupyter in your workstation. This does not require internet to run
Google Colab : upload the notebook to Google Colab, along with the data needed. This requires internet connection to be continuously used. GDrive needs to be set-up as a mounted memory to where you can upload or download data and results. This means it requires a Google Account to be logged-in.
CoCalc : upload the notebook to Google Colab, along with the data needed. This requires internet connection to be continuously used. The files tab allows you to easily upload or download data during the running session. No pre-requirement for any account, just a persistent internet connection to remain in the currest workings session.
Instructions on how to install are linked above for reference.
Check if file uploaded exists
Enter Pandas
Slicing data frames
Indexing Columns
Exercise: isolate only movie title, director name, imdb rating, year
Indexing Rows
Find Movies by James Cameron
Exercise: Find movies by Zack Snyder
Sort films by gross earnings
Exercise: Get the top 5 films of Michael Bay
Multiple Conditions: Find films from the Canada that have Hugh Jackman as the actor_1_name
Exercise: Find the actor who is the actor_1_name for the movie that grossed exactly 67344392. Then, find films whose actor_3_name is Piolo Pascual and actor_1_name is the person from Armageddon.
Preprocessing
Convert actor_1_facebook_likes to integers
Apply function for each cell
Actually, one can think of a lambda as a nameless function.
Clean titles
Clean all
Replace values with 0. Of course, imputation could be done here, but other libraries are need for it.
We are adding a new reference to the original data. Only the new cells are allocated memory. The unchanged cells are referenced to the original.
Data Summaries
Data Correlations
Outliers : Clipping to the 99th percentile
This is just one of the many rules-of-thumb used in practice. It doesn't always work, especially if one has too many outliers.
Output to CSV file
Aggregations
In this example, the group by statement does two things:
groups together the dataframe by title_year
the size() function has the title_year as the index
Data Visualization
Matplotlib: Line Plot : Average Facebook Likes per Year
Plot takes in as the first parameter the x axis and the second, the y axis values.
plt.figure(figsize=(15,8)) is just the size of the plot.
Matplotlib: Scatterplot : Gross VS Budget
Matplotlib: Histogram of IMDB scores
Pandas : Barplot of the gross earnings of the 10 movies with highest budget superimposed with their budget as a line graph
Plot the histograms of imdb_scores according to different content rating Types
We're finding out which content type tends to have the higher imdb_score.
Seaborn : Conditional Formatting
For more, check out: https://pandas.pydata.org/pandas-docs/stable/style.html
Advanced Graphs: Plotting more than 2 variables
Let's take this slowly.
Seaborn : Line Plot with Regression : Voted users and reviews
Additional Excercises and Challenges
Challenge! Fix the above plot by clipping by the Tukey's Test (k=1):
Use the np.clip function to bound the results as the following interval:
Seaborn : Barplot of the gross earnings of the last 10 years
Challenge : Get the top 10 directors with most movies directed and use a boxplot for their gross earnings
Additional Activities
Plot the following variables in one graph:
num_critic_for_reviews
IMDB score
gross
Steven Spielberg against others
Compute Sales (Gross - Budget), add it as another column
Which directors garnered the most total sales?
Which actors garnered the most total sales?
Plot sales and average likes as a scatterplot. Fit it with a line.
Which of these genres are the most profitable? Plot their sales using different histograms, superimposed in the same axis.
Romance
Comedy
Action
Fantasy
Standardization
Standardize sales using the following formula then save it to a new column.
The first values should be: [2.612646, 0.026695, -0.246587, 0.975996, -0.020609]
For each of movie, compute average likes of the three actors and store it as a new variable. Standardize. Read up on the mean function.
Store it as a new column, average_actor_likes.
Create a linear hypothesis function
Create a function that takes (1) a scalar, (2) theta and (3) a bias variable to output a value as close as possible to gross.
Create an RMSE function
Create a function that compares two vectors and outputs the root mean squared error / deviation.
Create the best possible thetas by brute-forcing against the RMSE function.
Create predictions for your entire dataset. Compare your predictions against the score. Achieve the smallest RMSE you can.
Plot your best theta, bias variable against the imdb score for each movie
For a cleaner plot:
(1) compile your average_actor_likes, imdb_scores and predicted to a new dataframe
(2) limit the bounds of your predicted ratings
Convert your hypothesis function to use more variables:
Don't forget to standardize your new variables.
Compile your theta values to a new pandas dataframe which consists of the following columns:
Plot how each theta parameter influence the RMSE. Which one seems to be most influential?
Advanced Activities
Using Linear Regression (Ridge)
Find the best coefficients using Ridge regression