GitHub Repository: DataScienceUWL/DS775
Path: blob/main/Lessons/Lesson 13 - RecSys 1/extras/DataCleaning.ipynb
⁸⁷¹ views

Kernel: Python 3 (system-wide)

In [1]:

import pandas as pd
import numpy as np
from ast import literal_eval

Cleaning the Movies_Metadata csv

If you follow along with the comments while we cleaned data, you will note that we made some minor changes, primarily in chaining some of the functions that Banik uses. While it's not necessary to chain functions, it does simplify our code, so we included this method as another option. It's one of the nifty features of Pandas.

In [3]:

#read in the data
df = pd.read_csv('../data/movies_metadata.csv')

#print the shape of the dataframe
print(f"The shape is {df.shape}")

#get the column info
df.info()

#####################
# Helper Functions
#####################
#converts ints & string representations of numbers to floats
def to_float(x):
    try:
        x = float(x)
    except:
        x = np.nan
    return x

#Helper function to convert NaT to 0 and all other years to integers.
def convert_int(x):
    try:
        return int(x)
    except:
        return 0

#we can run both apply and astype in one line by chaining them
df['budget'] = df['budget'].apply(to_float).astype('float')

#Convert release_date into pandas datetime format
df['release_date'] = pd.to_datetime(df['release_date'],errors='coerce')

#Extract year from the datetime and convert to integer. (Again, we're chaining functions)
df['year'] = df['release_date'].apply(lambda x: str(x).split('-')[0] if x != np.nan else np.nan).apply(convert_int)

#convert vote_count to integer
df['vote_count'] = df['vote_count'].apply(lambda x: str(x).split('-')[0] if x != np.nan else np.nan).apply(convert_int)

#Convert all NaN into stringified empty lists and apply literal eval and convert to list by chaining functions
df['genres'] = df['genres'].fillna('[]').apply(literal_eval).apply(lambda x: [i['name'] for i in x] if isinstance(x, list) else [])

#filter to just the relevant columns
df = df[['id','title','budget', 'genres', 'overview', 'revenue', 'runtime', 'vote_average', 'vote_count', 'year']]
df.head()
df.to_csv('movies_metadata_clean.csv', index=False)

Out[3]:

The shape is (5000, 24)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5000 entries, 0 to 4999
Data columns (total 24 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 adult                  5000 non-null   bool   
 belongs_to_collection  825 non-null    object 
 budget                 5000 non-null   int64  
 genres                 5000 non-null   object 
 homepage               311 non-null    object 
 id                     5000 non-null   int64  
 imdb_id                5000 non-null   object 
 original_language      5000 non-null   object 
 original_title         5000 non-null   object 
 overview               4979 non-null   object 
popularity             5000 non-null   float64
poster_path            4979 non-null   object 
production_companies   5000 non-null   object 
production_countries   5000 non-null   object 
release_date           4996 non-null   object 
revenue                5000 non-null   int64  
runtime                4994 non-null   float64
spoken_languages       5000 non-null   object 
status                 4994 non-null   object 
tagline                3749 non-null   object 
title                  5000 non-null   object 
video                  5000 non-null   bool   
vote_average           5000 non-null   float64
vote_count             5000 non-null   int64  
dtypes: bool(2), float64(3), int64(4), object(15)
memory usage: 869.3+ KB

Cleaning the Ted Talks

This is straight out of the book. Apply is a handy function available in pandas that lets you run a function for each row or column of your data. You're seeing examples here of using a lambda (inline) function as well as using a separately created function (convert_int).

The lambda function is just grabbing the year from the published date. It's doing that by splitting the string on the '-' character. This creates an array. We grab the first item in the array, which, if we had a valid date, should be the year. If we didn't have a valid date, then we drop in the np.nan.

In [5]:

import pandas as pd
import numpy as np

ted = pd.read_csv('../data/ted-talks/ted_main.csv')
#Convert release_date into pandas datetime format
ted['published_date'] = pd.to_datetime(ted['published_date'],
                                       errors='coerce', unit='s')

#see what the new date looks like
print("This is what the datetime string looks like:")
display(ted['published_date'].head())


#Extract year from the datetime
ted['published_year'] = ted['published_date'].apply(lambda x: str(x).split('-')[0] if x != np.nan else np.nan)

#Helper function to convert NaT to 0 and all other years to integers.
def convert_int(x):
    try:
        return int(x)
    except:
        return 0

#Apply convert_int to the year feature
ted['published_year'] = ted['published_year'].apply(convert_int)


ted.head()

Out[5]:

This is what the datetime string looks like:

 2006-06-27 00:11:00
 2006-06-27 00:11:00
 2006-06-27 00:11:00
 2006-06-27 00:11:00
 2006-06-27 20:38:00
Name: published_date, dtype: datetime64[ns]

This is also straight from the book. When we use the literal_eval function on the ratings column, we get a dictionary that we can manipulate. The "name" key holds the part of the ratings that we care about. We want to convert these words to lower case and create a list of the words. We create an empty list if there were no ratings.

In [6]:

#Import the literal_eval function from ast
from ast import literal_eval

#Convert all NaN into stringified empty lists
ted['ratings'] = ted['ratings'].fillna('[]')

#Apply literal_eval to convert stringified empty lists to the list object
ted['ratings'] = ted['ratings'].apply(literal_eval)


#Convert list of dictionaries to a list of strings
ted['ratings'] = ted['ratings'].apply(lambda x: [i['name'].lower() for i in x] if isinstance(x, list) else [])

ted.to_csv('ted_clean.csv', index=False)
#See how 'ratings' has changed?
ted.head()

Out[6]:

Cleaning the Movies_Metadata csv

Cleaning the Ted Talks

Product

Resources

Company