Book a Demo!
CoCalc Logo Icon
StoreFeaturesDocsShareSupportNewsAboutPoliciesSign UpSign In
DataScienceUWL
GitHub Repository: DataScienceUWL/DS775
Path: blob/main/Lessons/Lesson 13 - RecSys 1/extras/DataCleaning.ipynb
871 views
Kernel: Python 3 (system-wide)
import pandas as pd import numpy as np from ast import literal_eval

Cleaning the Movies_Metadata csv

If you follow along with the comments while we cleaned data, you will note that we made some minor changes, primarily in chaining some of the functions that Banik uses. While it's not necessary to chain functions, it does simplify our code, so we included this method as another option. It's one of the nifty features of Pandas.

#read in the data df = pd.read_csv('../data/movies_metadata.csv') #print the shape of the dataframe print(f"The shape is {df.shape}") #get the column info df.info() ##################### # Helper Functions ##################### #converts ints & string representations of numbers to floats def to_float(x): try: x = float(x) except: x = np.nan return x #Helper function to convert NaT to 0 and all other years to integers. def convert_int(x): try: return int(x) except: return 0 #we can run both apply and astype in one line by chaining them df['budget'] = df['budget'].apply(to_float).astype('float') #Convert release_date into pandas datetime format df['release_date'] = pd.to_datetime(df['release_date'],errors='coerce') #Extract year from the datetime and convert to integer. (Again, we're chaining functions) df['year'] = df['release_date'].apply(lambda x: str(x).split('-')[0] if x != np.nan else np.nan).apply(convert_int) #convert vote_count to integer df['vote_count'] = df['vote_count'].apply(lambda x: str(x).split('-')[0] if x != np.nan else np.nan).apply(convert_int) #Convert all NaN into stringified empty lists and apply literal eval and convert to list by chaining functions df['genres'] = df['genres'].fillna('[]').apply(literal_eval).apply(lambda x: [i['name'] for i in x] if isinstance(x, list) else []) #filter to just the relevant columns df = df[['id','title','budget', 'genres', 'overview', 'revenue', 'runtime', 'vote_average', 'vote_count', 'year']] df.head() df.to_csv('movies_metadata_clean.csv', index=False)
The shape is (5000, 24) <class 'pandas.core.frame.DataFrame'> RangeIndex: 5000 entries, 0 to 4999 Data columns (total 24 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 adult 5000 non-null bool 1 belongs_to_collection 825 non-null object 2 budget 5000 non-null int64 3 genres 5000 non-null object 4 homepage 311 non-null object 5 id 5000 non-null int64 6 imdb_id 5000 non-null object 7 original_language 5000 non-null object 8 original_title 5000 non-null object 9 overview 4979 non-null object 10 popularity 5000 non-null float64 11 poster_path 4979 non-null object 12 production_companies 5000 non-null object 13 production_countries 5000 non-null object 14 release_date 4996 non-null object 15 revenue 5000 non-null int64 16 runtime 4994 non-null float64 17 spoken_languages 5000 non-null object 18 status 4994 non-null object 19 tagline 3749 non-null object 20 title 5000 non-null object 21 video 5000 non-null bool 22 vote_average 5000 non-null float64 23 vote_count 5000 non-null int64 dtypes: bool(2), float64(3), int64(4), object(15) memory usage: 869.3+ KB

Cleaning the Ted Talks

This is straight out of the book. Apply is a handy function available in pandas that lets you run a function for each row or column of your data. You're seeing examples here of using a lambda (inline) function as well as using a separately created function (convert_int).

The lambda function is just grabbing the year from the published date. It's doing that by splitting the string on the '-' character. This creates an array. We grab the first item in the array, which, if we had a valid date, should be the year. If we didn't have a valid date, then we drop in the np.nan.

import pandas as pd import numpy as np ted = pd.read_csv('../data/ted-talks/ted_main.csv') #Convert release_date into pandas datetime format ted['published_date'] = pd.to_datetime(ted['published_date'], errors='coerce', unit='s') #see what the new date looks like print("This is what the datetime string looks like:") display(ted['published_date'].head()) #Extract year from the datetime ted['published_year'] = ted['published_date'].apply(lambda x: str(x).split('-')[0] if x != np.nan else np.nan) #Helper function to convert NaT to 0 and all other years to integers. def convert_int(x): try: return int(x) except: return 0 #Apply convert_int to the year feature ted['published_year'] = ted['published_year'].apply(convert_int) ted.head()
This is what the datetime string looks like:
0 2006-06-27 00:11:00 1 2006-06-27 00:11:00 2 2006-06-27 00:11:00 3 2006-06-27 00:11:00 4 2006-06-27 20:38:00 Name: published_date, dtype: datetime64[ns]

This is also straight from the book. When we use the literal_eval function on the ratings column, we get a dictionary that we can manipulate. The "name" key holds the part of the ratings that we care about. We want to convert these words to lower case and create a list of the words. We create an empty list if there were no ratings.

#Import the literal_eval function from ast from ast import literal_eval #Convert all NaN into stringified empty lists ted['ratings'] = ted['ratings'].fillna('[]') #Apply literal_eval to convert stringified empty lists to the list object ted['ratings'] = ted['ratings'].apply(literal_eval) #Convert list of dictionaries to a list of strings ted['ratings'] = ted['ratings'].apply(lambda x: [i['name'].lower() for i in x] if isinstance(x, list) else []) ted.to_csv('ted_clean.csv', index=False) #See how 'ratings' has changed? ted.head()