Path: blob/main/Lessons/Lesson 13 - RecSys 1/extras/DataCleaning.ipynb
871 views
Cleaning the Movies_Metadata csv
If you follow along with the comments while we cleaned data, you will note that we made some minor changes, primarily in chaining some of the functions that Banik uses. While it's not necessary to chain functions, it does simplify our code, so we included this method as another option. It's one of the nifty features of Pandas.
Cleaning the Ted Talks
This is straight out of the book. Apply is a handy function available in pandas that lets you run a function for each row or column of your data. You're seeing examples here of using a lambda (inline) function as well as using a separately created function (convert_int).
The lambda function is just grabbing the year from the published date. It's doing that by splitting the string on the '-' character. This creates an array. We grab the first item in the array, which, if we had a valid date, should be the year. If we didn't have a valid date, then we drop in the np.nan.
This is also straight from the book. When we use the literal_eval function on the ratings column, we get a dictionary that we can manipulate. The "name" key holds the part of the ratings that we care about. We want to convert these words to lower case and create a list of the words. We create an empty list if there were no ratings.