CoCalc -- python-movies-lab-solutions.ipynb

GitHub Repository: YStrano/DataScience_GA
Path: blob/master/lessons/lesson_02/code/python-foundations/solution-code/python-movies-lab-solutions.ipynb
¹⁹⁰⁵ views

Kernel: Python 2

Python Review With Movie Data

_Author: Kiefer Katovich and Dave Yerrington (San Francisco)

In this lab, you'll be using the IMDb movies list below as your data set.

This lab is designed to help you practice iteration and functions in particular. The normal questions are more gentle, and the challenge questions are suitable for advanced/expert Python students or those with programming experience.

All of the questions require writing functions and using iteration to solve. You should print out a test of each function you write.

1) Load the provided list of `movies` dictionaries.

In [1]:

# List of movies dictionaries:

movies = [
{
"name": "Usual Suspects", 
"imdb": 7.0,
"category": "Thriller"
},
{
"name": "Hitman",
"imdb": 6.3,
"category": "Action"
},
{
"name": "Dark Knight",
"imdb": 9.0,
"category": "Adventure"
},
{
"name": "The Help",
"imdb": 8.0,
"category": "Drama"
},
{
"name": "The Choice",
"imdb": 6.2,
"category": "Romance"
},
{
"name": "Colonia",
"imdb": 7.4,
"category": "Romance"
},
{
"name": "Love",
"imdb": 6.0,
"category": "Romance"
},
{
"name": "Bride Wars",
"imdb": 5.4,
"category": "Romance"
},
{
"name": "AlphaJet",
"imdb": 3.2,
"category": "War"
},
{
"name": "Ringing Crime",
"imdb": 4.0,
"category": "Crime"
},
{
"name": "Joking muck",
"imdb": 7.2,
"category": "Comedy"
},
{
"name": "What is the name",
"imdb": 9.2,
"category": "Suspense"
},
{
"name": "Detective",
"imdb": 7.0,
"category": "Suspense"
},
{
"name": "Exam",
"imdb": 4.2,
"category": "Thriller"
},
{
"name": "We Two",
"imdb": 7.2,
"category": "Romance"
}
]

2) Filtering data by IMDb score.

2.1)

Write a function that:

Accepts a single movie dictionary from the movies list as an argument.
Returns True if the IMDb score is greater than 5.5.

2.2 [Challenge])

Write a function that:

Accepts the movies list and a specified category.
Returns True if the average score of the category is higher than the average score of all movies.

In [2]:

# 2.1:

def imdb_score_over_bad(movie):
    if movie['imdb'] > 5.5:
        return True
    else:
        return False

print(movies[0])
print(imdb_score_over_bad(movies[0]))

Out[2]:

{'name': 'Usual Suspects', 'imdb': 7.0, 'category': 'Thriller'}
True

In [3]:

# 2.2:

def movies_category_over_avg(movies, category):
    overall_average = []
    category_average = []
    
    
    for movie in movies:
        # Creates a list of all IMDb scores:
        overall_average.append(movie['imdb'])
        # Creates a list of all IMDb scores that match the category argument:
        if movie['category'] == category:
            category_average.append(movie['imdb'])
            
    # Uses IMDb scores list to manually calculate the data set's mean:
    overall_average = sum(overall_average)/len(overall_average)
    # Catch to identify and respond to invalid categories:
    if len(category_average) == 0:
        print('no movies in specified category:', category)
        return False
    # Else valid category, calculate mean:
    else:
        category_average = sum(category_average)/len(category_average)
        # Compare category and overall means:
        if category_average > overall_average:
            return True
        else:
            return False

print(movies_category_over_avg(movies, 'Thriller'))
print(movies_category_over_avg(movies, 'Suspense'))

Out[3]:

False
True

3) Creating subsets by numeric condition.

3.1)

Write a function that:

Accepts the list of movies and a specified IMDb score.
Returns the sublist of movies that have scores greater than the one specified.

3.2 [Expert])

Write a function that:

Accepts the movies list as an argument.
Returns the movies list sorted first by category and then by movie according to category average score and individual IMDb score, respectively.

In [4]:

# 3.1:

def score_greater_subset(movies, score):
    subset = []
    for movie in movies:
        if movie['imdb'] > score:
            subset.append(movie)
    return subset

print(score_greater_subset(movies, 8.5))

Out[4]:

[{'name': 'Dark Knight', 'imdb': 9.0, 'category': 'Adventure'}, {'name': 'What is the name', 'imdb': 9.2, 'category': 'Suspense'}]

In [5]:

# 3.2:
# See these Stack Overflow questions and answers for another example and explanation of the lambda search:
# http://stackoverflow.com/questions/3766633/how-to-sort-with-lambda-in-python
# http://stackoverflow.com/questions/14299448/sorting-by-multiple-conditions-in-python

def category_score_sorted(movies):
    category_scores = {}
    for movie in movies:
        # If the category key does not exist in the category_scores dic:
        if not movie['category'] in category_scores:
            # Add the category key with its first value being the IMDb score:
            category_scores[movie['category']] = [movie['imdb']]
        else:
            # Otherwise, append the score to the existing category values list:
            category_scores[movie['category']].append(movie['imdb'])
    
    # Uses the category key-and-values list to create a new dic in which the values are the means:
    category_averages = {}
    for cat, vals in list(category_scores.items()):
        category_averages[cat] = sum(vals)/len(vals)
    
    
    movies_sorted = sorted(movies, key=lambda x: (category_averages[x['category']],
                                                  x['imdb']), reverse=True)
        # "key" argument in the sorted function refers the desired means of sorting.
        # "x" is referring to each individual entry in the movies list.
        # Lambda functions are like single-use, one-line functions.
        # In this case, we are sorting by category_avg and then IMDb scores.
        # Reverse because we want high to low instead of low to high.
    
    return movies_sorted

category_score_sorted(movies)

Out[5]:

[{'category': 'Adventure', 'imdb': 9.0, 'name': 'Dark Knight'},
 {'category': 'Suspense', 'imdb': 9.2, 'name': 'What is the name'},
 {'category': 'Suspense', 'imdb': 7.0, 'name': 'Detective'},
 {'category': 'Drama', 'imdb': 8.0, 'name': 'The Help'},
 {'category': 'Comedy', 'imdb': 7.2, 'name': 'Joking muck'},
 {'category': 'Romance', 'imdb': 7.4, 'name': 'Colonia'},
 {'category': 'Romance', 'imdb': 7.2, 'name': 'We Two'},
 {'category': 'Romance', 'imdb': 6.2, 'name': 'The Choice'},
 {'category': 'Romance', 'imdb': 6.0, 'name': 'Love'},
 {'category': 'Romance', 'imdb': 5.4, 'name': 'Bride Wars'},
 {'category': 'Action', 'imdb': 6.3, 'name': 'Hitman'},
 {'category': 'Thriller', 'imdb': 7.0, 'name': 'Usual Suspects'},
 {'category': 'Thriller', 'imdb': 4.2, 'name': 'Exam'},
 {'category': 'Crime', 'imdb': 4.0, 'name': 'Ringing Crime'},
 {'category': 'War', 'imdb': 3.2, 'name': 'AlphaJet'}]

4) Creating subsets by string condition.

4.1)

Write a function that:

Accepts the movies list and a category name.
Returns the movie names within that category (case-insensitive!).
If the category is not in the data, prints a message that says it does not exist and returns None.

Recall that, to convert a string to lowercase, you can use:

mystring = 'Dumb and Dumber'
lowercase_mystring = mystring.lower()
print lowercase_mystring
'dumb and dumber'

4.2 [Challenge])

Write a function that:

Accepts the movies list and a "search string."
Returns a dictionary with the keys 'category' and 'title' whose values are lists of categories that contain the search string and titles that contain the search string, respectively (case-insensitive!).

In [6]:

# 4.1:

def category_subset(movies, category):
    category = category.lower()
    movies_subset = []
    
    for movie in movies:
        movie_category = movie['category'].lower()
        if movie_category == category:
            movies_subset.append(movie)
            
    if len(movies_subset) == 0:
        print('No movies in category:', category)
        return None
    else:
        return movies_subset
    
print(category_subset(movies, 'suspense'))
print(category_subset(movies, 'sci-fi'))

Out[6]:

[{'name': 'What is the name', 'imdb': 9.2, 'category': 'Suspense'}, {'name': 'Detective', 'imdb': 7.0, 'category': 'Suspense'}]
No movies in category: sci-fi
None

In [7]:

# 4.2:

def category_title_search(movies, search_string):
    search_string = search_string.lower()
    
    results = {'category':[], 'title':[]}
    for movie in movies:
        movie_category = movie['category'].lower()
        movie_title = movie['name'].lower()
        
        if search_string in movie_category:
            if not movie_category in results['category']:
                results['category'].append(movie_category)
            
        if search_string in movie_title:
            results['title'].append(movie_title)
            
    return results

print(category_title_search(movies, 'SUS'))

Out[7]:

{'category': ['suspense'], 'title': ['usual suspects']}

5) Multiple conditions.

5.1)

Write a function that:

Accepts the movies list and a "search criteria" variable.
If the criteria variable is numeric, return a list of movie titles with a score greater than or equal to the criteria.
If the criteria variable is a string, return a list of movie titles that match that category (case-insensitive!). If there is no match, return an empty list and print an informative message.

5.2 [Expert])

Write a function that:

Accepts the movies list and a string search criteria variable.
The search criteria variable can contain within it:

Boolean operations: 'AND', 'OR', and 'NOT' (can have/be lowercase as well, we just capitalized for clarity).
Search criteria specified with the syntax score=..., category=..., and/or title=..., where the ... indicates what to look for.
- If score is present, it indicates scores greater than or equal to the value.
- For category and title, the string indicates that the category or title must contain the search string (case-insensitive).

Return the matches for the search criteria specified.

In [8]:

# 5.1:

def general_search(movies, criterion):
    titles_matches = []
    
    # First, check the criterion type:
    if type(criterion) in [int, float]:
        search_for = 'score'
    elif type(criterion) == str:
        search_for = 'titles'
        criterion = criterion.lower()
    else:
        print('criterion neither string nor numeric')
        return titles_matches
    
    for movie in movies:
        if search_for == 'score':
            if movie['imdb'] > criterion:
                titles_matches.append(movie['name'])
                
        else:
            if movie['category'].lower() == criterion:
                titles_matches.append(movie['name'])
                
    if len(titles_matches) == 0:
        print('no matches found')
    
    return titles_matches

print(general_search(movies, 6.9))
print(general_search(movies, 'suspense'))
print(general_search(movies, 'horror'))
print(general_search(movies, {'name':'the godfather'}))

Out[8]:

['Usual Suspects', 'Dark Knight', 'The Help', 'Colonia', 'Joking muck', 'What is the name', 'Detective', 'We Two']
['What is the name', 'Detective']
no matches found
[]
criterion neither string nor numeric
[]

In [9]:

# 5.2:

# This function is used later in the boolean_search function and may not make sense initially:
def movie_matches_subparser(movies, movie_key, value):
    # If we are assessing a title criterion:
    if movie_key == 'title':
        movie_key = 'name'
    # If not a title, category, or IMDb, throw an error message:
    elif movie_key not in ['category','imdb']:
        print('movie lookup key', movie_key, 'incorrect')
        return []
    # We are assessing a score criterion:
    if movie_key == 'imdb':
        try:
            value = float(value)
        # If score is invalid, throw an error message:
        except:
            print('imdb', value, 'cannot become float')
            return []
        
    subset = []
    # Assigns index values to movies and appends indexes of movies in the specified criteria:
    for movie_ind, movie in enumerate(movies):
        # Looks at scores:
        if type(value) == float:
            if movie[movie_key] >= value:
                subset.append(movie_ind)
        # Looks for strings:
        else:
            if value in movie[movie_key].lower():
                subset.append(movie_ind)
    
    return subset


# This function is used later in the boolean_search function and may not make sense initially:
def meets_boolean_criteria(movies, criteria_info):
    # Movie indices = the length of movies to compare to criteria_info:
    movie_inds = list(range(len(movies)))
    
    full_set = set(movie_inds)
    return_set = set(movie_inds)
    
    # Take a look at our movies' indices and their Booleans:
    for boolean, movie_subset in criteria_info:
        
        # Removes duplicate movies as the for loop iterates through:
        movie_subset = set(movie_subset)
        
        # Uses bools to add or drop movie index lists from the return set:
        if boolean == 'and':
            return_set = return_set & movie_subset
        elif boolean == 'or':
            return_set = return_set | movie_subset
        elif boolean == 'not':
            return_set = return_set - movie_subset
        elif boolean == 'ornot':
            return_set = return_set | (full_set - movie_subset)
            
    return_list = []
    # Uses those index values to extract the rest of the movie information:
    for ind in list(return_set):
        return_list.append(movies[ind])
        
    return return_list  
            
                

def boolean_search(movies, search):
    # Convert string to lower:
    search = search.lower()
    # Split criteria into various parts using white space:
    search = search.split(' ')
    # If extra or no white space is used in the search criteria, issues will arise:
    criteria_info = []
    current_boolean = 'and'
    
    # Utilize a while statement to individually assess and extract separate criteria:
    while len(search) > 0:
        # Pop off that first criterion:
        item = search.pop(0)
        '''This if statement may seem tricky, but it's trying to figure out of the 
        current criterion is a relational operator or a specified criterion.'''
        
        if item in ['and','or','not']:
            if (current_boolean == 'or') and (item == 'not'):
                current_boolean = 'ornot'
            else:
                current_boolean = item
            continue
        else:
            if '=' in item:
                item = item.split('=')
            else:
                print(item, 'syntax incorrect')
                return []
            # Pass the specified criterion through the movie_matches_subparser:            
            movie_match_inds = movie_matches_subparser(movies, item[0], item[1])
            # Now we will append the index results from the movie_match_inds with their desired bool:  
            criteria_info.append([current_boolean, movie_match_inds])

    # Finally, compare the list of movies to the identified index values and bools:      
    matches = meets_boolean_criteria(movies, criteria_info)
    return matches

In [10]:

boolean_search(movies, 'imdb=7.0 NOT category=suspense OR NOT title=love')

Out[10]:

[{'category': 'Thriller', 'imdb': 7.0, 'name': 'Usual Suspects'},
 {'category': 'Action', 'imdb': 6.3, 'name': 'Hitman'},
 {'category': 'Adventure', 'imdb': 9.0, 'name': 'Dark Knight'},
 {'category': 'Drama', 'imdb': 8.0, 'name': 'The Help'},
 {'category': 'Romance', 'imdb': 6.2, 'name': 'The Choice'},
 {'category': 'Romance', 'imdb': 7.4, 'name': 'Colonia'},
 {'category': 'Romance', 'imdb': 5.4, 'name': 'Bride Wars'},
 {'category': 'War', 'imdb': 3.2, 'name': 'AlphaJet'},
 {'category': 'Crime', 'imdb': 4.0, 'name': 'Ringing Crime'},
 {'category': 'Comedy', 'imdb': 7.2, 'name': 'Joking muck'},
 {'category': 'Suspense', 'imdb': 9.2, 'name': 'What is the name'},
 {'category': 'Suspense', 'imdb': 7.0, 'name': 'Detective'},
 {'category': 'Thriller', 'imdb': 4.2, 'name': 'Exam'},
 {'category': 'Romance', 'imdb': 7.2, 'name': 'We Two'}]

In [11]:

boolean_search(movies, 'imdb=8.9')

Out[11]:

[{'category': 'Adventure', 'imdb': 9.0, 'name': 'Dark Knight'},
 {'category': 'Suspense', 'imdb': 9.2, 'name': 'What is the name'}]

In [12]:

boolean_search(movies, 'imdb=8.9 AND NOT category=suspense')

Out[12]:

[{'category': 'Adventure', 'imdb': 9.0, 'name': 'Dark Knight'}]

In [13]:

boolean_search(movies, 'imdb=notafloat')

Out[13]:

imdb notafloat cannot become float

[]

In [14]:

boolean_search(movies, 'category=1')

Out[14]:

[]

In [15]:

boolean_search(movies, 'category=1')

Out[15]:

[]

In [16]:

boolean_search(movies, 'category=suspense WHEN imdb=5.5')

Out[16]:

when syntax incorrect

[]

In [17]:

boolean_search(movies, 'review_count=100')

Out[17]:

movie lookup key review_count incorrect

[]

In [18]:

Python Review With Movie Data

1) Load the provided list of `movies` dictionaries.

2) Filtering data by IMDb score.

2.1)

2.2 [Challenge])

3) Creating subsets by numeric condition.

3.1)

3.2 [Expert])

4) Creating subsets by string condition.

4.1)

4.2 [Challenge])

5) Multiple conditions.

5.1)

5.2 [Expert])

Product

Resources

Company

Python Review With Movie Data

1) Load the provided list of movies dictionaries.

2) Filtering data by IMDb score.

2.1)

2.2 [Challenge])

3) Creating subsets by numeric condition.

3.1)

3.2 [Expert])

4) Creating subsets by string condition.

4.1)

4.2 [Challenge])

5) Multiple conditions.

5.1)

5.2 [Expert])

1) Load the provided list of `movies` dictionaries.