GitHub Repository: dsc-courses/dsc10-2022-fa
Path: blob/main/lectures/lec09/lec09.ipynb
³⁰⁵⁸ views

Kernel: Python 3 (ipykernel)

In [ ]:

# Set up packages for lecture. Don't worry about understanding this code, but
# make sure to run it if you're following along.
import numpy as np
import babypandas as bpd
import pandas as pd
from matplotlib_inline.backend_inline import set_matplotlib_formats
import matplotlib.pyplot as plt
%reload_ext pandas_tutor
%set_pandas_tutor_options {'projectorMode': True}
set_matplotlib_formats("svg")
plt.style.use('ggplot')

np.set_printoptions(threshold=20, precision=2, suppress=True)
pd.set_option("display.max_rows", 7)
pd.set_option("display.max_columns", 8)
pd.set_option("display.precision", 2)

from IPython.display import display, IFrame
def show_def():
    src = "https://docs.google.com/presentation/d/e/2PACX-1vRKMMwGtrQOeLefj31fCtmbNOaJuKY32eBz1VwHi_5ui0AGYV3MoCjPUtQ_4SB1f9x4Iu6gbH0vFvmB/embed?start=false&loop=false&delayms=60000"
    width = 960 
    height = 569
    display(IFrame(src, width, height))

Lecture 9 – Functions and Apply

DSC 10, Fall 2022

Announcements

Lab 3 is due on Saturday 10/15 at 11:59PM.
Homework 3 is due on Tuesday 10/18 at 11:59PM.
The Midterm Project will be released one week from today – start thinking about who you may want to partner up with!
- You don't have to work with a partner.
- If you do, your partner doesn't have to be from your lecture section.

Agenda

Functions.
Applying functions to DataFrames.
Example: Student names.

Reminder: Use the DSC 10 Reference Sheet. You can also use it on exams!

Functions

Defining functions

We've learned how to do quite a bit in Python:
- Manipulate arrays, Series, and DataFrames.
- Perform operations on strings.
- Create visualizations.
But so far, we've been restricted to using existing functions (e.g. max, np.sqrt, len) and methods (e.g. .groupby, .assign, .plot).

Motivation

Suppose you drive to a restaurant 🥘 in LA, located exactly 100 miles away.

For the first 50 miles, you drive at 80 miles per hour.
For the last 50 miles, you drive at 60 miles per hour.

Question: What is your average speed throughout the journey?

🚨 The answer is not 70 miles per hour! Remember, from Homework 1, you need to use the fact that $\text{speed} = \frac{\text{distance}}{\text{time}}$ .

\text{average speed} = \frac{\text{distance}}{\text{time}} = \frac{50 + 50}{\text{time}_1 + \text{time}_2} \text{ miles per hour}

In segment 1, when you drove 50 miles at 80 miles per hour, you drove for $\frac{50}{80}$ hours:

\text{speed}_1 = \frac{\text{distance}_1}{\text{time}_1}

80 \text{ miles per hour} = \frac{50 \text{ miles}}{\text{time}_1} \implies \text{time}_1 = \frac{50}{80} \text{ hours}

Similarly, in segment 2, when you drove 50 miles at 60 miles per hour, you drove for $\text{time}_2 = \frac{50}{60} \text{ hours}$ .

Then,

\text{average speed} = \frac{50 + 50}{\frac{50}{80} + \frac{50}{60}} \text{ miles per hour}

\begin{align*}\text{average speed} &= \frac{50}{50} \cdot \frac{1 + 1}{\frac{1}{80} + \frac{1}{60}} \text{ miles per hour} \\ &= \frac{2}{\frac{1}{80} + \frac{1}{60}} \text{ miles per hour} \end{align*}

Example: Harmonic mean

The harmonic mean ( $\text{HM}$ ) of two positive numbers, $a$ and $b$ , is defined as

\text{HM} = \frac{2}{\frac{1}{a} + \frac{1}{b}}

It is often used to find the average of multiple rates.

Finding the harmonic mean of 80 and 60 is not hard:

In [ ]:

2 / (1 / 80 + 1 / 60)

But what if we want to find the harmonic mean of 80 and 70? 80 and 90? 20 and 40? This would require a lot of copy-pasting, which is prone to error.

It turns out that we can define our own "harmonic mean" function just once, and re-use it multiple times.

In [ ]:

def harmonic_mean(a, b):
    return 2 / (1 / a + 1 / b)

In [ ]:

harmonic_mean(80, 60)

In [ ]:

harmonic_mean(20, 40)

Note that we only had to specify how to calculate the harmonic mean once!

Functions

Functions are a way to divide our code into small subparts to prevent us from writing repetitive code. Each time we define our own function in Python, we will use the following pattern.

In [ ]:

show_def()

Functions are "recipes"

Functions take in inputs, known as arguments, do something, and produce some outputs.
The beauty of functions is that you don't need to know how they are implemented in order to use them!
- This is the premise of the idea of abstraction in computer science – you'll hear a lot about this in DSC 20.

In [ ]:

harmonic_mean(20, 40)

In [ ]:

harmonic_mean(79, 894)

In [ ]:

harmonic_mean(-2, 4)

Parameters and arguments

triple has one parameter, x.

In [ ]:

def triple(x):
    return x * 3

When we call triple with the argument 5, you can pretend that there's an invisible first line in the body of triple that says x = 5.

In [ ]:

triple(5)

Note that arguments can be of any type!

In [ ]:

triple('triton')

Functions can take 0 or more arguments

Functions can have any number of arguments. So far, we've created a function that takes two arguments – harmonic_mean – and a function that takes one argument – triple.

greeting takes no arguments!

In [ ]:

def greeting():
    return 'Hi! 👋'

In [ ]:

greeting()

Functions don't run until you call them!

The body of a function is not run until you use (call) the function.

Here, we can define where_is_the_error without seeing an error message.

In [ ]:

def where_is_the_error(something):
    '''You can describe your function within triple quotes. For example, this function 
    illustrates that errors don't occur until functions are executed (called).'''
    return (1 / 0) + something

It is only when we call where_is_the_error that Python gives us an error message.

In [ ]:

where_is_the_error(5)

Example: `first_name`

Let's create a function called first_name that takes in someone's full name and returns their first name. Example behavior is shown below.

>>> first_name('Pradeep Khosla')
'Pradeep'

Hint: Use the string method .split.

General strategy for writing functions:

First, try and get the behavior to work on a single example.
Then, encapsulate that behavior inside a function.

In [ ]:

'Pradeep Khosla'.split(' ')[0]

In [ ]:

def first_name(full_name):
    '''Returns the first name given a full name.'''
    return full_name.split(' ')[0]

In [ ]:

first_name('Pradeep Khosla')

In [ ]:

# What if there are three names?
first_name('Chancellor Pradeep Khosla')

Returning

The return keyword specifies what the output of your function should be, i.e. what a call to your function will evaluate to.
Most functions we write will use return, but using return is not required.
Be careful: print and return work differently!

In [ ]:

def pythagorean(a, b):
    '''Computes the hypotenuse length of a triangle with legs a and b.'''
    c = (a ** 2 + b ** 2) ** 0.5
    print(c)

In [ ]:

x = pythagorean(3, 4)

In [ ]:

# No output – why?
x

In [ ]:

# Errors – why?
x + 10

In [ ]:

def better_pythagorean(a, b):
    '''Computes the hypotenuse length of a triangle with legs a and b, and actually returns the result.'''
    c = (a ** 2 + b ** 2) ** 0.5
    return c

In [ ]:

x = better_pythagorean(3, 4)
x

In [ ]:

x + 10

Returning

Once a function executes a return statement, it stops running.

In [ ]:

def motivational(quote):
    return 0
    print("Here's a motivational quote:", quote)

In [ ]:

motivational('Fall seven times and stand up eight.')

Scope 🩺

The names you choose for a function’s parameters are only known to that function (known as local scope). The rest of your notebook is unaffected by parameter names.

In [ ]:

def what_is_awesome(s):
    return s + ' is awesome!'

In [ ]:

what_is_awesome('data science')

In [ ]:

In [ ]:

s = 'DSC 10'

In [ ]:

what_is_awesome('data science')

Applying functions to DataFrames

DSC 10 student data

The DataFrame roster contains the names and lecture sections of all students enrolled in DSC 10 this quarter. The first names are real, while the last names have been anonymized for privacy.

In [ ]:

roster = bpd.read_csv('data/roster-anon.csv')
roster

Example: Common first names

What is the most common first name among DSC 10 students? (Any guesses?)

In [ ]:

roster

Problem: We can't answer that right now, since we don't have a column with first names. If we did, we could group by it.

Solution: Use our function that extracts first names on every element of the 'name' column.

Using our `first_name` function

Somehow, we need to call first_name on every student's 'name'.

In [ ]:

roster

In [ ]:

roster.get('name').iloc[0]

In [ ]:

first_name(roster.get('name').iloc[0])

In [ ]:

first_name(roster.get('name').iloc[1])

Ideally, there's a better solution than doing this 411 times...

`.apply`

To apply a function to every element of column column_name in DataFrame df, use

df.get(column_name).apply(function_name)

The .apply method is a Series method.
- Important: We use .apply on Series, not DataFrames.
- The output of .apply is also a Series.

Pass just the name of the function – don't call it!
- Good ✅: .apply(first_name).
- Bad ❌: .apply(first_name()).

In [ ]:

roster.get('name').apply(first_name)

In [ ]:

%%pt

roster.get('name').apply(first_name)

Example: Common first names

In [ ]:

with_first = roster.assign(
    first=roster.get('name').apply(first_name)
)
with_first

In [ ]:

first_counts = with_first.groupby('first').count().sort_values('name', ascending=False).get(['name'])
first_counts

Activity

Below:

Create a bar chart showing the number of students with each first name, but only include first names shared by at least two students.
Determine the proportion of students in DSC 10 who have a first name that is shared by at least two students.

In [ ]:

Note: `.apply` works with built-in functions, too!

For instance, to find the length of each name, we might use the len function:

In [ ]:

with_first

In [ ]:

with_first.get('first').apply(len)

Aside: what if names are in the index?

We were able to apply first_name to the 'name' column because it's a Series. The .apply method doesn't work on the index, because the index is not a Series.

In [ ]:

indexed_by_name = roster.set_index('name')
indexed_by_name

In [ ]:

indexed_by_name.index.apply(first_name)

Solution: `.reset_index()`

Use .reset_index() to turn the index of a DataFrame into a column, and to reset the index back to the default of 0, 1, 2, 3, and so on.

In [ ]:

indexed_by_name.reset_index()

In [ ]:

indexed_by_name.reset_index().get('name').apply(first_name)

Example: Shared first names and sections

Suppose you're one of the $\approx$20% of students in DSC 10 who has a first name that is shared with at least one other student.
Let's try and determine whether someone in your lecture section shares the same first name as you.

In [ ]:

with_first

For example, maybe 'Ryan Ufhwdl' wants to see if there's another 'Ryan' in their section.

Strategy:

What section is 'Ryan Ufhwdl' in?
How many people are in that section and named 'Ryan'?

In [ ]:

what_section = with_first[with_first.get('name') == 'Ryan Ufhwdl'].get('section').iloc[0]
what_section

In [ ]:

how_many = with_first[(with_first.get('section') == what_section) & (with_first.get('first') == 'Ryan')].shape[0]
how_many

Another function: `shared_first_and_section`

Let's create a function named shared_first_and_section. It will take in the full name of a student and return the number of students in their section with the same first name and section (including them).

Note: This is the first function we're writing that involves using a DataFrame within the function – this is fine!

In [ ]:

def shared_first_and_section(name):
    # First, find the row corresponding to that full name in with_first.
    # We're assuming that full names are unique.
    row = with_first[with_first.get('name') == name]
    
    # Then, get that student's first name and section.
    first = row.get('first').iloc[0]
    section = row.get('section').iloc[0]
    
    # Now, find all the students with the same first name and section.
    shared_info = with_first[(with_first.get('first') == first) & (with_first.get('section') == section)]
    
    # Return the number of such students.
    return shared_info.shape[0]

In [ ]:

shared_first_and_section('Ryan Ufhwdl')

In [ ]:

shared_first_and_section('Dory Xaghsk')

Now, let's add a column to with_first that contains the values returned by shared_first_and_section.

In [ ]:

with_first = with_first.assign(shared=with_first.get('name').apply(shared_first_and_section))
with_first

Let's look at all the students who are in a section with someone that has the same first name as them.

In [ ]:

with_first[(with_first.get('shared') > 1)].sort_values('shared', ascending=False)

We can narrow this down to a particular lecture section if we'd like.

In [ ]:

one_section_only = with_first[(with_first.get('shared') > 1) & 
                              (with_first.get('section') == '10AM')].sort_values('shared', ascending=False)
one_section_only

In [ ]:

one_section_only.get('first').unique()

Sneak peek

While the DataFrames on the previous slide contain the info we were looking for, they're not organized very conveniently. For instance, there are three rows containing the fact that there are 3 'Andrew's in the 10AM lecture section.

Wouldn't it be great if we could create a DataFrame like the one below? We'll see how on Friday!

	section	first	count
0	10AM	Andrew	3
1	1PM	Ethan	3
2	1PM	Samuel	3
3	10AM	Kevin	2
4	11AM	Connor	2

Activity

Find the longest first name in the class that is shared by at least two students in the same section.

Hint: You'll have to use both assign and apply.

In [ ]:

...

Summary, next time

Summary

Functions are a way to divide our code into small subparts to prevent us from writing repetitive code.
The .apply method allows us to call a function on every single element of a Series, which usually comes from .getting a column of a DataFrame.

Next time

More advanced DataFrame manipulations!

Lecture 9 – Functions and Apply

DSC 10, Fall 2022

Announcements

Agenda

Functions

Defining functions

Motivation

Example: Harmonic mean

Functions

Functions are "recipes"

Parameters and arguments

Functions can take 0 or more arguments

Functions don't run until you call them!

Example: `first_name`

Returning

Returning

Scope 🩺

Applying functions to DataFrames

DSC 10 student data

Example: Common first names

Using our `first_name` function

`.apply`

Example: Common first names

Activity

Note: `.apply` works with built-in functions, too!

Aside: what if names are in the index?

Solution: `.reset_index()`

Example: Shared first names and sections

Another function: `shared_first_and_section`

Sneak peek

Activity

Summary, next time

Summary

Next time

Product

Resources

Company

Lecture 9 – Functions and Apply

DSC 10, Fall 2022

Announcements

Agenda

Functions

Defining functions

Motivation

Example: Harmonic mean

Functions

Functions are "recipes"

Parameters and arguments

Functions can take 0 or more arguments

Functions don't run until you call them!

Example: first_name

Returning

Returning

Scope 🩺

Applying functions to DataFrames

DSC 10 student data

Example: Common first names

Using our first_name function

.apply

Example: Common first names

Activity

Note: .apply works with built-in functions, too!

Aside: what if names are in the index?

Solution: .reset_index()

Example: Shared first names and sections

Another function: shared_first_and_section

Sneak peek

Activity

Summary, next time

Summary

Next time

Example: `first_name`

Using our `first_name` function

`.apply`

Note: `.apply` works with built-in functions, too!

Solution: `.reset_index()`

Another function: `shared_first_and_section`