GitHub Repository: dsc-courses/dsc10-2022-fa
Path: blob/main/lectures/lec05/lec05.ipynb
³⁰⁵⁸ views

Kernel: Python 3 (ipykernel)

Lecture 5 – DataFrames: Accessing, Sorting, and Querying

DSC 10, Fall 2022

Announcements

Homework 1 is due tomorrow at 11:59PM.
Lab 2 is released and is due Saturday at 11:59PM.
Discussion section is tonight. Attendance policy updates:
- Attendance will be taken at a "random" time during the 50-minute discussion section.
- To earn credit, you must be there at that moment and have a photo ID with you. No exceptions.
This video walks through the lottery example from the end of Lecture 4.

Agenda

Today, we'll use a real dataset and lots of motivating questions to illustrate key DataFrame manipulation techniques.

Note:

Remember to check the Resources tab of the course website for programming resources.
Some key links moving forward:

DataFrames

`pandas`

DataFrames (tables) are provided by a package called pandas.
pandas is the tool for doing data science in Python.

But `pandas` is not so cute...

Enter `babypandas`!

We at UCSD have created a smaller, nicer version of pandas called babypandas.
It keeps the important stuff and has much better error messages.
It's easier to learn, but is still valid pandas code.

DataFrames in `babypandas` 🐼

Tables in babypandas (and pandas) are called "DataFrames."
To use DataFrames, we'll need to import babypandas. (We'll need numpy as well.)

In [ ]:

import babypandas as bpd
import numpy as np

About the Data: Get It Done 👷

We'll usually work with data stored in the CSV format. CSV stands for "comma-separated values."
The file data/get-it-done-oct-1.csv contains service requests made on October 1, 2022 through the Get It Done program.
Get It Done allows the general public to report non-emergency problems to the City of San Diego through a mobile app, website, or phone call.

Reading data from a file 📖

We can read in a CSV using bpd.read_csv(...). Give it the path to a file relative to your notebook (if the file is in the same folder as your notebook, this is just the name of the file).

In [ ]:

oct_1 = bpd.read_csv('data/get-it-done-oct-1.csv')
oct_1

Structure of a DataFrame

DataFrames have columns and rows.
- Think of each column as an array. Columns contain data of the same type.
Each column has a label, e.g. 'neighborhood' and 'status'.
- A column's label is its name.
- Column labels are stored as strings.
Each row has a label too.
- Together, the row labels are called the index. The index is not a column!

In [ ]:

# This DataFrame has 753 rows and 7 columns
oct_1

Setting a new index

We can set a better index using .set_index(column_name).
Row labels should be unique identifiers.
- Row labels are row names; ideally, each row has a different, descriptive name.
⚠️ Like most DataFrame methods, .set_index returns a new DataFrame; it does not modify the original DataFrame.

In [ ]:

oct_1.set_index('service_request_id')

In [ ]:

oct_1

In [ ]:

oct_1 = oct_1.set_index('service_request_id')
oct_1

Shape of a DataFrame

.shape returns the number of rows and columns in a given DataFrame.
Access each with []:
- .shape[0] for rows.
- .shape[1] for columns.

In [ ]:

# There were 7 columns before, but one of them became the index, and the index is not a column!
oct_1.shape

In [ ]:

# Number of rows
oct_1.shape[0]

In [ ]:

# Number of columns
oct_1.shape[1]

Annual summary of Get It Done requests

The file data/get-it-done-requests.csv contains a summary of all Get It Done requests submitted this calendar year (2022) so far.
This whole year's worth of data shows the types of problems being reported in each neighborhood and how many service requests are resolved ('closed') versus unresolved ('open').

In [ ]:

requests = bpd.read_csv('data/get-it-done-requests.csv')
requests

Example 1: Total requests

Key concepts: Accessing columns, performing operations with them, and adding new columns.

Finding total requests

Question: How many service requests of each type in each neighborhood have been made this year?
We have, separately, the number of closed service requests and open service requests of each type in each neighborhood.
Workflow:
- Get the column of closed requests.
- Get the column of open requests.
- Add these columns element-wise.
- Add a new column to the DataFrame with these totals.

Step 1 – Getting a column

We can get a column from a DataFrame using .get(column_name).
⚠️ Column names are case sensitive!
Column names are strings, so we need to use quotes.
The result looks like a 1-column DataFrame, but is actually a Series.

In [ ]:

requests

In [ ]:

requests.get('closed')

Digression: Series

A Series is like an array, but with an index.
In particular, Series' support arithmetic.

In [ ]:

requests.get('closed')

Step 2 – Getting another column

In [ ]:

requests.get('open')

Step 3 – Calculating the total

Just like with arrays, we can perform arithmetic operations with two Series, as long as they have the same length and same index.
Operations happen element-wise.
The result is also a Series.

In [ ]:

requests.get('closed') + requests.get('open')

Step 4 – Adding the totals to the DataFrame as a new column

Use .assign(name_of_column=data_in_series) to assign a Series (or array, or list) to a DataFrame.
⚠️ Don't put quotes around name_of_column.
Creates a new DataFrame! Must save to variable.

In [ ]:

requests.assign(
    total=requests.get('closed') + requests.get('open')
)

In [ ]:

requests

In [ ]:

requests = requests.assign(
    total=requests.get('closed') + requests.get('open')
)
requests

Example 2: Analyzing requests

Key concept: Computing statistics of columns using Series methods.

Questions

What is the largest number of service requests for any one service in any one neighborhood?
What is a typical number of service requests for any one service in any one neighborhood?

Series, like arrays, have helpful methods, including .min(), .max(), and .mean().

In [ ]:

requests.get('total').max()

What is it that people are reporting so frequently, and where? We'll see how to find out shortly!

Other statistics:

In [ ]:

requests.get('total').mean()

In [ ]:

requests.get('total').median()

In [ ]:

requests.get('open').mean()

In [ ]:

requests.get('open').median()

Example 3: What and where is the most frequently requested service?

Key concepts: Sorting. Accessing using integer positions.

Step 1 – Sorting the DataFrame

Use the .sort_values(by=column_name) method to sort.
- The by= is not necessary.
Like most DataFrame methods, this returns a new DataFrame.

In [ ]:

requests.sort_values(by='total')

This sorts, but in ascending order (small to large). We want the opposite!

Step 1 – Sorting the DataFrame in descending order

Use .sort_values(by=column_name, ascending=False) to sort in descending order.
ascending is an optional argument. If omitted, it will be set to True by default.

In [ ]:

ordered_requests = requests.sort_values(by='total', ascending=False)
ordered_requests

Step 2 – Extracting the neighborhood and service

We saw that the most reported issue is 'Encampment' in 'Downtown', but how do we extract that information using code?
First, grab an entire column as a Series.
Navigate to a particular entry of the Series using .iloc[integer_position].

In [ ]:

ordered_requests

In [ ]:

ordered_requests.get('neighborhood')

In [ ]:

ordered_requests.get('neighborhood').iloc[0]

In [ ]:

ordered_requests.get('service').iloc[0]

Example 4: Status of a request

Key concept: Accessing using row labels.

Status of a request

On October 1, you submitted service request 3940652. Has the issue been resolved?
This cannot be answered from the annual summary data, but must be answered from the detailed data about October 1.

In [ ]:

oct_1

Your service request is buried in the middle of the DataFrame. Only the first few rows and last few rows are shown, so you can't tell just by looking at the DataFrame.

Accessing using the row label

To pull out one particular entry of a DataFrame corresponding to a row and column with certain labels:

Use .get(column_name) to extract the entire column as a Series.
Use .loc[] to access the element of a Series with a particular row label.

In this class, we'll always get a column, then a row (but row, then column is also possible).

In [ ]:

oct_1.get('status')

In [ ]:

oct_1.get('status').loc[3940652]

Activity 🚚

Oh no, your service request 3940652 has still not been resolved! What was the problem again?

Write one line of code that evaluates to the full description of the problem, as you described it in your service request.

In [ ]:

...

Summary of accessing a Series

There are two ways to get an element of a Series:
- .loc[] uses the row label.
- .iloc[] uses the integer position.
Usually .loc[] is more convenient, but each is best for different scenarios.

Note

Sometimes the integer position and row label are the same.
This happens by default with bpd.read_csv.

In [ ]:

bpd.read_csv('data/get-it-done-oct-1.csv')

In [ ]:

bpd.read_csv('data/get-it-done-oct-1.csv').get('public_description').loc[31]

In [ ]:

bpd.read_csv('data/get-it-done-oct-1.csv').get('public_description').iloc[31]

Reflection

Questions we can answer right now...

What is the largest number of open requests of one type in one neighborhood?
- requests.get('open').max().

How many requests were made on October 1?
- oct_1.shape[0].

What is the description of the latest request made on October 1?
- oct_1.sort_values(by='date_requested', ascending=False).get('public_description').iloc[0].

Moving forward, let's just focus on the requests DataFrame. As a reminder, here's what it looks like:

In [ ]:

requests

Questions we can't yet answer...

Which neighborhood has the most 'Weed Cleanup' requests?
What is the most commonly requested service in the 'University' neighborhood (near UCSD)?
In the 'Downtown' neighborhood, how many open service requests are there?

The common thread between these questions is that they all involve only a subset of the rows in our DataFrame.

Example 6: Which neighborhood has the most `'Weed Cleanup'` requests?

Key concept: Selecting rows (via Boolean indexing).

Selecting rows

We could determine the neighborhood with the most 'Weed Cleanup' requests if we had a DataFrame consisting of only these type of requests.
- We would sort by the 'total' column in descending order, then extract the neighborhood name in the first row.
How do we get that DataFrame?

The solution

In [ ]:

requests[requests.get('service') == 'Weed Cleanup']

Use == to check equality. Not =, as that's for assignment of a value to a variable.

In [ ]:

'Weed Cleanup' == 'Weed Clean-Up'

In [ ]:

'Weed Cleanup' == 'Weed Cleanup'

We can broadcast the equality check to each element of a Series. The comparison happens element-wise.

In [ ]:

requests.get('service') == 'Weed Cleanup'

Boolean indexing

To select only some rows of requests:

Make a sequence (list/array/Series) of Trues (keep) and Falses (toss).
- The values True and False are of the Boolean data type.
Then pass it into requests[sequence_goes_here].

Rather than making the sequence by hand, we usually generate it by making a comparison.

Another example of element-wise comparison

Comparisons can check inequality as well as equality.

In [ ]:

requests

In [ ]:

requests.get('open') > 1

In [ ]:

requests[requests.get('open') > 1]

Original Question: Which neighborhood has the most `'Weed Cleanup'` requests?

Strategy:

Extract a DataFrame of just the 'Weed Cleanup' requests.
Sort by 'total' in descending order.
Return the first element in the 'neighborhood' column.

In [ ]:

weed_cleanup_only = requests[requests.get('service') == 'Weed Cleanup']
weed_cleanup_sorted = weed_cleanup_only.sort_values(by='total', ascending=False)
weed_cleanup_sorted

In [ ]:

weed_cleanup_sorted.get('neighborhood').iloc[0]

What if the condition isn't satisfied?

In [ ]:

requests[requests.get('service') == 'Lime Cleanup']

Concept Check ✅ – Answer at cc.dsc10.com

Which expression below evaluates to the total number of service requests in the 'Downtown' neighborhood?

A. requests[requests.get('neighborhood') == 'Downtown'].get('total').sum()

B. requests.get('total').sum()[requests.get('neighborhood') == 'Downtown']

C. requests['Downtown'].get('total').sum()

D. More than one of the above.

In [ ]:

...

Activity 🚘

Question: What is the most commonly requested service in the 'University' neighborhood (near UCSD)?

Write one line of code that evaluates to the answer.

In [ ]:

...

Summary

We learned many DataFrame methods and techniques.
Don't feel the need to memorize them all right away.
Instead, refer to this lecture, the DSC 10 reference sheet, the babypandas notes, and the babypandas documentation when working on assignments.
Over time, these techniques will become more and more familiar.
Practice! Frame your own questions using this dataset and try and answer them.

Next time

We'll answer more complicated questions, which will lead us to a new core DataFrame method, .groupby.

Lecture 5 – DataFrames: Accessing, Sorting, and Querying

DSC 10, Fall 2022

Announcements

Agenda

Note:

DataFrames

pandas

But pandas is not so cute...

Enter babypandas!

DataFrames in babypandas 🐼

About the Data: Get It Done 👷

Reading data from a file 📖

Structure of a DataFrame

Setting a new index

Shape of a DataFrame

Annual summary of Get It Done requests

Example 1: Total requests

Finding total requests

Step 1 – Getting a column

Digression: Series

Step 2 – Getting another column

Step 3 – Calculating the total

Step 4 – Adding the totals to the DataFrame as a new column

Example 2: Analyzing requests

Questions

Example 3: What and where is the most frequently requested service?

Step 1 – Sorting the DataFrame

Step 1 – Sorting the DataFrame in descending order

Step 2 – Extracting the neighborhood and service

Example 4: Status of a request

Status of a request

Accessing using the row label

Activity 🚚

Summary of accessing a Series

Note

Reflection

Questions we can answer right now...

Questions we can't yet answer...

Example 6: Which neighborhood has the most 'Weed Cleanup' requests?

Selecting rows

The solution

Boolean indexing

Another example of element-wise comparison

Original Question: Which neighborhood has the most 'Weed Cleanup' requests?

What if the condition isn't satisfied?

Concept Check ✅ – Answer at cc.dsc10.com

Activity 🚘

Summary

Summary

Next time

`pandas`

But `pandas` is not so cute...

Enter `babypandas`!

DataFrames in `babypandas` 🐼

Example 6: Which neighborhood has the most `'Weed Cleanup'` requests?

Original Question: Which neighborhood has the most `'Weed Cleanup'` requests?