GitHub Repository: dsc-courses/dsc10-2022-fa
Path: blob/main/lectures/lec28/lec28-solutions.ipynb
³⁰⁵⁸ views

Kernel: Python 3 (ipykernel)

In [ ]:

# Set up packages for lecture. Don't worry about understanding this code, but
# make sure to run it if you're following along.
import numpy as np
import babypandas as bpd
import pandas as pd
from matplotlib_inline.backend_inline import set_matplotlib_formats
import matplotlib.pyplot as plt
from scipy import stats
import otter
set_matplotlib_formats("svg")
plt.style.use('ggplot')

np.set_printoptions(threshold=20, precision=2, suppress=True)
pd.set_option("display.max_rows", 7)
pd.set_option("display.max_columns", 8)
pd.set_option("display.precision", 2)

# Setup to start where we left off last time
keep_cols = ['business_name', 'inspection_date', 'inspection_score', 'risk_category', 'Neighborhoods', 'Zip Codes']
restaurants_full = bpd.read_csv('data/restaurants.csv').get(keep_cols)
bakeries = restaurants_full[restaurants_full.get('business_name').str.lower().str.contains('bake')]
bakeries = bakeries[bakeries.get('inspection_score') >= 0] # Keeping only the rows where we know the inspection score

Lecture 28 – Review, Conclusion

DSC 10, Fall 2022

Announcements

The Final Exam is tomorrow from 11:30am to 2:30pm in Center Hall.
- See this post on EdStem for details.
Come to WLH 2205 from 5-9pm tonight for the last study session/group office hours. See the calendar for today's other office hours.
If at least 80% of the class fills out both CAPEs and the End of Quarter Survey, then we will add 0.5% of extra credit to everyone's overall grade.
- The deadline is tomorrow at 8am!
- As of earlier today, we're at 67% on CAPEs and 51% on the End of Quarter Survey.

Agenda

More review.
Working on personal projects.
Demo: Gapminder 🌎.
Some parting thoughts.

Bakeries 🧁

Let's pick up where we left off last lecture.

In [ ]:

bakeries

In [ ]:

np.random.seed(23) # Ignore this

sample_of_bakeries = bakeries.sample(200) # SOLUTION
sample_of_bakeries

Concept Check ✅ – Answer at cc.dsc10.com

Using a single sample of 200 bakeries, how can we estimate the median inspection score of all bakeries in San Francisco with an inspection score? What technique should we use?

A. Standard hypothesis testing

B. Permutation testing

C. Bootstrapping

D. The Central Limit Theorem

Click for the answer after you've entered your guess above. Don't scroll any further.

Bootstrapping. The CLT only applies to sample means (and sums), not to any other statistics.

There is no CLT for sample medians, so instead we'll have to resort to bootstrapping to estimate the distribution of the sample median.

Recall, bootstrapping is the act of sampling from the original sample, with replacement. This is also called resampling.

In [ ]:

# The median of our original sample – this is just one number
sample_of_bakeries.get('inspection_score').median() # SOLUTION

In [ ]:

# The median of a single bootstrap resample – this is just one number
sample_of_bakeries.sample(200, replace=True).get('inspection_score').median() # SOLUTION

Let's resample repeatedly.

In [ ]:

np.random.seed(23) # Ignore this

boot_medians = np.array([])

# BEGIN SOLUTION
for i in np.arange(5000):
    boot_median = sample_of_bakeries.sample(200, replace=True).get('inspection_score').median()
    boot_medians = np.append(boot_medians, boot_median)
# END SOLUTION

In [ ]:

boot_medians

In [ ]:

bpd.DataFrame().assign(boot_medians=boot_medians).plot(kind='hist', density=True, ec='w', bins=10, figsize=(10, 5));

Note that this distribution is not at all normal.

To compute a 95% confidence interval, we take the middle 95% of the bootstrapped medians.

In [ ]:

# BEGIN SOLUTION
left = np.percentile(boot_medians, 2.5)
right = np.percentile(boot_medians, 97.5)

[left, right]
# END SOLUTION

Discussion Question

Which of the following interpretations of this confidence interval are valid?

95% of SF bakeries have an inspection score between 85 and 88.
95% of the resamples have a median inspection score between 85 and 88.
There is a 95% chance that our sample has a median inspection score between 85 and 88.
There is a 95% chance that the median inspection score of all SF bakeries is between 85 and 88.
If we had taken 100 samples from the same population, about 95 of these samples would have a median inspection score between 85 and 88.
If we had taken 100 samples from the same population, about 95 of the confidence intervals created would contain the median inspection score of all SF bakeries.

Click for the answer after you've entered your guess above. Don't scroll any further.

The correct answers are Option 2 and Option 6.

Physicians 🩺

The setup

You work as a family physician. You collect data and you find that in 6354 patients, 3115 were children and 3239 were adults.

You want to test the following hypotheses:

Null Hypothesis: Family physicians see an equal number of children and adults.
Alternative Hypothesis: Family physicians see more adults than they see children.

Concept Check ✅ – Answer at cc.dsc10.com

Which test statistic(s) could be used for this hypothesis test? Which values of the test statistic point towards the alternative?

A. Proportion of children seen B. Number of children seen C. Number of children minus number of adults seen D. Absolute value of number of children minus number of adults seen

There may be multiple correct answers; choose one.

Click for the answer after you've entered your guess above. Don't scroll any further.

All of these but the last one would work for this alternative. Small values of these statistics would favor the alternative.

If the alternative was instead "Family physicians see a different number of children and adults", the last option would work while the first three wouldn't.

Let's use option B, the number of children seen, as a test statistic. Small values of this statistic favor the alternative hypothesis.

How do we generate a single value of the test statistic?

In [ ]:

np.random.multinomial(6354, [0.5, 0.5])[0] # SOLUTION

As usual, let's simulate the test statistic many, many times.

In [ ]:

test_stats = np.array([])

# BEGIN SOLUTION
for i in np.arange(10000):
    stat = np.random.multinomial(6354, [0.5, 0.5])[0]
    test_stats = np.append(test_stats, stat)
# END SOLUTION

In [ ]:

test_stats

In [ ]:

bpd.DataFrame().assign(test_stats=test_stats) \
               .plot(kind='hist', density=True, ec='w', figsize=(10, 5), bins=20);
plt.axvline(3115, lw=3, color='black', label='observed statistic')
plt.legend();

Recall that you collected data and found that in 6354 patients, 3115 were children and 3239 were adults.

Concept Check ✅ – Answer at cc.dsc10.com

What goes in blank (a)?

p_value = np.count_nonzero(test_stats __(a)__ 3115) / 10000

A. >=

B. >

C. <=

D. <

Click for the answer after you've entered your guess above. Don't scroll any further. <=

In [ ]:

# Calculate the p-value

Concept Check ✅ – Answer at cc.dsc10.com

What do we do, assuming that we're using a 5% p-value cutoff?

A. Reject the null

B. Fail to reject the null

C. It depends

Click for the answer after you've entered your guess above. Don't scroll any further. Fail to reject the null, since the p-value is above 0.05.

Note that while we used np.random.multinomial to simulate the test statistic, we could have used np.random.choice, too:

In [ ]:

choices = np.random.choice(['adult', 'child'], p=[0.5, 0.5], size=6354, replace=True) # SOLUTION
choices

In [ ]:

np.count_nonzero(choices == 'child') # SOLUTION

Concept Check ✅ – Answer at cc.dsc10.com

Is this an example of bootstrapping?

A. Yes, because we are sampling with replacement.

B. No, this is not bootstrapping.

Click for the answer after you've entered your guess above. Don't scroll any further. No, this is not bootstrapping. Bootstrapping is when we resample from a single sample; here we're simulating data under the assumptions of a model.

Personal projects

Using Jupyter Notebooks after DSC 10

You may be interested in working on data science projects of your own.
In this video, we show you how to make blank notebooks and upload datasets of your own to DataHub.
Depending on the classes you're in, you may not have access to DataHub. Eventually, you'll want to install Jupyter Notebooks on your computer.
- Anaconda is a great way to do that, as it also installs many commonly used packages.
- You may want to download your work from DataHub so you can refer to it after the course ends.
- Remember, all babypandas code is regular pandas code, too!

Finding data

These sites allow you to search for datasets (in CSV format) from a variety of different domains. Some may require you to sign up for an account; these are generally reputable sources.

Note that all of these links are also available at rampure.org/find-datasets.

Data is Plural
FiveThirtyEight.
CORGIS.
Kaggle Datasets.
Google’s dataset search.
DataHub.io.
Data.world.
R datasets.
Wikipedia. (Use this site to extract and download tables as CSVs.)
Awesome Public Datasets GitHub repo.
Links to even more sources.

Domain-specific sources of data

Sports: Basketball Reference, Baseball Reference, etc.
US Government Sources: census.gov, data.gov, data.ca.gov, data.sfgov.org, FBI’s Crime Data Explorer, Centers for Disease Control and Prevention.
Global Development: data.worldbank.org, databank.worldbank.org, WHO.
Transportation: New York Taxi trips, Bureau of Transportation Statistics, SFO Air Traffic Statistics.
Music: Spotify Charts.
COVID: Johns Hopkins.
Any Google Forms survey you’ve administered! (Go to the results spreadsheet, then go to “File > Download > Comma-separated values”.)

Tip: if a site only allows you to download a file as an Excel file, not a CSV file, you can download it, open it in a spreadsheet viewer (Excel, Numbers, Google Sheets), and export it to a CSV.

Join a DS3 Project Group 🤝

The Data Science Student Society’s Projects Committee has opened its applications for the Winter 2023 cohort!

Students have the opportunity to join a team to pursue a unique data science project that will last two quarters.
At the end of the project, teams will have developed a polished, complete personal project which they will showcase to their peers, faculty, and companies in the data science industry.
If you don't have a lot of experience and are looking for that first data science project to get you started or if you're looking to build on your existing portfolio with a new experience, the Projects committee is a great platform.

Apply here by Sunday at 11:59pm (ignore the deadline on the form). Contact [email protected] with questions.

Demo: Gapminder 🌎

`plotly`

All of the visualizations (scatter plots, histograms, etc.) in this course were created using a library called matplotlib.
- This library was called under-the-hood everytime we wrote df.plot.
plotly is a different visualization library that allows us to create interactive visualizations.
You may learn about it in a future course, but we'll briefly show you some cool visualizations you can make with it.

In [ ]:

import plotly.express as px

Gapminder dataset

Gapminder Foundation is a non-profit venture registered in Stockholm, Sweden, that promotes sustainable global development and achievement of the United Nations Millennium Development Goals by increased use and understanding of statistics and other information about social, economic and environmental development at local, national and global levels. - Gapminder Wikipedia

In [ ]:

gapminder = px.data.gapminder()
gapminder

The dataset contains information for each country for several different years.

In [ ]:

gapminder.get('year').unique()

Let's start by just looking at 2007 data (the most recent year in the dataset).

In [ ]:

gapminder_2007 = gapminder[gapminder.get('year') == 2007]
gapminder_2007

Scatter plot

We can plot life expectancy vs. GDP per capita. If you hover over a point, you will see the name of the country.

In [ ]:

px.scatter(gapminder_2007, x='gdpPercap', y='lifeExp', hover_name='country')

In future courses, you'll learn about transformations. Here, we'll apply a log transformation to the x-axis to make the plot look a little more linear.

In [ ]:

px.scatter(gapminder_2007, x='gdpPercap', y='lifeExp', log_x=True, hover_name='country')

Animated scatter plot

We can take things one step further.

In [ ]:

px.scatter(gapminder,
           x = 'gdpPercap',
           y = 'lifeExp', 
           hover_name = 'country',
           color = 'continent',
           size = 'pop',
           size_max = 60,
           log_x = True,
           range_y = [30, 90],
           animation_frame = 'year',
           title = 'Life Expectancy, GDP Per Capita, and Population over Time'
          )

Watch this video if you want to see an even-more-animated version of this plot.

Animated histogram

In [ ]:

px.histogram(gapminder,
            x = 'lifeExp',
            animation_frame = 'year',
            range_x = [20, 90],
            range_y = [0, 50],
            title = 'Distribution of Life Expectancy over Time')

Choropleth

In [ ]:

px.choropleth(gapminder,
              locations = 'iso_alpha',
              color = 'lifeExp',
              hover_name = 'country',
              hover_data = {'iso_alpha': False},
              title = 'Life Expectancy Per Country',
              color_continuous_scale = px.colors.sequential.tempo
)

Parting thoughts

From Lecture 1: What is "data science"?

Data science is about drawing useful conclusions from data using computation. Throughout the quarter, we touched on several aspects of data science:

In the first 4 weeks, we used Python to explore data.
- Lots of visualization 📈📊 and "data manipulation", using industry-standard tools.

In the next 4 weeks, we used data to infer about a population, given just a sample.
- Rely heavily on simulation, rather than formulas.

In the last 2 weeks, we used data from the past to predict what may happen in the future.
- A taste of machine learning 🤖.

In future courses – including DSC 20 and 40A, which you may be taking next quarter – you'll revisit all three of these aspects of data science.

Note on grades

Suraj's freshman year transcript.

Don't let your grades define you, they don't tell the full story.

Procrastination

Adjusting to life in college can be challenging, particularly because it can be hard to manage your time wisely.

If you're interested, register to attend this workshop on overcoming procrastination taught by another data science professor next quarter – participants each receive a $50 Amazon gift card!

Thank you!

This course would not have been possible without...

Our graduate TA: Dasha Veraksa.
Our 20 undergraduate tutors: Gabriel Cha, Eric Chen, John Driscoll, Daphne Fabella, Charisse Hao, Dylan Lee, Daniel Li, Anthony Li, Anna Liu, Anastasiya Markova, Yash Potdar, Harshita Saha, Selim Shaalan, Yutian (Skylar) Shi, Tony Ta, Vineet Tallavajhala, Andrew Tan, Jiaxin Ye, Tiffany Yu, and Diego Zavalza.
Learn more about tutoring – it's fun, and you can be a tutor as early as your 3rd quarter at UCSD!
Keep in touch! dsc10.com/staff
- After grades are released, we'll make a post on EdStem where you can ask course staff for advice on courses and UCSD more generally.

Lecture 28 – Review, Conclusion

DSC 10, Fall 2022

Announcements

Agenda

Bakeries 🧁

Concept Check ✅ – Answer at cc.dsc10.com

Discussion Question

Physicians 🩺

The setup

Concept Check ✅ – Answer at cc.dsc10.com

Concept Check ✅ – Answer at cc.dsc10.com

Concept Check ✅ – Answer at cc.dsc10.com

Concept Check ✅ – Answer at cc.dsc10.com

Personal projects

Using Jupyter Notebooks after DSC 10

Finding data

Domain-specific sources of data

Join a DS3 Project Group 🤝

Demo: Gapminder 🌎

`plotly`

Gapminder dataset

Scatter plot

Animated scatter plot

Animated histogram

Choropleth

Parting thoughts

From Lecture 1: What is "data science"?

Note on grades

Procrastination

Thank you!

Good luck on your finals! 🎉

and see you tomorrow at 11:30am. ⏰

Product

Resources

Company