Lecture 7 β Data Visualization π
DSC 10, Fall 2022
Announcements
Lab 2 is due on Saturday 10/8 at 11:59pm.
Homework 2 is due on Tuesday 10/11 at 11:59pm.
Take a look at the solutions notebook and supplemental video for Lecture 6.
Agenda
Why visualize?
Terminology.
Scatter plots.
Line plots.
Bar charts.
Don't forget about the DSC 10 Reference Sheet and the Resources tab of the course website!
Aside: keyboard shortcuts
There are several keyboard shortcuts built into Jupyter Notebooks designed to help you save time. To see them, either click the keyboard button in the toolbar above or hit the H key on your keyboard (as long as you're not actively editing a cell).
Particularly useful shortcuts:
| Action | Keyboard shortcut |
|---|---|
| Run cell + jump to next cell | SHIFT + ENTER |
| Save the notebook | CTRL/CMD + S |
| Create new cell above/below | A/B |
| Delete cell | DD |
| Convert cell to Markdown | M |
| Convert cell to code | Y |
Note: the latter three only work if you're not actively editing a cell (to exit "edit mode", click somewhere outside of a cell).
Why visualize?
Run these cells to load the Little Women data from Lecture 1.
Little Women
In Lecture 1, we were able to answer questions about the plot of Little Women without having to read the novel. Some of those questions included:
Who is the main character?
Which pair of characters gets married in Chapter 35?
Napoleon's March

John Snow

Why visualize?
Computers are better than humans at crunching numbers, but humans are better at identifying visual patterns.
Visualizations allow us to understand lots of data quickly β they make it easier to spot trends and communicate our results with others.
There are many types of visualizations; in this class, we'll look at scatter plots, line plots, bar charts, and histograms, but there are many others.
The right choice depends on the type of data.
Terminology
Individuals and variables

Individual (row): Person/place/thing for which data is recorded. Also called an observation.
Variable (column): Something that is recorded for each individual. Also called a feature.
Types of variables
There are two main types of variables:
Numerical: It makes sense to do arithmetic with the values.
Categorical: Values fall into categories, that may or may not have some order to them.
Examples of numerical variables
Salaries of NBA players π.
Individual: an NBA player.
Variable: their salary.
Movie gross earnings π°.
Individual: a movie.
Variable: its gross earnings.
Booster doses administered per day π.
Individual: date.
Variable: number of booster doses administered on that date.
Examples of categorical variables
Movie genres π¬.
Individual: a movie.
Variable: its genre.
Zip codes π .
Individual: US resident.
Variable: zip code.
Even though they look like numbers, zip codes are categorical (arithmetic doesn't make sense).
Level of prior programming experience for students in DSC 10 π§βπ.
Individual: student in DSC 10.
Variable: their level of prior programming experience, e.g. none, low, medium, or high.
There is an order to these categories!
Concept Check β β Answer at cc.dsc10.com
Which of these is not a numerical variable?
A. Fuel economy in miles per gallon.
B. Number of quarters at UCSD.
C. College at UCSD (Sixth, Seventh, etc).
D. Bank account number.
E. More than one of these are not numerical variables.
Types of visualizations
The type of visualization we create depends on the kinds of variables we're visualizing.
Scatter plot: numerical vs. numerical.
Line plot: sequential numerical (time) vs. numerical.
Bar chart: categorical vs. numerical.
Histogram: numerical.
Will cover next time.
Note: We may interchange the words "plot", "chart", and "graph"; they all mean the same thing.
Scatter plots
Dataset of 50 top-grossing actors
| Column | Contents |
|---|---|
'Actor' | Name of actor |
'Total Gross' | Total gross domestic box office receipt, in millions of dollars, of all of the actorβs movies |
'Number of Movies' | The number of movies the actor has been in |
'Average per Movie' | Total gross divided by number of movies |
'#1 Movie' | The highest grossing movie the actor has been in |
'Gross' | Gross domestic box office receipt, in millions of dollars, of the actorβs #1 Movie |
Scatter plots
What is the relationship between 'Number of Movies' and 'Total Gross'?
Scatter plots
Scatter plots visualize the relationship between two numerical variables.
To create one from a DataFrame
df, use
The resulting scatter plot has one point per row of
df.If you put a semicolon after a call to
.plot, it will hide the weird text output that displays.
Scatter plots
What is the relationship between 'Number of Movies' and 'Average per Movie'?
Note that in the above plot, there's a negative association and an outlier.
Who was in 60 or more movies?
Who is the outlier?
Whoever they are, they made very few, high grossing movies.
Anthony Daniels

Line plots π
Dataset aggregating movies by year
| Column | Content |
|---|---|
'Year' | Year |
'Total Gross in Billions' | Total domestic box office gross, in billions of dollars, of all movies released |
'Number of Movies' | Number of movies released |
'#1 Movie' | Highest grossing movie |
Line plots
How has the number of movies changed over time? π€
Line plots
Line plots show trends in numerical variables over time.
To create one from a DataFrame
df, use
Plotting tip
Tip: if you want the x-axis to be the index, omit the
x=argument!Doesn't work for scatter plots, but works for most other plot types.
Since the year 2000
We can create a line plot of just 2000 onwards by querying movies_by_year before calling .plot.
What do you think explains the declines around 2008 and 2020?
How did this affect total gross?
What was the top grossing movie of 2016? π
Bar charts π
Dataset of the global top 200 songs on Spotify as of Tuesday (10/4/22)
Bar charts
How many streams do the top 10 songs have?
Bar charts
Bar charts visualize the relationship between a categorical variable and a numerical variable.
In a bar chart...
The thickness and spacing of bars is arbitrary.
The order of the categorical labels doesn't matter.
To create one from a DataFrame
df, use
The "h" in
'barh'stands for "horizontal".It's easier to read labels this way.
In the previous chart, we set
y='Streams'even though streams are measured by x-axis length.
How many songs do the top 15 artists have in the top 200?
First, let's create a DataFrame with a single column that describes the number of songs in the top 200 per artist. This involves using .groupby with .count(). Since we want one row per artist, we will group by 'artist_names'.
Using .sort_values and .take, we'll keep just the top 15 artists. Note that all columns in songs_per_artist contain the same information (this is a consequence of using .count()).
Using .assign and .drop, we'll create a column named 'count' that contains the same information that the other 3 columns contain, and then .get only that column (or equivalently, drop the other 3 columns).
Before calling .plot(kind='barh', y='count'), we'll sort top_15_artists by 'count' in increasing order. This is because, weirdly, Python reverses the order of rows when creating bars in horizontal bar charts.
Vertical bar charts
To create a vertical bar chart, use kind='bar' instead of kind='barh'. These are typically harder to read, though.
Aside: How many streams did Justin Bieber's songs on the chart receive?
It seems like we're missing a popular song...
How do we include featured songs, as well?
Answer: Using .str.contains.
Fun demo π΅
Let's find the URI of a song we care about.
Watch what happens! πΆ
Try it out yourself!
Summary
Summary
Visualizations make it easy to extract patterns from datasets.
There are two main types of variables: categorical and numerical.
The types of the variables we're visualizing inform our choice of which type of visualization to use.
Today, we looked at scatter plots, line plots, and bar charts.
Next time: Histograms and overlaid plots.
Let's discuss!
As mentioned earlier, visualizations allow us to easily spot trends and communicate our results with others.
Some visualizations make it more difficult to see the trend in data, by
Adding "chart junk."
Using misleading axes and sizes.

In this thread on EdStem, post some examples of particularly misleading or interesting visualizations! We'll share our favorites in class on Monday.