Path: blob/master/2021-spring/slides/03_wrangling.ipynb
2051 views
DSCI 100 - Introduction to Data Science
Lecture 3 - Wrangling to get tidy data
Credit to Jenny Brian's slides and Garrett Grolemund's tidying example
While you wait... enjoy this plot of the mtcars
data set from the new {ggbernie} R package 😂:
Housekeeping
Quiz 1 will be February 11th during the tutorial time for your registered section (i.e., in class)
45 mins
covers all the material from weeks 1 - 4 (but none of the optional web scraping is included)
open book (but not collaborative)
will be served on Canvas.
Make sure you have either Chrome or Firefox. Safari has been known to have trouble loading images and math.
we will work on projects after
practice questions will be posted -- make sure you can access the quiz on Canvas
Invigilate using Zoom (see upcoming Canvas post "Information about the quizzes" for details) -- make sure you can use Zoom with camera on your personal device
You should see a grade for worksheet 1 and 2 on Canvas (unless you registered late)
We are working in our project groups today (different from the breakout room groups)
Group contracts due Jan 30
Data Wrangling!
image: Medium
In the real world, when you get data, it's usually very messy
inconsistent format (commas, tabs, semicolons, missing data, extra empty lines)
split into multiple files (e.g. yearly recorded data over many years)
corrupted files, custom formats
when you read it successfully into R, it will often still be very messy
you need to make your data "tidy"
What is Tidy Data?
Illustrations from the Openscapes blog Tidy Data for reproducibility, efficiency, and collaboration by Julia Lowndes and Allison Horst"
Like families...
Comparison of work benches:
Friends with similar tools:
Easier for automation & iteration!
And it makes all other tidy datasets seem more welcoming!
So make friends with tidy data!
Examples of Tidy(?) Data
...here is the same data represented in a few different ways. Let's vote on which are tidy!
Tuberculosis data
This data is tidy. True or false?
country | year | rate |
---|---|---|
Afghanistan | 1999 | 745/19987071 |
Afghanistan | 2000 | 2666/20595360 |
Brazil | 1999 | 37737/172006362 |
Brazil | 2000 | 80488/174504898 |
China | 1999 | 212258/1272915272 |
China | 2000 | 213766/1280428583 |
False. Multiple variables are stored in one column.
the rate column is a combination of the number of cases and population each cell (row, column pair) does not correspond to a single value
Tuberculosis data
This data is tidy. True or false?
country | cases (year=1999) | cases (year=2000) |
---|---|---|
Afghanistan | 745 | 2666 |
Brazil | 37737 | 80488 |
China | 212258 | 213766 |
country | population (year=1999) | population (year=2000) |
---|---|---|
Afghanistan | 19987071 | 20595360 |
Brazil | 172006362 | 174504898 |
China | 1272915272 | 1280428583 |
False. A single observational unit is stored in multiple tables.
Here we have data values about a single type of observational unit spread out over multiple tables. These tables are split up by the number of cases of tuberculosis and the population of each country, so that each represents a single each row is not a single observation
Tuberculosis data
This data is tidy. True or false?
| country | 1999 | 2000 | |---------|-------|-------| | Afghanistan | 745 | 2666 | |Brazil |37737 | 80488 | | China | 212258 | 213766 |
False. Column headers are values, not variable names.
Here we see the columns headers are values (the year) not variable names. This common type of messy dataset is tabular data designed for presentation, where variables form both the rows and columns, and column headers are values, not variable names.
each column does not corresponds to a single variable since year is across the top
Tuberculosis data
This data is tidy. True or false?
country | year | key | value |
---|---|---|---|
Afghanistan | 1999 | cases | 745 |
Afghanistan | 1999 | population | 19987071 |
Afghanistan | 2000 | cases | 2666 |
Afghanistan | 2000 | population | 20595360 |
Brazil | 1999 | cases | 37737 |
Brazil | 1999 | population | 172006362 |
Brazil | 2000 | cases | 80488 |
Brazil | 2000 | population | 174504898 |
China | 1999 | cases | 212258 |
China | 1999 | population | 1272915272 |
China | 2000 | cases | 213766 |
China | 2000 | population | 1280428583 |
False. Variables are stored in both rows and columns. It has variables in individual columns (country, year), and across rows (population, cases) in the "key" column
each row does not corresponds to a single observation, each column does not corresponds to a single variable, and each cell (row, column pair) does not correspond to a single value
Tuberculosis data
This data is tidy. True or false?
country | year | cases | population |
---|---|---|---|
Afghanistan | 1999 | 745 | 19987071 |
Afghanistan | 2000 | 2666 | 20595360 |
Brazil | 1999 | 37737 | 172006362 |
Brazil | 2000 | 80488 | 174504898 |
China | 1999 | 212258 | 1272915272 |
China | 2000 | 213766 | 1280428583 |
each row corresponds to a single observation,
each column corresponds to a single variable, and
each cell (row, column pair) correspond to a single value
Tools for tidying:
tidyverse
package functions from:dplyr
package (select
,filter
,mutate
,group_by
,summarize
)tidyr
package (pivot_longer
,pivot_wider
)purrr
package (map_df
)
Demo Time!
── Attaching packages ───────────────────────────────────────────────────────────────── tidyverse 1.3.0 ──
✔ ggplot2 3.3.2 ✔ purrr 0.3.4
✔ tibble 3.0.4 ✔ dplyr 1.0.2
✔ tidyr 1.1.2 ✔ stringr 1.4.0
✔ readr 1.4.0 ✔ forcats 0.5.0
── Conflicts ──────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
Data for 344 penguins. There are 3 different species of penguins in this dataset, collected from 3 islands in the Antarctica.
Select
The select
function is used to choose a subset of columns (variables) in a dataframe.
e.g. select only numeric variables
Note to students: you don't always have to select columns when you tidy data
Filter
The filter
function is used to choose a subset of rows (observations) in a dataframe.
e.g. filter for only penguins with long flippers
Mutate
The mutate
function transforms old columns to add new columns.
e.g. convert body mass from grams to pounds
Pipe Operator %>%
When you need to do a long sequence of operations on data, you could:
1. Save intermediate objects
You could save intermediate objects (hard to read):
Disadvantages:
The reader may be tricked into thinking the named penguins_1 and penguins_2 objects are important for some reason, while they are just temporary intermediate computations.
Further, the reader has to look through and find where penguins_1 and penguins_2 are used in each subsequent line.
2. Composing functions
You could compose functions:
Disadvantage:
Difficult to read
3. Piping
You can also pipe with the %>%
symbol: passes the output of a function to the 1st argument of another.
note: if R sees a %>%
at the end of a line, it keeps reading the next line of code before evaluating!
Advantage:
Pipes make code more readable when you need to do a long sequence of operations on data.
Another big concept this week: iteration
iteration is when you need to do something repeatedly (e.g., ringing in and bagging groceries at the till)
Tidyverse tools for iteration
group_by
+summarize
: iterate over groups of rowsmap_df
: iterate over columns
group_by
+ summarize
e.g. calculate the average body mass (body_mass_g
) for each penguin species from the penguins
dataset
To do this we use group_by
+ summarize
to iterate over species, calculating average body mass.
What's going on with those NA's?
map_df
e.g. calculate the average rate of each type of crime from the USAarrests
data
Be careful: There are a lot of map_...
functions that you could use (map
, map_lgl
, map_chr
, etc). Usually in this course we'll only use map_df
, but you should investigate the others yourself!
map()
is a function for applying a function to each element of a list (similar to lapply() )map_df()
returns a data framemap()
returns a list,map_lgl()
a logical vector,map_int()
an integer vector,map_dbl()
a double vector,and map_chr()
a character vector.map(YOUR_LIST, YOUR_FUNCTION)
This data set contains statistics, in arrests per 100,000 residents for assault, murder, and rape in each of the 50 US states in 1973. Also given is the percent of the population living in urban areas.
Vector, Array, List and Data Frame are 4 basic data types defined in R. Knowing the differences between them will help you use R more efficiently.
Vector: All elements must be of the same type.
Matrix/array: A matrix is a vector with two additional attributes: the number of rows and the number of columns.
List can contain elements of different types.
Dataframe: A data frame is used for storing data tables. It is a list of vectors of equal length.
Vector, Array, List and Data Frame are 4 basic data types defined in R. Knowing the differences between them will help you use R more efficiently.
Vector: All elements must be of the same type.
Matrix/array: A matrix is a vector with two additional attributes: the number of rows and the number of columns.
List can contain elements of different types.
Dataframe: A data frame is used for storing data tables. It is a list of vectors of equal length.
Go forth and wrangle!
Go to your project groups (different from your breakout rooms)
Work on the group contract (due Jan 30): https://canvas.ubc.ca/courses/59079/assignments/784941
Discuss platforms for working on the group project together (e.g. via Canvas, exchange emails/phone, Slack)
Go ahead and get started on the worksheet! We'll be here to help if you need it.
If any members of your group aren't here today -- reach out to them via Canvas!