Book a Demo!
CoCalc Logo Icon
StoreFeaturesDocsShareSupportNewsAboutPoliciesSign UpSign In
UBC-DSCI
GitHub Repository: UBC-DSCI/dsci-100-assets
Path: blob/master/2021-spring/slides/03_wrangling.ipynb
2051 views
Kernel: R

DSCI 100 - Introduction to Data Science

Lecture 3 - Wrangling to get tidy data

Credit to Jenny Brian's slides and Garrett Grolemund's tidying example

While you wait... enjoy this plot of the mtcars data set from the new {ggbernie} R package 😂:

Housekeeping

  • Quiz 1 will be February 11th during the tutorial time for your registered section (i.e., in class)

    • 45 mins

    • covers all the material from weeks 1 - 4 (but none of the optional web scraping is included)

    • open book (but not collaborative)

    • will be served on Canvas.

      • Make sure you have either Chrome or Firefox. Safari has been known to have trouble loading images and math.

    • we will work on projects after

    • practice questions will be posted -- make sure you can access the quiz on Canvas

    • Invigilate using Zoom (see upcoming Canvas post "Information about the quizzes" for details) -- make sure you can use Zoom with camera on your personal device

  • You should see a grade for worksheet 1 and 2 on Canvas (unless you registered late)

  • We are working in our project groups today (different from the breakout room groups)

    • Group contracts due Jan 30

Reminder

Where are we? Where are we going?

image source: R for Data Science by Grolemund & Wickham

Data Wrangling!

image: Medium

  • In the real world, when you get data, it's usually very messy

    • inconsistent format (commas, tabs, semicolons, missing data, extra empty lines)

    • split into multiple files (e.g. yearly recorded data over many years)

    • corrupted files, custom formats

  • when you read it successfully into R, it will often still be very messy

  • you need to make your data "tidy"

What is Tidy Data?

Illustrations from the Openscapes blog Tidy Data for reproducibility, efficiency, and collaboration by Julia Lowndes and Allison Horst"

Like families...

Comparison of work benches:

Friends with similar tools:

Easier for automation & iteration!

And it makes all other tidy datasets seem more welcoming!

So make friends with tidy data!

Examples of Tidy(?) Data

...here is the same data represented in a few different ways. Let's vote on which are tidy!

Tuberculosis data

This data is tidy. True or false?

countryyearrate
Afghanistan1999745/19987071
Afghanistan20002666/20595360
Brazil199937737/172006362
Brazil200080488/174504898
China1999212258/1272915272
China2000213766/1280428583

False. Multiple variables are stored in one column.

the rate column is a combination of the number of cases and population each cell (row, column pair) does not correspond to a single value

Tuberculosis data

This data is tidy. True or false?

countrycases (year=1999)cases (year=2000)
Afghanistan7452666
Brazil3773780488
China212258213766
countrypopulation (year=1999)population (year=2000)
Afghanistan1998707120595360
Brazil172006362174504898
China12729152721280428583

False. A single observational unit is stored in multiple tables.

Here we have data values about a single type of observational unit spread out over multiple tables. These tables are split up by the number of cases of tuberculosis and the population of each country, so that each represents a single each row is not a single observation

Tuberculosis data

This data is tidy. True or false?

| country | 1999 | 2000 | |---------|-------|-------| | Afghanistan | 745 | 2666 | |Brazil |37737 | 80488 | | China | 212258 | 213766 |

False. Column headers are values, not variable names.

Here we see the columns headers are values (the year) not variable names. This common type of messy dataset is tabular data designed for presentation, where variables form both the rows and columns, and column headers are values, not variable names.

each column does not corresponds to a single variable since year is across the top

Tuberculosis data

This data is tidy. True or false?

countryyearkeyvalue
Afghanistan1999cases745
Afghanistan1999population19987071
Afghanistan2000cases2666
Afghanistan2000population20595360
Brazil1999cases37737
Brazil1999population172006362
Brazil2000cases80488
Brazil2000population174504898
China1999cases212258
China1999population1272915272
China2000cases213766
China2000population1280428583

False. Variables are stored in both rows and columns. It has variables in individual columns (country, year), and across rows (population, cases) in the "key" column

each row does not corresponds to a single observation, each column does not corresponds to a single variable, and each cell (row, column pair) does not correspond to a single value

Tuberculosis data

This data is tidy. True or false?

countryyearcasespopulation
Afghanistan199974519987071
Afghanistan2000266620595360
Brazil199937737172006362
Brazil200080488174504898
China19992122581272915272
China20002137661280428583
  • each row corresponds to a single observation,

  • each column corresponds to a single variable, and

  • each cell (row, column pair) correspond to a single value

Tools for tidying:

  • tidyverse package functions from:

    • dplyr package (select, filter, mutate, group_by, summarize)

    • tidyr package (pivot_longer, pivot_wider)

    • purrr package (map_df)

Demo Time!

library(tidyverse) library(palmerpenguins) options(repr.matrix.max.rows = 6)
── Attaching packages ───────────────────────────────────────────────────────────────── tidyverse 1.3.0 ── ✔ ggplot2 3.3.2 ✔ purrr 0.3.4 ✔ tibble 3.0.4 ✔ dplyr 1.0.2 ✔ tidyr 1.1.2 ✔ stringr 1.4.0 ✔ readr 1.4.0 ✔ forcats 0.5.0 ── Conflicts ──────────────────────────────────────────────────────────────────── tidyverse_conflicts() ── ✖ dplyr::filter() masks stats::filter() ✖ dplyr::lag() masks stats::lag()
penguins

Data for 344 penguins. There are 3 different species of penguins in this dataset, collected from 3 islands in the Antarctica.

Select

The select function is used to choose a subset of columns (variables) in a dataframe.

e.g. select only numeric variables

# select penguins example
penguins penguins_select <- select(penguins, bill_length_mm, bill_depth_mm, flipper_length_mm, body_mass_g) penguins_select penguins_no_island <- select(penguins, -island) penguins_no_island

Note to students: you don't always have to select columns when you tidy data

Filter

The filter function is used to choose a subset of rows (observations) in a dataframe.

e.g. filter for only penguins with long flippers

#penguins filter example penguins
penguins penguins_long <- filter(penguins, flipper_length_mm > 190) penguins_long

Mutate

The mutate function transforms old columns to add new columns.

e.g. convert body mass from grams to pounds

#penguin mutate example
penguin_pounds <- mutate(penguins, body_mass_lb = body_mass_g / 454) penguin_pounds

Pipe Operator %>%

When you need to do a long sequence of operations on data, you could:

1. Save intermediate objects

You could save intermediate objects (hard to read):

penguins_1 <- mutate(penguins, body_mass_lb = body_mass_g/454) penguins_2 <- filter(penguins_1, flipper_length_mm > 190) penguins_3 <- select(penguins_2, bill_length_mm, bill_depth_mm, flipper_length_mm, body_mass_lb) penguins3

Disadvantages:

  • The reader may be tricked into thinking the named penguins_1 and penguins_2 objects are important for some reason, while they are just temporary intermediate computations.

  • Further, the reader has to look through and find where penguins_1 and penguins_2 are used in each subsequent line.

2. Composing functions

You could compose functions:

penguins_composed <- select(filter(mutate(penguins, body_mass_lb = body_mass_g/454), flipper_length_mm > 190), bill_length_mm, bill_depth_mm, flipper_length_mm, body_mass_lb) penguins_composed

Disadvantage:

  • Difficult to read

3. Piping

You can also pipe with the %>% symbol: passes the output of a function to the 1st argument of another.

penguins_piped <- penguins %>% mutate(body_mass_lb = body_mass_g/454) %>% filter(flipper_length_mm > 190) %>% select(bill_length_mm, bill_depth_mm, flipper_length_mm, body_mass_lb) penguins_piped

note: if R sees a %>% at the end of a line, it keeps reading the next line of code before evaluating!

Advantage:

Pipes make code more readable when you need to do a long sequence of operations on data.

Another big concept this week: iteration

  • iteration is when you need to do something repeatedly (e.g., ringing in and bagging groceries at the till)

Tidyverse tools for iteration

  1. group_by + summarize: iterate over groups of rows

  2. map_df: iterate over columns

group_by + summarize

e.g. calculate the average body mass (body_mass_g) for each penguin species from the penguins dataset

#preview data frame penguins

To do this we use group_by + summarize to iterate over species, calculating average body mass.

# calculate average body mass for each species
`summarise()` ungrouping output (override with `.groups` argument)
avg_body_mass <- penguins %>% group_by(species) %>% summarise(avg_body_mass = mean(body_mass_g)) avg_body_mass

What's going on with those NA's?

# calculate average body mass for each species handling NA's
avg_body_mass <- penguins %>% group_by(species) %>% summarise(avg_body_mass = mean(body_mass_g, na.rm = TRUE)) avg_body_mass

map_df

e.g. calculate the average rate of each type of crime from the USAarrests data

Be careful: There are a lot of map_... functions that you could use (map, map_lgl, map_chr, etc). Usually in this course we'll only use map_df, but you should investigate the others yourself!

  • map() is a function for applying a function to each element of a list (similar to lapply() )

  • map_df() returns a data frame

  • map() returns a list, map_lgl() a logical vector, map_int() an integer vector, map_dbl() a double vector, and map_chr() a character vector.

  • map(YOUR_LIST, YOUR_FUNCTION)

  • This data set contains statistics, in arrests per 100,000 residents for assault, murder, and rape in each of the 50 US states in 1973. Also given is the percent of the population living in urban areas.

  • Vector, Array, List and Data Frame are 4 basic data types defined in R. Knowing the differences between them will help you use R more efficiently.

  • Vector: All elements must be of the same type.

  • Matrix/array: A matrix is a vector with two additional attributes: the number of rows and the number of columns.

  • List can contain elements of different types.

  • Dataframe: A data frame is used for storing data tables. It is a list of vectors of equal length.

head(USArrests)
#calculate average per capita crime rate for each type of crime
USavg <- map_df(USArrests, mean) # returns a dataframe USavg map(USArrests, mean) # returns a list class(USavg)
  • Vector, Array, List and Data Frame are 4 basic data types defined in R. Knowing the differences between them will help you use R more efficiently.

  • Vector: All elements must be of the same type.

  • Matrix/array: A matrix is a vector with two additional attributes: the number of rows and the number of columns.

  • List can contain elements of different types.

  • Dataframe: A data frame is used for storing data tables. It is a list of vectors of equal length.

Go forth and wrangle!

  1. Go to your project groups (different from your breakout rooms)

  2. Work on the group contract (due Jan 30): https://canvas.ubc.ca/courses/59079/assignments/784941

  3. Discuss platforms for working on the group project together (e.g. via Canvas, exchange emails/phone, Slack)

  4. Go ahead and get started on the worksheet! We'll be here to help if you need it.

  5. If any members of your group aren't here today -- reach out to them via Canvas!

What did we learn?