Book a Demo!
CoCalc Logo Icon
StoreFeaturesDocsShareSupportNewsAboutPoliciesSign UpSign In
UBC-DSCI
GitHub Repository: UBC-DSCI/dsci-100-assets
Path: blob/master/2019-fall/slides/03_wrangling.ipynb
2051 views
Kernel: R

DSCI 100 - Introduction to Data Science

Lecture 3 - Wrangling to get tidy data

2019-09-19

Reminder

Where are we? Where are we going?

image source: R for Data Science by Grolemund & Wickham

How should you wrangle your data?

We make it "tidy"

What is tidy data?

A tidy data is one that is satified by these three criteria:

  • each row is a single observation,

  • each variable is a single column, and

  • each value is a single cell (i.e., its row, column position in the data frame is not shared with another value)

Source: R for Data Science by Garrett Grolemund & Hadley Wickham

A tale of 4 data tables...

...here is the same data represented in 4 different ways, let's vote on which are tidy

Statistical question: What variables are associated with the number of TB cases?

This data is tidy, true or false?

countryyearrate
Afghanistan1999745/19987071
Afghanistan20002666/20595360
Brazil199937737/172006362
Brazil200080488/174504898
China1999212258/1272915272
China2000213766/1280428583
runnerregioncountryspeed
Olusoji FasubaAfricaNigeria9.85/100
Murielle AhouréAfricaIvory Coast10.78/100
Femi OgunodeAsiaQatar9.91/100
Su BingtianAsiaChina9.91/100
Li XuemeiAsiaChina10.79
countrycases (year=1999)cases (year=2000)
Afghanistan7452666
Brazil3773780488
China212258213766
countrypopulation (year=1999)population (year=2000)
Afghanistan1998707120595360
Brazil172006362174504898
China12729152721280428583

Statistical question: What variables are associated with the number of TB cases?

This data is tidy, true or false?

countryyearcasespopulation
Afghanistan199974519987071
Afghanistan2000266620595360
Brazil199937737172006362
Brazil200080488174504898
China19992122581272915272
China20002137661280428583

Statistical question: What variables are associated with the number of TB cases?

This data is tidy, true or false?

countryyearkeyvalue
Afghanistan1999cases745
Afghanistan1999population19987071
Afghanistan2000cases2666
Afghanistan2000population20595360
Brazil1999cases37737
Brazil1999population172006362
Brazil2000cases80488
Brazil2000population174504898
China1999cases212258
China1999population1272915272
China2000cases213766
China2000population1280428583

Tools for getting it there:

  • tidyverse package functions from:

    • dplyr package (select, filter, mutate, group_by, summarize)

    • tidyr package (gather)

    • purrr package (*map*)

Another big concept this week: iteration

  • iteration is when you need to do something repeatedly (e.g., ringing in and bagging groceries at the till)

Tidyverse tools for iteration

  1. group_by + summarize

  2. *map*

group_by + summarize

  • useful when you want to do something repeatedly to a group of rows

  • an example, we want to calculate the average life expectancy (lifeExp) for each continent from the gapminder data set

library(tidyverse) gapminder <- read_csv("https://raw.githubusercontent.com/swcarpentry/r-novice-gapminder/master/data/gapminder_data.csv")
── Attaching packages ─────────────────────────────────────── tidyverse 1.2.1 ── ggplot2 3.2.0 purrr 0.3.2 tibble 2.1.3 dplyr 0.8.3 tidyr 0.8.3 stringr 1.4.0 readr 1.3.1 forcats 0.4.0 ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ── dplyr::filter() masks stats::filter() dplyr::lag() masks stats::lag() Parsed with column specification: cols( country = col_character(), year = col_double(), pop = col_double(), continent = col_character(), lifeExp = col_double(), gdpPercap = col_double() )
#preview data frame

First, let's filter for only 1 year, 2007

Now let's use group_by + summarize to iterate

Goal: calculate average life expectancy in 2007 for each continent

*map*

  • useful when you want to do something repeatedly to almost anything (we'll give the example of columns in a data frame)

  • an example, we want to calculate the average value for each column from the USAarrests data to get the average across all US states

head(USArrests)

use *map* to iterate

Calculate the average/mean of each column:

But why isn't our output a data frame?

*map* functions output depends on which function you use...

map functionOutput
map()list
map_lgl()logical vector
map_int()integer vector
map_dbl()double vector
map_chr()character vector
map_df()data frame

use map_df instead:

Go forth and wrangle!

we'll be here to help if you need it!

Class activity 1

Calculate the mean petal length for each of the Iris (flower) species in the iris dataset. Post your answer on Piazza when you are done.

Class activity 2

Use map_df to caclulate the mean of each of the numerical columns in the iris dataset. Post your answer on Piazza when you are done.

What did we learn?