GitHub Repository: UBC-DSCI/dsci-100-assets
Path: blob/master/2019-fall/slides/03_wrangling.ipynb
²⁰⁵¹ views

Kernel: R

DSCI 100 - Introduction to Data Science

Lecture 3 - Wrangling to get tidy data

2019-09-19

Reminder

Where are we? Where are we going?

image source: R for Data Science by Grolemund & Wickham

Shameless borrowing of slides from Jenny Bryan

https://www.slideshare.net/Plotly/plotcon-nyc-behind-every-great-plot-theres-a-great-deal-of-wrangling

How should you wrangle your data?

We make it "tidy"

What is tidy data?

A tidy data is one that is satified by these three criteria:

each row is a single observation,
each variable is a single column, and
each value is a single cell (i.e., its row, column position in the data frame is not shared with another value)

Source: R for Data Science by Garrett Grolemund & Hadley Wickham

A tale of 4 data tables...

...here is the same data represented in 4 different ways, let's vote on which are tidy

Example source: https://garrettgman.github.io/tidying/

Statistical question: What variables are associated with the number of TB cases?

This data is tidy, true or false?

country	year	rate
Afghanistan	1999	745/19987071
Afghanistan	2000	2666/20595360
Brazil	1999	37737/172006362
Brazil	2000	80488/174504898
China	1999	212258/1272915272
China	2000	213766/1280428583

runner	region	country	speed
Olusoji Fasuba	Africa	Nigeria	9.85/100
Murielle Ahouré	Africa	Ivory Coast	10.78/100
Femi Ogunode	Asia	Qatar	9.91/100
Su Bingtian	Asia	China	9.91/100
Li Xuemei	Asia	China	10.79

country	cases (year=1999)	cases (year=2000)
Afghanistan	745	2666
Brazil	37737	80488
China	212258	213766

country	population (year=1999)	population (year=2000)
Afghanistan	19987071	20595360
Brazil	172006362	174504898
China	1272915272	1280428583

Statistical question: What variables are associated with the number of TB cases?

This data is tidy, true or false?

country	year	cases	population
Afghanistan	1999	745	19987071
Afghanistan	2000	2666	20595360
Brazil	1999	37737	172006362
Brazil	2000	80488	174504898
China	1999	212258	1272915272
China	2000	213766	1280428583

Statistical question: What variables are associated with the number of TB cases?

This data is tidy, true or false?

country	year	key	value
Afghanistan	1999	cases	745
Afghanistan	1999	population	19987071
Afghanistan	2000	cases	2666
Afghanistan	2000	population	20595360
Brazil	1999	cases	37737
Brazil	1999	population	172006362
Brazil	2000	cases	80488
Brazil	2000	population	174504898
China	1999	cases	212258
China	1999	population	1272915272
China	2000	cases	213766
China	2000	population	1280428583

Tools for getting it there:

tidyverse package functions from:
- dplyr package (select, filter, mutate, group_by, summarize)
- tidyr package (gather)
- purrr package (*map*)

Another big concept this week: iteration

iteration is when you need to do something repeatedly (e.g., ringing in and bagging groceries at the till)

Tidyverse tools for iteration

group_by + summarize
*map*

`group_by` + `summarize`

useful when you want to do something repeatedly to a group of rows
an example, we want to calculate the average life expectancy (lifeExp) for each continent from the gapminder data set

In [1]:

library(tidyverse)
gapminder <- read_csv("https://raw.githubusercontent.com/swcarpentry/r-novice-gapminder/master/data/gapminder_data.csv")

Out[1]:

── Attaching packages ─────────────────────────────────────── tidyverse 1.2.1 ──
✔ ggplot2 3.2.0     ✔ purrr   0.3.2
✔ tibble  2.1.3     ✔ dplyr   0.8.3
✔ tidyr   0.8.3     ✔ stringr 1.4.0
✔ readr   1.3.1     ✔ forcats 0.4.0
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
Parsed with column specification:
cols(
  country = col_character(),
  year = col_double(),
  pop = col_double(),
  continent = col_character(),
  lifeExp = col_double(),
  gdpPercap = col_double()
)

In [2]:

#preview data frame

First, let's filter for only 1 year, 2007

In [ ]:

Now let's use `group_by` + `summarize` to iterate

Goal: calculate average life expectancy in 2007 for each continent

In [ ]:

`map`

useful when you want to do something repeatedly to almost anything (we'll give the example of columns in a data frame)
an example, we want to calculate the average value for each column from the USAarrests data to get the average across all US states

In [3]:

head(USArrests)

Out[3]:

use `map` to iterate

Calculate the average/mean of each column:

In [ ]:

But why isn't our output a data frame?

`map` functions output depends on which function you use...

`map` function	Output
`map()`	list
`map_lgl()`	logical vector
`map_int()`	integer vector
`map_dbl()`	double vector
`map_chr()`	character vector
`map_df()`	data frame

use `map_df` instead:

In [ ]:

Go forth and wrangle!

we'll be here to help if you need it!

image source: https://media.giphy.com/media/Qgm6tIYrSQqC4/giphy-downsized-large.gif

Class activity 1

Calculate the mean petal length for each of the Iris (flower) species in the iris dataset. Post your answer on Piazza when you are done.

In [ ]:

Class activity 2

Use map_df to caclulate the mean of each of the numerical columns in the iris dataset. Post your answer on Piazza when you are done.

In [ ]:

DSCI 100 - Introduction to Data Science

Lecture 3 - Wrangling to get tidy data

2019-09-19

Reminder

Shameless borrowing of slides from Jenny Bryan

How should you wrangle your data?

We make it "tidy"

What is tidy data?

A tale of 4 data tables...

Tools for getting it there:

Another big concept this week: iteration

Tidyverse tools for iteration

`group_by` + `summarize`

First, let's filter for only 1 year, 2007

Now let's use `group_by` + `summarize` to iterate

`map`

use `map` to iterate

`map` functions output depends on which function you use...

use `map_df` instead:

Go forth and wrangle!

Class activity 1

Class activity 2

What did we learn?

Product

Resources

Company

DSCI 100 - Introduction to Data Science

Lecture 3 - Wrangling to get tidy data

2019-09-19

Reminder

Shameless borrowing of slides from Jenny Bryan

How should you wrangle your data?

We make it "tidy"

What is tidy data?

A tale of 4 data tables...

Tools for getting it there:

Another big concept this week: iteration

Tidyverse tools for iteration

group_by + summarize

First, let's filter for only 1 year, 2007

Now let's use group_by + summarize to iterate

*map*

use *map* to iterate

*map* functions output depends on which function you use...

use map_df instead:

Go forth and wrangle!

Class activity 1

Class activity 2

What did we learn?

`group_by` + `summarize`

Now let's use `group_by` + `summarize` to iterate

`map`

use `map` to iterate

`map` functions output depends on which function you use...

use `map_df` instead: