GitHub Repository: UBC-DSCI/dsci-100-assets
Path: blob/master/2019-spring/slides/03_wrangling.ipynb
²⁰⁵¹ views

Kernel: R

DSCI 100 - Introduction to Data Science

Lecture 3 - Wrangling to get tidy data

2019-01-17

Reminder

Where are we? Where are we going?

image source: R for Data Science by Grolemund & Wickham

Shameless borrowing of slides from Jenny Bryan

https://www.slideshare.net/Plotly/plotcon-nyc-behind-every-great-plot-theres-a-great-deal-of-wrangling

How should you wrangle your data?

We make it "tidy"

Lord of the Rings example

I will give you a concrete example of some untidy data from this data from the Lord of the Rings Trilogy.

The Fellowship Of The Ring

Race

Female

Male

Elf

1229

971

Hobbit

3644

Man

1995

The Two Towers

Race

Female

Male

Elf

331

513

Hobbit

2463

Man

401

3589

The Return Of The King

Race

Female

Male

Elf

183

510

Hobbit

2673

Man

268

2459

Here’s how the same data looks in tidy form:

Film

Gender

Race

Words

The Fellowship Of The Ring

Female

Elf

1229

The Fellowship Of The Ring

Male

Elf

971

The Fellowship Of The Ring

Female

Hobbit

The Fellowship Of The Ring

Male

Hobbit

3644

The Fellowship Of The Ring

Female

Man

The Fellowship Of The Ring

Male

Man

1995

The Two Towers

Female

Elf

331

The Two Towers

Male

Elf

513

The Two Towers

Female

Hobbit

The Two Towers

Male

Hobbit

2463

The Two Towers

Female

Man

401

The Two Towers

Male

Man

3589

The Return Of The King

Female

Elf

183

The Return Of The King

Male

Elf

510

The Return Of The King

Female

Hobbit

The Return Of The King

Male

Hobbit

2673

The Return Of The King

Female

Man

268

The Return Of The King

Male

Man

2459

What is tidy data?

A tidy data is one that is satified by these three criteria:

each row is a single observation,
each variable is a single column, and
each value is a single cell (i.e., its row, column position in the data frame is not shared with another value)

*image source: [R for Data Science](https://r4ds.had.co.nz/) by Garrett Grolemund & Hadley Wickham*

Tools for getting it there:

tidyverse package functions from:
- dplyr package (select, filter, mutate, group_by, summarize)
- tidyr package (gather)
- purrr package (*map*)

Another big concept this week: iteration

iteration is when you need to do something repeatedly (e.g., ringing in and bagging groceries at the till)

Tidyverse tools for iteration

group_by + summarize
*map*

`group_by` + `summarize`

useful when you want to do something repeatedly to a group of rows
an example, we want to calculate the average life expectancy (lifeExp) for each continent from the gapminder data set

In [2]:

library(gapminder)
head(gapminder)

Out[2]:

[1] 5

First, let's filter for only 1 year, 2007

In [28]:

library(tidyverse)
gap_2007 <- gapminder %>% 
    filter(year == 2007)
head(gap_2007)

Out[28]:

Now let's use `group_by` + `summarize` to iterate

Goal: calculate average life expectancy for each continent

In [10]:

avg_lifeExp_2007 <- gap_2007 %>% 
    group_by(continent) %>% 
    summarize(avg_lifeExp = mean(lifeExp))
avg_lifeExp_2007

Out[10]:

`map`

useful when you want to do something repeatedly to almost anything (we'll give the example of columns in a data frame)
an example, we want to calculate the average value for each column from the USAarrests data to get the average across all US states

In [13]:

head(USArrests)

Out[13]:

use `map` to iterate

In [31]:

USavg <- map(USArrests, mean)
USavg

Out[31]:

But why isn't our output a data frame?

`map` functions output depends on which function you use...

`map` function	Output
`map()`	list
`map_lgl()`	logical vector
`map_int()`	integer vector
`map_dbl()`	double vector
`map_chr()`	character vector
`map_df()`	data frame

use `map_df` instead:

In [32]:

USavg <- map_df(USArrests, mean)
USavg

Out[32]:

Go forth and wrangle!

we'll be here to help if you need it!

image source: https://media.giphy.com/media/Qgm6tIYrSQqC4/giphy-downsized-large.gif

Class activity 1

Calculate the mean petal length for the Iris (flower) species in the iris dataset:

In [24]:

library(tidyverse)
head(iris)
petal_length <- iris %>% 
    group_by(Species) %>% 
    summarise(mean_length = mean(Petal.Length))
petal_length

Out[24]:

Class activity 2

Use map_df to caclulate the mean of each of the numerical columns in the iris dataset.

In [26]:

avg_all <- iris %>% 
    select(-Species) %>% 
    map_df(mean)
avg_all

Out[26]:

What did we learn?

anti-select is -COLUMN_NAME
iteration (map & group_by + summarize)
gather (wide to long)

DSCI 100 - Introduction to Data Science

Lecture 3 - Wrangling to get tidy data

2019-01-17

Reminder

Shameless borrowing of slides from Jenny Bryan

How should you wrangle your data?

We make it "tidy"

Lord of the Rings example

What is tidy data?

Tools for getting it there:

Another big concept this week: iteration

Tidyverse tools for iteration

`group_by` + `summarize`

First, let's filter for only 1 year, 2007

Now let's use `group_by` + `summarize` to iterate

`map`

use `map` to iterate

`map` functions output depends on which function you use...

use `map_df` instead:

Go forth and wrangle!

Class activity 1

Class activity 2

What did we learn?

Product

Resources

Company

DSCI 100 - Introduction to Data Science

Lecture 3 - Wrangling to get tidy data

2019-01-17

Reminder

Shameless borrowing of slides from Jenny Bryan

How should you wrangle your data?

We make it "tidy"

Lord of the Rings example

What is tidy data?

Tools for getting it there:

Another big concept this week: iteration

Tidyverse tools for iteration

group_by + summarize

First, let's filter for only 1 year, 2007

Now let's use group_by + summarize to iterate

*map*

use *map* to iterate

*map* functions output depends on which function you use...

use map_df instead:

Go forth and wrangle!

Class activity 1

Class activity 2

What did we learn?

`group_by` + `summarize`

Now let's use `group_by` + `summarize` to iterate

`map`

use `map` to iterate

`map` functions output depends on which function you use...

use `map_df` instead: