Book a Demo!
CoCalc Logo Icon
StoreFeaturesDocsShareSupportNewsAboutPoliciesSign UpSign In
UBC-DSCI
GitHub Repository: UBC-DSCI/dsci-100-assets
Path: blob/master/2019-fall/slides/02_getting_data_into_r.ipynb
2051 views
Kernel: R

DSCI 100 - Introduction to Data Science

Lecture 2 - Getting data into R

2019-09-12

Housekeeping

  • Pin on Piazza about viewing fresh notebooks & how to upload to JupyterHub

  • Please fill out our pre-course survey

    • password: dsci100

    • your responses are anonymous to the instructors

    • will not affect your course grade

    • the survey closes next Thursday, September 19

  • Most/all folks who were registered before the due date have been graded for worksheet_01

  • Late registrant assignments are still being graded

  • You will get feedback forms for each assignment telling you which questions were correct/incorrect. They are coming soon!

Recap of last week

  • Introduction to

    • R programming and Jupyter notebooks

    • a sprinkle of data analysis

    • UBC's sketchy wifi

Today

Taking our first step in data analysis: loading data into R!

Where are we going?

In the data science workflow (source: Grolemund & Wickham, R for Data Science)

Loading/importing data

  • 4 most common ways to do this in Data Science

    1. read in a text file with data in a spreadsheet format

    2. read from a database (e.g., SQLite, PostgreSQL)

    3. scrape data from the web (optional bonus material)

    4. use a web API to read data from a website (not covered in DSCI100)

Different ways to locate a file / dataset

Local (on your computer)

  • An absolute path locates a file with respect to the "root" folder on a computer

    • starts with /, e.g. /home/trevor/documents/timesheet.xlsx

  • A relative path locates a file relative to your working directory

    • doesn't start with /, e.g. documents/timesheet.xlsx
      (working directory is /home/trevor/)

Remote (on the web)

via "URL" that starts with http:// or https://

http://traffic.libsyn.com/mbmbam/MyBrotherMyBrotherandMe367.mp3

Demo: Loading data from your computer

Workflow:

  • make the dataset accessible to the computer

    • might need to load a package, download a file, connect to a database

  • inspect the data using Jupyter to see what it looks like

  • load the data into R

    • using read_csv, read_delim, tbl, etc

  • inspect the result to make sure it worked

    • the head function is useful here

Let's load the Old Faithful geyser dataset from Larry Wasserman's book All of Statistcs

#Step 0: load the tidyverse library -- gives us read_* functions library(tidyverse)
#Step 1: download the file to get the data from the URL onto our computer data_url <- "http://www.stat.cmu.edu/~larry/all-of-statistics/=data/faithful.dat"
#Step 2: take a look at the data using Jupyter
#Step 3: load the data into R #need to: # 1. skip 26 lines of meta data # 2. manually add the column names index, eruption_time, wait_time # 3. set the entry delimiter to spaces
#Step 4: check the result

Note about loading data

  • It's important to do it carefully + check results after!

    • will help reduce bugs and speed up your analyses down the road

  • Think of it as tying your shoes before you run; not exciting, but if done wrong it will trip you up later!

Questions?

Go for it!

Class activity:

What did we learn?

What did we learn, Winter 2019

  • read_table2 allows us to read multiple whitespace delimited files

  • read the terms of service before scraping

  • when to use the different forms read_delim (or read_*)

Note on web scraping

  • More and more websites don't want you scraping

  • They instead are providing "easier" ways for you to access the data as opposed to scraping it (which they can regulate and know who you are)

  • So, TL;DR read the Terms of Service for ANY webpage you are planning on scraping

    • they're long to read, so search for "scraping", "auto", "bot", etc to find the relevant section