Book a Demo!
CoCalc Logo Icon
StoreFeaturesDocsShareSupportNewsAboutPoliciesSign UpSign In
UBC-DSCI
GitHub Repository: UBC-DSCI/dsci-100-assets
Path: blob/master/2019-fall/slides/05_tutorial_class_activity.ipynb
2051 views
Kernel: R

DSCI 100: Introduction to Data Science

Tutorial 5: Collaboration and Group Project

2019-10-08

First,

  • Congratulations on completing your first quiz!!

  • Your quizzes are being graded now by the TAs, we expect to be done by next Tuesday

Today,

We will be starting on your group project and introducing you to modelling (second part of the course)

Learning objectives

By the end of this class, you will be able to:

  • Set up a collaborative Jupyter notebook on CoCalc and learn how to use its basic functions

  • Understand the requirements for your group project

  • Explore some of the datasets assigned for your group project using the functions you learnt from weeks 1-5

  • Have a brief understanding of classification (in preparation for week 6)

Activity #1: Create your own CoCalc Project (Demo)

Activity #2: Explore Datasets - Preliminary

  • Group project assignment details: https://canvas.ubc.ca/courses/40616/assignments/365095

  • Each member of your team selects one of the eight datasets

  • Using one markdown cell and one code cell each, read the dataset, take a look at it, and write a short description about the dataset. Some questions you should try to answer:

    • What is the dataset about?

    • How many variables are there?

    • How many observations are there?

Activity #3: Explore Datasets Part 2 - Outcome Variable

  • Now, switch datasets with another group member

  • Working in the parts that your group member started on earlier, now answer these questions about your new dataset:

    • Identify the main outcome/categorical/label variable in the dataset.

    • How many values/groups are in this variable?

    • How many observations are there in each value/group?

  • Feel free to add more code and markdown cells to keep your notebook neat

Activity #4: Explore Datasets Part 3 - Visualisations!

  • Now, switch datasets with another group member again (make sure this is a dataset you have not worked on yet)

  • Make some visualisations of the outcome variable:

    • What does the distribution of the variable look like?

    • What relationship does it have with some of the other variables?

  • Try using a range of box plots, scatterplots, bar charts, line graphs, etc.

Back to Classification...

  • What is classification?

    • Recall: Predict a class/category for a new observation/measurement (e.g., cancerous or benign tumour)

    • Using data about past observations to make predictions about the class of new observations

Introducing... the k-nn Classification Algorithm

  • k-nn stands for k-nearest neighbours

  • Classifies new data points based on the classes of the data point's nearest neighbours

  • How many neighbours? k

k-nn Demo!

Lingering questions:

  1. How do we determine what k to use?

  1. How do we evaluate the accuracy of our model?

Find out on Thursday when we delve deeper into classification!