Path: blob/master/2019-fall/slides/05_tutorial_class_activity.ipynb
2051 views
DSCI 100: Introduction to Data Science
Tutorial 5: Collaboration and Group Project
2019-10-08
First,
Congratulations on completing your first quiz!!
Your quizzes are being graded now by the TAs, we expect to be done by next Tuesday
Today,
We will be starting on your group project and introducing you to modelling (second part of the course)
Learning objectives
By the end of this class, you will be able to:
Set up a collaborative Jupyter notebook on CoCalc and learn how to use its basic functions
Understand the requirements for your group project
Explore some of the datasets assigned for your group project using the functions you learnt from weeks 1-5
Have a brief understanding of classification (in preparation for week 6)
Activity #1: Create your own CoCalc Project (Demo)
Activity #2: Explore Datasets - Preliminary
Group project assignment details: https://canvas.ubc.ca/courses/40616/assignments/365095
Each member of your team selects one of the eight datasets
Using one markdown cell and one code cell each, read the dataset, take a look at it, and write a short description about the dataset. Some questions you should try to answer:
What is the dataset about?
How many variables are there?
How many observations are there?
Activity #3: Explore Datasets Part 2 - Outcome Variable
Now, switch datasets with another group member
Working in the parts that your group member started on earlier, now answer these questions about your new dataset:
Identify the main outcome/categorical/label variable in the dataset.
How many values/groups are in this variable?
How many observations are there in each value/group?
Feel free to add more code and markdown cells to keep your notebook neat
Activity #4: Explore Datasets Part 3 - Visualisations!
Now, switch datasets with another group member again (make sure this is a dataset you have not worked on yet)
Make some visualisations of the outcome variable:
What does the distribution of the variable look like?
What relationship does it have with some of the other variables?
Try using a range of box plots, scatterplots, bar charts, line graphs, etc.
Back to Classification...
What is classification?
Recall: Predict a class/category for a new observation/measurement (e.g., cancerous or benign tumour)
Using data about past observations to make predictions about the class of new observations
Introducing... the k-nn Classification Algorithm
k-nn stands for k-nearest neighbours
Classifies new data points based on the classes of the data point's nearest neighbours
How many neighbours? k
k-nn Demo!
Lingering questions:
How do we determine what k to use?
How do we evaluate the accuracy of our model?