Path: blob/master/lessons/lesson_09/code/practice_dbscan-lab.ipynb
1904 views
DBSCAN Practice
Authors: Joseph Nelson (DC)
You're now familiar with how DBSCAN works. Let's practice it in sklearn.
We will start out working with the NHL data. We're going to investigate clustering teams based on their counting stats.
Check out this glossary of hockey terms for a reference of what the columns indicate.
1. Load our data and perform any basic cleaning and/or EDA.
2. Set up an X
matrix to perform clustering with DBSCAN.
Let's cluster on all features EXCEPT team and rank.
Make rank be our y
vector which we can use to do cluster validation.
3. Scatter plot EDA
Make two scatter plots. At least one axis in one of the plots should represent points (goals for, GA). Do we obtain a general idea from the scatter plots of how many clusters we should expect to extract with a clustering algorithm?
4. Scale our data
Standardize the data and compare at least one of the scatterplots for the scaled data to unscaled above.
5. Fit a DBSCAN clusterer
Remember to pass an eps
and min_samples
of your choice.
6. Check out the assigned cluster labels
Using the .labels_
command on our DBSCAN class
7. Evaluate the DBSCAN clusters
7.1 Check the silhouette score.
How are the clusters?
If you're feeling adventurous, see how you can adjust our epsilon and min_points to improve this.
7.2 Check the homogeneity, completeness, and V-measure against the stored rank y
8. Plot the clusters
You can choose any two variables for the axes.
9. Fit DBSCAN on an easier dataset
Import the make_circles
function from sklearn.datasets
. You can use this to create some fake clusters that will perform well with DBSCAN.
Create some X
and y
using the function. Here is some sample code:
9.1 Plot the fake circles data.
9.2 Scale the data and fit DBSCAN on it.
9.3 Evaluate DBSCAN visually, with silhouette, and with the metrics against the true y
.