GitHub Repository: YStrano/DataScience_GA
Path: blob/master/lessons/lesson_09/code/practice_dbscan-lab.ipynb
¹⁹⁰⁴ views

Kernel: Python [default]

DBSCAN Practice

Authors: Joseph Nelson (DC)

You're now familiar with how DBSCAN works. Let's practice it in sklearn.

We will start out working with the NHL data. We're going to investigate clustering teams based on their counting stats.

Check out this glossary of hockey terms for a reference of what the columns indicate.

In [1]:

import pandas as pd
from sklearn.cluster import DBSCAN
from sklearn.preprocessing import StandardScaler
from sklearn import metrics

import matplotlib.pyplot as plt
%matplotlib inline
%config InlineBackend.figure_format = 'retina'

1. Load our data and perform any basic cleaning and/or EDA.

In [2]:

nhl = pd.read_csv('../data/nhl.csv')

In [3]:

# A:

2. Set up an `X` matrix to perform clustering with DBSCAN.

Let's cluster on all features EXCEPT team and rank.

Make rank be our y vector which we can use to do cluster validation.

In [4]:

# A:

3. Scatter plot EDA

Make two scatter plots. At least one axis in one of the plots should represent points (goals for, GA). Do we obtain a general idea from the scatter plots of how many clusters we should expect to extract with a clustering algorithm?

In [5]:

# A:

4. Scale our data

Standardize the data and compare at least one of the scatterplots for the scaled data to unscaled above.

In [6]:

# A:

5. Fit a DBSCAN clusterer

Remember to pass an eps and min_samples of your choice.

In [7]:

# A:

6. Check out the assigned cluster labels

Using the .labels_ command on our DBSCAN class

In [8]:

# A:

7. Evaluate the DBSCAN clusters

7.1 Check the silhouette score.

How are the clusters?

If you're feeling adventurous, see how you can adjust our epsilon and min_points to improve this.

In [9]:

# A:

7.2 Check the homogeneity, completeness, and V-measure against the stored rank y

In [10]:

# A:

8. Plot the clusters

You can choose any two variables for the axes.

In [11]:

# A:

9. Fit DBSCAN on an easier dataset

Import the make_circles function from sklearn.datasets. You can use this to create some fake clusters that will perform well with DBSCAN.

Create some X and y using the function. Here is some sample code:

from sklearn.datasets import make_circles
circles_X, circles_y = make_circles(n_samples=1000, random_state=123, noise=0.1, factor=0.2)

9.1 Plot the fake circles data.

In [12]:

# A:

9.2 Scale the data and fit DBSCAN on it.

In [13]:

# A:

9.3 Evaluate DBSCAN visually, with silhouette, and with the metrics against the true y.

In [14]:

# A:

DBSCAN Practice

1. Load our data and perform any basic cleaning and/or EDA.

2. Set up an `X` matrix to perform clustering with DBSCAN.

3. Scatter plot EDA

4. Scale our data

5. Fit a DBSCAN clusterer

6. Check out the assigned cluster labels

7. Evaluate the DBSCAN clusters

8. Plot the clusters

9. Fit DBSCAN on an easier dataset

Product

Resources

Company

DBSCAN Practice

1. Load our data and perform any basic cleaning and/or EDA.

2. Set up an X matrix to perform clustering with DBSCAN.

3. Scatter plot EDA

4. Scale our data

5. Fit a DBSCAN clusterer

6. Check out the assigned cluster labels

7. Evaluate the DBSCAN clusters

8. Plot the clusters

9. Fit DBSCAN on an easier dataset

2. Set up an `X` matrix to perform clustering with DBSCAN.