Book a Demo!
CoCalc Logo Icon
StoreFeaturesDocsShareSupportNewsAboutPoliciesSign UpSign In
YStrano
GitHub Repository: YStrano/DataScience_GA
Path: blob/master/lessons/lesson_03/code/distributions and dummy variables (done).ipynb
1904 views
Kernel: Python 3

Lesson 3: Demos

#General imports import numpy as np import pandas as pd import matplotlib.pyplot as plt #libraries are large so be selective and specific, #this also gives you flexibility and ease of use by typing a variable and getting a specific sublibrary from the library, #and lastly, this avoids importing and overwriting funtions, such as sum. %matplotlib inline
mtcars = pd.read_csv("mtcars.csv") mtcars.head()

Although the mean and median both give us some sense of the center of a distribution, they aren't always the same. The median gives us a value that splits the data into two halves while the mean is a numeric average, so extreme values can have a significant impact on the mean.

In a symmetric distribution, the mean and median will be the same. Let's investigate with a density plot:

norm_data = pd.DataFrame(np.random.normal(size=100000)) norm_data.plot(kind="density", figsize=(10,10)) plt.vlines(norm_data.mean(), # Plot black line at mean # also note that the mean is the x value ymin=0, ymax=0.4, linewidth=5.0) plt.vlines(norm_data.median(), # Plot red line at median # also note that the median is the x value ymin=0, ymax=0.4, linewidth=2.0, color="red")
<matplotlib.collections.LineCollection at 0x11a2987b8>
Image in a Jupyter notebook

In the plot above, the mean and median are both so close to zero that the red median line lies on top of the thicker black line drawn at the mean.

In skewed distributions, the mean tends to get pulled in the direction of the skew, while the median tends to resist the effects of skew:

skewed_data = pd.DataFrame(np.random.exponential(size=100000)) skewed_data.plot(kind="density", figsize=(10,10), xlim=(-1,5)) plt.vlines(skewed_data.mean(), # Plot black line at mean ymin=0, ymax=0.8, linewidth=5.0) plt.vlines(skewed_data.median(), # Plot red line at median ymin=0, ymax=0.8, linewidth=2.0, color="red")
<matplotlib.collections.LineCollection at 0x11a719c18>
Image in a Jupyter notebook

Notice that the mean is also influenced heavily by outliers, while the median resists the influence of outliers:

norm_data = np.random.normal(size=50) outliers = np.random.normal(15, size=3) combined_data = pd.DataFrame(np.concatenate((norm_data, outliers), axis=0)) combined_data.plot(kind="density", figsize=(10,10), xlim=(-5,20)) plt.vlines(combined_data.mean(), # Plot black line at mean ymin=0, ymax=0.2, linewidth=5.0) plt.vlines(combined_data.median(), # Plot red line at median ymin=0, ymax=0.2, linewidth=2.0, color="red")
<matplotlib.collections.LineCollection at 0x1a1fd07668>
Image in a Jupyter notebook
combined_data.skew()
0 3.406609 dtype: float64
combined_data.kurt()
0 11.618951 dtype: float64

Since the median tends to resist the effects of skewness and outliers, it is known as a "robust" statistic.

The median generally gives a better sense of the typical value in a distribution with significant skew or outliers.

unclear what this is cell is...

comp1 = np.random.normal(0, 1, size=200) # N(0, 1) (mean, standard_dev) comp2 = np.random.normal(10, 2, size=200) # N(10, 4) (mean, standard_dev) df1 = pd.Series(comp1) df2 = pd.Series(comp2)

Skewness and Kurtosis

Skewness measures the skew or asymmetry of a distribution while Kurtosis measures the "peakedness" of a distribution.

We won't go into the exact calculations behind these, but they are essentially just statistics that take the idea of variance a step further: while variance involves squaring deviations from the mean, skewness involves cubing deviations from the mean, and kurtosis involves raising deviations from the mean to the 4th power.

Pandas has built in functions for checking skewness and kurtosis, df.skew() and df.kurt() respectively:

mtcars.head()
mtcars.mpg.mean()
20.090624999999996
mtcars.mpg.median()
19.2
mtcars[['mpg']].plot(kind="density", figsize=(10,10))
<matplotlib.axes._subplots.AxesSubplot at 0x1a1fd35f28>
Image in a Jupyter notebook
mtcars["mpg"].skew() #this is only slightly skewed
0.6723771376290805
mtcars["mpg"].kurt() #the kurtosis value is low which implies a high kurtosis or peak
-0.0220062914240855

To explore these two measures further, let's create some dummy data and inspect it:

norm_data = np.random.normal(size=100000) skewed_data = np.concatenate((np.random.normal(size=35000)+2, #+2 shifts the data two units to the right, which affects the mean, making it = 2. np.random.exponential(size=65000)), axis=0) uniform_data = np.random.uniform(0,2, size=100000) peaked_data = np.concatenate((np.random.exponential(size=50000), np.random.exponential(size=50000)*(-1)), axis=0) data_df = pd.DataFrame({"norm":norm_data, "skewed":skewed_data, "uniform":uniform_data, "peaked":peaked_data})
data_df.head()
data_df.plot(kind='box')
<matplotlib.axes._subplots.AxesSubplot at 0x1a208ca2b0>
Image in a Jupyter notebook

Types of distributions

data_df["norm"].plot(kind="density", xlim=(-5,5))
<matplotlib.axes._subplots.AxesSubplot at 0x1a2079c908>
Image in a Jupyter notebook
data_df["peaked"].plot(kind="density", xlim=(-5,5))
<matplotlib.axes._subplots.AxesSubplot at 0x1a203ac208>
Image in a Jupyter notebook
data_df["skewed"].plot(kind="density", xlim=(-5,5))
<matplotlib.axes._subplots.AxesSubplot at 0x1a20609c18>
Image in a Jupyter notebook
data_df["uniform"].plot(kind="density", xlim=(-5,5))
<matplotlib.axes._subplots.AxesSubplot at 0x1a2068e3c8>
Image in a Jupyter notebook

All together

data_df.plot(kind="density", xlim=(-5,5))
<matplotlib.axes._subplots.AxesSubplot at 0x1a2068e0f0>
Image in a Jupyter notebook

Skewness

Now let's check the skewness of each of these distributions.

Since skewness measures asymmetry, we'd expect to see low skewness for all of the distributions except the skewed one, because all the others are roughly symmetric:

data_df.skew()
norm -0.000019 peaked 0.045590 skewed 0.992044 uniform -0.001869 dtype: float64

Kurtosis

Now let's check kurtosis. Since kurtosis measures peakedness, we'd expect the flat (uniform) distribution to have low kurtosis while the distributions with sharper peaks should have higher kurtosis.

data_df.kurt()
norm -0.003705 peaked 3.307038 skewed 1.191849 uniform -1.198439 dtype: float64

As we can see from the output, the normally distributed data has a kurtosis near zero, the flat distribution has negative kurtosis, and the two pointier distributions have positive kurtosis.

Class Variable Demo

Class/Dummy Variables

We want to represent categorical variables numerically, but we can't simply code them as 0=rural, 1=suburban, 2=urban because that would imply an ordered relationship between suburban and urban (suggesting that urban is somehow "twice" the suburban category, which doesn't make sense).

Why do we only need two dummy variables, not three? Because two dummies capture all of the information about the Area feature, and implicitly defines rural as the reference level.

In general, if you have a categorical feature with k levels, you create k-1 dummy variables.

# read data into a DataFrame data = pd.read_csv('http://www-bcf.usc.edu/~gareth/ISL/Advertising.csv', index_col=0) data.head()

Handling Categorical Predictors with Two Categories

Up to now, all of our predictors have been numeric. What if one of our predictors was categorical?

Let's create a new feature called "Size," and randomly assign observations to be small or large:

# set a seed for reproducibility np.random.seed(12345) # create a Series of booleans in which roughly half are True nums = np.random.rand(len(data)) mask_large = nums > 0.5 # initially set Size to small, then change roughly half to be large data['Size'] = 'small' data.loc[mask_large, 'Size'] = 'large' data.head()

For scikit-learn, we need to represent all data numerically.

If the feature only has two categories, we can simply create a dummy variable that represents the categories as a binary value.

# create a new Series called IsLarge data['IsLarge'] = data['Size'].map({'small':0, 'large':1}) data.head()

Handling Categorical Predictors with More than Two Categories

Let's create a new feature called Area, and randomly assign observations to be rural, suburban, or urban:

# set a seed for reproducibility np.random.seed(123456) # assign roughly one third of observations to each group nums = np.random.rand(len(data)) mask_suburban = (nums > 0.33) & (nums < 0.66) mask_urban = nums > 0.66 data['Area'] = 'rural' data.loc[mask_suburban, 'Area'] = 'suburban' data.loc[mask_urban, 'Area'] = 'urban' data.head()

We have to represent Area numerically, but we can't simply code it as 0=rural, 1=suburban, 2=urban because that would imply an ordered relationship between suburban and urban (and thus urban is somehow "twice" the suburban category).

Instead, we create another dummy variable:

Create three dummy variables using get_dummies, then exclude the first dummy column

my_categorical_var_dummies = pd.get_dummies(my_categorical_var, prefix='Area').iloc[:, 1:]

# create three dummy variables using get_dummies, then exclude the first dummy column area_dummies = pd.get_dummies(data['Area'], prefix='Area', drop_first=True) # concatenate the dummy variable columns onto the original DataFrame (axis=0 means rows, axis=1 means columns) data = pd.concat([data, area_dummies], axis=1) data.head()