GitHub Repository: YStrano/DataScience_GA
Path: blob/master/lessons/lesson_16/02_rolling_statistics (done).ipynb
¹⁹⁰⁴ views

Kernel: Python 3

Time Series: Rolling Statistics

Learning Objectives

After this lesson, you will be able to:

Define the concepts of trend and seasonality and be able to identify them visually.
Use boxplots to compare distributions.
Plot time series data over time to identify large-scale trends in data.
Investigate trends by computing simple aggregates with Pandas using the .resample() function.
Compute rolling statistics with Pandas to compare data of a date to a smaller window of time.
Utilize exponentially weighted windows to average out noise.
Use differences to remove trends in time series data.
Use the Pandas' .shift() function to create lagged features.

Lesson Guide

Time Series Rolling Statistics

Trend and Seasonality

Question: What constitutes a trend in data? Is linearity required for a trend?

A trend is any long-term change in the value we're measuring. Trends may “change direction,” going from an increasing trend to a decreasing trend.
Trends can only be measured within the scope of the data collected; there may be trends that are unmeasurable if the data are not complete.

An example of an upward trend:

When patterns repeat over known, fixed periods of time within a data set, we call this seasonality.
A seasonal pattern exists when a series is influenced by factors related to the cyclic nature of time — i.e., time of month, quarter, year, etc. Seasonality is of a fixed and known period, otherwise it is not truly seasonality. Additionally, it must be either attributed to another factor or counted as a set of anomalous events in the data.

Can you think of some seasonal patterns from your own experience?

The easiest way to visualize trends is by drawing trend lines.

In [32]:

import pandas as pd
import numpy as np
from datetime import timedelta
import matplotlib.pyplot as plt
%matplotlib inline

# Import the data.
df = pd.read_csv('data/mapquest_google_trends.csv')

# Clean/organize the data. 
df.columns = ['WeekOf', 'Hits']
print(df.head())
plt.rcParams["figure.figsize"] = [16,9]
ax = df.plot(title = "Interest in MapQuest Over Time")
ax.set_xlabel("Week")
ax.set_ylabel("Hits")

Out[32]:

       WeekOf  Hits
2004-01-04    53
2004-01-11    53
2004-01-18    54
2004-01-25    53
2004-02-01    52

Text(0,0.5,'Hits')

Next, we need to compute a coefficient and intercept for our line. NumPy's polyfit() method can do this.

Then, define our polynomial function using that coefficient. We can do this on a single dimension using NumPy's poly1d() method.

In [2]:

## Same as from sklearn.linear_model import LinearRegression!
line_coef = np.polyfit(df.index,df['Hits'],1)
print(line_coef)

polynomial = np.poly1d(line_coef)
# The intercept is ~86.59, the slope is ~0.11.

# Let's take a look at the trendline values at specific points:
print(polynomial(0))
print(polynomial(1))

Out[2]:

[-0.10841388 86.58979622]
86.58979621684865
86.48138233980062

Now, plot our trendline over the data.

In [6]:

# Plot the time series.
plt.plot(df.index, df['Hits'])
# Plot the least squares minimizing line.
plt.rcParams["figure.figsize"] = [16,9]
plt.plot(df.index, polynomial(df.index))

Out[6]:

[<matplotlib.lines.Line2D at 0x11d175b38>]

Looks like a second-order polynomial might fit our data even better. Let's try that out.

In [4]:

line_coef = np.polyfit(df.index,df['Hits'],2)
print(line_coef)

second_polynomial = np.poly1d(line_coef)

Out[4]:

[-1.84964781e-04  9.77861827e-03  7.40219942e+01]

In [5]:

plt.rcParams["figure.figsize"] = [16,9]
# Plot the time series.
plt.plot(df.index, df['Hits'])
# Plot the least squares minimizing line.
plt.plot(df.index, second_polynomial(df.index))

Out[5]:

[<matplotlib.lines.Line2D at 0x1105662b0>]

Question: Can you think of any other underlying patterns that might cause trends in time series data? What might cause seasonality in a time series?

Adding in Dates to the Plot

In [7]:

df.head()

Out[7]:

In [8]:

df['count'] = range(df.shape[0])
df['WeekOf'] = pd.to_datetime(df['WeekOf'])
df.head()

Out[8]:

In [9]:

plt.rcParams["figure.figsize"] = [16,9]
# Plot the time series.
plt.plot(df.set_index('WeekOf')['Hits'])
plt.axvline(x="2005-02-05", color='r') ## add line for google maps release

Out[9]:

<matplotlib.lines.Line2D at 0x11d12d240>

Guided Practice

Let's look for trends and seasonality in data made available by a German drugstore, Rossmann.

These data contain the daily sales made at the drugstore, as well as whether or not a sale or holiday affected the data.

Because we are most interested in the Date column (which contains the date of sales for each store), we will make sure to process that as a datetime type and make it the index of our DataFrame, as we did with our Apple stock data.

Let's recall the steps for preprocessing time series data with Pandas:

Convert time data to a datetime object.
Set datetime to index the DataFrame.

In [10]:

import pandas as pd
import matplotlib as plt
import seaborn as sns
%matplotlib inline

plt.rcParams['figure.figsize'] = (16.0, 8.0)

data = pd.read_csv('data/rossmann.csv', skipinitialspace=True, low_memory=False)
data['Date'] = pd.to_datetime(data['Date'])
data = data.set_index('Date')
data.head()

Out[10]:

This allows us to easily filter by date. Let's add a column for Year and Month based on the datetime index.

In [11]:

data['Year'] = data.index.year
data['Month'] = data.index.month

data['2015-05'].head()

Out[11]:

There are more than a million sales data points in this data set, so, for some simple exploratory data analysis (EDA), we'll focus on just one store.

In [12]:

store1_data = data[data['Store'] == 1]
store1_data.head()

Out[12]:

Plot the sales data.

Let's investigate whether or not promotions affect sales. For this, we'll use boxplots.

On state holidays, the store is closed (which means there are zero sales), so we need to cut those days out. (Contextual knowledge like this is always necessary to truly explain time series phenomena.)

Check for Understanding: Can you think of any other special considerations we should make when tracking sales?

Now, check to see if there is a difference affecting sales on promotion days.

In [13]:

sns.factorplot(
    x='Promo',
    y='Sales',
    data=store1_data[store1_data['Open']==1],
    kind='box'
)

Out[13]:

<seaborn.axisgrid.FacetGrid at 0x11d367518>

We can see that there is a difference in sales on promotion days.

Why is it important to separate out days on which the store is closed? Because there aren't any promotions on those days either, so including them will bias our sales data on days without promotions! Remember to think about the business logic in addition to analyzing the raw data.

We may also want to compare sales across days of the week:

In [14]:

sns.factorplot(
    x='DayOfWeek',
    y='Sales',
    data=store1_data,
    kind='box',
)

Out[14]:

<seaborn.axisgrid.FacetGrid at 0x10fbf3048>

Lastly, we want to identify larger-scale trends in our data. How did sales change from 2014 to 2015? Were there any particularly interesting outliers in terms of sales or customer visits?

To plot the sales and customer visits over time:

In [15]:

# Filter to days Store 1 was open.
store1_open_data = store1_data[store1_data['Open']==1]
store1_open_data[['Sales']].plot()
store1_open_data[['Customers']].plot()

Out[15]:

<matplotlib.axes._subplots.AxesSubplot at 0x1a1f6ee6d8>

We can see that there are large spikes of sales and customers toward the end of 2013 and 2014, leading into the first quarter of 2014 and 2015.

Let's use index filtering to filter just to 2015 changes over time. This should make it easier to identify the holiday sales bump.

In [16]:

store1_data_2015 = store1_data['2015']
store1_data_2015[store1_data_2015.Open==1][['Sales']].plot()

Out[16]:

<matplotlib.axes._subplots.AxesSubplot at 0x1a1f790908>

Aggregate Data

If we want to investigate trends over time in sales, as always, we'll start by computing simple aggregates. We want to know: What were the mean and median sales in each month and year?

We can use data.resample on the whole data set and provide: - A parameter for the level on which to roll up to: 'D' for day, 'W' for week, 'M' for month, 'A' for year. - The aggregation method to perform: mean(), median(), sum(), etc.

In [17]:

data[['Sales']].resample('A').mean()

Out[17]:

It looks like average sales were highest in 2015. Now, let's look at the median annual sales.

In [18]:

data[['Sales']].resample('A').median()

Out[18]:

In [19]:

data[['Sales']].resample('M').mean()

Out[19]:

In [20]:

data[['Sales']].resample('M').median()

Out[20]:

For more information, see Pandas' .resample documentation.

Question: How do aggregation techniques help us sort out important insights from the noise of time series data?

Rolling Statistics

With time series, we can "roll" statistics across time. For example, the rolling mean is the mean of a moving window across time periods. Pandas offers a variety of functionalities for creating rolling statistics, which we'll only scratch the surface of here.

E.g., to understand holidays sales, we don't want to compare sales data in late December with the entire month but instead with a few days immediately surrounding it. We can do this using rolling averages.

The syntax for these can be a little tricky at first. We'll be using a rolling() function with a statistical function chained to it. Let's dive into more detail.

Parameters for `rolling()` Functions

rolling().mean() (as well as rolling().median()) can take the following parameters:

The first indicates the time series to aggregate.
window indicates the number of periods to include in the average.
center indicates whether the window should be centered on the date or use data prior to that date.

Calculate the rolling daily sum over all stores.

Use the .resample() function to calculate the daily total over all of the stores.

In [21]:

daily_store_sales = data[['Sales']].resample('D').sum()

Use the .rolling() function to calculate the rolling average over a three-day period.

In [24]:

daily_store_sales.rolling(window=3, center=True).mean().head()

Out[24]:

We can use our index filtering to just look at 2015.

In [26]:

daily_store_sales.rolling(window=7, center=True).mean()['2015'].head()

Out[26]:

Instead of plotting the full time series, we can plot the rolling mean instead, which smooths random changes in sales and removes outliers, helping us identify larger trends.

In [27]:

daily_store_sales.rolling(window=30, center=True).mean().plot()

Out[27]:

<matplotlib.axes._subplots.AxesSubplot at 0x1a2036c908>

The Expanding Mean

The expanding mean simply uses all of the data points up to the current time to calculate the mean, as opposed to a moving window.

Calculate and plot the expanding mean below. Resample by quarter.

In [28]:

rolling_mean = data.Sales.resample('Q').sum().rolling(window=1, center=False).mean()
expanding_mean = data.Sales.resample('Q').sum().expanding().mean()

In [36]:

plt.rcParams["figure.figsize"] = [16,9]
fig, ax = plt.subplots()
rolling_mean.plot(legend = True)
expanding_mean.plot(legend = True)
ax.legend(['Rolling Mean', 'Expanding Mean'])

Out[36]:

<matplotlib.legend.Legend at 0x10f2551d0>

Exponentially Weighted Windows

Exponentially weighted windows are one of the most common and effective ways of averaging out noise in time series data. The averaging is done with an "exponential decay" on the contribution of prior means, decreasing the contribution of time points that are further in the past.

The (adjusted) exponentially weighted mean for time, $t$ , is defined as:

$x_t = \frac{x_t + (1 - \alpha)x_{t-1} + (1 - \alpha)^2x_{t-1} + ... + (1 - \alpha)^{t}x_0} {1 + (1 - \alpha) + (1 - \alpha)^2 + ... + (1 - \alpha)^{t}}$

Note: Review Pandas' documentation for more information.

Calculate and plot the exponentially weighted sum along with the rolling sum. What's the difference?

For example: .resample('Q').sum().ewm(span=10).mean().

In [30]:

rolling_mean = data.Sales.resample('Q').sum().rolling(window=2, center=True).mean()
exp_weighted_mean = data.Sales.resample('Q').sum().ewm(span=10).mean()

In [33]:

plt.rcParams["figure.figsize"] = [16,9]
fig, ax = plt.subplots()
rolling_mean.plot(legend = True)
exp_weighted_mean.plot(legend = True)
ax.legend(['Rolling Mean', 'Exponentially Weighted Mean'])

Out[33]:

<matplotlib.legend.Legend at 0x1a1fc9c400>

Note that rolling doesn't understand if you are missing periods (e.g., if you roll over three days and your data are missing weekends, it'll roll Fri/Sat/Sun), so you need to resample first if you care about that.

Question: How does the signal that's captured by rolling statistics differ from the signal captured by resampling time series data?

Differencing a Time Series and Stationarity

If a time series is stationary, the mean, variance, and autocorrelation (covered in the next section) will be constant over time. Forecasting methods typically assume the time series you are forecasting on to be stationary — or at least approximately stationary.

The most common way to make a time series stationary is through "differencing." This procedure converts a time series into the difference between values.

$\Delta y_t = y_t - y_{t-1}$

This removes trends in the time series and ensures that the mean across time is zero. In most cases, this only requires a single difference, although, in some, a second difference (or third, etc.) will be necessary to remove trends.

In [34]:

diff = store1_data['Sales'].diff(periods = 7)

In [35]:

plt.rcParams["figure.figsize"] = [16,9]
fig, ax = plt.subplots()
store1_data['Sales'].plot(legend = True)
diff.plot(legend = True)
ax.legend(['Sales', 'First Difference'])

Out[35]:

<matplotlib.legend.Legend at 0x1a1fe0b400>

Check: How does differencing help with problems of non-stationarity and autocorrelation in time series data?

Shifting and Lagging Time Series Data

Another common operation on time series data is shifting or lagging values backward and forward in time. This can help us calculate the percentage of change from sample to sample. Pandas has a .shift() method for shifting the data in a DataFrame.

Let's take a look at the Rossman data when we apply lagged features.

Always make sure that the shift is going in the right direction, in this case the data is going backwards so it's confusing

In [42]:

store1_data['sales_t1'] = store1_data['Sales'].shift(1) 
#this would be a better way of using the shift function and adding the value to a column

Out[42]:

/anaconda3/lib/python3.6/site-packages/ipykernel_launcher.py:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.

In [43]:

store1_data.head()

Out[43]:

Let's shift the sales price by one day.

In [44]:

shifted_forward = store1_data.shift(1)
shifted_forward.head()

Out[44]:

Notice that the first row now contains NaN values because there wasn't a previous day's data to shift to that day.

Next, let's shift the sales prices by five days.

In [45]:

shifted_forward5 = store1_data.shift(5)
shifted_forward5.head(10)

Out[45]:

We can also use negative numbers to shift the sales values in the reverse direction.

In [46]:

shifted_backward = store1_data.shift(-1)
shifted_backward.head()

Out[46]:

Lags can be used to calculate the changes in the values you are tracking with your time series data. In this case, we can use Pandas' .shift() method to look at the changes in sales.

Let's create a new column in our Rossman DataFrame that contains the previous day's sales.

Note that we add .copy() to the end of the chained assignment to explicitly tell Pandas that this will be a copy and not a view. Here is a useful video that helps explain how to avoid SettingCopyWithWarning errors in Pandas.

In [47]:

store1_data['Prev Day Sales'] = store1_data['Sales'].shift(1).copy()
store1_data.head()

Out[47]:

/anaconda3/lib/python3.6/site-packages/ipykernel_launcher.py:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.

Using our new column, it's simple to calculate the one-day change in sales at Store 1. Let's create a new column for this value in our DataFrame as well.

In [48]:

store1_data['Sales Change'] = store1_data['Sales'] - store1_data['Prev Day Sales'].copy()
store1_data.head()

Out[48]:

/anaconda3/lib/python3.6/site-packages/ipykernel_launcher.py:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.

Question: What are some other real-world applications you can think of for shifting data in a time series?

Recap

Trends are long-term changes in data. We have to sort through the noise of a time series to identify trends.
We can resample the data to look at simple aggregates and identify patterns.
Rolling statistics give us a local statistic of an average in time, smoothing out random fluctuations and removing outliers.

Independent Practice

Instructor Note: These are optional and can be assigned as student practice questions outside of class.

1) Load the Unemployment data set. Perform any necessary cleaning and preprocess the data by creating a `datetime` index.

In [49]:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import datetime
%matplotlib inline

In [50]:

unemp = pd.read_csv('./data/unemployment.csv')

In [51]:

unemp.head()

Out[51]:

In [55]:

list(unemp.columns)

Out[55]:

['Quarter',
 'Seasonally adjusted quarterly U.S. unemployment rates from 1948 to 1993']

In [56]:

unemp['unemp_rate'] = unemp['Seasonally adjusted quarterly U.S. unemployment rates from 1948 to 1993'].str.replace("%", '')

In [59]:

unemp['unemp_rate'] = unemp['unemp_rate'].astype('float')

In [60]:

unemp.dtypes

Out[60]:

Quarter                                                                     object
Seasonally adjusted quarterly U.S. unemployment rates from 1948 to 1993     object
unemp_rate                                                                 float64
dtype: object

In [62]:

unemp.set_index('Quarter').plot()

Out[62]:

<matplotlib.axes._subplots.AxesSubplot at 0x1a22db16d8>

In [63]:

unemp['year'] = unemp['Quarter'].apply(lambda x: x.split('Q')[0])

In [64]:

unemp.head()

Out[64]:

In [ ]:

In [ ]:

In [ ]:

2) Plot the unemployment rate.

In [ ]:

3) Calculate the rolling mean of years with `window=3` , without centering, and plot both the unemployment rates and the rolling mean data.

In [ ]:

In [ ]:

In [ ]:

4) Calculate the rolling median with `window=5` and `window=15`. Plot both together with the original data.

In [ ]:

In [ ]:

5) Calculate and plot the expanding mean. Resample by quarter. Plot the rolling mean and the expanding mean together.

In [ ]:

In [ ]:

In [ ]:

6) Calculate and plot the exponentially weighted sum along with the rolling sum.

In [ ]:

In [ ]:

7) Difference the unemployment rate and plot.

In [ ]:

In [ ]:

In [ ]:

Time Series: Rolling Statistics

Learning Objectives

Lesson Guide

Time Series Rolling Statistics

Trend and Seasonality

The easiest way to visualize trends is by drawing trend lines.

Now, plot our trendline over the data.

Adding in Dates to the Plot

Guided Practice

Plot the sales data.

Aggregate Data

Rolling Statistics

Parameters for `rolling()` Functions

Calculate the rolling daily sum over all stores.

The Expanding Mean

Calculate and plot the expanding mean below. Resample by quarter.

Exponentially Weighted Windows

$x_t = \frac{x_t + (1 - \alpha)x_{t-1} + (1 - \alpha)^2x_{t-1} + ... + (1 - \alpha)^{t}x_0} {1 + (1 - \alpha) + (1 - \alpha)^2 + ... + (1 - \alpha)^{t}}$

Differencing a Time Series and Stationarity

$\Delta y_t = y_t - y_{t-1}$

Shifting and Lagging Time Series Data

Always make sure that the shift is going in the right direction, in this case the data is going backwards so it's confusing

Recap

Independent Practice

1) Load the Unemployment data set. Perform any necessary cleaning and preprocess the data by creating a `datetime` index.

2) Plot the unemployment rate.

3) Calculate the rolling mean of years with `window=3` , without centering, and plot both the unemployment rates and the rolling mean data.

4) Calculate the rolling median with `window=5` and `window=15`. Plot both together with the original data.

5) Calculate and plot the expanding mean. Resample by quarter. Plot the rolling mean and the expanding mean together.

6) Calculate and plot the exponentially weighted sum along with the rolling sum.

7) Difference the unemployment rate and plot.

Product

Resources

Company

Time Series: Rolling Statistics

Learning Objectives

Lesson Guide

Time Series Rolling Statistics

Trend and Seasonality

The easiest way to visualize trends is by drawing trend lines.

Now, plot our trendline over the data.

Adding in Dates to the Plot

Guided Practice

Plot the sales data.

Aggregate Data

Rolling Statistics

Parameters for rolling() Functions

Calculate the rolling daily sum over all stores.

The Expanding Mean

Calculate and plot the expanding mean below. Resample by quarter.

Exponentially Weighted Windows

Differencing a Time Series and Stationarity

Δyt=yt−yt−1 \Delta y_t = y_t - y_{t-1} Δyt​=yt​−yt−1​

Shifting and Lagging Time Series Data

Always make sure that the shift is going in the right direction, in this case the data is going backwards so it's confusing

Recap

Independent Practice

1) Load the Unemployment data set. Perform any necessary cleaning and preprocess the data by creating a datetime index.

2) Plot the unemployment rate.

3) Calculate the rolling mean of years with window=3 , without centering, and plot both the unemployment rates and the rolling mean data.

4) Calculate the rolling median with window=5 and window=15. Plot both together with the original data.

5) Calculate and plot the expanding mean. Resample by quarter. Plot the rolling mean and the expanding mean together.

6) Calculate and plot the exponentially weighted sum along with the rolling sum.

7) Difference the unemployment rate and plot.

Parameters for `rolling()` Functions

$\Delta y_t = y_t - y_{t-1}$

1) Load the Unemployment data set. Perform any necessary cleaning and preprocess the data by creating a `datetime` index.

3) Calculate the rolling mean of years with `window=3` , without centering, and plot both the unemployment rates and the rolling mean data.

4) Calculate the rolling median with `window=5` and `window=15`. Plot both together with the original data.