DSE200x Week 4 Notes

²⁹⁶⁸ views

Kernel: Python 3 (Ubuntu Linux)

Pandas

pandas is a Python library for data analysis. It offers a number of data exploration, cleaning and transformation operations that are critical in working with data in Python.

pandas build upon numpy and scipy providing easy-to-use data structures and data manipulation functions with integrated indexing.

The main data structures pandas provides are Series and DataFrames. After a brief introduction to these two data structures and data ingestion, the key features of pandas this notebook covers are:

Generating descriptive statistics on data
Data cleaning using built in pandas functions
Frequent data operations for subsetting, filtering, insertion, deletion and aggregation of data
Merging multiple datasets using dataframes
Working with timestamps and time-series data

Additional Recommended Resources:

pandas Documentation: http://pandas.pydata.org/pandas-docs/stable/
Python for Data Analysis by Wes McKinney
Python Data Science Handbook by Jake VanderPlas

Let's get started with our first pandas notebook!

Import Libraries

In [18]:

import pandas as pd

Introduction to pandas Data Structures

*pandas* has two main data structures it uses, namely, *Series* and *DataFrames*.

pandas Series

pandas Series one-dimensional labeled array.

In [19]:

ser = pd.Series([100, 'foo', 300, 'bar', 500], ['tom', 'bob', 'nancy', 'dan', 'eric'])

In [20]:

ser

Out[20]:

tom      100
bob      foo
nancy    300
dan      bar
eric     500
dtype: object

In [21]:

ser.index

Out[21]:

Index(['tom', 'bob', 'nancy', 'dan', 'eric'], dtype='object')

In [22]:

ser.loc[['nancy','bob']]

Out[22]:

nancy    300
bob      foo
dtype: object

In [23]:

ser[[4, 3, 1]]

Out[23]:

eric    500
dan     bar
bob     foo
dtype: object

In [24]:

ser.iloc[2]

Out[24]:

300

In [25]:

'bob' in ser

Out[25]:

True

In [26]:

ser

Out[26]:

tom      100
bob      foo
nancy    300
dan      bar
eric     500
dtype: object

In [27]:

ser * 2

Out[27]:

tom         200
bob      foofoo
nancy       600
dan      barbar
eric       1000
dtype: object

In [28]:

ser[['nancy', 'eric']] ** 2

Out[28]:

nancy     90000
eric     250000
dtype: object

pandas DataFrame

pandas DataFrame is a 2-dimensional labeled data structure.

Create DataFrame from dictionary of Python Series

In [29]:

d = {'one' : pd.Series([100., 200., 300.], index=['apple', 'ball', 'clock']),
     'two' : pd.Series([111., 222., 333., 4444.], index=['apple', 'ball', 'cerill', 'dancy'])}

In [30]:

df = pd.DataFrame(d)
print(df)

Out[30]:

          one     two
apple   100.0   111.0
ball    200.0   222.0
cerill    NaN   333.0
clock   300.0     NaN
dancy     NaN  4444.0

In [31]:

df.index

Out[31]:

Index(['apple', 'ball', 'cerill', 'clock', 'dancy'], dtype='object')

In [32]:

df.columns

Out[32]:

Index(['one', 'two'], dtype='object')

In [33]:

pd.DataFrame(d, index=['dancy', 'ball', 'apple'])

Out[33]:

In [34]:

pd.DataFrame(d, index=['dancy', 'ball', 'apple'], columns=['two', 'five'])

Out[34]:

Create DataFrame from list of Python dictionaries

In [35]:

data = [{'alex': 1, 'joe': 2}, {'ema': 5, 'dora': 10, 'alice': 20}]

In [36]:

pd.DataFrame(data)

Out[36]:

In [37]:

pd.DataFrame(data, index=['orange', 'red'])

Out[37]:

In [38]:

pd.DataFrame(data, columns=['joe', 'dora','alice'])

Out[38]:

Basic DataFrame operations

In [39]:

df

Out[39]:

In [40]:

df['one']

Out[40]:

apple     100.0
ball      200.0
cerill      NaN
clock     300.0
dancy       NaN
Name: one, dtype: float64

In [41]:

df['three'] = df['one'] * df['two']
df

Out[41]:

In [42]:

df['flag'] = df['one'] > 250
df

Out[42]:

In [43]:

three = df.pop('three')

In [44]:

three

Out[44]:

apple     11100.0
ball      44400.0
cerill        NaN
clock         NaN
dancy         NaN
Name: three, dtype: float64

In [45]:

df

Out[45]:

In [46]:

del df['two']

In [47]:

df

Out[47]:

In [48]:

df.insert(2, 'copy_of_one', df['one'])
df

Out[48]:

In [49]:

df['one_upper_half'] = df['one'][:2]
df

Out[49]:

Case Study: Movie Data Analysis

This notebook uses a dataset from the MovieLens website. We will describe the dataset further as we explore with it using *pandas*.

Download the Dataset

Please note that you will need to download the dataset. Although the video for this notebook says that the data is in your folder, the folder turned out to be too large to fit on the edX platform due to size constraints.

Here are the links to the data source and location:

Data Source: MovieLens web site (filename: ml-20m.zip)
Location: https://grouplens.org/datasets/movielens/

Once the download completes, please make sure the data files are in a directory called movielens in your Week-3-pandas folder.

Let us look at the files in this dataset using the UNIX command ls.

In [50]:

# Note: Adjust the name of the folder to match your local directory

!ls ./movielens

Out[50]:

README.txt  genome-tags.csv  links.csv	movies.csv  ratings.csv  tags.csv

In [51]:

!cat ./movielens/movies.csv | wc -l

Out[51]:

9743

In [52]:

!head -5 ./movielens/ratings.csv

Out[52]:

userId,movieId,rating,timestamp
1,1,4.0,964982703
1,3,4.0,964981247
1,6,4.0,964982224
1,47,5.0,964983815

Use Pandas to Read the Dataset

In this notebook, we will be using three CSV files: * **ratings.csv :** *userId*,*movieId*,*rating*, *timestamp* * **tags.csv :** *userId*,*movieId*, *tag*, *timestamp* * **movies.csv :** *movieId*, *title*, *genres*

Using the read_csv function in pandas, we will ingest these three files.

In [53]:

movies = pd.read_csv('./movielens/movies.csv', sep=',')
print(type(movies))
movies.head(15)

Out[53]:

<class 'pandas.core.frame.DataFrame'>

In [54]:

# Timestamps represent seconds since midnight Coordinated Universal Time (UTC) of January 1, 1970

tags = pd.read_csv('./movielens/tags.csv', sep=',')
tags.head()

Out[54]:

In [55]:

ratings = pd.read_csv('./movielens/ratings.csv', sep=',', parse_dates=['timestamp'])
ratings.head()

Out[55]:

In [56]:

# For current analysis, we will remove timestamp (we will come back to it!)

del ratings['timestamp']
del tags['timestamp']

Data Structures

Series

In [129]:

#Extract 0th row: notice that it is infact a Series

row_0 = tags.iloc[0]
type(row_0)

Out[129]:

pandas.core.series.Series

In [58]:

print(row_0)

Out[58]:

userId         2
movieId    60756
tag        funny
Name: 0, dtype: object

In [59]:

row_0.index

Out[59]:

Index(['userId', 'movieId', 'tag'], dtype='object')

In [60]:

row_0['userId']

Out[60]:

2

In [61]:

'rating' in row_0

Out[61]:

False

In [62]:

row_0.name

Out[62]:

0

In [63]:

row_0 = row_0.rename('first_row')
row_0.name

Out[63]:

'first_row'

DataFrames

In [64]:

tags.head()

Out[64]:

In [65]:

tags.index

Out[65]:

RangeIndex(start=0, stop=3683, step=1)

In [130]:

tags.columns

Out[130]:

Index(['userId', 'movieId', 'tag', 'timestamp', 'parsed_time'], dtype='object')

In [67]:

# Extract row 0, 11, 2000 from DataFrame

tags.iloc[ [0,11,2000] ]

Out[67]:

Descriptive Statistics

Let's look how the ratings are distributed!

In [68]:

ratings['rating'].describe()

Out[68]:

count    100836.000000
mean          3.501557
std           1.042529
min           0.500000
25%           3.000000
50%           3.500000
75%           4.000000
max           5.000000
Name: rating, dtype: float64

In [69]:

ratings.describe()

Out[69]:

In [70]:

ratings['rating'].mean()

Out[70]:

3.501556983616962

In [71]:

ratings.mean()

Out[71]:

userId       326.127564
movieId    19435.295718
rating         3.501557
dtype: float64

In [72]:

ratings['rating'].min()

Out[72]:

0.5

In [73]:

ratings['rating'].max()

Out[73]:

5.0

In [74]:

ratings['rating'].std()

Out[74]:

1.0425292390605359

In [75]:

ratings['rating'].mode()

Out[75]:

0    4.0
dtype: float64

In [76]:

ratings.corr()

Out[76]:

In [77]:

filter_1 = ratings['rating'] > 5
print(filter_1)
filter_1.any()

Out[77]:

0         False
1         False
2         False
3         False
4         False
5         False
6         False
7         False
8         False
9         False
10        False
11        False
12        False
13        False
14        False
15        False
16        False
17        False
18        False
19        False
20        False
21        False
22        False
23        False
24        False
25        False
26        False
27        False
28        False
29        False
          ...  
100806    False
100807    False
100808    False
100809    False
100810    False
100811    False
100812    False
100813    False
100814    False
100815    False
100816    False
100817    False
100818    False
100819    False
100820    False
100821    False
100822    False
100823    False
100824    False
100825    False
100826    False
100827    False
100828    False
100829    False
100830    False
100831    False
100832    False
100833    False
100834    False
100835    False
Name: rating, Length: 100836, dtype: bool

False

In [78]:

filter_2 = ratings['rating'] > 0
filter_2.all()

Out[78]:

True

Data Cleaning: Handling Missing Data

In [79]:

movies.shape

Out[79]:

(9742, 3)

In [80]:

#is any row NULL ?

movies.isnull().any()

Out[80]:

movieId    False
title      False
genres     False
dtype: bool

Thats nice ! No NULL values !

In [81]:

ratings.shape

Out[81]:

(100836, 3)

In [82]:

#is any row NULL ?

ratings.isnull().any()

Out[82]:

userId     False
movieId    False
rating     False
dtype: bool

Thats nice ! No NULL values !

In [83]:

tags.shape

Out[83]:

(3683, 3)

In [84]:

#is any row NULL ?

tags.isnull().any()

Out[84]:

userId     False
movieId    False
tag        False
dtype: bool

We have some tags which are NULL.

In [85]:

tags = tags.dropna()

In [86]:

#Check again: is any row NULL ?

tags.isnull().any()

Out[86]:

userId     False
movieId    False
tag        False
dtype: bool

In [87]:

tags.shape

Out[87]:

(3683, 3)

Thats nice ! No NULL values ! Notice the number of lines have reduced.

Data Visualization

In [88]:

%matplotlib inline

ratings.hist(column='rating', figsize=(15,10))

Out[88]:

array([[<matplotlib.axes._subplots.AxesSubplot object at 0x7f12e4acc438>]],
      dtype=object)

In [89]:

ratings.boxplot(column='rating', figsize=(15,20))

Out[89]:

<matplotlib.axes._subplots.AxesSubplot at 0x7f12e4a63860>

Slicing Out Columns

In [90]:

tags['tag'].head()

Out[90]:

            funny
  Highly quotable
     will ferrell
     Boxing story
              MMA
Name: tag, dtype: object

In [91]:

movies[['title','genres']].head()

Out[91]:

In [92]:

ratings[-10:]

Out[92]:

In [93]:

tag_counts = tags['tag'].value_counts()
tag_counts[-10:]

Out[93]:

tedious                    1
royalty                    1
narnia                     1
Notable Nudity             1
stephen king               1
amazing                    1
short films                1
Lonesome Polecat           1
Disney animated feature    1
Andy Garcia                1
Name: tag, dtype: int64

In [94]:

tag_counts[:10].plot(kind='bar', figsize=(15,10))

Out[94]:

<matplotlib.axes._subplots.AxesSubplot at 0x7f12e49b2ef0>

Filters for Selecting Rows

In [95]:

is_highly_rated = ratings['rating'] >= 4.0

ratings[is_highly_rated][30:50]

Out[95]:

In [96]:

is_animation = movies['genres'].str.contains('Animation')

movies[is_animation][5:15]

Out[96]:

In [97]:

movies[is_animation].head(15)

Out[97]:

Group By and Aggregate

In [98]:

ratings_count = ratings[['movieId','rating']].groupby('rating').count()
ratings_count

Out[98]:

In [99]:

average_rating = ratings[['movieId','rating']].groupby('movieId').mean()
average_rating.head()

Out[99]:

In [100]:

movie_count = ratings[['movieId','rating']].groupby('movieId').count()
movie_count.head()

Out[100]:

In [101]:

movie_count = ratings[['movieId','rating']].groupby('movieId').count()
movie_count.tail()

Out[101]:

Merge Dataframes

In [102]:

tags.head()

Out[102]:

In [103]:

movies.head()

Out[103]:

In [104]:

t = movies.merge(tags, on='movieId', how='inner')
t.head()

Out[104]:

More examples: http://pandas.pydata.org/pandas-docs/stable/merging.html

Combine aggreagation, merging, and filters to get useful analytics

In [105]:

avg_ratings = ratings.groupby('movieId', as_index=False).mean()
del avg_ratings['userId']
avg_ratings.head()

Out[105]:

In [106]:

box_office = movies.merge(avg_ratings, on='movieId', how='inner')
box_office.tail()

Out[106]:

In [107]:

is_highly_rated = box_office['rating'] >= 4.0

box_office[is_highly_rated][-5:]

Out[107]:

In [108]:

is_comedy = box_office['genres'].str.contains('Comedy')

box_office[is_comedy][:5]

Out[108]:

In [109]:

box_office[is_comedy & is_highly_rated][-5:]

Out[109]:

Vectorized String Operations

In [110]:

movies.head()

Out[110]:

Split 'genres' into multiple columns

In [111]:

movie_genres = movies['genres'].str.split('|', expand=True)

In [112]:

movie_genres[:10]

Out[112]:

Add a new column for comedy genre flag

In [113]:

movie_genres['isComedy'] = movies['genres'].str.contains('Comedy')

In [114]:

movie_genres[:10]

Out[114]:

Extract year from title e.g. (1995)

In [115]:

movies['year'] = movies['title'].str.extract('.*\((.*)\).*', expand=True)

In [116]:

movies.tail()

Out[116]:

More here: http://pandas.pydata.org/pandas-docs/stable/text.html#text-string-methods

Parsing Timestamps

Timestamps are common in sensor data or other time series datasets. Let us revisit the tags.csv dataset and read the timestamps!

In [117]:

tags = pd.read_csv('./movielens/tags.csv', sep=',')

In [118]:

tags.dtypes

Out[118]:

userId        int64
movieId       int64
tag          object
timestamp     int64
dtype: object

Unix time / POSIX time / epoch time records time in seconds
since midnight Coordinated Universal Time (UTC) of January 1, 1970

In [119]:

tags.head(5)

Out[119]:

In [120]:

tags['parsed_time'] = pd.to_datetime(tags['timestamp'], unit='s')

Data Type datetime64[ns] maps to either <M8[ns] or >M8[ns] depending on the hardware

In [121]:


tags['parsed_time'].dtype

Out[121]:

dtype('<M8[ns]')

In [122]:

tags.head(2)

Out[122]:

Selecting rows based on timestamps

In [123]:

greater_than_t = tags['parsed_time'] > '2015-02-01'

selected_rows = tags[greater_than_t]

tags.shape, selected_rows.shape

Out[123]:

((3683, 5), (1710, 5))

Sorting the table using the timestamps

In [124]:

tags.sort_values(by='parsed_time', ascending=True)[:10]

Out[124]:

Average Movie Ratings over Time

## Are Movie ratings related to the year of launch?

In [125]:

average_rating = ratings[['movieId','rating']].groupby('movieId', as_index=False).mean()
average_rating.tail()

Out[125]:

In [126]:

joined = movies.merge(average_rating, on='movieId', how='inner')
joined.head()
joined.corr()

Out[126]:

In [127]:

yearly_average = joined[['year','rating']].groupby('year', as_index=False).mean()
yearly_average[:10]

Out[127]:

In [128]:

yearly_average[-20:].plot(x='year', y='rating', figsize=(15,10), grid=True)

Out[128]:

<matplotlib.axes._subplots.AxesSubplot at 0x7f12e47a2128>

Do some years look better for the boxoffice movies than others?

Does any data point seem like an outlier in some sense?

In [0]: