Book a Demo!
CoCalc Logo Icon
StoreFeaturesDocsShareSupportNewsAboutPoliciesSign UpSign In
Download

DSE200x Week 4 Notes

2968 views
Kernel: Python 3 (Ubuntu Linux)


Pandas


pandas is a Python library for data analysis. It offers a number of data exploration, cleaning and transformation operations that are critical in working with data in Python.

pandas build upon numpy and scipy providing easy-to-use data structures and data manipulation functions with integrated indexing.

The main data structures pandas provides are Series and DataFrames. After a brief introduction to these two data structures and data ingestion, the key features of pandas this notebook covers are:

  • Generating descriptive statistics on data

  • Data cleaning using built in pandas functions

  • Frequent data operations for subsetting, filtering, insertion, deletion and aggregation of data

  • Merging multiple datasets using dataframes

  • Working with timestamps and time-series data

Additional Recommended Resources:

Let's get started with our first pandas notebook!


Import Libraries

import pandas as pd

Introduction to pandas Data Structures


*pandas* has two main data structures it uses, namely, *Series* and *DataFrames*.

pandas Series

pandas Series one-dimensional labeled array.

ser = pd.Series([100, 'foo', 300, 'bar', 500], ['tom', 'bob', 'nancy', 'dan', 'eric'])
ser
tom 100 bob foo nancy 300 dan bar eric 500 dtype: object
ser.index
Index(['tom', 'bob', 'nancy', 'dan', 'eric'], dtype='object')
ser.loc[['nancy','bob']]
nancy 300 bob foo dtype: object
ser[[4, 3, 1]]
eric 500 dan bar bob foo dtype: object
ser.iloc[2]
300
'bob' in ser
True
ser
tom 100 bob foo nancy 300 dan bar eric 500 dtype: object
ser * 2
tom 200 bob foofoo nancy 600 dan barbar eric 1000 dtype: object
ser[['nancy', 'eric']] ** 2
nancy 90000 eric 250000 dtype: object

pandas DataFrame

pandas DataFrame is a 2-dimensional labeled data structure.

Create DataFrame from dictionary of Python Series

d = {'one' : pd.Series([100., 200., 300.], index=['apple', 'ball', 'clock']), 'two' : pd.Series([111., 222., 333., 4444.], index=['apple', 'ball', 'cerill', 'dancy'])}
df = pd.DataFrame(d) print(df)
one two apple 100.0 111.0 ball 200.0 222.0 cerill NaN 333.0 clock 300.0 NaN dancy NaN 4444.0
df.index
Index(['apple', 'ball', 'cerill', 'clock', 'dancy'], dtype='object')
df.columns
Index(['one', 'two'], dtype='object')
pd.DataFrame(d, index=['dancy', 'ball', 'apple'])
pd.DataFrame(d, index=['dancy', 'ball', 'apple'], columns=['two', 'five'])

Create DataFrame from list of Python dictionaries

data = [{'alex': 1, 'joe': 2}, {'ema': 5, 'dora': 10, 'alice': 20}]
pd.DataFrame(data)
pd.DataFrame(data, index=['orange', 'red'])
pd.DataFrame(data, columns=['joe', 'dora','alice'])

Basic DataFrame operations

df
df['one']
apple 100.0 ball 200.0 cerill NaN clock 300.0 dancy NaN Name: one, dtype: float64
df['three'] = df['one'] * df['two'] df
df['flag'] = df['one'] > 250 df
three = df.pop('three')
three
apple 11100.0 ball 44400.0 cerill NaN clock NaN dancy NaN Name: three, dtype: float64
df
del df['two']
df
df.insert(2, 'copy_of_one', df['one']) df
df['one_upper_half'] = df['one'][:2] df

Case Study: Movie Data Analysis


This notebook uses a dataset from the MovieLens website. We will describe the dataset further as we explore with it using *pandas*.

Download the Dataset

Please note that you will need to download the dataset. Although the video for this notebook says that the data is in your folder, the folder turned out to be too large to fit on the edX platform due to size constraints.

Here are the links to the data source and location:

Once the download completes, please make sure the data files are in a directory called movielens in your Week-3-pandas folder.

Let us look at the files in this dataset using the UNIX command ls.

# Note: Adjust the name of the folder to match your local directory !ls ./movielens
README.txt genome-tags.csv links.csv movies.csv ratings.csv tags.csv
!cat ./movielens/movies.csv | wc -l
9743
!head -5 ./movielens/ratings.csv
userId,movieId,rating,timestamp 1,1,4.0,964982703 1,3,4.0,964981247 1,6,4.0,964982224 1,47,5.0,964983815

Use Pandas to Read the Dataset


In this notebook, we will be using three CSV files: * **ratings.csv :** *userId*,*movieId*,*rating*, *timestamp* * **tags.csv :** *userId*,*movieId*, *tag*, *timestamp* * **movies.csv :** *movieId*, *title*, *genres*

Using the read_csv function in pandas, we will ingest these three files.

movies = pd.read_csv('./movielens/movies.csv', sep=',') print(type(movies)) movies.head(15)
<class 'pandas.core.frame.DataFrame'>
# Timestamps represent seconds since midnight Coordinated Universal Time (UTC) of January 1, 1970 tags = pd.read_csv('./movielens/tags.csv', sep=',') tags.head()
ratings = pd.read_csv('./movielens/ratings.csv', sep=',', parse_dates=['timestamp']) ratings.head()
# For current analysis, we will remove timestamp (we will come back to it!) del ratings['timestamp'] del tags['timestamp']

Data Structures

Series

#Extract 0th row: notice that it is infact a Series row_0 = tags.iloc[0] type(row_0)
pandas.core.series.Series
print(row_0)
userId 2 movieId 60756 tag funny Name: 0, dtype: object
row_0.index
Index(['userId', 'movieId', 'tag'], dtype='object')
row_0['userId']
2
'rating' in row_0
False
row_0.name
0
row_0 = row_0.rename('first_row') row_0.name
'first_row'

DataFrames

tags.head()
tags.index
RangeIndex(start=0, stop=3683, step=1)
tags.columns
Index(['userId', 'movieId', 'tag', 'timestamp', 'parsed_time'], dtype='object')
# Extract row 0, 11, 2000 from DataFrame tags.iloc[ [0,11,2000] ]

Descriptive Statistics

Let's look how the ratings are distributed!

ratings['rating'].describe()
count 100836.000000 mean 3.501557 std 1.042529 min 0.500000 25% 3.000000 50% 3.500000 75% 4.000000 max 5.000000 Name: rating, dtype: float64
ratings.describe()
ratings['rating'].mean()
3.501556983616962
ratings.mean()
userId 326.127564 movieId 19435.295718 rating 3.501557 dtype: float64
ratings['rating'].min()
0.5
ratings['rating'].max()
5.0
ratings['rating'].std()
1.0425292390605359
ratings['rating'].mode()
0 4.0 dtype: float64
ratings.corr()
filter_1 = ratings['rating'] > 5 print(filter_1) filter_1.any()
0 False 1 False 2 False 3 False 4 False 5 False 6 False 7 False 8 False 9 False 10 False 11 False 12 False 13 False 14 False 15 False 16 False 17 False 18 False 19 False 20 False 21 False 22 False 23 False 24 False 25 False 26 False 27 False 28 False 29 False ... 100806 False 100807 False 100808 False 100809 False 100810 False 100811 False 100812 False 100813 False 100814 False 100815 False 100816 False 100817 False 100818 False 100819 False 100820 False 100821 False 100822 False 100823 False 100824 False 100825 False 100826 False 100827 False 100828 False 100829 False 100830 False 100831 False 100832 False 100833 False 100834 False 100835 False Name: rating, Length: 100836, dtype: bool
False
filter_2 = ratings['rating'] > 0 filter_2.all()
True

Data Cleaning: Handling Missing Data

movies.shape
(9742, 3)
#is any row NULL ? movies.isnull().any()
movieId False title False genres False dtype: bool

Thats nice ! No NULL values !

ratings.shape
(100836, 3)
#is any row NULL ? ratings.isnull().any()
userId False movieId False rating False dtype: bool

Thats nice ! No NULL values !

tags.shape
(3683, 3)
#is any row NULL ? tags.isnull().any()
userId False movieId False tag False dtype: bool

We have some tags which are NULL.

tags = tags.dropna()
#Check again: is any row NULL ? tags.isnull().any()
userId False movieId False tag False dtype: bool
tags.shape
(3683, 3)

Thats nice ! No NULL values ! Notice the number of lines have reduced.

Data Visualization

%matplotlib inline ratings.hist(column='rating', figsize=(15,10))
array([[<matplotlib.axes._subplots.AxesSubplot object at 0x7f12e4acc438>]], dtype=object)
Image in a Jupyter notebook
ratings.boxplot(column='rating', figsize=(15,20))
<matplotlib.axes._subplots.AxesSubplot at 0x7f12e4a63860>
Image in a Jupyter notebook

Slicing Out Columns

tags['tag'].head()
0 funny 1 Highly quotable 2 will ferrell 3 Boxing story 4 MMA Name: tag, dtype: object
movies[['title','genres']].head()
ratings[-10:]
tag_counts = tags['tag'].value_counts() tag_counts[-10:]
tedious 1 royalty 1 narnia 1 Notable Nudity 1 stephen king 1 amazing 1 short films 1 Lonesome Polecat 1 Disney animated feature 1 Andy Garcia 1 Name: tag, dtype: int64
tag_counts[:10].plot(kind='bar', figsize=(15,10))
<matplotlib.axes._subplots.AxesSubplot at 0x7f12e49b2ef0>
Image in a Jupyter notebook

Filters for Selecting Rows

is_highly_rated = ratings['rating'] >= 4.0 ratings[is_highly_rated][30:50]
is_animation = movies['genres'].str.contains('Animation') movies[is_animation][5:15]
movies[is_animation].head(15)

Group By and Aggregate

ratings_count = ratings[['movieId','rating']].groupby('rating').count() ratings_count
average_rating = ratings[['movieId','rating']].groupby('movieId').mean() average_rating.head()
movie_count = ratings[['movieId','rating']].groupby('movieId').count() movie_count.head()
movie_count = ratings[['movieId','rating']].groupby('movieId').count() movie_count.tail()

Merge Dataframes

tags.head()
movies.head()
t = movies.merge(tags, on='movieId', how='inner') t.head()


Combine aggreagation, merging, and filters to get useful analytics

avg_ratings = ratings.groupby('movieId', as_index=False).mean() del avg_ratings['userId'] avg_ratings.head()
box_office = movies.merge(avg_ratings, on='movieId', how='inner') box_office.tail()
is_highly_rated = box_office['rating'] >= 4.0 box_office[is_highly_rated][-5:]
is_comedy = box_office['genres'].str.contains('Comedy') box_office[is_comedy][:5]
box_office[is_comedy & is_highly_rated][-5:]

Vectorized String Operations

movies.head()


Split 'genres' into multiple columns


movie_genres = movies['genres'].str.split('|', expand=True)
movie_genres[:10]


Add a new column for comedy genre flag


movie_genres['isComedy'] = movies['genres'].str.contains('Comedy')
movie_genres[:10]


Extract year from title e.g. (1995)


movies['year'] = movies['title'].str.extract('.*\((.*)\).*', expand=True)
movies.tail()

Parsing Timestamps

Timestamps are common in sensor data or other time series datasets. Let us revisit the tags.csv dataset and read the timestamps!

tags = pd.read_csv('./movielens/tags.csv', sep=',')
tags.dtypes
userId int64 movieId int64 tag object timestamp int64 dtype: object

Unix time / POSIX time / epoch time records time in seconds
since midnight Coordinated Universal Time (UTC) of January 1, 1970

tags.head(5)
tags['parsed_time'] = pd.to_datetime(tags['timestamp'], unit='s')

Data Type datetime64[ns] maps to either <M8[ns] or >M8[ns] depending on the hardware

tags['parsed_time'].dtype
dtype('<M8[ns]')
tags.head(2)

Selecting rows based on timestamps

greater_than_t = tags['parsed_time'] > '2015-02-01' selected_rows = tags[greater_than_t] tags.shape, selected_rows.shape
((3683, 5), (1710, 5))

Sorting the table using the timestamps

tags.sort_values(by='parsed_time', ascending=True)[:10]

Average Movie Ratings over Time

## Are Movie ratings related to the year of launch?
average_rating = ratings[['movieId','rating']].groupby('movieId', as_index=False).mean() average_rating.tail()
joined = movies.merge(average_rating, on='movieId', how='inner') joined.head() joined.corr()
yearly_average = joined[['year','rating']].groupby('year', as_index=False).mean() yearly_average[:10]
yearly_average[-20:].plot(x='year', y='rating', figsize=(15,10), grid=True)
<matplotlib.axes._subplots.AxesSubplot at 0x7f12e47a2128>
Image in a Jupyter notebook

Do some years look better for the boxoffice movies than others?

Does any data point seem like an outlier in some sense?