Pandas
pandas is a Python library for data analysis. It offers a number of data exploration, cleaning and transformation operations that are critical in working with data in Python.
pandas build upon numpy and scipy providing easy-to-use data structures and data manipulation functions with integrated indexing.
The main data structures pandas provides are Series and DataFrames. After a brief introduction to these two data structures and data ingestion, the key features of pandas this notebook covers are:
Generating descriptive statistics on data
Data cleaning using built in pandas functions
Frequent data operations for subsetting, filtering, insertion, deletion and aggregation of data
Merging multiple datasets using dataframes
Working with timestamps and time-series data
Additional Recommended Resources:
pandas Documentation: http://pandas.pydata.org/pandas-docs/stable/
Python for Data Analysis by Wes McKinney
Python Data Science Handbook by Jake VanderPlas
Let's get started with our first pandas notebook!
Import Libraries
Introduction to pandas Data Structures
*pandas* has two main data structures it uses, namely, *Series* and *DataFrames*.
pandas Series
pandas Series one-dimensional labeled array.
pandas DataFrame
pandas DataFrame is a 2-dimensional labeled data structure.
Create DataFrame from dictionary of Python Series
Create DataFrame from list of Python dictionaries
Basic DataFrame operations
Case Study: Movie Data Analysis
This notebook uses a dataset from the MovieLens website. We will describe the dataset further as we explore with it using *pandas*.
Download the Dataset
Please note that you will need to download the dataset. Although the video for this notebook says that the data is in your folder, the folder turned out to be too large to fit on the edX platform due to size constraints.
Here are the links to the data source and location:
Data Source: MovieLens web site (filename: ml-20m.zip)
Once the download completes, please make sure the data files are in a directory called movielens in your Week-3-pandas folder.
Let us look at the files in this dataset using the UNIX command ls.
README.txt genome-tags.csv links.csv movies.csv ratings.csv tags.csv
9743
userId,movieId,rating,timestamp
1,1,4.0,964982703
1,3,4.0,964981247
1,6,4.0,964982224
1,47,5.0,964983815
Use Pandas to Read the Dataset
In this notebook, we will be using three CSV files: * **ratings.csv :** *userId*,*movieId*,*rating*, *timestamp* * **tags.csv :** *userId*,*movieId*, *tag*, *timestamp* * **movies.csv :** *movieId*, *title*, *genres*
Using the read_csv function in pandas, we will ingest these three files.
Data Structures
Series
DataFrames
Descriptive Statistics
Let's look how the ratings are distributed!
Data Cleaning: Handling Missing Data
Thats nice ! No NULL values !
Thats nice ! No NULL values !
We have some tags which are NULL.
Thats nice ! No NULL values ! Notice the number of lines have reduced.
Data Visualization
Slicing Out Columns
Filters for Selecting Rows
Group By and Aggregate
Merge Dataframes
More examples: http://pandas.pydata.org/pandas-docs/stable/merging.html
Combine aggreagation, merging, and filters to get useful analytics
Vectorized String Operations
Split 'genres' into multiple columns
Add a new column for comedy genre flag
Extract year from title e.g. (1995)
Parsing Timestamps
Timestamps are common in sensor data or other time series datasets. Let us revisit the tags.csv dataset and read the timestamps!
Unix time / POSIX time / epoch time records time in seconds
since midnight Coordinated Universal Time (UTC) of January 1, 1970
Data Type datetime64[ns] maps to either <M8[ns] or >M8[ns] depending on the hardware
Selecting rows based on timestamps
Sorting the table using the timestamps
Average Movie Ratings over Time
## Are Movie ratings related to the year of launch?Do some years look better for the boxoffice movies than others?
Does any data point seem like an outlier in some sense?