Book a Demo!
CoCalc Logo Icon
StoreFeaturesDocsShareSupportNewsAboutPoliciesSign UpSign In
YStrano
GitHub Repository: YStrano/DataScience_GA
Path: blob/master/april_18/lessons/lesson-02/code/numpy-and-pandas.ipynb
1904 views
Kernel: Python 3

Agenda

  • Numpy

  • Pandas

  • Lab

Introduction

Create a new notebook for your code-along:

From our submission directory, type:

jupyter notebook

From the IPython Dashboard, open a new notebook. Change the title to: "Numpy and Pandas"

Introduction to Numpy

  • Overview

  • ndarray

  • Indexing and Slicing

More info: http://wiki.scipy.org/Tentative_NumPy_Tutorial

Numpy Overview

  • Why Python for Data? Numpy brings decades of C math into Python!

  • Numpy provides a wrapper for extensive C/C++/Fortran codebases, used for data analysis functionality

  • NDAarray allows easy vectorized math and broadcasting (i.e. functions for vector elements of different shapes)

import numpy as np

Creating ndarrays

An array object represents a multidimensional, homogeneous array of fixed-size items.

# Creating arrays a = np.zeros((3)) b = np.ones((2,3)) c = np.random.randint(1,10,(2,3,4)) d = np.arange(0,11,1)

What are these functions?

arange?
# Note the way each array is printed: a,b,c,d
## Arithmetic in arrays is element wise
a = np.array( [20,30,40,50] ) b = np.arange( 4 ) b
c = a-b c
b**2

Indexing, Slicing and Iterating

# one-dimensional arrays work like lists: a = np.arange(10)**2
a
a[2:5]
# Multidimensional arrays use tuples with commas for indexing # with (row,column) conventions beginning, as always in Python, from 0
b = np.random.randint(1,100,(4,4))
b
# Guess the output print(b[2,3]) print(b[0,0])
b[0:3,1],b[:,1]
b[1:3,:]

Introduction to Pandas

  • Object Creation

  • Viewing data

  • Selection

  • Missing data

  • Grouping

  • Reshaping

  • Time series

  • Plotting

  • i/o

pandas.pydata.org

Pandas Overview

Source: pandas.pydata.org

import pandas as pd import numpy as np import matplotlib.pyplot as plt
dates = pd.date_range('20140101',periods=6) dates
df = pd.DataFrame(np.random.randn(6,4),index=dates,columns=list('ABCD')) z = pd.DataFrame(index = df.index, columns = df.columns) df.columns
# Index, columns, underlying numpy data df.T df
df2 = pd.DataFrame({ 'A' : 1., 'B' : pd.Timestamp('20130102'), 'C' : pd.Series(1,index=list(range(4)),dtype='float32'), 'D' : np.array([3] * 4,dtype='int32'), 'E' : 'foo' }) df2
# With specific dtypes df2.dtypes

Viewing Data

df.head()
df.tail()
df.index
df.describe()
df.sort_values(by='B') df

Selection

df[['A','B']]
df[0:3]
# By label df.loc[dates[0]]
# multi-axis by label df.loc[:,['A','B']]
# Date Range df.loc['20140102':'20140104',['B']]
# Fast access to scalar df.at[dates[1],'B']
# iloc provides integer locations similar to np style df.iloc[3:]

Boolean Indexing

df[df.A < 0] # Basically a 'where' operation

Setting

df_posA = df.copy() # Without "copy" it would act on the dataset df_posA[df_posA.A < 0] = -1*df_posA
df_posA
#Setting new column aligns data by index s1 = pd.Series([1,2,3,4,5,6],index=pd.date_range('20140102',periods=6))
s1
df['F'] = s1
df

Missing Data

# Add a column with missing data df1 = df.reindex(index=dates[0:4],columns=list(df.columns) + ['E'])
df1.loc[dates[0]:dates[1],'E'] = 1
df1
# find where values are null pd.isnull(df1)

Operations

df.describe()
df.mean(),df.mean(1) # Operation on two different axes

Applying functions

df
df.apply(np.cumsum)
df.apply(lambda x: x.max() - x.min())
# Built in string methods s = pd.Series(['A', 'B', 'C', 'Aaba', 'Baca', np.nan, 'CABA', 'dog', 'cat']) s.str.lower()

Merge

np.random.randn(10,4)
#Concatenating pandas objects together df = pd.DataFrame(np.random.randn(10,4)) df
# Break it into pieces pieces = [df[:3], df[3:7],df[7:]] pieces
pd.concat(pieces)
# Also can "Join" and "Append" df

Grouping

df = pd.DataFrame({'A' : ['foo', 'bar', 'foo', 'bar', 'foo', 'bar', 'foo', 'foo'], 'B' : ['one', 'one', 'two', 'three', 'two', 'two', 'one', 'three'], 'C' : np.random.randn(8), 'D' : np.random.randn(8)})
df
df.groupby(['A','B']).sum()

Reshaping

# You can also stack or unstack levels
a = df.groupby(['A','B']).sum()
# Pivot Tables pd.pivot_table(df,values=['C','D'],index=['A'],columns=['B'])

Time Series

import pandas as pd import numpy as np
# 100 Seconds starting on January 1st rng = pd.date_range('1/1/2014', periods=100, freq='S')
# Give each second a random value ts = pd.Series(np.random.randint(0, 500, len(rng)), index=rng)
ts
# Built in resampling ts.resample('1Min').mean() # Resample secondly to 1Minutely
# Many additional time series features ts. #use tab

Plotting

ts.plot()
def randwalk(startdate,points): ts = pd.Series(np.random.randn(points), index=pd.date_range(startdate, periods=points)) ts=ts.cumsum() ts.plot() return(ts)
# Using pandas to make a simple random walker by repeatedly running: a=randwalk('1/1/2012',1000)
# Pandas plot function will print with labels as default
df = pd.DataFrame(np.random.randn(100, 4), index=ts.index,columns=['A', 'B', 'C', 'D']) df = df.cumsum() plt.figure();df.plot();plt.legend(loc='best') #

I/O

I/O is straightforward with, for example, pd.read_csv or df.to_csv

The benefits of open source:

Let's look under x's in plt modules

Next Steps

Recommended Resources

NameDescription
Official Pandas TutorialsWes & Company's selection of tutorials and lectures
Julia Evans Pandas CookbookGreat resource with examples from weather, bikes and 311 calls
Learn Pandas TutorialsA great series of Pandas tutorials from Dave Rojas
Research Computing Python Data PYNBsA super awesome set of python notebooks from a meetup-based course exclusively devoted to pandas