Book a Demo!
CoCalc Logo Icon
StoreFeaturesDocsShareSupportNewsAboutPoliciesSign UpSign In
suyashi29
GitHub Repository: suyashi29/python-su
Path: blob/master/Data Science using Python/Day 2 Pandas .ipynb
3074 views
Kernel: Python 3 (ipykernel)

Pandas

Pandas is the most popular python Data Analysis & Data Structure tool.

image-2.png

  • Python with Pandas is used in a wide range of fields including academic and commercial domains including finance, economics, Statistics, analytics, etc

Key Features of Pandas

  • Fast and efficient DataFrame object with default and customized indexing.

  • Tools for loading data into in-memory data objects from different file formats.

  • Data alignment and integrated handling of missing data.

  • Reshaping and pivoting of date sets.

  • Label-based slicing, indexing and subsetting of large data sets.

  • Columns from a data structure can be deleted or inserted.

  • Group by data for aggregation and transformations.

  • High performance merging and joining of data.

  • Time Series functionality.

Data Analysis

  • Raw data - information- Prepare- Feature selection- Model Data

  • import data(Data Acquistion) - Data prepartion(Cleaning data, Data Engineer) - EDA - Model Data

Installation

Standard Python distribution doesn't come bundled with Pandas module. A lightweight alternative is to install NumPy using popular Python package installer, pip.

  • pip install pandas

  • If you install Anaconda Python package, Pandas will be installed by default with the following − Anaconda (from https://www.continuum.io) is a free Python distribution for SciPy stack.

Pandas deals with the following three data structures −

  • Series: Series is a one-dimensional array like structure with homogeneous data.

  • DataFrame:DataFrame is a two-dimensional array with heterogeneous data

  • Panel: Panel is a three-dimensional data structure with heterogeneous data. It is hard to represent the panel in graphical representation. But a panel can be illustrated as a container of DataFrame. These data structures are built on top of Numpy array.

  • Data Acquition: , - Data Prepare: - Data Cleanings, Data manipulation, - Data Enginerring Raw Data - Information - Insights- Actions

  • pip install pandas import pandas

  • Lets work with Series

  1. Series (1D) , rows ,

  2. Data Frames (2D) , rows and columns

  3. Panel (3D)

import pandas A=(1,2,3,4,5,6) x=pandas.Series(A) x
0 1 1 2 2 3 3 4 4 5 5 6 dtype: int64
import pandas as pd a=(1,2,3,4,5,6) b=["one","two","three","four","five","six"] x=pd.Series(a,dtype=int,index=b) x
one 1 two 2 three 3 four 4 five 5 six 6 dtype: int32
#Series from list s=pd.Series([1,2,"a",4],index=("I","II","III","IV")) s

Creating series from dictionary

emp={"A":8,"B":9,"C":10} details=pd.Series(emp) details
A 8 B 9 C 10 dtype: int64
  • Note: Values are used by default as series elements & Keys as index Dictionary is a mapping data type , We cannot manupulate index in as we do in case of List & Tuples.

Quick Practice

  • Create Series using 1. List 2. Tuple and uisng Dictionary

  • Length should be 5.

  • For list and Tuple please name index

#changing order of index age = {'ram' : 28,'bob' : 19, 'cam' : 22} s = pd.Series(age,index=['bob','ram','cam',"Lina"],dtype=int) s ##type(s)
pandas.core.series.Series
  • note : Missing value is filled by NAN & index taken by keys

  • Lets Work with Data Frame

import pandas as pd # Data frame from 1D List l=["ashi","rom","sid"] df=pd.DataFrame(l) df
#2D lIST data = [['Nokia',10000],['Asus',12000],['Samsung',13000],['Apple',33000]] d = pd.DataFrame(data,columns=['Mobile','Price'],index=[1,2,3,4]) d

Quick Practice: Create a data frame containing Details of marks of six students in chem, phy, maths

note : If no index or column is passed, then by default, index will be range(n), where n is the array length.

import pandas as pd data = [["Ashi",34,23,56],["Bob",92,88,76],["Coy",76,87,25],["Don",45,78,99],["Muh",95,68,30],["May",76,78,99]] stu = pd.DataFrame(data,columns=["Name",'Maths','Phy','Chem']) stu

Creating Data frame from Series

import pandas as pd # Selecting Columns d = {'Chem' : pd.Series([30, 70, 35,25,26,67,77,67,89,98], index=["Raj",'Ram', 'Asa', 'Pi','Chi','Ru',"Sita","Ria","Gita","Sia"],dtype=int), 'Math' : pd.Series([18, 26, 35, 40,55,89,79,100], index=["Raj",'Ram', 'Pi', 'Chi', 'Ru',"Sita","Ria","Gita"],dtype=int), 'Phy' : pd.Series([31, 42, 83,34,80,78,67,98,67], index=["Asa",'Ram', 'Pi', 'Ru',"Sita","Gita","Chi","Sia","Raj"],dtype=int)} exam = pd.DataFrame(d) exam1=exam.copy() ## To make copy of your data exam2=exam.copy()
exam

Data Preparation

  • Removing null values/replace

  • Data description

  • Adding new data fields:analysis

  • Feature selection: ML , Predictions:decisions

exam1
# Adding columns print ("Adding a new column using the existing columns in DataFrame:") exam1['Total']=exam1['Chem']+exam1['Math']+exam1['Phy'] exam1
Adding a new column using the existing columns in DataFrame:

Dealing with missing data

  • Check for Missing Values

  • To make detecting missing values easier (and across different array dtypes), Pandas provides the isnull() and notnull() functions, which are also methods on Series and DataFrame objects

exam2
exam2.isnull().sum()
Chem 0 Math 2 Phy 1 dtype: int64

Calculations with Missing Data

  • When summing data, NA will be treated as Zero If the data are all NA, then the result will be NA Cleaning / Filling Missing Data

  • Pandas provides various methods for cleaning the missing values. The fillna function can “fill in” NA values with non-null data in a couple of ways, which we have illustrated in the following sections.

  • Replace NaN with a Scalar Value The following program shows how you can replace "NaN" with "0".

exam
exam=exam.fillna(0) exam
exam["Total"]=exam["Math"]+exam["Phy"]+exam["Chem"] exam
exam1
exam1=exam1.dropna() exam1
exam2
b=exam2["Phy"].median() # not affected by outliers c=exam2["Math"].mean() # get affected by outliers ## mode
exam2["Phy"]=exam2["Phy"].fillna(b) exam2["Math"]=exam2["Math"].fillna(c) exam2

Replacing error values by generic values by replace function

df = pd.DataFrame({'one':[10,20,30,40,50,"ABC"], 'AGE':[-19,1,30,40,50,60]}) df=df.replace({"ABC":60,-19:19})
exam1.info()
<class 'pandas.core.frame.DataFrame'> Index: 7 entries, Chi to Sita Data columns (total 4 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Chem 7 non-null int32 1 Math 7 non-null float64 2 Phy 7 non-null float64 3 Total 7 non-null float64 dtypes: float64(3), int32(1) memory usage: 252.0+ bytes
exam1['Total']=exam1['Total'].astype(int) exam1
C:\Users\suyashi144893\AppData\Local\Temp\1\ipykernel_4140\60639536.py:1: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy exam1['Total']=exam1['Total'].astype(int)
exam1.describe()

exam1.describe(include = "object")

exam1['Phy'].sum() exam1.mean() exam1.std()
Chem 27.002645 Math 31.482421 Phy 19.208133 Total 59.887196 dtype: float64

Stats: Data Description

import pandas as pd details = {'Brand':pd.Series(['Nokia','Asus',"Nokia","Nokia",'Samsung',"ABC",'Micromax','Apple','MI','Zen',"Apple"]), 'Price':pd.Series([10000,8000,12500,7000,40000,12000,12999,13999,59999]), 'Rating(10)':pd.Series([7,6.5,8.5,9,8,9.5,7,9]) } d = pd.DataFrame(details) d

Quick Practice

  • describe the data -Check missing values in given

  • replace missing values

  • describe data

  • mean, mode and median column wise

d.isnull().sum()
Brand 0 Price 2 Rating(10) 3 dtype: int64
d["Price"].mean()
19610.777777777777
#d["Rating(10)"].describe() d.describe()

The describe() function computes a summary of statistics pertaining to the DataFrame columns.

d.describe(include="object") #d[d['Brand'] == 'Apple']o #d.head(2)#displays data labels from start #d.tail(1)# dispalys data lables from bottom # - Mode,Sum,medain,valuable(null), any details about categorical

Lets Practice

  • Check null values for the given data

  • replace null values by mean/median/mode

  • describe

d=d.rename(columns={'Avg.Price' : 'Price'}) d

Sorting

import pandas as pd details = {'Brand':pd.Series(['Nokia','Asus',"Nokia","Nokia",'Samsung',"Acer",'Micromax','Apple','Micromax','Zen',"Oppo"]), 'Avg.Price':pd.Series([10000,8000,12500,7000,40000,12000,12999,13999,59999,15609,34051]), 'Rating(10)':pd.Series([7,6.5,8.5,9,8,9.5,7,9,7,8.5,9]) } d = pd.DataFrame(details) d
df2 = d.sort_values(by=['Avg.Price'],ascending= False)#Ascending First Column and desending Second df2
df2 = d.sort_values(by=['Brand','Avg.Price'],ascending= [True,False])#Ascending First Column and desending Second df2

get_dummies()

  • Pass a list with length equal to the number of columns.

  • Returns the DataFrame with One-Hot Encoded values.

CAR PRICE VS "MODEL", MILAGE

Type Encode A 0 B 1 C 2 A 0 Type STUDENTS A 0 C 2 B A B C C Dummy method Type.A Type.B Type.C 1 0 0 0 0 1 0 1 0

Emp, Genders, Age, etc

F, M F=1 M-=0

Gender.M Gender.F 1 0 0 1

A,P,E 0,1,2

a b c d e

0 1 2 3 4 5

Male Female M F M F Gender {M:1,F:0}-
import pandas as pd s = pd.Series(['Orange', 'Pink', 'Blue','Orange','Pink','Pink']) s1=s.str.get_dummies() s1
pd.get_dummies(d.Brand) ## df.label("A":0,"B":1)

Indexing & Selecting Data

  • loc() : (label based indexing) takes two single/list/range operator separated by ','. The first one indicates the row and the second one indicates columns.

  • iloc() : integer based indexing.

loc()

import pandas as pd details = {'Brand':pd.Series(['Nokia','Asus',"Nokia","Nokia",'Samsung',"ABC",'Micromax','Apple','MI','Zen',"Apple"]), 'Avg.Price':pd.Series([10000,8000,12500,7000,40000,12000,12999,13999,59999,7899,78999]), 'Rating(10)':pd.Series([7,6.5,8.5,9,8,9.5,7,9,10,7,6.9]) } d = pd.DataFrame(details) d
d['Avg.Price'] d.Brand
0 Nokia 1 Asus 2 Nokia 3 Nokia 4 Samsung 5 ABC 6 Micromax 7 Apple 8 MI 9 Zen 10 Apple Name: Brand, dtype: object
d.loc[:,'Rating(10)']
0 7.0 1 6.5 2 8.5 3 9.0 4 8.0 5 9.5 6 7.0 7 9.0 8 10.0 9 7.0 10 6.9 Name: Rating(10), dtype: float64
d.iloc[:,1]
0 10000 1 8000 2 12500 3 7000 4 40000 5 12000 6 12999 7 13999 8 59999 9 7899 10 78999 Name: Avg.Price, dtype: int64
d.loc[3,["Brand","Rating(10)"]] ##d.loc[:,"Brand"] ## d.loc["Nokia"]

About iloc() :

# select all rows for a specific column d.iloc[:,2]
d.iloc[2:5]
# Integer slicing d.iloc[2:5, 1:3]
# Slicing through list of values d.iloc[[7],[0]]
### Using index parameter d.Brand d["Brand"]

Creating joints in Pandas

  • Full Outer Join

combines the results of both the left and the right outer joins. The joined DataFrame will contain all records from both the DataFrames and fill in NaNs for missing matches on either side. You can perform a full outer join by specifying the how argument as outer in the merge() function:

  • Inner Join

combines the common results of both

  • There should be relevance(Common field,analysis requirement)

import pandas as pd d = { 'id': ['1', '2', '3', '4', '5','6'], 'Color': ['RED', 'GREEN', 'YELLOW', 'BLUE', 'PINK','BLACK'], 'Fruit': ['APPLE', 'BANANA', 'MANGO', 'BERRY', 'MELON','GRAPES']} d1=pd.DataFrame(d) d1
z = { 'id': ['1', '2', '3', '4', '5','7'], 'rating': ['A', 'B', 'C', 'A', 'B','A'], 'COST': [200, 230, 400, 400, 100,450], 'Fruit': ['APPLE', 'BANANA', 'MANGO', 'BERRY', 'MELON','KIWI'], 'BUY': ['Y', 'N', 'Y', 'N', 'Y','N']} d2=pd.DataFrame(z) d2
join = pd.merge(d1, d2, on=["Fruit","id"], how='inner') ## similary we can use 'outer/right/left join' join
#Selecting few fields from frames join = pd.merge(d2,d1[["Color","Fruit"]], on="Fruit", how='outer') ## similary we can use 'inner/right/left join' join
#Selecting one common field from frames join = pd.merge(d1, d2, on=["id"], how='outer') ## similary we can use 'inner/right/left join' join

Let read a data and combine it

s1=pd.read_csv(r"C:\Users\suyashi144893\Documents\data Sets\Exam_Data3.csv")
s1.head(2)
s1.shape
(320, 3)
s2=pd.read_excel(r"C:\Users\suyashi144893\Documents\data Sets\Exam_data4.xlsx")
s2.head(3)
s2.shape s2.describe(include="object")
join_s3= pd.merge(s1, s2[["ROLL NO","Hindi"]], on=["ROLL NO"], how='outer') join_s3
join_s3.isnull().sum()
Economics 0 ROLL NO 0 Class 0 Hindi 0 dtype: int64

How simple merge function differs from join?

Simple merge works as an inner joint

df_merge = pd.merge(d1,d2, on='Fruit') df_merge

Concadination

a = { "id": [12,14], 'Color': ['Brown', 'yellow', ], 'Fruit': ['Pear', 'muskmellon']} d3=pd.DataFrame(a) d3
d1
df_col = pd.concat([d1,d3], axis=0,ignore_index =True) df_col.reset_index() df_col

Frequency table : Crosstab

d2
my_tab = pd.crosstab(index=d2["rating"], # Make a crosstab columns="count") # Name the count column my_tab
c = pd.value_counts(d2.rating).to_frame().reset_index() c

datetime.now() gives you the current date and time.

%%time pd.datetime.now()
%%time from time import sleep print("Hi") sleep(6) ## Delay Function 3+4
Hi CPU times: total: 0 ns Wall time: 6.01 s
7

Input Output

Reading files

Following files can be read in pandas:

  • df.to_csv(...) # csv file

  • df.to_hdf(...) # HDF5 file

  • df.to_pickle(...) # serialized object

  • df.to_sql(...) # to SQL database

  • df.to_excel(...) # to Excel sheet

  • df.to_json(...) # to JSON string

  • df.to_html(...) # render as HTML table

  • df.to_feather(...) # binary feather-format

  • df.to_latex(...) # tabular environment table

  • df.to_stata(...) # Stata binary data files

  • df.to_msgpack(...) # msgpack (serialize) object

  • df.to_gbq(...) # to a Google BigQuery table.

  • df.to_string(...) # console-friendly tabular output.

  • df.to_clipboard(...) # clipboard that can be pasted into Excel

import os print(os.getcwd()) pd.read.type("name"\ "Path")
import pandas as pd crop=pd.read_csv("Crop_recommendation.csv") crop.head(2)
import pandas as pd viz_data=pd.ExcelFile(r"C:\Users\suyashi144893\Documents\data Sets\Exam_data.xlsx").parse("StudentsPerformance") viz_data.head()
viz_data2=pd.ExcelFile(r"C:\Users\suyashi144893\Documents\data Sets\Exam_data.xlsx").parse("Result-2") viz_data2.head(2)

read a file form Github

n = pd.read_csv(r"https://raw.githubusercontent.com/suyashi29/python-su/master/Python%20for%20Data%20Science/titanic.csv") #n=pd.read_csv(r"https://raw.githubusercontent.com/suyashi29/python-su/master/cars.csv")
n.head(2)
cu = pd.read_csv(r"https://raw.githubusercontent.com/suyashi29/python-su/master/data/Chicago.csv") cu.head()
n.isnull().sum() n
n.drop(columns=["Ticket"],inplace=True)
n
n.describe(include = "object")
n.drop_duplicates()
a=n[["Name","Age"]] type(a)

creating a csv file

empid=[100,200,300,400,500] emprole=["lead","Trainer","Consultant","Sales","VP"] details=list(zip(empid,emprole)) details
[(100, 'lead'), (200, 'Trainer'), (300, 'Consultant'), (400, 'Sales'), (500, 'VP')]
import pandas as pd df=pd.DataFrame(data=details,index=["ONE","TWO","THREE","FOUR","FIVE"]) df
df.to_csv("emp.csv",header=False,index=True)

Parsing JSON Dataset

import requests from pandas.io.json import json_normalize import pandas as pd url = "https://api.exchangerate-api.com/v4/latest/USD" df = pd.read_json(url) df

Pandas SQL Operations

We can use the pandas read_sql_query function to read the results of a SQL query directly into a pandas DataFrame. The below code will execute the same query that we just did, but it will return a DataFrame. It has several advantages over the query we did above:

  • It doesn’t require us to create a Cursor object or call fetchall at the end.

  • It automatically reads in the names of the headers from the table.

  • It creates a DataFrame, so we can quickly explore the data.

conda install -c sqlite3

  1. Create Connection

  2. Create a Cursor for queries 3.Close connection

# Program to Connect to the exixting database ##If the database does not exist, then it will be created and finally a database object will be returned import sqlite3 as sq conn = sq.connect('D3.db') print ("My first Connection")
My first Connection
conn = sqlite3.connect('D1.db') print ("Opened database successfully") conn.execute('''CREATE TABLE COM (ID INT PRIMARY KEY NOT NULL, NAME TEXT NOT NULL, AGE INT NOT NULL, ADDRESS CHAR(50), SALARY REAL);''') print ("Table created......") conn.close()
# Program to create a table in the previously created database conn = sq.connect('D3.db') print ("Opened database successfully") conn.execute('''CREATE TABLE DSClass2 (ID INT PRIMARY KEY NOT NULL, NAME TEXT NOT NULL, AGE INT NOT NULL, Job_Role CHAR(20));''') print ("Table created......") conn.close()
Opened database successfully Table created......
## to create records in the COMPANY table created in the above example. conn = sq.connect('D3.db') print ("Opened database successfully"); conn.execute("INSERT INTO DSClass2 (ID,NAME,AGE,Job_Role) \ VALUES (8, 'Pie', 48, 'VP' )") conn.execute("INSERT INTO DSClass2 (ID,NAME,AGE,Job_Role) \ VALUES (9, 'Ria', 40, 'VP' )") conn.commit() print ("Records updated successfully"); conn.close()
Opened database successfully Records updated successfully
## to create records in the COMPANY table created in the above example. conn = sq.connect('D3.db') print ("Opened database successfully"); conn.execute("INSERT INTO DSClass2 (ID,NAME,AGE,Job_Role) \ VALUES (3, 'Toy', 28, 'Lead' )") conn.execute("INSERT INTO DSClass2 (ID,NAME,AGE,Job_Role) \ VALUES (4, 'June', 49, 'Sr. Manager' )") conn.commit() print ("Records updated successfully"); conn.close()
Opened database successfully Records updated successfully
import pandas as pd import sqlite3 conn = sqlite3.connect("D3.db") Q = pd.read_sql_query( '''SELECT id, name, Job_Role from DsClass2''', conn)
d = pd.DataFrame(Q)#, columns=['ID','NAME','ADDRESS','SALARY'])
d
import zipfile import pandas as pd df=pd.read_csv("car_price.zip") df