GitHub Repository: suyashi29/python-su
Path: blob/master/Data Science using Python/Day 2 Pandas .ipynb
³⁰⁷⁴ views

Kernel: Python 3 (ipykernel)

Pandas

Pandas is the most popular python Data Analysis & Data Structure tool.

Python with Pandas is used in a wide range of fields including academic and commercial domains including finance, economics, Statistics, analytics, etc

Key Features of Pandas

Fast and efficient DataFrame object with default and customized indexing.
Tools for loading data into in-memory data objects from different file formats.
Data alignment and integrated handling of missing data.
Reshaping and pivoting of date sets.
Label-based slicing, indexing and subsetting of large data sets.
Columns from a data structure can be deleted or inserted.
Group by data for aggregation and transformations.
High performance merging and joining of data.
Time Series functionality.

Data Analysis

Raw data - information- Prepare- Feature selection- Model Data
import data(Data Acquistion) - Data prepartion(Cleaning data, Data Engineer) - EDA - Model Data

Installation

Standard Python distribution doesn't come bundled with Pandas module. A lightweight alternative is to install NumPy using popular Python package installer, pip.

pip install pandas
If you install Anaconda Python package, Pandas will be installed by default with the following − Anaconda (from https://www.continuum.io) is a free Python distribution for SciPy stack.

Pandas deals with the following three data structures −

Series: Series is a one-dimensional array like structure with homogeneous data.
DataFrame:DataFrame is a two-dimensional array with heterogeneous data
Panel: Panel is a three-dimensional data structure with heterogeneous data. It is hard to represent the panel in graphical representation. But a panel can be illustrated as a container of DataFrame. These data structures are built on top of Numpy array.

Data Acquition: , - Data Prepare: - Data Cleanings, Data manipulation, - Data Enginerring Raw Data - Information - Insights- Actions
pip install pandas import pandas

Lets work with Series

Series (1D) , rows ,
Data Frames (2D) , rows and columns
Panel (3D)

In [4]:

import pandas 
A=(1,2,3,4,5,6)
x=pandas.Series(A)
x

Out[4]:

  1
  2
  3
  4
  5
  6
dtype: int64

In [5]:

import pandas as pd 
a=(1,2,3,4,5,6)
b=["one","two","three","four","five","six"] 
x=pd.Series(a,dtype=int,index=b)
x

Out[5]:

one      1
two      2
three    3
four     4
five     5
six      6
dtype: int32

In [ ]:

#Series from list
s=pd.Series([1,2,"a",4],index=("I","II","III","IV"))
s

Creating series from dictionary

In [6]:

emp={"A":8,"B":9,"C":10}
details=pd.Series(emp)
details

Out[6]:

A     8
B     9
C    10
dtype: int64

Note: Values are used by default as series elements & Keys as index Dictionary is a mapping data type , We cannot manupulate index in as we do in case of List & Tuples.

Quick Practice

Create Series using 1. List 2. Tuple and uisng Dictionary
Length should be 5.
For list and Tuple please name index

In [9]:

#changing order of index
age = {'ram' : 28,'bob' : 19, 'cam' : 22}
s = pd.Series(age,index=['bob','ram','cam',"Lina"],dtype=int)
s 
##type(s)

Out[9]:

pandas.core.series.Series

note : Missing value is filled by NAN & index taken by keys

Lets Work with Data Frame

In [10]:

import pandas as pd
# Data frame from 1D List
l=["ashi","rom","sid"]
df=pd.DataFrame(l)
df

Out[10]:

In [11]:

#2D lIST
data = [['Nokia',10000],['Asus',12000],['Samsung',13000],['Apple',33000]]
d = pd.DataFrame(data,columns=['Mobile','Price'],index=[1,2,3,4])
d

Out[11]:

Quick Practice: Create a data frame containing Details of marks of six students in chem, phy, maths

note : If no index or column is passed, then by default, index will be range(n), where n is the array length.

In [12]:

import pandas as pd
data = [["Ashi",34,23,56],["Bob",92,88,76],["Coy",76,87,25],["Don",45,78,99],["Muh",95,68,30],["May",76,78,99]]
stu = pd.DataFrame(data,columns=["Name",'Maths','Phy','Chem'])
stu

Out[12]:

Creating Data frame from Series

In [13]:

import pandas as pd
# Selecting Columns
d = {'Chem' : pd.Series([30, 70, 35,25,26,67,77,67,89,98], index=["Raj",'Ram', 'Asa', 'Pi','Chi','Ru',"Sita","Ria","Gita","Sia"],dtype=int),
     'Math' : pd.Series([18, 26, 35, 40,55,89,79,100], index=["Raj",'Ram', 'Pi', 'Chi', 'Ru',"Sita","Ria","Gita"],dtype=int),
     'Phy' : pd.Series([31, 42, 83,34,80,78,67,98,67], index=["Asa",'Ram', 'Pi', 'Ru',"Sita","Gita","Chi","Sia","Raj"],dtype=int)}
exam = pd.DataFrame(d)
exam1=exam.copy() ## To make copy of your data
exam2=exam.copy()

In [14]:

exam

Out[14]:

Data Preparation

Removing null values/replace
Data description
Adding new data fields:analysis
Feature selection: ML , Predictions:decisions

In [15]:

exam1

Out[15]:

In [16]:

# Adding columns

print ("Adding a new column using the existing columns in DataFrame:")
exam1['Total']=exam1['Chem']+exam1['Math']+exam1['Phy']
exam1

Out[16]:

Adding a new column using the existing columns in DataFrame:

Dealing with missing data

Check for Missing Values
To make detecting missing values easier (and across different array dtypes), Pandas provides the isnull() and notnull() functions, which are also methods on Series and DataFrame objects

In [17]:

exam2

Out[17]:

In [19]:

exam2.isnull().sum()

Out[19]:

Chem    0
Math    2
Phy     1
dtype: int64

Calculations with Missing Data

When summing data, NA will be treated as Zero If the data are all NA, then the result will be NA Cleaning / Filling Missing Data
Pandas provides various methods for cleaning the missing values. The fillna function can “fill in” NA values with non-null data in a couple of ways, which we have illustrated in the following sections.
Replace NaN with a Scalar Value The following program shows how you can replace "NaN" with "0".

In [20]:

exam

Out[20]:

In [21]:

exam=exam.fillna(0)
exam

Out[21]:

In [23]:

exam["Total"]=exam["Math"]+exam["Phy"]+exam["Chem"]
exam

Out[23]:

In [24]:

exam1

Out[24]:

In [25]:

exam1=exam1.dropna()
exam1

Out[25]:

In [26]:

exam2

Out[26]:

In [27]:

b=exam2["Phy"].median() # not affected by outliers
c=exam2["Math"].mean() # get affected by outliers
## mode

In [28]:

exam2["Phy"]=exam2["Phy"].fillna(b)
exam2["Math"]=exam2["Math"].fillna(c)
exam2

Out[28]:

Replacing error values by generic values by replace function

In [29]:

df = pd.DataFrame({'one':[10,20,30,40,50,"ABC"], 'AGE':[-19,1,30,40,50,60]})

df=df.replace({"ABC":60,-19:19})

In [30]:

exam1.info()

Out[30]:

<class 'pandas.core.frame.DataFrame'>
Index: 7 entries, Chi to Sita
Data columns (total 4 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   Chem    7 non-null      int32  
 1   Math    7 non-null      float64
 2   Phy     7 non-null      float64
 3   Total   7 non-null      float64
dtypes: float64(3), int32(1)
memory usage: 252.0+ bytes

In [34]:

exam1['Total']=exam1['Total'].astype(int)
exam1

Out[34]:

C:\Users\suyashi144893\AppData\Local\Temp\1\ipykernel_4140\60639536.py:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  exam1['Total']=exam1['Total'].astype(int)

In [35]:

exam1.describe()

Out[35]:

exam1.describe(include = "object")

In [39]:

exam1['Phy'].sum()
exam1.mean()
exam1.std()

Out[39]:

Chem     27.002645
Math     31.482421
Phy      19.208133
Total    59.887196
dtype: float64

Stats: Data Description

In [40]:

import pandas as pd
details = {'Brand':pd.Series(['Nokia','Asus',"Nokia","Nokia",'Samsung',"ABC",'Micromax','Apple','MI','Zen',"Apple"]),
   'Price':pd.Series([10000,8000,12500,7000,40000,12000,12999,13999,59999]),
   'Rating(10)':pd.Series([7,6.5,8.5,9,8,9.5,7,9])
}

d = pd.DataFrame(details)
d

Out[40]:

Quick Practice

describe the data -Check missing values in given
replace missing values
describe data
mean, mode and median column wise

In [43]:

d.isnull().sum()

Out[43]:

Brand         0
Price         2
Rating(10)    3
dtype: int64

In [44]:


d["Price"].mean()

Out[44]:

19610.777777777777

In [41]:

#d["Rating(10)"].describe()
d.describe()

Out[41]:

The describe() function computes a summary of statistics pertaining to the DataFrame columns.

In [42]:

d.describe(include="object")
#d[d['Brand'] == 'Apple']o
#d.head(2)#displays data labels from start
#d.tail(1)# dispalys data lables from bottom
# - Mode,Sum,medain,valuable(null), any details about categorical

Out[42]:

Lets Practice

Check null values for the given data
replace null values by mean/median/mode
describe

In [60]:

d=d.rename(columns={'Avg.Price' : 'Price'})
d

Out[60]:

Sorting

In [45]:

import pandas as pd
details = {'Brand':pd.Series(['Nokia','Asus',"Nokia","Nokia",'Samsung',"Acer",'Micromax','Apple','Micromax','Zen',"Oppo"]),
   'Avg.Price':pd.Series([10000,8000,12500,7000,40000,12000,12999,13999,59999,15609,34051]),
   'Rating(10)':pd.Series([7,6.5,8.5,9,8,9.5,7,9,7,8.5,9])
}

d = pd.DataFrame(details)
d

Out[45]:

In [47]:

df2 = d.sort_values(by=['Avg.Price'],ascending= False)#Ascending First Column and desending Second
df2

Out[47]:

In [48]:

df2 = d.sort_values(by=['Brand','Avg.Price'],ascending= [True,False])#Ascending First Column and desending Second
df2

Out[48]:

get_dummies()

Pass a list with length equal to the number of columns.
Returns the DataFrame with One-Hot Encoded values.

CAR PRICE VS "MODEL", MILAGE

Type Encode
 A    0
 B    1
 C    2
 A    0
 
 Type STUDENTS 
   A   0
   C   2
   B 
   A
   B 
   C
   C
Dummy method
Type.A   Type.B  Type.C
  1        0      0
  0        0      1
  0        1      0

Emp, Genders, Age, etc

F, M F=1 M-=0

Gender.M Gender.F 1 0 0 1

A,P,E 0,1,2

a b c d e

0 1 2 3 4 5

Male Female

M
F
M
F


Gender
{M:1,F:0}-

In [49]:

import pandas as pd                                                                                                                                  
s = pd.Series(['Orange', 'Pink', 'Blue','Orange','Pink','Pink'])
s1=s.str.get_dummies()
s1

Out[49]:

In [50]:


pd.get_dummies(d.Brand)
## df.label("A":0,"B":1)

Out[50]:

Indexing & Selecting Data

loc() : (label based indexing) takes two single/list/range operator separated by ','. The first one indicates the row and the second one indicates columns.
iloc() : integer based indexing.

loc()

In [51]:

import pandas as pd
details = {'Brand':pd.Series(['Nokia','Asus',"Nokia","Nokia",'Samsung',"ABC",'Micromax','Apple','MI','Zen',"Apple"]),
   'Avg.Price':pd.Series([10000,8000,12500,7000,40000,12000,12999,13999,59999,7899,78999]),
   'Rating(10)':pd.Series([7,6.5,8.5,9,8,9.5,7,9,10,7,6.9])
}

d = pd.DataFrame(details)
d

Out[51]:

In [55]:

d['Avg.Price']
d.Brand

Out[55]:

      Nokia
       Asus
      Nokia
      Nokia
    Samsung
        ABC
   Micromax
      Apple
         MI
        Zen
     Apple
Name: Brand, dtype: object

In [56]:

d.loc[:,'Rating(10)']

Out[56]:

    7.0
    6.5
    8.5
    9.0
    8.0
    9.5
    7.0
    9.0
   10.0
    7.0
   6.9
Name: Rating(10), dtype: float64

In [57]:

d.iloc[:,1]

Out[57]:

   10000
    8000
   12500
    7000
   40000
   12000
   12999
   13999
   59999
    7899
  78999
Name: Avg.Price, dtype: int64

In [ ]:

d.loc[3,["Brand","Rating(10)"]]
##d.loc[:,"Brand"]
## d.loc["Nokia"]

About iloc() :

In [ ]:


# select all rows for a specific column
d.iloc[:,2]

In [ ]:

d.iloc[2:5]

In [ ]:

# Integer slicing
d.iloc[2:5, 1:3]

In [ ]:

# Slicing through list of values
d.iloc[[7],[0]]

In [ ]:

### Using index parameter

d.Brand
d["Brand"]

Creating joints in Pandas

Full Outer Join

combines the results of both the left and the right outer joins. The joined DataFrame will contain all records from both the DataFrames and fill in NaNs for missing matches on either side. You can perform a full outer join by specifying the how argument as outer in the merge() function:

Inner Join

combines the common results of both

There should be relevance(Common field,analysis requirement)

In [61]:

import pandas as pd
d = {
        'id': ['1', '2', '3', '4', '5','6'],
        'Color': ['RED', 'GREEN', 'YELLOW', 'BLUE', 'PINK','BLACK'],
        'Fruit': ['APPLE', 'BANANA', 'MANGO', 'BERRY', 'MELON','GRAPES']}
d1=pd.DataFrame(d)
d1

Out[61]:

In [62]:

z = {    'id': ['1', '2', '3', '4', '5','7'],
        'rating': ['A', 'B', 'C', 'A', 'B','A'],
        'COST': [200, 230, 400, 400, 100,450],
        'Fruit': ['APPLE', 'BANANA', 'MANGO', 'BERRY', 'MELON','KIWI'],
        'BUY': ['Y', 'N', 'Y', 'N', 'Y','N']}
d2=pd.DataFrame(z)
d2

Out[62]:

In [66]:

join = pd.merge(d1, d2, on=["Fruit","id"], how='inner') ## similary we can  use 'outer/right/left join'
join

Out[66]:

In [67]:

#Selecting few fields from frames
join = pd.merge(d2,d1[["Color","Fruit"]], on="Fruit", how='outer') ## similary we can  use 'inner/right/left join'

join

Out[67]:

In [ ]:

#Selecting one common field from frames
join = pd.merge(d1, d2, on=["id"], how='outer') ## similary we can  use 'inner/right/left join'

join

Let read a data and combine it

In [68]:

s1=pd.read_csv(r"C:\Users\suyashi144893\Documents\data Sets\Exam_Data3.csv")

In [69]:

s1.head(2)

Out[69]:

In [70]:

s1.shape

Out[70]:

(320, 3)

In [71]:

s2=pd.read_excel(r"C:\Users\suyashi144893\Documents\data Sets\Exam_data4.xlsx")

In [72]:

s2.head(3)

Out[72]:

In [76]:

s2.shape
s2.describe(include="object")

Out[76]:

In [77]:

join_s3= pd.merge(s1, s2[["ROLL NO","Hindi"]], on=["ROLL NO"], how='outer')
join_s3

Out[77]:

In [78]:

join_s3.isnull().sum()

Out[78]:

Economics    0
ROLL NO      0
Class        0
Hindi        0
dtype: int64

How simple merge function differs from join?

Simple merge works as an inner joint

In [79]:

df_merge = pd.merge(d1,d2, on='Fruit')

df_merge

Out[79]:

Concadination

In [80]:

a = {
        "id": [12,14],
        'Color': ['Brown', 'yellow', ],
        'Fruit': ['Pear', 'muskmellon']}
d3=pd.DataFrame(a)
d3

Out[80]:

In [81]:

d1

Out[81]:

In [82]:

df_col = pd.concat([d1,d3], axis=0,ignore_index =True)
df_col.reset_index()
df_col

Out[82]:

Frequency table : Crosstab

In [83]:

d2

Out[83]:

In [85]:

my_tab = pd.crosstab(index=d2["rating"],  # Make a crosstab
                              columns="count")      # Name the count column
my_tab

Out[85]:

In [86]:

c = pd.value_counts(d2.rating).to_frame().reset_index()
c

Out[86]:

datetime.now() gives you the current date and time.

%%time
pd.datetime.now()

In [88]:

%%time
from time import sleep
print("Hi")
sleep(6) ## Delay Function
3+4

Out[88]:

Hi
CPU times: total: 0 ns
Wall time: 6.01 s

7

Input Output

Reading files

Following files can be read in pandas:

df.to_csv(...) # csv file
df.to_hdf(...) # HDF5 file
df.to_pickle(...) # serialized object
df.to_sql(...) # to SQL database
df.to_excel(...) # to Excel sheet
df.to_json(...) # to JSON string
df.to_html(...) # render as HTML table
df.to_feather(...) # binary feather-format
df.to_latex(...) # tabular environment table
df.to_stata(...) # Stata binary data files
df.to_msgpack(...) # msgpack (serialize) object
df.to_gbq(...) # to a Google BigQuery table.
df.to_string(...) # console-friendly tabular output.
df.to_clipboard(...) # clipboard that can be pasted into Excel

import os
print(os.getcwd())
pd.read.type("name"\ "Path")

In [89]:

import pandas as pd
crop=pd.read_csv("Crop_recommendation.csv")
crop.head(2)

Out[89]:

In [90]:

import pandas as pd
viz_data=pd.ExcelFile(r"C:\Users\suyashi144893\Documents\data Sets\Exam_data.xlsx").parse("StudentsPerformance")
viz_data.head()

Out[90]:

In [91]:

viz_data2=pd.ExcelFile(r"C:\Users\suyashi144893\Documents\data Sets\Exam_data.xlsx").parse("Result-2")
viz_data2.head(2)

Out[91]:

read a file form Github

In [96]:

n = pd.read_csv(r"https://raw.githubusercontent.com/suyashi29/python-su/master/Python%20for%20Data%20Science/titanic.csv")
#n=pd.read_csv(r"https://raw.githubusercontent.com/suyashi29/python-su/master/cars.csv")

In [98]:

n.head(2)

Out[98]:

In [ ]:

cu = pd.read_csv(r"https://raw.githubusercontent.com/suyashi29/python-su/master/data/Chicago.csv")
cu.head()

In [100]:

n.isnull().sum()
n

Out[100]:

In [101]:

n.drop(columns=["Ticket"],inplace=True)

In [102]:

Out[102]:

In [103]:

n.describe(include = "object")

Out[103]:

In [104]:

n.drop_duplicates()

Out[104]:

In [ ]:

a=n[["Name","Age"]]
type(a)

creating a csv file

In [105]:

empid=[100,200,300,400,500]
emprole=["lead","Trainer","Consultant","Sales","VP"]
details=list(zip(empid,emprole))
details

Out[105]:

[(100, 'lead'),
 (200, 'Trainer'),
 (300, 'Consultant'),
 (400, 'Sales'),
 (500, 'VP')]

In [106]:

import pandas as pd
df=pd.DataFrame(data=details,index=["ONE","TWO","THREE","FOUR","FIVE"])
df

Out[106]:

In [107]:

df.to_csv("emp.csv",header=False,index=True)

Parsing JSON Dataset

In [108]:

import requests
from pandas.io.json import json_normalize
import pandas as pd

url = "https://api.exchangerate-api.com/v4/latest/USD"
df = pd.read_json(url)
df

Out[108]:

Pandas SQL Operations

We can use the pandas read_sql_query function to read the results of a SQL query directly into a pandas DataFrame. The below code will execute the same query that we just did, but it will return a DataFrame. It has several advantages over the query we did above:

It doesn’t require us to create a Cursor object or call fetchall at the end.
It automatically reads in the names of the headers from the table.
It creates a DataFrame, so we can quickly explore the data.

conda install -c sqlite3

Create Connection
Create a Cursor for queries 3.Close connection

In [109]:

# Program to Connect to the exixting database
##If the database does not exist, then it will be created and finally a database object will be returned
import sqlite3 as sq

conn = sq.connect('D3.db')

print ("My first Connection")

Out[109]:

My first Connection

conn = sqlite3.connect('D1.db')
print ("Opened database successfully")

conn.execute('''CREATE TABLE COM
         (ID INT PRIMARY KEY     NOT NULL,
         NAME           TEXT    NOT NULL,
         AGE            INT     NOT NULL,
         ADDRESS        CHAR(50),
         SALARY         REAL);''')
print ("Table created......")

conn.close()

In [110]:

# Program to create a table in the previously created database
conn = sq.connect('D3.db')
print ("Opened database successfully")

conn.execute('''CREATE TABLE DSClass2
         (ID INT PRIMARY KEY     NOT NULL,
         NAME           TEXT    NOT NULL,
         AGE            INT     NOT NULL,
         Job_Role        CHAR(20));''')
print ("Table created......")

conn.close()

Out[110]:

Opened database successfully
Table created......

In [111]:

## to create records in the COMPANY table created in the above example.
conn = sq.connect('D3.db')
print ("Opened database successfully");

conn.execute("INSERT INTO DSClass2 (ID,NAME,AGE,Job_Role) \
      VALUES (8, 'Pie', 48, 'VP' )")


conn.execute("INSERT INTO DSClass2 (ID,NAME,AGE,Job_Role) \
      VALUES (9, 'Ria', 40, 'VP' )")

conn.commit()
print ("Records updated successfully");
conn.close()

Out[111]:

Opened database successfully
Records updated successfully

In [113]:

## to create records in the COMPANY table created in the above example.
conn = sq.connect('D3.db')
print ("Opened database successfully");

conn.execute("INSERT INTO DSClass2 (ID,NAME,AGE,Job_Role) \
      VALUES (3, 'Toy', 28, 'Lead' )")


conn.execute("INSERT INTO DSClass2 (ID,NAME,AGE,Job_Role) \
      VALUES (4, 'June', 49, 'Sr. Manager' )")

conn.commit()
print ("Records updated successfully");
conn.close()

Out[113]:

Opened database successfully
Records updated successfully

In [114]:

import pandas as pd
import sqlite3
conn = sqlite3.connect("D3.db")
Q = pd.read_sql_query(
'''SELECT id, name, Job_Role from DsClass2''', conn)

In [115]:

d = pd.DataFrame(Q)#, columns=['ID','NAME','ADDRESS','SALARY'])

In [116]:

Out[116]:

In [117]:

import zipfile 
import pandas as pd
df=pd.read_csv("car_price.zip")
df

Out[117]:

In [ ]:

Pandas

Key Features of Pandas

Data Analysis

Installation

Pandas deals with the following three data structures −

Creating series from dictionary

Quick Practice

Quick Practice: Create a data frame containing Details of marks of six students in chem, phy, maths

Creating Data frame from Series

Data Preparation

Dealing with missing data

Calculations with Missing Data

Replacing error values by generic values by replace function

Stats: Data Description

Quick Practice

The describe() function computes a summary of statistics pertaining to the DataFrame columns.

Lets Practice

Sorting

get_dummies()

Indexing & Selecting Data

loc()

About iloc() :

Creating joints in Pandas

Let read a data and combine it

How simple merge function differs from join?

Concadination

Frequency table : Crosstab

datetime.now() gives you the current date and time.

Input Output

Reading files

Following files can be read in pandas:

read a file form Github

creating a csv file

Parsing JSON Dataset

Pandas SQL Operations

conda install -c sqlite3

Product

Resources

Company