GitHub Repository: suyashi29/python-su
Path: blob/master/Data Analytics Using Python/2 Data Handling Using Pandas.ipynb
³⁰⁷⁴ views

Kernel: Python 3 (ipykernel)

Pandas

Pandas is the most popular python Data Analysis & Data Structure tool.

Python with Pandas is used in a wide range of fields including academic and commercial domains including finance, economics, Statistics, analytics, etc

Key Features of Pandas

Fast and efficient DataFrame object with default and customized indexing.
Tools for loading data into in-memory data objects from different file formats.
Data alignment and integrated handling of missing data.
Reshaping and pivoting of date sets.
Label-based slicing, indexing and subsetting of large data sets.
Columns from a data structure can be deleted or inserted.
Group by data for aggregation and transformations.
High performance merging and joining of data.
Time Series functionality.

Data Analysis

Raw data - information- Prepare- Feature selection- Model Data
import data(Data Acquistion) - Data prepartion(Cleaning data, Data Engineer) - EDA - Model Data

Installation

Standard Python distribution doesn't come bundled with Pandas module. A lightweight alternative is to install NumPy using popular Python package installer, pip.

pip install pandas
If you install Anaconda Python package, Pandas will be installed by default with the following − Anaconda (from https://www.continuum.io) is a free Python distribution for SciPy stack.

Pandas deals with the following three data structures −

Series: Series is a one-dimensional array like structure with homogeneous data.
DataFrame:DataFrame is a two-dimensional array with heterogeneous data
Panel: Panel is a three-dimensional data structure with heterogeneous data. It is hard to represent the panel in graphical representation. But a panel can be illustrated as a container of DataFrame. These data structures are built on top of Numpy array.

Data Acquition: , - Data Prepare: - Data Cleanings, Data manipulation, - Data Enginerring Raw Data - Information - Insights- Actions
pip install pandas import pandas

Lets work with Series

Series (1D) , rows ,
Data Frames (2D) , rows and columns
Panel (3D)

In [1]:

a=(1,23,23)
a

Out[1]:

(1, 23, 23)

In [3]:

import pandas as pd #conda install -c pandas 
a=(1,23,23)
y=pd.Series(a)
y
type(y)

Out[3]:

pandas.core.series.Series

In [4]:

import pandas as pd 
a=[1,2,3,4,5,6]
b=["one","two","three","four","five","six"] 
#dtype=int
x=pd.Series(a,dtype=float,index=b)
x

Out[4]:

one      1.0
two      2.0
three    3.0
four     4.0
five     5.0
six      6.0
dtype: float64

In [5]:

#Series from list
s=pd.Series([1,2,3,4],index=("I","II","III","IV"),dtype=int)
s

Out[5]:

I      1
II     2
III    3
IV     4
dtype: int32

Creating series from dictionary

In [6]:

emp={"A":8,"B":9,"C":10}
details=pd.Series(emp)
details

Out[6]:

A     8
B     9
C    10
dtype: int64

Note: Values are used by default as series elements & Keys as index Dictionary is a mapping data type , We cannot manupulate index in as we do in case of List & Tuples.

In [7]:

#changing order of index
age = {'ram' : 28,'bob' : 19, 'cam' : 22}
s = pd.Series(age,index=['bob','ram','cam',"Lina"],dtype=int)
s

Out[7]:

bob     19.0
ram     28.0
cam     22.0
Lina     NaN
dtype: float64

note : Missing value is filled by NAN & index taken by keys

Lets Work with Data Frame

In [8]:

import pandas as pd
# Data frame from 1D List
l=["ashi","rom","sid"]
df=pd.DataFrame(l)
df

Out[8]:

In [9]:

#2D lIST
data = [['Nokia',10000],['Asus',12000],['Samsung',13000],['Apple',33000]]
d = pd.DataFrame(data,columns=['Mobile','Price'],index=[1,2,3,4])
d

Out[9]:

Create a data frame containing Details of six students in chem, phy, maths,Roll/name(Index)

note : If no index or column is passed, then by default, index will be range(n), where n is the array length.

Creating Data frame from Series

In [10]:

import pandas as pd
# Selecting Columns
d = {'Chem' : pd.Series([30, 70, 85,65,26,67,77,67,89,98], index=["Raj",'Ram', 'Asa', 'Pi','Chi','Ru',"Sita","Ria","Gita","Sia"],dtype=int),
     'Math' : pd.Series([18, 26, 35, 40,55,89,79,100], index=["Raj",'Ram', 'Pi', 'Chi', 'Ru',"Sita","Ria","Gita"],dtype=int),
     'Phy' : pd.Series([31, 42, 83,34,80,78,67,98], index=["Asa",'Ram', 'Pi', 'Ru',"Sita","Gita","Chi","Sia"],dtype=int)}
exam = pd.DataFrame(d)
exam1=exam.copy()
exam2=exam.copy()
## To make copy of your data
exam

Out[10]:

Data Preparation

Removing null values/replace
Data description
Adding new data fields:analysis
Feature selection: ML , Predictions:decisions

In [11]:

# Adding columns

print ("Adding a new column using the existing columns in DataFrame:")
exam1['Total']=exam1['Chem']+exam1['Math']+exam1['Phy']
exam1

Out[11]:

Adding a new column using the existing columns in DataFrame:

Dealing with missing data

Check for Missing Values
To make detecting missing values easier (and across different array dtypes), Pandas provides the isnull() and notnull() functions, which are also methods on Series and DataFrame objects

In [12]:

exam

Out[12]:

In [14]:

exam.isnull().sum()

Out[14]:

Chem    0
Math    2
Phy     2
dtype: int64

mean -Average (affected by outliers) mode- frequency median- median values (1,2,3,4,5,6) - outliers

Calculations with Missing Data

When summing data, NA will be treated as Zero If the data are all NA, then the result will be NA Cleaning / Filling Missing Data
Pandas provides various methods for cleaning the missing values. The fillna function can “fill in” NA values with non-null data in a couple of ways, which we have illustrated in the following sections.
Replace NaN with a Scalar Value The following program shows how you can replace "NaN" with "0".

In [15]:

exam1

Out[15]:

In [17]:

exam1=exam1.dropna()
exam1

Out[17]:

In [18]:

exam2

Out[18]:

In [23]:

exam2=exam2.fillna(0)
exam2["T"]=exam2["Chem"]+exam2["Math"]+exam2["Phy"]
exam2

Out[23]:

In [ ]:

a=exam["Math"].median()
a

In [24]:

a=exam["Math"].median()
exam["Math"]= exam["Math"].fillna(a)

b=exam["Phy"].median()
exam["Phy"]=exam["Phy"].fillna(b)

In [25]:

exam

Out[25]:

Dropping missing values using "dropna()"

10000 A- 999(NA): Drop A B-79(NA): Replace NA C- 0(NA)

Replacing error values by generic values by replace function

In [26]:

df = pd.DataFrame({'one':[10,20,30,40,50,"ABC"], 'AGE':[-19,1,30,40,50,60]})

df=df.replace({"ABC":60,-19:19})
df

Out[26]:

Stats: Data Description

dropna()
Fillna(0)
Fillna(median),mode or mean

Quick Excercise

In [1]:

import pandas as pd
details = {'Brand':pd.Series(['Nokia','Asus',"Nokia","Nokia",'Samsung',"ABC",'Micromax','Apple','MI','Zen',"Apple"]),
   'Price':pd.Series([10000,8000,12500,7000,40000,12000,12999,13999,59999]),
   'Rating(10)':pd.Series([7,6.5,8.5,9,8,9.5,7,9])
}

d = pd.DataFrame(details)
d

Out[1]:

In [3]:

d.describe()

Out[3]:

In [4]:

d.describe(include="object")

Out[4]:

Check missing values in given
replace missing values
describe data
mean, mode and median column wise

In [2]:

d.isnull().sum()

Out[2]:

Brand         0
Price         2
Rating(10)    3
dtype: int64

In [5]:


d["Price"].mean()

Out[5]:

19610.777777777777

In [ ]:

#d["Rating(10)"].describe()
d.describe()

The describe() function computes a summary of statistics pertaining to the DataFrame columns.

In [ ]:

d.describe(include="object")
#d[d['Brand'] == 'Apple']
#d.head(2)#displays data labels from start
#d.tail(1)# dispalys data lables from bottom
# - Mode,Sum,medain,valuable(null), any details about categorical

Lets Practice

Check null values for the given data
replace null values by mean/median/mode
describe

Renaming

The rename() method allows you to relabel an axis based on some mapping (a dict or Series) or an arbitrary function.

In [6]:

d=d.rename(columns={'Brand' : 'Type', 'Avg.Price' : 'Price'},
index = {0 : 'S0', 1 : 'S1', 2 : 'S2'})
d

Out[6]:

Sorting

In [7]:

import pandas as pd
details = {'Brand':pd.Series(['Nokia','Asus',"Nokia","Nokia",'Samsung',"ABC",'Micromax','Apple','MI','Zen',"Apple"]),
   'Avg.Price':pd.Series([10000,8000,12500,7000,40000,12000,12999,13999,59999,8000,76000]),
   'Rating(10)':pd.Series([7,6.5,8.5,9,8,9.5,7,9,8,5,5])
}

d = pd.DataFrame(details)
d

Out[7]:

In [8]:

d.sort_values(by=['Avg.Price'],ascending= False)#Decending order

Out[8]:

In [9]:

d.sort_values(by=['Brand','Avg.Price'],ascending= [True,False])#Decending order

Out[9]:

In [10]:

# Count number of entries under each Brand
brand_counts = d['Brand'].value_counts().reset_index()
brand_counts.columns = ['Brand', 'Count']
brand_counts

Out[10]:

In [11]:

 ## Add automatic index starting from 1:
brand_counts.index = brand_counts.index + 1
brand_counts.index.name = 'Index'
brand_counts

Out[11]:

get_dummies()

Pass a list with length equal to the number of columns.
Returns the DataFrame with One-Hot Encoded values.

CAR PRICE VS "MODEL", MILAGE

Type Encode
 A    0
 B    1
 C    2
 D    3
 
 type 
  A
  A
  B
  C 
  
  type.A  type.B
           0
           0
         1
           0
  1
  1
  0
  0

In [12]:

import pandas as pd                                                                                                                                  
s = pd.Series(['Orange', 'Pink', 'Blue'])
s1=s.str.get_dummies()
s1

Out[12]:

In [ ]:

pd.get_dummies(d.Brand)
## df.label("A":0,"B":1)

Factorize or Encode into numeric

In [13]:

d['Brand'] = pd.factorize(d['Brand'])[0]
d

#d = d.apply(lambda x: pd.factorize(x)[0])  ## For all columns

Out[13]:

conv_cols = obj_cols.apply(pd.to_numeric, errors = 'coerce')
##Use this instead. You can add parameter errors='coerce' to convert bad non numeric values to NaN.

Indexing & Selecting Data

loc() : (label based indexing) takes two single/list/range operator separated by ','. The first one indicates the row and the second one indicates columns.
iloc() : integer based indexing.

loc()

In [14]:

import pandas as pd
details = {'Brand':pd.Series(['Nokia','Asus',"Nokia","Nokia",'Samsung',"ABC",'Micromax','Apple','MI','Zen',"Apple"]),
   'Avg.Price':pd.Series([10000,8000,22500,27000,40000,12000,12999,13999,59999,7899,8999]),
   'Rating(10)':pd.Series([7,6.5,8.5,9,8,9.5,7,9,10,7,6.9])
}

d = pd.DataFrame(details)
d

Out[14]:

In [15]:

d["Rating(10)"]

Out[15]:

    7.0
    6.5
    8.5
    9.0
    8.0
    9.5
    7.0
    9.0
   10.0
    7.0
   6.9
Name: Rating(10), dtype: float64

In [16]:

d.loc[:,"Rating(10)"]

Out[16]:

    7.0
    6.5
    8.5
    9.0
    8.0
    9.5
    7.0
    9.0
   10.0
    7.0
   6.9
Name: Rating(10), dtype: float64

In [17]:

d.iloc[:,2]
##d.loc[:,"Brand"]
## d.loc["Nokia"]

Out[17]:

    7.0
    6.5
    8.5
    9.0
    8.0
    9.5
    7.0
    9.0
   10.0
    7.0
   6.9
Name: Rating(10), dtype: float64

About iloc() :

In [ ]:


# select all rows for a specific column
d.iloc[2:6]

In [ ]:

d.iloc[2:5]

In [ ]:

# Integer slicing
d.iloc[2:5, 1:3]

In [ ]:

# Slicing through list of values
d.iloc[[7],[0]]

In [ ]:

### Using index parameter

d.iloc[:,1]

Correlation

Correlation shows the linear relationship between any two array of values (series). There are multiple methods to compute the correlation like pearson(default), spearman and kendall.

price increases with years (+ive) (1)
For ABC cmpny Sales is decreasing with Time (-ve) (-1)
In last 5 years toursim industry has a constant growth of 0.6 per. (0)
30 marks : 12 hrs , 12 marks : 5 hrs

In [18]:

d.info()

Out[18]:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11 entries, 0 to 10
Data columns (total 3 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   Brand       11 non-null     object 
 1   Avg.Price   11 non-null     int64  
 2   Rating(10)  11 non-null     float64
dtypes: float64(1), int64(1), object(1)
memory usage: 396.0+ bytes

In [19]:

Out[19]:

In [21]:

import numpy as np                                                 # Implemennts milti-dimensional array and matrices
import pandas as pd                                                # For data manipulation and analysis
import matplotlib.pyplot as plt                                    # Plotting library for Python programming language and it's numerical mathematics extension NumPy                                             # Provides a high level interface for drawing attractive and informative statistical graphics
%matplotlib inline
from subprocess import check_output
#d.corr()
d.plot(x="Rating(10)",y="Avg.Price",style="*")
plt.show()

Out[21]:

In [22]:

import seaborn as sns
plt.figure(figsize=(8, 6))
sns.heatmap(d.corr(numeric_only=True), annot=True, cmap='coolwarm', fmt=".2f")
plt.title("Correlation Heatmap")
plt.show()

Out[22]:

Creating joints in Pandas

Full Outer Join

combines the results of both the left and the right outer joins. The joined DataFrame will contain all records from both the DataFrames and fill in NaNs for missing matches on either side. You can perform a full outer join by specifying the how argument as outer in the merge() function:

Inner Join

combines the common results of both

There should be relevance(Common field,analysis requirement)

In [23]:

import pandas as pd
d = {
        'id': ['1', '2', '3', '4', '5','6'],
        'Color': ['RED', 'GREEN', 'YELLOW', 'BLUE', 'PINK','BLACK'],
        'Fruit': ['APPLE', 'BANANA', 'MANGO', 'BERRY', 'MELON','GRAPES']}
d1=pd.DataFrame(d)
d1

Out[23]:

In [24]:

z = {    'id': ['1', '2', '3', '4', '5','7'],
        'rating': ['A', 'B', 'C', 'A', 'B','A'],
        'COST': [200, 230, 400, 400, 100,450],
        'Fruit': ['APPLE', 'BANANA', 'MANGO', 'BERRY', 'MELON','KIWI'],
        'BUY': ['Y', 'N', 'Y', 'N', 'Y','N']}
d2=pd.DataFrame(z)
d2

Out[24]:

In [26]:

join = pd.merge(d1, d2, on=["Fruit","id"], how='outer') ## similary we can  use 'inner/outer/right/left join'

join

Out[26]:

In [27]:

#Selecting few fields from frames
join = pd.merge(d1[["Color","Fruit"]], d2, on="Fruit", how='left') ## similary we can  use 'inner/right/left join'

join

Out[27]:

In [28]:

#Selecting any one fields from frames
join = pd.merge(d1, d2, on=["id"], how='outer') ## similary we can  use 'inner/right/left join'

join

Out[28]:

How simple merge function differs from join?

Simple merge works as an inner joint

In [29]:

df_merge = pd.merge(d1,d2, on='Fruit')

df_merge

Out[29]:

To join two pandas DataFrames where the common column has different names but contains the same values, you can use the merge() function and specify the left_on and right_on parameters.

In [30]:

# Sample DataFrames
df1 = pd.DataFrame({
    'employee_id': [1, 2, 3,4],
    'name': ['Ashi', 'Surya', 'Shiv','Ram']
})
df1

Out[30]:

In [31]:

df2 = pd.DataFrame({
    'id': [1, 2, 3,4],
    'department': ['HR', 'Sales', 'CD', 'Tech']
})
df2

Out[31]:

In [32]:


# Merge on columns with different names but matching values
merged_df = pd.merge(df1, df2, left_on='employee_id', right_on='id', how='inner')
merged_df

Out[32]:

Concadination

In [33]:

d1

Out[33]:

In [34]:

a = {
        "id": [12,14,15],
        'Color': ['brown', 'yellow',"green" ],
        'Fruit': ['Pear', 'muskmellon',"Apricot"]}
d3=pd.DataFrame(a)
d3

Out[34]:

In [35]:

df_col = pd.concat([d1,d3], axis=0)
df_col.reset_index(drop=True)
#df_col

Out[35]:

Frequency table : Crosstab

In [36]:

d2

Out[36]:

In [37]:

my_tab = pd.crosstab(index=d2["rating"],  # Make a crosstab
                              columns="c1")      # Name the count column
my_tab

Out[37]:

In [38]:

c = pd.value_counts(d2.rating).to_frame().reset_index()
c

Out[38]:

Split Data into Groups

In [ ]:

import pandas as pd
ipl_data = {'Team': ['Riders', 'Riders', 'Devils', 'Devils', 'Kings',
   'Kings', 'Kings', 'Kings', 'Riders', 'Royals',"MI", 'Royals', 'Riders',"MI","MI","GL","MI"],
   'Rank': [1, 2, 2, 3, 3,4 ,1 ,1,2 ,3, 4,1,2,1,1,1,5],
   'Year': [2014,2015,2014,2015,2014,2015,2016,2017,2016,2016,2014,2015,2017,2018,2019,2022,2022],
   'Points':[876,789,863,673,741,812,756,788,694,701,607,804,690,890,900,950,600]}
ipl = pd.DataFrame(ipl_data)
ipl

In [ ]:

c=ipl.groupby('Team').groups
c

a=ipl.groupby(['Team','Year']).groups
a

In [ ]:

team_data = ipl.groupby(["Team"])
team_data.get_group(("MI"))

In [ ]:

team_data = ipl.groupby(["Team",'Year'])# Similary try for Year ,Rank & Points
team_data.get_group(("Devils",2014))
## team_data.first()

In [ ]:

c=ipl.groupby(["Team",'Points'])
team_data.get_group(("Devils",2014))

In [ ]:

ipl[ipl["Team"]=="Devils"]

In [ ]:

my_tab = pd.crosstab(index=ipl["Team"],  # Make a crosstab
                              columns="count")      # Name the count column

my_tab

In [ ]:

## Agrregration on groups
import numpy as np
grouped = ipl.groupby('Year')
grouped['Points'].agg(np.mean)

In [ ]:

# Attribute Access in Python Pandas
import numpy as np
grouped = ipl.groupby('Team')
print (grouped.agg(np.size))

In [ ]:

grouped = ipl.groupby('Team')
print (grouped['Points'].agg([np.sum, np.mean, np.std]))

Filtration

Filtration filters the data on a defined criteria and returns the subset of data. The filter() function is used to filter the data.

Lambda function

The lambda keyword is used to create anonymous functions
This function can have any number of arguments but only one expression, which is evaluated and returned.
One is free to use lambda functions wherever function objects are required.
You need to keep in your knowledge that lambda functions are syntactically restricted to a single expression.

In [ ]:

a=lambda x,y:x**3+6*y+9
a(3,2)

In [ ]:

z=lambda x,y:(x*x-2*y+3)
print(z(2,3))

In [ ]:

x=lambda a,b,c : (a*b/c)
print(x(5,3,2))

In [ ]:

%%time
a=[1,3,4,6,23,44,56,78,90,54,60]
list_e = list(filter(lambda x:(x %2==0),a))
print(list_e)

In [ ]:

# Python code to illustrate 
# filter() with lambda() 
final_list = list(filter(lambda x: (x > 800) , ipl["Points"])) 
print(final_list)

In [ ]:

ipl[ipl["Rank"].apply(lambda s: s == 1)] ## by directly Acessing Colums

In [ ]:

import pandas as pd
top=lambda x:x ==1
ipl[ipl["Rank"].apply(top)] ## by column name

In [ ]:

cap = lambda x: x.upper()
ipl['Team'].apply(cap)

In [ ]:

%%time
from time import sleep
print("hello")
sleep(5) ## Delay Function
3+4

Input Output

Reading files

Following files can be read in pandas:

df.to_csv(...) # csv file
df.to_hdf(...) # HDF5 file
df.to_pickle(...) # serialized object
df.to_sql(...) # to SQL database
df.to_excel(...) # to Excel sheet
df.to_json(...) # to JSON string
df.to_html(...) # render as HTML table
df.to_feather(...) # binary feather-format
df.to_latex(...) # tabular environment table
df.to_stata(...) # Stata binary data files
df.to_msgpack(...) # msgpack (serialize) object
df.to_gbq(...) # to a Google BigQuery table.
df.to_string(...) # console-friendly tabular output.
df.to_clipboard(...) # clipboard that can be pasted into Excel

import os
print(os.getcwd())
pd.read.type("name"\ "Path")

In [42]:

s=pd.read_csv("cloth_store_dataset.csv")
s.head(2)
s.tail(3)
s.shape

Out[42]:

(5000, 12)

In [43]:

s.info()
s.describe(include="object")

Out[43]:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5000 entries, 0 to 4999
Data columns (total 12 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   ProductID        5000 non-null   object 
 1   ProductName      5000 non-null   object 
 2   Category         5000 non-null   object 
 3   Brand            5000 non-null   object 
 4   Size             5000 non-null   object 
 5   Color            5000 non-null   object 
 6   Price            5000 non-null   float64
 7   Discount         5000 non-null   float64
 8   QuantityInStock  5000 non-null   int64  
 9   Material         5000 non-null   object 
 10  Season           5000 non-null   object 
 11  Manufacturer     5000 non-null   object 
dtypes: float64(2), int64(1), object(9)
memory usage: 468.9+ KB

In [44]:

d=pd.read_excel("[email protected]")
d

Out[44]:

In [45]:


viz_data=pd.ExcelFile(r"C:\Documents\data Sets\Exam_data.xlsx").parse("StudentsPerformance")
viz_data
viz_data.head(2)

Out[45]:

In [46]:

viz_data2=pd.ExcelFile(r"C:\Documents\data Sets\Exam_data.xlsx").parse("Result-2")
viz_data2
viz_data2.head(2)

Out[46]:

Quick Practice please join vizdata and vizdata2 and name data as finalexamdata

check shape of data
check info
check statistical description
check and deal with null values
sort data according to total Marks
create headmap for phyiscs and chem marks
create a column for Total Marks and Percentage

In [47]:

c=pd.read_csv(r"https://raw.githubusercontent.com/suyashi29/python-su/refs/heads/master/Data%20Analytics%20Using%20Python/cloth_store_dataset.csv")

In [ ]:

n.drop(columns=["director"],inplace=True) # to drop colums

In [ ]:

n.drop_duplicates()
n.shape

This will fectch data from bottom

In [ ]:

##Movie Trend 2006 - 2016
1. Describe data
2. Drop Columns
3. Check Null values 
4. Replace null values

In [ ]:

import pymysql
query = open('my_data.sql', 'r')
con= "F:\python"
DF = pd.read_sql_query(query.read(),con)

In [ ]:

Emp.describe()

SQL DATA FROM SERVER


import pandas as pd
from sqlalchemy import create_engine

def getData():
  # Parameters
  ServerName = "my_server"
  Database = "my_db"
  UserPwd = "user:pwd"
  Driver = "driver=SQL Server Native Client 11.0"

  # Create the connection
  engine = create_engine('mssql+pyodbc://' + UserPwd + '@' + ServerName + '/' + Database + "?" + Driver)

  sql = "select * from mytable"
  df = pd.read_sql(sql, engine)
  return df

df2 = getData()
print(df2)

creating a csv file

In [ ]:

empid=[100,200,300,400]
emprole=["lead","Trainer","Consultant","Sales"]
details=list(zip(empid,emprole))
details

In [ ]:

import pandas as pd
df=pd.DataFrame(data=details,index=["ONE","TWO","THREE","FOUR"])
df

In [ ]:

df.to_csv("e.csv",header=False,index=False)

Pandas SQL Operations

We can use the pandas read_sql_query function to read the results of a SQL query directly into a pandas DataFrame. The below code will execute the same query that we just did, but it will return a DataFrame. It has several advantages over the query we did above:

It doesn’t require us to create a Cursor object or call fetchall at the end.
It automatically reads in the names of the headers from the table.
It creates a DataFrame, so we can quickly explore the data.

In [ ]:

## conda  install -c sqlite3

In [ ]:

# Program to Connect to the exixting database
##If the database does not exist, then it will be created and finally a database object will be returned
import sqlite3

conn = sqlite3.connect('D1.db')

print ("My first Connection")

In [ ]:

# Program to create a table in the previously created database
conn = sqlite3.connect('D1.db')
print ("Opened database successfully")

conn.execute('''CREATE TABLE COMPANY3
         (ID INT PRIMARY KEY     NOT NULL,
         NAME           TEXT    NOT NULL,
         AGE            INT     NOT NULL,
         ADDRESS        CHAR(50),
         SALARY         REAL);''')
print ("Table created......")

conn.close()

In [ ]:

## to create records in the COMPANY table created in the above example.
conn = sqlite3.connect('D1.db')
print ("Opened database successfully");

conn.execute("INSERT INTO COMPANY3 (ID,NAME,AGE,ADDRESS,SALARY) \
      VALUES (7, 'Poy', 30, 'UP', 30000.00 )")

conn.execute("INSERT INTO COMPANY3(ID,NAME,AGE,ADDRESS,SALARY) \
      VALUES (8, 'Ram', 33, 'FARIDABAD', 18000.00 )")

conn.execute("INSERT INTO COMPANY3 (ID,NAME,AGE,ADDRESS,SALARY) \
      VALUES (9, 'edd', 42, 'NEW DELHI', 22000.00 )")

conn.commit()
print ("Records updated successfully");
conn.close()

In [ ]:

## to create records in the COMPANY table created in the above example.
conn = sqlite3.connect('D1.db')
print ("Opened database successfully");

conn.execute("INSERT INTO COMPANY3 (ID,NAME,AGE,ADDRESS,SALARY) \
      VALUES (1, 'Joy', 43, 'UP', 60000.00 )")

conn.execute("INSERT INTO COMPANY3(ID,NAME,AGE,ADDRESS,SALARY) \
      VALUES (2, 'Tam', 23, 'UP', 18000.00 )")

conn.execute("INSERT INTO COMPANY3 (ID,NAME,AGE,ADDRESS,SALARY) \
      VALUES (3, 'Sidd', 22, 'NEW DELHI', 22000.00 )")

conn.commit()
print ("Records updated successfully");
conn.close()

In [ ]:

import pandas as pd
import sqlite3
conn = sqlite3.connect("D1.db")
Q = pd.read_sql_query(
'''SELECT id, name, address, salary from COMPANY3''', conn)

In [ ]:

d = pd.DataFrame(Q)#, columns=['ID','NAME','ADDRESS','SALARY'])
Q

In [ ]: