Path: blob/master/Data Analysis using Python/Pandas.ipynb
3074 views
Pandas
Pandas is the most popular python Data Analysis & Data Structure tool.
Python with Pandas is used in a wide range of fields including academic and commercial domains including finance, economics, Statistics, analytics, etc
Key Features of Pandas
Fast and efficient DataFrame object with default and customized indexing.
Tools for loading data into in-memory data objects from different file formats.
Data alignment and integrated handling of missing data.
Reshaping and pivoting of date sets.
Label-based slicing, indexing and subsetting of large data sets.
Columns from a data structure can be deleted or inserted.
Group by data for aggregation and transformations.
High performance merging and joining of data.
Time Series functionality.
Data Analysis
Raw data - information- Prepare- Feature selection- Model Data
import data(Data Acquistion) - Data prepartion(Cleaning data, Data Engineer) - EDA - Model Data
Installation
Standard Python distribution doesn't come bundled with Pandas module. A lightweight alternative is to install NumPy using popular Python package installer, pip.
pip install pandas
If you install Anaconda Python package, Pandas will be installed by default with the following − Anaconda (from https://www.continuum.io) is a free Python distribution for SciPy stack.
Pandas deals with the following three data structures −
Series: Series is a one-dimensional array like structure with homogeneous data.
DataFrame:DataFrame is a two-dimensional array with heterogeneous data
Panel: Panel is a three-dimensional data structure with heterogeneous data. It is hard to represent the panel in graphical representation. But a panel can be illustrated as a container of DataFrame. These data structures are built on top of Numpy array.
Data Acquition: , - Data Prepare: - Data Cleanings, Data manipulation, - Data Enginerring Raw Data - Information - Insights- Actions
pip install pandas import pandas
Lets work with Series
Series (1D) , rows ,
Data Frames (2D) , rows and columns
Panel (3D)
Creating series from dictionary
Note: Values are used by default as series elements & Keys as index Dictionary is a mapping data type , We cannot manupulate index in as we do in case of List & Tuples.
note : Missing value is filled by NAN & index taken by keys
Lets Work with Data Frame
Create a data frame containing Details of six students in chem, phy, maths,Roll/name(Index)
note : If no index or column is passed, then by default, index will be range(n), where n is the array length.
Creating Data frame from Series
Data Preparation
Removing null values/replace
Data description
Adding new data fields:analysis
Feature selection: ML , Predictions:decisions
Dealing with missing data
Check for Missing Values
To make detecting missing values easier (and across different array dtypes), Pandas provides the isnull() and notnull() functions, which are also methods on Series and DataFrame objects
Calculations with Missing Data
When summing data, NA will be treated as Zero If the data are all NA, then the result will be NA Cleaning / Filling Missing Data
Pandas provides various methods for cleaning the missing values. The fillna function can “fill in” NA values with non-null data in a couple of ways, which we have illustrated in the following sections.
Replace NaN with a Scalar Value The following program shows how you can replace "NaN" with "0".
Dropping missing values using "dropna()"
10000 A- 999(NA): Drop A B-79(NA): Replace NA C- 0(NA)
Replacing missing values by generic values by replace function
Stats: Data Description
The describe() function computes a summary of statistics pertaining to the DataFrame columns.
Lets Practice
Check null values for the given data
replace null values by mean/median/mode
describe
Renaming
The rename() method allows you to relabel an axis based on some mapping (a dict or Series) or an arbitrary function.
Sorting
get_dummies()
Pass a list with length equal to the number of columns.
Returns the DataFrame with One-Hot Encoded values.
CAR PRICE VS "MODEL", MILAGE
Indexing & Selecting Data
loc() : (label based indexing) takes two single/list/range operator separated by ','. The first one indicates the row and the second one indicates columns.
iloc() : integer based indexing.
loc()
About iloc() :
Correlation
Correlation shows the linear relationship between any two array of values (series). There are multiple methods to compute the correlation like pearson(default), spearman and kendall.
price increases with years (+ive)
For ABC cmpny Sales is decreasing with Time (-ve)
In last 5 years toursim industry has a constant growth of 0.6 per.
30 marks : 12 hrs , 12 marks : 5 hrs
Creating joints in Pandas
Full Outer Join
combines the results of both the left and the right outer joins. The joined DataFrame will contain all records from both the DataFrames and fill in NaNs for missing matches on either side. You can perform a full outer join by specifying the how argument as outer in the merge() function:
Inner Join
combines the common results of both
There should be relevance(Common field,analysis requirement)
How simple merge function differs from join?
Simple merge works as an inner joint
Concadination
Frequency table : Crosstab
Split Data into Groups
Filtration
Filtration filters the data on a defined criteria and returns the subset of data. The filter() function is used to filter the data.
Lambda function
The lambda keyword is used to create anonymous functions
This function can have any number of arguments but only one expression, which is evaluated and returned.
One is free to use lambda functions wherever function objects are required.
You need to keep in your knowledge that lambda functions are syntactically restricted to a single expression.
datetime.now() gives you the current date and time.
Input Output
Reading files
Following files can be read in pandas:
df.to_csv(...) # csv file
df.to_hdf(...) # HDF5 file
df.to_pickle(...) # serialized object
df.to_sql(...) # to SQL database
df.to_excel(...) # to Excel sheet
df.to_json(...) # to JSON string
df.to_html(...) # render as HTML table
df.to_feather(...) # binary feather-format
df.to_latex(...) # tabular environment table
df.to_stata(...) # Stata binary data files
df.to_msgpack(...) # msgpack (serialize) object
df.to_gbq(...) # to a Google BigQuery table.
df.to_string(...) # console-friendly tabular output.
df.to_clipboard(...) # clipboard that can be pasted into Excel
This will fectch data from bottom
SQL DATA FROM SERVER
creating a csv file
Pandas SQL Operations
We can use the pandas read_sql_query function to read the results of a SQL query directly into a pandas DataFrame. The below code will execute the same query that we just did, but it will return a DataFrame. It has several advantages over the query we did above:
It doesn’t require us to create a Cursor object or call fetchall at the end.
It automatically reads in the names of the headers from the table.
It creates a DataFrame, so we can quickly explore the data.