CoCalc -- 4 Advanced Pandas Functions.ipynb

GitHub Repository: suyashi29/python-su
Path: blob/master/Advanced Data Analysis using Python/4 Advanced Pandas Functions.ipynb
³⁰⁷⁴ views

Kernel: Python 3 (ipykernel)

Pandas Functions

groupby() Used to group data by specific columns and apply functions on them.

In [5]:

import pandas as pd
df = pd.DataFrame({'Category': ['A', 'B', 'A', 'B', 'A','C','C'], 'Value': [10, 20, 30, 40, 50,60,100]})
grouped = df.groupby('Category').count()
print(grouped)

Out[5]:

          Value
Category       
A             3
B             2
C             2

In [8]:

grouped = df.groupby('Category').agg({'Value': ['mean', 'max','sum','min']})
print(grouped)

Out[8]:

         Value              
          mean  max  sum min
Category                    
A         30.0   50   90  10
B         30.0   40   60  20
C         80.0  100  160  60

Lambda funtion:

Lambda functions are best suited for short, throwaway functions.
For complex operations, regular functions are preferred.
Lambda functions are anonymous functions defined using the 'lambda' keyword.
Syntax: lambda arguments: expression

In [15]:

# Example 4: Using with map() to square a list of numbers
nums = [1, 2, 3, 4, 5]
squared = list(map(lambda x: x**2, nums))
print("Squared List:", squared)

Out[15]:

Squared List: [1, 4, 9, 16, 25]

In [ ]:

# Example 1: Basic usage
square = lambda x: x**2
print("Square of 5:", square(5))  

# Example 2: Adding two numbers
add = lambda a, b,c: a + b+c
print("Sum of 4 and 7:", add(4, 7,3))  

# Example 3: Sorting a list of tuples by the second item
pairs = [(1, 'one'), (3, 'three'), (2, 'two'), (4, 'four')]
sorted_pairs = sorted(pairs, key=lambda x: x[1])
print("Sorted by value:", sorted_pairs)

# Example 4: Using with map() to square a list of numbers
nums = [1, 2, 3, 4, 5]
squared = list(map(lambda x: x**2, nums))
print("Squared List:", squared)

# Example 5: Using with filter() to filter even numbers
even_nums = list(filter(lambda x: x % 2 == 0, nums))
print("Even Numbers:", even_nums)

# Example 6: Using with reduce() to compute the product of elements
from functools import reduce
product = reduce(lambda x, y: x * y, nums)
print("Product of List:", product)

# Example 7: Conditional logic inside lambda
check_even = lambda x: 'Even' if x % 2 == 0 else 'Odd'
print("Check 7:", check_even(7))

# Example 8: Lambda inside a function
def make_incrementor(n):
    return lambda x: x + n

increment_by_3 = make_incrementor(3)
print("Increment 10 by 3:", increment_by_3(10))

In [16]:

## - apply():Applies a function along an axis of the DataFrame.
df = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]})
df['A'] = df['A'].apply(lambda x: x**2)
df

Out[16]:

In [17]:

df = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]})
df['Sum'] = df.apply(lambda row: row['A'] + row['B'], axis=1)
df

Out[17]:

rolling(): Provides rolling window calculations.

In [18]:

df = pd.DataFrame({'Value': [1, 2, 3, 4, 5]})
rolling_mean = df['Value'].rolling(window=3).mean()
print(rolling_mean)

Out[18]:

  NaN
  NaN
  2.0
  3.0
  4.0
Name: Value, dtype: float64

In [19]:

df = pd.DataFrame({'Value': [1, 2, 3, 4, 5]})
rolling_sum = df['Value'].rolling(window=2).sum()
print(rolling_sum)

Out[19]:

  NaN
  3.0
  5.0
  7.0
  9.0
Name: Value, dtype: float64

In [21]:

 ## shift(): Shifts the values of a column by a given number of periods.
df = pd.DataFrame({'A': [1, 2, 3, 4]})
shifted_df = df['A'].shift(1)
shifted_df

Out[21]:

  NaN
  1.0
  2.0
  3.0
Name: A, dtype: float64

In [23]:

df = pd.DataFrame({'A': [10, 20, 30, 40]})
shifted_df = df['A'].shift(-1)
shifted_df
df

Out[23]:

In [24]:

## transform(): Applies a function to each group in the DataFrame and returns the transformed data.
df = pd.DataFrame({'Category': ['A', 'A', 'B', 'B'], 'Value': [1, 2, 3, 4]})
df['Transformed'] = df.groupby('Category')['Value'].transform(lambda x: x / x.sum())
df

Out[24]:

In [27]:

df = pd.DataFrame({'A': [1, 2, 3, 4]})
df['Transformed'] = df['A'].transform(lambda x: 2*x)
df

Out[27]:

In [29]:

## nlargest(): Returns the first n largest values in a DataFrame.
##nsmallest():Returns the first n smallest values in a DataFrame
df = pd.DataFrame({'A': [10, 20, 30, 40]})
largest = df['A'].nlargest(1)
print(largest)

Out[29]:

3    40
Name: A, dtype: int64

In [30]:

df = pd.DataFrame({'A': [10, 20, 30, 40]})
smallest = df['A'].nsmallest(2)
print(smallest)

Out[30]:

0    10
1    20
Name: A, dtype: int64

In [32]:

##query(): Filters data based on a query string.
df = pd.DataFrame({'A': [1, 2, 3, 4,-1,8,9], 'B': [5, 6, 7, 8,7,3,3]})
filtered_df = df.query('A > 2')
filtered_df

Out[32]:

In [33]:

df = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]})
filtered_df = df.query('A == 2 and B == 5')
filtered_df

Out[33]:

In [37]:

## explode() : Explodes a list-like column into separate rows.
df = pd.DataFrame({'A': [[1, 2,4], [3, 4], [5, 6]]})
df

Out[37]:

In [41]:

df

Out[41]:

In [40]:

## explode() : Explodes a list-like column into separate rows.
df = pd.DataFrame({'A': [[1], [3,4], [5, 6,7]]})

exploded_df = df.explode('A')
exploded_df

Out[40]:

Case Study A: Sales Performance and Data Transformation in E-Commerce

Key Plots:

Sales and Revenue by Category: Bar plot representing total sales and average revenue by each category.
Average Revenue by Region and Product: Bar plot for average revenue based on region and product.
Original vs Doubled Revenue: A comparison of the original and doubled revenue.
Transformed Sales, Revenue, and Discount: A line plot showing the numeric transformations.
Sales and Rolling Mean Sales: A line plot showing sales alongside a rolling mean.
Sales and Shifted Sales: Line plot comparing original and shifted sales.
Sales by Category After Merge: Bar plot showing the merged sales data grouped by category.
Exploded Product Distribution: Bar plot showing the distribution of exploded product lists.
Original vs Updated Age: Bar plot comparing the original and updated age columns.
Product Count by Region: A stacked bar plot showing counts of each product per region.
Daily Sales and Revenue: Line plot showing daily aggregated sales and revenue.
Discount after FillNA: Line plot showing how missing discount values were filled.
Sales by Category with Reset Index: Bar plot showing sales by category after resetting the index.

In [42]:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns


df = pd.read_csv("sales_data.csv")

In [46]:

df.head(3)
df.shape
df.info()
df.isnull().sum()

Out[46]:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3000 entries, 0 to 2999
Data columns (total 8 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   Category  3000 non-null   object 
 1   Sales     3000 non-null   int64  
 2   Revenue   3000 non-null   float64
 3   Discount  3000 non-null   float64
 4   Age       3000 non-null   int64  
 5   Region    3000 non-null   object 
 6   Product   3000 non-null   object 
 7   Date      3000 non-null   object 
dtypes: float64(2), int64(2), object(4)
memory usage: 187.6+ KB

Category    0
Sales       0
Revenue     0
Discount    0
Age         0
Region      0
Product     0
Date        0
dtype: int64

In [47]:

df

Out[47]:

In [51]:

## Mean of Sales and mean of Revenue category wise
grouped = df.groupby('Category').agg({'Sales': 'mean', 'Revenue': 'mean'})
grouped

Out[51]:

In [52]:

# Plotting the grouped data
grouped.plot(kind='bar', figsize=(15, 8), title='Sales and Revenue by Category')
plt.ylabel('Value')
plt.show()

Out[52]:

In [53]:

# 2. **`pivot_table()`** - Pivot by Region and Product
pivot_table = df.pivot_table(values='Revenue', index='Region', columns='Product', aggfunc='mean')
pivot_table

Out[53]:

In [55]:

# Plotting the pivot table
pivot_table.plot(kind='bar', figsize=(11, 6), title='Average Revenue by Region and Product')
plt.ylabel('Average Revenue')
plt.show()

Out[55]:

In [56]:

# 3. **`apply()`** - Apply function to double Revenue
df['Revenue_Doubled'] = df['Revenue'].apply(lambda x: x * 2)
print(df[['Revenue', 'Revenue_Doubled']].head())

Out[56]:

       Revenue  Revenue_Doubled
1998.012319      3996.024637
3189.885959      6379.771919
2097.538421      4195.076842
4606.332021      9212.664042
1860.817919      3721.635838

In [57]:

# Plotting the doubled revenue
df[['Revenue', 'Revenue_Doubled']].head(10).plot(kind='bar', figsize=(10, 6), title='Original and Doubled Revenue')
plt.ylabel('Revenue')
plt.show()

Out[57]:

In [58]:

# 4. **`applymap()`** - Apply function to all numeric columns
df_numeric = df[['Sales', 'Revenue', 'Discount']].applymap(lambda x: x * 2)
print(df_numeric.head())

Out[58]:

   Sales      Revenue  Discount
 7316  3996.024637  0.812216
 9000  6379.771919  0.485009
 2748  4195.076842  0.419267
 6240  9212.664042  0.840931
 6042  3721.635838  0.501743

In [59]:

# Plotting the transformed numeric data
df_numeric.head(10).plot(kind='line', figsize=(10, 6), title='Transformed Sales, Revenue, and Discount')
plt.ylabel('Transformed Value')
plt.show()

Out[59]:

In [ ]:



# 5. **`rolling()`** - Calculate rolling mean of Sales
df['Rolling_Mean_Sales'] = df['Sales'].rolling(window=10).mean()
print(df[['Sales', 'Rolling_Mean_Sales']].head())

# Plotting the rolling mean of sales
df[['Sales', 'Rolling_Mean_Sales']].head(100).plot(figsize=(10, 6), title='Sales and Rolling Mean Sales')
plt.ylabel('Sales')
plt.show()

# 6. **`shift()`** - Shift the Sales column
df['Sales_Shifted'] = df['Sales'].shift(1)
print(df[['Sales', 'Sales_Shifted']].head())

# Plotting the sales and shifted sales
df[['Sales', 'Sales_Shifted']].head(100).plot(figsize=(10, 6), title='Sales and Shifted Sales')
plt.ylabel('Sales')
plt.show()

# 7. **`merge()`** - Merge with a small dataframe
small_df = pd.DataFrame({
    'Category': ['A', 'B', 'C'],
    'Region': ['East', 'North', 'South'],
    'Product': ['P1', 'P2', 'P3']
})
merged_df = pd.merge(df, small_df, on='Category', how='left')
print(merged_df.head())

# Plotting the merged data
merged_df[['Category', 'Sales']].groupby('Category').sum().plot(kind='bar', figsize=(10, 6), title='Merged Data Sales by Category')
plt.ylabel('Sales')
plt.show()

# 8. **`explode()`** - Explode the list of products into separate rows
df['Product_List'] = df['Product'].apply(lambda x: [x, f"{x}2", f"{x}3"])
exploded_df = df.explode('Product_List')
print(exploded_df[['Product', 'Product_List']].head())

# Plotting exploded product distribution
exploded_df['Product_List'].value_counts().plot(kind='bar', figsize=(10, 6), title='Product List Exploded Distribution')
plt.ylabel('Count')
plt.show()

# 9. **`pipe()`** - Apply a transformation function to age
def add_years(df, years=5):
    df['Age_Updated'] = df['Age'] + years
    return df

df = df.pipe(add_years, years=5)
print(df[['Age', 'Age_Updated']].head())

# Plotting updated ages
df[['Age', 'Age_Updated']].head(10).plot(kind='bar', figsize=(10, 6), title='Original and Updated Age')
plt.ylabel('Age')
plt.show()

# 10. **`crosstab()`** - Cross-tabulation of Region and Product
crosstab = pd.crosstab(df['Region'], df['Product'])
print(crosstab)

# Plotting crosstab data
crosstab.plot(kind='bar', stacked=True, figsize=(10, 6), title='Product Count by Region')
plt.ylabel('Count')
plt.show()

# 11. **`resample()`** - Resample data by day and sum sales and revenue
df['Date'] = pd.to_datetime(df['Date'])
df.set_index('Date', inplace=True)
resampled_df = df.resample('D').sum()
print(resampled_df[['Sales', 'Revenue']].head())

# Plotting resampled data (daily sales and revenue)
resampled_df[['Sales', 'Revenue']].head(30).plot(figsize=(10, 6), title='Daily Sales and Revenue')
plt.ylabel('Value')
plt.show()

# 12. **`fillna()`** - Fill missing values in Discount column
df['Discount'].iloc[::10] = np.nan  # Introducing NaNs for demonstration
df['Discount'] = df['Discount'].fillna(df['Discount'].mean())
print(df[['Discount']].head())

# Plotting filled discount values
df['Discount'].head(50).plot(kind='line', figsize=(10, 6), title='Discount after FillNA')
plt.ylabel('Discount')
plt.show()

# 13. **`reset_index()`** - Resetting the index of the DataFrame
df_reset = df.reset_index()
print(df_reset.head())

# Plotting data with reset index
df_reset[['Category', 'Sales']].groupby('Category').sum().plot(kind='bar', figsize=(10, 6), title='Sales by Category with Reset Index')
plt.ylabel('Sales')
plt.show()

Insights:

GroupBy gives aggregate metrics.

Pivot tables help in aggregating data based on multiple variables.

Apply and ApplyMap allow for custom transformations of entire columns or the whole DataFrame.

Rolling and Shift are great for calculating trends and previous values.

Merge, Explode, and Pipe allow for complex data transformations.

Cross-tab helps in comparing two categorical features.

Resampling and FillNA ensure handling time series data and missing values.

In [ ]:

Pandas Functions

Lambda funtion:

Case Study A: Sales Performance and Data Transformation in E-Commerce

Key Plots:

Insights:

Product

Resources

Company