Book a Demo!
CoCalc Logo Icon
StoreFeaturesDocsShareSupportNewsAboutPoliciesSign UpSign In
suyashi29
GitHub Repository: suyashi29/python-su
Path: blob/master/Advanced Data Analysis using Python/4 Advanced Pandas Functions.ipynb
3074 views
Kernel: Python 3 (ipykernel)

Pandas Functions

  • groupby() Used to group data by specific columns and apply functions on them.

import pandas as pd df = pd.DataFrame({'Category': ['A', 'B', 'A', 'B', 'A','C','C'], 'Value': [10, 20, 30, 40, 50,60,100]}) grouped = df.groupby('Category').count() print(grouped)
Value Category A 3 B 2 C 2
grouped = df.groupby('Category').agg({'Value': ['mean', 'max','sum','min']}) print(grouped)
Value mean max sum min Category A 30.0 50 90 10 B 30.0 40 60 20 C 80.0 100 160 60

Lambda funtion:

  • Lambda functions are best suited for short, throwaway functions.

  • For complex operations, regular functions are preferred.

  • Lambda functions are anonymous functions defined using the 'lambda' keyword.

  • Syntax: lambda arguments: expression

# Example 4: Using with map() to square a list of numbers nums = [1, 2, 3, 4, 5] squared = list(map(lambda x: x**2, nums)) print("Squared List:", squared)
Squared List: [1, 4, 9, 16, 25]
# Example 1: Basic usage square = lambda x: x**2 print("Square of 5:", square(5)) # Example 2: Adding two numbers add = lambda a, b,c: a + b+c print("Sum of 4 and 7:", add(4, 7,3)) # Example 3: Sorting a list of tuples by the second item pairs = [(1, 'one'), (3, 'three'), (2, 'two'), (4, 'four')] sorted_pairs = sorted(pairs, key=lambda x: x[1]) print("Sorted by value:", sorted_pairs) # Example 4: Using with map() to square a list of numbers nums = [1, 2, 3, 4, 5] squared = list(map(lambda x: x**2, nums)) print("Squared List:", squared) # Example 5: Using with filter() to filter even numbers even_nums = list(filter(lambda x: x % 2 == 0, nums)) print("Even Numbers:", even_nums) # Example 6: Using with reduce() to compute the product of elements from functools import reduce product = reduce(lambda x, y: x * y, nums) print("Product of List:", product) # Example 7: Conditional logic inside lambda check_even = lambda x: 'Even' if x % 2 == 0 else 'Odd' print("Check 7:", check_even(7)) # Example 8: Lambda inside a function def make_incrementor(n): return lambda x: x + n increment_by_3 = make_incrementor(3) print("Increment 10 by 3:", increment_by_3(10))
## - apply():Applies a function along an axis of the DataFrame. df = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]}) df['A'] = df['A'].apply(lambda x: x**2) df
df = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]}) df['Sum'] = df.apply(lambda row: row['A'] + row['B'], axis=1) df
  • rolling(): Provides rolling window calculations.

df = pd.DataFrame({'Value': [1, 2, 3, 4, 5]}) rolling_mean = df['Value'].rolling(window=3).mean() print(rolling_mean)
0 NaN 1 NaN 2 2.0 3 3.0 4 4.0 Name: Value, dtype: float64
df = pd.DataFrame({'Value': [1, 2, 3, 4, 5]}) rolling_sum = df['Value'].rolling(window=2).sum() print(rolling_sum)
0 NaN 1 3.0 2 5.0 3 7.0 4 9.0 Name: Value, dtype: float64
## shift(): Shifts the values of a column by a given number of periods. df = pd.DataFrame({'A': [1, 2, 3, 4]}) shifted_df = df['A'].shift(1) shifted_df
0 NaN 1 1.0 2 2.0 3 3.0 Name: A, dtype: float64
df = pd.DataFrame({'A': [10, 20, 30, 40]}) shifted_df = df['A'].shift(-1) shifted_df df
## transform(): Applies a function to each group in the DataFrame and returns the transformed data. df = pd.DataFrame({'Category': ['A', 'A', 'B', 'B'], 'Value': [1, 2, 3, 4]}) df['Transformed'] = df.groupby('Category')['Value'].transform(lambda x: x / x.sum()) df
df = pd.DataFrame({'A': [1, 2, 3, 4]}) df['Transformed'] = df['A'].transform(lambda x: 2*x) df
## nlargest(): Returns the first n largest values in a DataFrame. ##nsmallest():Returns the first n smallest values in a DataFrame df = pd.DataFrame({'A': [10, 20, 30, 40]}) largest = df['A'].nlargest(1) print(largest)
3 40 Name: A, dtype: int64
df = pd.DataFrame({'A': [10, 20, 30, 40]}) smallest = df['A'].nsmallest(2) print(smallest)
0 10 1 20 Name: A, dtype: int64
##query(): Filters data based on a query string. df = pd.DataFrame({'A': [1, 2, 3, 4,-1,8,9], 'B': [5, 6, 7, 8,7,3,3]}) filtered_df = df.query('A > 2') filtered_df
df = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]}) filtered_df = df.query('A == 2 and B == 5') filtered_df
## explode() : Explodes a list-like column into separate rows. df = pd.DataFrame({'A': [[1, 2,4], [3, 4], [5, 6]]}) df
df
## explode() : Explodes a list-like column into separate rows. df = pd.DataFrame({'A': [[1], [3,4], [5, 6,7]]}) exploded_df = df.explode('A') exploded_df

Case Study A: Sales Performance and Data Transformation in E-Commerce

Key Plots:

  • Sales and Revenue by Category: Bar plot representing total sales and average revenue by each category.

  • Average Revenue by Region and Product: Bar plot for average revenue based on region and product.

  • Original vs Doubled Revenue: A comparison of the original and doubled revenue.

  • Transformed Sales, Revenue, and Discount: A line plot showing the numeric transformations.

  • Sales and Rolling Mean Sales: A line plot showing sales alongside a rolling mean.

  • Sales and Shifted Sales: Line plot comparing original and shifted sales.

  • Sales by Category After Merge: Bar plot showing the merged sales data grouped by category.

  • Exploded Product Distribution: Bar plot showing the distribution of exploded product lists.

  • Original vs Updated Age: Bar plot comparing the original and updated age columns.

  • Product Count by Region: A stacked bar plot showing counts of each product per region.

  • Daily Sales and Revenue: Line plot showing daily aggregated sales and revenue.

  • Discount after FillNA: Line plot showing how missing discount values were filled.

  • Sales by Category with Reset Index: Bar plot showing sales by category after resetting the index.

import pandas as pd import numpy as np import matplotlib.pyplot as plt import seaborn as sns df = pd.read_csv("sales_data.csv")
df.head(3) df.shape df.info() df.isnull().sum()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 3000 entries, 0 to 2999 Data columns (total 8 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Category 3000 non-null object 1 Sales 3000 non-null int64 2 Revenue 3000 non-null float64 3 Discount 3000 non-null float64 4 Age 3000 non-null int64 5 Region 3000 non-null object 6 Product 3000 non-null object 7 Date 3000 non-null object dtypes: float64(2), int64(2), object(4) memory usage: 187.6+ KB
Category 0 Sales 0 Revenue 0 Discount 0 Age 0 Region 0 Product 0 Date 0 dtype: int64
df
## Mean of Sales and mean of Revenue category wise grouped = df.groupby('Category').agg({'Sales': 'mean', 'Revenue': 'mean'}) grouped
# Plotting the grouped data grouped.plot(kind='bar', figsize=(15, 8), title='Sales and Revenue by Category') plt.ylabel('Value') plt.show()
Image in a Jupyter notebook
# 2. **`pivot_table()`** - Pivot by Region and Product pivot_table = df.pivot_table(values='Revenue', index='Region', columns='Product', aggfunc='mean') pivot_table
# Plotting the pivot table pivot_table.plot(kind='bar', figsize=(11, 6), title='Average Revenue by Region and Product') plt.ylabel('Average Revenue') plt.show()
Image in a Jupyter notebook
# 3. **`apply()`** - Apply function to double Revenue df['Revenue_Doubled'] = df['Revenue'].apply(lambda x: x * 2) print(df[['Revenue', 'Revenue_Doubled']].head())
Revenue Revenue_Doubled 0 1998.012319 3996.024637 1 3189.885959 6379.771919 2 2097.538421 4195.076842 3 4606.332021 9212.664042 4 1860.817919 3721.635838
# Plotting the doubled revenue df[['Revenue', 'Revenue_Doubled']].head(10).plot(kind='bar', figsize=(10, 6), title='Original and Doubled Revenue') plt.ylabel('Revenue') plt.show()
Image in a Jupyter notebook
# 4. **`applymap()`** - Apply function to all numeric columns df_numeric = df[['Sales', 'Revenue', 'Discount']].applymap(lambda x: x * 2) print(df_numeric.head())
Sales Revenue Discount 0 7316 3996.024637 0.812216 1 9000 6379.771919 0.485009 2 2748 4195.076842 0.419267 3 6240 9212.664042 0.840931 4 6042 3721.635838 0.501743
# Plotting the transformed numeric data df_numeric.head(10).plot(kind='line', figsize=(10, 6), title='Transformed Sales, Revenue, and Discount') plt.ylabel('Transformed Value') plt.show()
Image in a Jupyter notebook
# 5. **`rolling()`** - Calculate rolling mean of Sales df['Rolling_Mean_Sales'] = df['Sales'].rolling(window=10).mean() print(df[['Sales', 'Rolling_Mean_Sales']].head()) # Plotting the rolling mean of sales df[['Sales', 'Rolling_Mean_Sales']].head(100).plot(figsize=(10, 6), title='Sales and Rolling Mean Sales') plt.ylabel('Sales') plt.show() # 6. **`shift()`** - Shift the Sales column df['Sales_Shifted'] = df['Sales'].shift(1) print(df[['Sales', 'Sales_Shifted']].head()) # Plotting the sales and shifted sales df[['Sales', 'Sales_Shifted']].head(100).plot(figsize=(10, 6), title='Sales and Shifted Sales') plt.ylabel('Sales') plt.show() # 7. **`merge()`** - Merge with a small dataframe small_df = pd.DataFrame({ 'Category': ['A', 'B', 'C'], 'Region': ['East', 'North', 'South'], 'Product': ['P1', 'P2', 'P3'] }) merged_df = pd.merge(df, small_df, on='Category', how='left') print(merged_df.head()) # Plotting the merged data merged_df[['Category', 'Sales']].groupby('Category').sum().plot(kind='bar', figsize=(10, 6), title='Merged Data Sales by Category') plt.ylabel('Sales') plt.show() # 8. **`explode()`** - Explode the list of products into separate rows df['Product_List'] = df['Product'].apply(lambda x: [x, f"{x}2", f"{x}3"]) exploded_df = df.explode('Product_List') print(exploded_df[['Product', 'Product_List']].head()) # Plotting exploded product distribution exploded_df['Product_List'].value_counts().plot(kind='bar', figsize=(10, 6), title='Product List Exploded Distribution') plt.ylabel('Count') plt.show() # 9. **`pipe()`** - Apply a transformation function to age def add_years(df, years=5): df['Age_Updated'] = df['Age'] + years return df df = df.pipe(add_years, years=5) print(df[['Age', 'Age_Updated']].head()) # Plotting updated ages df[['Age', 'Age_Updated']].head(10).plot(kind='bar', figsize=(10, 6), title='Original and Updated Age') plt.ylabel('Age') plt.show() # 10. **`crosstab()`** - Cross-tabulation of Region and Product crosstab = pd.crosstab(df['Region'], df['Product']) print(crosstab) # Plotting crosstab data crosstab.plot(kind='bar', stacked=True, figsize=(10, 6), title='Product Count by Region') plt.ylabel('Count') plt.show() # 11. **`resample()`** - Resample data by day and sum sales and revenue df['Date'] = pd.to_datetime(df['Date']) df.set_index('Date', inplace=True) resampled_df = df.resample('D').sum() print(resampled_df[['Sales', 'Revenue']].head()) # Plotting resampled data (daily sales and revenue) resampled_df[['Sales', 'Revenue']].head(30).plot(figsize=(10, 6), title='Daily Sales and Revenue') plt.ylabel('Value') plt.show() # 12. **`fillna()`** - Fill missing values in Discount column df['Discount'].iloc[::10] = np.nan # Introducing NaNs for demonstration df['Discount'] = df['Discount'].fillna(df['Discount'].mean()) print(df[['Discount']].head()) # Plotting filled discount values df['Discount'].head(50).plot(kind='line', figsize=(10, 6), title='Discount after FillNA') plt.ylabel('Discount') plt.show() # 13. **`reset_index()`** - Resetting the index of the DataFrame df_reset = df.reset_index() print(df_reset.head()) # Plotting data with reset index df_reset[['Category', 'Sales']].groupby('Category').sum().plot(kind='bar', figsize=(10, 6), title='Sales by Category with Reset Index') plt.ylabel('Sales') plt.show()

Insights:

GroupBy gives aggregate metrics.

Pivot tables help in aggregating data based on multiple variables.

Apply and ApplyMap allow for custom transformations of entire columns or the whole DataFrame.

Rolling and Shift are great for calculating trends and previous values.

Merge, Explode, and Pipe allow for complex data transformations.

Cross-tab helps in comparing two categorical features.

Resampling and FillNA ensure handling time series data and missing values.