Book a Demo!
CoCalc Logo Icon
StoreFeaturesDocsShareSupportNewsAboutPoliciesSign UpSign In
ine-rmotr-curriculum
GitHub Repository: ine-rmotr-curriculum/FreeCodeCamp-Pandas-Real-Life-Example
Path: blob/master/Lecture_1.ipynb
145 views
Kernel: Python 3

rmotr


Bike store sales

In this class we'll be analyzing sales made on bike stores.

Follow this data in a Google Spreadsheet

purple-divider

Hands on!

import numpy as np import pandas as pd import matplotlib.pyplot as plt %matplotlib inline

green-divider

Loading our data:

!head data/sales_data.csv
Date,Day,Month,Year,Customer_Age,Age_Group,Customer_Gender,Country,State,Product_Category,Sub_Category,Product,Order_Quantity,Unit_Cost,Unit_Price,Profit,Cost,Revenue 2013-11-26,26,November,2013,19,Youth (<25),M,Canada,British Columbia,Accessories,Bike Racks,Hitch Rack - 4-Bike,8,45,120,590,360,950 2015-11-26,26,November,2015,19,Youth (<25),M,Canada,British Columbia,Accessories,Bike Racks,Hitch Rack - 4-Bike,8,45,120,590,360,950 2014-03-23,23,March,2014,49,Adults (35-64),M,Australia,New South Wales,Accessories,Bike Racks,Hitch Rack - 4-Bike,23,45,120,1366,1035,2401 2016-03-23,23,March,2016,49,Adults (35-64),M,Australia,New South Wales,Accessories,Bike Racks,Hitch Rack - 4-Bike,20,45,120,1188,900,2088 2014-05-15,15,May,2014,47,Adults (35-64),F,Australia,New South Wales,Accessories,Bike Racks,Hitch Rack - 4-Bike,4,45,120,238,180,418 2016-05-15,15,May,2016,47,Adults (35-64),F,Australia,New South Wales,Accessories,Bike Racks,Hitch Rack - 4-Bike,5,45,120,297,225,522 2014-05-22,22,May,2014,47,Adults (35-64),F,Australia,Victoria,Accessories,Bike Racks,Hitch Rack - 4-Bike,4,45,120,199,180,379 2016-05-22,22,May,2016,47,Adults (35-64),F,Australia,Victoria,Accessories,Bike Racks,Hitch Rack - 4-Bike,2,45,120,100,90,190 2014-02-22,22,February,2014,35,Adults (35-64),M,Australia,Victoria,Accessories,Bike Racks,Hitch Rack - 4-Bike,22,45,120,1096,990,2086
sales = pd.read_csv( 'data/sales_data.csv', parse_dates=['Date'])

green-divider

The data at a glance:

sales.head()
sales.shape
(113036, 18)
sales.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 113036 entries, 0 to 113035 Data columns (total 18 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Date 113036 non-null datetime64[ns] 1 Day 113036 non-null int64 2 Month 113036 non-null object 3 Year 113036 non-null int64 4 Customer_Age 113036 non-null int64 5 Age_Group 113036 non-null object 6 Customer_Gender 113036 non-null object 7 Country 113036 non-null object 8 State 113036 non-null object 9 Product_Category 113036 non-null object 10 Sub_Category 113036 non-null object 11 Product 113036 non-null object 12 Order_Quantity 113036 non-null int64 13 Unit_Cost 113036 non-null int64 14 Unit_Price 113036 non-null int64 15 Profit 113036 non-null int64 16 Cost 113036 non-null int64 17 Revenue 113036 non-null int64 dtypes: datetime64[ns](1), int64(9), object(8) memory usage: 15.5+ MB
sales.describe()

green-divider

Numerical analysis and visualization

We'll analyze the Unit_Cost column:

sales['Unit_Cost'].describe()
count 113036.000000 mean 267.296366 std 549.835483 min 1.000000 25% 2.000000 50% 9.000000 75% 42.000000 max 2171.000000 Name: Unit_Cost, dtype: float64
sales['Unit_Cost'].mean()
267.296365759581
sales['Unit_Cost'].median()
9.0
sales['Unit_Cost'].plot(kind='box', vert=False, figsize=(14,6))
<matplotlib.axes._subplots.AxesSubplot at 0x7f8af096b700>
Image in a Jupyter notebook
sales['Unit_Cost'].plot(kind='density', figsize=(14,6)) # kde
<matplotlib.axes._subplots.AxesSubplot at 0x7f8af0bf0f70>
Image in a Jupyter notebook
ax = sales['Unit_Cost'].plot(kind='density', figsize=(14,6)) # kde ax.axvline(sales['Unit_Cost'].mean(), color='red') ax.axvline(sales['Unit_Cost'].median(), color='green')
<matplotlib.lines.Line2D at 0x7f8b17499250>
Image in a Jupyter notebook
ax = sales['Unit_Cost'].plot(kind='hist', figsize=(14,6)) ax.set_ylabel('Number of Sales') ax.set_xlabel('dollars')
Text(0.5, 0, 'dollars')
Image in a Jupyter notebook

green-divider

Categorical analysis and visualization

We'll analyze the Age_Group column:

sales.head()
sales['Age_Group'].value_counts()
Adults (35-64) 55824 Young Adults (25-34) 38654 Youth (<25) 17828 Seniors (64+) 730 Name: Age_Group, dtype: int64
sales['Age_Group'].value_counts().plot(kind='pie', figsize=(6,6))
<matplotlib.axes._subplots.AxesSubplot at 0x7f8ade8bc2e0>
Image in a Jupyter notebook
ax = sales['Age_Group'].value_counts().plot(kind='bar', figsize=(14,6)) ax.set_ylabel('Number of Sales')
Text(0, 0.5, 'Number of Sales')
Image in a Jupyter notebook

green-divider

Relationship between the columns?

Can we find any significant relationship?

corr = sales.corr() corr
fig = plt.figure(figsize=(8,8)) plt.matshow(corr, cmap='RdBu', fignum=fig.number) plt.xticks(range(len(corr.columns)), corr.columns, rotation='vertical'); plt.yticks(range(len(corr.columns)), corr.columns);
Image in a Jupyter notebook
sales.plot(kind='scatter', x='Customer_Age', y='Revenue', figsize=(6,6))
<matplotlib.axes._subplots.AxesSubplot at 0x7f8ade7c34f0>
Image in a Jupyter notebook
sales.plot(kind='scatter', x='Revenue', y='Profit', figsize=(6,6))
<matplotlib.axes._subplots.AxesSubplot at 0x7f8ade7a70a0>
Image in a Jupyter notebook
ax = sales[['Profit', 'Age_Group']].boxplot(by='Age_Group', figsize=(10,6)) ax.set_ylabel('Profit')
Text(0, 0.5, 'Profit')
Image in a Jupyter notebook
boxplot_cols = ['Year', 'Customer_Age', 'Order_Quantity', 'Unit_Cost', 'Unit_Price', 'Profit'] sales[boxplot_cols].plot(kind='box', subplots=True, layout=(2,3), figsize=(14,8))
Year AxesSubplot(0.125,0.536818;0.227941x0.343182) Customer_Age AxesSubplot(0.398529,0.536818;0.227941x0.343182) Order_Quantity AxesSubplot(0.672059,0.536818;0.227941x0.343182) Unit_Cost AxesSubplot(0.125,0.125;0.227941x0.343182) Unit_Price AxesSubplot(0.398529,0.125;0.227941x0.343182) Profit AxesSubplot(0.672059,0.125;0.227941x0.343182) dtype: object
Image in a Jupyter notebook

green-divider

Column wrangling

We can also create new columns or modify existing ones.

Add and calculate a new Revenue_per_Age column

sales['Revenue_per_Age'] = sales['Revenue'] / sales['Customer_Age'] sales['Revenue_per_Age'].head()
0 50.000000 1 50.000000 2 49.000000 3 42.612245 4 8.893617 Name: Revenue_per_Age, dtype: float64
sales['Revenue_per_Age'].plot(kind='density', figsize=(14,6))
<matplotlib.axes._subplots.AxesSubplot at 0x7f8ade499d90>
Image in a Jupyter notebook
sales['Revenue_per_Age'].plot(kind='hist', figsize=(14,6))
<matplotlib.axes._subplots.AxesSubplot at 0x7f8ade3f9100>
Image in a Jupyter notebook

Add and calculate a new Calculated_Cost column

Use this formula

Calculated_Cost=Order_QuantityUnit_CostCalculated\_Cost = Order\_Quantity * Unit\_Cost
sales['Calculated_Cost'] = sales['Order_Quantity'] * sales['Unit_Cost'] sales['Calculated_Cost'].head()
0 360 1 360 2 1035 3 900 4 180 Name: Calculated_Cost, dtype: int64
(sales['Calculated_Cost'] != sales['Cost']).sum()
0

We can see the relationship between Cost and Profit using a scatter plot:

sales.plot(kind='scatter', x='Calculated_Cost', y='Profit', figsize=(6,6))
<matplotlib.axes._subplots.AxesSubplot at 0x7f8ade3df790>
Image in a Jupyter notebook

Add and calculate a new Calculated_Revenue column

Use this formula

Calculated_Revenue=Cost+ProfitCalculated\_Revenue = Cost + Profit
sales['Calculated_Revenue'] = sales['Cost'] + sales['Profit'] sales['Calculated_Revenue'].head()
0 950 1 950 2 2401 3 2088 4 418 Name: Calculated_Revenue, dtype: int64
(sales['Calculated_Revenue'] != sales['Revenue']).sum()
0
sales.head()
sales['Revenue'].plot(kind='hist', bins=100, figsize=(14,6))
<matplotlib.axes._subplots.AxesSubplot at 0x7f8ade39afa0>
Image in a Jupyter notebook

Modify all Unit_Price values adding 3% tax to them

sales['Unit_Price'].head()
0 120 1 120 2 120 3 120 4 120 Name: Unit_Price, dtype: int64
#sales['Unit_Price'] = sales['Unit_Price'] * 1.03 sales['Unit_Price'] *= 1.03
sales['Unit_Price'].head()
0 123.6 1 123.6 2 123.6 3 123.6 4 123.6 Name: Unit_Price, dtype: float64

green-divider

Selection & Indexing:

### Get all the sales made in the state of Kentucky

sales.loc[sales['State'] == 'Kentucky']

Get the mean revenue of the Adults (35-64) sales group

sales.loc[sales['Age_Group'] == 'Adults (35-64)', 'Revenue'].mean()
762.8287654055604

How many records belong to Age Group Youth (<25) or Adults (35-64)?

sales.loc[(sales['Age_Group'] == 'Youth (<25)') | (sales['Age_Group'] == 'Adults (35-64)')].shape[0]
73652

Get the mean revenue of the sales group Adults (35-64) in United States

sales.loc[(sales['Age_Group'] == 'Adults (35-64)') & (sales['Country'] == 'United States'), 'Revenue'].mean()
726.7260473588342

### Increase the revenue by 10% to every sale made in France

sales.loc[sales['Country'] == 'France', 'Revenue'].head()
50 787 51 787 52 2957 53 2851 60 626 Name: Revenue, dtype: int64
#sales.loc[sales['Country'] == 'France', 'Revenue'] = sales.loc[sales['Country'] == 'France', 'Revenue'] * 1.1 sales.loc[sales['Country'] == 'France', 'Revenue'] *= 1.1
sales.loc[sales['Country'] == 'France', 'Revenue'].head()
50 865.7 51 865.7 52 3252.7 53 3136.1 60 688.6 Name: Revenue, dtype: float64

purple-divider