Path: blob/master/Lecture_1.ipynb
145 views
Kernel: Python 3
Bike store sales
In this class we'll be analyzing sales made on bike stores.
Hands on!
In [1]:
Loading our data:
In [2]:
Out[2]:
Date,Day,Month,Year,Customer_Age,Age_Group,Customer_Gender,Country,State,Product_Category,Sub_Category,Product,Order_Quantity,Unit_Cost,Unit_Price,Profit,Cost,Revenue
2013-11-26,26,November,2013,19,Youth (<25),M,Canada,British Columbia,Accessories,Bike Racks,Hitch Rack - 4-Bike,8,45,120,590,360,950
2015-11-26,26,November,2015,19,Youth (<25),M,Canada,British Columbia,Accessories,Bike Racks,Hitch Rack - 4-Bike,8,45,120,590,360,950
2014-03-23,23,March,2014,49,Adults (35-64),M,Australia,New South Wales,Accessories,Bike Racks,Hitch Rack - 4-Bike,23,45,120,1366,1035,2401
2016-03-23,23,March,2016,49,Adults (35-64),M,Australia,New South Wales,Accessories,Bike Racks,Hitch Rack - 4-Bike,20,45,120,1188,900,2088
2014-05-15,15,May,2014,47,Adults (35-64),F,Australia,New South Wales,Accessories,Bike Racks,Hitch Rack - 4-Bike,4,45,120,238,180,418
2016-05-15,15,May,2016,47,Adults (35-64),F,Australia,New South Wales,Accessories,Bike Racks,Hitch Rack - 4-Bike,5,45,120,297,225,522
2014-05-22,22,May,2014,47,Adults (35-64),F,Australia,Victoria,Accessories,Bike Racks,Hitch Rack - 4-Bike,4,45,120,199,180,379
2016-05-22,22,May,2016,47,Adults (35-64),F,Australia,Victoria,Accessories,Bike Racks,Hitch Rack - 4-Bike,2,45,120,100,90,190
2014-02-22,22,February,2014,35,Adults (35-64),M,Australia,Victoria,Accessories,Bike Racks,Hitch Rack - 4-Bike,22,45,120,1096,990,2086
In [3]:
The data at a glance:
In [4]:
Out[4]:
In [5]:
Out[5]:
(113036, 18)
In [6]:
Out[6]:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 113036 entries, 0 to 113035
Data columns (total 18 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Date 113036 non-null datetime64[ns]
1 Day 113036 non-null int64
2 Month 113036 non-null object
3 Year 113036 non-null int64
4 Customer_Age 113036 non-null int64
5 Age_Group 113036 non-null object
6 Customer_Gender 113036 non-null object
7 Country 113036 non-null object
8 State 113036 non-null object
9 Product_Category 113036 non-null object
10 Sub_Category 113036 non-null object
11 Product 113036 non-null object
12 Order_Quantity 113036 non-null int64
13 Unit_Cost 113036 non-null int64
14 Unit_Price 113036 non-null int64
15 Profit 113036 non-null int64
16 Cost 113036 non-null int64
17 Revenue 113036 non-null int64
dtypes: datetime64[ns](1), int64(9), object(8)
memory usage: 15.5+ MB
In [7]:
Out[7]:
Numerical analysis and visualization
We'll analyze the Unit_Cost
column:
In [8]:
Out[8]:
count 113036.000000
mean 267.296366
std 549.835483
min 1.000000
25% 2.000000
50% 9.000000
75% 42.000000
max 2171.000000
Name: Unit_Cost, dtype: float64
In [9]:
Out[9]:
267.296365759581
In [10]:
Out[10]:
9.0
In [11]:
Out[11]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f8af096b700>
In [12]:
Out[12]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f8af0bf0f70>
In [13]:
Out[13]:
<matplotlib.lines.Line2D at 0x7f8b17499250>
In [14]:
Out[14]:
Text(0.5, 0, 'dollars')
Categorical analysis and visualization
We'll analyze the Age_Group
column:
In [16]:
Out[16]:
In [15]:
Out[15]:
Adults (35-64) 55824
Young Adults (25-34) 38654
Youth (<25) 17828
Seniors (64+) 730
Name: Age_Group, dtype: int64
In [17]:
Out[17]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f8ade8bc2e0>
In [18]:
Out[18]:
Text(0, 0.5, 'Number of Sales')
Relationship between the columns?
Can we find any significant relationship?
In [19]:
Out[19]:
In [20]:
Out[20]:
In [21]:
Out[21]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f8ade7c34f0>
In [22]:
Out[22]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f8ade7a70a0>
In [23]:
Out[23]:
Text(0, 0.5, 'Profit')
In [24]:
Out[24]:
Year AxesSubplot(0.125,0.536818;0.227941x0.343182)
Customer_Age AxesSubplot(0.398529,0.536818;0.227941x0.343182)
Order_Quantity AxesSubplot(0.672059,0.536818;0.227941x0.343182)
Unit_Cost AxesSubplot(0.125,0.125;0.227941x0.343182)
Unit_Price AxesSubplot(0.398529,0.125;0.227941x0.343182)
Profit AxesSubplot(0.672059,0.125;0.227941x0.343182)
dtype: object
Column wrangling
We can also create new columns or modify existing ones.
Add and calculate a new Revenue_per_Age
column
In [25]:
Out[25]:
0 50.000000
1 50.000000
2 49.000000
3 42.612245
4 8.893617
Name: Revenue_per_Age, dtype: float64
In [26]:
Out[26]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f8ade499d90>
In [27]:
Out[27]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f8ade3f9100>
Add and calculate a new Calculated_Cost
column
Use this formula
In [28]:
Out[28]:
0 360
1 360
2 1035
3 900
4 180
Name: Calculated_Cost, dtype: int64
In [29]:
Out[29]:
0
We can see the relationship between Cost
and Profit
using a scatter plot:
In [30]:
Out[30]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f8ade3df790>
Add and calculate a new Calculated_Revenue
column
Use this formula
In [31]:
Out[31]:
0 950
1 950
2 2401
3 2088
4 418
Name: Calculated_Revenue, dtype: int64
In [32]:
Out[32]:
0
In [33]:
Out[33]:
In [34]:
Out[34]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f8ade39afa0>
Modify all Unit_Price
values adding 3% tax to them
In [35]:
Out[35]:
0 120
1 120
2 120
3 120
4 120
Name: Unit_Price, dtype: int64
In [36]:
In [37]:
Out[37]:
0 123.6
1 123.6
2 123.6
3 123.6
4 123.6
Name: Unit_Price, dtype: float64
Selection & Indexing:
### Get all the sales made in the state of Kentucky
In [38]:
Out[38]:
Get the mean revenue of the Adults (35-64)
sales group
In [39]:
Out[39]:
762.8287654055604
How many records belong to Age Group Youth (<25)
or Adults (35-64)
?
In [43]:
Out[43]:
73652
Get the mean revenue of the sales group Adults (35-64)
in United States
In [44]:
Out[44]:
726.7260473588342
### Increase the revenue by 10% to every sale made in France
In [45]:
Out[45]:
50 787
51 787
52 2957
53 2851
60 626
Name: Revenue, dtype: int64
In [46]:
In [47]:
Out[47]:
50 865.7
51 865.7
52 3252.7
53 3136.1
60 688.6
Name: Revenue, dtype: float64