CoCalc -- day04.ipynb

GitHub Repository: jackfrued/Python-100-Days
Path: blob/master/Day66-80/code/day04.ipynb
²⁹²² views

Kernel: Python 3

深入浅出pandas-1

pandas是一个支持数据分析全流程的Python开源库，它的作者Wes McKinney于2008年开始开发这个库，其主要目标是提供一个大数据分析和处理的工具。pandas封装了从数据加载、数据重塑、数据清洗到数据透视、数据呈现等一系列操作，提供了三种核心的数据类型：

Series：数据系列，表示一维的数据。跟一维数组的区别在于每条数据都有对应的索引，处理数据的方法比ndarray更为丰富。
DataFrame：数据框、数据窗、数据表，表示二维的数据。跟二维数组相比，DataFrame有行索引和列索引，而且提供了100+方法来处理数据。
Index：为Series和DataFrame提供索引服务。

In [1]:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

plt.rcParams['font.sans-serif'].insert(0, 'SimHei')
plt.rcParams['axes.unicode_minus'] = False
get_ipython().run_line_magic('config', "InlineBackend.figure_format = 'svg'")

创建DataFrame对象

In [2]:

np.random.seed(20)

In [3]:

stu_names = ['狄仁杰', '白起', '李元芳', '苏妲己', '孙尚香']
cou_names = ['语文', '数学', '英语']
scores_arr = np.random.randint(60, 101, (5, 3))
scores_arr

Out[3]:

array([[ 95,  86,  75],
       [ 91,  88,  86],
       [ 69,  80,  71],
       [ 82,  67,  94],
       [ 92, 100,  81]])

In [4]:

# 方法一：通过二维数组构造DataFrame对象
df1 = pd.DataFrame(data=scores_arr, columns=cou_names, index=stu_names)
df1

Out[4]:

In [5]:

# 行索引
df1.index

Out[5]:

Index(['狄仁杰', '白起', '李元芳', '苏妲己', '孙尚香'], dtype='object')

In [6]:

# 列索引
df1.columns

Out[6]:

Index(['语文', '数学', '英语'], dtype='object')

In [7]:

# 值 - 二维数组
df1.values

Out[7]:

array([[ 95,  86,  75],
       [ 91,  88,  86],
       [ 69,  80,  71],
       [ 82,  67,  94],
       [ 92, 100,  81]])

In [8]:

scores_dict = {
    '语文': [95, 91, 69, 82, 92],
    '数学': [86, 88, 80, 67, 100],
    '英语': [75, 86, 71, 94, 81]
}

In [9]:

# 方法二：通过数据字典构造DataFrame对象
df2 = pd.DataFrame(data=scores_dict, index=stu_names)
df2

Out[9]:

In [10]:

# 查看DataFrame信息
df2.info(memory_usage='deep')

Out[10]:

<class 'pandas.core.frame.DataFrame'>
Index: 5 entries, 狄仁杰 to 孙尚香
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype
---  ------  --------------  -----
 0   语文      5 non-null      int64
 1   数学      5 non-null      int64
 2   英语      5 non-null      int64
dtypes: int64(3)
memory usage: 558.0 bytes

In [11]:

# 方法三：从CSV文件加载数据创建DataFrame对象
df3 = pd.read_csv(
    'res/2023年北京积分落户数据.csv',
    # encoding='utf-8',  # 指定字符编码
    # sep='',  # 指定字段的分隔符（默认逗号）
    # delimiter='#',
    # header=0,  # 表头所在的行
    # quotechar='"',  # 包裹字符串的字符（默认双引号）
    # index_col='公示编号',  # 索引列
    # usecols=['公示编号', '姓名', '积分分值'],  # 指定加载的列
    # nrows=10,  # 加载的行数
    # skiprows=np.arange(1, 101),  # 跳过哪些行
    # true_values=['是', 'Yes', 'YES'],  # 哪些值会被视为布尔值True
    # false_values=['否', 'No', 'NO'],  # 哪些值会被视为布尔值False
    # na_values=['---', 'N/A'],  # 哪些值会被视为空值
    # iterator=True,  # 开启迭代器模式
    # chunksize=1000,  # 每次加载的数据体量
)
df3

Out[11]:

In [12]:

# %pip install openpyxl

In [13]:

# 方法四：从Excel文件加载数据创建DataFrame对象
df6 = pd.read_excel(
    'res/2020年销售数据.xlsx',
    sheet_name='data',
)
df6

Out[13]:

In [14]:

# %pip install -U pymysql cryptography sqlalchemy

In [15]:

# 方法五：从数据服务器加载数据创建DataFrame对象
from sqlalchemy import create_engine

# URL 
engine = create_engine('mysql+pymysql://guest:[email protected]:3306/hrs')
engine

Out[15]:

Engine(mysql+pymysql://guest:***@47.109.26.237:3306/hrs)

In [16]:

dept_df = pd.read_sql('tb_dept', engine, index_col='dno')
dept_df

Out[16]:

In [17]:

emp_df1 = pd.read_sql('tb_emp', engine, index_col='eno')
emp_df1

Out[17]:

In [18]:

emp_df2 = pd.read_sql('tb_emp2', engine, index_col='eno')
emp_df2

Out[18]:

In [19]:

# 关闭连接释放资源
engine.connect().close()

查看信息

In [20]:

df6.info()

Out[20]:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1945 entries, 0 to 1944
Data columns (total 8 columns):
 #   Column  Non-Null Count  Dtype         
---  ------  --------------  -----         
 0   销售日期    1945 non-null   datetime64[ns]
 1   销售区域    1945 non-null   object        
 2   销售渠道    1945 non-null   object        
 3   销售订单    1945 non-null   object        
 4   品牌      1945 non-null   object        
 5   售价      1945 non-null   int64         
 6   销售数量    1945 non-null   int64         
 7   直接成本    1945 non-null   int64         
dtypes: datetime64[ns](1), int64(3), object(4)
memory usage: 121.7+ KB

In [21]:

# 获取前N行
df6.head(3)

Out[21]:

In [22]:

# 获取后N行
df6.tail(5)

Out[22]:

操作行、列、单元格

In [23]:

# 访问列
df6['销售日期']

Out[23]:

    2020-01-01
    2020-01-01
    2020-01-01
    2020-01-01
    2020-01-01
          ...    
 2020-12-30
 2020-12-30
 2020-12-31
 2020-12-31
 2020-12-31
Name: 销售日期, Length: 1945, dtype: datetime64[ns]

In [24]:

df6.销售渠道

Out[24]:

     拼多多
      抖音
      天猫
      天猫
      天猫
       ... 
   京东
   实体
   实体
   抖音
   天猫
Name: 销售渠道, Length: 1945, dtype: object

In [25]:

df6['销售渠道']

Out[25]:

     拼多多
      抖音
      天猫
      天猫
      天猫
       ... 
   京东
   实体
   实体
   抖音
   天猫
Name: 销售渠道, Length: 1945, dtype: object

In [26]:

type(df6['销售日期'])

Out[26]:

pandas.core.series.Series

In [27]:

df6[['销售渠道']]

Out[27]:

In [28]:

type(df6[['销售渠道']])

Out[28]:

pandas.core.frame.DataFrame

In [29]:

# 访问多个列 - 花式索引
df6[['销售日期', '销售区域', '直接成本']]

Out[29]:

In [30]:

df6.columns[3:7]

Out[30]:

Index(['销售订单', '品牌', '售价', '销售数量'], dtype='object')

In [31]:

df6[df6.columns[3:7]]

Out[31]:

In [32]:

# 访问行 - loc属性
df6.loc[1944]

Out[32]:

销售日期    2020-12-31 00:00:00
销售区域                     福建
销售渠道                     天猫
销售订单             211807-050
品牌                      八匹马
售价                       99
销售数量                     27
直接成本                    435
Name: 1944, dtype: object

In [33]:

df6.iloc[-1]

Out[33]:

销售日期    2020-12-31 00:00:00
销售区域                     福建
销售渠道                     天猫
销售订单             211807-050
品牌                      八匹马
售价                       99
销售数量                     27
直接成本                    435
Name: 1944, dtype: object

In [34]:

# 访问多行 - 花式索引
df6.loc[[0, 100, 58, 1000, 1000, 1000, 1099]]

Out[34]:

In [35]:

# 访问多行 - 切片索引
df6.loc[101:200]

Out[35]:

In [36]:

# df6[101:200]
df6.iloc[101:200]

Out[36]:

In [37]:

df6.iloc[-1:-101:-1]

Out[37]:

In [38]:

# 访问单元格
df6.at[2, '售价']

Out[38]:

169

In [39]:

df6.at[2, '售价'] = 999
df6

Out[39]:

In [40]:

df6.iat[2, -3] = 888
df6

Out[40]:

In [41]:

# 添加列
df6['销售额'] = df6['售价'] * df6['销售数量']
df6['季度'] = df6['销售日期'].dt.quarter
df6['月份'] = df6['销售日期'].dt.month
df6

Out[41]:

In [42]:

# 添加行 - 实际工作中基本没有意义

In [43]:

# 删除列
# inplace=False - 默认设定 - 不修改原对象返回修改后的新对象
# inplace=True - 直接修改DataFrame对象不返回新对象 - 方法没有返回值
df6.drop(columns=['季度'], inplace=True)
df6

Out[43]:

In [44]:

# 删除行
# df6.drop(index=[0, 1, 2, 100, 1944, 1943])
df6.drop(index=[0, 1, 2, 100, 1944, 1943], inplace=True)
df6

Out[44]:

In [45]:

df6.drop(index=df6.index[100:200], inplace=True)
df6

Out[45]:

In [46]:

# 重命名
df6.rename(columns={'销售区域': '区域', '销售渠道': '渠道', '销售订单': '订单号'}, inplace=True)
df6

Out[46]:

In [47]:

# 重置索引
# drop=False - 默认值 - 原来的索引变成一个普通列
# drop=True - 原来的索引直接丢弃
df6.reset_index(drop=True, inplace=True)
df6

Out[47]:

In [48]:

# 设置索引
df6.set_index('订单号', inplace=True)
df6

Out[48]:

In [49]:

# 筛选数据 - 布尔索引
df6[df6['销售额'] > 100000]

Out[49]:

In [50]:

df6[(df6['销售额'] > 100000) & (df6['月份'] == 6)]

Out[50]:

In [51]:

df6[(df6['销售额'] > 100000) | (df6['月份'] == 6)]

Out[51]:

In [52]:

df6.query('销售额 > 100000')

Out[52]:

In [53]:

df6.query('月份 == 6 and 渠道 == "实体"')

Out[53]:

In [54]:

df6.query('销售额 > 100000 and 月份 == 6')

Out[54]:

In [55]:

# 随机抽样
df6.sample(n=100)

Out[55]:

In [56]:

df6.sample(frac=0.05)

Out[56]:

In [57]:

# replace=False - 无放回抽样
ignore_rows = np.random.choice(np.arange(1, 1946), size=int(1945 * 0.9), replace=False)
pd.read_excel(
    'res/2020年销售数据.xlsx',
    sheet_name='data',
    skiprows=ignore_rows
)

Out[57]:

数据重塑

拼接（合并结构一致的数据）
合并（事实表连接维度表）

In [58]:

# 拼接两个DataFrame - union
all_emp_df = pd.concat([emp_df1, emp_df2])
all_emp_df.shape

Out[58]:

(19, 6)

In [59]:

# 连表 - 连接事实表和维度表 - 用维度把数据分组然后再做聚合
# 连接两个DataFrame（内连接、左外连接、右外连接、全外连接）- join
# how - 连表方式 - inner、left、right、outer
# on - 基于哪个字段连表 - left_on、right_on
all_emp_df = pd.merge(all_emp_df, dept_df, how='inner', on='dno')
all_emp_df

Out[59]:

In [60]:

# 作业：在jobs目录下有若干个CVS文件，它们的数据结构是一样的，现在需要把所有CSV文件的数据拼接到一个DataFrame中
import os

dfs = [pd.read_csv(os.path.join('res/jobs', filename))
       for filename in os.listdir('res/jobs') 
       if filename.endswith('.csv')]
pd.concat(dfs, ignore_index=True).to_csv('res/all_jobs.csv', index=False)

数据清洗

缺失值
重复值
异常值
预处理

In [61]:

# 甄别缺失值
all_emp_df.isna()

Out[61]:

In [62]:

# all_emp_df['comm'].isna()
all_emp_df['comm'].isnull()

Out[62]:

   False
   False
   False
    True
    True
    True
    True
   False
    True
    True
   True
   True
  False
  False
  False
  False
  False
  False
  False
Name: comm, dtype: bool

In [63]:

# all_emp_df['comm'].notna()
all_emp_df['comm'].notnull()

Out[63]:

    True
    True
    True
   False
   False
   False
   False
    True
   False
   False
  False
  False
   True
   True
   True
   True
   True
   True
   True
Name: comm, dtype: bool

In [64]:

all_emp_df['comm'].notna().value_counts()

Out[64]:

comm
True     11
False     8
Name: count, dtype: int64

In [65]:

# 删除空值 - 删除带有空值的行
all_emp_df.dropna()

Out[65]:

In [66]:

all_emp_df.dropna(axis=1)

Out[66]:

In [67]:

all_emp_df.mgr.dropna()

Out[67]:

   3344.0
   7800.0
   2056.0
   2056.0
   2056.0
   3088.0
   2056.0
   7800.0
   5566.0
   5566.0
  3344.0
  5566.0
  7800.0
  9800.0
  7800.0
  9800.0
Name: mgr, dtype: float64

In [68]:

# 填充空值
all_emp_df.fillna(0)

Out[68]:

In [69]:

all_emp_df.comm.fillna(0).astype('i8')

Out[69]:

    200
   1500
    800
      0
      0
      0
      0
    800
      0
      0
     0
     0
  1000
  1200
  8000
   600
  6000
  5000
  1200
Name: comm, dtype: int64

In [70]:

# 将空值下方的非空值向上填充 - backward fill
all_emp_df.comm.bfill()

Out[70]:

    200.0
   1500.0
    800.0
    800.0
    800.0
    800.0
    800.0
    800.0
   1000.0
   1000.0
  1000.0
  1000.0
  1000.0
  1200.0
  8000.0
   600.0
  6000.0
  5000.0
  1200.0
Name: comm, dtype: float64

In [71]:

# 将空值上方的非空值向下填充 - forward fill
all_emp_df.comm.ffill()

Out[71]:

    200.0
   1500.0
    800.0
    800.0
    800.0
    800.0
    800.0
    800.0
    800.0
    800.0
   800.0
   800.0
  1000.0
  1200.0
  8000.0
   600.0
  6000.0
  5000.0
  1200.0
Name: comm, dtype: float64

In [72]:

# 通过插值算法填充空值 - interpolate
all_emp_df['comm'] = all_emp_df.comm.interpolate(method='linear')

In [73]:

all_emp_df['comm'] = all_emp_df.comm.astype('i8')
all_emp_df

Out[73]:

In [74]:

all_emp_df['mgr'] = all_emp_df.mgr.fillna(-1).astype('i8')
all_emp_df

Out[74]:

In [75]:

# 甄别重复值
all_emp_df.ename.duplicated()

Out[75]:

   False
   False
   False
   False
   False
   False
   False
   False
   False
   False
  False
  False
  False
  False
   True
  False
   True
  False
  False
Name: ename, dtype: bool

In [76]:

all_emp_df.duplicated(['ename', 'job'])

Out[76]:

   False
   False
   False
   False
   False
   False
   False
   False
   False
   False
  False
  False
  False
  False
   True
  False
   True
  False
  False
dtype: bool

In [77]:

# 统计每个元素出现的频次
all_emp_df.ename.value_counts()

Out[77]:

ename
张三丰    3
胡一刀    1
朱九真    1
骆昊     1
王大锤    1
宋远桥    1
郭靖     1
苗人凤    1
杨过     1
乔峰     1
黄蓉     1
张翠山    1
欧阳锋    1
丘处机    1
张无忌    1
李莫愁    1
陈小刀    1
Name: count, dtype: int64

In [78]:

all_emp_df.job.value_counts()

Out[78]:

job
程序员     5
总裁      3
销售员     2
分析师     2
会计      2
设计师     1
销售主管    1
出纳      1
会计师     1
架构师     1
Name: count, dtype: int64

In [79]:

# 统计不重复的元素的个数
all_emp_df.ename.nunique()

Out[79]:

17

In [80]:

# 删除重复值
# keep='first' - 默认值，重复元素保留第一项 - 'last' / False
all_emp_df.drop_duplicates(['ename', 'job'], keep='last', inplace=True)
all_emp_df

Out[80]:

In [81]:

# 异常值的甄别
# 数值判定法（data < Q1 - 1.5 * IQR 或者 data > Q3 + 1.5 * IQR）


def find_outliers_by_iqr(data, whis=1.5):
    q1, q3 = np.quantile(data, [0.25, 0.75])
    iqr = q3 - q1
    return data[(data < q1 - whis * iqr) | (data > q3 + whis * iqr)]

In [82]:

temp = np.random.normal(80, 8, 50).round(0)
temp = np.append(temp, [120, 160, 200, 40, 20, -50])
temp

Out[82]:

array([ 83.,  81.,  89.,  89.,  76.,  79.,  78.,  76.,  79.,  74.,  89.,
        61.,  90.,  74.,  68.,  81.,  81.,  93.,  69.,  81.,  76.,  87.,
        80.,  90.,  72.,  89.,  72.,  71.,  93.,  75.,  75.,  73.,  85.,
        91.,  96.,  82.,  74.,  80.,  72.,  83.,  72.,  64.,  83.,  79.,
        78.,  68.,  68.,  70.,  68.,  84., 120., 160., 200.,  40.,  20.,
       -50.])

In [83]:

find_outliers_by_iqr(temp)

Out[83]:

array([120., 160., 200.,  40.,  20., -50.])

In [84]:

find_outliers_by_iqr(temp, whis=3)

Out[84]:

array([160., 200.,  20., -50.])

In [85]:

# zscore判定法（三西格玛法则 ---> 68-95-99.7法则）


def find_outliers_by_zscore(data, mul=3):
    mu, sigma = np.mean(data), np.std(data)
    zscore = (data - mu) / sigma
    return data[np.abs(zscore) > mul]

In [86]:

find_outliers_by_zscore(temp)

Out[86]:

array([200., -50.])

In [87]:

find_outliers_by_zscore(temp, mul=2)

Out[87]:

array([160., 200.,  20., -50.])

In [88]:

find_outliers_by_zscore(df6.直接成本)

Out[88]:

订单号
G69924           23688
G70509           31935
G72204           26758
G70509           31594
G72186           30583
G70509           52302
G69631           32125
543369-010       29843
543367-077       31889
G69627           31028
G69645           23947
G72201           40327
G69631           26534
543367-077       25674
G71332           35120
588705-010       25502
543367-077       31375
G68188           29819
577714-010       40884
FT001-18-1763    25835
G72186           24770
G71330           29795
577714-010       27244
FT007-18-1763    25454
G69627           24634
G69627           23537
G72204           31613
543367-077       45442
G85411           22861
G68188           34290
AYMH063-1        29307
G69645           37782
Name: 直接成本, dtype: int64

In [89]:

# 根据离群点的行索引删除行
df6.drop(index=find_outliers_by_zscore(df6.直接成本).index)

Out[89]:

In [90]:

med_value = np.median(temp)
med_value

Out[90]:

79.0

In [91]:

find_outliers_by_zscore(temp, mul=2)

Out[91]:

array([160., 200.,  20., -50.])

In [92]:

np.in1d(temp, find_outliers_by_zscore(temp, mul=2))

Out[92]:

array([False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False,  True,  True, False,
        True,  True])

In [93]:

# 替换离群点
np.place(temp, np.in1d(temp, find_outliers_by_zscore(temp, mul=2)), med_value)

In [94]:

temp

Out[94]:

array([ 83.,  81.,  89.,  89.,  76.,  79.,  78.,  76.,  79.,  74.,  89.,
        61.,  90.,  74.,  68.,  81.,  81.,  93.,  69.,  81.,  76.,  87.,
        80.,  90.,  72.,  89.,  72.,  71.,  93.,  75.,  75.,  73.,  85.,
        91.,  96.,  82.,  74.,  80.,  72.,  83.,  72.,  64.,  83.,  79.,
        78.,  68.,  68.,  70.,  68.,  84., 120.,  79.,  79.,  40.,  79.,
        79.])

案例1：招聘数据清洗和预处理

数据加载
去重
数据抽取
拆分列
替换值
数据筛选

In [95]:

jobs_df = pd.read_csv('res/all_jobs.csv')
jobs_df.head(10)

Out[95]:

In [96]:

# 根据URI列去重
jobs_df.drop_duplicates('uri', inplace=True)
jobs_df.shape

Out[96]:

(9777, 9)

In [97]:

# 通过正则表达式从列中提取信息
jobs_df[['salary_lower', 'salary_upper']] = jobs_df.salary.str.extract(r'(\d+)-(\d+)').astype('i8')
jobs_df['salary'] = (jobs_df.salary_lower + jobs_df.salary_upper) / 2
jobs_df

Out[97]:

In [98]:

jobs_df.drop(columns=['uri', 'city'], inplace=True)
jobs_df

Out[98]:

In [99]:

# 拆分列
jobs_df['city'] = jobs_df.site.str.split(expand=True)[0]
jobs_df.drop(columns='site', inplace=True)
jobs_df

Out[99]:

In [100]:

# 字符串正则表达式替换
jobs_df['year'] = jobs_df.year.replace(r'5-10年|10年以上', '5年以上', regex=True)

In [101]:

jobs_df.year.unique()

Out[101]:

array(['1-3年', '3-5年', '1年以内', '经验不限', '5年以上', '应届生'], dtype=object)

In [102]:

jobs_df['edu'] = jobs_df.edu.replace(r'中专|高中', '学历不限', regex=True)
jobs_df['edu'] = jobs_df.edu.replace(r'硕士|博士', '研究生', regex=True)

In [103]:

jobs_df.edu.unique()

Out[103]:

array(['本科', '大专', '学历不限', '研究生'], dtype=object)

In [104]:

jobs_df['job_name'] = jobs_df.job_name.str.lower()
jobs_df = jobs_df[jobs_df.job_name.str.contains('python|数据|产品|运营|data', regex=True)]
jobs_df.shape

Out[104]:

(6487, 9)

In [105]:

jobs_df.to_csv('res/cleand_jobs.csv', index=False)

案例2：北京积分落户数据预处理

加载数据
日期时间处理
年龄段分箱
落户积分归一化

In [106]:

settle_df = pd.read_csv('res/2023年北京积分落户数据.csv', index_col='公示编号')
settle_df.head(5)

Out[106]:

In [107]:

settle_df.info()

Out[107]:

<class 'pandas.core.frame.DataFrame'>
Index: 6003 entries, 202300001 to 202306003
Data columns (total 4 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   姓名      6003 non-null   object 
 1   出生年月    6003 non-null   object 
 2   单位名称    6003 non-null   object 
 3   积分分值    6003 non-null   float64
dtypes: float64(1), object(3)
memory usage: 234.5+ KB

In [108]:

# 将字符串处理成日期
settle_df['出生年月'] = pd.to_datetime(settle_df['出生年月'])
settle_df.info()

Out[108]:

<class 'pandas.core.frame.DataFrame'>
Index: 6003 entries, 202300001 to 202306003
Data columns (total 4 columns):
 #   Column  Non-Null Count  Dtype         
---  ------  --------------  -----         
 0   姓名      6003 non-null   object        
 1   出生年月    6003 non-null   datetime64[ns]
 2   单位名称    6003 non-null   object        
 3   积分分值    6003 non-null   float64       
dtypes: datetime64[ns](1), float64(1), object(2)
memory usage: 234.5+ KB

In [109]:

# 将生日换算成年龄
settle_df['年龄'] = (pd.to_datetime('2023-01-01') - settle_df.出生年月).dt.days // 365
settle_df.head(5)

Out[109]:

In [110]:

# 将年龄划分到年龄段 - 分箱 - 数据桶
settle_df['年龄段'] = pd.cut(
    settle_df.年龄,
    bins=np.arange(35, 61, 5),
    labels=['35~39岁', '40~44岁', '45~49岁', '50~54岁', '55~59岁'],
    right=False
)
settle_df.head(5)

Out[110]:

In [111]:

# 统计每个元素出现的频次
temp = settle_df.年龄段.value_counts()
temp

Out[111]:

年龄段
40~44岁    4215
45~49岁    1053
35~39岁     681
50~54岁      34
55~59岁      20
Name: count, dtype: int64

In [112]:

plt.cm.Greens(np.linspace(0.9, 0.1, 5))

Out[112]:

array([[0.        , 0.39277201, 0.15816993, 1.        ],
       [0.18246828, 0.59332564, 0.30675894, 1.        ],
       [0.45176471, 0.76708958, 0.46120723, 1.        ],
       [0.72312188, 0.88961169, 0.69717801, 1.        ],
       [0.91326413, 0.96670511, 0.89619377, 1.        ]])

In [113]:

# 绘制柱状图
temp.plot(
    kind='bar',      # 图表类型
    figsize=(8, 4),  # 图表尺寸
    xlabel='',       # 横轴标签
    ylabel='Count',  # 纵轴标签
    width=0.5,       # 柱子宽度
    hatch='//',      # 柱子条纹
    color=plt.cm.Greens(np.linspace(0.9, 0.3, temp.size))  # 颜色值
)

for i in range(temp.size):
    # plt.text(横坐标, 纵坐标, 标签内容)
    plt.text(i, temp.iloc[i] + 30, temp.iloc[i], ha='center')

# 定制横轴的刻度
plt.xticks(rotation=0)
plt.show()

Out[113]:

In [114]:

# 绘制饼图
temp.plot(
    kind='pie',
    ylabel='',
    autopct='%.1f%%',           # 自动计算并显示百分比
    wedgeprops={'width': 0.3},  # 环状结构部分的宽度
    pctdistance=0.85,           # 百分比到圆心的距离
    labeldistance=1.1,          # 标签到圆心的距离
    # shadow=True,              # 阴影效果
    # startangle=0,             # 起始角度
    counterclock=True,          # 是否反时针方向绘制
)
plt.show()

Out[114]:

In [115]:

# agg - aggregate - 聚合
settle_df.积分分值.agg(['mean', 'max', 'min', 'std', 'skew', 'kurt'])

Out[115]:

mean    111.849640
max     140.050000
min     109.920000
std       2.481941
skew      3.485351
kurt     17.390027
Name: 积分分值, dtype: float64

线性归一化： $x^{\prime} = \frac{x - x_{min}}{x_{max} - x_{min}}$

In [116]:

# 将积分分值处理成0~1范围的值
max_score, min_score = settle_df.积分分值.agg(['max', 'min'])
max_score, min_score

Out[116]:

(140.05, 109.92)

In [117]:

# map - 映射 - 将指定的函数作用到数据系列的每个元素上
# apply - 应用 - 将指定的函数应用到数据系列的每个元素上
settle_df['线性归一化积分'] = settle_df.积分分值.map(lambda x: (x - min_score) / (max_score - min_score)).round(2)
settle_df

Out[117]:

zscore标准化： $x^{\prime} = \frac{x - \mu}{\sigma}$

In [118]:

mu, sigma = settle_df.积分分值.agg(['mean', 'std'])
settle_df['zscore评分'] = settle_df.积分分值.apply(lambda x: (x - mu) / sigma)
settle_df

Out[118]: