Path: blob/master/notebooks/book1/12/poisson_regression_insurance.ipynb
1192 views
Kernel: Python 3
Poisson regression for predicting insurance claim rates
In [5]:
In [1]:
Out[1]:
0.24.2
Data
In [3]:
Out[3]:
In [6]:
Out[6]:
Average Frequency = 0.10070308464041304
Fraction of exposure with zero claims = 93.9%
In [34]:
Out[34]:
0 10.000000
1 1.298701
2 1.333333
3 11.111111
4 1.190476
...
677988 0.000000
677989 0.000000
677990 0.000000
677991 0.000000
677992 0.000000
Name: Frequency, Length: 677993, dtype: float64
In [7]:
In [29]:
Out[29]:
Feature engineering
In [8]:
Evaluation metrics
In [9]:
In [30]:
Dummy model
Just predicts overall mean.
In [35]:
In [36]:
Out[36]:
[0.26520328 0.26520328 0.26520328 0.26520328 0.26520328]
We need to weight the examples by exposure, for reasons explained here: https://github.com/scikit-learn/scikit-learn/issues/18059
In [37]:
In [38]:
Out[38]:
[0.10069261 0.10069261 0.10069261 0.10069261 0.10069261]
In [13]:
Out[13]:
Constant mean frequency evaluation:
MSE: 0.564
MAE: 0.189
mean Poisson deviance: 0.625
Linear regression
Linear regression is an okay baseline, but not the most appropriate model for count data... We see the L2 regularization to a small value, since the training set is large.
In [15]:
In [16]:
Out[16]:
Ridge evaluation:
MSE: 0.560
MAE: 0.177
WARNING: Estimator yields invalid, non-positive predictions for 1315 samples out of 223745. These predictions are ignored when computing the Poisson deviance.
mean Poisson deviance: 0.601
Poisson linear regression
We set the L2 regularizer to the same value as in ridge regression, but divided by the number of training samples, since poisson regression penalizes the average log likelihood.
In [17]:
Out[17]:
Pipeline(steps=[('preprocessor',
ColumnTransformer(transformers=[('passthrough_numeric',
'passthrough',
['BonusMalus']),
('binned_numeric',
KBinsDiscretizer(n_bins=10),
['VehAge', 'DrivAge']),
('log_scaled_numeric',
Pipeline(steps=[('functiontransformer',
FunctionTransformer(func=<ufunc 'log'>)),
('standardscaler',
StandardScaler())]),
['Density']),
('onehot_categorical',
OneHotEncoder(),
['VehBrand', 'VehPower',
'VehGas', 'Region',
'Area'])])),
('regressor', PoissonRegressor(alpha=1e-12, max_iter=300))])
In [18]:
Out[18]:
PoissonRegressor evaluation:
MSE: 0.560
MAE: 0.186
mean Poisson deviance: 0.594
Comparison
In [31]:
Out[31]:
In [44]:
Out[44]:
In [42]:
Out[42]:
DummyRegressor
MSE: 0.564
MAE: 0.189
mean Poisson deviance: 0.625
Ridge
MSE: 0.560
MAE: 0.177
WARNING: Estimator yields invalid, non-positive predictions for 1315 samples out of 223745. These predictions are ignored when computing the Poisson deviance.
mean Poisson deviance: 0.601
PoissonRegressor
MSE: 0.560
MAE: 0.186
mean Poisson deviance: 0.594
Calibration plot
In [23]:
In [43]:
Out[43]:
Actual number of claims: 11935.0
Predicted number of claims by DummyRegressor(): 11931.2
Predicted number of claims by Ridge(alpha=1e-06): 10693.1
Predicted number of claims by PoissonRegressor(alpha=1e-12, max_iter=300): 11930.8
In [ ]: