GitHub Repository: CloudPak-Outcomes/Outcomes-Projects
Path: blob/main/L4assets/DSandMLOpsAssets/HandsOn/Notebooks/DS modeling records.ipynb
¹⁹²⁸ views

Kernel: Python 3.10

Create the modeling records

CPDaaS: Make sure to first insert a "project token"

Click on the three vertical dots icon in the uper right of the screen, then click on Insert project token

Once inserted, execute the cell.

A project token is only available if you followed the prerequesite instructions to create on in your project.

Modeling record tasks

Create a set of records for distances from every household to each of the five cluster center
Extract modelingdata from the Db2 tables
Add the distances to the modeling records
Save the result as a CSV file on the project

In [ ]:

import warnings
import pandas as pd
import numpy as np
import math
import time
import os
from ibm_watson_studio_lib import access_project_or_space

import ibm_db
import ibm_db_dbi

# Get access to the prohject API for CPD on-premises
if "USER_ID" in os.environ :
    wslib = access_project_or_space()

Connect to Db2

In [ ]:

# Db2 instance information
username = "bluadmin" 
password = "3lxF4Yp8S@VrEoQvSRd5r91VLaAhk" 
database = "bludb" 
hostname = "db2w-ovqfeqq.us-south.db2w.cloud.ibm.com"
port     = 50001

In [ ]:

## Connect to the database
credentials = {
    'username': username,
    'password': password,
    'database': 'BLUDB',
    'host': hostname,
    'port': port
}
dsn = (
    "DRIVER={{IBM DB2 ODBC DRIVER}};"
    "DATABASE={0};"
    "HOSTNAME={1};"
    "PORT={2};"
    "PROTOCOL=TCPIP;"
    "SECURITY=ssl;"
    "UID={3};"
    "PWD={4};").format(credentials['database'], credentials['host'],
                       credentials['port'], credentials['username'],
                       credentials['password'])

conn = ibm_db.connect(dsn, "", "")
pconn = ibm_db_dbi.Connection(conn)

# Ignore warnings about the driver
warnings.filterwarnings("ignore") # one of "error", "ignore", "always", "default", "module", or "once"

Get the cluster data

The columns are renamed to match the names form the comparison lab. They are also easier to remember.

In [ ]:

body = wslib.load_data("SPSSClustersAll.csv")
spssall_df = pd.read_csv(body)

# Order the clusters by count, descending
spssall_df.sort_values("Record_Count", ascending=False, ignore_index=True, inplace=True)
spssall_df['$XC-autocluster'] = spssall_df.index
spssall_df = spssall_df.rename(columns={"latitude_Mean": "latitude", "longitude_Mean": "longitude",
                         "$XC-autocluster": "cluster", "Record_Count": "cnt"})
spssall_df.head()

Calculate distances from households to each cluster center

In the Chicago area, the value 0.00015 represents roughly:

Horizontal (longitudinal) distance: 40 feet Vertical (latitudinal) distance: 54 feet Diagonal distance: 68 feet

Here, we use geodesic from geopy.distance to calculate the distance between two points (latitude, longitude)

In [ ]:

# Install geopy to calculate distances
!pip install geopy 2>&1 >geopy.out
from geopy.distance import geodesic

Get the household information from Db2

In [ ]:

stmt = """
SELECT HOUSEHOLD_ID, LATITUDE, LONGITUDE
FROM   INSURANCE.INSURANCE_HOUSEHOLD 
"""
hh_pd = pd.read_sql(stmt, pconn)
hh_pd.head()

Loop to create the distance information

Time how long it takes. Should be between four and eight minutes.

In [ ]:

# calculate the distance from household to cluster centers
t0 = time.time()
hotspots = ['HOTSPOT1','HOTSPOT2','HOTSPOT3','HOTSPOT4','HOTSPOT5']
for hs in range(spssall_df.shape[0]) :
    hh_pd[hotspots[hs]] = -1
    for ix in range(hh_pd.shape[0]) :
        hh_pd.at[ix, hotspots[hs]] = math.floor( .5 + geodesic((hh_pd.iloc[ix]['LATITUDE'].item(),
                                      hh_pd.iloc[ix]['LONGITUDE'].item()),
                                     (spssall_df.iloc[hs]['latitude'].item(),
                                      spssall_df.iloc[hs]['longitude'].item())).miles)
t1 = time.time()

total = t1-t0
print("Timing: {} minutes".format(total/60))

Extract the data from the Db2 tables

The data needed comes from multiple tables and sometime aggreagates on some tables. If you are not careful, you may lose records through the join. Some tables may not have matches so for this reason you have to judiciously use the outer join construct. For example, there are few claims on the policies. If you join the claims to the policies, you'll get only records for policies that have claims, discarding most of the policies. The effect is then felt throughout the execution and results in 859 rows instand of over 10,000.

The fields required for modeling are:

field name	description
household_id	household unique id
policy_id	policy unique id
last_name	primary driver last name
zipcode	household zipcode
hotspot1 to 5	distances from each cluster center in miles rounded to the nearest integer
nb_drivers	number of drivers on the policy
under25	number of drivers under the age of 25
over60	number of drivers over the age of 60
male25	number of male drivers under the age of 25
prim_driver_age	primary driver age
prim_driver_gender	primary driver gender
make_model	car make and model
model_year	car year
initial_odometer	initial odometer reading
low_mileage_use	indicator that the policy is for low mileage drivers
nb_claims	number of claims filed against the policy
claim_total	claims amount total
nb_prim_claims	number of claims filed by the primary driver
prim_claim_total	claims amount filed by the primary driver
risk	assigne risk factor

Extraction strategy: divide and conquer

Data needs to be processed and retrieved from multiple tables. It is better to divide the work in parts that retrieve specific fields and join them together to get the final result.

This is the stategy used in the SQL below. It takes it one table at a time, taking advantage of the SELECT.. FROM (SELECT...) virtual table construct.

The left outer join's are also resolved in the virtual table constructs.

SQL performance gotcha

Beware of using correlated subqueries:

WRONG:
-- Get the latest policy but also primary driver info
(SELECT P1.*, D1.birthdate prim_driver_birthdate, D1.gender prim_driver_gender, risk
FROM INSURANCE.INSURANCE_POLICY P1,
INSURANCE.INSURANCE_DRIVER D1
WHERE P1.household_id = MD.household_id
AND P1.START_DATE = (SELECT MAX(START_DATE) maxdate
FROM INSURANCE.INSURANCE_POLICY P2
WHERE P2.HOUSEHOLD_ID = P1.HOUSEHOLD_ID
GROUP BY HOUSEHOLD_ID)
AND P1.primary_driver_id = D1.driver_id
) AS POLICY
RIGHT:

-- Get the latest policy but also primary driver info
(SELECT P1.*, D1.birthdate prim_driver_birthdate, D1.gender prim_driver_gender, risk
FROM INSURANCE.INSURANCE_POLICY P1,
INSURANCE.INSURANCE_DRIVER D1,
(SELECT household_id, MAX(START_DATE) maxdate
FROM INSURANCE.INSURANCE_POLICY
GROUP BY HOUSEHOLD_ID) as MD
WHERE P1.household_id = MD.household_id
AND P1.START_DATE = maxdate
AND P1.primary_driver_id = D1.driver_id
) AS POLICY

Using a correlated subquery forces the SQL engine to execute the statement for each row. Using a join instead greatly speeds up the processing. It goes from minutes to a few seconds.

In [ ]:

# Create one record per household using the latest policy
# This returns 100346 rows but there are only 100169 households (diff 77)
# This means that there are some households that have multiple policies starting on the max(start_date)
#
stmt = """
SELECT DRIVER.household_id, POLICY.policy_id, DRIVER.last_name, HOUSEHOLD.zipcode, 
       0 as hotspot1, 0 as hotspot2, 0 as hotspot3, 0 as hotspot4, 0 as hotspot5,
       DR_AGGR.nb_drivers, DR_AGGR.under25, DR_AGGR.over60, DR_AGGR.male25,
       -- primary driver age and gender
       YEAR(CURRENT DATE - POLICY.prim_driver_birthdate) prim_driver_age, POLICY.prim_driver_gender,
       CONCAT(CONCAT(POLICY.make,' '), POLICY.model) as make_model, POLICY.model_year, POLICY.initial_odometer,
       POLICY.low_mileage_use, CL_AGGR.nb_claims, CL_AGGR.claim_total,
       -- claim by primary driver, amount by primary driver
       PRIM_CL_AGGR.nb_prim_claims, PRIM_CL_AGGR.prim_claim_total, POLICY.risk
FROM   INSURANCE.INSURANCE_HOUSEHOLD AS HOUSEHOLD,
       INSURANCE.INSURANCE_DRIVER AS DRIVER,
       -- Get the latest policy but also primary driver info
       (SELECT P1.*, D1.birthdate prim_driver_birthdate, D1.gender prim_driver_gender
        FROM INSURANCE.INSURANCE_POLICY P1,
             INSURANCE.INSURANCE_DRIVER D1,
             (SELECT household_id, MAX(START_DATE) maxdate
              FROM   INSURANCE.INSURANCE_POLICY
              GROUP BY HOUSEHOLD_ID) as MD
        WHERE P1.household_id = MD.household_id
        AND   P1.START_DATE = maxdate
        AND   P1.primary_driver_id = D1.driver_id   
       ) AS POLICY,
       (SELECT household_id, COUNT(DRIVER_ID) as nb_drivers,
               SUM(case when YEAR(CURRENT DATE - BIRTHDATE) < 25 then 1 else 0 end) under25,
               SUM(case when YEAR(CURRENT DATE - BIRTHDATE) > 59 then 1 else 0 end) over60,
               SUM(case when YEAR(CURRENT DATE - BIRTHDATE) < 25 and gender = 'M' then 1 else 0 end) male25
        FROM INSURANCE.INSURANCE_DRIVER
        GROUP BY household_id) AS DR_AGGR,

       -- Needs an outer join to make sure we have something for all households
       (SELECT HH.household_id, COUNT(claim_id) as nb_claims, SUM(claim_amount) claim_total
         FROM INSURANCE.INSURANCE_HOUSEHOLD HH LEFT OUTER JOIN INSURANCE.INSURANCE_CLAIM CL
         ON HH.household_id = CL.household_id
         GROUP BY HH.household_id) as CL_AGGR,

       -- Needs an outer join
       (SELECT PP.household_id, COUNT(*) as nb_prim_claims, SUM(claim_amount) prim_claim_total
        FROM INSURANCE.INSURANCE_POLICY PP LEFT OUTER JOIN INSURANCE.INSURANCE_CLAIM CL
        ON   PP.policy_id = CL.policy_id
        AND  PP.primary_driver_id = CL.DRIVER_ID
        GROUP BY PP.household_id) as PRIM_CL_AGGR

WHERE HOUSEHOLD.household_id = DRIVER.household_id
AND   DRIVER.household_id = POLICY.household_id
AND   POLICY.primary_driver_id = DRIVER.driver_id
AND   DR_AGGR.household_id = POLICY.household_id
AND   CL_AGGR.household_id = POLICY.household_id
AND   PRIM_CL_AGGR.household_id = POLICY.household_id
;
"""
t0 = time.time()
data_pd = pd.read_sql(stmt, pconn)
t1 = time.time()

total = t1-t0
print("Timing: {:.2f} seconds".format(total))

Merge the distances with the insurance records

In [ ]:

xx_pd = pd.merge(data_pd, hh_pd, on=['HOUSEHOLD_ID'])
xx2_pd = xx_pd[['HOUSEHOLD_ID', 'POLICY_ID', 'LAST_NAME', 'ZIPCODE', 'HOTSPOT1_y',
       'HOTSPOT2_y', 'HOTSPOT3_y', 'HOTSPOT4_y', 'HOTSPOT5_y', 'NB_DRIVERS',
       'UNDER25', 'OVER60', 'MALE25', 'PRIM_DRIVER_AGE', 'PRIM_DRIVER_GENDER',
       'MAKE_MODEL', 'MODEL_YEAR', 'INITIAL_ODOMETER', 'LOW_MILEAGE_USE',
       'NB_CLAIMS', 'CLAIM_TOTAL', 'NB_PRIM_CLAIMS', 'PRIM_CLAIM_TOTAL', 'RISK']]
xx3_pd = xx2_pd.rename(columns={'HOTSPOT1_y': 'HOTSPOT1', 'HOTSPOT2_y': 'HOTSPOT2',
                       'HOTSPOT3_y': 'HOTSPOT3', 'HOTSPOT4_y': 'HOTSPOT4', 'HOTSPOT5_y': 'HOTSPOT5'})
xx3_pd.head()

Write the result into a CSV file into the project

for local file: wslib.upload_file()
for stream:

sdata = wslib.load_data(filename)
newfile = wslib.save_data("mynewfile.csv", sdata.read())

In [ ]:

xx3_pd.to_csv("ModelingRecords.csv", index=False)
res = wslib.upload_file('ModelingRecords.csv')
print("File {} uploaded".format(res['name']))

Author

Jacques Roy is a member of the IBM Enablement for Data and AI