Book a Demo!
CoCalc Logo Icon
StoreFeaturesDocsShareSupportNewsAboutPoliciesSign UpSign In
CloudPak-Outcomes
GitHub Repository: CloudPak-Outcomes/Outcomes-Projects
Path: blob/main/L4assets/DSandMLOpsAssets/HandsOn/Notebooks/DS modeling records.ipynb
1928 views
Kernel: Python 3.10

Create the modeling records

CPDaaS: Make sure to first insert a "project token"

Click on the three vertical dots icon in the uper right of the screen, then click on Insert project token

Once inserted, execute the cell.

A project token is only available if you followed the prerequesite instructions to create on in your project.

Modeling record tasks

  • Create a set of records for distances from every household to each of the five cluster center

  • Extract modelingdata from the Db2 tables

  • Add the distances to the modeling records

  • Save the result as a CSV file on the project

import warnings import pandas as pd import numpy as np import math import time import os from ibm_watson_studio_lib import access_project_or_space import ibm_db import ibm_db_dbi # Get access to the prohject API for CPD on-premises if "USER_ID" in os.environ : wslib = access_project_or_space()

Connect to Db2

# Db2 instance information username = "bluadmin" password = "3lxF4Yp8S@VrEoQvSRd5r91VLaAhk" database = "bludb" hostname = "db2w-ovqfeqq.us-south.db2w.cloud.ibm.com" port = 50001
## Connect to the database credentials = { 'username': username, 'password': password, 'database': 'BLUDB', 'host': hostname, 'port': port } dsn = ( "DRIVER={{IBM DB2 ODBC DRIVER}};" "DATABASE={0};" "HOSTNAME={1};" "PORT={2};" "PROTOCOL=TCPIP;" "SECURITY=ssl;" "UID={3};" "PWD={4};").format(credentials['database'], credentials['host'], credentials['port'], credentials['username'], credentials['password']) conn = ibm_db.connect(dsn, "", "") pconn = ibm_db_dbi.Connection(conn) # Ignore warnings about the driver warnings.filterwarnings("ignore") # one of "error", "ignore", "always", "default", "module", or "once"

Get the cluster data

The columns are renamed to match the names form the comparison lab. They are also easier to remember.

body = wslib.load_data("SPSSClustersAll.csv") spssall_df = pd.read_csv(body) # Order the clusters by count, descending spssall_df.sort_values("Record_Count", ascending=False, ignore_index=True, inplace=True) spssall_df['$XC-autocluster'] = spssall_df.index spssall_df = spssall_df.rename(columns={"latitude_Mean": "latitude", "longitude_Mean": "longitude", "$XC-autocluster": "cluster", "Record_Count": "cnt"}) spssall_df.head()

Calculate distances from households to each cluster center

In the Chicago area, the value 0.00015 represents roughly:

Horizontal (longitudinal) distance: 40 feet Vertical (latitudinal) distance: 54 feet Diagonal distance: 68 feet

Here, we use geodesic from geopy.distance to calculate the distance between two points (latitude, longitude)

# Install geopy to calculate distances !pip install geopy 2>&1 >geopy.out from geopy.distance import geodesic

Get the household information from Db2

stmt = """ SELECT HOUSEHOLD_ID, LATITUDE, LONGITUDE FROM INSURANCE.INSURANCE_HOUSEHOLD """ hh_pd = pd.read_sql(stmt, pconn) hh_pd.head()

Loop to create the distance information

Time how long it takes. Should be between four and eight minutes.

# calculate the distance from household to cluster centers t0 = time.time() hotspots = ['HOTSPOT1','HOTSPOT2','HOTSPOT3','HOTSPOT4','HOTSPOT5'] for hs in range(spssall_df.shape[0]) : hh_pd[hotspots[hs]] = -1 for ix in range(hh_pd.shape[0]) : hh_pd.at[ix, hotspots[hs]] = math.floor( .5 + geodesic((hh_pd.iloc[ix]['LATITUDE'].item(), hh_pd.iloc[ix]['LONGITUDE'].item()), (spssall_df.iloc[hs]['latitude'].item(), spssall_df.iloc[hs]['longitude'].item())).miles) t1 = time.time() total = t1-t0 print("Timing: {} minutes".format(total/60))

Extract the data from the Db2 tables

The data needed comes from multiple tables and sometime aggreagates on some tables. If you are not careful, you may lose records through the join. Some tables may not have matches so for this reason you have to judiciously use the outer join construct. For example, there are few claims on the policies. If you join the claims to the policies, you'll get only records for policies that have claims, discarding most of the policies. The effect is then felt throughout the execution and results in 859 rows instand of over 10,000.

The fields required for modeling are:

field namedescription
household_idhousehold unique id
policy_idpolicy unique id
last_nameprimary driver last name
zipcodehousehold zipcode
hotspot1 to 5distances from each cluster center in miles rounded to the nearest integer
nb_driversnumber of drivers on the policy
under25number of drivers under the age of 25
over60number of drivers over the age of 60
male25number of male drivers under the age of 25
prim_driver_ageprimary driver age
prim_driver_genderprimary driver gender
make_modelcar make and model
model_yearcar year
initial_odometerinitial odometer reading
low_mileage_useindicator that the policy is for low mileage drivers
nb_claimsnumber of claims filed against the policy
claim_totalclaims amount total
nb_prim_claimsnumber of claims filed by the primary driver
prim_claim_totalclaims amount filed by the primary driver
riskassigne risk factor

Extraction strategy: divide and conquer

Data needs to be processed and retrieved from multiple tables. It is better to divide the work in parts that retrieve specific fields and join them together to get the final result.

This is the stategy used in the SQL below. It takes it one table at a time, taking advantage of the SELECT.. FROM (SELECT...) virtual table construct.

The left outer join's are also resolved in the virtual table constructs.

SQL performance gotcha

Beware of using correlated subqueries:

  • WRONG:
    -- Get the latest policy but also primary driver info
    (SELECT P1.*, D1.birthdate prim_driver_birthdate, D1.gender prim_driver_gender, risk
    FROM INSURANCE.INSURANCE_POLICY P1,
    INSURANCE.INSURANCE_DRIVER D1
    WHERE P1.household_id = MD.household_id
    AND P1.START_DATE = (SELECT MAX(START_DATE) maxdate
    FROM INSURANCE.INSURANCE_POLICY P2
    WHERE P2.HOUSEHOLD_ID = P1.HOUSEHOLD_ID
    GROUP BY HOUSEHOLD_ID)
    AND P1.primary_driver_id = D1.driver_id
    ) AS POLICY

  • RIGHT:

-- Get the latest policy but also primary driver info
(SELECT P1.*, D1.birthdate prim_driver_birthdate, D1.gender prim_driver_gender, risk
FROM INSURANCE.INSURANCE_POLICY P1,
INSURANCE.INSURANCE_DRIVER D1,
(SELECT household_id, MAX(START_DATE) maxdate
FROM INSURANCE.INSURANCE_POLICY
GROUP BY HOUSEHOLD_ID) as MD
WHERE P1.household_id = MD.household_id
AND P1.START_DATE = maxdate
AND P1.primary_driver_id = D1.driver_id
) AS POLICY

Using a correlated subquery forces the SQL engine to execute the statement for each row. Using a join instead greatly speeds up the processing. It goes from minutes to a few seconds.

# Create one record per household using the latest policy # This returns 100346 rows but there are only 100169 households (diff 77) # This means that there are some households that have multiple policies starting on the max(start_date) # stmt = """ SELECT DRIVER.household_id, POLICY.policy_id, DRIVER.last_name, HOUSEHOLD.zipcode, 0 as hotspot1, 0 as hotspot2, 0 as hotspot3, 0 as hotspot4, 0 as hotspot5, DR_AGGR.nb_drivers, DR_AGGR.under25, DR_AGGR.over60, DR_AGGR.male25, -- primary driver age and gender YEAR(CURRENT DATE - POLICY.prim_driver_birthdate) prim_driver_age, POLICY.prim_driver_gender, CONCAT(CONCAT(POLICY.make,' '), POLICY.model) as make_model, POLICY.model_year, POLICY.initial_odometer, POLICY.low_mileage_use, CL_AGGR.nb_claims, CL_AGGR.claim_total, -- claim by primary driver, amount by primary driver PRIM_CL_AGGR.nb_prim_claims, PRIM_CL_AGGR.prim_claim_total, POLICY.risk FROM INSURANCE.INSURANCE_HOUSEHOLD AS HOUSEHOLD, INSURANCE.INSURANCE_DRIVER AS DRIVER, -- Get the latest policy but also primary driver info (SELECT P1.*, D1.birthdate prim_driver_birthdate, D1.gender prim_driver_gender FROM INSURANCE.INSURANCE_POLICY P1, INSURANCE.INSURANCE_DRIVER D1, (SELECT household_id, MAX(START_DATE) maxdate FROM INSURANCE.INSURANCE_POLICY GROUP BY HOUSEHOLD_ID) as MD WHERE P1.household_id = MD.household_id AND P1.START_DATE = maxdate AND P1.primary_driver_id = D1.driver_id ) AS POLICY, (SELECT household_id, COUNT(DRIVER_ID) as nb_drivers, SUM(case when YEAR(CURRENT DATE - BIRTHDATE) < 25 then 1 else 0 end) under25, SUM(case when YEAR(CURRENT DATE - BIRTHDATE) > 59 then 1 else 0 end) over60, SUM(case when YEAR(CURRENT DATE - BIRTHDATE) < 25 and gender = 'M' then 1 else 0 end) male25 FROM INSURANCE.INSURANCE_DRIVER GROUP BY household_id) AS DR_AGGR, -- Needs an outer join to make sure we have something for all households (SELECT HH.household_id, COUNT(claim_id) as nb_claims, SUM(claim_amount) claim_total FROM INSURANCE.INSURANCE_HOUSEHOLD HH LEFT OUTER JOIN INSURANCE.INSURANCE_CLAIM CL ON HH.household_id = CL.household_id GROUP BY HH.household_id) as CL_AGGR, -- Needs an outer join (SELECT PP.household_id, COUNT(*) as nb_prim_claims, SUM(claim_amount) prim_claim_total FROM INSURANCE.INSURANCE_POLICY PP LEFT OUTER JOIN INSURANCE.INSURANCE_CLAIM CL ON PP.policy_id = CL.policy_id AND PP.primary_driver_id = CL.DRIVER_ID GROUP BY PP.household_id) as PRIM_CL_AGGR WHERE HOUSEHOLD.household_id = DRIVER.household_id AND DRIVER.household_id = POLICY.household_id AND POLICY.primary_driver_id = DRIVER.driver_id AND DR_AGGR.household_id = POLICY.household_id AND CL_AGGR.household_id = POLICY.household_id AND PRIM_CL_AGGR.household_id = POLICY.household_id ; """ t0 = time.time() data_pd = pd.read_sql(stmt, pconn) t1 = time.time() total = t1-t0 print("Timing: {:.2f} seconds".format(total))

Merge the distances with the insurance records

xx_pd = pd.merge(data_pd, hh_pd, on=['HOUSEHOLD_ID']) xx2_pd = xx_pd[['HOUSEHOLD_ID', 'POLICY_ID', 'LAST_NAME', 'ZIPCODE', 'HOTSPOT1_y', 'HOTSPOT2_y', 'HOTSPOT3_y', 'HOTSPOT4_y', 'HOTSPOT5_y', 'NB_DRIVERS', 'UNDER25', 'OVER60', 'MALE25', 'PRIM_DRIVER_AGE', 'PRIM_DRIVER_GENDER', 'MAKE_MODEL', 'MODEL_YEAR', 'INITIAL_ODOMETER', 'LOW_MILEAGE_USE', 'NB_CLAIMS', 'CLAIM_TOTAL', 'NB_PRIM_CLAIMS', 'PRIM_CLAIM_TOTAL', 'RISK']] xx3_pd = xx2_pd.rename(columns={'HOTSPOT1_y': 'HOTSPOT1', 'HOTSPOT2_y': 'HOTSPOT2', 'HOTSPOT3_y': 'HOTSPOT3', 'HOTSPOT4_y': 'HOTSPOT4', 'HOTSPOT5_y': 'HOTSPOT5'}) xx3_pd.head()

Write the result into a CSV file into the project

  • for local file: wslib.upload_file()

  • for stream:

sdata = wslib.load_data(filename) newfile = wslib.save_data("mynewfile.csv", sdata.read())
xx3_pd.to_csv("ModelingRecords.csv", index=False) res = wslib.upload_file('ModelingRecords.csv') print("File {} uploaded".format(res['name']))

Author

Jacques Roy is a member of the IBM Enablement for Data and AI

Copyright © 2023. This notebook and its source code are released under the terms of the MIT License.