Path: blob/main/L4assets/DSandMLOpsAssets/HandsOn/Notebooks/DS modeling records.ipynb
1928 views
Create the modeling records
CPDaaS: Make sure to first insert a "project token"
Click on the three vertical dots icon in the uper right of the screen, then click on Insert project token
Once inserted, execute the cell.
A project token is only available if you followed the prerequesite instructions to create on in your project.
Modeling record tasks
Create a set of records for distances from every household to each of the five cluster center
Extract modelingdata from the Db2 tables
Add the distances to the modeling records
Save the result as a CSV file on the project
Connect to Db2
Get the cluster data
The columns are renamed to match the names form the comparison lab. They are also easier to remember.
Calculate distances from households to each cluster center
In the Chicago area, the value 0.00015 represents roughly:
Horizontal (longitudinal) distance: 40 feet Vertical (latitudinal) distance: 54 feet Diagonal distance: 68 feet
Here, we use geodesic from geopy.distance to calculate the distance between two points (latitude, longitude)
Get the household information from Db2
Loop to create the distance information
Time how long it takes. Should be between four and eight minutes.
Extract the data from the Db2 tables
The data needed comes from multiple tables and sometime aggreagates on some tables. If you are not careful, you may lose records through the join. Some tables may not have matches so for this reason you have to judiciously use the outer join construct. For example, there are few claims on the policies. If you join the claims to the policies, you'll get only records for policies that have claims, discarding most of the policies. The effect is then felt throughout the execution and results in 859 rows instand of over 10,000.
The fields required for modeling are:
| field name | description |
|---|---|
| household_id | household unique id |
| policy_id | policy unique id |
| last_name | primary driver last name |
| zipcode | household zipcode |
| hotspot1 to 5 | distances from each cluster center in miles rounded to the nearest integer |
| nb_drivers | number of drivers on the policy |
| under25 | number of drivers under the age of 25 |
| over60 | number of drivers over the age of 60 |
| male25 | number of male drivers under the age of 25 |
| prim_driver_age | primary driver age |
| prim_driver_gender | primary driver gender |
| make_model | car make and model |
| model_year | car year |
| initial_odometer | initial odometer reading |
| low_mileage_use | indicator that the policy is for low mileage drivers |
| nb_claims | number of claims filed against the policy |
| claim_total | claims amount total |
| nb_prim_claims | number of claims filed by the primary driver |
| prim_claim_total | claims amount filed by the primary driver |
| risk | assigne risk factor |
Extraction strategy: divide and conquer
Data needs to be processed and retrieved from multiple tables. It is better to divide the work in parts that retrieve specific fields and join them together to get the final result.
This is the stategy used in the SQL below. It takes it one table at a time, taking advantage of the SELECT.. FROM (SELECT...) virtual table construct.
The left outer join's are also resolved in the virtual table constructs.
SQL performance gotcha
Beware of using correlated subqueries:
WRONG:
-- Get the latest policy but also primary driver info
(SELECT P1.*, D1.birthdate prim_driver_birthdate, D1.gender prim_driver_gender, risk
FROM INSURANCE.INSURANCE_POLICY P1,
INSURANCE.INSURANCE_DRIVER D1
WHERE P1.household_id = MD.household_id
ANDP1.START_DATE = (SELECT MAX(START_DATE) maxdate
FROM INSURANCE.INSURANCE_POLICY P2
WHERE P2.HOUSEHOLD_ID = P1.HOUSEHOLD_ID
GROUP BY HOUSEHOLD_ID)
AND P1.primary_driver_id = D1.driver_id
) AS POLICYRIGHT:
-- Get the latest policy but also primary driver info
(SELECT P1.*, D1.birthdate prim_driver_birthdate, D1.gender prim_driver_gender, risk
FROM INSURANCE.INSURANCE_POLICY P1,
INSURANCE.INSURANCE_DRIVER D1,
(SELECT household_id, MAX(START_DATE) maxdate
FROM INSURANCE.INSURANCE_POLICY
GROUP BY HOUSEHOLD_ID) as MD
WHERE P1.household_id = MD.household_id
AND P1.START_DATE = maxdate
AND P1.primary_driver_id = D1.driver_id
) AS POLICY
Using a correlated subquery forces the SQL engine to execute the statement for each row. Using a join instead greatly speeds up the processing. It goes from minutes to a few seconds.
Merge the distances with the insurance records
Write the result into a CSV file into the project
for local file:
wslib.upload_file()for stream:
Author
Jacques Roy is a member of the IBM Enablement for Data and AI
Copyright © 2023. This notebook and its source code are released under the terms of the MIT License.