Book a Demo!
CoCalc Logo Icon
StoreFeaturesDocsShareSupportNewsAboutPoliciesSign UpSign In
CloudPak-Outcomes
GitHub Repository: CloudPak-Outcomes/Outcomes-Projects
Path: blob/main/L4assets/DSandMLOpsAssets/HandsOn/Notebooks/DS Db2 data exploration.ipynb
1928 views
Kernel: Python 3.10

Db2 data exploration

This notebook is used to explore the tables use in the hands-on lab to better understand their data and relationships.

# Needed imports import warnings import ibm_db import ibm_db_dbi import pandas as pd

Connect to Db2

The connection information is provided for you

# Db2 instance information username = "cpdemo" password = "C!oudP@k4DataDem0s" database = "bludb" hostname = "db2w-ovqfeqq.us-south.db2w.cloud.ibm.com" port = 50001
## Connect to the database credentials = { 'username': username, 'password': password, 'database': 'BLUDB', 'host': hostname, 'port': port } dsn = ( "DRIVER={{IBM DB2 ODBC DRIVER}};" "DATABASE={0};" "HOSTNAME={1};" "PORT={2};" "PROTOCOL=TCPIP;" "SECURITY=ssl;" "UID={3};" "PWD={4};").format(credentials['database'], credentials['host'], credentials['port'], credentials['username'], credentials['password']) conn = ibm_db.connect(dsn, "", "") pconn = ibm_db_dbi.Connection(conn) # Ignore warnings about the driver warnings.filterwarnings("ignore") # one of "error", "ignore", "always", "default", "module", or "once"

Rows per table

You can find the names of the tables by looking at the schema file.

The next cell is provided as an example on how to access the Db2 database.
Note that all thedata is in the schema INSURANCE

# Rows per table tables = ["INSURANCE_HOUSEHOLD", "INSURANCE_DRIVER", "INSURANCE_POLICY", "INSURANCE_CLAIM", "INSURANCE_CLAIM_DETAIL", "INSURANCE_CLAIM_STATUS"] stmt = """ SELECT COUNT(*) as nb_rows FROM INSURANCE.{} """ for table in tables : data_pd = pd.read_sql(stmt.format(table), pconn) print("{}: {}".format(table, data_pd['NB_ROWS'][0].tolist() ))

Answer the following questions:

  • Is there at least one driver per household?

  • Is there more than one policy per household?

  • Any household without a policy?

  • Are there more than one policy per household that has the highest start_date?

  • Are there more than one policy per household that has the highest expiry_date?

  • Are there policies with more than one claim?

  • Are there multiple policies per driver?

  • Is there always a claim_details attached to a claim?

  • Are there columns with many null values? (Figure out which columns are important)

Is there at least one driver per household?

This is more complex than you may thin at first.

Hints:

  • left outer join

  • having

stmt = """ """ data_pd = pd.read_sql(stmt, pconn) data_pd.head()

Is there more than one policy per household?

hint:

  • SELECT... FROM (SELECT...)...

# household policies: number of policies, number of households having that number

Any household without a policy?

Hints:

  • SUM(case when...)

  • LEFT OUTER JOIN

## Remove answer before publishing

Are there more than one policy per household that have the highest start_date?

Are there more than one policy per household that has the highest expiry_date?

Are there policies with more than one claim?

Hint:

  • Select FROM (SELECT)

  • Similar to number of policies per household

# claims by policies

Are there multiple policies per driver?

Hint:

  • Similar to the previous query

  • Requires INSURANCE_POLICY and INSURANCE_DRIVER

# Policies per driver

Is there always a claim_details attached to a claim?

Are there columns with many null values? (Figure out which columns are important)

Hint: the easiest way is to use Pandas dataframes

Author

Jacques Roy is a member of the IBM Enablement for Data and AI

Copyright © 2023. This notebook and its source code are released under the terms of the MIT License.