GitHub Repository: CloudPak-Outcomes/Outcomes-Projects
Path: blob/main/DataVirtualization-L3-Tech-Lab/wq_lab_notebook.ipynb
¹⁹²⁸ views

Kernel: Python 3.10

Watson Query Data Virtualization Lab Notebook

This notebook is intended to be used with the IBM Watson Query Level 3 for Technical Sales lab (link).

Perform the work in each section as instructed in the main lab guide.

Section #1: Connect to Watson Query

Run the following cell to install the ibm_db driver package. This provides a Python interface for connecting to IBM Db2 (for Linux, UNIX, and Windows), IBM Db2 for z/OS, IBM Informix, and IBM Db2-based offerings and services, like Watson Query.

In [ ]:

# Install the ibm_db driver package. This only needs to be executed once in your runtime environment and may already be included
# by default. In any case, it is not harmful to attempt to install it more than once, as subsequent attempts will simply state
# that the requirement has already been satisfied.
!pip install ibm_db

print("ibm_db library has been installed.")

Run the following cell to import the libraries used by this notebook.

In [ ]:

# Import libraries that are used by this notebook.
import ibm_db
import pandas as pd
import time
import datetime

print("Libraries have been imported.")

Run the following two cells to download and load Db2 Jupyter Notebook extensions (Db2 magic commands).

Note: For more information on these Db2 Jupyter Notebook extensions, please see the following resources:
https://github.com/IBM/db2-jupyter
https://ibm.github.io/db2-jupyter/

In [ ]:

# Download the Db2 Jupyter Notebook extensions to local storage
!wget https://raw.githubusercontent.com/IBM/db2-jupyter/master/db2.ipynb -O db2.ipynb

In [ ]:

# Load the Db2 Jupyter Notebook extensions
%run db2.ipynb

Update the environment settings (hostname, portnum, and apikey) in the following cell based on your own Watson Query environment and then run the cell. Instructions on where to get these values are provided in the main lab guide.

In [ ]:

# User-specific connection settings. UPDATE WITH VALUES FROM YOUR OWN ENVIRONMENT, AS PER THE LAB INSTRUCTIONS.
dbname   = "BLUDB"
hostname = "replace-with-your-hostname.databases.appdomain.cloud"  # <-- UPDATE HOSTNAME HERE
portnum  = "50001"                                                 # <-- UPDATE PORT NUMBER HERE
apikey   = "replace-with-business-user-api-key"                    # <-- UPDATE API KEY HERE

print("Variables have been set.")

Update the following cell with your Watson Query user ID (specify it in all UPPERCASE letters). This user ID is also used as the schema name for any virtual objects you create in Watson Query and that is how it is being used here. Instructions on how to find your user ID (schema name) are provided in the main lab guide. After updating the value, run the cell.

Note: This user ID isn't needed for the connection, but it's being set here just to keep all of your settings localized in this one area of the notebook.

In [ ]:

# By default, objects are created in a schema that matches the user ID of the user.
# Set it here so that it can be used in any queries that follow.
dv_user_id = "XXXXXXXXXX"  # <-- UPDATE DV_USER_ID HERE (USE UPPERCASE LETTERS)

print("Variable set.")

After modifying the variables in the previous two cells, persist the changes to the notebook by clicking File > Save from the menu bar at the top of the page.
Run the following cell to connect to Watson Query.

In [ ]:

# Connect to the Data Virtualization service.
#
# If the connection fails then it's likely that one or more of the settings you provided are incorrect. The error
# message may provide the reason for the failure. These are the likely causes of some of the more common errors:
#
#  - SQL30061N: Invalid database name
#  - SQL1336N: Invalid hostname
#  - SQL30081N: Invalid port number (or some other communication error)

print("Connecting to Watson Query using API key...")
%sql connect to {dbname} host {hostname} port {portnum} apikey {apikey}

Run the following cell to create a function that will be used to execute SQL query statements against Watson Query.

In [ ]:

# Run a SQL query statement a given number of times. By default, the row count of the result set and the
# elapsed time per statement is displayed. The result set is returned as a data frame (of the last iteration,
# if multiple iterations requested).
def run_query(sql_stmt, num_iterations=1, quiet=False):
    iter = 1
    result = None
    if not quiet:
        print("Current time: " + str(datetime.datetime.now()) + " UTC")
        print("SQL statement: " + sql_stmt + "\n")
    while iter <= num_iterations:
        if not quiet and num_iterations != 1:
            print("Iteration " + str(iter) + " of " + str(num_iterations) + "... ", end = '')
        iter += 1
        start = time.time()
        df = %sql -q {sql_stmt}
        if sqlcode != 0:
            print("\nERROR: Unable to run SQL statement, SQLCODE = " + str(sqlcode) + " (" + sqlerror + ")")
        else:
            end = time.time()
            if not quiet and num_iterations != 1:
                print("success; row count=" + str(len(df)) + ", elapsed time=" + format(end - start, ".2f") + "s")
            if not quiet and iter > num_iterations:
                print("Query execution complete.\n")
            result = df
    return result

print("Query function has been created.")

Run the following cell to perform a simple test query to validate that the connection was successful.

In [ ]:

# Run a sample query to validate the connection (if the connection was not successful then
# this query will fail). This is querying metadata within the database. The value returned
# isn't important, just that it provides a non-0 row count without error.
sql_stmt = "SELECT COUNT(*) FROM SYSCAT.TABLES"
df = run_query(sql_stmt, 2, True)
print("SYSCAT.TABLES row count: " + str(int(df.loc[0][0])))

Section #2: Query Virtual Objects

This section assumes that you successfully connected to Watson Query in Section #1 above.

Run the following cell to query the CUSTOMER virtual table (whose data resides in a Db2 Warehouse database). Because you are querying this data as the business user, you can see that the credit card-related data has been masked.

In [ ]:

# Query the CUSTOMER table.
sql_stmt = "SELECT * FROM " + dv_user_id + ".CUSTOMER ORDER BY LOYALTY_NBR FETCH FIRST 10 ROWS ONLY"
df = run_query(sql_stmt, 1, True)
pd.set_option("display.max_columns", None) # Show all columns in output.
display(df)

Run the following cell to query the CUSTOMER_LOYALTY table (whose data resides in a PostgreSQL database).

In [ ]:

# Query the CUSTOMER_LOYALTY table.
sql_stmt = "SELECT * FROM " + dv_user_id + ".CUSTOMER_LOYALTY ORDER BY LOYALTY_NBR FETCH FIRST 10 ROWS ONLY"
df = run_query(sql_stmt, 1, True)
display(df)

Run the following cell to query the CUSTOMER_LOYALTY_HISTORY table (whose data resides on AWS S3 object storage).

In [ ]:

sql_stmt = "SELECT * FROM " + dv_user_id + ".CUSTOMER_LOYALTY_HISTORY ORDER BY LOYALTY_NBR FETCH FIRST 10 ROWS ONLY"
df = run_query(sql_stmt, 1, True)
display(df)

Run the following cell to query the CUSTOMER_SUMMARY_V virtual view (the view created through the graphical interface). This is retrieving and joining data from the two base data sources. Once again, see how the credit card data is masked! The business user isn't seeing data that they don't have the authority to see.

In [ ]:

# Query the CUSTOMER_SUMMARY_V view.
sql_stmt = "SELECT * FROM " + dv_user_id + ".CUSTOMER_SUMMARY_V ORDER BY LOYALTY_NBR FETCH FIRST 10 ROWS ONLY"
df = run_query(sql_stmt, 1, True)
display(df)

Run the following cell to query the CUSTOMER_SUMMARY_V2 virtual view (the view created by executing a SQL statement).

In [ ]:

# Query the CUSTOMER_SUMMARY_V2 view.
sql_stmt = "SELECT * FROM " + dv_user_id + ".CUSTOMER_SUMMARY_V2 ORDER BY LOYALTY_NBR FETCH FIRST 10 ROWS ONLY"
df = run_query(sql_stmt, 1, True)
display(df)

Section #3: Data Caches - Running Workload to Generate Query History

This section assumes that you successfully connected to Watson Query in Section #1 above.

Run the following two cells to issue queries against the virtual objects. Query execution will take approximately 10-15 minutes.

In [ ]:

# Queries that will be run to generate a query history for Cache Management in Watson Query.

sql_stmt_1 = ("/*Q1*/"
               " SELECT C.CUSTOMER_ID, C.LOYALTY_NBR, C.FIRST_NAME, C.LAST_NAME, C.CITY, C.STATE_NAME, C.COUNTRY,"
                      " SUM(CL.REVENUE) AS REV_SUM, SUM(CL.PLANNED_REVENUE) AS PLANNED_REV_SUM"
                 " FROM " + dv_user_id + ".CUSTOMER C, " + dv_user_id + ".CUSTOMER_LOYALTY CL"
                 " WHERE C.LOYALTY_NBR = CL.LOYALTY_NBR"
                 " GROUP BY C.CUSTOMER_ID, C.LOYALTY_NBR, C.FIRST_NAME, C.LAST_NAME, C.COUNTRY, C.STATE_NAME, C.CITY"
                 " ORDER BY C.CUSTOMER_ID")

sql_stmt_2 = ("/*Q2*/"
               " SELECT LAST_NAME, FIRST_NAME, CUSTOMER_ID, LOYALTY_NBR, COUNTRY, STATE_NAME, CITY, ORDER_YEAR, QUARTER,"
                      " SUM(REVENUE) AS REV_SUM, SUM(PLANNED_REVENUE) AS PLANNED_REV_SUM"
                 " FROM " + dv_user_id + ".CUSTOMER_SUMMARY_V"
                 " GROUP BY CUSTOMER_ID, LOYALTY_NBR, FIRST_NAME, LAST_NAME, COUNTRY, STATE_NAME, CITY, ORDER_YEAR, QUARTER"
                 " HAVING ORDER_YEAR > 2000"
                 " ORDER BY CUSTOMER_ID, ORDER_YEAR, QUARTER")

sql_stmt_3 = ("/*Q3*/"
               " WITH QUARTERLY_BREAKDOWN AS"
                 " (SELECT C.CUSTOMER_ID, C.LOYALTY_NBR, C.FIRST_NAME, C.LAST_NAME,"
                         " C.COUNTRY, C.STATE_NAME, C.CITY, CL.ORDER_YEAR, CL.QUARTER,"
                         " SUM(CL.REVENUE) AS REV_SUM, SUM(CL.PLANNED_REVENUE) AS PLANNED_REV_SUM"
                    " FROM " + dv_user_id + ".CUSTOMER C, " + dv_user_id + ".CUSTOMER_LOYALTY CL"
                    " WHERE C.LOYALTY_NBR = CL.LOYALTY_NBR"
                    " GROUP BY C.CUSTOMER_ID, C.LOYALTY_NBR, C.FIRST_NAME, C.LAST_NAME, C.COUNTRY, C.STATE_NAME, C.CITY, CL.ORDER_YEAR, CL.QUARTER"
                    " HAVING CL.ORDER_YEAR > 2010)"
               " SELECT CUSTOMER_ID, SUM(REV_SUM) AS TOTAL_REV, AVG(REV_SUM) AS AVG_REV,"
                      " SUM(PLANNED_REV_SUM) AS PLANNED_TOTAL_REV, AVG(PLANNED_REV_SUM) AS PLANNED_AVG_REV"
                 " FROM QUARTERLY_BREAKDOWN"
                 " GROUP BY CUSTOMER_ID "
                 " ORDER BY CUSTOMER_ID")

sql_stmt_4 = ("/*Q4*/"
               "WITH"
                " CUSTOMER_LOYALTY_ALL AS"
                  " (SELECT LOYALTY_NBR, ORDER_YEAR, QUARTER, REVENUE, PLANNED_REVENUE FROM " + dv_user_id + ".CUSTOMER_LOYALTY"
                     " UNION ALL"
                  " SELECT LOYALTY_NBR, ORDER_YEAR, QUARTER, REVENUE, PLANNED_REVENUE FROM " + dv_user_id + ".CUSTOMER_LOYALTY_HISTORY),"
                " QUARTERLY_BREAKDOWN AS"
                  " (SELECT C.CUSTOMER_ID, C.LOYALTY_NBR, C.FIRST_NAME, C.LAST_NAME,"
                          " C.COUNTRY, C.STATE_NAME, C.CITY, CL.ORDER_YEAR, CL.QUARTER,"
                          " SUM(CL.REVENUE) AS REV_SUM, SUM(CL.PLANNED_REVENUE) AS PLANNED_REV_SUM"
                     " FROM " + dv_user_id + ".CUSTOMER C, CUSTOMER_LOYALTY_ALL CL"
                     " WHERE C.LOYALTY_NBR = CL.LOYALTY_NBR"
                     " GROUP BY C.CUSTOMER_ID, C.LOYALTY_NBR, C.FIRST_NAME, C.LAST_NAME, C.COUNTRY, C.STATE_NAME, C.CITY, CL.ORDER_YEAR, CL.QUARTER"
                     " HAVING CL.ORDER_YEAR > 2010)"
              " SELECT CUSTOMER_ID, DEC(SUM(REV_SUM),8,2) AS TOTAL_REV, DEC(AVG(REV_SUM),8,2) AS AVG_REV,"
                     " DEC(SUM(PLANNED_REV_SUM),8,2) AS PLANNED_TOTAL_REV, DEC(AVG(PLANNED_REV_SUM),8,2) AS PLANNED_AVG_REV"
                " FROM QUARTERLY_BREAKDOWN"
                " GROUP BY CUSTOMER_ID"
                " ORDER BY CUSTOMER_ID")

print("SQL statements have been generated.")

In [ ]:

# Run each query 5 times.
df1 = run_query(sql_stmt_1, 5, False)
df2 = run_query(sql_stmt_2, 5, False)
df3 = run_query(sql_stmt_3, 5, False)
df4 = run_query(sql_stmt_4, 5, False)

print("Query workload has finished. You can close the notebook now.")

This concludes the notebook portion of the lab.

Watson Query Data Virtualization Lab Notebook

Section #1: Connect to Watson Query

Section #2: Query Virtual Objects

Section #3: Data Caches - Running Workload to Generate Query History

Product

Resources

Company