Path: blob/main/TrustedAI-L3-Tech-Lab/02-Open source lab.ipynb
1928 views
1. Insert project token, API key, and region

Click the three vertical dots icon above and select Insert project token to provide this notebook API access to your project.
The API key you created earlier in the lab should be pasted into the cell below as the value for API_KEY.
The LOCATION value below will depend on where you provisioned your services. According to the WML Client documentation, valid values for LOCATION are:
London: https://eu-gb.ml.cloud.ibm.com
Frankfurt: https://eu-de.ml.cloud.ibm.com
Run the cell above, and continue running cells individually until you reach step 2.
The first model you will create in this notebook uses the scikit-learn framework. The sklearn package is available by default in Watson Studio Python environments, and does not need to be installed.
The next cell uses the API key and location variables defined above to authenticate with your Watson Machine Learning service. An error in this cell likely means that you do not have access to a WML service, or that the API key or location provided above is incorrect.
LIKELY ACTION REQUIRED: restart the kernel on error messages
The cell below will install the IBM Factsheets service using the pip utility, then authenticate with the IBM Factsheet service using credentials you have already supplied and initialize Factsheet monitoring for this model.
If you receive an error message from running the cell, you will need to restart the kernel and run all previous cells again. Due to an issue with different levels of libraries available in the Python and Spark environment, you may receive an error message when importing ibm_aigov_facts_client from the Factsheets library. Restarting the kernel typically fixes this issue, though in rare cases you may have to do it more than once. Click the Kernel menu item above and select Restart. Once the kernel has restarted, click the Cell item and select Run All Above. Once those cells have finished executing, run the cell below
Note that Python notebooks in Watson Studio have full support for pip install, which allows you to add whatever libraries you need to the notebook environment. For example, if you wanted to use Python to parse command line arguments, you could run !pip install argparse.
2. !!--STOP--!! Insert data to code below
Place your cursor in the empty code cell below. Then click the Code snippets icon in the upper right corner of the screen -- it looks like an HTML tag.

Click the Read data tile beneath the Data Ingestion header, then click the Select data from project button. Click Data asset from the Categories list, then select modeling_records_2022.csv from the asset list, then click Select.

Use the Load as dropdown beneath to select pandas DataFrame, then click the Insert code to cell button. A code block is automatically inserted into the empty cell that will import your data into a dataframe. Like the sklearn package, pandas is automatically provided in Watson Studio Python environments.
IMPORTANT: replace all instances of df_data_x with df_data_1 in the code
The automated dataframe will likely use the df_data_3 variable to hold the data. Update the last two lines of code to import data into the df_data_1 variable for the rest of the notebook to work correctly. The last lines of your cell should look like this:

Run the inserted code cell below. If you have correctly imported the data, you will see a table populated with employee data. Continue running cells individually until you reach step 3.
The next cell splits the training data into the feature columns and the label columns, and then further splits the data further into a training data set and a testing data set. If this cell generates an error, it is likely because you have not imported the data into the df_data_1 variable as described above. You will need to alter the previous cell to use df_data_1 and then rerun it.
Now you will tell Watson Machine Learning to use the current project to store the model.
The cell below tells the Watson Machine Learning client to save the models in the current project. If you receive an error here, it is likely because you did not correctly set your project ID at the beginning of the notebook.
The following cell provides connection information to the model training data, which will be stored with the model and in FactSheets. You could use the Cloud Object Storage information for this particular project by changing the credentials to match those from above where you inserted the file to code, but for simplicity's sake, you will use a pre-existing file.
The next three cells construct metadata for your model and connect to the Factsheets client. This metadata will be saved with the model itself and will appear on its Factsheet. If you get errors trying to save the model, they will most likely be from the metadata contained in the model properties, specifically the TYPE and SOFTWARE_SPEC_UID, which frequently change as Watson Studio adds support for new versions of Python, and removes support for outdated versions. You can get a list of current supported specifications by running wml_client.software_specifications.list().
The next three cells fit the data the the model using a Random Forest classifier, run predictions on the test data, and then print out the accuracy for how the model did on the test data. Finally, the notebook calculates and displays feature importance. For more information on Random Forest classifiers, see here.
The next three cells export data from the model you just created to the FactSheet. The first lists experiments tracked by FactSheets. The second writes the URL and other info on this notebook as custom data to the FactSheet. Note that any data can be written to the FactSheet that might be helpful for model validators.
Finally, the model is stored to the project with all of the metadata defined above.
Next, the notebook uses Apache Spark to create a second model. Because you specified a Spark environment when you created this notebook, the pyspark runtime will be available without needing to be installed via pip.
3. !!--STOP--!! Insert data to code below
Place your cursor in the empty code cell below. Then click the Find and add data icon in the upper right corner of the screen like you did in step 2. Locate the modeling_records_2022.csv file, click its associated Insert to code dropdown, and select SparkSession DataFrame.
IMPORTANT: replace all instances of df_data_x with df_data_2 in the code
The automated dataframe will likely use the df_data_3 variable to hold the data. Update the last two lines of code to import data into the df_data_2 variable for the rest of the notebook to work correctly. The last lines of your cell should look like this:

Run the inserted code cell below. If you have correctly imported the data, you will see a table populated with employee data. The remainder of the notebook is very similar to the training of the sklearn model. It will enable FactSheets for the second model, train a Spark Gradient Boost Classifier, and then save that model to the project. You may run the rest of the notebook to its conclusion.
Similar to the sklearn model, you need to specify metadata for the spark model.
For the second model, you will create a Gradient Boosted Tree classifier. For more information on Gradient Boosting, see here.
Congratulations!
You have completed this notebook. You can now return to the Data and AI Live Demos lab page and continue with the lab.