GitHub Repository: CloudPak-Outcomes/Outcomes-Projects
Path: blob/main/TrustedAI-L3-Tech-Lab/02-Open source lab.ipynb
¹⁹²⁸ views

Kernel: Python 3.10

1. Insert project token, API key, and region

Click the three vertical dots icon above and select Insert project token to provide this notebook API access to your project.

The API key you created earlier in the lab should be pasted into the cell below as the value for API_KEY.

The LOCATION value below will depend on where you provisioned your services. According to the WML Client documentation, valid values for LOCATION are:

Dallas: https://us-south.ml.cloud.ibm.com
London: https://eu-gb.ml.cloud.ibm.com
Frankfurt: https://eu-de.ml.cloud.ibm.com
Tokyo: https://jp-tok.ml.cloud.ibm.com

Run the cell above, and continue running cells individually until you reach step 2.

In [ ]:

import os

API_KEY = 'xxxxxxxxxxxxxxxxxxx'
PROJECT_ID = os.environ['PROJECT_ID']
LOCATION = 'https://us-south.ml.cloud.ibm.com'

In [ ]:

if "p-" in PROJECT_ID:
    raise Exception("You have not correctly set the value for your PROJECT_ID. The value beginning with 'p-' is your project access \
    token. Please copy the value of the project_id into the previous cell and re-run it.")

The first model you will create in this notebook uses the scikit-learn framework. The sklearn package is available by default in Watson Studio Python environments, and does not need to be installed.

In [ ]:

import sklearn
sklearn.__version__

The next cell uses the API key and location variables defined above to authenticate with your Watson Machine Learning service. An error in this cell likely means that you do not have access to a WML service, or that the API key or location provided above is incorrect.

In [ ]:

from ibm_watson_machine_learning import APIClient

wml_credentials = {
    "apikey": API_KEY,
    "url": LOCATION
}

wml_client = APIClient(wml_credentials)

LIKELY ACTION REQUIRED: restart the kernel on error messages

The cell below will install the IBM Factsheets service using the pip utility, then authenticate with the IBM Factsheet service using credentials you have already supplied and initialize Factsheet monitoring for this model.

If you receive an error message from running the cell, you will need to restart the kernel and run all previous cells again. Due to an issue with different levels of libraries available in the Python and Spark environment, you may receive an error message when importing ibm_aigov_facts_client from the Factsheets library. Restarting the kernel typically fixes this issue, though in rare cases you may have to do it more than once. Click the Kernel menu item above and select Restart. Once the kernel has restarted, click the Cell item and select Run All Above. Once those cells have finished executing, run the cell below

Note that Python notebooks in Watson Studio have full support for pip install, which allows you to add whatever libraries you need to the notebook environment. For example, if you wanted to use Python to parse command line arguments, you could run !pip install argparse.

In [ ]:

!pip uninstall -y ibm-aigov-facts-client
!pip install --upgrade ibm-aigov-facts-client  --no-cache | tail -n 1

from ibm_aigov_facts_client import AIGovFactsClient

2. !!--STOP--!! Insert data to code below

Place your cursor in the empty code cell below. Then click the Code snippets icon in the upper right corner of the screen -- it looks like an HTML tag.

Click the Read data tile beneath the Data Ingestion header, then click the Select data from project button. Click Data asset from the Categories list, then select modeling_records_2022.csv from the asset list, then click Select.

Use the Load as dropdown beneath to select pandas DataFrame, then click the Insert code to cell button. A code block is automatically inserted into the empty cell that will import your data into a dataframe. Like the sklearn package, pandas is automatically provided in Watson Studio Python environments.

IMPORTANT: replace all instances of `df_data_x` with `df_data_1` in the code

The automated dataframe will likely use the df_data_3 variable to hold the data. Update the last two lines of code to import data into the df_data_1 variable for the rest of the notebook to work correctly. The last lines of your cell should look like this:

Run the inserted code cell below. If you have correctly imported the data, you will see a table populated with employee data. Continue running cells individually until you reach step 3.

In [ ]:

The next cell splits the training data into the feature columns and the label columns, and then further splits the data further into a training data set and a testing data set. If this cell generates an error, it is likely because you have not imported the data into the df_data_1 variable as described above. You will need to alter the previous cell to use df_data_1 and then rerun it.

In [ ]:

from sklearn.model_selection import train_test_split

X = df_data_1.drop(['ATTRITION'], axis=1)  # Features
y = df_data_1['ATTRITION']  # Labels

# Split dataset into training set and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.15) # 85% training and 15% test

Now you will tell Watson Machine Learning to use the current project to store the model.

In [ ]:

X.columns.tolist()

The cell below tells the Watson Machine Learning client to save the models in the current project. If you receive an error here, it is likely because you did not correctly set your project ID at the beginning of the notebook.

In [ ]:

wml_client.set.default_project(PROJECT_ID)

The following cell provides connection information to the model training data, which will be stored with the model and in FactSheets. You could use the Cloud Object Storage information for this particular project by changing the credentials to match those from above where you inserted the file to code, but for simplicity's sake, you will use a pre-existing file.

In [ ]:

training_data_references = [
                {
                    "id": "attrition",
                    "type": "container",
                    "connection": {},
                    "location": {
                        "path": "modeling_records_2022.csv"
                    },

                    #"type": "s3",
                    #"connection": {
                    #    "access_key_id": "yqcPbWZ0AQPHleHVerrR4Wx5e9pymBdMgydbEra5zCif",
                    #    "endpoint_url": "https://s3.us.cloud-object-storage.appdomain.cloud",
                    #    "resource_instance_id": "crn:v1:bluemix:public:cloud-object-storage:global:a/7d8b3c34272c0980d973d3e40be9e9d2:2883ef10-23f1-4592-8582-2f2ef4973639::"
                    #},
                    #"location": {
                    #    "bucket": "faststartlab-donotdelete-pr-nhfd4jnhlxgpc7",
                    #    "path": "modeling_records_2022.csv"
                    #},
                    "schema": {
                        "id": "training_schema",
                        "fields": [
                            {"name": "POSITION_CODE", "nullable": True, "metadata": {}, "type": "double"},
                            {"name": "DEPARTMENT_CODE", "nullable": True, "metadata": {}, "type": "double"},
                            {"name": "DAYS_WITH_COMPANY", "nullable": True, "metadata": {}, "type": "double"},
                            {"name": "COMMUTE_TIME", "nullable": True, "metadata": {}, "type": "double"},
                            {"name": "AGE_BEGIN_PERIOD", "nullable": True, "metadata": {}, "type": "double"},
                            {"name": "GENDER_CODE", "nullable": True, "metadata": {}, "type": "double"},
                            {"name": "PERIOD_TOTAL_DAYS", "nullable": True, "metadata": {}, "type": "double"},
                            {"name": "STARTING_SALARY", "nullable": True, "metadata": {}, "type": "double"},
                            {"name": "ENDING_SALARY", "nullable": True, "metadata": {}, "type": "double"},
                            {"name": "NB_INCREASES", "nullable": True, "metadata": {}, "type": "double"},
                            {"name": "BONUS", "nullable": True, "metadata": {}, "type": "double"},
                            {"name": "NB_BONUS", "nullable": True, "metadata": {}, "type": "double"},
                            {"name": "VACATION_DAYS_TAKEN", "nullable": True, "metadata": {}, "type": "double"},
                            {"name": "SICK_DAYS_TAKEN", "nullable": True, "metadata": {}, "type": "double"},
                            {"name": "PROMOTIONS", "nullable": True, "metadata": {}, "type": "double"},
                            {"name": "NB_MANAGERS", "nullable": True, "metadata": {}, "type": "double"},
                            {"name": "DAYS_IN_POSITION", "nullable": True, "metadata": {}, "type": "double"},
                            {"name": "DAYS_SINCE_LAST_RAISE", "nullable": True, "metadata": {}, "type": "double"},
                            {"name": "RANKING_CODE", "nullable": True, "metadata": {}, "type": "double"},
                            {"name": "OVERTIME", "nullable": True, "metadata": {}, "type": "double"},
                            {"name": "DBLOVERTIME", "nullable": True, "metadata": {}, "type": "double"},
                            {"name": "TRAVEL", "nullable": True, "metadata": {}, "type": "double"}
                        ]
                    }
                }
            ]

The next three cells construct metadata for your model and connect to the Factsheets client. This metadata will be saved with the model itself and will appear on its Factsheet. If you get errors trying to save the model, they will most likely be from the metadata contained in the model properties, specifically the TYPE and SOFTWARE_SPEC_UID, which frequently change as Watson Studio adds support for new versions of Python, and removes support for outdated versions. You can get a list of current supported specifications by running wml_client.software_specifications.list().

In [ ]:

fields=X_train.columns.tolist()
metadata_dict = {'target_col' : 'ATTRITION', 'fields':fields}

In [ ]:

PROJECT_UID = os.environ['PROJECT_ID']
CPD_URL=os.environ['RUNTIME_ENV_APSX_URL'][len('https://api.'):]
CONTAINER_ID=PROJECT_UID
CONTAINER_TYPE='project'
EXPERIMENT_NAME='predictive_attrition'

PROJECT_ACCESS_TOKEN=project.project_context.accessToken.replace('Bearer ','')

facts_client = AIGovFactsClient(api_key=API_KEY,experiment_name=EXPERIMENT_NAME,container_type=CONTAINER_TYPE,container_id=CONTAINER_ID,set_as_current_experiment=True)

In [ ]:

software_spec_uid = wml_client.software_specifications.get_id_by_name("runtime-22.2-py3.10")
print("Software Specification ID: {}".format(software_spec_uid))
model_props = {
    wml_client._models.ConfigurationMetaNames.NAME:"{}".format("attrition challenger - sklearn"),
    wml_client._models.ConfigurationMetaNames.TYPE: "scikit-learn_1.1",
    wml_client._models.ConfigurationMetaNames.SOFTWARE_SPEC_UID: software_spec_uid,
    wml_client._models.ConfigurationMetaNames.TRAINING_DATA_REFERENCES: training_data_references,
    wml_client._models.ConfigurationMetaNames.LABEL_FIELD: "ATTRITION",
    wml_client._models.ConfigurationMetaNames.CUSTOM: metadata_dict
}

facts_client.export_facts.prepare_model_meta(wml_client=wml_client,meta_props=model_props)

The next three cells fit the data the the model using a Random Forest classifier, run predictions on the test data, and then print out the accuracy for how the model did on the test data. Finally, the notebook calculates and displays feature importance. For more information on Random Forest classifiers, see here.

In [ ]:

from sklearn.ensemble import RandomForestClassifier

#Create a Gaussian Classifier
clf = RandomForestClassifier(n_estimators=100)

#Train the model using the training sets y_pred=clf.predict(X_test)
clf.fit(X_train,y_train)

y_pred = clf.predict(X_test)

In [ ]:

from sklearn import metrics
# Model Accuracy, how often is the classifier correct?
print("Accuracy:",metrics.accuracy_score(y_test, y_pred))

In [ ]:

feature_imp = pd.Series(clf.feature_importances_,index=X.columns).sort_values(ascending=False)
feature_imp

The next three cells export data from the model you just created to the FactSheet. The first lists experiments tracked by FactSheets. The second writes the URL and other info on this notebook as custom data to the FactSheet. Note that any data can be written to the FactSheet that might be helpful for model validators.

In [ ]:

facts_client.runs.list_runs_by_experiment('1')

In [ ]:

nb_name = "attrition model creation and deployment"
nb_asset_id = "tbd"
nb_asset_url = "https://" + CPD_URL + "/analytics/notebooks/v2/" + nb_asset_id + "?projectid=" + PROJECT_UID + "&context=cpdaas"

latestRunId = facts_client.runs.list_runs_by_experiment('1').sort_values('start_time').iloc[-1]['run_id']
facts_client.runs.set_tags(latestRunId, {"Notebook name": nb_name, "Notebook id": nb_asset_id, "Notebook URL" : nb_asset_url})
facts_client.export_facts.export_payload(latestRunId)

In [ ]:

RUN_ID=facts_client.runs.get_current_run_id()
facts_client.export_facts.export_payload(RUN_ID)

Finally, the model is stored to the project with all of the metadata defined above.

In [ ]:

print("Storing model...")
published_model_details = wml_client.repository.store_model(
    model=clf, 
    meta_props=model_props,
    training_target=['ATTRITION'],
    training_data=X)
model_uid = wml_client.repository.get_model_id(published_model_details)

print("Done")
print("Model ID: {}".format(model_uid))

Next, the notebook uses Apache Spark to create a second model. Because you specified a Spark environment when you created this notebook, the pyspark runtime will be available without needing to be installed via pip.

In [ ]:

try:
    from pyspark.sql import SparkSession
except:
    print('Error: Spark runtime is missing. If you are using Watson Studio change the notebook runtime to Spark by clicking \
    the Vew notebook info button above (the lowercase i in a circle). Click on the Environment tab and use the Environment \
    definition dropdown to select an environment with Spark and Python.')
    raise
spark.version

3. !!--STOP--!! Insert data to code below

Place your cursor in the empty code cell below. Then click the Find and add data icon in the upper right corner of the screen like you did in step 2. Locate the modeling_records_2022.csv file, click its associated Insert to code dropdown, and select SparkSession DataFrame.

IMPORTANT: replace all instances of `df_data_x` with `df_data_2` in the code

The automated dataframe will likely use the df_data_3 variable to hold the data. Update the last two lines of code to import data into the df_data_2 variable for the rest of the notebook to work correctly. The last lines of your cell should look like this:

Run the inserted code cell below. If you have correctly imported the data, you will see a table populated with employee data. The remainder of the notebook is very similar to the training of the sklearn model. It will enable FactSheets for the second model, train a Spark Gradient Boost Classifier, and then save that model to the project. You may run the rest of the notebook to its conclusion.

In [ ]:

Similar to the sklearn model, you need to specify metadata for the spark model.

In [ ]:

software_spec_uid = wml_client.software_specifications.get_id_by_name("spark-mllib_3.3")
print("Software Specification ID: {}".format(software_spec_uid))
model_props = {
    wml_client._models.ConfigurationMetaNames.NAME:"{}".format("attrition challenger - spark"),
    wml_client._models.ConfigurationMetaNames.TYPE: "mllib_3.3",
    wml_client._models.ConfigurationMetaNames.SOFTWARE_SPEC_UID: software_spec_uid,
    wml_client._models.ConfigurationMetaNames.TRAINING_DATA_REFERENCES: training_data_references,
    wml_client._models.ConfigurationMetaNames.LABEL_FIELD: "ATTRITION"
}

facts_client.export_facts.prepare_model_meta(wml_client=wml_client,meta_props=model_props)

In [ ]:

from pyspark.ml.feature import VectorAssembler
from pyspark.ml.classification import GBTClassifier
from pyspark.ml.evaluation import BinaryClassificationEvaluator
from pyspark.ml import Pipeline, Model

For the second model, you will create a Gradient Boosted Tree classifier. For more information on Gradient Boosting, see here.

In [ ]:

from pyspark.sql.types import FloatType
for field in fields:
    df_data_2=df_data_2.withColumn(field,df_data_2[field].cast("float").alias(field))
df_data_2=df_data_2.withColumn('ATTRITION',df_data_2['ATTRITION'].cast("int").alias('ATTRITTION'))
df_data_2.take(5)

In [ ]:

va = VectorAssembler(inputCols = fields, outputCol='features')
va_df = va.transform(df_data_2)
va_df = va_df.select(['features', 'ATTRITION'])
va_df.show(3)

In [ ]:

gbtc = GBTClassifier(labelCol="ATTRITION", maxIter=20)

pipeline = Pipeline(stages=[va, gbtc])

In [ ]:

split_data = df_data_2.randomSplit([0.8, 0.2], 24)
train_data = split_data[0]
test_data = split_data[1]

print("Number of training records: " + str(train_data.count()))
print("Number of testing records : " + str(test_data.count()))

In [ ]:

spark_model = pipeline.fit(train_data)

pred = spark_model.transform(test_data)
pred.show(3)

In [ ]:

evaluator = BinaryClassificationEvaluator()
evaluator.setLabelCol("ATTRITION")
print("Test Area Under ROC: " + str(evaluator.evaluate(pred, {evaluator.metricName: "areaUnderROC"})))

In [ ]:

print("Storing spark model...")
published_model_details = wml_client.repository.store_model(
    model=spark_model, 
    meta_props=model_props,
    training_target=['ATTRITION'],
    training_data=train_data,
    pipeline=pipeline
)
model_uid = wml_client.repository.get_model_id(published_model_details)

print("Done")
print("Model ID: {}".format(model_uid))

Congratulations!

You have completed this notebook. You can now return to the Data and AI Live Demos lab page and continue with the lab.

1. Insert project token, API key, and region

LIKELY ACTION REQUIRED: restart the kernel on error messages

2. !!--STOP--!! Insert data to code below

IMPORTANT: replace all instances of `df_data_x` with `df_data_1` in the code

3. !!--STOP--!! Insert data to code below

IMPORTANT: replace all instances of `df_data_x` with `df_data_2` in the code

Congratulations!

Product

Resources

Company

1. Insert project token, API key, and region

LIKELY ACTION REQUIRED: restart the kernel on error messages

2. !!--STOP--!! Insert data to code below

IMPORTANT: replace all instances of df_data_x with df_data_1 in the code

3. !!--STOP--!! Insert data to code below

IMPORTANT: replace all instances of df_data_x with df_data_2 in the code

Congratulations!

IMPORTANT: replace all instances of `df_data_x` with `df_data_1` in the code

IMPORTANT: replace all instances of `df_data_x` with `df_data_2` in the code