Path: blob/master/cpd4.8/notebooks/python_sdk/deployments/xgboost/Use XGBoost to classify tumors.ipynb
6405 views
Use XGBoost to classify tumors with ibm-watson-machine-learning
This notebook contains steps and code to get data from the IBM Watson Studio Community, create a predictive model, and start scoring new data. It introduces commands for getting data and for basic data cleaning and exploration, model training, model persistance to Watson Machine Learning repository, model deployment, and scoring.
Some familiarity with Python is helpful. This notebook uses Python 3.10, XGBoost, and scikit-learn.
You will use a publicly available data set, the Breast Cancer Wisconsin (Diagnostic) Data Set, to train an XGBoost Model to classify breast cancer tumors (as benign or malignant) from 569 diagnostic images based on measurements such as radius, texture, perimeter and area. XGBoost is short for “Extreme Gradient Boosting”.
The XGBoost classifier makes its predictions based on the majority vote from collection of models which are a set of classification trees. It uses the combination of weak learners to create a single strong learner. It’s a sequential training process, whereby new learners focus on the misclassified examples of previous learners.
Learning goals
You will learn how to:
Load a CSV file into numpy array
Explore data
Prepare data for training and evaluation
Create an XGBoost machine learning model
Train and evaluate a model
Use cross-validation to optimize model's hyperparameters
Persist a model in Watson Machine Learning repository
Deploy a model for online scoring
Score sample data
Contents
This notebook contains the following parts:
Connection to WML
Authenticate the Watson Machine Learning service on IBM Cloud Pack for Data. You need to provide platform url
, your username
and api_key
.
Alternatively you can use username
and password
to authenticate WML services.
Install and import the ibm-watson-machine-learning
package
Note: ibm-watson-machine-learning
documentation can be found here.
Working with spaces
First of all, you need to create a space that will be used for your work. If you do not have space already created, you can use {PLATFORM_URL}/ml-runtime/spaces?context=icp4data
to create one.
Click New Deployment Space
Create an empty space
Go to space
Settings
tabCopy
space_id
and paste it below
Tip: You can also use SDK to prepare the space for your work. More information can be found here.
Action: Assign space ID below
You can use list
method to print all existing spaces.
To be able to interact with all resources available in Watson Machine Learning, you need to set space which you will be using.
In this section you will load the data as a numpy array and perform a basic exploration.
To load the data as a numpy array, user wget
to download the data, then use the genfromtxt
method to read the data.
Example: First, you need to install the required packages. You can do this by running the following code. Run it only one time.
The csv file BreastCancerWisconsinDataSet.csv is downloaded. Run the code in the next cells to load the file to the numpy array.
Run the code in the next cell to view the feature names and data storage types.
You can see that the data set has 569 records and 32 features.
3. Create an XGBoost model
In this section you will learn how to train and test an XGBoost model.
Note: Update xgboost
to ensure you have 1.6.2
version.
Now, you can prepare your data for model building. You will use the diagnosis
column as your target variable so you must remove it from the set of predictors. You must also remove the id
variable.
Split the data set into:
Train data set
Test data set
The data has been successfully split into two data sets:
The train data set, which is the largest group, will be used for training
The test data set will be used for model evaluation and is used to test the assumptions of the model
Start by importing the necessary libraries.
3.2.1. Create an XGBoost classifier
In this section you create an XGBoost classifier with default hyperparameter values and you will call it xgb_model.
Note The next sections show you how to improve this base model.
Note: Usage of default or n_jobs=-1
parameter in XGBoost classifier is not recommended as underlying process often cannot correctly discover number of CPUs / threads allowed.
Other ways to controll the number of cores used is through environmental variables OMP_NUM_THREADS
and MKL_NUM_THREADS
, which should be set by default if this notebook is executed inside Watson Studio.
Display the default parameters for xgb_model.
Now that your XGBoost classifier, xgb_model, is set up, you can train it by invoking the fit method. You will also evaluate xgb_model while the train and test data are being trained.
Note: You can also use a pandas dataFrame instead of the numpy array.
Plot the model performance evaluated during the training process to assess model overfitting.
You can see that there is model overfitting, and there is a decrease in model accuracy after about 60 iterations
Select the trained model obtained after 30 iterations.
Note: You will use the accuracy value obtained on the test data to compare the accuracy of the model with default parameters to the accuracy of the model with tuned parameters.
3.2.2. Use grid search and cross-validation to tune the model
You can use grid search and cross-validation to tune your model to achieve better accuracy.
XGBoost has an extensive catalog of hyperparameters which provides great flexibility to shape an algorithm’s desired behavior. Here you will the optimize the model tuning which adds an L1 penalty (reg_alpha
).
Use a 5-fold cross-validation because your training data set is small.
In the cell below, create the XGBoost pipeline and set up the parameter grid for the search.
Use GridSearchCV
to search for the best parameters over the parameters values that were specified in the previous section.
From the grid scores, you can see the performance result of all parameter combinations including the best parameter combination based on model performance.
Display the accuracy estimated using cross-validation and the hyperparameter values for the best model.
Display the accuracy of best parameter combination on the test set.
The accuracy on test set is about the same for tuned model as it is for the trained model that has default hyperparameters values, even though the selected hyperparameters are different to the default parameters.
3.2.3. Model with pipeline data preprocessing
Here you learn how to use the XGBoost model within the scikit-learn pipeline.
Let's start by importing the required objects.
Now you are ready to evaluate accuracy of the model trained on the reduced set of features.
You can see that this model has a similar accuracy to the model trained using default hyperparameter values.
Let's see how you can save your XGBoost pipeline using the WML service instance and deploy it for online scoring.
In this section you learn how to use the Python client libraries to store your XGBoost model in the WML repository.
Save the XGBoost model to the WML Repository
Save the model artifact as XGBoost model for breast cancer to your WML instance.
Get software specification for XGBoost.
Get the saved model metadata from WML.
You need the model uid to create the deployment. You can extract the model uid from the saved model details.
Use this modul_uid in the next section to create the deployment.
Now you can create a deployment, Predict breast cancer.
Get a list of all deployments.
The Predict breast cancer model has been successfully deployed.
5.2 Get deployment details
To show deployments details, you need get deployment_uid.
Now, extract the url endpoint, scoring_url, which will be used to send scoring requests.
Prepare the scoring payload with the values to score.
Result: The patient record is classified as a benign tumor.
If you want to clean up all created assets:
experiments
trainings
pipelines
model definitions
models
functions
deployments
please follow up this sample notebook.
8. Summary and next steps
You successfully completed this notebook! You learned how to use Keras machine learning library as well as Watson Machine Learning for model creation and deployment.
Check out our Online Documentation for more samples, tutorials, documentation, how-tos, and blog posts.
Authors
Wojciech Jargielo, Software Engineer
Copyright © 2020-2025 IBM. This notebook and its source code are released under the terms of the MIT License.