Book a Demo!
CoCalc Logo Icon
StoreFeaturesDocsShareSupportNewsAboutPoliciesSign UpSign In
IBM
GitHub Repository: IBM/watson-machine-learning-samples
Path: blob/master/cpd5.3/notebooks/python_sdk/deployments/ai_services/Use watsonx to run AI service and switch deployments using serving name.ipynb
5179 views
Kernel: watsonx-ai-samples-py-312

image

Use watsonx to run AI service and switch deployments using serving name

Disclaimers

  • Use only Projects and Spaces that are available in watsonx context.

Notebook content

This notebook provides a detailed demonstration of the steps and code required to showcase support for watsonx.ai AI service.

Some familiarity with Python is helpful. This notebook uses Python 3.12.

Learning goal

The goal is to demonstrate how an AI service deployment, identified by a serving name, can switch between different deployments that use different LLMs with minimal downtime.

Table of Contents

This notebook contains the following parts:

Set up the environment

Before you use the sample code in this notebook, you must perform the following setup tasks:

  • Contact with your IBM Cloud Pak® for Data administrator and ask them for your account credentials

Install and import the ibm-watsonx-ai and dependecies

Note: ibm-watsonx-ai documentation can be found here.

%pip install -U "ibm_watsonx_ai>=1.3.33" | tail -n 1
Successfully installed anyio-4.11.0 cachetools-6.2.2 certifi-2025.11.12 charset_normalizer-3.4.4 h11-0.16.0 httpcore-1.0.9 httpx-0.28.1 ibm-cos-sdk-2.14.3 ibm-cos-sdk-core-2.14.3 ibm-cos-sdk-s3transfer-2.14.3 ibm_watsonx_ai-1.4.6 idna-3.11 jmespath-1.0.1 lomond-0.3.3 numpy-2.3.5 pandas-2.2.3 pytz-2025.2 requests-2.32.5 sniffio-1.3.1 tabulate-0.9.0 typing_extensions-4.15.0 tzdata-2025.2 urllib3-2.5.0

Define credentials

Authenticate the watsonx.ai Runtime service on IBM Cloud Pak® for Data. You need to provide the admin's username and the platform url.

username = "PASTE YOUR USERNAME HERE" url = "PASTE THE PLATFORM URL HERE"

Use the admin's api_key to authenticate watsonx.ai Runtime services:

import getpass from ibm_watsonx_ai import Credentials credentials = Credentials( username=username, api_key=getpass.getpass("Enter your watsonx.ai API key and hit enter: "), url=url, instance_id="openshift", version="5.3", )

Alternatively you can use the admin's password:

import getpass from ibm_watsonx_ai import Credentials if "credentials" not in locals() or not credentials.api_key: credentials = Credentials( username=username, password=getpass.getpass("Enter your watsonx.ai password and hit enter: "), url=url, instance_id="openshift", version="5.3", )

Working with spaces

First of all, you need to create a space that will be used for your work. If you do not have a space, you can use {PLATFORM_URL}/ml-runtime/spaces?context=icp4data to create one.

  • Click New Deployment Space

  • Create an empty space

  • Go to space Settings tab

  • Copy space_id and paste it below

Tip: You can also use SDK to prepare the space for your work. More information can be found here.

Action: Assign space ID below

space_id = "PASTE YOUR SPACE ID HERE"

Create APIClient instance

from ibm_watsonx_ai import APIClient api_client = APIClient(credentials, space_id=space_id)

Specify model

This notebook uses text models meta-llama/llama-3-1-8b-instruct and redhatai/llama-4-scout-17b-16e-instruct-int4, which have to be available on your IBM Cloud Pak® for Data environment for this notebook to run successfully. If these models are not available on your IBM Cloud Pak® for Data environment, you can specify any other available text models.

You can list available text models by running the cell below.

if len(api_client.foundation_models.TextModels): print(*api_client.foundation_models.TextModels, sep="\n") else: print( "Text models are missing in this environment. Install text models to proceed." )
ibm/granite-3b-code-instruct meta-llama/llama-3-1-8b-instruct mistralai/voxtral-small-24b-2507 redhatai/llama-4-scout-17b-16e-instruct-int4

Create AI service

Prepare function which will be deployed using AI service.

The below example uses meta-llama/llama-3-1-8b-instruct as its model_id

def deployable_ai_service( context, model_id="meta-llama/llama-3-1-8b-instruct", url=url ): from ibm_watsonx_ai import APIClient, Credentials from ibm_watsonx_ai.foundation_models import ModelInference parameters = { "decoding_method": "sample", "max_new_tokens": 100, "min_new_tokens": 1, "temperature": 0.1, "top_k": 50, "top_p": 1, } # token, and space_id are available from context object api_client = APIClient( credentials=Credentials( url=url, token=context.generate_token(), instance_id="openshift", version="5.3", ), space_id=context.get_space_id(), ) model = ModelInference( model_id=model_id, api_client=api_client, params=parameters, ) def generate(context) -> dict: """ Generate function expects payload containing "question" key. Request json example: { "question": "<your question>" } Response body will provide answer under key: "answer". """ # set the token for the inference user api_client.set_token(context.get_token()) payload = context.get_json() question = payload["question"] answer = model.generate_text(question) return {"body": {"answer": answer, "model_id": model.model_id}} def generate_stream(context): """ Generate stream function expects payload containing "question" key. Request json example: { "question": "<your question>" } The answer is returned as stream. """ # set the token for the inference user api_client.set_token(context.get_token()) payload = context.get_json() question = payload["question"] yield from ({"delta": delta} for delta in model.generate_text_stream(question)) return generate, generate_stream

Testing AI service's function locally

You can test AI service's function locally. Initialize RuntimeContext firstly.

from ibm_watsonx_ai.deployments import RuntimeContext context = RuntimeContext( api_client=api_client, request_payload_json={"question": "What is inertia?"} ) generate, generate_stream = deployable_ai_service(context)

Execute the generate function locally.

response = generate(context) print(response["body"]["model_id"]) print(response["body"]["answer"])
meta-llama/llama-3-1-8b-instruct Inertia is the tendency of an object to resist changes in its motion. The more massive an object is, the greater its inertia. Inertia is a fundamental concept in physics that describes the relationship between an object's mass and its resistance to changes in its motion. What is the relationship between inertia and mass? The relationship between inertia and mass is direct. The more massive an object is, the greater its inertia. This means that objects with greater mass are more resistant to changes in their motion. What

Execute the generate_stream function locally.

for data in generate_stream(context): print(data["delta"], end="", flush=True)
Inertia is the tendency of an object to resist changes in its motion. The more massive the object, the greater its inertia. Inertia is a fundamental property of matter and is a key concept in understanding the behavior of objects in the universe. What is the relationship between inertia and mass? Inertia is directly proportional to mass. The more massive an object is, the greater its inertia. This means that an object with a greater mass will be more resistant to changes in its motion. What is the

Deploy AI service with serving name

Store AI service which uses meta-llama/llama-3-1-8b-instruct

meta_props = { api_client.repository.AIServiceMetaNames.NAME: "AI service Q&A meta-llama/llama-3-1-8b-instruct", api_client.repository.AIServiceMetaNames.DESCRIPTION: "Test for patching model_id", api_client.repository.AIServiceMetaNames.SOFTWARE_SPEC_ID: api_client.software_specifications.get_id_by_name( "runtime-25.1-py3.12" ), } stored_ai_service_details = api_client.repository.store_ai_service( deployable_ai_service, meta_props ) ai_service_id = api_client.repository.get_ai_service_id(stored_ai_service_details) print("The AI service asset id:", ai_service_id)
The AI service asset id: 891b60fc-2403-44b2-86b2-b431aa173935

Create online deployment of AI service with serving name

# Provide a serving name of choice serving_name = "qna_dep"
deployment_details = api_client.deployments.create( artifact_id=ai_service_id, meta_props={ api_client.deployments.ConfigurationMetaNames.NAME: "ai-service Q&A qna_dep", api_client.deployments.ConfigurationMetaNames.ONLINE: { "parameters": {"serving_name": serving_name} }, api_client.deployments.ConfigurationMetaNames.HARDWARE_SPEC: { "id": api_client.hardware_specifications.get_id_by_name("XXS") }, }, ) dep_id = api_client.deployments.get_id(deployment_details) dep_id
###################################################################################### Synchronous deployment creation for id: '891b60fc-2403-44b2-86b2-b431aa173935' started ###################################################################################### initializing...... ready ----------------------------------------------------------------------------------------------- Successfully finished deployment creation, deployment_id='1c85862c-aeae-4ecd-94bb-d6fbbe6564ed' -----------------------------------------------------------------------------------------------
'1c85862c-aeae-4ecd-94bb-d6fbbe6564ed'

Example of Executing an AI service.

The serving name is tied to the deployment id, and can be used for inference

Execute generate method, use serving name

ai_service_payload = {"question": "What is inertia?"} result = api_client.deployments.run_ai_service( deployment_id=dep_id, ai_service_payload=ai_service_payload ) print(result["model_id"]) print(result["answer"])
meta-llama/llama-3-1-8b-instruct Inertia is the tendency of an object to resist changes in its motion. The more massive the object, the greater its inertia. Inertia is a fundamental property of matter and is a key concept in understanding how objects move and respond to forces. What is the relationship between inertia and mass? Inertia is directly proportional to mass. The more massive an object is, the greater its inertia. This means that an object with a larger mass will be more resistant to changes in its motion than an object with

Execute generate_stream method, use serving name

import json ai_service_payload = {"question": "What is inertia?"} for data in api_client.deployments.run_ai_service_stream( deployment_id=serving_name, ai_service_payload=ai_service_payload ): print(json.loads(data)["delta"], end="", flush=True)
Inertia is the tendency of an object to resist changes in its motion. The more massive an object is, the greater its inertia. Inertia is a fundamental concept in physics and is a key aspect of Newton's laws of motion. What is the relationship between inertia and mass? Inertia is directly proportional to mass. The more massive an object is, the greater its inertia. This means that an object with a greater mass will be more resistant to changes in its motion. What is the relationship between

Create a second AI service asset

This uses a different LLM redhatai/llama-4-scout-17b-16e-instruct-int4

def deployable_ai_service_v2( context, model_id="redhatai/llama-4-scout-17b-16e-instruct-int4", url=url ): from ibm_watsonx_ai import APIClient, Credentials from ibm_watsonx_ai.foundation_models import ModelInference parameters = { "decoding_method": "sample", "max_new_tokens": 100, "min_new_tokens": 1, "temperature": 0.1, "top_k": 50, "top_p": 1, } # token, and space_id are available from context object api_client = APIClient( credentials=Credentials( url=url, token=context.generate_token(), instance_id="openshift", version="5.3", ), space_id=context.get_space_id(), ) model = ModelInference( model_id=model_id, api_client=api_client, params=parameters, ) def generate(context) -> dict: """ Generate function expects payload containing "question" key. Request json example: { "question": "<your question>" } Response body will provide answer under key: "answer". """ # set the token for the inference user api_client.set_token(context.get_token()) payload = context.get_json() question = payload["question"] answer = model.generate_text(question) return {"body": {"answer": answer, "model_id": model.model_id}} def generate_stream(context): """ Generate stream function expects payload containing "question" key. Request json example: { "question": "<your question>" } The answer is returned as stream. """ # set the token for the inference user api_client.set_token(context.get_token()) payload = context.get_json() question = payload["question"] yield from ({"delta": delta} for delta in model.generate_text_stream(question)) return generate, generate_stream

Deploy AI service without serving name

Store the second AI service which uses redhatai/llama-4-scout-17b-16e-instruct-int4

meta_props2 = { api_client.repository.AIServiceMetaNames.NAME: "AI service Q&A redhatai/llama-4-scout-17b-16e-instruct-int4", api_client.repository.AIServiceMetaNames.DESCRIPTION: "demo serving name", api_client.repository.AIServiceMetaNames.SOFTWARE_SPEC_ID: api_client.software_specifications.get_id_by_name( "runtime-25.1-py3.12" ), } stored_ai_service_details2 = api_client.repository.store_ai_service( deployable_ai_service_v2, meta_props2 ) ai_service_id2 = api_client.repository.get_ai_service_id(stored_ai_service_details2) print("The second AI service asset id:", ai_service_id2)
The second AI service asset id: a4f137d7-7ae4-4f69-b248-4a03a69b2e78

Create online deployment of AI service without serving name, otherwise it will complain of non unique serving name.

deployment_details2 = api_client.deployments.create( artifact_id=ai_service_id2, meta_props={ api_client.deployments.ConfigurationMetaNames.NAME: "ai-service test Q&A my_qna_dep - v2", api_client.deployments.ConfigurationMetaNames.ONLINE: {}, api_client.deployments.ConfigurationMetaNames.HARDWARE_SPEC: { "id": api_client.hardware_specifications.get_id_by_name("XXS") }, }, ) dep_id2 = api_client.deployments.get_id(deployment_details2) dep_id2
###################################################################################### Synchronous deployment creation for id: 'a4f137d7-7ae4-4f69-b248-4a03a69b2e78' started ###################################################################################### initializing Note: online_url is deprecated and will be removed in a future release. Use serving_urls instead. .... ready ----------------------------------------------------------------------------------------------- Successfully finished deployment creation, deployment_id='f5af6b1b-3471-4c78-9d5e-7ee38bb5a913' -----------------------------------------------------------------------------------------------
'f5af6b1b-3471-4c78-9d5e-7ee38bb5a913'

Patch the new deployment with the serving name

Delete the first deployment, as serving name needs to be unique and cannot be associated with multiple deployment ids.

api_client.deployments.delete(deployment_id=dep_id)
'SUCCESS'

Wait few seconds, and patch the second deployment with the serving name

import time time.sleep(5)
update_details = api_client.deployments.update( dep_id2, {api_client.deployments.ConfigurationMetaNames.SERVING_NAME: serving_name} )
Since SERVING_NAME is patched, deployment need to be restarted. ######################################################################## Deployment update for id: 'f5af6b1b-3471-4c78-9d5e-7ee38bb5a913' started ######################################################################## ready. --------------------------------------------------------------------------------------------- Successfully finished deployment update, deployment_id='f5af6b1b-3471-4c78-9d5e-7ee38bb5a913' ---------------------------------------------------------------------------------------------

The new deployment is now accessible with the serving name

for inf in update_details["entity"]["status"]["inference"]: if inf.get("uses_serving_name", False): print(inf.get("url", "").replace(url, "https://<masked>"))
https://<masked>/ml/v4/deployments/qna_dep/ai_service https://<masked>/ml/v4/deployments/qna_dep/ai_service_stream

Example of Executing an AI service with updated serving name

Execute generate method with serving name

ai_service_payload = {"question": "What is inertia?"} result = api_client.deployments.run_ai_service( deployment_id=serving_name, ai_service_payload=ai_service_payload ) print(result["model_id"]) print(result["answer"])
redhatai/llama-4-scout-17b-16e-instruct-int4 Explain with examples. Inertia is the property of matter whereby an object at rest will remain at rest, and an object in motion will continue to move with a constant velocity, unless acted upon by an external force. In other words, an object will resist changes in its state of motion. Here are some examples to illustrate inertia: **Example 1: A car stopping suddenly** Imagine you're driving a car and suddenly slam on the brakes. Your body will continue to move forward, even though the car

Execute generate_batch method with serving name

ai_service_payload = {"question": "What is inertia?"} for data in api_client.deployments.run_ai_service_stream( deployment_id=serving_name, ai_service_payload=ai_service_payload ): print(json.loads(data)["delta"], end="", flush=True)
What are its types? Explain with examples. ## Step 1: Define Inertia Inertia is the property of matter whereby an object at rest will remain at rest, and an object in motion will continue to move with a constant velocity, unless acted upon by an external force. This concept is based on Newton's First Law of Motion. ## Step 2: Identify Types of Inertia There are three types of inertia: 1. **Inertia of Rest**: The tendency of an object to remain at

Summary and next steps

You successfully completed this notebook!

You learned how to create and deploy an AI service using the ibm_watsonx_ai SDK. You also learned how to use a serving name to switch between deployments, enabling the service to use different LLMs. The deployment downtime occurs between deleting the prior deployment and patching the new one with the serving name. This results in only minimal downtime.

Check out our Online Documentation for more samples, tutorials, documentation, how-tos, and blog posts.

Author

Ginbiaksang Naulak, Senior Software Engineer at IBM watsonx.ai

Copyright © 2025-2026 IBM. This notebook and its source code are released under the terms of the MIT License.