Book a Demo!
CoCalc Logo Icon
StoreFeaturesDocsShareSupportNewsAboutPoliciesSign UpSign In
IBM
GitHub Repository: IBM/watson-machine-learning-samples
Path: blob/master/cloud/notebooks/python_sdk/deployments/ai_services/Use watsonx, and Model Gateway to run as AI service with load balancing.ipynb
5214 views
Kernel: .venv_watsonx_ai_samples_py_312

image

Use watsonx, and Model Gateway to run as AI service with load balancing

Disclaimers

  • Use only Projects and Spaces that are available in watsonx context.

Notebook content

This notebook provides a detailed demonstration of the steps and code required to showcase support for watsonx.ai Model Gateway.

Some familiarity with Python is helpful. This notebook uses Python 3.12.

Learning goal

The learning goal for your notebook is to leverage Model Gateway to create AI services using provided model from OpenAI compatible provider. You will also learn how to achieve model load balancing inside the AI service.

Table of Contents

This notebook contains the following parts:

Set up the environment

Before you use the sample code in this notebook, you must perform the following setup tasks:

Note: The example of model load balancing presented in this sample notebook may raise Status Code 429 (Too Many Requests) errors when using the free plan, due to lower maximum number of requests allowed per second.

Install dependencies

Note: ibm-watsonx-ai documentation can be found here.

%pip install -U "ibm_watsonx_ai>=1.3.40" | tail -n 1
Successfully installed anyio-4.12.1 cachetools-6.2.4 certifi-2026.1.4 charset_normalizer-3.4.4 h11-0.16.0 httpcore-1.0.9 httpx-0.28.1 ibm-cos-sdk-2.14.3 ibm-cos-sdk-core-2.14.3 ibm-cos-sdk-s3transfer-2.14.3 ibm_watsonx_ai-1.5.0 idna-3.11 jmespath-1.0.1 lomond-0.3.3 numpy-2.4.1 pandas-2.2.3 pytz-2025.2 requests-2.32.5 tabulate-0.9.0 typing_extensions-4.15.0 tzdata-2025.3 urllib3-2.6.3

Define the watsonx.ai credentials

Use the code cell below to define the watsonx.ai credentials that are required to work with watsonx Foundation Model inferencing.

Action: Provide the IBM Cloud user API key. For details, see Managing user API keys.

import getpass from ibm_watsonx_ai import Credentials credentials = Credentials( url="https://ca-tor.ml.cloud.ibm.com", api_key=getpass.getpass("Enter your watsonx.ai api key and hit enter: "), )

Working with spaces

You need to create a space that will be used for your work. If you do not have a space, you can use Deployment Spaces Dashboard to create one.

  • Click New Deployment Space

  • Create an empty space

  • Select Cloud Object Storage

  • Select watsonx.ai Runtime instance and press Create

  • Go to Manage tab

  • Copy Space GUID and paste it below

Tip: You can also use SDK to prepare the space for your work. More information can be found here.

Action: assign space ID below

import os try: space_id = os.environ["SPACE_ID"] except KeyError: space_id = input("Please enter your space_id (hit enter): ")

Create APIClient instance

from ibm_watsonx_ai import APIClient client = APIClient(credentials=credentials, space_id=space_id)

Initialize and configure Model Gateway

In this section we will initialize the Model Gateway and configure its providers.

Initialize the Model Gateway

Create Gateway instance

from ibm_watsonx_ai.gateway import Gateway gateway = Gateway(api_client=client)

List available providers

gateway.providers.list()

Create secret instance in IBM Cloud Secrets Manager

When creating a model provider, you need to supply your credentials. This is achieved by creating a key-value secret in IBM Cloud Secrets Manager and providing its CRN in the provider creation request payload.

The exact specification of the secret content depends on the provider type. For more information, please see the documentation. For watsonx.ai provider, the content should contain the following key-value pairs:

{ "apikey": "<YOUR_API_KEY>", "auth_url": "https://iam.cloud.ibm.com/identity/token", "base_url": "https://ca-tor.ml.cloud.ibm.com", // You can use a different location "space_id": "<YOUR_SPACE_ID>", // Required if `project_id` is not provided "project_id": "<YOUR_PROJECT_ID>", // Required if `space_id` is not provided }
secret_crn_id = "PASTE_YOUR_SECRET_CRN_HERE"

Work with watsonx.ai provider

Create provider

watsonx_ai_provider_details = gateway.providers.create( provider="watsonxai", name="watsonx-ai-provider", secret_crn_id=secret_crn_id ) watsonx_ai_provider_id = gateway.providers.get_id(watsonx_ai_provider_details) watsonx_ai_provider_id
'e7dcd646-5bc5-44ce-8d3c-7aea165d1374'

Get provider details

gateway.providers.get_details(watsonx_ai_provider_id)

List available models for created provider

gateway.providers.list_available_models(watsonx_ai_provider_id)

Create model and deploy it as AI service

In this section we will create a model using Model Gateway and deploy it as an AI service.

Create model using Model Gateway

In this sample we will use the ibm/granite-3-8b-instruct model.

model = "ibm/granite-3-8b-instruct" model_details = gateway.models.create( provider_id=watsonx_ai_provider_id, model=model, ) model_id = gateway.models.get_id(model_details)
gateway.models.list()

Create AI service

Prepare function which will be deployed using AI service.

def deployable_ai_service(context, url=credentials.url, model_id=model): from ibm_watsonx_ai import APIClient, Credentials from ibm_watsonx_ai.gateway import Gateway api_client = APIClient( credentials=Credentials(url=url, token=context.generate_token()), space_id=context.get_space_id(), ) gateway = Gateway(api_client=api_client) def generate(context) -> dict: api_client.set_token(context.get_token()) payload = context.get_json() prompt = payload["prompt"] messages = [ { "role": "user", "content": prompt, } ] response = gateway.chat.completions.create(model=model_id, messages=messages) return {"body": response} return generate

Testing AI service's function locally

Create AI service function

from ibm_watsonx_ai.deployments import RuntimeContext context = RuntimeContext(api_client=client) local_function = deployable_ai_service(context=context)

Prepare request payload

context.request_payload_json = {"prompt": "What is a tram?"}

Execute the function locally

import json response = local_function(context) print(json.dumps(response, indent=2))
{ "body": { "id": "chatcmpl-49145e45eedf3f5a5bb14f1dbcbe13d0---18ed289b-98d9-4174-8830-2ab3d11c1ca4", "object": "chat.completion", "created": 1768467790, "model": "ibm/granite-3-8b-instruct", "choices": [ { "index": 0, "message": { "role": "assistant", "content": "A tram, also known as streetcar or light rail, is a type of rail vehicle that operates on a network of tracks running through city streets. Trams are typically smaller and lighter than traditional railway trains, allowing them to navigate urban environments and stop more frequently. They are often used for public transportation within cities, maximizing accessibility and providing a connection for pedestrians. Tram systems have been in operation since the 19th century and continue to be a popular means of urban transit due to their capacity to alleviate traffic congestion and reduce carbon emissions. Modern trams often run on electricity, which makes them a more environmentally friendly transportation option." }, "finish_reason": "stop", "logprobs": null } ], "usage": { "prompt_tokens": 65, "completion_tokens": 153, "total_tokens": 218 }, "service_tier": null, "system_fingerprint": null, "cached": false } }

Deploy AI service

Store AI service with previously created custom software specification

software_specification_id = client.software_specifications.get_id_by_name( "genai-A25-py3.12" ) meta_props = { client.repository.AIServiceMetaNames.NAME: "Model Gateway AI service with SDK", client.repository.AIServiceMetaNames.SOFTWARE_SPEC_ID: software_specification_id, } stored_ai_service_details = client.repository.store_ai_service( deployable_ai_service, meta_props )
ai_service_id = client.repository.get_ai_service_id(stored_ai_service_details) ai_service_id
'2f78e9cc-8413-487c-a444-d60b4996cd13'

Create online deployment of AI service.

meta_props = { client.deployments.ConfigurationMetaNames.NAME: "AI service with SDK", client.deployments.ConfigurationMetaNames.ONLINE: {}, } deployment_details = client.deployments.create(ai_service_id, meta_props)
###################################################################################### Synchronous deployment creation for id: '2f78e9cc-8413-487c-a444-d60b4996cd13' started ###################################################################################### initializing Note: online_url and serving_urls are deprecated and will be removed in a future release. Use inference instead. .. ready ----------------------------------------------------------------------------------------------- Successfully finished deployment creation, deployment_id='e10f1241-59ab-4e7e-bff6-d2c14e39ebc4' -----------------------------------------------------------------------------------------------

Obtain the deployment_id of the previously created deployment.

deployment_id = client.deployments.get_id(deployment_details)

Execute the AI service

question = "Summarize core values of IBM" deployments_results = client.deployments.run_ai_service( deployment_id, {"prompt": question} )
import json print(json.dumps(deployments_results, indent=2))
{ "cached": false, "choices": [ { "finish_reason": "stop", "index": 0, "logprobs": null, "message": { "content": "IBM, an abbreviation for International Business Machines Corporation, has a set of core values that guide its operations and culture. These values, often referred to as \"IBM Values,\" are integral to their global brand identity. The key values include:\n\n1. **Innovation**: IBM consistently seeks to invent new technologies and revolutionize existing ones to drive growth and meet the needs of their clients.\n\n2. **Client Service**: IBM places a strong focus on understanding client needs and delivering top-tier service. They aim to be the trusted advisor to businesses, governments, and other organizations across the globe.\n\n3. **Integrity**: Honesty, transparency, and respect from employees to clients and colleagues are essential in all of IBM's operations and interactions.\n\n4. **Diversity and Inclusion**: IBM celebrates the variety of ideas, perspectives, and experiences that their workforce brings. The company promotes an inclusive environment where everyone feels valued and empowered to succeed.\n\n5. **Respect for the Individual**: IBM believes in nurturing employees' talents, investing in their development, and fostering an open environment where everyone's contributions are recognized.\n\n6. **Driving Towards Greater Sustainability**: IBM commits to operating in an environmentally responsible and sustainable manner, while contributing to the development of innovative solutions for its clients in this area.\n\nThese core values reflect IBM's mission to \"Let's put smart to work\" and to leverage its expertise, resources, and technology to address complex challenges faced by organizations around the world.", "role": "assistant" } } ], "created": 1768467830, "id": "chatcmpl-481fed6a13df479ccc9128fc4e11d669---f856ad28-6e41-4343-8d7a-7230d89574ec", "model": "ibm/granite-3-8b-instruct", "object": "chat.completion", "service_tier": null, "system_fingerprint": null, "usage": { "completion_tokens": 353, "prompt_tokens": 66, "total_tokens": 419 } }

Create models and deploy them as an AI service with load balancing

In this section we will create models with the same alias using Model Gateway and deploy them as an AI service in order to perform load balancing between them.

Note: This sample notebook creates three providers using watsonx.ai. It's worth pointing out that Model Gateway can also load balance between other providers, such as AWS Bedrock or NVIDIA NIM, as well as between different datacenters.

Create models using Model Gateway with the same alias on different providers

In this sample we will use the ibm/granite-3-8b-instruct, meta-llama/llama-3-2-11b-vision-instruct, and meta-llama/llama-3-3-70b-instruct models in the same datacenter.

Tip: It is also possible to perform load balancing across datacenters in different regions. In order to achieve it, when creating your providers you should use credentials for separate datacenters. See the example below:

watsonx_ai_provider_ca_tor_details = gateway.providers.create( provider="watsonxai", name="watsonx-ai-provider-ca-tor", secret_crn_id="<secret-crn-id-for-ca-tor>", ) watsonx_ai_provider_au_syd_details = gateway.providers.create( provider="watsonxai", name="watsonx-ai-provider-au-syd", secret_crn_id="<secret-crn-id-for-au-syd>", )
model_alias = "load-balancing-models"

Create provider for ibm/granite-3-8b-instruct model

granite_3_model = "ibm/granite-3-8b-instruct" watsonx_ai_provider_1_details = gateway.providers.create( provider="watsonxai", name="watsonx-ai-provider-1", secret_crn_id=secret_crn_id ) watsonx_ai_provider_1_id = gateway.providers.get_id(watsonx_ai_provider_1_details) granite_3_model_details = gateway.models.create( provider_id=watsonx_ai_provider_1_id, model=granite_3_model, alias=model_alias ) granite_3_model_id = gateway.models.get_id(granite_3_model_details)

Create provider for meta-llama/llama-3-2-11b-vision-instruct model

llama_3_2_model = "meta-llama/llama-3-2-11b-vision-instruct" watsonx_ai_provider_2_details = gateway.providers.create( provider="watsonxai", name="watsonx-ai-provider-2", secret_crn_id=secret_crn_id ) watsonx_ai_provider_2_id = gateway.providers.get_id(watsonx_ai_provider_2_details) llama_3_2_model_details = gateway.models.create( provider_id=watsonx_ai_provider_2_id, model=llama_3_2_model, alias=model_alias ) llama_3_2_model_id = gateway.models.get_id(llama_3_2_model_details)

Create provider for meta-llama/llama-3-3-70b-instruct model

llama_3_3_model = "meta-llama/llama-3-3-70b-instruct" watsonx_ai_provider_3_details = gateway.providers.create( provider="watsonxai", name="watsonx-ai-provider-3", secret_crn_id=secret_crn_id ) watsonx_ai_provider_3_id = gateway.providers.get_id(watsonx_ai_provider_3_details) llama_3_3_model_details = gateway.models.create( provider_id=watsonx_ai_provider_3_id, model=llama_3_3_model, alias=model_alias ) llama_3_3_model_id = gateway.models.get_id(llama_3_3_model_details)

List available providers

gateway.providers.list()

List available models

gateway.models.list()

Create AI service

Prepare function which will be deployed using AI service. Please specify the default parameters that will be passed to the function.

def deployable_load_balancing_ai_service( context, url=credentials.url, model_alias=model_alias ): from ibm_watsonx_ai import APIClient, Credentials from ibm_watsonx_ai.gateway import Gateway api_client = APIClient( credentials=Credentials(url=url, token=context.generate_token()), space_id=context.get_space_id(), ) gateway = Gateway(api_client=api_client) def generate(context) -> dict: api_client.set_token(context.get_token()) payload = context.get_json() prompt = payload["prompt"] messages = [ { "role": "user", "content": prompt, } ] response = gateway.chat.completions.create(model=model_alias, messages=messages) return {"body": response} return generate

Testing AI service's function locally

Create AI service function

from ibm_watsonx_ai.deployments import RuntimeContext context = RuntimeContext(api_client=client) local_load_balancing_function = deployable_load_balancing_ai_service(context=context)

Prepare request payload

context.request_payload_json = {"prompt": "Explain what IBM is"}

Execute the function locally

import asyncio from collections import Counter from typing import Coroutine async def send_requests(function, context): tasks: list[Coroutine] = [] for _ in range(25): task = asyncio.to_thread(function, context) tasks.append(task) await asyncio.sleep(0.2) return await asyncio.gather(*tasks) loop = asyncio.get_event_loop() responses = await loop.create_task( send_requests(function=local_load_balancing_function, context=context) ) Counter(map(lambda x: x["body"]["model"], responses))
Counter({'ibm/granite-3-8b-instruct': 10, 'meta-llama/llama-3-3-70b-instruct': 8, 'meta-llama/llama-3-2-11b-vision-instruct': 7})

As demonstrated, out of 25 requests sent to Model Gateway:

  • 10 of them were handled by ibm/granite-3-8b-instruct,

  • 8 of them were handled by meta-llama/llama-3-3-70b-instruct,

  • 7 of them were handled by meta-llama/llama-3-2-11b-vision-instruct.

Deploy AI service

Store AI service with previously created custom software specification

meta_props = { client.repository.AIServiceMetaNames.NAME: "Model Gateway load balancing AI service with SDK", client.repository.AIServiceMetaNames.SOFTWARE_SPEC_ID: software_specification_id, } stored_ai_service_details = client.repository.store_ai_service( deployable_load_balancing_ai_service, meta_props )
ai_service_id = client.repository.get_ai_service_id(stored_ai_service_details) ai_service_id
'5fc89f75-6a80-4dc5-95ac-8ef8cfa50949'

Create online deployment of AI service.

meta_props = { client.deployments.ConfigurationMetaNames.NAME: "Load balancing AI service with SDK", client.deployments.ConfigurationMetaNames.ONLINE: {}, } deployment_details = client.deployments.create(ai_service_id, meta_props)
###################################################################################### Synchronous deployment creation for id: '5fc89f75-6a80-4dc5-95ac-8ef8cfa50949' started ###################################################################################### initializing Note: online_url and serving_urls are deprecated and will be removed in a future release. Use inference instead. .. ready ----------------------------------------------------------------------------------------------- Successfully finished deployment creation, deployment_id='ff47c05e-b555-4f8e-887b-cb64b5154fa3' -----------------------------------------------------------------------------------------------

Obtain the deployment_id of the previously created deployment.

deployment_id = client.deployments.get_id(deployment_details)

Execute the AI service

In the following cell there are 25 requests send to the AI service in asynchronous mode. Between each request there is a 0.2 second delay in order to avoid 429 Too Many Requests errors.

async def send_requests(question): tasks: list[Coroutine] = [] for _ in range(25): tasks.append( client.deployments.arun_ai_service(deployment_id, {"prompt": question}) ) await asyncio.sleep(0.2) return await asyncio.gather(*tasks) loop = asyncio.get_event_loop() responses = await loop.create_task( send_requests(question="Explain to me what is a dog in cat language") ) Counter(map(lambda x: x["model"], responses))
Counter({'ibm/granite-3-8b-instruct': 10, 'meta-llama/llama-3-2-11b-vision-instruct': 9, 'meta-llama/llama-3-3-70b-instruct': 6})

As demonstrated, out of 25 requests sent to AI Service:

  • 10 of them were handled by meta-llama/llama-3-2-11b-vision-instruct,

  • 9 of them were handled by meta-llama/llama-3-3-70b-instruct,

  • 6 of them were handled by ibm/granite-3-8b-instruct.

Summary and next steps

You successfully completed this notebook!

You learned how to create and deploy a load-balancing AI service with Model Gateway using ibm_watsonx_ai SDK.

Check out our Online Documentation for more samples, tutorials, documentation, how-tos, and blog posts.

Author

Rafał Chrzanowski, Software Engineer at watsonx.ai.

Copyright © 2025-2026 IBM. This notebook and its source code are released under the terms of the MIT License.