Book a Demo!
CoCalc Logo Icon
StoreFeaturesDocsShareSupportNewsAboutPoliciesSign UpSign In
ibm
GitHub Repository: ibm/watson-machine-learning-samples
Path: blob/master/cloud/notebooks/python_sdk/experiments/autoai_rag/Use AutoAI RAG with custom foundation model.ipynb
6405 views
Kernel: autoai_rag

image

AutoAI RAG experiment with custom foundation model.

Disclaimers

  • Use only Projects and Spaces that are available in the watsonx context.

Notebook content

This notebook demonstrates how to deploy custom foundation model and use this model in AutoAI RAG experiment. The data used in this notebook is from the Granite Code Models paper.

Some familiarity with Python is helpful. This notebook uses Python 3.11.

Learning goal

The learning goals of this notebook are:

  • How to deploy your own foundation models with huggingface hub

  • Create an AutoAI RAG job that will find the best RAG pattern based on custom foundation model used during the experiment

Contents

This notebook contains the following parts:

Set up the environment

%pip install -U wget | tail -n 1 %pip install -U 'ibm-watsonx-ai[rag]>=1.3.26' | tail -n 1
Requirement already satisfied: wget in /Users/michalsteczko/anaconda3/envs/autoai_rag/lib/python3.11/site-packages (3.2) Note: you may need to restart the kernel to use updated packages. Requirement already satisfied: pyasn1<0.7.0,>=0.4.6 in /Users/michalsteczko/anaconda3/envs/autoai_rag/lib/python3.11/site-packages (from pyasn1-modules>=0.2.1->google-auth>=1.0.1->kubernetes>=28.1.0->chromadb!=0.5.4,!=0.5.5,<0.6.0,>=0.4.0->langchain-chroma==0.1.4->ibm-watsonx-ai[rag]>=1.3.12) (0.6.1) Note: you may need to restart the kernel to use updated packages. Requirement already satisfied: certifi>=2017.4.17 in /Users/michalsteczko/anaconda3/envs/autoai_rag/lib/python3.11/site-packages (from requests->huggingface-hub==0.30.2) (2024.8.30) Note: you may need to restart the kernel to use updated packages.

Prerequisites

Please fill below values to be able to move forward:

  • API_KEY - your api key to IBM Cloud, more information about API keys can be found here.

  • WML_ENDPOINT - endpoint url associated with your api key, to see the list of available endpoints please refer to this documentation.

  • PROJECT_ID - ID of the project associated with your api key and endpoint, to find your project id please refer to this documentation.

  • DATASOURCE_CONNECTION_ASSET_ID - connection asset ID to your data source which will store custom foundation model files, please refer to this documentation to get to know how to create this kind of asset. In the example below you will be using the connection to S3 Cloud Object Storage.

  • BUCKET_NAME - bucket with your custom foundation models files in .safetensors format.

API_KEY = "PUT YOUR API KEY HERE" # API key to your IBM cloud or Cloud Pack for Data instance WML_ENDPOINT = "https://us-south.ml.cloud.ibm.com" # endpoint associated with your API key PROJECT_ID = "PUT YOUR PROJECT ID HERE" # project ID associated with your API key and endpoint DATASOURCE_CONNECTION_ASSET_ID = "PUT YOUR DATASURCE CONNECTION ASSET ID" # datasource connection inside your project BUCKET_NAME = "PUT BUCKET NAME WHICH STORES YOUR CUSTOM MODEL FILES HERE" # bucket name in your Cloud Object Storage BUCKET_MODEL_DIR_NAME = "PUT PATH TO YOUR CUSTOM MODEL FILES IN YOUR BUCKET" # dir name inside the bucket which will store your custom model files BUCKET_BENCHMARK_JSON_FILE_PATH = "benchmark.json" # path inside bucket where your benchmark.json file is stored

Create API Client instance.

This client will allow us to connect with the IBM services.

from ibm_watsonx_ai import APIClient, Credentials credentials = Credentials( api_key=API_KEY, url=WML_ENDPOINT ) client = APIClient(credentials=credentials, project_id=PROJECT_ID)

Deploy the model

Check the docs to avoid any problems during model deployment here.

Create custom model repository

software_spec = client.software_specifications.get_id_by_name('watsonx-cfm-caikit-1.1')
metadata = { client.repository.ModelMetaNames.NAME: "My deployment", client.repository.ModelMetaNames.SOFTWARE_SPEC_ID: software_spec, client.repository.ModelMetaNames.TYPE: client.repository.ModelAssetTypes.CUSTOM_FOUNDATION_MODEL_1_0, client.repository.ModelMetaNames.MODEL_LOCATION: { "file_path": BUCKET_MODEL_DIR_NAME, "bucket": BUCKET_NAME, "connection_id": DATASOURCE_CONNECTION_ASSET_ID, }, }
stored_model_details = client.repository.store_model(model=BUCKET_MODEL_DIR_NAME, meta_props=metadata) stored_model_asset_id = client.repository.get_model_id(stored_model_details)
client.repository.list(framework_filter='custom_foundation_model_1.0')[0:1]

Store client task credentials

try: client.task_credentials.store() except Exception: print("Client task credentials have already been stored.")
Task Credentials have already been stored. Use old or delete them.
Client task credentials have already been stored.

Perform custom model deployment

MAX_SEQUENCE_LENGTH = 32_000 MAX_NEW_TOKENS = 1000 MIN_NEW_TOKENS = 1 MAX_BATCH_SIZE = 1024 meta_props = { client.deployments.ConfigurationMetaNames.NAME: "My custom foundation model deployment", client.deployments.ConfigurationMetaNames.DESCRIPTION: "My custom foundation model deployment", client.deployments.ConfigurationMetaNames.HARDWARE_REQUEST: { 'size': client.deployments.HardwareRequestSizes.Small, 'num_nodes': 1 }, # optionally overwrite model parameters here client.deployments.ConfigurationMetaNames.FOUNDATION_MODEL: { "max_sequence_length": MAX_SEQUENCE_LENGTH, "max_new_tokens": MAX_NEW_TOKENS, "max_batch_size": MAX_BATCH_SIZE, }, client.deployments.ConfigurationMetaNames.SERVING_NAME: "custom_foundation_model" # must be unique } deployment_details = client.deployments.create(stored_model_asset_id, meta_props) deployment_id = client.deployments.get_id(deployment_details=deployment_details)
###################################################################################### Synchronous deployment creation for id: '3a9346ec-bd38-4965-b3d9-b19fb545dc92' started ###################################################################################### initializing Note: This model is missing a chat template. To use chat-related functionality, enable a chat template in the tokenizer_config.json file and try again. ....................................................................................................... ready ----------------------------------------------------------------------------------------------- Successfully finished deployment creation, deployment_id='085784e7-75d5-452d-ae2d-873aa6e20075' -----------------------------------------------------------------------------------------------

Prepare the data for the AutoAI RAG experiment

Download granite_code_models.pdf document

import wget data_url = "https://arxiv.org/pdf/2405.04324" byom_input_filename = "granite_code_models.pdf" wget.download(data_url, byom_input_filename)
'granite_code_models.pdf'

Create data asset with your training data

document_asset_details = client.data_assets.create(name=byom_input_filename, file_path=byom_input_filename) document_asset_id = client.data_assets.get_id(document_asset_details) document_asset_id
Creating data asset... SUCCESS
'f99e0fc9-9170-44b2-bf36-1f32e8384df1'
from ibm_watsonx_ai.helpers import DataConnection input_data_references = [DataConnection(data_asset_id=document_asset_id)]
import json local_benchmark_json_filename = "benchmark.json" benchmarking_data = [ { "question": "What are the two main variants of Granite Code models?", "correct_answer": "The two main variants are Granite Code Base and Granite Code Instruct.", "correct_answer_document_ids": [byom_input_filename] }, { "question": "What is the purpose of Granite Code Instruct models?", "correct_answer": "Granite Code Instruct models are finetuned for instruction-following tasks using datasets like CommitPack, OASST, HelpSteer, and synthetic code instruction datasets, aiming to improve reasoning and instruction-following capabilities.", "correct_answer_document_ids": [byom_input_filename] }, { "question": "What is the licensing model for Granite Code models?", "correct_answer": "Granite Code models are released under the Apache 2.0 license, ensuring permissive and enterprise-friendly usage.", "correct_answer_document_ids": [byom_input_filename] }, ] with open(local_benchmark_json_filename, mode="w", encoding="utf-8") as fp: json.dump(benchmarking_data, fp, indent=4)

Create data asset with benchmark.json file

test_asset_details = client.data_assets.create(name=local_benchmark_json_filename, file_path=local_benchmark_json_filename) test_asset_id = client.data_assets.get_id(test_asset_details) test_asset_id
Creating data asset... SUCCESS
'ab864a7e-7b75-4e19-a833-936e1d24ed3c'
test_data_references = [DataConnection(data_asset_id=test_asset_id)]

Run the AutoAI RAG experiment

Provide the input information for AutoAI RAG optimizer:

  • custom_prompt_template_text - custom prompt template text which will be used to query your own foundation model

  • custom_context_template_text - custom context template text which will be used to query your own foundation model

  • name - experiment name

  • description - experiment description

  • max_number_of_rag_patterns - maximum number of RAG patterns to create

  • optimization_metrics - target optimization metrics

from ibm_watsonx_ai.experiment import AutoAI from ibm_watsonx_ai.helpers.connections import DataConnection, ContainerLocation from ibm_watsonx_ai.foundation_models.schema import ( AutoAIRAGCustomModelConfig, AutoAIRAGModelParams ) experiment = AutoAI(credentials, project_id=PROJECT_ID) custom_prompt_template_text = "Answer my question {question} related to these documents {reference_documents}." custom_context_template_text = "My document {document}" parameters = AutoAIRAGModelParams(max_sequence_length=32_000) custom_foundation_model_config = AutoAIRAGCustomModelConfig( deployment_id=deployment_id, project_id=PROJECT_ID, prompt_template_text=custom_prompt_template_text, context_template_text=custom_context_template_text, parameters=parameters ) rag_optimizer = experiment.rag_optimizer( name='AutoAI RAG - Custom foundation model experiment', description = "AutoAI RAG experiment using custom foundation model.", max_number_of_rag_patterns=4, optimization_metrics=['faithfulness'], foundation_models=[custom_foundation_model_config] ) container_data_location = DataConnection( type="container", location=ContainerLocation( path="autorag/results" ), ) container_data_location.set_client(api_client=client) rag_optimizer.run( test_data_references=test_data_references, input_data_references=input_data_references, results_reference=container_data_location, background_mode=False )
############################################## Running 'cc4a3c81-a058-4376-9d94-e5e14d58e36c' ############################################## pending.... running....................................................................... completed Training of 'cc4a3c81-a058-4376-9d94-e5e14d58e36c' finished successfully.
{'entity': {'hardware_spec': {'id': 'a6c4923b-b8e4-444c-9f43-8a7ec3020110', 'name': 'L'}, 'input_data_references': [{'connection': {'id': '37589eb2-ed80-4174-a33b-adf7d9dcf727'}, 'location': {'bucket': 'autorag-byom', 'file_name': 'granite_code_models.pdf'}, 'type': 'connection_asset'}], 'parameters': {'constraints': {'generation': {'foundation_models': [{'context_template_text': 'My document {document}', 'deployment_id': '085784e7-75d5-452d-ae2d-873aa6e20075', 'parameters': {'max_sequence_length': 32000}, 'project_id': '74cee487-8422-49ef-b61f-db92d8ce7b12', 'prompt_template_text': 'Answer my question {question} related to these documents {reference_documents}.'}]}, 'max_number_of_rag_patterns': 4}, 'optimization': {'metrics': ['faithfulness']}, 'output_logs': True}, 'results': [{'context': {'iteration': 0, 'max_combinations': 80, 'rag_pattern': {'composition_steps': ['model_selection', 'chunking', 'embeddings', 'retrieval', 'generation'], 'duration_seconds': 20, 'location': {'evaluation_results': 'autorag/results/cc4a3c81-a058-4376-9d94-e5e14d58e36c/Pattern1/evaluation_results.json', 'indexing_notebook': 'autorag/results/cc4a3c81-a058-4376-9d94-e5e14d58e36c/Pattern1/indexing_inference_notebook.ipynb', 'inference_notebook': 'autorag/results/cc4a3c81-a058-4376-9d94-e5e14d58e36c/Pattern1/indexing_inference_notebook.ipynb', 'inference_service_code': 'autorag/results/cc4a3c81-a058-4376-9d94-e5e14d58e36c/Pattern1/inference_ai_service.gz', 'inference_service_metadata': 'autorag/results/cc4a3c81-a058-4376-9d94-e5e14d58e36c/Pattern1/inference_service_metadata.json'}, 'name': 'Pattern1', 'settings': {'chunking': {'chunk_overlap': 128, 'chunk_size': 512, 'method': 'recursive'}, 'embeddings': {'model_id': 'intfloat/multilingual-e5-large', 'truncate_input_tokens': 512, 'truncate_strategy': 'left'}, 'generation': {'context_template_text': 'My document {document}', 'deployment_id': '085784e7-75d5-452d-ae2d-873aa6e20075', 'parameters': {'decoding_method': 'greedy', 'max_new_tokens': 1000, 'max_sequence_length': 32000, 'min_new_tokens': 1}, 'prompt_template_text': 'Answer my question {question} related to these documents {reference_documents}.', 'word_to_token_ratio': 2.0}, 'retrieval': {'method': 'window', 'number_of_chunks': 3, 'window_size': 2}, 'vector_store': {'datasource_type': 'chroma', 'distance_metric': 'cosine', 'index_name': 'autoai_rag_cc4a3c81_20250626125037', 'operation': 'upsert', 'schema': {'fields': [{'description': 'text field', 'name': 'text', 'role': 'text', 'type': 'string'}, {'description': 'document name field', 'name': 'document_id', 'role': 'document_name', 'type': 'string'}, {'description': 'chunk starting token position in the source document', 'name': 'start_index', 'role': 'start_index', 'type': 'number'}, {'description': 'chunk number per document', 'name': 'sequence_number', 'role': 'sequence_number', 'type': 'number'}, {'description': 'vector embeddings', 'name': 'vector', 'role': 'vector_embeddings', 'type': 'array'}], 'id': 'autoai_rag_1.0', 'name': 'Document schema using open-source loaders', 'type': 'struct'}}}, 'settings_importance': {'chunking': [{'importance': 0.125, 'parameter': 'chunk_size'}, {'importance': 0.125, 'parameter': 'chunk_overlap'}], 'embeddings': [{'importance': 0.125, 'parameter': 'embedding_model'}], 'generation': [{'importance': 0.125, 'parameter': 'foundation_model'}], 'retrieval': [{'importance': 0.125, 'parameter': 'retrieval_method'}, {'importance': 0.125, 'parameter': 'window_size'}, {'importance': 0.125, 'parameter': 'number_of_chunks'}]}}, 'software_spec': {'name': 'autoai-rag_rt24.1-py3.11'}}, 'metrics': {'test_data': [{'ci_high': 1.0, 'ci_low': 0.4286, 'mean': 0.649, 'metric_name': 'answer_correctness'}, {'ci_high': 0.4106, 'ci_low': 0.2534, 'mean': 0.3519, 'metric_name': 'faithfulness'}, {'mean': 1.0, 'metric_name': 'context_correctness'}]}}, {'context': {'iteration': 1, 'max_combinations': 80, 'rag_pattern': {'composition_steps': ['model_selection', 'chunking', 'embeddings', 'retrieval', 'generation'], 'duration_seconds': 19, 'location': {'evaluation_results': 'autorag/results/cc4a3c81-a058-4376-9d94-e5e14d58e36c/Pattern2/evaluation_results.json', 'indexing_notebook': 'autorag/results/cc4a3c81-a058-4376-9d94-e5e14d58e36c/Pattern2/indexing_inference_notebook.ipynb', 'inference_notebook': 'autorag/results/cc4a3c81-a058-4376-9d94-e5e14d58e36c/Pattern2/indexing_inference_notebook.ipynb', 'inference_service_code': 'autorag/results/cc4a3c81-a058-4376-9d94-e5e14d58e36c/Pattern2/inference_ai_service.gz', 'inference_service_metadata': 'autorag/results/cc4a3c81-a058-4376-9d94-e5e14d58e36c/Pattern2/inference_service_metadata.json'}, 'name': 'Pattern2', 'settings': {'chunking': {'chunk_overlap': 128, 'chunk_size': 512, 'method': 'recursive'}, 'embeddings': {'model_id': 'intfloat/multilingual-e5-large', 'truncate_input_tokens': 512, 'truncate_strategy': 'left'}, 'generation': {'context_template_text': 'My document {document}', 'deployment_id': '085784e7-75d5-452d-ae2d-873aa6e20075', 'parameters': {'decoding_method': 'greedy', 'max_new_tokens': 1000, 'max_sequence_length': 32000, 'min_new_tokens': 1}, 'prompt_template_text': 'Answer my question {question} related to these documents {reference_documents}.', 'word_to_token_ratio': 2.0}, 'retrieval': {'method': 'simple', 'number_of_chunks': 5}, 'vector_store': {'datasource_type': 'chroma', 'distance_metric': 'cosine', 'index_name': 'autoai_rag_cc4a3c81_20250626125037', 'operation': 'upsert', 'schema': {'fields': [{'description': 'text field', 'name': 'text', 'role': 'text', 'type': 'string'}, {'description': 'document name field', 'name': 'document_id', 'role': 'document_name', 'type': 'string'}, {'description': 'chunk starting token position in the source document', 'name': 'start_index', 'role': 'start_index', 'type': 'number'}, {'description': 'chunk number per document', 'name': 'sequence_number', 'role': 'sequence_number', 'type': 'number'}, {'description': 'vector embeddings', 'name': 'vector', 'role': 'vector_embeddings', 'type': 'array'}], 'id': 'autoai_rag_1.0', 'name': 'Document schema using open-source loaders', 'type': 'struct'}}}, 'settings_importance': {'chunking': [{'importance': 0.0, 'parameter': 'chunk_size'}, {'importance': 0.0, 'parameter': 'chunk_overlap'}], 'embeddings': [{'importance': 0.0, 'parameter': 'embedding_model'}], 'generation': [{'importance': 0.0, 'parameter': 'foundation_model'}], 'retrieval': [{'importance': 0.48, 'parameter': 'retrieval_method'}, {'importance': 0.1, 'parameter': 'window_size'}, {'importance': 0.42, 'parameter': 'number_of_chunks'}]}}, 'software_spec': {'name': 'autoai-rag_rt24.1-py3.11'}}, 'metrics': {'test_data': [{'ci_high': 0.746, 'ci_low': 0.0, 'mean': 0.4841, 'metric_name': 'answer_correctness'}, {'ci_high': 0.2167, 'ci_low': 0.0223, 'mean': 0.0945, 'metric_name': 'faithfulness'}, {'mean': 1.0, 'metric_name': 'context_correctness'}]}}, {'context': {'iteration': 2, 'max_combinations': 80, 'rag_pattern': {'composition_steps': ['model_selection', 'chunking', 'embeddings', 'retrieval', 'generation'], 'duration_seconds': 18, 'location': {'evaluation_results': 'autorag/results/cc4a3c81-a058-4376-9d94-e5e14d58e36c/Pattern3/evaluation_results.json', 'indexing_notebook': 'autorag/results/cc4a3c81-a058-4376-9d94-e5e14d58e36c/Pattern3/indexing_inference_notebook.ipynb', 'inference_notebook': 'autorag/results/cc4a3c81-a058-4376-9d94-e5e14d58e36c/Pattern3/indexing_inference_notebook.ipynb', 'inference_service_code': 'autorag/results/cc4a3c81-a058-4376-9d94-e5e14d58e36c/Pattern3/inference_ai_service.gz', 'inference_service_metadata': 'autorag/results/cc4a3c81-a058-4376-9d94-e5e14d58e36c/Pattern3/inference_service_metadata.json'}, 'name': 'Pattern3', 'settings': {'chunking': {'chunk_overlap': 256, 'chunk_size': 1024, 'method': 'recursive'}, 'embeddings': {'model_id': 'intfloat/multilingual-e5-large', 'truncate_input_tokens': 512, 'truncate_strategy': 'left'}, 'generation': {'context_template_text': 'My document {document}', 'deployment_id': '085784e7-75d5-452d-ae2d-873aa6e20075', 'parameters': {'decoding_method': 'greedy', 'max_new_tokens': 1000, 'max_sequence_length': 32000, 'min_new_tokens': 1}, 'prompt_template_text': 'Answer my question {question} related to these documents {reference_documents}.', 'word_to_token_ratio': 2.0}, 'retrieval': {'method': 'simple', 'number_of_chunks': 3}, 'vector_store': {'datasource_type': 'chroma', 'distance_metric': 'cosine', 'index_name': 'autoai_rag_cc4a3c81_20250626125133', 'operation': 'upsert', 'schema': {'fields': [{'description': 'text field', 'name': 'text', 'role': 'text', 'type': 'string'}, {'description': 'document name field', 'name': 'document_id', 'role': 'document_name', 'type': 'string'}, {'description': 'chunk starting token position in the source document', 'name': 'start_index', 'role': 'start_index', 'type': 'number'}, {'description': 'chunk number per document', 'name': 'sequence_number', 'role': 'sequence_number', 'type': 'number'}, {'description': 'vector embeddings', 'name': 'vector', 'role': 'vector_embeddings', 'type': 'array'}], 'id': 'autoai_rag_1.0', 'name': 'Document schema using open-source loaders', 'type': 'struct'}}}, 'settings_importance': {'chunking': [{'importance': 0.1538463, 'parameter': 'chunk_size'}, {'importance': 0.1538462, 'parameter': 'chunk_overlap'}], 'embeddings': [{'importance': 0.0, 'parameter': 'embedding_model'}], 'generation': [{'importance': 0.0, 'parameter': 'foundation_model'}], 'retrieval': [{'importance': 0.3736262, 'parameter': 'retrieval_method'}, {'importance': 0.18681306, 'parameter': 'window_size'}, {'importance': 0.13186823, 'parameter': 'number_of_chunks'}]}}, 'software_spec': {'name': 'autoai-rag_rt24.1-py3.11'}}, 'metrics': {'test_data': [{'ci_high': 0.7407, 'ci_low': 0.0, 'mean': 0.2469, 'metric_name': 'answer_correctness'}, {'ci_high': 0.2827, 'ci_low': 0.0, 'mean': 0.0942, 'metric_name': 'faithfulness'}, {'mean': 1.0, 'metric_name': 'context_correctness'}]}}, {'context': {'iteration': 3, 'max_combinations': 80, 'rag_pattern': {'composition_steps': ['model_selection', 'chunking', 'embeddings', 'retrieval', 'generation'], 'duration_seconds': 15, 'location': {'evaluation_results': 'autorag/results/cc4a3c81-a058-4376-9d94-e5e14d58e36c/Pattern4/evaluation_results.json', 'indexing_notebook': 'autorag/results/cc4a3c81-a058-4376-9d94-e5e14d58e36c/Pattern4/indexing_inference_notebook.ipynb', 'inference_notebook': 'autorag/results/cc4a3c81-a058-4376-9d94-e5e14d58e36c/Pattern4/indexing_inference_notebook.ipynb', 'inference_service_code': 'autorag/results/cc4a3c81-a058-4376-9d94-e5e14d58e36c/Pattern4/inference_ai_service.gz', 'inference_service_metadata': 'autorag/results/cc4a3c81-a058-4376-9d94-e5e14d58e36c/Pattern4/inference_service_metadata.json'}, 'name': 'Pattern4', 'settings': {'chunking': {'chunk_overlap': 256, 'chunk_size': 1024, 'method': 'recursive'}, 'embeddings': {'model_id': 'intfloat/multilingual-e5-large', 'truncate_input_tokens': 512, 'truncate_strategy': 'left'}, 'generation': {'context_template_text': 'My document {document}', 'deployment_id': '085784e7-75d5-452d-ae2d-873aa6e20075', 'parameters': {'decoding_method': 'greedy', 'max_new_tokens': 1000, 'max_sequence_length': 32000, 'min_new_tokens': 1}, 'prompt_template_text': 'Answer my question {question} related to these documents {reference_documents}.', 'word_to_token_ratio': 2.0}, 'retrieval': {'method': 'window', 'number_of_chunks': 3, 'window_size': 1}, 'vector_store': {'datasource_type': 'chroma', 'distance_metric': 'cosine', 'index_name': 'autoai_rag_cc4a3c81_20250626125133', 'operation': 'upsert', 'schema': {'fields': [{'description': 'text field', 'name': 'text', 'role': 'text', 'type': 'string'}, {'description': 'document name field', 'name': 'document_id', 'role': 'document_name', 'type': 'string'}, {'description': 'chunk starting token position in the source document', 'name': 'start_index', 'role': 'start_index', 'type': 'number'}, {'description': 'chunk number per document', 'name': 'sequence_number', 'role': 'sequence_number', 'type': 'number'}, {'description': 'vector embeddings', 'name': 'vector', 'role': 'vector_embeddings', 'type': 'array'}], 'id': 'autoai_rag_1.0', 'name': 'Document schema using open-source loaders', 'type': 'struct'}}}, 'settings_importance': {'chunking': [{'importance': 0.036102694, 'parameter': 'chunk_size'}, {'importance': 0.12929015, 'parameter': 'chunk_overlap'}], 'embeddings': [{'importance': 0.0, 'parameter': 'embedding_model'}], 'generation': [{'importance': 0.0, 'parameter': 'foundation_model'}], 'retrieval': [{'importance': 0.46589968, 'parameter': 'retrieval_method'}, {'importance': 0.25456277, 'parameter': 'window_size'}, {'importance': 0.11414474, 'parameter': 'number_of_chunks'}]}}, 'software_spec': {'name': 'autoai-rag_rt24.1-py3.11'}}, 'metrics': {'test_data': [{'ci_high': 1.0, 'ci_low': 0.8677, 'mean': 0.9153, 'metric_name': 'answer_correctness'}, {'ci_high': 0.5257, 'ci_low': 0.375, 'mean': 0.4715, 'metric_name': 'faithfulness'}, {'mean': 1.0, 'metric_name': 'context_correctness'}]}}], 'results_reference': {'location': {'path': 'autorag/results', 'training': 'autorag/results/cc4a3c81-a058-4376-9d94-e5e14d58e36c', 'training_status': 'autorag/results/cc4a3c81-a058-4376-9d94-e5e14d58e36c/training-status.json', 'training_log': 'autorag/results/cc4a3c81-a058-4376-9d94-e5e14d58e36c/output.log', 'assets_path': 'autorag/results/cc4a3c81-a058-4376-9d94-e5e14d58e36c/assets'}, 'type': 'container'}, 'status': {'completed_at': '2025-06-26T12:55:56.690Z', 'message': {'level': 'info', 'text': 'AAR019I: AutoAI execution completed.'}, 'running_at': '2025-06-26T12:49:19.000Z', 'state': 'completed', 'step': 'generation'}, 'test_data_references': [{'connection': {'id': '37589eb2-ed80-4174-a33b-adf7d9dcf727'}, 'location': {'bucket': 'autorag-byom', 'file_name': 'benchmark.json'}, 'type': 'connection_asset'}], 'timestamp': '2025-06-26T12:55:59.428Z'}, 'metadata': {'created_at': '2025-06-26T12:48:58.519Z', 'description': 'AutoAI RAG experiment using custom foundation model.', 'id': 'cc4a3c81-a058-4376-9d94-e5e14d58e36c', 'modified_at': '2025-06-26T12:55:56.725Z', 'name': 'AutoAI RAG - Custom foundation model experiment', 'project_id': '74cee487-8422-49ef-b61f-db92d8ce7b12', 'tags': ['autorag.2ba2f4e0-8e16-4cbb-b856-9da22d1bc376']}}
rag_optimizer.get_run_details()
{'entity': {'hardware_spec': {'id': 'a6c4923b-b8e4-444c-9f43-8a7ec3020110', 'name': 'L'}, 'input_data_references': [{'connection': {'id': '37589eb2-ed80-4174-a33b-adf7d9dcf727'}, 'location': {'bucket': 'autorag-byom', 'file_name': 'granite_code_models.pdf'}, 'type': 'connection_asset'}], 'parameters': {'constraints': {'generation': {'foundation_models': [{'context_template_text': 'My document {document}', 'deployment_id': '085784e7-75d5-452d-ae2d-873aa6e20075', 'parameters': {'max_sequence_length': 32000}, 'project_id': '74cee487-8422-49ef-b61f-db92d8ce7b12', 'prompt_template_text': 'Answer my question {question} related to these documents {reference_documents}.'}]}, 'max_number_of_rag_patterns': 4}, 'optimization': {'metrics': ['faithfulness']}, 'output_logs': True}, 'results': [{'context': {'iteration': 0, 'max_combinations': 80, 'rag_pattern': {'composition_steps': ['model_selection', 'chunking', 'embeddings', 'retrieval', 'generation'], 'duration_seconds': 20, 'location': {'evaluation_results': 'autorag/results/cc4a3c81-a058-4376-9d94-e5e14d58e36c/Pattern1/evaluation_results.json', 'indexing_notebook': 'autorag/results/cc4a3c81-a058-4376-9d94-e5e14d58e36c/Pattern1/indexing_inference_notebook.ipynb', 'inference_notebook': 'autorag/results/cc4a3c81-a058-4376-9d94-e5e14d58e36c/Pattern1/indexing_inference_notebook.ipynb', 'inference_service_code': 'autorag/results/cc4a3c81-a058-4376-9d94-e5e14d58e36c/Pattern1/inference_ai_service.gz', 'inference_service_metadata': 'autorag/results/cc4a3c81-a058-4376-9d94-e5e14d58e36c/Pattern1/inference_service_metadata.json'}, 'name': 'Pattern1', 'settings': {'chunking': {'chunk_overlap': 128, 'chunk_size': 512, 'method': 'recursive'}, 'embeddings': {'model_id': 'intfloat/multilingual-e5-large', 'truncate_input_tokens': 512, 'truncate_strategy': 'left'}, 'generation': {'context_template_text': 'My document {document}', 'deployment_id': '085784e7-75d5-452d-ae2d-873aa6e20075', 'parameters': {'decoding_method': 'greedy', 'max_new_tokens': 1000, 'max_sequence_length': 32000, 'min_new_tokens': 1}, 'prompt_template_text': 'Answer my question {question} related to these documents {reference_documents}.', 'word_to_token_ratio': 2.0}, 'retrieval': {'method': 'window', 'number_of_chunks': 3, 'window_size': 2}, 'vector_store': {'datasource_type': 'chroma', 'distance_metric': 'cosine', 'index_name': 'autoai_rag_cc4a3c81_20250626125037', 'operation': 'upsert', 'schema': {'fields': [{'description': 'text field', 'name': 'text', 'role': 'text', 'type': 'string'}, {'description': 'document name field', 'name': 'document_id', 'role': 'document_name', 'type': 'string'}, {'description': 'chunk starting token position in the source document', 'name': 'start_index', 'role': 'start_index', 'type': 'number'}, {'description': 'chunk number per document', 'name': 'sequence_number', 'role': 'sequence_number', 'type': 'number'}, {'description': 'vector embeddings', 'name': 'vector', 'role': 'vector_embeddings', 'type': 'array'}], 'id': 'autoai_rag_1.0', 'name': 'Document schema using open-source loaders', 'type': 'struct'}}}, 'settings_importance': {'chunking': [{'importance': 0.125, 'parameter': 'chunk_size'}, {'importance': 0.125, 'parameter': 'chunk_overlap'}], 'embeddings': [{'importance': 0.125, 'parameter': 'embedding_model'}], 'generation': [{'importance': 0.125, 'parameter': 'foundation_model'}], 'retrieval': [{'importance': 0.125, 'parameter': 'retrieval_method'}, {'importance': 0.125, 'parameter': 'window_size'}, {'importance': 0.125, 'parameter': 'number_of_chunks'}]}}, 'software_spec': {'name': 'autoai-rag_rt24.1-py3.11'}}, 'metrics': {'test_data': [{'ci_high': 1.0, 'ci_low': 0.4286, 'mean': 0.649, 'metric_name': 'answer_correctness'}, {'ci_high': 0.4106, 'ci_low': 0.2534, 'mean': 0.3519, 'metric_name': 'faithfulness'}, {'mean': 1.0, 'metric_name': 'context_correctness'}]}}, {'context': {'iteration': 1, 'max_combinations': 80, 'rag_pattern': {'composition_steps': ['model_selection', 'chunking', 'embeddings', 'retrieval', 'generation'], 'duration_seconds': 19, 'location': {'evaluation_results': 'autorag/results/cc4a3c81-a058-4376-9d94-e5e14d58e36c/Pattern2/evaluation_results.json', 'indexing_notebook': 'autorag/results/cc4a3c81-a058-4376-9d94-e5e14d58e36c/Pattern2/indexing_inference_notebook.ipynb', 'inference_notebook': 'autorag/results/cc4a3c81-a058-4376-9d94-e5e14d58e36c/Pattern2/indexing_inference_notebook.ipynb', 'inference_service_code': 'autorag/results/cc4a3c81-a058-4376-9d94-e5e14d58e36c/Pattern2/inference_ai_service.gz', 'inference_service_metadata': 'autorag/results/cc4a3c81-a058-4376-9d94-e5e14d58e36c/Pattern2/inference_service_metadata.json'}, 'name': 'Pattern2', 'settings': {'chunking': {'chunk_overlap': 128, 'chunk_size': 512, 'method': 'recursive'}, 'embeddings': {'model_id': 'intfloat/multilingual-e5-large', 'truncate_input_tokens': 512, 'truncate_strategy': 'left'}, 'generation': {'context_template_text': 'My document {document}', 'deployment_id': '085784e7-75d5-452d-ae2d-873aa6e20075', 'parameters': {'decoding_method': 'greedy', 'max_new_tokens': 1000, 'max_sequence_length': 32000, 'min_new_tokens': 1}, 'prompt_template_text': 'Answer my question {question} related to these documents {reference_documents}.', 'word_to_token_ratio': 2.0}, 'retrieval': {'method': 'simple', 'number_of_chunks': 5}, 'vector_store': {'datasource_type': 'chroma', 'distance_metric': 'cosine', 'index_name': 'autoai_rag_cc4a3c81_20250626125037', 'operation': 'upsert', 'schema': {'fields': [{'description': 'text field', 'name': 'text', 'role': 'text', 'type': 'string'}, {'description': 'document name field', 'name': 'document_id', 'role': 'document_name', 'type': 'string'}, {'description': 'chunk starting token position in the source document', 'name': 'start_index', 'role': 'start_index', 'type': 'number'}, {'description': 'chunk number per document', 'name': 'sequence_number', 'role': 'sequence_number', 'type': 'number'}, {'description': 'vector embeddings', 'name': 'vector', 'role': 'vector_embeddings', 'type': 'array'}], 'id': 'autoai_rag_1.0', 'name': 'Document schema using open-source loaders', 'type': 'struct'}}}, 'settings_importance': {'chunking': [{'importance': 0.0, 'parameter': 'chunk_size'}, {'importance': 0.0, 'parameter': 'chunk_overlap'}], 'embeddings': [{'importance': 0.0, 'parameter': 'embedding_model'}], 'generation': [{'importance': 0.0, 'parameter': 'foundation_model'}], 'retrieval': [{'importance': 0.48, 'parameter': 'retrieval_method'}, {'importance': 0.1, 'parameter': 'window_size'}, {'importance': 0.42, 'parameter': 'number_of_chunks'}]}}, 'software_spec': {'name': 'autoai-rag_rt24.1-py3.11'}}, 'metrics': {'test_data': [{'ci_high': 0.746, 'ci_low': 0.0, 'mean': 0.4841, 'metric_name': 'answer_correctness'}, {'ci_high': 0.2167, 'ci_low': 0.0223, 'mean': 0.0945, 'metric_name': 'faithfulness'}, {'mean': 1.0, 'metric_name': 'context_correctness'}]}}, {'context': {'iteration': 2, 'max_combinations': 80, 'rag_pattern': {'composition_steps': ['model_selection', 'chunking', 'embeddings', 'retrieval', 'generation'], 'duration_seconds': 18, 'location': {'evaluation_results': 'autorag/results/cc4a3c81-a058-4376-9d94-e5e14d58e36c/Pattern3/evaluation_results.json', 'indexing_notebook': 'autorag/results/cc4a3c81-a058-4376-9d94-e5e14d58e36c/Pattern3/indexing_inference_notebook.ipynb', 'inference_notebook': 'autorag/results/cc4a3c81-a058-4376-9d94-e5e14d58e36c/Pattern3/indexing_inference_notebook.ipynb', 'inference_service_code': 'autorag/results/cc4a3c81-a058-4376-9d94-e5e14d58e36c/Pattern3/inference_ai_service.gz', 'inference_service_metadata': 'autorag/results/cc4a3c81-a058-4376-9d94-e5e14d58e36c/Pattern3/inference_service_metadata.json'}, 'name': 'Pattern3', 'settings': {'chunking': {'chunk_overlap': 256, 'chunk_size': 1024, 'method': 'recursive'}, 'embeddings': {'model_id': 'intfloat/multilingual-e5-large', 'truncate_input_tokens': 512, 'truncate_strategy': 'left'}, 'generation': {'context_template_text': 'My document {document}', 'deployment_id': '085784e7-75d5-452d-ae2d-873aa6e20075', 'parameters': {'decoding_method': 'greedy', 'max_new_tokens': 1000, 'max_sequence_length': 32000, 'min_new_tokens': 1}, 'prompt_template_text': 'Answer my question {question} related to these documents {reference_documents}.', 'word_to_token_ratio': 2.0}, 'retrieval': {'method': 'simple', 'number_of_chunks': 3}, 'vector_store': {'datasource_type': 'chroma', 'distance_metric': 'cosine', 'index_name': 'autoai_rag_cc4a3c81_20250626125133', 'operation': 'upsert', 'schema': {'fields': [{'description': 'text field', 'name': 'text', 'role': 'text', 'type': 'string'}, {'description': 'document name field', 'name': 'document_id', 'role': 'document_name', 'type': 'string'}, {'description': 'chunk starting token position in the source document', 'name': 'start_index', 'role': 'start_index', 'type': 'number'}, {'description': 'chunk number per document', 'name': 'sequence_number', 'role': 'sequence_number', 'type': 'number'}, {'description': 'vector embeddings', 'name': 'vector', 'role': 'vector_embeddings', 'type': 'array'}], 'id': 'autoai_rag_1.0', 'name': 'Document schema using open-source loaders', 'type': 'struct'}}}, 'settings_importance': {'chunking': [{'importance': 0.1538463, 'parameter': 'chunk_size'}, {'importance': 0.1538462, 'parameter': 'chunk_overlap'}], 'embeddings': [{'importance': 0.0, 'parameter': 'embedding_model'}], 'generation': [{'importance': 0.0, 'parameter': 'foundation_model'}], 'retrieval': [{'importance': 0.3736262, 'parameter': 'retrieval_method'}, {'importance': 0.18681306, 'parameter': 'window_size'}, {'importance': 0.13186823, 'parameter': 'number_of_chunks'}]}}, 'software_spec': {'name': 'autoai-rag_rt24.1-py3.11'}}, 'metrics': {'test_data': [{'ci_high': 0.7407, 'ci_low': 0.0, 'mean': 0.2469, 'metric_name': 'answer_correctness'}, {'ci_high': 0.2827, 'ci_low': 0.0, 'mean': 0.0942, 'metric_name': 'faithfulness'}, {'mean': 1.0, 'metric_name': 'context_correctness'}]}}, {'context': {'iteration': 3, 'max_combinations': 80, 'rag_pattern': {'composition_steps': ['model_selection', 'chunking', 'embeddings', 'retrieval', 'generation'], 'duration_seconds': 15, 'location': {'evaluation_results': 'autorag/results/cc4a3c81-a058-4376-9d94-e5e14d58e36c/Pattern4/evaluation_results.json', 'indexing_notebook': 'autorag/results/cc4a3c81-a058-4376-9d94-e5e14d58e36c/Pattern4/indexing_inference_notebook.ipynb', 'inference_notebook': 'autorag/results/cc4a3c81-a058-4376-9d94-e5e14d58e36c/Pattern4/indexing_inference_notebook.ipynb', 'inference_service_code': 'autorag/results/cc4a3c81-a058-4376-9d94-e5e14d58e36c/Pattern4/inference_ai_service.gz', 'inference_service_metadata': 'autorag/results/cc4a3c81-a058-4376-9d94-e5e14d58e36c/Pattern4/inference_service_metadata.json'}, 'name': 'Pattern4', 'settings': {'chunking': {'chunk_overlap': 256, 'chunk_size': 1024, 'method': 'recursive'}, 'embeddings': {'model_id': 'intfloat/multilingual-e5-large', 'truncate_input_tokens': 512, 'truncate_strategy': 'left'}, 'generation': {'context_template_text': 'My document {document}', 'deployment_id': '085784e7-75d5-452d-ae2d-873aa6e20075', 'parameters': {'decoding_method': 'greedy', 'max_new_tokens': 1000, 'max_sequence_length': 32000, 'min_new_tokens': 1}, 'prompt_template_text': 'Answer my question {question} related to these documents {reference_documents}.', 'word_to_token_ratio': 2.0}, 'retrieval': {'method': 'window', 'number_of_chunks': 3, 'window_size': 1}, 'vector_store': {'datasource_type': 'chroma', 'distance_metric': 'cosine', 'index_name': 'autoai_rag_cc4a3c81_20250626125133', 'operation': 'upsert', 'schema': {'fields': [{'description': 'text field', 'name': 'text', 'role': 'text', 'type': 'string'}, {'description': 'document name field', 'name': 'document_id', 'role': 'document_name', 'type': 'string'}, {'description': 'chunk starting token position in the source document', 'name': 'start_index', 'role': 'start_index', 'type': 'number'}, {'description': 'chunk number per document', 'name': 'sequence_number', 'role': 'sequence_number', 'type': 'number'}, {'description': 'vector embeddings', 'name': 'vector', 'role': 'vector_embeddings', 'type': 'array'}], 'id': 'autoai_rag_1.0', 'name': 'Document schema using open-source loaders', 'type': 'struct'}}}, 'settings_importance': {'chunking': [{'importance': 0.036102694, 'parameter': 'chunk_size'}, {'importance': 0.12929015, 'parameter': 'chunk_overlap'}], 'embeddings': [{'importance': 0.0, 'parameter': 'embedding_model'}], 'generation': [{'importance': 0.0, 'parameter': 'foundation_model'}], 'retrieval': [{'importance': 0.46589968, 'parameter': 'retrieval_method'}, {'importance': 0.25456277, 'parameter': 'window_size'}, {'importance': 0.11414474, 'parameter': 'number_of_chunks'}]}}, 'software_spec': {'name': 'autoai-rag_rt24.1-py3.11'}}, 'metrics': {'test_data': [{'ci_high': 1.0, 'ci_low': 0.8677, 'mean': 0.9153, 'metric_name': 'answer_correctness'}, {'ci_high': 0.5257, 'ci_low': 0.375, 'mean': 0.4715, 'metric_name': 'faithfulness'}, {'mean': 1.0, 'metric_name': 'context_correctness'}]}}], 'results_reference': {'location': {'path': 'autorag/results', 'training': 'autorag/results/cc4a3c81-a058-4376-9d94-e5e14d58e36c', 'training_status': 'autorag/results/cc4a3c81-a058-4376-9d94-e5e14d58e36c/training-status.json', 'training_log': 'autorag/results/cc4a3c81-a058-4376-9d94-e5e14d58e36c/output.log', 'assets_path': 'autorag/results/cc4a3c81-a058-4376-9d94-e5e14d58e36c/assets'}, 'type': 'container'}, 'status': {'completed_at': '2025-06-26T12:55:56.690Z', 'message': {'level': 'info', 'text': 'AAR019I: AutoAI execution completed.'}, 'running_at': '2025-06-26T12:49:19.000Z', 'state': 'completed', 'step': 'generation'}, 'test_data_references': [{'connection': {'id': '37589eb2-ed80-4174-a33b-adf7d9dcf727'}, 'location': {'bucket': 'autorag-byom', 'file_name': 'benchmark.json'}, 'type': 'connection_asset'}], 'timestamp': '2025-06-26T12:55:59.428Z'}, 'metadata': {'created_at': '2025-06-26T12:48:58.519Z', 'description': 'AutoAI RAG experiment using custom foundation model.', 'id': 'cc4a3c81-a058-4376-9d94-e5e14d58e36c', 'modified_at': '2025-06-26T12:55:56.725Z', 'name': 'AutoAI RAG - Custom foundation model experiment', 'project_id': '74cee487-8422-49ef-b61f-db92d8ce7b12', 'tags': ['autorag.2ba2f4e0-8e16-4cbb-b856-9da22d1bc376']}}
summary = rag_optimizer.summary() summary
best_pattern_name = summary.index.values[0] print('Best pattern is:', best_pattern_name) best_pattern = rag_optimizer.get_pattern()
Best pattern is: Pattern4
rag_optimizer.get_pattern_details(pattern_name=best_pattern_name)
{'composition_steps': ['model_selection', 'chunking', 'embeddings', 'retrieval', 'generation'], 'duration_seconds': 15, 'location': {'evaluation_results': 'autorag/results/cc4a3c81-a058-4376-9d94-e5e14d58e36c/Pattern4/evaluation_results.json', 'indexing_notebook': 'autorag/results/cc4a3c81-a058-4376-9d94-e5e14d58e36c/Pattern4/indexing_inference_notebook.ipynb', 'inference_notebook': 'autorag/results/cc4a3c81-a058-4376-9d94-e5e14d58e36c/Pattern4/indexing_inference_notebook.ipynb', 'inference_service_code': 'autorag/results/cc4a3c81-a058-4376-9d94-e5e14d58e36c/Pattern4/inference_ai_service.gz', 'inference_service_metadata': 'autorag/results/cc4a3c81-a058-4376-9d94-e5e14d58e36c/Pattern4/inference_service_metadata.json'}, 'name': 'Pattern4', 'settings': {'chunking': {'chunk_overlap': 256, 'chunk_size': 1024, 'method': 'recursive'}, 'embeddings': {'model_id': 'intfloat/multilingual-e5-large', 'truncate_input_tokens': 512, 'truncate_strategy': 'left'}, 'generation': {'context_template_text': 'My document {document}', 'deployment_id': '085784e7-75d5-452d-ae2d-873aa6e20075', 'parameters': {'decoding_method': 'greedy', 'max_new_tokens': 1000, 'min_new_tokens': 1}, 'prompt_template_text': 'Answer my question {question} related to these documents {reference_documents}.', 'word_to_token_ratio': 2.0}, 'retrieval': {'method': 'window', 'number_of_chunks': 3, 'window_size': 1}, 'vector_store': {'datasource_type': 'chroma', 'distance_metric': 'cosine', 'index_name': 'autoai_rag_cc4a3c81_20250626125133', 'operation': 'upsert', 'schema': {'fields': [{'description': 'text field', 'name': 'text', 'role': 'text', 'type': 'string'}, {'description': 'document name field', 'name': 'document_id', 'role': 'document_name', 'type': 'string'}, {'description': 'chunk starting token position in the source document', 'name': 'start_index', 'role': 'start_index', 'type': 'number'}, {'description': 'chunk number per document', 'name': 'sequence_number', 'role': 'sequence_number', 'type': 'number'}, {'description': 'vector embeddings', 'name': 'vector', 'role': 'vector_embeddings', 'type': 'array'}], 'id': 'autoai_rag_1.0', 'name': 'Document schema using open-source loaders', 'type': 'struct'}}}, 'settings_importance': {'chunking': [{'importance': 0.036102694, 'parameter': 'chunk_size'}, {'importance': 0.12929015, 'parameter': 'chunk_overlap'}], 'embeddings': [{'importance': 0.0, 'parameter': 'embedding_model'}], 'generation': [{'importance': 0.0, 'parameter': 'foundation_model'}], 'retrieval': [{'importance': 0.46589968, 'parameter': 'retrieval_method'}, {'importance': 0.25456277, 'parameter': 'window_size'}, {'importance': 0.11414474, 'parameter': 'number_of_chunks'}]}}

Query generated pattern locally

from ibm_watsonx_ai.deployments import RuntimeContext runtime_context = RuntimeContext(api_client=client) inference_service_function = best_pattern.inference_service(runtime_context)[0]
/Users/michalsteczko/anaconda3/envs/autoai_rag/lib/python3.11/site-packages/pypdf/_crypt_providers/_cryptography.py:32: CryptographyDeprecationWarning: ARC4 has been moved to cryptography.hazmat.decrepit.ciphers.algorithms.ARC4 and will be removed from cryptography.hazmat.primitives.ciphers.algorithms in 48.0.0. from cryptography.hazmat.primitives.ciphers.algorithms import AES, ARC4
question = "What training objectives are used for the granite models?" context = RuntimeContext( api_client=client, request_payload_json={"messages": [{"role": "user", "content": question}]}, ) resp = inference_service_function(context) resp
{'body': {'choices': [{'index': 0, 'message': {'role': 'assistant', 'content': ' The\nclusters are equipped with 100Gbps and 200Gbps HDR InfiniBand links, respectively.\nWe utilize NVIDIA’s Megatron-LM (Shoeybi et al., 2019; Narayanan et al., 2021) for\ndistributed training, which is optimized for large language models. We use the same Megatron\nLM framework for all our models, ensuring consistency in training infrastructure.\n4.5 Model Architecture\nThe architecture of the Granite Code models is based on the original transformer architecture\n(Douglas & Smith, 2019) with modifications for code modeling. The base model has 16\nlayers, 8 attention heads, and a token embedding dimension of 512. For the 3B model, we\nuse the standard transformer architecture with a multi-head attention mechanism. The 8B model\nincorporates Grouped-Query Attention (GQA) (Ainslie et al., 2023) to improve inference\nefficiency. The 20B model uses learned absolute position embeddings and Multi-Query\nAttention (Shazeer, 2019). The 34B model is built upon the 20B model with depth\nupscaling (Kim et al., 2024) to double the model depth, resulting in 88 layers.\nThe models are trained with different context lengths depending on their size: 2048, 4096,\n8192, and 8192 tokens respectively for 3B, 8B, 20B, and 34B models.\nFor the MLP block, we use GELU activation function (Hendrycks & Gimpel, 2023) for the 20B\nand 34B models, while using GLU (Shazeer, 2020) for the 3B and 8B models. For\nnormalization, we use LayerNorm (Ba et al., 2016) for all models except the 8B model,\nwhich uses RMSNorm (Zhang & Sennrich, 2019) for computational efficiency.\n5.1 Evaluation Protocol\nWe evaluate the Granite Code models on a comprehensive set of benchmarks including\nHumanEvalPack (Muennighoff et al., 2023), MBPP(+) (Austin et al., 2021; Liu et al.,\n2023a), RepoBench (Liu et al., 2023b), ReCode (Wang et al., 2022), and more. This\nset of benchmarks encompasses many different kinds of coding tasks beyond just code\nsynthesis in Python, e.g., code fixing, code explanation, code editing, code translation,\netc., across most major programming languages (Python, JavaScript, Java, Go, C++, Rust,\netc.). Our evaluation protocol includes both automated metrics and human evaluations.\nFor automated metrics, we use standard evaluation scripts that measure code synthesis,\nfixing, and explanation. For human evaluations, we conduct a series of studies where\nparticipants are asked to compare the output of Granite Code models with other open-source\ncode models on various coding tasks. We also perform ablation studies to understand the\nimpact of different architectural choices on model performance.\n5.2 Results\nOur findings reveal that among open-source models, the Granite Code models overall show\nvery strong performance across all model sizes and benchmarks (often outperforming other\nopen-source code models that are twice large compared to Granite). As an illustration,\nfigure 1 (top) shows a comparison of Granite-8B-Code-Base with other open-source base code\nLLMs, including recent high-performing general purpose base LLMs like Mistral (Jiang et al.,\n2023b) and LLama-3 (AI@Meta, 2024) on HumanEvalPack (Muennighoff et al., 2023). While\nCodeGemma and StarCoder2 perform reasonably well in generating code, they perform\nsignificantly worse on the code fixing and explanation variants of HumanEvalPack. On av-\nerage, Granite-8B-Code-Base achieves 73% accuracy on code fixing, compared to 60% for\nCodeGemma and 55% for StarCoder2. Similarly, on code explanation, Granite-8B-\nCode-Base achieves 70% accuracy, outperforming CodeGemma (58%) and StarCoder2\n(50%). These results demonstrate that Granite Code models are not only capable of\ngenerating high-quality code but also excelling in code-related tasks that require reasoning\nand understanding, such as code fixing and explanation.\n\n6.3 Conclusion\nIn conclusion, we present Granite Code models, a series of highly capable code LLMs designed\nto support enterprise software development across a wide range of coding tasks. Our\nresults show that Granite Code models achieve state-of-the-art performance across a variety\nof'}, 'reference_documents': [{'page_content': 'code, while others focus primarily on coding-related tasks (e.g. StarCoder (Li et al., 2023a;\nLozhkov et al., 2024), CodeGen (Nijkamp et al., 2023), CodeLlama (Rozi `ere et al., 2023), and\nCodeGemma (CodeGemma Team et al., 2024)).\nHowever, there remain important gaps in the current field of LLMs for code, especially in\nthe context of enterprise software development. First, while very large, generalist LLMs can\nachieve excellent coding performance, their size makes them expensive to deploy. Smaller\ncode-focused models (Li et al., 2023a; Lozhkov et al., 2024; Nijkamp et al., 2023; Rozi `ere et al.,\n2023; CodeGemma Team et al., 2024) can achieve excellent code generation performance in\na smaller and more flexible package, but performance in coding tasks beyond generation\n(e.g. fixing and explanation) can lag behind code generation performance.\nIn many enterprise contexts, code LLM adoption can be further complicated by factors\nbeyond the performance of the models. For instance, even open models are sometimes\nplagued by a lack of transparency about the data sources and data processing methods\nthat went into model, which can erode trust in models in mission critical and regulated\ncontexts. Furthermore, license terms in today’s open LLMs can encumber and complicate\nan enterprise’s ability to use a model.\nHere, we present Granite Code models, a series of highly capable code LLMs, designed to\nsupport enterprise software development across a wide range of coding tasks. Granite Code\nmodels has two main variants that we release in four different sizes (3B, 8B, 20B, and 34B):\n2\nIBM Granite Code Models\n•Granite Code Base: base foundation models for code-related tasks;\n•Granite Code Instruct: instruction following models finetuned using a combination\nof Git commits paired with human instructions and open-source synthetically\ngenerated code instruction datasets.\nThe base models in the series have been trained from scratch with a two-phase training\nstrategy. In phase 1, our model is trained on 3 to 4 trillion tokens sourced from 116 pro-\ngramming languages, ensuring a comprehensive understanding of programming languages\nand syntax. In phase 2, our model is further trained on 500 billion tokens with a carefully\ndesigned mixture of high-quality data from code and natural language domains to improve\nthe model’s ability to reason. We use the unsupervised language modeling objective to\ntrain the base models in both the phases of training. The instruct models are derived by\nfurther finetuning the above trained base models on a combination of a filtered variant of\nCommitPack (Muennighoff et al., 2023), natural language instruction following datasets\n(OASST (K ¨opf et al., 2023), HelpSteer (Wang et al., 2023)) and open-source math datasets\n(MathInstruct (Yue et al., 2023) and MetaMathQA (Yu et al., 2023)), including synthetically\ngenerated code datasets for improving instruction following and reasoning capabilities.\nWe conduct extensive evaluations of our code LLMs on a comprehensive set of benchmarks,\nincluding HumanEvalPack (Muennighoff et al., 2023), MBPP(+) (Austin et al., 2021; Liu\net al., 2023a), RepoBench (Liu et al., 2023b), ReCode (Wang et al., 2022), and more. This set of\nbenchmarks encompasses many different kinds of coding tasks beyond just code synthesis\nin Python, e.g., code fixing, code explanation, code editing, code translation, etc., across\nmost major programming languages (Python, JavaScript, Java, Go, C++, Rust, etc.).\nOur findings reveal that among open-source models, the Granite Code models overall show\nvery strong performance across all model sizes and benchmarks (often outperforming other\nopen-source code models that are twice large compared to Granite). As an illustration, fig-\nure 1 (top) shows a comparison of Granite-8B-Code-Base with other open-source base code\nLLMs, including recent high-performing general purpose base LLMs like Mistral (Jiang et al.,\n2023b) and LLama-3 (AI@Meta, 2024) on HumanEvalPack (Muennighoff et al., 2023). While\nCodeGemma and StarCoder2 perform reasonably well in generating code, they perform\nsignificantly worse on the code fixing and explanation variants of HumanEvalPack. On av-', 'metadata': {'sequence_number': [6, 7, 8, 9, 10], 'document_id': 'granite_code_models.pdf'}}, {'page_content': 'activation function (Ramachandran et al., 2017) with GLU (Shazeer, 2020) for the MLP , also\ncommonly referred to as swiglu. For normalization, we use RMSNorm (Zhang & Sennrich,\n2019) since it’s computationally more efficient than LayerNorm (Ba et al., 2016). The 3B\nmodel is trained with a context length of 2048 tokens.\n8B: The 8B model has a similar architecture as the 3B model with the exception of using\nGrouped-Query Attention (GQA) (Ainslie et al., 2023). Using GQA offers a better tradeoff\nbetween model performance and inference efficiency at this scale. We train the 8B model\nwith a context length of 4096 tokens.\n20B: The 20B code model is trained with learned absolute position embeddings. We use\nMulti-Query Attention (Shazeer, 2019) during training for efficient downstream inference.\nFor the MLP block, we use the GELU activation function (Hendrycks & Gimpel, 2023). For\nnormalizing the activations, we use LayerNorm (Ba et al., 2016). This model is trained with\na context length of 8192 tokens.\n34B: To train the 34B model, we follow the approach by Kim et al. for depth upscaling of\nthe 20B model. Specifically, we first duplicate the 20B code model with 52 layers and then\n5https://www.clamav.net/\n5\nIBM Granite Code Models\nFigure 2: An overview of depth upscaling (Kim et al., 2024) for efficient training of Granite-\n34B-Code. We utilize the 20B model after 1.6T tokens to start training of 34B model with the\nsame code pretraining data without any changes to the training and inference framework.\nremove final 8 layers from the original model and initial 8 layers from its duplicate to form\ntwo models. Finally, we concatenate both models to form Granite-34B-Code model with\n88 layers (see Figure 2 for an illustration). After the depth upscaling, we observe that the\ndrop in performance compared to 20B model is pretty small contrary to what is observed by\nKim et al.. This performance is recovered pretty quickly after we continue pretraining of the\nupscaled 34B model. Similar, to 20B, we use a 8192 token context during pretraining.\n4 Pretraining\nIn this section, we provide details on two phase training (Sec. 4.1), training objectives\n(Sec. 4.2), optimization (Sec. 4.3) and infrastructure (Sec. 4.4) used in pretraining the models.\n4.1 Two Phase Training\nGranite Code models are trained on 3.5T to 4.5T tokens of code data and natural language\ndatasets related to code. Data is tokenized via byte pair encoding (BPE, (Sennrich et al.,\n2015)), employing the same tokenizer as StarCoder (Li et al., 2023a). Following (Shen et al.,\n2024; Hu et al., 2024), we utilize high-quality data with two phases of training as follows.\n•Phase 1 (code only training) : During phase 1, both 3B and 8B models are trained for\n4 trillion tokens of code data comprising 116 languages. The 20B parameter model\nis trained on 3 trillion tokens of code. The 34B model is trained on 1.4T tokens after\nthe depth upscaling which is done on the 1.6T checkpoint of 20B model.\n•Phase 2 (code + language training) : In phase 2, we include additional high-quality\npublicly available data from various domains, including technical, mathematics,\nand web documents, to further improve the model’s performance in reasoning and\nproblem solving skills, which are essential for code generation. We train all our\nmodels for 500B tokens (80% code and 20% language data) in phase 2 training.\n4.2 Training Objective\nFor training of all our models, we use the causal language modeling objective and Fill-In-\nthe-Middle (FIM) (Bavarian et al., 2022) objective. The FIM objective is tasked to predict\ninserted tokens with the given context and subsequent text. We train our models to work\nwith both PSM (Prefix-Suffix-Middle) and SPM (Suffix-Prefix-Middle) modes, with relevant\nformatting control tokens, same as StarCoder (Li et al., 2023a).\nThe overall loss is computed as a weighted combination of the 2 objectives:\nL=αLCLM + (1−α)LFIM (1)\nWe emperically set α=0.5 during training and find that this works well in practice leading\nto SOTA performance on both code completion and code infilling tasks. It should be\n6\nIBM Granite Code Models', 'metadata': {'sequence_number': [21, 22, 23, 24, 25], 'document_id': 'granite_code_models.pdf'}}, {'page_content': 'datasets related to code. Data is tokenized via byte pair encoding (BPE, (Sennrich et al.,\n2015)), employing the same tokenizer as StarCoder (Li et al., 2023a). Following (Shen et al.,\n2024; Hu et al., 2024), we utilize high-quality data with two phases of training as follows.\n•Phase 1 (code only training) : During phase 1, both 3B and 8B models are trained for\n4 trillion tokens of code data comprising 116 languages. The 20B parameter model\nis trained on 3 trillion tokens of code. The 34B model is trained on 1.4T tokens after\nthe depth upscaling which is done on the 1.6T checkpoint of 20B model.\n•Phase 2 (code + language training) : In phase 2, we include additional high-quality\npublicly available data from various domains, including technical, mathematics,\nand web documents, to further improve the model’s performance in reasoning and\nproblem solving skills, which are essential for code generation. We train all our\nmodels for 500B tokens (80% code and 20% language data) in phase 2 training.\n4.2 Training Objective\nFor training of all our models, we use the causal language modeling objective and Fill-In-\nthe-Middle (FIM) (Bavarian et al., 2022) objective. The FIM objective is tasked to predict\ninserted tokens with the given context and subsequent text. We train our models to work\nwith both PSM (Prefix-Suffix-Middle) and SPM (Suffix-Prefix-Middle) modes, with relevant\nformatting control tokens, same as StarCoder (Li et al., 2023a).\nThe overall loss is computed as a weighted combination of the 2 objectives:\nL=αLCLM + (1−α)LFIM (1)\nWe emperically set α=0.5 during training and find that this works well in practice leading\nto SOTA performance on both code completion and code infilling tasks. It should be\n6\nIBM Granite Code Models\nnoted that the FIM objective is only used during pretraining, however we drop it during\ninstruction finetuning i.e we set α=1.\n4.3 Optimization\nWe use AdamW optimizer (Kingma & Ba, 2017) with β1=0.9,β2=0.95 and weight decay\nof 0.1 for training all our Granite code models. For the phase-1 pretraining, the learning\nrate follows a cosine schedule starting from 3 ×10−4which decays to 3 ×10−5with an\ninitial linear warmup step of 2k iterations. For phase-2 pretraining, we start from 3 ×10−4\n(1.5×10−4for 20B and 34B models) and adopt an exponential decay schedule to anneal it\nto 10% of the initial learning rate. We use a batch size of 4M-5M tokens depending on the\nmodel size during both phases of pretraining.\nTo accelerate training, we use FlashAttention 2 (Dao et al., 2022; Dao, 2023), the persistent\nlayernorm kernel, Fused RMSNorm kernel (depending on the model) and the Fused Adam\nkernel available in NVIDIA’s Apex library. We use a custom fork of NVIDIA’s Megatron-\nLM (Shoeybi et al., 2019; Narayanan et al., 2021) for distributed training of all our models.\nWe train with a mix of 3D parallelism: tensor parallel, pipeline parallel and data parallel.\nWe also use sequence parallelism (Korthikanti et al., 2023) for reducing the activation\nmemory consumption of large context length during training. We use Megatron’s distributed\noptimizer with mixed precision training (Micikevicius et al., 2018) in BF16 (Kalamkar et al.,\n2019) with gradient all-reduce and gradient accumulation in FP32 for training stability.\n4.4 Infrastructure\nWe train the Granite Code models using IBM’s two supercomputing clusters, namely Vela\nand Blue Vela, outfitted with NVIDIA A100 and H100 GPUs, respectively. In the Vela\nA100 GPU cluster, each node has 2 ×Intel Xeon Scalable Processors with 8 ×80GB A100\nGPUs connected to each other by NVLink and NVSwitch. The Vela cluster adopts RoCE\n(RDMA over Converged Ethernet) and GDR (GPU-direct RDMA) for high-performance\nnetworking. Similarly, each node in Blue Vela cluster consists of dual 48-core Intel processors\nwith 8 ×80GB H100 GPUs. Blue Vela employs 3.2Tbps InfiniBand interconnect to facilitate\nseamless communication between nodes, known for their high throughput and low latency.', 'metadata': {'sequence_number': [24, 25, 26, 27, 28], 'document_id': 'granite_code_models.pdf'}}]}]}}
print(inference_service_function(context)["body"]["choices"][0]["message"]["content"])
The clusters are equipped with 100Gbps and 200Gbps HDR InfiniBand links, respectively. We utilize NVIDIA’s Megatron-LM (Shoeybi et al., 2019; Narayanan et al., 2021) for distributed training, which is optimized for large language models. We use the same Megatron LM framework for all our models, ensuring consistency in training infrastructure. 4.5 Model Architecture The architecture of the Granite Code models is based on the original transformer architecture (Douglas & Smith, 2019) with modifications for code modeling. The base model has 16 layers, 8 attention heads, and a token embedding dimension of 512. For the 3B model, we use the standard transformer architecture with a multi-head attention mechanism. The 8B model incorporates Grouped-Query Attention (GQA) (Ainslie et al., 2023) to improve inference efficiency. The 20B model uses learned absolute position embeddings and Multi-Query Attention (Shazeer, 2019). The 34B model is built upon the 20B model with depth upscaling (Kim et al., 2024) to double the model depth, resulting in 88 layers. The models are trained with different context lengths depending on their size: 2048, 4096, 8192, and 8192 tokens respectively for 3B, 8B, 20B, and 34B models. For the MLP block, we use GELU activation function (Hendrycks & Gimpel, 2023) for the 20B and 34B models, while using GLU (Shazeer, 2020) for the 3B and 8B models. For normalization, we use LayerNorm (Ba et al., 2016) for all models except the 8B model, which uses RMSNorm (Zhang & Sennrich, 2019) for computational efficiency. 5.1 Evaluation Protocol We evaluate the Granite Code models on a comprehensive set of benchmarks including HumanEvalPack (Muennighoff et al., 2023), MBPP(+) (Austin et al., 2021; Liu et al., 2023a), RepoBench (Liu et al., 2023b), ReCode (Wang et al., 2022), and more. This set of benchmarks encompasses many different kinds of coding tasks beyond just code synthesis in Python, e.g., code fixing, code explanation, code editing, code translation, etc., across most major programming languages (Python, JavaScript, Java, Go, C++, Rust, etc.). Our evaluation protocol includes both automated metrics and human evaluations. For automated metrics, we use standard evaluation scripts that measure code synthesis, fixing, and explanation. For human evaluations, we conduct a series of studies where participants are asked to compare the output of Granite Code models with other open-source code models on various coding tasks. We also perform ablation studies to understand the impact of different architectural choices on model performance. 5.2 Results Our findings reveal that among open-source models, the Granite Code models overall show very strong performance across all model sizes and benchmarks (often outperforming other open-source code models that are twice large compared to Granite). As an illustration, figure 1 (top) shows a comparison of Granite-8B-Code-Base with other open-source base code LLMs, including recent high-performing general purpose base LLMs like Mistral (Jiang et al., 2023b) and LLama-3 (AI@Meta, 2024) on HumanEvalPack (Muennighoff et al., 2023). While CodeGemma and StarCoder2 perform reasonably well in generating code, they perform significantly worse on the code fixing and explanation variants of HumanEvalPack. On av- erage, Granite-8B-Code-Base achieves 73% accuracy on code fixing, compared to 60% for CodeGemma and 55% for StarCoder2. Similarly, on code explanation, Granite-8B- Code-Base achieves 70% accuracy, outperforming CodeGemma (58%) and StarCoder2 (50%). These results demonstrate that Granite Code models are not only capable of generating high-quality code but also excelling in code-related tasks that require reasoning and understanding, such as code fixing and explanation. 6.3 Conclusion In conclusion, we present Granite Code models, a series of highly capable code LLMs designed to support enterprise software development across a wide range of coding tasks. Our results show that Granite Code models achieve state-of-the-art performance across a variety of

Summary

You successfully completed this notebook!

You learned how to use AutoAI RAG with your own foundation model.

Check out our Online Documentation for more samples, tutorials, documentation, how-tos, and blog posts.

Author:

Michał Steczko, Software Engineer at watsonx.ai.

Copyright © 2025 IBM. This notebook and its source code are released under the terms of the MIT License.