Book a Demo!
CoCalc Logo Icon
StoreFeaturesDocsShareSupportNewsAboutPoliciesSign UpSign In
ibm
GitHub Repository: ibm/watson-machine-learning-samples
Path: blob/master/cloud/notebooks/python_sdk/deployments/foundation_models/Use watsonx Text Extraction V2 service to extract text from file.ipynb
6405 views
Kernel: watsonx-ai-samples-py-311

image

Use watsonx.ai Text Extraction V2 service to extract text from file

Disclaimers

  • Use only Projects and Spaces that are available in watsonx context.

Notebook content

This notebook contains the steps and code demonstrating how to run a Text Extraction job using Python SDK and then retrieve the results in the form of JSON, Markdown, HTML and image files.

Some familiarity with Python is helpful. This notebook uses Python 3.11.

Learning goal

The purpose of this notebook is to demonstrate the usage a Text Extraction V2 service and ibm-watsonx-ai Python SDK to retrieve a text from file that is located at IBM Cloud Object Storage.

Contents

This notebook contains the following parts:

Set up the environment

Before you use the sample code in this notebook, you must perform the following setup tasks:

Install required packages

%pip install wget | tail -n 1 %pip install "ibm-watsonx-ai>=1.3.13" | tail -n 1
Successfully installed wget-3.2 Successfully installed anyio-4.9.0 cachetools-6.1.0 certifi-2025.7.14 charset_normalizer-3.4.2 h11-0.16.0 httpcore-1.0.9 httpx-0.28.1 ibm-cos-sdk-2.14.2 ibm-cos-sdk-core-2.14.2 ibm-cos-sdk-s3transfer-2.14.2 ibm-watsonx-ai-1.3.30 idna-3.10 jmespath-1.0.1 lomond-0.3.3 numpy-2.3.1 pandas-2.2.3 pytz-2025.2 requests-2.32.4 sniffio-1.3.1 tabulate-0.9.0 tzdata-2025.2 urllib3-2.5.0

Defining the watsonx.ai credentials

This cell defines the watsonx.ai credentials required to work with watsonx Foundation Model inferencing.

Action: Provide the IBM Cloud user API key. For details, see documentation.

import getpass location = "us-south" api_key = getpass.getpass("Please enter your watsonx.ai api key (hit enter): ")
from ibm_watsonx_ai import Credentials credentials = Credentials( url=f"https://{location}.ml.cloud.ibm.com", api_key=api_key, )

Defining the project ID

The Text Extraction service requires project id that provides the context for the call. We will obtain the id from the project in which this notebook runs. Otherwise, please provide the project ID.

import os try: project_id = os.environ["PROJECT_ID"] except KeyError: project_id = input("Please enter your project_id (hit enter): ")

API Client initialization

from ibm_watsonx_ai import APIClient client = APIClient(credentials=credentials, project_id=project_id)

Create data connections with source document and results reference

The document from which we are going to extract text is located in IBM Cloud Object Storage (COS). In the following example we are going to use Granite Code Models paper as a source text document. Also, the final results file, which will contain extracted text and necessary metadata, will be placed in COS as well. Therefore, we use ibm_watsonx_ai.helpers.DataConnection and ibm_watsonx_ai.helpers.S3Location class to create a Python objects that will represent the references to the processed files. Please note that you have to create connection asset with your COS details (for detailed explanation how to do this see IBM Cloud Object Storage connection or check below cells).

Download source document

import wget source_filename = "granite_code_models_paper.pdf" wget.download("https://arxiv.org/pdf/2405.04324", source_filename)
'granite_code_models_paper.pdf'

Create connection to COS

Tip: You need to create your connection only once.

bucket_name = "PASTE_YOUR_BUCKET_NAME_HERE" connection_asset_id = input( "Provide connection asset ID in your project. If you wish to provide your credentials by hand, hit enter: " ) if not connection_asset_id: datasource_name = "bluemixcloudobjectstorage" cos_endpoint_url = input("Please enter your COS endpoint URL and hit enter: ") cos_access_key = input("Please enter your COS access key and hit enter: ") cos_secret_key = getpass.getpass("Please enter your COS secret key and hit enter: ") connection_meta_props = { client.connections.ConfigurationMetaNames.NAME: f"Connection to {bucket_name} bucket", client.connections.ConfigurationMetaNames.DATASOURCE_TYPE: client.connections.get_datasource_type_id_by_name( datasource_name ), client.connections.ConfigurationMetaNames.DESCRIPTION: f"Connection to external bucket: {bucket_name}", client.connections.ConfigurationMetaNames.PROPERTIES: { "bucket": bucket_name, "access_key": cos_access_key, "secret_key": cos_secret_key, "url": cos_endpoint_url, }, } connection_details = client.connections.create(meta_props=connection_meta_props) connection_asset_id = client.connections.get_id(connection_details) connection_asset_id
'd87f2870-5e14-4fb6-be73-bc9307097b39'

Create text extraction document reference and result references

from ibm_watsonx_ai.helpers import DataConnection, S3Location document_reference = DataConnection( connection_asset_id=connection_asset_id, location=S3Location(bucket=bucket_name, path="granite_code_models_paper.pdf"), ) results_reference = DataConnection( connection_asset_id=connection_asset_id, location=S3Location( bucket=bucket_name, path="text_extraction_result/" ), # Must end with / )

Upload source file to COS

document_reference.set_client(client) document_reference.write(source_filename)

Text Extraction request preparation

Since data connection for source and results files are ready, we can proceed to the text extraction run job step. To initialize Text Extraction manager we use TextExtractions class.

from ibm_watsonx_ai.foundation_models.extractions import TextExtractionsV2 extraction = TextExtractionsV2(api_client=client, project_id=project_id)

Define Text Extraction parameters

When running a job, the parameters for the text extraction pipeline can be specified. For more details about available parameters see documentation. The list of parameters available in SDK can be found below.

from ibm_watsonx_ai.metanames import TextExtractionsV2ParametersMetaNames TextExtractionsV2ParametersMetaNames().show()
------------------------ ---- -------- META_PROP NAME TYPE REQUIRED MODE str N OCR_MODE str N LANGUAGES list N AUTO_ROTATION_CORRECTION bool N CREATE_EMBEDDED_IMAGES str N OUTPUT_DPI int N KVP_MODE str N OUTPUT_TOKENS_AND_BBOX bool N ------------------------ ---- --------

In our example we are going to use the following parameters:

parameters = { TextExtractionsV2ParametersMetaNames.MODE: "high_quality", TextExtractionsV2ParametersMetaNames.OCR_MODE: "enabled", TextExtractionsV2ParametersMetaNames.LANGUAGES: ["en", "fr"], TextExtractionsV2ParametersMetaNames.AUTO_ROTATION_CORRECTION: True, TextExtractionsV2ParametersMetaNames.CREATE_EMBEDDED_IMAGES: "enabled_placeholder", TextExtractionsV2ParametersMetaNames.OUTPUT_DPI: 72, TextExtractionsV2ParametersMetaNames.KVP_MODE: "generic_with_semantic", TextExtractionsV2ParametersMetaNames.SEMANTIC_CONFIG: { "target_image_width": 500, "enable_text_hints": True, "enable_generic_kvp": True, "schemas": [ { "document_type": "Scientific paper", "document_description": "Article published on arXiv regarding large language models for code generation", "target_image_width": 800, "enable_text_hints": True, "enable_generic_kvp": False, "fields": { "title": { "default": "", "example": "A Cognitive Ideation Support Framework using IBM Watson Services", }, "authors": { "default": "", "example": "Samaa Elnagar, Kweku-Muata Osei-Bryson", }, "abstract": { "default": "", "example": ( "Ideas generation is a core activity for innovation in organizations. The creativity of the generated " "ideas depends not only on the knowledge retrieved from the organizations' knowledge bases, but also on " "the external knowledge retrieved from other resources. Unfortunately, organizations often cannot " "efficiently utilize the knowledge in the knowledge bases due to the limited abilities of the search " "and retrieval mechanisms especially when dealing with unstructured data. In this paper, we present " "a new cognitive support framework for ideation that uses the IBM Watson DeepQA services. IBM Watson " "is a Question Answering system which mimics human cognitive abilities to retrieve and rank information. " "The proposed framework is based on the Search for Ideas in the Associative Memory (SIAM) model to help " "organizations develop creative ideas through discovering new relationships between retrieved data. To " "evaluate the effectiveness of the proposed system, the generated ideas generated are selected and " "assessed using a set of established creativity criteria." ), }, }, } ], }, }

Run extraction job for single return format

In order to run an extraction job, where only a single output format is requested, the result_formats parameter must be specified using the TextExtractionsV2ResultFormats enum. In our example we will use the TextExtractionsV2ResultFormats.MARKDOWN format.

from ibm_watsonx_ai.foundation_models.extractions import TextExtractionsV2ResultFormats single_format_job_details = extraction.run_job( document_reference=document_reference, results_reference=results_reference, parameters=parameters, result_formats=TextExtractionsV2ResultFormats.MARKDOWN, ) single_format_job_details
single_format_extraction_job_id = extraction.get_job_id( extraction_details=single_format_job_details )

We can list text extraction jobs using the list method.

extraction.list_jobs()

Moreover, to get details of a particular text extraction request, run the following:

extraction.get_job_details(extraction_job_id=single_format_extraction_job_id)

To wait until the text extraction job completes, run the following cell:

import time while True: time.sleep(5) job_details = extraction.get_job_details( extraction_job_id=single_format_extraction_job_id ) job_status = job_details["entity"]["results"]["status"] if job_status != "running": print("\n", job_status, sep="") break print(".", sep="", end="", flush=True)
........................... completed

Furthermore, to delete text extraction job run use delete_job() method.

Results examination

Once the job extraction is completed, we can use the get_results_reference method to create the results data connection.

single_results_reference = extraction.get_results_reference( single_format_extraction_job_id ) single_results_reference
{'type': 'connection_asset', 'connection': {'id': 'd87f2870-5e14-4fb6-be73-bc9307097b39'}, 'location': {'bucket': 'rc-byom-bucket', 'file_name': 'text_extraction_result/'}}

Download the file to the result path.

single_result_dir = "single_text_extraction_results" single_results_reference.download_folder(local_dir=single_result_dir)

After a successful download, it's possible to read the file content.

with open(f"{single_result_dir}/assembly.md") as file: extracted_text = file.read(1000) print(extracted_text)
## Granite Code Models: A Family of Open Foundation Models for Code Intelligence Mayank Mishra⋆ Matt Stallone⋆ Gaoyuan Zhang⋆ Yikang Shen Aditya Prasad Adriana Meza Soria Michele Merler Parameswaran Selvam Saptha Surendran Shivdeep Singh Manish Sethi Xuan-Hong Dang Pengyuan Li Kun-Lung Wu Syed Zawad Andrew Coleman Matthew White Mark Lewis Raju Pavuluri Yan Koyfman Boris Lublinsky Maximilien de Bayser Ibrahim Abdelaziz Kinjal Basu Mayank Agarwal Yi Zhou Chris Johnson Aanchal Goyal Hima Patel Yousaf Shah Petros Zerfos Heiko Ludwig Asim Munawar Maxwell Crouse Pavan Kapanipathi Shweta Salaria Bob Calio Sophia Wen Seetharami Seelam Brian Belgodere Carlos Fonseca Amith Singhee Nirmit Desai David D. Cox Ruchir Puri† Rameswar Panda† IBM Research ⋆Equal Contribution †Corresponding Authors [email protected], [email protected] ## Abstract Large Language Models (LLMs) trained on code are revolutionizing the software development process. Increasingly, code LLMs are being inte grated into software

Run extraction job for multiple file formats

In order to run an extraction job, where multiple output formats are requested, the result_format parameter must be specified using either a list of TextExtractionsV2ResultFormats enums (recommended) or a list of str instances. In our example we will use a list of the following file formats:

  • TextExtractionsV2ResultFormats.ASSEMBLY_JSON

  • TextExtractionsV2ResultFormats.HTML

  • TextExtractionsV2ResultFormats.PAGE_IMAGES

multiple_formats_job_details = extraction.run_job( document_reference=document_reference, results_reference=results_reference, parameters=parameters, result_formats=[ TextExtractionsV2ResultFormats.ASSEMBLY_JSON, TextExtractionsV2ResultFormats.HTML, TextExtractionsV2ResultFormats.PAGE_IMAGES, ], ) multiple_formats_job_details
multiple_formats_extraction_job_id = extraction.get_job_id( extraction_details=multiple_formats_job_details )

We can list text extraction jobs using the list method.

extraction.list_jobs()

Moreover, to get details of a particular text extraction request, run the following:

extraction.get_job_details(extraction_job_id=multiple_formats_extraction_job_id)

To wait until the text extraction job completes, run the following cell:

import time while True: time.sleep(5) job_details = extraction.get_job_details( extraction_job_id=multiple_formats_extraction_job_id ) job_status = job_details["entity"]["results"]["status"] if job_status != "running": print("\n", job_status, sep="") break print(".", sep="", end="", flush=True)
......................... completed

Furthermore, to delete text extraction job run use delete_job() method.

Results examination

Once the job extraction is completed, we can use the get_results_reference method to create the results data connection.

multiple_results_reference = extraction.get_results_reference( multiple_formats_extraction_job_id ) multiple_results_reference
{'type': 'connection_asset', 'connection': {'id': 'd87f2870-5e14-4fb6-be73-bc9307097b39'}, 'location': {'bucket': 'rc-byom-bucket', 'file_name': 'text_extraction_result/'}}
multiple_results_dir = "multiple_text_extraction_results" multiple_results_reference.download_folder(local_dir=multiple_results_dir)

After a successful download, it's possible to read the file contents.

import json from IPython.core.display import display_markdown def get_raw_text_for_key(key: str, page_number: int) -> str | None: iterator = ( item["value"]["raw_text"] for item in extracted_assembly_json["kvps"] if item["key"]["semantic_label"] == key and item["value"]["bbox"]["page_number"] == page_number ) return next(iterator, None) with open(f"{multiple_results_dir}/assembly.json") as file: extracted_assembly_json = json.load(file) # Available thanks to semantic configuration title = get_raw_text_for_key("title", 1) authors = get_raw_text_for_key("authors", 1) abstract = get_raw_text_for_key("abstract", 1) tokens = extracted_assembly_json["all_structures"]["tokens"] complete_text = " ".join(item["text"] for item in tokens) display_markdown( "\n\n".join( [ f"### Title\n{title}", f"### Authors\n{authors}", f"### Abstract\n{abstract}", f"### Complete text (first 1000 characters)\n{complete_text[:1000]}", ] ), raw=True, )

Title

Granite Code Models: A Family of Open Foundation Models for Code Intelligence

Authors

Mayank Mishra⋆ Matt Stallone⋆ Gaoyuan Zhang⋆ Yikang Shen Aditya Prasad Adriana Meza Soria Michele Merler Parameswaran Selvam Saptha Surendran Shivdeep Singh Manish Sethi Xuan-Hong Dang Pengyuan Li Kun-Lung Wu Syed Zawad Andrew Coleman Matthew White Mark Lewis Raju Pavuluri Yan Koyfman Boris Lublinsky Maximilien de Bayser Ibrahim Abdelaziz Kinjal Basu Mayank Agarwal Yi Zhou Chris Johnson Aanchal Goyal Hima Patel Yousaf Shah Petros Zerfos Heiko Ludwig Asim Munawar Maxwell Crouse Pavan Kapanipathi Shweta Belgodere Carlos Fonseca Amith Singhee Nirmit Desai David D. Cox Ruchir Puri† Rameswar Panda†

Abstract

Large Language Models (LLMs) trained on code are revolutionizing the software development process. Increasingly, code LLMs are being into software development environments to improve the productivity of human programmers, and LLM-based agents are beginning to show promise for handling complex tasks autonomously. Realizing the full potential of code LLMs requires a wide range of capabilities, including code generation, fixing bugs, explaining and documenting code, maintaining repositories, and more. In this work, we introduce the Granite series of decoder-only code models for code generative tasks, trained with code written in 116 programming languages. The Granite Code models family can be used in a wide range of applications, from complex application modernization tasks to on-device memory-constrained use cases. Evaluation on a comprehensive set of tasks demonstrates that Granite Code models consistently reaches state-of-the-art performance among available open-source code LLMs. The Granite Code model family was optimized for enterprise software development workflows and performs well across a range of coding tasks (e.g., code generation, fixing and explanation), making it a versatile “all around” code model. We release all our Granite Code models under an Apache 2.0 license for both research and commercial use.

Complete text (first 1000 characters)

Granite Code Models: A Family of Open Foundation Models for Code Intelligence Mayank Mishra⋆ Matt Stallone⋆ Gaoyuan Zhang⋆ Yikang Shen Aditya Prasad Adriana Meza Soria Michele Merler Parameswaran Selvam Saptha Surendran Shivdeep Singh Manish Sethi Xuan-Hong Dang Pengyuan Li Kun-Lung Wu Syed Zawad Andrew Coleman Matthew White Mark Lewis Raju Pavuluri Yan Koyfman Boris Lublinsky Maximilien de Bayser Ibrahim Abdelaziz Kinjal Basu Mayank Agarwal Yi Zhou Chris Johnson Aanchal Goyal Hima Patel Yousaf Shah Petros Zerfos Heiko Ludwig Asim Munawar Maxwell Crouse Pavan Kapanipathi Shweta Salaria Bob Calio Sophia Wen Seetharami Seelam Brian Belgodere Carlos Fonseca Amith Singhee Nirmit Desai David D. Cox Ruchir Puri† Rameswar Panda† IBM Research ⋆Equal Contribution †Corresponding Authors [email protected], [email protected] Abstract Large Language Models (LLMs) trained on code are revolutionizing the software development process. Increasingly, code LLMs are being inte grated into software developmen

from IPython.core.display import display_html with open(f"{multiple_results_dir}/assembly.html") as file: extracted_html = file.read() display_html(extracted_html[:1000], raw=True)
from IPython.core.display import display_png with open(f"{multiple_results_dir}/page_images/1.png", "rb") as file: extracted_png = file.read() display_png(extracted_png, raw=True)
Image in a Jupyter notebook

Summary and next steps

You successfully completed this notebook!

You learned how to use TextExtractionsV2 manager to run text extraction requests, check status of the submitted job and download a results file.

Check out our Online Documentation for more samples, tutorials, documentation, how-tos, and blog posts.

Authors:

Mateusz Świtała, Software Engineer at watsonx.ai.

Rafał Chrzanowski, Software Engineer Intern at watsonx.ai.

Copyright © 2025 IBM. This notebook and its source code are released under the terms of the MIT License.