GitHub Repository: ibm/watson-machine-learning-samples
Path: blob/master/cpd5.2/notebooks/python_sdk/deployments/foundation_models/Use watsonx Text Extraction V2 service to extract text from file.ipynb
⁶⁴¹² views

Kernel: watsonx-ai-samples-py-312

Use watsonx.ai Text Extraction V2 service to extract text from file

Disclaimers

Use only Projects and Spaces that are available in watsonx context.

Notebook content

This notebook contains the steps and code demonstrating how to run a Text Extraction job using Python SDK and then retrieve the results in the form of JSON, Markdown, HTML and image files.

Some familiarity with Python is helpful. This notebook uses Python 3.12.

Learning goal

The purpose of this notebook is to demonstrate the usage a Text Extraction service and ibm-watsonx-ai Python SDK to retrieve a text from file that is located at IBM Cloud Object Storage.

This notebook contains the following parts:

Set up the environment

Before you use the sample code in this notebook, you must perform the following setup tasks:

Contact with your Cloud Pak for Data administrator and ask them for your account credentials

Install dependencies

Note: ibm-watsonx-ai documentation can be found here.

In [1]:

%pip install wget | tail -n 1
%pip install "ibm-watsonx-ai>=1.3.13" | tail -n 1

Out[1]:

Successfully installed wget-3.2
Successfully installed anyio-4.9.0 certifi-2025.4.26 charset-normalizer-3.4.1 h11-0.16.0 httpcore-1.0.9 httpx-0.28.1 ibm-cos-sdk-2.14.0 ibm-cos-sdk-core-2.14.0 ibm-cos-sdk-s3transfer-2.14.0 ibm-watsonx-ai-1.3.13 idna-3.10 jmespath-1.0.1 lomond-0.3.3 numpy-2.2.5 pandas-2.2.3 pytz-2025.2 requests-2.32.2 sniffio-1.3.1 tabulate-0.9.0 tzdata-2025.2 urllib3-2.4.0

Define credentials

Authenticate the watsonx.ai Runtime service on IBM Cloud Pak for Data. You need to provide the admin's username and the platform url.

In [2]:

username = "PASTE YOUR USERNAME HERE"
url = "PASTE THE PLATFORM URL HERE"

Use the admin's api_key to authenticate watsonx.ai Runtime services:

In [ ]:

import getpass
from ibm_watsonx_ai import Credentials

credentials = Credentials(
    username=username,
    api_key=getpass.getpass("Enter your watsonx.ai API key and hit enter: "),
    url=url,
    instance_id="openshift",
    version="5.2",
)

Alternatively you can use the admin's password:

In [3]:

import getpass
from ibm_watsonx_ai import Credentials

if "credentials" not in locals() or not credentials.api_key:
    credentials = Credentials(
        username=username,
        password=getpass.getpass("Enter your watsonx.ai password and hit enter: "),
        url=url,
        instance_id="openshift",
        version="5.2",
    )

Working with projects

First of all, you need to create a project that will be used for your work. If you do not have a project created already, follow the steps below:

Open IBM Cloud Pak main page
Click all projects
Create an empty project
Copy project_id from url and paste it below

Action: Assign project ID below

In [4]:

import os

try:
    project_id = os.environ["PROJECT_ID"]
except KeyError:
    project_id = input("Please enter your project_id (hit enter): ")

Create `APIClient` instance

In [5]:

from ibm_watsonx_ai import APIClient

client = APIClient(credentials=credentials, project_id=project_id)

Create data connections with source document and results reference

The document from which we are going to extract text is located in IBM Cloud Object Storage (COS). In the following example we are going to use Granite Code Models paper as a source text document. Also, the final results file, which will contain extracted text and necessary metadata, will be placed in COS as well. Therefore, we use ibm_watsonx_ai.helpers.DataConnection and ibm_watsonx_ai.helpers.S3Location class to create a Python objects that will represent the references to the processed files. Please note that you have to create connection asset with your COS details (for detailed explanation how to do this see IBM Cloud Object Storage connection or check below cells).

Download source document

In [6]:

import wget

source_filename = "granite_code_models_paper.pdf"

wget.download("https://arxiv.org/pdf/2405.04324", source_filename)

Out[6]:

'granite_code_models_paper.pdf'

Create connection to COS

You can skip this section if you already have connection asset with IBM Cloud Object Storage.

In [7]:

datasource_name = "bluemixcloudobjectstorage"
bucket_name = "PASTE YOUR BUCKET NAME HERE"

In [ ]:

cos_credentials = {
    "endpoint_url": "PASTE YOUR ENDPOINT URL HERE",
    "apikey": "PASTE YOUR API KEY HERE",
    "access_key_id": "PASTE YOUR BUCKET ACCESS KEY HERE",
    "secret_access_key": "PASTE YOUR BUCKET SECRET ACCESS KEY HERE",
}

connection_meta_props = {
    client.connections.ConfigurationMetaNames.NAME: f"Connection to {bucket_name} bucket",
    client.connections.ConfigurationMetaNames.DATASOURCE_TYPE: client.connections.get_datasource_type_id_by_name(
        datasource_name
    ),
    client.connections.ConfigurationMetaNames.DESCRIPTION: f"Connection to external bucket: {bucket_name}",
    client.connections.ConfigurationMetaNames.PROPERTIES: {
        "bucket": bucket_name,
        "access_key": cos_credentials["access_key_id"],
        "secret_key": cos_credentials["secret_access_key"],
        "url": cos_credentials["endpoint_url"],
    },
}

connection_details = client.connections.create(meta_props=connection_meta_props)
connection_asset_id = client.connections.get_id(connection_details)

Creating connections...
SUCCESS

Create text extraction document reference and result references

In [ ]:

from ibm_watsonx_ai.helpers import DataConnection, S3Location

document_reference = DataConnection(
    connection_asset_id=connection_asset_id,
    location=S3Location(bucket=bucket_name, path="granite_code_models_paper.pdf"),
)

results_reference = DataConnection(
    connection_asset_id=connection_asset_id,
    location=S3Location(
        bucket=bucket_name, path="text_extraction_result/"
    ),  # Must end with /
)

Upload source file to COS

In [10]:

document_reference.set_client(client)
document_reference.write(source_filename)

Text Extraction request preparation

Since data connection for source and results files are ready, we can proceed to the text extraction run job step. To initialize Text Extraction manager we use TextExtractions class.

In [11]:

from ibm_watsonx_ai.foundation_models.extractions import TextExtractionsV2

extraction = TextExtractionsV2(api_client=client, project_id=project_id)

Define Text Extraction parameters

When running a job, the parameters for the text extraction pipeline can be specified. For more details about available parameters see documentation. The list of parameters available in SDK can be found below.

In [12]:

from ibm_watsonx_ai.metanames import TextExtractionsV2ParametersMetaNames

TextExtractionsV2ParametersMetaNames().show()

Out[12]:

------------------------  ----  --------
META_PROP NAME            TYPE  REQUIRED
MODE                      str   N
OCR_MODE                  str   N
LANGUAGES                 list  N
AUTO_ROTATION_CORRECTION  bool  N
CREATE_EMBEDDED_IMAGES    str   N
OUTPUT_DPI                int   N
KVP_MODE                  str   N
OUTPUT_TOKENS_AND_BBOX    bool  N
------------------------  ----  --------

In our example we are going to use the following parameters:

In [13]:

parameters = {
    TextExtractionsV2ParametersMetaNames.MODE: "high_quality",
    TextExtractionsV2ParametersMetaNames.OCR_MODE: "enabled",
    TextExtractionsV2ParametersMetaNames.LANGUAGES: ["en", "fr"],
    TextExtractionsV2ParametersMetaNames.AUTO_ROTATION_CORRECTION: True,
    TextExtractionsV2ParametersMetaNames.CREATE_EMBEDDED_IMAGES: "enabled_placeholder",
    TextExtractionsV2ParametersMetaNames.OUTPUT_DPI: 72,
    TextExtractionsV2ParametersMetaNames.KVP_MODE: "invoice",
}

Run extraction job for single return format

In order to run an extraction job, where only a single output format is requested, the result_formats parameter must be specified using the TextExtractionsV2ResultFormats enum. In our example we will use the TextExtractionsV2ResultFormats.MARKDOWN format.

In [ ]:

from ibm_watsonx_ai.foundation_models.extractions import TextExtractionsV2ResultFormats


single_format_job_details = extraction.run_job(
    document_reference=document_reference,
    results_reference=results_reference,
    parameters=parameters,
    result_formats=TextExtractionsV2ResultFormats.MARKDOWN,
)

single_format_job_details

{'metadata': {'id': '2b0bcb35-c5eb-427b-acb4-abef610b90be',
  'created_at': '2025-04-30T14:19:19.203Z',
  'project_id': '20dfc787-5698-4cf0-a5fd-7ab5bea49be3'},
 'entity': {'document_reference': {'type': 'connection_asset',
   'connection': {'id': '631b67d7-ec99-4526-8af9-2aa902a1b25b'},
   'location': {'file_name': 'granite_code_models_paper.pdf',
    'bucket': 'text-extraction-test'}},
  'results_reference': {'type': 'connection_asset',
   'connection': {'id': '631b67d7-ec99-4526-8af9-2aa902a1b25b'},
   'location': {'bucket': 'text-extraction-test',
    'file_name': 'text_extraction_result/'}},
  'parameters': {'requested_outputs': ['md'],
   'mode': 'high_quality',
   'ocr_mode': 'enabled',
   'languages': ['en', 'fr'],
   'auto_rotation_correction': True,
   'create_embedded_images': 'enabled_placeholder',
   'output_dpi': 72,
   'output_tokens_and_bbox': True,
   'kvp_mode': 'invoice'},
  'results': {'status': 'submitted', 'number_pages_processed': 0}}}

In [ ]:

single_format_extraction_job_id = extraction.get_job_id(
    extraction_details=single_format_job_details
)

We can list text extraction jobs using the list method.

In [16]:

extraction.list_jobs()

Out[16]:

Moreover, to get details of a particular text extraction request, run the following:

In [17]:

extraction.get_job_details(extraction_job_id=single_format_extraction_job_id)

Out[17]:

{'entity': {'document_reference': {'connection': {'id': '631b67d7-ec99-4526-8af9-2aa902a1b25b'},
   'location': {'bucket': 'text-extraction-test',
    'file_name': 'granite_code_models_paper.pdf'},
   'type': 'connection_asset'},
  'parameters': {'auto_rotation_correction': True,
   'create_embedded_images': 'enabled_placeholder',
   'kvp_mode': 'invoice',
   'languages': ['en', 'fr'],
   'mode': 'high_quality',
   'ocr_mode': 'enabled',
   'output_dpi': 72,
   'output_tokens_and_bbox': True,
   'requested_outputs': ['md']},
  'results': {'number_pages_processed': 10,
   'running_at': '2025-04-30T14:19:22.441Z',
   'status': 'running'},
  'results_reference': {'connection': {'id': '631b67d7-ec99-4526-8af9-2aa902a1b25b'},
   'location': {'bucket': 'text-extraction-test',
    'file_name': 'text_extraction_result/'},
   'type': 'connection_asset'}},
 'metadata': {'created_at': '2025-04-30T14:19:19.203Z',
  'id': '2b0bcb35-c5eb-427b-acb4-abef610b90be',
  'modified_at': '2025-04-30T14:19:43.576Z',
  'project_id': '20dfc787-5698-4cf0-a5fd-7ab5bea49be3'}}

To wait until the text extraction job completes, run the following cell:

In [ ]:

import time

while True:
    time.sleep(5)

    job_details = extraction.get_job_details(
        extraction_job_id=single_format_extraction_job_id
    )
    job_status = job_details["entity"]["results"]["status"]
    if job_status != "running":
        print("\n", job_status, sep="")
        break

    print(".", sep="", end="", flush=True)

...........
completed

Furthermore, to delete text extraction job run use delete_job() method.

Results examination

Once the job extraction is completed, we can use the get_results_reference method to create the results data connection.

In [ ]:

single_results_reference = extraction.get_results_reference(
    single_format_extraction_job_id
)
single_results_reference

{'type': 'connection_asset', 'connection': {'id': '631b67d7-ec99-4526-8af9-2aa902a1b25b'}, 'location': {'bucket': 'text-extraction-test', 'file_name': 'text_extraction_result/'}}

Download the file to the result path.

In [20]:

single_result_dir = "single_text_extraction_results"
single_results_reference.download_folder(local_dir=single_result_dir)

After a successful download, it's possible to read the file content.

In [21]:

with open(f"{single_result_dir}/assembly.md") as file:
    extracted_text = file.read(1000)

print(extracted_text)

Out[21]:

## Granite Code Models: A Family of Open Foundation Models for Code Intelligence

Mayank Mishra⋆ Matt Stallone⋆ Gaoyuan Zhang⋆ Yikang Shen Aditya Prasad Adriana Meza Soria Michele Merler Parameswaran Selvam Saptha Surendran Shivdeep Singh Manish Sethi Xuan-Hong Dang Pengyuan Li Kun-Lung Wu Syed Zawad Andrew Coleman Matthew White Mark Lewis Raju Pavuluri Yan Koyfman Boris Lublinsky Maximilien de Bayser Ibrahim Abdelaziz Kinjal Basu Mayank Agarwal Yi Zhou Chris Johnson Aanchal Goyal Hima Patel Yousaf Shah Petros Zerfos Heiko Ludwig Asim Munawar Maxwell Crouse Pavan Kapanipathi Shweta Salaria Bob Calio Sophia Wen Seetharami Seelam Brian Belgodere Carlos Fonseca Amith Singhee Nirmit Desai David D. Cox Ruchir Puri† Rameswar Panda†

IBM Research ⋆Equal Contribution †Corresponding Authors [email protected], [email protected]

## Abstract

Large Language Models (LLMs) trained on code are revolutionizing the software development process. Increasingly, code LLMs are being inte grated into software

Run extraction job for multiple file formats

In order to run an extraction job, where multiple output formats are requested, the result_format parameter must be specified using either a list of TextExtractionsV2ResultFormats enums (recommended) or a list of str instances. In our example we will use a list of the following file formats:

TextExtractionsV2ResultFormats.ASSEMBLY_JSON
TextExtractionsV2ResultFormats.HTML
TextExtractionsV2ResultFormats.PAGE_IMAGES

In [ ]:

multiple_formats_job_details = extraction.run_job(
    document_reference=document_reference,
    results_reference=results_reference,
    parameters=parameters,
    result_formats=[
        TextExtractionsV2ResultFormats.ASSEMBLY_JSON,
        TextExtractionsV2ResultFormats.HTML,
        TextExtractionsV2ResultFormats.PAGE_IMAGES,
    ],
)

multiple_formats_job_details

{'metadata': {'id': '9a1cc04a-2929-44d9-a192-5dbd386771e6',
  'created_at': '2025-04-30T14:25:22.841Z',
  'project_id': '20dfc787-5698-4cf0-a5fd-7ab5bea49be3'},
 'entity': {'document_reference': {'type': 'connection_asset',
   'connection': {'id': '631b67d7-ec99-4526-8af9-2aa902a1b25b'},
   'location': {'bucket': 'text-extraction-test',
    'file_name': 'granite_code_models_paper.pdf'}},
  'results_reference': {'type': 'connection_asset',
   'connection': {'id': '631b67d7-ec99-4526-8af9-2aa902a1b25b'},
   'location': {'bucket': 'text-extraction-test',
    'file_name': 'text_extraction_result/'}},
  'parameters': {'requested_outputs': ['assembly', 'html', 'page_images'],
   'mode': 'high_quality',
   'ocr_mode': 'enabled',
   'languages': ['en', 'fr'],
   'auto_rotation_correction': True,
   'create_embedded_images': 'enabled_placeholder',
   'output_dpi': 72,
   'output_tokens_and_bbox': True,
   'kvp_mode': 'invoice'},
  'results': {'status': 'submitted', 'number_pages_processed': 0}}}

In [ ]:

multiple_formats_extraction_job_id = extraction.get_job_id(
    extraction_details=multiple_formats_job_details
)

We can list text extraction jobs using the list method.

In [24]:

extraction.list_jobs()

Out[24]:

Moreover, to get details of a particular text extraction request, run the following:

In [25]:

extraction.get_job_details(extraction_job_id=multiple_formats_extraction_job_id)

Out[25]:

{'entity': {'document_reference': {'connection': {'id': '631b67d7-ec99-4526-8af9-2aa902a1b25b'},
   'location': {'bucket': 'text-extraction-test',
    'file_name': 'granite_code_models_paper.pdf'},
   'type': 'connection_asset'},
  'parameters': {'auto_rotation_correction': True,
   'create_embedded_images': 'enabled_placeholder',
   'kvp_mode': 'invoice',
   'languages': ['en', 'fr'],
   'mode': 'high_quality',
   'ocr_mode': 'enabled',
   'output_dpi': 72,
   'output_tokens_and_bbox': True,
   'requested_outputs': ['assembly', 'html', 'page_images']},
  'results': {'number_pages_processed': 0,
   'running_at': '2025-04-30T14:25:24.607Z',
   'status': 'running'},
  'results_reference': {'connection': {'id': '631b67d7-ec99-4526-8af9-2aa902a1b25b'},
   'location': {'bucket': 'text-extraction-test',
    'file_name': 'text_extraction_result/'},
   'type': 'connection_asset'}},
 'metadata': {'created_at': '2025-04-30T14:25:22.841Z',
  'id': '9a1cc04a-2929-44d9-a192-5dbd386771e6',
  'modified_at': '2025-04-30T14:25:30.259Z',
  'project_id': '20dfc787-5698-4cf0-a5fd-7ab5bea49be3'}}

To wait until the text extraction job completes, run the following cell:

In [ ]:

import time

while True:
    time.sleep(5)

    job_details = extraction.get_job_details(
        extraction_job_id=multiple_formats_extraction_job_id
    )
    job_status = job_details["entity"]["results"]["status"]
    if job_status != "running":
        print("\n", job_status, sep="")
        break

    print(".", sep="", end="", flush=True)

..............
completed

Furthermore, to delete text extraction job run use delete_job() method.

Results examination

Once the job extraction is completed, we can use the get_results_reference method to create the results data connection.

In [ ]:

multiple_results_reference = extraction.get_results_reference(
    multiple_formats_extraction_job_id
)
multiple_results_reference

{'type': 'connection_asset', 'connection': {'id': '631b67d7-ec99-4526-8af9-2aa902a1b25b'}, 'location': {'bucket': 'text-extraction-test', 'file_name': 'text_extraction_result/'}}

In [28]:

multiple_results_dir = "multiple_text_extraction_results"
multiple_results_reference.download_folder(local_dir=multiple_results_dir)

After a successful download, it's possible to read the file contents.

In [29]:

import json

with open(f"{multiple_results_dir}/assembly.json") as file:
    extracted_assembly_json = json.load(file)

tokens = extracted_assembly_json["all_structures"]["tokens"]
result_text = " ".join(item["text"] for item in tokens)
print(result_text[:1000])

Out[29]:

Granite Code Models: A Family of Open Foundation Models for Code Intelligence Mayank Mishra⋆ Matt Stallone⋆ Gaoyuan Zhang⋆ Yikang Shen Aditya Prasad Adriana Meza Soria Michele Merler Parameswaran Selvam Saptha Surendran Shivdeep Singh Manish Sethi Xuan-Hong Dang Pengyuan Li Kun-Lung Wu Syed Zawad Andrew Coleman Matthew White Mark Lewis Raju Pavuluri Yan Koyfman Boris Lublinsky Maximilien de Bayser Ibrahim Abdelaziz Kinjal Basu Mayank Agarwal Yi Zhou Chris Johnson Aanchal Goyal Hima Patel Yousaf Shah Petros Zerfos Heiko Ludwig Asim Munawar Maxwell Crouse Pavan Kapanipathi Shweta Salaria Bob Calio Sophia Wen Seetharami Seelam Brian Belgodere Carlos Fonseca Amith Singhee Nirmit Desai David D. Cox Ruchir Puri† Rameswar Panda† IBM Research ⋆Equal Contribution †Corresponding Authors [email protected], [email protected] Abstract Large Language Models (LLMs) trained on code are revolutionizing the software development process. Increasingly, code LLMs are being inte grated into software developmen

In [30]:

from IPython.core.display import display_html

with open(f"{multiple_results_dir}/assembly.html") as file:
    extracted_html = file.read()

display_html(extracted_html[:1000], raw=True)

Out[30]:

In [31]:

from IPython.core.display import display_png

with open(f"{multiple_results_dir}/page_images/1.png", "rb") as file:
    extracted_png = file.read()

display_png(extracted_png, raw=True)

Out[31]:

Summary and next steps

You successfully completed this notebook!

You learned how to use TextExtractionsV2 manager to run text extraction requests, check status of the submitted job and download a results file.

Check out our Online Documentation for more samples, tutorials, documentation, how-tos, and blog posts.

Authors:

Mateusz Świtała, Software Engineer at watsonx.ai.

Rafał Chrzanowski, Software Engineer Intern at watsonx.ai.

Use watsonx.ai Text Extraction V2 service to extract text from file

Disclaimers

Notebook content

Learning goal

Contents

Set up the environment

Install dependencies

Define credentials

Working with projects

Create `APIClient` instance

Create data connections with source document and results reference

Download source document

Create connection to COS

Create text extraction document reference and result references

Upload source file to COS

Text Extraction request preparation

Define Text Extraction parameters

Run extraction job for single return format

Results examination

Run extraction job for multiple file formats

Results examination

Summary and next steps

Authors:

Product

Resources

Company

Use watsonx.ai Text Extraction V2 service to extract text from file

Disclaimers

Notebook content

Learning goal

Contents

Set up the environment

Install dependencies

Define credentials

Working with projects

Create APIClient instance

Create data connections with source document and results reference

Download source document

Create connection to COS

Create text extraction document reference and result references

Upload source file to COS

Text Extraction request preparation

Define Text Extraction parameters

Run extraction job for single return format

Results examination

Run extraction job for multiple file formats

Results examination

Summary and next steps

Authors:

Create `APIClient` instance