Path: blob/master/cpd5.2/notebooks/python_sdk/deployments/foundation_models/Use watsonx Text Extraction V2 service to extract text from file.ipynb
6412 views
Use watsonx.ai Text Extraction V2 service to extract text from file
Disclaimers
Use only Projects and Spaces that are available in watsonx context.
Notebook content
This notebook contains the steps and code demonstrating how to run a Text Extraction job using Python SDK and then retrieve the results in the form of JSON, Markdown, HTML and image files.
Some familiarity with Python is helpful. This notebook uses Python 3.12.
Learning goal
The purpose of this notebook is to demonstrate the usage a Text Extraction service and ibm-watsonx-ai
Python SDK to retrieve a text from file that is located at IBM Cloud Object Storage.
Contents
This notebook contains the following parts:
Install dependencies
Note: ibm-watsonx-ai
documentation can be found here.
Successfully installed wget-3.2
Successfully installed anyio-4.9.0 certifi-2025.4.26 charset-normalizer-3.4.1 h11-0.16.0 httpcore-1.0.9 httpx-0.28.1 ibm-cos-sdk-2.14.0 ibm-cos-sdk-core-2.14.0 ibm-cos-sdk-s3transfer-2.14.0 ibm-watsonx-ai-1.3.13 idna-3.10 jmespath-1.0.1 lomond-0.3.3 numpy-2.2.5 pandas-2.2.3 pytz-2025.2 requests-2.32.2 sniffio-1.3.1 tabulate-0.9.0 tzdata-2025.2 urllib3-2.4.0
Define credentials
Authenticate the watsonx.ai Runtime service on IBM Cloud Pak for Data. You need to provide the admin's username
and the platform url
.
Use the admin's api_key
to authenticate watsonx.ai Runtime services:
Alternatively you can use the admin's password
:
Working with projects
First of all, you need to create a project that will be used for your work. If you do not have a project created already, follow the steps below:
Open IBM Cloud Pak main page
Click all projects
Create an empty project
Copy
project_id
from url and paste it below
Action: Assign project ID below
Create APIClient
instance
The document from which we are going to extract text is located in IBM Cloud Object Storage (COS). In the following example we are going to use Granite Code Models paper as a source text document. Also, the final results file, which will contain extracted text and necessary metadata, will be placed in COS as well. Therefore, we use ibm_watsonx_ai.helpers.DataConnection
and ibm_watsonx_ai.helpers.S3Location
class to create a Python objects that will represent the references to the processed files. Please note that you have to create connection asset with your COS details (for detailed explanation how to do this see IBM Cloud Object Storage connection or check below cells).
Download source document
Create connection to COS
You can skip this section if you already have connection asset with IBM Cloud Object Storage.
Create text extraction document reference and result references
Upload source file to COS
Since data connection for source and results files are ready, we can proceed to the text extraction run job step. To initialize Text Extraction manager we use TextExtractions
class.
Define Text Extraction parameters
When running a job, the parameters for the text extraction pipeline can be specified. For more details about available parameters see documentation. The list of parameters available in SDK can be found below.
In our example we are going to use the following parameters:
Run extraction job for single return format
In order to run an extraction job, where only a single output format is requested, the result_formats
parameter must be specified using the TextExtractionsV2ResultFormats
enum. In our example we will use the TextExtractionsV2ResultFormats.MARKDOWN
format.
We can list text extraction jobs using the list method.
Moreover, to get details of a particular text extraction request, run the following:
To wait until the text extraction job completes, run the following cell:
Furthermore, to delete text extraction job run use delete_job()
method.
Results examination
Once the job extraction is completed, we can use the get_results_reference
method to create the results data connection.
Download the file to the result path.
After a successful download, it's possible to read the file content.
Run extraction job for multiple file formats
In order to run an extraction job, where multiple output formats are requested, the result_format
parameter must be specified using either a list of TextExtractionsV2ResultFormats
enums (recommended) or a list of str
instances. In our example we will use a list of the following file formats:
TextExtractionsV2ResultFormats.ASSEMBLY_JSON
TextExtractionsV2ResultFormats.HTML
TextExtractionsV2ResultFormats.PAGE_IMAGES
We can list text extraction jobs using the list method.
Moreover, to get details of a particular text extraction request, run the following:
To wait until the text extraction job completes, run the following cell:
Furthermore, to delete text extraction job run use delete_job()
method.
Results examination
Once the job extraction is completed, we can use the get_results_reference
method to create the results data connection.
After a successful download, it's possible to read the file contents.
Summary and next steps
You successfully completed this notebook!
You learned how to use TextExtractionsV2
manager to run text extraction requests, check status of the submitted job and download a results file.
Check out our Online Documentation for more samples, tutorials, documentation, how-tos, and blog posts.
Authors:
Mateusz Świtała, Software Engineer at watsonx.ai.
Rafał Chrzanowski, Software Engineer Intern at watsonx.ai.
Copyright © 2025 IBM. This notebook and its source code are released under the terms of the MIT License.