Path: blob/master/cpd5.1/notebooks/python_sdk/deployments/foundation_models/Use watsonx Text Extraction service to extract text from file.ipynb
6405 views
Use watsonx.ai Text Extraction service to extract text from file
Disclaimers
Use only Projects and Spaces that are available in watsonx context.
Notebook content
This notebook contains the steps and code demonstrating how to run a Text Extraction job using python SDK and then retrieve the results in the form of markdown file.
Some familiarity with Python is helpful. This notebook uses Python 3.11.
Learning goal
The purpose of this notebook is to demonstrate the usage a Text Extraction service and ibm-watsonx-ai
Python SDK to retrieve a text from file that is located at IBM Cloud Object Storage.
Contents
This notebook contains the following parts:
Install required packages
Connection to WML
Authenticate the Watson Machine Learning service on IBM Cloud Pak for Data. You need to provide platform url
, your username
and api_key
.
Alternatively you can use username
and password
to authenticate WML services.
Working with projects
First of all, you need to create a project that will be used for your work. If you do not have project already created follow bellow steps.
Open IBM Cloud Pak main page
Click all projects
Create an empty project
Copy
project_id
from url and paste it below
Action: Assign project ID below
Create an instance of APIClient with authentication details.
To be able to interact with all resources available in Watson Machine Learning, you need to set project_id which you will be using.
The document, from which we are going to extract text, is located at IBM Cloud Object Storage (COS). In the following example we are going to use Granite Code Models paper as a source text document. Also, the final results file, which will contain extracted text and necessary metadata, will be placed in COS. Therefore, we use ibm_watsonx_ai.helpers.DataConnection
and ibm_watsonx_ai.helpers.S3Location
class to create a Python objects that will represent the references to the processed files. Please note that you have to create connection asset with your COS details (for detailed explanation how to do this see IBM Cloud Object Storage connection or check below cells).
Create connection to COS
You can skip this section if you already have connection asset with IBM Cloud Object Storage.
Upload file and create document and results reference
Finally, we can create Data Connection that represents document and results reference.
Since data connection for source and results files are ready, we can proceed to the text extraction run job step. To initialize Text Extraction manager we use TextExtractions
class.
When running job the steps for the text extraction pipeline can be specified. For more details about available steps see documentation. The list of steps available in sdk can be found below.
To view sample parameter values for the text extraction steps run get_example_values()
.
In our example we are going to use the following steps
Now, we can run Text Extraction job.
We can list text extraction jobs using a proper list method.
Moreover, to get details of a particular text extraction request run following
Furthermore, to delete text extraction jub run use delete_job()
method.
Once the job extraction is completed we can download the results file and process it further.
Summary and next steps
You successfully completed this notebook!
You learned how to use TextExtractions
manager to run text extraction requests, check status of the submitted job and download a results file.
Check out our Online Documentation for more samples, tutorials, documentation, how-tos, and blog posts.
Authors:
Mateusz Świtała, Software Engineer at Watson Machine Learning.
Copyright © 2024-2025 IBM. This notebook and its source code are released under the terms of the MIT License.