GitHub Repository: ibm/watson-machine-learning-samples
Path: blob/master/cpd5.1/notebooks/python_sdk/deployments/foundation_models/Use watsonx Text Extraction service to extract text from file.ipynb
⁶⁴⁰⁵ views

Kernel: note_env

Use watsonx.ai Text Extraction service to extract text from file

Disclaimers

Use only Projects and Spaces that are available in watsonx context.

Notebook content

This notebook contains the steps and code demonstrating how to run a Text Extraction job using python SDK and then retrieve the results in the form of markdown file.

Some familiarity with Python is helpful. This notebook uses Python 3.11.

Learning goal

The purpose of this notebook is to demonstrate the usage a Text Extraction service and ibm-watsonx-ai Python SDK to retrieve a text from file that is located at IBM Cloud Object Storage.

This notebook contains the following parts:

Set up the environment

Before you use the sample code in this notebook, you must perform the following setup tasks:

Contact with your Cloud Pak for Data administrator and ask them for your account credentials

Install required packages

In [ ]:

!pip install "ibm-watsonx-ai>=1.1.5" | tail -n 1

Connection to WML

Authenticate the Watson Machine Learning service on IBM Cloud Pak for Data. You need to provide platform url, your username and api_key.

In [ ]:

username = 'PASTE YOUR USERNAME HERE'
api_key = 'PASTE YOUR API_KEY HERE'
url = 'PASTE THE PLATFORM URL HERE'

In [ ]:

from ibm_watsonx_ai import Credentials

credentials = Credentials(
    username=username,
    api_key=api_key,
    url=url,
    instance_id="openshift",
    version="5.1"
)

Alternatively you can use username and password to authenticate WML services.

credentials = Credentials(
    username=***,
    password=***,
    url=***,
    instance_id="openshift",
    version="5.1"
)

Working with projects

First of all, you need to create a project that will be used for your work. If you do not have project already created follow bellow steps.

Open IBM Cloud Pak main page
Click all projects
Create an empty project
Copy project_id from url and paste it below

Action: Assign project ID below

In [ ]:

project_id = 'PASTE YOUR PROJECT ID HERE'

Create an instance of APIClient with authentication details.

In [4]:

from ibm_watsonx_ai import APIClient

client = APIClient(credentials)

To be able to interact with all resources available in Watson Machine Learning, you need to set project_id which you will be using.

In [5]:

client.set.default_project(project_id)

Out[5]:

'SUCCESS'

Create data connections with source document and results reference

The document, from which we are going to extract text, is located at IBM Cloud Object Storage (COS). In the following example we are going to use Granite Code Models paper as a source text document. Also, the final results file, which will contain extracted text and necessary metadata, will be placed in COS. Therefore, we use ibm_watsonx_ai.helpers.DataConnection and ibm_watsonx_ai.helpers.S3Location class to create a Python objects that will represent the references to the processed files. Please note that you have to create connection asset with your COS details (for detailed explanation how to do this see IBM Cloud Object Storage connection or check below cells).

Create connection to COS

You can skip this section if you already have connection asset with IBM Cloud Object Storage.

In [ ]:

datasource_name = 'bluemixcloudobjectstorage'
bucketname = "textextractionms"

In [ ]:

cos_credentials = {
                  "endpoint_url": "<endpoint url>",
                  "apikey": "<apikey>",
                  "access_key_id": "<access_key_id>",
                  "secret_access_key": "<secret_access_key>"
              }

In [8]:

conn_meta_props= {
    client.connections.ConfigurationMetaNames.NAME: f"Connection to Database - {datasource_name} ",
    client.connections.ConfigurationMetaNames.DATASOURCE_TYPE: client.connections.get_datasource_type_id_by_name(datasource_name),
    client.connections.ConfigurationMetaNames.DESCRIPTION: "Connection to external Database",
    client.connections.ConfigurationMetaNames.PROPERTIES: {
        'bucket': bucketname,
        'access_key': cos_credentials['access_key_id'],
        'secret_key': cos_credentials['secret_access_key'],
        'iam_url': 'https://iam.cloud.ibm.com/identity/token',
        'url': cos_credentials['endpoint_url']
    }
}

conn_details = client.connections.create(meta_props=conn_meta_props)
connection_asset_id = client.connections.get_id(conn_details)

Out[8]:

Creating connections...
SUCCESS

Upload file and create document and results reference

In [9]:

local_source_file_name = "granite_code_models_paper.pdf"
source_file_name = "./files/granite_code_models_paper.pdf"
results_file_name = "./files/text_extraction_granite_code_models_paper.md"

In [11]:

from ibm_watsonx_ai.helpers import DataConnection, S3Location

remote_document_reference = DataConnection(connection_asset_id=connection_asset_id,
                                           location=S3Location(bucket = bucketname, path = "."))
remote_document_reference.set_client(client)

remote_document_reference.write(local_source_file_name, remote_name=source_file_name)

Finally, we can create Data Connection that represents document and results reference.

In [12]:

document_reference = DataConnection(connection_asset_id=connection_asset_id,
                                    location=S3Location(bucket=bucketname,
                                                        path=source_file_name))

results_reference = DataConnection(connection_asset_id=connection_asset_id,
                                   location=S3Location(bucket=bucketname,
                                                       path=results_file_name))

Text Extraction request

Since data connection for source and results files are ready, we can proceed to the text extraction run job step. To initialize Text Extraction manager we use TextExtractions class.

In [13]:

from ibm_watsonx_ai.foundation_models.extractions import TextExtractions
from ibm_watsonx_ai.metanames import TextExtractionsMetaNames

In [14]:

extraction = TextExtractions(api_client=client,
                             project_id=project_id)

When running job the steps for the text extraction pipeline can be specified. For more details about available steps see documentation. The list of steps available in sdk can be found below.

In [15]:

TextExtractionsMetaNames().show()

Out[15]:

----------------  ----  --------
META_PROP NAME    TYPE  REQUIRED
OCR               dict  N
TABLE_PROCESSING  dict  N
----------------  ----  --------

To view sample parameter values for the text extraction steps run get_example_values().

In [16]:

TextExtractionsMetaNames().get_example_values()

Out[16]:

{'ocr': {'languages_list': ['en']}, 'tables_processing': {'enabled': True}}

In our example we are going to use the following steps

In [17]:

steps = {TextExtractionsMetaNames.OCR: {'languages_list': ['en']},
        TextExtractionsMetaNames.TABLE_PROCESSING: {'enabled': True}}

Now, we can run Text Extraction job.

In [18]:

details = extraction.run_job(document_reference=document_reference, 
                             results_reference=results_reference, 
                             steps=steps,
                             results_format='markdown')
details

Out[18]:

{'metadata': {'id': '5c3bf6fb-0a01-4e52-a16f-259068cbdacd',
  'created_at': '2024-12-06T07:25:25.321Z',
  'project_id': '99486413-555d-464c-9524-3114d6728eb2'},
 'entity': {'document_reference': {'type': 'connection_asset',
   'connection': {'id': 'b25d2b54-0658-4769-b5df-8edfc158096e'},
   'location': {'bucket': 'text-extraction-ms',
    'file_name': './files/granite_code_models_paper.pdf'}},
  'document': None,
  'results_reference': {'type': 'connection_asset',
   'connection': {'id': 'b25d2b54-0658-4769-b5df-8edfc158096e'},
   'location': {'bucket': 'text-extraction-ms',
    'file_name': './files/text_extraction_granite_code_models_paper.md'}},
  'steps': {'ocr': {'languages_list': ['en']},
   'tables_processing': {'enabled': True}},
  'assembly_md': {},
  'results': {'status': 'submitted', 'number_pages_processed': 0}}}

In [19]:

extraction_job_id = extraction.get_id(extraction_details=details)

We can list text extraction jobs using a proper list method.

In [20]:

extraction.list_jobs()

Out[20]:

Moreover, to get details of a particular text extraction request run following

In [21]:

extraction.get_job_details(extraction_id=extraction_job_id)

Out[21]:

{'entity': {'document_reference': {'connection': {'id': 'b25d2b54-0658-4769-b5df-8edfc158096e'},
   'location': {'bucket': 'text-extraction-ms',
    'file_name': './files/granite_code_models_paper.pdf'},
   'type': 'connection_asset'},
  'results': {'completed_at': '2024-12-06T07:27:27.210Z',
   'number_pages_processed': 28,
   'running_at': '2024-12-06T07:25:27.528Z',
   'status': 'completed'},
  'results_reference': {'connection': {'id': 'b25d2b54-0658-4769-b5df-8edfc158096e'},
   'location': {'bucket': 'text-extraction-ms',
    'file_name': './files/text_extraction_granite_code_models_paper.md'},
   'type': 'connection_asset'},
  'steps': {'ocr': {'languages_list': ['en']},
   'tables_processing': {'enabled': True}}},
 'metadata': {'created_at': '2024-12-06T07:25:25.321Z',
  'id': '5c3bf6fb-0a01-4e52-a16f-259068cbdacd',
  'modified_at': '2024-12-06T07:27:27.223Z',
  'project_id': '99486413-555d-464c-9524-3114d6728eb2'}}

Furthermore, to delete text extraction jub run use delete_job() method.

Results examination

Once the job extraction is completed we can download the results file and process it further.

In [22]:

results_reference = extraction.get_results_reference(extraction_id=extraction_job_id)

In [23]:

filename = "text_extraction_results_granite_code_models_paper.md"

results_reference.download(filename=filename)

In [25]:

with open(filename, 'r') as file:
    extracted_text = file.read()

print(extracted_text[:1000])

Out[25]:

†Corresponding Authors 

Large Language Models (LLMs) trained on code are revolutionizing the software development process. Increasingly, code LLMs are being inte- grated into software development environments to improve the produc- tivity of human programmers, and LLM-based agents are beginning to show promise for handling complex tasks autonomously. Realizing the full potential of code LLMs requires a wide range of capabilities, including code generation, fixing bugs, explaining and documenting code, maintaining repositories, and more. In this work, we introduce the Granite series of decoder-only code models for code generative tasks, trained with code written in 116 programming languages. The Granite Code models family consists of models ranging in size from 3 to 34 billion parameters, suitable for applications ranging from complex application modernization tasks to on-device memory-constrained use cases. Evaluation on a comprehensive set of tasks demonstrates that Granite Code mode

Summary and next steps

You successfully completed this notebook!

You learned how to use TextExtractions manager to run text extraction requests, check status of the submitted job and download a results file.

Check out our Online Documentation for more samples, tutorials, documentation, how-tos, and blog posts.

Authors:

Mateusz Świtała, Software Engineer at Watson Machine Learning.