GitHub Repository: ibm/watson-machine-learning-samples
Path: blob/master/cloud/notebooks/rest_api/deployments/foundation_models/Use watsonx, and Google `flan-ul2` to summarize Cybersecurity documents.ipynb
⁶⁴⁰⁵ views

Kernel: Python 3 (ipykernel)

Use watsonx, and Google `flan-ul2` to summarize Cybersecurity documents

Disclaimers

Use only Projects and Spaces that are available in watsonx context.

Notebook content

This notebook contains the steps and code to demonstrate support of text summarization in watsonx. It introduces commands for data retrieval and model testing.

Some familiarity with Python is helpful. This notebook uses Python 3.11.

Learning goal

The goal of this notebook is to demonstrate how to use ul2 model to summarize cybersecurity: SPEC5G Cellular Network Protocol.

Use case & dataset

5G is the 5th generation cellular network protocol. It is the state-of-the-art global wireless standard that enables an advanced kind of network designed to connect virtually everyone and everything with increased speed and reduced latency. Therefore, its development, analysis, and security are critical. However, all approaches to the 5G protocol development and security analysis, e.g., property extraction, protocol summarization, and semantic analysis of the protocol specifications and implementations are completely manual. To reduce such manual effort,foundation model are used to summarize the paragraphs automitically. The dataset that is used in this notebook has two columns which are paragraph and simplification(summary).

This notebook contains the following parts:

Set up the environment

Before you use the sample code in this notebook, you must perform the following setup tasks:

Create a watsonx.ai Runtime Service instance (a free plan is offered and information about how to create the instance can be found here).

Install and import the `datasets` and dependecies

In [ ]:

!pip install datasets | tail -n 1
!pip install requests | tail -n 1
!pip install wget | tail -n 1
!pip install ibm-cloud-sdk-core | tail -n 1
!pip install "scikit-learn==1.3.2" | tail -n 1
!pip install rouge | tail -n 1

In [1]:

import os, getpass, wget
import requests
from ibm_cloud_sdk_core import IAMTokenManager
from pandas import read_csv, DataFrame
from rouge import Rouge
from sklearn.model_selection import train_test_split

Inferencing class

This cell defines a class that makes a REST API call to the watsonx Foundation Model inferencing API that we will use to generate output from the provided input. The class takes the access token created in the previous step, and uses it to make a REST API call with input, model id and model parameters. The response from the API call is returned as the cell output.

Action: Provide watsonx.ai Runtime url to work with watsonx.ai.

In [2]:

endpoint_url = input("Please enter your watsonx.ai Runtime endpoint url (hit enter): ")

Define a Prompt class for prompts generation.

In [3]:

class Prompt:
    def __init__(self, access_token, project_id):
        self.access_token = access_token
        self.project_id = project_id

    def generate(self, input, model_id, parameters):
        wml_url = f"{endpoint_url}/ml/v1/text/generation?version=2024-03-19"
        Headers = {
            "Authorization": "Bearer " + self.access_token,
            "Content-Type": "application/json",
            "Accept": "application/json"
        }
        data = {
            "model_id": model_id,
            "input": input,
            "parameters": parameters,
            "project_id": self.project_id
        }
        response = requests.post(wml_url, json=data, headers=Headers)
        if response.status_code == 200:
            return response.json()["results"][0]
        else:
            return response.text

watsonx API connection

This cell defines the credentials required to work with watsonx API for Foundation Model inferencing.

Action: Provide the IBM Cloud user API key. For details, see documentation.

In [4]:

access_token = IAMTokenManager(
    apikey = getpass.getpass("Please enter your watsonx.ai api key (hit enter): "),
    url = "https://iam.cloud.ibm.com/identity/token"
).get_token()

Defining the project id

The API requires project id that provides the context for the call. We will obtain the id from the project in which this notebook runs:

In [5]:

try:
    project_id = os.environ["PROJECT_ID"]
except KeyError:
    project_id = input("Please enter your project_id (hit enter): ")

Data loading

Download the cybersecurity: SPEC5G Cellular Network Protocol dataset.

In [6]:

filename = 'Data_Cyber.csv'
url = 'https://raw.githubusercontent.com/IBM/watson-machine-learning-samples/master/cloud/data/spec5g/spec5g.csv'
if not os.path.isfile(filename): wget.download(url, out=filename)

Read the data.

In [7]:

data= read_csv("Data_Cyber.csv", index_col=0)
data.head()

Out[7]:

Inspect data sample.

Check the sample text and summary length.

The original text lenght statistics.

In [8]:

data.Paragraph.apply(lambda x: len(x.split())).describe()

Out[8]:

count    713.000000
mean     101.632539
std       34.300754
min       35.000000
25%       78.000000
50%       98.000000
75%      121.000000
max      266.000000
Name: Paragraph, dtype: float64

The reference summary lenght statistics.

In [9]:

data.Simplification.apply(lambda x: len(x.split())).describe()

Out[9]:

count    713.000000
mean      43.927069
std       24.889311
min        8.000000
25%       28.000000
50%       38.000000
75%       53.000000
max      249.000000
Name: Simplification, dtype: float64

Split data to train and test

In [10]:

data_train, data_test, y_train, y_test = train_test_split(data['Paragraph'], 
                                                    data['Simplification'],
                                                    test_size=0.3,
                                                    random_state=33,)
data_train = DataFrame(data_train)
data_test = DataFrame(data_test)

Foundation Models on watsonx

List available models

In [11]:

models_json = requests.get(endpoint_url + '/ml/v1/foundation_model_specs?version=2024-03-19&limit=50',
                           headers={
                                    'Authorization': f'Bearer {access_token}',
                                    'Content-Type': 'application/json',
                                    'Accept': 'application/json'
                            }).json()
models_ids = [m['model_id'] for m in models_json['resources']]
models_ids

Out[11]:

['bigcode/starcoder',
 'bigscience/mt0-xxl',
 'codellama/codellama-34b-instruct-hf',
 'eleutherai/gpt-neox-20b',
 'google/flan-t5-xl',
 'google/flan-t5-xxl',
 'google/flan-ul2',
 'ibm-mistralai/mixtral-8x7b-instruct-v01-q',
 'ibm/granite-13b-chat-v1',
 'ibm/granite-13b-chat-v2',
 'ibm/granite-13b-instruct-v1',
 'ibm/granite-13b-instruct-v2',
 'ibm/granite-20b-multilingual',
 'ibm/mpt-7b-instruct2',
 'meta-llama/llama-2-13b-chat',
 'meta-llama/llama-2-70b-chat']

You need to specify model_id that will be used for inferencing:

In [12]:

model_id = "google/flan-ul2"

Generate document summary

Define instructions for the model.

In [13]:

instruction =  """
Extract the key outline of the "Original text" similar to the Simplification according to the examples."""

Prepare model inputs for zero-shot example, use below zero_shot_inputs.

In [14]:

zero_shot_inputs = [{"input": text} for text in data_test['Paragraph']]
for i in range(2):
    print(f"The sentence example {i+1} is:\n {zero_shot_inputs[i]['input']}\n")

Out[14]:

The sentence example 1 is:
 UE A can then prompt the user to initiate a voice call to UE B 6a(Successful case). The RAB Assignment Request message is sent from MSC B to the RNC B, requesting the establishment of a RAB for a Video Call.
 The radio bearer is established between the RNC B and UE B.
 RNC B responds to MSC B with a RAB Assignment Response message.
 Following the allocation of the radio resources, UE B sends an Alerting message to 6b (Failure case). The video call fails because of lack of radio resources on the B side.
 

The sentence example 2 is:
 As a network option, the operator may refuse to provide the requested information. When gsmSCF processing is complete the call control is returned to the GMSC server .
 The GMSC server interrogates the HLR in order to determine his current location.
 The HLR shall create an HLR interrogation record. The GMSC server routes the call to the VPLMN in which subscriber "B" is currently located.
 The GMSC server shall create an outgoing gateway record for accounting purposes.
 The GMSC server shall also create a roaming record.
 

Prepare model inputs for few-shot examples, use below few_shot_inputs.

In [15]:

data_train_and_labels=data_train.copy()
data_train_and_labels['Simplification']=y_train

In [16]:

train_samples=data_train_and_labels.sample(2)
few_shot_example=[]
examples = []
for s in range(len(train_samples)):
    examples.append(f"\tsentence:\t{train_samples['Paragraph'].iloc[s]}\n\tSimplification: {train_samples['Simplification'].iloc[s]}\n")
few_shot_examples=[''.join(examples)]

In [17]:

few_shot_inputs_ = [{"input": text} for text in data_test['Paragraph'].values]
for i in range(2):
    print(f"The sentence example {i+1} is:\n {few_shot_inputs_[i]['input']}\n")

Out[17]:

The sentence example 1 is:
 UE A can then prompt the user to initiate a voice call to UE B 6a(Successful case). The RAB Assignment Request message is sent from MSC B to the RNC B, requesting the establishment of a RAB for a Video Call.
 The radio bearer is established between the RNC B and UE B.
 RNC B responds to MSC B with a RAB Assignment Response message.
 Following the allocation of the radio resources, UE B sends an Alerting message to 6b (Failure case). The video call fails because of lack of radio resources on the B side.
 

The sentence example 2 is:
 As a network option, the operator may refuse to provide the requested information. When gsmSCF processing is complete the call control is returned to the GMSC server .
 The GMSC server interrogates the HLR in order to determine his current location.
 The HLR shall create an HLR interrogation record. The GMSC server routes the call to the VPLMN in which subscriber "B" is currently located.
 The GMSC server shall create an outgoing gateway record for accounting purposes.
 The GMSC server shall also create a roaming record.
 

Defining the model parameters

We need to provide a set of model parameters that will influence the result. Based on decoding strategy that ww have for the models, the parameters can change.

There are two decoding strategies: greedy and sampling.

We usually use greedy for complaint classification, extraction and Q&A.

We usually use sampling for content generation and summarization.

In [18]:

parameters = {
         "decoding_method": "greedy",
         "random_seed": 33,
         "repetition_penalty":1,
         "min_new_tokens": 50,
         "max_new_tokens": 300
}

Generate the cybersecurity: SPEC5G Cellular Network Protocol summary using `ul2` model.

Note: You might need to adjust model parameters for different models or tasks, to do so please refer to documentation.

Initialize the Prompt class.

Hint: Your authentication token might expire, if so please regenerate the access_token reinitialize the Prompt class.

In [19]:

prompt = Prompt(access_token, project_id)

Get the docs summaries.

In [20]:

results = []
for inp in few_shot_inputs_[:2]:
    results.append(prompt.generate(" ".join([instruction+few_shot_examples[0], inp['input']]), model_id, parameters))

Explore model output.

In [21]:

results

Out[21]:

[{'generated_text': 'Simplification: UE A can then prompt the user to initiate a voice call to UE B 6a(Successful case). The RAB Assignment Request message is sent from MSC B to the RNC B, requesting the establishment of a RAB for a Video Call. The radio bearer is established between the RNC B and UE B. RNC B responds to MSC B with a RAB Assignment Response message. Following the allocation of the radio resources, UE B sends an Alerting message to',
  'generated_token_count': 118,
  'input_token_count': 471,
  'stop_reason': 'eos_token'},
 {'generated_text': 'Simplification: The GMSC server interrogates the HLR in order to determine his current location. The GMSC server routes the call to the VPLMN in which subscriber "B" is currently located. The GMSC server shall create an outgoing gateway record for accounting purposes. The GMSC server shall also create a roaming record.',
  'generated_token_count': 78,
  'input_token_count': 459,
  'stop_reason': 'eos_token'}]

Score the model

Cosine Similarity

Note: To run the Score section for model scoring on the cybersecurity dataset please transform following markdown cells to code cells. Have in mind that scoring model on the whole test set can take significant amount of time.

In this sample notebook spacy implementation of cosine similarity for en_core_web_md corpus was used for cosine similarity calculation.

Tip: You might consider using bigger language corpus, different word embeddings and distance metrics for output summary scoring against the reference summary.

Get the true labels.

y_true = y_test.values[:2]
print(y_true)

Get the prediction labels.

y_pred = [result['generated_text'] for result in results]
y_pred

Use spacy and en_core_web_md corpus to calculate cosine similarity of generated and reference summaries.

!pip install -U spacy | tail -1
!python -m spacy download en_core_web_md | tail -1

import spacy
import en_core_web_md
nlp = en_core_web_md.load()

for truth, pred in zip(y_true, y_pred):
    t = nlp(truth)
    p = nlp(pred)
    print("Reference summary similarity with the predicted summary", t.similarity(p))

Rouge Metric

Note: Rouge (Recall-Oriented Understudy for Gisting Evaluation) metric is a set of evaluation measures used in natural language processing (NLP) and specifically in text summarization tasks. Please refer to below link for more information: torchmetrics

rouge = Rouge()
scores=[]
for i in range(len(y_true)):
    
    score = rouge.get_scores(y_true[i], y_pred[i])
    scores.append(score)
scores

Summary and next steps

You successfully completed this notebook!

You learned how to generate documents summaries with Google's flan-ul2 on watsonx.

Check out our Online Documentation for more samples, tutorials, documentation, how-tos, and blog posts.

Author: Kahila Mokhtari

Use watsonx, and Google `flan-ul2` to summarize Cybersecurity documents

Disclaimers

Notebook content

Learning goal

Use case & dataset

Contents

Set up the environment

Install and import the `datasets` and dependecies

Inferencing class

watsonx API connection

Defining the project id

Data loading

Check the sample text and summary length.

Foundation Models on watsonx

List available models

Generate document summary

Defining the model parameters

Generate the cybersecurity: SPEC5G Cellular Network Protocol summary using `ul2` model.

Score the model

Cosine Similarity

Rouge Metric

Summary and next steps

Product

Resources

Company

Use watsonx, and Google flan-ul2 to summarize Cybersecurity documents

Disclaimers

Notebook content

Learning goal

Use case & dataset

Contents

Set up the environment

Install and import the datasets and dependecies

Inferencing class

watsonx API connection

Defining the project id

Data loading

Check the sample text and summary length.

Foundation Models on watsonx

List available models

Generate document summary

Defining the model parameters

Generate the cybersecurity: SPEC5G Cellular Network Protocol summary using ul2 model.

Score the model

Cosine Similarity

Rouge Metric

Summary and next steps

Use watsonx, and Google `flan-ul2` to summarize Cybersecurity documents

Install and import the `datasets` and dependecies

Generate the cybersecurity: SPEC5G Cellular Network Protocol summary using `ul2` model.