CoCalc -- Transformers for Natural Language Processing Task.ipynb

GitHub Repository: suyashi29/python-su
Path: blob/master/Natural Language Processing using Python/Transformers for Natural Language Processing Task.ipynb
³⁰⁷⁴ views

Kernel: Python 3 (ipykernel)

Transformers for Natural Language Processing Task

Pre-trained NLP Models in Transformers

Pre-trained NLP models in Hugging Face's transformers library are deep learning models that have been trained on large text datasets. These models can be fine-tuned for various NLP tasks without training from scratch, saving computational resources and time.

Key Features of Pre-trained Transformer Models

Trained on Large Datasets – Pre-trained models are trained on massive text corpora like Wikipedia, Common Crawl, and BooksCorpus.
Fine-tuning Capability – They can be further trained (fine-tuned) for specific tasks like sentiment analysis, named entity recognition (NER), and text generation.
Contextual Understanding – Unlike traditional NLP models (like TF-IDF, Word2Vec), transformers understand context and word relationships better

Example Usage in transformers Library

Using a pre-trained BERT model for sentiment analysis:

In [3]:

from transformers import pipeline

# 1 Load a sentiment-analysis pipeline
classifier = pipeline("sentiment-analysis")

# Example texts
texts = [
    "I love using transformers for NLP!",
    "This is a terrible experience.",
    "The product is okay, but could be improved."
]

# Perform sentiment analysis
results = classifier(texts)

# Display results
for text, result in zip(texts, results):
    print(f"Text: {text}\nSentiment: {result['label']}, Confidence: {result['score']:.4f}\n")

Out[3]:

No model was supplied, defaulted to distilbert-base-uncased-finetuned-sst-2-english and revision af0f99b (https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.

Text: I love using transformers for NLP!
Sentiment: POSITIVE, Confidence: 0.9926

Text: This is a terrible experience.
Sentiment: NEGATIVE, Confidence: 0.9993

Text: The product is okay, but could be improved.
Sentiment: POSITIVE, Confidence: 0.9015

from transformers import pipeline, AutoTokenizer, AutoModelForSeq2SeqLM
import torch

# 2. Named Entity Recognition (NER)
ner_pipeline = pipeline("ner", model="dbmdz/bert-large-cased-finetuned-conll03-english")
ner_result = ner_pipeline("Omen is based in New York and was founded by Julien Chaumond.")
print("Named Entity Recognition:", ner_result)

In [6]:

# 4. Text Generation
generator = pipeline("text-generation", model="gpt2")
generated_text = generator("Once upon a time in a faraway land,", max_length=70)
print("Text Generation:", generated_text)

Out[6]:

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.

Text Generation: [{'generated_text': 'Once upon a time in a faraway land, they began to take the lives of others. Many went to seek to save others.\n\nAs the day of judgment neared, a flood washed their homeland, and the prophets of the Lord came down upon the place.\n\nJesus was the first disciple of John, who had been taken by'}]

In [1]:



# 3. Question Answering
qa_pipeline = pipeline("question-answering")
qa_result = qa_pipeline(
    question="Where is Hugging Face based?",
    context="Hugging Face is a company based in New York that focuses on AI and NLP.",
)
print("Question Answering:", qa_result)

# 4. Text Generation
generator = pipeline("text-generation", model="gpt2")
generated_text = generator("Once upon a time in a faraway land,", max_length=50)
print("Text Generation:", generated_text)

# 5. Text Summarization
summarizer = pipeline("summarization")
summary = summarizer("""Hugging Face is a company that has been developing state-of-the-art NLP models. They provide easy-to-use APIs 
for various NLP tasks and enable researchers and developers to build powerful applications.""", max_length=30, min_length=10, do_sample=False)
print("Summarization:", summary)

# 6. Translation
translator = pipeline("translation_en_to_fr")
translation = translator("Hugging Face is an AI company specializing in NLP.")
print("Translation:", translation)

# 7. Zero-Shot Classification
zero_shot_classifier = pipeline("zero-shot-classification")
zero_shot_result = zero_shot_classifier("This is a great opportunity for AI research.", candidate_labels=["business", "science", "technology"])
print("Zero-Shot Classification:", zero_shot_result)

# 8. Tokenization using AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
tokens = tokenizer("Hugging Face is revolutionizing AI research.", return_tensors="pt")
print("Tokenization:", tokens)

Out[1]:

C:\Users\Suyashi144893\AppData\Local\anaconda3\Lib\site-packages\transformers\utils\generic.py:260: FutureWarning: `torch.utils._pytree._register_pytree_node` is deprecated. Please use `torch.utils._pytree.register_pytree_node` instead.
  torch.utils._pytree._register_pytree_node(
No model was supplied, defaulted to distilbert-base-uncased-finetuned-sst-2-english and revision af0f99b (https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
C:\Users\Suyashi144893\AppData\Local\anaconda3\Lib\site-packages\huggingface_hub\file_download.py:1142: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
  warnings.warn(

Sentiment Analysis: [{'label': 'POSITIVE', 'score': 0.9982925057411194}]

Some weights of the model checkpoint at dbmdz/bert-large-cased-finetuned-conll03-english were not used when initializing BertForTokenClassification: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
No model was supplied, defaulted to distilbert-base-cased-distilled-squad and revision 626af31 (https://huggingface.co/distilbert-base-cased-distilled-squad).
Using a pipeline without specifying a model name and revision in production is not recommended.

Named Entity Recognition: [{'entity': 'I-ORG', 'score': 0.9906591, 'index': 1, 'word': 'Hu', 'start': 0, 'end': 2}, {'entity': 'I-ORG', 'score': 0.8921493, 'index': 2, 'word': '##gging', 'start': 2, 'end': 7}, {'entity': 'I-ORG', 'score': 0.9757078, 'index': 3, 'word': 'Face', 'start': 8, 'end': 12}, {'entity': 'I-LOC', 'score': 0.9990557, 'index': 7, 'word': 'New', 'start': 25, 'end': 28}, {'entity': 'I-LOC', 'score': 0.9986376, 'index': 8, 'word': 'York', 'start': 29, 'end': 33}, {'entity': 'I-PER', 'score': 0.9989183, 'index': 13, 'word': 'Julien', 'start': 53, 'end': 59}, {'entity': 'I-PER', 'score': 0.9990766, 'index': 14, 'word': 'Cha', 'start': 60, 'end': 63}, {'entity': 'I-PER', 'score': 0.95468247, 'index': 15, 'word': '##um', 'start': 63, 'end': 65}, {'entity': 'I-PER', 'score': 0.9530121, 'index': 16, 'word': '##ond', 'start': 65, 'end': 68}]
Question Answering: {'score': 0.9974803924560547, 'start': 35, 'end': 43, 'answer': 'New York'}

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
No model was supplied, defaulted to sshleifer/distilbart-cnn-12-6 and revision a4f8f3e (https://huggingface.co/sshleifer/distilbart-cnn-12-6).
Using a pipeline without specifying a model name and revision in production is not recommended.

Text Generation: [{'generated_text': 'Once upon a time in a faraway land, the hero could see themselves in a room with six statues of goddesses who had given birth to their respective children. The heroes themselves, in a few hours, would make peace. Their hero would call'}]

config.json:   0%|          | 0.00/1.80k [00:00<?, ?B/s]

C:\Users\Suyashi144893\AppData\Local\anaconda3\Lib\site-packages\huggingface_hub\file_download.py:147: UserWarning: `huggingface_hub` cache-system uses symlinks by default to efficiently store duplicated files but your machine does not support them in C:\Users\Suyashi144893\.cache\huggingface\hub\models--sshleifer--distilbart-cnn-12-6. Caching files will still work but in a degraded version that might require more space on your disk. This warning can be disabled by setting the `HF_HUB_DISABLE_SYMLINKS_WARNING` environment variable. For more details, see https://huggingface.co/docs/huggingface_hub/how-to-cache#limitations.
To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development
  warnings.warn(message)

pytorch_model.bin:   0%|          | 0.00/1.22G [00:00<?, ?B/s]

C:\Users\Suyashi144893\AppData\Local\anaconda3\Lib\site-packages\transformers\modeling_utils.py:479: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature.
  return torch.load(checkpoint_file, map_location=map_location)

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

No model was supplied, defaulted to t5-base and revision 686f1db (https://huggingface.co/t5-base).
Using a pipeline without specifying a model name and revision in production is not recommended.

Summarization: [{'summary_text': ' Hugging Face is a company that has been developing state-of-the-art NLP models . They provide easy-to-'}]

config.json:   0%|          | 0.00/1.21k [00:00<?, ?B/s]

C:\Users\Suyashi144893\AppData\Local\anaconda3\Lib\site-packages\huggingface_hub\file_download.py:147: UserWarning: `huggingface_hub` cache-system uses symlinks by default to efficiently store duplicated files but your machine does not support them in C:\Users\Suyashi144893\.cache\huggingface\hub\models--t5-base. Caching files will still work but in a degraded version that might require more space on your disk. This warning can be disabled by setting the `HF_HUB_DISABLE_SYMLINKS_WARNING` environment variable. For more details, see https://huggingface.co/docs/huggingface_hub/how-to-cache#limitations.
To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development
  warnings.warn(message)

model.safetensors:   0%|          | 0.00/892M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.39M [00:00<?, ?B/s]

C:\Users\Suyashi144893\AppData\Local\anaconda3\Lib\site-packages\transformers\models\t5\tokenization_t5_fast.py:155: FutureWarning: This tokenizer was incorrectly instantiated with a model max length of 512 which will be corrected in Transformers v5.
For now, this behavior is kept to avoid breaking backwards compatibility when padding/encoding with `truncation is True`.
- Be aware that you SHOULD NOT rely on t5-base automatically truncating your input to 512 when padding/encoding.
- If you want to encode/pad to sequences longer than 512 you can either instantiate this tokenizer with `model_max_length` or pass `max_length` when encoding/padding.
- To avoid this warning, please instantiate this tokenizer with `model_max_length` set to your preferred value.
  warnings.warn(
No model was supplied, defaulted to facebook/bart-large-mnli and revision c626438 (https://huggingface.co/facebook/bart-large-mnli).
Using a pipeline without specifying a model name and revision in production is not recommended.

Translation: [{'translation_text': "Hugging Face est une entreprise d'IA spécialisée dans la LNP."}]

config.json:   0%|          | 0.00/1.15k [00:00<?, ?B/s]

C:\Users\Suyashi144893\AppData\Local\anaconda3\Lib\site-packages\huggingface_hub\file_download.py:147: UserWarning: `huggingface_hub` cache-system uses symlinks by default to efficiently store duplicated files but your machine does not support them in C:\Users\Suyashi144893\.cache\huggingface\hub\models--facebook--bart-large-mnli. Caching files will still work but in a degraded version that might require more space on your disk. This warning can be disabled by setting the `HF_HUB_DISABLE_SYMLINKS_WARNING` environment variable. For more details, see https://huggingface.co/docs/huggingface_hub/how-to-cache#limitations.
To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development
  warnings.warn(message)

model.safetensors:   0%|          | 0.00/1.63G [00:00<?, ?B/s]

Error while downloading from https://cdn-lfs.hf.co/facebook/bart-large-mnli/cfbb687dbbd9df99fe865e1860350a22aebac4d26ee4bcb50217f1df606a018e?response-content-disposition=inline%3B+filename*%3DUTF-8%27%27model.safetensors%3B+filename%3D%22model.safetensors%22%3B&Expires=1738321007&Policy=eyJTdGF0ZW1lbnQiOlt7IkNvbmRpdGlvbiI6eyJEYXRlTGVzc1RoYW4iOnsiQVdTOkVwb2NoVGltZSI6MTczODMyMTAwN319LCJSZXNvdXJjZSI6Imh0dHBzOi8vY2RuLWxmcy5oZi5jby9mYWNlYm9vay9iYXJ0LWxhcmdlLW1ubGkvY2ZiYjY4N2RiYmQ5ZGY5OWZlODY1ZTE4NjAzNTBhMjJhZWJhYzRkMjZlZTRiY2I1MDIxN2YxZGY2MDZhMDE4ZT9yZXNwb25zZS1jb250ZW50LWRpc3Bvc2l0aW9uPSoifV19&Signature=Kjsk9Smyop3WF58yNI7%7Eomn3cKqr8MXfwQJoYJjMsKD7i5Ay7CPcpb-pEUDtP2mSUBJiogjYLcVCLKbRVfz78I8ycAGiqPLnRmyEDIS5lb7fF9V8DKRzIGagNrzjAVEa2atDE1IQf2oNdzGwf0hOx9e%7EfN6wQEvpoKZVXWSghKIlsl1Pi6wow2Oo3Pa4RU589VTkuEpCLb2tJre1iSUM4Vi%7EP-SSrKWD9oKUASlNMk0llF0vLprZQZA9Ok2SzMKcXZ0-FGZn1E1AJf49UhgliQw0j9h3gC822bT-abWompNbOJigHCdAnKYU6EXoe-l2EWiuNyUnLB-4uWFrKCz5Lg__&Key-Pair-Id=K3RPWS32NSSJCE: HTTPSConnectionPool(host='cdn-lfs.hf.co', port=443): Read timed out.
Trying to resume download...

model.safetensors:  89%|########9 | 1.46G/1.63G [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Zero-Shot Classification: {'sequence': 'This is a great opportunity for AI research.', 'labels': ['technology', 'science', 'business'], 'scores': [0.9234941005706787, 0.07130993902683258, 0.005195995327085257]}

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Tokenization: {'input_ids': tensor([[  101, 17662,  2227,  2003,  4329,  6026,  9932,  2470,  1012,   102]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])}

C:\Users\Suyashi144893\AppData\Local\anaconda3\Lib\site-packages\huggingface_hub\file_download.py:147: UserWarning: `huggingface_hub` cache-system uses symlinks by default to efficiently store duplicated files but your machine does not support them in C:\Users\Suyashi144893\.cache\huggingface\hub\models--bert-base-uncased. Caching files will still work but in a degraded version that might require more space on your disk. This warning can be disabled by setting the `HF_HUB_DISABLE_SYMLINKS_WARNING` environment variable. For more details, see https://huggingface.co/docs/huggingface_hub/how-to-cache#limitations.
To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development
  warnings.warn(message)

In [ ]:

Transformers for Natural Language Processing Task

Pre-trained NLP Models in Transformers

Key Features of Pre-trained Transformer Models

Example Usage in transformers Library

Product

Resources

Company