GitHub Repository: suyashi29/python-su
Path: blob/master/Generative NLP Models using Python/6 Hugging face libraries.ipynb
³⁰⁷⁴ views

Kernel: Python 3 (ipykernel)

pip install --upgrade numpy
pip install transformer
conda create -n huggingface_env python=3.10 -y
conda activate huggingface_env
pip install huggingface_hub[hf_xet]
pip install numpy==1.26.2 tensorflow==2.18.0 transformers
pip install tensorflow-intel==2.18.0 numpy==1.26.2
pip install gensim

In [6]:

import os
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '3'

import tensorflow as tf
tf.get_logger().setLevel('ERROR')

Hugging Face Transformers for NLP Tasks

What is Hugging Face?

Hugging Face provides:

transformers: State-of-the-art pre-trained models for NLP (and beyond).
datasets: Ready-to-use NLP datasets.
tokenizers: Fast and customizable tokenization.

Installation

pip install transformers pip install torch # or tensorflow, depending on backend

Common NLP Tasks with Hugging Face

Task	Description	Model Example
Text Classification	Classify text into categories	BERT, DistilBERT
Named Entity Recognition (NER)	Identify entities in text	BERT, RoBERTa
Question Answering	Extract answer from a given context	BERT, ALBERT
Summarization	Summarize long text into key points	BART, T5
Translation	Translate text between languages	MarianMT, mBART
Text Generation	Autocomplete or continue text generation	GPT-2, GPT-3, LLaMA
Sentiment Analysis	Detect sentiment (positive/negative/etc.)	DistilBERT, BERT

The pipeline() function is the simplest way to use Hugging Face models

In [1]:

from transformers import pipeline

# Explicitly specify the model name
model_name = "distilbert-base-uncased-finetuned-sst-2-english"

# Load the sentiment analysis pipeline with the specified model
sentiment_pipeline = pipeline("sentiment-analysis", model=model_name, tokenizer=model_name)

# Sentiment analysis
classifier = pipeline("sentiment-analysis")
print(classifier("Suyashi hates coding"))

Out[1]:

WARNING:tensorflow:From C:\Users\Suyashi144893\AppData\Local\anaconda3\Lib\site-packages\tf_keras\src\losses.py:2976: The name tf.losses.sparse_softmax_cross_entropy is deprecated. Please use tf.compat.v1.losses.sparse_softmax_cross_entropy instead.

No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision af0f99b (https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
C:\Users\Suyashi144893\AppData\Local\anaconda3\Lib\site-packages\huggingface_hub\file_download.py:943: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
  warnings.warn(

[{'label': 'NEGATIVE', 'score': 0.9959076642990112}]

In [2]:

# Named Entity Recognition
ner = pipeline("ner", model="dslim/bert-base-NER")
ner = pipeline("ner", grouped_entities=True)
print(ner("My name is Ashi and I live in India."))

Out[2]:

Some weights of the model checkpoint at dslim/bert-base-NER were not used when initializing BertForTokenClassification: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
No model was supplied, defaulted to dbmdz/bert-large-cased-finetuned-conll03-english and revision f2482bf (https://huggingface.co/dbmdz/bert-large-cased-finetuned-conll03-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
Some weights of the model checkpoint at dbmdz/bert-large-cased-finetuned-conll03-english were not used when initializing BertForTokenClassification: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
C:\Users\Suyashi144893\AppData\Local\anaconda3\Lib\site-packages\transformers\pipelines\token_classification.py:168: UserWarning: `grouped_entities` is deprecated and will be removed in version v5.0.0, defaulted to `aggregation_strategy="AggregationStrategy.SIMPLE"` instead.
  warnings.warn(

[{'entity_group': 'PER', 'score': 0.9978213, 'word': 'Ashi', 'start': 11, 'end': 15}, {'entity_group': 'LOC', 'score': 0.9997433, 'word': 'India', 'start': 30, 'end': 35}]

In [3]:

# Question Answering
qa = pipeline(
    "question-answering",
    model="distilbert/distilbert-base-cased-distilled-squad",
    revision="564e9b5"  # Optional: use this to lock the exact version
)
qa = pipeline("question-answering")
result = qa({
    'question': 'Where do I live?',
    'context': 'My name is Ashi and I live in India.'
})
print(result)

Out[3]:

config.json:   0%|          | 0.00/473 [00:00<?, ?B/s]

C:\Users\Suyashi144893\AppData\Local\anaconda3\Lib\site-packages\huggingface_hub\file_download.py:143: UserWarning: `huggingface_hub` cache-system uses symlinks by default to efficiently store duplicated files but your machine does not support them in C:\Users\Suyashi144893\.cache\huggingface\hub\models--distilbert--distilbert-base-cased-distilled-squad. Caching files will still work but in a degraded version that might require more space on your disk. This warning can be disabled by setting the `HF_HUB_DISABLE_SYMLINKS_WARNING` environment variable. For more details, see https://huggingface.co/docs/huggingface_hub/how-to-cache#limitations.
To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development
  warnings.warn(message)

model.safetensors:   0%|          | 0.00/261M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/49.0 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

No model was supplied, defaulted to distilbert/distilbert-base-cased-distilled-squad and revision 626af31 (https://huggingface.co/distilbert/distilbert-base-cased-distilled-squad).
Using a pipeline without specifying a model name and revision in production is not recommended.

{'score': 0.9746261239051819, 'start': 30, 'end': 35, 'answer': 'India'}

# Summarization
summarizer = pipeline("summarization")
text = """Life is a journey, a path unknown,
A road we walk, yet not alone.
With dreams as maps and hope as light,
We trek through day and rest by night.

Some steps are smooth, some steep and rough,
At times the way feels long and tough.
But every turn and twist we face,
Reveals new strength, unveils new grace.

We meet some souls who walk awhile,
They teach us love, they make us smile.
And others pass like fleeting air,
Yet leave a mark that lingers there.

So walk with courage, heart held high,
Beneath the storm or open sky.
For life’s a journey — not the end, but why.."""
print(summarizer(text))

In [4]:


from transformers import pipeline, AutoTokenizer, AutoModelForSeq2SeqLM

model_name = "t5-base"

translator = pipeline(
    "translation_en_to_fr",
    model=AutoModelForSeq2SeqLM.from_pretrained(model_name),
    tokenizer=AutoTokenizer.from_pretrained(model_name),
    revision="a9723ea",  # Optional, to pin model version
    device=-1  # Force CPU; use device=0 for GPU
)

# Test
text = "Work is workship"
result = translator(text)
print(result[0]['translation_text'])

Out[4]:

Le travail est l'emploi

Under the Hood: Model & Tokenizer Loading

You can manually load the model/tokenizer if you want finer control:

In [5]:

from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased-finetuned-sst-2-english")
model = AutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased-finetuned-sst-2-english")

inputs = tokenizer("I love this product!", return_tensors="pt")
outputs = model(**inputs)
logits = outputs.logits
prediction = torch.argmax(logits)
print(prediction)

Out[5]:

tensor(1)

Advantages of Hugging Face Pipelines

Quick setup

Pre-trained models ready for use

Handles tokenization, model inference, and decoding internally

Easy to switch models by changing model name

Where to Find Models

Browse thousands of models at https://huggingface.co/models

Summary

Hugging Face makes using SOTA NLP models easy with pipeline().
Supports a wide variety of tasks: classification, NER, QA, summarization, etc.
You can dig deeper by using tokenizers and model classes directly.
Easily switch between models with AutoModel and AutoTokenizer.

In [ ]: