Book a Demo!
CoCalc Logo Icon
StoreFeaturesDocsShareSupportNewsAboutPoliciesSign UpSign In
suyashi29
GitHub Repository: suyashi29/python-su
Path: blob/master/Generative NLP Models using Python/6 Hugging face libraries.ipynb
3074 views
Kernel: Python 3 (ipykernel)
pip install --upgrade numpy pip install transformer conda create -n huggingface_env python=3.10 -y conda activate huggingface_env pip install huggingface_hub[hf_xet] pip install numpy==1.26.2 tensorflow==2.18.0 transformers pip install tensorflow-intel==2.18.0 numpy==1.26.2 pip install gensim
import os os.environ['TF_CPP_MIN_LOG_LEVEL'] = '3' import tensorflow as tf tf.get_logger().setLevel('ERROR')

Hugging Face Transformers for NLP Tasks


What is Hugging Face?

Hugging Face provides:

  • transformers: State-of-the-art pre-trained models for NLP (and beyond).

  • datasets: Ready-to-use NLP datasets.

  • tokenizers: Fast and customizable tokenization.


Installation

pip install transformers pip install torch # or tensorflow, depending on backend

Common NLP Tasks with Hugging Face

TaskDescriptionModel Example
Text ClassificationClassify text into categoriesBERT, DistilBERT
Named Entity Recognition (NER)Identify entities in textBERT, RoBERTa
Question AnsweringExtract answer from a given contextBERT, ALBERT
SummarizationSummarize long text into key pointsBART, T5
TranslationTranslate text between languagesMarianMT, mBART
Text GenerationAutocomplete or continue text generationGPT-2, GPT-3, LLaMA
Sentiment AnalysisDetect sentiment (positive/negative/etc.)DistilBERT, BERT

The pipeline() function is the simplest way to use Hugging Face models

from transformers import pipeline # Explicitly specify the model name model_name = "distilbert-base-uncased-finetuned-sst-2-english" # Load the sentiment analysis pipeline with the specified model sentiment_pipeline = pipeline("sentiment-analysis", model=model_name, tokenizer=model_name) # Sentiment analysis classifier = pipeline("sentiment-analysis") print(classifier("Suyashi hates coding"))
WARNING:tensorflow:From C:\Users\Suyashi144893\AppData\Local\anaconda3\Lib\site-packages\tf_keras\src\losses.py:2976: The name tf.losses.sparse_softmax_cross_entropy is deprecated. Please use tf.compat.v1.losses.sparse_softmax_cross_entropy instead.
No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision af0f99b (https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english). Using a pipeline without specifying a model name and revision in production is not recommended. C:\Users\Suyashi144893\AppData\Local\anaconda3\Lib\site-packages\huggingface_hub\file_download.py:943: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`. warnings.warn(
[{'label': 'NEGATIVE', 'score': 0.9959076642990112}]
# Named Entity Recognition ner = pipeline("ner", model="dslim/bert-base-NER") ner = pipeline("ner", grouped_entities=True) print(ner("My name is Ashi and I live in India."))
Some weights of the model checkpoint at dslim/bert-base-NER were not used when initializing BertForTokenClassification: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight'] - This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model). - This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model). No model was supplied, defaulted to dbmdz/bert-large-cased-finetuned-conll03-english and revision f2482bf (https://huggingface.co/dbmdz/bert-large-cased-finetuned-conll03-english). Using a pipeline without specifying a model name and revision in production is not recommended. Some weights of the model checkpoint at dbmdz/bert-large-cased-finetuned-conll03-english were not used when initializing BertForTokenClassification: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight'] - This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model). - This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model). C:\Users\Suyashi144893\AppData\Local\anaconda3\Lib\site-packages\transformers\pipelines\token_classification.py:168: UserWarning: `grouped_entities` is deprecated and will be removed in version v5.0.0, defaulted to `aggregation_strategy="AggregationStrategy.SIMPLE"` instead. warnings.warn(
[{'entity_group': 'PER', 'score': 0.9978213, 'word': 'Ashi', 'start': 11, 'end': 15}, {'entity_group': 'LOC', 'score': 0.9997433, 'word': 'India', 'start': 30, 'end': 35}]
# Question Answering qa = pipeline( "question-answering", model="distilbert/distilbert-base-cased-distilled-squad", revision="564e9b5" # Optional: use this to lock the exact version ) qa = pipeline("question-answering") result = qa({ 'question': 'Where do I live?', 'context': 'My name is Ashi and I live in India.' }) print(result)
config.json: 0%| | 0.00/473 [00:00<?, ?B/s]
C:\Users\Suyashi144893\AppData\Local\anaconda3\Lib\site-packages\huggingface_hub\file_download.py:143: UserWarning: `huggingface_hub` cache-system uses symlinks by default to efficiently store duplicated files but your machine does not support them in C:\Users\Suyashi144893\.cache\huggingface\hub\models--distilbert--distilbert-base-cased-distilled-squad. Caching files will still work but in a degraded version that might require more space on your disk. This warning can be disabled by setting the `HF_HUB_DISABLE_SYMLINKS_WARNING` environment variable. For more details, see https://huggingface.co/docs/huggingface_hub/how-to-cache#limitations. To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development warnings.warn(message)
model.safetensors: 0%| | 0.00/261M [00:00<?, ?B/s]
tokenizer_config.json: 0%| | 0.00/49.0 [00:00<?, ?B/s]
vocab.txt: 0.00B [00:00, ?B/s]
tokenizer.json: 0.00B [00:00, ?B/s]
No model was supplied, defaulted to distilbert/distilbert-base-cased-distilled-squad and revision 626af31 (https://huggingface.co/distilbert/distilbert-base-cased-distilled-squad). Using a pipeline without specifying a model name and revision in production is not recommended.
{'score': 0.9746261239051819, 'start': 30, 'end': 35, 'answer': 'India'}
# Summarization summarizer = pipeline("summarization") text = """Life is a journey, a path unknown, A road we walk, yet not alone. With dreams as maps and hope as light, We trek through day and rest by night. Some steps are smooth, some steep and rough, At times the way feels long and tough. But every turn and twist we face, Reveals new strength, unveils new grace. We meet some souls who walk awhile, They teach us love, they make us smile. And others pass like fleeting air, Yet leave a mark that lingers there. So walk with courage, heart held high, Beneath the storm or open sky. For life’s a journey — not the end, but why..""" print(summarizer(text))
from transformers import pipeline, AutoTokenizer, AutoModelForSeq2SeqLM model_name = "t5-base" translator = pipeline( "translation_en_to_fr", model=AutoModelForSeq2SeqLM.from_pretrained(model_name), tokenizer=AutoTokenizer.from_pretrained(model_name), revision="a9723ea", # Optional, to pin model version device=-1 # Force CPU; use device=0 for GPU ) # Test text = "Work is workship" result = translator(text) print(result[0]['translation_text'])
Le travail est l'emploi

Under the Hood: Model & Tokenizer Loading

You can manually load the model/tokenizer if you want finer control:

from transformers import AutoTokenizer, AutoModelForSequenceClassification import torch tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased-finetuned-sst-2-english") model = AutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased-finetuned-sst-2-english") inputs = tokenizer("I love this product!", return_tensors="pt") outputs = model(**inputs) logits = outputs.logits prediction = torch.argmax(logits) print(prediction)
tensor(1)

Advantages of Hugging Face Pipelines

Quick setup

Pre-trained models ready for use

Handles tokenization, model inference, and decoding internally

Easy to switch models by changing model name

Where to Find Models

Summary

  • Hugging Face makes using SOTA NLP models easy with pipeline().

  • Supports a wide variety of tasks: classification, NER, QA, summarization, etc.

  • You can dig deeper by using tokenizers and model classes directly.

  • Easily switch between models with AutoModel and AutoTokenizer.