CoCalc Logo Icon
StoreFeaturesDocsShareSupportNewsAboutSign UpSign In
huggingface

Real-time collaboration for Jupyter Notebooks, Linux Terminals, LaTeX, VS Code, R IDE, and more,
all in one place. Commercial Alternative to JupyterHub.

GitHub Repository: huggingface/notebooks
Path: blob/main/course/videos/batch_inputs_tf.ipynb
Views: 2542
Kernel: Unknown Kernel

This notebook regroups the code sample of the video below, which is a part of the Hugging Face course.

#@title from IPython.display import HTML HTML('<iframe width="560" height="315" src="https://www.youtube.com/embed/ROxrFOEbsQE?rel=0&amp;controls=0&amp;showinfo=0" frameborder="0" allowfullscreen></iframe>')

Install the Transformers and Datasets libraries to run this notebook.

! pip install datasets transformers[sentencepiece]
from transformers import AutoTokenizer checkpoint = "distilbert-base-uncased-finetuned-sst-2-english" tokenizer = AutoTokenizer.from_pretrained(checkpoint) sentences = [ "I've been waiting for a HuggingFace course my whole life.", "I hate this.", ] tokens = [tokenizer.tokenize(sentence) for sentence in sentences] ids = [tokenizer.convert_tokens_to_ids(token) for token in tokens] print(ids[0]) print(ids[1])
[1045, 1005, 2310, 2042, 3403, 2005, 1037, 17662, 12172, 2607, 2026, 2878, 2166, 1012] [1045, 5223, 2023, 1012]
import tensorflow as tf ids = [[1045, 1005, 2310, 2042, 3403, 2005, 1037, 17662, 12172, 2607, 2026, 2878, 2166, 1012], [1045, 5223, 2023, 1012]] input_ids = tf.constant(ids)
--------------------------------------------------------------------------- ValueError Traceback (most recent call last) <ipython-input-5-5c1e8b893878> in <module> 4 [1045, 5223, 2023, 1012]] 5 ----> 6 input_ids = tf.constant(ids) ~/.pyenv/versions/3.7.9/envs/base/lib/python3.7/site-packages/tensorflow/python/framework/constant_op.py in constant(value, dtype, shape, name) 263 """ 264 return _constant_impl(value, dtype, shape, name, verify_shape=False, --> 265 allow_broadcast=True) 266 267 ~/.pyenv/versions/3.7.9/envs/base/lib/python3.7/site-packages/tensorflow/python/framework/constant_op.py in _constant_impl(value, dtype, shape, name, verify_shape, allow_broadcast) 274 with trace.Trace("tf.constant"): 275 return _constant_eager_impl(ctx, value, dtype, shape, verify_shape) --> 276 return _constant_eager_impl(ctx, value, dtype, shape, verify_shape) 277 278 g = ops.get_default_graph() ~/.pyenv/versions/3.7.9/envs/base/lib/python3.7/site-packages/tensorflow/python/framework/constant_op.py in _constant_eager_impl(ctx, value, dtype, shape, verify_shape) 299 def _constant_eager_impl(ctx, value, dtype, shape, verify_shape): 300 """Implementation of eager constant.""" --> 301 t = convert_to_eager_tensor(value, ctx, dtype) 302 if shape is None: 303 return t ~/.pyenv/versions/3.7.9/envs/base/lib/python3.7/site-packages/tensorflow/python/framework/constant_op.py in convert_to_eager_tensor(value, ctx, dtype) 96 dtype = dtypes.as_dtype(dtype).as_datatype_enum 97 ctx.ensure_initialized() ---> 98 return ops.EagerTensor(value, ctx.device_name, dtype) 99 100 ValueError: Can't convert non-rectangular Python sequence to Tensor.
import tensorflow as tf ids = [[1045, 1005, 2310, 2042, 3403, 2005, 1037, 17662, 12172, 2607, 2026, 2878, 2166, 1012], [1045, 5223, 2023, 1012, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]] input_ids = tf.constant(ids)
from transformers import AutoTokenizer tokenizer = AutoTokenizer.from_pretrained(checkpoint) tokenizer.pad_token_id
0
from transformers import TFAutoModelForSequenceClassification ids1 = tf.constant( [[1045, 1005, 2310, 2042, 3403, 2005, 1037, 17662, 12172, 2607, 2026, 2878, 2166, 1012]] ) ids2 = tf.constant([[1045, 5223, 2023, 1012]]) all_ids = tf.constant( [[1045, 1005, 2310, 2042, 3403, 2005, 1037, 17662, 12172, 2607, 2026, 2878, 2166, 1012], [1045, 5223, 2023, 1012, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]] ) model = TFAutoModelForSequenceClassification.from_pretrained(checkpoint) print(model(ids1).logits) print(model(ids2).logits) print(model(all_ids).logits)
All model checkpoint layers were used when initializing TFDistilBertForSequenceClassification. All the layers of TFDistilBertForSequenceClassification were initialized from the model checkpoint at distilbert-base-uncased-finetuned-sst-2-english. If your task is similar to the task the model of the checkpoint was trained on, you can already use TFDistilBertForSequenceClassification for predictions without further training.
tf.Tensor([[-2.7276204 2.8789372]], shape=(1, 2), dtype=float32) tf.Tensor([[ 3.9497483 -3.1357408]], shape=(1, 2), dtype=float32) tf.Tensor( [[-2.7276206 2.878937 ] [ 1.5444432 -1.3998369]], shape=(2, 2), dtype=float32)
all_ids = tf.constant( [[1045, 1005, 2310, 2042, 3403, 2005, 1037, 17662, 12172, 2607, 2026, 2878, 2166, 1012], [1045, 5223, 2023, 1012, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]] ) attention_mask = tf.constant( [[ 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [ 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]] )
model = TFAutoModelForSequenceClassification.from_pretrained(checkpoint) output1 = model(ids1) output2 = model(ids2) print(output1.logits) print(output2.logits)
Some layers from the model checkpoint at distilbert-base-uncased-finetuned-sst-2-english were not used when initializing TFDistilBertForSequenceClassification: ['dropout_19'] - This IS expected if you are initializing TFDistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model). - This IS NOT expected if you are initializing TFDistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model). Some layers of TFDistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased-finetuned-sst-2-english and are newly initialized: ['dropout_39'] You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
tf.Tensor([[-2.7276204 2.8789372]], shape=(1, 2), dtype=float32) tf.Tensor([[ 3.9497483 -3.1357408]], shape=(1, 2), dtype=float32)
output = model(all_ids, attention_mask=attention_mask) print(output.logits)
tf.Tensor( [[-2.7276206 2.878937 ] [ 3.9497476 -3.1357408]], shape=(2, 2), dtype=float32)
from transformers import AutoTokenizer checkpoint = "distilbert-base-uncased-finetuned-sst-2-english" tokenizer = AutoTokenizer.from_pretrained(checkpoint) sentences = [ "I've been waiting for a HuggingFace course my whole life.", "I hate this.", ] print(tokenizer(sentences, padding=True))
{'input_ids': [[101, 1045, 1005, 2310, 2042, 3403, 2005, 1037, 17662, 12172, 2607, 2026, 2878, 2166, 1012, 102], [101, 1045, 5223, 2023, 1012, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]}