CoCalc Logo Icon
StoreFeaturesDocsShareSupportNewsAboutSign UpSign In
huggingface

Real-time collaboration for Jupyter Notebooks, Linux Terminals, LaTeX, VS Code, R IDE, and more,
all in one place. Commercial Alternative to JupyterHub.

GitHub Repository: huggingface/notebooks
Path: blob/main/sagemaker/14_train_and_push_to_hub/sagemaker-notebook.ipynb
Views: 2542
Kernel: conda_pytorch_p39

HuggingFace Hub meets Amazon SageMaker

Fine-tune a Multi-Class Classification with Trainer and emotion dataset and push it to the Hugging Face Hub

Introduction

Welcome to our end-to-end multi-class Text-Classification example. In this demo, we will use the Hugging Faces transformers and datasets library together with a custom Amazon sagemaker-sdk extension to fine-tune a pre-trained transformer for multi-class text classification. In particular, the pre-trained model will be fine-tuned using the emotion dataset. To get started, we need to set up the environment with a few prerequisite steps, for permissions, configurations, and so on.

emotion-widget.png

NOTE: You can run this demo in Sagemaker Studio, your local machine or Sagemaker Notebook Instances

Development Environment and Permissions

Installation

Note: we only install the required libraries from Hugging Face and AWS. You also need PyTorch or Tensorflow, if you haven´t it installed

!pip install "sagemaker>=2.140.0" "transformers==4.26.1" "datasets[s3]==2.10.1" --upgrade
import sagemaker.huggingface

Permissions

If you are going to use Sagemaker in a local environment. You need access to an IAM Role with the required permissions for Sagemaker. You can find here more about it.

import sagemaker import boto3 sess = sagemaker.Session() # sagemaker session bucket -> used for uploading data, models and logs # sagemaker will automatically create this bucket if it not exists sagemaker_session_bucket=None if sagemaker_session_bucket is None and sess is not None: # set to default bucket if a bucket name is not given sagemaker_session_bucket = sess.default_bucket() try: role = sagemaker.get_execution_role() except ValueError: iam = boto3.client('iam') role = iam.get_role(RoleName='sagemaker_execution_role')['Role']['Arn'] sess = sagemaker.Session(default_bucket=sagemaker_session_bucket) print(f"sagemaker role arn: {role}") print(f"sagemaker bucket: {sess.default_bucket()}") print(f"sagemaker session region: {sess.boto_region_name}")

Preprocessing

We are using the datasets library to download and preprocess the emotion dataset. After preprocessing, the dataset will be uploaded to our sagemaker_session_bucket to be used within our training job. The emotion dataset consists of 16000 training examples, 2000 validation examples, and 2000 testing examples.

Tokenization

from datasets import load_dataset from transformers import AutoTokenizer # tokenizer used in preprocessing tokenizer_name = 'distilbert-base-uncased' # dataset used dataset_name = 'emotion' # s3 key prefix for the data s3_prefix = 'samples/datasets/emotion'
# download tokenizer tokenizer = AutoTokenizer.from_pretrained(tokenizer_name) # tokenizer helper function def tokenize(batch): return tokenizer(batch['text'], padding='max_length', truncation=True) # load dataset train_dataset, test_dataset = load_dataset(dataset_name, split=['train', 'test']) # tokenize dataset train_dataset = train_dataset.map(tokenize, batched=True) test_dataset = test_dataset.map(tokenize, batched=True) # set format for pytorch train_dataset = train_dataset.rename_column("label", "labels") train_dataset.set_format('torch', columns=['input_ids', 'attention_mask', 'labels']) test_dataset = test_dataset.rename_column("label", "labels") test_dataset.set_format('torch', columns=['input_ids', 'attention_mask', 'labels'])
Downloading: 3.62kB [00:00, 629kB/s] Downloading: 3.28kB [00:00, 998kB/s] Using custom data configuration default Reusing dataset emotion (/Users/philipp/.cache/huggingface/datasets/emotion/default/0.0.0/348f63ca8e27b3713b6c04d723efe6d824a56fb3d1449794716c0f0296072705) 100%|██████████| 2/2 [00:00<00:00, 99.32it/s] 100%|██████████| 16/16 [00:02<00:00, 7.47ba/s] 100%|██████████| 2/2 [00:00<00:00, 6.47ba/s]

Uploading data to sagemaker_session_bucket

After we processed the datasets we are going to use the new FileSystem integration to upload our dataset to S3.

# save train_dataset to s3 training_input_path = f's3://{sess.default_bucket()}/{s3_prefix}/train' train_dataset.save_to_disk(training_input_path) # save test_dataset to s3 test_input_path = f's3://{sess.default_bucket()}/{s3_prefix}/test' test_dataset.save_to_disk(test_input_path)

Creating an Estimator and start a training job

List of supported models: https://huggingface.co/models?library=pytorch,transformers&sort=downloads

setting up push_to_hub for our model.

The train.py scripts implements the push_to_hub using the Trainer and TrainingArguments. To push our model to the Hub we need to define the push_to_hub. hyperparameter and set it to True and provide out Hugging Face Token. Additionally, we can configure the repository name and saving strategy using the hub_model_id, hub_strategy.

You can find documentation to those parameters here.

We are going to provide our HF Token securely with out exposing it to the public using notebook_login from the huggingface_hub SDK. But be careful your token will still be visible insight the logs of the training job. If you run huggingface_estimator.fit(...,wait=True) you will see the token in the logs. A better way of providing your HF_TOKEN to your training jobs would be using AWS Secret Manager

You can also directly find your token at https://hf.co/settings/token.

from huggingface_hub import notebook_login notebook_login()
from sagemaker.huggingface import HuggingFace from huggingface_hub import HfFolder import time # hyperparameters, which are passed into the training job hyperparameters={'epochs': 1, # number of training epochs 'train_batch_size': 32, # batch size for training 'eval_batch_size': 64, # batch size for evaluation 'learning_rate': 3e-5, # learning rate used during training 'model_id':'distilbert-base-uncased', # pre-trained model 'fp16': True, # Whether to use 16-bit (mixed) precision training 'push_to_hub': True, # Defines if we want to push the model to the hub 'hub_model_id': 'sagemaker-distilbert-emotion', # The model id of the model to push to the hub 'hub_strategy': 'every_save', # The strategy to use when pushing the model to the hub 'hub_token': HfFolder.get_token() # HuggingFace token to have permission to push }
# define Training Job Name job_name = f'push-to-hub-sample-{time.strftime("%Y-%m-%d-%H-%M-%S", time.localtime())}' # create the Estimator huggingface_estimator = HuggingFace( entry_point = 'train.py', # fine-tuning script used in training jon source_dir = './scripts', # directory where fine-tuning script is stored instance_type = 'ml.p3.2xlarge', # instances type used for the training job instance_count = 1, # the number of instances used for training base_job_name = job_name, # the name of the training job role = role, # Iam role used in training job to access AWS ressources, e.g. S3 transformers_version = '4.26', # the transformers version used in the training job pytorch_version = '1.13', # the pytorch_version version used in the training job py_version = 'py39', # the python version used in the training job hyperparameters = hyperparameters, # the hyperparameter used for running the training job )
# define a data input dictonary with our uploaded s3 uris data = { 'train': training_input_path, 'test': test_input_path } # starting the train job with our uploaded datasets as input # setting wait to False to not expose the HF Token huggingface_estimator.fit(data, wait=False)
# adding waiter to see when training is done waiter = huggingface_estimator.sagemaker_session.sagemaker_client.get_waiter('training_job_completed_or_stopped') waiter.wait(TrainingJobName=huggingface_estimator.latest_training_job.name)

Accessing the model on hf.co/models

we can access the model on hf.co/models using the hub_model_id and our username.

from huggingface_hub import HfApi whoami = HfApi().whoami() username = whoami['name'] print(f"https://huggingface.co/{username}/{hyperparameters['hub_model_id']}")
https://huggingface.co/philschmid/sagemaker-distilbert-emotion

Deploying the model from Hugging Face to a SageMaker Endpoint

To deploy our model to Amazon SageMaker we can create a HuggingFaceModel and provide the Hub configuration (HF_MODEL_ID & HF_TASK) to deploy it. Alternatively, we can use the the hugginface_estimator to deploy our model from S3 with huggingface_estimator.deploy().

from sagemaker.huggingface import HuggingFaceModel import sagemaker role = sagemaker.get_execution_role() # Hub Model configuration. https://huggingface.co/models hub = { 'HF_MODEL_ID':f"{username}/{hyperparameters['hub_model_id']}", 'HF_TASK':'text-classification' } # create Hugging Face Model Class huggingface_model = HuggingFaceModel( transformers_version='4.26', pytorch_version='1.13', py_version='py39', env=hub, role=role, ) # deploy model to SageMaker Inference predictor = huggingface_model.deploy( initial_instance_count=1, # number of instances instance_type='ml.m5.xlarge' # ec2 instance type )

Then, we use the returned predictor object to call the endpoint.

sentiment_input= {"inputs": "Winter is coming and it will be dark soon."} predictor.predict(sentiment_input)

Finally, we delete the inference endpoint.

predictor.delete_model() predictor.delete_endpoint()