CoCalc Logo Icon
StoreFeaturesDocsShareSupportNewsAboutSign UpSign In
huggingface

Real-time collaboration for Jupyter Notebooks, Linux Terminals, LaTeX, VS Code, R IDE, and more,
all in one place. Commercial Alternative to JupyterHub.

GitHub Repository: huggingface/notebooks
Path: blob/main/sagemaker/01_getting_started_pytorch/sagemaker-notebook.ipynb
Views: 2542
Kernel: conda_pytorch_p39

Huggingface Sagemaker-sdk - Getting Started Demo

Binary Classification with Trainer and imdb dataset

Introduction

Welcome to our end-to-end binary Text-Classification example. In this demo, we will use the Hugging Faces transformers and datasets library together with a custom Amazon sagemaker-sdk extension to fine-tune a pre-trained transformer on binary text classification. In particular, the pre-trained model will be fine-tuned using the imdb dataset. To get started, we need to set up the environment with a few prerequisite steps, for permissions, configurations, and so on.

image.png

NOTE: You can run this demo in Sagemaker Studio, your local machine or Sagemaker Notebook Instances

Development Environment and Permissions

Installation

Note: we only install the required libraries from Hugging Face and AWS. You also need PyTorch or Tensorflow, if you haven´t it installed

!pip install "sagemaker>=2.140.0" "transformers==4.26.1" "datasets[s3]==2.10.1" --upgrade

Development environment

import sagemaker.huggingface

Permissions

If you are going to use Sagemaker in a local environment. You need access to an IAM Role with the required permissions for Sagemaker. You can find here more about it.

import sagemaker import boto3 sess = sagemaker.Session() # sagemaker session bucket -> used for uploading data, models and logs # sagemaker will automatically create this bucket if it not exists sagemaker_session_bucket=None if sagemaker_session_bucket is None and sess is not None: # set to default bucket if a bucket name is not given sagemaker_session_bucket = sess.default_bucket() try: role = sagemaker.get_execution_role() except ValueError: iam = boto3.client('iam') role = iam.get_role(RoleName='sagemaker_execution_role')['Role']['Arn'] sess = sagemaker.Session(default_bucket=sagemaker_session_bucket) print(f"sagemaker role arn: {role}") print(f"sagemaker bucket: {sess.default_bucket()}") print(f"sagemaker session region: {sess.boto_region_name}")

Preprocessing

We are using the datasets library to download and preprocess the imdb dataset. After preprocessing, the dataset will be uploaded to our sagemaker_session_bucket to be used within our training job. The imdb dataset consists of 25000 training and 25000 testing highly polar movie reviews.

Tokenization

from datasets import load_dataset from transformers import AutoTokenizer # tokenizer used in preprocessing tokenizer_name = 'distilbert-base-uncased' # dataset used dataset_name = 'imdb' # s3 key prefix for the data s3_prefix = 'samples/datasets/imdb'
# load dataset dataset = load_dataset(dataset_name) # download tokenizer tokenizer = AutoTokenizer.from_pretrained(tokenizer_name) # tokenizer helper function def tokenize(batch): return tokenizer(batch['text'], padding='max_length', truncation=True) # load dataset train_dataset, test_dataset = load_dataset('imdb', split=['train', 'test']) test_dataset = test_dataset.shuffle().select(range(10000)) # smaller the size for test dataset to 10k # tokenize dataset train_dataset = train_dataset.map(tokenize, batched=True) test_dataset = test_dataset.map(tokenize, batched=True) # set format for pytorch train_dataset = train_dataset.rename_column("label", "labels") train_dataset.set_format('torch', columns=['input_ids', 'attention_mask', 'labels']) test_dataset = test_dataset.rename_column("label", "labels") test_dataset.set_format('torch', columns=['input_ids', 'attention_mask', 'labels'])

Uploading data to sagemaker_session_bucket

After we processed the datasets we are going to use the new FileSystem integration to upload our dataset to S3.

# save train_dataset to s3 training_input_path = f's3://{sess.default_bucket()}/{s3_prefix}/train' train_dataset.save_to_disk(training_input_path) # save test_dataset to s3 test_input_path = f's3://{sess.default_bucket()}/{s3_prefix}/test' test_dataset.save_to_disk(test_input_path)

Fine-tuning & starting Sagemaker Training Job

In order to create a sagemaker training job we need an HuggingFace Estimator. The Estimator handles end-to-end Amazon SageMaker training and deployment tasks. In a Estimator we define, which fine-tuning script should be used as entry_point, which instance_type should be used, which hyperparameters are passed in .....

huggingface_estimator = HuggingFace(entry_point='train.py', source_dir='./scripts', base_job_name='huggingface-sdk-extension', instance_type='ml.p3.2xlarge', instance_count=1, transformers_version='4.4', pytorch_version='1.6', py_version='py36', role=role, hyperparameters = {'epochs': 1, 'train_batch_size': 32, 'model_name':'distilbert-base-uncased' })

When we create a SageMaker training job, SageMaker takes care of starting and managing all the required ec2 instances for us with the huggingface container, uploads the provided fine-tuning script train.py and downloads the data from our sagemaker_session_bucket into the container at /opt/ml/input/data. Then, it starts the training job by running.

/opt/conda/bin/python train.py --epochs 1 --model_name distilbert-base-uncased --train_batch_size 32

The hyperparameters you define in the HuggingFace estimator are passed in as named arguments.

Sagemaker is providing useful properties about the training environment through various environment variables, including the following:

  • SM_MODEL_DIR: A string that represents the path where the training job writes the model artifacts to. After training, artifacts in this directory are uploaded to S3 for model hosting.

  • SM_NUM_GPUS: An integer representing the number of GPUs available to the host.

  • SM_CHANNEL_XXXX: A string that represents the path to the directory that contains the input data for the specified channel. For example, if you specify two input channels in the HuggingFace estimator’s fit call, named train and test, the environment variables SM_CHANNEL_TRAIN and SM_CHANNEL_TEST are set.

To run your training job locally you can define instance_type='local' or instance_type='local_gpu' for gpu usage. Note: this does not working within SageMaker Studio

!pygmentize ./scripts/train.py

Creating an Estimator and start a training job

from sagemaker.huggingface import HuggingFace # hyperparameters, which are passed into the training job hyperparameters={'epochs': 1, 'train_batch_size': 32, 'model_name':'distilbert-base-uncased' }
huggingface_estimator = HuggingFace(entry_point='train.py', source_dir='./scripts', instance_type='ml.p3.2xlarge', instance_count=1, role=role, transformers_version='4.26', pytorch_version='1.13', py_version='py39', hyperparameters = hyperparameters)
# starting the train job with our uploaded datasets as input huggingface_estimator.fit({'train': training_input_path, 'test': test_input_path})

Deploying the endpoint

To deploy our endpoint, we call deploy() on our HuggingFace estimator object, passing in our desired number of instances and instance type.

predictor = huggingface_estimator.deploy(1, "ml.g4dn.xlarge")

Then, we use the returned predictor object to call the endpoint.

sentiment_input= {"inputs":"I love using the new Inference DLC."} predictor.predict(sentiment_input)

Finally, we delete the endpoint again.

predictor.delete_model() predictor.delete_endpoint()

Extras

Estimator Parameters

# container image used for training job print(f"container image used for training job: \n{huggingface_estimator.image_uri}\n") # s3 uri where the trained model is located print(f"s3 uri where the trained model is located: \n{huggingface_estimator.model_data}\n") # latest training job name for this estimator print(f"latest training job name for this estimator: \n{huggingface_estimator.latest_training_job.name}\n")
# access the logs of the training job huggingface_estimator.sagemaker_session.logs_for_job(huggingface_estimator.latest_training_job.name)

Attach to old training job to an estimator

In Sagemaker you can attach an old training job to an estimator to continue training, get results etc..

from sagemaker.estimator import Estimator # job which is going to be attached to the estimator old_training_job_name=''
# attach old training job huggingface_estimator_loaded = Estimator.attach(old_training_job_name) # get model output s3 from training job huggingface_estimator_loaded.model_data