Real-time collaboration for Jupyter Notebooks, Linux Terminals, LaTeX, VS Code, R IDE, and more,
all in one place. Commercial Alternative to JupyterHub.

GitHub Repository: huggingface/notebooks
Path: blob/main/sagemaker/08_distributed_summarization_bart_t5/sagemaker-notebook.ipynb
Views: ²⁵⁴²

Kernel: conda_pytorch_p39

Huggingface Sagemaker-sdk - Distributed Training Demo

Distributed Summarization with `transformers` scripts + `Trainer` and `samsum` dataset

Tutorial

We will use the new Hugging Face DLCs and Amazon SageMaker extension to train a distributed Seq2Seq-transformer model on summarization using the transformers and datasets libraries and upload it afterwards to huggingface.co and test it.

As distributed training strategy we are going to use SageMaker Data Parallelism, which has been built into the Trainer API. To use data-parallelism we only have to define the distribution parameter in our HuggingFace estimator.

# configuration for running training on smdistributed Data Parallel
distribution = {'smdistributed':{'dataparallel':{ 'enabled': True }}}

In this tutorial, we will use an Amazon SageMaker Notebook Instance for running our training job. You can learn here how to set up a Notebook Instance.

What are we going to do:

Set up a development environment and install sagemaker
Chose 🤗 Transformers examples/ script
Configure distributed training and hyperparameters
Create a HuggingFace estimator and start training
Upload the fine-tuned model to huggingface.co
Test inference

Model and Dataset

We are going to fine-tune facebook/bart-base on the samsum dataset. "BART is sequence-to-sequence model trained with denoising as pretraining objective." [REF]

The samsum dataset contains about 16k messenger-like conversations with summaries.

{'id': '13818513',
 'summary': 'Amanda baked cookies and will bring Jerry some tomorrow.',
 'dialogue': "Amanda: I baked cookies. Do you want some?\r\nJerry: Sure!\r\nAmanda: I'll bring you tomorrow :-)"}

NOTE: You can run this demo in Sagemaker Studio, your local machine or Sagemaker Notebook Instances

Set up a development environment and install sagemaker

Installation

Note: The use of Jupyter is optional: We could also launch SageMaker Training jobs from anywhere we have an SDK installed, connectivity to the cloud and appropriate permissions, such as a Laptop, another IDE or a task scheduler like Airflow or AWS Step Functions.

In [ ]:

!pip install "sagemaker>=2.48.0"  --upgrade
#!apt install git-lfs

In [ ]:

!curl -s https://packagecloud.io/install/repositories/github/git-lfs/script.rpm.sh | sudo bash
!sudo yum install git-lfs -y
!git lfs install

Development environment

In [1]:

import sagemaker.huggingface

Permissions

If you are going to use Sagemaker in a local environment. You need access to an IAM Role with the required permissions for Sagemaker. You can find here more about it.

In [ ]:

import sagemaker
import boto3
sess = sagemaker.Session()
# sagemaker session bucket -> used for uploading data, models and logs
# sagemaker will automatically create this bucket if it not exists
sagemaker_session_bucket=None
if sagemaker_session_bucket is None and sess is not None:
    # set to default bucket if a bucket name is not given
    sagemaker_session_bucket = sess.default_bucket()

try:
    role = sagemaker.get_execution_role()
except ValueError:
    iam = boto3.client('iam')
    role = iam.get_role(RoleName='sagemaker_execution_role')['Role']['Arn']

sess = sagemaker.Session(default_bucket=sagemaker_session_bucket)

print(f"sagemaker role arn: {role}")
print(f"sagemaker bucket: {sess.default_bucket()}")
print(f"sagemaker session region: {sess.boto_region_name}")

Choose 🤗 Transformers `examples/` script

The 🤗 Transformers repository contains several examples/scripts for fine-tuning models on tasks from language-modeling to token-classification. In our case, we are using the run_summarization.py from the seq2seq/ examples.

Note: you can use this tutorial identical to train your model on a different examples script.

Since the HuggingFace Estimator has git support built-in, we can specify a training script that is stored in a GitHub repository as entry_point and source_dir.

We are going to use the transformers 4.4.2 DLC which means we need to configure the v4.4.2 as the branch to pull the compatible example scripts.

In [1]:

git_config = {'repo': 'https://github.com/huggingface/transformers.git','branch': 'v4.26.0'} # v4.6.1 is referring to the `transformers_version` you use in the estimator.

Configure distributed training and hyperparameters

Next, we will define our hyperparameters and configure our distributed training strategy. As hyperparameter, we can define any Seq2SeqTrainingArguments and the ones defined in run_summarization.py.

In [2]:

# hyperparameters, which are passed into the training job
hyperparameters={'per_device_train_batch_size': 4,
                 'per_device_eval_batch_size': 4,
                 'model_name_or_path': 'facebook/bart-large-cnn',
                 'dataset_name': 'samsum',
                 'do_train': True,
                 'do_eval': True,
                 'do_predict': True,
                 'predict_with_generate': True,
                 'output_dir': '/opt/ml/model',
                 'num_train_epochs': 3,
                 'learning_rate': 5e-5,
                 'seed': 7,
                 'fp16': True,
                 }

# configuration for running training on smdistributed Data Parallel
distribution = {'smdistributed':{'dataparallel':{ 'enabled': True }}}

Create a `HuggingFace` estimator and start training

In [ ]:

from sagemaker.huggingface import HuggingFace

# create the Estimator
huggingface_estimator = HuggingFace(
      entry_point='run_summarization.py', # script
      source_dir='./examples/pytorch/summarization', # relative path to example
      git_config=git_config,
      instance_type='ml.p3dn.24xlarge',
      instance_count=2,
      transformers_version='4.26.0',
      pytorch_version='1.13.1',
      py_version='py39',
      role=role,
      hyperparameters = hyperparameters,
      distribution = distribution
)

In [ ]:

# starting the train job
huggingface_estimator.fit()

Deploying the endpoint

To deploy our endpoint, we call deploy() on our HuggingFace estimator object, passing in our desired number of instances and instance type.

In [ ]:

predictor = huggingface_estimator.deploy(1,"ml.g4dn.xlarge")

Then, we use the returned predictor object to call the endpoint.

In [ ]:

conversation = '''Jeff: Can I train a 🤗 Transformers model on Amazon SageMaker? 
    Philipp: Sure you can use the new Hugging Face Deep Learning Container. 
    Jeff: ok.
    Jeff: and how can I get started? 
    Jeff: where can I find documentation? 
    Philipp: ok, ok you can find everything here. https://huggingface.co/blog/the-partnership-amazon-sagemaker-and-hugging-face                                           
    '''

data= {"inputs":conversation}

predictor.predict(data)

Finally, we delete the endpoint again.

In [12]:

predictor.delete_endpoint()

Upload the fine-tuned model to huggingface.co

We can download our model from Amazon S3 and unzip it using the following snippet.

In [ ]:

import os
import tarfile
from sagemaker.s3 import S3Downloader

local_path = 'my_bart_model'

os.makedirs(local_path, exist_ok = True)

# download model from S3
S3Downloader.download(
    s3_uri=huggingface_estimator.model_data, # s3 uri where the trained model is located
    local_path=local_path, # local path where *.targ.gz is saved
    sagemaker_session=sess # sagemaker session used for training the model
)

# unzip model
tar = tarfile.open(f"{local_path}/model.tar.gz", "r:gz")
tar.extractall(path=local_path)
tar.close()
os.remove(f"{local_path}/model.tar.gz")

Before we are going to upload our model to huggingface.co we need to create a model_card. The model_card describes the model includes hyperparameters, results and which dataset was used for training. To create a model_card we create a README.md in our local_path

In [ ]:

# read eval and test results 
with open(f"{local_path}/eval_results.json") as f:
    eval_results_raw = json.load(f)
    eval_results={}
    eval_results["eval_rouge1"] = eval_results_raw["eval_rouge1"]
    eval_results["eval_rouge2"] = eval_results_raw["eval_rouge2"]
    eval_results["eval_rougeL"] = eval_results_raw["eval_rougeL"]
    eval_results["eval_rougeLsum"] = eval_results_raw["eval_rougeLsum"]

with open(f"{local_path}/test_results.json") as f:
    test_results_raw = json.load(f)
    test_results={}
    test_results["test_rouge1"] = test_results_raw["test_rouge1"]
    test_results["test_rouge2"] = test_results_raw["test_rouge2"]
    test_results["test_rougeL"] = test_results_raw["test_rougeL"]
    test_results["test_rougeLsum"] = test_results_raw["test_rougeLsum"]

After we extract all the metrics we want to include we are going to create our README.md. Additionally to the automated generation of the results table we add the metrics manually to the metadata of our model card under model-index

In [ ]:

print(eval_results)
print(test_results)

In [ ]:

import json


MODEL_CARD_TEMPLATE = """
---
language: en
tags:
- sagemaker
- bart
- summarization
license: apache-2.0
datasets:
- samsum
model-index:
- name: {model_name}
  results:
  - task: 
      name: Abstractive Text Summarization
      type: abstractive-text-summarization
    dataset:
      name: "SAMSum Corpus: A Human-annotated Dialogue Dataset for Abstractive Summarization" 
      type: samsum
    metrics:
       - name: Validation ROGUE-1
         type: rogue-1
         value: 42.621
       - name: Validation ROGUE-2
         type: rogue-2
         value: 21.9825
       - name: Validation ROGUE-L
         type: rogue-l
         value: 33.034
       - name: Test ROGUE-1
         type: rogue-1
         value: 41.3174
       - name: Test ROGUE-2
         type: rogue-2
         value: 20.8716
       - name: Test ROGUE-L
         type: rogue-l
         value: 32.1337
widget:
- text: | 
    Jeff: Can I train a 🤗 Transformers model on Amazon SageMaker? 
    Philipp: Sure you can use the new Hugging Face Deep Learning Container. 
    Jeff: ok.
    Jeff: and how can I get started? 
    Jeff: where can I find documentation? 
    Philipp: ok, ok you can find everything here. https://huggingface.co/blog/the-partnership-amazon-sagemaker-and-hugging-face 
---
## `{model_name}`
This model was trained using Amazon SageMaker and the new Hugging Face Deep Learning container.
For more information look at:
- [🤗 Transformers Documentation: Amazon SageMaker](https://huggingface.co/transformers/sagemaker.html)
- [Example Notebooks](https://github.com/huggingface/notebooks/tree/master/sagemaker)
- [Amazon SageMaker documentation for Hugging Face](https://docs.aws.amazon.com/sagemaker/latest/dg/hugging-face.html)
- [Python SDK SageMaker documentation for Hugging Face](https://sagemaker.readthedocs.io/en/stable/frameworks/huggingface/index.html)
- [Deep Learning Container](https://github.com/aws/deep-learning-containers/blob/master/available_images.md#huggingface-training-containers)
## Hyperparameters
    {hyperparameters}
## Usage
    from transformers import pipeline
    summarizer = pipeline("summarization", model="philschmid/{model_name}")
    conversation = '''Jeff: Can I train a 🤗 Transformers model on Amazon SageMaker? 
    Philipp: Sure you can use the new Hugging Face Deep Learning Container. 
    Jeff: ok.
    Jeff: and how can I get started? 
    Jeff: where can I find documentation? 
    Philipp: ok, ok you can find everything here. https://huggingface.co/blog/the-partnership-amazon-sagemaker-and-hugging-face                                           
    '''
    summarizer(conversation)
## Results
| key | value |
| --- | ----- |
{eval_table}
{test_table}
"""

# Generate model card (todo: add more data from Trainer)
model_card = MODEL_CARD_TEMPLATE.format(
    model_name=f"{hyperparameters['model_name_or_path'].split('/')[1]}-{hyperparameters['dataset_name']}",
    hyperparameters=json.dumps(hyperparameters, indent=4, sort_keys=True),
    eval_table="\n".join(f"| {k} | {v} |" for k, v in eval_results.items()),
    test_table="\n".join(f"| {k} | {v} |" for k, v in test_results.items()),
)
with open(f"{local_path}/README.md", "w") as f:
    f.write(model_card)

After we have our unzipped model and model card located in my_bart_model we can use the either huggingface_hub SDK to create a repository and upload it to huggingface.co or go to https://huggingface.co/new an create a new repository and upload it.

In [ ]:

from getpass import getpass
from huggingface_hub import HfApi, Repository

hf_username = "philschmid" # your username on huggingface.co
hf_email = "[email protected]" # email used for commit
repository_name = f"{hyperparameters['model_name_or_path'].split('/')[1]}-{hyperparameters['dataset_name']}" # repository name on huggingface.co
password = getpass("Enter your password:") # creates a prompt for entering password

# get hf token
token = HfApi().login(username=hf_username, password=password)

# create repository
repo_url = HfApi().create_repo(token=token, name=repository_name, exist_ok=True)

# create a Repository instance
model_repo = Repository(use_auth_token=token,
                        clone_from=repo_url,
                        local_dir=local_path,
                        git_user=hf_username,
                        git_email=hf_email)

# push model to the hub
model_repo.push_to_hub()

print(f"https://huggingface.co/{hf_username}/{repository_name}")

Real-time collaboration for Jupyter Notebooks, Linux Terminals, LaTeX, VS Code, R IDE, and more,
all in one place. Commercial Alternative to JupyterHub.

Huggingface Sagemaker-sdk - Distributed Training Demo

Distributed Summarization with `transformers` scripts + `Trainer` and `samsum` dataset

Tutorial

Model and Dataset

Set up a development environment and install sagemaker

Installation

Development environment

Permissions

Choose 🤗 Transformers `examples/` script

Configure distributed training and hyperparameters

Create a `HuggingFace` estimator and start training

Deploying the endpoint

Upload the fine-tuned model to huggingface.co

Product

Resources

Company

Real-time collaboration for Jupyter Notebooks, Linux Terminals, LaTeX, VS Code, R IDE, and more, all in one place. Commercial Alternative to JupyterHub.

Huggingface Sagemaker-sdk - Distributed Training Demo

Distributed Summarization with transformers scripts + Trainer and samsum dataset

Tutorial

Model and Dataset

Set up a development environment and install sagemaker

Installation

Development environment

Permissions

Choose 🤗 Transformers examples/ script

Configure distributed training and hyperparameters

Create a HuggingFace estimator and start training

Deploying the endpoint

Upload the fine-tuned model to huggingface.co

Real-time collaboration for Jupyter Notebooks, Linux Terminals, LaTeX, VS Code, R IDE, and more,
all in one place. Commercial Alternative to JupyterHub.

Distributed Summarization with `transformers` scripts + `Trainer` and `samsum` dataset

Choose 🤗 Transformers `examples/` script

Create a `HuggingFace` estimator and start training