Real-time collaboration for Jupyter Notebooks, Linux Terminals, LaTeX, VS Code, R IDE, and more,
all in one place. Commercial Alternative to JupyterHub.
Real-time collaboration for Jupyter Notebooks, Linux Terminals, LaTeX, VS Code, R IDE, and more,
all in one place. Commercial Alternative to JupyterHub.
Path: blob/main/sagemaker/06_sagemaker_metrics/sagemaker-notebook.ipynb
Views: 2542
Huggingface Sagemaker-sdk - training with custom metrics
Binary Classification with Trainer
and imdb
dataset
In this demo, we extend the basic classification demo by adding metrics definition to capture and visualize training metrics.
The documentation of the SageMaker metrics capture feature can be seen at https://docs.aws.amazon.com/sagemaker/latest/dg/training-metrics.html
We additionally use SageMaker Checkpointing to send intermediary checkpoint data to S3 uncompressed in parallel to the training happening https://docs.aws.amazon.com/sagemaker/latest/dg/model-checkpoints.html
SageMaker Checkpointing is supported by HF Trainer after Transformers 4.4.0+
Import libraries and set environment
Note: we only install the required libraries from Hugging Face and AWS. You also need PyTorch or Tensorflow, if you haven´t it installed
Development environment
Permissions
If you are going to use Sagemaker in a local environment. You need access to an IAM Role with the required permissions for Sagemaker. You can find here more about it.
Preprocessing
We are using the datasets
library to download and preprocess the imdb
dataset. After preprocessing, the dataset will be uploaded to our sagemaker_session_bucket
to be used within our training job. The imdb dataset consists of 25000 training and 25000 testing highly polar movie reviews.
Tokenization
Uploading data to Amazon S3
After we processed the datasets
we are going to use the new FileSystem
integration to upload our dataset to S3.
Launching a Training Job with custom metrics
We create a metric_definition dictionary that contains regex-based definitions that will be used to parse the job logs and extract metrics
Accessing Training Metrics
The training job doesn't emit metrics immediately. For example, it first needs to provision a training instance, download the training image, download the data. Additionally in this demo the first evaluation logs come after 500 steps (default in the Hugging Face trainer https://huggingface.co/transformers/main_classes/trainer.html#transformers.TrainingArguments).
Hence, run the below section 15 to 20 minutes after launching the training, otherwise it may not have available metrics yet and return an error
Note that you can also copy this code and run it from a different place (as long as connected to the cloud and authorized to use the API), by specifiying the exact training job name in the TrainingJobAnalytics
API call.)
We can also plot some of the metrics collected
Note: the plot below were generated at the end of the training job, with metrics available for all training duration
Deploying the endpoint
To deploy our endpoint, we call deploy()
on our HuggingFace estimator object, passing in our desired number of instances and instance type.
Then, we use the returned predictor object to call the endpoint.
Finally, we delete the endpoint again.