Real-time collaboration for Jupyter Notebooks, Linux Terminals, LaTeX, VS Code, R IDE, and more,
all in one place. Commercial Alternative to JupyterHub.
Real-time collaboration for Jupyter Notebooks, Linux Terminals, LaTeX, VS Code, R IDE, and more,
all in one place. Commercial Alternative to JupyterHub.
Path: blob/main/sagemaker/25_pytorch_fsdp_model_parallelism/sagemaker-notebook.ipynb
Views: 2542
How to scale LLM workloads to 20B+ with multi-node clusters on Amazon SageMaker using Hugging Face and PyTorch FSDP
In this tutorial, we will fine-tune the new GPT-NeoXT-Chat-Base-20B on the ELI5 dataset to improve the explanation and question-answering skills of the agent. The ELI5 dataset is an English-language dataset of questions and answers gathered from three subreddits where users ask factual questions requiring paragraph-length or longer answers.
GPT-NeoXT-Chat-Base is a 20B open-source LLM, which makes it hard to fine-tune on a single GPU or even a single Node with multiple GPUs. We are going to use Amazon SageMaker managed training platform as our infrastructure backbone to help us create a multi-node cluster to easily run our distributed training. As instances, we will use 2x p4d.24xlarge instances, which come with 8x NIVIDA A100 40GB GPUs.
Note: You might have to increase and request a quota for those instances.
As distributed training framework, we will use Pytorch FSDP + Hugging Face Transformers Trainer, which will make it super easy to distribute our model and data in a fully sharded way across all our nodes and GPUs.
What is PyTorch Fully Sharded Data Parallel (FSDP)?
PyTorch FSDP (Fully Sharded Data Parallel) is an extension of data parallelism that enables efficient large-scale training of LLMs. With FSDP, each GPU stores only a subset of the model and associated optimizer states and gradients and can optionally offload the sharded model parameters to CPUs. This helps maximize the overlap between network communication and model computation, reducing the memory footprint on GPUs.
FSDP optimizations include:
Transformer Wrapping Policy
Mixed Precision (bf16)
Activation Checkpointing (Gradient Checkpointing)
Full Sharding Strategy
PyTorch FSDP is natively integrated into the Hugging Face Trainer, making it easy to adapt and use. You can learn more about PyTorch FSDP in Efficient Large-Scale Training with Pytorch FSDP and AWS or Introducing PyTorch Fully Sharded Data Parallel (FSDP) API blog post.
If you are going to use Sagemaker in a local environment. You need access to an IAM Role with the required permissions for Sagemaker. You can find here more about it.
2. Load and prepare the dataset
As the base dataset, we will use the ELI5 dataset, but before fine-tuning the model, we need to preprocess the data. We will create a "chat" version of the dataset by adding <user>
and <bot>
tokens and add an end-of-sequence <|endoftext|>
token to help the model learn to distinguish consecutive examples. Additionally, we create chunks of 2048
tokens (model max length) to avoid unnecessary padding and computing.
The first step is to load our dataset from Hugging Face. The dataset contains 272634
samples for eli5
. We will downsample the dataset to 25 000
to make it more realistic for real-world use cases.
An ELI5 sample can include multiple answers to a “question”. We will select the answer with the highest user score for our explanation.
Note: This dataset is a good example of using reinforcement learning for training transformers learning to generate answers with higher scores. Let me know if you are interested in an example of that.
The next step is to convert our dataset into a chat version. Here we will follow the instructions on the Model card and add the EOS token.
The last step of the data preparation is to tokenize and chunk our dataset. We convert our inputs (text) to token IDs by tokenizing, which the model can understand. Additionally, we concatenate our dataset samples into chunks of 2048
to avoid unnecessary padding.
After we processed the datasets we are going to use the new FileSystem integration to upload our dataset to S3. We are using the sess.default_bucket()
, adjust this if you want to store the dataset in a different S3 bucket. We will use the S3 path later in our training script.
3. Fine-tune the GPT model using FSDP on Amazon SageMaker
As mentioned in the beginning, we will use Amazon SageMaker and PyTorch FSDP to train our model. Amazon SageMaker makes it easy to create a multi-node cluster to train our model in a distributed manner. Lately, the sagemaker
python SDK got support to run training jobs using torchrun
, to distribute the script across multiple nodes and GPUs.
To use torchrun
to execute our scripts, we only have to define the distribution
parameter in our Estimator and set it to "torch_distributed": {"enabled": True}
. This tells sagemaker to launch our training job with.
To use FSDP with the Hugging Face Trainer, we need to provide our fsdp
strategy as well as the transformer layer policy
.
In our example, we will use full shard auto_wrap
and GPTNeoXLayer
as transformer layer policy. If you run this example and change the model id make sure to also adjust the transformer layer policy.
We prepared a run_clm.py, which implements causal language modeling and accepts our fsdp and other hyperparameters.
To create a sagemaker training job, we create an HuggingFace
Estimator and provide all our information. SagMaker takes care of starting and managing all the required ec2 instances for us, provides the correct huggingface container, uploads the provided scripts and downloads the data from our S3 bucket into the container at /opt/ml/input/data
. Then, it starts the training job by running.
We can now start our training job, with the .fit()
method passing our S3 path to the training script.
The training took 9407
seconds, which is about 2.6 hours. The ml.p4d.24xlarge
instance we used costs $37.688
per hour. So the total cost for training GPT-NeoXT-Chat-Base-20B
is (2.6h * $37.688) * 2 instances which results in $197. We could reduce the cost by using a spot instance or using Parameter Efficient Fine Tuning.
Note: Upload the model can take a while. To improve this you can save the artifacts to Hugging Face Hub, since SageMaker first creates an archives, which is pretty slow.