CoCalc Logo Icon
StoreFeaturesDocsShareSupportNewsAboutSign UpSign In
huggingface

Real-time collaboration for Jupyter Notebooks, Linux Terminals, LaTeX, VS Code, R IDE, and more,
all in one place. Commercial Alternative to JupyterHub.

GitHub Repository: huggingface/notebooks
Path: blob/main/sagemaker/24_train_bloom_peft_lora/sagemaker-notebook.ipynb
Views: 2542
Kernel: pytorch

Efficient Large Language Model training with LoRA and Hugging Face

In this sagemaker example, we are going to learn how to apply Low-Rank Adaptation of Large Language Models (LoRA) to fine-tune BLOOMZ (7 billion parameter version instruction tuned version of BLOOM) on a single GPU. We are going to leverage Hugging Face Transformers, Accelerate, and PEFT.

You will learn how to:

  1. Setup Development Environment

  2. Load and prepare the dataset

  3. Fine-Tune BLOOM with LoRA and bnb int-8 on Amazon SageMaker

  4. Deploy the model to Amazon SageMaker Endpoint

Quick intro: PEFT or Parameter Efficient Fine-tuning

PEFT, or Parameter Efficient Fine-tuning, is a new open-source library from Hugging Face to enable efficient adaptation of pre-trained language models (PLMs) to various downstream applications without fine-tuning all the model's parameters. PEFT currently includes techniques for:

!pip install "transformers==4.26.0" "datasets[s3]==2.9.0" sagemaker py7zr --upgrade --quiet

If you are going to use Sagemaker in a local environment. You need access to an IAM Role with the required permissions for Sagemaker. You can find here more about it.

import sagemaker import boto3 sess = sagemaker.Session() # sagemaker session bucket -> used for uploading data, models and logs # sagemaker will automatically create this bucket if it not exists sagemaker_session_bucket=None if sagemaker_session_bucket is None and sess is not None: # set to default bucket if a bucket name is not given sagemaker_session_bucket = sess.default_bucket() try: role = sagemaker.get_execution_role() except ValueError: iam = boto3.client('iam') role = iam.get_role(RoleName='sagemaker_execution_role')['Role']['Arn'] sess = sagemaker.Session(default_bucket=sagemaker_session_bucket) print(f"sagemaker role arn: {role}") print(f"sagemaker bucket: {sess.default_bucket()}") print(f"sagemaker session region: {sess.boto_region_name}")

2. Load and prepare the dataset

We will use the samsum dataset, a collection of about 16k messenger-like conversations with summaries. Conversations were created and written down by linguists fluent in English.

{ "id": "13818513", "summary": "Amanda baked cookies and will bring Jerry some tomorrow.", "dialogue": "Amanda: I baked cookies. Do you want some?\r\nJerry: Sure!\r\nAmanda: I'll bring you tomorrow :-)" }

To load the samsum dataset, we use the load_dataset() method from the 🤗 Datasets library.

from datasets import load_dataset # Load dataset from the hub dataset = load_dataset("samsum", split="train") print(f"Train dataset size: {len(dataset)}") # Train dataset size: 14732

To train our model, we need to convert our inputs (text) to token IDs. This is done by a 🤗 Transformers Tokenizer. If you are not sure what this means, check out chapter 6 of the Hugging Face Course.

from transformers import AutoTokenizer model_id="bigscience/bloomz-7b1" # Load tokenizer of BLOOMZ tokenizer = AutoTokenizer.from_pretrained(model_id) tokenizer.model_max_length = 2048 # overwrite wrong value

Before we can start training, we need to preprocess our data. Abstractive Summarization is a text-generation task. Our model will take a text as input and generate a summary as output. We want to understand how long our input and output will take to batch our data efficiently.

We defined a prompt_template which we will use to construct an instruct prompt for better performance of our model. Our prompt_template has a “fixed” start and end, and our document is in the middle. This means we need to ensure that the “fixed” template parts + document are not exceeding the max length of the model. We preprocess our dataset before training and save it to disk to then upload it to S3. You could run this step on your local machine or a CPU and upload it to the Hugging Face Hub.

from random import randint from itertools import chain from functools import partial # custom instruct prompt start prompt_template = f"Summarize the chat dialogue:\n{{dialogue}}\n---\nSummary:\n{{summary}}{{eos_token}}" # template dataset to add prompt to each sample def template_dataset(sample): sample["text"] = prompt_template.format(dialogue=sample["dialogue"], summary=sample["summary"], eos_token=tokenizer.eos_token) return sample # apply prompt template per sample dataset = dataset.map(template_dataset, remove_columns=list(dataset.features)) print(dataset[randint(0, len(dataset))]["text"]) # empty list to save remainder from batches to use in next batch remainder = {"input_ids": [], "attention_mask": []} def chunk(sample, chunk_length=2048): # define global remainder variable to save remainder from batches to use in next batch global remainder # Concatenate all texts and add remainder from previous batch concatenated_examples = {k: list(chain(*sample[k])) for k in sample.keys()} concatenated_examples = {k: remainder[k] + concatenated_examples[k] for k in concatenated_examples.keys()} # get total number of tokens for batch batch_total_length = len(concatenated_examples[list(sample.keys())[0]]) # get max number of chunks for batch if batch_total_length >= chunk_length: batch_chunk_length = (batch_total_length // chunk_length) * chunk_length # Split by chunks of max_len. result = { k: [t[i : i + chunk_length] for i in range(0, batch_chunk_length, chunk_length)] for k, t in concatenated_examples.items() } # add remainder to global variable for next batch remainder = {k: concatenated_examples[k][batch_chunk_length:] for k in concatenated_examples.keys()} # prepare labels result["labels"] = result["input_ids"].copy() return result # tokenize and chunk dataset lm_dataset = dataset.map( lambda sample: tokenizer(sample["text"]), batched=True, remove_columns=list(dataset.features) ).map( partial(chunk, chunk_length=1536), batched=True, ) # Print total number of samples print(f"Total number of samples: {len(lm_dataset)}")

After we processed the datasets we are going to use the new FileSystem integration to upload our dataset to S3. We are using the sess.default_bucket(), adjust this if you want to store the dataset in a different S3 bucket. We will use the S3 path later in our training script.

# save train_dataset to s3 training_input_path = f's3://{sess.default_bucket()}/processed/samsum-sagemaker/train' lm_dataset.save_to_disk(training_input_path) print("uploaded data to:") print(f"training dataset to: {training_input_path}")

3. Fine-Tune BLOOM with LoRA and bnb int-8 on Amazon SageMaker

In addition to the LoRA technique, we will use bitsanbytes LLM.int8() to quantize out frozen LLM to int8. This allows us to reduce the needed memory for BLOOMZ ~4x.

We prepared a run_clm.py, which implements uses PEFT to train our model. If you are interested in how this works check-out Efficient Large Language Model training with LoRA and Hugging Face blog, where we explain the training script in detail.

In order to create a sagemaker training job we need an HuggingFace Estimator. The Estimator handles end-to-end Amazon SageMaker training and deployment tasks. The Estimator manages the infrastructure use. SagMaker takes care of starting and managing all the required ec2 instances for us, provides the correct huggingface container, uploads the provided scripts and downloads the data from our S3 bucket into the container at /opt/ml/input/data. Then, it starts the training job by running.

import time # define Training Job Name job_name = f'huggingface-peft-{time.strftime("%Y-%m-%d-%H-%M-%S", time.localtime())}' from sagemaker.huggingface import HuggingFace # hyperparameters, which are passed into the training job hyperparameters ={ 'model_id': model_id, # pre-trained model 'dataset_path': '/opt/ml/input/data/training', # path where sagemaker will save training dataset 'epochs': 3, # number of training epochs 'per_device_train_batch_size': 1, # batch size for training 'lr': 2e-4, # learning rate used during training } # create the Estimator huggingface_estimator = HuggingFace( entry_point = 'run_clm.py', # train script source_dir = 'scripts', # directory which includes all the files needed for training instance_type = 'ml.g5.2xlarge', # instances type used for the training job instance_count = 1, # the number of instances used for training base_job_name = job_name, # the name of the training job role = role, # Iam role used in training job to access AWS ressources, e.g. S3 volume_size = 300, # the size of the EBS volume in GB transformers_version = '4.26', # the transformers version used in the training job pytorch_version = '1.13', # the pytorch_version version used in the training job py_version = 'py39', # the python version used in the training job hyperparameters = hyperparameters )

We can now start our training job, with the .fit() method passing our S3 path to the training script.

# define a data input dictonary with our uploaded s3 uris data = {'training': training_input_path} # starting the train job with our uploaded datasets as input huggingface_estimator.fit(data, wait=True)

In our example, the SageMaker training job took 20632 seconds, which is about 5.7 hours. The ml.g5.2xlarge instance we used costs $1.515 per hour for on-demand usage. As a result, the total cost for training our fine-tuned BLOOMZ-7B model was only $8.63.

We could further reduce the training costs by using spot instances. However, there is a possibility this would result in the total training time increasing due to spot instance interruptions. See the SageMaker pricing page for instance pricing details."

4. Deploy the model to Amazon SageMaker Endpoint

When using peft for training, you normally end up with adapter weights. We added the merge_and_unload() method to merge the base model with the adatper to make it easier to deploy the model. Since we can now use the pipelines feature of the transformers library.

SageMaker starts the deployment process by creating a SageMaker Endpoint Configuration and a SageMaker Endpoint. The Endpoint Configuration defines the model and the instance type.

from sagemaker.huggingface import HuggingFaceModel # create Hugging Face Model Class huggingface_model = HuggingFaceModel( model_data=huggingface_estimator.model_data, #model_data="s3://hf-sagemaker-inference/model.tar.gz", # Change to your model path role=role, transformers_version="4.26", pytorch_version="1.13", py_version="py39", model_server_workers=1 )

We can now deploy our model using the deploy() on our HuggingFace estimator object, passing in our desired number of instances and instance type.

# deploy model to SageMaker Inference predictor = huggingface_model.deploy( initial_instance_count=1, instance_type= "ml.g5.4xlarge" )

Note: it may take 5-10 min for the SageMaker endpoint to bring your instance online and download your model in order to be ready to accept inference requests.

Lets test by using a example from the test split.

from random import randint from datasets import load_dataset # Load dataset from the hub test_dataset = load_dataset("samsum", split="test") # select a random test sample sample = test_dataset[randint(0,len(test_dataset))] # format sample prompt_template = f"Summarize the chat dialogue:\n{{dialogue}}\n---\nSummary:\n" fomatted_sample = { "inputs": prompt_template.format(dialogue=sample["dialogue"]), "parameters": { "do_sample": True, # sample output predicted probabilities "top_p": 0.9, # sampling technique Fan et. al (2018) "temperature": 0.1, # increasing the likelihood of high probability words and decreasing the likelihood of low probability words "max_new_tokens": 100, # The maximum numbers of tokens to generate, ignoring the number of tokens in the prompt } } # predict res = predictor.predict(fomatted_sample) print(res[0]["generated_text"].split("Summary:")[-1]) # Sample model output: Kirsten and Alex are going bowling this Friday at 7 pm. They will meet up and then go together.

Now let's compare the model summarized dialog output to the test sample summary.

print(sample["summary"]) # Test sample summary: Kirsten reminds Alex that the youth group meets this Friday at 7 pm to go bowling.

Finally, we delete the endpoint again.

predictor.delete_model() predictor.delete_endpoint()