CoCalc Logo Icon
StoreFeaturesDocsShareSupportNewsAboutSign UpSign In
huggingface

Real-time collaboration for Jupyter Notebooks, Linux Terminals, LaTeX, VS Code, R IDE, and more,
all in one place. Commercial Alternative to JupyterHub.

GitHub Repository: huggingface/notebooks
Path: blob/main/sagemaker/28_train_llms_with_qlora/sagemaker-notebook.ipynb
Views: 2542
Kernel: pytorch

Train LLMs using QLoRA on Amazon SageMaker

In this sagemaker example, we are going to learn how to apply QLoRA: Efficient Finetuning of Quantized LLMs to fine-tune Falcon 40B. QLoRA is an efficient finetuning technique that quantizes a pretrained language model to 4 bits and attaches small “Low-Rank Adapters” which are fine-tuned. This enables fine-tuning of models with up to 65 billion parameters on a single GPU; despite its efficiency, QLoRA matches the performance of full-precision fine-tuning and achieves state-of-the-art results on language tasks.

In our example, we are going to leverage Hugging Face Transformers, Accelerate, and PEFT.

In Detail you will learn how to:

  1. Setup Development Environment

  2. Load and prepare the dataset

  3. Fine-Tune Falcon 40B with QLoRA on Amazon SageMaker

Quick intro: PEFT or Parameter Efficient Fine-tuning

PEFT, or Parameter Efficient Fine-tuning, is a new open-source library from Hugging Face to enable efficient adaptation of pre-trained language models (PLMs) to various downstream applications without fine-tuning all the model's parameters. PEFT currently includes techniques for:

!pip install "transformers==4.30.2" "datasets[s3]==2.13.0" sagemaker --upgrade --quiet

If you are going to use Sagemaker in a local environment. You need access to an IAM Role with the required permissions for Sagemaker. You can find here more about it.

import sagemaker import boto3 sess = sagemaker.Session() # sagemaker session bucket -> used for uploading data, models and logs # sagemaker will automatically create this bucket if it not exists sagemaker_session_bucket=None if sagemaker_session_bucket is None and sess is not None: # set to default bucket if a bucket name is not given sagemaker_session_bucket = sess.default_bucket() try: role = sagemaker.get_execution_role() except ValueError: iam = boto3.client('iam') role = iam.get_role(RoleName='sagemaker_execution_role')['Role']['Arn'] sess = sagemaker.Session(default_bucket=sagemaker_session_bucket) print(f"sagemaker role arn: {role}") print(f"sagemaker bucket: {sess.default_bucket()}") print(f"sagemaker session region: {sess.boto_region_name}")

2. Load and prepare the dataset

we will use the dolly an open source dataset of instruction-following records generated by thousands of Databricks employees in several of the behavioral categories outlined in the InstructGPT paper, including brainstorming, classification, closed QA, generation, information extraction, open QA, and summarization.

{ "instruction": "What is world of warcraft", "context": "", "response": "World of warcraft is a massive online multi player role playing game. It was released in 2004 by bizarre entertainment" }

To load the samsum dataset, we use the load_dataset() method from the 🤗 Datasets library.

from datasets import load_dataset from random import randrange # Load dataset from the hub dataset = load_dataset("databricks/databricks-dolly-15k", split="train") print(f"dataset size: {len(dataset)}") print(dataset[randrange(len(dataset))]) # dataset size: 15011

To instruct tune our model we need to convert our structured examples into a collection of tasks described via instructions. We define a formatting_function that takes a sample and returns a string with our format instruction.

def format_dolly(sample): instruction = f"### Instruction\n{sample['instruction']}" context = f"### Context\n{sample['context']}" if len(sample["context"]) > 0 else None response = f"### Answer\n{sample['response']}" # join all the parts together prompt = "\n\n".join([i for i in [instruction, context, response] if i is not None]) return prompt

lets test our formatting function on a random example.

from random import randrange print(format_dolly(dataset[randrange(len(dataset))]))
### Instruction Who is the most decorated olympian of all time? ### Answer Michael Phelps is the most decorated olympian winning a total of 28 medals.

In addition, to formatting our samples we also want to pack multiple samples to one sequence to have a more efficient training.

from transformers import AutoTokenizer model_id = "tiiuae/falcon-40b" # sharded weights tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True) tokenizer.pad_token = tokenizer.eos_token

We define some helper functions to pack our samples into sequences of a given length and then tokenize them.

from random import randint from itertools import chain from functools import partial # template dataset to add prompt to each sample def template_dataset(sample): sample["text"] = f"{format_dolly(sample)}{tokenizer.eos_token}" return sample # apply prompt template per sample dataset = dataset.map(template_dataset, remove_columns=list(dataset.features)) # print random sample print(dataset[randint(0, len(dataset))]["text"]) # empty list to save remainder from batches to use in next batch remainder = {"input_ids": [], "attention_mask": [], "token_type_ids": []} def chunk(sample, chunk_length=2048): # define global remainder variable to save remainder from batches to use in next batch global remainder # Concatenate all texts and add remainder from previous batch concatenated_examples = {k: list(chain(*sample[k])) for k in sample.keys()} concatenated_examples = {k: remainder[k] + concatenated_examples[k] for k in concatenated_examples.keys()} # get total number of tokens for batch batch_total_length = len(concatenated_examples[list(sample.keys())[0]]) # get max number of chunks for batch if batch_total_length >= chunk_length: batch_chunk_length = (batch_total_length // chunk_length) * chunk_length # Split by chunks of max_len. result = { k: [t[i : i + chunk_length] for i in range(0, batch_chunk_length, chunk_length)] for k, t in concatenated_examples.items() } # add remainder to global variable for next batch remainder = {k: concatenated_examples[k][batch_chunk_length:] for k in concatenated_examples.keys()} # prepare labels result["labels"] = result["input_ids"].copy() return result # tokenize and chunk dataset lm_dataset = dataset.map( lambda sample: tokenizer(sample["text"]), batched=True, remove_columns=list(dataset.features) ).map( partial(chunk, chunk_length=2048), batched=True, ) # Print total number of samples print(f"Total number of samples: {len(lm_dataset)}")

After we processed the datasets we are going to use the new FileSystem integration to upload our dataset to S3. We are using the sess.default_bucket(), adjust this if you want to store the dataset in a different S3 bucket. We will use the S3 path later in our training script.

# save train_dataset to s3 training_input_path = f's3://{sess.default_bucket()}/processed/dolly/train' lm_dataset.save_to_disk(training_input_path) print("uploaded data to:") print(f"training dataset to: {training_input_path}")

3. Fine-Tune Falcon 40B with QLoRA on Amazon SageMaker

We are going to use the recently introduced method in the paper "QLoRA: Quantization-aware Low-Rank Adapter Tuning for Language Generation" by Tim Dettmers et al. QLoRA is a new technique to reduce the memory footprint of large language models during finetuning, without sacrificing performance. The TL;DR; of how QLoRA works is:

  • Quantize the pretrained model to 4 bits and freezing it.

  • Attach small, trainable adapter layers. (LoRA)

  • Finetune only the adapter layers, while using the frozen quantized model for context.

We prepared a run_clm.py, which implements QLora using PEFT to train our model. The script also merges the LoRA weights into the model weights after training. That way you can use the model as a normal model without any additional code.

In order to create a sagemaker training job we need an HuggingFace Estimator. The Estimator handles end-to-end Amazon SageMaker training and deployment tasks. The Estimator manages the infrastructure use. SagMaker takes care of starting and managing all the required ec2 instances for us, provides the correct huggingface container, uploads the provided scripts and downloads the data from our S3 bucket into the container at /opt/ml/input/data. Then, it starts the training job by running.

import time # define Training Job Name job_name = f'huggingface-qlora-{time.strftime("%Y-%m-%d-%H-%M-%S", time.localtime())}' from sagemaker.huggingface import HuggingFace # hyperparameters, which are passed into the training job hyperparameters ={ 'model_id': model_id, # pre-trained model 'dataset_path': '/opt/ml/input/data/training', # path where sagemaker will save training dataset 'epochs': 3, # number of training epochs 'per_device_train_batch_size': 4, # batch size for training 'lr': 2e-4, # learning rate used during training } # create the Estimator huggingface_estimator = HuggingFace( entry_point = 'run_clm.py', # train script source_dir = 'scripts', # directory which includes all the files needed for training instance_type = 'ml.g5.12xlarge', # instances type used for the training job instance_count = 1, # the number of instances used for training base_job_name = job_name, # the name of the training job role = role, # Iam role used in training job to access AWS ressources, e.g. S3 volume_size = 300, # the size of the EBS volume in GB transformers_version = '4.28', # the transformers version used in the training job pytorch_version = '2.0', # the pytorch_version version used in the training job py_version = 'py310', # the python version used in the training job hyperparameters = hyperparameters, environment = { "HUGGINGFACE_HUB_CACHE": "/tmp/.cache" }, # set env variable to cache models in /tmp )

We can now start our training job, with the .fit() method passing our S3 path to the training script.

# define a data input dictonary with our uploaded s3 uris data = {'training': training_input_path} # starting the train job with our uploaded datasets as input huggingface_estimator.fit(data, wait=True)

In our example, the SageMaker training job took 53405 seconds, which is about 14.8 hours. The ml.g5.12xlarge instance we used costs $7.09 per hour for on-demand usage. As a result, the total cost for training our fine-tuned Falcon-40B model was only ~$105.

Next Steps

You can deploy your fine-tuned model to a SageMaker endpoint and use it for inference. Check out the Deploy Falcon 7B & 40B on Amazon SageMaker and Securely deploy LLMs inside VPCs with Hugging Face and Amazon SageMaker for more details.