Copyright Information
Laboratory 3: Large Language Model (LLM) Fine-tuning
In this lab, you will fine-tune a multi-billion parameter large language model (LLM). We will go through several fundamental concepts of LLMs, including tokenization, templates, and fine-tuning. This lab provides a complete pipeline for fine-tuning a language model to generate responses in a specific style, and you will explore not only language model fine-tuning, but also ways to evaluate the performance of a language model.
You will use Google's Gemma 2B model as the base language model to fine-tune; Liquid AI's LFM-40B as an evaluation "judge" model; and Comet ML's Opik as a framework for streamlined LLM evaluation.
First, let's download the MIT deep learning package, install dependencies, and import the relevant packages we'll need for this lab.
Part 1: Fine-tuning an LLM for style
In the first part of this lab, we will fine-tune an LLM as a chatbot that can generate responses in a specific style. We will use the Gemma 2B model as the base language model to finetune.
1.1: Templating and tokenization
1.1.1: Templating
Language models that function as chatbots are able to generate responses to user queries -- but how do they do this? We need to provide them with a way to understand the conversation and generate responses in a coherent manner -- some structure of what are inputs and outputs.
Templating is a way to format inputs and outputs in a consistent structure that a language model can understand. It involves adding special tokens or markers to indicate different parts of the conversation, like who is speaking and where turns begin and end. This structure helps the model learn the proper format for generating responses and maintain a coherent conversation flow. Without templates, the model may not know how to properly format its outputs or distinguish between different speakers in a conversation.
Let's start by defining some basic templates for the chatbot, for turns where the user asks a question and the model responds with an answer.
1.1.2: Tokenization
To operate on language, we need to prepare the text for the model. Fundamentally we can think of language as a sequence of "chunks" of text. We can split the text into individual chunks, and then map these chunks to numerical tokens -- collectively this is the process of tokenization. Numerical tokens can then be fed into a language model.
There are several common approaches to tokenizing natural language text:
Word-based tokenization: splits text into individual words. While simple, this can lead to large vocabularies and does not handle unknown words well.
Character-based tokenization: splits text into individual characters. While this involves a very small vocabulary, it produces long sequences and loses word-level meaning.
Subword tokenization: breaks words into smaller units (subwords) based on their frequency. The most popular and commonly used approach is byte-pair encoding (BPE), which iteratively merges the most frequent character pairs. Modern language models typically use subword tokenization as it balances vocabulary size and sequence length while handling unknown words effectively by breaking them into known subword units.
In this lab we will use the tokenizer from the Gemma 2B model, which uses BPE. Let's load it and inspect it.
We not only need to be able to tokenize the text into tokens (encode), but also de-tokenize the tokens back into text (decode). Our tokenizer will have:
an
encode
function to tokenize the text into tokens, anda
decode
function to de-tokenize back to text so that we can read out the model's outputs.
Let's test out both steps and inspect to get a better understanding of how this works.
This is really cool. Now we have a way to move in and out of the token space.
To "chat" with our LLM chatbot, we need to use the tokenizer and the chat template together, in order for the model to respond to the user's question. We can use the templates defined earlier to construct a prompt for the model, without the answer.
If we were to feed this to the model, it would see that it is now the start of the model's turn, and it would generate the answer to this question.
1.2: Getting started with the LLM
Now that we have a way to prepare our data, we're ready to work with our LLM!
LLMs like Gemma 2B are trained on a large corpus of text, on the task of predicting the next token in a sequence, given the previous tokens. We call this training task "next token prediction"; you may also see it called "causal language modeling" or "autoregressive language modeling". We can leverage models trained in this way to generate new text by sampling from the predicted probability distribution over the next token.
Let's load the Gemma 2B model and start working with it. We will construct a prompt in chat template form and tokenize it. Then, we will feed it to the model to predict next token probabilities. Finally, we will get the next token (which is still numerical) and decode it to text.
Note that the model is not able to predict the answer to the question, it is only able to predict the next token in the sequence! For more complex questions, we can't just generate one token, but rather we need to generate a sequence of tokens.
This can be done by doing the process above iteratively, step by step -- after each step we feed the generated token back into the model and predict the next token again.
Instead of doing this manually ourselves, we can use the model's built-in model.generate()
functionality (supported by HuggingFace's Transformers library) to generate max_new_tokens
number of tokens, and decode the output back to text.
Now we have the basic pipeline for generating text with an LLM!
1.3: Fine-tuning
Fine-tuning is a technique that allows us to adapt a pre-trained neural network to better suit a downstream task, domain, or style, by training the model further on new data. By training the model further on a carefully curated dataset, we can modify its behavior, style, or capabilities. Fine-tuning is used in a variety of applications, not just language modeling. But in language modeling, fine-tuning can be used to:
Adapt the model's writing style
Improve performance on specific tasks or domains
Teach the model new capabilities or knowledge
Reduce unwanted behaviors or biases
In this lab, you will fine-tune the Gemma LLM to adapt the model's writing style. Recall that in Lab 1 you built out a RNN-based sequence model to generate Irish folk songs. Continuing with our Irish theme, we will first fine-tune the LLM to chat in the style of a leprechaun.
We have prepared a question-answer dataset where the questions are in standard English style (i.e. "base" style) and the answers are in "leprechaun" style (written by another LLM). Let's load the dataset and inspect it.
1.3.1: Chat function
Before we start finetuning, we will build a function to easily chat with the model, both so we can monitor its progress over the course of finetuning and also to generate responses to questions.
Recall our core steps from before:
Construct the question prompt using the template
Tokenize the text
Feed the tokensthrough the model to predict the next token probabilities
Decode the predicted tokens back to text
Use these steps to build out the chat
function below.
Let's try chatting with the model now to test if it works! We have a sample question here (continuing with the Irish theme); feel free to try out other questions!
1.3.2: Parameter-efficient fine-tuning
In fine-tuning, the weights of the model are updated to better fit the fine-tuning dataset and/or task. Updating all the weights in a language model like Gemma 2B -- which has ~2 billion parameters -- is computationally expensive. There are many techniques to make fine-tuning more efficient.
We will use a technique called LoRA -- low-rank adaptation -- to make the fine-tuning process more efficient. LoRA is a way to fine-tune LLMs very efficiently by only updating a small subset of the model's parameters, and it works by adding trainable low-rank matrices to the model. While we will not go into the details of LoRA here, you can read more about it in the LoRA paper. We will use the peft
library to apply LoRA to the Gemma model.
1.3.3: Forward pass and loss computation
Now let's define a function to perform a forward pass through the LLM and compute the loss. The forward pass gives us the logits -- which reflect the probability distribution over the next token -- for the next token. We can compute the loss by comparing the predicted logits to the true next token -- our target label. Note that this is effectively a classification problem! So, our loss can be captured by the cross entropy loss, and we can use PyTorch's nn.functional.cross_entropy
function to compute it.
1.3.4: Training loop for fine-tuning
With this function to compute the loss, we can now define a training loop to fine-tune the model using LoRA. This training loop has the same core components as we've seen before in other labs:
Grab a batch of data from the dataset (using the DataLoader)
Feed the data through the model to complete a forward pass and compute the loss
Backward pass to update the model weights
The data in our DataLoader is initially text, and is not structured in our question-answer template. So in step (1) we will need to format the data into our question-answer template previously defined, and then tokenize the text.
We care about the model's answer to the question; the "answer" tokens are the part of the text we want to predict and compute the loss for. So, after tokenizing the text we need to denote to the model which tokens are part of the "answer" and which are part of the "question". We can do this by computing a mask for the answer tokens, and then using this mask to compute the loss.
Finally, we will complete the backward pass to update the model weights.
Let's put this all together in the training loop below.
Let's try chatting with the model again to see how it has changed!
Part 2: Evaluating a style-tuned LLM
How do we know if the model is doing well? How closely does the model's style match the style of a leprechaun? As you can see from the example above, determining whether a generated response is good or not is can seem qualitative, and it can be hard to measure how well the model is doing.
While benchmarks have been developed to evaluate the performance of language models on a variety of tasks, these benchmarks are not always representative of the real-world performance of the model. For example, a model may perform well on a benchmark but poorly on a more realistic task. Benchmarks are also limited in the scope of tasks they can cover and capabilities they can reflect, and there can be concerns about whether the data in the benchmark was used to train the model. Synthetic data generation and synthetic tasks are a way to address these limitations, and this is an active area of research.
We can also turn a qualitative evaluation of a generated response quantitative by deploying someone or something to "judge" the outputs. In this lab, we will use a technique called LLM as a judge to do exactly this. This involves using a larger LLM to score the outputs of a smaller LLM. The larger LLM is used as a judge, and it is given a system prompt that describes the task we want the smaller LLM to perform and the judging criteria. A "system prompt" is a way to set the general context and guide an LLM's behavior. Contextualized with this system prompt, the judge LLM can score the outputs of the smaller LLM, and we can use this score to evaluate how well the smaller LLM is doing.
2.1: Fine-tune well, you must!
Our leprechaun-tuned model is already pretty good at generating responses in the leprechaun style. It must be the luck of the Irish.
Let's make things more interesting by considering a different style, one that has some clear patterns but also a lot of variability and room for creativity. We will use the style of Yoda from Star Wars.
Your goal is to try to fine-tune your model to generate responses in the Yoda style, use the LLM judge to evaluate how well the outputs of your chat model follow Yoda speak, and then use that information to improve the model.
Start by defining a system prompt for the judge LLM, setting the context that it will evaluate how well the outputs of your chat model follow Yoda speak. Experiment with different system prompts to see how they affect the judge LLM's evaluation! Keep in mind that a better judge LLM will give you a better evaluation of how well your Yoda model is doing, and that a better evaluation will help you improve your Yoda model.
2.2: Setting up the judge LLM
In LLM as a judge, we need to use a model that is larger (and therefore more capable) than our "performer" model, in our case the style fine-tuned Gemma 2B. Since it is infeasible to load larger models locally into notebooks, you will gain experience interfacing with these larger LLMs through an API served on OpenRouter.
You will need to sign up for an OpenRouter account and then generate an API key. Running powerful LLMs of this scale costs money -- for students in the in-person course, we can provide a credit to your OpenRouter account to allow you to run this lab. Come to office hours to receive your credit.
Through the OpenRouter interface, you will be able to experiment with different judge LLMs -- here we have suggested two possible larger LLMs to get you started: Liquid AI's LFM-40B andGoogle's Gemma 9B. Note there are also free models available on OpenRouter (e.g., gemma-2-9b-it:free), but these will run into rate limitations if you run them too much.
We have defined a simple class, LLMClient
, to interact with the OpenRouter API. This class has a method ask
that takes a user prompt and returns the model's response. Keep in mind that the judge LLM's response will be conditioned on the system prompt you provide -- the system prompt is critical to set the criteria for the evaluation!
2.3: Defining the evaluation metric
Great! We have set up our judge LLM, but we still need to make this quantitative. We can do this by defining a metric that uses the judge LLM to score the outputs of the model. Doing this is streamlined with Comet ML's Opik library, a platform for LLM evaluation and benchmarking.
In prior labs, we used Comet for experiment tracking, so you should have an account and API key. If not, you can sign up for a Comet account here if you have not done so already. Now we will use the Comet Opik library to define a metric that uses the judge LLM to score the outputs of the model.
Opik has a base class for defining metrics, base_metric.BaseMetric
. You will use this to define a custom metric that uses the judge LLM to evaluate text for how well it adheres to Yoda speak. Note that the judge LLM and the metric can be applied to any text, not just the outputs of the model. This is important to keep in mind, since we need both a negative control -- text in the "base" standard English style -- and a positive control -- training-set text in Yoda-speak style -- against which to compare the model's generations.
Set the judging criteria in the system prompt, and define the score
function to evaluate text by querying the judge LLM.
Instaniate your Comet Opik judge using the LLMJudgeEvaluator class and system prompt.
2.4: Evaluating the model by scoring with your judge LLM
Now we can use the judge LLM to score the outputs of the model. We will use the scoring_function
to score text using the judge LLM.
Feed in a few probe sentences to get a vibe check on the judge LLM.
We will evaluate how well our fine-tuned model is doing by scoring the outputs of the model, as well as our base-style text (negative control) and the training-set text in Yoda-speak style (positive control).
Generate text from your model by asking it new questions.
Let's also collect some base-style text (base_samples
) and the training-set text in Yoda-speak style (style_samples
). For these, we won't need to generate text, since we already have the text in the dataset.
Now that we have our samples, we can score them using the judge LLM. We will use a multiprocessed scoring function to score the samples in parallel, because each sample is independent and we can submit them all as simultaneous requests to the judge LLM.
Look at the average scores for each of the three types of text -- what do you observe?
We can also plot the distribution of scores for each of the three types of text.
Use these observations to improve your model. Remember that the judge LLM is not perfect, and you can try to improve the judge LLM to better evaluate the model's outputs. A better judge LLM will give you a better evaluation of how well your Yoda model is doing, and that better evaluation will help you improve your Yoda model.
2.5: Conclusion
Experiment with both your chat model and your judge LLM to try to improve the quality of the Yoda-speak. The competition for this lab will be based on the following criteria:
Likelihood of true Yoda-speak under your chat model: the better your chat model does at understanding Yoda-speak, it will estimate a lower cross entropy loss for language that is true Yoda-speak. At the end of this lab, you will evaluate the likelihood of a held-out test-sample of true Yoda-speak under your chat model. Include this likelihood in your report. This gives us a quantitative measure to compare different chat models (which may have interacted with different judge LLMs).
Experiments and changes you tried to improve your chat model: include a description of changes you made and the results you observed.
IMPORTANT: RUN THE FOLLOWING CELL BELOW TO PRINT THE RESULT BUT DO NOT MODIFY ITS CONTENTS.
Submission information
To enter the competition, please upload the following to the lab submission site for the Large Language Models Lab):
Jupyter notebook with the code you used to generate your results;
copy of the bar plot showing the judge LLM's scores of text in base style, generated text, and text in true Yoda-speak style;
a written description modifications you made and experimentes you tried;
a written discussion of why and how these modifications changed performance;
the numerical result of the last cell in this notebook.
Submissions without the result of the last cell will be automatically disqualified.
Name your file in the following format: [FirstName]_[LastName]_LLM
, followed by the file format (.zip, .ipynb, .pdf, etc). ZIP files are preferred over individual files. If you submit individual files, you must name the individual files according to the above nomenclature (e.g., [FirstName]_[LastName]_LLM_Report.pdf
, etc.).