Path: blob/master/deep_learning/contrastive/sentence_embedding_peft/sentence_embedding_peft.ipynb
2616 views
Multilingual Sentence Embedding with LLM and PEFT LORA
In this article, we'll be taking a look at training a multilingual sentence embedding with Large Language Model (LLM) and a parameter efficient fine tuning technique: LoRA (Low Rank Adaption).
LLM For Retrieval
Large Language Model (LLM) with billion of parameters, fine-tuned to follow instructions have showcased remarkable capabilities on many NLP tasks. Consequently, there's a growing interest in harnessing these models as retrieval systems, such as LLAMA-2 [7], GPT [10], Mistral [11].
RepLLaMA/RankLLaMA [7] leverages LLAMA-2-7B as backbone model for training retrieval and re-ranker model. Previous work on dense retriever models often uses bi-directional encoder model like BERT, taking the representation of prepended special [CLS] token or average pooling as sentence embedding. Given LLAMA is a uni-directional decoder only, an end of sentence token <\s> is appended to serve as embedding.
For addressing high GPU memory cost associated with fine tuning large models with contrastive learning, they leverage memory efficiency solutions such as LoRA, flash attention, and gradient checkpointing. The model is trained on 16 x 32G V100 GPUs with a batch size of 128, hard negatives from a blend of BFM25 and CoCondenser to ensure hard negatives are derived from both sparse and dense retrieval results.
Apart from potent performance when evaluated on in-domain dataset MS MARCO and zero shot evaluation on BEIR benchmark suite, it also offers the advantage that modern LLM are often pre-trained with longer context window.
LoRA
In modern transformer pre-trained model era, many application rely on fine tuning one large pre-trained model to multiple down stream applications. Given the higher associated cost with fine tuning, many sought to adapt only partial parameters, i.e. freezing base layers. LoRA (Low Rank Adaptation) [9] presents an alternative approach by representing the weight update with two low rank matrices.
Quoting the LoRA paper: Given a weight matrix , we would constrain its update , where , . During training is frozen, while and contain trainable parameters. Both set of matrices would receiving the same input during forward pass: , where is a scaling constant. At the beginning, is initialized with random Gaussian, and zero for .

Its advantages:
A pre-trained model can be shared, and use to build many small LoRA modules for different tasks.
Compared to full fine tuning, training becomes more efficien as it drastically reduces the number of trainable parameters. Lowering the hardware barrier as well as accelerating training cycle, especially when it comes to billion sized pre-trained models.
Its linear design allows us to merge LoRA's trainable matrices with the original frozen weights, effectively introducing zero additional inference latency compared to the original model.
Data
We'll be utilizing the bloomz model family as our tokenizer/model. We have the flexibility to substitute it with any other Language Model Models (LLMs), we've opted for the bloomz model family for its multilingual capabilities.
ESCI
For our dataset, we taking inspiration from one of the examples from peft library's [2] documentation. Specifically, we'll be using a small subset of ESCI e-commerce search query dataset that's conveniently available on huggingface dataset. The ESCI dataset [3] [8], available in multiple languages including English, Japanese, and Spanish, consists of challenging search queries (such as those involving negations: "energy bar without nuts" or "gluten-free biscuits") paired with up to 40 search results, along with their ESCI (Exact, Substitute, Complement, Irrelevant) judgments. Our task at hand will be to train a model for retrieving similar products for a given query.
Model
The following next code chunk defines a huggingface compatible SentenceEmbeddingModel for training retrieval model using contrastive learning. For actual LoRA experimentation, we'll directly leverage peft library.
As part of our LoraConfig, we need to specify target_modules, which checks if the specified substring is in module's full name. LoRA can be applied to any module in our model, though the most common practice for transformer style model is applying to to attention layer's key, value, query matrices as well as its immediate feed forward layer.
With our LoRA setup along with gradient checkpointing, we are able to train a 1.7B model using a single V100 GPU with micro batch size of 64.
Evaluation
Evaluation process involves:
Generating embeddings for both distinct queries and products (corpus).
Retrieve top-k products using FAISS's flat index, i.e. exact cosine similarity.
Compute evaluation metrics, in this case recall@k.
We conclude our article by offering some guidance when training with LoRA as well as decoder based retrieval models.
LoRA:
The most critical LoRA hyperparameter is how many LoRA adapters are used in total and LoRA on all linear transformer block layers are required to match full fine tuning's performance. Other parameters such as projection dimension doesn't affect performance much. i.e. It's more preferable to adapt more weight matrices than adapting a single type of weights with a larger rank.
When training with LoRA a lower learning rate as well as more steps might be required for matching full fine tuning's performance.
The effective of LoRA might be task dependent. Compared to full fine tuning, LoRA might stumble when ecountering more challenging tasks such as mathematical reasoning [5].
Personally, LoRA feels very much akin to matrix factorization, factorization machines family of methods with a twist.
Decoder retrieval models:
Exploring LLMs' usage in embedding have garnered quite some interest with good reason, e.g. Improving text embeddings with LLMs [11] showed that using LLMs (Mistral 7B) as an initial backbone using synthetic data along with some moderate amount of labeled text pairs is sufficient, foregoing the need for large amounts of text pairs to obtain high quality embeddings.
Keep in mind that apart from performance, there's also the cost of operating these large LLMs for embedding use case. This is from a inference speed perspective as well as storage (billion parameter scale LLM typically involves generating a larger embedding hidden dimension, 2048+)[6]
Reference
[1] PEFT Documentation: LoRA
[2] PEFT Documentation: LoRA for semantic similarity tasks
[3] Shopping Queries Dataset: A Large-Scale ESCI Benchmark for Improving Product Search
[4] LoRA From Scratch – Implement Low-Rank Adaptation for LLMs in PyTorch
[5] Fine-Tuning LLMs: LoRA or Full-Parameter? An in-depth Analysis with Llama 2
[6] OpenAI GPT-3 Text Embeddings - Really a new state-of-the-art in dense text embeddings?
[7] Xueguang Ma, Liang Wang, Nan Yang, Furu Wei, Jimmy Lin - Fine-Tuning LLaMA for Multi-Stage Text Retrieval (2023)
[8] Chandan K. Reddy, Lluís Màrquez, Fran Valero, Nikhil Rao, Hugo Zaragoza, Sambaran Bandyopadhyay, Arnab Biswas, Anlu Xing, Karthik Subbian - Shopping Queries Dataset: A Large-Scale ESCI Benchmark for Improving Product Search (2022)
[9] Edward J. Hu, Yelong Shen, et al. - LoRA: Low-Rank Adaptation of Large Language Models (2021)
[10] Arvind Neelakantan, Tao Xu, et al. - Text and Code Embeddings by Contrastive Pre-Training (2022)
[11] Liang Wang, Nan Yang, Furu Wei, et al. - Improving Text Embeddings with Large Language Models (2024)