Path: blob/master/deep_learning/llm/judge/llm_pairwise_judge.ipynb
2619 views
LLM Pairwise Judge
In this article, we'll be implementing a LLM pairwise judge, where a LLM is presented with a question and two answers, and tasked with determining which answer is better or declaring a tie. Using LLMs as judges for evaluation offers several benefits:
Scalability: Compared to obtaining ground truth labels from human evaluators, LLM inference is generally faster and more cost-effective.
Explainability: Unlike metrics such as BLEU or ROUGE, which primarily focus on variants of text overlap or re-ranker based relevance model, LLMs can also generate reasoning or explanations along with scores, providing more interpretable evaluations.
Versatility: LLMs can be fine-tuned or adapted to judge outputs across various domains, tasks, and languages, offering a versatile evaluation framework. This makes LLMs more suitable for evaluating diverse instruction-following and conversational abilities.
Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena [5] provides a thorough examination of using LLMs as judges. The researchers curated two distinct benchmark suites for this purpose:
MT-Bench: A benchmark consisting of 80 high-quality multi-turn questions.
Chatbot Arena: An innovative crowdsourcing benchmark platform featuring anonymous battles. On this platform, users can interact with two anonymous chatbot models simultaneously by posing the same question to both. They then vote for the model that provides their preferred response, with the models' identities revealed only after the voting process. Unlike traditional benchmarks that rely on predefined questions, Chatbot Arena enables users to ask any question they desire, effectively capturing a wide range of evaluations "in the wild".
They verify by using state of art LLMs, GPT-4, as judges, it's capable of matching human evaluation at an agreement rate exceeding 80%.
LLM Generation
We'll first implement a generation module for generating responses from LLM. We use Qwen 2.5 Collection [2] in this article, feel free to pick your favorite LLM. While doing so be sure to set the correct padding token, padding side as well as configure max_new_tokens [1].
Huggingface's generate function returns up to 20 tokens by default if
max_new_tokensis not explicitly specified inGenerationConfig.LLMs (decoder only models)'s also returns the input prompt as part of the output by default. We'll need some post-processing to crop those input prompts out if that is not the desired behaviour.
Similar to other tasks, while operating on a batch of inputs, if our input prompts have varying lengths, they need to be padded to ensure consistent length. Since LLMs often times don't have a default pad token and are not trained to continue from pad tokens, be sure to assign a pad token (e.g. assign eos token) and left pad our inputs.
LLM Pairwise Judge
The pairwise judge's implementation (prompt used) is inspired by huggingface's HfPairwiseJudge. At the time of writing this, its backend relies on their own inference client which has poses some restriction on the model size free tier users are allowed to use.
Our judge will also make an attempt to handle position bias. Position bias is when an LLM exhibits a propensity to favor certain positions over others, regardless of the actual content or quality of the answers. A conservative approach for addressing this issue is to call the judge twice, swapping the two answers' order, and only declare a win when an answer is preferred in both orders. If results are inconsistent after swapping, a tie can be declared. A more aggressive approach is to assign positions randomly, which can be effective at a large scale with the correct expectations. In the following experiments, we use the conservative approach.
Reference
[1] Huggingface Documentation: Generation with LLMs
[2] Huggingface Space Qwen 2.5 Collection
[3] Using LLM-as-a-judge π§ββοΈ for an automated and versatile evaluation
[4] Best Practices for LLM Evaluation of RAG Applications
[5] Lianmin Zheng, Wei-Lin Chiang, Ying Sheng et al. - Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena (2023)