Path: blob/master/deep_learning/question_answer/question_answer.ipynb
2577 views
Fine Tuning Pre-trained Encoder on Question Answer Task
In this document, we'll be going over how to train an extractive question and answer model using a pre-trained language encoder model via huggingface's transformers library. Throughout this process, we'll also:
Introduce some advanced tokenizer functionalities that are needed for token level based tasks such as question answering.
Explain some of the pre-processing and post-processing that are needed for question and answering task.
There are many different forms of question answering, but the one we will be discussing today is termed open book extractive question answering. Open book allows our model to retrieve relevant information from some context, similar to open book exams where students can refer to their books for relevant information during an exam, in this setup, our model can look up information from external sources. Extractive means our model will extract the most relevant span of texts or snippets from these contexts to answer incoming question. Although span based answers are more constrained compared to free form answers, they come with the benefit of being easier to evaluate.
Similar to a lot modern recommendation systems out there, there are three main components to these type of systems: a vector database for storing our data encoded in vector representation, a retrieval model for efficiently retrieving top-N context, lastly a reader model that identifies the span of text from a range of context. In this document, we'll be focusing on the reader model part.
To piggyback on modern today's pre-trained language model for reader model fine-tuning. We need two inputs: question and context, as well as two labels identifying answer's start and end positions within that context. The following diagram depicts this notion very nicely [4].

Slightly more formally, after feeding our input sentence through an encoder layer and obtaining the embedding vector for every token, we learn two additional weights, one for start position, and the other for end position, . These two weights will be used to define: for each token, the probability distribution of belonging to start and end position: ,
The dataset we'll be using is SQuAD (Standford Question Answering Dataset). This data contains a question, a context, and potentially answer. Where the answer to every question is a segment of text, a.k.a span, from a corresponding context. We can decide whether to experiment with SQuAD or SQuAD 2.0. SQuAD version 2.0 is a superset of the existing dataset containing unanswerable questions. This nature makes it more challenging to do well on version 2.0, as not only does the model need to identify correct answers, but also need to determine when no answer is supported by a given context and abstain from spitting out unreliable guesses.
Printing out a sample format, hopefully field names are all quite self explanatory. The one thing that's worth clarifying is answer_start field contains starting character index of each answer inside the corresponding context.
Tokenizer
After passing raw text through a tokenizer, a single word can be split into multiple tokens. e.g. in the example below, @huggingface is split into multiple tokens, @ hugging, and ##face. This can cause some issues for our token level labels, as our original label was mapped to a single word @huggingface. To resolve this, we'll need to use offsets mapping returned by the tokenizer, which gives us a tuple indicating each sub token's start and end position relative to the original token it was split from. For special tokens, offset mapping's start and end position will both be set to 0.
Another specific preprocessing detail for question answering task is appropriate ways to deal with long documents. In many other tasks, we typically truncate documents that are longer than our model's maximum sequence/sentence length, but here, removing some parts of the context might result in losing a section of the document that contains our answer. To deal with this, we will allow one (long) example in our dataset to give several input features by turning on return_overflowing_tokens. Commonly referred to as chunks, each chunk's length will be shorter than the model's maximum length (configurable hyper-parameter). Also, just in case a particular answer lies at the point where we splitted a long context, we will allow some overlap between chunks/features controlled by a hyper-parameter doc_stride, sometimes commonly known as sliding window.
Our two input sentences/examples has been split into 8 tokenized features. From the overflow_to_sample_mapping field, we can see which original example these 8 features map to.
Last thing we'll mention is the sequence_ids attribute. When feeding pairs of input to a tokenizer, we can use it to distinguish first and second portion of a given sentence. In question and answering this will be helpful for identifying whether the predicted answer's start and end position falls inside context portion of a given document, instead of question portion. If we look at a sample output, we'll notice that special tokens will be mapped to None, whereas our context, which is passed as the second part of our paired input will receive a value of 1.
Upon introducing these advanced tokenizer usages, the next few code cell showcase how to put them in use and creates a function for preprocessing our question answer dataset into a format that's suited for downstream modeling. Note:
When performing truncation, we should only truncate the context, never the question. Configured via
truncation="only_second"Given that we split a single document into several chunks, it can happen that a given chunk doesn't contain a valid answer, in this case, we will set question answer task's label,
start_positionandend_position, to index 0 (special token[CLS]'s index).We'll be padding every feature to maximum length, as most of the context will be reaching that threshold, there's no real benefit of performing dynamic padding.
We test our preprocessing function on a sample text to ensure our somewhat complicated preprocessing function works as expected, i.e. the start and end position of a tokenized answer matches the original un-tokenized version.
Model
Upon preparing our dataset, fine-tuning a question answer model on top of pre-trained language model will be similar to other tasks, where we initialize a AutoModelForQuestionAnswering model, and follow the standard fine-tuning process.
Evaluation
Evaluating our model also requires a bit more work on postprocessing front, hence we'll first use transformer's pipeline object for confirming the model we just trained is indeed learning by seeing if its predicted answer resembles ground truth answer.
For evaluation, we'll preprocess our dataset in a slightly different manner:
First, we technically don't need to generate labels.
Second, the "fun" part is to map our model's prediction back to original context's span. As a reminder, some of our features are overflowed inputs for the same given example. We'll be using example id for creating this mapping.
Last, but not least, we'll set the question part of our input sequence to
None, this is for efficiently detecting if our predicted answer span is within the context portion of input sentence as opposed to the question portion.
With our features, we can generate prediction which is a pair of start and end logits.
Having our original example, preprocessed features, generated predictions, we'll perform a final post-processing to generate predicted answer for each example. This process mainly involves:
Creating a map between examples and features.
Looping through all the examples, and for each example, loop through all its features to pick the best start and end logit combination from the n_best start and end logits.
During this picking out best answer span process, we'll automatically eliminate answers that are not inside the context, gives negative length (start position greater than end position), as well as answers that are too long (configurable with a max_answer_length parameter).
The "proper" way of computing score for each answer is to convert start and end logit into probability using a softmax operation, then taking a product of these two probabilities. Here, we'll skip the softmax and obtain the logit scores by summing start and end logits instead (log(ab)=log(a)+log(b))
Squad primarily uses two metrics for model evaluation.
Exact Match: Measures percentage of predictions that perfectly matches any one of the ground truth answers.
Macro F1: Measures average overlap between prediction and ground truth answer.
For context, screenshot below shows performance reported by the original Squad 2 paper [5].

That's a wrap for this document. We went through nitty gritty details on how to pre-process our inputs and post-process our outputs for fine-tuning a cross attention model on top of pre-trained language model.
Reference
[1] Notebook: Fine-tuning a model on a question-answering task
[2] Github: Huggingface Question Answering Examples
[3] Huggingface Course: Chapter 7 Main NLP tasks - Question answering
[4] Blog: Reader Models for Open Domain Question-Answering
[5] Paper: Pranav Rajpurkar, Robin Jia, et al. - Know What You Don't Know: Unanswerable Questions for SQuAD - 2018