Path: blob/main/ch16/ch16-part1-self-attention.ipynb
1245 views
Machine Learning with PyTorch and Scikit-Learn
-- Code Examples
Package version checks
Add folder to path in order to load from the check_packages.py script:
Check recommended package versions:
Chapter 16: Transformers – Improving Natural Language Processing with Attention Mechanisms (Part 1/3)
Adding an attention mechanism to RNNs
Attention helps RNNs with accessing information
The original attention mechanism for RNNs
Processing the inputs using a bidirectional RNN
Generating outputs from context vectors
Computing the attention weights
Introducing the self-attention mechanism
Starting with a basic form of self-attention
Assume we have an input sentence that we encoded via a dictionary, which maps the words to integers as discussed in the RNN chapter:
Next, assume we have an embedding of the words, i.e., the words are represented as real vectors.
Since we have 8 words, there will be 8 vectors. Each vector is 16-dimensional:
The goal is to compute the context vectors , which involve attention weights .
In turn, the attention weights involve the values
Let's start with the 's first, which are computed as dot-products:
Actually, let's compute this more efficiently by replacing the nested for-loops with a matrix multiplication:
Next, let's compute the attention weights by normalizing the "omega" values so they sum to 1
We can conform that the columns sum up to one:
Now that we have the attention weights, we can compute the context vectors , which involve attention weights
For instance, to compute the context-vector of the 2nd input element (the element at index 1), we can perform the following computation:
Or, more effiently, using linear algebra and matrix multiplication:
Parameterizing the self-attention mechanism: scaled dot-product attention
Attention is all we need: introducing the original transformer architecture
Encoding context embeddings via multi-head attention
Learning a language model: decoder and masked multi-head attention
Implementation details: positional encodings and layer normalization
Readers may ignore the next cell.