Path: blob/main/notebooks/published/attention_mechanism/attention_mechanism.ipynb
51 views
Attention Mechanism: Theory and Implementation
Introduction
The attention mechanism is a fundamental component in modern deep learning architectures, particularly in natural language processing and computer vision. It allows models to dynamically focus on relevant parts of the input when producing an output, mimicking the human cognitive process of selective attention.
Mathematical Foundation
Scaled Dot-Product Attention
The core attention mechanism computes a weighted sum of values () based on the compatibility between queries () and keys (). The scaled dot-product attention is defined as:
where:
is the query matrix
is the key matrix
is the value matrix
is the dimension of queries and keys
is the dimension of values
Attention Weights
The attention weights are computed as:
This softmax normalization ensures that , making the weights interpretable as a probability distribution over the keys.
Scaling Factor
The scaling factor is crucial for preventing the dot products from growing too large in magnitude. When is large, the dot products can have large magnitudes, pushing the softmax into regions with extremely small gradients. The scaling counteracts this effect:
Dividing by normalizes the variance to approximately 1.
Multi-Head Attention
Multi-head attention extends the basic mechanism by allowing the model to jointly attend to information from different representation subspaces:
where each head is computed as:
with learned projection matrices , , , and .
Implementation
Scaled Dot-Product Attention
We implement the core attention mechanism following the mathematical formulation above.
Multi-Head Attention
We extend the basic attention to multiple heads, allowing the model to capture different types of relationships.
Demonstration: Self-Attention on a Sequence
Let's demonstrate the attention mechanism on a simple sequence, simulating how a model might attend to different parts of an input.
Visualization
Attention Weight Heatmaps
Visualizing attention weights helps understand what the model is "looking at" when processing each position.
Analysis of Results
Interpretation of Attention Patterns
Single-head attention shows the overall attention pattern when using a single attention mechanism.
Multi-head attention reveals that different heads learn to attend to different aspects:
Some heads may focus on local patterns (attending to nearby positions)
Other heads may capture global dependencies (attending to distant positions)
This diversity allows the model to capture multiple types of relationships simultaneously
Entropy analysis measures how "spread out" the attention is:
Low entropy: Focused attention on specific positions
High entropy: Distributed attention across many positions
Maximum entropy for 8 positions: nats
Causal (Masked) Attention
In autoregressive models (like GPT), we use causal masking to prevent attending to future positions:
Conclusion
The attention mechanism is a powerful tool that enables neural networks to:
Dynamically weight different parts of the input based on relevance
Capture long-range dependencies without the limitations of recurrent architectures
Provide interpretability through attention weight visualization
Key takeaways:
The scaling factor is crucial for stable gradients
Multi-head attention allows capturing diverse relationship types
Causal masking enables autoregressive generation
Attention weights can be interpreted as a soft retrieval mechanism
This mechanism forms the backbone of Transformer architectures and has revolutionized natural language processing, computer vision, and other domains.