Path: blob/master/Generative NLP Models using Python/Positional Encoding, Self-Attention, and Text Summarization.ipynb
4732 views
1. Why Transformers Need Positional Encoding
Unlike RNNs or CNNs, Transformers do not process tokens sequentially. They process all tokens in parallel using attention.
Therefore, word order information must be injected explicitly.
Let:
Input sentence length = (N)
Embedding dimension = (d_{model})
Each token is mapped to an embedding: The formula is:

(X) The input embedding matrix to the Transformer (or self-attention layer).
(N) Sequence length(number of tokens in the input text; e.g., words, subwords, or tokens)**
(d_{\text{model}}) Model (embedding) dimension (the size of the vector used to represent each token).

Without positional encoding:
"Dog bites man"
"Man bites dog"
would be indistinguishable.
2. Sinusoidal Positional Encoding
The original Transformer uses fixed sinusoidal positional encodings.
For position (pos) and dimension (i): 
Why sinusoids?
Allow extrapolation to longer sequences
Relative positions can be inferred via linear operations
3. Input Representation to the Transformer
Final input embedding: [ Z = X + PE ]
Where:
(X) = token embeddings
(PE) = positional encodings
This sum injects order information without increasing dimensionality.
3. Example: "suyashi is happy"
Let's consider a simple example where we encode a small part of the sequence using positional encoding. Assume the sentence "suyashi is happy" has four positions (0, 1, 2, 3), and we use a small dimension d=4 for simplicity.

Resulting Positional Encodings
Combining these, we get the positional encodings for each position
If the word embeddings for "suyashi", "is", "happy" are vectors, the positional encoding vectors would be added to these embeddings. This addition ensures that the model is aware of the position of each word in the sentence. In practice, these operations are done over higher-dimensional spaces and with many more positions, but the fundamental idea remains the same. The positional encodings help the Transformer model understand the order and position of words within a sequence
4. Self-Attention: Core Idea
Each token attends to every other token, including itself.
For each token we compute:
Query (Q)
Key (K)
Value (V)
Using learned projection matrices: [ Q = XW_Q, \quad K = XW_K, \quad V = XW_V ]
Where:
(W_Q, W_K, W_V \in \mathbb{R}^{d_{model} \times d_k})
5. Scaled Dot-Product Attention (Mathematics)
Attention is computed as:


Why scaling by (\sqrt{d_k})?
Prevents dot products from growing too large
Stabilizes gradients during training
6. Multi-Head Attention
A single self-attention head computes one attention pattern over the entire sequence.This is limiting because language has multiple simultaneous relationships, for example:
syntactic dependency
semantic similarity
long-range references
positional relationships
Multi-head attention allows the model to attend to different representation subspaces in parallel.
Instead of performing one attention operation on the full embedding space:
Perform h independent attention operations on smaller subspaces, then combine them.

7. Transformer for Text Summarization (Conceptual Flow)
For a 1000-word paragraph:
Tokenize text
Convert tokens to embeddings
Add positional encoding
Pass through Transformer Encoder layers
Use a Decoder (or encoder-only model) to generate summary
In practice, we use pretrained Transformer models.
8. Practical Example: Summarization Using a Transformer (Hugging Face)
We will use a pretrained encoder-decoder Transformer:
facebook/bart-large-cnnDesigned specifically for summarization
How Summarization Works in facebook/bart-large-cnn
facebook/bart-large-cnn is a sequence-to-sequence (encoder–decoder) Transformer model fine-tuned specifically for abstractive summarization on the CNN/DailyMail dataset. Its operation can be understood in four stages:
1. Architecture: Encoder–Decoder Transformer
Encoder: Reads the full input text and converts it into contextualized hidden representations using self-attention.
Decoder: Generates the summary token-by-token, attending both to previously generated tokens and to the encoder’s output.
This enables the model to paraphrase, compress, and reorganize information rather than simply extract sentences.
2. Pretraining via Denoising Autoencoding
BART is pretrained by:
Corrupting input text (masking, deleting, shuffling sentences)
Training the model to reconstruct the original text This teaches the model:
Language structure
Long-range dependencies
Content reconstruction
3. Fine-Tuning for Summarization
During fine-tuning on CNN/DailyMail:
Input: Full news article
Output: Human-written highlights (summaries)
The model learns:
What information is salient
How to compress content to ~3–4 sentences
A neutral, news-style summarization tone
9. Summarizing a Long (~1000 word) Paragraph
Important considerations:
Transformers have maximum token limits
BART supports ~1024 tokens
For longer inputs, chunking or sliding windows are required
10. How Self-Attention Enables Summarization
During encoding:
Each token attends to all other tokens
Important sentences receive higher attention weights
During decoding:
The model learns to generate compressed semantic representations
Redundant or low-attention content is dropped
Mathematically: [ \text{Summary} = f(Attention(Z)) ]
11. Key Takeaways
Positional encoding injects order into parallel processing
Self-attention computes contextual relevance via dot products
Transformers scale efficiently for long documents
Pretrained models operationalize these ideas for summarization