GitHub Repository: suyashi29/python-su
Path: blob/master/Generative NLP Models using Python/Quick Revision_Transformer_Architecture.ipynb
⁴⁷³² views

Kernel: Python 3 (ipykernel)

Understanding Transformer Architecture – End-to-End

Trainer & Learner Friendly Jupyter Notebook

1. Why Transformers?

Limitations of RNNs / LSTMs

• Sequential computation → slow training • Long-term dependencies are hard to learn • Vanishing / exploding gradients • Limited parallelism

Transformers solve this using Attention.

2. High-Level Transformer Architecture

A Transformer consists of:

Encoder Stack
Decoder Stack
Attention Mechanism
Positional Encoding

Each block is built using Attention + Feed Forward Networks.

3. Tokenization & Embeddings

Before entering the Transformer:

Text → Tokens (word/subword)
Tokens → Token IDs
Token IDs → Embedding vectors

Embedding captures semantic meaning.

In [ ]:


import torch
import torch.nn as nn

vocab_size = 10000
embedding_dim = 512

embedding = nn.Embedding(vocab_size, embedding_dim)
sample_tokens = torch.tensor([10, 25, 300])
embedding(sample_tokens).shape

4. Positional Encoding

Transformers do NOT have sequence awareness.

Positional Encoding adds order information.

Formula:

PE(pos, 2i) = sin(pos / 10000^(2i/d)) PE(pos, 2i+1) = cos(pos / 10000^(2i/d))

In [1]:


import math
import torch

def positional_encoding(seq_len, d_model):
    pe = torch.zeros(seq_len, d_model)
    for pos in range(seq_len):
        for i in range(0, d_model, 2):
            pe[pos, i] = math.sin(pos / (10000 ** (i / d_model)))
            pe[pos, i+1] = math.cos(pos / (10000 ** (i / d_model)))
    return pe

positional_encoding(5, 8)

Out[1]:

tensor([[ 0.0000e+00,  1.0000e+00,  0.0000e+00,  1.0000e+00,  0.0000e+00,
          1.0000e+00,  0.0000e+00,  1.0000e+00],
        [ 8.4147e-01,  5.4030e-01,  9.9833e-02,  9.9500e-01,  9.9998e-03,
          9.9995e-01,  1.0000e-03,  1.0000e+00],
        [ 9.0930e-01, -4.1615e-01,  1.9867e-01,  9.8007e-01,  1.9999e-02,
          9.9980e-01,  2.0000e-03,  1.0000e+00],
        [ 1.4112e-01, -9.8999e-01,  2.9552e-01,  9.5534e-01,  2.9996e-02,
          9.9955e-01,  3.0000e-03,  1.0000e+00],
        [-7.5680e-01, -6.5364e-01,  3.8942e-01,  9.2106e-01,  3.9989e-02,
          9.9920e-01,  4.0000e-03,  9.9999e-01]])

5. Self-Attention – Core Idea

Each word attends to every other word.

We compute:

Query (Q)
Key (K)
Value (V)

Attention(Q,K,V) = softmax(QKᵀ / √d) V

In [2]:


def scaled_dot_product_attention(Q, K, V):
    d_k = Q.size(-1)
    scores = torch.matmul(Q, K.transpose(-2, -1)) / math.sqrt(d_k)
    weights = torch.softmax(scores, dim=-1)
    return torch.matmul(weights, V), weights

Q = torch.rand(1, 4, 8)
K = torch.rand(1, 4, 8)
V = torch.rand(1, 4, 8)

output, attn_weights = scaled_dot_product_attention(Q, K, V)
output.shape, attn_weights.shape

Out[2]:

(torch.Size([1, 4, 8]), torch.Size([1, 4, 4]))

6. Multi-Head Attention

Instead of one attention:

Split embeddings into multiple heads
Each head learns different relationships
Concatenate results

This improves representational power.

7. Feed Forward Network (FFN)

Applied independently to each token.

Structure: Linear → ReLU → Linear

Adds non-linearity and depth.

In [5]:

import torch
import torch.nn as nn

ffn = nn.Sequential(
    nn.Linear(512, 2048),
    nn.ReLU(),
    nn.Linear(2048, 512)
)

ffn(torch.rand(2, 5, 512)).shape

Out[5]:

torch.Size([2, 5, 512])

nn.Linear operates on the last dimension of the input tensor. Here, each of the 512-dimensional vectors in the (2, 5) sequence is independently passed through the feed-forward network, preserving the batch and sequence dimensions

8. Residual Connections & Layer Normalization

Why? • Stabilize training • Faster convergence • Better gradient flow

Each sub-layer: Output = LayerNorm(x + Sublayer(x))

9. Encoder Block Summary

Each Encoder layer contains:

Multi-Head Self-Attention
Add & Norm
Feed Forward Network
Add & Norm

Repeated N times.

10. Decoder Block Summary

Decoder adds:

Masked Self-Attention
Encoder–Decoder Attention

Used for text generation tasks.

11. Masked Attention (Why?)

Prevents model from seeing future tokens during training.

Essential for autoregressive generation.

12. Transformer for Text Generation

Workflow:

Input tokens
Encoder builds contextual representations
Decoder predicts next token
Sampling (temperature, top-k, top-p)

13. Transformer vs LSTM (Intuition)

Aspect	LSTM	Transformer
Parallelism	❌	✅
Long Context	Limited	Strong
Training Speed	Slow	Fast
Attention	Optional	Core

14. Connection to Large Language Models (LLMs)

LLMs are: • Decoder-only Transformers • Trained on massive corpora • Predict next token

Examples: GPT, LLaMA, PaLM, Claude

15. Key Takeaways

• Attention replaces recurrence • Positional encoding adds order • Multi-head attention captures rich context • Transformers scale efficiently • Foundation of modern GenAI