Book a Demo!
CoCalc Logo Icon
StoreFeaturesDocsShareSupportNewsAboutPoliciesSign UpSign In
suyashi29
GitHub Repository: suyashi29/python-su
Path: blob/master/Generative NLP Models using Python/Quick Revision_Transformer_Architecture.ipynb
4732 views
Kernel: Python 3 (ipykernel)

Understanding Transformer Architecture – End-to-End

Trainer & Learner Friendly Jupyter Notebook

1. Why Transformers?

Limitations of RNNs / LSTMs

• Sequential computation → slow training • Long-term dependencies are hard to learn • Vanishing / exploding gradients • Limited parallelism

Transformers solve this using Attention.

2. High-Level Transformer Architecture

A Transformer consists of:

  • Encoder Stack

  • Decoder Stack

  • Attention Mechanism

  • Positional Encoding

Each block is built using Attention + Feed Forward Networks.

3. Tokenization & Embeddings

Before entering the Transformer:

  1. Text → Tokens (word/subword)

  2. Tokens → Token IDs

  3. Token IDs → Embedding vectors

Embedding captures semantic meaning.

import torch import torch.nn as nn vocab_size = 10000 embedding_dim = 512 embedding = nn.Embedding(vocab_size, embedding_dim) sample_tokens = torch.tensor([10, 25, 300]) embedding(sample_tokens).shape

4. Positional Encoding

Transformers do NOT have sequence awareness.

Positional Encoding adds order information.

Formula:

PE(pos, 2i) = sin(pos / 10000^(2i/d)) PE(pos, 2i+1) = cos(pos / 10000^(2i/d))

import math import torch def positional_encoding(seq_len, d_model): pe = torch.zeros(seq_len, d_model) for pos in range(seq_len): for i in range(0, d_model, 2): pe[pos, i] = math.sin(pos / (10000 ** (i / d_model))) pe[pos, i+1] = math.cos(pos / (10000 ** (i / d_model))) return pe positional_encoding(5, 8)
tensor([[ 0.0000e+00, 1.0000e+00, 0.0000e+00, 1.0000e+00, 0.0000e+00, 1.0000e+00, 0.0000e+00, 1.0000e+00], [ 8.4147e-01, 5.4030e-01, 9.9833e-02, 9.9500e-01, 9.9998e-03, 9.9995e-01, 1.0000e-03, 1.0000e+00], [ 9.0930e-01, -4.1615e-01, 1.9867e-01, 9.8007e-01, 1.9999e-02, 9.9980e-01, 2.0000e-03, 1.0000e+00], [ 1.4112e-01, -9.8999e-01, 2.9552e-01, 9.5534e-01, 2.9996e-02, 9.9955e-01, 3.0000e-03, 1.0000e+00], [-7.5680e-01, -6.5364e-01, 3.8942e-01, 9.2106e-01, 3.9989e-02, 9.9920e-01, 4.0000e-03, 9.9999e-01]])

5. Self-Attention – Core Idea

Each word attends to every other word.

We compute:

  • Query (Q)

  • Key (K)

  • Value (V)

Attention(Q,K,V) = softmax(QKᵀ / √d) V

def scaled_dot_product_attention(Q, K, V): d_k = Q.size(-1) scores = torch.matmul(Q, K.transpose(-2, -1)) / math.sqrt(d_k) weights = torch.softmax(scores, dim=-1) return torch.matmul(weights, V), weights Q = torch.rand(1, 4, 8) K = torch.rand(1, 4, 8) V = torch.rand(1, 4, 8) output, attn_weights = scaled_dot_product_attention(Q, K, V) output.shape, attn_weights.shape
(torch.Size([1, 4, 8]), torch.Size([1, 4, 4]))

6. Multi-Head Attention

Instead of one attention:

  • Split embeddings into multiple heads

  • Each head learns different relationships

  • Concatenate results

This improves representational power.

7. Feed Forward Network (FFN)

Applied independently to each token.

Structure: Linear → ReLU → Linear

Adds non-linearity and depth.

import torch import torch.nn as nn ffn = nn.Sequential( nn.Linear(512, 2048), nn.ReLU(), nn.Linear(2048, 512) ) ffn(torch.rand(2, 5, 512)).shape
torch.Size([2, 5, 512])

nn.Linear operates on the last dimension of the input tensor. Here, each of the 512-dimensional vectors in the (2, 5) sequence is independently passed through the feed-forward network, preserving the batch and sequence dimensions

8. Residual Connections & Layer Normalization

Why? • Stabilize training • Faster convergence • Better gradient flow

Each sub-layer: Output = LayerNorm(x + Sublayer(x))

9. Encoder Block Summary

Each Encoder layer contains:

  1. Multi-Head Self-Attention

  2. Add & Norm

  3. Feed Forward Network

  4. Add & Norm

Repeated N times.

10. Decoder Block Summary

Decoder adds:

  • Masked Self-Attention

  • Encoder–Decoder Attention

Used for text generation tasks.

11. Masked Attention (Why?)

Prevents model from seeing future tokens during training.

Essential for autoregressive generation.

12. Transformer for Text Generation

Workflow:

  1. Input tokens

  2. Encoder builds contextual representations

  3. Decoder predicts next token

  4. Sampling (temperature, top-k, top-p)

13. Transformer vs LSTM (Intuition)

AspectLSTMTransformer
Parallelism
Long ContextLimitedStrong
Training SpeedSlowFast
AttentionOptionalCore

14. Connection to Large Language Models (LLMs)

LLMs are: • Decoder-only Transformers • Trained on massive corpora • Predict next token

Examples: GPT, LLaMA, PaLM, Claude

15. Key Takeaways

• Attention replaces recurrence • Positional encoding adds order • Multi-head attention captures rich context • Transformers scale efficiently • Foundation of modern GenAI