Path: blob/master/Generative NLP Models using Python/Important Terms relevant to Neural Network and NLP.ipynb
4770 views
Data Science is interesting Topic.Data Analytics , Data Lover and Data Science.
Science -
1. N-grams
Definition: An n-gram is a contiguous sequence of n tokens (characters or words) extracted from text.
Why it matters:
Historically used in statistical language models.
Conceptually helps understand how models learn local context.
Still useful in preprocessing, evaluation, and feature engineering.
Examples (word-level):
Unigram (n=1):
"I","love","AI"Bigram (n=2):
"I love","love AI"Trigram (n=3):
"I love AI"
In neural models:
LSTM/RNN: Implicitly learns variable-length n-gram–like dependencies over time.
Transformers: Attention mechanisms learn relationships beyond fixed n-grams, including long-range dependencies.
2. Temperature (in Text Generation)
Definition: Temperature controls the randomness of token selection during text generation by scaling the logits before applying softmax.
Formula intuition:
Effect of temperature:
Low temperature (< 1.0): More deterministic, safer outputs
High temperature (> 1.0): More diverse, creative, riskier outputs
Example: If predicted probabilities for next word are:
Temperature = 0.5 → likely always selects
"AI"Temperature = 1.5 →
"ML"or"Data"may appear more often
Usage:
Common in LSTM text generators and Transformer decoders
Often combined with top-k or top-p (nucleus) sampling
3. BATCH_SIZE
Definition: The number of training samples processed together in one forward/backward pass.
Why it matters:
Affects training speed, memory usage, and gradient stability.
Example:
Dataset size: 10,000 sentences
Batch size = 32 → ~313 batches per epoch
Model perspective:
LSTM: Smaller batches often preferred for stability.
Transformers: Larger batch sizes are common (with GPUs/TPUs).
4. BUFFER_SIZE
Definition: The number of samples held in memory while shuffling the dataset (common in TensorFlow pipelines).
Why it matters:
Larger buffer → better randomization
Smaller buffer → faster, but less shuffled data
Example:
Impact:
Improves generalization by preventing sequence bias.
Important when training language models on ordered text.
5. Shuffle
Definition: Randomly rearranging training samples before batching.
Why it matters in NLP:
Prevents the model from learning artificial ordering.
Reduces overfitting to sequential patterns in the dataset.
Note:
Shuffling happens at the sequence level, not within a sentence.
In Transformers, shuffling does not affect positional encodings inside sequences.
6. Embedding (Embedding Layer)
Definition: A trainable mapping from discrete tokens (words/subwords) to dense numerical vectors.
Why embeddings are critical:
Neural networks cannot operate on raw text.
Embeddings capture semantic and syntactic relationships.
Example:
Model usage:
LSTM: Embedding → LSTM → Dense
Transformers: Token embedding + Positional embedding → Attention blocks
7. Parameters
Definition: Trainable values learned by the model during training.
Examples:
Embedding matrices
LSTM weights and biases
Attention projection matrices in Transformers
Scale comparison:
LSTM language model: Thousands to millions of parameters
Transformer models: Millions to billions of parameters
Why it matters:
More parameters → higher capacity
Also higher risk of overfitting and higher compute cost
8. Epoch
Definition: One complete pass through the entire training dataset.
Example:
Dataset: 10,000 samples
Batch size: 100
1 epoch = 100 gradient updates
In practice:
Text models typically train for multiple epochs.
Transformers may converge in fewer epochs due to scale.
9. Vocabulary (Vocab)
Definition: The set of unique tokens the model can understand and generate.
Types of tokens:
Character-level
Word-level
Subword-level (BPE, WordPiece, SentencePiece)
Example:
Impact:
Larger vocab → richer language, more parameters
Smaller vocab → more token splitting, longer sequences
Transformers: Almost always use subword vocabularies.
10. Early Stopping
Definition: A training strategy that stops training when validation performance stops improving.
Why it matters:
Prevents overfitting
Saves compute time
Example:
Monitor validation loss
Stop training if no improvement for 3 consecutive epochs
Typical usage:
Common in LSTM-based models
Less common in very large Transformers, but still used in fine-tuning
Summary Mapping (Quick Reference)
| Term | Primary Purpose |
|---|---|
| N-grams | Local context modeling intuition |
| Temperature | Controls randomness in generation |
| Batch Size | Training efficiency and stability |
| Buffer Size | Quality of data shuffling |
| Shuffle | Prevents order bias |
| Embedding | Converts tokens to vectors |
| Parameters | Model capacity |
| Epoch | Training progress unit |
| Vocabulary | Language coverage |
| Early Stopping | Prevents overfitting |