Book a Demo!
CoCalc Logo Icon
StoreFeaturesDocsShareSupportNewsAboutPoliciesSign UpSign In
suyashi29
GitHub Repository: suyashi29/python-su
Path: blob/master/Generative NLP Models using Python/Important Terms relevant to Neural Network and NLP.ipynb
4770 views
Kernel: Python 3 (ipykernel)

Data Science is interesting Topic.Data Analytics , Data Lover and Data Science.

Science -

1. N-grams

Definition: An n-gram is a contiguous sequence of n tokens (characters or words) extracted from text.

Why it matters:

  • Historically used in statistical language models.

  • Conceptually helps understand how models learn local context.

  • Still useful in preprocessing, evaluation, and feature engineering.

Examples (word-level):

  • Unigram (n=1): "I", "love", "AI"

  • Bigram (n=2): "I love", "love AI"

  • Trigram (n=3): "I love AI"

In neural models:

  • LSTM/RNN: Implicitly learns variable-length n-gram–like dependencies over time.

  • Transformers: Attention mechanisms learn relationships beyond fixed n-grams, including long-range dependencies.


2. Temperature (in Text Generation)

Definition: Temperature controls the randomness of token selection during text generation by scaling the logits before applying softmax.

Formula intuition:

softmax(logits / temperature)

Effect of temperature:

  • Low temperature (< 1.0): More deterministic, safer outputs

  • High temperature (> 1.0): More diverse, creative, riskier outputs

Example: If predicted probabilities for next word are:

{"AI": 0.6, "ML": 0.25, "Data": 0.15}
  • Temperature = 0.5 → likely always selects "AI"

  • Temperature = 1.5"ML" or "Data" may appear more often

Usage:

  • Common in LSTM text generators and Transformer decoders

  • Often combined with top-k or top-p (nucleus) sampling


3. BATCH_SIZE

Definition: The number of training samples processed together in one forward/backward pass.

Why it matters:

  • Affects training speed, memory usage, and gradient stability.

Example:

  • Dataset size: 10,000 sentences

  • Batch size = 32 → ~313 batches per epoch

Model perspective:

  • LSTM: Smaller batches often preferred for stability.

  • Transformers: Larger batch sizes are common (with GPUs/TPUs).


4. BUFFER_SIZE

Definition: The number of samples held in memory while shuffling the dataset (common in TensorFlow pipelines).

Why it matters:

  • Larger buffer → better randomization

  • Smaller buffer → faster, but less shuffled data

Example:

dataset.shuffle(BUFFER_SIZE=10000)

Impact:

  • Improves generalization by preventing sequence bias.

  • Important when training language models on ordered text.


5. Shuffle

Definition: Randomly rearranging training samples before batching.

Why it matters in NLP:

  • Prevents the model from learning artificial ordering.

  • Reduces overfitting to sequential patterns in the dataset.

Note:

  • Shuffling happens at the sequence level, not within a sentence.

  • In Transformers, shuffling does not affect positional encodings inside sequences.


6. Embedding (Embedding Layer)

Definition: A trainable mapping from discrete tokens (words/subwords) to dense numerical vectors.

Why embeddings are critical:

  • Neural networks cannot operate on raw text.

  • Embeddings capture semantic and syntactic relationships.

Example:

"king" [0.21, -0.45, 0.88, ...] "queen" [0.19, -0.42, 0.91, ...]

Model usage:

  • LSTM: Embedding → LSTM → Dense

  • Transformers: Token embedding + Positional embedding → Attention blocks


7. Parameters

Definition: Trainable values learned by the model during training.

Examples:

  • Embedding matrices

  • LSTM weights and biases

  • Attention projection matrices in Transformers

Scale comparison:

  • LSTM language model: Thousands to millions of parameters

  • Transformer models: Millions to billions of parameters

Why it matters:

  • More parameters → higher capacity

  • Also higher risk of overfitting and higher compute cost


8. Epoch

Definition: One complete pass through the entire training dataset.

Example:

  • Dataset: 10,000 samples

  • Batch size: 100

  • 1 epoch = 100 gradient updates

In practice:

  • Text models typically train for multiple epochs.

  • Transformers may converge in fewer epochs due to scale.


9. Vocabulary (Vocab)

Definition: The set of unique tokens the model can understand and generate.

Types of tokens:

  • Character-level

  • Word-level

  • Subword-level (BPE, WordPiece, SentencePiece)

Example:

Vocab size = 30,000 tokens

Impact:

  • Larger vocab → richer language, more parameters

  • Smaller vocab → more token splitting, longer sequences

Transformers: Almost always use subword vocabularies.


10. Early Stopping

Definition: A training strategy that stops training when validation performance stops improving.

Why it matters:

  • Prevents overfitting

  • Saves compute time

Example:

  • Monitor validation loss

  • Stop training if no improvement for 3 consecutive epochs

Typical usage:

  • Common in LSTM-based models

  • Less common in very large Transformers, but still used in fine-tuning


Summary Mapping (Quick Reference)

TermPrimary Purpose
N-gramsLocal context modeling intuition
TemperatureControls randomness in generation
Batch SizeTraining efficiency and stability
Buffer SizeQuality of data shuffling
ShufflePrevents order bias
EmbeddingConverts tokens to vectors
ParametersModel capacity
EpochTraining progress unit
VocabularyLanguage coverage
Early StoppingPrevents overfitting