Book a Demo!
CoCalc Logo Icon
StoreFeaturesDocsShareSupportNewsAboutPoliciesSign UpSign In
suyashi29
GitHub Repository: suyashi29/python-su
Path: blob/master/Generative NLP Models using Python/4 N-gram models and Neural Network models .ipynb
3074 views
Kernel: Python 3 (ipykernel)

The N-Gram model is a probabilistic language model used to predict the next item in a sequence, usually a word or character, based on the previous N-1 items.

Types

📘 1. Unigram Model (N = 1)

Only considers the probability of each word independently.

Corpus: "I love ice cream"

Unigram probabilities (based on frequency):

  • P(I) = 1/4

  • P(love) = 1/4

  • P(ice) = 1/4

  • P(cream) = 1/4

Usage: To generate text, choose each word based on its unigram probability, ignoring previous words.


📘 2. Bigram Model (N = 2)

Considers the probability of a word based on the previous word.

Corpus: "I love ice cream"

Bigrams and probabilities:

  • P(love | I) = 1

  • P(ice | love) = 1

  • P(cream | ice) = 1

To calculate a sentence probability:

  • P(I love ice cream) = P(I) × P(love | I) × P(ice | love) × P(cream | ice)

  • Let’s assume P(I) = 0.25 (from unigram), then: P(sentence) = 0.25 × 1 × 1 × 1 = 0.25


📘 3. Trigram Model (N = 3)

Considers two previous words to predict the next.

Corpus: "I love ice cream"

Trigrams and probabilities:

  • P(ice | I love) = 1

  • P(cream | love ice) = 1

To calculate a sentence probability (approximate):

  • Assume we also have P(I) and P(love | I):

  • P(sentence) = P(I) × P(love | I) × P(ice | I love) × P(cream | love ice)


Summary:

ModelMemoryExample Prediction
Unigram0 wordsPredict next word from overall freq
Bigram1 wordPredict "cream" from "ice"
Trigram2 wordsPredict "cream" from "love ice"

Real Use Case Example:

Suppose you're using a bigram model and want to predict the next word after "machine". From a corpus, you might get:

  • P(learning | machine) = 0.7

  • P(gun | machine) = 0.2

  • P(shop | machine) = 0.1

So, the model would most likely predict "learning".

Bigram Model step-by-step using the sentence:

"Ashi have ice cream jar. Ice cream is her fav."


Step 1: Preprocess the Sentence

Let’s split the text into lowercase words and add a sentence boundary marker <s> at the beginning of each sentence.

Preprocessed tokens:

<s> ashi have ice cream jar . <s> ice cream is her fav .

Tokenized list:

['<s>', 'ashi', 'have', 'ice', 'cream', 'jar', '.', '<s>', 'ice', 'cream', 'is', 'her', 'fav', '.']

Step 2: Extract Bigrams

From the token list, we form pairs of consecutive words:

('<s>', 'ashi') ('ashi', 'have') ('have', 'ice') ('ice', 'cream') ('cream', 'jar') ('jar', '.') ('.', '<s>') ('<s>', 'ice') ('ice', 'cream') ('cream', 'is') ('is', 'her') ('her', 'fav') ('fav', '.')

Step 3: Count Bigrams and Unigrams

Bigram Counts:

BigramCount
('', 'ashi')1
('ashi', 'have')1
('have', 'ice')1
('ice', 'cream')2
('cream', 'jar')1
('jar', '.')1
('.', '')1
('', 'ice')1
('cream', 'is')1
('is', 'her')1
('her', 'fav')1
('fav', '.')1

Unigram Counts (first word in each bigram):

WordCount
<s>2
ashi1
have1
ice2
cream2
jar1
.1
is1
her1
fav1

Step 4: Calculate Bigram Probabilities

Using Maximum Likelihood Estimation (MLE):

P(wi∣wi−1)=Count(wi−1,wi)Count(wi−1)P(w_i \mid w_{i-1}) = \frac{\text{Count}(w_{i-1}, w_i)}{\text{Count}(w_{i-1})}

Examples:

  • P(ashi∣<s>)=12=0.5P(ashi \mid <s>) = \frac{1}{2} = 0.5

  • P(ice∣have)=11=1.0P(ice \mid have) = \frac{1}{1} = 1.0

  • P(cream∣ice)=22=1.0P(cream \mid ice) = \frac{2}{2} = 1.0

  • P(jar∣cream)=12=0.5P(jar \mid cream) = \frac{1}{2} = 0.5

  • P(is∣cream)=12=0.5P(is \mid cream) = \frac{1}{2} = 0.5


Step 5: Bigram Sentence Probability

Let’s compute the probability of the first sentence: "Ashi have ice cream jar ." With tokens: <s> ashi have ice cream jar .

P(<s>,ashi,have,ice,cream,jar,.)=P(ashi∣<s>)×P(have∣ashi)×P(ice∣have)×P(cream∣ice)×P(jar∣cream)×P(.∣jar)P(<s>, ashi, have, ice, cream, jar, .) = P(ashi \mid <s>) \times P(have \mid ashi) \times P(ice \mid have) \times P(cream \mid ice) \times P(jar \mid cream) \times P(. \mid jar)=12×11×11×22×12×11=0.5×1×1×1×0.5×1=∗∗0.25∗∗= \frac{1}{2} \times \frac{1}{1} \times \frac{1}{1} \times \frac{2}{2} \times \frac{1}{2} \times \frac{1}{1} = 0.5 \times 1 \times 1 \times 1 \times 0.5 \times 1 = **0.25**

Summary:

The bigram model helps assign probabilities to word sequences by looking at pairs of words. In this example:

  • It recognizes that "ice cream" is more probable (seen twice).

  • It gives lower probabilities to less frequent bigrams like "cream jar" or "ashi have".

import nltk from nltk.util import ngrams from nltk.tokenize import word_tokenize from collections import Counter # Sample text text = "Ashi has a cat Doma Doma is very naughty Doma is not like other cats Ashi still loves cat Doma." # Tokenize text tokens = word_tokenize(text)
# Generate bigrams bigrams = list(ngrams(tokens, 2)) # Count frequency of bigrams bigram_freq = Counter(bigrams) print("Bigram Frequencies:") for bg, freq in bigram_freq.items(): print(f"{bg}: {freq}")
Bigram Frequencies: ('Ashi', 'has'): 1 ('has', 'a'): 1 ('a', 'cat'): 1 ('cat', 'Doma'): 2 ('Doma', 'Doma'): 1 ('Doma', 'is'): 2 ('is', 'very'): 1 ('very', 'naughty'): 1 ('naughty', 'Doma'): 1 ('is', 'not'): 1 ('not', 'like'): 1 ('like', 'other'): 1 ('other', 'cats'): 1 ('cats', 'Ashi'): 1 ('Ashi', 'still'): 1 ('still', 'loves'): 1 ('loves', 'cat'): 1 ('Doma', '.'): 1
# Estimate P(word2 | word1) where word1 is 'Doma' word1 = 'cat' following_words = {bg[1]: freq for bg, freq in bigram_freq.items() if bg[0] == word1} total_count = sum(following_words.values()) probabilities = {word: count / total_count for word, count in following_words.items()} print(f"\nProbabilities of words following '{word1}':") for word, prob in probabilities.items(): print(f"P({word} | {word1}) = {prob:.2f}")
Probabilities of words following 'cat': P(Doma | cat) = 1.00

Simple Neural Network Language Model with TensorFlow

"life is love" [1,2,3] [length=4] [0,1,2,3] [0,0,1,2,3]

import tensorflow as tf from tensorflow.keras.preprocessing.text import Tokenizer from tensorflow.keras.preprocessing.sequence import pad_sequences import numpy as np # Sample corpus corpus = [ "Ashi has a cat Doma", "Doma is very naughty", "All cats are not like Doma", "Ashi still loves her" ] # Tokenization tokenizer = Tokenizer() tokenizer.fit_on_texts(corpus) total_words = len(tokenizer.word_index) + 1 # Create input sequences input_sequences = [] for line in corpus: token_list = tokenizer.texts_to_sequences([line])[0] for i in range(1, len(token_list)): n_gram_seq = token_list[:i+1] input_sequences.append(n_gram_seq) # Padding max_seq_len = max(len(seq) for seq in input_sequences) input_sequences = pad_sequences(input_sequences, maxlen=max_seq_len, padding='pre') # Split X and y X = input_sequences[:, :-1] y = input_sequences[:, -1] y = tf.keras.utils.to_categorical(y, num_classes=total_words) # Build model model = tf.keras.Sequential([ tf.keras.layers.Embedding(total_words, 10, input_length=max_seq_len - 1), tf.keras.layers.SimpleRNN(50), tf.keras.layers.Dense(total_words, activation='softmax') ]) model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy']) # Train model.fit(X, y, epochs=200, verbose=0) # Predict next word seed_text = "Ashi has a" token_list = tokenizer.texts_to_sequences([seed_text])[0] token_list = pad_sequences([token_list], maxlen=max_seq_len - 1, padding='pre') predicted_probs = model.predict(token_list, verbose=0) predicted_index = np.argmax(predicted_probs, axis=1)[0] predicted_word = tokenizer.index_word[predicted_index] print(f"Given seed text: '{seed_text}'") print(f"Predicted next word: '{predicted_word}'")
C:\Users\Suyashi144893\AppData\Local\anaconda3\Lib\site-packages\keras\src\layers\core\embedding.py:90: UserWarning: Argument `input_length` is deprecated. Just remove it. warnings.warn(
Given seed text: 'Ashi has a' Predicted next word: 'cat'

Full Bigram Model Code with Sentence Probability

from collections import defaultdict import re # Step 1: Preprocess the Text text = "Ashi have ice cream jar. Ice cream is her fav." text = text.lower() sentences = re.split(r'[.!?]', text) tokens = [] for sentence in sentences: words = sentence.strip().split() if words: tokens += ['<s>'] + words + ['</s>'] # Step 2: Count Unigrams and Bigrams bigram_counts = defaultdict(int) unigram_counts = defaultdict(int) for i in range(len(tokens) - 1): unigram_counts[tokens[i]] += 1 bigram_counts[(tokens[i], tokens[i + 1])] += 1 unigram_counts[tokens[-1]] += 1 # last word # Step 3: Calculate Bigram Probabilities bigram_prob = {} for (w1, w2), count in bigram_counts.items(): prob = count / unigram_counts[w1] bigram_prob[(w1, w2)] = round(prob, 3) # Step 4: Show Bigram Probabilities print("Bigram Probabilities:\n") for (w1, w2), prob in bigram_prob.items(): print(f"P({w2} | {w1}) = {prob}") # Step 5: Calculate Sentence Probability def sentence_probability(sentence, bigram_prob_dict): words = ['<s>'] + sentence.lower().split() + ['</s>'] prob = 1.0 for i in range(len(words) - 1): pair = (words[i], words[i + 1]) prob *= bigram_prob_dict.get(pair, 0) return prob # Test Sentence test_sentence = "ashi have ice cream jar" prob = sentence_probability(test_sentence, bigram_prob) print(f"\nSentence: '{test_sentence}'") print(f"Bigram Model Probability: {round(prob, 5)}")