CoCalc -- 4 N-gram models and Neural Network models .ipynb

GitHub Repository: suyashi29/python-su
Path: blob/master/Generative NLP Models using Python/4 N-gram models and Neural Network models .ipynb
³⁰⁷⁴ views

Kernel: Python 3 (ipykernel)

The N-Gram model is a probabilistic language model used to predict the next item in a sequence, usually a word or character, based on the previous N-1 items.

Types

📘 1. Unigram Model (N = 1)

Only considers the probability of each word independently.

Corpus: "I love ice cream"

Unigram probabilities (based on frequency):

P(I) = 1/4
P(love) = 1/4
P(ice) = 1/4
P(cream) = 1/4

Usage: To generate text, choose each word based on its unigram probability, ignoring previous words.

📘 2. Bigram Model (N = 2)

Considers the probability of a word based on the previous word.

Corpus: "I love ice cream"

Bigrams and probabilities:

P(love | I) = 1
P(ice | love) = 1
P(cream | ice) = 1

To calculate a sentence probability:

P(I love ice cream) = P(I) × P(love | I) × P(ice | love) × P(cream | ice)
Let’s assume P(I) = 0.25 (from unigram), then: P(sentence) = 0.25 × 1 × 1 × 1 = 0.25

📘 3. Trigram Model (N = 3)

Considers two previous words to predict the next.

Corpus: "I love ice cream"

Trigrams and probabilities:

P(ice | I love) = 1
P(cream | love ice) = 1

To calculate a sentence probability (approximate):

Assume we also have P(I) and P(love | I):
P(sentence) = P(I) × P(love | I) × P(ice | I love) × P(cream | love ice)

Summary:

Model	Memory	Example Prediction
Unigram	0 words	Predict next word from overall freq
Bigram	1 word	Predict "cream" from "ice"
Trigram	2 words	Predict "cream" from "love ice"

Real Use Case Example:

Suppose you're using a bigram model and want to predict the next word after "machine". From a corpus, you might get:

P(learning | machine) = 0.7
P(gun | machine) = 0.2
P(shop | machine) = 0.1

So, the model would most likely predict "learning".

Bigram Model step-by-step using the sentence:

"Ashi have ice cream jar. Ice cream is her fav."

Step 1: Preprocess the Sentence

Let’s split the text into lowercase words and add a sentence boundary marker <s> at the beginning of each sentence.

Preprocessed tokens:

<s> ashi have ice cream jar .
<s> ice cream is her fav .

Tokenized list:

['<s>', 'ashi', 'have', 'ice', 'cream', 'jar', '.', '<s>', 'ice', 'cream', 'is', 'her', 'fav', '.']

Step 2: Extract Bigrams

From the token list, we form pairs of consecutive words:

('<s>', 'ashi')
('ashi', 'have')
('have', 'ice')
('ice', 'cream')
('cream', 'jar')
('jar', '.')
('.', '<s>')
('<s>', 'ice')
('ice', 'cream')
('cream', 'is')
('is', 'her')
('her', 'fav')
('fav', '.')

Step 3: Count Bigrams and Unigrams

Bigram Counts:

Bigram	Count
('', 'ashi')	1
('ashi', 'have')	1
('have', 'ice')	1
('ice', 'cream')	2
('cream', 'jar')	1
('jar', '.')	1
('.', '')	1
('', 'ice')	1
('cream', 'is')	1
('is', 'her')	1
('her', 'fav')	1
('fav', '.')	1

Unigram Counts (first word in each bigram):

Word	Count
`<s>`	2
ashi	1
have	1
ice	2
cream	2
jar	1
.	1
is	1
her	1
fav	1

Step 4: Calculate Bigram Probabilities

Using Maximum Likelihood Estimation (MLE):

P(w_i \mid w_{i-1}) = \frac{\text{Count}(w_{i-1}, w_i)}{\text{Count}(w_{i-1})}

Examples:

$P(ashi \mid <s>) = \frac{1}{2} = 0.5$
$P(ice \mid have) = \frac{1}{1} = 1.0$
$P(cream \mid ice) = \frac{2}{2} = 1.0$
$P(jar \mid cream) = \frac{1}{2} = 0.5$
$P(is \mid cream) = \frac{1}{2} = 0.5$

Step 5: Bigram Sentence Probability

Let’s compute the probability of the first sentence: "Ashi have ice cream jar ." With tokens: <s> ashi have ice cream jar .

P(<s>, ashi, have, ice, cream, jar, .) = P(ashi \mid <s>) \times P(have \mid ashi) \times P(ice \mid have) \times P(cream \mid ice) \times P(jar \mid cream) \times P(. \mid jar)

= \frac{1}{2} \times \frac{1}{1} \times \frac{1}{1} \times \frac{2}{2} \times \frac{1}{2} \times \frac{1}{1} = 0.5 \times 1 \times 1 \times 1 \times 0.5 \times 1 = **0.25**

Summary:

The bigram model helps assign probabilities to word sequences by looking at pairs of words. In this example:

It recognizes that "ice cream" is more probable (seen twice).
It gives lower probabilities to less frequent bigrams like "cream jar" or "ashi have".

In [10]:

import nltk
from nltk.util import ngrams
from nltk.tokenize import word_tokenize
from collections import Counter

# Sample text
text = "Ashi has a cat Doma Doma is very naughty Doma is not like other cats Ashi still loves cat Doma."

# Tokenize text
tokens = word_tokenize(text)

In [11]:

# Generate bigrams
bigrams = list(ngrams(tokens, 2))

# Count frequency of bigrams
bigram_freq = Counter(bigrams)

print("Bigram Frequencies:")
for bg, freq in bigram_freq.items():
    print(f"{bg}: {freq}")

Out[11]:

Bigram Frequencies:
('Ashi', 'has'): 1
('has', 'a'): 1
('a', 'cat'): 1
('cat', 'Doma'): 2
('Doma', 'Doma'): 1
('Doma', 'is'): 2
('is', 'very'): 1
('very', 'naughty'): 1
('naughty', 'Doma'): 1
('is', 'not'): 1
('not', 'like'): 1
('like', 'other'): 1
('other', 'cats'): 1
('cats', 'Ashi'): 1
('Ashi', 'still'): 1
('still', 'loves'): 1
('loves', 'cat'): 1
('Doma', '.'): 1

In [14]:

# Estimate P(word2 | word1) where word1 is 'Doma'
word1 = 'cat'
following_words = {bg[1]: freq for bg, freq in bigram_freq.items() if bg[0] == word1}

total_count = sum(following_words.values())
probabilities = {word: count / total_count for word, count in following_words.items()}

print(f"\nProbabilities of words following '{word1}':")
for word, prob in probabilities.items():
    print(f"P({word} | {word1}) = {prob:.2f}")

Out[14]:

Probabilities of words following 'cat':
P(Doma | cat) = 1.00

Simple Neural Network Language Model with TensorFlow

"life is love" [1,2,3] [length=4] [0,1,2,3] [0,0,1,2,3]

In [15]:

import tensorflow as tf
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
import numpy as np

# Sample corpus
corpus = [
    "Ashi has a cat Doma",
    "Doma is very naughty",
    "All cats are not like Doma",
    "Ashi still loves her"
]

# Tokenization
tokenizer = Tokenizer()
tokenizer.fit_on_texts(corpus)
total_words = len(tokenizer.word_index) + 1

# Create input sequences
input_sequences = []
for line in corpus:
    token_list = tokenizer.texts_to_sequences([line])[0]
    for i in range(1, len(token_list)):
        n_gram_seq = token_list[:i+1]
        input_sequences.append(n_gram_seq)

# Padding
max_seq_len = max(len(seq) for seq in input_sequences)
input_sequences = pad_sequences(input_sequences, maxlen=max_seq_len, padding='pre')

# Split X and y
X = input_sequences[:, :-1]
y = input_sequences[:, -1]
y = tf.keras.utils.to_categorical(y, num_classes=total_words)

# Build model
model = tf.keras.Sequential([
    tf.keras.layers.Embedding(total_words, 10, input_length=max_seq_len - 1),
    tf.keras.layers.SimpleRNN(50),
    tf.keras.layers.Dense(total_words, activation='softmax')
])
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

# Train
model.fit(X, y, epochs=200, verbose=0)

# Predict next word
seed_text = "Ashi has a"
token_list = tokenizer.texts_to_sequences([seed_text])[0]
token_list = pad_sequences([token_list], maxlen=max_seq_len - 1, padding='pre')

predicted_probs = model.predict(token_list, verbose=0)
predicted_index = np.argmax(predicted_probs, axis=1)[0]
predicted_word = tokenizer.index_word[predicted_index]

print(f"Given seed text: '{seed_text}'")
print(f"Predicted next word: '{predicted_word}'")

Out[15]:

C:\Users\Suyashi144893\AppData\Local\anaconda3\Lib\site-packages\keras\src\layers\core\embedding.py:90: UserWarning: Argument `input_length` is deprecated. Just remove it.
  warnings.warn(

Given seed text: 'Ashi has a'
Predicted next word: 'cat'

Full Bigram Model Code with Sentence Probability

In [ ]:

from collections import defaultdict
import re

# Step 1: Preprocess the Text
text = "Ashi have ice cream jar. Ice cream is her fav."
text = text.lower()
sentences = re.split(r'[.!?]', text)
tokens = []

for sentence in sentences:
    words = sentence.strip().split()
    if words:
        tokens += ['<s>'] + words + ['</s>']

# Step 2: Count Unigrams and Bigrams
bigram_counts = defaultdict(int)
unigram_counts = defaultdict(int)

for i in range(len(tokens) - 1):
    unigram_counts[tokens[i]] += 1
    bigram_counts[(tokens[i], tokens[i + 1])] += 1

unigram_counts[tokens[-1]] += 1  # last word

# Step 3: Calculate Bigram Probabilities
bigram_prob = {}
for (w1, w2), count in bigram_counts.items():
    prob = count / unigram_counts[w1]
    bigram_prob[(w1, w2)] = round(prob, 3)

# Step 4: Show Bigram Probabilities
print("Bigram Probabilities:\n")
for (w1, w2), prob in bigram_prob.items():
    print(f"P({w2} | {w1}) = {prob}")

# Step 5: Calculate Sentence Probability
def sentence_probability(sentence, bigram_prob_dict):
    words = ['<s>'] + sentence.lower().split() + ['</s>']
    prob = 1.0
    for i in range(len(words) - 1):
        pair = (words[i], words[i + 1])
        prob *= bigram_prob_dict.get(pair, 0)
    return prob

# Test Sentence
test_sentence = "ashi have ice cream jar"
prob = sentence_probability(test_sentence, bigram_prob)
print(f"\nSentence: '{test_sentence}'")
print(f"Bigram Model Probability: {round(prob, 5)}")

Types

📘 1. Unigram Model (N = 1)

📘 2. Bigram Model (N = 2)

📘 3. Trigram Model (N = 3)

Summary:

Real Use Case Example:

Bigram Model step-by-step using the sentence:

Step 1: Preprocess the Sentence

Step 2: Extract Bigrams

Step 3: Count Bigrams and Unigrams

Bigram Counts:

Unigram Counts (first word in each bigram):

Step 4: Calculate Bigram Probabilities

Step 5: Bigram Sentence Probability

Summary:

Simple Neural Network Language Model with TensorFlow

Full Bigram Model Code with Sentence Probability

Product

Resources

Company