CoCalc -- 6 Transformers for NLP.ipynb

GitHub Repository: suyashi29/python-su
Path: blob/master/Generative NLP Models using Python/6 Transformers for NLP.ipynb
³⁰⁷⁴ views

Kernel: Python 3 (ipykernel)

Transformers" in the context of artificial intelligence are a type of computer model designed to understand and generate human language. They're really good at tasks like translating languages, answering questions, and generating text.

Transformers rely on a mechanism called "self-attention" to weigh the importance of different words in a sentence when processing language data. This mechanism allows them to capture long-range dependencies and relationships between words more effectively than previous models. As a result, Transformers have achieved state-of-the-art performance in many NLP tasks, including language translation, text summarization, question answering, and sentiment analysis.
Self-attention is a technique used in NLP that helps models to understand relationships between words or entities in a sentence, no matter where they appear. It is a important part of transformers model which is used in tasks like translation and text generation.

Understanding Attention in NLP

The goal of self attention mechanism is to improve performance of traditional models such as encoder-decoder models used in RNNs (Recurrent Neural Networks).
In traditional encoder-decoder models input sequence is compressed into a single fixed-length vector which is then used to generate the output.
This works well for short sequences but struggles with long ones because important information can be lost when compressed into a single vector.

Self-Attention Mechanism Explained

The self-attention mechanism is a key innovation behind models like Transformers (e.g., BERT, GPT). It allows models to weigh the importance of different words in a sequence relative to each other.

1. What is Self-Attention?

Self-Attention lets a model attend to all positions of a sequence to compute a representation of that sequence.
It’s used to model dependencies between tokens, even if they are far apart.
It allows a model to dynamically focus on relevant parts of the input for each token.

2. Core Idea

Given an input sequence of token embeddings:

[ X = [x_1, x_2, ..., x_n] ]

We transform these into Query (Q), Key (K), and Value (V) vectors:

[ Q = XW^Q,\quad K = XW^K,\quad V = XW^V ]

Where (W^Q, W^K, W^V) are learnable weight matrices.

3. Scaled Dot-Product Attention

For each token:

Compute attention scores using dot product between its Query and all Keys: [ \text{AttentionScore}(i,j) = Q_i \cdot K_j^T ]
Scale the scores: [ \text{ScaledScore} = \frac{QK^T}{\sqrt{d_k}} ]
Apply softmax to get attention weights: [ \text{AttentionWeights} = \text{softmax}\left( \frac{QK^T}{\sqrt{d_k}} \right) ]
Multiply by Values: [ \text{Output} = \text{AttentionWeights} \cdot V ]

4. Multi-Head Attention

Instead of performing a single attention function, we run multiple in parallel (called "heads"):

[ \text{MultiHead}(Q, K, V) = \text{Concat}(\text{head}_1, ..., \text{head}_h)W^O ]

Each head learns different representations, helping the model attend to information from different subspaces.

5. Role in Transformers

Self-Attention is used in Encoder and Decoder blocks.
Captures contextual information for each token efficiently.
Enables parallelization (vs. RNNs).

Key Advantages

Long-range dependency modeling
Parallel computation
Dynamic contextualization of words
Scales better than recurrent models

. Summary

Component	Role
Query (Q)	Current word's focus
Key (K)	Candidate words to attend to
Value (V)	Contains information to combine
Softmax	Normalizes attention scores
Multi-head	Allows model to attend to different perspectives

The key components of Transformer models include:

1. Self-Attention Mechanism: This is the core component of Transformers. Self-attention allows the model to weigh the importance of different words in a sentence when processing language data. It enables capturing contextual relationships between words in a sequence, facilitating better understanding of the input.
1. Multi-Head Attention: In Transformers, self-attention is typically used in multiple "heads" or parallel attention mechanisms. Each head allows the model to focus on different parts of the input, enabling it to capture different types of relationships simultaneously.
1. Positional Encoding: Since Transformer models do not inherently understand the sequential order of input tokens like recurrent neural networks (RNNs), positional encoding is added to the input embeddings to provide information about the position of each token in the sequence.
1. Feedforward Neural Networks: Transformers include feedforward neural networks as part of their architecture. These networks are applied independently to each token's representation after self-attention and positional encoding, allowing the model to capture non-linear relationships between features.
1. Encoder and Decoder Layers: Transformer architectures often consist of encoder and decoder layers. The encoder processes the input sequence, while the decoder generates the output sequence in tasks like sequence-to-sequence translation. Each layer in the encoder and decoder typically includes self-attention and feedforward neural network sub-layers.
1. Residual Connections and Layer Normalization: To facilitate training deep networks, Transformers use residual connections around each sub-layer followed by layer normalization. These techniques help alleviate the vanishing gradient problem and improve the flow of information through the network.
1. Masking: In tasks like language translation, where the entire input sequence is available during training, masking is applied to prevent the model from attending to future tokens when predicting the output sequence.

These components work together to enable Transformers to achieve state-of-the-art performance in various natural language processing tasks.

In [ ]:

"Life is good" 

["Life","is","Good"] - Token of words

[1 0  0 
0  1  0
0  0  1] * Weight Random matrix

Word Embedding

Word embedding is a technique used to represent words as vectors (arrays of numbers). These vectors capture semantic relationships between words.

In [1]:

## Word Emdedding using Python 
import numpy as np

# Define the vocabulary
vocab = {"Suyashi": 0, "is": 1, "Happy": 2}


# One-hot encode each word
def one_hot_encoding(word_index, vocab_size):
    one_hot = np.zeros(vocab_size)
    one_hot[word_index] = 1
    return one_hot

# Define the weight matrix (random initialization for simplicity)
weight_matrix = np.random.rand(len(vocab), 3)

# Multiply one-hot encoded words with the weight matrix
word_embeddings = {}
for word, index in vocab.items():
    one_hot_encoded = one_hot_encoding(index, len(vocab))
    embedding = np.dot(one_hot_encoded, weight_matrix)
    word_embeddings[word] = embedding

# Display word embeddings
for word, embedding in word_embeddings.items():
    print(f"{word}: {embedding}")

Out[1]:

Suyashi: [0.67536105 0.42282022 0.58439591]
is: [0.61368818 0.43329308 0.06924587]
Happy: [0.97324454 0.60396219 0.82889336]

let's implement a basic position encoding scheme to add positional information to the word embeddings:

Positional encoding is a crucial concept in the Transformer architecture, enabling the model to capture the order of words in a sequence since Transformers lack inherent sequential information. Let's break down the concept using a simple example with the sentence "suyashi is happy".

Understanding Positional Encoding In a Transformer model, each word is embedded into a high-dimensional space using an embedding matrix. However, the position of each word in the sequence is not captured by these embeddings. Positional encoding addresses this by adding a unique positional information to each word embedding.
Positional Encoding Formula The most common method for positional encoding in Transformers involves using sine and cosine functions of different frequencies. For a position 𝑝 𝑜 𝑠 pos and a dimension 𝑖 i of the encoding, the positional encoding is defined as:

3. Example: "suyashi is happy"

Let's consider a simple example where we encode a small part of the sequence using positional encoding. Assume the sentence "suyashi is happy" has four positions (0, 1, 2, 3), and we use a small dimension d=4 for simplicity.

Resulting Positional Encodings

Combining these, we get the positional encodings for each position

If the word embeddings for "suyashi", "is", "happy" are vectors, the positional encoding vectors would be added to these embeddings. This addition ensures that the model is aware of the position of each word in the sentence.

In practice, these operations are done over higher-dimensional spaces and with many more positions, but the fundamental idea remains the same. The positional encodings help the Transformer model understand the order and position of words within a sequence

In [2]:

import numpy as np

# Position Encoding
def position_encoding(sentence_length, embedding_dim):
    position_encodings = np.zeros((sentence_length, embedding_dim))
    for pos in range(sentence_length):
        for i in range(embedding_dim):
            if i % 2 == 0:
                position_encodings[pos, i] = np.sin(pos / (10000 ** (i / embedding_dim)))  # scale th position encoding 
            else:
                position_encodings[pos, i] = np.cos(pos / (10000 ** ((i - 1) / embedding_dim)))
    return position_encodings

# Assuming we have word embeddings for "Suyashi", "is", "Happy" as follows
word_embeddings = {
    "Suyashi": np.array([0.1, 0.2, 0.3]),
    "is": np.array([0.2, 0.3, 0.1]),
    "Happy": np.array([0.3, 0.1, 0.2])
}

# Get the position encodings
sentence_length = 3
embedding_dim = 3
pos_encodings = position_encoding(sentence_length, embedding_dim)

# Add position encodings to word embeddings
for i, word in enumerate(word_embeddings):
    word_embeddings[word] += pos_encodings[i % sentence_length]

# Display word embeddings with position encodings
print("Word embeddings with position encodings:")
for word, embedding in word_embeddings.items():
    print(f"{word}: {embedding}")

Out[2]:

Word embeddings with position encodings:
Suyashi: [0.1 1.2 0.3]
is: [1.04147098 0.84030231 0.10215443]
Happy: [ 1.20929743 -0.31614684  0.20430886]

In [ ]:

import numpy as np
A = np.array([ [4, 10, 11], [21, 22, 23], [31, 32, 33] ])
B = np.ones((3,3))

print("\nMultiplying two arrays: ")
print(A * B)

print("Matrix Multiplication")
print(np.dot (A,B))

In [ ]:

B = np.ones((3,3))
B

We define the position encoding function and word embeddings as we did previously.
We concatenate word embeddings with position encodings to create input vectors.
The self_attention function calculates attention scores using dot product and scales them by the square root of the dimensionality of the embeddings.
Softmax is applied to obtain attention weights.
The attention weights are then used to compute the attended inputs by applying them to the input vectors.
Finally, we display the attended inputs and attention weights.

In [3]:

import numpy as np

# Position Encoding
def position_encoding(sentence_length, embedding_dim):
    position_encodings = np.zeros((sentence_length, embedding_dim))
    for pos in range(sentence_length):
        for i in range(embedding_dim):
            if i % 2 == 0:
                position_encodings[pos, i] = np.sin(pos / (10000 ** (i / embedding_dim)))
            else:
                position_encodings[pos, i] = np.cos(pos / (10000 ** ((i - 1) / embedding_dim)))
    return position_encodings

# Word embeddings for "Suyashi", "is", "Happy"
word_embeddings = {
    "Suyashi": np.array([0.1, 0.2, 0.3]),
    "is": np.array([0.2, 0.3, 0.1]),
    "Happy": np.array([0.3, 0.1, 0.2])
}

# Get the position encodings
sentence_length = 3
embedding_dim = 3
pos_encodings = position_encoding(sentence_length, embedding_dim)

# Concatenate word embeddings with position encodings
inputs = np.array([word_embeddings[word] + pos_encodings[i] for i, word in enumerate(word_embeddings)])

# Self-attention mechanism
def self_attention(inputs):
    # Calculate attention scores
    attention_scores = np.dot(inputs, inputs.T) / np.sqrt(inputs.shape[-1])
    
    # Apply softmax to obtain attention weights
    attention_weights = np.exp(attention_scores) / np.sum(np.exp(attention_scores), axis=-1, keepdims=True)
    
    # Apply attention weights to inputs
    attended_inputs = np.dot(attention_weights, inputs)
    
    return attended_inputs, attention_weights

# Apply self-attention
attended_inputs, attention_weights = self_attention(inputs)

# Display results
print("Attended Inputs:")
print(attended_inputs)
print("\nAttention Weights:")
print(attention_weights)

Out[3]:

Attended Inputs:
[[0.63448685 0.81047639 0.21099442]
 [0.80976342 0.62970605 0.18847823]
 [0.96158921 0.34185482 0.18548879]]

Attention Weights:
[[0.46252785 0.36781824 0.16965391]
 [0.29492673 0.4312344  0.27383886]
 [0.17117121 0.34457284 0.48425595]]

Simplified implementation using NumPy to demonstrate self-attention

steps:

X represents the input token embeddings.
W_q, W_k, and W_v represent the query, key, and value projection matrices respectively.
We compute the dot products of query and key matrices to get the attention scores.
Softmax is applied to the attention scores to get the attention weights.
Finally, we compute the weighted sum of value vectors to get the output.

In [4]:

import numpy as np

# Define input tokens (embedding vectors)
X = np.array([
    [0.1, 0.2, 0.3],  # Embedding vector for "Ashi"
    [0.4, 0.5, 0.6],  # Embedding vector for "is"
    [0.7, 0.8, 0.9],  # Embedding vector for "beautiful"
])

In [5]:


# Define query, key, and value matrices (linear projections)
W_q = np.random.rand(3, 3)  # Random query matrix
W_k = np.random.rand(3, 3)  # Random key matrix
W_v = np.random.rand(3, 3)  # Random value matrix

W_k

Out[5]:

array([[0.3987028 , 0.84523365, 0.03935686],
       [0.14043704, 0.23418582, 0.80679358],
       [0.46115938, 0.95457538, 0.80559394]])

In [6]:

# Compute dot products of query and key matrices
Q = np.dot(X, W_q.T)  # Query vectors
K = np.dot(X, W_k.T)  # Key vectors
Q
K

Out[6]:

array([[0.22072407, 0.30291894, 0.4787092 ],
       [0.60571206, 0.65734388, 1.14510781],
       [0.99070005, 1.01176881, 1.81150642]])

In [7]:

# Compute attention scores
attention_scores = np.dot(Q, K.T) / np.sqrt(3)  # Divided by sqrt(d_k)

# Apply softmax to get attention weights
attention_weights = np.exp(attention_scores) / np.sum(np.exp(attention_scores), axis=1, keepdims=True)

# Compute weighted sum of value vectors to get output
V = np.dot(X, W_v.T)  # Value vectors
output = np.dot(attention_weights, V)

print("Attention Weights:")
print(attention_weights)
print("\nOutput:")
print(output)

Out[7]:

Attention Weights:
[[0.29183364 0.33153316 0.3766332 ]
 [0.22786659 0.32072082 0.45141258]
 [0.17286912 0.30145243 0.52567845]]

Output:
[[0.84330987 1.10142135 0.5226462 ]
 [0.91037107 1.18946118 0.56718602]
 [0.97284874 1.27148364 0.60868162]]

Simplified Python code demonstrating how self-attention might be applied in a neural machine translation scenario using the transformers library

This code uses the BERT model from the transformers library to tokenize the input text, compute its hidden states, and extract the self-attention weights. These weights indicate how much each token attends to every other token in each layer of the model. However, note that BERT is not specifically trained for machine translation, so this is just an illustration of self-attention in a language model context.

Steps

Tokenization: The input text "Ashi is beautiful." is tokenized into its constituent tokens using the BERT tokenizer. Each token is represented by an integer ID. Let's denote the tokenized input as 𝑋 X.
Model Computation: The tokenized input 𝑋
X is fed into the BERT model, which consists of multiple layers of self-attention and feedforward neural networks. The BERT model processes the input tokens and produces hidden states for each token. Let's denote the hidden states as 𝐻
Self-Attention: During each layer of the BERT model, self-attention is applied to the input tokens. The self-attention mechanism computes attention scores between each token and every other token in the sequence. These attention scores are calculated using the formula:
Self-Attention Weights: The self-attention weights represent the importance of each token attending to every other token in the sequence. These weights are computed for each layer of the model. In the code, the mean of the attention weights across the sequence dimension is calculated for each layer and printed out.

pip install torch

In [6]:

import torch
from transformers import BertModel, BertTokenizer, BertConfig

# Load pre-trained BERT model and tokenizer
machine_T = 'bert-base-multilingual-cased'
tokenizer = BertTokenizer.from_pretrained(machine_T )
model = BertModel.from_pretrained(machine_T )

# Input text
input_text = input("enter your text : ")

# Tokenize input text
input_ids = tokenizer.encode(input_text, add_special_tokens=True, return_tensors="pt")

# Get BERT model's output
outputs = model(input_ids)

Out[6]:

enter your text : Data is good

In [7]:

outputs

Out[7]:

BaseModelOutputWithPoolingAndCrossAttentions(last_hidden_state=tensor([[[ 0.1283, -0.2642,  0.4334,  ...,  0.4047,  0.2219,  0.0389],
         [ 0.2017, -0.7483,  1.1598,  ...,  0.3947,  0.2724, -0.1828],
         [ 0.0408, -0.7210,  0.6518,  ...,  1.1248,  0.1620,  0.2890],
         [ 0.1386, -0.9688,  0.8534,  ...,  0.4954,  0.2549, -0.0027],
         [ 0.2003, -0.5410,  0.7501,  ...,  0.2699,  0.3223,  0.1790]]],
       grad_fn=<NativeLayerNormBackward0>), pooler_output=tensor([[ 0.2756, -0.0726,  0.3052, -0.1321, -0.1877,  0.4000,  0.2618,  0.3498,
         -0.4464,  0.3896, -0.1449, -0.3249, -0.2578, -0.2245,  0.3276, -0.1474,
          0.6852,  0.2105,  0.2241, -0.4080, -1.0000, -0.3713, -0.4331, -0.2059,
         -0.4801,  0.2068, -0.1695,  0.2358,  0.3884, -0.2888,  0.1087, -1.0000,
          0.7191,  0.6414,  0.2634, -0.2632,  0.2798,  0.2552,  0.3660, -0.3003,
         -0.0791,  0.1356, -0.1100,  0.2280, -0.1202, -0.3200, -0.1897,  0.2633,
         -0.4936,  0.2228, -0.1130,  0.1875,  0.5714,  0.2662,  0.4146,  0.0800,
          0.2996,  0.2474,  0.3427, -0.3151,  0.1168,  0.2721,  0.1107, -0.3081,
         -0.3023, -0.2847,  0.1383, -0.0892,  0.5495, -0.3093, -0.3938, -0.5029,
         -0.3182,  0.2197,  0.0984, -0.4443,  0.3332,  0.4242,  0.1478, -0.2095,
         -0.4192, -0.5514, -0.3749,  0.2264, -0.1740,  0.3898,  0.2528, -0.4711,
          0.2025,  0.0638,  0.1598,  0.6279, -0.3366,  0.3933, -0.1614, -0.2917,
         -0.8572, -0.2179, -0.2299, -0.4760, -0.2940,  0.1961, -0.2564, -0.0466,
         -0.3929, -0.3631,  0.2288,  0.3551, -0.3071,  0.3176,  0.1272, -0.3880,
         -0.2033,  0.1228, -0.2719,  0.9858, -0.4880,  0.2815, -0.0567, -0.2540,
         -0.5931,  1.0000,  0.3260, -0.2063,  0.1794,  0.2233, -0.5139,  0.3098,
          0.2962,  0.3489,  0.1131, -0.1362, -0.2202, -0.3770, -0.8286, -0.2619,
         -0.3495,  0.3336, -0.2957, -0.2588,  0.1419,  0.5723,  0.0968, -0.0412,
         -0.1040, -0.0675,  0.1494, -0.2210,  1.0000,  0.6742, -0.2350, -0.3404,
          0.5601, -0.7040, -0.1786, -0.2644, -0.3754, -0.6575,  0.1896,  0.2716,
          0.1056, -0.2250, -0.2710, -0.2951,  0.3810, -0.6085, -0.2457,  0.2596,
          0.4503,  0.3913, -0.2588,  0.3556,  0.2355, -0.3165, -0.1064,  0.4674,
          0.3056, -0.1598, -0.2090, -0.0739,  0.3202, -0.2783, -0.5003,  0.2143,
         -0.2690, -0.4567,  0.0593, -0.0186, -0.1308,  0.1441, -0.2952,  0.2202,
         -0.3788,  0.2324,  0.5164,  0.1099, -0.3471,  0.2639,  0.3702,  0.2838,
          0.3122,  0.1881,  0.2679,  0.2565, -0.2479, -0.6877,  0.2398,  0.2692,
          0.4246, -0.3379, -0.6117, -0.5196,  0.6179,  0.4261, -0.2643,  0.1968,
          0.2303, -0.2340, -0.1983,  0.2088, -0.1890, -0.2780, -0.3291, -0.3461,
         -0.1877,  0.2770,  0.2995,  0.3024,  0.2332, -0.1863, -0.1521, -0.1987,
          0.1932,  0.3148, -0.3318,  0.8591, -0.2177,  0.1995, -0.5111, -0.1500,
          0.2518, -0.1162,  0.2976,  0.9721,  0.2867, -0.3901,  0.3332,  0.2202,
          0.1678, -0.2391,  0.1373, -0.6102,  0.6667,  0.2668,  0.2815, -1.0000,
          0.2516,  0.1920,  0.3822,  0.2561,  0.3950,  0.3904,  0.1694,  0.9322,
         -0.3960, -0.4518, -0.3582, -0.2139, -0.5751, -0.1051, -0.2545, -0.2992,
         -0.2860, -0.1713, -0.1629,  0.3207,  0.3659, -0.9968,  0.8993,  0.2111,
         -0.2166,  0.0169,  0.4490, -1.0000,  0.2889, -0.1928, -0.3841,  0.3370,
         -0.4723, -0.2793,  0.2351,  0.3278,  0.3278,  0.2978,  0.3204,  0.5741,
         -0.1660,  0.0545,  0.2936, -0.2403,  0.5859, -0.0502,  0.1825,  0.3942,
         -0.2030,  0.2433, -0.2700,  0.4339,  0.4301,  0.1643,  0.1844, -0.2005,
          0.3882, -0.8017,  0.2280, -0.2982, -0.2717, -0.1204,  0.1777, -0.2360,
         -0.3228,  0.2040, -0.4173,  1.0000,  0.2126, -0.1994, -0.3915,  0.4691,
          0.5545, -0.2644, -0.6512, -0.1415,  0.6259,  0.3727,  0.3858,  0.1376,
         -0.0074,  0.2535, -0.1703, -0.0595,  0.0946, -0.4061,  0.3289, -0.1135,
         -0.5026,  0.1035, -0.1570, -0.1421, -0.8375,  0.2806,  0.3039,  0.2480,
          0.3227,  0.0535, -0.2602,  0.5413,  0.3724, -0.1879, -0.3446, -0.3135,
         -0.1742,  0.1302, -0.3075, -0.4306,  0.1371, -0.6877,  0.2093, -0.0523,
         -0.2487, -0.3479,  0.3349, -1.0000, -0.3351,  0.1898, -0.3443,  0.3410,
         -0.4124, -0.2528,  0.4096,  0.2705, -0.0386,  0.1311, -0.3805,  0.2042,
         -0.0977,  0.0780,  0.7849,  0.6419,  0.2217, -0.1847,  0.2341, -0.5030,
         -0.2349,  0.2697,  0.2274, -0.1983,  0.3597,  0.2252,  0.2069, -0.1527,
          0.4054, -0.2896, -0.2953,  0.3446, -0.0771, -0.1625, -0.3604,  0.2768,
         -0.4752,  0.2771,  0.1520,  0.3784,  0.1697,  0.4519, -0.3654, -0.1705,
         -0.0799, -0.1947, -0.4442, -0.2672, -0.2113,  1.0000,  0.4461,  0.3947,
         -0.4235,  0.2636,  0.3372, -0.2990,  0.2652,  0.3187,  0.3367, -0.2594,
          0.2033,  0.1467,  0.3499,  0.3343,  0.1912,  0.6187, -0.3661,  0.7218,
         -0.2973, -0.3441, -0.9984,  0.2601,  0.4657, -0.4260, -0.6909,  0.2208,
         -0.2139,  0.2456, -0.2086,  0.1137,  0.2828, -0.2971,  0.4231, -0.3950,
          1.0000, -0.0267,  0.1905,  0.2916,  0.2196, -0.2369, -0.2173, -0.2105,
          0.3763, -0.1783,  0.2367, -0.9629,  0.2873,  0.1231,  0.4633, -0.1597,
          0.2651, -0.5830,  0.3795, -0.1421, -0.2525, -0.3440,  0.3359, -0.4909,
          0.4453, -0.3307,  0.2947, -0.4178,  0.3201, -0.1011,  0.2058, -0.1459,
          0.1326, -0.2453, -0.2297, -0.3837,  0.2525, -0.3726,  1.0000, -0.1465,
          0.2929, -0.1541,  0.3234, -0.2734,  0.3879,  0.6860, -0.2738,  0.2206,
          0.3545, -0.7474,  0.4272, -0.1270, -0.7822, -0.2666,  0.9702,  0.1676,
          0.3803,  0.5982,  0.3703,  0.2027, -0.2611,  0.3427,  0.8775,  0.2049,
          0.2941,  0.2801,  0.1047, -0.4053, -0.3845,  1.0000,  1.0000,  0.1556,
          0.3238, -0.4180, -0.4400, -0.2850,  0.3532,  0.2147,  0.3675, -0.0832,
          0.1184, -0.4685, -0.3664, -0.0952, -0.1199, -0.3031,  0.1592, -0.3230,
          0.4972,  0.4605,  0.1271,  0.5749,  0.3137,  0.3284, -0.1396, -0.3616,
          0.3583, -0.2111, -0.2039, -0.4121,  0.1167, -1.0000, -0.2796, -0.3079,
         -0.2916,  0.5100,  0.2364,  0.2168, -0.3573, -0.1511, -0.3069,  0.3189,
          0.2299,  0.2352, -0.1985, -0.5207,  0.4403, -0.3442,  0.1399, -0.2536,
         -0.3422, -0.7139, -0.2942, -0.3488,  0.3883, -0.4219, -0.3380,  0.2919,
          0.3824,  0.3810, -0.3751,  0.3730, -0.4379,  0.1041,  0.2375,  0.2791,
          0.2681, -0.2421, -0.3757, -0.3087, -0.3009, -0.1858,  0.2844, -0.2017,
          0.3235, -0.1800,  0.3779, -0.3259,  0.2229,  0.2578,  0.4221, -0.4171,
          0.5190,  0.4710, -0.0820,  0.4621,  0.1913, -0.2379, -0.3289,  1.0000,
          0.4125,  0.2892,  0.2688, -0.0828,  0.3745,  0.1737,  0.5865, -0.1589,
          0.8535, -0.2277,  0.3158,  0.1782,  0.3405,  0.1969,  0.2350,  0.4225,
          0.8014,  0.3248,  0.3765,  0.3254,  0.4182,  0.2563,  0.3656,  0.2674,
          0.4192,  0.2476, -0.2566,  0.3997, -0.3743, -0.1656, -0.2609, -0.0687,
         -0.3294,  0.1814, -0.2386, -0.3419, -0.2824,  0.3230, -0.2209,  0.1805,
         -0.2313, -0.4340,  0.5987, -0.5079,  0.3053, -0.1960,  0.1617, -0.8513,
          0.1824, -0.1358, -0.5853, -0.3094, -0.5629,  0.2166,  0.3064, -0.0848,
          0.2502, -0.3308,  0.3191, -0.1830, -0.1367,  0.1126, -1.0000,  0.0992,
          0.1957, -0.3672,  0.0530,  0.1252,  0.2593,  0.3714, -0.3647, -0.2733,
         -0.2402,  0.2592, -0.1878,  0.0910,  0.2290, -0.3566, -0.2407,  0.0465,
         -0.2186,  0.1297,  0.2595, -0.3620,  0.3850, -0.2395,  0.2187, -0.1024,
          0.3580, -0.2425, -0.3519,  0.2390, -0.3936, -0.3819, -0.1567,  0.1617,
         -0.1467,  0.1646,  0.2244, -0.2822,  0.4172, -0.2355,  0.3921, -0.3271,
          0.2244, -0.8235, -0.4976, -0.5142, -0.3765,  0.4159,  0.4084,  0.0874,
          0.1785, -0.1814,  0.2414, -0.1461,  0.2215,  0.2297, -0.2160,  0.1035,
         -0.2379,  0.3216, -0.3146,  0.0570, -0.9956, -0.2562,  0.1724,  0.2988,
          0.4266, -0.3838, -0.3340, -0.4874, -0.2788,  0.3877,  0.3341,  0.4247,
          0.2389,  0.1909, -0.2818, -0.1113,  0.7498, -0.2954,  0.0580,  0.4510,
          0.1855,  0.8676,  0.3703,  0.3746,  0.2662, -0.4935,  0.4277,  0.3768]],
       grad_fn=<TanhBackward0>), hidden_states=None, past_key_values=None, attentions=None, cross_attentions=None)

In [9]:


# Check if the model supports attention weights
config = BertConfig.from_pretrained(machine_T )
if config.output_attentions:
    # Extract hidden states
    hidden_states = outputs.last_hidden_state

    # Self-attention
    self_attention_weights = outputs.attentions

    # Print self-attention weights
    print("Self-attention weights:")
    for layer, attn_weights in enumerate(self_attention_weights):
        print(f"Layer {layer+1}: {attn_weights.mean(dim=1)}")

# Decoding input text
decoded_output = tokenizer.decode(input_ids[0], skip_special_tokens=True)
print("Decoded output from positional encoder:", decoded_output)

# Get the model's final output
final_output = outputs[0]

# Print the final output
print("Final output:", final_output)

print("Shape of the final Output",final_output.shape)

Out[9]:

Decoded output from positional encoder: Data is good
Final output: tensor([[[ 0.1283, -0.2642,  0.4334,  ...,  0.4047,  0.2219,  0.0389],
         [ 0.2017, -0.7483,  1.1598,  ...,  0.3947,  0.2724, -0.1828],
         [ 0.0408, -0.7210,  0.6518,  ...,  1.1248,  0.1620,  0.2890],
         [ 0.1386, -0.9688,  0.8534,  ...,  0.4954,  0.2549, -0.0027],
         [ 0.2003, -0.5410,  0.7501,  ...,  0.2699,  0.3223,  0.1790]]],
       grad_fn=<NativeLayerNormBackward0>)
Shape of the final Output torch.Size([1, 5, 768])

Example Language Translation

In [14]:

pip install torch

Out[14]:

Requirement already satisfied: torch in c:\users\suyashi144893\appdata\local\anaconda3\lib\site-packages (2.4.1)Note: you may need to restart the kernel to use updated packages.

Requirement already satisfied: filelock in c:\users\suyashi144893\appdata\local\anaconda3\lib\site-packages (from torch) (3.9.0)
Requirement already satisfied: typing-extensions>=4.8.0 in c:\users\suyashi144893\appdata\local\anaconda3\lib\site-packages (from torch) (4.12.2)
Requirement already satisfied: sympy in c:\users\suyashi144893\appdata\local\anaconda3\lib\site-packages (from torch) (1.11.1)
Requirement already satisfied: networkx in c:\users\suyashi144893\appdata\local\anaconda3\lib\site-packages (from torch) (3.1)
Requirement already satisfied: jinja2 in c:\users\suyashi144893\appdata\local\anaconda3\lib\site-packages (from torch) (3.1.2)
Requirement already satisfied: fsspec in c:\users\suyashi144893\appdata\local\anaconda3\lib\site-packages (from torch) (2024.9.0)
Requirement already satisfied: MarkupSafe>=2.0 in c:\users\suyashi144893\appdata\local\anaconda3\lib\site-packages (from jinja2->torch) (2.1.1)
Requirement already satisfied: mpmath>=0.19 in c:\users\suyashi144893\appdata\local\anaconda3\lib\site-packages (from sympy->torch) (1.3.0)

In [15]:

import torch
from transformers import MarianMTModel, MarianTokenizer

# Load pre-trained MarianMT model and tokenizer for English to Hindi translation
model_name = 'Helsinki-NLP/opus-mt-en-hi'
tokenizer = MarianTokenizer.from_pretrained(model_name)
hindi_tran = MarianMTModel.from_pretrained(model_name)

# Input text
input_text = input("Enter text for transalation = ")
print("Input=",input_text)

# Tokenize input text
input_ids = tokenizer(input_text, return_tensors="pt")

# Perform translation
translated_output = hindi_tran.generate(**input_ids)

# Decode the translated output
translated_text = tokenizer.decode(translated_output[0], skip_special_tokens=True)

# Print the translated output
print("Translated Output (Hindi):")
print(translated_text)

Out[15]:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[15], line 7
model_name = 'Helsinki-NLP/opus-mt-en-hi'
tokenizer = MarianTokenizer.from_pretrained(model_name)
----> 7 hindi_tran = MarianMTModel.from_pretrained(model_name)
# Input text
input_text = input("Enter text for transalation = ")
File ~\AppData\Local\anaconda3\Lib\site-packages\transformers\modeling_utils.py:309, in restore_default_torch_dtype.<locals>._wrapper(*args, **kwargs)
old_dtype = torch.get_default_dtype()
try:
--> 309     return func(*args, **kwargs)
finally:
   torch.set_default_dtype(old_dtype)
File ~\AppData\Local\anaconda3\Lib\site-packages\transformers\modeling_utils.py:4573, in PreTrainedModel.from_pretrained(cls, pretrained_model_name_or_path, config, cache_dir, ignore_mismatched_sizes, force_download, local_files_only, token, revision, use_safetensors, weights_only, *model_args, **kwargs)
   if dtype_orig is not None:
       torch.set_default_dtype(dtype_orig)
   (
       model,
       missing_keys,
       unexpected_keys,
       mismatched_keys,
       offload_index,
       error_msgs,
-> 4573     ) = cls._load_pretrained_model(
       model,
       state_dict,
       checkpoint_files,
       pretrained_model_name_or_path,
       ignore_mismatched_sizes=ignore_mismatched_sizes,
       sharded_metadata=sharded_metadata,
       device_map=device_map,
       disk_offload_folder=offload_folder,
       offload_state_dict=offload_state_dict,
       dtype=torch_dtype,
       hf_quantizer=hf_quantizer,
       keep_in_fp32_regex=keep_in_fp32_regex,
       device_mesh=device_mesh,
       key_mapping=key_mapping,
       weights_only=weights_only,
   )
# record tp degree the model sharded to
model._tp_size = tp_size
File ~\AppData\Local\anaconda3\Lib\site-packages\transformers\modeling_utils.py:4832, in PreTrainedModel._load_pretrained_model(cls, model, state_dict, checkpoint_files, pretrained_model_name_or_path, ignore_mismatched_sizes, sharded_metadata, device_map, disk_offload_folder, offload_state_dict, dtype, hf_quantizer, keep_in_fp32_regex, device_mesh, key_mapping, weights_only)
   original_checkpoint_keys = list(state_dict.keys())
else:
   original_checkpoint_keys = list(
-> 4832         load_state_dict(checkpoint_files[0], map_location="meta", weights_only=weights_only).keys()
   )
# Check if we are in a special state, i.e. loading from a state dict coming from a different architecture
prefix = model.base_model_prefix
File ~\AppData\Local\anaconda3\Lib\site-packages\transformers\modeling_utils.py:553, in load_state_dict(checkpoint_file, is_quantized, map_location, weights_only)
# Fallback to torch.load (if weights_only was explicitly False, do not check safety as this is known to be unsafe)
if weights_only:
--> 553     check_torch_load_is_safe()
try:
   if map_location is None:
File ~\AppData\Local\anaconda3\Lib\site-packages\transformers\utils\import_utils.py:1417, in check_torch_load_is_safe()
def check_torch_load_is_safe():
   if not is_torch_greater_or_equal("2.6"):
-> 1417         raise ValueError(
           "Due to a serious vulnerability issue in `torch.load`, even with `weights_only=True`, we now require users "
           "to upgrade torch to at least v2.6 in order to use the function. This version restriction does not apply "
           "when loading files with safetensors."
           "\nSee the vulnerability report here https://nvd.nist.gov/vuln/detail/CVE-2025-32434"
       )
ValueError: Due to a serious vulnerability issue in `torch.load`, even with `weights_only=True`, we now require users to upgrade torch to at least v2.6 in order to use the function. This version restriction does not apply when loading files with safetensors.
See the vulnerability report here https://nvd.nist.gov/vuln/detail/CVE-2025-32434

In [16]:

from transformers import MarianMTModel, MarianTokenizer

# Load the pre-trained translation model and tokenizer for English to German
model_name = "Helsinki-NLP/opus-mt-en-de"
tokenizer = MarianTokenizer.from_pretrained(model_name)
model = MarianMTModel.from_pretrained(model_name)

# Input text in English
input_text = input("Enter the Text: ")

# Tokenize the input text
inputs = tokenizer(input_text, return_tensors="pt")

# Perform translation
outputs = model.generate(**inputs)

# Decode the translated text
decoded_text = tokenizer.decode(outputs[0], skip_special_tokens=True)

# Print the decoded text
print("Decoded text in German:", decoded_text)

Out[16]:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[16], line 6
model_name = "Helsinki-NLP/opus-mt-en-de"
tokenizer = MarianTokenizer.from_pretrained(model_name)
----> 6 model = MarianMTModel.from_pretrained(model_name)
# Input text in English
input_text = input("Enter the Text: ")
File ~\AppData\Local\anaconda3\Lib\site-packages\transformers\modeling_utils.py:309, in restore_default_torch_dtype.<locals>._wrapper(*args, **kwargs)
old_dtype = torch.get_default_dtype()
try:
--> 309     return func(*args, **kwargs)
finally:
   torch.set_default_dtype(old_dtype)
File ~\AppData\Local\anaconda3\Lib\site-packages\transformers\modeling_utils.py:4573, in PreTrainedModel.from_pretrained(cls, pretrained_model_name_or_path, config, cache_dir, ignore_mismatched_sizes, force_download, local_files_only, token, revision, use_safetensors, weights_only, *model_args, **kwargs)
   if dtype_orig is not None:
       torch.set_default_dtype(dtype_orig)
   (
       model,
       missing_keys,
       unexpected_keys,
       mismatched_keys,
       offload_index,
       error_msgs,
-> 4573     ) = cls._load_pretrained_model(
       model,
       state_dict,
       checkpoint_files,
       pretrained_model_name_or_path,
       ignore_mismatched_sizes=ignore_mismatched_sizes,
       sharded_metadata=sharded_metadata,
       device_map=device_map,
       disk_offload_folder=offload_folder,
       offload_state_dict=offload_state_dict,
       dtype=torch_dtype,
       hf_quantizer=hf_quantizer,
       keep_in_fp32_regex=keep_in_fp32_regex,
       device_mesh=device_mesh,
       key_mapping=key_mapping,
       weights_only=weights_only,
   )
# record tp degree the model sharded to
model._tp_size = tp_size
File ~\AppData\Local\anaconda3\Lib\site-packages\transformers\modeling_utils.py:4832, in PreTrainedModel._load_pretrained_model(cls, model, state_dict, checkpoint_files, pretrained_model_name_or_path, ignore_mismatched_sizes, sharded_metadata, device_map, disk_offload_folder, offload_state_dict, dtype, hf_quantizer, keep_in_fp32_regex, device_mesh, key_mapping, weights_only)
   original_checkpoint_keys = list(state_dict.keys())
else:
   original_checkpoint_keys = list(
-> 4832         load_state_dict(checkpoint_files[0], map_location="meta", weights_only=weights_only).keys()
   )
# Check if we are in a special state, i.e. loading from a state dict coming from a different architecture
prefix = model.base_model_prefix
File ~\AppData\Local\anaconda3\Lib\site-packages\transformers\modeling_utils.py:553, in load_state_dict(checkpoint_file, is_quantized, map_location, weights_only)
# Fallback to torch.load (if weights_only was explicitly False, do not check safety as this is known to be unsafe)
if weights_only:
--> 553     check_torch_load_is_safe()
try:
   if map_location is None:
File ~\AppData\Local\anaconda3\Lib\site-packages\transformers\utils\import_utils.py:1417, in check_torch_load_is_safe()
def check_torch_load_is_safe():
   if not is_torch_greater_or_equal("2.6"):
-> 1417         raise ValueError(
           "Due to a serious vulnerability issue in `torch.load`, even with `weights_only=True`, we now require users "
           "to upgrade torch to at least v2.6 in order to use the function. This version restriction does not apply "
           "when loading files with safetensors."
           "\nSee the vulnerability report here https://nvd.nist.gov/vuln/detail/CVE-2025-32434"
       )
ValueError: Due to a serious vulnerability issue in `torch.load`, even with `weights_only=True`, we now require users to upgrade torch to at least v2.6 in order to use the function. This version restriction does not apply when loading files with safetensors.
See the vulnerability report here https://nvd.nist.gov/vuln/detail/CVE-2025-32434

pip install wordcloud --trusted-host pypi.org --trusted-host files.pythonhosted.org pip install pipeline

Hindi to Tamil

In [17]:

from transformers import MarianMTModel, MarianTokenizer
import torch

# Define the model and tokenizer for English to Hindi
model_name_en_to_hi = 'Helsinki-NLP/opus-mt-en-hi'
model_en_to_hi = MarianMTModel.from_pretrained(model_name_en_to_hi)
tokenizer_en_to_hi = MarianTokenizer.from_pretrained(model_name_en_to_hi)

def translate_en_to_hi(text):
    inputs = tokenizer_en_to_hi(text, return_tensors='pt', padding=True, truncation=True)
    with torch.no_grad():
        translated_tokens = model_en_to_hi.generate(**inputs)
    return tokenizer_en_to_hi.batch_decode(translated_tokens, skip_special_tokens=True)[0]

english_text = "Hello, how are you?"
hindi_text = translate_en_to_hi(english_text)
print(f'Hindi Translation: {hindi_text}')

Out[17]:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[17], line 6
# Define the model and tokenizer for English to Hindi
model_name_en_to_hi = 'Helsinki-NLP/opus-mt-en-hi'
----> 6 model_en_to_hi = MarianMTModel.from_pretrained(model_name_en_to_hi)
tokenizer_en_to_hi = MarianTokenizer.from_pretrained(model_name_en_to_hi)
def translate_en_to_hi(text):
File ~\AppData\Local\anaconda3\Lib\site-packages\transformers\modeling_utils.py:309, in restore_default_torch_dtype.<locals>._wrapper(*args, **kwargs)
old_dtype = torch.get_default_dtype()
try:
--> 309     return func(*args, **kwargs)
finally:
   torch.set_default_dtype(old_dtype)
File ~\AppData\Local\anaconda3\Lib\site-packages\transformers\modeling_utils.py:4573, in PreTrainedModel.from_pretrained(cls, pretrained_model_name_or_path, config, cache_dir, ignore_mismatched_sizes, force_download, local_files_only, token, revision, use_safetensors, weights_only, *model_args, **kwargs)
   if dtype_orig is not None:
       torch.set_default_dtype(dtype_orig)
   (
       model,
       missing_keys,
       unexpected_keys,
       mismatched_keys,
       offload_index,
       error_msgs,
-> 4573     ) = cls._load_pretrained_model(
       model,
       state_dict,
       checkpoint_files,
       pretrained_model_name_or_path,
       ignore_mismatched_sizes=ignore_mismatched_sizes,
       sharded_metadata=sharded_metadata,
       device_map=device_map,
       disk_offload_folder=offload_folder,
       offload_state_dict=offload_state_dict,
       dtype=torch_dtype,
       hf_quantizer=hf_quantizer,
       keep_in_fp32_regex=keep_in_fp32_regex,
       device_mesh=device_mesh,
       key_mapping=key_mapping,
       weights_only=weights_only,
   )
# record tp degree the model sharded to
model._tp_size = tp_size
File ~\AppData\Local\anaconda3\Lib\site-packages\transformers\modeling_utils.py:4832, in PreTrainedModel._load_pretrained_model(cls, model, state_dict, checkpoint_files, pretrained_model_name_or_path, ignore_mismatched_sizes, sharded_metadata, device_map, disk_offload_folder, offload_state_dict, dtype, hf_quantizer, keep_in_fp32_regex, device_mesh, key_mapping, weights_only)
   original_checkpoint_keys = list(state_dict.keys())
else:
   original_checkpoint_keys = list(
-> 4832         load_state_dict(checkpoint_files[0], map_location="meta", weights_only=weights_only).keys()
   )
# Check if we are in a special state, i.e. loading from a state dict coming from a different architecture
prefix = model.base_model_prefix
File ~\AppData\Local\anaconda3\Lib\site-packages\transformers\modeling_utils.py:553, in load_state_dict(checkpoint_file, is_quantized, map_location, weights_only)
# Fallback to torch.load (if weights_only was explicitly False, do not check safety as this is known to be unsafe)
if weights_only:
--> 553     check_torch_load_is_safe()
try:
   if map_location is None:
File ~\AppData\Local\anaconda3\Lib\site-packages\transformers\utils\import_utils.py:1417, in check_torch_load_is_safe()
def check_torch_load_is_safe():
   if not is_torch_greater_or_equal("2.6"):
-> 1417         raise ValueError(
           "Due to a serious vulnerability issue in `torch.load`, even with `weights_only=True`, we now require users "
           "to upgrade torch to at least v2.6 in order to use the function. This version restriction does not apply "
           "when loading files with safetensors."
           "\nSee the vulnerability report here https://nvd.nist.gov/vuln/detail/CVE-2025-32434"
       )
ValueError: Due to a serious vulnerability issue in `torch.load`, even with `weights_only=True`, we now require users to upgrade torch to at least v2.6 in order to use the function. This version restriction does not apply when loading files with safetensors.
See the vulnerability report here https://nvd.nist.gov/vuln/detail/CVE-2025-32434

In [ ]:

#pip install --upgrade numpy
#pip install transformer
#conda create -n huggingface_env python=3.10 -y
#conda activate huggingface_env
#pip install huggingface_hub[hf_xet]
#pip install numpy==1.26.2 tensorflow==2.18.0 transformers
pip install tensorflow-intel==2.18.0 numpy==1.26.2
pip install gensim

In [ ]:

pip install wordcloud --trusted-host pypi.org --trusted-host files.pythonhosted.org pip install googletrans==4.0.0-rc1

import sys print("Python version:", sys.version)

Sentiment Analysis

We are using distilbert-base-uncased-finetuned-sst-2-english model, which is fine-tuned for sentiment analysis on the Stanford Sentiment Treebank (SST-2) dataset.

In [21]:

from transformers import pipeline

# Explicitly specify the model name
model_name = "distilbert-base-uncased-finetuned-sst-2-english"

# Load the sentiment analysis pipeline with the specified model
sentiment_pipeline = pipeline("sentiment-analysis", model=model_name, tokenizer=model_name)

# Function to perform sentiment analysis
def analyze_sentiment(text):
    results = sentiment_pipeline(text)
    return results

# Example usage
if __name__ == "__main__":
    text = input("Enter the text for sentiment analysis: ")
    results = analyze_sentiment(text)
    for result in results:
        print(f"Label: {result['label']}, Score: {result['score']:.4f}")

Out[21]:

Device set to use cpu

Enter the text for sentiment analysis: i am happy
Label: POSITIVE, Score: 0.9999

In [23]:

import torch
from transformers import BertTokenizer, BertForSequenceClassification
import numpy as np

# Load pre-trained BERT model and tokenizer
model_name = "bert-base-uncased"
tokenizer = BertTokenizer.from_pretrained(model_name)
model = BertForSequenceClassification.from_pretrained(model_name)
model.eval()

# Define labels
labels = ['Negative', 'Positive']

# Function to perform text classification
def classify_text(text):
    # Tokenize input text
    inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True)
    # Classify text using BERT model
    with torch.no_grad():
        outputs = model(**inputs)
    # Get predicted probabilities
    probabilities = torch.softmax(outputs.logits, dim=1).squeeze().numpy()
    # Get predicted label index
    predicted_label_index = np.argmax(probabilities)
    # Get predicted label
    predicted_label = labels[predicted_label_index]
    return predicted_label, probabilities

# User input for text
user_text = input("Enter the text you want to classify: ")

# Perform text classification
predicted_label, probabilities = classify_text(user_text)

# Print results
print("Predicted label:", predicted_label)
print("Probabilities:", probabilities)

Out[23]:

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.

Enter the text you want to classify: Riya is happy with her scores
Predicted label: Positive
Probabilities: [0.4749452 0.5250549]

In [ ]: