CoCalc -- Updated Language Translation model.ipynb

GitHub Repository: suyashi29/python-su
Path: blob/master/Generative NLP Models using Python/Updated Language Translation model.ipynb
³⁰⁷⁴ views

Kernel: Python 3 (ipykernel)

In [2]:

from transformers import pipeline, AutoTokenizer, AutoModelForSeq2SeqLM

# Step 1: Define supported target languages and model
supported_languages = {
    "2": ("German", "translation_en_to_de"),
    "3": ("French", "translation_en_to_fr"),
    "4": ("Spanish", "translation_en_to_es"),
}

# Step 2: Display menu to user
print("Select the language to translate English into:")
for key, (language, _) in supported_languages.items():
    print(f"{key}. {language}")

# Step 3: Take user input for language and text
lang_choice = input("Enter choice (1-4): ").strip()
if lang_choice not in supported_languages:
    print("Invalid choice. Exiting.")
    exit()

target_lang, pipeline_task = supported_languages[lang_choice]

text_to_translate = input(f"Enter English text to translate into {target_lang}: ")

# Step 4: Load appropriate model and tokenizer
translator = pipeline(pipeline_task)

# Step 5: Translate and display result
result = translator(text_to_translate)
print(f"\nTranslated text in {target_lang}: {result[0]['translation_text']}")

Out[2]:

Select the language to translate English into:
1. Hindi
2. German
3. French
4. Spanish
Enter choice (1-4): 2
Enter English text to translate into German: living is good

No model was supplied, defaulted to google-t5/t5-base and revision 686f1db (https://huggingface.co/google-t5/t5-base).
Using a pipeline without specifying a model name and revision in production is not recommended.
C:\Users\Suyashi144893\AppData\Local\anaconda3\Lib\site-packages\huggingface_hub\file_download.py:943: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
  warnings.warn(

config.json: 0.00B [00:00, ?B/s]

C:\Users\Suyashi144893\AppData\Local\anaconda3\Lib\site-packages\huggingface_hub\file_download.py:143: UserWarning: `huggingface_hub` cache-system uses symlinks by default to efficiently store duplicated files but your machine does not support them in C:\Users\Suyashi144893\.cache\huggingface\hub\models--google-t5--t5-base. Caching files will still work but in a degraded version that might require more space on your disk. This warning can be disabled by setting the `HF_HUB_DISABLE_SYMLINKS_WARNING` environment variable. For more details, see https://huggingface.co/docs/huggingface_hub/how-to-cache#limitations.
To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development
  warnings.warn(message)

WARNING:tensorflow:From C:\Users\Suyashi144893\AppData\Local\anaconda3\Lib\site-packages\tf_keras\src\losses.py:2976: The name tf.losses.sparse_softmax_cross_entropy is deprecated. Please use tf.compat.v1.losses.sparse_softmax_cross_entropy instead.

model.safetensors:   0%|          | 0.00/892M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

Translated text in German: Das Leben ist gut

Second Option

In [3]:

from transformers import MarianTokenizer, MarianMTModel

# Language map
language_models = {
    "1": ("Hindi", "Helsinki-NLP/opus-mt-en-hi"),
    "2": ("German", "Helsinki-NLP/opus-mt-en-de"),
    "3": ("French", "Helsinki-NLP/opus-mt-en-fr"),
    "4": ("Spanish", "Helsinki-NLP/opus-mt-en-es"),
}

# Show menu
print("Select target language:")
for key, (lang, _) in language_models.items():
    print(f"{key}. {lang}")

# Get input
choice = input("Enter your choice (1-4): ").strip()
if choice not in language_models:
    print("Invalid selection.")
    exit()

lang_name, model_name = language_models[choice]
text = input(f"Enter English text to translate to {lang_name}: ")

# Load tokenizer and model
tokenizer = MarianTokenizer.from_pretrained(model_name)
model = MarianMTModel.from_pretrained(model_name)

# Translate
tokens = tokenizer.prepare_seq2seq_batch([text], return_tensors="pt")
translated = model.generate(**tokens)
output = tokenizer.decode(translated[0], skip_special_tokens=True)

# Result
print(f"\nTranslated to {lang_name}: {output}")

Out[3]:

Select target language:
1. Hindi
2. German
3. French
4. Spanish
Enter your choice (1-4): 2
Enter English text to translate to German: hello hey

tokenizer_config.json:   0%|          | 0.00/42.0 [00:00<?, ?B/s]

C:\Users\Suyashi144893\AppData\Local\anaconda3\Lib\site-packages\huggingface_hub\file_download.py:143: UserWarning: `huggingface_hub` cache-system uses symlinks by default to efficiently store duplicated files but your machine does not support them in C:\Users\Suyashi144893\.cache\huggingface\hub\models--Helsinki-NLP--opus-mt-en-de. Caching files will still work but in a degraded version that might require more space on your disk. This warning can be disabled by setting the `HF_HUB_DISABLE_SYMLINKS_WARNING` environment variable. For more details, see https://huggingface.co/docs/huggingface_hub/how-to-cache#limitations.
To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development
  warnings.warn(message)

source.spm:   0%|          | 0.00/768k [00:00<?, ?B/s]

target.spm:   0%|          | 0.00/797k [00:00<?, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

config.json: 0.00B [00:00, ?B/s]

pytorch_model.bin:   0%|          | 0.00/298M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/293 [00:00<?, ?B/s]

C:\Users\Suyashi144893\AppData\Local\anaconda3\Lib\site-packages\transformers\tokenization_utils_base.py:4072: FutureWarning: 
`prepare_seq2seq_batch` is deprecated and will be removed in version 5 of HuggingFace Transformers. Use the regular
`__call__` method to prepare your inputs and targets.

Here is a short example:

model_inputs = tokenizer(src_texts, text_target=tgt_texts, ...)

If you either need to use different keyword arguments for the source and target texts, you should do two calls like
this:

model_inputs = tokenizer(src_texts, ...)
labels = tokenizer(text_target=tgt_texts, ...)
model_inputs["labels"] = labels["input_ids"]

See the documentation of your specific tokenizer for more details on the specific arguments to the tokenizer of choice.
For a more complete example, see the implementation of `prepare_seq2seq_batch`.

  warnings.warn(formatted_warning, FutureWarning)

Translated to German: Guten Tag.

In [ ]:

Second Option

Product

Resources

Company