Book a Demo!
CoCalc Logo Icon
StoreFeaturesDocsShareSupportNewsAboutPoliciesSign UpSign In
suyashi29
GitHub Repository: suyashi29/python-su
Path: blob/master/Generative NLP Models using Python/1 Natural Language Processing.ipynb
3074 views
Kernel: Python 3 (ipykernel)

Natural language processing (NLP)

NLP is a branch of artificial intelligence that helps computers understand, interpret and manipulate human language. NLP draws from many disciplines, including computer science and computational linguistics, in its pursuit to fill the gap between human communication and computer understanding. image.png

NLTK

  • The NLTK module is a massive tool kit, aimed at helping you with the entire Natural Language Processing (NLP) methodology.

  • NLTK will aid you with everything from splitting sentences from paragraphs, splitting up words, recognizing the part of speech of those words, highlighting the main subjects, and then even with helping your machine to understand what the text is all about.

Installation

  • conda install -c anaconda nltk

Components of NLP

Five main Component of Natural Language processing are:

Morphological and Lexical Analysis##

  • Lexical analysis is a vocabulary that includes its words and expressions.

  • It depicts analyzing, identifying and description of the structure of words.

  • It includes dividing a text into paragraphs, words and the sentences.

  • Individual words are analyzed into their components, and nonword tokens such as punctuations are separated from the words

Semantic Analysis:

  • Semantic Analysis is a structure created by the syntactic analyzer which assigns meanings.

  • This component transfers linear sequences of words into structures.

  • It shows how the words are associated with each other.

Pragmatic Analysis:

  • Pragmatic Analysis deals with the overall communicative and social content and its effect on interpretation.

  • It means abstracting or deriving the meaningful use of language in situations.

  • In this analysis, the main focus always on what was said in reinterpreted on what is meant.

Syntax Analysis:

  • The words are commonly accepted as being the smallest units of syntax.

  • The syntax refers to the principles and rules that govern the sentence structure of any individual languages.

Discourse Integration :

  • It means a sense of the context.

  • The meaning of any single sentence which depends upon that sentences. It also considers the meaning of the following sentence.

NLP and writing systems

The kind of writing system used for a language is one of the deciding factors in determining the best approach for text pre-processing. Writing systems can be

  • Logographic: a Large number of individual symbols represent words. Example Japanese, Mandarin

  • Syllabic: Individual symbols represent syllables

  • Alphabetic: Individual symbols represent sound

    Challenges

    • Extracting meaning(semantics) from a text is a challenge

    • NLP is dependent on the quality of the corpus. If the domain is vast, it's difficult to understand context.

    • There is a dependence on the character set and language

Your feedback was excellent. your feedback was bad.

["your","feedback","was","excellent"]

import nltk nltk.download()
import nltk
  • First step :conda install -c anaconda nltk : pip install nltk

  • Second Step : import nltk nltk.download()

NLP lib in Python

  • NLTK

  • Gensim (Topic Modelling, Document summarization)

  • CoreNLP(linguistic analysis)

  • SpaCY

  • TextBlob

  • Pattern (Web minning)

Tokenizing Words & Sentences

A sentence or data can be split into words using the method sent_tokenize() & word_tokenize() respectively.

import nltk from nltk.tokenize import word_tokenize T1 = "Hello Hello, - i am Suyashi end." a= word_tokenize(T1) #type(a) a
['Hello', 'Hello', ',', '-', 'i', 'am', 'Suyashi', 'end', '.']
from nltk.tokenize import sent_tokenize S2_TEXT = "Positive thinking! You know. is all imp A matter of habits? If you are not quite a positive thinker Change Yourself?" print(sent_tokenize(S2_TEXT)) ## type(sent_tokenize(E_TEXT)) ##!,? and.
['Positive thinking!', 'You know.', 'is all imp A matter of habits?', 'If you are not quite a positive thinker Change Yourself?']

Quick Parctice :

  • Do you know Customer and target audience’s reviews can be analyzed? You can use this! to create a roadmap of features and products.

Convert above para into word token and sentence token

Stopping Words

  • To do this, we need a way to convert words to values, in numbers, or signal patterns. The process of converting data to something a computer can understand is referred to as "pre-processing." One of the major forms of pre-processing is going to be filtering too useful data. In natural language processing, not imp words (data), are referred to as stop words.

  • For now, we'll be considering stop words as words that just contain no meaning, and we want to remove them.

  • We can do this easily, by storing a list of words that you consider to be stop words. NLTK starts you off with a bunch of words that they consider to be stop words, you can access it via the NLTK corpus .

from nltk.corpus import stopwords from nltk.tokenize import word_tokenize a = "I think i that Learning DATA Science will bring a big leap in your Carrier Profile. Data science is an interdisciplinary field that uses scientific methods, processes, algorithms and systems to extract knowledge and insights from noisy, structured and unstructured data, and apply knowledge from data across a broad range of application domains" word_tokens = word_tokenize(a) print(word_tokens) print ("Lenghth of words = ",len(word_tokens))
['I', 'think', 'i', 'that', 'Learning', 'DATA', 'Science', 'will', 'bring', 'a', 'big', 'leap', 'in', 'your', 'Carrier', 'Profile', '.', 'Data', 'science', 'is', 'an', 'interdisciplinary', 'field', 'that', 'uses', 'scientific', 'methods', ',', 'processes', ',', 'algorithms', 'and', 'systems', 'to', 'extract', 'knowledge', 'and', 'insights', 'from', 'noisy', ',', 'structured', 'and', 'unstructured', 'data', ',', 'and', 'apply', 'knowledge', 'from', 'data', 'across', 'a', 'broad', 'range', 'of', 'application', 'domains'] Lenghth of words = 58
type(word_tokens)
list
stop_words1 = set(stopwords.words('english')) #downloads the file with english stop words filtered_sentence = [w for w in word_tokens if not w in stop_words1] print(filtered_sentence) #print(word_tokens) #print(filtered_sentence) print("The number of words stopped :",(len(word_tokens)-len(filtered_sentence))) print ("Lenghth of words = ",len(filtered_sentence))
['I', 'think', 'Learning', 'DATA', 'Science', 'bring', 'big', 'leap', 'Carrier', 'Profile', '.', 'Data', 'science', 'interdisciplinary', 'field', 'uses', 'scientific', 'methods', ',', 'processes', ',', 'algorithms', 'systems', 'extract', 'knowledge', 'insights', 'noisy', ',', 'structured', 'unstructured', 'data', ',', 'apply', 'knowledge', 'data', 'across', 'broad', 'range', 'application', 'domains'] The number of words stopped : 18 Lenghth of words = 40
b=["I",".",",","?",":"] #Creating your own Stop word list stop_words1=list(stop_words1) stop_words2 = b #downloads the file with english stop words stop_words=stop_words1+stop_words2 word_tokens = word_tokenize(a) filtered_sentence = [w for w in word_tokens if not w in stop_words] print(filtered_sentence) #print(word_tokens) #print(filtered_sentence) print("The number of words stopped :",(len(word_tokens)-len(filtered_sentence))) print ("Lenghth of words filtered sentence = ",len(filtered_sentence))
['think', 'Learning', 'DATA', 'Science', 'bring', 'big', 'leap', 'Carrier', 'Profile', 'Data', 'science', 'interdisciplinary', 'field', 'uses', 'scientific', 'methods', 'processes', 'algorithms', 'systems', 'extract', 'knowledge', 'insights', 'noisy', 'structured', 'unstructured', 'data', 'apply', 'knowledge', 'data', 'across', 'broad', 'range', 'application', 'domains'] The number of words stopped : 24 Lenghth of words filtered sentence = 34

Write a Python script that takes a paragraph of text and performs word tokenization using NLTK. Print the list of tokens and filtered words after applying stop word.

import nltk from nltk.corpus import stopwords from nltk.tokenize import word_tokenize # Download necessary NLTK data files nltk.download('punkt') nltk.download('stopwords') # Ask the user to enter a paragraph paragraph = input("Please enter a paragraph of text: ") # Tokenize the paragraph tokens = word_tokenize(paragraph) # Print the list of tokens print("\nTokens:") print(tokens) # Get the list of stop words stop_words = set(stopwords.words('english')) # Filter out stop words filtered_words = [word for word in tokens if word.lower() not in stop_words] # Print the list of filtered words print("\nFiltered Words:") print(filtered_words)

STEMMING

A word stem is part of a word. It is sort of a normalization idea, but linguistic.

  • For example, the stem of the word Using is use.

from nltk.stem import PorterStemmer ps = PorterStemmer() ## defining stemmer s_words = ["AIIMS","Aims","Aimed","Aimmer","Aiming","Aim","go","went","dance","dances","dancing"] for i in s_words: print(ps.stem(i))
aiim aim aim aimmer aim aim go went danc danc danc
from nltk.stem import PorterStemmer ps = PorterStemmer() ## defining stemmer s_words = ["Calls","dance","Calling","Call","Called"] for i in s_words: print(ps.stem(i))
call danc call call call

help(Nltk)

Part of Speech tagging

This means labeling words in a sentence as nouns, adjectives, verbs.

image.png

import nltk from nltk.tokenize import PunktSentenceTokenizer document = 'Whether you\'re new to DataScience or an paracetamol, it\'s easy to learn and use Python.Are you Good enough in Prgramming? I Am based in Delhi location' sentences = nltk.sent_tokenize(document) for sent in sentences: print(nltk.pos_tag(nltk.word_tokenize(sent)))
[('Whether', 'IN'), ('you', 'PRP'), ("'re", 'VBP'), ('new', 'JJ'), ('to', 'TO'), ('DataScience', 'NNP'), ('or', 'CC'), ('an', 'DT'), ('paracetamol', 'NN'), (',', ','), ('it', 'PRP'), ("'s", 'VBZ'), ('easy', 'JJ'), ('to', 'TO'), ('learn', 'VB'), ('and', 'CC'), ('use', 'VB'), ('Python.Are', 'NNP'), ('you', 'PRP'), ('Good', 'NNP'), ('enough', 'RB'), ('in', 'IN'), ('Prgramming', 'NNP'), ('?', '.')] [('I', 'PRP'), ('Am', 'VBP'), ('based', 'VBN'), ('in', 'IN'), ('Delhi', 'NNP'), ('location', 'NN')]
sentences
##from nltk.corpus import state_union from nltk.tokenize import PunktSentenceTokenizer document = 'Whether you\'re new to DataScience or an experienced, it\'s easy to learn and use Python.' sentences = nltk.sent_tokenize(document) data = [] for sent in sentences: data = data + nltk.pos_tag(nltk.word_tokenize(sent)) for word in data: if 'CC' in word[1]: print(word)

Get synonyms/antonyms using WordNet

  • WordNet’s structure makes it a useful tool for computational linguistics and natural language processing

  • WordNet superficially resembles a thesaurus, in that it groups words together based on their meanings

# First, you're going to need to import wordnet: from nltk.corpus import wordnet # Then, we're going to use the term "" to find synsets like so: syns = wordnet.synsets("Happy") # An example of a synset: print(syns[0].name()) # Just the word: print(syns[0].lemmas()[0].name()) # Definition of that first synset: print(syns[0].definition()) # Examples of the word in use in sentences: print(syns[0].examples())
happy.a.01 happy enjoying or showing or marked by joy or pleasure ['a happy smile', 'spent many happy days on the beach', 'a happy marriage']
syns = wordnet.synsets("Generative") # An example of a synset: print(syns[0].name()) # Just the word: print(syns[0].lemmas()[0].name()) # Definition of that first synset: print(syns[0].definition()) # Examples of the word in use in sentences: print(syns[0].examples())
generative.a.01 generative having the ability to produce or originate ['generative power', 'generative forces']
import nltk from nltk.corpus import wordnet synonyms = [] antonyms = [] for syn in wordnet.synsets("Sound"): for l in syn.lemmas(): synonyms.append(l.name()) if l.antonyms(): antonyms.append(l.antonyms()[0].name()) print("Similar words =",set(synonyms)) print("Antonyms for sound =",set(antonyms))
Similar words = {'phone', 'vocalise', 'auditory_sensation', 'profound', 'levelheaded', 'well-grounded', 'good', 'voice', 'reasoned', 'effectual', 'vocalize', 'strait', 'go', 'audio', 'wakeless', 'heavy', 'sound', 'fathom', 'legal', 'level-headed', 'healthy', 'intelligent', 'speech_sound'} Antonyms for sound = {'devoice', 'unsound', 'silence'}

Filtering Duplicate Words we can use sets

##Sets s={1,2,33,33,44,0,-5} s
{-5, 0, 1, 2, 33, 44}
import nltk word_data = "The python is a a python data analytics language" # First Word tokenization nltk_tokens = nltk.word_tokenize(word_data) # Applying Set no_order = list(set(nltk_tokens)) print (no_order)
['language', 'python', 'data', 'a', 'The', 'analytics', 'is']

What is Lemmatization?

  • Lemmatization is the process of converting a word to its base or dictionary form (called a lemma), considering the context and part of speech (POS).

  • Unlike stemming (which simply chops off word endings), lemmatization ensures that the result is a valid word, not just a root form.

Uses:

  • Smarter text normalization: dancing, danced, dances → dance

  • Uses grammar rules and vocabulary (via POS tagging)

  • Important in NLP tasks like text classification, search engines, and question answering

e.g "Soni loves dancing and danced well at the dance competition

import nltk from nltk.tokenize import word_tokenize from nltk.stem import WordNetLemmatizer lem = WordNetLemmatizer() a = "Soni loves dancing and danced well at the dances competition" words = word_tokenize(a) words
['Soni', 'loves', 'dancing', 'and', 'danced', 'well', 'at', 'the', 'dances', 'competition']
from nltk.tokenize import word_tokenize lemmatized_words = [lem.lemmatize(words) for words in words] lemmatized_words
['Soni', 'love', 'dancing', 'and', 'danced', 'well', 'at', 'the', 'dance', 'competition']
WordPOS TagLemmaNotes
SoniNNPSoniProper noun (unchanged)
lovesVBZloveVerb converted to base form
dancingVBGdanceVerb gerund → base
dancedVBDdancePast verb → base
danceNNdanceAlready a noun

Example 2

sentences =

  • "Soni was dancing in the rain while Kron watched.",

  • "Kron loved the rain and danced in it.", - "They are loving the way Soni dances.",

  • "It rained heavily but Soni kept dancing.",

  • "Kron is a warrior who loved to fight and dance."

import nltk from nltk.stem import WordNetLemmatizer from nltk import pos_tag, word_tokenize from nltk.corpus import wordnet # Download required resources #nltk.download('punkt') #nltk.download('averaged_perceptron_tagger') #nltk.download('wordnet') # Initialize the lemmatizer lemmatizer = WordNetLemmatizer() # Function to convert POS tag to WordNet POS def get_wordnet_pos(tag): if tag.startswith('J'): return wordnet.ADJ elif tag.startswith('V'): return wordnet.VERB elif tag.startswith('N'): return wordnet.NOUN elif tag.startswith('R'): return wordnet.ADV else: return wordnet.NOUN # Lemmatization function def lemmatize_sentence(sentence): tokens = word_tokenize(sentence) pos_tags = pos_tag(tokens) lemmatized = [lemmatizer.lemmatize(word, get_wordnet_pos(tag)) for word, tag in pos_tags] return ' '.join(lemmatized) # Sample sentences sentences = [ "Soni was dancing in the rain while Kron watched.", "Kron loved the rain and danced in it.", "They are loving the way Soni dances.", "It rained heavily but Soni kept dancing.", "Kron is a warrior who loved to fight and dance." ] # Apply lemmatization to all sentences print("Lemmatized Sentences:\n") for s in sentences: print(" Original :", s) print("Lemmatized:", lemmatize_sentence(s)) print("---")
Lemmatized Sentences: Original : Soni was dancing in the rain while Kron watched. Lemmatized: Soni be dance in the rain while Kron watch . --- Original : Kron loved the rain and danced in it. Lemmatized: Kron love the rain and dance in it . --- Original : They are loving the way Soni dances. Lemmatized: They be love the way Soni dance . --- Original : It rained heavily but Soni kept dancing. Lemmatized: It rain heavily but Soni keep dance . --- Original : Kron is a warrior who loved to fight and dance. Lemmatized: Kron be a warrior who love to fight and dance . ---

Named Entity Recognition (NER)

Named Entity Recognition (NER) is the process of detecting named entities (like persons, organizations, locations, dates) in text. Example:

  • “Soni visited Paris in May to meet Kron from Microsoft.”

Entities here:

Soni → Person

Paris → Location

May → Date

Kron → Person

Microsoft → Organization

import nltk from nltk import word_tokenize, pos_tag, ne_chunk # Download required resources #nltk.download('punkt') #nltk.download('maxent_ne_chunker') #nltk.download('words') #nltk.download('averaged_perceptron_tagger') # Sample sentence sentence = "Soni visited Paris in May to meet Kron from Microsoft." # Tokenize into words tokens = word_tokenize(sentence) # Part-of-speech tagging pos_tags = pos_tag(tokens) # Named Entity Recognition ner_tree = ne_chunk(pos_tags) # Print the named entities print(" Named Entities:") print(ner_tree)
Named Entities: (S (PERSON Soni/NNP) visited/VBD (GPE Paris/NNP) in/IN May/NNP to/TO meet/VB (PERSON Kron/NNP) from/IN (ORGANIZATION Microsoft/NNP) ./.)
type(pos_tags) pos_tags
import nltk from nltk import word_tokenize, pos_tag, ne_chunk # Download NLTK data files if not already done #nltk.download('punkt') #nltk.download('averaged_perceptron_tagger') #nltk.download('maxent_ne_chunker') #nltk.download('words') # Sample sentences sentences = [ "Shivi plays football in New York every summer.", "Delhi hosted an international football tournament.", "Shivi traveled from Delhi to New York to play football.", "Football is Shivi's favorite sport in Delhi.", "In New York, Shivi met players from different countries." ] # Function to run NER and print results def perform_ner(text): tokens = word_tokenize(text) pos_tags = pos_tag(tokens) ner_tree = ne_chunk(pos_tags) print(f"\nSentence: {text}") print("Named Entities:") print(ner_tree) # Run NER on all sample sentences for sent in sentences: perform_ner(sent)
SentenceNamed Entities
"Shivi plays football in New York every summer."Shivi (PERSON), New York (GPE)
"Delhi hosted an international football tournament."Delhi (GPE)
"Shivi traveled from Delhi to New York to play football."Shivi (PERSON), Delhi (GPE), New York (GPE)
"Football is Shivi's favorite sport in Delhi."Shivi (PERSON), Delhi (GPE)
"In New York, Shivi met players from different countries."New York (GPE), Shivi (PERSON)