GitHub Repository: suyashi29/python-su
Path: blob/master/GenAI Transformers Basics/1.2 Natural Language Processing.ipynb
³⁰⁷⁴ views

Kernel: Python 3 (ipykernel)

Natural language processing (NLP)

NLP is a branch of artificial intelligence that helps computers understand, interpret and manipulate human language. NLP draws from many disciplines, including computer science and computational linguistics, in its pursuit to fill the gap between human communication and computer understanding.

NLTK

NLTK is a popular Python framework for dealing with data of human language. It includes a set of text processing libraries for classification and semantic reasoning, as well as wrappers for industrial-strength NLP libraries and an active discussion forum.

The NLTK module is a massive tool kit, aimed at helping you with the entire Natural Language Processing (NLP) methodology.
NLTK will aid you with everything from splitting sentences from paragraphs, splitting up words, recognizing the part of speech of those words, highlighting the main subjects, and then even with helping your machine to understand what the text is all about.

Installation

conda install -c anaconda nltk

Components of NLP

Five main Component of Natural Language processing are:

Morphological and Lexical Analysis##

Lexical analysis is a vocabulary that includes its words and expressions.
It depicts analyzing, identifying and description of the structure of words.
It includes dividing a text into paragraphs, words and the sentences.
Individual words are analyzed into their components, and nonword tokens such as punctuations are separated from the words

Semantic Analysis:

Semantic Analysis is a structure created by the syntactic analyzer which assigns meanings.
This component transfers linear sequences of words into structures.
It shows how the words are associated with each other.

Pragmatic Analysis:

Pragmatic Analysis deals with the overall communicative and social content and its effect on interpretation.
It means abstracting or deriving the meaningful use of language in situations.
In this analysis, the main focus always on what was said in reinterpreted on what is meant.

Syntax Analysis:

The words are commonly accepted as being the smallest units of syntax.
The syntax refers to the principles and rules that govern the sentence structure of any individual languages.

Discourse Integration :

It means a sense of the context.
The meaning of any single sentence which depends upon that sentences. It also considers the meaning of the following sentence.

NLP and writing systems

The kind of writing system used for a language is one of the deciding factors in determining the best approach for text pre-processing. Writing systems can be

Logographic: a Large number of individual symbols represent words. Example Japanese, Mandarin
Syllabic: Individual symbols represent syllables
Alphabetic: Individual symbols represent sound
Challenges
- Extracting meaning(semantics) from a text is a challenge
- NLP is dependent on the quality of the corpus. If the domain is vast, it's difficult to understand context.
- There is a dependence on the character set and language

In [2]:

import nltk
nltk.download()

Out[2]:

showing info https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/index.xml

True

In [3]:

import nltk

First step :conda install -c anaconda nltk : pip install nltk
Second Step : import nltk nltk.download()

NLP lib in Python

NLTK
Gensim (Topic Modelling, Document summarization)
CoreNLP(linguistic analysis)
SpaCY
TextBlob
Pattern (Web minning)

pip install --proxy http://noidasezproxy.corp.exlservice.com:8000 package

Tokenizing Words & Sentences

A sentence or data can be split into words using the method sent_tokenize() & word_tokenize() respectively.

In [51]:

import nltk 
from nltk.tokenize import  word_tokenize 

T1 = "Hello -Hello , - i am; Suyashi Raiwani."

a= word_tokenize(T1)
#type(a)
a

Out[51]:

['Hello', '-Hello', ',', '-', 'i', 'am', ';', 'Suyashi', 'Raiwani', '.']

In [10]:

from nltk.tokenize import sent_tokenize

S2_TEXT = "Positive thinking! You know. is all imp A matter of habits? If you are not quite a positive thinker Change Yourself?"

print(sent_tokenize(S2_TEXT))

## type(sent_tokenize(E_TEXT)) ##!,? and.

Out[10]:

['Positive thinking!', 'You know.', 'is all imp A matter of habits?', 'If you are not quite a positive thinker Change Yourself?']

Quick Parctice :

Do you know Customer and target audience’s reviews can be analyzed? You can use this! to create a roadmap of features and products.

Convert above para into word token and sentence token

In [12]:

a=[1,2,3]
b=[2,3,4]
a+b

Out[12]:

[1, 2, 3, 2, 3, 4]

In [17]:

import numpy as np
a=[1,2,3]
b=np.array(a)
b+b

Out[17]:

array([2, 4, 6])

In [19]:

##  store the words and sentences and type cast them in form of array:

from nltk.tokenize import sent_tokenize, word_tokenize
import numpy as np
 
data = "All work and no play makes jack dull boy. All work and no play makes jack a dull boy."
 

words = word_tokenize(data)

words

Out[19]:

['All',
 'work',
 'and',
 'no',
 'play',
 'makes',
 'jack',
 'dull',
 'boy',
 '.',
 'All',
 'work',
 'and',
 'no',
 'play',
 'makes',
 'jack',
 'a',
 'dull',
 'boy',
 '.']

In [20]:

new_array=np.array(words)
new_array
# print(type(new_array))

Out[20]:

array(['All', 'work', 'and', 'no', 'play', 'makes', 'jack', 'dull', 'boy',
       '.', 'All', 'work', 'and', 'no', 'play', 'makes', 'jack', 'a',
       'dull', 'boy', '.'], dtype='<U5')

Stopping Words

To do this, we need a way to convert words to values, in numbers, or signal patterns. The process of converting data to something a computer can understand is referred to as "pre-processing." One of the major forms of pre-processing is going to be filtering too useful data. In natural language processing, not imp words (data), are referred to as stop words.
For now, we'll be considering stop words as words that just contain no meaning, and we want to remove them.
We can do this easily, by storing a list of words that you consider to be stop words. NLTK starts you off with a bunch of words that they consider to be stop words, you can access it via the NLTK corpus .

In [25]:

from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

a = "I think i that Learning  DATA Science will bring a big leap in your Carrier Profile. Data science is an interdisciplinary field that uses scientific methods, processes, algorithms and systems to extract knowledge and insights from noisy, structured and unstructured data, and apply knowledge from data across a broad range of application domains"
word_tokens = word_tokenize(a)
print(word_tokens)
print ("Lenghth of words = ",len(word_tokens))

Out[25]:

['I', 'think', 'i', 'that', 'Learning', 'DATA', 'Science', 'will', 'bring', 'a', 'big', 'leap', 'in', 'your', 'Carrier', 'Profile', '.', 'Data', 'science', 'is', 'an', 'interdisciplinary', 'field', 'that', 'uses', 'scientific', 'methods', ',', 'processes', ',', 'algorithms', 'and', 'systems', 'to', 'extract', 'knowledge', 'and', 'insights', 'from', 'noisy', ',', 'structured', 'and', 'unstructured', 'data', ',', 'and', 'apply', 'knowledge', 'from', 'data', 'across', 'a', 'broad', 'range', 'of', 'application', 'domains']
Lenghth of words =  58

In [26]:

stop_words1 = set(stopwords.words('english')) #downloads the file with english stop words
filtered_sentence = [w for w in word_tokens if not w in stop_words1]
print(filtered_sentence)
#print(word_tokens)
#print(filtered_sentence)
print("The number of words stopped :",(len(word_tokens)-len(filtered_sentence)))
print ("Lenghth of words = ",len(filtered_sentence))

Out[26]:

['I', 'think', 'Learning', 'DATA', 'Science', 'bring', 'big', 'leap', 'Carrier', 'Profile', '.', 'Data', 'science', 'interdisciplinary', 'field', 'uses', 'scientific', 'methods', ',', 'processes', ',', 'algorithms', 'systems', 'extract', 'knowledge', 'insights', 'noisy', ',', 'structured', 'unstructured', 'data', ',', 'apply', 'knowledge', 'data', 'across', 'broad', 'range', 'application', 'domains']
The number of words stopped : 18
Lenghth of words =  40

In [29]:

 
b=["I",".",",","?",":"]  #Creating your own Stop word list
stop_words1=list(stop_words1)
stop_words2 = b #downloads the file with english stop words
stop_words=stop_words1+stop_words2
word_tokens = word_tokenize(a)

filtered_sentence = [w for w in word_tokens if not w in stop_words]
print(filtered_sentence)

#print(word_tokens)
#print(filtered_sentence)
print("The number of words stopped :",(len(word_tokens)-len(filtered_sentence)))
print ("Lenghth of words filtered sentence = ",len(filtered_sentence))

Out[29]:

['think', 'Learning', 'DATA', 'Science', 'bring', 'big', 'leap', 'Carrier', 'Profile', 'Data', 'science', 'interdisciplinary', 'field', 'uses', 'scientific', 'methods', 'processes', 'algorithms', 'systems', 'extract', 'knowledge', 'insights', 'noisy', 'structured', 'unstructured', 'data', 'apply', 'knowledge', 'data', 'across', 'broad', 'range', 'application', 'domains']
The number of words stopped : 24
Lenghth of words filtered sentence =  34

Write a Python script that takes a paragraph of text and performs word tokenization using NLTK. Print the list of tokens and filtered words after applying stop word.

In [30]:

import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

# Download necessary NLTK data files
nltk.download('punkt')
nltk.download('stopwords')

# Ask the user to enter a paragraph
paragraph = input("Please enter a paragraph of text: ")

# Tokenize the paragraph
tokens = word_tokenize(paragraph)

# Print the list of tokens
print("\nTokens:")
print(tokens)

# Get the list of stop words
stop_words = set(stopwords.words('english'))

# Filter out stop words
filtered_words = [word for word in tokens if word.lower() not in stop_words]

# Print the list of filtered words
print("\nFiltered Words:")
print(filtered_words)

Out[30]:

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\suyashi144893\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\suyashi144893\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!

Please enter a paragraph of text: hello i am happy to hear that all of you are important to me. but none of you are important

Tokens:
['hello', 'i', 'am', 'happy', 'to', 'hear', 'that', 'all', 'of', 'you', 'are', 'important', 'to', 'me', '.', 'but', 'none', 'of', 'you', 'are', 'important']

Filtered Words:
['hello', 'happy', 'hear', 'important', '.', 'none', 'important']

STEMMING

A word stem is part of a word. It is sort of a normalization idea, but linguistic.

For example, the stem of the word Using is use.

In [31]:

from nltk.stem import PorterStemmer

ps = PorterStemmer() ## defining stemmer
s_words = ["AIIMS","Aims","Aimed","Aimmer","Aiming","Aim"]
for i in s_words:
    print(ps.stem(i))

Out[31]:

aiim
aim
aim
aimmer
aim
aim

In [37]:

from nltk.stem import PorterStemmer

ps = PorterStemmer() ## defining stemmer
s_words = ["sing?"]
for i in s_words:
    print(ps.stem(i))

Out[37]:

sing?

In [ ]:

from nltk.stem import PorterStemmer
ps = PorterStemmer() ## defining stemmer
s_words = ["Calls","Caller","Calling","Call","Called"]
for i in s_words:
    print(ps.stem(i))

help(Nltk)

Part of Speech tagging

This means labeling words in a sentence as nouns, adjectives, verbs.

In [38]:

import nltk 
from nltk.tokenize import PunktSentenceTokenizer

document = 'Whether you\'re new to DataScience or an paracetamol , it\'s easy to learn and use Python.Are you Good enough in Prgramming? I Am based in Delhi location'
sentences = nltk.sent_tokenize(document)   
for sent in sentences: 
    print(nltk.pos_tag(nltk.word_tokenize(sent)))

Out[38]:

[('Whether', 'IN'), ('you', 'PRP'), ("'re", 'VBP'), ('new', 'JJ'), ('to', 'TO'), ('DataScience', 'NNP'), ('or', 'CC'), ('an', 'DT'), ('paracetamol', 'NN'), (',', ','), ('it', 'PRP'), ("'s", 'VBZ'), ('easy', 'JJ'), ('to', 'TO'), ('learn', 'VB'), ('and', 'CC'), ('use', 'VB'), ('Python.Are', 'NNP'), ('you', 'PRP'), ('Good', 'NNP'), ('enough', 'RB'), ('in', 'IN'), ('Prgramming', 'NNP'), ('?', '.')]
[('I', 'PRP'), ('Am', 'VBP'), ('based', 'VBN'), ('in', 'IN'), ('Delhi', 'NNP'), ('location', 'NN')]

In [ ]:

sentences

##from nltk.corpus import state_union
from nltk.tokenize import PunktSentenceTokenizer

document = 'Whether you\'re new to DataScience or an experienced, it\'s easy to learn and use Python.'
sentences = nltk.sent_tokenize(document)
data = []
for sent in sentences:
    data = data + nltk.pos_tag(nltk.word_tokenize(sent))
   
for word in data: 
    if 'CC' in word[1]: 
        print(word)

Get synonyms/antonyms using WordNet

WordNet’s structure makes it a useful tool for computational linguistics and natural language processing
WordNet superficially resembles a thesaurus, in that it groups words together based on their meanings

In [40]:

# First, you're going to need to import wordnet:
from nltk.corpus import wordnet

# Then, we're going to use the term "" to find synsets like so:
syns = wordnet.synsets("Happy")

# An example of a synset:
print(syns[0].name())

# Just the word:
print(syns[0].lemmas()[0].name())

# Definition of that first synset:
print(syns[0].definition())

# Examples of the word in use in sentences:
print(syns[0].examples())

Out[40]:

happy.a.01
happy
enjoying or showing or marked by joy or pleasure
['a happy smile', 'spent many happy days on the beach', 'a happy marriage']

In [ ]:

syns = wordnet.synsets("Generative")

# An example of a synset:
print(syns[0].name())

# Just the word:
print(syns[0].lemmas()[0].name())

# Definition of that first synset:
print(syns[0].definition())

# Examples of the word in use in sentences:
print(syns[0].examples())

In [42]:

import nltk
from nltk.corpus import wordnet
synonyms = []
antonyms = []

for syn in wordnet.synsets("Sound"):
    for l in syn.lemmas():
        synonyms.append(l.name())
        if l.antonyms():
            antonyms.append(l.antonyms()[0].name())

print("Similar words =",set(synonyms))
print("Antonyms for sound =",set(antonyms))

Out[42]:

Similar words = {'heavy', 'phone', 'fathom', 'voice', 'good', 'levelheaded', 'speech_sound', 'wakeless', 'profound', 'audio', 'healthy', 'sound', 'effectual', 'go', 'vocalise', 'strait', 'well-grounded', 'auditory_sensation', 'intelligent', 'vocalize', 'level-headed', 'legal', 'reasoned'}
Antonyms for sound = {'devoice', 'unsound', 'silence'}

Filtering Duplicate Words we can use sets

In [ ]:

##Sets 
s={1,2,33,33,44,0,-5}
s

In [43]:

import nltk
word_data = "The python is a a python data analytics language" 

# First Word tokenization
nltk_tokens = nltk.word_tokenize(word_data)

# Applying Set
no_order = list(set(nltk_tokens))

print (no_order)

Out[43]:

['language', 'data', 'python', 'analytics', 'The', 'is', 'a']

Lemmentization

Lemmatizing reduces words to their core meaning, but it will give you a complete English word that makes sense on its own instead of just a fragment of a word like 'danc'.

In [44]:

import nltk 
from nltk.tokenize import  word_tokenize
from nltk.stem import WordNetLemmatizer
lem = WordNetLemmatizer()
a = "I have yellow and Black scarves. I love to wear scarf"
words = word_tokenize(a)
words

Out[44]:

['I',
 'have',
 'yellow',
 'and',
 'Black',
 'scarves',
 '.',
 'I',
 'love',
 'to',
 'wear',
 'scarf']

In [45]:

from nltk.tokenize import word_tokenize
lemmatized_words = [lem.lemmatize(words) for words in words]
lemmatized_words

Out[45]:

['I',
 'have',
 'yellow',
 'and',
 'Black',
 'scarf',
 '.',
 'I',
 'love',
 'to',
 'wear',
 'scarf']

In [46]:

from nltk.stem import PorterStemmer
ps = PorterStemmer() ## defining stemmer
s_words = ["Dances", "dances", "Dancing", "dancer", "dances", "danced", "ddd","Sang","sings","singings","that"]
s_words1 = ["dancess", "dances", "dancing", "dancer", "dances", "danced", "ddd"]
for i in s_words1:
    print(ps.stem(i))

Out[46]:

dancess
danc
danc
dancer
danc
danc
ddd

In [47]:

s = "dancess dances dancing dancer dances danced ddd"

words = word_tokenize(s)
lm = [lem.lemmatize(word) for word in words]
lm

Out[47]:

['dance', 'dance', 'dancing', 'dancer', 'dance', 'danced', 'ddd']

Lem works directly on noun, for other Parts of Speeching tagging

In [1]:

a="dancer"
#lem.lemmatize(a,pos ="n")
lem.lemmatize("worst", pos="a") #y

Out[1]:

---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Input In [1], in <cell line: 3>()
      1 a="dancer"
      2 #lem.lemmatize(a,pos ="n")
----> 3 lem.lemmatize("worst", pos="a")
NameError: name 'lem' is not defined

In [ ]:

s="these places have many worst wolves. My friends love nicer to visit Zoo.All of Us goods were wearing beautiful dressess"
words = word_tokenize(s)
lemmatized_words = [lem.lemmatize(word) for word in words]
lemmatized_words

In [ ]:

from nltk.tokenize import word_tokenize
d = "Good nice , India is place for young people, Suyashi are you from Delhi? I love this place"
quote = word_tokenize(d)
quote

In [ ]:

# next step is to tag those words by part of speech:
import nltk
#nltk.download("averaged_perceptron_tagger")
pos_tags = nltk.pos_tag(quote)
pos_tags