Path: blob/master/Generative NLP Models using Python/1 Natural Language Processing.ipynb
3074 views
Natural language processing (NLP)
NLP is a branch of artificial intelligence that helps computers understand, interpret and manipulate human language. NLP draws from many disciplines, including computer science and computational linguistics, in its pursuit to fill the gap between human communication and computer understanding.
NLTK
NLTK is a popular Python framework for dealing with data of human language. It includes a set of text processing libraries for classification and semantic reasoning, as well as wrappers for industrial-strength NLP libraries and an active discussion forum.
The NLTK module is a massive tool kit, aimed at helping you with the entire Natural Language Processing (NLP) methodology.
NLTK will aid you with everything from splitting sentences from paragraphs, splitting up words, recognizing the part of speech of those words, highlighting the main subjects, and then even with helping your machine to understand what the text is all about.
Installation
conda install -c anaconda nltk
Components of NLP
Five main Component of Natural Language processing are:
Morphological and Lexical Analysis##
Lexical analysis is a vocabulary that includes its words and expressions.
It depicts analyzing, identifying and description of the structure of words.
It includes dividing a text into paragraphs, words and the sentences.
Individual words are analyzed into their components, and nonword tokens such as punctuations are separated from the words
Semantic Analysis:
Semantic Analysis is a structure created by the syntactic analyzer which assigns meanings.
This component transfers linear sequences of words into structures.
It shows how the words are associated with each other.
Pragmatic Analysis:
Pragmatic Analysis deals with the overall communicative and social content and its effect on interpretation.
It means abstracting or deriving the meaningful use of language in situations.
In this analysis, the main focus always on what was said in reinterpreted on what is meant.
Syntax Analysis:
The words are commonly accepted as being the smallest units of syntax.
The syntax refers to the principles and rules that govern the sentence structure of any individual languages.
Discourse Integration :
It means a sense of the context.
The meaning of any single sentence which depends upon that sentences. It also considers the meaning of the following sentence.
NLP and writing systems
The kind of writing system used for a language is one of the deciding factors in determining the best approach for text pre-processing. Writing systems can be
Logographic: a Large number of individual symbols represent words. Example Japanese, Mandarin
Syllabic: Individual symbols represent syllables
Alphabetic: Individual symbols represent sound
Challenges
Extracting meaning(semantics) from a text is a challenge
NLP is dependent on the quality of the corpus. If the domain is vast, it's difficult to understand context.
There is a dependence on the character set and language
Your feedback was excellent. your feedback was bad.
["your","feedback","was","excellent"]
First step :conda install -c anaconda nltk : pip install nltk
Second Step : import nltk nltk.download()
NLP lib in Python
NLTK
Gensim (Topic Modelling, Document summarization)
CoreNLP(linguistic analysis)
SpaCY
TextBlob
Pattern (Web minning)
pip install --proxy http://noidasezproxy.corp.exlservice.com:8000 package
Tokenizing Words & Sentences
A sentence or data can be split into words using the method sent_tokenize() & word_tokenize() respectively.
Quick Parctice :
Do you know Customer and target audience’s reviews can be analyzed? You can use this! to create a roadmap of features and products.
Convert above para into word token and sentence token
Stopping Words
To do this, we need a way to convert words to values, in numbers, or signal patterns. The process of converting data to something a computer can understand is referred to as "pre-processing." One of the major forms of pre-processing is going to be filtering too useful data. In natural language processing, not imp words (data), are referred to as stop words.
For now, we'll be considering stop words as words that just contain no meaning, and we want to remove them.
We can do this easily, by storing a list of words that you consider to be stop words. NLTK starts you off with a bunch of words that they consider to be stop words, you can access it via the NLTK corpus .
Write a Python script that takes a paragraph of text and performs word tokenization using NLTK. Print the list of tokens and filtered words after applying stop word.
STEMMING
A word stem is part of a word. It is sort of a normalization idea, but linguistic.
For example, the stem of the word Using is use.
help(Nltk)
Part of Speech tagging
This means labeling words in a sentence as nouns, adjectives, verbs.
Get synonyms/antonyms using WordNet
WordNet’s structure makes it a useful tool for computational linguistics and natural language processing
WordNet superficially resembles a thesaurus, in that it groups words together based on their meanings
Filtering Duplicate Words we can use sets
What is Lemmatization?
Lemmatization is the process of converting a word to its base or dictionary form (called a lemma), considering the context and part of speech (POS).
Unlike stemming (which simply chops off word endings), lemmatization ensures that the result is a valid word, not just a root form.
Uses:
Smarter text normalization: dancing, danced, dances → dance
Uses grammar rules and vocabulary (via POS tagging)
Important in NLP tasks like text classification, search engines, and question answering
e.g "Soni loves dancing and danced well at the dance competition
Word | POS Tag | Lemma | Notes |
---|---|---|---|
Soni | NNP | Soni | Proper noun (unchanged) |
loves | VBZ | love | Verb converted to base form |
dancing | VBG | dance | Verb gerund → base |
danced | VBD | dance | Past verb → base |
dance | NN | dance | Already a noun |
Example 2
sentences =
"Soni was dancing in the rain while Kron watched.",
"Kron loved the rain and danced in it.", - "They are loving the way Soni dances.",
"It rained heavily but Soni kept dancing.",
"Kron is a warrior who loved to fight and dance."
Named Entity Recognition (NER)
Named Entity Recognition (NER) is the process of detecting named entities (like persons, organizations, locations, dates) in text. Example:
“Soni visited Paris in May to meet Kron from Microsoft.”
Entities here:
Soni → Person
Paris → Location
May → Date
Kron → Person
Microsoft → Organization
Sentence | Named Entities |
---|---|
"Shivi plays football in New York every summer." | Shivi (PERSON) , New York (GPE) |
"Delhi hosted an international football tournament." | Delhi (GPE) |
"Shivi traveled from Delhi to New York to play football." | Shivi (PERSON) , Delhi (GPE) , New York (GPE) |
"Football is Shivi's favorite sport in Delhi." | Shivi (PERSON) , Delhi (GPE) |
"In New York, Shivi met players from different countries." | New York (GPE) , Shivi (PERSON) |