Path: blob/master/Natural Language Processing using Python/Vectorization (TF-IDF) .ipynb
3074 views
Count Vectorizer:
Count Vectorizer converts a collection of text documents into a matrix of token counts. Each document is represented by a vector where each element corresponds to the count of a particular word (token) in that document
TF-IDF (Term Frequency-Inverse Document Frequency):
TF-IDF reflects the importance of a term in a document relative to its frequency in the entire corpus. It combines two components: Term Frequency (TF) and Inverse Document Frequency (IDF).
Term Frequency (TF): Term frequency is a measure of how often a term appears in a document. It is calculated as the ratio of the number of times a term appears in a document to the total number of terms in that document.
Inverse Document Frequency
This measures the rarity of a term across all documents in the corpus. It is calculated as the logarithm of the ratio of the total number of documents to the number of documents containing the term, with a smoothing term to handle the case when the term doesn't appear in any document.
This gives higher weight to terms that are frequent within a document but rare across all documents, effectively highlighting their importance in representing the content of the document.
Quick Practice
Data for CountVectorization:
"Ashi is happy to work in Gurgoan. Work brings happiness to Ashi."
"Happy people work better in Gurgoan. Ashi enjoys the work culture here."
"The work environment in her company makes Ashi very happy."
"Happiness comes from working hard and work life balance"
Lab Questions:
Apply CountVectorization to the provided text data and answer the following:
Display the Count Vectorizer Matrix.
Display the Vocabulary.
Identify the term with the highest frequency.
Which term(s) occur in all sentences?
Explain how the frequency of the word 'happy' changes across the sentences.
What would happen if we set stop_words='english' in the CountVectorizer?