Path: blob/master/2 - Natural Language Processing with Probabilistic Models/Week 1/C2W1_L1_Building the vocabulary.ipynb
65 views
NLP Course 2 Week 1 Lesson : Building The Model - Lecture Exercise 01
Estimated Time: 10 minutes
Vocabulary Creation
Create a tiny vocabulary from a tiny corpus
It's time to start small !
Imports and Data
Preprocessing
Create Vocabulary
Option 1 : A set of distinct words from the text
Add Information with Word Counts
Option 2 : Two alternatives for including the word count as well
Ungraded Exercise
Note that counts_b, above, returned by collections.Counter is sorted by word count
Can you modify the tiny corpus of text so that a new color appears between pink and red in counts_b ?
Do you need to run all the cells again, or just specific ones ?
Expected Outcome:
counts_b : Counter({'blue': 4, 'pink': 3, 'your_new_color_here': 2, red': 1, 'yellow': 1, 'orange': 1})
count : 6
Summary
This is a tiny example but the methodology scales very well.
In the assignment you will create a large vocabulary of thousands of words, from a corpus
of tens of thousands or words! But the mechanics are exactly the same.
The only extra things to pay attention to will be; run time, memory management and the vocab data structure.
So the choice of approach used in code blocks counts_a vs counts_b, above, will be important.