Book a Demo!
CoCalc Logo Icon
StoreFeaturesDocsShareSupportNewsAboutPoliciesSign UpSign In
kavgan
GitHub Repository: kavgan/nlp-in-practice
Path: blob/master/hashingvectorizer/HashingVectorizer.ipynb
314 views
Kernel: Python 3

HashingVectorizer

from sklearn.feature_extraction.text import HashingVectorizer # dataset cat_in_the_hat_docs=[ "One Cent, Two Cents, Old Cent, New Cent: All About Money (Cat in the Hat's Learning Library", "Inside Your Outside: All About the Human Body (Cat in the Hat's Learning Library)", "Oh, The Things You Can Do That Are Good for You: All About Staying Healthy (Cat in the Hat's Learning Library)", "On Beyond Bugs: All About Insects (Cat in the Hat's Learning Library)", "There's No Place Like Space: All About Our Solar System (Cat in the Hat's Learning Library)" ] # Compute raw counts using hashing vectorizer # Small numbers of n_features can cause hash collisions hvectorizer = HashingVectorizer(n_features=10000,norm=None,alternate_sign=False) # compute counts without any term frequency normalization X = hvectorizer.fit_transform(cat_in_the_hat_docs)
# 5 docs, 10000 columns X.shape
(5, 1048576)
# print populated columns of first document # format: (doc id, pos_in_matrix) raw_count print(X[0])
(0, 93) 3.0 (0, 689) 1.0 (0, 717) 1.0 (0, 1664) 1.0 (0, 2759) 1.0 (0, 3124) 1.0 (0, 4212) 1.0 (0, 4380) 1.0 (0, 5044) 1.0 (0, 7353) 1.0 (0, 8903) 1.0 (0, 8958) 1.0 (0, 9376) 1.0 (0, 9402) 1.0 (0, 9851) 1.0

Achieving The Same with CountVectorizer

from sklearn.feature_extraction.text import CountVectorizer cvectorizer = CountVectorizer() # compute counts without any term frequency normalization X = cvectorizer.fit_transform(cat_in_the_hat_docs)
print(X.shape)
(5, 43)
print(X[0])
(0, 28) 1 (0, 8) 3 (0, 40) 1 (0, 9) 1 (0, 26) 1 (0, 23) 1 (0, 1) 1 (0, 0) 1 (0, 22) 1 (0, 7) 1 (0, 16) 1 (0, 37) 1 (0, 13) 1 (0, 19) 1 (0, 20) 1
cvectorizer.vocabulary_
{'one': 28, 'cent': 8, 'two': 40, 'cents': 9, 'old': 26, 'new': 23, 'all': 1, 'about': 0, 'money': 22, 'cat': 7, 'in': 16, 'the': 37, 'hat': 13, 'learning': 19, 'library': 20, 'inside': 18, 'your': 42, 'outside': 30, 'human': 15, 'body': 4, 'oh': 25, 'things': 39, 'you': 41, 'can': 6, 'do': 10, 'that': 36, 'are': 2, 'good': 12, 'for': 11, 'staying': 34, 'healthy': 14, 'on': 27, 'beyond': 3, 'bugs': 5, 'insects': 17, 'there': 38, 'no': 24, 'place': 31, 'like': 21, 'space': 33, 'our': 29, 'solar': 32, 'system': 35}