CoCalc -- HashingVectorizer.ipynb

GitHub Repository: kavgan/nlp-in-practice
Path: blob/master/hashingvectorizer/HashingVectorizer.ipynb
³⁴⁵ views

Kernel: Python 3

HashingVectorizer

In [74]:

from sklearn.feature_extraction.text import HashingVectorizer

# dataset
cat_in_the_hat_docs=[
      "One Cent, Two Cents, Old Cent, New Cent: All About Money (Cat in the Hat's Learning Library",
      "Inside Your Outside: All About the Human Body (Cat in the Hat's Learning Library)",
      "Oh, The Things You Can Do That Are Good for You: All About Staying Healthy (Cat in the Hat's Learning Library)",
      "On Beyond Bugs: All About Insects (Cat in the Hat's Learning Library)",
      "There's No Place Like Space: All About Our Solar System (Cat in the Hat's Learning Library)" 
]

# Compute raw counts using hashing vectorizer
# Small numbers of n_features can cause hash collisions
hvectorizer = HashingVectorizer(n_features=10000,norm=None,alternate_sign=False)

# compute counts without any term frequency normalization
X = hvectorizer.fit_transform(cat_in_the_hat_docs)

In [75]:

# 5 docs, 10000 columns
X.shape

Out[75]:

(5, 1048576)

In [60]:

# print populated columns of first document
# format: (doc id, pos_in_matrix)  raw_count
print(X[0])

Out[60]:

  (0, 93)	3.0
  (0, 689)	1.0
  (0, 717)	1.0
  (0, 1664)	1.0
  (0, 2759)	1.0
  (0, 3124)	1.0
  (0, 4212)	1.0
  (0, 4380)	1.0
  (0, 5044)	1.0
  (0, 7353)	1.0
  (0, 8903)	1.0
  (0, 8958)	1.0
  (0, 9376)	1.0
  (0, 9402)	1.0
  (0, 9851)	1.0

Achieving The Same with CountVectorizer

In [72]:

from sklearn.feature_extraction.text import CountVectorizer
cvectorizer = CountVectorizer()

# compute counts without any term frequency normalization
X = cvectorizer.fit_transform(cat_in_the_hat_docs)

In [69]:

print(X.shape)

Out[69]:

(5, 43)

In [70]:

print(X[0])

Out[70]:

  (0, 28)	1
  (0, 8)	3
  (0, 40)	1
  (0, 9)	1
  (0, 26)	1
  (0, 23)	1
  (0, 1)	1
  (0, 0)	1
  (0, 22)	1
  (0, 7)	1
  (0, 16)	1
  (0, 37)	1
  (0, 13)	1
  (0, 19)	1
  (0, 20)	1

In [71]:

cvectorizer.vocabulary_

Out[71]:

{'one': 28,
 'cent': 8,
 'two': 40,
 'cents': 9,
 'old': 26,
 'new': 23,
 'all': 1,
 'about': 0,
 'money': 22,
 'cat': 7,
 'in': 16,
 'the': 37,
 'hat': 13,
 'learning': 19,
 'library': 20,
 'inside': 18,
 'your': 42,
 'outside': 30,
 'human': 15,
 'body': 4,
 'oh': 25,
 'things': 39,
 'you': 41,
 'can': 6,
 'do': 10,
 'that': 36,
 'are': 2,
 'good': 12,
 'for': 11,
 'staying': 34,
 'healthy': 14,
 'on': 27,
 'beyond': 3,
 'bugs': 5,
 'insects': 17,
 'there': 38,
 'no': 24,
 'place': 31,
 'like': 21,
 'space': 33,
 'our': 29,
 'solar': 32,
 'system': 35}

In [ ]:

HashingVectorizer

Achieving The Same with CountVectorizer

Product

Resources

Company