Path: blob/master/clustering/tfidf/__pycache__/feature_extraction.cpython-35.pyc
2596 views
uz�Y$? � @ s� d d l Z d d l Z d d l m Z d d l m Z m Z d d l m Z d d l
m Z m Z d d l
m Z Gd d � d e � Z Gd d
� d
e e � Z Gd d � d e � Z d S)
� N)�defaultdict)�spdiags�
csr_matrix)� normalize)�
BaseEstimator�TransformerMixin)�ENGLISH_STOP_WORDSc @ s� e Z d Z d Z d d d d d d d � Z d d d
� Z d d d � Z d
d � Z d d � Z d d � Z d d � Z
d d � Z d d � Z d S)�CountVectorizera�
Convert a collection of text documents to a matrix of token counts,
this implementation produces a sparse representation of the counts using
scipy.sparse.csr_matrix.
The number of features will be equal to the vocabulary size found by
analyzing all input documents and removal of stop words
Parameters
----------
analyzer : str {'word'} or callable
Whether the feature should be made of word, if n-grams is specified,
then the words are concatenated with space.
If a callable is passed it is used to extract the sequence of features
out of the raw, unprocessed input.
ngram_range : tuple (min_n, max_n)
The lower and upper boundary of the range of n-values for different
n-grams to be extracted. All values of n such that min_n <= n <= max_n
will be used.
token_pattern : str
Regular expression denoting what constitutes a "token", only used
if ``analyzer == 'word'``. The default regexp select tokens of 2
or more alphanumeric characters (punctuation is completely ignored
and always treated as a token separator).
stop_words : string {'english'}, collection, or None, default None
- If 'english', a built-in stop word list for English is used.
- If a collection, that list or setis assumed to contain stop words, all of which
will be removed from the resulting tokens. Only applies if ``analyzer == 'word'``.
- If None, no stop words will be used.
lowercase : bool, default True
Convert all characters to lowercase before tokenizing.
Attributes
----------
vocabulary_ : dict
A mapping of terms to feature indices.
�word� z \b\w\w+\bNTc C s1 | | _ | | _ | | _ | | _ | | _ d S)N)�analyzer� lowercase�
stop_words�ngram_range�
token_pattern)�selfr r r r r
� r �D/Users/ethen/machine-learning/clustering/tfidf/feature_extraction.py�__init__5 s
zCountVectorizer.__init__c C s | j | � | S)z�
Learn the vocabulary dictionary of all tokens in the raw documents.
Parameters
----------
raw_documents : iterable
An iterable which yields str
Returns
-------
self
)�
fit_transform)r �
raw_documents�yr r r �fit= s
zCountVectorizer.fitc C sC t | t � r t d � � | j | d d �\ } } | | _ | S)a�
Learn the vocabulary dictionary and return document-term matrix.
This is equivalent to calling fit followed by transform, but more
efficiently implemented.
Parameters
----------
raw_documents : iterable
An iterable which yields either str.
Returns
-------
X : scipy sparse matrix, shape [n_samples, n_features]
Count document-term matrix.
zCIterable over raw text documents expected, string objected received�fixed_vocabF)�
isinstance�str�
ValueError�_count_vocab�vocabulary_)r r r �X�
vocabularyr r r r M s zCountVectorizer.fit_transformc
C s� | r | j } n t � } | j | _ | j � } g } g } g } | j d � x� | D]� } i } xa | | � D]S }
y5 | |
} | | k r� d | | <n | | d 7<Wqr t k
r� wr Yqr Xqr W| j | j � � | j | j � � | j t
| � � qY W| st | � } t j
| d t j �} t j
| d t j �} t j
| d t j �} t
| � d t
| � f } t | | | f d | d t j �}
|
| f S)zBCreate sparse feature matrix and vocabulary if fixed_vocab = Falser r �dtype�shape)r r �__len__�default_factory�_build_analyzer�append�KeyError�extend�keys�values�len�dict�np�asarrayZintcr )r r r r Zanalyzer* Zindptr�indices�docZfeature_counter�featureZfeature_idxr"