Book a Demo!
CoCalc Logo Icon
StoreFeaturesDocsShareSupportNewsAboutPoliciesSign UpSign In
ethen8181
GitHub Repository: ethen8181/machine-learning
Path: blob/master/clustering/tfidf/__pycache__/feature_extraction.cpython-35.pyc
2596 views


uz�Y$?�@s�ddlZddlZddlmZddlmZmZddlm	Z	ddl
mZmZddl
mZGdd�de�ZGd	d
�d
ee�ZGdd�de�ZdS)
�N)�defaultdict)�spdiags�
csr_matrix)�	normalize)�
BaseEstimator�TransformerMixin)�ENGLISH_STOP_WORDSc@s�eZdZdZddddddd�Zdd	d
�Zddd�Zd
d�Zdd�Zdd�Z	dd�Z
dd�Zdd�ZdS)�CountVectorizera�
    Convert a collection of text documents to a matrix of token counts,
    this implementation produces a sparse representation of the counts using
    scipy.sparse.csr_matrix.

    The number of features will be equal to the vocabulary size found by
    analyzing all input documents and removal of stop words

    Parameters
    ----------
    analyzer : str {'word'} or callable
        Whether the feature should be made of word, if n-grams is specified,
        then the words are concatenated with space.
        If a callable is passed it is used to extract the sequence of features
        out of the raw, unprocessed input.

    ngram_range : tuple (min_n, max_n)
        The lower and upper boundary of the range of n-values for different
        n-grams to be extracted. All values of n such that min_n <= n <= max_n
        will be used.

    token_pattern : str
        Regular expression denoting what constitutes a "token", only used
        if ``analyzer == 'word'``. The default regexp select tokens of 2
        or more alphanumeric characters (punctuation is completely ignored
        and always treated as a token separator).

    stop_words : string {'english'}, collection, or None, default None
        - If 'english', a built-in stop word list for English is used.
        - If a collection, that list  or setis assumed to contain stop words, all of which
        will be removed from the resulting tokens. Only applies if ``analyzer == 'word'``.
        - If None, no stop words will be used.

    lowercase : bool, default True
        Convert all characters to lowercase before tokenizing.

    Attributes
    ----------
    vocabulary_ : dict
        A mapping of terms to feature indices.
    �word�z	\b\w\w+\bNTcCs1||_||_||_||_||_dS)N)�analyzer�	lowercase�
stop_words�ngram_range�
token_pattern)�selfrrrrr
�r�D/Users/ethen/machine-learning/clustering/tfidf/feature_extraction.py�__init__5s
				zCountVectorizer.__init__cCs|j|�|S)z�
        Learn the vocabulary dictionary of all tokens in the raw documents.

        Parameters
        ----------
        raw_documents : iterable
            An iterable which yields str

        Returns
        -------
        self
        )�
fit_transform)r�
raw_documents�yrrr�fit=s

zCountVectorizer.fitcCsCt|t�rtd��|j|dd�\}}||_|S)a�
        Learn the vocabulary dictionary and return document-term matrix.
        This is equivalent to calling fit followed by transform, but more
        efficiently implemented.

        Parameters
        ----------
        raw_documents : iterable
            An iterable which yields either str.

        Returns
        -------
        X : scipy sparse matrix, shape [n_samples, n_features]
            Count document-term matrix.
        zCIterable over raw text documents expected, string objected received�fixed_vocabF)�
isinstance�str�
ValueError�_count_vocab�vocabulary_)rrr�X�
vocabularyrrrrMs		zCountVectorizer.fit_transformc
Cs�|r|j}nt�}|j|_|j�}g}g}g}|jd�x�|D]�}i}	xa||�D]S}
y5||
}||	kr�d|	|<n|	|d7<Wqrtk
r�wrYqrXqrW|j|	j��|j|	j	��|jt
|��qYW|st|�}tj
|dtj�}tj
|dtj�}tj
|dtj�}t
|�dt
|�f}t|||fd|dtj�}
|
|fS)zBCreate sparse feature matrix and vocabulary if fixed_vocab = Falserr�dtype�shape)rr�__len__�default_factory�_build_analyzer�append�KeyError�extend�keys�values�len�dict�np�asarrayZintcr)rrrr Zanalyzer*Zindptr�indices�docZfeature_counter�featureZfeature_idxr"rrrrris<	




$zCountVectorizer._count_vocabcsot�j�r�jS�jdkrS�j���j�����fdd�Stdj�j���dS)z=Return a callable that handles preprocessing and tokenizationr
cs�j�|���S)N)�_word_ngrams)r0)rr�tokenizerr�<lambda>�sz1CountVectorizer._build_analyzer.<locals>.<lambda>z.{} is not a valid tokenization scheme/analyzerN)�callabler�_build_tokenizer�_get_stop_wordsr�format)rr)rrr3rr%�s	zCountVectorizer._build_analyzercs?tj|j��|jr+�fdd�S�fdd�SdS)zAReturns a function that splits a string into a sequence of tokenscs�j|j��S)N)�findall�lower)r0)rrrr4�sz2CountVectorizer._build_tokenizer.<locals>.<lambda>cs
�j|�S)N)r9)r0)rrrr4�sN)�re�compilerr
)rr)rrr6�s	z CountVectorizer._build_tokenizercCsU|j}|dkrtS|dkr)dSt|t�rGtd��n
t|�SdS)z1Build or fetch the effective stop words frozenset�englishNzStop words not a collection)rrrrr�	frozenset)r�stoprrrr7�s	zCountVectorizer._get_stop_wordscs��dk	r%�fdd�|D�}|j\}}|dkrD|St|�}t|�}|dkru|d7}ng}|j}dj}xet|t|d|d��D]C}	x:t||	d�D]$}
||||
|
|	���q�Wq�W|SdS)zGTokenize document into a sequence of n-grams after stop words filteringNcs"g|]}|�kr|�qSrr)�.0�w)rrr�
<listcomp>�s	z0CountVectorizer._word_ngrams.<locals>.<listcomp>r� )r�listr+r&�join�range�min)r�tokensrZmin_nZmax_nZoriginal_tokensZn_original_tokensZ
tokens_appendZ
space_join�n�ir)rrr2�s 
		'&zCountVectorizer._word_ngramscCs|j|dd�\}}|S)a�
        Transform documents to document-term matrix.
        Extract token counts out of raw text documents using the vocabulary
        fitted with fit or fit_transform.

        Parameters
        ----------
        raw_documents : iterable
            An iterable which yields either str.

        Returns
        -------
        X : scipy sparse matrix, [n_samples, n_features]
            Document-term matrix.
        rT)r)rrr�_rrr�	transform�szCountVectorizer.transform)rr)
�__name__�
__module__�__qualname__�__doc__rrrrr%r6r7r2rLrrrrr	
s).r	c@sIeZdZdZdddddd�Zddd	�Zd
d�ZdS)�TfidfTransformera�
    Transform a count matrix to a tf-idf representation.

    Parameters
    ----------
    norm : 'l1', 'l2' or None, default 'l2'
        Norm used to normalize term vectors. None for no normalization.

    smooth_idf : bool, default True
        Smooth idf weights by adding one to document frequencies, as if an
        extra document was seen containing every term in the collection
        exactly once. Prevents zero divisions.

    sublinear_tf : bool, default False
        Apply sublinear tf scaling, i.e. replace tf with 1 + log(tf).

    copy : bool, default True
        Whether to copy input data and operate on the copy or perform in-place operations.
    �l2TFcCs(||_||_||_||_dS)N)�norm�copy�
smooth_idf�sublinear_tf)rrSrUrVrTrrrr�s			zTfidfTransformer.__init__Nc
Cs�|j\}}tj|jd|jd�}|t|j�7}|t|j�7}tjt|�|�d}t|ddd|d|dd	�|_	|S)
z�
        Learn the idf vector.

        Parameters
        ----------
        X : scipy sparse matrix, shape [n_samples, n_features]
            Count document-term matrix.
        Z	minlengthrg�?Zdiagsr�mrIr8Zcsr)
r"r-�bincountr/�intrU�log�floatr�	_idf_diag)rrrZ	n_samplesZ
n_featuresZdoc_freqZidfrrrrs	'zTfidfTransformer.fitcCs}|jr|j�}|jrBtj|j�|_|jd7_||j9}|jdk	ryt|d|jdd�}|S)aX
        Transform a count matrix to tf-idf representation.

        Parameters
        ----------
        X : scipy sparse matrix, [n_samples, n_features]
            Count document-term matrix.

        Returns
        -------
        X : scipy sparse matrix, [n_samples, n_features]
            Tf-idf weighted document-term matrix.
        rNrSrTF)rTrVr-rZ�datar\rSr)rrrrrrLs		
zTfidfTransformer.transform)rMrNrOrPrrrLrrrrrQ�srQcseZdZdZddddddddd�fd	d
�	Zd�fdd�Zd�fd
d�Zd�fdd�Zedd��Z	e	j
dd��Z	edd��Zej
dd��Zedd��Zej
dd��Zedd��Z
e
j
dd��Z
�S)�TfidfVectorizera[
    Convert a collection of raw documents to a matrix of TF-IDF features.
    This is equivalent to CountVectorizer followed by TfidfTransformer.

    Parameters
    ----------
    analyzer : str {'word'} or callable
        Whether the feature should be made of word, if n-grams is specified,
        then the words are concatenated with space.
        If a callable is passed it is used to extract the sequence of features
        out of the raw, unprocessed input.

    ngram_range : tuple (min_n, max_n)
        The lower and upper boundary of the range of n-values for different
        n-grams to be extracted. All values of n such that min_n <= n <= max_n
        will be used.

    token_pattern : str
        Regular expression denoting what constitutes a "token", only used
        if ``analyzer == 'word'``. The default regexp select tokens of 2
        or more alphanumeric characters (punctuation is completely ignored
        and always treated as a token separator).

    stop_words : string {'english'}, collection, or None, default None
        - If 'english', a built-in stop word list for English is used.
        - If a collection, that list  or setis assumed to contain stop words, all of which
        will be removed from the resulting tokens. Only applies if ``analyzer == 'word'``.
        - If None, no stop words will be used.

    lowercase : bool, default True
        Convert all characters to lowercase before tokenizing.

    norm : 'l1', 'l2' or None, default 'l2'
        Norm used to normalize term vectors. None for no normalization.

    smooth_idf : bool, default True
        Smooth idf weights by adding one to document frequencies, as if an
        extra document was seen containing every term in the collection
        exactly once. Prevents zero divisions.

    sublinear_tf : bool, default False
        Apply sublinear tf scaling, i.e. replace tf with 1 + log(tf).

    copy : bool, default True
        Whether to copy input data and operate on the copy or perform in-place operations.

    Attributes
    ----------
    vocabulary_ : dict
        A mapping of terms to feature indices.
    r
rz	\b\w\w+\bNTrRFc

sSt�jd|d|d|d|d|�td|d|d|d	|	�|_dS)
Nrrrrr
rSrUrVrT)�superrrQ�_tfidf)
rrrrrr
rSrUrVrT)�	__class__rrrns
zTfidfVectorizer.__init__cs&t�j|�}|jj|�|S)z�
        Learn vocabulary and idf from training set.
        Parameters
        ----------
        raw_documents : iterable
            An iterable which yields str.

        Returns
        -------
        self
        )r_rr`r)rrrr)rarrrxszTfidfVectorizer.fitcs"t�j|�}|jj|�S)a�
        Learn vocabulary and idf, return term-document matrix.
        This is equivalent to calling fit followed by transform, but more
        efficiently implemented.

        Parameters
        ----------
        raw_documents : iterable
            An iterable which yields str.

        Returns
        -------
        X : scipy sparse matrix, shape [n_samples, n_features]
            Tf-idf weighted document-term matrix.
        )r_rr`)rrrr)rarrr�szTfidfVectorizer.fit_transformcs"t�j|�}|jj|�S)a)
        Transform documents to document-term matrix.

        Uses the vocabulary and document frequencies learned by fit or
        fit_transform.

        Parameters
        ----------
        raw_documents : iterable
            An iterable which yields str.

        copy : boolean, default True
            Whether to copy X and operate on the copy or perform in-place
            operations.

        Returns
        -------
        X : scipy sparse matrix, shape [n_samples, n_features]
            Tf-idf weighted document-term matrix.
        )r_rLr`)rrrTr)rarrrL�szTfidfVectorizer.transformcCs
|jjS)N)r`rS)rrrrrS�szTfidfVectorizer.normcCs||j_dS)N)r`rS)r�valuerrrrS�scCs
|jjS)N)r`�use_idf)rrrrrc�szTfidfVectorizer.use_idfcCs||j_dS)N)r`rc)rrbrrrrc�scCs
|jjS)N)r`rU)rrrrrU�szTfidfVectorizer.smooth_idfcCs||j_dS)N)r`rU)rrbrrrrU�scCs
|jjS)N)r`rV)rrrrrV�szTfidfVectorizer.sublinear_tfcCs||j_dS)N)r`rV)rrbrrrrV�s)rr)rMrNrOrPrrrrL�propertyrS�setterrcrUrVrr)rarr^9s3	r^)r;�numpyr-�collectionsrZscipy.sparserrZsklearn.preprocessingrZsklearn.baserrZsklearn.feature_extraction.textrr	rQr^rrrr�<module>s�O