CoCalc -- partial_dependence.cpython-36.pyc

GitHub Repository: ethen8181/machine-learning
Path: blob/master/model_selection/partial_dependence/__pycache__/partial_dependence.cpython-36.pyc
²⁵⁸⁶ views
3

<� [�.�@sjddlZddlZddljZddlmZddl	m
Z
mZddlm
Z
dgZGdd�d�Zdd�Zd	d
�ZdS)�N)�ceil)�Parallel�delayed)�GridSpec�PartialDependenceExplainerc@sDeZdZdZddd�Zdd	�Zddd
�Zdd�Zdd�Zdd�Z	dS)ra�
    Partial Dependence explanation [1]_.

    - Supports scikit-learn like classification and regression classifiers.
    - Works for both numerical and categorical columns.

    Parameters
    ----------
    estimator : sklearn-like classifier
        Model that was fitted on the data.

    n_grid_points : int, default 50
        Number of grid points used in replacement
        for the original numeric data. Only used
        if the targeted column is numeric. For categorical
        column, the number of grid points will always be
        the distinct number of categories in that column.
        Smaller number of grid points serves as an
        approximation for the total number of unique
        points and will result in faster computation

    batch_size : int, default = 'auto'
        Compute partial depedence prediction batch by batch to save
        memory usage, the default batch size will be
        ceil(number of rows in the data / the number of grid points used)

    n_jobs : int, default 1
        Number of jobs to run in parallel, if the model already fits
        extremely fast on the data, then specify 1 so that there's no
        overhead of spawning different processes to do the computation

    verbose : int, default 1
        The verbosity level: if non zero, progress messages are printed.
        Above 50, the output is sent to stdout. The frequency of the messages increases
        with the verbosity level. If it more than 10, all iterations are reported.

    pre_dispatch : int or str, default '2*n_jobs'
        Controls the number of jobs that get dispatched during parallel
        execution. Reducing this number can be useful to avoid an
        explosion of memory consumption when more jobs get dispatched
        than CPUs can process. Possible inputs:
            - None, in which case all the jobs are immediately
              created and spawned. Use this for lightweight and
              fast-running jobs, to avoid delays due to on-demand
              spawning of the jobs
            - An int, giving the exact number of total jobs that are
              spawned
            - A string, giving an expression as a function of n_jobs,
              as in '2*n_jobs'

    Attributes
    ----------
    feature_name_ : str
        The input feature_name to the .fit unmodified, will
        be used in subsequent method.

    feature_type_ : str
        The input feature_type to the .fit unmodified, will
        be used in subsequent method.

    feature_grid_ : 1d ndarray
        Unique grid points that were used to generate the
        partial dependence result.

    results : list of DataFrame
        Partial dependence result. If it's a classification
        estimator then each index of the list is the result
        for each class. On the other hand, if it's a regression
        estimator, it will be a list with 1 element.

    References
    ----------
    .. [1] `Python partial dependence plot toolbox
            <https://github.com/SauceCat/PDPbox>`_
    �2�auto��2*n_jobscCs"||_||_||_||_||_dS)N)�n_jobs�verbose�	estimator�pre_dispatch�
n_grid_points)�selfr
r�
batch_sizerrr�r�Z/Users/mingyuliu/machine-learning/model_selection/partial_dependence/partial_dependence.py�__init__Ys
z#PartialDependenceExplainer.__init__csV|j}y|jj�d�|j�Wn"tk
r>d�d�|j�YnX|�}tj|�}|j}|dkr�|j|krp|�n"tj	|tj
dd|j��}tj|���}	n|�tj�fdd�|D��}	|jd}
t
|
�j�}t|j|j|jd	�}|�����fd
d�t||�D��}
g}x0t|
�D]$}tj|dd�}|	|_|j|��qW||_�|_�|_||_|S)
a�
        Obtain the partial dependence result.

        Parameters
        ----------
        data : DataFrame, shape [n_samples, n_features]
            Input data to the estimator/model.

        feature_name : str
            Feature's name in the data what we wish to explain.

        feature_type : str, {'num', 'cat'}
            Specify whether feature_name is a numerical or
            categorical column.

        Returns
        -------
        self
        Tr	F�numr�dcsg|]}dj�|��qS)z{}_{})�format)�.0�category)�feature_namerr�
<listcomp>�sz2PartialDependenceExplainer.fit.<locals>.<listcomp>)rrrc3s$|]}tt�|������VqdS)N)r�_predict_batch)r�
data_batch)�feature_gridr�
is_classifier�	n_classes�predictrr�	<genexpr>�sz1PartialDependenceExplainer.fit.<locals>.<genexpr>)�ignore_index)r
�classes_�size�
predict_proba�AttributeErrorr!�np�uniquer�
percentile�linspace�asarray�shaperrrrr�
_data_iter�zip�pd�concat�columns�append�results_�
feature_name_�
feature_grid_�
feature_type_)r�datar�feature_typer
�target�
unique_target�n_uniquer*�feature_cols�n_rowsr�parallel�outputs�results�output�resultr)rrrr r!r�fitasJ





zPartialDependenceExplainer.fitTrcCsVtdd�}tj|ddd�f�}|j|�tj|dd�dd�f�}|j|||�|S)a7
        Use the partial dependence result to generate
        a partial dependence plot (using matplotlib).

        Parameters
        ----------
        centered : bool, default True
            Center the partial dependence plot by subtacting every partial
            dependence result table's column value with the value of the first
            column, i.e. first column's value will serve as the baseline
            (centered at 0) for all other values.

        target_class : int, default 0
            The target class to show for the partial dependence result,
            for regression task, we can leave the default number unmodified,
            but for classification task, we should specify the target class
            parameter to meet our needs

        Returns
        -------
        figure
        �r	rN)r�plt�subplot�_plot_title�
_plot_content)r�centered�target_class�figure�ax1�ax2rrr�plot�s

zPartialDependenceExplainer.plotcCshd}dj|j�}dj|jj�}d}d}|jd�|jdd|||d	�|jdd
|d||d�|jd
�dS)N�ArialzPartial Dependence Plot for {}z Number of unique grid points: {}���whitergffffff�?)�fontsize�fontnameg�������?�grey)�colorrTrU�off)rr5r6r%�
set_facecolor�text�axis)r�ax�font_family�title�subtitle�title_fontsize�subtitle_fontsizerrrrH�s


z&PartialDependenceExplainer._plot_titlecCsd}d}d}d}d}d}	d}
d}|j|}|j}
|jd	kr\tt|
��}|j|�|j|
�n|
}|jjd
d�}|r~||d
8}|jj	d
d�}||}||}|j
||||d|d
�|j
|d
g|j|
d|	d�|j|||||d�|j
|j|d�|j|�dS)N�rEz#1A4E5Dg�������?z#66C2D7g�?z#E75438�
�catr)r[�o)rW�	linewidth�marker�
markersizez--)rW�	linestylerf)�alpharW)rT)r4r2r7�range�len�
set_xticks�set_xticklabels�values�mean�stdrOr%�fill_between�
set_xlabelr5�_modify_axis)rr\rJrK�pd_linewidth�
pd_markersize�pd_color�
fill_alpha�
fill_color�zero_linewidth�
zero_color�xlabel_fontsizerAr=�xr0�pd_std�upper�lowerrrrrI�s<




z(PartialDependenceExplainer._plot_contentc
Cs�d}d}d}|jdd|||d�|jd�|j�j�|j�j�xdD]}|j|jd�qHWx$dD]}|jdd|ddddd�qfWdS)N�z#9E9E9Ez#424242�both�major)r[�which�colors�	labelsize�
labelcolorrS�top�left�right�bottomFr}�yTz--g�?�kg333333�?)�ls�lw�crj)r�r�r�r�)r}r�)	�tick_paramsrY�	get_yaxis�	tick_left�	get_xaxis�tick_bottom�spines�set_visible�grid)rr\�tick_labelsize�tick_colors�tick_labelcolor�	directionr[rrrrts



z'PartialDependenceExplainer._modify_axisN)rrr	r	r
)Tr)
�__name__�
__module__�__qualname__�__doc__rrDrOrHrIrtrrrrrsK
N
*ccs>|jd}x.td||�D]}||||�jdd�VqWdS)zDUsed by PartialDependenceExplainer to loop through the data by batchrT)�dropN)r-rk�reset_index)r8rr>�irrrr.s
r.c
Cs�tj|jj|jd�}|j|j�}tj||jd�||<g}||�}	xTt	|�D]H}
|rh|	dd�|
f}n|	}|j
|jd|jf�}tj|�}|j
|�qNW|S)zBUsed by PartialDependenceExplainer to generate prediction by batch)�repeatsrN)r(�repeat�indexror%�iloc�copy�tiler-rk�reshaper0�	DataFramer3)
rrrrr r!�index_batch�ice_datarA�
prediction�n_classrC�reshapedrrrr!s	
r)�numpyr(�pandasr0�matplotlib.pyplot�pyplotrF�mathr�joblibrr�matplotlib.gridspecr�__all__rr.rrrrr�<module>s
Product

Resources

Company