CoCalc -- tree.cpython-36.pyc

GitHub Repository: ethen8181/machine-learning
Path: blob/master/trees/__pycache__/tree.cpython-36.pyc
²⁵⁸⁵ views
3

ţ�[C�@sRddlZGdd�d�Zdd�Zdd�Zdd	�Zd
d�Zdd
�Zddd�ZegZ	dS)�Nc@s:eZdZdZddd�Zdd	�Zd
d�Zdd
�Zdd�ZdS)�Treea
    Classification tree using information gain with entropy as impurity

    Parameters
    ----------
    max_features : int or None, default None
        The number of features to consider when looking for the best split,
        None uses all features

    min_samples_split : int, default 10
        The minimum number of samples required to split an internal node

    max_depth : int, default 3
        Maximum depth of the tree

    minimum_gain : float, default 1e-7
        Minimum information gain required for splitting
    �N�H�����z>�
cCs||_||_||_||_dS)N)�	max_depth�max_features�minimum_gain�min_samples_split)�selfrrrr	�r�//Users/mingyuliu/machine-learning/trees/tree.py�__init__sz
Tree.__init__cCs�tj|�jd|_|jdks,|j|jdkr8|jd|_tj|jd�|_t|||j|j	|j|j
|j|j|jd�	|_|jtj|j�_|S)z4pass in the 2d-array dataset and the response columnrN�)
�np�unique�shape�n_classr�zeros�feature_importance�_create_decision_treerrr	�tree�sum)r
�X�yrrr�fit s
zTree.fitcCs|j|�}tj|dd�}|S)Nr)�axis)�
predict_probar�argmax)r
r�proba�predrrr�predict3s
zTree.predictcCsPtj|jd|jf�}x4t|jd�D]"}|j||dd�f|j�||<q&W|S)Nr)r�emptyrr�range�_predict_rowr)r
rr�irrrr8s"zTree.predict_probacCsH|dr|dS||d|dkr4|j||d�S|j||d�SdS)zPredict single row�is_leaf�prob�	split_col�	threshold�left�rightN)r#)r
�rowrrrrr#?s
zTree._predict_row)rNrr)	�__name__�
__module__�__qualname__�__doc__r
rr rr#rrrrrs
rc	Cs�y�|dkst�|jd|ks t�t|||�\}	}
}||ks>t�||	|jd||7<t|||	|
�\}}
}}t|||d||||||�	}t|
||d||||||�	}Wn:tk
r�tj||d�}||jd}d|d�}|SXd|||	|
d�}|S)zIrecursively grow the decision tree until it reaches the stopping criteriarr)�	minlengthT)r%r&F)r%r)r*r'r()�AssertionErrorr�_find_best_split�_splitrr�bincount)rrrrrr	rr�n_row�column�value�gain�left_X�right_X�left_y�right_y�
left_child�right_child�countsr&�leaf�noderrrrJs4
rc
	Cs�tjj|jd|dd�}d\}}}t|�}xd|D]\}t||�}	xL|	D]D}
t||||
dd�}|t||�}|dksx||krD||
|}}}qDWq0W|||fS)z?Greedy algorithm to find the best feature and value for a splitrF)�replaceN)�return_X)NNN)r�random�choicer�_compute_entropy�_find_splitsr3�_compute_splits_entropy)
rrr�subset�max_col�max_valZmax_gainZparent_entropyr6�split_valuesr7�splitsr8rrrr2ts



r2cCs:tj|dd�\}}||jd}tj|tj|��}|S)z$entropy score using a fix log base 2T)�
return_countsr)rrrr�log2)�split�_r?�p�entropyrrrrF�srFcCsltj|dd�|f�}tj|jdd�}x<td|jd�D](}||d||d}|||d<q<W|S)z�
    find all possible split values (threshold),
    by getting unique values in a sorted order
    and finding cutoff point (average) between every two values
    Nrr�)rrr!rr")rr6ZX_uniquerLr$�averagerrrrG�srGcCs6d}x,|D]$}||jd|jdt|�7}q
W|S)z8compute the entropy for the splits (the two child nodes)r)rrF)rrMZsplits_entropyrPrrrrH�s
$rHTcCsh|dd�|f|k}|dd�|f|k}||||}}|sF||fS||||}	}
|	|
||fSdS)z4split the response column using the cutoff thresholdNr)rrr6r7rC�	left_mask�
right_maskr;r<r9r:rrrr3�sr3)T)
�numpyrrrr2rFrGrHr3�__all__rrrr�<module>sF*
Product

Resources

Company