Book a Demo!
CoCalc Logo Icon
StoreFeaturesDocsShareSupportNewsAboutPoliciesSign UpSign In
ethen8181
GitHub Repository: ethen8181/machine-learning
Path: blob/master/trees/__pycache__/tree.cpython-35.pyc
2585 views


X�YC�@swddlZGdd�d�Zdd�Zdd�Zdd	�Zd
d�Zdd
�Zddd�ZegZ	dS)�Nc@s^eZdZdZdddddd�Zdd	�Zd
d�Zdd
�Zdd�ZdS)�Treea
    Classification tree using information gain with entropy as impurity

    Parameters
    ----------
    max_features : int or None, default None
        The number of features to consider when looking for the best split,
        None uses all features

    min_samples_split : int, default 10
        The minimum number of samples required to split an internal node

    max_depth : int, default 3
        Maximum depth of the tree

    minimum_gain : float, default 1e-7
        Minimum information gain required for splitting
    �NgH�����z>�
cCs(||_||_||_||_dS)N)�	max_depth�max_features�minimum_gain�min_samples_split)�selfrrrr�r
�+/Users/ethen/machine-learning/trees/tree.py�__init__s			z
Tree.__init__cCs�tj|�jd|_|jdks>|j|jdkrN|jd|_tj|jd�|_t|||j|j	|j|j
|j|j|jd�	|_|jtj|j�_|S)z4pass in the 2d-array dataset and the response columnrN�)
�np�unique�shape�n_classr�zeros�feature_importance�_create_decision_treerrr�tree�sum)r	�X�yr
r
r�fit s%zTree.fitcCs(|j|�}tj|dd�}|S)N�axisr
)�
predict_probar�argmax)r	r�proba�predr
r
r�predict3szTree.predictcCsjtj|jd|jf�}xDt|jd�D]/}|j||dd�f|j�||<q3W|S)Nr)r�emptyrr�range�_predict_rowr)r	rr�ir
r
rr8s-zTree.predict_probacCsV|dr|dS||d|dkr>|j||d�S|j||d�SdS)zPredict single row�is_leaf�prob�	split_col�	threshold�left�rightN)r")r	�rowrr
r
rr"?s

zTree._predict_row)	�__name__�
__module__�__qualname__�__doc__rrrrr"r
r
r
rrsrc	Cscy�|dkst�|jd|ks.t�t|||�\}	}
}||ks[t�||	|jd||7<t|||	|
�\}}
}}t|||d||||||�	}t|
||d||||||�	}WnNtk
r:tj|d|�}||jd}ddd|i}|SYnXddd|d	|d
|	d|
i}|S)zIrecursively grow the decision tree until it reaches the stopping criteriarr
�	minlengthr$Tr%Fr(r)r&r')�AssertionErrorr�_find_best_split�_splitrr�bincount)rrrrrrrr�n_row�column�value�gain�left_X�right_X�left_y�right_y�
left_child�right_child�countsr%�leaf�noder
r
rrJs4!
	rc
	Cs�tjj|jd|dd�}d\}}}t|�}x�|D]�}t||�}	xh|	D]`}
t||||
dd�}|t||�}|dks�||kr`||
|}}}q`WqDW|||fS)z?Greedy algorithm to find the best feature and value for a splitr
�replaceFN�return_X)NNN)r�random�choicer�_compute_entropy�_find_splitsr2�_compute_splits_entropy)
rrr�subset�max_col�max_valZmax_gainZparent_entropyr5�split_valuesr6�splitsr7r
r
rr1ts"

r1cCsMtj|dd�\}}||jd}tj|tj|��}|S)z$entropy score using a fix log base 2�
return_countsTr)rrrr�log2)�split�_r>�p�entropyr
r
rrE�srEcCs�tj|dd�|f�}tj|jdd�}xFtd|jd�D].}||d||d}|||d<qPW|S)z�
    find all possible split values (threshold),
    by getting unique values in a sorted order
    and finding cutoff point (average) between every two values
    Nrr
�)rrr rr!)rr5ZX_uniquerKr#�averager
r
rrF�srFcCsAd}x4|D],}||jd|jdt|�7}q
W|S)z8compute the entropy for the splits (the two child nodes)r)rrE)rrLZsplits_entropyrOr
r
rrG�s
*rGTcCs�|dd�|f|k}|dd�|f|k}||||}}|s]||fS||||}	}
|	|
||fSdS)z4split the response column using the cutoff thresholdNr
)rrr5r6rB�	left_mask�
right_maskr:r;r8r9r
r
rr2�s
r2)
�numpyrrrr1rErFrGr2�__all__r
r
r
r�<module>sF*