Градиентный бустинг над деревьями.¶

Пусть дана обучающая выборка $X = \{(x_i, y_i)\}_{i=1}^N$ и выбран функционал качества

$$Q(a, X) = \sum_{i=1}^N L(y_i, a(x_i)),$$

который мы стремимся минимизировать. Градиентный бустинг строит композиции вида

$$a_M(x) = \sum_{j=0}^M \gamma_j b_j(x),$$

где $b_j \in \mathcal{B}$ — базовые алгоритмы из некоторого параметрического семейства $\mathcal{B}$.

Композиция строится пошагово, на $M$-ом шаге к композиции добавляется алгоритм $b_M$ путём выполнения следующих действий:

Вычисление сдвигов текущей композиции по выборке $X$: $$s_i^{(M)} = - \frac{\partial L}{\partial z} (y, z)\Bigg|_{z = a_{M-1}(x_i)}$$
Обучение нового базового алгоритма на выборке $\{(x_i, s_i)\}_{i=1}^N$: $$b_M = \arg\min_{b \in \mathcal{B}} \sum_{i=1}^N (b(x_i) - s_i)^2$$
Подбор коэффициента $\gamma_M$ при новом базовом алгоритме: $$\gamma_M = \arg \min_{\gamma} \sum_{i=1}^N L(y_i, a_{M-1}(x_i) + \gamma b_M(x_i))$$

В качестве базовых алгоритмов удобно брать решающие деревья малой глубины, поскольку они обладают большим смещением и малым разбросом.

Сравнение бустинга и бэггинга на модельном примере¶

%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt

X_train = np.linspace(0, 2, 100)
X_test = np.linspace(0, 2, 1000)

@np.vectorize
def stair(x):
    return x < 1.0

Y_train = stair(X_train) + np.random.randn(*X_train.shape) * 0.1 # пусть целевая функция - ступенька с шумом

plt.figure(figsize = (16, 9))
plt.scatter(X_train, Y_train, s=50)

<matplotlib.collections.PathCollection at 0x7f1fe05cfd68>

Обучим на этой выборке большое количество решающих деревьев бэггингом.

from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import BaggingRegressor, GradientBoostingRegressor

mod1 = BaggingRegressor(DecisionTreeRegressor(max_depth=1),warm_start=True)
# Обратите внимание на параметр warm_start, при установлении warm_start=True, последующее увеличение числа базовых алгоритмов
#осуществляется посредством добавления новых с уже существующим 
plt.figure(figsize=(20, 30))
sizes = [1, 2, 5, 20, 100, 500, 1000, 2000]
for i in range(len(sizes)):
    mod1.n_estimators = sizes[i]
    mod1.fit(X_train.reshape(-1, 1), Y_train)
    plt.subplot(4, 2, i+1)
    plt.xlim([0, 2])
    plt.scatter(X_train, Y_train, s=30)
    plt.plot(X_test, mod1.predict(X_test.reshape(-1, 1)), c='red', linewidth=(2))
    plt.title('The number of trees: {} '.format(sizes[i]))

mod2 = GradientBoostingRegressor(max_depth=1, learning_rate=1,warm_start=True)
plt.figure(figsize=(20, 30))
for i in range(len(sizes)):
    mod2.n_estimators = sizes[i]
    mod2.fit(X_train.reshape(-1, 1), Y_train)
    plt.subplot(4, 2, i+1)
    plt.xlim([0, 2])
    plt.scatter(X_train, Y_train, s=30)
    plt.plot(X_test, mod2.predict(X_test.reshape(-1, 1)), c='green', linewidth=2)
    plt.title('The number of trees: {} '.format(sizes[i]))

Градиентный бустинг довольно быстро построил истинную зависимость, после чего начал настраиваться уже на конкретные объекты обучающей выборки, из-за чего сильно переобучился.

Бороться с этой проблемой можно искусственным снижением веса новых алгоритмов при помощи шага $\eta$ (learning_rate):

$$a_M(x) = \sum_{n=0}^M \eta \gamma_M b_M(x).$$

mod3 = GradientBoostingRegressor(max_depth=1, learning_rate=0.1,warm_start=True)
plt.figure(figsize=(20, 30))
for i in range (len(sizes)):
    mod3.n_estimators = sizes[i]
    mod3.fit(X_train.reshape(-1, 1), Y_train)
    plt.subplot(4, 2, i+1)
    plt.xlim([0, 2])
    plt.scatter(X_train, Y_train, s=30)
    plt.plot(X_test, mod3.predict(X_test.reshape(-1, 1)), c='gray', linewidth=2)
    plt.title('The number of trees: {} '.format(sizes[i]))

Задание 1.¶

Сравните с алгоритмом случайного леса

from sklearn.ensemble import RandomForestRegressor
mod4 = RandomForestRegressor(warm_start=True)
plt.figure(figsize=(20, 30))
for i in range(len(sizes)):
    mod4.n_estimators = sizes[i]
    mod4.fit(X_train.reshape(-1, 1), Y_train)
    plt.subplot(4, 2, i+1)
    plt.xlim([0, 2])
    plt.scatter(X_train, Y_train, s=30)
    plt.plot(X_test, mod4.predict(X_test.reshape(-1, 1)), c='green', linewidth=2)
    plt.title('The number of trees: {} '.format(sizes[i]))

from sklearn.ensemble import RandomForestRegressor
mod5 = RandomForestRegressor(warm_start=True, max_depth=1)
plt.figure(figsize=(20, 30))
for i in range(len(sizes)):
    mod5.n_estimators = sizes[i]
    mod5.fit(X_train.reshape(-1, 1), Y_train)
    plt.subplot(4, 2, i+1)
    plt.xlim([0, 2])
    plt.scatter(X_train, Y_train, s=30)
    plt.plot(X_test, mod5.predict(X_test.reshape(-1, 1)), c='green', linewidth=2)
    plt.title('The number of trees: {} '.format(sizes[i]))

Задание 2.¶

Проверьте полученные выше результаты на реальных данных. Используйте датасет boston, разбейте выборку случайным образом на 406 обучающих объектов и 100 тестовых. Постройте зависимости среднеквадратичного отклонения обучающей и тестовой выборки от числа деревьев (число деревьев до 2000 с шагом 100, используйте warm_start=True) для алгоритмов:

BaggingRegressor

GradientBoostingRegressor (leaning_rate =1.0)

GradientBoostingRegressor (leaning_rate =0.1)

Random forest

from sklearn.datasets import load_boston
data = load_boston()
print (data.DESCR)

Boston House Prices dataset
===========================

Notes
------
Data Set Characteristics:  

    :Number of Instances: 506 

    :Number of Attributes: 13 numeric/categorical predictive
    
    :Median Value (attribute 14) is usually the target

    :Attribute Information (in order):
        - CRIM     per capita crime rate by town
        - ZN       proportion of residential land zoned for lots over 25,000 sq.ft.
        - INDUS    proportion of non-retail business acres per town
        - CHAS     Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
        - NOX      nitric oxides concentration (parts per 10 million)
        - RM       average number of rooms per dwelling
        - AGE      proportion of owner-occupied units built prior to 1940
        - DIS      weighted distances to five Boston employment centres
        - RAD      index of accessibility to radial highways
        - TAX      full-value property-tax rate per $10,000
        - PTRATIO  pupil-teacher ratio by town
        - B        1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town
        - LSTAT    % lower status of the population
        - MEDV     Median value of owner-occupied homes in $1000's

    :Missing Attribute Values: None

    :Creator: Harrison, D. and Rubinfeld, D.L.

This is a copy of UCI ML housing dataset.
http://archive.ics.uci.edu/ml/datasets/Housing


This dataset was taken from the StatLib library which is maintained at Carnegie Mellon University.

The Boston house-price data of Harrison, D. and Rubinfeld, D.L. 'Hedonic
prices and the demand for clean air', J. Environ. Economics & Management,
vol.5, 81-102, 1978.   Used in Belsley, Kuh & Welsch, 'Regression diagnostics
...', Wiley, 1980.   N.B. Various transformations are used in the table on
pages 244-261 of the latter.

The Boston house-price data has been used in many machine learning papers that address regression
problems.   
     
**References**

   - Belsley, Kuh & Welsch, 'Regression diagnostics: Identifying Influential Data and Sources of Collinearity', Wiley, 1980. 244-261.
   - Quinlan,R. (1993). Combining Instance-Based and Model-Based Learning. In Proceedings on the Tenth International Conference of Machine Learning, 236-243, University of Massachusetts, Amherst. Morgan Kaufmann.
   - many more! (see http://archive.ics.uci.edu/ml/datasets/Housing)

from sklearn import cross_validation
X_tr, X_test, y_tr, y_test = cross_validation.train_test_split(data.data, data.target, test_size=100, random_state=7)
X_tr.shape

(406, 13)

Bagging regressor¶

mod6 = BaggingRegressor(DecisionTreeRegressor(max_depth=1),warm_start=True)
mod6_1 = BaggingRegressor(DecisionTreeRegressor(max_depth=3),warm_start=True)
sizes = range(1, 2000, 100)
train_errors = []
test_errors = []
train_errors_1 = []
test_errors_1 = []
for i in range(len(sizes)):
    mod6.n_estimators = sizes[i]
    mod6.fit(X_tr, y_tr)
    mod6_1.n_estimators = sizes[i]
    mod6_1.fit(X_tr, y_tr)
    test_errors.append(1-mod6.score(X_test, y_test))
    train_errors.append(1-mod6.score(X_tr, y_tr))
    test_errors_1.append(1-mod6_1.score(X_test, y_test))
    train_errors_1.append(1-mod6_1.score(X_tr, y_tr))

plt.figure(figsize=(20, 30))
plt.subplot(4, 2, 1)
plt.plot(sizes, train_errors, 'o-b', label="$Train \ errors$")
plt.plot(sizes, test_errors, 'o-r', label="$Test \ errors$")
plt.xlabel("$Trees$", size=15)
plt.ylabel("$Error$", size=15)
plt.title("$Bagging \ regressor, \ max\_depth=1$", size=20)
plt.legend(loc='best', fontsize=15)
plt.subplot(4, 2, 2)
plt.plot(sizes, train_errors_1, 'o-b', label="$Train \ errors$")
plt.plot(sizes, test_errors_1, 'o-r', label="$Test \ errors$")
plt.xlabel("$Trees$", size=15)
plt.ylabel("$Error$", size=15)
plt.title("$Bagging \ regressor, \ max\_depth=3$", size=20)
plt.legend(loc='best', fontsize=15)

<matplotlib.legend.Legend at 0x7f1fd1beec18>

Gradient boosting regressor, lr=1.0¶

mod7 = GradientBoostingRegressor(max_depth=1, learning_rate=1,warm_start=True)
mod7_1 = GradientBoostingRegressor(max_depth=3, learning_rate=1,warm_start=True)
train_errors = []
test_errors = []
train_errors_1 = []
test_errors_1 = []
for i in range(len(sizes)):
    mod7.n_estimators = sizes[i]
    mod7.fit(X_tr, y_tr)
    mod7_1.n_estimators = sizes[i]
    mod7_1.fit(X_tr, y_tr)
    test_errors.append(1-mod7.score(X_test, y_test))
    train_errors.append(1-mod7.score(X_tr, y_tr))
    test_errors_1.append(1-mod7_1.score(X_test, y_test))
    train_errors_1.append(1-mod7_1.score(X_tr, y_tr))

plt.figure(figsize=(20, 30))
plt.subplot(4, 2, 1)
plt.plot(sizes, train_errors, 'o-b', label="$Train \ errors$")
plt.plot(sizes, test_errors, 'o-r', label="$Test \ errors$")
plt.xlabel("$Trees$", size=15)
plt.ylabel("$Error$", size=15)
plt.title("$Gradient \ boosting, \ max\_depth=1, \ lr=1.0$", size=20)
plt.legend(loc='best', fontsize=15)
plt.subplot(4, 2, 2)
plt.plot(sizes, train_errors_1, 'o-b', label="$Train \ errors$")
plt.plot(sizes, test_errors_1, 'o-r', label="$Test \ errors$")
plt.xlabel("$Trees$", size=15)
plt.ylabel("$Error$", size=15)
plt.title("$Gradient \ boosting, \ max\_depth=3, \ lr=1.0$", size=20)
plt.legend(loc='best', fontsize=15)

<matplotlib.legend.Legend at 0x7f1fd20148d0>

Gradient boosting regressor, lr=0.1¶

mod8 = GradientBoostingRegressor(max_depth=1, learning_rate=0.1,warm_start=True)
mod8_1 = GradientBoostingRegressor(max_depth=3, learning_rate=0.1,warm_start=True)
train_errors = []
test_errors = []
train_errors_1 = []
test_errors_1 = []
for i in range(len(sizes)):
    mod8.n_estimators = sizes[i]
    mod8.fit(X_tr, y_tr)
    mod8_1.n_estimators = sizes[i]
    mod8_1.fit(X_tr, y_tr)
    test_errors.append(1-mod8.score(X_test, y_test))
    train_errors.append(1-mod8.score(X_tr, y_tr))
    test_errors_1.append(1-mod8_1.score(X_test, y_test))
    train_errors_1.append(1-mod8_1.score(X_tr, y_tr))

plt.figure(figsize=(20, 30))
plt.subplot(4, 2, 1)
plt.plot(sizes, train_errors, 'o-b', label="$Train \ errors$")
plt.plot(sizes, test_errors, 'o-r', label="$Test \ errors$")
plt.xlabel("$Trees$", size=15)
plt.ylabel("$Error$", size=15)
plt.title("$Gradient \ boosting, \ max\_depth=1, \ lr=0.1$", size=20)
plt.legend(loc='best', fontsize=15)
plt.subplot(4, 2, 2)
plt.plot(sizes, train_errors_1, 'o-b', label="$Train \ errors$")
plt.plot(sizes, test_errors_1, 'o-r', label="$Test \ errors$")
plt.xlabel("$Trees$", size=15)
plt.ylabel("$Error$", size=15)
plt.title("$Gradient \ boosting, \ max\_depth=3, \ lr=0.1$", size=20)
plt.legend(loc='best', fontsize=15)

<matplotlib.legend.Legend at 0x7f1fd015ea20>

Random forest regressor¶

mod9 = RandomForestRegressor(warm_start=True, max_depth=1)
mod9_1 = RandomForestRegressor(warm_start=True, max_depth=3)
train_errors = []
test_errors = []
train_errors_1 = []
test_errors_1 = []
for i in range(len(sizes)):
    mod9.n_estimators = sizes[i]
    mod9.fit(X_tr, y_tr)
    mod9_1.n_estimators = sizes[i]
    mod9_1.fit(X_tr, y_tr)
    test_errors.append(1-mod9.score(X_test, y_test))
    train_errors.append(1-mod9.score(X_tr, y_tr))
    test_errors_1.append(1-mod9_1.score(X_test, y_test))
    train_errors_1.append(1-mod9_1.score(X_tr, y_tr))

plt.figure(figsize=(20, 30))
plt.subplot(4, 2, 1)
plt.plot(sizes, train_errors, 'o-b', label="$Train \ errors$")
plt.plot(sizes, test_errors, 'o-r', label="$Test \ errors$")
plt.xlabel("$Trees$", size=15)
plt.ylabel("$Error$", size=15)
plt.title("$Random \ forest \ regressor, \ max\_depth=1$", size=20)
plt.legend(loc='best', fontsize=15)
plt.subplot(4, 2, 2)
plt.plot(sizes, train_errors_1, 'o-b', label="$Train \ errors$")
plt.plot(sizes, test_errors_1, 'o-r', label="$Test \ errors$")
plt.xlabel("$Trees$", size=15)
plt.ylabel("$Error$", size=15)
plt.title("$Random \ forest \ regressor, \ max\_depth=3$", size=20)
plt.legend(loc='best', fontsize=15)

<matplotlib.legend.Legend at 0x7f1fcbc22cc0>

Задание 3 Смещение и разброс методов обучения¶

Исследуем численно разброс и смещение различных моделей обучения. Данные будем генирировать следующим образом. x- одномерное распределение (рассмотрите три различных распределения: нормальное ( np.random.normal(0, 0.3), экспоненциальное (np.random.exponential(0.3)), равномерное (np.random.uniform(0, 1)) y- сумма f(x)=cos(2pi x) и случайного шума (равномерное распределение на [-0.2. 0.2]

Возьмем метод обучения (рассмотрите 4 различных "LinearRegression", "DecisionTree", "RandomForest", "GradientBoosting")

1) Обучите модель с параметрами по умолчанию на сгенерированной выборке (число объектов выборки 100). Постройте на одном графике объекты выборки и восстановленную регрессионную кривую

2) Рассмотрим ансамбль обучающих моделей. Сгенерируйте N_samples=25 случайных выборок, обучите и постройте все выборки и все регрессионные кривые на одном графике (полупрозрачным). Посчитайте среднюю кривую F(x) , как среднее арифмитическое по ансамблю регрессионных кривых. Изобразите его тоже на графике. Изобразите истинную кривую ответов.

3) Вычислите смещение метода обучения. Смещение - это среднее квадрата разности вектора зашумленных ответов и вектора предсказаний

$$ Bias= E_{x,y} ((\mathbb{E}[y|x] - \mathbb{E}_X [\mu(X)])^2)$$

.

Для вычисления этого мат.ожидания будем генерировать выборку размера N_objects=100. Средний ответ $\mathbb{E}[y|x] $ на объекте x вычислите как сумму f(x) + среднее значение шума на выборке N_objects (для этого можно сгенерировать отдельно N_objects шумовых компонент) Среднее предсказание $\mathbb{E}_X [\mu(X)]$ на объекте x - это значение построенной функции F(x) (см.2) Мат. ожидание заменяем на среднее арифмитическое по выборке N_objects.

4) Вычислите разброс метода обучения. Разброс - это среднеквадратичное отклонение предсказания алгоритмов метода $\mu$ на объекте x от среднего предсказания $\mathbb{E}_X [\mu(X)]$ $$ Variance=E_{x,y} (E_{X}(\mathbb{E}_X [\mu(X)] - \mu(X)])^2)$$

Чтобы его вычислить, на выборке N_objects вычислите среднее арифмитическое по N_samples алгоритмам (см. 2) для N_objects объектов.

5) Представьте результаты в виде таблицы, проанализируйте (какой метод дает наименьший разброс, какой наименьшее смещение, почему?, как смещение и разброс связаны с распределением признака x)

x_1 = np.sort(np.random.normal(0, 0.3, size=100)).reshape(100, 1)
x_2 = np.sort(np.random.exponential(0.3, size=100)).reshape(100, 1)
x_3 = np.sort(np.random.uniform(0, 1, size=100)).reshape(100, 1)
b = np.random.uniform(-0.2, 0.2, size=100).reshape(100, 1)
y_1 = (np.cos(2*np.pi*x_1) + b).reshape(100)
y_2 = (np.cos(2*np.pi*x_2) + b).reshape(100)
y_3 = (np.cos(2*np.pi*x_3) + b).reshape(100)

def plotfunc(model1):
    plt.figure(figsize=(20,30))

    plt.subplot(4, 2, 1)
    s = np.array([0])
    for i in range(25):
        x = np.sort(np.random.normal(0, 0.3, size=100)).reshape(100, 1)
        b = np.random.uniform(-0.2, 0.2, size=100).reshape(100, 1)
        y = (np.cos(2*np.pi*x) + b).reshape(100)
        model1.fit(x, y)
        plt.scatter(x, y, alpha=0.5)
        plt.plot(x, model1.predict(x), color='red', alpha=0.5)
        if s.all() == 0:
            s = model1.predict(x)
        else:
            s += model1.predict(x)

    plt.plot(x_1, s/25, 'g', linewidth=3)
    s1 = s/25

    s = np.array([0])
    plt.subplot(4, 2, 2)
    for i in range(25):
        x = np.sort(np.random.exponential(0.3, size=100)).reshape(100, 1)
        b = np.random.uniform(-0.2, 0.2, size=100).reshape(100, 1)
        y = (np.cos(2*np.pi*x) + b).reshape(100)
        model1.fit(x, y)
        plt.scatter(x, y, alpha=0.5)
        plt.plot(x, model1.predict(x), color='red', alpha=0.5)
        if s.all() == 0:
            s = model1.predict(x)
        else:
            s += model1.predict(x)

    plt.plot(x_2, s/25, 'g', linewidth=3)
    s2 = s/25

    s = np.array([0])
    plt.subplot(4, 2, 3)
    for i in range(25):
        x = np.sort(np.random.uniform(0, 1, size=100)).reshape(100, 1)
        b = np.random.uniform(-0.2, 0.2, size=100).reshape(100, 1)
        y = (np.cos(2*np.pi*x) + b).reshape(100)
        model1.fit(x, y)
        plt.scatter(x, y, alpha=0.5)
        plt.plot(x, model1.predict(x), color='red', alpha=0.5)
        if s.all() == 0:
            s = model1.predict(x)
        else:
            s += model1.predict(x)

    plt.plot(x_3, s/25, 'g', linewidth=3)
    plt.show()
    return s1, s2, s/25

def mf():
    y_1_s = np.array([0])
    y_2_s = np.array([0])
    y_3_s = np.array([0])
    for i in range(25):
        x_1_t = np.sort(np.random.normal(0, 0.3, size=100)).reshape(100, 1)
        x_2_t = np.sort(np.random.exponential(0.3, size=100)).reshape(100, 1)
        x_3_t = np.sort(np.random.uniform(0, 1, size=100)).reshape(100, 1)
        b_t = np.random.uniform(-0.2, 0.2, size=100).reshape(100, 1)
        y_1_t = (np.cos(2*np.pi*x_1) + b).reshape(100)
        y_2_t = (np.cos(2*np.pi*x_2) + b).reshape(100)
        y_3_t = (np.cos(2*np.pi*x_3) + b).reshape(100)
        
        if y_1_s.all() == 0:
            y_1_s = y_1_t
        else:
            y_1_s += y_1_t
        if y_2_s.all() == 0:
            y_2_s = y_2_t
        else:
            y_2_s += y_2_t
        if y_3_s.all() == 0:
            y_3_s = y_3_t
        else:
            y_3_s += y_3_t
    return y_1_s/25, y_2_s/25, y_3_s/25

Linear regression¶

from sklearn.linear_model import LinearRegression
model1 = LinearRegression()
model2 = DecisionTreeRegressor()
model3 = RandomForestRegressor()
model4 = GradientBoostingRegressor()

1)¶

plt.figure(figsize=(20,30))
model1.fit(x_1, y_1)
plt.subplot(4, 2, 1)
plt.scatter(x_1, y_1)
plt.plot(x_1, model1.predict(x_1), color='blue')

model1.fit(x_2, y_2)
plt.subplot(4, 2, 2)
plt.scatter(x_2, y_2)
plt.plot(x_2, model1.predict(x_2), color='blue')

model1.fit(x_3, y_3)
plt.subplot(4, 2, 3)
plt.scatter(x_3, y_3)
plt.plot(x_3, model1.predict(x_3), color='blue')
plt.show()

2)¶

s1, s2, s3=plotfunc(model1)

3)¶

t_y1, t_y2, t_y3 = mf()
bias1_1 = np.mean(((t_y1 - s1)**2))
bias1_2 = np.mean(((t_y2 - s2)**2))
bias1_3 = np.mean(((t_y3 - s3)**2))
print(bias1_1, bias1_2, bias1_3)

0.482673080222 0.468019512762 0.569561742236

4)¶

var1_1 = np.var(s1)
var1_2 = np.var(s2)
var1_3 = np.var(s3)
print(var1_1, var1_2, var1_3)

5.38218982636e-05 0.12828607186 5.74140635415e-06

Decision tree¶

1)¶

plt.figure(figsize=(20,30))
model2.fit(x_1, y_1)
plt.subplot(4, 2, 1)
plt.scatter(x_1, y_1)
plt.plot(x_1, model2.predict(x_1), color='red')

model2.fit(x_2, y_2)
plt.subplot(4, 2, 2)
plt.scatter(x_2, y_2)
plt.plot(x_2, model2.predict(x_2), color='red')

model2.fit(x_3, y_3)
plt.subplot(4, 2, 3)
plt.scatter(x_3, y_3)
plt.plot(x_3, model2.predict(x_3), color='red')
plt.show()

2)¶

s1, s2, s3=plotfunc(model2)

3)¶

t_y1, t_y2, t_y3 = mf()
bias2_1 = np.mean(((t_y1 - s1)**2))
bias2_2 = np.mean(((t_y2 - s2)**2))
bias2_3 = np.mean(((t_y3 - s3)**2))
print(bias2_1, bias2_2, bias2_3)

0.0415791501162 0.110103819265 0.0453014334807

4)¶

var2_1 = np.var(s1)
var2_2 = np.var(s2)
var2_3 = np.var(s3)
print(var2_1, var2_2, var2_3)

0.434807701606 0.446868243727 0.48955216351

Random forest¶

1)¶

plt.figure(figsize=(20,30))
model3.fit(x_1, y_1)
plt.subplot(4, 2, 1)
plt.scatter(x_1, y_1)
plt.plot(x_1, model3.predict(x_1), color='red')

model3.fit(x_2, y_2)
plt.subplot(4, 2, 2)
plt.scatter(x_2, y_2)
plt.plot(x_2, model3.predict(x_2), color='red')

model3.fit(x_3, y_3)
plt.subplot(4, 2, 3)
plt.scatter(x_3, y_3)
plt.plot(x_3, model3.predict(x_3), color='red')
plt.show()

2)¶

s1, s2, s3 = plotfunc(model3)

3)¶

t_y1, t_y2, t_y3 = mf()
bias3_1 = np.mean(((t_y1 - s1)**2))
bias3_2 = np.mean(((t_y2 - s2)**2))
bias3_3 = np.mean(((t_y3 - s3)**2))
print(bias3_1, bias3_2, bias3_3)

0.050326669053 0.114417895936 0.0587126336738

4)¶

var3_1 = np.var(s1)
var3_2 = np.var(s2)
var3_3 = np.var(s3)
print(var3_1, var3_2, var3_3)

0.431074023184 0.423468439936 0.469122132341

Gradient boosting¶

1)¶

plt.figure(figsize=(20,30))
model4.fit(x_1, y_1)
plt.subplot(4, 2, 1)
plt.scatter(x_1, y_1)
plt.plot(x_1, model4.predict(x_1), color='red')

model4.fit(x_2, y_2)
plt.subplot(4, 2, 2)
plt.scatter(x_2, y_2)
plt.plot(x_2, model4.predict(x_2), color='red')

model4.fit(x_3, y_3)
plt.subplot(4, 2, 3)
plt.scatter(x_3, y_3)
plt.plot(x_3, model4.predict(x_3), color='red')
plt.show()

2)¶

s1, s2, s3 = plotfunc(model4)

3)¶

t_y1, t_y2, t_y3 = mf()
bias4_1 = np.mean(((t_y1 - s1)**2))
bias4_2 = np.mean(((t_y2 - s2)**2))
bias4_3 = np.mean(((t_y3 - s3)**2))
print(bias4_1, bias4_2, bias4_3)

0.0459675591723 0.0931393199501 0.0618913212458

var4_1 = np.var(s1)
var4_2 = np.var(s2)
var4_3 = np.var(s3)
print(var4_1, var4_2, var4_3)

0.429572978357 0.408253132101 0.46541616365

biases = [bias1_1, bias1_2, bias1_3, bias2_1, bias2_2, bias2_3, bias3_1, bias3_2, bias3_3, bias4_1, bias4_2, bias4_3]
variances = [var1_1, var1_2, var1_3, var2_1, var2_2, var2_3, var3_1, var3_2, var3_3, var4_1, var4_2, var4_3]

import pandas as pd
bias = pd.DataFrame(np.reshape(biases, (4,3)),columns=["normal(0, 0.3)", "exponential(0.3)", "uniform(0, 1)"],
                      index=["LinearRegression", "DecisionTree", "RandomForest", "GradientBoosting"])
variance = pd.DataFrame(np.reshape(variances, (4,3)), columns=bias.columns, index=bias.index)

bias

variance

Задание 4 Наивный байесовский классификатор¶

Пусть дана выборка (x1,x2,y): (1,1,0); (0,1,0); (0,1,0); (0,1,0); (1,0,0);

(0,0,1); (1,0,1); (1,0,1); (1,0,1); (0,0,1)

С помощью наивного байесова классификатора оценить апостериорные вероятности Pr(y |x), если (a) x1 = 1, x2 = 0; (b) x1 = 0, x2 = 1

$ (a) \\ P(y=0 \ | \ x_1=1, x_2=0) = \displaystyle\frac{2}{17} \\ P(y=1 \ | \ x_1=1, x_2=0) = \displaystyle\frac{15}{17} \\(b) \\ P(y=0 \ | \ x_1=0, x_2=1) = 1 \\ P(y=1 \ | \ x_1=0, x_2=1) = 0 $

	normal(0, 0.3)	exponential(0.3)	uniform(0, 1)
LinearRegression	0.482673	0.468020	0.569562
DecisionTree	0.041579	0.110104	0.045301
RandomForest	0.050327	0.114418	0.058713
GradientBoosting	0.045968	0.093139	0.061891

	normal(0, 0.3)	exponential(0.3)	uniform(0, 1)
LinearRegression	0.000054	0.128286	0.000006
DecisionTree	0.434808	0.446868	0.489552
RandomForest	0.431074	0.423468	0.469122
GradientBoosting	0.429573	0.408253	0.465416