Project: Default

Path: Econometrics/Spring 2018 / Correlation_EDA.ipynb

Views: ¹⁴⁸⁰

Kernel: Python 3 (Ubuntu Linux)

Корреляционный анализ

In [3]:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns

In [2]:

data = pd.read_csv("telecom-churn.csv")

Проверим связь international plan и churn.

In [4]:

from scipy.stats import chi2_contingency

In [5]:

chi2_contingency(pd.crosstab(data['international plan'],
                             data['churn']))

(222.56575664993761,
 2.4931077033159556e-50,
 1,
 array([[ 2573.80738074,   436.19261926],
        [  276.19261926,    46.80738074]]))

Связь подтвердилась.

Можно посчитать коэффициенты ассоциации и контингенции по формулам из методички.

In [6]:

ct = pd.crosstab(data['international plan'], data['churn'])
ct

churn	False	True
international plan
no	2664	346
yes	186	137

In [8]:

a, b, c, d = ct.iloc[0, 0], ct.iloc[0, 1], ct.iloc[1, 0], ct.iloc[1, 1]

In [9]:

assoc = (a * d - b * c) / (a * d + b * c)
assoc

0.7001984515191324

In [11]:

cont = (a * d - b * c) / np.sqrt((a+b)*(b+d)*(d+c)*(a+c))
cont

0.25985184734548217

Проверим связь total day minutes и total night minutes.

In [12]:

from scipy.stats import pearsonr
r, p_value = pearsonr(data['total day minutes'], 
                      data['total night minutes'])

In [13]:

0.004323366578518016

In [14]:

p_value

0.80297036983069459

Корреляции нет.

Проверим связь account length и churn. Используем бисериальный коэффициент корреляции.

In [16]:

from scipy.stats import pointbiserialr

In [17]:

pointbiserialr(data['churn'], data['account length'])

PointbiserialrResult(correlation=0.016540742243674137, pvalue=0.33976000705691278)

У нас недостаточно оснований для отклонения нулевой гипотезы.

Проверим нормальность признака account length.

In [18]:

from scipy.stats import normaltest

In [19]:

normaltest(data['account length'])

NormaltestResult(statistic=6.8844259831139638, pvalue=0.031993804907765627)

p-значение достаточно большое. Это означает, что нельзя отклонить нулевую гипотезу о том, что выборка взята из нормального распределения.

Попробуем другие тесты.

In [20]:

from scipy.stats import shapiro, anderson, kstest

In [21]:

shapiro(data['account length'])

(0.9982772469520569, 0.0011495520593598485)

In [23]:

kstest(data['account length'], 'norm')

KstestResult(statistic=0.99594983194136721, pvalue=0.0)

In [24]:

anderson(data['account length'], 'norm')

AndersonResult(statistic=0.42615789312776542, critical_values=array([ 0.575,  0.655,  0.786,  0.917,  1.091]), significance_level=array([ 15. ,  10. ,   5. ,   2.5,   1. ]))

In [25]:

anderson(np.random.normal(size=3000))

AndersonResult(statistic=0.53644590843168771, critical_values=array([ 0.575,  0.655,  0.786,  0.917,  1.091]), significance_level=array([ 15. ,  10. ,   5. ,   2.5,   1. ]))

In [0]: