{ "cells": [ { "cell_type": "markdown", "metadata": { "collapsed": false }, "source": [ "# Визуальный анализ данных" ] }, { "cell_type": "markdown", "metadata": { "collapsed": false }, "source": [ "Подключаем необходимые библиотеки." ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "collapsed": false }, "outputs": [ ], "source": [ "import numpy as np\n", "import scipy as sp\n", "import pandas as pd\n", "import matplotlib.pyplot as plt\n", "%matplotlib inline\n", "import seaborn as sns" ] }, { "cell_type": "markdown", "metadata": { "collapsed": false }, "source": [ "Считываем датасет." ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "collapsed": false }, "outputs": [ ], "source": [ "data = pd.read_csv(\"telecom-churn.csv\")" ] }, { "cell_type": "markdown", "metadata": { "collapsed": false }, "source": [ "Проверяем, всё ли правильно считалось и \"распарсилось\"." ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
stateaccount lengtharea codephone numberinternational planvoice mail plannumber vmail messagestotal day minutestotal day callstotal day charge...total eve callstotal eve chargetotal night minutestotal night callstotal night chargetotal intl minutestotal intl callstotal intl chargecustomer service callschurn
0KS128415382-4657noyes25265.111045.07...9916.78244.79111.0110.032.701False
1OH107415371-7191noyes26161.612327.47...10316.62254.410311.4513.733.701False
2NJ137415358-1921nono0243.411441.38...11010.30162.61047.3212.253.290False
3OH84408375-9999yesno0299.47150.90...885.26196.9898.866.671.782False
4OK75415330-6626yesno0166.711328.34...12212.61186.91218.4110.132.733False
\n", "

5 rows × 21 columns

\n", "
" ] }, "execution_count": 3, "metadata": { }, "output_type": "execute_result" } ], "source": [ "data.head()" ] }, { "cell_type": "markdown", "metadata": { "collapsed": false }, "source": [ "Можно получить сводку и общее представление о типах данных." ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "RangeIndex: 3333 entries, 0 to 3332\n", "Data columns (total 21 columns):\n", "state 3333 non-null object\n", "account length 3333 non-null int64\n", "area code 3333 non-null int64\n", "phone number 3333 non-null object\n", "international plan 3333 non-null object\n", "voice mail plan 3333 non-null object\n", "number vmail messages 3333 non-null int64\n", "total day minutes 3333 non-null float64\n", "total day calls 3333 non-null int64\n", "total day charge 3333 non-null float64\n", "total eve minutes 3333 non-null float64\n", "total eve calls 3333 non-null int64\n", "total eve charge 3333 non-null float64\n", "total night minutes 3333 non-null float64\n", "total night calls 3333 non-null int64\n", "total night charge 3333 non-null float64\n", "total intl minutes 3333 non-null float64\n", "total intl calls 3333 non-null int64\n", "total intl charge 3333 non-null float64\n", "customer service calls 3333 non-null int64\n", "churn 3333 non-null bool\n", "dtypes: bool(1), float64(8), int64(8), object(4)\n", "memory usage: 524.1+ KB\n" ] } ], "source": [ "data.info()" ] }, { "cell_type": "markdown", "metadata": { "collapsed": false }, "source": [ "Целевая переменная: churn (лояльность абонента). Это категориальный (более конкретно — бинарный) признак. Попробуем узнать, как распределены его значения." ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "False 2850\n", "True 483\n", "Name: churn, dtype: int64" ] }, "execution_count": 5, "metadata": { }, "output_type": "execute_result" } ], "source": [ "data['churn'].value_counts()" ] }, { "cell_type": "markdown", "metadata": { "collapsed": false }, "source": [ "Видим, что 2850 из 3333 абонентов — лояльные. А сколько это в процентах?.." ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "False 0.855086\n", "True 0.144914\n", "Name: churn, dtype: float64" ] }, "execution_count": 6, "metadata": { }, "output_type": "execute_result" } ], "source": [ "data['churn'].value_counts(normalize=True)" ] }, { "cell_type": "markdown", "metadata": { "collapsed": false }, "source": [ "Визуализируем это." ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "collapsed": false }, "outputs": [ { "data": { "image/png": "eb58c1303680f70e0c840b8024b3cd7c2294f3a2" }, "output_type": "display_data" } ], "source": [ "data['churn'].value_counts(normalize=True).plot(kind='bar', \n", " title='Признак churn');" ] }, { "cell_type": "markdown", "metadata": { "collapsed": false }, "source": [ "Нам также может быть интересно, у скольких наших клиентов подключён роуминг." ] }, { "cell_type": "code", "execution_count": 14, "metadata": { "collapsed": false }, "outputs": [ { "data": { "image/png": "1b30217f75cfba5c7c0f37dd809752b3e089062b" }, "output_type": "display_data" } ], "source": [ "data['international plan'].value_counts(normalize=True).plot(kind='bar');" ] }, { "cell_type": "markdown", "metadata": { "collapsed": false }, "source": [ "А как обстоят дела у нелояльных пользователей (churn=1)?" ] }, { "cell_type": "code", "execution_count": 16, "metadata": { "collapsed": false }, "outputs": [ { "data": { "image/png": "bf924362d8de0d0b916bc5413e923970540953ce" }, "output_type": "display_data" } ], "source": [ "churn_users = data[data['churn'] == True]\n", "churn_users['international plan'].value_counts(normalize=True).plot(kind='bar');" ] }, { "cell_type": "markdown", "metadata": { "collapsed": false }, "source": [ "Видим, что процент клиентов с роумингом выше, чем в общей выборке. \n", "\n", "Можем предположить, что бинарные признаки **international plan** и **churn** коррелируют. Нарисуем теперь их вместе." ] }, { "cell_type": "code", "execution_count": 19, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
international plannoyesAll
churn
False26641862850
True346137483
All30103233333
\n", "
" ] }, "execution_count": 19, "metadata": { }, "output_type": "execute_result" } ], "source": [ "pd.crosstab(data['churn'], data['international plan'], margins=True)" ] }, { "cell_type": "code", "execution_count": 22, "metadata": { "collapsed": false }, "outputs": [ { "data": { "image/png": "5739f5954f8d9545ddac940d39ad0f001ea9422f" }, "output_type": "display_data" } ], "source": [ "sns.countplot(x='international plan', hue='churn', data=data);" ] }, { "cell_type": "markdown", "metadata": { "collapsed": false }, "source": [ "Большинство клиентов, у которых был подключён роуминг, от нас ушли!" ] }, { "cell_type": "code", "execution_count": 25, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "international plan\n", "no 3010\n", "yes 323\n", "Name: churn, dtype: int64" ] }, "execution_count": 25, "metadata": { }, "output_type": "execute_result" } ], "source": [ "data.groupby('international plan')['churn'].count()" ] }, { "cell_type": "markdown", "metadata": { "collapsed": false }, "source": [ "Посмотрим на распределение признака **account length**." ] }, { "cell_type": "code", "execution_count": 27, "metadata": { "collapsed": false }, "outputs": [ { "data": { "image/png": "52816f2fd37519b286aea91df17b0cdcad6892c3" }, "output_type": "display_data" } ], "source": [ "sns.distplot(data['account length']);" ] }, { "cell_type": "markdown", "metadata": { "collapsed": false }, "source": [ "Похоже на нормальное распределение!" ] }, { "cell_type": "markdown", "metadata": { "collapsed": false }, "source": [ "Что можно сказать о связи между **account length** и лояльностью?" ] }, { "cell_type": "code", "execution_count": 28, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "churn\n", "False 100.793684\n", "True 102.664596\n", "Name: account length, dtype: float64" ] }, "execution_count": 28, "metadata": { }, "output_type": "execute_result" } ], "source": [ "data.groupby('churn')['account length'].mean()" ] }, { "cell_type": "code", "execution_count": 29, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "churn\n", "False 39.88235\n", "True 39.46782\n", "Name: account length, dtype: float64" ] }, "execution_count": 29, "metadata": { }, "output_type": "execute_result" } ], "source": [ "data.groupby('churn')['account length'].std()" ] }, { "cell_type": "code", "execution_count": 30, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "churn\n", "False 100\n", "True 103\n", "Name: account length, dtype: int64" ] }, "execution_count": 30, "metadata": { }, "output_type": "execute_result" } ], "source": [ "data.groupby('churn')['account length'].median()" ] }, { "cell_type": "markdown", "metadata": { "collapsed": false }, "source": [ "На первый взгляд, никак не связаны." ] }, { "cell_type": "code", "execution_count": 36, "metadata": { "collapsed": false }, "outputs": [ { "data": { "image/png": "963212841239ba8b9f8b019138180325d0ed8199" }, "output_type": "display_data" } ], "source": [ "fig, ax = plt.subplots(1, 2, sharey=True)\n", "sns.distplot(data[data['churn'] == False]['account length'], \n", " ax=ax[0]).set_title('Лояльные');\n", "sns.distplot(churn_users['account length'], \n", " ax=ax[1]).set_title('Ушедшие');" ] }, { "cell_type": "markdown", "metadata": { "collapsed": false }, "source": [ "На второй взгляд тоже." ] }, { "cell_type": "markdown", "metadata": { "collapsed": false }, "source": [ "Теперь посмотрим, связаны ли длительности дневных и ночных звонков." ] }, { "cell_type": "code", "execution_count": 37, "metadata": { "collapsed": false }, "outputs": [ { "data": { "image/png": "8fa7b65953d6716500a10aa52d0d816da03e9b9b" }, "output_type": "display_data" } ], "source": [ "sns.regplot(data['total day minutes'], data['total night minutes']);" ] }, { "cell_type": "markdown", "metadata": { "collapsed": false }, "source": [ "А как насчёт количества звонков?" ] }, { "cell_type": "code", "execution_count": 38, "metadata": { "collapsed": false }, "outputs": [ { "data": { "image/png": "7f444c1709be3b358ed668456f53846126753f2d" }, "output_type": "display_data" } ], "source": [ "sns.regplot(data['total day calls'], data['total night calls']);" ] }, { "cell_type": "markdown", "metadata": { "collapsed": false }, "source": [ "Пока никакой связи не видно." ] }, { "cell_type": "markdown", "metadata": { "collapsed": false }, "source": [ "Построим корреляционную матрицу для числовых признаков." ] }, { "cell_type": "code", "execution_count": 54, "metadata": { "collapsed": false }, "outputs": [ ], "source": [ "numeric_data = data.select_dtypes(['int64', 'float64'])" ] }, { "cell_type": "code", "execution_count": 45, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
account lengtharea codenumber vmail messagestotal day minutestotal day callstotal day chargetotal eve minutestotal eve callstotal eve chargetotal night minutestotal night callstotal night chargetotal intl minutestotal intl callstotal intl chargecustomer service calls
012841525265.111045.07197.49916.78244.79111.0110.032.701
110741526161.612327.47195.510316.62254.410311.4513.733.701
21374150243.411441.38121.211010.30162.61047.3212.253.290
3844080299.47150.9061.9885.26196.9898.866.671.782
4754150166.711328.34148.312212.61186.91218.4110.132.733
\n", "
" ] }, "execution_count": 45, "metadata": { }, "output_type": "execute_result" } ], "source": [ "numeric_data.head()" ] }, { "cell_type": "code", "execution_count": 60, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
account lengthnumber vmail messagestotal day minutestotal day callstotal day chargetotal eve minutestotal eve callstotal eve chargetotal night minutestotal night callstotal night chargetotal intl minutestotal intl callstotal intl chargecustomer service calls
account length1.000000-0.0046280.0062160.0384700.006214-0.0067570.019260-0.006745-0.008955-0.013176-0.0089600.0095140.0206610.009546-0.003796
number vmail messages-0.0046281.0000000.000778-0.0095480.0007760.017562-0.0058640.0175780.0076810.0071230.0076630.0028560.0139570.002884-0.013263
total day minutes0.0062160.0007781.0000000.0067501.0000000.0070430.0157690.0070290.0043230.0229720.004300-0.0101550.008033-0.010092-0.013423
total day calls0.038470-0.0095480.0067501.0000000.006753-0.0214510.006462-0.0214490.022938-0.0195570.0229270.0215650.0045740.021666-0.018942
total day charge0.0062140.0007761.0000000.0067531.0000000.0070500.0157690.0070360.0043240.0229720.004301-0.0101570.008032-0.010094-0.013427
total eve minutes-0.0067570.0175620.007043-0.0214510.0070501.000000-0.0114301.000000-0.0125840.007586-0.012593-0.0110350.002541-0.011067-0.012985
total eve calls0.019260-0.0058640.0157690.0064620.015769-0.0114301.000000-0.011423-0.0020930.007710-0.0020560.0087030.0174340.0086740.002423
total eve charge-0.0067450.0175780.007029-0.0214490.0070361.000000-0.0114231.000000-0.0125920.007596-0.012601-0.0110430.002541-0.011074-0.012987
total night minutes-0.0089550.0076810.0043230.0229380.004324-0.012584-0.002093-0.0125921.0000000.0112040.999999-0.015207-0.012353-0.015180-0.009288
total night calls-0.0131760.0071230.022972-0.0195570.0229720.0075860.0077100.0075960.0112041.0000000.011188-0.0136050.000305-0.013630-0.012802
total night charge-0.0089600.0076630.0043000.0229270.004301-0.012593-0.002056-0.0126010.9999990.0111881.000000-0.015214-0.012329-0.015186-0.009277
total intl minutes0.0095140.002856-0.0101550.021565-0.010157-0.0110350.008703-0.011043-0.015207-0.013605-0.0152141.0000000.0323040.999993-0.009640
total intl calls0.0206610.0139570.0080330.0045740.0080320.0025410.0174340.002541-0.0123530.000305-0.0123290.0323041.0000000.032372-0.017561
total intl charge0.0095460.002884-0.0100920.021666-0.010094-0.0110670.008674-0.011074-0.015180-0.013630-0.0151860.9999930.0323721.000000-0.009675
customer service calls-0.003796-0.013263-0.013423-0.018942-0.013427-0.0129850.002423-0.012987-0.009288-0.012802-0.009277-0.009640-0.017561-0.0096751.000000
\n", "
" ] }, "execution_count": 60, "metadata": { }, "output_type": "execute_result" } ], "source": [ "corr_matrix = numeric_data.drop('area code', axis=1).corr()\n", "corr_matrix" ] }, { "cell_type": "code", "execution_count": 62, "metadata": { "collapsed": false }, "outputs": [ { "data": { "image/png": "c6aa458fc6fdd72653818dc98b9a432ae6bd0cb6" }, "output_type": "display_data" } ], "source": [ "sns.heatmap(corr_matrix);" ] }, { "cell_type": "code", "execution_count": 58, "metadata": { "collapsed": false }, "outputs": [ { "data": { "image/png": "2601088f3e858ef8830a07e5b3e531d36c72eeb4" }, "output_type": "display_data" } ], "source": [ "sns.pairplot(numeric_data[['total day minutes', \n", " 'total day calls', \n", " 'total day charge']]);" ] }, { "cell_type": "code", "execution_count": 0, "metadata": { "collapsed": false }, "outputs": [ ], "source": [ ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (Ubuntu Linux)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.5" } }, "nbformat": 4, "nbformat_minor": 0 }