GitHub Repository: suyashi29/python-su
Path: blob/master/Data Analysis using Python/Titanic EDA(final).ipynb
³⁰⁷⁴ views

Kernel: Python 3

In [ ]:

Learning EDA from Historic Diaster "The Titanic Wreck"

Objective

The objective here is to conduct Exploratory data analysis (EDA) on the Titanic Dataset in order to gather insights and evenutally predicting survior on basics of factors like Class ,Sex , Age , Gender ,Pclass etc.

Why EDA?

An approach to summarize, visualize, and become intimately familiar with the important characteristics of a data set.
Defines and Refines the selection of feature variables that will be used for machine learning.
Helps to find hidden Insights
It provides the context needed to develop an appropriate model with minimum errors

About Event

The RMS Titanic was a British passenger liner that sank in the North Atlantic Ocean in the early morning hours of 15 April 1912, after it collided with an iceberg during its maiden voyage from Southampton to New York City. There were an estimated 2,224 passengers and crew aboard the ship, and more than 1,500 died, making it one of the deadliest commercial peacetime maritime disasters in modern history. This sensational tragedy shocked the international community and led to better safety regulations for ships.

2. Data Description

The dataset consists of the information about people boarding the famous RMS Titanic. Various variables present in the dataset includes data of age, sex, fare, ticket etc. The dataset comprises of 891 observations of 12 columns. Below is a table showing names of all the columns and their description.

| Column Name | Description | | ------------- |:------------- 😐 | PassengerId | Passenger Identity | | Survived | Survival (0 = No; 1 = Yes) | | Pclass | Passenger Class (1 = 1st; 2 = 2nd; 3 = 3rd) | | Name | Name of passenger | | Sex | Sex of passenger | | Age | Age of passenger | | SibSp | Number of sibling and/or spouse travelling with passenger | | Parch | Number of parent and/or children travelling with passenger| | Ticket | Ticket number | | Fare | Price of ticket | | Cabin | Cabin number | |Embarkment | Port of Embarkation (C = Cherbourg; Q = Queenstown; S = Southampton)|

In [2]:

import numpy as np               # For linear algebra
import pandas as pd              # For data manipulation
import matplotlib.pyplot as plt  # For 2D visualization
import pandas_profiling
import seaborn as sns
%matplotlib inline
sns.set()

Importing Data

In [3]:

Titanic_data = pd.read_csv("https://raw.githubusercontent.com/insaid2018/Term-1/master/Data/Casestudy/titanic_train.csv")
Titanic_test = pd.read_csv("https://raw.githubusercontent.com/insaid2018/Term-2/master/Data/test.csv")
combine = [Titanic_data, Titanic_test]

In [4]:

Titanic_data.columns

Out[4]:

Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',
       'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'],
      dtype='object')

In [5]:

Titanic_data.head()

Out[5]:

In [6]:

Titanic_data.tail()

Out[6]:

Examining Data

In [7]:

Titanic_data.shape #shows total number of rows and columns in data set

Out[7]:

(891, 12)

In [8]:

Titanic_data.describe()

Out[8]:

Insights:

1.Total samples are 891 or 40% of the actual number of passengers on board the Titanic (2,224)

2.Survived is a categorical feature with 0 or 1 values

3.Around 38% samples survived representative of the actual survival rate at 32%

4.Fares varied significantly with few passengers (<1%) paying as high as $512.

5.Few elderly passengers (<1%) within age range 65-80.

Data Profiling

By pandas profiling, an interactive HTML report gets generated which contains all the information about the columns of the dataset, like the counts and type of each column.

1.Detailed information about each column, coorelation between different columns and a sample of dataset

2.It gives us visual interpretation of each column in the data

3.Spread of the data can be better understood by the distribution plot

4.Grannular level analysis of each column.

In [9]:

profile = pandas_profiling.ProfileReport(Titanic_data)
profile.to_file(outputfile="Titanic_before_preprocessing.html")

Data Preprocessing

Check for Errors and Null Values
Replace Null Values with appropriate values
Drop down features that are incomplete and are not too relevant for analysis
Create new features that can would help to improve prediction

Check for null or empty values in Data

In [10]:

miss1=Titanic_data.isnull().sum()
miss= (Titanic_data.isnull().sum()/len(Titanic_data))*100
miss_data=pd.concat([miss1,miss],axis=1,keys=['Total','%'])
print(miss_data)

Out[10]:

             Total          %
PassengerId      0   0.000000
Survived         0   0.000000
Pclass           0   0.000000
Name             0   0.000000
Sex              0   0.000000
Age            177  19.865320
SibSp            0   0.000000
Parch            0   0.000000
Ticket           0   0.000000
Fare             0   0.000000
Cabin          687  77.104377
Embarked         2   0.224467

The Age, Cabin and Embarked have null values.Lets fix them

Filling missing age by median

In [11]:

new_age = Titanic_data.Age.median()  
Titanic_data.Age.fillna(new_age, inplace = True)
Titanic_test.Age.fillna(new_age, inplace = True)

Filling missing Embarked by mode

In [12]:

Titanic_data.Embarked = Titanic_data.Embarked.fillna(Titanic_data['Embarked'].mode()[0])
Titanic_test.Embarked = Titanic_test.Embarked.fillna(Titanic_data['Embarked'].mode()[0])

Cabin feature may be dropped as it is highly incomplete or contains many null values

In [13]:

Titanic_data.drop('Cabin', axis = 1,inplace = True)

PassengerId Feature may be dropped from training dataset as it does not contribute to survival

In [14]:

Titanic_data.drop('PassengerId', axis = 1,inplace = True)

Ticket feature may be dropped down

In [15]:

Titanic_data.drop('Ticket', axis = 1,inplace = True)

Creating New Fields

Create New Age Bands to improve prediction Insights
Create a new feature called Family based on Parch and SibSp to get total count of family members on board
Create a Fare range feature if it helps our analysis

AGE-BAND

In [16]:

Titanic_data['Age_band']=0
Titanic_data.loc[Titanic_data['Age']<=1,'Age_band']="Infant"
Titanic_data.loc[(Titanic_data['Age']>1)&(Titanic_data['Age']<=12),'Age_band']="Children"
Titanic_data.loc[Titanic_data['Age']>12,'Age_band']="Adults"
Titanic_data.head(2)

Out[16]:

Fare-Band

In [17]:

Titanic_data['FareBand']=0
Titanic_data.loc[(Titanic_data['Fare']>=0)&(Titanic_data['Fare']<=10),'FareBand']=1
Titanic_data.loc[(Titanic_data['Fare']>10)&(Titanic_data['Fare']<=15),'FareBand']=2
Titanic_data.loc[(Titanic_data['Fare']>15)&(Titanic_data['Fare']<=35),'FareBand']=3
Titanic_data.loc[Titanic_data['Fare']>35,'FareBand']=4
Titanic_data.head(2)

Out[17]:

We want to analyze if Name feature can be engineered to extract titles and test correlation between titles and survival, before dropping Name and PassengerId features.

In the following code we extract Title feature using regular expressions. The RegEx pattern (\w+.) matches the first word which ends with a dot character within Name feature. The expand=False flag returns a DataFrame.

In [18]:


for dataset in combine:
    dataset['Title'] = dataset.Name.str.extract(' ([A-Za-z]+)\.', expand=False)

pd.crosstab(Titanic_data['Title'], Titanic_data['Sex'])

Out[18]:

We can replace many titles with a more common name or classify them as Rare.

In [19]:

for dataset in combine:
    dataset['Title'] = dataset['Title'].replace(['Lady', 'Countess','Capt', 'Col',\
 	'Don', 'Dr', 'Major', 'Rev', 'Sir', 'Jonkheer', 'Dona'], 'Rare')

    dataset['Title'] = dataset['Title'].replace('Mlle', 'Miss')
    dataset['Title'] = dataset['Title'].replace('Ms', 'Miss')
    dataset['Title'] = dataset['Title'].replace('Mme', 'Mrs')
    
Titanic_data[['Title', 'Survived']].groupby(['Title'], as_index=False).mean()

Out[19]:

We can convert the categorical titles to ordinal.

In [20]:

title_mapping = {"Mr": 1, "Miss": 2, "Mrs": 3, "Master": 4, "Rare": 5}
for dataset in combine:
    dataset['Title'] = dataset['Title'].map(title_mapping)
    dataset['Title'] = dataset['Title'].fillna(0)

Titanic_data.head()

Out[20]:

Insights

Most titles band Age groups accurately. For example: Master title has Age mean of 5 years.
Survival among Title Age bands varies slightly.
Certain titles mostly survived (Mme, Lady, Sir) or did not (Don, Rev, Jonkheer).

Decision

We decide to retain the new Title feature for model training

Now we can convert features which contain strings to numerical values. This is required by most model algorithms. Doing so will also help us in achieving the feature completing goal.

Converting Sex feature to a new feature called Gender where female=1 and male=0.

In [21]:

for dataset in combine:
    dataset['Sex'] = dataset['Sex'].map( {'female': 1, 'male': 0} ).astype(int)

Titanic_data.head()

Out[21]:

Extracting Titles Now we can drop down Name feature

In [22]:

Titanic_data.drop('Name', axis = 1,inplace = True)

In [23]:

for dataset in combine:
    dataset['Embarked'] = dataset['Embarked'].map( {'S': 0, 'C': 1, 'Q': 2} ).astype(int)

Titanic_data.head()

Out[23]:

We can also create an artificial feature combining Pclass and Age.

In [24]:

for dataset in combine:
    dataset['Age*Class'] = dataset.Age * dataset.Pclass

Titanic_data.loc[:, ['Age*Class', 'Age', 'Pclass']].head(10)

Out[24]:

Post Pandas Profiling : Checking Data after data preparation

In [25]:

import pandas_profiling
profile = pandas_profiling.ProfileReport(Titanic_data)
profile.to_file(outputfile="Titanic_after_preprocessing.html")

Data Visualization

4.1 What is Total Count of Survivals and Victims?

In [31]:

Titanic_data.groupby(['Survived'])['Survived'].count()# similar functions unique(),sum(),mean() etc

Out[31]:

Survived
0    549
1    342
Name: Survived, dtype: int64

In [29]:

plt = Titanic_data.Survived.value_counts().plot('bar')
plt.set_xlabel('DIED OR SURVIVED')
plt.set_ylabel('Passenger Count')

Out[29]:

Text(0, 0.5, 'Passenger Count')

Insights

Only 342 Passengers Survived out of 891
Majority Died which conveys there were less chances of Survival

4.2 Which gender has more survival rate?

In [37]:

Titanic_data.groupby(['Survived', 'Sex']).count()["Age"]

Out[37]:

Survived  Sex
0         0      468
          1       81
1         0      109
          1      233
Name: Age, dtype: int64

In [35]:

sns.countplot('Survived',data=Titanic_data,hue='Sex')

Out[35]:

<matplotlib.axes._subplots.AxesSubplot at 0x252aa2077f0>

In [38]:

Titanic_data[['Sex','Survived']].groupby(['Sex']).mean().plot.bar()

Out[38]:

<matplotlib.axes._subplots.AxesSubplot at 0x252aa43c278>

Insights

Female has better chances of Survival "LADIES FIRST"
There were more males as compared to females ,but most of them died.

4.3 What is Survival rate based on Person type?

In [39]:

Titanic_data.groupby(['Survived', 'Age_band']).count()['Sex']

Out[39]:

Survived  Age_band
0         Adults      520
          Children     27
          Infant        2
1         Adults      302
          Children     28
          Infant       12
Name: Sex, dtype: int64

In [42]:

Titanic_data[Titanic_data['Age_band'] == 'Adults'].Survived.groupby(Titanic_data.Survived).count().plot(kind='pie', figsize=(6, 6),explode=[0,0.05],autopct='%1.1f%%')
plt.axis('equal')
#plt.legend(["Died","Survived"])
#plt.set_title("Adult survival rate")
plt.show()

Out[42]:

------------------------------------------ADULT-SURVIVAL RATE--------------------------------------------------------------

In [35]:

Titanic_data[Titanic_data['Age_band'] == 'Children'].Survived.groupby(Titanic_data.Survived).count().plot(kind='pie', figsize=(6, 6),explode=[0,0.05],autopct='%1.1f%%')
plt.axis('equal')
#plt.legend(["Died","Survived"])
plt.set_title("Child survival rate")
#plt.show()

Out[35]:

Text(0.5, 1.0, 'Child survival rate')

------------------------------------------CHILD-SURVIVAL RATE--------------------------------------------------------------

In [36]:

Titanic_data[Titanic_data['Age_band'] == 'Infant'].Survived.groupby(Titanic_data.Survived).count().plot(kind='pie', figsize=(6, 6),explode=[0,0.05],autopct='%1.1f%%')
plt.axis('equal')
#plt.legend(["Died","Survived"])
plt.set_title("Infant survival rate")
#plt.show()

Out[36]:

Text(0.5, 1.0, 'Infant survival rate')

Insights

Majority Passengers were Adults
Almost half of the total number of children survived.
Most of the Adults failed to Survive
More than 85percent of Infant Survived

4.4 Did Economy Class had an impact on survival rate?

In [37]:

Titanic_data.groupby(['Pclass', 'Survived'])['Survived'].count()

Out[37]:

Pclass  Survived
1       0            80
        1           136
2       0            97
        1            87
3       0           372
        1           119
Name: Survived, dtype: int64

In [38]:

sns.barplot('Pclass','Survived', data=Titanic_data)

Out[38]:

C:\Users\HP\Anaconda3\lib\site-packages\scipy\stats\stats.py:1713: FutureWarning: Using a non-tuple sequence for multidimensional indexing is deprecated; use `arr[tuple(seq)]` instead of `arr[seq]`. In the future this will be interpreted as an array index, `arr[np.array(seq)]`, which will result either in an error or a different result.
  return np.add.reduce(sorted[indexer] * weights, axis=axis) / sumval

<matplotlib.axes._subplots.AxesSubplot at 0x2035aef26d8>

In [39]:

sns.barplot('Pclass','Survived',hue='Sex', data=Titanic_data)

Out[39]:

C:\Users\HP\Anaconda3\lib\site-packages\scipy\stats\stats.py:1713: FutureWarning: Using a non-tuple sequence for multidimensional indexing is deprecated; use `arr[tuple(seq)]` instead of `arr[seq]`. In the future this will be interpreted as an array index, `arr[np.array(seq)]`, which will result either in an error or a different result.
  return np.add.reduce(sorted[indexer] * weights, axis=axis) / sumval

<matplotlib.axes._subplots.AxesSubplot at 0x2035a4b70f0>

Insights

Most of the passengers travelled in Third class but only 24per of them survived
If we talk about survival ,more passengers in First class survived and again female given more priority
Economic Class affected Survival rate and Passengers travelling with First Class had higher ratio of survival as compared to Class 2 and 3.

4.5 What is Survival Propability based on Embarkment of passengers?

Titanic’s first voyage was to New York before sailing to the Atlantic Ocean it picked passengers from three ports Cherbourg(C), Queenstown(Q), Southampton(S). Most of the Passengers in Titanicic embarked from the port of Southampton.Lets see how embarkemt affected survival probability.

In [40]:


sns.countplot('Embarked',data=Titanic_data)

Out[40]:

<matplotlib.axes._subplots.AxesSubplot at 0x2035a0c7b70>

In [41]:

plt = Titanic_data[['Embarked', 'Survived']].groupby('Embarked').mean().Survived.plot('bar')
plt.set_xlabel('Embarked')
plt.set_ylabel('Survival Probability')

Out[41]:

Text(0, 0.5, 'Survival Probability')

Gender Survival based on Embarkment and Pclass

In [42]:

pd.crosstab([Titanic_data.Sex, Titanic_data.Survived,Titanic_data.Pclass],[Titanic_data.Embarked], margins=True)

Out[42]:

In [43]:

sns.violinplot(x='Embarked',y='Pclass',hue='Survived',data=Titanic_data,split=True)

Out[43]:

C:\Users\HP\Anaconda3\lib\site-packages\scipy\stats\stats.py:1713: FutureWarning: Using a non-tuple sequence for multidimensional indexing is deprecated; use `arr[tuple(seq)]` instead of `arr[seq]`. In the future this will be interpreted as an array index, `arr[np.array(seq)]`, which will result either in an error or a different result.
  return np.add.reduce(sorted[indexer] * weights, axis=axis) / sumval

<matplotlib.axes._subplots.AxesSubplot at 0x20358d9b710>

In [43]:

sns.catplot(x="Embarked", y="Survived", hue="Sex",
            col="Pclass", aspect=.8,kind='bar',
             data=Titanic_data);

Out[43]:

C:\Users\HP\Anaconda3\lib\site-packages\scipy\stats\stats.py:1713: FutureWarning: Using a non-tuple sequence for multidimensional indexing is deprecated; use `arr[tuple(seq)]` instead of `arr[seq]`. In the future this will be interpreted as an array index, `arr[np.array(seq)]`, which will result either in an error or a different result.
  return np.add.reduce(sorted[indexer] * weights, axis=axis) / sumval

Insights:

Most Passengers from port C Survived.
Most Passengers were from Southampton(S).
Exception in Embarked=C where males had higher survival rate. This could be a correlation between Pclass and Embarked and in turn Pclass and Survived, not necessarily direct correlation between Embarked and Survived.
Males had better survival rate in Port C when compared for S and Q ports.
Females had least Survival rate in Pclass 3

4.6 How is Fare distributed for Passesngers?

In [42]:

Titanic_data['Fare'].min()

Out[42]:

0.0

In [43]:

Titanic_data['Fare'].max()

Out[43]:

512.3292

In [44]:

Titanic_data[['FareBand', 'Survived']].groupby(['FareBand'], as_index=False).mean().sort_values(by='FareBand', ascending=True)

Out[44]:

In [45]:

Titanic_data.groupby(['FareBand', 'Survived'])['Survived'].count()

Out[45]:

FareBand  Survived
1         0           269
          1            67
2         0            75
          1            47
3         0           130
          1           105
4         0            75
          1           123
Name: Survived, dtype: int64

In [45]:

sns.swarmplot(x='Survived', y='Fare', data=Titanic_data);

Out[45]:

Insights

Majority Passenger's fare lies in 0-100 dollars range
Passengers who paid more Fares had more chances of Survival
Fare as high as 514 dollars was purcharsed by very few.(Outlier)

4.7 What was Average fare by Pclass & Embark location?

In [49]:

sns.boxplot(x="Pclass", y="Fare", data=Titanic_data,hue="Embarked")

Out[49]:

<matplotlib.axes._subplots.AxesSubplot at 0x252aa3b7828>

In [48]:

sns.boxplot(x="Embarked", y="Fare", data=Titanic_data)

Out[48]:

<matplotlib.axes._subplots.AxesSubplot at 0x1b996a8a080>

Insights

First Class Passengers paid major part of total Fare.
Passengers who Embarked from Port C paid Highest Fare

4.8 Segment Age in bins with size of 10

In [49]:

plt=Titanic_data['Age'].hist(bins=20)
plt.set_ylabel('Passengers')
plt.set_xlabel('Age of Passengers')
plt.set_title('Age Distribution of Titanic Passengers',size=17, y=1.08)

Out[49]:

Text(0.5, 1.08, 'Age Distribution of Titanic Passengers')

Insights:

The youngest passenger on the Titanic were toddlers under 6 months
The oldest were of 80 years of age.
The mean for passengers was a bit over 29 years i.e there were more young passengers in the ship.

Lets see how Age has correlation with Survival

In [50]:


sns.distplot(Titanic_data[Titanic_data['Survived']==1]['Age'])

Out[50]:

C:\Users\HP\Anaconda3\lib\site-packages\scipy\stats\stats.py:1713: FutureWarning: Using a non-tuple sequence for multidimensional indexing is deprecated; use `arr[tuple(seq)]` instead of `arr[seq]`. In the future this will be interpreted as an array index, `arr[np.array(seq)]`, which will result either in an error or a different result.
  return np.add.reduce(sorted[indexer] * weights, axis=axis) / sumval

<matplotlib.axes._subplots.AxesSubplot at 0x1b996826240>

In [51]:

sns.distplot(Titanic_data[Titanic_data['Survived']==0]['Age'])

Out[51]:

C:\Users\HP\Anaconda3\lib\site-packages\scipy\stats\stats.py:1713: FutureWarning: Using a non-tuple sequence for multidimensional indexing is deprecated; use `arr[tuple(seq)]` instead of `arr[seq]`. In the future this will be interpreted as an array index, `arr[np.array(seq)]`, which will result either in an error or a different result.
  return np.add.reduce(sorted[indexer] * weights, axis=axis) / sumval

<matplotlib.axes._subplots.AxesSubplot at 0x1b990d69a90>

In [52]:

sns.violinplot(x='Sex',y='Age',hue='Survived',data=Titanic_data,split=True)

Out[52]:

C:\Users\HP\Anaconda3\lib\site-packages\scipy\stats\stats.py:1713: FutureWarning: Using a non-tuple sequence for multidimensional indexing is deprecated; use `arr[tuple(seq)]` instead of `arr[seq]`. In the future this will be interpreted as an array index, `arr[np.array(seq)]`, which will result either in an error or a different result.
  return np.add.reduce(sorted[indexer] * weights, axis=axis) / sumval

<matplotlib.axes._subplots.AxesSubplot at 0x1b994a52550>

Insights

Most of the passengers died.
Majority of passengers were between 25-40,most of them died
Female are more likely to survival

4.9 Did Solo Passenger has less chances of Survival ?

In [52]:

Titanic_data['FamilySize']=0
Titanic_data['FamilySize']=Titanic_data['Parch']+Titanic_data['SibSp']
Titanic_data['SoloPassenger']=0
Titanic_data.loc[Titanic_data.FamilySize==0,'SoloPassenger']=1

In [53]:

sns.factorplot('SoloPassenger','Survived',data=Titanic_data)

Out[53]:

C:\Users\HP\Anaconda3\lib\site-packages\seaborn\categorical.py:3666: UserWarning: The `factorplot` function has been renamed to `catplot`. The original name will be removed in a future release. Please update your code. Note that the default `kind` in `factorplot` (`'point'`) has changed `'strip'` in `catplot`.
  warnings.warn(msg)
C:\Users\HP\Anaconda3\lib\site-packages\scipy\stats\stats.py:1713: FutureWarning: Using a non-tuple sequence for multidimensional indexing is deprecated; use `arr[tuple(seq)]` instead of `arr[seq]`. In the future this will be interpreted as an array index, `arr[np.array(seq)]`, which will result either in an error or a different result.
  return np.add.reduce(sorted[indexer] * weights, axis=axis) / sumval

<seaborn.axisgrid.FacetGrid at 0x252aa3835f8>

In [55]:

sns.violinplot(y='SoloPassenger',x='Sex',hue='Survived',data=Titanic_data,split=True)

Out[55]:

C:\Users\HP\Anaconda3\lib\site-packages\scipy\stats\stats.py:1713: FutureWarning: Using a non-tuple sequence for multidimensional indexing is deprecated; use `arr[tuple(seq)]` instead of `arr[seq]`. In the future this will be interpreted as an array index, `arr[np.array(seq)]`, which will result either in an error or a different result.
  return np.add.reduce(sorted[indexer] * weights, axis=axis) / sumval

<matplotlib.axes._subplots.AxesSubplot at 0x1b9964d4978>

In [56]:

sns.factorplot('SoloPassenger','Survived',hue='Pclass',col="Embarked",data=Titanic_data)

Out[56]:

C:\Users\HP\Anaconda3\lib\site-packages\seaborn\categorical.py:3666: UserWarning: The `factorplot` function has been renamed to `catplot`. The original name will be removed in a future release. Please update your code. Note that the default `kind` in `factorplot` (`'point'`) has changed `'strip'` in `catplot`.
  warnings.warn(msg)
C:\Users\HP\Anaconda3\lib\site-packages\scipy\stats\stats.py:1713: FutureWarning: Using a non-tuple sequence for multidimensional indexing is deprecated; use `arr[tuple(seq)]` instead of `arr[seq]`. In the future this will be interpreted as an array index, `arr[np.array(seq)]`, which will result either in an error or a different result.
  return np.add.reduce(sorted[indexer] * weights, axis=axis) / sumval

<seaborn.axisgrid.FacetGrid at 0x1b9964e24e0>

Insights

Most of the Passengers were travelling Solo and most of them died
Solo Females were more likely to Survive as compared to males
Passengers Class have a positive correlation with Solo Passenger Survival
Passengers Embarked from Port Q had Fifty -Fifty Chances of Survival

4.10 How did total family size affected Survival Count?

In [57]:

for i in Titanic_data:
    Titanic_data['FamilySize'] = Titanic_data['SibSp'] + Titanic_data['Parch'] + 1

Titanic_data[['FamilySize', 'Survived']].groupby(['FamilySize'], as_index=False).mean().sort_values(by='Survived', ascending=False)

Out[57]:

In [58]:

sns.barplot(x='FamilySize', y='Survived', hue='Sex', data=Titanic_data)

Out[58]:

C:\Users\HP\Anaconda3\lib\site-packages\scipy\stats\stats.py:1713: FutureWarning: Using a non-tuple sequence for multidimensional indexing is deprecated; use `arr[tuple(seq)]` instead of `arr[seq]`. In the future this will be interpreted as an array index, `arr[np.array(seq)]`, which will result either in an error or a different result.
  return np.add.reduce(sorted[indexer] * weights, axis=axis) / sumval

<matplotlib.axes._subplots.AxesSubplot at 0x1b99772a358>

Insights

Both men and women had a massive drop of survival with a FamilySize over 4.
The chance to survive as a man increased with FamilySize until a size of 4
Men are not likely to Survive with FamilySize 5 and 6
Big Size Family less likihood of Survival

4.11 How can you correlate Pclass/Age/Fare with Survival rate?

In [55]:

sns.pairplot(Titanic_data[["FareBand","Age","Pclass","Survived"]],vars= ["FareBand","Age","Pclass"],hue="Survived", dropna=True,markers=["o", "s"])
#plt.set_title('Pair Plot')

Out[55]:

C:\Users\HP\Anaconda3\lib\site-packages\scipy\stats\stats.py:1713: FutureWarning: Using a non-tuple sequence for multidimensional indexing is deprecated; use `arr[tuple(seq)]` instead of `arr[seq]`. In the future this will be interpreted as an array index, `arr[np.array(seq)]`, which will result either in an error or a different result.
  return np.add.reduce(sorted[indexer] * weights, axis=axis) / sumval

<seaborn.axisgrid.PairGrid at 0x252a9590a90>

Insights:

Fare and Survival has positive correlation
We cannt relate age and Survival as majority of travellers were of mid age
Higher Class Passengers had more likeihood of Survival

4.12 Which features had most impact on Survival rate?

In [60]:

sns.heatmap(Titanic_data.corr(),annot=True)

Out[60]:

<matplotlib.axes._subplots.AxesSubplot at 0x1b998d97fd0>

Insights:

Older women have higher rate of survival than older men . Also, older women has higher rate of survival than younger women; an opposite trend to the one for the male passengers.
All the features are not necessary to predict Survival
More Features creates Complexitity
Fare has positive Correlation
For Females major Survival Chances , only for port C males had more likeihood of Survival.

Conclusion : "If you were young female travelling in First Class and embarked from port -C then you have best chances of Survival in Titanic"

Most of the Passengers Died
"Ladies & Children First" i.e 76% of Females and 16% of Children Survived
Gender , Passenger type & Classs are mostly realted to Survival.
Survival rate diminishes significantly for Solo Passengers
Majority of Male Died
Males with Family had better Survival rate as compared to Solo Males

Part -2

Machine Learning

Importing Machine Learning Packages

In [61]:

# machine learning
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC, LinearSVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.linear_model import Perceptron
from sklearn.linear_model import SGDClassifier
from sklearn.tree import DecisionTreeClassifier
# Data vizualization package
import pandas as pd 
import numpy as np
import random as rnd

In [62]:

Titanic_data.head()

Out[62]:

In [63]:

Titanic_data['Age_band']=0
Titanic_data.loc[Titanic_data['Age']<=1,'Age_band']=1
Titanic_data.loc[(Titanic_data['Age']>1)&(Titanic_data['Age']<=12),'Age_band']=2
Titanic_data.loc[Titanic_data['Age']>12,'Age_band']=3
Titanic_data.head(2)

Out[63]:

Analyze by pivoting features¶

To confirm some of our observations and assumptions, we can quickly analyze our feature correlations by pivoting features against each other. We can only do so at this stage for features which do not have any empty values. It also makes sense doing so only for features which are categorical (Sex), ordinal (Pclass) or discrete (SibSp, Parch) type.

Pclass: We observe significant correlation (>0.5) among Pclass=1 and Survived (classifying #3). We decide to include this feature in our model.
Sex :We confirm the observation during problem definition that Sex=female had very high survival rate at 74% (classifying #1).
SibSp and Parch : These features have zero correlation for certain values. It may be best to derive a feature or a set of features from these individual features (creating #1).

In [64]:

Titanic_data[['Pclass', 'Survived']].groupby(['Pclass'], as_index=False).mean().sort_values(by='Survived', ascending=False)

Out[64]:

In [65]:

Titanic_data[['Sex', 'Survived']].groupby(['Sex'], as_index=False).mean().sort_values(by='Survived', ascending=False)

Out[65]:

In [66]:

Titanic_data[['FamilySize', 'Survived']].groupby(['FamilySize'], as_index=False).mean().sort_values(by='Survived', ascending=False)

Out[66]:

Observations form EDA on Categorical Features

Female passengers had much better survival rate than males. Classifying .
Exception in Embarked=C where males had higher survival rate. This could be a correlation between Pclass and Embarked and in turn Pclass and Survived, not necessarily direct correlation between Embarked and Survived.
Males had better survival rate in Pclass=3 when compared with Pclass=2 for C and Q ports.Correlatring
Ports of embarkation have varying survival rates for Pclass=3 and among male passengers. Correlating.

Decisions.

Add Sex feature to model training.
Complete and add Embarked feature to model training.

There are 60+ predictive modelling algorithms to choose from. We must understand the type of problem and solution requirement to narrow down to a select few models which we can evaluate. Here our problem is a classification and regression problem.

Lets identify relationship between output (Survived or not) with other variables or features (Gender, Age, Port) and perform a category of machine learning which is called supervised learning

1. Logistic Regression

Logistic regression is a statistical method for analyzing a dataset in which there are one or more independent variables that determine an outcome.
Logistic Regression is used when the dependent variable(target) is categorical.
Logistic regression measures the relationship between the categorical dependent variable (feature) and one or more independent variables (features) by estimating probabilities using a logistic function, which is the cumulative logistic distribution.

In [67]:

Titanic_data.shape, Titanic_test.shape

Out[67]:

((891, 14), (418, 13))

In [68]:

Titanic_test = Titanic_test.drop(['Ticket', 'Cabin','Name'], axis=1)

In [69]:

X_titanic = Titanic_data.drop("Survived", axis=1)
Y_titanic = Titanic_data["Survived"]
X_test  = Titanic_test.drop("PassengerId", axis=1).copy()
X_titanic.shape, Y_titanic.shape, X_test.shape

Out[69]:

((891, 13), (891,), (418, 9))

In [78]:

#Titanic_test  = Titanic_test.drop("PassengerId", axis=1)
Titanic_test.head()

Out[78]:

In [ ]:

logreg = LogisticRegression()
logreg.fit(X_titanic, Y_titanic)
Y_pred = logreg.predict(X_test)
acc_log = round(logreg.score(X_titanic, Y_titanic) * 100, 2)
acc_log

We can use Logistic Regression to validate our assumptions and decisions for feature creating and completing goals. This can be done by calculating the coefficient of the features in the decision function.

In [80]:

coeff_df = pd.DataFrame(Titanic_data.columns.delete(0))
coeff_df.columns = ['Feature']
coeff_df["Correlation"] = pd.Series(logreg.coef_[0])

coeff_df.sort_values(by='Correlation', ascending=False)

Out[80]:

Positive coefficients increase the log-odds of the response (and thus increase the probability), and negative coefficients decrease the log-odds of the response (and thus decrease the probability).

Insights

Sex is highest positivie coefficient, implying as the Sex value increases (male: 0 to female: 1), the probability of Survived=1 increases the most.
Inversely as Pclass increases, probability of Survived=1 decreases the most.
This way Age*Class is a good artificial feature to model as it has second highest negative correlation with Survived.
So is Title as second highest positive correlation.

Support Vector Machines(SVM)

Support-vector machines also support-vector networks) are supervised learning models with associated learning algorithms that analyze data used for classification and regression analysis.

In [ ]:


svc = SVC()
svc.fit(X_titanic, Y_titanic)
Y_pred = svc.predict(X_test)
acc_svc = round(svc.score(X_train, Y_train) * 100, 2)
acc_svc

k-Nearest Neighbors algorithm

In pattern recognition, the k-Nearest Neighbors algorithm (or k-NN for short) is a non-parametric method used for classification and regression. A sample is classified by a majority vote of its neighbors, with the sample being assigned to the class most common among its k nearest neighbors (k is a positive integer, typically small). If k = 1, then the object is simply assigned to the class of that single nearest neighbor.

In [ ]:

knn = KNeighborsClassifier(n_neighbors = 3)
knn.fit(X_train, Y_train)
Y_pred = knn.predict(X_test)
acc_knn = round(knn.score(X_train, Y_train) * 100, 2)
acc_knn

Naive Bayes

Naive Bayes classifiers are a family of simple probabilistic classifiers based on applying Bayes' theorem with strong (naive) independence assumptions between the features. Naive Bayes classifiers are highly scalable, requiring a number of parameters linear in the number of variables (features) in a learning problem.

The model generated confidence score is the lowest among the models evaluated so far.

In [ ]:

# Gaussian Naive Bayes

gaussian = GaussianNB()
gaussian.fit(X_train, Y_train)
Y_pred = gaussian.predict(X_test)
acc_gaussian = round(gaussian.score(X_train, Y_train) * 100, 2)
acc_gaussian

Perceptron

The perceptron is an algorithm for supervised learning of binary classifiers (functions that can decide whether an input, represented by a vector of numbers, belongs to some specific class or not). It is a type of linear classifier, i.e. a classification algorithm that makes its predictions based on a linear predictor function combining a set of weights with the feature vector. The algorithm allows for online learning, in that it processes elements in the training set one at a time.

In [ ]:

# Perceptron

perceptron = Perceptron()
perceptron.fit(X_train, Y_train)
Y_pred = perceptron.predict(X_test)
acc_perceptron = round(perceptron.score(X_train, Y_train) * 100, 2)
acc_perceptron

In [ ]:

Learning EDA from Historic Diaster "The Titanic Wreck"

Table of Contents

Objective

2. Data Description

Examining Data

Data Profiling

Data Preprocessing

Creating New Fields

Decision

We decide to retain the new Title feature for model training

Now we can convert features which contain strings to numerical values. This is required by most model algorithms. Doing so will also help us in achieving the feature completing goal.

Post Pandas Profiling : Checking Data after data preparation

Data Visualization

Conclusion : "If you were young female travelling in First Class and embarked from port -C then you have best chances of Survival in Titanic"

Part -2

Machine Learning

Table of Contents

Importing Machine Learning Packages

Analyze by pivoting features¶

Observations form EDA on Categorical Features

Decisions.

Lets identify relationship between output (Survived or not) with other variables or features (Gender, Age, Port) and perform a category of machine learning which is called supervised learning

1. Logistic Regression

Insights

Support Vector Machines(SVM)

k-Nearest Neighbors algorithm

Naive Bayes

Perceptron

Product

Resources

Company