Were the women and children first to the lifeboats? We examine the demographics of the survivors vs the passenger demographics.
How strongly does each feature corrolate with survival? Answering this question would allow us to construct a naïve classifier for passenger survival.
import numpy as np
import pandas as pd
df=pd.read_csv('titanic.csv')
df.head()
df.describe()
print df.Cabin.describe()
print df.Ticket.describe()
print df.Embarked.describe()
We notice several extraneous features in our data. First of all, 'Name', and 'PassengerId' each uniquely identify the passenger. Since we are not interested in reconstructing family trees in this exploration, we discard both features. Next, 'Ticket' and 'Cabin' seem unlikely to correlate in any signifcant way with surival. While the latter feature may have some impact on survivability--we would expect positive correlation between closeness of cabin to lifeboats and survivability--the 'Cabin' column is missing over $75\%$ of the entires. Therefore, we discard both of these features.
del df['PassengerId']
del df['Name']
del df['Cabin']
del df['Ticket']
df.head()
Finally, we note that 'Embarked' is missing two data points. We fill the missing points with S for Southampton, which was the first point of departure [4].
#df['Sex']=df['Sex'].map({'female':1,'male':0})
#df['Embarked']=df['Embarked'].map({'S':0,'C':1,'Q':2})
df['Embarked']=df['Embarked'].fillna('S')
df['Embarked'].describe()
The column statistics show us two problems. First, the minimum Fare was $\$0$. It is unclear if this is a problem in the data, or if some fares were free. Since this only effects $15$ data points, we ignore this issue. Next, the 'Age' column is missing 177 entries. The survival rate for the passengers with no age data appears to be substantially lower than the overall survival rate ($22\%$ vs $38\%$). Therefore, filling missing entries with the sample average may skew our analysis. We also note that our entries with missing age are disproportionately Male ($71\%$ vs the population's $65\%$).
print df.describe()
print '------------- '
print 'Missing ages:'
print df[df['Age'] != df['Age']].describe()
print '------------- '
print '$0 Fares:'
print df[df['Fare'] == 0].describe()
We further investigate how far the survival statistics of the age-less subsample differ from the survival statistics of the overall sample by grouping by gender:
print 'Survival statistics, by sex, for passengers with Age'
print df[df['Age'] == df['Age']].groupby('Sex')['Survived'].mean()
print 'Survival statistics, by sex, for passengers missing Age'
print df[df['Age'] != df['Age']].groupby('Sex')['Survived'].mean()
We still note some differences, although it is uncertain if they are statistically significant. For a finer analysis, we could try to fill the missing values with the mean or median of similar samples, but for now we simply discard the entries missing an age.
df=df[df['Age'] == df['Age']]
df.describe()
Our primary question was whether women and children were more likely to survive than the men. It will be convenient for us to group our data into these demographic groups, so we create a new feature, 'MWC', for "Man, Woman, or Child". We take as our definition of child any man or woman of age $\leq 16$.
def wmc_bin(passenger):
out=''
if passenger['Age'] <= 16:
out='Child'
else:
if passenger['Sex'] == 'male':
out='Man'
if passenger['Sex'] == 'female':
out='Woman'
return out
df['MWC']=df.apply(wmc_bin,axis = 1)
df.head()
With this new feature, we can describe the breakdown of the survivors in terms of Men/Women/Children. The graph shows that only $24.48\%$ of the survivors were men, and that most survivors were Women and Children.
import seaborn
#Summing (over each category) the 'Survivor' column gives us a total count of Survivors (in that category).
df.groupby('MWC')['Survived'].sum().plot(kind = 'pie', autopct='%.2f', fontsize=25)
Of course, this does not completely answer our question. To infer if Women and Children were actually given priority in the rush to the lifeboats, we have to better understand the statistics of our population. We determine the number of Men, Women, and Children in our sample, and also determine what proportion of each demographic survived.
def field_count(data,field,flags):
#This function counts the number of entries in our data with field = flag, returning a tuple of counts for each flag in flags
out=[]
for flag in flags:
out.append(len(data[data[field] == flag]))
return tuple(out)
M,W,C = field_count(df,'MWC',['Man','Woman','Child'])
print 'There were '+str(M)+' Men, '+str(W)+' Women, and '+str(C)+' Children.'
df.groupby('MWC')['Survived'].mean()
We see $17.7\%$ of the $402$ men survived, $77.4\%$ of the $212$ women survied, and $55\%$ of the $100$ children survived.
First, we analyze correlations between numerical features:
df.corr()
We notice another strong trend in the data. There is a pronounced negative correlation between survial and fare class. Further investigation shows that, of the three fare classes, a substantially higher proportion of first and second class passengers survived, compared to the third class passengers. The following histogram plots the total number of passengers in each fare class (Blue) and the total number of survivors in each fare class (Green). From this, we can see that over half of the first class passengers in our sample survived, roughly half of the second class passenger survived, and less than a quarter of the third class passengers survived.
df['Pclass'].hist(bins=3)
df[df['Survived'] == 1]['Pclass'].hist(bins=3)
We see the same trend when we plot by Fare (omitting, for the sake of the visualization, the outlier high fares). A low percentage of fares under $\$25$ survived, while higher fares had much higher survival rates. Of course, this is just a higher resolution picture of the first image, as the Fare essentially predicts class.
df[df['Fare'] < 300]['Fare'].hist(bins=12)
df[(df['Survived'] == 1) & (df['Fare']< 300)]['Fare'].hist(bins=12)
Moving on, it remains to investigate of the point of embarcation had any influence on survival rate
df.groupby('Embarked')['Survived'].mean()
It appears that passengers departing from Cherbourg, France, survived at a higher rate than passengers departing from Southampton or Queenstown. We investigate this further, to determine if this is a genuine signal or if other hidden factors can explain the discrepancy.
df.groupby('Embarked')['Pclass'].describe()
df.groupby('Embarked')['Sex'].describe()
It appears that most passengers departing in Cherbourg were in first or second class. Thus, we would tentatively conclude that the higher proportion of Cherbourg passengers surviving was an artifact of the higher proportion of Cherbourg passengers riding in first or second class.
The data suggests that women and children were indeed more likely to survive than men. In addition, passengers in first and second class appear to have survived at a substantially higher rate than those in third class. Breaking down our sample by Man/Woman/Child and Class, we have the following survival rates:
#df['MWC_Class']=df['Pclass'].map({1: 'First class', 2: 'Second class', 3: 'Third class'})
#df['MWC_Class']=df.apply(lambda x: x['MWC']+', '+x['MWC_Class'], axis = 1)
#df.head()
print 'Demographic Counts'
print (df.groupby('MWC_Class')['Survived']).count()
print 'Survival Rates'
print df.groupby('MWC_Class')['Survived'].mean()
The tables above lists the counts and survival rates for each demographic (Man/Woman/Child + Class) in our sample. We note that the $79$ women in first class survived at a remarkably high rate of $97.5\%$, the $64$ women in second class survived at a similarly high rate of $90.6\%$, and that small number of children in first and second class survived at high rate of $>88.9\%$. The survival rates for all other demographics were less than $50\%$, with a strikingly low percentage of men in second class ($6.8\%$) and men in third class ($13.0\%$) surviving.
We tentatively conclude that survival was highly correlated with gender/age (Man vs. Woman. vs. Child) and fare class (First vs. Second vs. Third), but at this early stage we cannot infer any causal effects. We would speculate, though, that passengers in first and second class were closer to the lifeboats, and of those passengers, women and children were more likely to be put into a lifeboat. In future work, we would try to establish confidence intervals for our estimates of the survival mean for each demographic and compare these survival statistics with a different sample population.