Questions:

1. Did women and children survive at a higher rate than men?

Were the women and children first to the lifeboats? We examine the demographics of the survivors vs the passenger demographics.

2. What other factors were strongly correlated with survival?

How strongly does each feature corrolate with survival? Answering this question would allow us to construct a naïve classifier for passenger survival.

Data Cleaning

In [2]:
import numpy as np
import pandas as pd

df=pd.read_csv('titanic.csv')
df.head()
Out[2]:
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S
In [3]:
df.describe()
/projects/sage/sage-6.10/local/lib/python2.7/site-packages/numpy/lib/function_base.py:3834: RuntimeWarning: Invalid value encountered in percentile
  RuntimeWarning)
Out[3]:
PassengerId Survived Pclass Age SibSp Parch Fare
count 891.000000 891.000000 891.000000 714.000000 891.000000 891.000000 891.000000
mean 446.000000 0.383838 2.308642 29.699118 0.523008 0.381594 32.204208
std 257.353842 0.486592 0.836071 14.526497 1.102743 0.806057 49.693429
min 1.000000 0.000000 1.000000 0.420000 0.000000 0.000000 0.000000
25% 223.500000 0.000000 2.000000 NaN 0.000000 0.000000 7.910400
50% 446.000000 0.000000 3.000000 NaN 0.000000 0.000000 14.454200
75% 668.500000 1.000000 3.000000 NaN 1.000000 0.000000 31.000000
max 891.000000 1.000000 3.000000 80.000000 8.000000 6.000000 512.329200
In [4]:
print df.Cabin.describe()
print df.Ticket.describe()
print df.Embarked.describe()
count             204
unique            147
top       C23 C25 C27
freq                4
Name: Cabin, dtype: object
count          891
unique         681
top       CA. 2343
freq             7
Name: Ticket, dtype: object
count     889
unique      3
top         S
freq      644
Name: Embarked, dtype: object

We notice several extraneous features in our data. First of all, 'Name', and 'PassengerId' each uniquely identify the passenger. Since we are not interested in reconstructing family trees in this exploration, we discard both features. Next, 'Ticket' and 'Cabin' seem unlikely to correlate in any signifcant way with surival. While the latter feature may have some impact on survivability--we would expect positive correlation between closeness of cabin to lifeboats and survivability--the 'Cabin' column is missing over $75\%$ of the entires. Therefore, we discard both of these features.

In [5]:
del df['PassengerId']
del df['Name']
del df['Cabin']
del df['Ticket']
df.head()
Out[5]:
Survived Pclass Sex Age SibSp Parch Fare Embarked
0 0 3 male 22.0 1 0 7.2500 S
1 1 1 female 38.0 1 0 71.2833 C
2 1 3 female 26.0 0 0 7.9250 S
3 1 1 female 35.0 1 0 53.1000 S
4 0 3 male 35.0 0 0 8.0500 S

Finally, we note that 'Embarked' is missing two data points. We fill the missing points with S for Southampton, which was the first point of departure [4].

In [6]:
#df['Sex']=df['Sex'].map({'female':1,'male':0})
#df['Embarked']=df['Embarked'].map({'S':0,'C':1,'Q':2})
df['Embarked']=df['Embarked'].fillna('S')
df['Embarked'].describe()
Out[6]:
count     891
unique      3
top         S
freq      646
Name: Embarked, dtype: object

The column statistics show us two problems. First, the minimum Fare was $\$0$. It is unclear if this is a problem in the data, or if some fares were free. Since this only effects $15$ data points, we ignore this issue. Next, the 'Age' column is missing 177 entries. The survival rate for the passengers with no age data appears to be substantially lower than the overall survival rate ($22\%$ vs $38\%$). Therefore, filling missing entries with the sample average may skew our analysis. We also note that our entries with missing age are disproportionately Male ($71\%$ vs the population's $65\%$).

In [7]:
print df.describe()
print '------------- '
print 'Missing ages:'
print df[df['Age'] != df['Age']].describe()
print '------------- '
print '$0 Fares:'
print df[df['Fare'] == 0].describe()
         Survived      Pclass         Age       SibSp       Parch        Fare
count  891.000000  891.000000  714.000000  891.000000  891.000000  891.000000
mean     0.383838    2.308642   29.699118    0.523008    0.381594   32.204208
std      0.486592    0.836071   14.526497    1.102743    0.806057   49.693429
min      0.000000    1.000000    0.420000    0.000000    0.000000    0.000000
25%      0.000000    2.000000         NaN    0.000000    0.000000    7.910400
50%      0.000000    3.000000         NaN    0.000000    0.000000   14.454200
75%      1.000000    3.000000         NaN    1.000000    0.000000   31.000000
max      1.000000    3.000000   80.000000    8.000000    6.000000  512.329200
------------- 
Missing ages:
         Survived      Pclass  Age       SibSp       Parch        Fare
count  177.000000  177.000000  0.0  177.000000  177.000000  177.000000
mean     0.293785    2.598870  NaN    0.564972    0.180791   22.158567
std      0.456787    0.763216  NaN    1.626316    0.534145   31.874608
min      0.000000    1.000000  NaN    0.000000    0.000000    0.000000
25%      0.000000    3.000000  NaN    0.000000    0.000000    7.750000
50%      0.000000    3.000000  NaN    0.000000    0.000000    8.050000
75%      1.000000    3.000000  NaN    0.000000    0.000000   24.150000
max      1.000000    3.000000  NaN    8.000000    2.000000  227.525000
------------- 
$0 Fares:
        Survived     Pclass        Age  SibSp  Parch  Fare
count  15.000000  15.000000   7.000000   15.0   15.0  15.0
mean    0.066667   1.933333  35.142857    0.0    0.0   0.0
std     0.258199   0.798809  10.023781    0.0    0.0   0.0
min     0.000000   1.000000  19.000000    0.0    0.0   0.0
25%     0.000000   1.000000        NaN    0.0    0.0   0.0
50%     0.000000   2.000000        NaN    0.0    0.0   0.0
75%     0.000000   2.500000        NaN    0.0    0.0   0.0
max     1.000000   3.000000  49.000000    0.0    0.0   0.0

We further investigate how far the survival statistics of the age-less subsample differ from the survival statistics of the overall sample by grouping by gender:

In [8]:
print 'Survival statistics, by sex, for passengers with Age'
print df[df['Age'] == df['Age']].groupby('Sex')['Survived'].mean()
print 'Survival statistics, by sex, for passengers missing Age'
print df[df['Age'] != df['Age']].groupby('Sex')['Survived'].mean()
Survival statistics, by sex, for passengers with Age
Sex
female    0.754789
male      0.205298
Name: Survived, dtype: float64
Survival statistics, by sex, for passengers missing Age
Sex
female    0.679245
male      0.129032
Name: Survived, dtype: float64

We still note some differences, although it is uncertain if they are statistically significant. For a finer analysis, we could try to fill the missing values with the mean or median of similar samples, but for now we simply discard the entries missing an age.

In [9]:
df=df[df['Age'] == df['Age']]
df.describe()
Out[9]:
Survived Pclass Age SibSp Parch Fare
count 714.000000 714.000000 714.000000 714.000000 714.000000 714.000000
mean 0.406162 2.236695 29.699118 0.512605 0.431373 34.694514
std 0.491460 0.838250 14.526497 0.929783 0.853289 52.918930
min 0.000000 1.000000 0.420000 0.000000 0.000000 0.000000
25% 0.000000 1.000000 20.125000 0.000000 0.000000 8.050000
50% 0.000000 2.000000 28.000000 0.000000 0.000000 15.741700
75% 1.000000 3.000000 38.000000 1.000000 1.000000 33.375000
max 1.000000 3.000000 80.000000 5.000000 6.000000 512.329200

Our primary question was whether women and children were more likely to survive than the men. It will be convenient for us to group our data into these demographic groups, so we create a new feature, 'MWC', for "Man, Woman, or Child". We take as our definition of child any man or woman of age $\leq 16$.

In [10]:
def wmc_bin(passenger):
    out=''
    if passenger['Age'] <= 16:
        out='Child'
    else:
        if passenger['Sex'] == 'male':
            out='Man'
        if passenger['Sex'] == 'female':
            out='Woman'
    return out
df['MWC']=df.apply(wmc_bin,axis = 1)
df.head()
Out[10]:
Survived Pclass Sex Age SibSp Parch Fare Embarked MWC
0 0 3 male 22.0 1 0 7.2500 S Man
1 1 1 female 38.0 1 0 71.2833 C Woman
2 1 3 female 26.0 0 0 7.9250 S Woman
3 1 1 female 35.0 1 0 53.1000 S Woman
4 0 3 male 35.0 0 0 8.0500 S Man

With this new feature, we can describe the breakdown of the survivors in terms of Men/Women/Children. The graph shows that only $24.48\%$ of the survivors were men, and that most survivors were Women and Children.

In [11]:
import seaborn
#Summing (over each category) the 'Survivor' column gives us a total count of Survivors (in that category).
df.groupby('MWC')['Survived'].sum().plot(kind = 'pie', autopct='%.2f', fontsize=25)
/projects/sage/sage-6.10/local/lib/python2.7/site-packages/matplotlib-1.5.0-py2.7-linux-x86_64.egg/matplotlib/__init__.py:872: UserWarning: axes.color_cycle is deprecated and replaced with axes.prop_cycle; please use the latter.
  warnings.warn(self.msg_depr % (key, alt_key))
Out[11]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f5c0dc581d0>

Of course, this does not completely answer our question. To infer if Women and Children were actually given priority in the rush to the lifeboats, we have to better understand the statistics of our population. We determine the number of Men, Women, and Children in our sample, and also determine what proportion of each demographic survived.

In [23]:
def field_count(data,field,flags):
    #This function counts the number of entries in our data with field = flag, returning a tuple of counts for each flag in flags
    out=[]
    for flag in flags:
        out.append(len(data[data[field] == flag]))
    return tuple(out)

M,W,C = field_count(df,'MWC',['Man','Woman','Child'])
print 'There were '+str(M)+' Men, '+str(W)+' Women, and '+str(C)+' Children.'
df.groupby('MWC')['Survived'].mean()
There were 402 Men, 212 Women, and 100 Children.
Out[23]:
MWC
Child    0.550000
Man      0.176617
Woman    0.773585
Name: Survived, dtype: float64

We see $17.7\%$ of the $402$ men survived, $77.4\%$ of the $212$ women survied, and $55\%$ of the $100$ children survived.

The higher survival rate for women and children suggests that women and children were first to the lifeboats.

Exploratory Analysis continued

First, we analyze correlations between numerical features:

In [34]:
df.corr()
Out[34]:
Survived Pclass Age SibSp Parch Fare
Survived 1.000000 -0.359653 -0.077221 -0.017358 0.093317 0.268189
Pclass -0.359653 1.000000 -0.369226 0.067247 0.025683 -0.554182
Age -0.077221 -0.369226 1.000000 -0.308247 -0.189119 0.096067
SibSp -0.017358 0.067247 -0.308247 1.000000 0.383820 0.138329
Parch 0.093317 0.025683 -0.189119 0.383820 1.000000 0.205119
Fare 0.268189 -0.554182 0.096067 0.138329 0.205119 1.000000

We notice another strong trend in the data. There is a pronounced negative correlation between survial and fare class. Further investigation shows that, of the three fare classes, a substantially higher proportion of first and second class passengers survived, compared to the third class passengers. The following histogram plots the total number of passengers in each fare class (Blue) and the total number of survivors in each fare class (Green). From this, we can see that over half of the first class passengers in our sample survived, roughly half of the second class passenger survived, and less than a quarter of the third class passengers survived.

In [13]:
df['Pclass'].hist(bins=3) 
df[df['Survived'] == 1]['Pclass'].hist(bins=3)
Out[13]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f5c30bcc910>

We see the same trend when we plot by Fare (omitting, for the sake of the visualization, the outlier high fares). A low percentage of fares under $\$25$ survived, while higher fares had much higher survival rates. Of course, this is just a higher resolution picture of the first image, as the Fare essentially predicts class.

In [14]:
df[df['Fare'] < 300]['Fare'].hist(bins=12)
df[(df['Survived'] == 1) & (df['Fare']< 300)]['Fare'].hist(bins=12)
Out[14]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f5c0dc2aa50>

It appears that high fare/first class passengers had the highest survival rates

Moving on, it remains to investigate of the point of embarcation had any influence on survival rate

In [36]:
df.groupby('Embarked')['Survived'].mean()
Out[36]:
Embarked
C    0.607692
Q    0.285714
S    0.365108
Name: Survived, dtype: float64

It appears that passengers departing from Cherbourg, France, survived at a higher rate than passengers departing from Southampton or Queenstown. We investigate this further, to determine if this is a genuine signal or if other hidden factors can explain the discrepancy.

In [40]:
df.groupby('Embarked')['Pclass'].describe()
Out[40]:
Embarked       
C         count    130.000000
          mean       1.746154
          std        0.909140
          min        1.000000
          25%        1.000000
          50%        1.000000
          75%        3.000000
          max        3.000000
Q         count     28.000000
          mean       2.785714
          std        0.568112
          min        1.000000
          25%        3.000000
          50%        3.000000
          75%        3.000000
          max        3.000000
S         count    556.000000
          mean       2.323741
          std        0.784681
          min        1.000000
          25%        2.000000
          50%        3.000000
          75%        3.000000
          max        3.000000
Name: Pclass, dtype: float64
In [42]:
df.groupby('Embarked')['Sex'].describe()
Out[42]:
Embarked        
C         count      130
          unique       2
          top       male
          freq        69
Q         count       28
          unique       2
          top       male
          freq        16
S         count      556
          unique       2
          top       male
          freq       368
Name: Sex, dtype: object

It appears that most passengers departing in Cherbourg were in first or second class. Thus, we would tentatively conclude that the higher proportion of Cherbourg passengers surviving was an artifact of the higher proportion of Cherbourg passengers riding in first or second class.

Tentative conclusions:

The data suggests that women and children were indeed more likely to survive than men. In addition, passengers in first and second class appear to have survived at a substantially higher rate than those in third class. Breaking down our sample by Man/Woman/Child and Class, we have the following survival rates:

In [28]:
#df['MWC_Class']=df['Pclass'].map({1: 'First class', 2: 'Second class', 3: 'Third class'})
#df['MWC_Class']=df.apply(lambda x: x['MWC']+', '+x['MWC_Class'], axis = 1)
#df.head()
print 'Demographic Counts'
print (df.groupby('MWC_Class')['Survived']).count()
print 'Survival Rates'
print df.groupby('MWC_Class')['Survived'].mean()
Demographic Counts
MWC_Class
Child, First class       9
Child, Second class     21
Child, Third class      70
Man, First class        98
Man, Second class       88
Man, Third class       216
Woman, First class      79
Woman, Second class     64
Woman, Third class      69
Name: Survived, dtype: int64
Survival Rates
MWC_Class
Child, First class     0.888889
Child, Second class    0.904762
Child, Third class     0.400000
Man, First class       0.377551
Man, Second class      0.068182
Man, Third class       0.129630
Woman, First class     0.974684
Woman, Second class    0.906250
Woman, Third class     0.420290
Name: Survived, dtype: float64

The tables above lists the counts and survival rates for each demographic (Man/Woman/Child + Class) in our sample. We note that the $79$ women in first class survived at a remarkably high rate of $97.5\%$, the $64$ women in second class survived at a similarly high rate of $90.6\%$, and that small number of children in first and second class survived at high rate of $>88.9\%$. The survival rates for all other demographics were less than $50\%$, with a strikingly low percentage of men in second class ($6.8\%$) and men in third class ($13.0\%$) surviving.

We tentatively conclude that survival was highly correlated with gender/age (Man vs. Woman. vs. Child) and fare class (First vs. Second vs. Third), but at this early stage we cannot infer any causal effects. We would speculate, though, that passengers in first and second class were closer to the lifeboats, and of those passengers, women and children were more likely to be put into a lifeboat. In future work, we would try to establish confidence intervals for our estimates of the survival mean for each demographic and compare these survival statistics with a different sample population.