Book a Demo!
CoCalc Logo Icon
StoreFeaturesDocsShareSupportNewsAboutPoliciesSign UpSign In
suyashi29
GitHub Repository: suyashi29/python-su
Path: blob/master/Advanced Data Analysis using Python/Titanic Survival Prediction EDA and Data Modeling.ipynb
3074 views
Kernel: Python 3 (ipykernel)

Titanic Survival Prediction: EDA and Data Modeling

What is EDA?

Exploratory Data Analysis (EDA) is the process of examining datasets to summarize their main characteristics, often using visual methods. It helps to:

  • Understand the structure of the data.

  • Detect missing values or outliers.

  • Identify patterns and relationships between variables.

  • Formulate hypotheses based on visual insights.

Advantages of EDA:

  • Improves understanding of data structure and quality.

  • Supports feature selection and engineering by highlighting important variables.

  • Guides data cleaning by identifying missing or inconsistent data.

  • Informs model choice and evaluation strategies.


What is Data Modeling?

Data modeling involves applying mathematical and statistical techniques to data to build predictive or descriptive models. In this context, we are using a supervised learning technique where:

  • The features (X) are used to predict the target (y).

  • The target variable is whether the passenger survived (binary classification: 0 or 1).


What is Logistic Regression?

Logistic Regression is a classification algorithm used to predict binary outcomes. Unlike linear regression, which predicts continuous values, logistic regression estimates the probability that a given input belongs to a particular category.

Formula:

The logistic regression model uses the sigmoid function:

[ P(Y=1|X) = \frac{1}{1 + e^{-(\beta_0 + \beta_1X_1 + \beta_2X_2 + ... + \beta_nX_n)}} ]

Where:

  • ( P ) is the probability of survival.

  • ( X_i ) are feature variables.

  • ( \beta_i ) are coefficients learned from the data.

Why Logistic Regression?

  • Simple and interpretable.

  • Performs well on linearly separable data.

  • Useful baseline model for binary classification tasks.


Steps Followed in the Titanic Project:

1. Import Libraries

We imported essential libraries for data analysis, visualization, and modeling.

2. Load Dataset

Used the dataset available at: https://raw.githubusercontent.com/suyashi29/python-su/refs/heads/master/Data Visualization using Python/titanic.csv

3. Initial Inspection

  • Checked data shape and summary.

  • Looked at data types and statistical properties.

4. Drop Unnecessary Columns

Removed columns that do not contribute to prediction (PassengerId, Name, Ticket, Cabin).

5. Handle Missing Values

  • Filled missing values in Age with the median.

  • Filled missing values in Embarked with the mode.

6. Feature Engineering

  • Created a new column FamilySize = SibSp + Parch + 1.

  • Converted categorical features (Sex, Embarked) to numeric using LabelEncoder.

7. Visualization (EDA)

  • Used seaborn and matplotlib to explore:

    • Survival distribution.

    • Age distribution.

    • Correlation heatmap.

    • Survival by gender.

8. Data Preparation

  • Separated features and target.

  • Scaled features using StandardScaler.

  • Performed train-test split (80/20).

9. Modeling

  • Used LogisticRegression to fit the model on training data.

10. Evaluation

  • Evaluated the model using accuracy, confusion matrix, and classification report.

# 1.Importing libraries import pandas as pd import numpy as np import seaborn as sns import matplotlib.pyplot as plt from sklearn.model_selection import train_test_split from sklearn.preprocessing import LabelEncoder, StandardScaler from sklearn.linear_model import LogisticRegression from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
# Set custom color palette custom_palette = ["orange", "yellow", "pink", "green"] sns.set_palette(custom_palette)
# Load the data data_url = "https://raw.githubusercontent.com/suyashi29/python-su/refs/heads/master/Data%20Visualization%20using%20Python/titanic.csv" df = pd.read_csv(data_url)
## Displaying data df.head()
# Initial inspection print("Initial Data Shape:", df.shape)
Initial Data Shape: (891, 12)
print(df.info())
<class 'pandas.core.frame.DataFrame'> RangeIndex: 891 entries, 0 to 890 Data columns (total 12 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 PassengerId 891 non-null int64 1 Survived 891 non-null int64 2 Name 891 non-null object 3 Sex 891 non-null object 4 Age 714 non-null float64 5 SibSp 891 non-null int64 6 Parch 891 non-null int64 7 Fare 891 non-null float64 8 Cabin 204 non-null object 9 Embarked 889 non-null object 10 Pclass 891 non-null int64 11 Ticket 891 non-null object dtypes: float64(2), int64(5), object(5) memory usage: 83.7+ KB None
## Statistical describtion df.describe()
df.describe(include="object")
# 4. Drop unnecessary columns df.drop(['PassengerId', 'Name', 'Ticket', 'Cabin'], axis=1, inplace=True)
df
# Handle missing values print("Missing values before:") print(df.isnull().sum())
Missing values before: Survived 0 Sex 0 Age 177 SibSp 0 Parch 0 Fare 0 Embarked 2 Pclass 0 dtype: int64
df['Age'].fillna(df['Age'].median(), inplace=True) df['Embarked'].fillna(df['Embarked'].mode()[0], inplace=True) print("Missing values after:") df.isnull().sum()
Missing values after:
Survived 0 Sex 0 Age 0 SibSp 0 Parch 0 Fare 0 Embarked 0 Pclass 0 dtype: int64
df

Feature Engineering

# Creating new feature: FamilySize df['FamilySize'] = df['SibSp'] + df['Parch'] + 1
# Creating Age Groups def age_group(age): if age < 1: return 'Infant' elif age < 12: return 'Child' elif age < 24: return 'Young' elif age < 60: return 'Adult' else: return 'Senior' df['AgeGroup'] = df['Age'].apply(age_group)
df['AgeGroup'].describe()
count 891 unique 5 top Adult freq 618 Name: AgeGroup, dtype: object
# Encode 'Sex', 'Embarked', 'AgeGroup' le = LabelEncoder() df['Sex'] = le.fit_transform(df['Sex']) df['Embarked'] = le.fit_transform(df['Embarked']) df['AgeGroup'] = le.fit_transform(df['AgeGroup'])
df

Visualizations

sns.set(style="darkgrid") # Donut Chart - Survival Distribution survived_counts = df['Survived'].value_counts() plt.figure(figsize=(6, 6)) plt.pie(survived_counts, labels=['Did Not Survive', 'Survived'], colors=custom_palette[:2], startangle=90, wedgeprops={'width':0.4}) plt.title('Survival Distribution (Donut Chart)') plt.show()
Image in a Jupyter notebook

There were more passengers who lost their lives the incident

# Stacked Bar - Survival by Sex and AgeGroup plt.figure(figsize=(10, 6)) df_temp = df.copy() df_temp['SexLabel'] = df_temp['Sex'].replace({0: 'Female', 1: 'Male'}) sns.histplot(data=df_temp, x='SexLabel', hue='Survived', multiple='stack', palette=custom_palette[:2], shrink=0.8) plt.title('Survival by Gender (Stacked Bar)') plt.show()
Image in a Jupyter notebook

More Number of Males travelling but most of the females survived. (if you are female you have more chances of survival)

# Scatter Plot - Age vs Fare colored by Survival plt.figure(figsize=(8, 6)) sns.scatterplot(data=df, x='Age', y='Fare', hue='Survived', palette=custom_palette[:2]) plt.title('Scatter Plot of Age vs Fare (Colored by Survival)') plt.show()
Image in a Jupyter notebook
# Correlation Heatmap plt.figure(figsize=(10,6)) sns.heatmap(df.corr(), annot=True, cmap='coolwarm') plt.title('Feature Correlation') plt.show()
Image in a Jupyter notebook

Preparing data for modeling

df
# 4. Drop unnecessary columns df.drop([ 'SibSp','Parch'], axis=1, inplace=True)
df
X = df.drop('Survived', axis=1) #input y = df['Survived'] # outcome
X
# Feature scaling scaler = StandardScaler() X_scaled = scaler.fit_transform(X)
# Train-test split X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=42)

Data Modelling:

  • X_train ,y_train (Moldel training)

  • x_test ( Test on trained model to get y_pred)

  • y_test (y_test is for evaluation) , y_test - y_pred (accurate)

# 9. Modeling - Logistic Regression model = LogisticRegression() model.fit(X_train, y_train) y_pred = model.predict(X_test)

Evaluation

  • Accuracy : Total correctly classified / Total instances

  • Confusion Matrix : TP TN FP FN

  • Accurcay = TP+TN/ TP+TN+FP+FN

  • Precison = TP/(TP+FP)

  • Recall = TP/ (TP+FN)

print("Accuracy:", accuracy_score(y_test, y_pred))
Accuracy: 0.7988826815642458

80 Percent of the total data inputs we will be able to predict current survival chances (1,0)

print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))
Confusion Matrix: [[90 15] [21 53]]
print("Classification Report:\n", classification_report(y_test, y_pred))
Classification Report: precision recall f1-score support 0 0.81 0.86 0.83 105 1 0.78 0.72 0.75 74 accuracy 0.80 179 macro avg 0.80 0.79 0.79 179 weighted avg 0.80 0.80 0.80 179

A precision of 0.81 suggets that the model would precict 81% for survived(not-died) rate and 19% of the people it precicted would have not survied.