Path: blob/master/Advanced Data Analysis using Python/Titanic Survival Prediction EDA and Data Modeling.ipynb
3074 views
Titanic Survival Prediction: EDA and Data Modeling
What is EDA?
Exploratory Data Analysis (EDA) is the process of examining datasets to summarize their main characteristics, often using visual methods. It helps to:
Understand the structure of the data.
Detect missing values or outliers.
Identify patterns and relationships between variables.
Formulate hypotheses based on visual insights.
Advantages of EDA:
Improves understanding of data structure and quality.
Supports feature selection and engineering by highlighting important variables.
Guides data cleaning by identifying missing or inconsistent data.
Informs model choice and evaluation strategies.
What is Data Modeling?
Data modeling involves applying mathematical and statistical techniques to data to build predictive or descriptive models. In this context, we are using a supervised learning technique where:
The features (X) are used to predict the target (y).
The target variable is whether the passenger survived (binary classification: 0 or 1).
What is Logistic Regression?
Logistic Regression is a classification algorithm used to predict binary outcomes. Unlike linear regression, which predicts continuous values, logistic regression estimates the probability that a given input belongs to a particular category.
Formula:
The logistic regression model uses the sigmoid function:
[ P(Y=1|X) = \frac{1}{1 + e^{-(\beta_0 + \beta_1X_1 + \beta_2X_2 + ... + \beta_nX_n)}} ]
Where:
( P ) is the probability of survival.
( X_i ) are feature variables.
( \beta_i ) are coefficients learned from the data.
Why Logistic Regression?
Simple and interpretable.
Performs well on linearly separable data.
Useful baseline model for binary classification tasks.
Steps Followed in the Titanic Project:
1. Import Libraries
We imported essential libraries for data analysis, visualization, and modeling.
2. Load Dataset
Used the dataset available at: https://raw.githubusercontent.com/suyashi29/python-su/refs/heads/master/Data Visualization using Python/titanic.csv
3. Initial Inspection
Checked data shape and summary.
Looked at data types and statistical properties.
4. Drop Unnecessary Columns
Removed columns that do not contribute to prediction (PassengerId
, Name
, Ticket
, Cabin
).
5. Handle Missing Values
Filled missing values in
Age
with the median.Filled missing values in
Embarked
with the mode.
6. Feature Engineering
Created a new column
FamilySize
=SibSp
+Parch
+ 1.Converted categorical features (
Sex
,Embarked
) to numeric usingLabelEncoder
.
7. Visualization (EDA)
Used
seaborn
andmatplotlib
to explore:Survival distribution.
Age distribution.
Correlation heatmap.
Survival by gender.
8. Data Preparation
Separated features and target.
Scaled features using
StandardScaler
.Performed train-test split (80/20).
9. Modeling
Used
LogisticRegression
to fit the model on training data.
10. Evaluation
Evaluated the model using accuracy, confusion matrix, and classification report.
Feature Engineering
Visualizations
There were more passengers who lost their lives the incident
More Number of Males travelling but most of the females survived. (if you are female you have more chances of survival)
Preparing data for modeling
Data Modelling:
X_train ,y_train (Moldel training)
x_test ( Test on trained model to get y_pred)
y_test (y_test is for evaluation) , y_test - y_pred (accurate)
Evaluation
Accuracy : Total correctly classified / Total instances
Confusion Matrix : TP TN FP FN
Accurcay = TP+TN/ TP+TN+FP+FN
Precison = TP/(TP+FP)
Recall = TP/ (TP+FN)
80 Percent of the total data inputs we will be able to predict current survival chances (1,0)
A precision of 0.81 suggets that the model would precict 81% for survived(not-died) rate and 19% of the people it precicted would have not survied.