Path: blob/master/Applied Data Modelling using Gradio/1.1 Data Modelling .ipynb
7216 views
Data Modelling is the process of structuring and organizing data to represent real-world entities, relationships, and business rules in a way that supports analysis, storage, and decision-making.
Think of it as a blueprint for data systems—it defines:
What data to store (entities) ?
How they relate (relationships) ?
How they are structured (schemas) ?
Types of Data Modelling
Conceptual Model (High-Level)
Focus: Business understanding
Example:
---- Customer → Orders → Products
Logical Model (Detailed Structure)
Defines attributes, keys, relationships
Example:
-----Customer(ID, Name, Email) Order(OrderID, Date, CustomerID)
Physical Model (Implementation)
Actual database tables, data types, constraints Example:
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
Cell In[36], line 2
1 d=(1,23)
----> 2 d[2]=9
TypeError: 'tuple' object does not support item assignment
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
Cell In[43], line 6
4 d.keys()
5 d.values()
----> 6 d[0]
KeyError: 0
Python Implementation (Using Pandas)
Key Concepts in Data Modelling
Entity → Object (Customer, Product)
Attribute → Properties (Name, Price)
Primary Key → Unique identifier
Foreign Key → Links between tables
Normalization → Avoid redundancy
Advanced Example: Star Schema (Used in BI)
Fact Table → Sales
Dimension Tables → Customer, Product, Time
Python Implementation
Why Data Modelling Matters?
Improves data quality & consistency
Enables efficient querying
Supports analytics & AI models
Reduces redundancy & errors
Real-World Use Cases
Banking → Customer transactions model
E-commerce → Order management system
Healthcare → Patient records
AI → Feature engineering datasets
End-to-End Applied Data Modelling Lifecycle
Problem Definition (Business → Analytical Translation)
Example Problem:
An e-commerce company wants to predict customer churn (who will stop purchasing).
Translate to Data Problem:
Type: Supervised Learning (Classification)
Target Variable: churn (0/1)
Goal: Predict probability of churn
Data Acquisition & Understanding
Example Dataset
Decide Algorithm to be used: Logistic Regression
Logistic Regression is a supervised machine learning algorithm used for binary classification problems (e.g., Yes/No, 0/1, Churn/No Churn).
Unlike linear regression, it does not predict continuous values. Instead, it predicts the probability of a class, which is then converted into a label.
It models the probability that an input belongs to a class: **P(Y=1|X)
Uses a sigmoid (logistic) function to map any real value into a range between 0 and 1.

Predict whether a customer will churn:
Inputs: Age, Purchase Frequency, Spend
Output: Probability of churn
P=0.78 ---> Churn
Why Logistic Regression?
Simple & interpretable
Works well for baseline models
Provides feature importance via coefficients
Efficient for large datasets
Key Assumptions
Linear relationship between features and log-odds
Independent observations
No strong multicollinearity
Summary
Converts linear equation → probability using sigmoid
Outputs classification via threshold
Widely used in classification problems
Key Tasks:
Schema validation
Missing values check
Data types correction
Data Modelling (Structure & Relationships)
Here we define:
Features (X) → Inputs
Target (y) → Output
This is the logical data model for ML:
Entities → Customers
Attributes → Behavioral features
Relationship → Features → Target
Data Preprocessing & Feature Engineering
Enhancements:
Feature scaling
Encoding categorical variables
Creating derived features
Model Building
X_train and Y_train(Modelling )
X_test = Y_Pred()
Ytest(Actual) - Ypred : Performing
Advanced Metrics:
Precision / Recall
ROC-AUC
Confusion Matrix
Here are the core evaluation metric formulas used in classification problems, expressed clearly with context:
Confusion Matrix (Foundation)
A confusion matrix summarizes prediction outcomes:
| Predicted Positive | Predicted Negative | |
|---|---|---|
| Actual Positive | TP (True Positive) | FN (False Negative) |
| Actual Negative | FP (False Positive) | TN (True Negative) |
TP → Correct positive prediction
TN → Correct negative prediction
FP → Incorrect positive prediction
FN → Missed positive
2 Precision
TP/(TP+FP)
Interpretation: Out of all predicted positives, how many were actually correct?
Use when false positives are costly (e.g., spam detection, fraud alerts)
3️ Recall (Sensitivity / True Positive Rate)
Recall = TP/(TP+TN)
Interpretation: Out of all actual positives, how many did we correctly identify?
Use when false negatives are costly (e.g., disease detection)
4 ROC Curve & AUC
ROC Curve:
Plots:
X-axis: False Positive Rate (FPR)
Y-axis: True Positive Rate (Recall)

Interpretation:
Measures overall model ability to distinguish classes
Range: 0 → 1
| AUC Value | Meaning |
|---|---|
| 0.5 | Random model |
| 0.7–0.8 | Good |
| 0.8–0.9 | Very good |
| 0.9+ | Excellent |
Quick Summary
Precision → Accuracy of positive predictions
Recall → Coverage of actual positives
Confusion Matrix → Base for all metrics
ROC-AUC → Overall classification performance
Model Interpretation (Critical in Applied Modelling)
Helps answer:
Which features drive churn?
Business insight extraction
Deployment (Making Model Usable)
Option 1: Simple Function API
Interactive Deployment using Gradio
Problem → Data → Model Design → Feature Engineering → Training → Evaluation → Deployment → Monitoring
Data modelling is not just database design — it extends to ML feature structuring
Strong modelling = better model performance
Deployment (Gradio/API) is what makes models business usable
Iteration is continuous (MLOps mindset)
Real-World Extensions
Fraud Detection
Recommendation Systems
Demand Forecasting
NLP Applications