GitHub Repository: suyashi29/python-su
Path: blob/master/Applied Data Modelling using Gradio/1.1 Data Modelling .ipynb
⁷²¹⁶ views

Kernel: Python (TensorFlow 3.10)

Data Modelling is the process of structuring and organizing data to represent real-world entities, relationships, and business rules in a way that supports analysis, storage, and decision-making.

Think of it as a blueprint for data systems—it defines:

What data to store (entities) ?
How they relate (relationships) ?
How they are structured (schemas) ?

Types of Data Modelling

1. Conceptual Model (High-Level)

Focus: Business understanding

Example:

---- Customer → Orders → Products

1. Logical Model (Detailed Structure)

Defines attributes, keys, relationships

Example:

-----Customer(ID, Name, Email) Order(OrderID, Date, CustomerID)

Physical Model (Implementation)

Actual database tables, data types, constraints Example:

CREATE TABLE Customer (
    CustomerID INT PRIMARY KEY,
    Name VARCHAR(100),
    Email VARCHAR(100)
);

In [16]:

#Python 

#variables :

2*4+18

a,b = "ashi", 29
a
b

#Number:
type(-2333333333) 
type(-23.980000)
type(3+4j)

## Data containers 
type("Ashi is beautiful")

Out[16]:

str

In [24]:

a="ashi65"
a[0]
len(a)
#a.append("o")
#del a[1]
del a

type([1,23,4,"a"])

Out[24]:

list

In [27]:

a=[1,2,"a"]
len(a)
a.append(3)

In [28]:

Out[28]:

[1, 2, 'a', 3]

In [30]:

### Type Casting 
a=list("ashi29")
a

Out[30]:

['a', 's', 'h', 'i', '2', '9']

In [34]:

type((1,2,"a"))

Out[34]:

tuple

In [36]:

d=(1,23)
d[2]=9

Out[36]:

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[36], line 2
      1 d=(1,23)
----> 2 d[2]=9
TypeError: 'tuple' object does not support item assignment

In [45]:

a={}
s={1,2,4,5,87,-1,-1,87,1,2}
s

Out[45]:

{-1, 1, 2, 4, 5, 87}

In [43]:

d={}
d={"A":1,"B":2}
d
d.keys()
d.values()
#Keys are index in dictionary

Out[43]:

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
Cell In[43], line 6
      4 d.keys()
      5 d.values()
----> 6 d[0]
KeyError: 0

Python Implementation (Using Pandas)

In [48]:

a=[1,2,3,4]
import pandas as pd
s=pd.Series(a,index=["one","two","three","four"])
s

Out[48]:

one      1
two      2
three    3
four     4
dtype: int64

In [51]:

d=pd.DataFrame(a,index=["one","two","three","four"],columns=["Name"])
d

Out[51]:

In [52]:

## Step 1 create table
import pandas as pd

# Customer Table
customers = pd.DataFrame({
    'customer_id': [1, 2],
    'name': ['Ali', 'Ashi'],
    'email': ['[email protected]', '[email protected]']
})

# Product Table
products = pd.DataFrame({
    'product_id': [101, 102],
    'product_name': ['Laptop', 'Phone'],
    'price': [800, 500]
})

# Orders Table
orders = pd.DataFrame({
    'order_id': [1001, 1002],
    'customer_id': [1, 2],
    'product_id': [101, 102],
    'quantity': [1, 2]
})

In [53]:

customers

Out[53]:

In [54]:

## Step 2 Establish Relationships (Joins)
# Merge Orders with Customers
order_customer = pd.merge(orders, customers, on='customer_id')

# Merge with Products
final_data = pd.merge(order_customer, products, on='product_id')

final_data

Out[54]:

Key Concepts in Data Modelling

Entity → Object (Customer, Product)
Attribute → Properties (Name, Price)
Primary Key → Unique identifier
Foreign Key → Links between tables
Normalization → Avoid redundancy

Advanced Example: Star Schema (Used in BI)

Fact Table → Sales
Dimension Tables → Customer, Product, Time

Python Implementation

In [55]:

1,1000,20
1/1000, 1000/1000,20/1000

Out[55]:

(0.001, 1.0, 0.02)

In [56]:

# Fact Table
sales = pd.DataFrame({
    'sale_id': [1, 2],
    'customer_id': [1, 2],
    'product_id': [101, 102],
    'amount': [800, 1000]
})

Why Data Modelling Matters?

Improves data quality & consistency
Enables efficient querying
Supports analytics & AI models
Reduces redundancy & errors

Real-World Use Cases

Banking → Customer transactions model
E-commerce → Order management system
Healthcare → Patient records
AI → Feature engineering datasets

End-to-End Applied Data Modelling Lifecycle

Problem Definition (Business → Analytical Translation)

Example Problem:

An e-commerce company wants to predict customer churn (who will stop purchasing).

Translate to Data Problem:

Type: Supervised Learning (Classification)
Target Variable: churn (0/1)
Goal: Predict probability of churn

Data Acquisition & Understanding

Example Dataset

Decide Algorithm to be used: Logistic Regression

Logistic Regression is a supervised machine learning algorithm used for binary classification problems (e.g., Yes/No, 0/1, Churn/No Churn).
Unlike linear regression, it does not predict continuous values. Instead, it predicts the probability of a class, which is then converted into a label.
It models the probability that an input belongs to a class: **P(Y=1|X)
Uses a sigmoid (logistic) function to map any real value into a range between 0 and 1.

Predict whether a customer will churn:

Inputs: Age, Purchase Frequency, Spend
Output: Probability of churn
P=0.78 ---> Churn

Why Logistic Regression?

Simple & interpretable
Works well for baseline models
Provides feature importance via coefficients
Efficient for large datasets

Key Assumptions

Linear relationship between features and log-odds
Independent observations
No strong multicollinearity

Summary

Converts linear equation → probability using sigmoid
Outputs classification via threshold
Widely used in classification problems

In [69]:

import pandas as pd

data = pd.DataFrame({
    'age': [25, 40, 35, 50,45,10],
    'purchase_freq': [5, 2, 3, 1,2,1],
    'avg_spend': [200, 500, 300, 700,1000,300],
    'churn': [0, 1, 0, 1,1,0]
})

data.head()

Out[69]:

Key Tasks:

Schema validation
Missing values check
Data types correction

In [70]:

data.info()

Out[70]:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6 entries, 0 to 5
Data columns (total 4 columns):
 #   Column         Non-Null Count  Dtype
---  ------         --------------  -----
 0   age            6 non-null      int64
 1   purchase_freq  6 non-null      int64
 2   avg_spend      6 non-null      int64
 3   churn          6 non-null      int64
dtypes: int64(4)
memory usage: 324.0 bytes

In [71]:

data.isnull().sum()

Out[71]:

age              0
purchase_freq    0
avg_spend        0
churn            0
dtype: int64

Data Modelling (Structure & Relationships)

Here we define:

Features (X) → Inputs
Target (y) → Output

In [72]:

X = data[['age', 'purchase_freq', 'avg_spend']]
y = data['churn']

This is the logical data model for ML:

Entities → Customers
Attributes → Behavioral features
Relationship → Features → Target

Data Preprocessing & Feature Engineering

In [73]:

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

Enhancements:

Feature scaling
Encoding categorical variables
Creating derived features

In [74]:

## For Example 
data['spend_per_purchase'] = data['avg_spend'] / data['purchase_freq']
data['spend_per_purchase']

Out[74]:

   40.0
  250.0
  100.0
  700.0
  500.0
  300.0
Name: spend_per_purchase, dtype: float64

Model Building

X_train and Y_train(Modelling )
X_test = Y_Pred()
Ytest(Actual) - Ypred : Performing

In [75]:

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression

X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2)

model = LogisticRegression()
model.fit(X_train, y_train)

Out[75]:




Actual   Prediction  Lable 
 1         1         TP 
 0         1         FP
 1         0         FN
 0         0         TN 
 
 Accuracy = TP+TN / (TP+FP+TN+FN)

In [76]:

from sklearn.metrics import accuracy_score

y_pred = model.predict(X_test)
print("Accuracy:", accuracy_score(y_test, y_pred))

Out[76]:

Accuracy: 0.5

Advanced Metrics:

Precision / Recall
ROC-AUC
Confusion Matrix

Here are the core evaluation metric formulas used in classification problems, expressed clearly with context:

Confusion Matrix (Foundation)

A confusion matrix summarizes prediction outcomes:

	Predicted Positive	Predicted Negative
Actual Positive	TP (True Positive)	FN (False Negative)
Actual Negative	FP (False Positive)	TN (True Negative)

TP → Correct positive prediction
TN → Correct negative prediction
FP → Incorrect positive prediction
FN → Missed positive

2 Precision

TP/(TP+FP)

Interpretation: Out of all predicted positives, how many were actually correct?

Use when false positives are costly (e.g., spam detection, fraud alerts)

3️ Recall (Sensitivity / True Positive Rate)

Recall = TP/(TP+TN)

Interpretation: Out of all actual positives, how many did we correctly identify?

Use when false negatives are costly (e.g., disease detection)

4 ROC Curve & AUC

ROC Curve:

Plots:

X-axis: False Positive Rate (FPR)
Y-axis: True Positive Rate (Recall)

Interpretation:

Measures overall model ability to distinguish classes
Range: 0 → 1

AUC Value	Meaning
0.5	Random model
0.7–0.8	Good
0.8–0.9	Very good
0.9+	Excellent

Quick Summary

Precision → Accuracy of positive predictions
Recall → Coverage of actual positives
Confusion Matrix → Base for all metrics
ROC-AUC → Overall classification performance

Model Interpretation (Critical in Applied Modelling)

Helps answer:

Which features drive churn?
Business insight extraction

In [77]:

print("Coefficients:", model.coef_)

Out[77]:

Coefficients: [[ 0.64497049 -0.43204911  0.52753669]]

Deployment (Making Model Usable)

Option 1: Simple Function API

In [78]:

def predict_churn(age, purchase_freq, avg_spend):
    input_data = scaler.transform([[age, purchase_freq, avg_spend]])
    prediction = model.predict(input_data)
    return prediction[0]

Interactive Deployment using Gradio

In [79]:

import gradio as gr

def churn_app(age, purchase_freq, avg_spend):
    result = predict_churn(age, purchase_freq, avg_spend)
    return "Churn" if result == 1 else "No Churn"

interface = gr.Interface(
    fn=churn_app,
    inputs=["number", "number", "number"],
    outputs="text",
    title="Customer Churn Prediction"
)

interface.launch()

Out[79]:

* Running on local URL:  http://127.0.0.1:7860
* To create a public link, set `share=True` in `launch()`.

C:\Users\Suyashi144893\AppData\Local\anaconda3\Lib\site-packages\sklearn\base.py:464: UserWarning: X does not have valid feature names, but StandardScaler was fitted with feature names
  warnings.warn(

Problem → Data → Model Design → Feature Engineering → Training → Evaluation → Deployment → Monitoring

Data modelling is not just database design — it extends to ML feature structuring
Strong modelling = better model performance
Deployment (Gradio/API) is what makes models business usable
Iteration is continuous (MLOps mindset)

Real-World Extensions

Fraud Detection
Recommendation Systems
Demand Forecasting
NLP Applications

In [ ]:

Data Modelling is the process of structuring and organizing data to represent real-world entities, relationships, and business rules in a way that supports analysis, storage, and decision-making.

Types of Data Modelling

Example:

Example:

Physical Model (Implementation)

Python Implementation (Using Pandas)

Key Concepts in Data Modelling

Advanced Example: Star Schema (Used in BI)

Python Implementation

Why Data Modelling Matters?

Real-World Use Cases

End-to-End Applied Data Modelling Lifecycle

Example Problem:

Translate to Data Problem:

Data Acquisition & Understanding

Decide Algorithm to be used: Logistic Regression

Predict whether a customer will churn:

Why Logistic Regression?

Key Assumptions

Summary

Key Tasks:

Data Modelling (Structure & Relationships)

This is the logical data model for ML:

Data Preprocessing & Feature Engineering

Enhancements:

Model Building

Advanced Metrics:

Confusion Matrix (Foundation)

2 Precision

3️ Recall (Sensitivity / True Positive Rate)

4 ROC Curve & AUC

ROC Curve:

Quick Summary

Model Interpretation (Critical in Applied Modelling)

Deployment (Making Model Usable)

Interactive Deployment using Gradio

Problem → Data → Model Design → Feature Engineering → Training → Evaluation → Deployment → Monitoring

Product

Resources

Company