Book a Demo!
CoCalc Logo Icon
StoreFeaturesDocsShareSupportNewsAboutPoliciesSign UpSign In
suyashi29
GitHub Repository: suyashi29/python-su
Path: blob/master/Applied Data Modelling using Gradio/1.1 Data Modelling .ipynb
7216 views
Kernel: Python (TensorFlow 3.10)

Data Modelling is the process of structuring and organizing data to represent real-world entities, relationships, and business rules in a way that supports analysis, storage, and decision-making.

Think of it as a blueprint for data systems—it defines:

  • What data to store (entities) ?

  • How they relate (relationships) ?

  • How they are structured (schemas) ?

Types of Data Modelling

    1. Conceptual Model (High-Level)

Focus: Business understanding

Example:

---- Customer → Orders → Products

    1. Logical Model (Detailed Structure)

Defines attributes, keys, relationships

Example:

-----Customer(ID, Name, Email) Order(OrderID, Date, CustomerID)

Physical Model (Implementation)

  • Actual database tables, data types, constraints Example:

CREATE TABLE Customer ( CustomerID INT PRIMARY KEY, Name VARCHAR(100), Email VARCHAR(100) );
#Python #variables : 2*4+18 a,b = "ashi", 29 a b #Number: type(-2333333333) type(-23.980000) type(3+4j) ## Data containers type("Ashi is beautiful")
str
a="ashi65" a[0] len(a) #a.append("o") #del a[1] del a type([1,23,4,"a"])
list
a=[1,2,"a"] len(a) a.append(3)
a
[1, 2, 'a', 3]
### Type Casting a=list("ashi29") a
['a', 's', 'h', 'i', '2', '9']
type((1,2,"a"))
tuple
d=(1,23) d[2]=9
--------------------------------------------------------------------------- TypeError Traceback (most recent call last) Cell In[36], line 2 1 d=(1,23) ----> 2 d[2]=9 TypeError: 'tuple' object does not support item assignment
a={} s={1,2,4,5,87,-1,-1,87,1,2} s
{-1, 1, 2, 4, 5, 87}
d={} d={"A":1,"B":2} d d.keys() d.values() #Keys are index in dictionary
--------------------------------------------------------------------------- KeyError Traceback (most recent call last) Cell In[43], line 6 4 d.keys() 5 d.values() ----> 6 d[0] KeyError: 0

Python Implementation (Using Pandas)

a=[1,2,3,4] import pandas as pd s=pd.Series(a,index=["one","two","three","four"]) s
one 1 two 2 three 3 four 4 dtype: int64
d=pd.DataFrame(a,index=["one","two","three","four"],columns=["Name"]) d
## Step 1 create table import pandas as pd # Customer Table customers = pd.DataFrame({ 'customer_id': [1, 2], 'name': ['Ali', 'Ashi'], 'email': ['[email protected]', '[email protected]'] }) # Product Table products = pd.DataFrame({ 'product_id': [101, 102], 'product_name': ['Laptop', 'Phone'], 'price': [800, 500] }) # Orders Table orders = pd.DataFrame({ 'order_id': [1001, 1002], 'customer_id': [1, 2], 'product_id': [101, 102], 'quantity': [1, 2] })
customers
## Step 2 Establish Relationships (Joins) # Merge Orders with Customers order_customer = pd.merge(orders, customers, on='customer_id') # Merge with Products final_data = pd.merge(order_customer, products, on='product_id') final_data

Key Concepts in Data Modelling

  • Entity → Object (Customer, Product)

  • Attribute → Properties (Name, Price)

  • Primary Key → Unique identifier

  • Foreign Key → Links between tables

  • Normalization → Avoid redundancy

Advanced Example: Star Schema (Used in BI)

  • Fact Table → Sales

  • Dimension Tables → Customer, Product, Time

Python Implementation

1,1000,20 1/1000, 1000/1000,20/1000
(0.001, 1.0, 0.02)
# Fact Table sales = pd.DataFrame({ 'sale_id': [1, 2], 'customer_id': [1, 2], 'product_id': [101, 102], 'amount': [800, 1000] })

Why Data Modelling Matters?

  • Improves data quality & consistency

  • Enables efficient querying

  • Supports analytics & AI models

  • Reduces redundancy & errors

Real-World Use Cases

  • Banking → Customer transactions model

  • E-commerce → Order management system

  • Healthcare → Patient records

  • AI → Feature engineering datasets

End-to-End Applied Data Modelling Lifecycle

Problem Definition (Business → Analytical Translation)

Example Problem:

  • An e-commerce company wants to predict customer churn (who will stop purchasing).

Translate to Data Problem:

  • Type: Supervised Learning (Classification)

  • Target Variable: churn (0/1)

  • Goal: Predict probability of churn

Data Acquisition & Understanding

Example Dataset

Decide Algorithm to be used: Logistic Regression

  • Logistic Regression is a supervised machine learning algorithm used for binary classification problems (e.g., Yes/No, 0/1, Churn/No Churn).

  • Unlike linear regression, it does not predict continuous values. Instead, it predicts the probability of a class, which is then converted into a label.

  • It models the probability that an input belongs to a class: **P(Y=1|X)

  • Uses a sigmoid (logistic) function to map any real value into a range between 0 and 1. %7B45873EBB-6266-4D2B-8B8A-78D2A017EF00%7D.png

Predict whether a customer will churn:

  • Inputs: Age, Purchase Frequency, Spend

  • Output: Probability of churn

  • P=0.78 ---> Churn

Why Logistic Regression?

  • Simple & interpretable

  • Works well for baseline models

  • Provides feature importance via coefficients

  • Efficient for large datasets

Key Assumptions

  • Linear relationship between features and log-odds

  • Independent observations

  • No strong multicollinearity

Summary

  • Converts linear equation → probability using sigmoid

  • Outputs classification via threshold

  • Widely used in classification problems

import pandas as pd data = pd.DataFrame({ 'age': [25, 40, 35, 50,45,10], 'purchase_freq': [5, 2, 3, 1,2,1], 'avg_spend': [200, 500, 300, 700,1000,300], 'churn': [0, 1, 0, 1,1,0] }) data.head()

Key Tasks:

  • Schema validation

  • Missing values check

  • Data types correction

data.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 6 entries, 0 to 5 Data columns (total 4 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 age 6 non-null int64 1 purchase_freq 6 non-null int64 2 avg_spend 6 non-null int64 3 churn 6 non-null int64 dtypes: int64(4) memory usage: 324.0 bytes
data.isnull().sum()
age 0 purchase_freq 0 avg_spend 0 churn 0 dtype: int64

Data Modelling (Structure & Relationships)

Here we define:

  • Features (X) → Inputs

  • Target (y) → Output

X = data[['age', 'purchase_freq', 'avg_spend']] y = data['churn']

This is the logical data model for ML:

  • Entities → Customers

  • Attributes → Behavioral features

  • Relationship → Features → Target

Data Preprocessing & Feature Engineering

from sklearn.preprocessing import StandardScaler scaler = StandardScaler() X_scaled = scaler.fit_transform(X)

Enhancements:

  • Feature scaling

  • Encoding categorical variables

  • Creating derived features

## For Example data['spend_per_purchase'] = data['avg_spend'] / data['purchase_freq'] data['spend_per_purchase']
0 40.0 1 250.0 2 100.0 3 700.0 4 500.0 5 300.0 Name: spend_per_purchase, dtype: float64

Model Building

  • X_train and Y_train(Modelling )

  • X_test = Y_Pred()

  • Ytest(Actual) - Ypred : Performing

from sklearn.model_selection import train_test_split from sklearn.linear_model import LogisticRegression X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2) model = LogisticRegression() model.fit(X_train, y_train)
Actual Prediction Lable 1 1 TP 0 1 FP 1 0 FN 0 0 TN Accuracy = TP+TN / (TP+FP+TN+FN)
from sklearn.metrics import accuracy_score y_pred = model.predict(X_test) print("Accuracy:", accuracy_score(y_test, y_pred))
Accuracy: 0.5

Advanced Metrics:

  • Precision / Recall

  • ROC-AUC

  • Confusion Matrix

Here are the core evaluation metric formulas used in classification problems, expressed clearly with context:


Confusion Matrix (Foundation)

A confusion matrix summarizes prediction outcomes:

Predicted PositivePredicted Negative
Actual PositiveTP (True Positive)FN (False Negative)
Actual NegativeFP (False Positive)TN (True Negative)
  • TP → Correct positive prediction

  • TN → Correct negative prediction

  • FP → Incorrect positive prediction

  • FN → Missed positive


2 Precision

TP/(TP+FP)

Interpretation: Out of all predicted positives, how many were actually correct?

  • Use when false positives are costly (e.g., spam detection, fraud alerts)


3️ Recall (Sensitivity / True Positive Rate)

Recall = TP/(TP+TN)

Interpretation: Out of all actual positives, how many did we correctly identify?

  • Use when false negatives are costly (e.g., disease detection)


4 ROC Curve & AUC

ROC Curve:

Plots:

  • X-axis: False Positive Rate (FPR)

  • Y-axis: True Positive Rate (Recall) image.png

Interpretation:

  • Measures overall model ability to distinguish classes

  • Range: 0 → 1

AUC ValueMeaning
0.5Random model
0.7–0.8Good
0.8–0.9Very good
0.9+Excellent

Quick Summary

  • Precision → Accuracy of positive predictions

  • Recall → Coverage of actual positives

  • Confusion Matrix → Base for all metrics

  • ROC-AUC → Overall classification performance

Model Interpretation (Critical in Applied Modelling)

Helps answer:

  • Which features drive churn?

  • Business insight extraction

print("Coefficients:", model.coef_)
Coefficients: [[ 0.64497049 -0.43204911 0.52753669]]

Deployment (Making Model Usable)

  • Option 1: Simple Function API

def predict_churn(age, purchase_freq, avg_spend): input_data = scaler.transform([[age, purchase_freq, avg_spend]]) prediction = model.predict(input_data) return prediction[0]

Interactive Deployment using Gradio

import gradio as gr def churn_app(age, purchase_freq, avg_spend): result = predict_churn(age, purchase_freq, avg_spend) return "Churn" if result == 1 else "No Churn" interface = gr.Interface( fn=churn_app, inputs=["number", "number", "number"], outputs="text", title="Customer Churn Prediction" ) interface.launch()
* Running on local URL: http://127.0.0.1:7860 * To create a public link, set `share=True` in `launch()`.
C:\Users\Suyashi144893\AppData\Local\anaconda3\Lib\site-packages\sklearn\base.py:464: UserWarning: X does not have valid feature names, but StandardScaler was fitted with feature names warnings.warn(

Problem → Data → Model Design → Feature Engineering → Training → Evaluation → Deployment → Monitoring

  • Data modelling is not just database design — it extends to ML feature structuring

  • Strong modelling = better model performance

  • Deployment (Gradio/API) is what makes models business usable

  • Iteration is continuous (MLOps mindset)

Real-World Extensions

  • Fraud Detection

  • Recommendation Systems

  • Demand Forecasting

  • NLP Applications