CoCalc -- 1.1 Data Science Overview and WorkFlow.ipynb

GitHub Repository: suyashi29/python-su
Path: blob/master/Data Science Essentials for Data Analysts/1.1 Data Science Overview and WorkFlow.ipynb
³⁰⁷⁴ views

Kernel: Python 3 (ipykernel)

obtaining knowledge from often enormously large data sets.
process include analysis, preparing data for analysis, and presenting results to support organisational decisions

Data Science is the combination of

statistics,
mathematics,
programming,
problem-solving,
capturing data in ingenious ways, the ability to look at things differently,

and the activity of cleansing, preparing and aligning the data.

             +-------------------+
             |   Problem         |
             |   Definition      |
             +---------+---------+
                       |
                       v
             +-------------------+
             |   Data Collection |
             +---------+---------+
                       |
                       v
             +-------------------+
             |  Data Preprocessing|
             +---------+---------+
                       |
                       v
             +-------------------+
             | Exploratory       |
             | Data Analysis     |
             +---------+---------+
                       |
                       v
             +-------------------+
             |   Feature         |
             |   Engineering     |
             +---------+---------+
                       |
                       v
             +-------------------+
             |  Model Building   |
             +---------+---------+
                       |
                       v
             +-------------------+
             |  Model Evaluation |
             +---------+---------+
                       |
                       v
             +-------------------+
             | Deployment/       |
             | Communication     |
             +-------------------+

In [1]:

import matplotlib.pyplot as plt
import networkx as nx

# Define the workflow stages
stages = [
    "Problem Definition",
    "Data Collection",
    "Data Preprocessing",
    "Exploratory Data Analysis",
    "Feature Engineering",
    "Model Building",
    "Model Evaluation",
    "Deployment/Communication"
]

# Create a directed graph
G = nx.DiGraph()

# Add edges between stages
for i in range(len(stages) - 1):
    G.add_edge(stages[i], stages[i + 1])

# Draw the graph
pos = nx.spring_layout(G, seed=42)  # Layout for consistent plotting
plt.figure(figsize=(12, 8))

# Draw nodes
nx.draw_networkx_nodes(G, pos, node_size=3000, node_color="skyblue")

# Draw edges
nx.draw_networkx_edges(G, pos, arrows=True, arrowstyle='-|>', arrowsize=20)

# Draw labels
nx.draw_networkx_labels(G, pos, font_size=10, font_weight="bold")

plt.title("Data Science Workflow", fontsize=14)
plt.axis("off")
plt.show()

Out[1]:

Use Cases for Data Science WorkFlow

Use Case	Problem Description	Example Models	Key Metrics
Customer Churn	Predict if a customer will leave a service.	Logistic Regression,	Precision, Recall, F1
Prediction		Decision Trees
-------------------------	---------------------------------------------------------	-------------------------	------------------------
Product Recommendation	Suggest products to users based on their behavior.	Collaborative Filtering,	Mean Average Precision
System		Content-Based Filtering
-------------------------	---------------------------------------------------------	-------------------------	------------------------
House Price Prediction	Predict house prices based on features like location.	Linear Regression,	RMSE, R² Score
		Random Forest
-------------------------	---------------------------------------------------------	-------------------------	------------------------
Fraud Detection	Detect fraudulent transactions in financial systems.	Random Forest, SVM,	Precision, Recall,
		Neural Networks	False Negatives

Customer Churn Prediction

Problem Definition: Predict whether a customer will leave a service (churn) based on their historical usage data.
Data Collection: Collect data from CRM systems, transaction logs, and support tickets.

Data Preprocessing:

Clean missing data.
Standardize features like age, income, or usage duration.

Exploratory Data Analysis (EDA):

Analyze trends (e.g., high churn in low usage customers).
Visualize correlations between churn and features like complaints or discounts.

Feature Engineering:

Create new features like average time on platform or discount utilization rate.

Model Building: Use logistic regression or decision trees to predict churn probability.

Model Evaluation: Validate with metrics like accuracy, precision, recall, or F1 score.

Deployment:

Integrate into a CRM tool.
Notify sales teams about high-risk customers.

Communication: Present insights to the marketing team for proactive engagement.

In [3]:

import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report

# Step 1: Generate Sample Data
np.random.seed(42)
n_samples = 1000

# Generate synthetic data
data = {
    "Customer_Age": np.random.randint(18, 65, n_samples),
    "Monthly_Charges": np.random.uniform(20, 120, n_samples),
    "Tenure_Months": np.random.randint(1, 60, n_samples),
    "Contract_Type": np.random.choice(["Month-to-Month", "One-Year", "Two-Year"], n_samples, p=[0.6, 0.3, 0.1]),
    "Internet_Service": np.random.choice(["DSL", "Fiber Optic", "No"], n_samples, p=[0.4, 0.4, 0.2]),
    "Churn": np.random.choice([0, 1], n_samples, p=[0.7, 0.3]),  # 0: No Churn, 1: Churn
}

# Convert to DataFrame
df = pd.DataFrame(data)

# Step 2: Preprocessing
# Encode categorical columns
df_encoded = pd.get_dummies(df, columns=["Contract_Type", "Internet_Service"], drop_first=True)

# Split features and target
X = df_encoded.drop("Churn", axis=1)
y = df_encoded["Churn"]

# Split into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Step 3: Train a Random Forest Classifier
clf = RandomForestClassifier(random_state=42, n_estimators=100)
clf.fit(X_train, y_train)

# Step 4: Evaluate the Model
y_pred = clf.predict(X_test)

# Metrics
print("Accuracy:", accuracy_score(y_test, y_pred))
print("\nClassification Report:\n", classification_report(y_test, y_pred))

# Step 5: Predict Churn for a New Customer
new_customer = pd.DataFrame({
    "Customer_Age": [45],
    "Monthly_Charges": [75.0],
    "Tenure_Months": [24],
    "Contract_Type_One-Year": [1],
    "Contract_Type_Two-Year": [0],
    "Internet_Service_Fiber Optic": [1],
    "Internet_Service_No": [0],
})

print("\nChurn Prediction for New Customer:", clf.predict(new_customer))

Out[3]:

Accuracy: 0.6933333333333334

Classification Report:
               precision    recall  f1-score   support

           0       0.73      0.90      0.81       217
           1       0.36      0.14      0.21        83

    accuracy                           0.69       300
   macro avg       0.55      0.52      0.51       300
weighted avg       0.63      0.69      0.64       300


Churn Prediction for New Customer: [0]

Data Science is the combination of

Use Cases for Data Science WorkFlow

Customer Churn Prediction

Data Preprocessing:

Exploratory Data Analysis (EDA):

Feature Engineering:

Model Building: Use logistic regression or decision trees to predict churn probability.

Model Evaluation: Validate with metrics like accuracy, precision, recall, or F1 score.

Deployment:

Communication: Present insights to the marketing team for proactive engagement.

Product

Resources

Company

Dealing with unstructured and structured data, Data Science is a field that comprises of everything that related to data cleansing, preparation, and analysis.

Data Science is the combination of

Use Cases for Data Science WorkFlow

Customer Churn Prediction

Data Preprocessing:

Exploratory Data Analysis (EDA):

Feature Engineering:

Model Building: Use logistic regression or decision trees to predict churn probability.

Model Evaluation: Validate with metrics like accuracy, precision, recall, or F1 score.

Deployment:

Communication: Present insights to the marketing team for proactive engagement.