GitHub Repository: suyashi29/python-su
Path: blob/master/ML Classification using Python/Machine Learning and Implementation.ipynb
⁴⁷⁴⁵ views

Kernel: Python 3 (ipykernel)

Machine Learning Classification

What is Classification in Machine Learning?

Classification is a type of Supervised Learning where the goal is to predict categories or labels.

Examples:

Email → Spam or Not Spam
Diagnosis → Diabetic or Non-Diabetic
Banking → Fraud or Legit
Customer → Will Churn or Not

The output is categorical.

Algorithm	Type	Best For	Example Use
Logistic Regression	Linear	Binary classification	Spam vs Not Spam
Decision Tree	Non-linear	Simple & interpretable	Loan approval
Random Forest	Ensemble	High accuracy, robust	Fraud detection
KNN	Distance-based	Small datasets	Recommender
SVM	Margin-based	High-dimensional data	Text classification
Naive Bayes	Probabilistic	Text data	Sentiment analysis

Machine Learning Workflow

Below is the typical ML workflow:

Data Collection
Data Preprocessing
Model Training
Model Evaluation
Model Deployment
Prediction

Predicting Hiring Chances

We will cover the basics of Machine Learning Classification using Python with a real-world inspired example: predicting whether a candidate will be hired based on their profile.

1. Python Setup

We will use the following libraries:

pandas and numpy for data manipulation
scikit-learn for ML algorithms and preprocessing
matplotlib and seaborn for visualization

In [1]:

# Install required libraries if not already installed
# !pip install pandas numpy scikit-learn matplotlib seaborn

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

2. Data Preprocessing for Classification

In [2]:

H_data=pd.read_excel("Hiringdataset.xlsx")
H_data

Out[2]:

In [3]:

print("\nInfo:")
H_data.info()

Out[3]:

Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1200 entries, 0 to 1199
Data columns (total 5 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   YearsExperience     1140 non-null   float64
 1   EducationLevel      1140 non-null   float64
 2   SkillsScore         1140 non-null   float64
 3   CertificationCount  1140 non-null   float64
 4   Hired               1200 non-null   int64  
dtypes: float64(4), int64(1)
memory usage: 47.0 KB

In [4]:

print("\nStatistical Summary:")
H_data.describe() # Numerical

Out[4]:

Statistical Summary:

In [7]:

#H_data.describe(include = object) # Numerical

Data Prepartion

In [8]:

H_data.isnull().sum()

Out[8]:

YearsExperience       60
EducationLevel        60
SkillsScore           60
CertificationCount    60
Hired                  0
dtype: int64

In [9]:

missing_percentage = (H_data.isnull().sum() / len(H_data)) * 100
missing_percentage

Out[9]:

YearsExperience       5.0
EducationLevel        5.0
SkillsScore           5.0
CertificationCount    5.0
Hired                 0.0
dtype: float64

In [10]:

#Fill numeric values with median
num_cols = H_data.select_dtypes(include=['float64','int64']).columns

for col in num_cols:
    H_data[col] = H_data[col].fillna(H_data[col].median())

cat_cols = H_data.select_dtypes(include=['object']).columns

for col in cat_cols:
    H_data[col] = H_data[col].fillna(H_data[col].mode()[0])

In [11]:

missing_percentage = (H_data.isnull().sum() / len(H_data)) * 100
missing_percentage

Out[11]:

YearsExperience       0.0
EducationLevel        0.0
SkillsScore           0.0
CertificationCount    0.0
Hired                 0.0
dtype: float64

EDA & Visualizations

In [12]:

H_data.hist(figsize=(10,6), bins=20, edgecolor='yellow')
plt.tight_layout()
plt.show()

Out[12]:

In [13]:

sns.pairplot(H_data, hue="Hired")
plt.show()

Out[13]:

In [14]:

plt.figure(figsize=(10,6))
sns.heatmap(H_data.corr(), annot=True, cmap="Blues")
plt.title("Correlation Heatmap")
plt.show()

Out[14]:

Prepare Data for ML

In [15]:

X = H_data.drop("Hired", axis=1)
y = H_data["Hired"]

In [16]:

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

In [17]:

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

In [18]:

Hiring_m = LogisticRegression()
Hiring_m.fit(X_train_scaled, y_train)

Out[18]:

In [20]:

y_pred = Hiring_m.predict(X_test_scaled)

In [21]:

print("Accuracy:", accuracy_score(y_test, y_pred))

Out[21]:

Accuracy: 0.9083333333333333

Classification metrics

1. Real predicted Outcome
- 1 1 T.P
- 1 0 F.N
- 0 1 F.P
- 0 0 T.N
- Accuracy = TP+TN / (Tp+TN+FN+FP)

In [22]:

print(classification_report(y_test, y_pred))

Out[22]:

              precision    recall  f1-score   support

           0       0.92      0.98      0.95       213
           1       0.67      0.37      0.48        27

    accuracy                           0.91       240
   macro avg       0.80      0.67      0.71       240
weighted avg       0.90      0.91      0.90       240

In [23]:

cm = confusion_matrix(y_test, y_pred)

sns.heatmap(cm, annot=True, cmap="Purples", fmt="d")
plt.xlabel("Predicted")
plt.ylabel("Actual")
plt.title("Confusion Matrix")
plt.show()

Out[23]:

In [25]:

import pickle

# Save the trained model
with open("logistic_model.pkl", "wb") as file:
    pickle.dump(Hiring_m, file)

# Save the scaler (important for prediction)
with open("scaler.pkl", "wb") as file:
    pickle.dump(scaler, file)

print("Model and scaler saved successfully!")

Out[25]:

Model and scaler saved successfully!

In [26]:

import pickle

# Load model
with open("logistic_model.pkl", "rb") as file:
    loaded_model = pickle.load(file)

# Load scaler
with open("scaler.pkl", "rb") as file:
    loaded_scaler = pickle.load(file)

print("Model and scaler loaded!")

Out[26]:

Model and scaler loaded!

Predict on NEW Data Points

YearsExperience | EducationLevel | SkillsScore | CertificationCount

In [27]:

new_data = pd.DataFrame({
    "YearsExperience": [2, 7, 10],
    "EducationLevel": [1, 2, 3],
    "SkillsScore": [78, 91, 88],
    "CertificationCount": [1, 4, 5]
})

In [28]:

new_data_scaled = loaded_scaler.transform(new_data)

In [29]:

new_predictions = loaded_model.predict(new_data_scaled)
new_predictions

Out[29]:

array([0, 0, 1], dtype=int64)

In [30]:

new_data["Predicted_Hired"] = new_predictions
new_data

Out[30]: