CoCalc -- 1 Introduction to Unsupervised Machine Learning .ipynb

GitHub Repository: suyashi29/python-su
Path: blob/master/ML Clustering Analysis/1 Introduction to Unsupervised Machine Learning .ipynb
³⁰⁷⁹ views

Kernel: Python 3 (ipykernel)

Profit Sales 100 20 200 10

Profit (2025): Y=mx+c Y - Profit , x -sales

color Weight sound type Cat Dog Mouse Model (Predit) Animal Class: Classification Type

Salary Spend 1000 10

What is Machine Learning?

Machine learning is a branch of artificial intelligence that enables computers to learn from data and make predictions or decisions without being explicitly programmed. It involves using algorithms to identify patterns in data and improve performance over time based on experience.

Difference between Supervised and Unsupervised ML

Unsupervised learning , also known as unsupervised machine learning, uses machine learning algorithms to analyze and cluster unlabeled datasets.

These algorithms discover hidden patterns or data groupings without the need for human intervention.
Its ability to discover similarities and differences in information make it the ideal solution for : - Exploratory data analysis - Cross-selling strategies - Customer segmentation - Image recognition

Examples

Determine what the target market will be for a new product that we want to release ( we have no historical data of the demographics of the target market)
Google is an instances of clustering that needs unsupervised learning to group news items depends on their content
Social network analysis is conducted to make clusters of friends depends on the frequency of connection between them. Such analysis reveals the links between the users of some social networking website
The geographic areas of servers is determined on the basis of clustering of web requests received from a specific area of the world
Recommendation Engines: Past purchase behavior coupled with unsupervised learning can be used to help businesses discover data trends that could be used to develop effective cross-selling strategies.

Standard Process involved in Machine Learning

Problem Definition: Clearly define the problem you want to solve and the goal of the machine learning model.

Data Collection: Gather relevant data that will be used to train and test the model.

Data Preprocessing:

- Clean the data (handle missing values, remove duplicates).
- Transform and normalize the data (scaling, encoding categorical variables).
- Split the data into training and testing sets.
- test data should be a statiscal of  representaion of your whole data
-

Feature Engineering:

Select relevant features (feature selection).
Create new features (feature extraction) if needed to improve model performance.

Model Selection: Choose an appropriate machine learning algorithm based on the problem type (e.g., classification, regression).

Model Training: Train the selected model using the training dataset, allowing it to learn patterns from the data.

Model Evaluation: Assess the model's performance on the test dataset using appropriate metrics (e.g., accuracy, precision, recall, RMSE).

Hyperparameter Tuning: Optimize the model's hyperparameters to improve performance.

Model Deployment: Implement the trained model into a production environment where it can make predictions on new data.

Monitoring and Maintenance: Continuously monitor the model’s performance in production and update it as needed to maintain accuracy over time.

There are three main tasks when performing unsupervised learning (in no particular order):

1. Clustering

Clustering involves grouping unlabeled data based on their similarities and differences, therefore, when 2 instances appear in different groups we can infer that they have dissimilar properties and/or features.

E.g. : exclusive clustering, overlapping clustering, hierarchical clustering, and probabilistic clustering

2. Association Rules

Association rule learning is a rule-based machine learning method for discovering interesting relationships between variables in a given dataset. The intention of the method is to identify strong rules discovered in data using a measure of interest. E.g. : Apriori algorithm.

3. Dimensionality reduction

This refers to the transformation of data from a high-dimensional space to a low-dimensional space such that the low dimensional space retains meaning properties of the original data. One reason we would reduce the dimensionality of data is to simplify the modeling problem since more input features can make the modeling task more challenging. This is known as the curse of dimensionality.

Libraries installation : pip install --proxy http://u:[email protected]:8000 pac

Important Libraries

Numpy
Pandas
Sckitlearn
Apyori
mlxtend

ML Supervised Example: Predicting whether a person has diabetes based on their health data.

In [2]:

# Import necessary libraries
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

# Step 1: Load the dataset
url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.data.csv"
column_names = ['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 
                'BMI', 'DiabetesPedigreeFunction', 'Age', 'Outcome']
data = pd.read_csv(url, header=None, names=column_names)
data.head()

Out[2]:

In [4]:

data.describe()

Out[4]:

(768, 9)

In [5]:

data.shape

Out[5]:

(768, 9)

In [6]:

# Step 2: Data Preprocessing
X = data.drop('Outcome', axis=1)  # Features (input variables)
y = data['Outcome']               # Target variable (output variable)

# Step 3: Train-Test Split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Step 4: Feature Scaling
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

In [7]:

# Step 5: Model Selection and Training
model = LogisticRegression()
model.fit(X_train, y_train)

Out[7]:

In [9]:

# Step 6: Model Prediction
y_pred = model.predict(X_test)
y_pred

Out[9]:

array([0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0,
       1, 0, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0, 1, 1, 0,
       0, 1, 1, 0, 0, 1, 0, 1, 1, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 1,
       0, 1, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0,
       0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 1, 0, 1, 0, 1, 0, 1, 0, 0, 1, 0, 1,
       0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1,
       0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0],
      dtype=int64)

tp fp tn fn

Accuracy = tp+tn/(tp+fp+tn+fn) Ytest(Actual) YPred tp - dia , dia - tp tn - ndia , ndia- tn fp - ndia, dia - fp fn - dia ndia - fn

In [10]:

# Step 7: Model Evaluation
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.2f}")

Out[10]:

Accuracy: 0.75

In [11]:


conf_matrix = confusion_matrix(y_test, y_pred)
print("Confusion Matrix:")
print(conf_matrix)

print("Classification Report:")
print(classification_report(y_test, y_pred))

Out[11]:

Confusion Matrix:
[[79 20]
 [18 37]]
Classification Report:
              precision    recall  f1-score   support

           0       0.81      0.80      0.81        99
           1       0.65      0.67      0.66        55

    accuracy                           0.75       154
   macro avg       0.73      0.74      0.73       154
weighted avg       0.76      0.75      0.75       154

Unsupervised Example: Grouping similar data points (customers) based on their spending habits.

Data Generation: Here, we generate synthetic data with 300 samples and 4 distinct clusters using make_blobs from sklearn.datasets. This is just for illustration purposes, and in a real-world scenario, you'd use actual data.
Data Visualization: The data points are visualized in a 2D scatter plot to understand the distribution. This step is optional but helps in visualizing the clustering effect later.
Model Selection and Training: We select the K-Means clustering algorithm and fit it to the data. We specify n_clusters=4, meaning we want to identify 4 clusters in the data.
Model Prediction: After training, we predict the cluster labels for each data point. These labels indicate which cluster each data point belongs to.
Cluster Visualization: The final plot shows the data points color-coded by cluster and the cluster centers marked with a red 'X'. This visualization helps in understanding how the data points are grouped based on their similarity.

In [16]:

# Import necessary libraries
import pandas as pd
from sklearn.datasets import make_blobs
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt

# Step 1: Generate synthetic data (for illustration purposes)
X, _ = make_blobs(n_samples=500, centers=4, cluster_std=0.60, random_state=42)

# Step 2: Visualize the data (optional, just to see the input data distribution)
plt.scatter(X[:, 0], X[:, 1], s=50, cmap='viridis')
plt.title("Input Data")
plt.show()

Out[16]:

In [17]:

# Step 3: Model Selection and Training (Applying K-Means Clustering)
kmeans = KMeans(n_clusters=4)  # Assume we want to find 4 clusters
kmeans.fit(X)

# Step 4: Model Prediction (Assign data points to clusters)
y_kmeans = kmeans.predict(X)

In [18]:


# Step 5: Visualize the Clusters
plt.scatter(X[:, 0], X[:, 1], c=y_kmeans, s=50, cmap='viridis')
centers = kmeans.cluster_centers_
plt.scatter(centers[:, 0], centers[:, 1], c='red', s=200, alpha=0.75, marker='X')
plt.title("K-Means Clustering")
plt.show()

Out[18]: