Path: blob/master/ML Clustering Analysis/1 Introduction to Unsupervised Machine Learning .ipynb
3079 views
Profit Sales 100 20 200 10
Profit (2025): Y=mx+c Y - Profit , x -sales
color Weight sound type Cat Dog Mouse Model (Predit) Animal Class: Classification Type
Salary Spend 1000 10
What is Machine Learning?
Machine learning is a branch of artificial intelligence that enables computers to learn from data and make predictions or decisions without being explicitly programmed. It involves using algorithms to identify patterns in data and improve performance over time based on experience.
Difference between Supervised and Unsupervised ML
Unsupervised learning , also known as unsupervised machine learning, uses machine learning algorithms to analyze and cluster unlabeled datasets.
These algorithms discover hidden patterns or data groupings without the need for human intervention.
Its ability to discover similarities and differences in information make it the ideal solution for : - Exploratory data analysis - Cross-selling strategies - Customer segmentation - Image recognition
Examples
Determine what the target market will be for a new product that we want to release ( we have no historical data of the demographics of the target market)
Google is an instances of clustering that needs unsupervised learning to group news items depends on their content
Social network analysis is conducted to make clusters of friends depends on the frequency of connection between them. Such analysis reveals the links between the users of some social networking website
The geographic areas of servers is determined on the basis of clustering of web requests received from a specific area of the world
Recommendation Engines: Past purchase behavior coupled with unsupervised learning can be used to help businesses discover data trends that could be used to develop effective cross-selling strategies.
Standard Process involved in Machine Learning
Problem Definition: Clearly define the problem you want to solve and the goal of the machine learning model.
Data Collection: Gather relevant data that will be used to train and test the model.
Data Preprocessing:
Feature Engineering:
Select relevant features (feature selection).
Create new features (feature extraction) if needed to improve model performance.
Model Selection: Choose an appropriate machine learning algorithm based on the problem type (e.g., classification, regression).
Model Training: Train the selected model using the training dataset, allowing it to learn patterns from the data.
Model Evaluation: Assess the model's performance on the test dataset using appropriate metrics (e.g., accuracy, precision, recall, RMSE).
Hyperparameter Tuning: Optimize the model's hyperparameters to improve performance.
Model Deployment: Implement the trained model into a production environment where it can make predictions on new data.
Monitoring and Maintenance: Continuously monitor the model’s performance in production and update it as needed to maintain accuracy over time.
There are three main tasks when performing unsupervised learning (in no particular order):
1. Clustering
Clustering involves grouping unlabeled data based on their similarities and differences, therefore, when 2 instances appear in different groups we can infer that they have dissimilar properties and/or features.
E.g. : exclusive clustering, overlapping clustering, hierarchical clustering, and probabilistic clustering
2. Association Rules
Association rule learning is a rule-based machine learning method for discovering interesting relationships between variables in a given dataset. The intention of the method is to identify strong rules discovered in data using a measure of interest. E.g. : Apriori algorithm.
3. Dimensionality reduction
This refers to the transformation of data from a high-dimensional space to a low-dimensional space such that the low dimensional space retains meaning properties of the original data. One reason we would reduce the dimensionality of data is to simplify the modeling problem since more input features can make the modeling task more challenging. This is known as the curse of dimensionality.
Libraries installation : pip install --proxy http://u:[email protected]:8000 pac
Important Libraries
Numpy
Pandas
Sckitlearn
Apyori
mlxtend
ML Supervised Example: Predicting whether a person has diabetes based on their health data.
tp fp tn fn
Accuracy = tp+tn/(tp+fp+tn+fn) Ytest(Actual) YPred tp - dia , dia - tp tn - ndia , ndia- tn fp - ndia, dia - fp fn - dia ndia - fn
Unsupervised Example: Grouping similar data points (customers) based on their spending habits.
Data Generation: Here, we generate synthetic data with 300 samples and 4 distinct clusters using make_blobs from sklearn.datasets. This is just for illustration purposes, and in a real-world scenario, you'd use actual data.
Data Visualization: The data points are visualized in a 2D scatter plot to understand the distribution. This step is optional but helps in visualizing the clustering effect later.
Model Selection and Training: We select the K-Means clustering algorithm and fit it to the data. We specify n_clusters=4, meaning we want to identify 4 clusters in the data.
Model Prediction: After training, we predict the cluster labels for each data point. These labels indicate which cluster each data point belongs to.
Cluster Visualization: The final plot shows the data points color-coded by cluster and the cluster centers marked with a red 'X'. This visualization helps in understanding how the data points are grouped based on their similarity.