Jack Montoro's K-Means and K-Nearest Neighbors Project
ubuntu2004
Jack Montoro
K-Means Clustering Final Project Lecture
Math 157
Monday, March 20, 2023
Run this first
K-means clustering is an algorithm that takes in a data point value and evaluates its category on a basis not dissimilar to the K-Nearest Neighbors algorithm, but instead of looking for individual data points as neighbors, it looks for clusters as neighbors. We will review KNN below and later take a look at K-means clusters.
K Nearest Neighbor Review:
K Nearest Neighbors is a supervised machine learning algorithm that we may use for regression or classification tasks. The inherent advantage of the KNN algorithm is that we do not need to make assumptions about the distributions of the data we are analyzing.
However, we do assume similarities between existing case data that we have access to and new data collected in order to group the new data. We group this new data according to its proximity to categories we have formed for the existing cases. For example, in the following photo we evaluate the new datum according to the classes A and B that we have established from the existing data. We evaluate the new datum on the basis of its proximity to the other entries, and assign its category.

In the previous example we we looking at the 3 neighbor points closest to our new entry to classify the node, so we can deduce that k=3 here. How we define "proximity" is up to the implementation method of the data analyst, but one potential method of calculating proximity is the euclidean distance formula.
Once we have the count of the categories of the k nearest neighbors, we can assign a category to our newest entry depending on which category among the k neighbors had the greatest count.
This same distance formula will later be used to calculate the distances between our points for k-means clusters.
If we decide to go with euclidean distance formula as our distance function, we can calculate via where p and q are separate points of dimension . The following example is the trivial n=2:
![]()
Participation Check:
In the cell below, implement a function that returns the Euclidean distance of two points in dimension 2:
Here, we will use the iris dataset to build a dataframe and demonstrate our algorithm.
150×6 DataFrame
Row │ Id SepalLengthCm SepalWidthCm PetalLengthCm PetalWidthCm Specie ⋯
│ Int64 Float64 Float64 Float64 Float64 String ⋯
─────┼──────────────────────────────────────────────────────────────────────────
1 │ 1 5.1 3.5 1.4 0.2 Iris-s ⋯
2 │ 2 4.9 3.0 1.4 0.2 Iris-s
3 │ 3 4.7 3.2 1.3 0.2 Iris-s
4 │ 4 4.6 3.1 1.5 0.2 Iris-s
5 │ 5 5.0 3.6 1.4 0.2 Iris-s ⋯
6 │ 6 5.4 3.9 1.7 0.4 Iris-s
7 │ 7 4.6 3.4 1.4 0.3 Iris-s
8 │ 8 5.0 3.4 1.5 0.2 Iris-s
9 │ 9 4.4 2.9 1.4 0.2 Iris-s ⋯
10 │ 10 4.9 3.1 1.5 0.1 Iris-s
11 │ 11 5.4 3.7 1.5 0.2 Iris-s
⋮ │ ⋮ ⋮ ⋮ ⋮ ⋮ ⋱
141 │ 141 6.7 3.1 5.6 2.4 Iris-v
142 │ 142 6.9 3.1 5.1 2.3 Iris-v ⋯
143 │ 143 5.8 2.7 5.1 1.9 Iris-v
144 │ 144 6.8 3.2 5.9 2.3 Iris-v
145 │ 145 6.7 3.3 5.7 2.5 Iris-v
146 │ 146 6.7 3.0 5.2 2.3 Iris-v ⋯
147 │ 147 6.3 2.5 5.0 1.9 Iris-v
148 │ 148 6.5 3.0 5.2 2.0 Iris-v
149 │ 149 6.2 3.4 5.4 2.3 Iris-v
150 │ 150 5.9 3.0 5.1 1.8 Iris-v ⋯
1 column and 129 rows omittedHere we obtain a new DataFrame X, which is iris, but with only the 4 relevant categories that we will use for our distance formula.
150×4 DataFrame
Row │ SepalLengthCm SepalWidthCm PetalLengthCm PetalWidthCm
│ Float64 Float64 Float64 Float64
─────┼──────────────────────────────────────────────────────────
1 │ 5.1 3.5 1.4 0.2
2 │ 4.9 3.0 1.4 0.2
3 │ 4.7 3.2 1.3 0.2
4 │ 4.6 3.1 1.5 0.2
5 │ 5.0 3.6 1.4 0.2
6 │ 5.4 3.9 1.7 0.4
7 │ 4.6 3.4 1.4 0.3
8 │ 5.0 3.4 1.5 0.2
9 │ 4.4 2.9 1.4 0.2
10 │ 4.9 3.1 1.5 0.1
11 │ 5.4 3.7 1.5 0.2
⋮ │ ⋮ ⋮ ⋮ ⋮
141 │ 6.7 3.1 5.6 2.4
142 │ 6.9 3.1 5.1 2.3
143 │ 5.8 2.7 5.1 1.9
144 │ 6.8 3.2 5.9 2.3
145 │ 6.7 3.3 5.7 2.5
146 │ 6.7 3.0 5.2 2.3
147 │ 6.3 2.5 5.0 1.9
148 │ 6.5 3.0 5.2 2.0
149 │ 6.2 3.4 5.4 2.3
150 │ 5.9 3.0 5.1 1.8
129 rows omittedOnce we have chosen our distance function, we can calculate the distance from our test data point to each of the other points in our dataset. Once we have calculated these values in the form of an array, we can sort the array to get the data points in terms of their distance to our test value.
150×7 DataFrame
Row │ Id SepalLengthCm SepalWidthCm PetalLengthCm PetalWidthCm Specie ⋯
│ Int64 Float64 Float64 Float64 Float64 String ⋯
─────┼──────────────────────────────────────────────────────────────────────────
1 │ 1 5.1 3.5 1.4 0.2 Iris-s ⋯
2 │ 2 4.9 3.0 1.4 0.2 Iris-s
3 │ 3 4.7 3.2 1.3 0.2 Iris-s
4 │ 4 4.6 3.1 1.5 0.2 Iris-s
5 │ 5 5.0 3.6 1.4 0.2 Iris-s ⋯
6 │ 6 5.4 3.9 1.7 0.4 Iris-s
7 │ 7 4.6 3.4 1.4 0.3 Iris-s
8 │ 8 5.0 3.4 1.5 0.2 Iris-s
9 │ 9 4.4 2.9 1.4 0.2 Iris-s ⋯
10 │ 10 4.9 3.1 1.5 0.1 Iris-s
11 │ 11 5.4 3.7 1.5 0.2 Iris-s
⋮ │ ⋮ ⋮ ⋮ ⋮ ⋮ ⋱
141 │ 141 6.7 3.1 5.6 2.4 Iris-v
142 │ 142 6.9 3.1 5.1 2.3 Iris-v ⋯
143 │ 143 5.8 2.7 5.1 1.9 Iris-v
144 │ 144 6.8 3.2 5.9 2.3 Iris-v
145 │ 145 6.7 3.3 5.7 2.5 Iris-v
146 │ 146 6.7 3.0 5.2 2.3 Iris-v ⋯
147 │ 147 6.3 2.5 5.0 1.9 Iris-v
148 │ 148 6.5 3.0 5.2 2.0 Iris-v
149 │ 149 6.2 3.4 5.4 2.3 Iris-v
150 │ 150 5.9 3.0 5.1 1.8 Iris-v ⋯
2 columns and 129 rows omitted10×7 DataFrame
Row │ Id SepalLengthCm SepalWidthCm PetalLengthCm PetalWidthCm Specie ⋯
│ Int64 Float64 Float64 Float64 Float64 String ⋯
─────┼──────────────────────────────────────────────────────────────────────────
1 │ 45 5.1 3.8 1.9 0.4 Iris-s ⋯
2 │ 44 5.0 3.5 1.6 0.6 Iris-s
3 │ 6 5.4 3.9 1.7 0.4 Iris-s
4 │ 22 5.1 3.7 1.5 0.4 Iris-s
5 │ 20 5.1 3.8 1.5 0.3 Iris-s ⋯
6 │ 24 5.1 3.3 1.7 0.5 Iris-s
7 │ 47 5.1 3.8 1.6 0.2 Iris-s
8 │ 27 5.0 3.4 1.6 0.4 Iris-s
9 │ 17 5.4 3.9 1.3 0.4 Iris-s ⋯
10 │ 25 4.8 3.4 1.9 0.2 Iris-s
2 columns omittedWe can then use the most common label of the k nearest points in the distance array, to find the category of our test point. We repeat this process until we have categorized all of our test points.
Of our 10 nearest neighbors, we can see that Iris-setosa is the most frequent species category the appears, so we would assign our test value this species label.
K-Means Clustering Algorithm
This method that we use in KNN algorithm for data point categorization is utilized in k-means where we do not need to assume we have any information about species, but categorize the information according to selected cluster values, and each cluster is defined by our given distance formula.
We select k clusters for which we calculate k means. Selecting the optimal k value can be achieved by iterating over each value of k to k = n (where n is the number of data entries) and plotting the variance of the clusters against each value of k.
To get the optimal value of k, we find the value of k after which the rate of variance reduction diminishes.
In this example, the optimal k value would be 2, as we can observe above that the rate of variation decrease between clusters drops of at 2 clusters.
Below, we obtains the optimal cluster set for the collected data when k=3.
We typically get a cluster by choosing three points at random, and designating them as our cluster points.
We then form our clusters by calculating the distance of each point to each of the three clusters. We assign each point to the cluster it is closest to.
Once every point is assigned, we get the mean of each cluster, and calculate the clusters again by assigning each point its cluster by calculating its proximity to this calculated mean.
We repeat this last process until the means of clusters calculated from our latest iteration is the same as the means calculated from our previous iteration.
This is a lot of work, but thankfully with the Julia package Clusters, we can calculate the k-means for our data with just a single function.
We can see from the plot that our "elbow" value is k=3, as this is the point where the rate of decrease for SSE per k value falls off.
We know this is the case as there are in fact three species of iris, but we can verify visually with scatter plots that this is true.
In the plots above, we can see that variation between the point clusters drops off most precipitously when k=3. This means that adding more clusters has diminishing returns for the distance of each data point to its cluster mean as visualized in the SSE plot. This means that adding more clusters will probably not provide greater optimization of our categories. Therefore, we may conclude that k=3 is the optimal categorization.
K-means Clustering may be applied to a plethora of real-world examples, such as targeted ad campaigns that group customers according to available data to determine potential interest in a product or service It may also be used to detect patterns in fraud, data breaches, or IT alerts based on observed commonalities between incidents.
Homework Questions:
Problem 1) Alternative distance formula:
You may run into situations where you would like to use some formula other than Euclidean distance to calculate the distance between datapoints. One alternative you may choose is the Minkowski Distance Formula.
This formula is represented as
where n is the dimensions of the data points, x is your first point, y is your second point, and p is the power parameter. For instance, if your power parameter is 2, your distance is Euclidean.
Using the kmeans function, find a k value that optimizes the kmeans clusters for the following dataset:
660×7 DataFrame
Row │ Sl_No Customer Key Avg_Credit_Limit Total_Credit_Cards Total_visits ⋯
│ Int64 Int64 Int64 Int64 Int64 ⋯
─────┼──────────────────────────────────────────────────────────────────────────
1 │ 1 87073 100000 2 ⋯
2 │ 2 38414 50000 3
3 │ 3 17341 50000 7
4 │ 4 40496 30000 5
5 │ 5 47437 100000 6 ⋯
6 │ 6 58634 20000 3
7 │ 7 48370 100000 5
8 │ 8 37376 15000 3
9 │ 9 82490 5000 2 ⋯
10 │ 10 44770 3000 4
11 │ 11 52741 10000 4
⋮ │ ⋮ ⋮ ⋮ ⋮ ⋮ ⋱
651 │ 651 78996 195000 10
652 │ 652 78404 132000 9 ⋯
653 │ 653 28525 156000 8
654 │ 654 51826 95000 10
655 │ 655 65750 172000 10
656 │ 656 51108 99000 10 ⋯
657 │ 657 60732 84000 10
658 │ 658 53834 145000 8
659 │ 659 80655 172000 10
660 │ 660 80150 167000 9 ⋯
3 columns and 639 rows omittedWe notice that the row number is equal to the SI number and Customer Key is an arbitrary value, so this column is superfluous. We can clean up the data further:
660×5 DataFrame
Row │ Avg_Credit_Limit Total_Credit_Cards Total_visits_bank Total_visits_o ⋯
│ Int64 Int64 Int64 Int64 ⋯
─────┼──────────────────────────────────────────────────────────────────────────
1 │ 100000 2 1 ⋯
2 │ 50000 3 0
3 │ 50000 7 1
4 │ 30000 5 1
5 │ 100000 6 0 ⋯
6 │ 20000 3 0
7 │ 100000 5 0
8 │ 15000 3 0
9 │ 5000 2 0 ⋯
10 │ 3000 4 0
11 │ 10000 4 0
⋮ │ ⋮ ⋮ ⋮ ⋮ ⋱
651 │ 195000 10 1
652 │ 132000 9 1 ⋯
653 │ 156000 8 1
654 │ 95000 10 0
655 │ 172000 10 1
656 │ 99000 10 1 ⋯
657 │ 84000 10 1
658 │ 145000 8 1
659 │ 172000 10 1
660 │ 167000 9 0 ⋯
2 columns and 639 rows omittedFor this dataframe, utilize the kmeans function to get the best approximate categorization of clusters
Part 1: Arrange the Dataframe "credit" so we can get k-means for a k-value of 2.
Part 2: Write a function to return a vector of size 10 filled with all the kmeans objects for our given data from k=1 to k=10.
Part 3: Plot your information with the k-values in the X-axis and the SSE for each k-value in the Y-axis.
Solutions:
Participation Check Euclidean Distance:
Minkowski Distance Solution:
K-Means For Credit Card Data Solution
We can see that the "elbow" point on this plot is k=3, so we can conclude that the optimal number of k-means clusers is 3.
We verify the results below.