- Introduction
In today's world, driven by big data, businesses and individuals strive to make sense of the massive amounts of available data. Data mining employs analysis tools to unearth previously unknown relationships between data objects. Clustering, a prevalent data mining technique, entails grouping similar data points to reveal hidden patterns and insights. Clustering algorithms are instrumental in data organization and analysis, offering valuable insights for decision-making and comprehending complex datasets. (Gupta, et al., 2021)
Clustering algorithms are critical in today's data-driven society, enabling precise analysis and organization of massive datasets. Selecting clustering algorithm that can adapt to the quality and quantity is crucial. The k-means clustering algorithm is widely used in industries such as marketing, healthcare, and image processing.
The k-means clustering algorithm is one of many clustering algorithms used in machine learning. Each algorithm has advantages, disadvantages, and specific applications based on the similarities and differences between the K-means and other prominent clustering methods. Overall, K-means clustering is an effective and simple unsupervised learning approach with a wide range of applications. It is important to consider its limitations and select appropriate number of clusters for optimal outcomes (Karthikeyan, et al., 2020).
It partitions data points into clusters, enabling consumer segmentation, trend analysis in patient data, and image compression, and more. The k-means algorithm maximizes intra-cluster similarity while minimizing inter-cluster similarity. Understanding and utilizing the full potential of the k-means algorithm is essential for uncovering hidden insights within data. (Mohammed Elhassan & Ahamed, 2020).
This paper delves into various clustering algorithms, including Random, K-means++, Canopy, and Farthest First. It examines their methodologies and advantages in data analysis. Understanding these techniques is crucial for effectively extracting valuable insights from large datasets.
.K-means clustering algorithm:
K-means clustering is an unsupervised machine learning method that groups data points into clusters based on similarities. The "k" in K-means represents the number of clusters to identify. It aims to minimize variation within each cluster using the within-cluster sum of squares metric (Karthikeyan, et al., 2020). The algorithm works iteratively by:
- Determining the number of clusters "k".
- Assigning data points to their closest centroid can be done randomly or using approaches such as k-means++, canopy, and farthest first.
- Recalculating cluster centers, the process repeats until cluster stability is achieved.
Using k-means clustering, a set of observations (x1, x2, ..., xn)is partitioned into with in cluster sum squares (WCSS) (i.e., variance) (Hochreiter, 2014). Formally, the objective is to find:argmin�∑�=1�∑�∈��‖�−��‖2=argmin�∑�=1�|��|Var��
Where μi is the mean (also called centroid) of points in , i.e.
(2)
The size of , denoted as , represents the standard L2 norm. This standard minimizes the pairwise squared deviations of points within the same cluster.
(3)
Because the overall variance remains constant, this is equivalent to maximizing the sum of squared deviations between points in distinct clusters.
The k-mean clustering algorithm has various advantages that make it a useful tool for data analysis. It offers a quick and effective method for analyzing huge datasets. By grouping comparable data points together, the method decreases data complexity and simplifies interpretation. This is especially valuable in exploratory data analysis, which requires a grasp of the underlying patterns and structure (Syakur, et al., 2018).
K-means clustering is an indispensable tool for data scientists exploring unsupervised machine learning and/or for data analysts seeking insights from unlabeled data, especially when acquiring labeled data is expensive and/or time-consuming.
The ability to effectively employ this potent strategy will depend on comprehension of the inner workings of the k-mean clustering algorithm, careful feature selection, data preparation, and clustering result analysis. The k-means clustering algorithm provides multiple initialization methods for cluster centroids, including random, k-means++, canopy, and farthest first.
These techniques are used to position the initial centroids in a way that increases the convergence and overall quality of the clustering. Here's a summary of each method:
- Random clustering Algorithm:
The k-means clustering algorithm assigns data points to clusters based on their similarity. It randomly selects cluster centers and iteratively updates them until convergence is achieved. The steps involve choosing cluster centers, measuring distances to data points, assigning points to clusters, and calculating new centroids. The steps in the k-means clustering algorithm are as follows:
- Cluster centers 'c' are randomly chosen.
- Calculate the separation between each data point and cluster center.
- Allocate data points to cluster centers that are as close together as possible.
- Calculate the new centroids using the assigned locations.
- Repeat steps 2-4 until convergence is attained.
The algorithm tries to minimize the squared error function, which is given by
4)
Where is a data point, is a cluster center, and is the set of centers.This clustering algorithm is widely used in unsupervised machine learning to group observations into clusters. It is important to recognize that the algorithm's outcomes can be influenced by the initial cluster assignments. Hence, it is common practice to run the algorithm multiple times with different starting conditions (Edsa & Chandra, 2023).
- k-means++ clustering Algorithm:
The K-means++ technique is an intelligent centroid that initializes centroids randomly in this algorithm. The goal is to position the centroids as far apart as possible, ensuring maximum distance between them. There are basic procedures to initialize centroids with K-means++:
- Randomly select the initial centroid (C₁).
- Calculate the separation between all data points and the first cluster center:
(5)
This formula represents the distance between the farthest centroid and data point .
- Assign the data point as a new centroid.
- Re-initialize the centroids by determining the average of all the data points in that cluster.
Repeat steps 3 and 4 until all of the defined K clusters have been found (Arthur & Vassilvitskii, 2007).
- Canopy Clustering Algorithm:
Proposed by Andrew et al. (2000), the unsupervised pre-clustering method accelerates clustering operations on large datasets and is often used as a preliminary step for the K-means technique, also known as hierarchical clustering. The algorithm consists of two phases:
- Phase 1: Use an "non-expansive" distance measurement to approximately divide the data into overlapping subsets known as canopies. This is done using two thresholds, T1 (the loose distance) and T2 (the tight distance), where T1>T2.
- Phase 2: Perform elaborate clustering within each canopy by using an "expensive" distance measurement.
The canopy clustering algorithm is a faster and simpler algorithm commonly used as an early stage in more stringent clustering algorithms. (Andrew, et al., 2000)
- Farthest first clustering algorithm:
The farthest first clustering, also known as complete-linkage clustering, is a hierarchical clustering technique. It entails identifying the most distant pair of items, one from each group, and utilizing the distance between these two objects to calculate the distance between the groups. In complete-linkage clustering, the following formula can be used to get the distance between clusters X and Y:
(7)
It represents the distance between the elements and . and represent two groups of items (clusters). The farthest neighbor strategy is part of the agglomerative hierarchical approach, which classifies each data point as a cluster before gradually joining the closest clusters. In which the distance between groups is presented as the distance between the most distant pair of clusters (Jayakameswaraia, et al., 2017).
Distance metrics are crucial in clustering algorithms such as K-means clustering because they determine the similarity between data points. The choice of distance metric has a significant impact on clustering outcomes. The Euclidean distance is typically employed in k-means clustering, but alternative measures may be preferable based on the data's characteristics. Study measures include:
- Euclidean distance: Euclidean distance measures the straight-line distance between two points in Euclidean space and is commonly used as a distance metric. It is calculated using the square root of the sum of squared coordinate differences and determined using the formula(Edsa & Chandra, 2023):
(8)
Where is dimensional space for the similarity calculation distance, attribute of data points, and this distance measure is appropriate when the measurements are comparable and the data is continuous. (Amira , et al., 2023)
- Manhattan Distance: Commonly referred to as the city block distance, it is appropriate for categorical or ordinal data. It calculates the total of the absolute coordinate differences, determined by summing the values of x and y. The formula is:
(9)
and are the variables of vectors and in the two-dimensional vector space. (Gan, et al., 2021).
- Literature review:
First proposed by Andrew in 1967, the K-means clustering technique was introduced as a simple, unsupervised learning clustering system. Over time, researchers have proposed various enhancements and modifications to the K-means algorithm to address its limitations and improve its effectiveness (Andrew, et al., 2000).
Arthur and Vassilvitskii (2007) studied the selecting initial centroids of k-means algorithm that are far apart, the algorithm delivers better clustering outcomes than random initialization.
Furthermore, Jain (2008) discussed the k-mean algorithm's history, modifications, and applications across numerous areas.
Rodriguez and Luciano da F.'s (2019) study, “Clustering algorithms: A comparative approach” evaluates and contrasts various performance metrics for clustering algorithms, including the K-means approach. The study highlights the merits and weaknesses of the K-means algorithm and its performance across different dataset variants.
Sinaga & Yang (2020) suggested that the unsupervised K-means algorithm does not require any initializations or parameters. The U-k-means algorithm can determine the ideal number of clusters automatically. The approach was adaptable to various cluster sizes and forms. Experimental results show that the U-k-means algorithm outperforms other known algorithms.
Ghazal et al. (2021) assessed the K-means clustering method using both Euclidean and Manhattan distances. The performance of the K-means algorithm was measured in terms of execution time and the number of clusters formed. The results indicated that using Manhattan distance measurement metrics yielded the best outcomes.
Tabianan et al. (2022) conducted a study on clustering analysis using the K-means algorithm and the Elbow method to determine the optimal number of clusters. The study employed pre-processing techniques for data cleaning on repository questionnaires for a quantitative study, utilizing Malaysia's e-commerce dataset.
Hamad (2023) developed a predictive model for variables impacting evaporation, using climatic data from meteorological stations in Iraq's Kurdistan Region from January 2020 to December 2022. The study utilized cluster analysis and regression to highlight the strengths of each technique. It also explored the application of hierarchical cluster analysis using Euclidean and Manhattan distances. Two clusters were created, and the results were used to compare K-means clustering algorithms in the current study.
The body of work on the K-means clustering algorithm demonstrates the research community's dedication to enhancing and refining its capabilities. These efforts aim to improve the algorithm's adaptability and effectiveness across various disciplines. By expanding its capabilities and addressing its limitations, researchers are making K-means a more powerful tool for data analysis.
- Experiments and Results:
The variables impacting evaporation rate were discovered in this study utilizing climate data from meteorological stations in the Kurdistan Region of Iraq from January 2020 to December 2022.
A series of experiments were conducted on an 8-attribute dataset consisting of 1,095 cases. The dataset was divided into a predetermined number of clusters using the Random, K-means++, Canopy, and Farthest First methods.
The initial centroids for the K-means approach were randomly determined using Euclidean and Manhattan distance metrics. Clustering accuracy and algorithmic proficiency were evaluated. The testing examined the accuracy and speed of simple K-means algorithms using the Weka tool, version 3.8.6, with predefined datasets. This includes a data mining document that outlines and thoroughly explains all the techniques covered. The attributes used in the K-means algorithm analysis are displayed in Table 1.
Table 1: Attributes of the Climate Data Set
|
Variables
|
No.
|
Variables
|
No.
|
|
SP-Pressure at Sea Level
|
5
|
AE-Amount of Evaporation
|
1
|
|
WD-Wind Direction
|
6
|
T-Temperature
|
2
|
|
RT-Relative Humidity
|
7
|
WT-Wet Temperature
|
3
|
|
WS-Wind Speed
|
8
|
SS-Sunshine
|
4
|
The following table represents the analysis process of all algorithms, displays the findings for the dataset, which has 8 attributes and 1,095 instances, and compares the clustering algorithms using Euclidean distance metrics. The parameters chosen were the number of clusters produced, the number of iterations, the time (seconds) required to construct the model, and the within cluster sum of squared errors (WCSS).
Table 2 shows that the number of clusters for each algorithm was two (2) when using the Euclidean distance metric. The K-means++ and Canopy techniques required the most iterations, while the Farthest First algorithm required the fewest.
The k-means++ approach takes the least amount of time, while the canopy algorithm takes the most. The within-cluster sum square (WCSS) of errors showed no significant variations, there is a slight difference between the lowest value and the highest value of WCSS. The random and canopy algorithms had the lowest WCSS scores, while the k-means++ and farthest first algorithms had the highest.
Table 2: Clustering algorithms result for Climate Dataset with Euclidean distance
|
Algorithm
|
No. of
Clusters
|
Cluster
instance
|
No. of
Iteration
|
Time taken
( seconds)
|
WCSS
|
|
Random
|
2
|
513 (47%)
582 (53%)
|
8
|
0.25
|
154.867
|
|
K-means++
|
2
|
515 (47%)
580 (53%)
|
9
|
0.10
|
154.869
|
|
Canopy
|
2
|
582 (53%)
513 (47%)
|
9
|
0.31
|
154.867
|
|
Farthest First
|
2
|
515 (47%)
580 (53%)
|
6
|
0.14
|
154.869
|
Table 3 presents the results for the same attributes using Manhattan distance metric to compare the clustering algorithms. The chosen parameters include the number of clusters formed, the number of iterations, the time (in seconds) taken to build the model, and the within-cluster sum of squared errors.
When using the Manhattan distance measure, the number of clusters for each algorithm was two (2). The Canopy algorithm exhibited the highest number of iterations, while the Farthest First algorithm required the fewest iterations. The Farthest First algorithm also had the shortest time to build the model, whereas the Random algorithm took the longest. There were no significant differences in the sum of squared errors within the clusters. The lowest within-cluster sum of squares (WCSS) values were observed for the Random, Canopy, and Farthest First algorithms, with the highest value for the K-means++ algorithm.
Table3: Clustering algorithms result for Climate Dataset with Manhattan distance
|
Algorithm
|
No. of
Clusters
|
Cluster
instance
|
No. of
Iteration
|
Time take
( seconds)
|
WCSS
|
|
Random
|
2
|
547 (50%)
548 (50%)
|
7
|
0.14
|
835.429
|
|
K-means++
|
2
|
551 (50%)
544 (50%)
|
6
|
0.13
|
835.448
|
|
Canopy
|
2
|
547 (50%)
548 (50%)
|
9
|
0.10
|
835.429
|
|
Farthest First
|
2
|
548 (50%)
547 (50%)
|
5
|
0.05
|
835.429
|
Figure 1 shows a comparison graph for Euclidean distance between the first and second cluster instances for all K-means methods. The graph indicates that the Canopy algorithm produces different outcomes compared to the Random, K-means++, and Farthest First algorithms. The Canopy algorithm has 582 instances in the first cluster and 513 instances in the second cluster, as shown in Table 2, whereas the other algorithms have fewer instances in the first cluster and more instances in the second cluster.
Figure 1:Comparison of First and Second Cluster Instance for Euclidean Distance
Figure 2 shows a comparison graph of the first and second cluster instances for all K-means algorithms using the Manhattan distance. The graph reveals that the K-means++ algorithm produces different outcomes compared to the Random, Canopy, and Farthest First algorithms. This algorithm has 551 instances in the first cluster and 544 instances in the second cluster, as shown in Table 3, whereas the other algorithms have fewer instances in the first cluster and more instances in the second cluster.
Figure 2:Comparison of First and Second Cluster Instance for Manhattan Distance
Figure 3 shows a comparison graph of the number of iterations for Euclidean and Manhattan distances across all K-means algorithms. The Farthest First and Random algorithms have the fewest iterations for both distances, while the Canopy algorithm has the most iterations for both distances. The K-means++ algorithm exhibits clear differences in the number of iterations between the two distances metrics.
Figure 3:Comparison of Number of Iteration for Euclidean and Manhattan Distance
Figure 4 presents a comparison graph of the time (seconds) taken for Euclidean and Manhattan distances using all K-means algorithms. The K-means++ algorithm using Euclidean distance takes the shortest time, while the Canopy algorithm takes the longest. For Manhattan distance, the Farthest First algorithm takes the shortest time, while the Random algorithm takes the longest.
Figure 4:Comparison for Time Taken by k-means algorithms Using Euclidean and Manhattan Distance
Figure 5 shows a comparison graph of the sum of squared errors within each cluster for both Euclidean and Manhattan distances using all K-means algorithms. The graph indicates that Euclidean distance results in the lowest sum of squared errors within the clusters compared to Manhattan distance.
Figure 5:Comparison of Clustering Algorithms With in Sum Square Error (WCSSE)
Figure 6 illustrates insights into the clusters within the climate dataset using the K-means clustering technique with Euclidean distance. It visualizes the cluster data as an interactive map, offering an overview and detailed information on attributes and conceptually related clusters. The distribution of cluster instances for attributes (WT, RH, SP, SS, T) between the first and second clusters is clearer, while the distribution of attribute (RH) instances shows overlap compared to Manhattan distance, as depicted in Figure 7.
|
AE
|
AE
|
AE
|
|
WS
|
WT
|
RH
|
|
AE
|
AE
|
AE
|
|
SP
|
SS
|
T
|
|
Cluster Instance: Cluster 1 Cluster 2 Euclidean Distance
|
Figure 6: Distribution of Cluster Instances using Euclidean Distance
|
AE
|
AE
|
AE
|
|
WS
|
WT
|
RH
|
|
AE
|
AE
|
AE
|
|
SP
|
SS
|
T
|
|
Cluster Instance: Cluster 1 Cluster 2 Manhattan Distance
|