Chapter 2 – Kmeans Clustering – Data and Methods Exploration Group

Partitioning algorithms are one of the most widely used and deeply studied clustering algorithms. It aims to partition the dataset into several clusters with similar objects while maximize the between-cluster variations(Dabbura, 2018). Though there are many modified partitioning algorithms, we will focus on Kmeans algorithm in this blog. Kmeans has been widely applied in data mining, pattern recognition, image compression and many machine learning fields due to its simplicity and efficiency. In this blog, the kmeans algorithm will be briefly introduced firstly, then an example will be exhibited, and the limitation and advantages will be discussed, finally, the applicable scene of kmeans algorithms will be concluded.

Introduction

The K-means clustering is the most popular partition clustering method. In this algorithm, the number of required clusters must be specified before clustering. Accoding to Reddy and Vinzamuri (2014) , we will introduce several indicators relevant to choice of the optimal number of clusters:

“wss” (within sum of square): the lower, the better
“silhouette” (average clusters silhouette): the higher, the better
“gap_stat” (gap statistics): the higher, the better
“NbClust”: this function will directly return the optimal number of clustering based on the frequency distribution histogram

After specified the cluster numbers (usually denoted by k), we should randomly pick k objects from the data set as the initial cluster centers (the initial objects are also known as “seeds”). Then for one unpicked object, it will be integrated into the cluster with minimum distance between the object and the cluster‘s center. Repeat this procedure for every unpicked object except the centers, then there form new clusters. Now, we calculate and update the centers of new clusters, and then iterating the integration step. The iteration will not terminate until elements in every clusters are unchanging or maximum number of iteration is reached. In the end, the final k clusters are the clustering results required.

Basic Steps of Kmeans Algorithm

Prepare the data
Determine the required number of clusters as k
Pick up k objects as the initial clusters’ centers
Assign non- center objects into closest clusters
Calculate and update the centers of clusters
Repeat steps 4-5 until the clustering result converges

Example & Codes

Now we will apply the kmeans algorithm to the test data set in R to show how a typical kmeans clustering works.

##Load data from a csv file and prepare data
mydata <- read_csv("test.csv", header=TRUE)

mydata <- mydata %>%
+ na.omit() %>%   #delete the N/A data
+ scale()         #standardize the raw data

##Determining the optimal number of clusters
#1st indicator: the average silhouette
fviz_nbclust(mydata, kmeans, 
   method="silhouette", k.max=10)

Average silhouette method indicates that the optimal number of cluster is k = 3.


#2nd indicator: the gap_stat
fviz_nbclust(mydata, kmeans, method="gap_stat", 
             k.max=10)

Gap statistics method indicates that the optimal number of cluster is k = 2.

#3rd indicator: the NbClust function
res.nbclust <- NbClust(mydata, distance="euclidean", min.nc = 2, max.nc = 10, method="kmeans", index="all")

*** : The Hubert index is a graphical method of determining the number of clusters.
                In the plot of Hubert index, we seek a significant knee that corresponds to a 
                significant increase of the value of the measure i.e the significant peak in Hubert
                index second differences plot. 
 
*** : The D index is a graphical method of determining the number of clusters. 
                In the plot of D index, we seek a significant knee (the significant peak in Dindex
                second differences plot) that corresponds to a significant increase of the value of
                the measure. 
 
******************************************************************* 
* Among all indices:                                                
* 3 proposed 2 as the best number of clusters 
* 9 proposed 3 as the best number of clusters 
* 1 proposed 4 as the best number of clusters 
* 2 proposed 6 as the best number of clusters 
* 6 proposed 8 as the best number of clusters 
* 2 proposed 10 as the best number of clusters 

                   ***** Conclusion *****                            
 
* According to the majority rule, the best number of clusters is  3 
 
******************************************************************* 
 
#Visualize the optimal number of clusters distribution
idx <- seq(from = 1, to = 52, by = 2)

cluster_number <- res.nbclust[["Best.nc"]][idx]

ftable <- data.frame(matrix(, nrow = 11, ncol = 2))

for (i in c(0:10)){
+     ftable[i+1, 1] <- i
+     ftable[i+1, 2] <- sum(cluster_number==i)
+ }

ggplot(ftable, aes(x = X1, y = X2))
+ geom_bar(stat = "identity")
+ scale_x_continuous(breaks=seq(0, 10, by = 1))
+ scale_y_continuous(breaks=seq(0, 10, by = 2))
+ labs(x = "Number of clusters", y = "Frequency", title = "Frequency Distribution Histogram")

Frequency distribution histogram indicates that the optimal number of cluster is k = 3.


##Compute the kmeans clustering with k=2 or k=3
res.km2 <- kmeans(mydata, k = 2, iter.max=10)
res.km3 <- kmeans(mydata, k = 3, iter.max=10)

##Visualize the kmeans clusters with k=2 or k=3
#Visualize the kmeans clusters with k=2
fviz_cluster(res.km2, data = mydata, 
  geom = "point", ellipse.type = "convex",
  ellipse.level = 0.7, outlier.color = "black",
  ggtheme = theme_minimal())

#Visualize the kmeans clusters with k = 3
fviz_cluster(res.km3, data = mydata, 
  geom = "point", ellipse.type = "convex",
  ellipse.level = 0.7, outlier.color = "black",
  ggtheme = theme_minimal())

K-means clustering results with k=2 and k=3

We could see from the diagram that there is a big cluster and many outliers in the test data. However, both clustering results with k=2 and k=3 have not successfully distinguished the big cluster. Instead, they divided the big cluster into several clusters and combine them with some outliers. As a result, kmeans algorithm does not behave expectedly in the test data.

Advantages and Limitation

The advantages of kmeans method are its simplicity and ease of implementation comparing to the other methods.

However, it is sensitive to noise and outliers. The clustering results are undeterministic, it partly depends on the selection of initial centers. Besides, the number of clusters must be predetermined. Most importantly, kmeans algorithm behave worse when applied to non-convex data (could be seen from the example) compared with DBSCAN.

Conclusion

In conclusion, kmeans algorithms partition the data set into pre-determined number of clusters with maximum within-cluster similarity. Its simplicity and efficiency make it one of the most popular clustering methods. However, kmeans algorithms are nondeterministic algorithm. The obtained clustering results accord with the choice of initial centers. Thus, kmeans algorithms are sensitive with noise. Most importantly, it performs less well on non-convex data set.

Bibliography

Dabbura, I. (2018). K-means Clustering: Algorithm, Applications, Evaluation Methods, and Drawbacks – Towards Data Science. [online] Towards data science. Available at: https://towardsdatascience.com/k-means-clustering-algorithm-applications-evaluation-methods-and-drawbacks-aa03e644b48a [Accessed 15 Jul. 2019].

Kassambara, A. (2018). K-Means Clustering in R: Algorithm and Practical Examples – Datanovia. [online] Datanovia. Available at: https://www.datanovia.com/en/lessons/k-means-clustering-in-r-algorith-and-practical-examples/ [Accessed 17 Jun. 2019].

Kassambra, A. (2018). Types of Clustering methods: Overview and Quick Start R Code – Datanovia . [online] Datanovia. Available at: https://www.datanovia.com/en/blog/types-of-clustering-methods-overview-and-quick-start-r-code/#partitioning-clustering/ [Accessed 17 Jun. 2019].

Reddy, C. and Vinzamuri, B. (2014). Chapter 4: A Survey of Partitional and Hierarchical Clustering Algorithms. In: Aggarwal, C. and Reddy, C., Data clustering: algorithms and applications. 1st ed. CRC Press, pp. 88-100.

Share this:

Related