Comparative Study between Parallel K-Means and Parallel K-Medoids using Message Passing Interface ( MPI )

Data mining is a combination technology for analyze a useful information from dataset using some technique such as classification, clustering, and etc. Clustering is one of the most used data mining technique these day. K-Means and KMedoids are one of clustering algorithms that mostly used because it’s easy implementation, efficient, and also present good results. Besides mining information, the needs of time spent when mining data is also a concern in today era considering the real world applications produce huge volume of data. This research analyzed the results from K-Means and K-Medoids algorithm and time performance using High Performance Computing (HPC) Cluster to parallelized K-Means and KMedoids algorithms and using Message Passing Interface (MPI) library. The results shown that K-Means algorithm gives smaller Sum Squared of Error (SSE) than K-Medoids. And also parallel algorithm that used MPI gives faster computation time than sequential algorithm.


I. INTRODUCTION
Nowadays, data generation advancement are massively and rapidly developed.Collecting any data is possible everywhere and anywhere.There are a lot type of data with various amount of data which stored on data warehouse like sales production, satellite orbit lane, disease data, and various data type from many disciplines.
Gathering information and processed into knowledge could be done with data mining technique.Data mining is a technology that combined data analysis method with several massive-data processing algorithm.Data mining were also used to help find and analyze new information from data that ever used with different method.Clustering is one of data mining technique.
Clustering is a data mining technique which very useful for real problems [9].The concept of clustering is similarity based on distance on a same group and difference on another group [9,10].There were a lot of clustering algorithms that could be used.Selection of clustering algorithm could be based on fata type or use of data.
Today's data generation also have a dimension that could get up to thousands of dimensions.The problems is how we could process thousands dimensions data not only with a great accuracy but also with a shortest time possible.High Performance Computing was considered capable of supporting data mining process.Use of HPC in data mining were widely used.Just like a research done by Jing Zhang, Gongqing Wu, Xuegang Hu, Shiying Li, Shuilang Hao titled "A Parallel K-means Clustering Algorithm with MPI" [1].Other researches had already compare between K-Means and K-Medoids [8,9].We propose HPC Cluster approach to implement K-Means and K-Medoids in parallel platform.Parallel data clustering using Message Passing Interface (MPI) were done in this research to get a high accuracy and low computational time for clustering result on data mining process.

A. K-MEANS CLUSTERING
K-Means algorithm is the mostly used clustering algorithm [1].First, determine the amount of K cluster.Pick the initial centroid cluster randomly and K-Means algorithm will repeat this steps until the centroid don't change [2,9]:

B. K-MEDOIDS CLUSTERING
Partitioning Around Medoids (PAM) or known as K-Medoids has similar algorithm with K-Means Clustering [3,9].The difference is, centroid that used in K-Means is the means of closest centroid in C cluster but in K-Medoids, the centroid itself with minimum cost:

C. MESSAGE PASSING INTERFACE (MPI)
MPI is standard library by using message-passing mechanism for parallelized the algorithm to support parallel computing [1].Parallel programing using MPI is defined clearly by choosing what functions to used.This is the general structured of MPI programming: 1) MPI Include file 2) Start serial code 3) Initialize MPI 4) Do work & make message passing calls 5) Terminate MPI environtment 6) Continue the serial code 7) End program.

D. PARALLEL K-MEANS AND K-MEDOID CLUSTERING
Parallelized in basic have same meaning in partition data so data can be execute in same time [4].K-Means and K-Medoids Clustering have same intensive computation, it is in compute the distance.In parallel system, the main idea is partition to each processes so the processes will have same amount of data and can do the processes in same time.
Each computer can do the algorithm, and have centroid in each process.After that, the result of each processe will be merged in head node.Parallel K-Means and K-Medoids algorithm will be explain in Table 3 and 4.

E. CLUSTER EVALUATION
Cluster evaluation is a part of clustering analysis.Because the algoritm using Euclidean Distance formula as closeness measurement, so the objective function that can be used for measure the quality of cluster is Sum Squared of Error (SSE).the explanation of the formula is explained below [2]:

F. PARALLEL PERFORMANCE EVALUATION
Parallel performance improvement, can be measure using speedup, performance improvement and speedup [5].This three evaluation will measure how good the number of processor by the computation time.Speedup measure how fast the time that parallel algorithm used than sequential algorithm.Speedup formula can be wrote as: Performance improvement, describe the relation of improvement parallel process than sequential process.Performance improvement formula can be wrote as: Efficiency estimated how processor be used for processing comparing the amount of work that used for communicate and synchronized.Efficiency formula can be wrote as:

G. DATASET
The dataset that used in this research was selected from UCI Machine Learning Repository [6] and KentRidge Biomedical Dataset [7].We expected, this various type of dataset will give more information whether the number of attribute or the number the record will affect the time of processing.

H. EXPERIMENTAL ENVIRONMENT
The hardware platform in this paper use HPC Cluster with total 12 Compute node and 48 core processors intel i7 @3.0 GHz, 96 GB memory, and the network environment use UTP Cable LAN.

I. PERFORMANCE EVALUATION
In this research will perform the evaluation of both K-Means and K-Medoids algorithm such as Sum Squared of Error (SSE), sequential and parallel computation time and also the performance of using MPI to parallelize the algorithm.

III. RESEARCH METHOD
The proposed method in this research will compare the algorithmn both sequential and parallel process.The process in this section will divide In five steps as describe in Fig. 1, there are (1) pre-processing data, (2) training the dataset by using K-Means and K-Medoids algorithm, (3) Parallelized the algorithm with MPI, (4) Validation, (5) Performance Analysis.In this step will build cluster from data.The cluster will have membership and centroid.The centroid will be used in validation step.We did training in every method and every algorithm.

C. Parallelized algorithm with MPI
This step has same mecanishm with sequential step but the difference is the algoritmn will be parallized with MPI so the computation time will be reduced.

D. Validation
This step will be inputted by different data from before.It will give membership result by using the centroid from training step.We did training in every method and every algorithm.

E. Performance Evaluation
In this steps will evaluate the result of clustering, computation time and also the use of processor in parallelize algorithm.The number processor that used for this research is 2, 4,6 and 8.

A. CLUSTERING PERFORMANCE
As have been mentioned before, the cluster result from both algorithm will be evaluated by using SSE.

B. TIME EVALUATION OF SEQUENTIAL COMPUTATION
Clustering performance can also be seen in sequential computation time.

C. PARALLEL PERFORMANCE EVALUATION
This section will evaluate the speedup, performance improvement and efficiency from the use of processor.In fig 4 shows that every processor have no much different in improvement.But in contrast, the performance for Tumor dataset, both K-Means and K-Medoids algorithm are not improve.

V. Conclusion
The experiment results shows that K-Means algorithms has better performance than K-Medoids algorithm.But in reverse, K-Medoids has better algorithm to be parallelized than K-Means because K-Medoids algorithm has more computing section than K-Means algorithm.For paralleling the data, MPI gives fastest time for computing but it depends on how big the data and the core we used.For our study, the data not big enough to used 8 cores but 2 cores gives better performance than using serial algorithm.

Fig. 1
Fig.1 Design of System Fig 2 shown that K-Means computation time is faster than K-Medoids.This is matching with the complexity that K-Means and K-Medoids have[8].The comparison of computation time shown in Fig 2.

Fig. 2
Fig.2 Comparison of Sequential Time

Fig 3 , 4
and 5 shows the speedup, performance improvement and efficiency result in both algorithm.The result below was from the formula(2,3,4).

Fig. 3
Fig.3 Speed Up GraphicIn Fig3shows that every dataset has different 'best' speedup but all have same 'worst' speedup.In every eight processor, the speedup always down.It is shown that the dataset are not big enough to parallelized so the algorithm take more time to communicate and synchronized.For performance improvement will shown in fig4.

Fig. 5
Fig.5 Efficiency GraphicIn Fig5, the graphic clearly shows that the efficiency by using two processor is the most efficient for that range size of data.

TABLE II K-MEDOID ALGORITHM K-Medoids Algorithm Input: Data, K Cluster Output: K Centroid
Table VI shown the results of SSE from K-Means and K-Medoids algorithm in each dataset.As we can see in Table VI the SSE from K-Means is smaller than K-Medoids algorithm.So the centroid from means is closer than random object that being centroid.