A Non-MCMC Procedure for Fitting Dirichlet Process Mixture Models
Cluster analysis is concerned with partitioning cases into clusters such that the cases in a cluster are similar in terms of a set of variables. The K-Means Clustering Algorithm is a popular clustering method. It finds k clusters by choosing k data points as initial cluster centroids. Each data point is then assigned to the cluster with center that is closest to that point. In K-Means, the number of clusters has to be supplied in advance, which may be difficult in practice. A new method, the X-Means Clustering Algorithm, was proposed to solve this problem, which starts with an initial partition, then recursively runs a local K-Means in each cluster to split it until a lower Bayesian Information Criterion (BIC) value is reached compared with the previous larger cluster. However, this method would introduce a more severe local mode problem, that is, the previous inappropriate partition of cases cannot be corrected in the following local splitting. In this work, we develop a new algorithm that is based on Bayesian Dirichlet process mixture models, called the Non-MCMC DPM clustering algorithm. In the new clustering algorithm, we run the EM algorithm with all the cases to find a tentative partition, and then decide whether to accept the new partion with a criterion called DPC. We have tested our new clustering algorithm based on several simulated data sets, and found that it performs better than X-Means. We have also applied the algorithm to a real micorarray sample data set for predicting the class label (cancer or normal) based on the clustering results found by our new algorithm, and found that the prediction performance is comparable to state-of-the-art methods.
DegreeMaster of Science (M.Sc.)
DepartmentMathematics and Statistics
CommitteeSoteros, Chris; Laverty, William H.; McQuillan, Ian
Copyright DateJune 2012
Cluster Analysis, Non-MCMC