Abstract:Grid density peak clustering, taking advantage of the strong ability that the density peak clustering algorithm can identify arbitrary shape clusters, greatly reduces the computational cost by gridding the data set. It has become a popular clustering method nowadays. However, for large-scale data, how to further distinguish the dense and sparse grids and reduce the number of non-empty grids involved in the calculation of the grid density peak clustering is the key to solve the "grid disaster". In view of the probability density distribution with grid density as variable shows a Zipf-like distribution, a grid density peak clustering algorithm based on Zipf distribution is proposed. Firstly, the density of all non-empty grids is calculated and mapped to Zipf distribution, so the dense center grids and sparse edge grids are filtered according to the Zipf distribution. And then, only the dense center grids are clustered with peak density, and the calculation of Euclidean distance is reduced while the cluster centers are determined heuristically, which reduces the complexity of the algorithm. Finally, the sparse edge grids are processed to further optimize the cluster boundary and improve the clustering accuracy. Experimental results on the synthetic datasets and UCI datasets show that the proposed algorithm has obvious advantages for the clustering of large-scale and cross-cluster data, and can reduce the time complexity while ensuring the clustering accuracy.