基于Zipf分布的网格密度峰值聚类算法
作者:
作者单位:

1.南京财经大学;2.南京邮电大学

作者简介:

通讯作者:

中图分类号:

TP 18

基金项目:

国家自然科学基金项目(面上项目,重点项目,重大项目)


Grid Density Peak Clustering Algorithm Based on Zipf Distribution
Author:
Affiliation:

Nanjing University of Finance and Economics

Fund Project:

The National Natural Science Foundation of China (General Program, Key Program, Major Research Plan)

  • 摘要
  • |
  • 图/表
  • |
  • 访问统计
  • |
  • 参考文献
  • |
  • 相似文献
  • |
  • 引证文献
  • |
  • 资源附件
  • |
  • 文章评论
    摘要:

    网格密度峰值聚类在兼顾密度峰值聚类算法可识别任意形状类簇的基础上,通过数据集的网格化简化整体计算量,成为当前备受关注的聚类方法. 针对大规模数据,如何进一步区分稠密与稀疏网格,减少网格密度峰值聚类中参与计算的非空网格代表点的数量是解决“网格灾难”的关键. 结合以网格密度为变量的概率密度分布呈现出类Zipf分布的特点,提出一种基于Zipf分布的网格密度峰值聚类算法. 首先计算所有非空网格的密度并映射为Zipf分布,根据对应的Zipf分布筛选出稠密中心网格和稀疏边缘网格;然后仅对稠密中心网格进行密度峰值聚类,在自适应确定潜在聚类中心的同时减少欧氏距离的计算量,降低算法复杂度;最后通过对稀疏边缘网格的处理,进一步优化类簇边界并提高聚类精度. 人工数据集和UCI数据集下的实验结果表明,所提出算法对大规模、类簇交叉数据的聚类具有明显优势,能够在保证聚类精度的同时降低时间复杂度.

    Abstract:

    Grid density peak clustering, taking advantage of the strong ability that the density peak clustering algorithm can identify arbitrary shape clusters, greatly reduces the computational cost by gridding the data set. It has become a popular clustering method nowadays. However, for large-scale data, how to further distinguish the dense and sparse grids and reduce the number of non-empty grids involved in the calculation of the grid density peak clustering is the key to solve the ”grid disaster”. In view of the probability density distribution with grid density as variable shows a Zipf-like distribution, a grid density peak clustering algorithm based on Zipf distribution is proposed. Firstly, the density of all non-empty grids is calculated and mapped to Zipf distribution, so the dense center grids and sparse edge grids are filtered according to the Zipf distribution. And then, only the dense center grids are clustered with peak density, and the calculation of Euclidean distance is reduced while the cluster centers are determined heuristically, which reduces the complexity of the algorithm. Finally, the sparse edge grids are processed to further optimize the cluster boundary and improve the clustering accuracy. Experimental results on the synthetic datasets and UCI datasets show that the proposed algorithm has obvious advantages for the clustering of large-scale and cross-cluster data, and can reduce the time complexity while ensuring the clustering accuracy.

    参考文献
    相似文献
    引证文献
引用本文
分享
文章指标
  • 点击次数:
  • 下载次数:
  • HTML阅读次数:
  • 引用次数:
历史
  • 收稿日期:2022-03-09
  • 最后修改日期:2023-08-08
  • 录用日期:2022-10-10
  • 在线发布日期: 2022-10-22
  • 出版日期: