﻿ 基于过滤模型的聚类算法
 控制与决策  2020, Vol. 35 Issue (5): 1091-1101 0

### 引用本文 [复制中英文]

[复制中文]
QIU Bao-zhi, ZHANG Rui-lin, LI Xiang-li. Clustering algorithm based on filter model[J]. Control and Decision, 2020, 35(5): 1091-1101. DOI: 10.13195/j.kzyjc.2018.1089.
[复制英文]

### 文章历史

Clustering algorithm based on filter model
QIU Bao-zhi , ZHANG Rui-lin , LI Xiang-li
School of Information Engineering, Zhengzhou University, Zhengzhou 450001, China
Abstract: Reasonable clustering prototype is the premise of correct clustering. Most of the existing clustering algorithms have some shortcomings such as the unreasonable selection of clustering prototypes and calculation deviation of cluster numbers. A clustering algorithm based on filter model (CA-FM) is proposed. The algorithm uses the proposed filtering model to remove the boundary and noise objects which interfere with the clustering process. The adjacency matrix is generated according to the neighbor relationships among the core objects, and the number of clusters is calculated by traversing the matrix. Then, the objects are sorted according to the density factor, and clustering prototypes are selected from them. Finally, the remaining objects are assigned into corresponding clusters according to the minimum distance from the high density objects. The effectiveness of the proposed algorithm is demonstrated by experiments on synthetic datasets, UCI datasets and Olivetti face dataset. Compared with similar algorithms, the CA-FM has a higher clustering accuracy.
Keywords: clustering algorithm    filter model    deviation factor    clustering prototype    local density    density factor
0 引言

K-means、FCM为代表的基于划分的聚类算法将随机选取的k个对象作为聚类的初始原型, 按照相似性原则将数据对象分配给相应的原型形成一个个簇, 通过反复计算每个簇的原型和再分配, 直至目标函数收敛.这一机制决定了这一类算法不能有效地处理非球形簇, 且聚类精度不高.

DPC、LPC、CDP等算法通过计算对象密度, 选取决策图中密度峰值对象作为聚类原型; 依据距峰值最小距离原则, 将其余数据对象划分到相应聚类.虽然算法结构简单, 易于理解, 但DPC、CDP算法的密度度量方式依赖于手动输入的截断距离参数, 不合理的参数设置会引起对象划分错误的连锁反应. LPC算法借鉴谱聚类思想, 将数据集视为无向图, 采用拉普拉斯中心性[14]表征数据对象的密度, 由于LPC算法在提取每一维度的特征值形成拉普拉斯矩阵的计算量较大, 对于高维数据集, 算法运行时间会指数倍增加, 无法适应高维数据聚类要求.

1 基于过滤模型的聚类算法 1.1 相关定义

 (1)

 (2)

 (3)

 (4)

DPC算法认为聚类中心具有较高的密度且距高密度对象的最小距离较远[11].为了方便聚类中心的选取, 这里使用密度因子的概念放大聚类中心与其他对象的特征差异.

 (5)

 (6)

1.2 核心对象的获取

 (7)

 (8)

α为过滤因子(偏差因子降序排列后的百分位数).对于数据集D, 过滤模型FM将其分为非核心对象集non_core_set与核心对象集core_set, 定义如下:

 (9)

 图 1 核心点获取过程(α=19.84)
1.3 聚类原型选取

 (10)

1.4 基于过滤模型的聚类算法

step 1:计算.根据式(4)、(5)、(6)、(8)计算数据对象的局部密度den、密度因子R、距离δ、偏差因子df.

step 2:核心对象获取.

step 2.1:依据偏差因子df将对象降序排序, 得到df_desc序列;

step 2.2:选取df_desc集合中位置之后的对象作为核心对象, 形成核心对象集core_set.

step 3:原型获取.

step 3.1:根据式(10), 确定核心对象之间的近邻关系, 构建邻接矩阵Conn_M;

step 3.2:采用遍历算法搜索邻接矩阵, 计算邻接矩阵中连通子图个数, 得到聚类中心个数I;

step 3.3:选取密度因子降序排序后前I个对象作为聚类中心.

step 4:分派.将剩余数据对象划分至距离最近的高密度对象所在的簇中, 返回聚类标签Label.

2 实验结果与分析

2.1 人工合成数据集

DPC算法的聚类效果整体优于K-means、FCM、DBSCAN聚类算法, 但DPC算法的中心选取过程依赖于数据对象在决策图中的分布, 不同的截断参数会产生不同形状的决策图, 可能会造成聚类中心与其他对象之间的特征差距变小, 难以选取正确的聚类中心, 最终影响聚类结果, 如算法在4k2-far数据集上聚类效果不佳.

LPC算法聚类效果好于DPC算法, 虽然LPC不需要显式的输入参数, 但仍需要根据实际的聚类个数和决策图来选择聚类中心.

CDP算法依据距高密度点最小距离与距低密度点最小距离的差, 将决策图中聚类中心与其他点进一步分离, 便于选出聚类中心, 但算法需要输入聚类个数, 并且由于采用了截断距离dc进行密度计算, 导致其处理多密度数据的能力不强，如在Compound、Jain数据集上聚类效果不理想.

CA-FM算法在6个人工合成数据集中的4个均达到最佳聚类效果, 在Compound数据集与Aggregation数据集的聚类效果大部分达到最佳, 其余的聚类指标取值与最佳的聚类效果相差很小, 说明CA-FM算法在处理多种数据分布的聚类时是有效的.

2.2 UCI数据集

2.3 高维数据集

 图 2 算法在人脸识别数据集上的聚类结果

2.4 算法分析 2.4.1 时间复杂度分析

CA-FM算法的计算开销主要有:计算偏差因子、计算局部密度、计算α距离、建立邻接矩阵、计算子图个数和对象划分.

2.4.2 参数敏感性分析

CA-FM算法有两个参数:近邻对象数k、过滤因子α.本文选取4个二维数据集与4个高维数据集对两个参数进行敏感性分析. 图 3(a)图 3(c)给出了参数k与聚类精度之间的关系, 可以看出, 随着k增大, 对象的采样空间不断扩展, 局部密度可有效表征数据的真实分布情况, 其聚类精度越来越高.当k继续增大时, 会导致数据对象的近邻关系延伸至其他簇中, 造成多个簇合并从而导致聚类精度下降.一般情况下, 当k的取值为5 ~ 31时, 算法能达到理想的聚类效果.

 图 3 聚类精度与参数k、α的关系

2.4.3 伸缩性分析

 图 4 算法的运行时间与样本量、维度的关系
3 结论

 [1] Kacprzyk J, Pedrycz W. Springer handbook of computational intelligence[M]. Berlin: Springer Publishing Company, 2015: 578-600. [2] Li X L, Han Q, Qiu B Z. A clustering algorithm using skewness-based boundary detection[J]. Neurocomputing, 2018, 275: 618-626. DOI:10.1016/j.neucom.2017.09.023 [3] Mondal S A. An improved approximation algorithm for hierarchical clustering[J]. Pattern Recognition Letters, 2018, 104: 23-28. DOI:10.1016/j.patrec.2018.01.015 [4] Chen H Z, Wang W W, Feng X C, et al. Discriminative and coherent subspace clustering[J]. Neurocomputing, 2018, 284: 177-186. DOI:10.1016/j.neucom.2018.01.006 [5] 李海林, 王成, 邓晓懿. 基于分量属性近邻传播的多元时间序列数据聚类方法[J]. 控制与决策, 2018, 33(4): 649-656. (Li H L, Wang C, Deng X Y. Multivariate time series clustering based on affinity propagation of component attributes[J]. Control and Decision, 2018, 33(4): 649-656.) [6] 李向丽, 曹晓锋, 邱保志. 基于矩阵模型的高维聚类边界模式发现[J]. 自动化学报, 2017, 43(11): 1962-1972. (Li X L, Cao X F, Qiu B Z. Clustering boundary pattern discovery for high dimensional space base on matrix model[J]. Acta Automatica Sinica, 2017, 43(11): 1962-1972.) [7] 张秦, 方志耕, 蔡佳佳, 等. 基于多元异构不确定性案例学习的广义区间灰数熵权聚类模型[J]. 控制与决策, 2018, 33(8): 1481-1488. (Zhang Q, Fang Z G, Cai J J, et al. Generalized interval grey entropy-weight clustering model based on multiple heterogeneous uncertainty cases study[J]. Control and Decision, 2018, 33(8): 1481-1488.) [8] Macqueen J. Some methods for classification and analysis of multivariate observations[C]. Proc of 5th Berkeley Symposium on Mathematical Statistics and Probability. Berkeley: California Presseedings, 1967: 281-297. [9] Bezdek J C, Robert E, Full W. FCM: The fuzzy c-means clustering algorithm[J]. Computers & Geosciences, 1984, 10(2/3): 191-203. [10] Ester M, Kriegel H P, Xu X W, et al. A density-based algorithm for discovering clusters in large spatial databases with noise[J]. Proceedings of the 2nd International Conference on Knowledge Discovery and Data Mining (KDD-96). Portland: Association for the Advan- cement of Artificial Intelligence, 1996: 226-231. [11] Rodriguez A, Laio A. Clustering by fast search and find of density peaks[J]. Science, 2014, 344(6191): 1492-1496. DOI:10.1126/science.1242072 [12] Yang X H, Zhu Q P, Huang Y J, et al. Parameter-free laplacian centrality peaks clustering[J]. Pattern Recognition Letters, 2017, 100: 167-173. DOI:10.1016/j.patrec.2017.10.025 [13] Li Z J, Tang Y C. Comparative density peaks clustering[J]. Expert Systems with Applications, 2018, 95: 236-247. DOI:10.1016/j.eswa.2017.11.020 [14] Qi X, Fuller E, Wu Q, et al. Laplacian centrality: A new centrality measure for weighted networks[J]. Information Sciences, 2012, 194: 240-253. DOI:10.1016/j.ins.2011.12.027 [15] Kumar K M, Reddy A R M. A fast dbscan clustering algorithm by accelerating neighbor searching using groups method[J]. Pattern Recognition, 2016, 58: 39-48. DOI:10.1016/j.patcog.2016.03.008 [16] Li X L, Han Q, Qiu B Z. A clustering algorithm with affine space-based boundary detection[J]. Applied Intelli- gence, 2018, 48(2): 432-444. DOI:10.1007/s10489-017-0979-z [17] Qiu B Z, Cao X F. Clustering boundary detection for high dimensional space based on space inversion and Hopkins statistics[J]. Knowledge-Based Systems, 2016, 98: 216-225. DOI:10.1016/j.knosys.2016.01.035 [18] Chen H P, Shen X J, Lv Y D, et al. A novel automatic fuzzy clustering algorithm based on soft partition and membership information[J]. Neurocomputing, 2017, 236: 104-112. DOI:10.1016/j.neucom.2016.09.103 [19] Zhang Y, Mańdziuk J, Chai H Q, et al. Curvature-based method for determining the number of clusters[J]. Information Sciences, 2017, 415: 414-428. [20] 陈晋音, 何辉豪. 基于密度的聚类中心自动确定的混合属性数据聚类算法研究[J]. 自动化学报, 2015, 41(10): 1798-1813. (Chen J Y, He H H. Research on density-based clustering algorithm for mixed data with determine cluster centers auto matically[J]. Acta Automatica Sinica, 2015, 41(10): 1798-1813.) [21] Ding S F, Du M J, Sun T F, et al. An entropy-based density peaks clustering algorithm for mixed type data employing fuzzy neighborhood[J]. Knowledge-Based Systems, 2017, 133: 294-313. DOI:10.1016/j.knosys.2017.07.027 [22] Aliguliyev R M. Performance evaluation of density-based clustering methods[J]. Information Sciences, 2009, 179(20): 3583-3602. DOI:10.1016/j.ins.2009.06.012 [23] 乔颖, 王士同, 杭文龙. 大规模数据集引力同步聚类[J]. 控制与决策, 2017, 32(6): 1075-1083. (Qiao Y, Wang S T, Hang W L. Clustering by gravitational synchronization on large scale dataset[J]. Control and Decision, 2017, 32(6): 1075-1083.) [24] 徐明亮, 王士同, 杭文龙. 一种基于同类约束的半监督近邻反射传播聚类方法[J]. 自动化学报, 2016, 42(2): 255-269. (Xu M L, Wang S T, Hang W L. A semi-supervised affinity propagation clustering method with homogeneity constraint[J]. Acta Automatica Sinica, 2016, 42(2): 255-269.) [25] Chen M, Li L J, Wang B, et al. Effectively clustering by finding density backbone based-on knn[J]. Pattern Recognition, 2016, 60: 486-498. DOI:10.1016/j.patcog.2016.04.018 [26] Campo D N, Stegmayer G, Milone D H. A new index for clustering validation with overlapped clusters[J]. Expert Systems with Applications, 2016, 64: 549-556. DOI:10.1016/j.eswa.2016.08.021 [27] Du M, Ding S, Xue Y. A novel density peaks clustering algorithm for mixed data[J]. Pattern Recognition Letters, 2017, 97: 46-53. DOI:10.1016/j.patrec.2017.07.001