基于优化RDD分区的Spark并行K-means 大尺度遥感图像分割

doi:10.13195/j.kzyjc.2022.1717

首页 > 过刊浏览>2024年第39卷第5期 >1612-1619. DOI:10.13195/j.kzyjc.2022.1717

基于优化RDD分区的Spark并行K-means 大尺度遥感图像分割
DOI:
                        10.13195/j.kzyjc.2022.1717
                    
CSTR:
                        
                    
作者:
                        
                        
                    
作者单位:辽宁工程技术大学 测绘与地理科学学院,辽宁 阜新 123000
作者简介:
通讯作者:E-mail: liyu@lntu.edu.cn.
中图分类号:TP751
基金项目:辽宁省自然科学基金项目(2022-M S-400)；辽宁省教育厅重点攻关项目(LJ2020ZD003).

Spark parallel K-means large scale remote sensing image segmentation based on optimized RDD partition

Author:

Affiliation:

School of Geomatics,Liaoning Technical University,Fuxin 123000,China

Fund Project:

摘要

图/表

访问统计

参考文献

相似文献

引证文献

资源附件

文章评论

摘要:

大尺度遥感图像分割对单机处理方式而言是巨大挑战.Spark平台为在单机上构建用于大数据处理的分布式计算环境提供了可能.当Spark平台内置的K-means算法用于数字图像处理时,其中的Spark Shuffle弹性分布式数据集(RDD)分区一般采用缺省设置,尽管这种RDD设置简单便捷,但对大尺度图像分割任务容易造成“多分区、小数据”现象,极大影响图像分割速度.为此,采用覆盖部分上海市区的WorldView-3遥感图像为测试数据,在K-means算法初始化聚类中心阶段自定义影响RDD分区的参数spark.sql.shuffle.partitions,在迭代计算阶段调用coalesce()算子减少分区数;与串行K-means算法对比验证单机处理大数据的可行性与有效性,与优化前的Spark并行K-means算法对比实现了大尺度遥感图像快速分割.实验结果表明,在K-means算法初始化聚类中心和迭代计算阶段,将RDD分区数设置在CPU核数的1sim10倍,总用时由优化前的145s缩减到97s,尤其在初始化聚类中心阶段的时间效率上,优化后是优化前的500sim1000倍.

Abstract:

It is a great challenge for segmentating large scale remote sensing images on a single computer. The Spark platform makes it possible to build a distributed computing environment for big data processing on a single computer. When the K-means algorithm built in Spark platform is used for digital images processing, the Spark Shuffle resilient distributed dataset(RDD) partition generally adopts the default setting. Although this RDD setup is simple and convenient, it is easy to cause the phenomenon of “excessive partition and too little data” in the large scale images segmentation task, which greatly affects the image segmentation speed. Therefore, this paper utilizes the built-in K-means algorithm for segmenting the WorldView-3 images coving part of Shanghai city, which properly defines the RDD partition parameter spark.sql.shuffle.partitions during the initializing clustering centers stage and adaptively calls the coalesce() operator to adjust the number of RDD partitions during iteration. Comparing with the serial K-means algorithm, the feasibility and effectiveness of single computer processing of big data are verified. Comparing with the Spark parallel K-means algorithm with the parameters for the default setting, the proposed algorithm can realize large-scale image segmentation faster. The experimental results show that, in the both stages of initialization and iterative computation of cluster centers for the K-means algorithm, the RDD partition number is set at 1-10 times of the CPU core number, which reduces the total time from 145s before optimization to 97s. Especially in the time efficiency of the initializing cluster center stage, the time efficiency after optimization is 500-1000 times that before optimization.

参考文献

相似文献

引证文献

引用本文

李玉,崔书琳,赵泉华.基于优化RDD分区的Spark并行K-means 大尺度遥感图像分割[J].控制与决策,2024,39(5):1612-1619

复制

文章指标

点击次数:
下载次数:
HTML阅读次数:
引用次数:

历史

收稿日期:
最后修改日期:
录用日期:
在线发布日期: 2024-04-17
出版日期: 2024-05-20

首页

期刊简介

编委会

作者中心

精选专辑

品牌联动

引用本文

分享

文章指标

历史

文章二维码