图像-点云-文本多模态融合的室内三维目标检测方法

doi:10.13195/j.kzyjc.2025.1112

首页 > 过刊浏览>年第0卷第期 >. DOI:10.13195/j.kzyjc.2025.1112

图像-点云-文本多模态融合的室内三维目标检测方法
DOI:
                        10.13195/j.kzyjc.2025.1112
                    
CSTR:
                        
                    
作者:
                        
                        
                    
作者单位:1.青岛科技大学;2.北京机械工业自动化研究所有限公司
作者简介:
通讯作者:
中图分类号:TP391
基金项目:国家重点研发计划（2023YFF0612100）；山东省自然科学基金（ZR2024MF023）；中国高校产学研创新基金“智能驾驶及智能座舱教育专项”（2024HT030）

Image-Point Cloud-Text Multimodal Fusion Method for Indoor 3D Object Detection

Author:

Affiliation:

Fund Project:

National Key R&D Program of China under Grant 2023YFF0612102;Shandong Provincial Natural Science Foundation‌（ZR2024MF023）；China Higher Education Institution Industry-University-Research Innovation Fund ‌“Intelligent Driving and Intelligent Cockpit Education Special Project”（2024HT030）

摘要

图/表

访问统计

参考文献

相似文献

引证文献

资源附件

文章评论

摘要:

三维目标检测对于提升智能系统在复杂室内环境中的感知与理解能力具有重要意义。然而，现有基于单模态点云的检测方法普遍存在语义信息不足、泛化能力受限等问题，难以有效应对室内场景中新类别目标的检测需求。针对上述问题，本文提出一种图像–点云–文本多模态融合的室内三维目标检测方法。该方法首先引入密集深度图引导的图像–点云早期融合策略，通过深度约束将图像语义特征精确映射至三维空间，有效增强点云的语义表达能力并缓解遮挡带来的空间错位问题；其次，设计混合查询引导的室内Transformer检测器，采用几何查询与可学习查询相结合的双分支查询机制，在兼顾局部目标精细建模的同时强化场景级上下文建模能力；最后，提出动态解耦3D-IoU损失增强策略，通过解耦空间梯度并根据目标尺度动态调整权重，提高新物体候选框的定位质量与发现能力。在SUN-RGBD数据集上的实验结果表明，本文方法在多项评价指标上均优于现有先进方法，验证了其在室内开放域三维目标检测任务中的有效性与鲁棒性。

Abstract:

3D object detection holds significant importance in enhancing the perception and understanding capabilities of intelligent systems in complex indoor environments. However, existing detection methods based on single-modality point clouds generally suffer from issues such as insufficient semantic information and limited generalization ability, making it difficult to effectively address the detection needs of new categories of objects in indoor scenes. To address these issues, this paper proposes an indoor three-dimensional object detection method that integrates image-point cloud-text multimodal fusion. Firstly, the method introduces an early fusion strategy for image-point cloud based on dense depth maps, accurately mapping image semantic features to three-dimensional space through depth constraints, effectively enhancing the semantic expression ability of point clouds and alleviating spatial misalignment issues caused by occlusion. Secondly, a hybrid query-guided indoor Transformer detector is designed, utilizing a dual-branch query mechanism combining geometric queries and learnable queries, which simultaneously considers fine-grained modeling of local objects and strengthens scene-level context modeling capabilities. Finally, a dynamic decoupling 3D-IoU loss enhancement strategy is proposed, which decouples spatial gradients and dynamically adjusts weights based on object scale, improving the localization quality and detection ability of new object candidate boxes. Experimental results on the SUN-RGBD dataset demonstrate that our method outperforms existing state-of-the-art methods in multiple evaluation metrics, validating its effectiveness and robustness in indoor open-domain three-dimensional object detection tasks.

参考文献

相似文献

引证文献

引用本文

复制

文章指标

点击次数:
下载次数:
HTML阅读次数:
引用次数:

历史

收稿日期:2025-10-24
最后修改日期:2026-01-26
录用日期:2026-01-27
在线发布日期: 2026-02-25
出版日期:

首页

期刊简介

编委会

作者中心

精选专辑

品牌联动

引用本文

相关视频

分享

文章指标

历史

文章二维码