Abstract:3D object detection holds significant importance in enhancing the perception and understanding capabilities of intelligent systems in complex indoor environments. However, existing detection methods based on single-modality point clouds generally suffer from issues such as insufficient semantic information and limited generalization ability, making it difficult to effectively address the detection needs of new categories of objects in indoor scenes. To address these issues, this paper proposes an indoor three-dimensional object detection method that integrates image-point cloud-text multimodal fusion. Firstly, the method introduces an early fusion strategy for image-point cloud based on dense depth maps, accurately mapping image semantic features to three-dimensional space through depth constraints, effectively enhancing the semantic expression ability of point clouds and alleviating spatial misalignment issues caused by occlusion. Secondly, a hybrid query-guided indoor Transformer detector is designed, utilizing a dual-branch query mechanism combining geometric queries and learnable queries, which simultaneously considers fine-grained modeling of local objects and strengthens scene-level context modeling capabilities. Finally, a dynamic decoupling 3D-IoU loss enhancement strategy is proposed, which decouples spatial gradients and dynamically adjusts weights based on object scale, improving the localization quality and detection ability of new object candidate boxes. Experimental results on the SUN-RGBD dataset demonstrate that our method outperforms existing state-of-the-art methods in multiple evaluation metrics, validating its effectiveness and robustness in indoor open-domain three-dimensional object detection tasks.