Abstract:Accurate 6D pose estimation is crucial for flexible manufacturing, robotic grasping, and intelligent assembly. However, it still faces three major limitations: First, it is difficult to detect small targets against complex backgrounds; Second, traditional registration methods are sensitive to initial estimates, and have narrow convergence basins, while deep learning has poor generalization for industrial small objects; Third, primarily designed based on RGB-D images for view synthesis, existing neural rendering approaches struggle to meet the industrial demand for lossless, precise pose estimation from geometry of high-accuracy point cloud. To address these challenges, a collaborative estimation framework termed "Cross-Modal Coarse Localization and Differentiated Fine Registration" (CMCL-DFR) is proposed. In the first stage, a Virtual Rendering-based Neural Pose Estimation (VR-NPE) method is introduced. Differentiable rendering is used to bridge the point cloud to the image domain. A designed Geometry-aware Multi-Scale Network (GMS-Net) fuses multimodal features to enhance the robustness of small-target detection and coarse localization. In the second stage, a Pose-Guided Multi-Scale Geometric-Aware Registration (PG-MSGAR) method is proposed.In this method, adaptive region segmentation of the point cloud is achieved through curvature analysis. Differential constraint weights are assigned to regions with varying geometric saliency, and TEASER++ is ultilized to suppress outliers, thereby enabling high-precision pose refinement. Experimental results on a self-built Industrial Parts Dataset (IPD) demonstrate that the proposed method achieves an Average Distance (ADD) error of 0.95 mm with a success rate of 91.8%, reducing the ADD error by 48.6% in comparison with FoundationPose.