Abstract:To address the challenges of small object detection in UAV aerial images, where targets are typically small in size, densely distributed, and lack clear texture details, this paper proposes a guided feature fusion algorithm based on a cross-domain dual-stream network. Specifically, a spatial–frequency collaborative dual-stream architecture is introduced in the backbone, in which spatial-domain and frequency-domain feature extraction pathways are constructed in parallel. The spatial stream focuses on capturing local detail features, while the frequency stream incorporates an edge and frequency enhancement module. This module performs three-band frequency decomposition via frequency transformation and dynamic Gaussian masking, and employs a context-aware gating mechanism to adaptively enhance features at different frequency bands, thereby improving the network"s global context perception capability. Subsequently, an adaptive spatial–frequency collaborative fusion module is designed to efficiently integrate cross-domain features through dynamic weight allocation. Finally, a guided three-branch fusion module is adopted in the neck network, where the main-branch features serve as guidance to adaptively fuse semantic and detailed information from upsampling, main-branch, and cross-layer features, effectively alleviating semantic discrepancies across different scales. Experiments conducted on the VisDrone2019 and TinyPerson public datasets demonstrate the effectiveness of the proposed method.