Abstract:In complex indoor environments, UWB positioning accuracy is severely limited by non-line-of-sight (NLoS), multipath, and dynamic interference. Moreover, conventional anchor selection strategies fail to adaptively balance signal quality and favorable geometric distribution. To address these challenges, this paper proposes an adaptive anchor selection method based on a Double-layer Deep Proximal Policy Optimization (DDPPO) framework, modeling anchor selection as a sequential decision problem with deep reinforcement learning. First, a multi-source state space integrating channel impulse response (CIR) signals, geometric layout, temporal trajectory information, and signal quality metrics is constructed to comprehensively perceive the dynamic environment. Second, a hierarchical dual-layer PPO architecture is designed to decouple anchor selection into quantity and combination decisions, and combines a curriculum learning strategy for rapid convergence. Finally, the policy network"s anchor subset is directly used for position estimation via weighted least squares (WLS) model. The positioning error is taken as the dominant term to construct the reward function, which incorporates a sample-difficulty adaptive mechanism to dynamically adjust anchor count preference based on real?time error. The resulting reward signal is fed back to the decision network, forming a closed?loop learning system. Experimental on four real-world indoor datasets show that the DDPPO method achieves an average error of 0.219 m, representing an improvement of 45.9% to 72.5% over six existing methods, while balancing accuracy and efficiency.