基于双层深度PPO的自适应锚点选择UWB定位方法

doi:10.13195/j.kzyjc.2026.0066

首页 > 过刊浏览>年第0卷第期 >. DOI:10.13195/j.kzyjc.2026.0066

基于双层深度PPO的自适应锚点选择UWB定位方法
DOI:
                        10.13195/j.kzyjc.2026.0066
                    
CSTR:
                        
                    
作者:
                        
                        
                    
作者单位:重庆交通大学
作者简介:
通讯作者:
中图分类号:TN92;TP18
基金项目:重庆市自然科学基金（CSTB2024NSCQ-MSX0275）；研究生教育“课程思政”示范项目（KCSZ2025009）

An adaptive anchor selection UWB localization method based on dual-layer deep PPO

Author:

Affiliation:

Fund Project:

Chongqing Natural Science Foundation Project (CSTB2024NSCQ-MSX0275); Demonstration Project for Integrating Ideological and Political Education into Graduate Curriculum (KCSZ2025009)

摘要

图/表

访问统计

参考文献

相似文献

引证文献

资源附件

文章评论

摘要:

复杂室内环境下的非视距传播、多径以及动态干扰等问题严重制约UWB定位精度的提升，现有的锚点选择策略难以在信号质量与几何分布之间实现自适应平衡。为此，本文提出了一种基于双层深度近端策略优化（DDPPO）的自适应锚点选择方法，将锚点选择建模为序贯决策问题，并利用深度强化学习实现智能筛选。首先，构建融合信道冲激响应、几何分布、轨迹时序与信号质量的多源状态空间，实现对动态环境的全面感知。其次，设计层次化双层PPO架构将锚点选择解耦为数量决策与组合决策两个层次，结合课程学习策略引导模型快速收敛。最后，策略网络输出的锚点子集经加权最小二乘解算位置，以定位误差为主导构建奖励函数，该函数所包含的样本难度自适应机制可根据实时误差动态调整对锚点数量的偏好，生成的奖励信号反馈至决策网络学习形成闭环。在四类真实室内场景数据集上的实验表明，DDPPO方法平均定位误差为0.219米，较现有六种方法降幅达45.9%至72.5%，在定位精度与计算效率间取得良好平衡。

Abstract:

In complex indoor environments, UWB positioning accuracy is severely limited by non-line-of-sight (NLoS), multipath, and dynamic interference. Moreover, conventional anchor selection strategies fail to adaptively balance signal quality and favorable geometric distribution. To address these challenges, this paper proposes an adaptive anchor selection method based on a Double-layer Deep Proximal Policy Optimization (DDPPO) framework, modeling anchor selection as a sequential decision problem with deep reinforcement learning. First, a multi-source state space integrating channel impulse response (CIR) signals, geometric layout, temporal trajectory information, and signal quality metrics is constructed to comprehensively perceive the dynamic environment. Second, a hierarchical dual-layer PPO architecture is designed to decouple anchor selection into quantity and combination decisions, and combines a curriculum learning strategy for rapid convergence. Finally, the policy network"s anchor subset is directly used for position estimation via weighted least squares (WLS) model. The positioning error is taken as the dominant term to construct the reward function, which incorporates a sample-difficulty adaptive mechanism to dynamically adjust anchor count preference based on real?time error. The resulting reward signal is fed back to the decision network, forming a closed?loop learning system. Experimental on four real-world indoor datasets show that the DDPPO method achieves an average error of 0.219 m, representing an improvement of 45.9% to 72.5% over six existing methods, while balancing accuracy and efficiency.

参考文献

相似文献

引证文献

引用本文

复制

文章指标

点击次数:
下载次数:
HTML阅读次数:
引用次数:

历史

收稿日期:2026-01-20
最后修改日期:2026-03-12
录用日期:2026-03-13
在线发布日期: 2026-04-01
出版日期:

首页

期刊简介

编委会

作者中心

精选专辑

品牌联动

引用本文

相关视频

分享

文章指标

历史

文章二维码