基于温和泛化的不确定性离线强化学习

doi:10.13195/j.kzyjc.2025.0124

首页 > 过刊浏览>2025年第40卷第11期 >3329-3339. DOI:10.13195/j.kzyjc.2025.0124

基于温和泛化的不确定性离线强化学习
DOI:
                        10.13195/j.kzyjc.2025.0124
                    
CSTR:
                        
                    
作者:
                        
                        
                    
作者单位:
作者简介:
通讯作者:
中图分类号:TP18
基金项目:国家自然科学基金项目 (62373364, 62176259)；江苏省重点研发计划项目 (BE2022095).

Uncertainty-aware offline reinforcement learning with mild generalization

Author:

Affiliation:

Fund Project:

摘要

图/表

访问统计

参考文献

相似文献

引证文献

资源附件

文章评论

摘要:

离线强化学习(ORL)通过预先收集好的数据集解决动态决策问题, 为强化学习在现实中的应用提供了极大的可能性. 现有ORL多聚焦于对抗分布外动作, 从而导致习得策略是次优的可能性增大. 鉴于此, 研究导致次优的认知误差问题, 提出一种基于温和泛化的不确定性离线强化学习方法(UAMG). 首先, 设计一种由习得策略和扰动模型组成的温和泛化策略, 以对未见过的动作具有一定的适应性; 其次, 在习得策略的更新网络中引入退火行为克隆作为惩罚, 逐渐提高习得策略的泛化能力; 此外, 将不确定性引入Q值函数的估计中, 利用温和泛化策略构造不确定性量化器, 实现对不确定性的有效量化, 进而减小认知误差. 理论分析表明, UAMG能够有效降低习得策略的次优性. D4RL基准上的实验表明: 相较于对比方法, UAMG在认知误差的抑制方面表现优异且在多数任务上获得最高的回报.

Abstract:

Offline reinforcement learning (ORL) addresses dynamic decision-making problems by utilizing pre-collected datasets, offering significant potential for the application of reinforcement learning in real-world scenarios. Existing ORL methods primarily focus on countering out-of-distribution actions, which increases the likelihood of learning suboptimal policies. To this end, this paper investigates the cognitive error that leads to suboptimality and proposes an uncertainty-aware ORL with mild generalization (UAMG). First, a mild generalization policy consisting of learned policy and perturbation model is designed to adapt to unseen actions. Then, an annealed behavioral cloning penalty is introduced in the policy update network to gradually improve the generalization capability of learned policy. Furthermore, uncertainty is incorporated into the Q-function estimation, and an uncertainty quantifier is constructed using the mild generalization policy, enabling effective quantification of uncertainty and thereby reducing the cognitive error. Theoretical analysis proves that the UAMG can effectively reduce the suboptimality of learned policy. Experimental results on D4RL benchmark demonstrate that compared with comparative methods, the UAMG excels in mitigating the cognitive error and achieving the highest returns in most tasks.

参考文献

相似文献

引证文献

引用本文

王雪松,杨露,程玉虎.基于温和泛化的不确定性离线强化学习[J].控制与决策,2025,40(11):3329-3339

复制

文章指标

点击次数:
下载次数:
HTML阅读次数:
引用次数:

历史

收稿日期:2025-01-27
最后修改日期:
录用日期:
在线发布日期: 2025-10-14
出版日期: 2025-11-20

首页

期刊简介

编委会

作者中心

精选专辑

品牌联动

引用本文

相关视频

分享

文章指标

历史

文章二维码