基于双重值修正的离线强化学习

doi:10.13195/j.kzyjc.2025.1154

首页 > 过刊浏览>年第0卷第期 >. DOI:10.13195/j.kzyjc.2025.1154

基于双重值修正的离线强化学习
DOI:
                        10.13195/j.kzyjc.2025.1154
                    
CSTR:
                        
                    
作者:
                        
                        
                    
作者单位:中国矿业大学信息与控制工程学院
作者简介:
通讯作者:
中图分类号:TP18
基金项目:国家自然科学基金项目（面上项目，重点项目，重大项目）

Offline Reinforcement Learning Based on Dual Value Correction

Author:

Affiliation:

Fund Project:

The National Natural Science Foundation of China (General Program, Key Program, Major Research Plan)

摘要

图/表

访问统计

参考文献

相似文献

引证文献

资源附件

文章评论

摘要:

离线强化学习(ORL)依赖固定数据集进行动态决策学习，常因分布外动作引发外推误差。现有方法通常通过约束策略分布或采用保守的Q值估计来缓解该问题，但由此带来的悲观性会导致习得的策略次优。为此，本文从提升值函数估计准确性的角度出发，构造了一种Q值修正(QVC)贝尔曼算子，其以习得Q函数与行为Q函数之间的差异作为方向性信号，对Q函数更新目标进行有界修正。在此基础上，将QVC贝尔曼算子与分布内贝尔曼算子相结合，提出平衡贝尔曼算子以更好地利用分布内外数据。理论结果表明，通过平衡贝尔曼算子迭代得到的Q函数具有收敛性，且其与真实Q函数之间的误差是有界的。进一步，将平衡贝尔曼算子集成至隐式Q学习中，并在V函数更新过程中引入针对Q值高估与低估的自适应修正机制，提出基于双重值修正的离线强化学习(ORL-DVC)方法。实验结果表明，在D4RL基准的Gym-Mujoco移动控制和AntMaze导航控制任务中，ORL-DVC的平均归一化得分达到80.9和62.7，整体性能优于现有主流ORL方法，体现出更优的泛化性能。

Abstract:

Offline reinforcement learning (ORL) leverages fixed datasets for dynamic decision-making, but is often prone to extrapolation errors due to out-of-distribution actions. Existing approaches typically mitigate this issue by constraining the policy distribution or applying conservative Q-value estimation, but the induced pessimism often leads to suboptimal policies. To address this issue, this paper introduces a Q-value correction (QVC) Bellman operator from the perspective of improving the accuracy of value function estimation. The QVC Bellman operator uses the difference between the learned and behavior Q-functions as a directional signal, applying bounded adjustments to the Q-value update target. Building on this, we combine the QVC Bellman operator with the in-distribution Bellman operator to form a balanced Bellman operator, enabling more effective utilization of both in-distribution and out-of-distribution data. Theoretical analysis confirms that the Q-function derived from iterative application of the balanced Bellman operator is convergent, and its deviation from the true Q-function is bounded. Furthermore, we integrate the balanced Bellman operator into implicit Q-learning and incorporate an adaptive correction mechanism in V-function update to jointly address Q-value overestimation and underestimation, thus propose a novel ORL method based on dual value correction (ORL-DVC).Experimental results on the D4RL benchmark, including Gym-Mujoco locomotion and AntMaze navigation tasks, demonstrate that ORL-DVC achieves an average normalized score of 80.9 and 62.7, respectively, outperforming existing state-of-the-art ORL methods with superior generalization capability.

参考文献

相似文献

引证文献

引用本文

复制

文章指标

点击次数:
下载次数:
HTML阅读次数:
引用次数:

历史

收稿日期:2025-11-06
最后修改日期:2026-03-02
录用日期:2026-03-03
在线发布日期: 2026-03-23
出版日期:

首页

期刊简介

编委会

作者中心

精选专辑

品牌联动

引用本文

相关视频

分享

文章指标

历史

文章二维码