基于优势约束扩散策略的离线强化学习

doi:10.13195/j.kzyjc.2024.0618

首页 > 过刊浏览>2025年第40卷第6期 >1903-1912. DOI:10.13195/j.kzyjc.2024.0618

基于优势约束扩散策略的离线强化学习
DOI:
                        10.13195/j.kzyjc.2024.0618
                    
CSTR:
                        
                    
作者:
                        
                        
                    
作者单位:
作者简介:
通讯作者:
中图分类号:TP18
基金项目:国家自然科学基金项目(62373364, 62176259)；江苏省重点研发计划项目(BE2022095).

Offline reinforcement learning based on advantage-constrained diffusion policy

Author:

Affiliation:

Fund Project:

摘要

图/表

访问统计

参考文献

相似文献

引证文献

资源附件

文章评论

摘要:

离线强化学习旨在从静态的经验数据集中学习策略, 这种数据驱动的学习范式为强化学习在现实世界的应用提供了极大可能. 然而, 离线数据集通常由不同水平的策略收集而来, 其动作分布呈现出一种难以表达的多峰状态. 此外, 离线数据集中的高回报轨迹通常较为稀缺, 导致策略学习效率低下. 为此, 提出一种基于优势约束扩散策略的离线强化学习方法. 首先, 利用扩散模型的反向扩散步骤生成策略, 以更好地拟合多峰动作分布; 然后, 在策略提升阶段, 使用优势函数进行策略约束以帮助智能体更加专注于数量稀少的高回报轨迹, 并分别针对连续控制任务和稀疏奖励导航任务构建两种特定优势函数. 在bandit任务和D4RL基准上的实验结果表明: 所提方法能有效缓解行为策略表达能力受限及高回报轨迹稀缺的问题, 在大多数任务上获得最高的归一化得分.

Abstract:

The goal of offline reinforcement learning is to learn policies from static datasets of previously collected experience. This data-driven learning paradigm greatly expands the potential for applying reinforcement learning in real-world scenarios. However, offline static datasets are often collected from policies of varying quality, leading to a multimodal action distribution that is difficult to model effectively. Furthermore, high-return trajectories are typically scarce within these datasets, which reduces the efficiency of policy learning. To address these challenges, this paper introduces an offline reinforcement learning method based on advantage-constrained diffusion policy. The diffusion model's reverse process is utilized to generate policies that better capture the multimodal distribution. During policy improvement, an advantage function is applied to constrain the policy, directing the agent's focus on the sparse high-return trajectories. Two specific advantage functions are designed for continuous control tasks and sparse reward navigation tasks. Experimental results on bandit tasks and D4RL benchmarks show that the proposed method successfully mitigates limitations in behavior policy expressiveness and the scarcity of high-return trajectories, achieving the highest normalized scores in most tasks.

参考文献

相似文献

引证文献

引用本文

王雪松,张恒瑞,张佳志,等.基于优势约束扩散策略的离线强化学习[J].控制与决策,2025,40(6):1903-1912

复制

文章指标

点击次数:
下载次数:
HTML阅读次数:
引用次数:

历史

收稿日期:2024-05-22
最后修改日期:
录用日期:
在线发布日期: 2025-04-30
出版日期: 2025-06-20

首页

期刊简介

编委会

作者中心

精选专辑

品牌联动

引用本文

相关视频

分享

文章指标

历史

文章二维码