基于优势约束扩散策略的离线强化学习
CSTR:
作者:
作者单位:

中国矿业大学

作者简介:

通讯作者:

中图分类号:

TP18

基金项目:

国家自然科学基金项目(面上项目,重点项目,重大项目)


Offline Reinforcement Learning based on Advantage-constrained Diffusion Policy
Author:
Affiliation:

China University of Mining and Technology

Fund Project:

  • 摘要
  • |
  • 图/表
  • |
  • 访问统计
  • |
  • 参考文献
  • |
  • 相似文献
  • |
  • 引证文献
  • |
  • 资源附件
  • |
  • 文章评论
    摘要:

    离线强化学习旨在从固定的静态数据集中学习策略,这种数据驱动的学习范式为强化学习从模拟环境到真实世界的转变提供了极大可能。然而,离线数据集通常是由不同水平的策略收集而来的,其动作分布呈现出一种难以表达的多峰状态。并且,离线数据集中的高回报轨迹较为稀缺,使得策略学习的效率低下。为此,本文提出一种基于优势约束扩散策略的离线强化学习方法。首先,利用扩散模型的反向扩散步骤生成策略,以更好地拟合多峰行为策略。然后,提出利用优势函数对策略提升进行克隆指导,以帮助智能体更加专注于数量稀少的高回报轨迹。最后针对连续控制任务和稀疏奖励导航任务分别构建了两种优势函数。在bandit任务和D4RL基准上的实验结果表明:所提方法有效缓解行为策略表达能力受限及高回报轨迹稀缺的问题,在大多数任务获得最高的归一化得分。

    Abstract:

    Offline reinforcement learning aims to learn policies from fixed static datasets, providing a significant avenue for the transition of reinforcement learning from simulated environments to real-world applications. However, offline datasets are typically collected from strategies of varying proficiency, resulting in a multi-modal action distribution that is challenging to articulate. Meanwhile, high-return trajectories in offline datasets are scarce, impeding the efficiency of policy learning. To address these challenges, this paper proposes an offline reinforcement learning approach based on advantage-constrained diffusion policy. Initially, policy is generated through the reverse diffusion steps of a diffusion model to better fit multi-modal behavior policy. Subsequently, a method is proposed to guide policy improvement using advantage functions, aimed at assisting agents in focusing more on trajectories with scarce yet high rewards. Finally, two types of advantage functions are developed specifically for co

    参考文献
    相似文献
    引证文献
引用本文
分享
文章指标
  • 点击次数:
  • 下载次数:
  • HTML阅读次数:
  • 引用次数:
历史
  • 收稿日期:2024-05-22
  • 最后修改日期:2024-10-06
  • 录用日期:2024-10-07
  • 在线发布日期: 2024-10-16
  • 出版日期:
文章二维码