基于Q网络集成下置信域引导的高效扩散策略

doi:10.13195/j.kzyjc.2025.1127

首页 > 过刊浏览>年第0卷第期 >. DOI:10.13195/j.kzyjc.2025.1127

基于Q网络集成下置信域引导的高效扩散策略
DOI:
                        10.13195/j.kzyjc.2025.1127
                    
CSTR:
                        
                    
作者:
                        
                        
                    
作者单位:河北大学数学与信息科学学院
作者简介:
通讯作者:
中图分类号:TP18
基金项目:科技部重点研发项目(2022YFE0196100)，教育部春晖合作项目（HZKY20220256-202200417）

Efficient Q-Ensembles Diffusion Policy for Offline Reinforcement Learning

Author:

Affiliation:

Fund Project:

摘要

图/表

访问统计

参考文献

相似文献

引证文献

资源附件

文章评论

摘要:

离线强化学习中的数据常由多种策略混合采集，导致动作空间呈现复杂多模分布。现有扩散策略虽能有效刻画多模分布，但由于其动作生成依赖多步逆向推理，效率较低。此外，生成的动作可能接近分布外区域，但并不完全超出数据支持范围，这种不确定性容易引起Q值高估，导致策略不稳定或性能退化。本文提出一种基于Q集成下置信域引导的高效扩散策略(Efficient Q-Ensemble Diffusion Policy, E2DP).该方法通过设计两步逆向推理机制,在显著降低推理开销的同时保留对多模的动作分布的建模能力。为解决Q值高估问题，引入集成Q网络下置信域估计,利用独立目标函数和随机权重系数的方差正则化增强网络多样性,使E2DP仅需少量Q网络即可在分布内动作与数据支持附近的潜在高价值候选之间形成有效权衡，有效提升了策略的性能及鲁棒性。在Bandit任务和D4RL基准任务上的实验结果表明，E2DP在保持与现有扩散策略相当分布表达能力的同时，推理速度提升约2.5倍，并在多个任务中取得最优归一化得分.本文的实验代码已开源至GitHub平台：https://github.com/smartredcat/E2DP.

Abstract:

Offline reinforcement learning datasets are often collected by mixtures of different behaviour policies, resulting in complex multimodal action distributions. Diffusion-based policy methods can effectively model such offline action distributions; however, their action generation relies on multi-step reverse diffusion, which results in high inference latency, becoming a significant bottleneck for practical deployment. Additionally, while the generated actions may still fall within the distribution"s support, the inherent uncertainty can lead to overestimation of Q-values, thereby compromising policy stability. To address the challenges, we propose an Efficient Q-Ensemble Diffusion Policy (E2DP). E2DP significantly reduces the computational cost of action generation through a two-step reverse diffusion mechanism. Meanwhile, ensemble Q networks with variance regularization are employed to improve uncertainty estimation under a small ensemble size, and a lower confidence bound constraint is incorporated during policy improvement to balance in-distribution actions and high-risk candidates near the data support. Experimental results on bandit and D4RL benchmark demonstrate that E2DP achieves inference speedups of approximately 2.5× while maintaining distribution modeling capability comparable to existing diffusion-based policies, and obtain improved normalized performance across multiple tasks. The code of E2DP is available on GitHub via https://github.com/smartredcat/E2DP.

参考文献

相似文献

引证文献

引用本文

复制

文章指标

点击次数:
下载次数:
HTML阅读次数:
引用次数:

历史

收稿日期:2025-10-29
最后修改日期:2026-02-25
录用日期:2026-02-26
在线发布日期: 2026-03-26
出版日期:

首页

期刊简介

编委会

作者中心

精选专辑

品牌联动

引用本文

相关视频

分享

文章指标

历史

文章二维码