基于组合网络优化的延迟深度确定性策略梯度

doi:10.13195/j.kzyjc.2024.0147

首页 > 过刊浏览>2025年第40卷第3期 >1015-1023. DOI:10.13195/j.kzyjc.2024.0147

基于组合网络优化的延迟深度确定性策略梯度
DOI:
                        10.13195/j.kzyjc.2024.0147
                    
CSTR:
                        
                    
作者:
                        
                        
                    
作者单位:
作者简介:
通讯作者:
中图分类号:TP18
基金项目:国家自然科学基金项目(62176259, 62006232).

Delayed deep deterministic policy gradient based on combinatorial network optimization

Author:

Affiliation:

Fund Project:

摘要

图/表

访问统计

参考文献

相似文献

引证文献

资源附件

文章评论

摘要:

值函数估计偏差修正已成为深度强化学习领域的一个重要研究方向. 现有大多数研究工作均聚焦于如何缓解高估偏差, 却忽略了缓解高估偏差过程中引入的低估偏差问题. 为此, 通过在Actor-Critic框架中灵活设置多个Actor和Critic网络来缓解值函数低估偏差, 提出一种基于组合网络优化的延迟深度确定性策略梯度(D3PG-CNO). D3PG-CNO的主要思路为: 在经验收集阶段用一个Critic网络对多个Actor网络的输出动作进行评估, 并选择最优的动作存入经验池. 在经验训练阶段, 从多个Critic网络中选出在当前状态-动作对下估计结果最小的Critic网络, 并用其对多个Actor网络的输出动作进行评估, 选择评估最大值进行目标值的计算. MuJoCo平台上的实验结果显示, 相比于现有的确定性策略梯度算法, D3PG-CNO显著降低了估计偏差, 提高了算法的稳定性和收敛速度, 并在多个任务中表现出更好的性能.

Abstract:

In recent years, value function estimation bias correction has become an important research direction in the field of deep reinforcement learning. Most existing research work focuses on how to alleviate overestimation bias, but ignores the problem of underestimation bias introduced in the process of mitigating overestimation bias. To this end, this paper flexibly sets up multiple Actor and Critic networks in the Actor-Critic framework to alleviate the value function underestimation bias, and proposes a delayed depth deterministic policy gradient based on combined network optimization (D3PG-CNO). The main idea of the D3PG-CNO is to use a Critic network to evaluate the output actions of multiple Actor networks in the experience collection phase, and to select the optimal actions to store in the experience pool. In the experience training stage, the Critic network with the smallest estimated result under the current state-action pair is selected from multiple Critic networks and used to evaluate the output actions of multiple Actor networks, and the maximum evaluation value is selected to calculate the target value. Experimental results on the MuJoCo platform show that the D3PG-CNO significantly reduces estimation bias compared to existing deterministic policy gradient algorithms, improves the stability and convergence speed of the algorithm, and shows better performance in multiple tasks.

参考文献

相似文献

引证文献

引用本文

程玉虎,安冰清,孔毅.基于组合网络优化的延迟深度确定性策略梯度[J].控制与决策,2025,40(3):1015-1023

复制

文章指标

点击次数:
下载次数:
HTML阅读次数:
引用次数:

历史

收稿日期:2024-02-06
最后修改日期:
录用日期:
在线发布日期: 2025-02-11
出版日期: 2025-03-20

首页

期刊简介

编委会

作者中心

精选专辑

品牌联动

引用本文

相关视频

分享

文章指标

历史

文章二维码