基于事后经验回放和策略延迟更新的深度强化学习充电枪装配策略

doi:10.13195/j.kzyjc.2025.0386

首页 > 过刊浏览>2026年第41卷第5期 >1439-1448. DOI:10.13195/j.kzyjc.2025.0386

基于事后经验回放和策略延迟更新的深度强化学习充电枪装配策略
DOI:
                        10.13195/j.kzyjc.2025.0386
                    
CSTR:
                        
                    
作者:
                        
                        
                    
作者单位:
作者简介:
通讯作者:
中图分类号:TP242
基金项目:国家自然科学基金项目(62203116, 62273095)；国家重点研发计划课题(2024YFB3312403)；广东省基础与应用基础研究基金面上项目(2024A1515010222, 2022A1515240058)；广东省普通高校重点领域专项(2023ZDZX1040, 2022ZDZX1045).

Deep reinforcement learning-based charging gun assembly strategy with hindsight experience replay and delayed policy updates

Author:

Affiliation:

Fund Project:

摘要

图/表

访问统计

参考文献

相似文献

引证文献

资源附件

文章评论

摘要:

为解决充电枪装配过程中传统强化学习算法训练样本效率低、策略不稳定和对硬件资源利用不充分的问题, 提出融合事后经验回放(HER)和策略延迟更新(DPU)的软演员-评论家(SAC)算法 (SAC with HER-DPU). 首先, 建立充电枪装配模型, 通过在经验回放池中引入HER, 重新定义目标以生成“伪成功”经验; 然后, 在算法的梯度更新部分加入DPU, 通过多次更新价值网络后再更新策略网络, 确保策略更新基于更稳定的价值估计; 最后, 在使用SAC with HER-DPU 算法进行充电枪装配训练时采用双线程训练架构, 将数据收集和神经网络训练解耦. 实验结果表明, 所提算法的收敛时间为33.2 h, 平均装配步数为75步, 相较于SAC算法, 收敛时间减少21.4 h, 平均装配步数少17步, 可有效提高训练的样本效率、策略稳定性和训练速度.

Abstract:

To address the challenges of low sample efficiency, unstable policy updates, and insufficient hardware utilization in traditional reinforcement learning methods for charging gun assembly, we propose an enhanced soft actor-critic (SAC) algorithm that integrates hindsight experience replay (HER) and delayed policy updates (DPU). First, a charging gun assembly model is established. The HER is integrated into the replay buffer to redefine goals and generate "pseudo-success" experiences. Then, the DPU is applied during the gradient update phase, where the value network is updated multiple times before each policy update to ensure more stable value estimation. Finally, during training with the SAC-HER-DPU algorithm, a dual-thread architecture is adopted to decouple data collection from neural network training, improving overall training efficiency. Experimental results show that the proposed algorithm achieves convergence in 33.2 hours, with an average of 75 assembly steps. Compared to the baseline SAC algorithm, it reduces convergence time by 21.4 hours and decreases the average number of assembly steps by 17. Moreover, it effectively improves sample efficiency, policy stability, and training speed.

参考文献

相似文献

引证文献

引用本文

王福杰,彭永岗,李醒,等.基于事后经验回放和策略延迟更新的深度强化学习充电枪装配策略[J].控制与决策,2026,41(5):1439-1448

复制

文章指标

点击次数:
下载次数:
HTML阅读次数:
引用次数:

历史

收稿日期:2025-04-15
最后修改日期:
录用日期:
在线发布日期: 2026-04-17
出版日期: 2026-05-10

首页

期刊简介

编委会

作者中心

精选专辑

品牌联动

引用本文

相关视频

分享

文章指标

历史

文章二维码