融合认知行为模型的深度强化学习框架及算法

doi:10.13195/j.kzyjc.2022.0281

首页 > 过刊浏览>年第0卷第11期 >3209-3218. DOI:10.13195/j.kzyjc.2022.0281

融合认知行为模型的深度强化学习框架及算法
DOI:
                        10.13195/j.kzyjc.2022.0281
                    
作者:
                        
                        
                    
作者单位:国防科技大学
作者简介:
通讯作者:
中图分类号:TP183
基金项目:国家自然科学基金项目（面上项目，重点项目，重大项目）

Deep reinforcement learning framework and algorithms integrated with cognitive behavior models

Author:

Affiliation:

National University of Defense Technology

Fund Project:

The National Natural Science Foundation of China (General Program, Key Program, Major Research Plan)

摘要

图/表

访问统计

参考文献

相似文献

引证文献

资源附件

文章评论

摘要:

面对高维连续状态空间或稀疏奖励等复杂任务时,仅依靠深度强化学习算法从零学习最优策略十分困难,如何将已有知识表示为人和学习型智能体之间相互可理解的形式,并有效地加速策略收敛仍是一个难题.本文提出一种融合认知行为模型的深度强化学习框架,将领域内先验知识建模为基于信念-愿望-意图(belief-desire-intention, BDI)的认知行为模型,用于引导智能体策略学习.基于此框架,分别提出融合认知行为模型的深度Q学习算法和近端策略优化算法,并定量化设计了认知行为模型对智能体策略更新的引导方式.最后,通过典型gym环境和空战机动决策对抗环境,验证了提出的算法可以高效利用认知行为模型加速策略学习,有效缓解了状态空间巨大和环境奖励稀疏的影响.

Abstract:

It is difficult for a reinforcement learning agent to learn an optimal policy from scratch, when facing complex tasks with high-dimensional continuous state-space or sparse rewards. How to represent the known knowledge in a form understandable by human beings and the learning agent, and effectively accelerate policy convergence is still a difficult problem. This paper proposes a deep reinforcement learning (DRL) framework integrating with cognitive behavior models. It represents prior knowledge as belief-desire-intention (BDI) based cognitive behavior models, which are used to guide policy learning in DRL. Besides, we introduce the deep Q-learning with the cognitive behavior model (COG-DQN) algorithm and the proximal policy optimization with the cognitive behavior model (COG-PPO) algorithm based on the proposed framework. Moreover, we quantitatively design the guidance strategies of the cognitive behavior model to policy update. Finally, in a typical gym environment and an air combat maneuver confrontation environment, we verify that the proposed algorithms can efficiently use the cognitive behavior model to accelerate policy learning, and significantly alleviate the impact of high-dimensional state-space and sparse rewards.

参考文献

相似文献

引证文献

引用本文

复制

文章指标

点击次数:
下载次数:
HTML阅读次数:
引用次数:

历史

收稿日期:2022-02-22
最后修改日期:2023-01-16
录用日期:2022-06-24
在线发布日期: 2022-07-10
出版日期:

首页

期刊简介

编委会

作者中心

精选专辑

品牌联动

引用本文

分享

文章指标

历史