多智能体网络协调控制的研究作为控制领域的前沿课题, 深受研究人员的青睐, 而且已在诸多工程领域中得到了广泛成功的应用. 例如, 自组装机器人聚集、无人机火灾救援、卫星姿态调整和智能电网分配等. 作为典型的协调控制, 包容控制的潜在应用前景涵盖了危险物资搬运、火灾救援等军事和民用方面. 在包容控制系统中, 存在多个领航者, 并且跟随者的运动限定在领航者所围成的最小几何空间内. 迄今为止, 在多智能体网络包容控制研究方面已经涌现出很多优秀的研究成果[1-4].
目前, 大多数研究成果均要求系统动态已知且非最优控制. 而在实际应用中, 救援和搬运机器人需在尽可能短的时间内且能量损耗最小的情况下, 将人员或物资转移到安全地点. 因此, 它们必须适应不可预测、连续变化的环境, 在安全任务中学习采取最优行动得到最优性能. 博弈理论为多智能体网络动态优化问题的求解提供了极其合适的工具. 博弈论为动态交互网络提供了表示多参与者决策控制问题的环境, 从而网络中智能体之间的策略交互可建模为多玩家同时运动的博弈[5]. 针对线性离散网络, 文献[6]基于博弈论思想, 解决了数据驱动的多智能体网络一致问题. 而针对非线性多智能体网络, 文献[7-8]在给出领航-跟随非线性微分图博弈描述的基础上, 采用评价-执行框架和梯度下降法实现了最优控制策略的估计, 并且设计依赖系统动态的算法以实现分布式跟踪控制.
实际应用中, 网络个体经常受到外部的干扰, 例如测量噪声、敌对方对网络个体的攻击以及外部环境的变化所导致动态的不确定性. 为了保证网络个体顺利完成任务或者在受到攻击后具有防御性或复原性, 研究人员主要采用零和博弈框架来研究多智能体网络分布式鲁棒控制. 零和博弈是竞争类博弈, 其意味着当一个玩家赢时, 另一个玩家就输. 在控制系统中, 零和博弈与干扰抑制的
上述成果均限于单个系统. 基于零和博弈理论和梯度下降法, 文献[11-12]求解近似最优控制策略, 分别解决了多个轮式机器人的同步问题和线性多智能体网络的干扰抑制问题. 文献[13]针对非线性多智能体网络, 结合零和博弈理论和自适应动态规划思想, 构造评价神经网络在线逼近协调代价函数, 从而实现网络跟踪控制. 但上述成果的最优策略均依赖于系统动态. 在实际应用中, 外部环境的复杂性很难获得精确的系统动态信息, 因此, 本文受文献[10, 12]的启发, 采用零和博弈理论和积分强化学习(integral reinforcement learning, IRL)思想, 给出包容误差
考虑由
考虑
| $ \begin{align} \dot {x}_i = f(x_{i})+g(x_{i})u_{i}+k(x_{i})\omega_{i}, \; i\in F. \end{align} $ | (1) |
其中:
领航者的动态描述为
| $ \begin{align} \dot {x}_i = h_{i}(x_{i}), \; i\in L. \end{align} $ | (2) |
其中:
令领航者之间无通信, 且领航者与跟随者之间通信是单向的, 即领航者发送信息, 则跟随者之间的网络拓扑和领航者与跟随者之间的网络拓扑能够决定整个网络通信. 由此对Laplacian阵
| $ \begin{align*} \mathcal {L} = \begin{bmatrix} \mathcal {T} & {\mathcal {T}}_{d} \\ {{0}}_{(n-m)\times m} & {{0}}_{(n-m)\times(n-m)} \end{bmatrix}. \end{align*} $ |
其中:
假设1 对于每个跟随者, 至少存在一个领航者与其存在有向路径通信.
1.3 网络误差定义网络误差为
| $ \begin{align} e_{i} = \sum\limits_{j = 1}^{n}a_{ij}(x_{i}-x_{j}), \; i\in F, \end{align} $ | (3) |
则网络误差动态为
| $ \begin{align} &\dot{e_{i}} = \sum\limits_{j = 1}^{n}a_{ij}(\dot{x_{i}}-\dot{x_{j}}) = \\ &\varPhi_{i}+d_{i}g(x_{i})u_{i}-\sum\limits_{j\in F}a_{ij}g(x_{j})u_{j}+ \\ &d_{i}k(x_{i})\omega_{i}-\sum\limits_{j\in F}a_{ij}k(x_{j})\omega_{j}. \end{align} $ | (4) |
其中:
由网络拓扑和网络误差定义可得整个网络的误差动态, 可描述为
定义1 设
| $ \begin{align*} {\rm Co}(X) = \, &\Big\{\sum\limits_{i = 1}^k {\alpha _i x_i } \vert x_i \in X, \alpha _i \in R, \; \alpha _i \geqslant 0, \\[3pt] &\sum\limits_{i = 1}^k {\alpha _i } = 1, \; k = 1, 2, \ldots \Big\}. \end{align*} $ |
定义2 考虑由动态(1)和(2)所构成的多智能体网络, 对于所有的跟随者有
| $\begin{array}{l} \int_{{t_0}}^\infty {\left( {e_i^{\rm{T}}{Q_i}{e_i} + u_i^{\rm{T}}{R_i}{u_i}} \right)} {\rm{d}}t \le \\ {\gamma ^2}\int_{{t_0}}^\infty {\omega _i^{\rm{T}}{P_i}{\omega _i}{\rm{d}}t + {V_i}({e_i}({t_0})).} \end{array}$ | (5) |
其中:
为每个跟随者定义性能指标
| $ \begin{align} & J_{i}(e_{i}(t_{0}), u_{i}, u_{-i}, \omega_{i}, \omega_{-i}) = \\[3pt] &\int_{{t_0}}^\infty (e_{i}^{\rm T}Q_{i}e_{i}+u_{i}^{\rm T}R_{i}u_{i}-\gamma^{2}\omega_{i}^{\rm T}P_{i}\omega_{i}){\rm d}t, \; i \in F. \end{align} $ | (6) |
其中:
| $ \begin{align} V_{i}(e_{i}(t_{0})) = \min\limits_{u_{i}}\max\limits_{\omega_{i}}J_{i}(e_{i}(t_{0}), u_{i}, u_{-i}, \omega_{i}, \omega_{-i}). \end{align} $ | (7) |
若博弈意义上的鞍点
| $ \begin{align} V_{i}^{\ast}(e_{i}(t_{0})) = \;&\min\limits_{u_{i}}\max\limits_{\omega_{i}}J_{i}(e_{i}(t_{0}), u_{i}, u_{-i}^{\ast}, \omega_{i}, \omega_{-i}^{\ast}) = \\ &\max\limits_{\omega_{i}}\min\limits_{u_{i}}J_{i}(e_{i}(t_{0}), u_{i}, u_{-i}^{\ast}, \omega_{i}, \omega_{-i}^{\ast}), \end{align} $ | (8) |
则
| $ \begin{align} & J_{i}(e_{i}(t_{0}), u_{i}^{\ast}, u_{-i}^{\ast}, \omega_{i}^{\ast}, \omega_{-i}^{\ast}) \leqslant\\ & J_{i}(e_{i}(t_{0}), u_{i}, u_{-i}^{\ast}, \omega_{i}^{\ast}, \omega_{-i}^{\ast}), \\ & J_{i}(e_{i}(t_{0}), u_{i}^{\ast}, u_{-i}^{\ast}, \omega_{i}^{\ast}, \omega_{-i}^{\ast}) \geqslant\\ & J_{i}(e_{i}(t_{0}), u_{i}^{\ast}, u_{-i}^{\ast}, \omega_{i}, \omega_{-i}^{\ast}). \end{align} $ | (9) |
于是, 与式(8)等价的Nash平衡条件为
| $ \begin{align} & J_{i}(e_{i}(t_{0}), u_{i}^{\ast}, u_{-i}^{\ast}, \omega_{i}, \omega_{-i}^{\ast})\leqslant\\ & J_{i}(e_{i}(t_{0}), u_{i}^{\ast}, u_{-i}^{\ast}, \omega_{i}^{\ast}, \omega_{-i}^{\ast}) \leqslant\\ & J_{i}(e_{i}(t_{0}), u_{i}, u_{-i}^{\ast}, \omega_{i}^{\ast}, \omega_{-i}^{\ast}), \end{align} $ | (10) |
其中
对于第
| $ \begin{align} V_{i}(e_{i}(t_{0})) = \int_{{t_0}}^\infty (e_{i}^{\rm T}Q_{i}e_{i}+u_{i}^{\rm T}R_{i}u_{i}-\gamma^{2}\omega_{i}^{\rm T}P_{i}\omega_{i}){\rm d}t. \end{align} $ | (11) |
由此可得下列Bellman方程:
| $ \begin{align} & H_{i}(e_{i}, \nabla V_{i}, u_{i}, u_{-i}, \omega_{i}, \omega_{-i})\equiv \\ &\nabla V_{i}\Big(\varPhi_{i}+d_{i}g(x_{i})u_{i}-\sum\limits_{j\in F}a_{ij}g(x_{j})u_{j}+\\ & d_{i}k(x_{i})\omega_{i}-\sum\limits_{j\in F}a_{ij}k(x_{j})\omega_{j}\Big)+ \\ & e_{i}^{\rm T}Q_{i}e_{i}+u_{i}^{\rm T}R_{i}u_{i}-\gamma^{2}\omega_{i}^{\rm T}P_{i}\omega_{i}, \end{align} $ | (12) |
其中
| $ \begin{align} & u_{i}^{\ast}(t) = -\frac{1}{2}d_{i}R_{i}^{-1}g^{\rm T}(x_{i})\nabla V_{i}^{\ast}, \end{align} $ | (13) |
| $ \begin{align} & \omega_{i}^{\ast}(t) = \frac{1}{2\gamma^{2}}d_{i}P_{i}^{-1}k^{\rm T}(x_{i})\nabla V_{i}^{\ast}, \end{align} $ | (14) |
则耦合HJI方程为
| $ \begin{align} (\nabla V_{i}^{\ast})^{\rm T}\varPi_{i}+e_{i}^{\rm T}Q_{i}e_{i}+\varXi_{i} = 0. \end{align} $ | (15) |
其中
| $ \begin{align} \varPi_{i} = \, &\varPhi_{i}-\dfrac{d_{i}^{2}}{2}g(x_{i})R_{i}^{-1}g^{\rm T}(x_{i})\nabla V_{i}^{\ast}+ \\[3pt] &\dfrac{d_{i}^{2}}{2\gamma^{2}}k(x_{i})P_{i}^{-1}k^{\rm T}(x_{i})\nabla V_{i}^{\ast}- \\[3pt] &\dfrac{d_{j}}{2}\sum\limits_{j\in F}a_{ij}g(x_{j})R_{j}^{-1}g^{\rm T}(x_{j})\nabla V_{j}^{\ast}- \\[3pt] &\dfrac{d_{j}}{2\gamma^{2}}\sum\limits_{j\in F}a_{ij}k(x_{j})P_{j}^{-1}k^{\rm T}(x_{j})\nabla V_{j}^{\ast}, \end{align} $ | (16) |
| $ \begin{align} \varXi_{i} = \, &\dfrac{d_{i}^{2}}{4}(\nabla V_{i}^{\ast})^{\rm T}g(x_{i})R_{i}^{-1}g^{\rm T}(x_{i})\nabla V_{i}^{\ast}- \\[3pt] &\dfrac{d_{i}^{2}}{4\gamma^{4}}(\nabla V_{i}^{\ast})^{\rm T}k(x_{i})P_{i}^{-1}k^{\rm T}(x_{i})\nabla V_{i}^{\ast}. \end{align} $ | (17) |
由此, 该零和博弈问题需要求解
对于给定干扰抑制水平
定理1 令
1) 当控制策略
2) 当所有跟随者均选择各自的最优控制策略
证明 因
| $ \begin{align*} &V_{i}^{\ast}(e_{i}(t+\Delta t))-V_{i}^{\ast}(e_{i}(t)) = \\[4pt] &\int_{t}^{t+\triangle t} (e_{i}^{\rm T}Q_{i}e_{i}+(u_{i}^{\ast})^{\rm T}R_{i}u_{i}^{\ast}-\gamma^{2}\omega_{i}^{\rm T}P_{i}\omega_{i}){\rm d}\tau. \end{align*} $ |
当
| $ \begin{align*} \dfrac{{\rm d} V_{i}^{\ast}(e_{i})}{{\rm d}t} = -(e_{i}^{\rm T}Q_{i}e_{i}+(u_{i}^{\ast})^{\rm T}R_{i}u_{i}^{\ast}-\gamma^{2}\omega_{i}^{\rm T}P_{i}\omega_{i}). \end{align*} $ |
1) 当
| $ \begin{align*} \dfrac{{\rm d} V_{i}^{\ast}(e_{i})}{{\rm d}t} = -(e_{i}^{\rm T}Q_{i}e_{i}+(u_{i}^{\ast})^{\rm T}R_{i}u_{i}^{\ast})<0. \end{align*} $ |
因此网络误差动态渐近稳定, 由文献[14]中引理3.1得知跟随者渐近趋于期望状态.
2) 当所有跟随者均选择各自的最优控制策略
| $ \begin{align*} & V_{i}^{\ast}(e_{i}(\infty))-V_{i}^{\ast}(e_{i}(t_{0})) = \\[4pt] &-\int_{t_{0}}^{\infty}(e_{i}^{\rm T}Q_{i}e_{i}+(u_{i}^{\ast})^{\rm T}R_{i}u_{i}^{\ast}-\gamma^{2}\omega_{i}^{\rm T}P_{i}\omega_{i}){\rm d}\tau, \end{align*} $ |
则
| $ \begin{align*} & V_{i}^{\ast}(e_{i}(\infty))+\int\_{t_{0}}^{\infty}(e_{i}^{\rm T}Q_{i}e_{i}+(u_{i}^{\ast})^{\rm T}R_{i}u_{i}^{\ast}){\rm d}\tau = \\[5pt] &\int_{t_{0}}^{\infty}\gamma^{2}\omega_{i}^{\rm T}P_{i}\omega_{i}{\rm d}\tau+V_{i}^{\ast}(e_{i}(t_{0})), \end{align*} $ |
即
| $ \begin{align*} &\int_{t_{0}}^{\infty}(e_{i}^{\rm T}Q_{i}e_{i}+(u_{i}^{\ast})^{\rm T}R_{i}u_{i}^{\ast}){\rm d}\tau\leqslant \\[5pt] &\gamma^{2}\int_{t_{0}}^{\infty}\omega_{i}^{\rm T}P_{i}\omega_{i}{\rm d}\tau+V_{i}^{\ast}(e_{i}(t_{0})). \end{align*} $ |
所以有界
推论1 若定理1中的条件均满足, 并且假设1成立, 则可得包容误差
注1 由定理1以及网络误差与包容误差的关系式
定理2 令
证明 当
| $ \begin{align*} & J_{i}(e_{i}(t_{0}), u_{i}, u_{-i}, \omega_{i}, \omega_{-i}) = \\[3pt] & V_{i}^{\ast}(e_{i}(\infty))\!+\!\int_{t_{0}}^\infty(e_{i}^{\rm T}Q_{i}e_{i}\!+\!u_{i}^{\rm T}R_{i}u_{i}\!-\!\gamma^{2}\omega_{i}^{\rm T}P_{i}\omega_{i}){\rm d}t\! = \\[4pt] & V_{i}^{\ast}(e_{i}(t_{0}))\!+\!\int_{t_{0}}^\infty(e_{i}^{\rm T}Q_{i}e_{i}\!+\!u_{i}^{\rm T}R_{i}u_{i}\!-\!\gamma^{2}\omega_{i}^{\rm T}P_{i}\omega_{i}){\rm d}t\!- \\[4pt] &\int_{t_{0}}^\infty(e_{i}^{\rm T}Q_{i}e_{i}+(u_{i}^{\ast})^{\rm T}R_{i}u_{i}^{\ast}-\gamma^{2}(\omega_{i}^{\ast})^{\rm T}P_{i}\omega_{i}^{\ast}){\rm d}t. \end{align*} $ |
计算
| $ \begin{align*} &\int_{t_{0}}^\infty(e_{i}^{\rm T}Q_{i}e_{i}+u_{i}^{\rm T}R_{i}u_{i}-\gamma^{2}\omega_{i}^{\rm T}P_{i}\omega_{i}){\rm d}t- \\[4pt] &\int_{t_{0}}^\infty(e_{i}^{\rm T}Q_{i}e_{i}+(u_{i}^{\ast})^{\rm T}R_{i}u_{i}^{\ast}-\gamma^{2}(\omega_{i}^{\ast})^{\rm T}P_{i}\omega_{i}^{\ast}){\rm d}t = \\[4pt] &\int_{t_{0}}^\infty\Big((u_{i}-u_{i}^{\ast})^{\rm T}R_{i}(u_{i}-u_{i}^{\ast})- \\[4pt] &\gamma^{2}\int_{t_{0}}^\infty\Big((\omega_{i}-\omega_{i}^{\ast})^{\rm T}R_{i}(\omega_{i}-\omega_{i}^{\ast})- \\[4pt] &\nabla V_{i}^{\rm T}\sum\limits_{j\in F}a_{ij}g_{i}(x_{i})(u_{j}^{\ast}-u_{j})\Big){\rm d}t- \\[4pt] &\nabla V_{i}^{\rm T}\sum\limits_{j\in F}a_{ij}g_{i}(x_{i})(\omega_{j}^{\ast}-\omega_{j})\Big){\rm d}t. \end{align*} $ |
当
| $ \begin{align*} & J_{i}(e_{i}(t_{0}), u_{i}, u_{-i}^{\ast}, \omega_{i}, \omega_{-i}^{\ast}) = \\[3pt] & V_{i}^{\ast}(e_{i}(t_{0}))+\int_{t_{0}}^\infty(u_{i}-u_{i}^{\ast})^{\rm T}R_{i}(u_{i}-u_{i}^{\ast}){\rm d}\tau-\\[4pt] &\gamma^{2}\int_{t_{0}}^\infty(\omega_{i}-\omega_{i}^{\ast})^{\rm T}R_{i}(\omega_{i}-\omega_{i}^{\ast}){\rm d}\tau, \end{align*} $ |
可知满足Nash平衡条件(10), 即
| $ \begin{align*} & J_{i}(e_{i}(t_{0}), u_{i}^{\ast}, u_{-i}^{\ast}, \omega_{i}, \omega_{-i}^{\ast})\leqslant \\ & J_{i}(e_{i}(t_{0}), u_{i}^{\ast}, u_{-i}^{\ast}, \omega_{i}^{\ast}, \omega_{-i}^{\ast})\leqslant\\ & J_{i}(e_{i}(t_{0}), u_{i}, u_{-i}^{\ast}, \omega_{i}^{\ast}, \omega_{-i}^{\ast}), \end{align*} $ |
而且博弈值
| $ \begin{align*} \; \; \; \; \; \; \; \; J_{i}(e_{i}(t_{0}), u_{i}^{\ast}, u_{-i}^{\ast}, \omega_{i}^{\ast}, \omega_{-i}^{\ast}) = V_{i}^{\ast}(e_{i}(t_{0})).\end{align*} $ |
注2 定理1从稳定性的角度出发, 表明当所有跟随者取得最优控制策略时, 可保证包容误差有界且实现鲁棒包容控制. 定理2从博弈论角度出发, 表明最优控制策略和最优干扰策略同时满足Nash平衡条件. 定理2为定理1提供了最优化情况, 即当整个网络实现Nash平衡时, 多智能体网络能够在克服最坏干扰情况下, 实现较高精度和耗能最小的鲁棒包容控制.
2.3 求解HJI方程的策略迭代算法由前述可知, 实现网络的鲁棒包容控制, 需要求解
算法1 基于模型的策略迭代算法.
令
| $ \begin{align*} u_{i}^{(0)} = - \frac{1}{2}d_{i}R_{i}^{-1}g_{i}^{\rm T}(x_{i}) \nabla V_{i}^{(0)}, \end{align*} $ |
初始扰动策略为
| $ \begin{align*} \omega_{i}^{(0)} = \frac{1}{2\gamma^{2}}d_{i}P_{i}^{-1}k_{i}^{\rm T}(x_{i})\nabla V_{i}^{(0)}. \end{align*} $ |
令
step 1:根据下式求解值函数
| $ \begin{align} &(\nabla V_{i}^{(k+1)})^{\rm T}\Big(\varPhi_{i}+d_{i}g(x_{i})u_{i}-\sum\limits_{j\in F}a_{ij}g(x_{j})u_{j}+ \\ & d_{i}k(x_{i})\omega_{i}-\sum\limits_{j\in F}a_{ij}k(x_{j})\omega_{j}\Big)+r(e_{i}, u_{i}, \omega_{i}) = 0, \end{align} $ | (18) |
其中
step 2: 由下式更新控制策略和扰动策略:
| $ \begin{align*} & u_{i}^{(k+1)} = - \frac{1}{2}d_{i}R_{i}^{-1}g^{\rm T}(x_{i})\nabla V_{i}^{(k+1)}, \\[2pt] &\omega_{i}^{(k+1)} = \frac{d_{i}}{2\gamma^{2}}P_{i}^{-1}k^{\rm T}(x_{i})\nabla V_{i}^{(k+1)}. \end{align*} $ |
step 3: 令
算法1的收敛性证明如下.
首先给出如下定理.
定理3 基于算法1, 迭代序列
注3 依据牛顿迭代法和Gâteaux导数与Frechet导数之间的关系, 可以证明算法1的收敛性, 具体可参见文献[15]中的定理1.
很显然算法1依赖于系统动态信息, 然而, 在复杂环境下很难获得这些信息. 因此, 下面提出无模型策略迭代算法.
算法2 无模型的策略迭代算法.
受强化学习中探索未知信息和利用已有信息之间寻求平衡思想的启发, 网络误差动态(4)还可写为
| $ \begin{align} &\dot{e_{i}} = \\ &\varPhi_{i}+d_{i}g(x_{i})u_{i}^{(k)}-\sum\limits_{j\in F}a_{ij}g(x_{j})u_{j}^{(k)}+d_{i}k(x_{i})\omega_{i}^{(k)} - \\ &\sum\limits_{j\in F}a_{ij}k(x_{j})\omega_{j}^{(k)}+d_{i}g(x_{i})(u_{i}-u_{i}^{(k)}) - \\[3pt] &\sum\limits_{j\in F}a_{ij}g(x_{j})(u_{j}-u_{j}^{(k)})+d_{i}k(x_{i})(\omega_{i}-\omega_{i}^{(k)})- \\[3pt] &\sum\limits_{j\in F}a_{ij}k(x_{j})(\omega_{j}-\omega_{j}^{(k)}). \end{align} $ | (19) |
其中:
| $ \begin{align} &\dfrac{{\rm d} V_{i}^{(k+1)}}{{\rm d}t} = \\[4pt] &(\nabla V_{i}^{(k+1)})^{\rm T}\Big[\varPhi_{i}+d_{i}g(x_{i})u_{i}^{(k)}-\sum\limits_{j\in F}a_{ij}g(x_{j})u_{j}^{(k)} + \\[3pt] & d_{i}k(x_{i})\omega_{i}^{(k)}-\sum\limits_{j\in F}a_{ij}k(x_{j})\omega_{j}^{(k)}+d_{i}g(x_{i})n_{ui} - \\[3pt] &\sum\limits_{j\in F}a_{ij}g(x_{j})n_{uj}+d_{i}k(x_{i})n_{wi}- \\[3pt] &\sum\limits_{j\in F}a_{ij}k(x_{j})n_{wj}\Big]. \end{align} $ | (20) |
应用式(18)可得
| $ \begin{align} &\dfrac{{\rm d} V_{i}^{(k+1)}}{{\rm d}t} = \\[3pt] &-r(e_{i}, u_{i}, \omega_{i})-(u_{i}^{(k+1)})^{\rm T}R_{i}n_{ui} + \\[3pt] &\dfrac{2}{d_{i}}(u_{i}^{(k+1)})^{\rm T}R_{i}\sum\limits_{j\in F}a_{ij}n_{uj}+2\gamma^{2}(\omega^{(k+1)}_{i})^{\rm T}P_{i}n_{wi}- \\[3pt] &\dfrac{2\gamma^{2}}{d_{i}}(\omega^{(k+1)}_{i})^{\rm T}P_{i}\sum\limits_{j\in F}a_{ij}n_{wj}. \end{align} $ | (21) |
然后式(21)两端在
| $ \begin{align} & V_{i}^{(k+1)}(e_{i}(t+T)) = \\[4pt] & V_{i}^{(k+1)}(e_{i}(t))-\int_{t}^{t+T}r(e_{i}, u_{i}^{(k)}, \omega_{i}^{(k)}){\rm d}\tau- \\[5pt] &\int_{t}^{t+T}(u_{i}^{(k+1)})^{\rm T}R_{i}\Big(n_{ui}+\dfrac{2}{d_{i}}\sum\limits_{j\in F}a_{ij}n_{uj}\Big){\rm d}\tau+ \\[5pt] &2\gamma^{2}\int_{t}^{t+T}(\omega^{(k+1)}_{i})^{\rm T}P_{i}\Big(n_{\omega i}-\frac{1}{d_{i}}\sum\limits_{j\in F}a_{ij}n_{\omega j}\Big){\rm d}\tau, \end{align} $ | (22) |
其中
|
图 1 无模型IRL迭代算法流程 |
定理4 采用算法2, 当
注4 依据算法2的推导过程, 可以证明算法1与算法2等价. 由前面算法1的收敛性知, 当
注5 实际上, 算法2中的状态信息、控制输入信息和干扰输入信息已包含了未知动态信息, 所以算法2与算法1的控制效果是等价的. 因此在系统动态未知的情况下, 算法2同样可以达到鲁棒包容控制的目的, 从而降低了系统动态已知的要求或者避免了辨识系统动态的过程.
2.4 算法2的在线执行为了实现算法2, 对第
| $ \begin{align} &\; \; \; \; \; \; \hat{V}_{i}^{(k+1)}(e_{i}) = \hat{\theta}_{i}^{\rm T}\varphi(e_{i}), \\ &\; \; \; \; \; \; \hat{u}_{i}^{(k+1)}(e_{i}) = \hat{\varpi}_{i}^{\rm T}\phi(e_{i}), \\ &\; \; \; \; \; \; \hat{\omega}_{i}^{(k+1)}(e_{i}) = \hat{\vartheta}_{i}^{\rm T}\rho(e_{i}). \end{align} $ | (23) |
其中:
| $ \begin{align} \delta_{i}(t) = \, &\hat{\theta}_{i}^{\rm T}(\varphi(e_{i}(t+T))-\varphi(e_{i}(t)))+ \\ &\int_{t}^{t+T}r(e_{i}, u_{i}^{(k)}, \omega_{i}^{(k)}){\rm d}\tau+ \\[2pt] &\sum\limits_{j' = 1}^q r_{ij'}\int_{t}^{t+T}\hat{\varpi}_{i, j'}^{\rm T}\phi(e_{i}(\tau))\delta_{u}{\rm d}\tau- \\ &2\gamma^{2}\sum\limits_{j' = 1}^l p_{ij'}\int_{t}^{t+T}\hat{\vartheta}_{i, j'}^{\rm T}\rho(e_{i}(\tau))\delta_{\omega}{\rm d}\tau. \end{align} $ | (24) |
其中:
| $ \begin{align} z_{i}(t)+\delta_{i}(t) = \hat{W_{i}}^{\rm T}y_{i}(t). \end{align} $ | (25) |
其中
| $ \begin{align*} & z_{i}(t) = -\int_{t}^{t+T}r(e_{i}, u_{i}^{(k)}, \omega_{i}^{(k)}){\rm d}\tau, \notag\\[3pt] &\hat{W_{i}} = [\theta_{i}^{\rm T}, {\varpi}_{i, 1}^{\rm T}, \ldots, {\varpi}_{i, q}^{\rm T}, {\vartheta}_{i, 1}^{\rm T}, \ldots, {\vartheta}_{i, l}^{\rm T}]^{\rm T}, \notag \\ & y_{i}(t) = \\ &\begin{bmatrix} \varphi(e_{i}(t+T))-\varphi(e_{i}(t) \\[3pt] r_{i1} \int_{t}^{t+T}\phi(e_{i}(\tau))\Big(n_{ui}+ \frac{2}{d_{i}}\sum\limits_{j\in F}a_{ij}n_{uj}\Big){\rm d}\tau \\ \vdots \\ r_{iq} \int_{t}^{t+T}\phi(e_{i}(\tau))\Big(n_{ui}+ \frac{2}{d_{i}}\sum\limits_{j\in F}a_{ij}n_{uj}\Big){\rm d}\tau \\[14pt] -2\gamma^{2}p_{i1} \int_{t}^{t+T}\rho(e_{i}(\tau))\Big(n_{\omega i}+ \frac{1}{d_{i}}\sum\limits_{j\in F}a_{ij}n_{\omega j}\Big){\rm d}\tau \\ \vdots \\ -2\gamma^{2}p_{il} \int_{t}^{t+T}\rho(e_{i}(\tau))\Big(n_{\omega i}+ \frac{1}{d_{i}}\sum\limits_{j\in F}a_{ij}n_{\omega j}\Big){\rm d}\tau \end{bmatrix}. \end{align*} $ |
为了最小化逼近误差, 采用最小二乘法进行计算. 假定从时间
实验1 考虑由8个智能体组成的多智能体网络. 有向拓扑如图 2所示.
|
图 2 网络拓扑结构1 |
第
| $ \begin{align} \dot {x}_i = f(x_{i})+g(x_{i})u_{i}+k(x_{i})\omega_{i}, i\in F. \end{align} $ | (26) |
其中:
|
图 3 多智能体网络运动轨迹1 |
|
图 4 包容误差变化曲线1 |
实验2 考虑由4个领航者和6个跟随者所组成的多智能体网络, 有向拓扑如图 5所示. 跟随者动态、领航者动态、参数选择和评价-执行-干扰框架选择同实验1. 智能体的运动轨迹曲线和包容误差变化曲线如图 6和图 7所示. 同样可以得出本文的控制方案有效可行.
|
图 5 网络拓扑结构2 |
|
图 6 多智能体网络运动轨迹2 |
|
图 7 包容误差变化曲线2 |
为了使智能体学习采取最优行动而在任务中取得快速、准确和最优性能, 本文提出了受扰多智能体网络鲁棒包容控制新方法. 基于零和博弈思想和积分强化学习算法, 在证明零和博弈Nash平衡解存在且网络包容误差
| [1] |
Li D Y, Zhang W, He W, et al. Two-layer distributed formation-containment control of multiple Euler-Lagrange systems by output feedback[J]. IEEE Transactions on Cybernetics, 2019, 49(2): 675-687. DOI:10.1109/TCYB.2017.2786318 |
| [2] |
Zhu Y R, Zheng Y S, Wang L. Containment control of switched multi-agent systems[J]. International Journal of Control, 2015, 88(12): 2570-2577. DOI:10.1080/00207179.2015.1050698 |
| [3] |
Mei J, Ren W, Li B, et al. Distributed containment control for multiple unknown second-order nonlinear systems with application to networked Lagrangian systems[J]. IEEE Transactions on Neural Networks and Learning Systems, 2015, 26(9): 1885-1899. DOI:10.1109/TNNLS.2014.2359955 |
| [4] |
Yu D, Ji X Y. Finite-time containment control of perturbed multi-agent systems based on sliding-mode control[J]. International Journal of Systems Science, 2018, 49(2): 299-311. DOI:10.1080/00207721.2017.1406553 |
| [5] |
谭拂晓, 刘德荣, 关新平, 等. 基于微分对策理论的非线性控制回顾与展望[J]. 自动化学报, 2014, 40(1): 1-15. (Tan F X, Liu D R, Guan X P, et al. Review and perspective of nonlinear systems control based on differential games[J]. Acta Automatica Sinica, 2014, 40(1): 1-15.) |
| [6] |
Ren H, Zhang H G, Wen Y L, et al. Integral reinforcement learning off-policy method for solving nonlinear multi-player nonzero-sum games with saturated actuator[J]. Neurocomputing, 2019, 335: 96-104. |
| [7] |
Tatari F, Naghibi-Sistani M B, Vamvoudakis K G. Distributed learning algorithm for non-linear differential graphical games[J]. Transactions of the Institute of Measurement and Control, 2017, 39(2): 173-182. DOI:10.1177/0142331215603791 |
| [8] |
Mazouchi M, Naghibi-Sistani M B, Sani S K H. A novel distributed optimal adaptive control algorithm for nonlinear multi-agent differential graphical games[J]. IEEE/CAA Journal of Automatica Sinica, 2018, 5(1): 331-341. DOI:10.1109/JAS.2017.7510784 |
| [9] |
Zhang H G, Cui X H, Luo Y H, et al. Finite-horizon H∞ tracking control for unknown nonlinear systems with saturating actuators[J]. IEEE Transactions on Neural Networks and Learning Systems, 2018, 29(4): 1200-1212. |
| [10] |
Modares H, Lewis F L, Jiang Z P. H∞ tracking control of completely unknown continuous-time systems via off policy reinforcement learning[J]. IEEE Transactions on Neural Networks and Learning Systems, 2015, 26(10): 2550-2562. DOI:10.1109/TNNLS.2015.2441749 |
| [11] |
Wen G X, Chen C L P, Ge S S, et al. Optimized adaptive nonlinear tracking control using actor-critic reinforcement learning strategy[J]. IEEE Transactions on Industrial Informatics, 2019, 15(9): 4969-4977. |
| [12] |
Jiao Q, Modares H, Xu S Y, et al. Multi-agent zero-sum differential graphical games for disturbance rejection in distributed control[J]. Automatica, 2016, 69: 24-34. |
| [13] |
Sun J L, Liu C S. Distributed zero-sum differential game for multi-agent nonlinear systems via adaptive dynamic programming[C]. The 37th Chinese Control Conference. Wuhan: IEEE, 2018: 2770-2775.
|
| [14] |
Yu D, Wu Q H, Song L. Finite time estimation and containment control of second order perturbed directed networks[C]. The 50th IEEE Conference on Decision and Control and European Control Conference. Orland: IEEE, 2011: 4126-4131.
|
| [15] |
Wu H N, Luo B. Neural network based online simultaneous policy update algorithm for solving the HJI align in nonlinear H∞ control[J]. IEEE Transactions on Neural Networks and Learning Systems, 2012, 23(12): 1884-1895. |
| [16] |
Yang X, Liu D R, Luo B, et al. Data-based robust adaptive control for a class of unknown nonlinear constrained-input systems via integral reinforcement learning[J]. Information Sciences, 2016, 369: 731-747. |
2021, Vol. 36
