Baird's Counterexample

来源：互联网发布：盛势网络剧海外版480p 编辑：程序博客网时间：2024/06/15 07:09

rl第二版 11.2

在rl 2nd的第七章，我们介绍了一些表格off-policy算法，接下来为了将off-policy算法应用到function approximation，我们将更新表格数组改为更新权重向量θ，通过估计价值函数和它的梯度来更新。许多这样的方法都要用到importance sampling radio。

ρt=˙π(At|St)μ(At|St)
对于episodic，单步的状态价值函数的semi-gradient off-policy TD(0)，更新如下：

θ t + 1 = ˙ θ t + α ρ t δ t \nabla v^(S t, θ t) 其 中 ， δ t = ˙ R t + 1 + γ v^(S t + 1, θ t) - v^(S t, θ t)

接下来举个例子。
如图所示，有7个状态，2个行动的MDP，从上到下，从左到右，分别编号为0,1,2,3,4,5,6。behavior policy μ采取两个行动的概率分别是6/7和1/7，采取行动dashed将会到达其它6个状态，采取行动solid将会到达。target policy π 总是选择solid行动，因此on-policy会集中在状态6。奖励值都为0。

这里写图片描述

因此ρsolid=1/(1/7)=7, ρdashed=0/(6/7)=0

可以参考优达学院的rl课程，理解题意（代码中）。从实验结果可以看到θ最终没有收敛，系统不稳定。

The example shows that even the simplest combination of bootstrapping and function approximation can be unstable if the backups are not done according to the on-policy distribution.

代码：
https://github.com/Mandalalala/Reinforcement-Learning-an-introduction/tree/master/Chapter%2011

参考：
优达学院baird’s counterexample：
https://classroom.udacity.com/courses/ud600/lessons/4627968925/concepts/46743885780923

阅读全文

0 0