《reinforcement learning：an introduction》第十三章《Policy Gradient Methods》总结

来源：互联网发布：工商银行软件官方下载编辑：程序博客网时间：2024/05/29 18:03

由于组里新同学进来，需要带着他入门RL，选择从silver的课程开始。

对于我自己，增加一个仔细阅读《reinforcement learning：an introduction》的要求。

因为之前读的不太认真，这一次希望可以认真一点，将对应的知识点也做一个简单总结。

13.1 Policy Approximation and its Advantages . . . . . . . . . . . . . . . . 266
13.2 The Policy Gradient Theorem . . . . . . . . . . . . . . . . . . . . . . . 268
13.3 REINFORCE: Monte Carlo Policy Gradient . . . . . . . . . . . . . . . 270
13.4 REINFORCE with Baseline . . . . . . . . . . . . . . . . . . . . . . . . 272
13.5 Actor-Critic Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . 273
13.6 Policy Gradient for Continuing Problems (Average Reward Rate) . . . 275
13.7 Policy Parameterization for Continuous Actions . . . . . . . . . . . . . 278

learn a parameterized policythat can select actions directly；A value function may still be used tolearnthe policy weights, but is not required for action selection.

Policy Gradient Methods方法的优缺点：

1）the policy may be a simpler function to approximate than value-function

2）Action-value methods have no natural way of finding stochastic optimal policies, whereas policy approximating methods can

3）policy parameterization is sometimes a good way of injecting prior knowledge

Better convergence properties（步长足够小，policy gradient保证不断优化policy，相应的，往往陷入local optimum）
Effective in high-dimensional or continuous action spaces（Q需要max over action）
Can learn stochastic policies（max Q over action可以近似看做是deterministic policy）

Disadvantages:
Typically converge to a local rather than global optimum（softmax policy貌似收敛到global optimum；ANN肯定是local optimum）
Evaluating a policy is typically inefficient and high variance （所以有actor-critic、baseline、actor- multi-step-critic）

13.2 The Policy Gradient Theorem

in the episodiccase,we define the performance measure as the value of the start state of the episode，η(θ) = Vπθ(s0) 。

13.3 REINFORCE: Monte Carlo Policy Gradient

The update increases the weight vector in this direction proportional to the return, and inversely proportional to the action probability. The former makes sense because it causes the weights to move most in the directions that favor actions that yield the highest return. The latter makes sense because otherwise actions that are selected frequently are at an advantage (the updates will be more often in their direction) and might win out even if they do not yield the highest return.

As a stochastic gradient method, REINFORCE has good theoretical convergence properties (Gt is unbiased and REINFORCE will converge asymptotically to a local minimum). This assures an improvement in expected performance for sufficiently small α, and convergence to a local optimum under standard stochastic approximation conditions for decreasing α. However, as a Monte Carlo method REINFORCE may be of high variance and thus slow to learn.

13.4 REINFORCE with Baseline

The baseline can be any function, even a random variable,as long as it does not vary witha

reduce the variance (and thus speed the learning)

One natural choice for the baseline is an estimate of the state value, V(St;w). In some states all actions have high values and we need a high baseline to differentiate the higher valued actions from the less highly valued ones; in other states all actions will have low values and a low baseline is appropriate.

REINFORCE with baseline is unbiased and will converge asymptotically to a local minimum, but like all Monte Carlo methods it tends to be slow to learn (high variance) and inconvenient to implement online or for continuing problems

13.5 Actor-Critic Methods

Although the REINFORCE-with-baseline method learns both a policy and a statevalue function,we do not consider it to be an actor-critic method because its statevalue function is used only as a baseline, not as a critic. That is,it is not used for bootstrapping (updating a state from the estimated values of subsequent states), but only as a baseline for the state being updated.

One-step actor-critic methods replace the full return of REINFORCE (13.9) with the one-step returnas follow:

forward view of multi-step Actor-Critic, forward view of λ-return Actor-Critic, backward viewsλ-return Actor-Critic，这三者都比较直接。

13.6 Policy Gradient for Continuing Problems (Average Reward Rate)

13.7 Policy Parameterization for Continuous Actions（现在更多的用DPG和NAF）

这两部分了解一下就可以，其实sutton这一章讲的都不太好，推荐大家看2017 ICML Deep RL Tutorial。

下面是silver课程《Lecture 7，Policy Gradient Methods》我觉得应该知道的内容：

4：value-base、policy-based、actor-critic方法的区别

5：policy based方法的优缺点

Advantages:
Better convergence properties（步长足够小，policy gradient保证不断优化policy，相应的，往往陷入local optimum）
Effective in high-dimensional or continuous action spaces（Q需要max over action）
Can learn stochastic policies（max Q over action可以近似看做是deterministic policy）
Disadvantages:
Typically converge to a local rather than global optimum（softmax policy貌似收敛到global optimum；ANN肯定是local optimum）
Evaluating a policy is typically inefficient and high variance （所以有actor-critic、baseline、actor- multi-step-critic）

10：policy gradient方法是通过计算policy π(a|s;θ)的gradient来更新policy的参数θ，从而优化policy；那么衡量policy好坏的指标J(θ) 是什么？注意，分episodic、continuing两种情况。

13：求policy gradient的方法一：Finite Differences，Works for arbitrary policies, even if policy is not differentiable

12：求policy gradient的方法二：analytically，Assume policyπθis differentiable whenever it is non-zero

16-18：likelihood ratios和score function（▽θlogπθ(s;a) ），知道Softmax Policy及Gaussian Policy的score function的含义是，当前action对于平均actions的优势。

19：policy gradient theorem：one-step MDP。

20：policy gradient theorem：从one-step MDP推广到multi-step MDP。

21：Monte-Carlo Policy Gradient (REINFORCE) 方法，计算的是unbiased的policy gradient，但是variance很高。

23-25：Actor-Critic Policy Gradient，用critic Qw(s,a)来approximate原来REINFORCE方法中的return Gt，计算的是approximated的policy gradient。这是降低variance的方法之一。

26-28：Compatible Function Approximation，由于actor-critic中，用critic Qw(s,a)来approximate原来REINFORCE方法中的return Gt，往往会引入bias，因此有可能造成算法不收敛。那么critic Qw(s,a)满足什么条件能够不引入bias呢？这就是Compatible Function Approximation要解释的问题。

29-31：降低variance的方法之二，baseline（要求baseline和action无关）、advantage function（两种计算方法，A(s,a)=Qw(s,a)-V(s)；A=r+γVw(s')-Vw(s)）

32-34：critic可以基于different time-scale（MC/TD(0)/TD(λ)/backward-view TD(λ)）去estimate TD-target，相应的，policy gradient也有different time-scale（MC/TD(0)/TD(λ)/backward-view TD(λ)）的形式。

35-37：natural policy gradient了解一下。

41，Summary of Policy Gradient Algorithms ：

the policy may be a simpler function to approximate

阅读全文

0 0