《reinforcement learning:an introduction》第十三章《Policy Gradient Methods》总结
来源:互联网 发布:工商银行软件官方下载 编辑:程序博客网 时间:2024/05/29 18:03
由于组里新同学进来,需要带着他入门RL,选择从silver的课程开始。
对于我自己,增加一个仔细阅读《reinforcement learning:an introduction》的要求。
因为之前读的不太认真,这一次希望可以认真一点,将对应的知识点也做一个简单总结。
13.1 Policy Approximation and its Advantages . . . . . . . . . . . . . . . . 266
13.2 The Policy Gradient Theorem . . . . . . . . . . . . . . . . . . . . . . . 268
13.3 REINFORCE: Monte Carlo Policy Gradient . . . . . . . . . . . . . . . 270
13.4 REINFORCE with Baseline . . . . . . . . . . . . . . . . . . . . . . . . 272
13.5 Actor-Critic Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . 273
13.6 Policy Gradient for Continuing Problems (Average Reward Rate) . . . 275
13.7 Policy Parameterization for Continuous Actions . . . . . . . . . . . . . 278
learn a parameterized policythat can select actions directly;A value function may still be used tolearnthe policy weights, but is not required for action selection.
Policy Gradient Methods方法的优缺点:
1)the policy may be a simpler function to approximate than value-function
2)Action-value methods have no natural way of finding stochastic optimal policies, whereas policy approximating methods can
3)policy parameterization is sometimes a good way of injecting prior knowledge
Better convergence properties(步长足够小,policy gradient保证不断优化policy,相应的,往往陷入local optimum)
Effective in high-dimensional or continuous action spaces(Q需要max over action)
Can learn stochastic policies(max Q over action可以近似看做是deterministic policy)
Disadvantages:
Typically converge to a local rather than global optimum(softmax policy貌似收敛到global optimum;ANN肯定是local optimum)
Evaluating a policy is typically inefficient and high variance (所以有actor-critic、baseline、actor- multi-step-critic)
13.2 The Policy Gradient Theorem
in the episodiccase,we define the performance measure as the value of the start state of the episode,η(θ) = Vπθ(s0) 。
13.3 REINFORCE: Monte Carlo Policy Gradient
The update increases the weight vector in this direction proportional to the return, and inversely proportional to the action probability. The former makes sense because it causes the weights to move most in the directions that favor actions that yield the highest return. The latter makes sense because otherwise actions that are selected frequently are at an advantage (the updates will be more often in their direction) and might win out even if they do not yield the highest return.
As a stochastic gradient method, REINFORCE has good theoretical convergence properties (Gt is unbiased and REINFORCE will converge asymptotically to a local minimum). This assures an improvement in expected performance for sufficiently small α, and convergence to a local optimum under standard stochastic approximation conditions for decreasing α. However, as a Monte Carlo method REINFORCE may be of high variance and thus slow to learn.
13.4 REINFORCE with Baseline
The baseline can be any function, even a random variable,as long as it does not vary witha
reduce the variance (and thus speed the learning)
One natural choice for the baseline is an estimate of the state value, V(St;w). In some states all actions have high values and we need a high baseline to differentiate the higher valued actions from the less highly valued ones; in other states all actions will have low values and a low baseline is appropriate.
REINFORCE with baseline is unbiased and will converge asymptotically to a local minimum, but like all Monte Carlo methods it tends to be slow to learn (high variance) and inconvenient to implement online or for continuing problems
13.5 Actor-Critic Methods
Although the REINFORCE-with-baseline method learns both a policy and a statevalue function,we do not consider it to be an actor-critic method because its statevalue function is used only as a baseline, not as a critic. That is,it is not used for bootstrapping (updating a state from the estimated values of subsequent states), but only as a baseline for the state being updated.
One-step actor-critic methods replace the full return of REINFORCE (13.9) with the one-step returnas follow:
forward view of multi-step Actor-Critic, forward view of λ-return Actor-Critic, backward viewsλ-return Actor-Critic,这三者都比较直接。
13.6 Policy Gradient for Continuing Problems (Average Reward Rate)
13.7 Policy Parameterization for Continuous Actions(现在更多的用DPG和NAF)
这两部分了解一下就可以,其实sutton这一章讲的都不太好,推荐大家看2017 ICML Deep RL Tutorial。
下面是silver课程《Lecture 7,Policy Gradient Methods》我觉得应该知道的内容:
4:value-base、policy-based、actor-critic方法的区别
5:policy based方法的优缺点
Advantages:
Better convergence properties(步长足够小,policy gradient保证不断优化policy,相应的,往往陷入local optimum)
Effective in high-dimensional or continuous action spaces(Q需要max over action)
Can learn stochastic policies(max Q over action可以近似看做是deterministic policy)
Disadvantages:
Typically converge to a local rather than global optimum(softmax policy貌似收敛到global optimum;ANN肯定是local optimum)
Evaluating a policy is typically inefficient and high variance (所以有actor-critic、baseline、actor- multi-step-critic)
10:policy gradient方法是通过计算policy π(a|s;θ)的gradient来更新policy的参数θ,从而优化policy;那么衡量policy好坏的指标J(θ) 是什么?注意,分episodic、continuing两种情况。
13:求policy gradient的方法一:Finite Differences,Works for arbitrary policies, even if policy is not differentiable
12:求policy gradient的方法二:analytically,Assume policyπθis differentiable whenever it is non-zero
16-18:likelihood ratios和score function(▽θlogπθ(s;a) ),知道Softmax Policy及Gaussian Policy的score function的含义是,当前action对于平均actions的优势。
19:policy gradient theorem:one-step MDP。
20:policy gradient theorem:从one-step MDP推广到multi-step MDP。
21:Monte-Carlo Policy Gradient (REINFORCE) 方法,计算的是unbiased的policy gradient,但是variance很高。
23-25:Actor-Critic Policy Gradient,用critic Qw(s,a)来approximate原来REINFORCE方法中的return Gt,计算的是approximated的policy gradient。这是降低variance的方法之一。
26-28:Compatible Function Approximation,由于actor-critic中,用critic Qw(s,a)来approximate原来REINFORCE方法中的return Gt,往往会引入bias,因此有可能造成算法不收敛。那么critic Qw(s,a)满足什么条件能够不引入bias呢?这就是Compatible Function Approximation要解释的问题。
29-31:降低variance的方法之二,baseline(要求baseline和action无关)、advantage function(两种计算方法,A(s,a)=Qw(s,a)-V(s);A=r+γVw(s')-Vw(s))
32-34:critic可以基于different time-scale(MC/TD(0)/TD(λ)/backward-view TD(λ))去estimate TD-target,相应的,policy gradient也有different time-scale(MC/TD(0)/TD(λ)/backward-view TD(λ))的形式。
35-37:natural policy gradient了解一下。
41,Summary of Policy Gradient Algorithms :
- 《reinforcement learning:an introduction》第十三章《Policy Gradient Methods》总结
- 《reinforcement learning:an introduction》第十一章《Off-policy Methods with Approximation》总结
- 《reinforcement learning:an introduction》第五章《Monte Carlo Methods》总结
- Policy Gradient Methods in Reinforcement Learning
- 《reinforcement learning:an introduction》第八章《Planning and Learning with Tabular Methods》总结
- 《reinforcement learning:an introduction》第九章《On-policy Prediction with Approximation》总结
- 《reinforcement learning:an introduction》第十章《On-policy Control with Approximation》总结
- Policy Gradient Methods for Reinforcement Learning with Function Approximation
- 《reinforcement learning:an introduction》第六章《Temporal-Difference Learning》总结
- 《reinforcement learning:an introduction》第一章《The Reinforcement Learning Problem》总结
- 《reinforcement learning:an introduction》第二章《Multi-arm Bandits》总结
- 《reinforcement learning:an introduction》第四章《Dynamic Programming》总结
- 《reinforcement learning:an introduction》第七章《Multi-step Bootstrapping》总结
- reinforcement learning,增强学习:Policy Gradient
- Reinforcement Learning_By David Silver笔记七: Policy Gradient Methods
- 《reinforcement learning:an introduction》第三章《Finite Markov Decision Processes》总结
- Reinforcement Learning:An Introduction 读书笔记- Chapter 1
- Reinforcement Learning:An introduction读书笔记-Chapter 2
- sychronized、ReentrantLock、lock区别
- 术--replace()与replaceAll()的区别
- android自定义控件事件的传递
- Robot Framework 的安装配置和简单的实例介绍
- 0811-Java多态
- 《reinforcement learning:an introduction》第十三章《Policy Gradient Methods》总结
- Linux学习使用 (二) 常用命令
- 百度api开发小结
- bootstrap 部分CSS样式总结
- HDU1823 Luck and Love(二维线段树单点更新+区间查询+模板)
- angular 动态组件
- RFS 工作环境搭建记录
- 《西瓜书》笔记11:特征选择与稀疏表示(三)
- JAVA开发的23种设计模式之 --- 单例模式