《reinforcement learning：an introduction》第五章《Monte Carlo Methods》总结

来源：互联网发布：淘宝手机单怎么刷视频编辑：程序博客网时间：2024/06/14 21:09

由于组里新同学进来，需要带着他入门RL，选择从silver的课程开始。

对于我自己，增加一个仔细阅读《reinforcement learning：an introduction》的要求。

因为之前读的不太认真，这一次希望可以认真一点，将对应的知识点也做一个简单总结。

注意：本章考虑model-free的prediction和control，仍然有两种方法，policy iteration和value iteration（evaluation阶段使用model-free方法，improvement阶段采用greedy方法）。这一节主要讲基于Monte Carlo Methods的policy iteration方法。

在model-free的情况下，直接估算Q(S,A)更常见，因为即便估算出来了V(S)，没有model还是不知道如何选择action（如何生成policy）。

MC思想：先走到episode结束，然后用empirical mean return来估算expected return。适用于model-free、episode tasks。

Monte Carlo Prediction(policy evaluation)

先按照要evaluate的policy π的指示走到episode结束，然后用empirical mean return来估算expected return。

First- and Every-Visit MC policy evaluation，对估算V(S)、Q(S,A)都适用。

所有状态访问无限多次，大数定理保证收敛到Vπ ==》need to estimate the value of all the actions from each state，两种方法保证：1）exploring starts，即开始的(s,a)要随机初始化，保证所有(s,a)都能访问到，但有些实际环境难应用；2）在每个state，采用类似于e-greedy的方法（而不是采用greedy方法选择action），从而保证所有action、即所有(s,a)可以选择到。

MC对每个state的estimation是independent的，换句话说就是不bootstrap、完全不考虑MDP的问题。

Monte Carlo Control

基于generalized policy iteration (GPI)

policy evaluation时：根据Monte Carlo Prediction方法，计算出Qπ(S,A)

policy improvement时：更新π_new = argmax_aQπ(S,A) ---- 针对exploring starts方式

policy improvement时：针对e-greedy方式，更新π_new为：

基于exploring starts方式，可以找到Q*/π*；但是基于e-greedy方式，只能找到所有e-greedy policies里面最优的e-π*，好处是不在需要exploring starts。

On-policy和off-policy：

On-policy methods attempt to evaluate or improve the policy that is used to make decisions

off-policy methods evaluate or improve a policy different from that used to generate the data.

Off-policy Monte Carlo Prediction：

想根据behavior policy μ产生的样本估算target policy π的Vπ。此时要考虑两个policies产生同一个样本的概率是不同的，用Importance Sampling来平衡对应的return。

Importance Sampling：The importance sampling ratio ends up depending only on the two policies and not at all on the MDP

两种使用Importance Sampling的方法：Ordinary/Average Importance Sampling、Weighted Importance Sampling；且都可以通过incremental的方式on-line实现（参考5.6）。

In practice, the weighted estimator usually has dramatically lower variance and is strongly preferred. Nevertheless, we will not totally abandon ordinary importance sampling as it is easier to extend to the approximate methods using function approximation

Off-policy Monte Carlo Control ：

Off-policy Monte Carlo Prediction，即基于behavior policy μ学习Qπ

policy improvement on target policy π：π(St) = argmax_aQπ(St, a)

下面是silver课程《Lecture 4，Model-Free Prediction》、《Lecture 5，Model-Free Control》我觉得应该知道的内容：

这两节课讲的内容才是之后会经常用到的，但是实际上没多少需要掌握的知识，大多数了解一下就可以了。

具体的，要求掌握：

lecture 4：MC和TD区别

lecture 5：SARSA和Q-learning

lecture 4: 主要理解MC和TD思想
3：知道Model-free prediction ==》 Estimate the value function of an unknown MDP
知道有两大类方法：Monte-Carlo Learning (MC-Learning) and Temporal-Difference Learning (TD-Learning)
4：MC-Learning特点：走到结束，然后通过empirical mean return来估算expected return；==》model-free、no bootstrapping(必须有明显结束的MDPs)
6-7：First- and Every-Visit MC PE，By law of large numbers（概率论里的大数定理），empirical mean return趋近于expected return。
10-11：知道MC本质是每个样本的权重一样大，但是对于incremental的更新形式，(Gt-V(St))的stepsize/learning rate(lr)却越来越小，具体的是lr_t=1/N(St)
而对于non-stationary的问题，让lr_t=constant反而更合适一些，因为lr_t=constant本质上意味着离现在越近的reward对于V(s)的贡献越大（这一点你之后会理解的更深刻）。而实际问题中更多的是non-stationary的，所以直接设置lr_t=constant很常见。
12：TD-Learning特点：走一步，用下一个状态的估值来estimate当前状态的估值；==》model-free、with bootstrapping。
13-20：MC和TD的对比，包括更新公式、优缺点、bias-variance trade-off（这个是机器学习里非常重要的概念）。知道什么是TD-target和TD-error.
MC has high variance, zero bias
Good convergence properties
(even with function approximation)
Not very sensitive to initial value
Very simple to understand and use
TD has low variance, some bias
Usually more efficient than MC
TD(0) converges to vπ(s)
(but not always with function approximation)
More sensitive to initial value
21-25：Batch MC/TD的例子，24页很重要，能够本质上理解MC/TD。
TD exploits Markov property
Usually more efficient in Markov environments
MC does not exploit Markov property
Usually more effective in non-Markov environments
26-30：理解一下这些Unified View就好
30页之后的内容目前不需要看

lecture 5: 主要理解SARSA和Q-learning
3：知道Model-free control ==》 Optimise the value function of an unknown MDP
MDP model is unknown, but experience can be sampled
MDP model is known, but is too big to use, except by samples
5：知道on-policy和off-policy的区别
8：这边解释了为什么model-free的问题，一般都是计算action-value Q(s,a)而不是state-value V(s)，因为我们的最终目的是找到policy π*，在model-free的情况下，基于V(s)根本不能计算出π*。
10-11：e-greedy exploration
12：e-Greedy Policy Improvement，对于e-greedy的policy π进行e-Greedy Policy Improvement操作得到π'，仍然满足π'>=π
14-16：On-Policy Monte-Carlo Control，GLIE Monte-Carlo Control，了解一下就好
20-22：On-Policy Learning - SARSA，这个算法还是比较有名的，要求掌握
23：收敛条件，前者的直观理解是，步长要足够大以保证最终的结果不会受V/Q的初始值影响；后者的直观理解是，步长最终变得会越来越小以保证最终可以收敛。
考虑lecture4中提到的lr，对于lr_t=1/N(St)是满足收敛条件的（1+1/2+1/3+1/4+...=ln(n)+C,C为欧拉常数,数值是0.5772……. -> +infinite；1^2+(1/2)^2+(1/3)^2+... < +infinite；详细证明我也不知道），而lr_t=constant=c连第一个条件都不满足；尽管这样，也不妨碍lr_t=constant在实际中的应用。
26-30：可以忽略
31：Off-Policy Learning
32-34：important sampling了解一下就好，主要是因为behavior policy和target policy产生同一个样本的概率不同，所以计算return/TD target也应该考虑这个概率
35-38：Off-Policy Learning - Q-learning，这个算法最常用，要求掌握
40：想一下为什么SARSA比Q-learning的reward大。
41-42：理解一下这个总结，对于这里提到的TD-Learning，好像silver这里的课特指针对V(s)的“方法”，但更一般的是指lecture4中讲的一类“思想”。
最后还有一个问题，为什么SARSA是on-policy的而q-learning是off-policy的，你试着回答一下，看能不能明白两者的区别。参考22和36页。

更新一下需要掌握的内容：

lecture 4中说30页之后的内容目前不需要看；lecture 5中说26-30可以忽略；你要是有时间可以稍微看一下，尽量理解eligibility trace的思想，不过这部分内容确实不要求掌握，你尽力理解就好。

更新一下lecture 5的内容。

上一个邮件中提到的8、10-11、12、14-16，都是从单个知识点的角度出发的。实际上，7-16页按照On-policy Monte-Carlo的角度去看更好一些。其中：

7：引出问题==》基于Monte-Carlo的policy iteration方法存在什么问题。第一，在model-free的情况下，基于V(s)根本不能计算出π*。第二，greedy policy improvement会造成没有exploration。

8：针对第一个问题，解释了为什么model-free时，一般都是计算action-value Q(s,a)而不是state-value V(s)，因为我们的最终目的是找到policy π*，在model-free的情况下，基于V(s)根本不能计算出π*。

9-11：针对第二个问题，因为greedy policy improvement会造成没有exploration，所以需要将greedy policy improvement改成e-greedy policy improvement（即11页讲的内容）。

12：对于e-greedy的policy π进行e-Greedy Policy Improvement操作得到π'，仍然满足π'>=π。

13-14：Monte-Carlo Policy Iteration和Monte-Carlo Control，真正在control时，evaluation步骤只需要得到Q~=Qπ即可。

15-16：GLIE Monte-Carlo Control，能够保证收敛到Q*的Monte-Carlo Control方法，要求满足GLIE条件==》（1）每个(s,a)访问无穷多次；（2）最终的behavior policy（即improvement阶段得到的e-Greedy Policy）要慢慢趋近于greedy policy（可以认为只有greedy policy才是最优的）。满足GLIE最简单的方法就是让 e=1/k（即16页讲的内容）。

阅读全文

1 0