《reinforcement learning:an introduction》第七章《Multi-step Bootstrapping》总结
来源:互联网 发布:查询淘宝关键字搜索量 编辑:程序博客网 时间:2024/06/05 11:51
由于组里新同学进来,需要带着他入门RL,选择从silver的课程开始。
对于我自己,增加一个仔细阅读《reinforcement learning:an introduction》的要求。
因为之前读的不太认真,这一次希望可以认真一点,将对应的知识点也做一个简单总结。
7.1 n-step TD Prediction
The methods that use n-step backups are still TD methods because theystill change an earlier estimate based on how it differs from a later estimate.
n-step return:
If t +n≥ T(if then-step return extends to or beyond termination), then all the missing terms are taken as zero ==》这个很容易理解,最后n步之内,还剩多少步就令return等于所有剩余步数的reward和。相应的,前n-1步也是没有任何更新过程,Note that no changes at all are made during the first n-1 steps of each episode.。
n-step return的error往往更小:
Methods that involve an intermediate amount of bootstrapping are important because they will typically perform better than either extreme.
7.2 n-step SARSA
将V改成Q即可,整个流程基本一致:
想一下为什么n-step的方法能更有效地更新Q-table:因为每一个好的(s,a)或坏的(s,a)在更新过程中都会被使用n次(看上面伪代码),这样可以有效的backup到n-step之前的相关(s,a)。
7.3 n-step Off-policy Learning by Importance Sampling
The importance sampling that we have used in this section and in Chapter 5 enables off-policy learning, but at the cost of increasing the variance of the updates. The high variance forces us to use a small step-size parameter, resulting in slow learning. It is probably inevitable that off-policy training is slower than on-policy training|after all, the data is less relevant to what you are trying to learn.
7.4 Off-policy Learning Without Importance Sampling: Then-step Tree Backup Algorithm
7.3/7.4在实际中很少用吧,跳过去了。其中12章的eligibility traces应该是n-step方法部分应该掌握的重点。
silver课程的lecture 4中30页之后的内容、lecture 5中26-30页关于n-step algorithm以及eligibility trace的介绍非常简单明了。
forward-view TD(λ):
backward-view TD(λ):
Forward view provides theory
Backward view provides mechanism
Update online, every step, from incomplete sequences
λ=0,TD(λ)就是之前提到的TD(0)方法;λ=1,forward-view TD(λ)就是之前提到的MC方法,backward-view 的online TD(λ)近似是之前提到的MC方法。
forward-view SARSA(λ):
backward-view SARSA(λ):
- 《reinforcement learning:an introduction》第七章《Multi-step Bootstrapping》总结
- 《reinforcement learning:an introduction》第二章《Multi-arm Bandits》总结
- 《reinforcement learning:an introduction》第六章《Temporal-Difference Learning》总结
- 《reinforcement learning:an introduction》第一章《The Reinforcement Learning Problem》总结
- 《reinforcement learning:an introduction》第四章《Dynamic Programming》总结
- 《reinforcement learning:an introduction》第五章《Monte Carlo Methods》总结
- 《reinforcement learning:an introduction》第十三章《Policy Gradient Methods》总结
- 《reinforcement learning:an introduction》第八章《Planning and Learning with Tabular Methods》总结
- 《reinforcement learning:an introduction》第三章《Finite Markov Decision Processes》总结
- 《reinforcement learning:an introduction》第九章《On-policy Prediction with Approximation》总结
- 《reinforcement learning:an introduction》第十章《On-policy Control with Approximation》总结
- 《reinforcement learning:an introduction》第十一章《Off-policy Methods with Approximation》总结
- 强化学习导论(Reinforcement Learning: An Introduction)读书笔记(二):多臂赌博机(Multi-arm Bandits)
- Reinforcement Learning:An Introduction 读书笔记- Chapter 1
- Reinforcement Learning:An introduction读书笔记-Chapter 2
- Reinforcement Learning:An introduction读书笔记-Chapter 3
- introduction-to-reinforcement-learning-implementation
- 强化学习导论(Reinforcement Learning: An Introduction)读书笔记(一):强化学习介绍
- 上传本地代码到gitHub
- Matlab之画图相关总结
- Java 封装
- Android中字符串的拆分---split()方法
- Java IO流 总结
- 《reinforcement learning:an introduction》第七章《Multi-step Bootstrapping》总结
- 软件开发中的7种
- spring常见的面试题
- [PAT乙级]1018. 锤子剪刀布 (20)
- 构造函数的作用,无参构造和有参构造
- 心急的C小加
- js中的substring 和substr的区别
- JAVA 基础知识 面试题
- OpenThreads线程管理