Continuous Multi-Step TD, Eligibility Traces and TD(λ): A brief note
来源:互联网 发布:淘宝店铺怎么申请域名 编辑:程序博客网 时间:2024/05/24 04:50
Thanks Richard S. Sutton and Andrew G. Barto for their great work in Reinforcement Learning: An Introduction.
We focus on episodic case only and deal with continuous state and action spaces. Suppose you already have the basic knowledge of
Stochastic-gradient and Semi-gradient Methods
Stochastic-gradient methods are among the most widely used of all function approximation methods and are particularly well suited to online reinforcement learning.
Let
SGD methods are gradient descent methods because the overall step in
Why we take only a small step in the direction of the gradient? We do not seek to find a value function that has zero error for all states, but only an approximation that balances the errors in different states. In fact, the convergence results for SGD methods assume that
However, we do not know the true value function
If a bootstrapping estimate of
Although semi-gradient methods do not converge as robustly as gradient methods, they do converge reliably in important case such as the linear case.
Episodic Semi-gradient Control
In the view of control instead of prediction, the action-value function
For example, the update for the one-step SARSA method is
We call this method episodic semi-gradient one-step SARSA. For a constant policy, this method converges in the same way that
n-Step Semi-gradient SARSA
We can obtain an
The
The performance is the best if an intermediate level of bootstrapping is used, corresponding to an
The λ -Return
To begin with, we note that a valid update can be done not just toward any
The
which can be understood as one particular way of averaging
Now it is time to define our first learning algorithm based on
The approach is what we call the theoretical, or forward view of a learning algorithm. For each state visited, we look forward in time to all the future rewards and decide how best to combine them.
Eligibility Traces
Eligibility traces are one of the basic mechanisms of reinforcement learning. Almost any temporal-difference methods, such as Q-Learning or SARSA, can be combined with eligibility traces to obtain a more general method that may learn more efficiently. What eligibility traces offer beyond these is an elegant algorithmic mechanism with significant computational advantages. Only a single trace vector is required rather than a store of the last
The mechanism is a short-term memory vector, the eligibility trace
The idea behind eligibility traces is very simple. Each time a state is visited it initializes a short-term memory process, a trace, which then decays gradually over time. This trace marks the state as eligibility for learning. If an unexpectedly good or bad event occurs while the trace is non-zero, then the state is assigned credit accordingly. There have been two kinds of traces: the accumulating trace and the replacing trace. In a accumulating trace, the trace builds up each time the state is entered. In a replacing trace, on the other hand, each time the state is visited the trace is reset to 1 regardless of the presence of a prior trace. Typically, eligibility traces decay exponentially according to the product of a decay parameter,
where
In the gradient descent case, the (accumulating) eligibility trace is initialized to zero at the beginning of the episode, is incremental on each time step by the value gradient, and then fades away by
The eligibility trace keeps track of which components of the weight vector have contributed, positively or negatively, to recent state valuations, where recent is defined in terms
TD(λ) Methods
- It updates the weight vector on every step of an episode rather than only at the end.
- Its computations are equally distributed in time rather that all at the end of the episode.
- It can be applied to continuing problems rather than just episodic problems.
To begin with, the TD error for state-value prediction is:
In
e=0 - Choose
A∼π(⋅|S) - Observe
R ,S′ e=γλe+∇v^(S,θ) δ=R+γv^(S′,θ)−v^(S,θ) θ=θ+αδe S=S′
In another variant of the algorithm, the eligibility traces are updated according to:
This is called the replacing traces update.
- Continuous Multi-Step TD, Eligibility Traces and TD(λ): A brief note
- Reinforcement Learning in Continuous State and Action Spaces: A Brief Note
- td
- td
- TD
- TD
- 强化学习之Eligibility Traces
- 获得td a href
- jquery+table+td+a
- Example 7.1: n-step TD Methods on the Random Walk
- A. Dima and Continuous Line
- A. Dima and Continuous Line
- HTML5--TD游戏之A*算法
- JQuery向<td>中插入<a>
- 在CSS中 .td{} #td{} td{} 区别
- TD边框
- Desktop TD
- Td security
- MyEclipse常用快捷键
- Java 自带的线程池Executors.newFixedThreadPool
- [转]Run-Time Check Failure #2分析
- 微信小程序 wx:if 与 hidden区别
- JavaScript中=+的意思
- Continuous Multi-Step TD, Eligibility Traces and TD(λ): A brief note
- Android 螺旋水平进度条 progressbar
- python scrapy 之 随机选择user-agent
- 提升Android比于iOS吸引力,谷歌欲借力人工智能
- Oracle 12.2新特性掌上手册
- WebService
- Kotlin在Android上的运用(二)
- POJ 1195 Mobile phones 题解
- 20150519-jQuery ajax()