Continuous Multi-Step TD, Eligibility Traces and TD(λ): A brief note

来源:互联网 发布:淘宝店铺怎么申请域名 编辑:程序博客网 时间:2024/05/24 04:50

Thanks Richard S. Sutton and Andrew G. Barto for their great work in Reinforcement Learning: An Introduction.

We focus on episodic case only and deal with continuous state and action spaces. Suppose you already have the basic knowledge of TD(0) methods in reinforcement learning.

Stochastic-gradient and Semi-gradient Methods

Stochastic-gradient methods are among the most widely used of all function approximation methods and are particularly well suited to online reinforcement learning.

Let v^(s,θ) be the approximation of the true value function v(s) with real valued components θ=(θ1,θ2,,θn)T for all sS. We assume that states appear in examples with the same distribution d, over which we are trying to minimize error on the Mean Square Value Error (MSVE(θ)=sSd(s)[vπ(s)v^(s,θ)]2). A straight-forward strategy is to try to minimize the error on the observed examples. Stochastic-gradient methods do this by adjusting the weight vector after each example by a small amount in the direction that would most reduce the error on that example:

θt+1=θt12αθ[vπ(St)v^(St,θt)]2=θt+α[vπ(St)v^(St,θt)]θv^(St,θt)

SGD methods are gradient descent methods because the overall step in θt is proportional to the negetive gradient of the example’s squared error. This is the direction in which the error falls most rapidly. Meanwhile, stochastic means that the update is done on only a single example, which might have been selected stochastically.

Why we take only a small step in the direction of the gradient? We do not seek to find a value function that has zero error for all states, but only an approximation that balances the errors in different states. In fact, the convergence results for SGD methods assume that α decreases over time. By that way, the SGD method is guaranteed to converge to a local optimum.

However, we do not know the true value function vπ(s) in practice. Hence it is necessary to substitute an estimate Ut in place of vπ(s). If Ut is an unbiased estimate, this is, if E[Ut]=vπ(St) for each t, then θt is guaranteed to converge to a local optimum. For example, the Monte Carlo target Ut=Gt is by definition an unbiased estimate of vπ(St).

If a bootstrapping estimate of vπ(St) is used as the target Ut (such as Ut=Rt+1+γv^(St+1,θt)), one does not obtain the same guarantees since the update step would rely on the target failing to be independent of θt. Bootstrapping methods take into account the effect of changing the weight vector θt on the estimate, but ignore its effect on the target. They include only a part of the gradient and, accordingly, we call them semi-gradient methods.

Although semi-gradient methods do not converge as robustly as gradient methods, they do converge reliably in important case such as the linear case.

Episodic Semi-gradient Control

In the view of control instead of prediction, the action-value function q(St,At) accounts more for the application than value function v(St). The general gradient-descent update for action-value prediction is

θt+1=θt+α[Utq^(St,At,θt)]q^(St,At,θt)

For example, the update for the one-step SARSA method is
θt+1=θt+α[Rt+1+γq^(St+1,At+1,θt)q^(St,At,θt)]q^(St,At,θt)

We call this method episodic semi-gradient one-step SARSA. For a constant policy, this method converges in the same way that TD(0) does, with the same kind of error bound MSVE(θTD)11γminθMSVE(θ).

n-Step Semi-gradient SARSA

We can obtain an n-step version of episodic semi-gradient SARSA by using an n-step return as the update target.

G(n)t=Rt+1+γRt+2++γn1Rt+n+γnq^(St+n,At+n,θt+n1),n1,0t<Tn

The n-step update update equation is
θt+n=θt+n1+α[G(n)tq^(St,At,θt+n1)]q^(St,At,θt+n1)

The performance is the best if an intermediate level of bootstrapping is used, corresponding to an n larger than 1.

The λ-Return

To begin with, we note that a valid update can be done not just toward any n-step return, but toward any average of n-step returns. The composite return possesses an error reduction property similar to that of individual n-step returns and thus can be used to construct backups with guaranteed convergence properties. A backup that average simpler component backups is called a compound backup.

The λ-return is defined by

Gλt=(1λ)n=1λn1G(n)t

which can be understood as one particular way of averaging n-step backups. This average contains all the n-step backups, each weighted proportional to λn1, where λ[0,1], and normalized by a factor of 1λ to ensure that the weights sum to 1. After a terminal state (if exists) has been reached, all subsequent n-step returns are equal to Gt. If we want, we can separate these post-termination terms from the main sum, yielding
Gλt=(1λ)n=1Tt1λn1G(n)t+λTt1Gt

Now it is time to define our first learning algorithm based on λ-return: the off-line λ-return algorithm. As an off-line algorithm, it makes no changes to the weight vector during the episode. Then, at the end of the episode, a whole sequence of off-line updates are made according to usual semi-gradient rule:
θt+1=θt+α[Gλtv^(St,θt)]v^(St,θt)

The approach is what we call the theoretical, or forward view of a learning algorithm. For each state visited, we look forward in time to all the future rewards and decide how best to combine them.

Eligibility Traces

Eligibility traces are one of the basic mechanisms of reinforcement learning. Almost any temporal-difference methods, such as Q-Learning or SARSA, can be combined with eligibility traces to obtain a more general method that may learn more efficiently. What eligibility traces offer beyond these is an elegant algorithmic mechanism with significant computational advantages. Only a single trace vector is required rather than a store of the last n feature vectors. Learning also occurs continually and uniformly in time rather than being delayed and ten catching up at the end of the episode.

The mechanism is a short-term memory vector, the eligibility trace etRn,that parallels the long-term memory weight vector θtRn. The rough idea is that when a component of θt participates in producing an estimated value, then the corresponding component of et is bumped up and then begins to fade away. Learning will then occur in that component of θt is a nonzero TD error occurs before the trace falls back to zero. The trace-decay parameter λ[0,1] determines the rate at which the trace falls.

The idea behind eligibility traces is very simple. Each time a state is visited it initializes a short-term memory process, a trace, which then decays gradually over time. This trace marks the state as eligibility for learning. If an unexpectedly good or bad event occurs while the trace is non-zero, then the state is assigned credit accordingly. There have been two kinds of traces: the accumulating trace and the replacing trace. In a accumulating trace, the trace builds up each time the state is entered. In a replacing trace, on the other hand, each time the state is visited the trace is reset to 1 regardless of the presence of a prior trace. Typically, eligibility traces decay exponentially according to the product of a decay parameter, λ, 0λ1, and a discount-rate parameter, γ, 0γ1. The accumulating trace is defined by:

et+1(s)={γλet(s),γλet(s)+1,ssts=st

where et(s) represents the trace for state s at time t. and st is the actual state at time t. The corresponding replacing trace is defined by:
et+1(s)={γλet(s),1ssts=st

In the gradient descent case, the (accumulating) eligibility trace is initialized to zero at the beginning of the episode, is incremental on each time step by the value gradient, and then fades away by γλ:

e0et=0,=v^(St,θt)+γλet1

The eligibility trace keeps track of which components of the weight vector have contributed, positively or negatively, to recent state valuations, where recent is defined in terms γλ.

TD(λ) Methods

TD(λ) improves over the off-line λ-return algorithm in three ways:

  1. It updates the weight vector on every step of an episode rather than only at the end.
  2. Its computations are equally distributed in time rather that all at the end of the episode.
  3. It can be applied to continuing problems rather than just episodic problems.

To begin with, the TD error for state-value prediction is:

δt=Rt+1+γv^(St+1,θt)v^(St,θt)

In TD(λ), the weight vector is updated on each step proportional to the scalar TD error and the vector eligibility trace:
θt+1=θt+αδtet

TD(λ) is oriented backward in time. At each moment we look at the current TD error and assign it backward to each prior state according to how much that state contributed to the current eligibility trace at that time. The complete update rule of TD(λ) (with so-called accumulating traces) is as follows:

  1. e=0
  2. Choose Aπ(|S)
  3. Observe R, S
  4. e=γλe+v^(S,θ)
  5. δ=R+γv^(S,θ)v^(S,θ)
  6. θ=θ+αδe
  7. S=S

In another variant of the algorithm, the eligibility traces are updated according to:

e=max(γλe,v^(S,θ))

This is called the replacing traces update.

TD(0) is the case in which only the one state preceding the current one is changed by the TD error. For larger values of λ, but still λ<1, more of the preceding states are changed but each more temporally distant state is changed less because the corresponding eligibility trace is smaller. We say that the earlier states are given less credit for the TD error.

TD(1) is a way of implementing Monte Carlo algorithms. On-line TD(1) learns in an n-step TD way from the incomplete ongoing episode, where the n steps are all the way up to the current step. If something unusually good or bad happens during an episode, control methods based on TD(1) can learn immediately and alter their behavior on that same episode.

原创粉丝点击