Importance Sampling in Reinforcement Learning

来源:互联网 发布:teackpad Windows 编辑:程序博客网 时间:2024/05/11 03:29

Thanks Sutton and Barto for their great work of Reinforcement Learning: An Introduction.

Almost all off-policy reinforcement learning methods utilize importance sampling, a general technique for estimating expected values under one distribution given samples from another. We apply it by weighting returns according to the relative probability of their trajectories occurring under the target and behavior policies, called the importance-sampling ratio.

Given a set of trajectory St,At,St+1,At+1,,ST under policy π, i.e.

k=tT1π(Ak|Sk)p(Sk+1|Sk,Ak)
where p here is the state-transition probability function. Thus, the relative probability of the trajectory under the target and behavior policies (the importance-sampling ratio) is
ρTt=T1k=tπ(Ak|Sk)p(Sk+1|Sk,Ak)T1k=1μ(Ak|Sk)p(Sk+1|Sk,Ak)=T1k=tπ(Ak|Sk)T1k=1μ(Ak|Sk)
Note that the importance sampling ratio depends only on the two policies and not at all on the MDP.

Define J(s) as the set of all time steps in which state s is visited. This is for an every-visit method; for a first-visit method, J(s) would only include time steps that were first visits to s within their episodes. let T(t) denote the first time of termination following time t, and Gt denote the return after t up through T(t).

To estimate vπ(s), we simply scale the returns by the ratios and average the results:

V(s)=tJ(s)ρT(t)tGt|J(s)|

When importance sampling is done as a simple average in this way it is called ordinary importance sampling. An important alternative is weighted importance sampling, which uses a weighted average, defined as:
V(s)=tJ(s)ρT(t)tGttJ(s)ρT(t)t
or zero if the denominator is zero.

The difference between the two kinds of importance sampling is expressed in their biases and variances. The ordinary importance-sampling estimator is unbiased whereas the weighted importance-sampling estimator is biased (the bias converges asymptotically to zero). On the other hand, the variance of the ordinary importance-sampling estimator is in general unbounded because the variance of the ratios can be unbounded, whereas in the weighted estimator the largest weight on any single return is one.