Typical Policy Evaluation Strategies in Model-free Policy Search

来源:互联网 发布:爆菊 知乎 编辑:程序博客网 时间:2024/05/29 18:21

Thanks J. Peters et al for their great work of A Survey for Policy Search in Robotics.

Policy evaluation strategies are used to assess the quality of the executed policy. They may be used to transform sampled trjectories τ[i] into a data set D that contains samples of either the state-action pairs x[i]t,u[i]t or the parameter vectors θ[i]. The data set D is subsequently processed by the policy update strategies to determine the new policy.

Step-based Policy Evaluation

In step-based policy evaluation, we decompose the sampled trajectories τ[i] into its single x[i]t,u[i]t, and estimate the quality of the single actions. The quality of an action is given by:

Q[i]t=Qπt(x[i]t,u[i]t)=Epθ(τ)[h=1Trh(xh,uh)xt=x[i]t,ut=u[i]t]
Estimating the state-action value function usually suffers from high dimensional continuous spaces, approximation errors and a bias induced by the bootstrapping approach. Monte-Carlo estimates are unbiased, however, they typically exhibit a high variance.

Algorithms based on step-based policy evaluation use a data set Dstep={x[i],u[i],Q[i]} to determine the policy upadte step.

Episode-based Policy Evaluation

Episode-based policy evaluation strategies directly use the expected return R[i]=R(θ[i]) to evaluate the quality of a parameter vector θ[i]:

R(θ[i])=Epθ(τ)[t=0Trtθ=θ[i]]
However, this structure of return is by no means the only choice. Any reward function R(θ[i]) which depends on the resulting trajectory of the robot can be used. The expected return can be estimated by performing multiple rollouts on the real system. However, in order to avoid such an expensive operation, some approaches can cope with noisy estimates of R[i] and, hence, directly use the return Tt=0r[i]t of a single trajectory τ[i] to estimate R[i].

Episode-based policy evaluation produces a data set Dep={θ[i],R[i]} and is typically connected with parameter-based exploration strategies, and, hence, such algorithms can be formalized by the problem of learning an upper-level policy πw(θ) .

An underlying problem of episode-based evaluation is the variance of the R[i] estimates. For a high number of time steps and highly stochastic systems, step-based algorithms should be preferred.

Generalization to Multiple Tasks

For generalizing the learned polices to multiple tasks, so far, mainly episode-based policy evaluation strategies have been which learn an upper-level policy. Define a context vector s which describes all variables which do not change during the execution of the task but might change from task to task. The upper-level policy is extended to generalize the lower-level policy πθ(u|x) to different tasks by conditioning the upper-level policy πw(θ|s) on the context s.

The problem of learning πw(θ|s) can be characterized by maximizing the expected returns over all contexts:

Jw=sμ(s)θπw(θ|s)τp(τ|θ,s)R(τ,s)dτdθds=sμ(s)θπw(θ|s)R(θ,s)dθds
where R(θ,s) is the expected return for executing the lower-level policy with parameter vector θ in context s and μ(s) is the distribution over the contexts. The data set Dep={s[i],θ[i],R[i]} is used for updating the policy.