Typical Policy Representation in Policy Search Methods

来源:互联网 发布:李嫣兔唇原因知乎 编辑:程序博客网 时间:2024/05/16 06:05

Thanks Jan Peters et al for their great work of A Survey on Policy Search for Robotics.

Policy representation may be categorized into time-independent representation π(x) and time-dependent representation π(x,t). Since time-dependent representations can use different policies for different time steps, they allow for a simpler structure of the individual policies can be used.

In the following content, we will describe all these representations in their deterministic formulation πθ(x,t). In stochastic formulations, typically a zero-mean Gaussian noise vector ϵt is added to πθ(x,t). In this case, the parameter vector θ typically includes the covariance matrix used for generating the noise ϵt.

Linear Polices

Linear policy π:

πθ(x)=θTϕ(x)
where ϕ is a basis function vector. Linear polices are always limited to problems where appropriate basis functions are known.

Radial Basis Functions Networks

An RBF policy πθ is given as

πθ(x)=wTϕ(x),ϕi(x)=exp(12(xμi)TDi(xμi))
where Di=diag(di) . The parameters β={μi,di}i=1,,n of the basis functions are now free parameters to be learned. Hence θ={w,β} .

Dynamic Movement Primitives

DMPs are most widely used time-dependent policy representation in robotics. The key principle is to use a linear spring-damper system which is modulated by a nonlinear forcing function :

y¨t=τ2αy(βy(gyt)y˙t)+τ2ft
where the variable yt specifies the desired joint position. τ is the time-scaling coefficient, the coefficients αy and βy define the spring and damping constants and the goal parameter g is the unique point attractor of the system. The forcing function ft changes the goal attractor g .

One key innovation of the DMP approach is the use of a phase variable zt to scale the execution speed of the movement.

z˙=ταzz,z(0)=1

For each degree of freedom, an individual spring-damper system and forcing function is used.
f(z)=Ki=1ϕi(z)wiKi=1ϕi(z)z,ϕi(z)=exp(12σ2(zci)2)
The parameters wi are denoted as shape-parameters of the DMP as they modulate the acceleration profile and, hence, indirectly specify the shape of the movement. The nonlinear dynamic system is globally stable. We can think of the fact that the goal paramter g specifies the final position while the shape parameters wi specify how to reach the final position.

A policy πθ(xt,t) that is specified by a DMP, directly controls the acceleration of the joint, is given by:

πθ(xt,t)=τ2αy(βy(gyt)y˙)+τ2f(zt)
Note that the DMP policy is linear in the shape parameters w and the goal attractor g, but nonlinear in the time-scaling constant τ. Then θ={w,g,τ}.

Miscellaneous Representations

There exist other representations such as central pattern generators for robot walking and feed-forward neural networks used in simulation.

原创粉丝点击