Typical Policy Representation in Policy Search Methods

来源：互联网发布：李嫣兔唇原因知乎编辑：程序博客网时间：2024/05/16 06:05

Thanks Jan Peters et al for their great work of A Survey on Policy Search for Robotics.

Policy representation may be categorized into time-independent representation π(x) and time-dependent representation π(x,t). Since time-dependent representations can use different policies for different time steps, they allow for a simpler structure of the individual policies can be used.

In the following content, we will describe all these representations in their deterministic formulation πθ(x,t). In stochastic formulations, typically a zero-mean Gaussian noise vector ϵt is added to πθ(x,t). In this case, the parameter vector θ typically includes the covariance matrix used for generating the noise ϵt.

Linear Polices

Linear policy π:

π θ (x) = θ T ϕ (x)

where

ϕ is a basis function vector. Linear polices are always limited to problems where appropriate basis functions are known.

Radial Basis Functions Networks

An RBF policy πθ is given as

π θ (x) = w T ϕ (x), ϕ i (x) = exp (- 1 2 (x - μ i) T D i (x - μ i))

where

Di=diag(di) . The parameters

β={μi,di}i=1,…,n of the basis functions are now free parameters to be learned. Hence

θ={w,β} .

Dynamic Movement Primitives

DMPs are most widely used time-dependent policy representation in robotics. The key principle is to use a linear spring-damper system which is modulated by a nonlinear forcing function :

y ¨ t = τ 2 α y (β y (g - y t) - y ˙ t) + τ 2 f t

where the variable

yt specifies the desired joint position.

τ is the time-scaling coefficient, the coefficients

αy and

βy define the spring and damping constants and the goal parameter

g is the unique point attractor of the system. The forcing function

ft changes the goal attractor

g .

One key innovation of the DMP approach is the use of a phase variable zt to scale the execution speed of the movement.

z ˙ = - τ α z z, z (0) = 1

For each degree of freedom, an individual spring-damper system and forcing function is used.

f (z) = \sum K i = 1 ϕ i ( z ) w i \sum K i = 1 ϕ i ( z ) z, ϕ i (z) = exp (- 1 2 σ 2 (z - c i) 2)

The parameters

wi are denoted as shape-parameters of the DMP as they modulate the acceleration profile and, hence, indirectly specify the shape of the movement. The nonlinear dynamic system is globally stable. We can think of the fact that the goal paramter

g specifies the final position while the shape parameters

wi specify how to reach the final position.

A policy πθ(xt,t) that is specified by a DMP, directly controls the acceleration of the joint, is given by:

π θ (x t, t) = τ 2 α y (β y (g - y t) - y ˙) + τ 2 f (z t)

Note that the DMP policy is linear in the shape parameters

w and the goal attractor

g, but nonlinear in the time-scaling constant

τ. Then

θ={w,g,τ}.

Miscellaneous Representations

There exist other representations such as central pattern generators for robot walking and feed-forward neural networks used in simulation.

阅读全文

0 0