Typical Exploration Strategies in Model-free Policy Search

来源：互联网发布：手机淘宝怎么清除缓存编辑：程序博客网时间：2024/06/05 08:29

Thanks J. Peters et al for their great work of A Survey for Policy Search in Robotics.

The exploration strategy is used to generate new trajectory samples τ[i]. All exploration strategies in model-free policy search are local and use stochastic policies to implement exploration. Typically, Gaussian polices are used to model such stochastic policies.

Many model-free policy search approaches update the exploration distribution and , hence, the covariance of the Gaussian policy. Typically, a large exploration rate is used in the beginning of learning which is then gradually decreased to fine tune the policy parameters.

Action Space vs Parameter Space

In action space. we can simply add an exploration noise ϵu to the executed actions, i.e.

u t = μ (x, t) + ϵ u

The exploratino nose is always sampled independently for each time step from a zero-mean Gaussian distribution with covariance

Σu . The policy

π is given as :

π θ (u | x) = N (u | μ u (x, t), Σ u)

Applications of exploration in action space can be found in REINFORCE algorithm or eNAC algorithm.

Exploration in parameter space perturb the paramter vector θ. In contrast to exploration in action space, that in paramter space can use more structured nose and adapt the variance of the exploration noise in dependence of the state features ϕt(x).

Many approaches can be formulized with the concept of an upper-level policy πw(θ) which selects the parameters of the actual control policy πθ(u|x), i.e. the lower-level policy. The upper-level policy is typically modeled as a Gaussian distribution πw(θ)=N(θ|μθ,Σθ). The lower-level control policy u=πθ(x,t) is typically modeled as deterministic policy since exploration only takes place in the parameter space.

Now we use the paramter vector w defining a distribution over θ. Then we can use this distribution to directly explore in parameter space. The optimization problem for learning upper-level polices goes as maximizing:

J w = \int θ π w (θ) \int τ p (τ | θ) R (τ) d τ d θ = \int θ π w (θ) R (θ) d θ

For a linear control policy

u=ϕt(x)Tθ, we can rewrite the deterministic lower-level policy in combination with the upper-level policy as a single, stochastic policy:

π θ (u t | x t, t) = N (u t | ϕ t (x) T μ θ, ϕ t (x) T Σ θ ϕ t (x))

Episode-based vs Step-based

Step-based exploration use different exploration noise at each time step and can either in action space or in paramter space. Step-based exploration can be problematic as it might produce action sequences which are not reproducible by noise free control law.

Episode-base exploration use exploration noise only at the beginning of the episode, which leads to an exploration in parameter space. Episode-based exploration might produce more reliable policy updates.

Uncorrelated vs Correlated

As most policies are represented as Gaussian distributions, uncorrelated exploration noise is obtained by using a diagonal covariance matrix. It is also usable to achieve correlated exploration by maintaining a full representation of the covariance matrix.

Exploration in action space typically use a diagnoal covariance matrix. In paramter space, many approaches can be used to update the full covariance matrix of the Gaussian policy. Using the full covariance matrx often resultes in a considerably increased learning speed.

阅读全文

0 0