增强学习论文记录

来源:互联网 发布:mac svn服务器地址 编辑:程序博客网 时间:2024/05/01 15:31

< HIGH-DIMENSIONAL CONTINUOUS CONTROL USING GENERALIZED ADVANTAGE ESTIMATION >

John Schulman, Philipp Moritz, Sergey Levine, Michael I. Jordan and Pieter Abbeel
Department of Electrical Engineering and Computer Science
University of California, Berkeley
{joschu,pcmoritz,levine,jordan,pabbeel}@eecs.berkeley.edu

  • 主要是说用GAE模拟Advantage 函数,降低variance. 通过使用参数控制一系列的actions对reward的影响范围 This observation suggests an interpretation of Equation(16): reshape the rewards using V to shrink the temporal extent of the response function,and then introduce a “steeper” discount γλ to cut off the noise arising from long delays, i.e., ignore terms θlogπθ(at|st)δVt+l where l>>1/(1γλ)

  • GAN: generalized advantage estimation

  • Two main challenges
    • large number of samples
    • difficulty of obtaining stable and steady improvement
  • 解决办法

    • using value functions to substantially reduce the variance of policy gradient estimates at the cost of some bias, with an exponentially-weighted estimator of the advantage function that is analogous to TD(λ).
    • We address the second challenge by using trust region optimization procedure for both the policy and the value function, which are represented by neural networks.

  1. 提出了一族policy gradient estimators, 能显著减少variance while 将bias维持在在tolerable level. 被γ[0,1]λ[0,1] 参数化的generalized advantage estimator (GAE)
  2. 一个更general的分析,能同时用于online和batch setting, 讨论了我们方法的解释as an instance of reward shapping

论文的三个贡献:

  • We provide justification and intuition for an effective variance reduction scheme for policy gradients, which we call generalized advantage estimation (GAE). While the formula has been proposed in prior work (Kimura & Kobayashi, 1998; Wawrzynski ´ , 2009), our analysis is novel and enables GAE to be applied with a more general set of algorithms, including the batch trust-region algorithm we use for our experiments.
  • We propose the use of a trust region optimization method for the value function, which we find is a robust and efficient way to train neural network value functions with thousands of parameters.
  • By combining (1) and (2) above, we obtain an algorithm that empirically is effective at learning neural network policies for challenging control tasks. The results extend the state of the art in using reinforcement learning for high-dimensional continuous control.
    Videos are available at
    https://sites.google.com/site/gaepapersupp.

几种不同的polciy grad 方法


重点:

  • 后更新value function
  • The choice Ψt=Aπ(st,at) yields almost the lowest possible
    variance, though in parcticce, the advantage function is not known
    and must be estimated.
  • introduce a prameter γ to reduce variance by downweighting
    rewards对应delayde effects 但代价是introducing bias.

  • 这里写图片描述

    g^=1Nn=1Nt=0A^ntθlogπθ(ant|snt)(9)
    其中的n代表batch序号

  • V 是近似value function, 定义 δVt=rt+γV(st+1)V(st) 可以当成是action at 的advantage估计

  • 推导进行中 即代表了一部分a telescoping sum advantage
  • The generalized advantage estimator GAE !!!!!!!! !!!!!!两种情况
  • 用GAE创造一个biased gγ 估计, 通过改写公式6 这里写图片描述
  • 具体算法:

这里写图片描述


这里写图片描述


这里写图片描述

0 0
原创粉丝点击
热门问题 老师的惩罚 人脸识别 我在镇武司摸鱼那些年 重生之率土为王 我在大康的咸鱼生活 盘龙之生命进化 天生仙种 凡人之先天五行 春回大明朝 姑娘不必设防,我是瞎子 买家账户被亚马逊关闭余额怎么办 京东自营物流慢怎么办 京东退货不给退怎么办 刚付款不想要了怎么办 淘宝卖家拒绝退货退款怎么办 投诉不成立卖家怎么办 淘宝卖家被买家投诉卖假货怎么办 天猫三天未发货怎么办 天猫申请换货卖家不处理怎么办 天猫新疆不发货怎么办 天猫商城少发货怎么办 下单了卖家不发货怎么办 天猫超市漏发货怎么办 天猫购物几天不发货怎么办 天猫总是不发货怎么办 申请退款后卖家又发货了怎么办 天猫拍后申请退款卖家发货怎么办 淘宝上没下单却收到了货怎么办 被买家投诉三无产品怎么办 阿里巴巴卖家虚假发货怎么办 淘宝捡到便宜但是卖家不发货怎么办 被工商局查到三无产品怎么办 淘宝买到三无产品电器怎么办 天猫商城被投诉怎么办 床板有虫子咬人怎么办 微信充电话费充错怎么办 联通话费充多了怎么办 qq钱包充值要验证码怎么办 在微信qq币充错账号怎么办 微信qq币充错了怎么办 魅蓝e玩游戏卡怎么办 魅蓝5玩游戏卡怎么办 微信qb充错号了怎么办 支付宝qb充错号了怎么办 手机上q币充错了怎么办 q币数值充错了怎么办 微信充值商户电话是假了怎么办 微信冲话费冲错了怎么办 淘宝退款不退邮费怎么办 淘金币买的退款怎么办 淘宝退款不退运费怎么办