增强学习论文记录
来源:互联网 发布:mac svn服务器地址 编辑:程序博客网 时间:2024/05/01 15:31
< HIGH-DIMENSIONAL CONTINUOUS CONTROL USING GENERALIZED ADVANTAGE ESTIMATION >
John Schulman, Philipp Moritz, Sergey Levine, Michael I. Jordan and Pieter Abbeel
Department of Electrical Engineering and Computer Science
University of California, Berkeley
{joschu,pcmoritz,levine,jordan,pabbeel}@eecs.berkeley.edu
主要是说用GAE模拟Advantage 函数,降低variance. 通过使用参数控制一系列的actions对reward的影响范围 This observation suggests an interpretation of Equation(16): reshape the rewards using
V to shrink the temporal extent of the response function,and then introduce a “steeper” discountγλ to cut off the noise arising from long delays, i.e., ignore terms∇θlogπθ(at|st)⋅δVt+l wherel>>1/(1−γλ) GAN: generalized advantage estimation
- Two main challenges
- large number of samples
- difficulty of obtaining stable and steady improvement
解决办法
- using value functions to substantially reduce the variance of policy gradient estimates at the cost of some bias, with an exponentially-weighted estimator of the advantage function that is analogous to TD(λ).
- We address the second challenge by using trust region optimization procedure for both the policy and the value function, which are represented by neural networks.
- 提出了一族policy gradient estimators, 能显著减少variance while 将bias维持在在tolerable level. 被
γ∈[0,1]和λ∈[0,1] 参数化的generalized advantage estimator (GAE) - 一个更general的分析,能同时用于online和batch setting, 讨论了我们方法的解释as an instance of reward shapping
论文的三个贡献:
- We provide justification and intuition for an effective variance reduction scheme for policy gradients, which we call generalized advantage estimation (GAE). While the formula has been proposed in prior work (Kimura & Kobayashi, 1998; Wawrzynski ´ , 2009), our analysis is novel and enables GAE to be applied with a more general set of algorithms, including the batch trust-region algorithm we use for our experiments.
- We propose the use of a trust region optimization method for the value function, which we find is a robust and efficient way to train neural network value functions with thousands of parameters.
- By combining (1) and (2) above, we obtain an algorithm that empirically is effective at learning neural network policies for challenging control tasks. The results extend the state of the art in using reinforcement learning for high-dimensional continuous control.
Videos are available at
https://sites.google.com/site/gaepapersupp.
重点:
- 后更新value function
- The choice
Ψt=Aπ(st,at) yields almost the lowest possible
variance, though in parcticce, the advantage function is not known
and must be estimated. introduce a prameter
γ to reduce variance by downweighting
rewards对应delayde effects 但代价是introducing bias.- 其中的
g^=1N∑n=1N∑t=0∞A^nt∇θlogπθ(ant|snt)−−(9) n 代表batch序号 令
V 是近似value function, 定义δVt=rt+γV(st+1)−V(st) 可以当成是actionat 的advantage估计- 即代表了一部分a telescoping sum advantage
- The generalized advantage estimator GAE !!!!!!!!
- 用GAE创造一个biased
gγ 估计, 通过改写公式6 - 具体算法:
- 增强学习论文记录
- 深度增强学习方向论文整理
- 深度增强学习方向论文整理
- 深度增强学习方向论文整理
- 深度增强学习方向论文整理
- 增强学习
- 增强学习
- 增强学习
- 增强学习
- 增强学习
- 增强学习
- 增强学习
- 论文浏览记录
- 记录-论文1
- 第一篇论文记录
- [论文除草]记录思路
- ssd论文阅读记录
- ICCV2017 论文浏览记录
- springmvc json 数据交互
- [leetcode-排序]--148. Sort List
- centos7安装scala
- SPFA算法
- Java 10进制byte数组与16进制byte数组互转 及 在DES加解密中的使用
- 增强学习论文记录
- LeetCode 58. Length of Last Word
- 设计模式-单例模式
- 对结对编程理解
- 欢迎使用CSDN-markdown编辑器
- 全面认识Depth - 这里有关于Depth的一切
- 51NOD 1105 第K大的数 【二分】
- 泛型编程C++实现栈
- SSH框架之Struts的数据校验(2)