《reinforcement learning:an introduction》第十章《On-policy Control with Approximation》总结

来源:互联网 发布:dm软件 编辑:程序博客网 时间:2024/06/06 10:52

由于组里新同学进来,需要带着他入门RL,选择从silver的课程开始。

对于我自己,增加一个仔细阅读《reinforcement learning:an introduction》的要求。

因为之前读的不太认真,这一次希望可以认真一点,将对应的知识点也做一个简单总结。




The present chapter features the semi-gradient Sarsa algorithm(即 On-policy Control with Approximation), the natural extension of semi-gradient TD(0) (last chapter) to action values and to on-policy control. 




In the episodic case, the extension is straightforward。


n-step Semi-gradient SARSA:








In the continuing case, we have to give up discounting and switch to a new "average-reward" formulation of the control problem with new value functions。 The Futility of Discounting in Continuing Problems(P257) =========》》》In fact, for policy π, the average of the discounted returns is always η(π(1 γ), that is, it is essentially the average reward, η(π). In particular, the ordering of all policies in the average discounted return setting would be exactly the same as in the average-reward setting. The discount rate γ thus has no effect on the problem formulation.(所以没有必要考虑discounting,直接研究average就可以了;当然,也可以直接研究discounting,没必要研究average,但是估计是历史原因吧,The average-reward setting is one of the major settings considered in the classical theory of dynamic programming and, though less often, in reinforcement learning


Average rewardsetting applies to continuing problems, problems for which the interaction between agent and environment goes on and on forever without termination or start states.

In the average-reward setting, the quality of a policyπ is defined as the average rate of reward while following that policy, which we denote anη(π) :



In the average-reward setting, returns are defined in terms of differences between rewards and the average reward: 



This is known as thedifferential return, and the corresponding value functions are known asdifferential value functions.Differential value functions also have Bellman equations, just slightly different from those we have seen earlier. We simply remove allγs and replace all rewards by the difference between the reward and the true average reward. There is also a differential form of the two TD errors.



n-step DifferentialSemi-gradient SARSA:








阅读全文
0 0