Example 7.1: n-step TD Methods on the Random Walk

来源：互联网发布：ipad上看文献知乎编辑：程序博客网时间：2024/06/13 21:35

Consider using n-step TD methods on the random walk task described in Example 6.2 and shown in Figure 6.2. Suppose the first episode progressed directly from the center state,
C, to the right, through D and E, and then terminated on the right with a return of 1. Recall that the estimated values of all the states started at an intermediate value, V (s) = 0.5. As a result of this experience, a one-step method would change only the estimate for the last state, V (E), which would be incremented toward 1, the observed return. A two-step method, on the other hand, would increment the values of the two states preceding termination: V (D) and V (E) both would be incremented toward 1. A three-step method, or any n-step method for n > 2, would increment
the values of all three of the visited states toward 1, all by the same amount.
Which value of n is better? Figure 7.2 shows the results of a simple empirical test for a larger random walk process, with 19 states (and with a −1 outcome on the left, all values initialized to 0), which we use as a running example in this chapter.
Results are shown for n-step TD methods with a range of values for n and α. The performance measure for each parameter setting, shown on the vertical axis, is the square-root of the average squared error between the predictions at the end of the episode for the 19 states and their true values, then averaged over the first 10 episodes and 100 repetitions of the whole experiment (the same sets of walks were used for all parameter settings). Note that methods with an intermediate value of n worked best.
This illustrates how the generalization of TD and Monte Carlo methods to n-step methods can potentially perform better than either of the two extreme methods.

大意：之前我们做过讨论random walk，但是只有五个状态，现在讨论有19个状态的random walk。而且如果走到最左边，会得到-1的奖励值，走到最右边，得到1的奖励值。从中间开始。真实的状态值分别是[-0.1,-0.2,-0.3,……,0.9]，状态初始值为0.

R(n)t=rt+1+γrt+2+γ2rt+3+...+γn−1rt+n+γnVt(st+n)

V(st)=V(st)+α[R(n)t−Vt(st)]

道理很简单，接下来写代码。。
首先明确需要100次实验。每次实验开始都初始化价值函数，包括10个episode，这10个episode的rms error应该是逐渐下降的，然后取这10次的平均值。

run 100 times:    initial state values    run 10 episode:        while not end:            take actions,get reward(and judge whether this episode ends)            if actions taken more than n:                returns = sum(gamma,reward,value)                update state        error += rms_errorerror /= 100*10

大概就是这么个过程。在每个episode中，边走边更新，将过程中的状态和奖励都存起来。什么时候开始更新statevalue呢？当跑的步数大于n时，就可以开始更新最前面的状态。当处于time时，updatetime = time - n，更新值为value[updatetime]。

returns = reward[updatetime+2] + gamma*reward[updatetime+2] + . . . + gamma^(n-1)*reward[updatetime+n] + gamma^(n)*value[updatetime+n]

要注意的是，如果某个状态距离最终状态的步数小于n，比如为m，那么

R(n)t=rt+1+γrt+2+γ2rt+3+...+γm−1rt+m。

这里写图片描述

实验效果图如下：
这里写图片描述

横坐标应该是alpha，0.1到1，将就看吧。。
可以看到，n为1的时候rms error还是偏高。而n为2就明显降低了，n=3取得实验过程中最低的rms error，当n超过15，效果就不是很理想了。所以并不是看得越远越好。而当n大于200，状态的价值没有更新，因为在这个实验中，一个episode的步数一般为两位数，很少超过一百，所以，没有更新~~但是可以预见，如果更新的话，造成的rms error还是很大的。
对于alpha，n越大，取得最低rms error的alpha就越小。而比如对于alpha=0.5，n=30的rms error非常大，但对于n=3，rms error却是最优值。此可谓“汝之蜜糖，彼之砒霜”。

这里讲的都是online n-step TD：在行动的时候也更新状态价值。
还有一种offline n-step TD，在一个episode中，状态价值是不变的。进行完一个episode，才更新状态价值。更新的公式为：V(St)=V(St)+∑T−1t=0α[R(n)t−Vt(St)]

代码参见：
https://github.com/Mandalalala/Reinforcement-Learning-an-introduction/tree/master/Chapter%207

之前被代码困扰了很久，想着明白之后要写下来很详细。结果想清楚之后觉得so easy，就随便写了。

阅读全文

0 0