《reinforcement learning：an introduction》第二章《Multi-arm Bandits》总结

来源：互联网发布：软件用例图怎么画编辑：程序博客网时间：2024/06/05 15:14

由于组里新同学进来，需要带着他入门RL，选择从silver的课程开始。

对于我自己，增加一个仔细阅读《reinforcement learning：an introduction》的要求。

因为之前读的不太认真，这一次希望可以认真一点，将对应的知识点也做一个简单总结。

K-armed bandit problem:

Consider the following learning problem. You are faced repeatedly with a choice among k different options, or actions. After each choice you receive a numerical reward chosen from a stationary probability distribution that depends on the action you selected. Your objective is to maximize the expected total reward over some time period, for example, over 1000 action selections, or time steps.

即，用规定的timestep数，找到最优的action（每个action都对应自己的reward distribution，不是说每个action访问一次就可以确切知道action的reward）。用规定的预算，找到最好的广告安排策略。用规定的预算，找到最好的治疗方案都可以近似看作这类问题。Another analogy is that of a doctor choosing between experimental treatments for a series of
seriously ill patients. Each action selection is a treatment selection, and each reward is the survival or well-being of the patient.

考虑action-value function:

Q(a) = sigma{R_i * Indicator (A_i=a)} / sigma{Indicator (A_i=a)}

在大数定理之下，这种sample-average method计算Q(a)能够保证收敛：As the denominator goes to infinity,by the law of large numbers, Qt(a) converges to q∗(a).

exploration and exploitation：纯粹的exploitation一般不好，需要exploration

e-greedy Action Selection：虽然叫e-greedy，但不是以e的概率greedy，而是以e的概率探索exploration！ It is also possible to reduce e over time

Upper-Confidence-Bound Action Selection（UCB）：A_t = argmax_a [ Q_t(a) + c*sqrt(logt / N_t(a)) ] ; select among the non-greedy actions according to their potential for actually being optimal, taking into account both how close their estimates are to being maximal and the uncertainties in those estimates; UCB在bandit的问题上目测是比较好的方法，但是拓展到RL问题上比较难：it is more difficult than e-greedy to extend beyond bandits to the more general reinforcement learning settings。

Exercise 2.1：greedy方法一般情况下找不到best action；e-greedy最终（考虑执行非常多的步骤后）都会找到best action；但是e越大，探索越多（找到best action后，再探索就是浪费了），执行best action的概率越小，所以相对e较小的情况，得到的cumulative reward越少。所以结论是：
in terms of cumulative reward：e=0.01perform best；
in terms of cumulative probability of selecting the best action：e=0.01perform best；选择最优action的概率是99%

TD-learning: NewEstimate <-- OldEstimate+ StepSize [Target -OldEstimate]

本质上是reduce the estimated TD-error=Target-OldEstimate by taking a step toward the Target（从向量的角度去考虑这句话）。这里需要注意，尽管Target may be noisy，Thetarget is presumed to indicate a desirable direction in which to move。

Q_n+1 = Q_n +α[R_n - Q_n]

= (1-α)^n Q_1 + sigma_over_i{ α(1-α)^(n-i) * R_i }

可以看出来constant stepsize本质上是对不同step得到的reward进行weighted average，而且离现在越近，权重越大！！！===》track non-stationary. In such cases it makes sense to weight recent rewards more heavily than long-past ones.

另外有两个非常重要的概念：non-deterministic和non-stationary

non-deterministic：action对应的value有一个分布，而不是一个确定的值（这个分布是不变化的）；action的真值不变，但是是不确定的（理解不确定的最简单方法就是认为有一个分布）

non-stationary：action对应value的分布在不断变化，（如果action只对应一个值，那么这个值在不断变化）；action的真值在不断变化（不管是分布变化还是单个值变化）

实际问题多是non-stationary并且non-deterministic。即使environment是stationary and deterministic，exploration也是很必要的，因为the learner faces a set of banditlike decision tasks each of which changes over time as learning proceeds and the agent’s policy changes。说白了就是，即使是完全确定的，也需要探索以便知道其它的action效果，否则怎么确定哪个最优？？

在实际实现过程中，我们会考虑逐步减小参数stepsizeα的值，假设n时刻的stepsize为α(n) 。什么样的的权重{α(n)}sequence才能保证Q的计算可以收敛呢？？？A well-known result in stochastic approximation theory gives us the conditions required to assure convergence with probability 1: sigma_over_n { α(n) } == infinite（α(n)足够大能够纠正any initial conditions or random fluctuations） and sigma_over_n { α(n)^2 } < infinite （eventually the steps become small enough to assure convergence ）。

sample-average中，α(n) = 1/n 满足这个条件，能够保证收敛。但是α(n) = α 的恒定步长却不满足第二个条件，所以不能保证收敛；好处是，上面分析了，constant stepsizeα(n) = α意味着weighted R_i，而且R_i离现在越近权重越大，即make the value Q response to the most recently received rewards; this is actually desirable in a non-stationary environment, and problems that are effectively non-stationary are the norm in reinforcement learning; in addition, sequences of step-size parameters that meet the above conditions often converge very slowly or need considerable tuning。总之是，α(n) = α 的恒定步长反而更常用，满足上面收敛条件的step-sizeoften used in theoretical work but seldom used in applications and empirical research！！！

2.7 Gradient Bandit Algorithms

Gradient Bandit Algorithms本质上是一种gradient ascent方法。has robust convergence properties！

另外提到baseline的概念：只要独立于action就可以了。选择V(s)作为baseline很常用（因为bandit只有一个state，所以对应average(R)），只和state相关，从而独立于action。

2.8 Associative Search (Contextual Bandits)

bandit problem (nonassociative search)：只有一个state；只考虑immediate reward
Associative Search (Contextual Bandits)：只有多个state；只考虑immediate reward
reinforcement learning：多个state；考虑long term value

Associative search tasks are intermediate between thek-armed bandit problem and the full reinforcement learning problem. They are like the full reinforcement learning problem in that they involve learning a policy, but like our version of thek-armed bandit problem in that each action affects only the immediate reward. If actions are allowed to affect the next situationas well as the reward, then we have the full reinforcement learning problem.

阅读全文

2 0