《reinforcement learning:an introduction》第一章《The Reinforcement Learning Problem》总结
来源:互联网 发布:linux grep命令返回值 编辑:程序博客网 时间:2024/06/05 18:30
由于组里新同学进来,需要带着他入门RL,选择从silver的课程开始。
对于我自己,增加一个仔细阅读《reinforcement learning:an introduction》的要求。
因为之前读的不太认真,这一次希望可以认真一点,将对应的知识点也做一个简单总结。
Reinforcement learning problems involve learning what to do - how to map situations to actions - so as to maximize a numerical reward signal.
RL is different from supervised learning/unsupervised learning.
There is no supervisor (to tell what is best!), only a reward signal, must discover which actions yield the most reward by trying them out
action influence the environment and sub-sequential data; data distribution is not iid
Feedback is (sometimes) delayed, not instantaneous
trade-off between exploration and exploitation
for stochastic task, each action must be tried many times to gain a reliable estimate of its expected reward
elements of RL
policy:
reward signal: Reward Hypothesis, All goals can be described by the maximisation of expected cumulative reward
value function: Whereas the reward signal indicates what is good in an immediate sense, a value function specifies what is good in the long run.
Roughly speaking, the value of a state is the total amount of reward an agent can expect to accumulate over the future, starting from that state.
(sometimes) model of environment: P(s'|s,a) and R(s'|s,a). Models are used for planning(without actually take interaction with the environment)
1.4 Limitations and Scope
用大量的篇幅讲了genetic algorithms/evolutionary methods(EM)、optimization methods(OM)和RL的区别,指出:
evolutionary methods适合small space of policies;或者agent cant accurately sense the state of the environment。但是EM方法只看policy的最后结果而不考虑中间的演变的过程(the details of individual behavioral interactions),效率不如RL高:they do not use the fact that the policy they are searching for is a function from states to actions; they do not notice which states an individual passes through during its lifetime, or which actions it selects. In some cases this information can be misleading (e.g., when states are misperceived), but more often it should enable more efficient search.
1.5 An Extended Example: Tic-Tac-Toe
举例说明了传统的AI方法,比如minimax、dynamic programming、evolutionary method都不太适合即使是这么简单的RL问题。
the classical "minimax" solution from game theory is not correct here because it assumes a particular way of playing by the opponent.
dynamic programming, can compute an optimal solution for any opponent, but require as input a complete specification of that opponent
evolutionary method: To evaluate a policy an evolutionary method holds the policy fixed and plays many games against the opponent, or simulates many games using a model of the opponent. The frequency of wins gives an unbiased estimate of the probability of winning with that policy, and can be used to direct the next policy selection.But each policy change is made only after many games, and only the final outcome of each game is used: what happens during the games is ignored. For example, if the player wins, then all of its behavior in the game is given credit, independently of how specific moves might have been critical to the win. Credit is even given to moves that never occurred!
RL: Value function methods, in contrast, allow individual states to be evaluated. In the end, evolutionary and value function methods both search the space of policies, but learning a value function takes advantage of information available during the course of play.
reinforcement learning solution that it can achieve the effects of planning and lookahead without using a model of the opponent and without conducting an explicit search over possible sequences of future states and actions.(从TD-learning的角度去看,RL确实是不需要model就有一定的lookahead功能)
1.7 History of Reinforcement Learning
RL三条研究主线:learning with trial and error; optimal control and its solution using value functions and dynamic programming(planning); TD-methods;然后举了各种researchers的研究。。。
下面是silver课程《Lecture 1,Introduction to Reinforcement Learning》我觉得应该知道的内容:
6:提到RL,要明白是ML的分支,是在做决策,是optimal control的一部分,需要数学mathematics、CS、等等
8:characteristics make RL different from other ML paradigms
There is no supervisor (to tell what is best!), only a reward signal
Feedback is delayed, not instantaneous
Time really matters (sequential, non i.i.d data)
Agent’s actions affect the subsequent data it receives
13:Reward Hypothesis, All goals can be described by the maximisation of expected cumulative reward
18:History is the sequence of observations, actions, rewards, H_t = o_0, r_0, a_0, o_1, r_1, a_1, ..., o_t, r_t, a_t
State is the information used to determine what happens next, S_t = fun(H_t)
Markov state contains all useful information from the history, S is Markov iff P[S_t+1|S_t]=P[S_t+1|S_t,..,S_0],
The future is independent of the past given the present
23:Full observability: agent directly observes Markov state, i.e., O_t = S_t = fun(H_t), MDP
Partial observability: agent indirectly observes environment, O_t != S_t, POMDP
Agent must construct its own state representation:
Complete history: S_t = H_t
Beliefs of environment state: S_t = (P[S = s1]; ...; P[S = sn])
RNN: S_t = σ(W*S_t-1 + V*O_t)
25:four main subelements of a reinforcement learning system
a policy: agent’s behaviour function
a reward signal: indicates what is good in an immediate sense, the primary basis for altering the policy
a value function: specifies what is good in the long run, the expect total accumulated future reward
optionally, a model of the environment: something that mimics the behavior of the environment, P[S'|S,A] and R[S'|S,A]
34:Value-based、Policy-based、Actor-Critic;model-free、model-based
37:planning and (reinforcement)learning,大多数棋类游戏都是planning(know environment/model/rules, tree-search)
40:exploration and exploitation
43:prediction and control
- 《reinforcement learning:an introduction》第一章《The Reinforcement Learning Problem》总结
- 《reinforcement learning:an introduction》第六章《Temporal-Difference Learning》总结
- 《reinforcement learning:an introduction》第二章《Multi-arm Bandits》总结
- 《reinforcement learning:an introduction》第四章《Dynamic Programming》总结
- 《reinforcement learning:an introduction》第五章《Monte Carlo Methods》总结
- 《reinforcement learning:an introduction》第七章《Multi-step Bootstrapping》总结
- 《reinforcement learning:an introduction》第十三章《Policy Gradient Methods》总结
- Reinforcement Learning:An Introduction 读书笔记- Chapter 1
- Reinforcement Learning:An introduction读书笔记-Chapter 2
- Reinforcement Learning:An introduction读书笔记-Chapter 3
- 《reinforcement learning:an introduction》第八章《Planning and Learning with Tabular Methods》总结
- Reinforcement Learning学习总结
- introduction-to-reinforcement-learning-implementation
- 《reinforcement learning:an introduction》第三章《Finite Markov Decision Processes》总结
- 《reinforcement learning:an introduction》第九章《On-policy Prediction with Approximation》总结
- 《reinforcement learning:an introduction》第十章《On-policy Control with Approximation》总结
- 《reinforcement learning:an introduction》第十一章《Off-policy Methods with Approximation》总结
- Reinforcement Learning
- caffe 学习之LayerParameter
- Javaweb简单博客系统-----(二)数据库建表
- XCode里遇到 #include <XXX.h>file not found的解决方案
- Python学习之旅-20
- ztree自定义按钮的显示和功能
- 《reinforcement learning:an introduction》第一章《The Reinforcement Learning Problem》总结
- POJ2912 I
- HDU1529 Cashier Employment 题解 【差分约束】【二分答案】
- SQL SERVER存储过程
- ReactNative 应用于原生应用 踩坑
- 安徽一个班37人考进清华北大,老师发来一则短信,家长沉默了
- trait
- [译]The Python Tutorial#7. Input and Output
- android studio 改包名