Reinforcement Learning学习笔记(2)
来源:互联网 发布:驱动软件哪个好 编辑:程序博客网 时间:2024/06/04 17:48
最近在学习关于强化学习的相关知识,因此在此总结自己的心得体会。但由于小弟学识浅薄,内容难免存在错误,还望各路大神批评指正。
1、模型相关的强化学习
模型相关的强化学习是强化学习的一个分支,它需要我们完全知道问题背后的马尔科夫决策过程。
本节仍以机器人寻找三角形的例子进行相关概念的理解,如下图所示。
在该问题中,图中的不同位置为不同的状态,因此状态集合为
模型相关的强化学习的学习算法主要有:策略迭代(Policy Iteration)和价值迭代(Value Iteration),下面分别介绍这两个算法。
2、策略迭代
策略迭代的思想如下:先随机初始化一个策略
2.1、策略评估
策略评估主要是基于贝尔曼等式,如下式所示:
可以发现,状态的价值
2.2、策略改进
根据状态价值得到新策略,被称为策略改进。对于一个状态
由于在前面有定理已经证明是存在最优策略的,而该策略改进方法又可以保证
2.3、策略迭代代码
# -*- coding:utf-8 -*-from Mdp import Mdpclass Policy_Iteration: # 初始化 def __init__(self, mdp): # 保存状态价值 self.v = [0.0 for state in xrange(len(mdp.state)+1)] # 策略保存 self.pi = dict() for state in mdp.state: if state in mdp.terminalstate: continue self.pi[state] = mdp.action[0] # 随机初始化策略 def __policy_evaluate(self, mdp): max_iteration_num = 1000 for i in xrange(max_iteration_num): delta = 0.0 for state in mdp.state: if state in mdp.terminalstate: continue action = self.pi[state] is_terminal, next_state, reward = mdp.transform(state, action) value = reward + mdp.gamma * self.v[next_state] delta += abs(value - self.v[state]) self.v[state] = value if delta < 1e-6: break def __policy_imporve(self, mdp): for state in mdp.state: if state in mdp.terminalstate: continue a = mdp.action[0] is_terminal, next_state, reward = mdp.transform(state, a) value = reward + mdp.gamma * self.v[next_state] for action in mdp.action: is_terminal, next_state, reward = mdp.transform(state, action) if value < reward + mdp.gamma * self.v[next_state]: value = reward + mdp.gamma * self.v[next_state] a = action self.pi[state] = a # 策略迭代方法 def policy_iteration(self, mdp): max_iteration_num = 100 for i in xrange(max_iteration_num): self.__policy_evaluate(mdp) self.__policy_imporve(mdp)if "__main__" == __name__: mdp = Mdp() policy_value = Policy_Iteration(mdp) policy_value.policy_iteration(mdp) print "value:" for state in xrange(1, 6): print "state:%d value:%f"%(state, policy_value.v[state]) print "policy:" for state in xrange(1, 6): print "state:%d policy:%s"%(state, policy_value.pi[state])
策略迭代交替执行策略评估和策略改进直到收敛,从而得到最优策略了。下图是策略迭代在机器人找三角形问题中找到的最优解。
3、价值迭代
策略迭代需要反复执行策略评估和策略改进,其中策略改进部分其实就是根据贪婪原则选择最优的动作,问题主要在于策略评估需要反复迭代直至状态价值收敛。价值迭代则将策略评估和策略改进结合在同一个函数下,同时进行策略评估和策略改进。
3.1、价值迭代代码
# -*- coding:utf-8 -*-from Mdp import Mdpclass Value_Iteration: def __init__(self, mdp): # 保存状态价值 self.v = [0.0 for state in xrange(len(mdp.state) + 1)] # 保存策略 self.pi = dict() for state in mdp.state: if state in mdp.terminalstate: continue self.pi[state] = mdp.action[0] # 策略初始化 def value_iteration(self, mdp): max_iteration_num = 1000 for i in xrange(max_iteration_num): delta = 0.0 for state in mdp.state: if state in mdp.terminalstate: continue a = mdp.action[0] is_terminal, next_state, reward = mdp.transform(state, a) value = reward + mdp.gamma * self.v[next_state] for action in mdp.action: is_terminal, next_state, reward = mdp.transform(state, action) if value < reward + mdp.gamma * self.v[next_state]: value = reward + mdp.gamma * self.v[next_state] a = action delta += abs(value - self.v[state]) self.v[state] = value self.pi[state] = a if delta < 1e-6: breakif "__main__" == __name__: mdp = Mdp() policy_value = Value_Iteration(mdp) policy_value.value_iteration(mdp) print "value:" for state in xrange(1, 6): print "state:%d value:%f" % (state, policy_value.v[state]) print "policy:" for state in xrange(1, 6): print "state:%d policy:%s" % (state, policy_value.pi[state])
在机器人寻找三角形的例子中的价值迭代代码执行结果如下图所示,证明价值迭代的结果与策略迭代的结果一致。
- Reinforcement Learning学习笔记(2)
- Reinforcement Learning学习笔记(1)
- Reinforcement Learning学习笔记(3)
- Reinforcement Learning学习笔记(一)综述
- Reinforcement Learning 学习笔记(三)DQN
- DL学习笔记【22】增强学习(Reinforcement Learning)
- 增强学习(Reinforcement Learning)
- 增强学习(Reinforcement Learning)
- 增强学习(Reinforcement Learning)
- CS231n学习笔记--14. Reinforcement Learning
- Topic笔记:reinforcement learning
- 增强学习(Reinforcement Learning and Control)
- 强化学习(reinforcement learning)教程
- 增强学习(Reinforcement Learning and Control)
- 增强学习(Reinforcement Learning and Control)
- 强化学习(reinforcement learning)教程
- 强化学习(Reinforcement Learning)知识整理
- 深度学习(10):Reinforcement Learning
- 初探Lambda表达式-Java多核编程【2】并行与组合行为
- Java default方法
- task:scheduled-tasks cron表达式
- winform 可缩放,拖拽,画框,微调框的 pictureBox
- 【PAT】1057. Stack
- Reinforcement Learning学习笔记(2)
- ionic中循环出来的分类组点击改变背景颜色的实现
- 博客内容摘要及阅读顺序
- c++ primer 练习 2.9、2.10
- python for循环和range内置函数
- Java的数组和list升序,降序,逆序函数Collections.sort和Arrays.sort的使用
- FE
- kernel是如何选择iommu的呢?
- 移动应用崩溃日志收集工具对比