Reinforcement Learning

来源:互联网 发布:深圳证券交易所软件 编辑:程序博客网 时间:2024/05/23 09:45

https://github.com/lyuwenyu/RL



Reinforcement Learning

1. 

MDP( Markov Decision Process) :


(S, A, P, R, r)  PI

S ( state)

A ( action )

r (discount)

R (reward)

PI (policy)

G (Return)


Bellman equation


State-value function  v(s)

Action-value function  q(s,a)


Optimal state-value function

Optimal action-value function

Optimal policy


2. 

Model-based solution


Dynamic Programming


Value Iteration


Policy Iteration:

Policy evaluation

Policy improve (greedy)


3. 

Model-free solution


Policy Evaluation

MC (Monte Carlo)

TD (Temporal Difference)


4. 

on policy 

off policy


SARSA

QLearning


 off-policy: It is called an off-policy because the policy being learned can be different than the policy being executed

on-policy: it updates value functions strictly on the basis of the experience gained from executing some (possibly non-stationary) policy



-----------------------reference-----------------------------

1. https://www.youtube.com/watch?v=0g4j2k_Ggc4

2. http://www0.cs.ucl.ac.uk/staff/d.silver/web/Teaching.html

3. http://www.algorithmdog.com/reinforcement-learning-value-function-approximation

4. 

0 0
原创粉丝点击