Reinforcement Learning

来源：互联网发布：深圳证券交易所软件编辑：程序博客网时间：2024/05/23 09:45

https://github.com/lyuwenyu/RL

MDP( Markov Decision Process) :

(S, A, P, R, r) PI

S ( state)

A ( action )

r (discount)

R (reward)

PI (policy)

G (Return)

Bellman equation

State-value function v(s)

Action-value function q(s,a)

Optimal state-value function

Optimal action-value function

Optimal policy

Model-based solution

Dynamic Programming

Value Iteration

Policy Iteration:

Policy evaluation

Policy improve (greedy)

Model-free solution

Policy Evaluation

MC (Monte Carlo)

TD (Temporal Difference)

on policy

off policy

SARSA

QLearning

off-policy: It is called an off-policy because the policy being learned can be different than the policy being executed

on-policy: it updates value functions strictly on the basis of the experience gained from executing some (possibly non-stationary) policy

-----------------------reference-----------------------------

1. https://www.youtube.com/watch?v=0g4j2k_Ggc4

2. http://www0.cs.ucl.ac.uk/staff/d.silver/web/Teaching.html

3. http://www.algorithmdog.com/reinforcement-learning-value-function-approximation

0 0