增强学习

来源:互联网 发布:复合材料的刚度矩阵 编辑:程序博客网 时间:2024/05/17 06:44

1 value iteration

   for  i in max-iteration:

         for j in states:

              v[j] = max(r[j,a] + sum(p(j'|j,a)* v[j'])

2 policy iteration

   for i in max-iteration:

          policy-evaluation

         (迭代计算v [state]直至稳定,采取的action已知)

          policy-improvement

         (依次更新each state对应的action,每次取最优值)


3 model based learning

for i in max-iteration:

    1)follow policy pi, get transition list as history 

    2)  calculate reward, transition probability from history, and get P(state, prob,action,next_state)

    3) update policy using value iteration

原创粉丝点击