机器学习之Grid World的Q-Learning算法解析

来源:互联网 发布:为什么学linux 编辑:程序博客网 时间:2024/06/05 15:18

来自Github开源项目的基于Grid World游戏的Q-Learning算法




Q-learning is a model-free reinforcement learning technique. Specifically, Q-learning can be used to find an optimal action-selection policy for any given (finite) Markov decision process (MDP). It works by learning an action-value function that ultimately gives the expected utility of taking a given action in a given state and following the optimal policy thereafter. A policy is a rule that the agent follows in selecting actions, given the state it is in. When such an action-value function is learned, the optimal policy can be constructed by simply selecting the action with the highest value in each state. One of the strengths of Q-learning is that it is able to compare the expected utility of the available actions without requiring a model of the environment. Additionally, Q-learning can handle problems with stochastic transitions and rewards, without requiring any adaptations. It has been proven that for any finite MDP, Q-learning eventually finds an optimal policy, in the sense that the expected value of the total reward return over all successive steps, starting from the current state, is the maximum achievable.[1]



    # update q function with sample <s, a, r, s'>    def learn(self, state, action, reward, next_state):        current_q = self.q_table[state][action]        # using Bellman Optimality Equation to update q function        new_q = reward + self.discount_factor * max(self.q_table[next_state])        self.q_table[state][action] += self.learning_rate * (new_q - current_q)

它与SARSA算法的不同之处,就在于SARSA算法的学习函数参数多了最后一个A,这个A是个预估的值。而Q-Learning算法则是取下个状态有最大价值的A,这样做的好处就是学习起来可能更快,而坏处就是可能会出现Q值过度估计得问题,Double Q-Learning可以解决这个问题。




A recent application of Q-learning to deep learning, by Google DeepMind, titled “deep reinforcement learning” or “deep Q-networks”, has been successful at playing some Atari 2600 games at expert human levels. Preliminary results were presented in 2014, with a paper published in February 2015 in Nature.[12]
