TensorFlow实战13:实现策略网络(强化学习一)
来源:互联网 发布:统赢慢走丝编程软件 编辑:程序博客网 时间:2024/05/16 14:52
1.策略网络简介
所谓的策略网络,即建立一个神经网络模型,它可以通过观察环境状态,直接预测出目前最应该执行的策略(policy),执行这个策略可以获得最大的期望收益(包括现在的和未来的reward)。和之前的任务不同,在强化学习中可能没有绝对正确的学习目标,样本的feature和label也不在一一对应。我们的学习目标是期望价值,即当前获得的reward和未来潜在的可获取的reward。所以在策略网络中不只是使用当前的reward作为label,而是使用Discounted Future Reward,即把所有未来奖励一次乘以衰减系数γ。这里的衰减系数是一个略小于但接近1的数,防止没有损耗地积累导致Reward目标发散,同时也代表了对未来奖励的不确定性的估计。
2.Gym
Gym是OpenAI推出的开源的强化学习的环境生成工具。在Gym中有两个核心的概念,一个是Environment,指我们的任务或者问题,另一个就是Agent,即我们编写的策略或者算法。Agent会将执行的Action传给Environment,Environment接受某个Action后,再将结果Observation(即环境状态)和Reward返回给Agent。
安装Gym
sudo pip install gym
这里说几个问题:如果装的是python3版本的话使用pip3,否则装的是python2版本的。
还有一个经常出现的一个问题就是Error: could not create ‘some path’: Permisssion denied。出现这个问题主要的原因就是没有加sudo,其实安装软件会经常碰到Permission denied,就是权限不够,在一般情况下安装软件的话还是习惯性加个sudo,实在需要改权限再用chmod就行了。
3.CartPole的代码实现
#coding:utf-8import numpy as npimport cPickle as pickleimport tensorflow as tf#%matplotlib inlineimport matplotlib.pyplot as pltimport mathimport gymenv = gym.make('CartPole-v0')env.reset()random_episodes = 0reward_sum = 0while random_episodes < 10: env.render() observation, reward, done, _ = env.step(np.random.randint(0,2)) reward_sum += reward if done: random_episodes += 1 print "Reward for this episode was:",reward_sum reward_sum = 0 env.reset()# 超参数H = 50 # number of hidden layer neuronsbatch_size = 25 # every how many episodes to do a param update?learning_rate = 1e-1 # feel free to play with this to train faster or more stably.gamma = 0.99 # discount factor for rewardD = 4 # input dimensionalitytf.reset_default_graph()# 神经网络的输入环境的状态,并且输出左/右的概率observations = tf.placeholder(tf.float32, [None,D] , name="input_x")W1 = tf.get_variable("W1", shape=[D, H], initializer=tf.contrib.layers.xavier_initializer())layer1 = tf.nn.relu(tf.matmul(observations,W1))W2 = tf.get_variable("W2", shape=[H, 1], initializer=tf.contrib.layers.xavier_initializer())score = tf.matmul(layer1,W2)probability = tf.nn.sigmoid(score)# 定义其他部分tvars = tf.trainable_variables()input_y = tf.placeholder(tf.float32,[None,1], name="input_y")advantages = tf.placeholder(tf.float32,name="reward_signal")# 定义损失函数loglik = tf.log(input_y*(input_y - probability) + (1 - input_y)*(input_y + probability))loss = -tf.reduce_mean(loglik * advantages) newGrads = tf.gradients(loss,tvars)# 为了减少奖励函数中的噪声,我们累积一系列的梯度之后才会更新神经网络的参数adam = tf.train.AdamOptimizer(learning_rate=learning_rate) # Our optimizerW1Grad = tf.placeholder(tf.float32,name="batch_grad1") # Placeholders to send the final gradients through when we update.W2Grad = tf.placeholder(tf.float32,name="batch_grad2")batchGrad = [W1Grad,W2Grad]updateGrads = adam.apply_gradients(zip(batchGrad,tvars))def discount_rewards(r): """ take 1D float array of rewards and compute discounted reward """ discounted_r = np.zeros_like(r) running_add = 0 for t in reversed(xrange(0, r.size)): running_add = running_add * gamma + r[t] discounted_r[t] = running_add return discounted_rxs,hs,dlogps,drs,ys,tfps = [],[],[],[],[],[]running_reward = Nonereward_sum = 0episode_number = 1total_episodes = 10000init = tf.initialize_all_variables()# Launch the graphwith tf.Session() as sess: rendering = False sess.run(init) observation = env.reset() # Obtain an initial observation of the environment # Reset the gradient placeholder. We will collect gradients in # gradBuffer until we are ready to update our policy network. gradBuffer = sess.run(tvars) for ix,grad in enumerate(gradBuffer): gradBuffer[ix] = grad * 0 while episode_number <= total_episodes: # Rendering the environment slows things down, # so let's only look at it once our agent is doing a good job. if reward_sum/batch_size > 100 or rendering == True : env.render() rendering = True # Make sure the observation is in a shape the network can handle. x = np.reshape(observation,[1,D]) # Run the policy network and get an action to take. tfprob = sess.run(probability,feed_dict={observations: x}) action = 1 if np.random.uniform() < tfprob else 0 xs.append(x) # observation y = 1 if action == 0 else 0 # a "fake label" ys.append(y) # step the environment and get new measurements observation, reward, done, info = env.step(action) reward_sum += reward drs.append(reward) # record reward (has to be done after we call step() to get reward for previous action) # 批量更新 if done: episode_number += 1 # stack together all inputs, hidden states, action gradients, and rewards for this episode epx = np.vstack(xs) epy = np.vstack(ys) epr = np.vstack(drs) tfp = tfps xs,hs,dlogps,drs,ys,tfps = [],[],[],[],[],[] # reset array memory # compute the discounted reward backwards through time discounted_epr = discount_rewards(epr) # size the rewards to be unit normal (helps control the gradient estimator variance) discounted_epr -= np.mean(discounted_epr) discounted_epr /= np.std(discounted_epr) # Get the gradient for this episode, and save it in the gradBuffer tGrad = sess.run(newGrads,feed_dict={observations: epx, input_y: epy, advantages: discounted_epr}) for ix,grad in enumerate(tGrad): gradBuffer[ix] += grad # If we have completed enough episodes, then update the policy network with our gradients. if episode_number % batch_size == 0: sess.run(updateGrads,feed_dict={W1Grad: gradBuffer[0],W2Grad:gradBuffer[1]}) for ix,grad in enumerate(gradBuffer): gradBuffer[ix] = grad * 0 # Give a summary of how well our network is doing for each batch of episodes. running_reward = reward_sum if running_reward is None else running_reward * 0.99 + reward_sum * 0.01 print 'Average reward for episode %f. Total average reward %f.' % (reward_sum/batch_size, running_reward/batch_size) if reward_sum/batch_size >= 200: print "Task solved in",episode_number,'episodes!' break reward_sum = 0 observation = env.reset()print episode_number,'Episodes completed.'
这里我注释掉了一行代码%matplotlib inline,这行代码是魔法命令。如果直接在终端里运行会报错UsageError: Invalid GUI request ‘inline’, valid ones are [‘qt4’, ‘glut’, ……(主要都是一些编译器)]。报错是因为没有合适的GUI,魔法命令是要在ipython中使用的。上面语句的作用是在Ipython中显示图片内嵌在notebook中,这样可以直接在程序运行结束后显示图片。O(∩_∩)O
- TensorFlow实战13:实现策略网络(强化学习一)
- Tensorflow实战学习(三十七)【实现强化学习策略网络】
- tensorflow40《TensorFlow实战》笔记-08-01 TensorFlow实现深度强化学习-策略网络 code
- 《tensorflow实战》6——强化学习之策略网络
- Tensorflow实例:实现深度强化学习--策略网络
- TensorFlow实战14:实现估值网络(强化学习二)
- 深度强化学习实战:Tensorflow实现DDPG
- tensorflow41《TensorFlow实战》笔记-08-02 TensorFlow实现深度强化学习-估值网络 code
- 学习笔记TF037:实现强化学习策略网络
- 强化学习(一)
- Tensorflow实战学习(二十八)【实现简单卷积网络】
- Tensorflow实战学习(二十九)【实现进阶卷积网络】
- Tensorflow实战学习(三十八)【实现估值网络】
- tensorflow----强化学习
- Tensorflow实战学习(一)【什么是TensorFlow】
- 浅谈强化学习(一)
- 强化学习(一) ----- 基本概念
- 强化学习原理及实现(一)
- MySQL基础知识点八
- swustoj格雷码(0605)
- 西瓜书学习笔记(二)
- 2017.4.20课
- windows上安装MXNet
- TensorFlow实战13:实现策略网络(强化学习一)
- Liunux 编程遇到的SIGBUS信号
- RN-性能优化 (二)
- springmvc入门之注解方式重点解析
- 日期选择器 很多类型的
- Aspose Word模版使用总结
- RN-性能优化 (三)
- ASP连接Mysql小记
- Spring和SpringMVC父子容器关系初窥