深度学习DeepLearning.ai系列课程学习总结:12. 优化算法实战
来源:互联网 发布:三星手机3g网络设置 编辑:程序博客网 时间:2024/06/02 06:58
转载过程中,图片丢失,代码显示错乱。
为了更好的学习内容,请访问原创版本:
http://www.missshi.cn/api/view/blog/59bbcae0e519f50d04000204
Ps:初次访问由于js文件较大,请耐心等候(8s左右)
到目前为止,我们始终都是在使用梯度下降法来优化代价函数。
本文中,我们将使用一些更加高级的优化算法,利用这些优化算法,通常可以提高我们算法的收敛速度,并在最终得到更好的分离结果。
假设我们的代价函数就像这样一个山峰:
首先,我们需要引入一些相关的库:
- import numpy as np
- import matplotlib.pyplot as plt
- import scipy.io
- import math
- import sklearn
- import sklearn.datasets
- from opt_utils import load_params_and_grads, initialize_parameters, forward_propagation, backward_propagation
- from opt_utils import compute_cost, predict, predict_dec, plot_decision_boundary, load_dataset
- from testCases import *
- %matplotlib inline
- plt.rcParams['figure.figsize'] = (7.0, 4.0) # set default size of plots
- plt.rcParams['image.interpolation'] = 'nearest'
- plt.rcParams['image.cmap'] = 'gray'
其中,一些相关的函数如下:
- def load_params_and_grads(seed=1):
- np.random.seed(seed)
- W1 = np.random.randn(2,3)
- b1 = np.random.randn(2,1)
- W2 = np.random.randn(3,3)
- b2 = np.random.randn(3,1)
- dW1 = np.random.randn(2,3)
- db1 = np.random.randn(2,1)
- dW2 = np.random.randn(3,3)
- db2 = np.random.randn(3,1)
- return W1, b1, W2, b2, dW1, db1, dW2, db2
- def initialize_parameters(layer_dims):
- """
- Arguments:
- layer_dims -- python array (list) containing the dimensions of each layer in our network
- Returns:
- parameters -- python dictionary containing your parameters "W1", "b1", ..., "WL", "bL":
- W1 -- weight matrix of shape (layer_dims[l], layer_dims[l-1])
- b1 -- bias vector of shape (layer_dims[l], 1)
- Wl -- weight matrix of shape (layer_dims[l-1], layer_dims[l])
- bl -- bias vector of shape (1, layer_dims[l])
- Tips:
- - For example: the layer_dims for the "Planar Data classification model" would have been [2,2,1].
- This means W1's shape was (2,2), b1 was (1,2), W2 was (2,1) and b2 was (1,1). Now you have to generalize it!
- - In the for loop, use parameters['W' + str(l)] to access Wl, where l is the iterative integer.
- """
- np.random.seed(3)
- parameters = {}
- L = len(layer_dims) # number of layers in the network
- for l in range(1, L):
- parameters['W' + str(l)] = np.random.randn(layer_dims[l], layer_dims[l-1])* np.sqrt(2 / layer_dims[l-1])
- parameters['b' + str(l)] = np.zeros((layer_dims[l], 1))
- assert(parameters['W' + str(l)].shape == layer_dims[l], layer_dims[l-1])
- assert(parameters['W' + str(l)].shape == layer_dims[l], 1)
- return parameters
- def forward_propagation(X, parameters):
- """
- Implements the forward propagation (and computes the loss) presented in Figure 2.
- Arguments:
- X -- input dataset, of shape (input size, number of examples)
- parameters -- python dictionary containing your parameters "W1", "b1", "W2", "b2", "W3", "b3":
- W1 -- weight matrix of shape ()
- b1 -- bias vector of shape ()
- W2 -- weight matrix of shape ()
- b2 -- bias vector of shape ()
- W3 -- weight matrix of shape ()
- b3 -- bias vector of shape ()
- Returns:
- loss -- the loss function (vanilla logistic loss)
- """
- # retrieve parameters
- W1 = parameters["W1"]
- b1 = parameters["b1"]
- W2 = parameters["W2"]
- b2 = parameters["b2"]
- W3 = parameters["W3"]
- b3 = parameters["b3"]
- # LINEAR -> RELU -> LINEAR -> RELU -> LINEAR -> SIGMOID
- z1 = np.dot(W1, X) + b1
- a1 = relu(z1)
- z2 = np.dot(W2, a1) + b2
- a2 = relu(z2)
- z3 = np.dot(W3, a2) + b3
- a3 = sigmoid(z3)
- cache = (z1, a1, W1, b1, z2, a2, W2, b2, z3, a3, W3, b3)
- return a3, cache
- def backward_propagation(X, Y, cache):
- """
- Implement the backward propagation presented in figure 2.
- Arguments:
- X -- input dataset, of shape (input size, number of examples)
- Y -- true "label" vector (containing 0 if cat, 1 if non-cat)
- cache -- cache output from forward_propagation()
- Returns:
- gradients -- A dictionary with the gradients with respect to each parameter, activation and pre-activation variables
- """
- m = X.shape[1]
- (z1, a1, W1, b1, z2, a2, W2, b2, z3, a3, W3, b3) = cache
- dz3 = 1./m * (a3 - Y)
- dW3 = np.dot(dz3, a2.T)
- db3 = np.sum(dz3, axis=1, keepdims = True)
- da2 = np.dot(W3.T, dz3)
- dz2 = np.multiply(da2, np.int64(a2 > 0))
- dW2 = np.dot(dz2, a1.T)
- db2 = np.sum(dz2, axis=1, keepdims = True)
- da1 = np.dot(W2.T, dz2)
- dz1 = np.multiply(da1, np.int64(a1 > 0))
- dW1 = np.dot(dz1, X.T)
- db1 = np.sum(dz1, axis=1, keepdims = True)
- gradients = {"dz3": dz3, "dW3": dW3, "db3": db3,
- "da2": da2, "dz2": dz2, "dW2": dW2, "db2": db2,
- "da1": da1, "dz1": dz1, "dW1": dW1, "db1": db1}
- return gradients
- def compute_cost(a3, Y):
- """
- Implement the cost function
- Arguments:
- a3 -- post-activation, output of forward propagation
- Y -- "true" labels vector, same shape as a3
- Returns:
- cost - value of the cost function
- """
- m = Y.shape[1]
- logprobs = np.multiply(-np.log(a3),Y) + np.multiply(-np.log(1 - a3), 1 - Y)
- cost = 1./m * np.sum(logprobs)
- return cost
- def predict(X, y, parameters):
- """
- This function is used to predict the results of a n-layer neural network.
- Arguments:
- X -- data set of examples you would like to label
- parameters -- parameters of the trained model
- Returns:
- p -- predictions for the given dataset X
- """
- m = X.shape[1]
- p = np.zeros((1,m), dtype = np.int)
- # Forward propagation
- a3, caches = forward_propagation(X, parameters)
- # convert probas to 0/1 predictions
- for i in range(0, a3.shape[1]):
- if a3[0,i] > 0.5:
- p[0,i] = 1
- else:
- p[0,i] = 0
- # print results
- #print ("predictions: " + str(p[0,:]))
- #print ("true labels: " + str(y[0,:]))
- print("Accuracy: " + str(np.mean((p[0,:] == y[0,:]))))
- return p
- def predict_dec(parameters, X):
- """
- Used for plotting decision boundary.
- Arguments:
- parameters -- python dictionary containing your parameters
- X -- input data of size (m, K)
- Returns
- predictions -- vector of predictions of our model (red: 0 / blue: 1)
- """
- # Predict using forward propagation and a classification threshold of 0.5
- a3, cache = forward_propagation(X, parameters)
- predictions = (a3 > 0.5)
- return predictions
- def plot_decision_boundary(model, X, y):
- # Set min and max values and give it some padding
- x_min, x_max = X[0, :].min() - 1, X[0, :].max() + 1
- y_min, y_max = X[1, :].min() - 1, X[1, :].max() + 1
- h = 0.01
- # Generate a grid of points with distance h between them
- xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))
- # Predict the function value for the whole grid
- Z = model(np.c_[xx.ravel(), yy.ravel()])
- Z = Z.reshape(xx.shape)
- # Plot the contour and training examples
- plt.contourf(xx, yy, Z, cmap=plt.cm.Spectral)
- plt.ylabel('x2')
- plt.xlabel('x1')
- plt.scatter(X[0, :], X[1, :], c=y, cmap=plt.cm.Spectral)
- plt.show()
- def load_dataset():
- np.random.seed(3)
- train_X, train_Y = sklearn.datasets.make_moons(n_samples=300, noise=.2) #300 #0.2
- # Visualize the data
- plt.scatter(train_X[:, 0], train_X[:, 1], c=train_Y, s=40, cmap=plt.cm.Spectral);
- train_X = train_X.T
- train_Y = train_Y.reshape((1, train_Y.shape[0]))
- return train_X, train_Y
梯度下降算法
更新参数的函数实现如下:
- def update_parameters_with_gd(parameters, grads, learning_rate):
- """
- Update parameters using one step of gradient descent
- Arguments:
- parameters -- python dictionary containing your parameters to be updated:
- parameters['W' + str(l)] = Wl
- parameters['b' + str(l)] = bl
- grads -- python dictionary containing your gradients to update each parameters:
- grads['dW' + str(l)] = dWl
- grads['db' + str(l)] = dbl
- learning_rate -- the learning rate, scalar.
- Returns:
- parameters -- python dictionary containing your updated parameters
- """
- L = len(parameters) // 2 # number of layers in the neural networks
- # Update rule for each parameter
- for l in range(L):
- ### START CODE HERE ### (approx. 2 lines)
- parameters["W" + str(l+1)] = parameters["W" + str(l+1)] - learning_rate * grads['dW' + str(l+1)]
- parameters["b" + str(l+1)] = parameters["b" + str(l+1)] - learning_rate * grads['db' + str(l+1)]
- ### END CODE HERE ###
- return parameters
由梯度下降算法演变来的还有随机梯度下降算法和小批量梯度下降算法。
其中,对于梯度下降算法,实现如下:
- X = data_input
- Y = labels
- parameters = initialize_parameters(layers_dims)
- for i in range(0, num_iterations):
- # Forward propagation
- a, caches = forward_propagation(X, parameters)
- # Compute cost.
- cost = compute_cost(a, Y)
- # Backward propagation.
- grads = backward_propagation(a, caches, parameters)
- # Update parameters.
- parameters = update_parameters(parameters, grads)
对于随机梯度下降算法,实现如下:
- X = data_input
- Y = labels
- parameters = initialize_parameters(layers_dims)
- for i in range(0, num_iterations):
- for j in range(0, m):
- # Forward propagation
- a, caches = forward_propagation(X[:,j], parameters)
- # Compute cost
- cost = compute_cost(a, Y[:,j])
- # Backward propagation
- grads = backward_propagation(a, caches, parameters)
- # Update parameters.
- parameters = update_parameters(parameters, grads)
在随机梯度下降算法中,每次迭代中仅使用其中一个样本。
当训练集很大时,使用随机梯度下降算法的运行速度会很快,但是会存在一定的波动。
而在实践中,一个更好的实践是使用小批量梯度下降算法。
小批量梯度下降算法是一种综合批梯度下降算法和随机梯度下降算法的方法。
每次迭代过程中,既不是选择全部数据、也不是仅选择一个样本,而是选择一个批量。例如64,128等。
一方面,充分利用的GPU的并行性,更一方面,没有造成特别大的计算时间。
小批量梯度下降算法
首先,我们需要学习如果将训练集进行分批。
分批一共可以分为两个步骤:
第一步:打乱顺序。
首先,我们需要将输入样本和标定结果按照相同的顺序随机打乱,保证我们得到的每个批次都是随机的。
第二步:切分。
当我们已经把训练集随机打乱后,接下来就是对其进行切分。批次的大小可以选择64,128,256等。
- def random_mini_batches(X, Y, mini_batch_size = 64, seed = 0):
- """
- Creates a list of random minibatches from (X, Y)
- Arguments:
- X -- input data, of shape (input size, number of examples)
- Y -- true "label" vector (1 for blue dot / 0 for red dot), of shape (1, number of examples)
- mini_batch_size -- size of the mini-batches, integer
- Returns:
- mini_batches -- list of synchronous (mini_batch_X, mini_batch_Y)
- """
- np.random.seed(seed) # To make your "random" minibatches the same as ours
- m = X.shape[1] # number of training examples
- mini_batches = []
- # Step 1: Shuffle (X, Y)
- permutation = list(np.random.permutation(m))
- shuffled_X = X[:, permutation]
- shuffled_Y = Y[:, permutation].reshape((1,m))
- # Step 2: Partition (shuffled_X, shuffled_Y). Minus the end case.
- num_complete_minibatches = math.floor(m/mini_batch_size) # number of mini batches of size mini_batch_size in your partitionning
- for k in range(0, num_complete_minibatches):
- ### START CODE HERE ### (approx. 2 lines)
- mini_batch_X = shuffled_X[:, mini_batch_size * k : (k+1) * mini_batch_size]
- mini_batch_Y = shuffled_Y[:, mini_batch_size * k : (k+1) * mini_batch_size]
- ### END CODE HERE ###
- mini_batch = (mini_batch_X, mini_batch_Y)
- mini_batches.append(mini_batch)
- # Handling the end case (last mini-batch < mini_batch_size)
- if m % mini_batch_size != 0:
- ### START CODE HERE ### (approx. 2 lines)
- mini_batch_X = shuffled_X[:, mini_batch_size * (k+1) : m]
- mini_batch_Y = shuffled_Y[:, mini_batch_size * (k+1) : m]
- ### END CODE HERE ###
- mini_batch = (mini_batch_X, mini_batch_Y)
- mini_batches.append(mini_batch)
- return mini_batches
Momentum
当我们使用小批量梯度下降法时,每次对于一个训练样本的子集进行一次迭代。
因此,计算得到的梯度与真实的梯度可能存在一定的偏差。
此时,我们可以利用Momentum来减小偏差。
Momentum方法在更新的过程中,考虑了之前时刻的运行方向的影响,最终结合的作用可以克服一些非真实梯度引入的上下抖动现象。
momentum的更新规则如下:
- def initialize_velocity(parameters):
- """
- Initializes the velocity as a python dictionary with:
- - keys: "dW1", "db1", ..., "dWL", "dbL"
- - values: numpy arrays of zeros of the same shape as the corresponding gradients/parameters.
- Arguments:
- parameters -- python dictionary containing your parameters.
- parameters['W' + str(l)] = Wl
- parameters['b' + str(l)] = bl
- Returns:
- v -- python dictionary containing the current velocity.
- v['dW' + str(l)] = velocity of dWl
- v['db' + str(l)] = velocity of dbl
- """
- L = len(parameters) // 2 # number of layers in the neural networks
- v = {}
- # Initialize velocity
- for l in range(L):
- ### START CODE HERE ### (approx. 2 lines)
- v["dW" + str(l+1)] = np.zeros(parameters['W' + str(l+1)].shape)
- v["db" + str(l+1)] = np.zeros(parameters['b' + str(l+1)].shape)
- ### END CODE HERE ###
- return v
- def update_parameters_with_momentum(parameters, grads, v, beta, learning_rate):
- """
- Update parameters using Momentum
- Arguments:
- parameters -- python dictionary containing your parameters:
- parameters['W' + str(l)] = Wl
- parameters['b' + str(l)] = bl
- grads -- python dictionary containing your gradients for each parameters:
- grads['dW' + str(l)] = dWl
- grads['db' + str(l)] = dbl
- v -- python dictionary containing the current velocity:
- v['dW' + str(l)] = ...
- v['db' + str(l)] = ...
- beta -- the momentum hyperparameter, scalar
- learning_rate -- the learning rate, scalar
- Returns:
- parameters -- python dictionary containing your updated parameters
- v -- python dictionary containing your updated velocities
- """
- L = len(parameters) // 2 # number of layers in the neural networks
- # Momentum update for each parameter
- for l in range(L):
- ### START CODE HERE ### (approx. 4 lines)
- # compute velocities
- v["dW" + str(l+1)] = beta * v["dW" + str(l+1)] + (1-beta) * grads['dW' + str(l+1)]
- v["db" + str(l+1)] = beta * v["db" + str(l+1)] + (1-beta) * grads['db' + str(l+1)]
- # update parameters
- parameters["W" + str(l+1)] = parameters["W" + str(l+1)] - learning_rate * v["dW" + str(l+1)]
- parameters["b" + str(l+1)] = parameters["b" + str(l+1)] - learning_rate * v["db" + str(l+1)]
- ### END CODE HERE ###
- return parameters, v
在momentum中,有一个参数beta。
当beta=0时,此时,momentum相当于没有使用momentum算法的标准梯度下降算法。
当beta越到,说明平滑的作用越明显。通常,在实践中,0.9是比较适当的值。
Adam算法
Adam算法是训练神经网络中最有效的算法之一。
它是RMSProp算法与Momentum算法的结合体。
其迭代公式如下:
实现过程如下:
- def initialize_adam(parameters) :
- """
- Initializes v and s as two python dictionaries with:
- - keys: "dW1", "db1", ..., "dWL", "dbL"
- - values: numpy arrays of zeros of the same shape as the corresponding gradients/parameters.
- Arguments:
- parameters -- python dictionary containing your parameters.
- parameters["W" + str(l)] = Wl
- parameters["b" + str(l)] = bl
- Returns:
- v -- python dictionary that will contain the exponentially weighted average of the gradient.
- v["dW" + str(l)] = ...
- v["db" + str(l)] = ...
- s -- python dictionary that will contain the exponentially weighted average of the squared gradient.
- s["dW" + str(l)] = ...
- s["db" + str(l)] = ...
- """
- L = len(parameters) // 2 # number of layers in the neural networks
- v = {}
- s = {}
- # Initialize v, s. Input: "parameters". Outputs: "v, s".
- for l in range(L):
- ### START CODE HERE ### (approx. 4 lines)
- v["dW" + str(l+1)] = np.zeros(parameters['W' + str(l+1)].shape)
- v["db" + str(l+1)] = np.zeros(parameters['b' + str(l+1)].shape)
- s["dW" + str(l+1)] = np.zeros(parameters['W' + str(l+1)].shape)
- s["db" + str(l+1)] = np.zeros(parameters['b' + str(l+1)].shape)
- ### END CODE HERE ###
- return v, s
- def update_parameters_with_adam(parameters, grads, v, s, t, learning_rate = 0.01,
- beta1 = 0.9, beta2 = 0.999, epsilon = 1e-8):
- """
- Update parameters using Adam
- Arguments:
- parameters -- python dictionary containing your parameters:
- parameters['W' + str(l)] = Wl
- parameters['b' + str(l)] = bl
- grads -- python dictionary containing your gradients for each parameters:
- grads['dW' + str(l)] = dWl
- grads['db' + str(l)] = dbl
- v -- Adam variable, moving average of the first gradient, python dictionary
- s -- Adam variable, moving average of the squared gradient, python dictionary
- learning_rate -- the learning rate, scalar.
- beta1 -- Exponential decay hyperparameter for the first moment estimates
- beta2 -- Exponential decay hyperparameter for the second moment estimates
- epsilon -- hyperparameter preventing division by zero in Adam updates
- Returns:
- parameters -- python dictionary containing your updated parameters
- v -- Adam variable, moving average of the first gradient, python dictionary
- s -- Adam variable, moving average of the squared gradient, python dictionary
- """
- L = len(parameters) // 2 # number of layers in the neural networks
- v_corrected = {} # Initializing first moment estimate, python dictionary
- s_corrected = {} # Initializing second moment estimate, python dictionary
- # Perform Adam update on all parameters
- for l in range(L):
- # Moving average of the gradients. Inputs: "v, grads, beta1". Output: "v".
- ### START CODE HERE ### (approx. 2 lines)
- v["dW" + str(l+1)] = beta1 * v["dW" + str(l+1)] + (1-beta1) * grads['dW' + str(l+1)]
- v["db" + str(l+1)] = beta1 * v["db" + str(l+1)] + (1-beta1) * grads['db' + str(l+1)]
- ### END CODE HERE ###
- # Compute bias-corrected first moment estimate. Inputs: "v, beta1, t". Output: "v_corrected".
- ### START CODE HERE ### (approx. 2 lines)
- v_corrected["dW" + str(l+1)] = v["dW" + str(l+1)] / (1 - beta1 ** t)
- v_corrected["db" + str(l+1)] = v["db" + str(l+1)] / (1 - beta1 ** t)
- ### END CODE HERE ###
- # Moving average of the squared gradients. Inputs: "s, grads, beta2". Output: "s".
- ### START CODE HERE ### (approx. 2 lines)
- s["dW" + str(l+1)] = beta2 * s["dW" + str(l+1)] + (1-beta2) * (grads['dW' + str(l+1)] ** 2)
- s["db" + str(l+1)] = beta2 * s["db" + str(l+1)] + (1-beta2) * (grads['db' + str(l+1)] ** 2)
- ### END CODE HERE ###
- # Compute bias-corrected second raw moment estimate. Inputs: "s, beta2, t". Output: "s_corrected".
- ### START CODE HERE ### (approx. 2 lines)
- s_corrected["dW" + str(l+1)] = s["dW" + str(l+1)] / (1 - beta2 ** t)
- s_corrected["db" + str(l+1)] = s["db" + str(l+1)] / (1 - beta2 ** t)
- ### END CODE HERE ###
- # Update parameters. Inputs: "parameters, learning_rate, v_corrected, s_corrected, epsilon". Output: "parameters".
- ### START CODE HERE ### (approx. 2 lines)
- parameters["W" + str(l+1)] = parameters["W" + str(l+1)] - learning_rate * v_corrected["dW" + str(l+1)] / (s_corrected["dW" + str(l+1)] ** 0.5 + epsilon)
- parameters["b" + str(l+1)] = parameters["b" + str(l+1)] - learning_rate * v_corrected["db" + str(l+1)] / (s_corrected["db" + str(l+1)] ** 0.5 + epsilon)
- ### END CODE HERE ###
- return parameters, v, s
使用不同优化算法的模型效果对比
- train_X, train_Y = load_dataset()
基本模型实现如下(一个三层的神经网络模型):
- def model(X, Y, layers_dims, optimizer, learning_rate = 0.0007, mini_batch_size = 64, beta = 0.9,
- beta1 = 0.9, beta2 = 0.999, epsilon = 1e-8, num_epochs = 10000, print_cost = True):
- """
- 3-layer neural network model which can be run in different optimizer modes.
- Arguments:
- X -- input data, of shape (2, number of examples)
- Y -- true "label" vector (1 for blue dot / 0 for red dot), of shape (1, number of examples)
- layers_dims -- python list, containing the size of each layer
- learning_rate -- the learning rate, scalar.
- mini_batch_size -- the size of a mini batch
- beta -- Momentum hyperparameter
- beta1 -- Exponential decay hyperparameter for the past gradients estimates
- beta2 -- Exponential decay hyperparameter for the past squared gradients estimates
- epsilon -- hyperparameter preventing division by zero in Adam updates
- num_epochs -- number of epochs
- print_cost -- True to print the cost every 1000 epochs
- Returns:
- parameters -- python dictionary containing your updated parameters
- """
- L = len(layers_dims) # number of layers in the neural networks
- costs = [] # to keep track of the cost
- t = 0 # initializing the counter required for Adam update
- seed = 10 # For grading purposes, so that your "random" minibatches are the same as ours
- # Initialize parameters
- parameters = initialize_parameters(layers_dims)
- # Initialize the optimizer
- if optimizer == "gd":
- pass # no initialization required for gradient descent
- elif optimizer == "momentum":
- v = initialize_velocity(parameters)
- elif optimizer == "adam":
- v, s = initialize_adam(parameters)
- # Optimization loop
- for i in range(num_epochs):
- # Define the random minibatches. We increment the seed to reshuffle differently the dataset after each epoch
- seed = seed + 1
- minibatches = random_mini_batches(X, Y, mini_batch_size, seed)
- for minibatch in minibatches:
- # Select a minibatch
- (minibatch_X, minibatch_Y) = minibatch
- # Forward propagation
- a3, caches = forward_propagation(minibatch_X, parameters)
- # Compute cost
- cost = compute_cost(a3, minibatch_Y)
- # Backward propagation
- grads = backward_propagation(minibatch_X, minibatch_Y, caches)
- # Update parameters
- if optimizer == "gd":
- parameters = update_parameters_with_gd(parameters, grads, learning_rate)
- elif optimizer == "momentum":
- parameters, v = update_parameters_with_momentum(parameters, grads, v, beta, learning_rate)
- elif optimizer == "adam":
- t = t + 1 # Adam counter
- parameters, v, s = update_parameters_with_adam(parameters, grads, v, s,
- t, learning_rate, beta1, beta2, epsilon)
- # Print the cost every 1000 epoch
- if print_cost and i % 1000 == 0:
- print ("Cost after epoch %i: %f" %(i, cost))
- if print_cost and i % 100 == 0:
- costs.append(cost)
- # plot the cost
- plt.plot(costs)
- plt.ylabel('cost')
- plt.xlabel('epochs (per 100)')
- plt.title("Learning rate = " + str(learning_rate))
- plt.show()
- return parameters
小批量梯度下降算法
- layers_dims = [train_X.shape[0], 5, 2, 1]
- parameters = model(train_X, train_Y, layers_dims, optimizer = "gd")
- # Predict
- predictions = predict(train_X, train_Y, parameters)
- # Plot decision boundary
- plt.title("Model with Gradient Descent optimization")
- axes = plt.gca()
- axes.set_xlim([-1.5,2.5])
- axes.set_ylim([-1,1.5])
- plot_decision_boundary(lambda x: predict_dec(parameters, x.T), train_X, train_Y)
使用Momentum的小批量梯度算法
- layers_dims = [train_X.shape[0], 5, 2, 1]
- parameters = model(train_X, train_Y, layers_dims, beta = 0.9, optimizer = "momentum")
- # Predict
- predictions = predict(train_X, train_Y, parameters)
- # Plot decision boundary
- plt.title("Model with Momentum optimization")
- axes = plt.gca()
- axes.set_xlim([-1.5,2.5])
- axes.set_ylim([-1,1.5])
- plot_decision_boundary(lambda x: predict_dec(parameters, x.T), train_X, train_Y)
使用Adam优化的小批量梯度算法
- layers_dims = [train_X.shape[0], 5, 2, 1]
- parameters = model(train_X, train_Y, layers_dims, optimizer = "adam")
- # Predict
- predictions = predict(train_X, train_Y, parameters)
- # Plot decision boundary
- plt.title("Model with Adam optimization")
- axes = plt.gca()
- axes.set_xlim([-1.5,2.5])
- axes.set_ylim([-1,1.5])
- plot_decision_boundary(lambda x: predict_dec(parameters, x.T), train_X, train_Y)
总结一下:
不同优化算法的对比结果如下表:
对比三种方法,我们通常会发现Adam算法往往可以得到更好的结果。
更多更详细的内容,请访问原创网站:
http://www.missshi.cn/api/view/blog/59bbcae0e519f50d04000204
Ps:初次访问由于js文件较大,请耐心等候(8s左右)
- 深度学习DeepLearning.ai系列课程学习总结:12. 优化算法实战
- 深度学习DeepLearning.ai系列课程学习总结:11. 优化算法理论讲解
- 深度学习DeepLearning.ai系列课程学习总结:课程概述
- 深度学习DeepLearning.ai系列课程学习总结:4. Logistic代码实战
- 深度学习DeepLearning.ai系列课程学习总结:8. 多层神经网络代码实战
- 深度学习DeepLearning.ai系列课程学习总结:10. 初始化、正则化、梯度检查实战
- 深度学习DeepLearning.ai系列课程学习总结:6. 具有一个隐藏层的平面数据分类代码实战
- 深度学习DeepLearning.ai系列课程学习总结:1. 深度学习简介
- 深度学习DeepLearning.ai系列课程学习总结:9.深度学习基础实践理论
- 深度学习DeepLearning.ai系列课程学习总结:2. 神经网络基础
- 深度学习DeepLearning.ai系列课程学习总结:3. Python矢量化实现神经网络
- 深度学习DeepLearning.ai系列课程学习总结:5. 浅层神经网络
- 深度学习DeepLearning.ai系列课程学习总结:14. Tensorflow入门
- 深度学习DeepLearning.ai系列课程学习总结:7. 深层神经网络理论学习
- 吴恩达Coursera深度学习课程 DeepLearning.ai 提炼笔记(2-2)-- 优化算法
- 吴恩达Coursera深度学习课程 DeepLearning.ai 提炼笔记(2-2)-- 优化算法
- 深度学习课程项目Deeplearning.ai正式发布
- 【备忘】深度学习实战决胜AI-强化学习实战系列视频课程
- 直观理解机器学习中的偏差和方差
- Java笔记之final修饰符
- SylixOS 之epoll异常分析
- mui中table的假分页
- js360导航拖住效果~
- 深度学习DeepLearning.ai系列课程学习总结:12. 优化算法实战
- Java总结篇系列:Java泛型
- 关于异常处理的解决方案
- 作业5
- 软件工程作业
- 输入n个元素组成的序列S,你需要找出一个乘积最大的连续子序列。如果这个最大的乘积不是正数,应输出0(表示无解)。1<=18,-10<=Si<=10
- android中如何加载本地的html
- 函数的重载||委托类型的定义
- 陈伟华,羽毛球视频教学