COURSE 1 Neural Networks and Deep Learning

来源：互联网发布：心理学与生活知乎编辑：程序博客网时间：2024/06/06 17:28

Week1

What is neural network?

It is a powerful learning algorithm inspired by how the brain works.

Example 1 - single neural network

Given data about the size of houses on the real estate market and you want to fit a function that will
predict their price. It is a linear regression problem because the price as a function of size is a continuous
output.

We know the prices can never be negative so we are creating a function called Rectified Linear Unit (ReLU)
which starts at zero.

housing price prediction

The input is the size of the house (x)

The output is the price (y)

The “neuron” implements the function ReLU (blue line)

single neural network

Example 2 – Multiple neural network

The price of a house can be affected by other features such as size, number of bedrooms, zip code and
wealth. The role of the neural network is to predicted the price and it will automatically generate the
hidden units. We only need to give the inputs x and the output y.

housing price prediction2

Supervised learning for Neural Network

In supervised learning, we are given a data set and already know what our correct output should look like,
having the idea that there is a relationship between the input and the output.
Supervised learning problems are categorized into “regression” and “classification” problems. In a
regression problem, we are trying to predict results within a continuous output, meaning that we are
trying to map input variables to some continuous function. In a classification problem, we are instead
trying to predict results in a discrete output. In other words, we are trying to map input variables into
discrete categories.

There are different types of neural network, for example Convolution Neural Network (CNN) used often
for image application and Recurrent Neural Network (RNN) used for one-dimensional sequence data
such as translating English to Chinses or a temporal component such as text transcript. As for the
autonomous driving, it is a hybrid neural network architecture.

Neural Network examples

neural network examples

Structured vs unstructured data

Structured data refers to things that has a defined meaning such as price, age whereas unstructured
data refers to thing like pixel, raw audio, text.

structured vs unstructured data

Why is deep learning taking off?

Deep learning is taking off due to a large amount of data available through the digitization of the society, faster computation and innovation in the development of neural network algorithm.

scale drives deep learning progress

Two things have to be considered to get to the high level of performance:

Being able to train a big enough neural network
Huge amount of labeled data

The process of training a neural network is iterative.

idea code experiment

It could take a good amount of time to train a neural network, which affects your productivity. Faster computation helps to iterate and improve new algorithm.

Week2

Binary Classification

In a binary classification problem, the result is a discrete value output

Notation

a training example:
$(x, y), x \in ℝ n x, y \in {0, 1}$
m training examples:
${(x (1), y (1)), (x (2), y (2)), . . ., (x (m), y (m))} m = m train = # of train examples$
matrix:
$X = [x (1), x (2), . . ., x (m)] \in ℝ n x \times m Y = [y (1), y (2), . . ., y (m)] \in ℝ 1 \times m$
goal:
$Given x, y ̂ = P (y = 1 | x), where 0 \leq y ̂$

Logistic Regression

parameters

The input features vector:
$x \in ℝ n x, where n x is the number of features$
The training label:
$y \in {0, 1}$
The weights:
$w \in ℝ n X, where n x is the number of features$
The threshold:
$b \in ℝ$
The output:
$y ̂ = σ (w T x + b)$
Sigmoid function:
$s = σ (w t x + b) = σ (z) = 1 1 + e - z$

Loss (error) function:

ℓ (y ̂, y) = - (y log (y ̂) + (1 - y) log (1 - y ̂))

Cost function:

J (w, b) = 1 m \sum i = 1 m ℓ (y ̂ (i), y (i)) = - 1 m \sum i = 1 m (y (i) log (y ̂ (i)) + (1 - y (i)) log (1 - y ̂ (i)))

Gradient Descent

Want to find w and b that minimize J(w, b)

Process

Repeat

w : = w - α \partial J ( w , b ) \partial w b : = b - α \partial J ( b , w ) \partial b

Logistic Regression Gradient Descent

Recap

z = w T x + b y ̂ = a = σ (z) ℓ (a, y) = - (y log (a) + (1 - y) log (1 - a))

Gradient Descent

d z = \partial ℓ \partial z = a - y = a (1 - a) d w 1 = \partial ℓ \partial w 1 = x 1 \cdot d z d w 2 = \partial ℓ \partial w 2 = x 2 \cdot d z . . . d b = \partial ℓ \partial b = d z

Process

w 1 : = w 1 - α d w 1 w 2 : = w 2 - α d w 2 . . . b : = b - α d b

Gradient Descent on m examples

Recap

J (w, b) = 1 m \sum i = 1 m ℓ (a (i), y (i)) = - 1 m \sum i = 1 m (y (i) log (a (i)) + (1 - y (i)) log (1 - a (i))) a (i) = y (i) = σ (z (i)) = σ (w T x + b)

Descent

d z (i) = \partial ℓ \partial z ( i ) = a (i) - y (i) d w 1 = 1 m \sum i = 1 m \partial ℓ \partial w 1 = 1 m \sum i = 1 m x 1 \cdot d z (i) d w 2 = 1 m \sum i = 1 m \partial ℓ \partial w 2 = 1 m \sum i = 1 m x 2 \cdot d z (i) . . . d b = 1 m \sum i = 1 m \partial ℓ \partial b = 1 m \sum i = 1 m d z (i)

Pseudocode

这里写图片描述

Vectorization

Logistic Regression Derivatives

这里写图片描述

Vectorizing Logistic Regression

X = [x (1), x (2), . . ., x (m)] Y = [y (1), y (2), . . ., y (m)] Z = [z (1), z (2), . . ., z (m)] A = [a (1), a (2), . . ., a (m)] = σ (Z)

Implementing Logistic Regression

这里写图片描述

Broadcasting in Python

General Principle

(m, n) [+ - * /] (1, n) \to (m, n) [+ - * /] (m, n) (m, n) [+ - * /] (m, 1) \to (m, n) [+ - * /] (m, n)

Week3

Neural Networks Overview

这里写图片描述

Neural Network Representation

这里写图片描述

Computing a Neural Network’s Output

z [1] = W [1] T x + b [1] = W [1] T a [0] + b [1] a [1] = σ (z [1]) z [2] = W [2] T a [1] + b [2] a [2] = σ (z [2]) . . .

Vectorizing across multiple examples

a [2] (i) : example i, layer 2

这里写图片描述

Activation functions

sigmoid a = 1 1 + e - z, a' = a (1 - a) tanh a = e z - e - z e z + e - z, a' = 1 - a 2 ReLU a = m a x (0, z), a' = {01 if z < 0 if z \geq 0 leaky ReLU a = m a x (0.01 z, z) . a' = {0.01 1 if z < 0 if z \geq 0

这里写图片描述

Why do you need non-linear activation functions

Suppose

z [1] = W [1] x + b [1] a [1] = g [1] (z [1]) = z [1] z [2] = W [2] a [1] + b [2] a [2] = g [2] (z [2]) = z [2]

Then

a [1] = z [1] = W [1] x + b [1] a [2] = z [2] = W [2] a [1] + b [2] \to a [2] = W [2] (W [1] x + b [1]) + b [2] = (W [2] W [1]) x + (W [2] b [1] + b [2])

It is similar to

a [2] = W' x + b'

If you were to use linear activation functions or we go to call them identity activation functions, then the new network is just outputting a linear function of the input and we’ll talk about deep networks later new networks with many many layers, many many hidden layers and it turns out that if you use a linear activation function or alternatively if you don’t have an activation function. Then no matter how many layers, your neural network has always doing is just computing a linear activation function.

Gradient Descent for Neural Networks

Backpropogation

d Z [2] = g [2]' (Z [2]) d W [2] = 1 m d Z [2] A [1] T d b [2] = 1 m n p . s u m (d Z [2], a x i s = 1, k e e p d i m s = T r u e) d z [1] = W [2] T d Z [2] \circ g [1]' (Z [1]) d W [1] = 1 m d Z [1] X T d b [1] = 1 m n p . s u m (d Z [1], a x i s = 1, k e e p d i m s = T r u e)

Random Initialization

If initializing weights to zeros, then all weights will update symmetricly. Then no matter how many nodes in one layer, your neural network has always doing is just using one node in one layer.

Week4

Building Blocks of Deep Neural Networks

blocks

Propagation

Forward Propagation for Layer l

Input

a [l - 1]

Cache

z [l] = W [l] a [l - 1] + b [l]

Output

a [l] = g [l] (z [l])

Vectorized

Input

A [l - 1]

Cache

Z [l] = W [l] A [l - 1] + b [l]

Output

A [l] = g [l] (Z [l])

Backward Propagation for Layer l

Input

d a [l]

Local

d z [l] = d a [l] \circ g [l]' (z [l])

Output

d W [l] = d z [l] a [l - 1] d b [l] = d z [l] d a [l - 1] = W [l] T d z [l]

Vectorized

Input

d A [l]

Local

d Z [l] = d A [l] \circ g [l]' (Z [l])

Output

d W [l] = 1 m d Z [l] A [l - 1] d b [l] = 1 m n p . s u m (d Z [l], a x i s = 1, k e e p d i m s = T r u e) d A [l - 1] = W [l] T d Z [l]

Parameters vs Hyperparameters

Parameters

W [1], b [1] W [2], b [2] . . .

Hyperparameters

Hyperparameters can control W and b

learning rate α # of iterations # of hidden layers L # of hidden units n [1], n [2], . . . choice of activation function momentum term mini batch size various forms of regularization parameters

阅读全文

0 0

COURSE 1 Neural Networks and Deep Learning

Week1

What is neural network?

Example 1 - single neural network

Example 2 – Multiple neural network

Supervised learning for Neural Network

Neural Network examples

Structured vs unstructured data

Why is deep learning taking off?

Week2

Binary Classification

Notation

Logistic Regression

parameters

Loss (error)​ function:

Cost function:

Gradient Descent

Process

Logistic Regression Gradient Descent

Gradient Descent

Process

Gradient Descent on m examples

Pseudocode

Vectorization

Logistic Regression Derivatives

Vectorizing Logistic Regression

Implementing Logistic Regression

Broadcasting in Python

General Principle

Week3

Neural Networks Overview

Neural Network Representation

Computing a Neural Network’s Output

Vectorizing across multiple examples

Activation functions

Why do you need non-linear activation functions

Gradient Descent for Neural Networks

Backpropogation

Random Initialization

Week4

Building Blocks of Deep Neural Networks

Propagation

Forward Propagation for Layer l

Vectorized

Backward Propagation for Layer l

Vectorized

Parameters vs Hyperparameters

Parameters

Hyperparameters

Loss (error) function: