The Basic Formulas in Neural Networks

by resema — on  ,  , 

"Do not worry about your difficulties in Mathematics. I can assure you mine are still greater." Einstein

What to keep in mind regarding Calculus for Back Propagation.

Introduction

To compute derivates in Neural Networks a Computation Graph can help. It contains both a forward and a back propagation.

Back propagation contains Calculus, or more specific we have to take different Derivatives. Look at it as a computation from right to left.

Chain Rule





Wording

:= layer 2, example i, neuron j

Formulas for Forward and Back Propagation

forward_backward_propagation_formulas.png

Forward Propagation

Estimation function
Sigmoid function
Node
Loss (error) function
Logistic Regression Cost Function

Vectorized implementation

Forward Propagation with Dropout

Create matrix with the same dimension as the corresponding activation matrix and initialize it randomly. Then set all values to 0 or 1 regarding the keep probability. Then multiply the activation matrix with the dropout matrix and divide the remaining values by the keep probability.

Backward Propagation

Suppose you have already calculated the derivative , then you want to get , and .




where


Summary of gradient descent








Vectorized implementation example









The derivative of cost with respect to

Calculation Problem with Zero Matrix

Initialize weights with a 2x2 zero matrix is a problem. The computed a’s will be the same. And the dW rows will also be the same. This is independent how many cycles are be computed.

Random Initialization

We should initialize the parameters randomly. The parameters are := n of input layer, := n of hidden layer and := n of output layer




Where does the constant 0.01 comes from? We prefer to use very small initialization values. This means we will not start at the flat parts of the curve.

Xavier or He Initialization

Xavier initialization uses the factor instead of 0.01. And He et al. proposed the slightly adapted factor .

General Gradient Descent Rule


Update Rule For Each Parameter



L2 Regularization

The standard way to avoid overfitting is called L2 regularization. It consits of appropriately modifying your cost function.

to:

The first term is named cross-entropy cost and the second term L2 regularization cost.

Observations

  • The value of is a hyperparameter that you can tune using a dev set.
  • L2 regularization makes your decision boundary smoother. If is too large, it is also possible to oversmooth, resulting in a model with high bias.

Implementation Details

To calculate use

1
np.sum(np.square(W1))

Backpropagation With Regularization

For each node we have to add teh regularization term’s gradient .

Backpropagation with Dropout

Multiply the derivative with the corresponding droupout matrix cached in the forward propagation. Finally divide the remaining values in the activation matrix by the keep probablity.

Comments