M.E. Irizarry-Gelpí

Physics impostor. Mathematics interloper. Husband. Father.

Deep Learning Fundamentals with Keras 2


Say that you have some data and wish to fit a function of the form

\begin{equation*} Z = w X \end{equation*}

You need to find the value of \(w\) that best fits the data. One way to do this is to work with a loss function that is convex. For example,

\begin{equation*} J(w) = \frac{1}{2 N} \sum_{j = 1}^{N} \left( z_{j} - w x_{j} \right)^{2} \end{equation*}

The best fit corresponds to the value of \(w\) that minimizes the loss function \(J(w)\). Note that the loss function is a function in weight (and bias) space.

In general, the weight values for the best fit can be determined with the gradient descent algorithm. With this approach, you take steps that are proportional to the gradient of the cost function at the current point. You start with random weights \(w_{0}\) and compute the value of the gradient of the loss function at that initial weight value. This gradient is

\begin{equation*} \frac{\partial J}{\partial w} \end{equation*}

The gradient determines the direction of the step in weight space. The size of the step in weight space is controlled by a parameter \(\eta\) called the learning rate. The bigger the learning rate is, the larger the step in weight space will be. The new value of the weight is

\begin{equation*} w_{1} = w_{0} - \eta \frac{\partial J}{\partial w} \end{equation*}

This is the first iteration of the gradient descent algorithm.

The value of the learning rate needs to be chosen with care. If it is too large, you can end with large steps. If it is too small, you can end with a long-running algorithm.

In order for a neural network to provide good results, it needs to optimize the weights and biases it uses. One way to do this is with the backpropagation algorithm. During training, you have input data \(x\) and also ground truth \(T\). For simplicity assume a neural network with one input, and two layers. In the first layer,

\begin{equation*} x_{1} \longrightarrow z_{1} = w_{1} x_{1} + b_{1} \longrightarrow a_{1} = f(z_{1}) \end{equation*}

In the second layer,

\begin{equation*} a_{1} \longrightarrow z_{2} = w_{2} a_{1} + b_{2} \longrightarrow a_{2} = f(z_{2}) \end{equation*}

Thus, \(a_{1}\) depends on the weight \(w_{1}\) and bias \(b_{1}\), but the output \(a_{2}\) depends on the weight \(w_{2}\) and bias \(b_{2}\) as well as the output \(a_{1}\) from the hidden layer. The idea behind backpropagation is to compute the error between ground truth and estimate, and then use that error to update the value of the weights and biases via gradient descent:

\begin{align*} w_{j + 1} &= w_{j} - \eta \frac{\partial E}{\partial w_{j}} & b_{j + 1} &= b_{j} - \eta \frac{\partial E}{\partial b_{j}} \end{align*}

As an example, consider the following error function:

\begin{equation*} E = \frac{1}{2} \left( T - a_{2} \right)^{2} \end{equation*}

Note that \(E\) depends implicitly on the two weights and the two biases. You need to update the two weights and the two biases. Consider the following gradients:

\begin{align*} \frac{\partial E}{\partial w_{1}} &= -(T - a_{2}) \frac{\partial a_{2}}{\partial w_{1}} \\ \frac{\partial E}{\partial w_{2}} &= -(T - a_{2}) \frac{\partial a_{2}}{\partial w_{2}} \\ \frac{\partial E}{\partial b_{1}} &= -(T - a_{2}) \frac{\partial a_{2}}{\partial b_{1}} \\ \frac{\partial E}{\partial b_{2}} &= -(T - a_{2}) \frac{\partial a_{2}}{\partial b_{2}} \end{align*}

You need the gradients of the output \(a_{2}\):

\begin{align*} \frac{\partial a_{2}}{\partial w_{1}} &= \frac{\partial f}{\partial z_{2}} \frac{\partial z_{2}}{\partial w_{1}} \\ \frac{\partial a_{2}}{\partial w_{2}} &= \frac{\partial f}{\partial z_{2}} \frac{\partial z_{2}}{\partial w_{2}} \\ \frac{\partial a_{2}}{\partial b_{1}} &= \frac{\partial f}{\partial z_{2}} \frac{\partial z_{2}}{\partial b_{1}} \\ \frac{\partial a_{2}}{\partial b_{2}} &= \frac{\partial f}{\partial z_{2}} \frac{\partial z_{2}}{\partial b_{2}} \end{align*}

Furthermore, you need the gradients of the activation function \(f\), and the result \(z_{2}\) of the second hidden layer:

\begin{align*} \frac{\partial z_{2}}{\partial w_{1}} &= w_{2} \frac{\partial a_{1}}{\partial w_{1}} \\ \frac{\partial z_{2}}{\partial w_{2}} &= a_{1} \\ \frac{\partial z_{2}}{\partial b_{1}} &= w_{2} \frac{\partial a_{1}}{\partial b_{1}} \\ \frac{\partial z_{2}}{\partial b_{2}} &= 1 \end{align*}

Finally you need the gradients of the output \(a_{1}\) of the first hidden layer:

\begin{align*} \frac{\partial a_{1}}{\partial w_{1}} &= \frac{\partial f}{\partial z_{1}} \frac{\partial z_{1}}{\partial w_{1}} = \frac{\partial f}{\partial z_{1}} x_{1} \\ \frac{\partial a_{1}}{\partial b_{1}} &= \frac{\partial f}{\partial z_{1}} \frac{\partial z_{1}}{\partial b_{1}} = \frac{\partial f}{\partial z_{1}} \end{align*}

Thus,

\begin{align*} \frac{\partial E}{\partial w_{1}} &= -(T - a_{2}) \frac{\partial f}{\partial z_{2}} \frac{\partial f}{\partial z_{1}} w_{2} x_{1} \\ \frac{\partial E}{\partial w_{2}} &= -(T - a_{2}) \frac{\partial f}{\partial z_{2}} a_{1} \\ \frac{\partial E}{\partial b_{1}} &= -(T - a_{2}) \frac{\partial f}{\partial z_{2}} \frac{\partial f}{\partial z_{1}} w_{2} \\ \frac{\partial E}{\partial b_{2}} &= -(T - a_{2}) \frac{\partial f}{\partial z_{2}} \end{align*}

If \(f\) is the logistic function,

\begin{equation*} f(z) = \frac{1}{1 + \exp(-z)} \end{equation*}

then

\begin{equation*} \frac{\partial f}{\partial z} = \frac{\exp(-z)}{\left[ 1 + \exp(-z) \right]^{2}} = f(z) \left[1 - f(z)\right] \end{equation*}

This means that

\begin{align*} \frac{\partial f}{\partial z_{1}} &= a_{1} (1 - a_{1}) \\ \frac{\partial f}{\partial z_{2}} &= a_{2} (1 - a_{2}) \end{align*}

Gradient descent leads to the following updates for the weights and the biases:

\begin{align*} w_{1} &\longrightarrow w_{1} + \eta (T - a_{2}) a_{2} (1 - a_{2}) a_{1} (1 - a_{1}) w_{2} x_{1} \\ w_{2} &\longrightarrow w_{2} + \eta (T - a_{2}) a_{2} (1 - a_{2}) a_{1} \\ b_{1} &\longrightarrow b_{1} + \eta (T - a_{2}) a_{2} (1 - a_{2}) a_{1} (1 - a_{1}) w_{2} \\ b_{2} &\longrightarrow b_{2} + \eta (T - a_{2}) a_{2} (1 - a_{2}) \end{align*}

The outline of the training algorithm is as follows:

  1. Initialize weights and biases with random values.
  2. Use forward propagation to calculate output of network.
  3. Calculate error between truth and estimated output.
  4. Check if the error is below a pre-defined threshold. If it is, stop. Otherwise, continue.
  5. Update weights and biases via gradient descent and backpropagation.
  6. If all iterations have been done, stop. Otherwise, go back to step 2.

The logistic function allowed some convenience while doing the previous calculation by hand, but numerically it can lead to problems. One of them is the vanishing gradient problem, where small values for the gradients arise and get compounded by all the multiplications involve as you move more and more backward into the network. This leads to the layers closer to the output training differently than the layers farther away from the output.

In order to overcome the vanishing gradient problem, other activation functions besides the logistic function can be used. One such activation function is the Rectified Linear Unit or ReLU:

\begin{equation*} f(z) = \operatorname{max}(0, z) \end{equation*}

ReLU is usually used in hidden layers. Another example is the SoftMax function:

\begin{equation*} f(z_{j}) = \dfrac{\exp(z_{j})}{\displaystyle\sum_{k = 1}^{N} \exp(z_{k})} \end{equation*}

This one is usually used in the output layer of the network. Since

\begin{equation*} \sum_{j = 1}^{N} f(z_{j}) = 1 \end{equation*}

the SoftMax output can be understood as a probability. Indeed, with

\begin{equation*} z_{j} = - \frac{E_{j}}{k_{B}T} \end{equation*}

you can recognize the partition function from thermal physics and statistical mechanics.