Training Neural Networks

This is a guide to implementing a neural network and training with gradient descent.


Background

We will start with an example identifying handwritten digits using the MNIST dataset. This problem has 10 classes (digits 090-9). Each image in the database is 28×2828 \times 28 pixels, meaning each input is a linearlized 1×7841 \times 784 vector. The output of the CNN will be a 1×101 \times 10 vector, where each feature represents the probability of the input being a digit 090-9.

For now, let’s suppose our network has one fully-connected layer with 10 neurons, one neuron per class or digit. Each neuron has a weight for each input and a bias term.

A neuron is a single node within a neural network layer. Neurons, also known as perceptrons, take in and process a set of data and output a set of data. More information on the neuron (perceptron) can be found here.

Because our first layer is a fully-connected layer, each neuron in this layer takes in all the input data, making 784 connections and weights. More on different neural network layers here.

Definitions

Let’s define variables for our network:

  • xx is the input data, a 1×7841 \times 784 vector.
x=[x1x2x784]x = \begin{bmatrix} x_1 & x_2 & \cdots & x_{784} \end{bmatrix}
  • yy are the input data labels, a 1×101 \times 10 vector.
y=[y1y2y10]y = \begin{bmatrix} y_1 & y_2 & \cdots & y_{10} \end{bmatrix}
  • pp is the probability output of the network, a 1×101 \times 10 vector. Each value in pp corresponds to a value in yy.
p=[p1p2p10]p = \begin{bmatrix} p_1 & p_2 & \cdots & p_{10} \end{bmatrix}
  • ww are the weights per neuron, a 784×1784 \times 1 matrix per neuron. For all 10 neurons, ww is a 784×10784 \times 10 matrix.
w=[w1,1w1,2w1,10w2,1w2,2w2,10w784,1w784,2w784,10]w = \begin{bmatrix} w_{1,1} & w_{1,2} & \cdots & w_{1,10} \\ w_{2,1} & w_{2,2} & \cdots & w_{2,10} \\ \vdots & \vdots & \ddots & \vdots \\ w_{784,1} & w_{784,2} & \cdots & w_{784,10} \end{bmatrix}
  • bb are the bias of a neuron. For all 10 neurons, bb is a 1×101 \times 10 vector.
b=[b1b2b10]b = \begin{bmatrix} b_1 & b_2 & \cdots & b_{10} \end{bmatrix}
  • ll is the output of the fully-connected layer, a 1×101 \times 10 vector. ll stands for logits.
l=[l1l2l10]l = \begin{bmatrix} l_1 & l_2 & \cdots & l_{10} \end{bmatrix}

Now we can define an output per neuron in the fully connected layer:

lj=wjx+bj=i=1784wi,jxi+bjl_j = w_j \cdot x + b_j = \sum_{i=1}^{784} w_{i,j} \cdot x_i + b_j

where jj is the index of the current neuron.

We can turn our logits into probabilities for each class:

pj=eljjeljp_j = \frac{e^{l_j}}{\sum_j e^{l_j}}

This is simply the exponential of the current logit divided by the sum of all logit exponentials. This is known as the softmax function.

The softmax function is guarenteed to output a probability distribution (jpj=1\sum_j p_j = 1), and is popular for determining the best class in a classification problem for convolutional neural networks.

To train our model, we want to define a loss function for the difference between pp and yy, our predicted and actual values:

L(w,b,x)=j10yjlog(pj)L(w,b,x) = -\sum_j^{10} y_j \log(p_j)

where yj=1y_j = 1 if the input is class jj, and 00 otherwise.

This is known as the cross-entropy loss function. We can use this loss function to compute error, which is defined as 1accuracy1 - \text{accuracy}.

Cross-entropy loss measures how well the predicted probability distribution matches the actual distribution. This loss function minimizes the amounts of information needed to represent the truth distribution versus our predicted distribution. When the amount of information needed is similar, the loss is low and both distributions are similar. There are other loss functions available, but cross-entropy most is popular for classification problems.

Training

To train our model, we calculate the loss function LL for every different training example, also known as an epoch. We repeat this for many epochs until hte loss over all training examples is minimized.

A training example is a single input and output pair. This is used to update the weights and biases of the network. An epoch is a single pass through the entire dataset. This is used to update the weights and biases of the network.

The most common strategy for minimizing the loss function is gradient descent. For each training example, we will use backpropagation to update weights and biases via a learning rate.

Gradient descent is an optimization algorithm to minimize a function by iteratively moving in the direction of steepest descent. Steepest descents are calculated by the gradient of the function at the current point.

The learning rate is one of the neural network’s hyperparameters. It determines how far each step of gradient descent should go.

Backpropagation is a method to calculate the gradient of the loss function with respect to the weights and biases of the network. Backpropagation is used with gradient descent to update the weights and biases.

The weights and biases are updated as follows:

wi,j=wi,jλLwi,jw_{i,j} = w_{i,j} - \lambda \frac{\partial L}{\partial w_{i,j}} bj=bjλLbjb_j = b_j - \lambda \frac{\partial L}{\partial b_j}

where λ\lambda is the scalar learning rate.

In order to calculate the partial derivatives, we need to deduce Lwi,j\frac{\partial L}{\partial w_{i,j}} and Lbj\frac{\partial L}{\partial b_j} in terms of xix_i and pjp_j.

The derivatives are as follows:

δLδwij=δLδpaδpaδljδljδwij={xi(pj1),a=jxipj,aj\begin{equation} \frac{\delta L}{\delta w_{ij}} = \frac{\delta L}{\delta p_a} \frac{\delta p_a}{\delta l_j} \frac{\delta l_j}{\delta w_{ij}} =\begin{cases} x_i(p_j-1), a = j\\ x_ip_j, a\neq j \end{cases} \end{equation} δLδbj=δLδpaδpaδljδljδbj={(pj1),a=jpj,aj\begin{equation} \frac{\delta L}{\delta b_j} = \frac{\delta L}{\delta p_a} \frac{\delta p_a}{\delta l_j} \frac{\delta l_j}{\delta b_j} =\begin{cases} (p_j-1), a = j\\ p_j, a\neq j \end{cases} \end{equation}

We skip much of the calculation here, but the derivatives are derived from the chain rule, using backpropagation. More extensive derivation walkthroughs can be found here.

Learn More