The premise of this article is to discuss the **intuition** behind training an ANN. There will be no maths and no code, just discussion of some of the fundamental ideas. I will assume the reader has a basic understanding of a neural network. I aim to give the reader an understanding of the key concepts of ANNs, their rationale and some intuitions about the mathematics behind it all.

*What’s your problem?*

*What’s your problem?*

To begin with, we require a problem to solve. A typical application of ANNs is to train a Machine Learning classifier, such that it predicts the class membership for a record from some input data, for example Yes/No, Male/Female etc.

Before deciding on a modelling approach, it’s imperative to first understand the data. The principle of parsimony applies in model selection, meaning that the simplest possible model should be chosen. For instance, say we wanted to build a classifier to predict Yes/No from a given data set (assume [x, y, z] inputs). We would first begin by plotting the data as a scatterplot to determine if we could use a linear classifier, such as Logistic Regression, or whether we’ll need to create non-linear features.

*Are we there yet?*

*Are we there yet?*

In order to intuitively benchmark the results produced via ANNs when you first begin using them, it’s a good idea to initially consider developing a simpler model using another Machine Learning approach. For instance, assuming the data broadly lends itself to linear classification, we could initially train a Logistic Regression classifier. The input variables are [x, y, z], and the output is a binary class Yes/No. This would effectively give us a “lower-bound” of the decision boundary that the ANN will learn, as the decision boundary determined by the Logistic classifier will separate the classes as best it can via a straight line, but won’t be able to capture the true shape of the curve that separates the two classes.

*Training time*

*Training time*

The next step is to train an ANN. A simple ANN has 3 layers ie an input layer, an output layer and one hidden layer. Deep Neural Networks, ie Deep Learning, typically consist of many, potentially millions, of hidden layers.

NB: the number of nodes in the input layer is determined by the dimensionality of our data ie 3 in this case, the [x, y, z] values. Similarly, the number of nodes in the output layer is determined by the number of classes we have ie 2 in this case, Yes/No, resulting in a probability for each. Given there are only 2 classes, we could actually only use 1 output node to represent the prediction.

But how do we choose the number of nodes for the hidden layer? We need to consider computational efficiency (more nodes = more computational power required) and function complexity (more nodes = ability to fit functions of greater complexity BUT increases risk of overfitting). Guidelines exist for choosing the right number but it depends on your problem, so is often best to try a few different numbers, plot the results to see the resulting decision boundary, and try determine if any increase in accuracy is at the expense of the risk of overfitting.

*Activate all engines*

*Activate all engines*

Now that we’ve chosen the number of nodes in our hidden layer, we next need to choose an Activation Function. This function simply transforms the inputs of the hidden layer to its outputs, a value between 0 and 1. They’re an integral part of an ANN as they decide whether a neuron should be ‘activated’ or not. This information is then sent to the next layer of neurons as input.

Some of the most common activation functions are the Sigmoid (Logistic function) and Tanh, both ‘S’-shaped functions.

NB: a computational advantage of the sigmoid and tanh activation functions is that their derivatives include the original function (but more importantly, they’re differentiable functions), ie the derivative for tanh, for instance, includes tanh. This has the distinct advantage of requiring us to only calculate tanh once which we can then reuse in the derivate. This is very important later when we need the derivatives to reduce the error.

*Have you been converted?*

*Have you been converted?*

Now, how will the ANN actually make a prediction? Basically, this is done via a set of matrix multiplications, and the application of the activation function, and is known as Forward Propagation.

In order to make a prediction (ie Yes/No in our example), we need to convert the output values of the hidden layer to probabilities that represent this prediction. To do this, we use the Softmax function (a multivariate generalisation of the Logistic function, also known as the Normalised Exponential function), which simply transforms the output values to a probability between 0 and 1.

*Learning to learn*

*Learning to learn*

But what does it actually mean for the ANN to ‘learn’ in order to make these predictions? All we are doing here is trying to find the value of the network parameters (weights and biases) that minimise the error on our training data (or the difference between the predicted and the actual results). The function that actually measures the error is called the Loss Function. Typically for the Softmax output, the Cross-Entropy Loss (ie negative Log Likelihood) is used. Simply put, all it does is sum over the training examples and add up the incorrect class predictions (ie the losses). By minimising the loss, we maximise the accuracy of the classifier.

So, how do we minimise the loss? One common, and well known, algorithm is Gradient Descent. The idea behind it is to minimise functions using optimisation. The important thing to note is that gradient descent needs the (partial) derivatives of the loss function with respect to the parameters (weights and biases). We mentioned above that tanh and sigmoid both allow computationally efficient ways to calculate their derivatives, so this is where it becomes important.

*To move forward, you must first step backwards*

*To move forward, you must first step backwards*

To calculate these derivatives, we use the Backward Propagation algorithm, which is a special case of Automatic Differentiation. This method actually calculates the derivatives for the network by starting from the output, and propagating towards the input. Most importantly, it allows us to calculate the partial derivative of the output with respect to EVERY input. On the other hand, Forward Propagation only allows us the calculate this derivative with respect to a single input. This results in a significant speed up in calculating the derivatives. For instance, imagine a function with 10 inputs and one output. With Forward Propagation, we’d have to cycle through 10 times to calculate the derivatives we need, but with Backward Propagation we can do it in one step, a speed up of 10! However, if we have multiple outputs, then Forward Propagation is faster.

NB: The mathematics behind Backward Propagation, used to calculate the derivatives, is based on the Chain rule, along with the Quotient and Product rules. The important point to note is that the rate it learns is directly related to the error in the output, which is clearly seen via the partial derivatives calculated by the Chain rule.

We now have what we need, from an intuitive notion, to implement a simple ANN model.