From a mathematical perspective, Deep Learning can effectively be defined as:
The application of a set of complex geometric transformations to map the input space to the output space.
In more detail, we are simply doing the following when developing a Deep Learning model:
- Vectorisation: convert input and output data into vectors ie positions of points in space relative to one another
- Transformation: perform geometric transformations of the vectors from layer to layer, from input to output
- Parametrisation: Iteratively update the weights of the layers to maximise output accuracy
“All models are wrong; some models are useful”- George E. P. Box
As with any model developed to try emulate the ‘real world’, there are assumptions made. It’s imperative to understand these assumptions and hence know the limitations of the model.
One of the key assumptions made in Deep Learning models is hidden away within the Backward Propagation step. This is the crucial phase where the model does the actual ‘learning’. It does this by iterating throughout the network, and along with Gradient Descent, continually revising the model parameters (weights) to minimise the error, and hence ‘learn’ the correct output. Mathematically, this is done by the calculation of a set of partial derivatives for the output with respect to every input.
The key assumption for most cases is that the function is differentiable, which means that it is continuous and smooth. This is a serious restriction, limiting the transformation from input to output, throughout the network, to one that is both smooth and continuous ie the derivative exists at each point in the domain. If we were to plot the function on a graph, then it must be relatively smooth, can’t have any breaks (ie continuous) and must have a tangent at each point. However, versions can also handle pseudo gradients, and thus can be applied to non-smooth functions.
A Mathematical Enigma
So at its core, Deep Learning is just the application of complex geometric transformations, using a brute force approach to ‘learn’ the best parameters that map the input space to the output space.
All models have somewhat unrealistic assumptions, so why should modelling via Deep Learning be any different? How do these assumptions affect the types of systems that can sensibly be modelled, and the interpretation of these models?
If all Deep Learning is doing is a bunch of fancy transformations and iterations to find the parameters of best fit by trial and error, can we say it’s ‘learning’ in a real way?
I agree with Francois Chollet that as current AI systems are devoid of abstraction and reasoning, they do not learn in the human sense. They are incredibly powerful and useful, but we need to be aware of what they can do and not overestimate their abilities as sometimes happens with vendors and non-practitioners who get caught up in the hype. We also need to understand that they simply cannot (yet) tackle problems that require reasoning. For instance, some systems create the expectation that they can think like a human, yet they are nowhere near having this capability.
Big Data + Cheap GPUs + Deep Learning Algorithms = Artificial Intelligence?
An example that illustrates the limitations of the ‘learning’ in Deep Learning was included by the paper published by Google, “Intriguing Properties of Neural Networks” (2014). They define the concept of ‘adversarial examples’, where they show how a barely perceptible change in an input image can cause the network to misclassify the image. A slight tweak to the pixel values in an image, effectively undetectable to the human eye, resulted in the network misclassifying it. To do so, they purposefully adjusted pixels to maximise the networks prediction error, ie an ‘adversarial’ image.
It was shown that the same perturbation can cause other networks, trained on different subsets of the data, to also misclassify the same input image. We know a model is only as good as the data we give it. How often is the data clean, accurate and trustworthy? It begs the question, how can adversarial examples be used to trick Deep Learning systems, tricking the model to misclassify information intentionally that is imperceptible to us? Think of autonomous vehicles and adversarial signage..
What this proved was that the network does not have an understanding of its input and output. As the model is trained on a bounded input domain in order to learn, its understanding is restricted to within this range, which limits its ability to generalise to previously unseen domains.
A Deeply Butterfly Effect?
“When the present determines the future, but the approximate present does not approximately determine the future” Edward Lorenz
Chaos theory is the study of the behaviour of dynamical systems that are very sensitive to their initial conditions, and is most popularly used in weather forecasting. Their high sensitivity to initial inputs is known as the ‘Butterfly Effect’. The phrase was coined by Edward Lorenz to describe a metaphorical tornado whose path is influenced by the flapping of a distant butterflies wings, several weeks earlier. Lorenz discovered that a very small change in initial conditions in his weather models (caused by rounding approximations), resulted in significantly different outcomes.
Given the aforementioned sensitivity of Deep Learning models to their inputs, does this mean that we are implicitly creating ‘chaos-type’ systems?
After all, the problems are deterministic (for a given input, the correct answer is always the same) – yet a tiny perturbation in the input can lead to a misclassified result. Given there is no randomness involved, the data should be enough for us make correct predictions that aren’t easily mislead by extremely small changes in the input.
Given the bounded nature of these models, their limited ‘intelligence’, and their sensitivity to input data, do we really understand how fragile they are?
The Current State of Deep Learning
The following quote from Francois Chollet succinctly summarises the current state of AI:
“The only real success of Deep Learning so far has been the ability to map space X to space Y using a continuous geometric transform, given large amounts of human-annotated data. Doing this well is a game-changer for essentially every industry, but it is still a very long way from human-level AI”
Deep Leaning is incredibly powerful with a multitude of practical applications. There’s some exciting work currently being done to overcome its limitations and advance towards true artificial intelligence.
For instance, in the field of handwriting recognition, John Launchbury is developing generative models that can taught the strokes behind any given character. Rather than given the system 100,000 examples, it can learn from a just a few examples by using a model that describes how a hand moves on the page.
Another example includes ‘energy-based’ models as proposed by Yann LeCun. Rather than train an ANN to produce only one output, these models give an entire set of possible outputs, along with associated scores.
*thanks to Nikolaj Van Omme for valuable feedback