Over the years, I’ve worked with talented Data Scientist’s who’s backgrounds weren’t in typical quantitative disciplines, such as mathematics or statistics. I’ve had the privilege of assisting some of them to better understand the underlying mathematics behind many commonly used Machine Learning and Deep Learning algorithms.
This, along with the current growth in the popularity of Data Science, has made me realise that people will continue to transition from all sorts of disciplines, which means they may not have a strong grounding in mathematics.
With this motivation in mind, I decided to write an article that explains, from a mathematically intuitive perspective, one of the most fundamental concepts used in Machine Learning: Matrix Calculus.
With an abundance of Machine Learning libraries and packages available in tools such as R and Python, you can be almost completely oblivious to the underlying mathematics which powers most techniques and algorithms, such as Gradient Descent, Optimisation and Linear Regression.
A solid understanding of the underlying mathematics, however, is fundamental in allowing you to:
- Select the right algorithm to use;
- Understand the limitations and bounds of the algorithm;
- Select appropriate parameters;
- Know which validation techniques to use;
- Understand and resolve poor performance and results;
- Apply appropriate confidence levels on the results; and
- Recognise over/under fitting.
I aim to discuss some key ideas and concepts, which are commonly used in the context of Machine Learning in order to help you get started. To further develop your knowledge, I encouraged you to read some of the numerous material available online, in text books or through courses. After all, the purpose of this article is not to be a reference manual or text book, but rather an introduction to some fundamental concepts from an intuitive perspective.
Fear of the Unknown
Pick up any book on Machine Learning, take a peek under any Deep Learning algorithm, and you’re likely to be bombarded by squiggly lines and Greek letters. This will either seem comforting to you or will result in sweats, swearing and complete dismay. If you fall into the latter group, then please read on, as I intend to shed some light on what inflicts terror in you!
Matrix calculus forms the foundations of so many Machine Learning techniques, and is the culmination of two fields of mathematics:
- Linear Algebra: a set of mathematical tools used for manipulating groups of numbers simultaneously. It includes the structures (vectors and matrices, to be defined below) and operations and rules (addition, subtraction, multiplication) to enable this; and
- Multivariate Calculus: an extension of ordinary calculus, allowing us to move from functions of one variable to functions of many variables, along with a range of associated tools and theorems.
Formally Derivative
Consider two points, \((x_{0}, f(x_{0}))\) and \((x_{1}, f(x_{1}))\), plotted on a graph (where the red curve represents some function \(f\)), and joined by a straight line (denoted in blue):
Understanding how a function behaves as a result of changes to its inputs is of great importance. In this case we have \(f\) as a function of \(x\) ie \(f(x)\). In order to understand how \(f\) behaves in different conditions, it would be good to know how any change in \(x\) will affect \(f\).
The average slope between two points can help us approximate the relationship between \(x\) and \(f(x)\). For instance, as plotted above, a positive change in \(x\) results in a proportional positive change in \(f(x)\).
The slope can be defined as follows:
$$Slope = \frac{f(x_{1}) – f(x_{0})}{x_{1} – x_{0}}$$
Now, in practice it’s much more useful to determine what the slope is at a ‘specific point’, as we are often interested in calculating the affect on \(f\) caused by changes in all values of \(x\), hence needing to know the slope for each value of \(x\).
To do this, we introduce the notion of a ‘derivative‘. A derivative can simply be defined as the slope at a specific point. The act of calculating the derivative is known as differentiation.
When we calculate the derivative of \(f\) with respect to \(x\), all we’re doing is asking ‘how much does \(f\) change for a specific change in \(x\)?’.
In order to calculate the slope at a single point, we must first insert an imaginary second point (as two points are required to calculate a slope), which we set at an ‘infinitesimally small’ distance away from \(x\). We’ll use \(\Delta x\) to denote this very tiny distance from \(x\), where \(\Delta\) represents ‘change in’. This will allow us to ‘estimate’ the slope from an arbitrarily small distance away.
To do this, we need to introduce the concept of a limit. It simply states that the slope ‘approaches’ some value, ie \(\frac{df}{dx}\) (the derivative of \(f\) with respect to \(x\)), as \(\Delta x\), a very small change in \(x\), approaches \(0\):
$$\frac{d f }{d x} = \lim_{\Delta x \to 0} \frac{f(x + \Delta x) – f(x)}{\Delta x} $$
Intuitively, the limit is an estimate of the value of the function at a specific point. We’re basically saying if \(\Delta x = 0\), then what would the slope theoretically look like at \(x\), which is equivalent to the best prediction of the derivative at that point.
The slope of the tangent line at the specific point is equal to the derivative of the function, representing the line at that point. The derivative can be also thought of as a measure of the steepness of the graph at a function, which can be visualised as the steepness of the tangent line, at a particular point on the graph.
The \(d\) used to define the derivative for some function \(f(x)\), ie \(\frac{df}{dx}\), can be interpreted as ‘a very small change’ in \(f\) and \(x\), respectively.
Derivatives can also be thought of as defining a ‘rate of change’ ie how much does \(y\) change for a change in \(t\), where \(t\) represents time and not distance, as is usually denoted by \(x\).
Note that the derivative of a function is also a function itself, which simply provides us with an expression that lets us determine the slope, or instantaneous rate of change, at any single point on a line.
A Point of Difference
Now that we’ve defined the concept of a derivative, what can we actually do with them?
First, some notation: When we discuss derivatives, we typically label the function we’re interested in measuring changes in as \(f(x)\). The derivative of \(f(x)\) with respect to \(x\) can be represented in four common ways:
- \( \frac{d}{d(x)} f(x) \);
- \( \frac{d f}{d x} \);
- \( f'(x) \); or
- \( f’ \).
It may help to think of \( \frac{d}{d x} \) as a mathematical operator, ie an operator that turns one function \( f(x) \) into another function \( f'(x) \).
There are a vast number of rules for differentiating different functions, but here are some basic and common ones:
- Constant: \( \frac{d}{dx} c = 0 \), where \(c\) is a constant
- Multiplication by constant: \( \frac{d}{dx} c f(x) = c \frac{df}{dx} \), where \(c\) is a constant
- Summation Rule: \( \frac{d}{dx} \left(f(x) + g(x)\right) = \frac{df}{dx} + \frac{dg}{dx} \)
- Difference Rule: \( \frac{d}{dx} \left(f(x) – g(x) \right) = \frac{df}{dx} – \frac{dg}{dx} \)
- Product Rule: \( \frac{d}{dx} f(x) g(x) = \frac{df}{dx} g(x) + f(x) \frac{dg}{dx} \)
- Quotient Rule: \( \frac{d}{dx} \left(\frac{f(x)}{g(x)} \right) = \frac{ \left( \frac{df}{dx} g(x) – \frac{dg}{dx} f(x) \right)}{g(x)^2} \)
One rule of derivatives that is of particular importance is the Chain Rule. The chain rule is used when we have a composition of functions, ie a function of a function/s. The composition is denoted either by \( f\left( g(x) \right)\) or by \( \left(f \circ g\right)(x) \). In this case we have a ‘chaining’ process of functions, whereby the output of one function, \( g(x) \), the inner function, becomes the input to another function \( f(x) \), the outer function.
The Chain Rule for differentiating a composite function is given by
$$ \frac{d}{dx} f \left( g(x) \right) = \frac{df}{dg} \frac{dg}{dx} $$
The Chain Rule is fundamental to the Gradient Descent algorithm, which is key to understanding how to implement an Artificial Neural Network.
All these rules and definitions we’ve defined are well and good, when we are using them for functions of only one parameter, but what about when our function depends on multiple parameters as is often the case?
To handle such situations, we turn to multivariate (multivariable) calculus which simply extends the above concepts to functions of more than one variable.
The Calculus of Multiples
Multivariate Calculus helps us answer such questions as “what’s the derivative of \( f(x,y) \) with respect to \( x \) ie \( \frac{d}{d x} f(x,y) \)?”.
We now need to determine how changes in both \( x \) AND \( y \) affect \( f(x,y) \). To do this, we need to calculate two separate derivatives. First of all, we calculate how a change in \( x \) affects \( f \) WHILE treating \( y \) as a constant. Similarly, we then calculate how a change in \( y \) affects \( f \) WHILE holding \( x \) constant.
In doing so, we are calculating the partial derivatives of \( f \), or ‘partials’, as each one forms a ‘part’ of the total answer.
To denote the fact we’re working with partials, and not ordinary derivatives, the notation we use is slightly different. Instead of \( d \) we use the symbol \( \partial \).
We then end up with two separate derivatives: \( \frac{\partial}{\partial x} f(x,y) \) and \( \frac{\partial}{\partial y} f(x,y) \).
The version of the Chain Rule for such situations is best defined by an example. Consider \(f\) to be a function of \(x\) and \(y\), that are both functions of \(t\), ie \(f \left( x(t), y(t) \right) \).
The Chain Rule derivative of \(f\) with respect to \(t\) is defined as follows:
$$ \frac{df}{dt} = \frac{\partial f}{\partial x} \frac{dx}{dt} + \frac{\partial f}{\partial y} \frac{dy}{dt} $$
Note the mix of notation used based on whether the variable we’re differentiating is a function of one variable or two. Also note that the contributions of the partials, \(x\) and \(y\), are ADDED to form the total and not multiplied.
The Total Derivative of \(f\) is then defined as follows, and is calculated by simply multiplying both sides by \(dt\):
$$ df = \frac{\partial f}{\partial x} dx + \frac{\partial f}{\partial y} dy $$
But why are derivatives, especially partial derivatives, such important concepts in mathematics?
To answer this, let’s take a brief detour and discuss mathematical modelling…
Mathemagical Modelling
Historically, mathematicians (and statisticians, physicists, biologists, engineers, economists etc) have wanted to try understand the ‘real world’ and to represent the concepts in a universally known language. To do so, they came up with the notion of a mathematical model, ie a representation of the process using the language of mathematics, by writing equations to describe physical (or theoretical) processes.
This has the power of abstracting complex concepts into simpler forms, providing insight and allowing us to predict, perform what-if analysis, etc.
To do so, we commonly need to consider the concept of rates of change of a quantity, ie how a change in input variables affects a change in the output. This can include a rate of change in one variable in relation to distance, time, etc.
For instance, to determine the temperature distribution throughout an object for instance, we normally can’t simply define this in terms of \( T \), the temperature of the object. What we do instead is work up to this by starting with bits of information we already know. In this case, we use the well-known Newton’s Law of Cooling which states that the rate of change in the temperature of an object is proportional to the difference between the object and it’s surrounding temperature:
$$ \frac{d T}{d t} = k (T – T_{s}) $$
where \( T_{s} \) is the surrounding temperature, \( k \) is the cooling constant and \( T \) is the temperature of the object.
So what we do is define the problem in terms of these changes. This is easy as we now have a way to describe this mathematically (via derivatives!). We then need to ‘solve’ the equation in order to determine an expression for \( T \) (solving differential equations is a topic for another article). Typically, with the complexity of models we create these days to describe real-world systems, we need to solve the equations numerically, using computational power. Simpler models however can be solved mathematically to give an explicit expression for \( T \), for instance. Regardless, without the concept of derivatives, none of this would be possible!
Now that we know how to calculate the partials for a multivariate function, what can we actually do with them? To answer this, we need to first jump over to the land of Linear Algebra and discuss vectors and matrices. They are the structures that we’ll store our data in, before applying the above operations to, in order to do powerful things like perform Gradient Descent and Linear Regression.
Linear Algebra
In most tools that we use to solve real world problems (R, Python, SQL, SAS, Excel, etc), we store the data in a 2D arrays, such as dataframes and tables. This is mathematically represented as a ‘matrix’.
Matrices
A matrix is simply a rectangular array of numbers, arranged in rows and columns. This is just a general way of how we normally think of information stored in this way, ie with the rows representing records and the columns representing features/variables:
$$ \mathbf A_{m,n} = \begin{bmatrix} a_{1,1} & a_{1,2} & \cdots & a_{1,n} \\ a_{2,1} & a_{2,2} & \cdots & a_{2,n} \\ \vdots & \vdots & \ddots & \vdots \\ a_{m,1} & a_{m,2} & \cdots & a_{m,n} \end{bmatrix} $$
where \(a_{i, j} \in \mathbb{R} \), \(i = 1, 2, \ldots, m\) and \( j = 1, 2, \ldots, n\).
We typically refer to a matrix as being of dimension \( m \times n \), ie \( m \) rows by \( n \) columns, and we use bold font capitals by way of notation.
An individual element of matrix \(\mathbf A\) is denoted by \( a_{i,j} \), and is also referred to as an entry of the matrix, where \( i \) = row index and \( j \) = column index.
Below is a list of some common special cases and operations of matrices:
- A square matrix is one with the same number of rows and columns, ie dimension \( m \times m \). Thisis known as a square matrix of order n;
- An identity matrix is a square matrix with \(1\)‘s along the main diagonal (top left to bottom right, with \(0\)‘s everywhere else;
- The transpose of a matrix means to swap the rows and column, ie flip the matrix about the main diagonal. The transpose of matrix \( \mathbf A \) is denoted \( \mathbf A^{T} \);
- Associative rule: \( \mathbf A \left( \mathbf B \mathbf C \right ) = \left ( \mathbf A \mathbf B \right ) \mathbf C \);
- They are generally not commutative, ie \( \mathbf A \mathbf B \ne \mathbf B \mathbf A \); and
- Transposition rule: \( \left( \mathbf A \mathbf B \right )^{T} = \mathbf B^{T} \mathbf A^{T} \).
NB: When ‘transposing’ a matrix \(\mathbf A\), we simply swap the rows and columns to obtain a new matrix \(\mathbf A^{T}\). This is equivalent to reflecting \(\mathbf A\) over its main diagonal ie top left to bottom right.
One important operation to be aware of is how to multiply two matrices together:
Consider matrix \(\mathbf A\) of dimension \( m \times n \) and matrix \( \mathbf B \) of dimension \( n \times p \). In order to be able to multiply these two matrices together, the number of columns in matrix \( \mathbf A\) must equal the number of rows in matrix \( \mathbf B \), ie \( n \) in this case. The dimension of the resulting matrix \( \mathbf C \) is then \( m \times p \):
$$ c_{i, j} = \sum_{k = 1}^{n} a_{i, k} b_{k, j} $$
where \(i = 1, 2,\ldots,m\) and \( j = 1, 2,\ldots,p\).
As an aside, GPU’s are highly efficient at performing matrix calculus, hence their use in Deep Learning where the number of layers, hence matrix calculations and manipulations, can number in the millions.
Vectors
A vector is just a subset of a matrix. Specifically, it’s a matrix of only one column. They are typically denoted in lower case bold font, ie \(\mathbf v\):
$$ \mathbf v_{m} = \begin{bmatrix} a_{1} \\ a_{2} \\ \vdots \\ a_{m} \end{bmatrix} $$
When we talk about numbers in relation to vectors, such as multiplying a vector by a (real) number to get another vector, we refer to these numbers as scalars. Formally, when defining vectors and scalars, we really should discuss fields and vector spaces, but we’ll leave that for the astute reader to pursue further for mathematical completeness.
NB: when working with vectors, we typically represent them as column (vertical) vectors. To represent a vector horizontally, ie as a row, we simply transpose the vector.
An important operation for vectors, worth being aware of, is the dot product, which allows us to multiply two vectors together. It’s defined as follows:
$$ \mathbf u \cdot \mathbf v = \sum_{i = 1}^{m} \mathbf u_{i} \mathbf v_{i} = u_{1} v_{1} + u_{2} v_{2} + \ldots + u_{m} v_{m} $$
which is equivalent to \( \mathbf u \cdot \mathbf v = \mathbf u^{T} \mathbf v \).
Now that we’ve defined the concepts of derivatives (multivariate calculus) and vectors and matrices (linear algebra), we can combine them to calculate derivatives of vectors and matrices, which is what ultimately allows us to build Deep Learning and Machine Learning models.
Vector Calculus
Earlier we defined the concept of a multivariate derivative ie the derivative of a function of more than one variable. We’ll now extend that concept to calculating the derivative of vector functions.
When calculating partials in the world of vector calculus, we need to introduce the concept of a gradient vector, and denote it by the del (or nabla) symbol, \( \nabla \).
The purpose of the gradient is to store all the partials of a function into one vector so we can use it for performing operations and calculations in vector calculus land.
For instance, if we have a function \( f(x,y) \) of two variables, then it’s gradient is defined as follows:
$$ \nabla f(x,y) = \left [ \frac{\partial f(x,y)}{\partial x} , \frac{\partial f(x,y)}{\partial y} \right ] $$
Note that the gradient always points in the direction of greatest increase of a function ie the direction of steepest ascent, and it is zero at a local maximum or minimum. These are important points to keep in mind when trying to understand Gradient Descent and other algorithms.
The Chain Rule for vectors has the same structure as that for scalars:
$$ \frac{\partial}{\partial \mathbf x} \mathbf f (\mathbf g( \mathbf x )) = \frac{\partial \mathbf f}{\partial \mathbf g} \frac{\partial \mathbf g}{\partial \mathbf x} $$
Note that order is important as the product is not commutative.
Matrix Calculus
We can also extend the concept of differentiating a function to differentiating matrix functions.
First, a word on notation. Scalars are represented in lower case, \(x\), vectors in bold font, \(\mathbf v\) and matrices in bold font capitals, \(\mathbf A \).
If gradient vectors allow us to store all the partials for one function into a vector, what do we do when we have multiple functions?
The Jacobian matrix is used to store the gradient vectors for each function as rows. The Jacobian matrix, \(\mathbf J\), for functions \(f\) and \(g\), is defined as follows:
$$ \mathbf J = \begin{bmatrix} \nabla f(x,y) \\ \nabla g(x,y) \end{bmatrix} = \begin{bmatrix} \frac{\partial f(x,y)}{\partial x} & \frac{\partial f(x,y)}{\partial y} \\ \frac{\partial g(x,y)}{\partial x} & \frac{\partial g(x,y)}{\partial y} \end{bmatrix} $$
Each row of the Jacobian matrix simply contains all the partials for that function. Here we’ve used the numerator layout. The transpose of this is known as the denominator layout, so always make sure you’re consistent, and understand which layout any reference material is using.
A more general version of the Jacobian matrix is as follows, assuming \( \mathbf y = \mathbf f(\mathbf x)\) is a vector of \(m\) scalar valued functions that each take a vector \( \mathbf x \) of length \(n\):
$$ \frac{\partial \mathbf y}{\partial \mathbf x} = \begin{bmatrix} \nabla f_{1}(\mathbf x) \\ \nabla f_{2}(\mathbf x) \\ \vdots \\ \nabla f_m(\mathbf x) \end{bmatrix} = \begin{bmatrix} \frac{\partial}{\partial x_{1}} f_{1}(\mathbf x) & \frac{\partial}{\partial x_{2}} f_{1}(\mathbf x) & \cdots & \frac{\partial}{\partial x_{n}} f_{1}(\mathbf x) \\ \frac{\partial}{\partial x_{1}} f_{2}(\mathbf x) & \frac{\partial}{\partial x_{2}} f_{2}(\mathbf x) & \cdots & \frac{\partial}{\partial x_{n}} f_{2}(\mathbf x)\\ \vdots & \vdots & & \vdots \\ \frac{\partial}{\partial x_{1}} f_{m}(\mathbf x) & \frac{\partial}{\partial x_{2}} f_{m}(\mathbf x) & \cdots & \frac{\partial}{\partial x_{n}} f_{m}(\mathbf x) \end{bmatrix}$$
The Jacobian matrix contains all \(m\) gradients of \(f\) with respect to \(x\), stacked on top of one another, resulting in an \(m \times n \) matrix of all possible partials of \( f \).
Here is a list of Jacobian shapes, worth keeping note of, for some common derivatives:
$$ \frac{\partial \mathbf f}{\partial x} = \mathbf v, \frac{\partial f}{\partial \mathbf x} = \mathbf v^{T}, \frac{\partial \mathbf f}{\partial \mathbf x} = \mathbf M $$
Below are the all important Chain Rules for common scalar by matrix derivatives:
$$ \frac{\partial g(y)}{\partial \mathbf X} = \frac{\partial g(y)}{\partial y} \frac{\partial y}{\partial \mathbf X} $$
and
$$ \frac{\partial f (g(y))}{\partial \mathbf X} = \frac{\partial f(g)}{\partial g} \frac{\partial g(y)}{\partial y} \frac{\partial y}{\partial \mathbf X} $$
whereby the above chain rule has been applied to the interim derivative of \(\frac{\partial g}{\partial \mathbf X}\).
When dealing with derivatives of scalars, vectors and matrices, here’s a list of the shapes of the expected results, with \(s\) representing a scalar, \(\mathbf v\) representing a vector and \(\mathbf M\) representing a matrix:
$$ \frac{\partial s}{\partial s} = s,\frac{\partial \mathbf v}{\partial s} = \mathbf v, \frac{\partial \mathbf M }{\partial \ s} = \mathbf M, \frac{\partial s}{\partial \mathbf v} = \mathbf v^{T}, \frac{\partial \mathbf v}{\partial \mathbf v} = \mathbf M, \frac{\partial s}{\partial \mathbf M} = \mathbf M$$
You should now have a tool bag ready to take with you on your journey in Machine Learning. When you next lift the lid on a model, or peek inside the inner workings of an algorithm, you will have a better understanding of the moving parts, allowing you to delve deeper and acquire more tools as you need them.
To test your understanding, fill any gaps you may still have, and learn how Matrix Calculus is used in Deep Learning specifically, see this excellent paper.