This post will dissect and offer a brief overview of fundamental ML concepts, and rigorously explain some key principles of deep learning. This is targeted for readers with a high level understanding of ML but lack fundamentals, or need a refresher. I will assume basic math knowledge (multivariable calculus and linear algebra).

ML: Capacity, Overfitting, and Underfitting

What is ML?

Machine Learning (ML) is a form of applied statistics that leverage computing power to estimate functions.

A ML algorithm is a model that can learn from data. But what does that mean? There is a strict definition provided by Tom Mitchell:

A computer program is said to learn from experience $E$ with respect to some class of tasks $T$ and performance metric $P$, if its performance at tasks in $T$, as measured by $P$, improves with experience $E$. Let’s break these down 1 by 1.

The Task

Tasks are loosely defined with how ML models should process an example. An example is a collection of quantitatively measured features that the ML model will process. These are often represented in vectorized form. An examples of such are pixels (features) of an image (example) for an ML model responsible for image classification (task). For simplicity, I’ll refer to examples as “data”.

There are a breadth of combinations of tasks and data in today’s ML models.

The Performance Metric

But what about $P$? A performance measure is needed for any type of learning, so that one can know if they are wrong or not. This metric usually depends on the task at hand. But regardless of what we choose, once we have a performance metric, we are able to quantify the accuracy of our ML model. Strictly speaking, accuracy is defined as the proportion of examples for which the model produces the correct output.

The Experience

Broadly, you can categorize ML algorithms' learning experience $E$ into two groups: unsupervised or supervised. This depends on the experience it has during the learning process. However, all experiences involve the ML algorithm to experience a dataset of training data.

Unsupervised learning algorithms experience a dataset containing many features, and self learn the useful properties of this dataset. In statistical terms, this means learning the entire probability distribution that generated this dataset. Other algorithms can include clustering, where the algorithm learns how to divide the dataset into ordered clusters of similar examples (e.g. grouping balls by their colours).

Supervised learning algorithms experience a dataset containing many features, where each example is also associated with a label/target. The term supervised comes from the POV that there is a “teacher” telling the machine what the correct answer is.

There is no formal definition of distincting the two, but both are used to broadly categorize ML models.

There is a third category of experience called reinforcement learning, but I won’t dive into that because

  • I don’t know reinforcement learning
  • It’s a very niche field

Datasets and Training

Now you understand at a high level what an ML model is. But how can we ensure it can behave intelligently, as in, be able to perform well on tasks, particularly on examples it hasn’t seen before? This is a central challenge in machine learning.

This ability to do well on new data is defined as generalization. This is critical, because for practical use of ML models, it will encounter many new examples in the real world in deployment. Therefore, it makes sense to “test” the model separately from what it’s seen before. Akin to how you are given tests in school to determine your understanding, ML models are evaluated on a test set of data - one that it has never seen during training.

When we train an ML model, we use a particular distribution of data to train it. On every iteration of training (showing it several examples, and measuring the performance metric on the given task), we measure its training error. The equations governing this error function are used to motivate an optimization problem, where the independent variables are the model parameters.

There are many algorithms used today to solve this optimization problem. Most of them are based on an algorithm called gradient descent (more on this later).

We Need Good data

So now we know that the optimization algorithm helps improve model performance on the training data. However, just like learning how to ride a bike, learning to ride on the road doesn’t mean you will know how to ride one on mountain trails. The examples we show the model is super important.

If our training data distribution isn’t representative of what inputs our ML model will see in production, it can be completely useless. This is why rather than thinking about ML training as training error optimization, we need to think of it as a generalization error optimization problem. Commonly, generalization error is called the test error (how well it does in the real world).

We can easily measure test error comes from our model performance on test data - a distribution of data that is representative of what the ML model will see in practice. However, this poses an inherent problem. How can we expect to do well on a set of data we haven’t seen before/we can’t observe during training?

In reality, if the training and test datasets were collected arbitrarily, there is nothing we can do. This is why curating good datasets are so important, where our training data’s distribution is close to our test data. Going forward, we’ll make some key assumptions (the i.i.d. assumptions):

  1. the examples in each dataset are independent from each other
  2. the training set and test set are identically distributed

A lot of people today have done the grunt work for us. There are nice datasets (with these assumptions upheld to the best of their ability) everywhere online (Kaggle, Hugging Face, etc.).

The Devil’s in the Details

To summarize, the big problem now is “how can we improve our generalization error”? There are 2 main steps that go into improving a ML model:

  • first reducing the training error (AKA reducing bias),
  • then making the gap between training and test error small (AKA reducing variance)

The benchmark for when the training error is “sufficiently small enough” is often referred to as Bayes Error. It’s the lowest possible prediction error that can possibly be be achieved and is irreducable. Often times, we substitute human performance as a proxy for this measure. If the model does better than an average human, then it’s a good model.

You can of course have a combination of both high/low bias and variance. Nonetheless, these two factors correspond to the most central problems in ML: underfitting and overfitting.

Underfitting is when the model is unable to reduce the training error (bias) sufficiently. Overfitting occurs when the gap between training and test error (variance) is large.

We can fortunately engineer many mitigations to prevent these problems from likely occuring.

Note to reader: From now on, I will be speaking strictly about these ML concepts in the context of deep learning, and deep neural networks. This is because it’s the area most relevant to my own research and to the latest innovations across the entire ML field today (ChatGPT, etc.).

Quick Review

Before we continue with these techniques, it’s a good idea to refresh the concept of a NN, and understand the notation I’ll be using for them.

Neural Networks

Single Node/Perceptron in a NN (Image source: Ananda Hange).

Single Node/Perceptron in a NN (Image source: Ananda Hange).

Neural networks are composed of layers of hidden units (aka nodes or perceptrons). Each node takes in values from the previous layer, and the output of each node is a linear combination of its inputs. The output is goverened by a set of operations (explained in next section). This structure was motivated by the design of a neuron (it is an very oversimplified depiction of a neuron).

Mathematical Notation for Neural Networks

For each $i^{th}$ node in layer $l$:

$$ \begin{align} z_i ^{[l]} &= w_i ^{[l]T} x + b_i ^{[l]} \newline a_i ^{[l]} &= g(z_i^{[l]}) \end{align} $$

where $a_i ^{[l]}$ is the output of the $i^{th}$ node of layer $l$ (sometimes called a hidden layer), $w$ is the weight matrix describing the linear combination, $b$ is a variable constant that is added to the linear combination, and $x$ is the vectorized outputs of the last layer of each node. $g$ is an activation function, which is used to restrict the output values of each layer. This provides the model stability as you propogate into deeper layers.

In modern computing, these computations are usually vectorized to make them faster. This is because computing products of matrices are faster than doing scalar multiplication with double for loops. We have numpy to thank for that. So, in each layer, you get something like this:

$$ \begin{bmatrix} \cdots & w_1 ^{[l]T} & \cdots \newline \cdots & w_2 ^{[l]T} & \cdots \newline \cdots & \vdots & \cdots \newline \cdots & w_{k} ^{[l]T} &\cdots \end{bmatrix} \cdot \begin{bmatrix} x_1 \newline x_2 \newline \vdots \newline x_{n_x} \end{bmatrix} + \begin{bmatrix} b_1 \newline b_2 \newline \vdots\newline b_{k} \end{bmatrix} = z^{[l]} \newline $$

We denote the first matrix in the above equation as the weights matrix $W^{[l]}$ for that layer $l$. If we vectorize across multiple examples, each layer will get:

$$ W^{[l]} \cdot \begin{bmatrix} \vdots & \vdots & \newline x^{(1)} & x^{(2)} & \cdots \newline \vdots & \vdots & \end{bmatrix} + b^{[l]} = Z^{[l]} \newline $$

Notes:

  • $b^{[l]}$ is added through broadcasting.
  • $x^{(1)} \cdots$ are the vectorized outputs of layer $l-1$ for examples 1, 2, 3, … etc.

Over $m$ examples during training our cost function (the performance metric) $J$ can be observed as:

$$ J(W, b) = \frac{1}{m} \sum_{i=1}^{m} L(\hat{y}^{(i)}, y^{(i)}) $$

where $L$ is our loss function, $\hat{y}^{(i)}$ is our model’s prediction, $y^{(i)}$ is the ground truth, and $w, b$ are our parameters (weights and bias in 1 layer).

I’ve now introduced the idea of a neural network. Let’s look at things in more detail.

Activation and Loss Functions

Activations

$$ a_i ^{[l]} = g(z_i^{[l]}) $$

There are a variety of activation functions used today. But their purposes remain the same: to provide stabilization to the output values of each node in a hidden layer. We want stability in our numbers because as you’ll see later on in gradient descent, we want our values to remain extreme (i.e. avoiding $\rightarrow \infty$ or $0$). Otherwise, through long chains of multiplication, we can easily lose stability in our model performance. The following are the most popular activation functions:

Sigmoid

$$ g = \frac{1}{1+e^{-z}} $$

tanh

$$ g = \frac{e^z - e^{-z}}{e^z + e^{-z}} $$

Notes:

  • tanh usually works better for hidden layers (because of more symmetric outputs - allowing negatives)
  • sigmoid is useful for the output layer
  • both have small gradients for large $|x|$, which can make learning slow

To overcome this, researchers came up with ReLU, the Rectified Linear Unit

$$ g = max(0, z) $$

This allows larger inputs to yield larger outputs, which greatly helps speed up training.

Graphs of mentioned Activations (Image source: ResearchGate).

Graphs of mentioned Activations (Image source: ResearchGate).

A thing you might have noticed is that all these functions are not everywhere linear. The reason for this comes from our vectorized operations. Let’s illustrate this with an example by looking at the operations for 2 consecutive layers (e.g. 1 and 2, I’ve also simplified some notation):

$$ \begin{align*} z_i ^{[1]} &= w_i ^{[1]} x + b_i ^{[1]} \newline a_i ^{[1]} &= g(z_i^{[1]}) \newline z_i ^{[2]} &= w_i ^{[2]} a^{[1]} + b_i ^{[2]} \newline a_i ^{[2]} &= g(z_i^{[2]}) \end{align*} $$

If we expand the last line:

$$ \begin{align*} a_i ^{[2]} &= g(z_i^{[2]}) \newline &= g(w_i ^{[2]} a^{[1]} + b_i ^{[2]}) \newline &= g(w_i ^{[2]} g(z_i^{[1]}) + b_i ^{[2]}) \newline &= g(w_i ^{[2]} g(w_i ^{[1]} x + b_i ^{[1]}) + b_i ^{[2]}) \end{align*} $$

We can see that if this $g$ was linear, then the second hidden layer would just be useless because the composition of 2 linear functions (operator on $w_i$) is still linear. Meaning, we would not increase the expressivity of our model at all. This is like trying to generate a fourier series representation of a signal with only 1 trig function.

Losses

There actually isn’t much I can talk about here without diving into application-specific losses. Losses are usually very different depending on the context they’re used. For binary classification, it makes more sense to use a loss function that’s biased to yielding big loss if wrong, and small if not, with very little in between. For things like image recognition, you might want a loss function with greater expressiveness to provide a better insight - to capture structural similarities, colour differences, etc. However, regardless of the application, your loss function used in the cost function must be differentiable. Otherwise, we cannot perform any optimization algorithms (since we cannot compute derivatives).

Optimization Algorithms

Optimization algorithms are what is used to make the model “learn”. It describes how the model “understands” its mistakes during the learning experience.

I use a lot of personification here - it’s simply for illustrative purposes and not in literal sense. The mechanisms behind human learning is vastly complex.

Gradient Descent: the Holy Grail

This is your bread and butter. Recall our cost function over m examples:

$$ J(W, b) = \frac{1}{m} \sum_{i=1}^{m} L(\hat{y}^{(i)}, y^{(i)}) $$

We want to minimize $J(W,b)$, which we can do by adjusting the values of our parameters $W, b$. We can do this optimally by using the gradient of $J$ with respect to each parameter. In particular, after every calculation of the cost, we use the following update rule, also known as the process of gradient descent (GD). At every iteration:

$$ \text{Update} = \begin{cases} W := W - \alpha \cdot \frac{\partial J(W,b)}{\partial w} & \ \newline b := b - \alpha \cdot \frac{\partial J(W,b)}{\partial b} & \end{cases} $$

where $\alpha$ is the learning rate (LR) hyperparameter. Ok, I just used a bunch of new words and ideas, so let’s explain them:

  • Hyperparameters are numeric values determined BEFORE TRAINING by the user. You can almost think of it as character customization in a video game (deciding the character hair colour, height, etc.), but for ML models. These values are not determined by the ML model while training. The method of finding optimal hyperparameters is a detailed one, something I’ll address in another post in the future.
  • The computation of the partial derivatives often involves a long chain rule computation. When these chains get long for large models,
    • computation gets expensive - we look for ways to reduce computation time. The way we do this is by vectorizing things.
    • gradients can easily vanish or explode in value, due to recursive multiplication (e.g. $0.1 \cdot 0.1 \cdot … \cdot 0.1 = 10^{-n}, n \rightarrow large$). This is known as the exploding/vanishing gradient problem (EVGP). Again, another topic to be explored another day.

Some more terminology:

  • Iterations describe the number of times the model is finished “experiencing” (learning from) the set of data it was given. A good analogy is feeding a baby: 1 iteration is when a baby eats a spoonful of the food and swallows it.
  • An epoch is a word used to describe the number of times our model has “experienced” (learned from) all our examples in the training data. In the baby example, this is the number of times our baby has eaten all of its meal.

I’m going to substitute the term “learning” over “experienced” from now on, as it’s more intuitive and common in present literature.

Over many epochs through all $m$ examples, your parameters will eventually converge to a minimum where you minimize the cost function, and in turn minimizing training set error (bias). Vanilla GD is also often called Batch GD, since every update requires a full “batch” of training data.

Visualization of Batch GD (Image source: Panagiotis Antoniadis).

Visualization of Batch GD (Image source: Panagiotis Antoniadis).

Stochastic Gradient Descent

Stochastic simply means there is a randomness involved.

In both gradient descent (GD) and stochastic gradient descent (SGD), you still use the same update rule iteratively. Your objectives are the same (minimizing the cost function).

The difference is: vanilla GD requires you to run through ALL $m$ samples in your training set to do a single update for a parameter. In SGD, you use only one or a randomly sampled subset of the training set to do update the parameters.

Advantage to SGD comes from when you’re dealing with very large number of examples in training data. Using vanilla GD on datasets with many examples will take long, since it must read through all the data for 1 iterative update. Using SGD will be faster because you use 1 training example and it starts improving itself right away from the first example.

Visualization of Stochastic GD (Image source: Panagiotis Antoniadis).

Visualization of Stochastic GD (Image source: Panagiotis Antoniadis).

SGD often converges much faster, and more chaotically compared to GD. The cost function also often struggles to converge nicely. Due to its random nature, SGD almost never converges on a fixed minima, and instead ends up oscillating the cost function in a neighbourhood of “good enough” parameter values.

Let’s revisit what I said earlier: only one or a randomly sampled subset. If you choose a subset of size > 1, the process is now called Mini-batch Stochastic gradient Descent. Most people just call this Mini-batch Gradient Descent, since most samplers in popular ML libraries (e.g. PyTorch) are inherently random samplers. But how do we know how big we should make our mini-batches?

Choosing Mini-batch Size

Visualization of Mini-batch GD (Image source: Panagiotis Antoniadis).

Visualization of Mini-batch GD (Image source: Panagiotis Antoniadis).

In practice, batch size is between 1 and $m$. This itself is actually another hyperparameter! Towards 1, you lose the processing speed gained from vectorization of math operations (at that point you are just doing scalar arithmetic). Towards $m$, you get long training times.

Andrew Ng, the Coursera Deep Learning instructor, recommends the following guideline:

  • If your training set is small, use Batch GD ($m \leq 2000$)
  • Typical mini-batch size: 64, 128, 256, 512. The key thing is to make sure your mini-batches can fit in system memory.

Learning Rate Decay

A quick optimization trick you can add to GD is to add a learning rate decay. As we converge closer to the cost function minima during training, a common problem is that we end up oscillating in a neighbourhood of values near the minima. We can try mitigating this (essentially reducing the neighbourhood size) by decreasing the learning rate as we get closer to the minima.

We can do this by setting our learning rate $\alpha$ to be a function of our epochs. The more epochs we pass through, the lower our $\alpha$ should be. Specifically:

$$ \begin{align} \alpha &= \frac{1}{1 + \zeta * \gamma} \alpha_0 \end{align} $$

where $\zeta$ is a fixed decay-rate, $\gamma$ is our epoch count, and $\alpha_0$ is our initial learning rate.

There are many ways to theoretically do this:

$$ \begin{align} \alpha &= \zeta^{\gamma} \alpha_0 \newline \alpha &= \frac{k}{\zeta} \alpha_0 \end{align} $$

or even some discrete piecewise staricase fuctions.

We could in fact also do this manually. After observing your model training performance over time, you could choose to pause training, change the learning rate, and continue training.

The Exponentially Weighted Average

This is an underrated algorithm that’s quite important. It’s also faster than gradient descent. It’s used a lot in time series data, or any data that is a sequence. Its simplest form is called Exponentially Weighted Averages (EWA) in statistics.

The key equation is as follows:

$$ v_t = \beta v_{t-1} + (1-\beta)\theta_t $$

where $v_t$ is the average of the unit you are measuring at time $t$, $\beta$ is a hyperparameter, and $\theta_t$ is the actual unit value at time $t$. For example, $\theta_t$ could be describing the daily temperature at time $t$, and $v_t$ describing the moving average of temperature. You could think of $v_t$ as the approximate average over $\frac{1}{1-\beta}$ days.

Continuing with the temperature example, here’s a scatter plot (blue) of daily temperature over a year.

Visualizations of EWA (Image source: DeepLearningAI).

Visualizations of EWA (Image source: DeepLearningAI).

The red line is the graph of $v_t$ when $\beta = 0.9$, which translates to the average of last 10 days' temperature. The green line is when $\beta = 0.98$, which translates to the average 50 days' temperature. This makes sense - the green line appears to be “lagging”, and that’s because it’s averaging over a larger window of temperatures, which makes it adapt slower to changes. If $\beta$ was smaller, e.g. 0.5, you would see a really noisy graph.

Intuition

If you break down what a moving average equation for e.g. $t=10$ days actually looks like, you’ll see:

$$ \begin{align} v_{10} &= (1-\beta) \theta_{10} + \beta v_9 \newline &= (1-\beta) \theta_{10} + \beta( (1-\beta)\theta_9 + \beta v_9) \newline &= (1-\beta) \theta_{10} + (1-\beta) \beta \theta_9 + (1-\beta)\beta^2 \theta_8 + (1-\beta) \beta^3 \theta_8 + … \end{align} $$

if you take out the $(1-\beta)$, you are left with a sum which can actually be decomposed into an elementwise multiplication of $[\theta_{10}, \theta_9, … , \theta_1]$ and $[1, \beta, \beta^2, … \beta^9]$.

The former is simply the dataset of what we are measuring over time (log of daily temperatures). But what’s $[1, \beta, \beta^2, … \beta^9]$? This is actually the vectorization of the Taylor polynomial expansion of $e^\beta$! This is where the “exponential” from Exponentially Weighted Average comes from.

Implementation

The pseudo algorithm runs like this:

Vt = 0
Repeat{
  Get next theta_t
  Vt := beta * v_0 + (1 - beta)*theta_t
}

Notice how we don’t need more memory every time we do an iteration, and that it only involves 1 operation. This is why it’s so efficient.

Note: why is $v_t$ initialized as 0? This is in fact not good practice. If you initialize as zero, the beginnings of your moving average isn’t actually a good fit for your data (because there’s no previous elements to average it with)! So, we do something called bias correction. We scale down $v_t$ by $\frac{v_t}{1-\beta^t}$. Updated equation:

$$ v_t = \frac{\beta v_{t-1} + (1-\beta)\theta_t}{1-\beta^t} $$

This allows smaller values of $t$ to scale up the predicted $v_t$ so that it doesn’t trail to 0. At future $t$ sufficiently later, the denominator becomes essentially 1 (since $\beta < 1$, so $\beta^t \rightarrow 0$), keeping it consistent with what we had before.

Momentum

Momentum is a technique added on top of GD, leveraging EWA to make things better. It computes an exponentially weighted average of the gradients, and uses that average to update the parameters in training. It almost always works faster than the standard GD.

The pseudocode for momentum is the following:

v_dw = 0, v_db = 0

On iteration t:
    Compute dW, db on current mini-batch  # gradients wrt. W and b
    v_dW = beta * v_dW + (1 - beta) * dW  # compute exponentially weighted averages
    v_db = beta * v_db + (1 - beta) * db  #         for W and b

    W := W - LR * v_dW                    # update with learning rate and EWA of W and b
    b := b - LR * v_db

This essentially smoothens out the learning steps of GD. You can also think of it in a physical sense: let’s pretend the cost function can be visualized with a sheet, and our error minimization is the journey of a ball rolling to the bottom. As the ball gains “momentum” it is less disturbed on the way down from external forces (new gradients), keeping it “on track” to reach the bottom.

Specifically, it damps oscillatory behaviour during GD by combining gradients with opposite signs. It builds up speed in directions with a gentle but consistent gradient.

Specifics

In practice, we have two hyperparameters: learning rate LR and beta.

What about bias correction? In practice, we don’t actually use it. This is because most of the time, with sufficient data, after several examples our moving average will have attained the value it should be around.

I also wanted to touch on specific behaviours of momentum. If the error surface resembles a smooth tilted plane, you can image momentum allowing the “ball” (AKA our current $J(W,b)$) to reach terminal velocity. This allows it to be much faster than GD.

If momentum fascinates you, there’s further readings you can do on this. Countless ways of improving momentum have been proposed for many years. One of such is using Nesterov method. Vanilla momentum first computes the gradient and then takes a jump in the direction of the accumulated gradient. The Nesterov method first makes the jump in the direction of the previous accumulated gradient, and then corrects the current gradient. This is based off the principle of “better to correct a mistake after you make it, than before it happened”.

RMSprop

RMSprop stands for Root Mean Squared Propagation. This is really similar to momentum. The motivation for RMSprop is to similarly prevent oscillations during GD. During GD, we may get oscillations on the way down to minimzing the cost function. This is akin to taking a ramp down (going zig zag) a hill, rather than a direct flight of stairs. This wastes time, so we want to include designs in our algorithm to penalize this perpendicular oscillating behaviour. The algorithm is as follows:

On iteration t:
    Compute dW, db on current mini-batch    # gradients wrt. W and b
    S_dW = beta * S_dW + (1 - beta) * dW^2  # compute exponentially weighted averages
    S_db = beta * S_db + (1 - beta) * db^2  #         for W and b

    W := W - LR * dW/(sqrt(S_dW) + epsilon)  # update with learning rate and EWA of W and b
    b := b - LR * db/(sqrt(S_db) + epsilon)

explaining the code:

  • dW^2 is an elementwise squaring operation on the dW matrix. Same with db^2.
  • S_dw and S_db are simply the moving averages of the gradients (used another variable S to differentiate between momentum’s v).
  • very small manually set epsilon term prevents us from ever dividing by 0 (reasonable default is $10^{-8}$)

We want our learning to avoid being trapped in oscillations (high magnitude gradients in unwanted directions). Let’s put some of these variables into context in this scenario. This means that some elements in $dW$ are large. Hence, those elements in $dW^2$ will be relatively larger than all others. We can omit $db$ for now, as the same principles apply.

To penalize this, our update rules in the large unwanted directions are divided by a much larger number in the RMS denominator. This keeps the update/movement in the unwanted direction smaller, creating a dampening effect on the oscillatory behaviour. For smaller wanted directions and their gradients, the RMS will be a smaller value, hence scaling the movement towards the minima greater.

The Differences with Momentum

The key here is that the learning rate for RMSprop is adaptive.

I realize now that in hindsight this part could be way longer than it could be. It’s hard to explain the key difference here without mentioning adaptive algorithms in detail (Adaptive Gradient Algorithms, AKA AdaGrad). It’s an entirely other subset of ML that could be talked about in detail. If I ever revisit this (I’m sure I will), I will make sure to add it.

We have to first start with the flaw of momentum.

Hypothetical Loss Surface (Image source: D2L AI (not https)).

Hypothetical Loss Surface (Image source: D2L AI (not https)).

Say the red circle is our starting point. We have a very large gradient (orange) in x2 direction and very little in x1. Note where the global minima is towards: x1.

The x1 and x2 are arbitrary higher dimensions. This in effect represents the different dimensions of the weight matrices.

In momentum, we accumulated the gradient, where more recent gradients were heavily favoured - this would make our update step more in the direction of x2. This means we achieve very little in terms of getting close to the minima. At some point, the gradients will stop in that direction and reverse in the other direction (creating perpetual oscillatory cycles).

AdaGrad was introduced to:

  • manage gradients for each coordinate separately, and
  • add a scaling factor in the denominator to act as a brake. The brake scaling was based on the square of the previous gradients.

Now x2 will have a large brake, meaning it won’t move so fast with momentum in the x2 direction. But keep in mind, our gradients in the direction of x1 were already small. Which means, there was a boost to movement towards x1.

This seems perfect - it helps the little guy and takes away from the big guy. But, the underlying issue with AdaGrad was that it aggregted all the past gradients for the scaling factor. This is best seen through the AdaGrad equations:

$$ \begin{align} G(t) &= \sum_{t=1}^{m} dW \cdot dW^T \newline W &= W - \frac{\nu}{\sqrt{diag(G(t)) + \epsilon}} \end{align} $$

Equation (9) describes the scaling brake. Equation (10) describes the adaptive update rule.

Because $G(t)$ is accumulative, over time this caused the brake to scale up for any direction after a large number of iterations. Every new gradient computed for every mini-batch would increase $G(t)$. If this occured before we reached minima, learning would become super slow, almost impossible in any direction.

To prevent that, RMSprop made the aggregation leaky (i.e. bring back recency bias through EWA seen in momentum). This allowed aggregation to not die as fast, and almost remain constant, in turn allowing the model to still learn at later examples.

Adam

I could go on in this list with dozens of optimization algorithms, but this will be the last one.

Adaptive Moment Estimation (Adam) is a similar method to the ones seen previously. It similarly has an adaptive learning rate. In fact, it almost looks identical to a merged momentum and RMSprop. Here’s the algorithm:

V_dW = 0, S_dW = 0. V_db = 0, S_db = 0

On iteration t:
    Compute dW, db on current mini-batch    # gradients wrt. W and b
    V_dW = beta1 * V_dW + (1 - beta1) * dW^2  # compute exponentially weighted averages
    V_db = beta1 * V_db + (1 - beta1) * db^2
    S_dW = beta2 * S_dW + (1 - beta2) * dW^2
    S_db = beta2 * S_db + (1 - beta2) * db^2

    # In Adam, you DO perform bias correction
    V_dW_c = V_dW/(1-beta1^t)
    V_db_c = V_db/(1-beta1^t)
    S_dW_c = S_dW/(1-beta2^t)
    S_db_c = S_db/(1-beta2^t)

    # Update rules
    W := W - LR * V_dW_c/(sqrt(S_dW_c) + epsilon)
    b := b - LR * V_db_c/(sqrt(S_db_c) + epsilon)

Recommended Hyperparameters (from authors of paper):

  • LR: needs to be tuned
  • $\beta_1$: $0.9$
  • $\beta_2$: $0.999$
  • $\epsilon$: $10^-8$

Unlike a lot of optimizations that come and go, Adam was able to work well in practice and compare favourably to other algorithms for many different tasks. It’s one of the most popular optimization algorithms today.

That’s it for optimization algorithms. If you want to learn more you can check out this blog here.

Reducing Bias

A reminder: bias is our training set error.

Simply Training for Longer

If you are building a large ML model (e.g. an image classifier for multiple objects), and you observe a high bias early in your training, it might be a good idea to just let the training run on more epochs on your data. You can never really predict how long models might take to “converge” to a minima: it can take minutes, days, or even many months! Sometimes it’s worth waiting a bit.

Choosing a Good Hypothesis Space

A great example to illustrate this is regression on a bunch of datapoints. A higher degree polynomial curve can fit points better than a linear approximation. This is because a polynomial model has a higher capacity to express a wider variety of functions (the basis is larger). Abstracting this, this makes sense; ML training is a process to find a function that can map inputs to outputs well. With limited tools (i.e. a small basis), it cannot produce complex models to fit complex data. However, the problem can easily be flipped on its head. Providing models with high capacity can easily overfit the data by memorizing irrelevant properties: outliers, noise, etc.

Visualizations of under, good, and overfitting. (Image source: Deep Learning Book).

Visualizations of under, good, and overfitting. (Image source: Deep Learning Book).

This is however easier said than done. In practice, methods you can use (specifically in deep learning models) are to:

  • increase the size of your model (adding more layers, increasing layer size)
  • use other architectures (i.e. choosing a better model for your task with better priors/architectural improvements)

Priors

Put simply, priors are initial beliefs about an event in terms of probability distribution. These can be useful to improve model architectures, to bias your ML model to “learn” in a way that is beneficial. It’s also often called inductive biases. These are often created to exploit assumptions on symmetry. If you’re familiar with some state of the art model architectures, for example, transformers are built assuming data has equivariance to permutations. ConvNets are built on the prior that there is an equivariance to translations.

Ok, so we have ways where we can increase the hypothesis space of our model (e.g. adding higher degree polynomials to our regression model). We can also give a learning algorithm a preference for one solution over another within its hypothesis space. The way we do this is through regularization.

Reducing Variance

A reminder: variance is when our test set error is higher than our training set error. This is called overfitting. But why does this happen?

The training data contains information about the regularities in the mapping from input to output. But it als ocontains sampling errors. Even if we try our best to match the test data distribution, there will be accidental reguliarities just because of the particular training examples that were chosen! And so as the model trains, it cannot tell which regularities are real and which are caused by sampling errors. This causes it to learn both. And if we’re dealing with large models, it can pick up this sampling error really well.

There are several approaches to prevent overfitting. The most common practice to reduce overfitting is to use regularization.

More Data

Simply expanding the training set with more data that represents the test set can help the model improve. An example of this could be when creating a facial recognition system. If our training dataset only contains images of light-skinned faces, our model will have a problem in practice identifying dark-skinned faces. We can improve our model by expanding our training dataset to include dark-skinned faces.

Data Augmentation

Sometimes data is hard to come by, especially when building custom ML models without pre-defined datasets. There are several techniques to create more data from existing data.

Let’s say you are building a model to detect cat images. If you only have 100 pictures of cats, you can perform augmentations to synthesize more copies of your 100 images: rotating each image, cropping smaller portions of each image, imposing distrotions, etc. If you any combinatoin of these to your 100 images, you can easily generate thousands of syntheized training images.

Choosing the Right Capacity

I mentioned that choosing a good hypothesis space is important to ensure your model is capable of learning the data. However, it’s also a good idea to limit your model capacity such that:

  • it is enough to fit true regularities
  • it is not enough to also fit spurious regularities

This capacity can be controlled in several ways:

  • Architecture: limiting the number of layers and/or the number of units in each layer
  • Early stopping: stopping the learning before it overfits
  • Weight-decay: this penalizes large weights using pentalty terms in the cost function
  • Noise: adding noise to the weights (not explained in this post, too complex)

Choosing Capacity Controlling Meta-parameters

We want to maximize test set performance as we find the ideal meta-parameters - but keep in mind we don’t want to necessarily pick the meta-parameters that yield the best performance on our test set. This is because the settings that work best on one test set are unlikely to work as well on another test set.

So what do we do? We introduce the idea of cross-validation.

Cross-validation

There’s two main ways to implement this:

  • Creating a validation dataset
    • Instead of dividing our data into training and test data, we also include a third set called the validation set.
    • Validation data is not used for learning, but is used for deciding what settings of these hyperparameters work best.
  • Dividing the total dataset into $N$ partitions: one test set and $N-1$ other subsets, and train on all of those $N-1$ sets. This repeats $N$ times where each partition becomes a test set once, which yields $N$ different estimates of the validation error rate.
    • This method is called N-fold cross-validation (or k-fold).

The cross-validation set is often referred to as the dev set.

Early Stopping

If we have lots of data with a big model, it’s economically expensive to keep re-training to find an optimal set of meta and hyperparameters.

So, what we can do is plot both the dev and train errors over training time. We can stop training when dev error starts rising. This avoids overfitting, but might reduce an optimized $J$ if we stop too early.

Early Stopping. (Image source: Ramazan Gençay).

Early Stopping. (Image source: Ramazan Gençay).

Limiting Weight Size

I’ll introduce the equations first for each method, and then conduct a detailed analysis.

Recall our loss function (our average loss over $m$ training examples)

$$ J(W,b) = \frac{1}{m} \sum_{i=1}^{m} L(\hat{y}^{(i)}, y^{(i)}) $$

Regularization involves adding a regularization term for $W$ ($\lambda$ = regularization parameter), which is motivated to penalize extreme values of the weights.

Weight Decay (L2 Regularization)

In L2 regularization, we add the square of the weight matrices to our loss function.

$$ J(W,b) = \frac{1}{m} \sum_{i=1}^{m} L(\hat{y}^{(i)}, y^{(i)}) + \frac{\lambda}{2m} \sum_{i=1}^{m} W^2 $$

where $W^2 = \sum_{j=1}^{n_x} w_j ^2 = W^T W$.

A few observations about our penalty term: $\frac{\lambda}{2m} \sum_{i=1}^{m} W^2$

  • Why don’t we also regulaize $b$? If you look in practice, $W$ is usually a very high dimensional parameter, wheras $b$ is a small dimension. This means the effect of regularizing $b$ is negligible if we also regularize $W$. Penalizing $W$ will have a greater effect on the overall model.
  • We included a $\frac{\lambda}{2m}$:
    • $\lambda$ is our L2-regularization hyperparameter
    • the 2 in the denominator comes for convinience of mathematical expression - when we take the gradient of the cost function, the 2 allows us to cancel out with the 2 exponent from the $W^2$ term (power rule!)

What does this do?

When we break down the math during gradient descent (slightly simplified notation):

$$ \frac{\partial J}{\partial W} = \frac{\partial L}{\partial W} + \lambda W $$

we see a quadratic relationship with $W$ and the cost $J$.

Visualization of Cost to Weight graph. (Image source: Geoffry Hinton).

Visualization of Cost to Weight graph. (Image source: Geoffry Hinton).

This keeps the weights small unless they have big error derivatives.

If you abstract this to the idea into linear algebra:

Visualization of Weight Space. (Image source: Deep Learning Book).

Visualization of Weight Space. (Image source: Deep Learning Book).

The solid ellipse lines represent countours of equal value in the unregularized cost function $J$. The dotted circles represent contours of equal value of the regularized $J$. At $\tilde w$, the different cost functions reach an equilibrium point.

The following math requires a bit of advanced linear algebra knowledge.

Across the horizontal dimension, the eigenvalues of the Hessian (equivalent to second derivative) is small, meaning the cost function doesn’t increase much when moving horizontally from $w^{*}$. Because of this, the regularizer has a much bigger effect on this dimension, pulling (visually compressing) the contours in this direction much more. On the other hand, the cost function is already very sensitive to movements in the vertical dimension. Its corresponding eigenvalue is large, indicating a high curvature (large second derivative). Thus, the weight decay doesn’t affect the vertical dimension as much.

In practice, this prevents the model from using weights that aren’t needed. This helps improve generalization performance a lot, since it stops the model from fitting sampling errors. It also makes the model “smoother” (the output changes more slowly as the input changes). However, these obsolete weights are never reduced to zero - rather just kept as very small values. This makes L2 regularization yield a non-sparse solution. Moreover, if the network has two similar inputs, it prefers to put half the weight on each rather than all the weight on one.

L1 Regularization

Similarly to L2, L1 regularization involves using the L1 penalty term, making our loss function look like this:

$$ J(W,b) = \frac{1}{m} \sum_{i=1}^{m} L(\hat{y}^{(i)}, y^{(i)}) + \frac{\lambda}{2m} \sum_{i=1}^{m} |W| $$

Unfortunately the math gets super hairy when we try to break it down like we did for L2 regularization. This is because L1 does not emit clean algebraic expressions because the signage and derivative computations of the absolute value function is tricky. But overall, its effects are pretty similar.

A key difference is that L1 norm can make the weights 0. This results in a more sparse solution (sparsity referring to the fact that some parameters have optimal values at 0).

Weight Constraints

So far all the methods we’ve seen involve a penalizing factor on each weight separately. We can also take the approach to put a constraint on the maximum size (L1, L2 norms) of the entire weight matrix. If an update violates this contraint, we can scale down all of the incoming weights to the allowed range.

This provides an advantage to classic penalties:

  • prevents weights exploding (hard cap)
  • when a unit hits its limit, the effective “penalty” on all of its weights is determined by only the big gradients
    • this is better than a fixed penalty which pushes slightly irrelevant weights to 0

Dropout

This can be a rather complex topic if you dive into the math, so we’ll treat this as a gentle introduction to the idea. Dropout is an efficient way to actually average over many large neural nets. What do I mean?

The idea of dropout is this: every time we present a training example to the model, we randomly omit each hidden unit with a given dropout probability (hyperparameter). This allows us to randomly sample from $2^h$ ($h$ is the layer size) different architectures, given all architectures share the weights.

In practice, we use the idea of inverted dropout. This simply means choosing the probability we keep a weight, rather than dropping one. $P(keep) = 1- P(drop)$.

The algorithm looks like this in practice (I’ll use Python):

# example code section for implementing inverted dropout for layer 3
  # generate binary mask of what to keep
  d3 = np.random.rand(a3.shape[0], a3.shape[1]) < keep_prob
  # activation layer gets masked
  a3 = np.multiply(a3, d3)

  a3 /= keep_prob     # THIS IS IMPORTANT!

I’ll answer two questions:

  • Why inverted dropout?
  • Why did we scale our activations in the third line?
    • Remember that dropout is a training technique. During training, there are less hidden units “working” to match the function input to output. This means that during the validation and test sets, when all activations are used, our output value during test/validation will be scaled up by a factor of keep_prob. I’ll illustrate with an exmaple:

Imagine you had 50 units in the above layer 3. If 0.8 is our keep_prob, this means ~10 units shut off during training. Recall that $z^{[4]} = w^{[4]}a^{[3]} + b^{[4]}$. This means that the values produced by $a^{[3]}$ is reduced by 20%, hence reducing the value of $z^{[4]}$.

Dividing this by 0.8 offsets this value reduction!

So important things to keep in mind:

  • we do not use dropout @ test/dev time!
  • effects of dropout are similar to L2 regularization
  • you can also add more control by varying the keep_prob every layer
  • the main downsides to dropout are that the cost function $J$ is less well defined

If the idea hasn’t stuck with you yet, you can also think about it this way. If a hidden unit knows which other hidden units are present, it can co-adapt to them on the training data. But complex co-adaptations are likely to go wrong on new test data. Big, complex conspiracies are not robust.

Moreover, if a hidden unit has to work well with combinatorially many more sets of co-workers, it is more likely to do something that is individually useful. You can even think of this on a human scale: if you are on both a basketball, soccer, and track team, it generally means you are potently athletic and good on your feet (rather than just being good at shooting or passing). This makes you more versatile and a better player to choose if your school starts a new sports team (e.g. football).

Normalizing Inputs

Another common technique to help regularize our data is normalizing the first input layer (i.e. normalizing our data).

There are many ways to do this, but the most common ways are:

  • Subtracting mean: $$ \begin{align} \mu &= \frac{1}{m} \sum_{i=1}^{m} x^{(i)} \newline x &:= x - \mu \end{align} $$
  • Normalizing variance (using standard deviation): $$ \begin{align} \sigma^2 &= \frac{1}{m} \sum_{i=1}^{m} [x^{(i)}]^2 \newline x &:= x/ \sigma \end{align} $$

This makes the contours of our loss function more even, which makes gradient descent easier.

Batch Normalization

The idea is simple: can we normalize $a^{[l-1]}$ to make the training of $w^{[l]}$ and $b^{[l]}$ faster? The motivation is to reduce internal covariate shift. Internal covariate shift is when the distribution of network activations changes due to the change in network parameters during training. By normalizing each layer, it reduces the distributions the units shift from. For particularly deep neural networks, this limits how earlier layers affect later layers, which stabilizes and promotes independence of the deeper layers.

It also provides a slight regularization effect, because each mini-batch is normalized by the mean and variance of the mini-batch, while also injecting a bit of noise to $z^{[l]}$. Let’s see how Batch Normalization (BN) works:

Given some intermediate values in a neural network: $z^{[l],1}, z^{[l],2}$, $\cdots$, $z^{[l],m}$, we can calculate:

$$ \begin{align} \mu &= \frac{1}{m} \sum_{i} z^{[l],i} \newline \sigma ^2 &= \frac{1}{m}\sum_{i} (z^{[l],i} - \mu)^2 \end{align} $$

allowing us to normalize by:

$$ z_{norm}^{[l],i} = \frac{z^{[l],i} - \mu}{\sqrt{\sigma^2 + \epsilon}} $$

which will be used in this new update equation:

$$ \hat{Z}^{(i)} = \gamma Z_{norm}^{(i)} + \beta $$

where $\gamma, \beta$ are learnable parameters (like $w, b$). This happens at every layer of the network. Alongside these parameters, we also save the current layer’s mean $\mu$ and variance $\sigma$ via an exponential moving average over all the layers. This will be important for test time.

Some notes before continuing:

  • If you use BN, $b^{[l]}$ actually gets cancelled out by the normalization in every layer, hence, we don’t need to train for it
  • $dim(\beta^{[l]}, \gamma^{[l]}) = (n^{[l]}, 1)$
  • BN operation can be done before or after the activation. In this case, I presented BN before activation. Some literature suggests doing it after performs better sometimes. This is up to you.

Pseudo-implementation with GD:

mean_avg, var_avg
for t = 1 ... minibatch-size:
    compute forward prop on X
      in each hidden layer, use BN to replace Z with Zhat
        update mean_avg, var_avg
    use back prop to find dW, db, dbeta, dgamma
    update parameters

BN works with momentum, RMSprop, Adam, and many other optimizers.

At test time

At test time is often called “at/during inference”. Recall that during training, BN starts by calculating the mean and variance for a mini-batch. However, during inference, we have a single sample, not a mini-batch. How do we obtain the mean and variance in that case?

This is where the two moving averages come in (the ones we saved during training) and saved with the model. We just use those saved values for the BN operation during inference.

Conclusion

This is the end of the fundamentals post. I hope this was helpful for those needing a refresher or wanting to dip their toes in ML and deep learning. Thanks for reading.