Introduction

Vision forecasting is something I’ve been fascinated with so this post serves as a reflection of my understanding.

This post will take you through the latest advancements in deep learning techniques used for sequential modelling, in particular with the use case of time-series vision forecasting. I’ll start with the ideas behind staple architectures, like a vanilla RNN, LSTM, and GRU, and then discuss the advancements made by their successors and why/how each new development did better than the previous one.

While much more detailed reviews could be done on each architecture, in spirit of an overview I will avoid the nitty gritty details.

Note: This post is a work in progress, and is being updated regularly with more content.

Basic Overview


RNN

Architecture

Let’s say I wanted to build a neural network (NN) that could be used to model sequential data (e.g. sentences, where time steps can be words/characters; weather forecasts, where time steps can be satellite images over time, etc.). Building models to handle individual time steps is easy. These are your feedforward NNs, CNNs, etc. that are commonly used for time-independent tasks like image classification. A naiive approach would then be to use a time-independent model at each time step, by treating each time step as independent points in time. But intuitively, if we’re considering sequential data, it’s very likely that future time steps will depend on the previous time steps, which is why this method fails.

To address this, we can use a recurrent neural network structure (RNNs). RNNs provide a way to relate the computations that a NN is doing at a particular time step to both the prior history of its computation from previous time steps as well the input at that time step. It also gives the ability to pass on current information to future time steps.

An RNN does so by cleverly linking the computation of the network at different time steps to each other. This is often denoted as the “cell state” or “hidden” state, ht. By capturing this recurrence relation, it gives the NN a sense of “memory”. The figure below shows this relationship:

Fig 1. Diagram of RNN unit. (Image source MIT 6.S191, 2021).

Fig 1. Diagram of RNN unit. (Image source MIT 6.S191, 2021).

There are 3 main categories of sequence modelling problems:

  • Many-to-one: having a sequential input and one output. E.g. sentence sentiment analysis.
  • One-to-many: having one input that produces a sequential output. E.g. image captioning
  • Many-to-many: sequential input, sequential output. E.g. language translation and forecasting

Moreover, sequential modelling as a whole have criteria for an ideal RNN. It should be able to:

  • Handle variable length sequences. E.g. input sentences could have varying lengths.
  • Track long-term dependencies. E.g. remembering the introduction/thesis of an essay to put body paragraph text into context.
  • Maintain information about order. E.g. “I like dogs, not cats” is semantically opposite from “I like cats, not dogs”. However, these sentences have the same words, same counts of appearance, but are in a different order. This distinction must be noted.

Let’s begin by looking at a vanilla RNN (the OG). I’ll be referring to this as just “RNN” for simplicity.

The math can get a bit hairy when we’re dissecting what’s happening inside an RNN. But in layman’s terms, at each time step, the current hidden state is calculated as a function of the previous hidden state and the current time step’s input. This operation involves:

  1. A set of 2 weight matrices that each multiply both the previous hidden state, as well as the current input vector.
  2. The two multiplication products are added, and passed into a tanh function.
  3. This newly calculated hidden state is then AGAIN multiplied by another unique weight matrix that determines the output vector/prediction at that that time step.

This means at each time step, you make an output prediction. Which in turn means you can find a loss at each time step. These losses are added once the RNN has seen all time steps, and that total loss is used to train the network. These 3 weights are what will be computed through NN training.

Backpropagation Through Time

So what happens in training? Unlike the simpler backpropagation in feedforward NNs, RNNs require backpropagation through time. This means that when calculating gradients w.r.t. the initial cell state $h_0$, you need to backpropagate through the entire sequence of weights and make lots of repeated gradient calculations.

This can pose two main issues that can come from deep chains of gradient computation:

  1. If there are too many gradient values $> 1$, that means the chain of multiplication will result in something called exploding gradients. This is a problem because the gradient values will remain too large, and the network cannot optimize.
  2. Conversely, if too many values are $< 1$, that means the chain of multiplication will result in very small values, leading to something called vanishing gradients.

There are several ways to address these problems:

  • Gradient Clipping
  • Using a better activation function
  • Weight initialization
  • Using other architectures

Gradient Clipping

This is when you force the gradient values (element-wise) to a specific minimum or maximum value if the gradient exceeds an expected range. Meaning, all the partial derivatives of the loss w.r.t each trainable parameter will be clipped between a specific range. This range is ultimately another hyperparameter that you can tune while training. However, it is often better to clip by the norm, rather than the values of the gradient.

Let’s illustrate this with an example. If I have a gradient (in this case a 1x2 vector) whose computed value is $[0.9, 100]$, and I want to clip gradient element values to the range $[-1, 1]$. This clipping operation would result in a clipped gradient of $[0.9, 1]$, whose direction (imagine these are vector magnitudes) is very different from the original gradient. Our clipped gradient points near 45 degrees, whereas the original values pointed nearly straight up 90 degrees. If we use the norm-clipping operation instead, we can maintain the orientation of the gradient as much as possible. In this example, we can set the L2 norm’s threshold to 1.0, which means that the vector $[0.9, 100]$ will be clipped to $[0.00899, 0.999995]$, thus preserving orientation.

Using a better activation functions

If you look at the sigmoid and tanh activation functions, their derivatives fall off dramatically as x moves in the positive direction. This means that larger x values won’t have a large impact on the weights, causing a bias towards having smaller gradients. A solution to this is to use a different activation function, such as the Rectified Linear Unit (ReLU).

Fig 2. Function graph of ReLU and its variants. (Image source: Applied Intelligence).

Fig 2. Function graph of ReLU and its variants. (Image source: Applied Intelligence).

Not only is the computation of ReLU simpler (which speeds up training), the positive portion is updated more rapidly in training. While this may solve the problem of vanishing gradients, this itself can bring other problems. The 0 gradient on the left-hand side of the y-axis has its own problem, called “dead neurons”. When gradient update sets the incoming values to a ReLU such that the output is always zero, those neurons simply “die” and never turn back on. Variations of ReLU units such as ELU, Scaled ELU, Leaky ReLU, or PReLU, etc. can prevent this from happening by providing smaller but non-zero gradients for values of $x < 0$.

Weight Initialization

You can initialize initial values to weights and biases to a specific value. This is called “weight initialization”. Although these initial values are often random, there is practice to set weights based on the layer sizes to prevent the gradients from exploding early. Layer size matters because if your layer size $n$ is large, then you will have more terms in dot product of the weights and inputs. This means that you will want each dot product term to be smaller.

One way to do this is to use the variance of $w_x$ as a multiplicative factor to your random weight initialization. Depending on the layer’s activation function, different weight variances are typically used. For ReLU functions, $\sqrt{\frac{2}{n_{l-1}}}$ is standard choice. Other variants include the Xavier initialization, He-et-al initialization, and many more.

Other architectures

Researchers quickly realized the limitations to this basic architecture, and the constant evolution of the recurrent unit began. The idea was to design a more complex recurrent unit that can more effectively adhere to the 3 design principles discussed before. This began with ideating different “gates” that can control the types of information that can pass through to the next time-step and also control how the cell state is updated. This allowed for a more complex architecture to be designed, many of which I will cover below.

Gated Networks

LSTM

Long Short Term Memory (LSTM) is an RNN architecture that uses a “gated” cell to track information across large time steps.

In a standard RNN, repeating modules use a simple computation node with a single activation function. In contrast, LSTMs use different computation blocks that control information flow. Information is added or removed through structures called gates, which are essentially a layer of an activation function (like sigmoid) and/or pointwise matrix multiplication.

Another key distinction is that there are is more information passed from previous time steps. In vanilla RNNs, we used the term “previous cell state” to describe the single input from the previous time step. LSTMs instead send two things: for the $t+1$ calculation, it requires the previous cell state $C_t$ and what I’ll call the previous cell’s output state $h_t$.

The information flow in an LSTM module has a 4 step process:

  1. Forget: the forget gate takes ht-1 and the current time step input, and computes a weight/bias pointwise matrix multiplication that is passed through a sigmoid function. The sigmoid (range: 0 to 1) modulates how much should be passed in/left out. Intuitively, you could consider the result to be binary, where 0 or 1 means “forget” or “don’t forget”.
Fig 3a. The forget gate. (Image source: Christopher Olah).

Fig 3a. The forget gate. (Image source: Christopher Olah).

  1. Input/Store: this input gate takes $h_{t-1}$ and current time step input, and determines what bits of the new input we need to remember.

    Fig 3b. The input gate. (Image source: Christopher Olah).

    Fig 3b. The input gate. (Image source: Christopher Olah).

  2. Update: the result of the input gate and the previous time step’s cell state $C_{t-1}$ are fed into the update gate, where the LSTM module forget what needs to be forgotten and memorizes what needs to be remembered. This will update the cell’s memory, $C_t$.

    Fig 3c. The update gate. (Image source: Christopher Olah).

    Fig 3c. The update gate. (Image source: Christopher Olah).

  3. Output: the output gate is fed the newly calculated cell state $C_t$, previous $h_{t-1}$, and current input to determine the cell output $h_t$. It controls whether or not this bit of information should be sent to deeper LSTM layers.

    Fig 3d. The output gate. (Image source: Christopher Olah).

    Fig 3d. The output gate. (Image source: Christopher Olah).

There’s a key advantage of keeping the cell state separate from the output state. One step in backpropagation from $C_t$ to $C_{t-1}$ will only pass through 1 element-wise multiplication by $f$. This allows for the cell state to be updated through backprop without going through the compounding weight matrix multiplications we saw in vanilla RNNs, making training more efficient. This advantage is called uninterrupted gradient flow!

GRU

There are many variations of this LSTM-gated structure. These include adding peephole connections or coupled forget and input gates. A notable one is the GRU:

The Gated Recurrent Unit is a variation of the LSTM architecture with 2 key modifications. First, it combines the forget and input gates into a single “update gate.” It also combines the cell state and hidden state that was kept seperate in the LSTM into one “reset” gate. This condensation of gates makes training GRUs slightly faster than LSTMs.

Fig 4. The GRU. (Image source: Christopher Olah).

Fig 4. The GRU. (Image source: Christopher Olah).

Advantages/Disadvantages of Gated Units (LSTM, GRU, …)

Advantages

The GRU and LSTM both have additive components in their update (keeping existing memory and adding new content) from the previous time step, which distinguishes them from traditional RNNs (that replaces the old memory with a new value). This additive nature makes it easier for each unit to remember the existence of a specific feature in the input stream for a long series of steps. In practice, this means any important feature (decided by the forget gate in LSTMs, the update gate of GRUs, or other gates in different architectures of this nature), will not be overwritten but be maintained as it is. Furthermore, the additive update function for the cell state gives a derivative thats much more “well behaved”, reducing the chances of a vanishing gradient.

Moreover, the existence of having different gates regulating information flow makes the vanishing gradient problem even less common. These gates allow the network to decide how much the gradient vanishes, and can take on different values at each time step. A detailed mathematical look can be found here.

Disadvantages

By nature of having gates, it means having more parameters to learn during training. Hence, these models take MUCH longer to train and require more memory. For those with limited compute, this is a limiting bottleneck.

Novel Methods

ConvLSTM

Vision forecasting often involves video data, which is a sequence of images. The motivation behind the Convolutional LSTM (ConvLSTM) was to use the idea of convolutions (which work really well with standalone image data) in conjunction with LSTM’s ability to map temporal dependencies through recurrent layers.

Capturing Temporal Dependencies

The LSTM encoder-decoder framework proposed in this paper provides a general framework for sequence-to-sequence (many to many) problems by training temporally concatenated LSTMs – one for the input sequence and another for the output sequence. This idea was extended into a technique of unsupervised learning of video representations using LSTMs. This paper can be found here. They built an LSTM encoder-decoder-predictor model which can reconstruct the input sequence (kind of like autoencoders), and predict the future sequence simultaneously. While their model was promising for sequence forecasting, the “fully-connected-ness” did not take spatial correlation into consideration, which is exactly what ConvLSTMs try to include.

Fig 5. The Composite Model: The LSTM predicts the future as well as the input sequence. (Image source: Srivastava et al).

Fig 5. The Composite Model: The LSTM predicts the future as well as the input sequence. (Image source: Srivastava et al).

Essentially, ConvLSTMs incorporate convolutional structures in both the input-to-state and state-to-state transitions. This is done to target the main drawback of fully-connected (FC) LSTMs, which is that it must unfold input data into 1D vectors which destroys any spatial information.

Let’s dig into why this is so important: DL models work with spatial data, so if you transofrm 2D images into 1D vectors, you lose the spatial data content. For example, think of two identical images, but one of which is rotated. You would need to retain spatial data to conclude they are the same image, whereas without it in a 1D representation the images would look completely different.

ConvLSTMs overcome this by using convolutional layers in between every “LSTM unit”. Essentially, all inputs, cell outputs, hidden states, and gates in this version of the LSTM are 3D tensors, with the last 2 dimensions being the original image’s length and width (rows/cols).

At every step of the ConvLSTM, the future state of a pixel position is predicted by the inputs and past states of its local neighbours (figure shown below). The neigbouring dependencies that this convolution step will measure will depend on the size of the convolution kernel used. For example, if states were hidden representations of moving objects, then larger transition kernels would better capture faster motions (variance across larger areas) and smaller kernels would better capture slower motions (variance across smaller areas).

Fig 6. Inner structure of a ConvLSTM. (Image source: Shi et al).

Fig 6. Inner structure of a ConvLSTM. (Image source: Shi et al).

It’s also interesting to consider how with this new architecture, the previously mentioned FC-LSTM is essentially a special case of a ConvLSTM, where all the features are standing on a 1x1 cell.


That’s it for now! Thanks for reading. Topics that will be added soon here are: ConvGRU, TrajGRU, PredRNN (++), E3D-LSTM, CrevNet, Conv-TT-LSTM, PredRNN-V2, and Attention mechanisms.