Overview

Climatehack.ai was an international hackathon themed around fighting climate change. This post will dissect some of the thought process behind our decisions and also show methods we used to build our solution. Much credit goes to my teammates Tony Liu and Andy Cai.

Background

Better near-term forecasting of solar electricity generation will enable electricity grid operators around the world to do a better job of scheduling their grids.

For example, the UK National Energy Grid Operator currently use a combination of solar and natural gas sources to generate power. The objective of the grid is to reliably supply electricity to meet demand at all times, and hence natural gas is used as 24/7 standby in the event of sudden falls in solar production (e.g. due to dense cloud coverage).

By developing better solar forecasting techniques, the Grid Operator could minimize the use of standby gas turbines, potentially leading to a substantial reduction in carbon emissions of up to 100 kilotonnes a year. While it is incredibly difficult to accurately predict the climate impact, a rough estimate suggests that better solar power forecasts if deployed worldwide could reduce global carbon emissions by about 100 million tonnes of CO2 a year by 2030.

Challenge

Our specific challenge was cloud prediction. We had ~1.5 months as individuals, and 1 week as a team to apply machine learning techniques to develop the best satellite imagery forecasting algorithm that could eventually be used in solar power output forecasting.

The specific challenge: from a series of 12 images covering a 128×128-pixel region cropped out of a series of much larger satellite images taken five minutes apart, accurately predict the next 24 images for the central 64×64-pixel area, corresponding to the next two hours of satellite imagery.

Open Climate Fix provided us with ~2 years of high resolution satellite imagery over the UK and north-western Europe from EUMETSAT’s Spinning Enhanced Visible and InfraRed Imager Rapid Scanning Service with a spatial resolution of about 2-3 km.

The competition used the loss metric of Multi-scale Structural Similarity Loss (MS-SSIM).

Processing Data

The data was presented in a zarr format, which is a wrapper library for large array-like data. Its official documentation is here. We realized it was very slow to load + keep the data in this format, so what we took the dataset and split it into numpy arrays, which made loading significantly faster. We decided to random crop from the whole image during training. We interestingly got overfitting when we pre-cut crops.

Approaches

Optical Flow (Baseline comparison)

Optical flow (OF) can be thought of as a difference vector. If you’re given two images, OF describes the velocity vector for each pixel that describes how each pixel is moving between images. It essentially assumes that $I(x, y, t) = I(x + \delta x, y + \delta y, t + \delta t)$.

Fig 1. Optical flow (green highlight) of a short clip of walking football players. Velocity pixels on people have a vector representing the speed at which they walk. Ohter pixels are stationary because the camera is stationary. (Image source: NVIDIA).

Fig 1. Optical flow (green highlight) of a short clip of walking football players. Velocity pixels on people have a vector representing the speed at which they walk. Ohter pixels are stationary because the camera is stationary. (Image source: NVIDIA).

We thought the same line of thinking was appliable to clouds. If we assumed clouds move from an underlying wind field, OF would essentially approximate that wind field. The naiive approach was to take the average of all previous OFs and use that as the OF of our entire image set. The results were disappointing, and its performance was very poor. However, unlike most models we tried, it produced images that were actually realisitc to the human eye. This importance will be explored in the conclusion.

Trajectory GRU

Fig 2. TrajGRU. (Image source: Shi et al).

Fig 2. TrajGRU. (Image source: Shi et al).

TrajGRU is a model architecture that improves on previous attempts at incorporating convolutions in autoregressive models (ConvLSTM, ConvGRU). These past attempts use fixed convolutions on the hidden states at each time step, which means that the locality of connections between the current hidden state and past hidden state are always the same. This is not desirable when predicting fast moving objects, as the past hidden state representations of these objects should ideally have their information propagated to a different location in the future hidden state representation. TrajGRU solves this problem by dynamically generating these connections.

You can read more about the code implementation here: TrajGRU.

The main drawback to this model was the expensive compute. Like other gated network architectures, they took a long time (especially on this large scale parameters) for models to converge. Although we wanted to play around with hyperparameters, since we were limited in resources as students, we quickly abandoned gated networks (including ConvLSTMs) and focused on U-nets.

U-Net

Fig 3. Original U-Net architecture. This was designed for medical segmentation tasks, but has fared well in vision forecasting tasks. Red box highlights the location of the bottleneck. (Image source: Ronneberger et al).

Fig 3. Original U-Net architecture. This was designed for medical segmentation tasks, but has fared well in vision forecasting tasks. Red box highlights the location of the bottleneck. (Image source: Ronneberger et al).

The U-Net is a convolutional encoder and decoder. The encoder computes progressively higher level abstract features as the receptive fields of the convolutional kernels increase with the depth of the encoder. After a few layers, it forms a final latent representation at the bottleneck (bottom, visually) of the network.

From the bottleneck, the decoder part of the U-Net computes the feature maps of increasing resolution from this latent representation, to finally get to the final input resolution leading to the final output. It also takes the abstract features computed at different resolution levels in the encoder step, and feeds these into the decoder via skip connections (grey arrows in image) which allow the decoder to integrate information from different resolutions into its final output.

Vision Transformer

To understand the background of the vision transformer, here’s a little background on transformers:

Transformers are a sequence to sequence model, whose main mechanism relies on the idea of attention. The attention mechanism essentially describes how for each output element, the transformer learns which parts the input sequence is most important for the output. The issue with applying normal transformers to a vision task is that the memory requirement of attention mechanism and transformers scale quadratically with number of input elements. For images (even at smaller resolutions), memory scaling becomes a huge issue.

Fig 4. ViT. (Image source: Dosovitskiy et al).

Fig 4. ViT. (Image source: Dosovitskiy et al).

So, some clever researchers pioneered the Vision Transformer (ViT). To overcome this memory issue, is breaks down images into smaller patches (i.e. 4x4, 8x8, etc.), unrolling patches into a 1D vector, and multiplying it with an embedding vector and using that as input tokens in a transformer. This technique has shown to work on par/better than convolution-based networks for many vision tasks with a faster training time.

Final Solution

Model

Our best and final solution was a 5-level ViT-augmented U-Net.

Fig 5. Final model architecture.

Fig 5. Final model architecture.

U-Nets were good at combining spatial representations at different resolutions. But, its convolution mechanisms weren’t able to model temporal relationships and global feature connections. This is inherently due to the limited scope of the convolution kernel. But, by adding a ViT in the bottleneck, it allows us to model global relationships between image patches, overcoming the shortcomings of the U-Net.

Other Enhancements

Group Normalization

Fig 6. Normalization comparisons. (Image source: this blog.

Fig 6. Normalization comparisons. (Image source: this blog.

We used group normalization instead of batch normalization.

The idea is that in our input feature, we stack each frame as a timestamp (i.e. each channel in our input feature is a timestamp). This means each channel of our timestep will have a different data distribution. So, to account for these differing distributions across time steps, we normalize seperately across these different groups.

Depth-Point Convolutions

Depth-point convolutions are a two step combination of a depthwise convolution, followed by a pointwise convolution. A depthwise convolution convolves each channel seperately (convolutions separated by depth), and concatenate the output of each convolution into a 3D output. A pointwise convolution acts as a regular convolution that convolves across all channels, but it uses a 1x1 filter.

Fig 7. Depth-point convolution structure.

Fig 7. Depth-point convolution structure.

This form of convolution reduces the parameter model size significantly. In our case, it dropped it from around 50M to 15M params, without a noticable drop in performance. This idea was inspired by the Efficient Net paper.

Things we tried that didn’t make it into our solution

CBAM (Channel based attention module)

Fig 8. CBAM structure. (Image source: Github repo).

Fig 8. CBAM structure. (Image source: Github repo).

CBAMs are useful for modelling channel dependencies. It uses max and average pooling to aggregate spatial information while collapsing the dimension of the spatial map into a single vector, and attention is computed across input channels to capture inter-channel (in our case, temporal) dependencies. These modules weren’t included in our final model because they provided minimal benefit to performance.

Multiscale Vision Transformers

Fig 9. Multiscale ViT schematic. (Image source: Zhang et al).

Fig 9. Multiscale ViT schematic. (Image source: Zhang et al).

Multiscale vision transformers involve using multiple ViT that take in multiple features at diff resolutions, after each block. It is usable in the encoder or decoder part of U-Nets. Unfortunately, we never had a chance to fully train these models to convergence.

Focal Transformers

A focal transformer is a multi-resolution vision transformer architecture suited for dealing with high-resolution images. This architecture attends at a fine-resolution locally and course resolution globally allowing focal transformers to capture high-resolution local information while maintaining scalability to larger input features.

We used focal transformers as both the encoder and decoder replacement in our UNET architectures similar to how a Swin-UNet is organized. Our experiments were promising, but we ran out of time to train our focal transformer experiments to convergence.

Code Reference: Focal Transformer.

Model analysis

Performance Comparison

Fig 10. Model performance over time.

Fig 10. Model performance over time.

We compared our models' performances, and found that they drop off slower than baseline metrics (optical flow and persistence). We theorized that the slow decay is likely because our model compensates for uncertainties in the future by blurring images. Although this preserves accuracy in the MS-SSIM metric, it doesn’t produce realisitc images.

Checkerboarding

A big issue we ran into was checkerboarding. Checkerboarding is a phenomenon caused by the transposed convolutions in the convolutional decoder of the U-Net. The transpose convolutions produce checkerboarding artifacts when they overlap with neighbouring convolutions (due to striding). This produces concentrations of input signals that make this pattern.

Fig 11. Clear checkerboarding issue.

Fig 11. Clear checkerboarding issue.

We investigated potential solutions, but never got a proper chance to implement them.

Using Different Convolutions

We could’ve tried upsampling instead of transpose convolutions in the decoder. We could’ve also used traditional methods like bilnear upsampling or Nearnest Neighbour upsampling, followed by a regular convolution.

Non-convolution based Upsampling

Upsampling are not limited to convolution based operators - we could’ve tried using the ViT as a decoder.

Performance by Quartile

We also analyzed our performance in quartiles. The motivation was to see what images our model was good at, and what our model was bad at. On top of looking at absolute pixel differences between ground truth and prediction, we also did an FFT analysis of our images.

FFTs decompose signals into a linear combination of pure frequency components. This was applied to the predictions to help determine quantitatively which types of images the model was failing to predict. In the FFT diagram, low frequency features are shown in the center, with the frequency of the features increasing as you move away from it.

Quartile Animations (rows: top = image, bottom = FFT; columns: left = prediction, center = truth, right = difference)
1st Quartile (worst)
Fig 12. Low performance predictions.

Fig 12. Low performance predictions.

4th Quartlie (best)
Fig 13. High performance predictions.

Fig 13. High performance predictions.

Our model did not predict high frequency features well (they dsisappear quickly). In the 1st quartile, the FFT signals are lost very quickly wheras in ground truth these are maintained.

The ones they perform well on: these are samples which already have low frequency featuers to begin with. This analysis shows that the model predicts low frequency features well but fails to predict mid to high frequency features, as shown by the fact that the model’s predictions' FFTs are significantly more concentrated in the center compared to the ground truth.

Why?

Our theory is that the FFT decays quickly because (tying back to the averaging hypothesis) it takes the average of possibilities. If you look at the ground truth FFT, the high frequency features appear like white noise. It looks very randomized at a glance, so it makes sense that if you take an average of these, they’ll cancel out in a way. This is why these high frequency features will disappear from the model’s predictions. Moreover, if you look at the centre of the FFTs, the concentrated high frequency features remain the same, so even if u average it out it will remain constant.

Future Improvements?

Instead of predicting the image itself, we could try predicting features. E.g. maybe predict ooptical flow and/or other multiple features and composite them after to generate an image. The motivation is that optical flow retains the frequency better (more realistic images, less blurry), so the model might have a better time predicting this over longer periods of time. In fact, the FFT signal retention is clearly visible below when we compare FFTs of our model to optical flow.

Fig 14. Comparison between ViT-U-Net and Optical Flow.

Fig 14. Comparison between ViT-U-Net and Optical Flow.

Another suggestion could be to use an adversarial loss: by using a discriminator, which basically determines whether or not your predicted image is a real image, the model is forced to generate plausible high frequency features to maintain “realism”. While this might improve the visual appearance of predictions, it might also come at a cost to real performance because it can no longer use the “averaging” from our theory.

Which also begs the question, is MS-SSIM a good metric for cloud forecasting tasks? Looking into alternative metrics could be a good idea too.


As always, thanks for reading.