đڏââď¸ AI to the Rescue: Making Weather Forecasts Better And Cheaper for South Africa!
Should Africa rely on a global AI Weather Foundation model or create its own?
Ever wondered why your weather app sometimes gets forecasts so wrong? Well, buckle up, because weâre diving into how Artificial Intelligence (AI) is shaking things up to give South Africa better weather forecasts, without needing a supercomputer the size of a house! This fun guide is based on a cool study we conducted at AIMS South Africa in collaboration with Ana Lucic from Amsterdam University and Jan Ravnik from Ishango, and weâre breaking it down with simple explanations and a nifty diagram to make it all crystal clear. Letâs go! đ
âď¸ Why AI Weather Forecasting in South Africa May Need a Boost?
Let us start by looking at how one of the most powerful AI Weather Foundations, Aurora, is performing over South Africa compared to Europe and USA. We use this model not only because it outperforms the traditional Numerical Weather Prediction models but also because it outperforms other state-of-the-art AI weather models on many targets. The most remarkable performance is the prediction of storm Cirian, where most of the other AI weather models failed.

As you can see, even Aurora is struggling to predict the weather over the most data-rich region in Africa, while itâs doing incredibly well over Europe and USA. This is called differential performance bias, fancy words for âthe AI works better in some places than others.â Imagine youâre planning an outdoor party, but your weather app says âsunnyâ while a storm is brewing. Why does this happen? In places like South Africa, weather forecasting can be tricky because:
Not Enough Weather Stations: South Africa has fewer weather stations compared to places like Europe or the USA. Itâs like trying to guess whatâs cooking in a kitchen with only half the recipe!
Tricky Weather Patterns: From scorching summers to sudden storms, South Africaâs weather is a wild ride. Standard weather models often struggle to keep up.
AI Models Love Europe and the USA: Many AI weather models are trained on data from places with tons of weather info (like Europe). This makes them superstars there, but less accurate in South Africa. Itâs like an AI thatâs great at predicting snow in London but clueless about a Johannesburg thunderstorm.
Since Aurora is trained on global data, we canât say the lack of performance is due to the fact that it has not seen data from South Africa. But what we definitely know is that, historically, USA and Europe collected much more weather data than we did in Africa or South Africa here.
Before we continue, letâs talk a little bit more about Aurora.
đ Meet Aurora: The AI Weather Wizard
The genius behind this architecture is awe-inspiring, and I am going to tell you briefly why.
Aurora, Microsoft AI's 1.3 billion parameter model for climate forecasting, uses a 3D Perceiver encoder to address the challenge of heterogeneous weather data (varying variables, resolutions, and pressure levels) by mapping it into a standardized 3D tensor, leveraging Perceivers' ability to handle diverse input sizes and modalities efficiently. The 3D Swin Transformer U-Net backbone captures multi-scale atmospheric dynamics, with Swin Transformers chosen for their linear computational complexity and ability to model long-range dependencies, and the U-Net's multi-scale processing mirrors real atmospheric interactions. The 3D Perceiver decoder reconstructs outputs to original variables and resolutions, using its flexibility to handle diverse outputs. For more details about the potential reasons behind this architecture, you can check this Medium post by Devansh. Note that Microsoftâs researchers do not only released a 1.3 billion parameter model. They are aware of how challenging it can be for a simple user to run this model on their consumer-grade computer. Thatâs why, they also released a smaller version of 112 million parameters of Aurora (AuroraSmall) in order to allow a broader public to also get the advantage of this master piece. This small model is the one we use in this study, and it really makes our life easier in terms of computation and memory needs.
đŞ The Magic of Low-Rank Adaptation (LoRA): Fine-Tuning Without Breaking the Bank
Adapting a foundation model with billions or millions of parameters to a new task is not easy: finding the exact change to make to each parameter is not that obvious. Now the question is how do we learn efficiently the change to make without having to learn a specific change to each parameter; this is where LoRA comes in. Think of LoRA as a clever trick that makes model adaptation to a new task as easy as tweaking a recipe instead of rewriting the whole cookbook. AuroraSmall is like a giant recipe book with millions of instructions (parameters) for predicting weather. Instead of rewriting the whole book, LoRA adds a few sticky notes in parallel (matrices A and B, like in the figure below) with South Africa-specific tips, like âadd extra attention to flood.â These matrices represent the change you would have made to the pretrained layer if you were training it directly. Because of the down-projection to a lower dimension r, these sticky notes are small and have fewer parameters to train, so they donât need a supercomputer to update. We only trained these sticky notes to learn the local weather patterns we are interested in. We recover the original pretrained layer size by taking the product of the learned ticky notes A and B. Then, we add this product to the original pretrained matrix to get the updated weights that take into account the local weather patterns. Mathematically, let us consider a matrix
representing the pretrained layer weights,
If
is the input to the layer, then the output is:
The scaling factor
is how we control how much of the change we learned we take into account during prediction. It can be greater than 1, but most of the time itâs set to less than 1 to avoid an aggressive update of the parameter during training that may result in instability (NAN values in gradients, for example). Note that
here is a positive integer like r. In this study, we set r to 16, and we tried two different values for
2 and 4. 4 turns out to work better. The other important thing with LoRA is how we initialize the matrices A and B. B is initialized to zero, while a normal distribution is used for A in the original paper of LoRA. However, many advanced and more suitable initialization techniques have been explored in the literature. In this study, we used the Kaiming Uniform initialization. Itâs designed to maintain a good variance of the activations across layers, especially useful for layers followed by ReLU or similar non-linearities.
The formula for the uniform bounds is:
fan_in number of input units (i.e., number of columns in A)
a: slope of the activation function (here,
math.sqrt(5)is used as a common default for leaky ReLU)
So this initialization ensures that:
The values in
Aare drawn from the interval [âbound,+bound]This helps gradients flow well at the beginning of training.
Now you are probably wondering why B zero? Why not A zero? Or why not set both to zero?
I can start by telling you that we need at least one of these two matrices to be zero before we start the training; otherwise, the first forward pass (first prediction) will not mean anything useful that the model can use to adjust the weights in A and B. If none of A nor B is zero, ie., we use a random distribution to initialize them, then the output of our first prediction is just a random prediction. But we donât want that because we need the model to keep its initial knowledge and add new knowledge to it. Therefore, setting one of them to zero helps to keep the original model knowledge during the first forward pass because AB=0, and based on the computed error, we start updating the weights.
Now what about setting both to zero? Why not? We need to dive into math a little to get it. Donât worry, just some simple derivatives.
We all know that backpropagation is the algorithm that governs the parameter updates during training. Backpropagation requires the gradients of the loss with respect to the parameters being updated, which are represented here by the matrices A and B. I donât think you need to know the loss we are using here to understand what is going on. However, if you are curious, you can still look it up in the Aurora paper. Just keep in mind that our loss here is a function of the output we talked about a little bit earlier. Therefore, we can write something like this:
Letâs derive the gradient with respect to A first. The application of the chain rule gives:
Therefore,
If you are wondering how I get this derivative, just look at the formulas in The Matrix Cookbook.
Time for the gradient with respect to B.
Therefore,
If you look at the two gradients we just computed carefully, they are either a product of A or B. I am sure you already guessed it. If we set A and B to zero at initialization, during backpropagation, our gradients will always be zero, and the parameters we are trying to update will stay the same, no matter how many iterations we train for.
We also realize through this analysis that theoretically, it does not matter which one we set to zero to start with. However, setting B to zero seems to work better empirically.
đ What We Discover
Does fine-tuning AuroraSmall make it better for South Africa?
Yes! Fine-tuning with LoRA made AuroraSmall better at predicting almost all the considered variables, especially for forecasts up to 48 hours. But it wasnât perfect for everything, like specific humidity. Think of it as teaching Aurora to dance the gumboot dance; it got the steps right, but needs practice for the fancy spins.

Can fine-tuned AuroraSmall beat the big Aurora in South Africa?
Here is probably the most striking finding: the fine-tuned small model actually performed better modestly but consistently across almost all variables than the pretrained Large Model over South Africa. So the small model, once tuned efficiently for a region, could outperform the original pretrained large model on that region. It suggests that regional adaptation, even with this smaller model using efficient techniques, can be more effective locally than just relying on the larger globally trained model. It really speaks to the potential of making accurate weather forecasting more affordable and accessible.

Now, itâs obvious that we do not necessarily need to create our own AI Weather model from scratch in Africa. We can save a significant amount of time and money by adapting existing solutions to the African reality. Here, we begin with Aurora, but we can also utilize GraphCast, NeuralGCM, GenCast, and many other Weather Foundation models available.
â ď¸ The Catch: Itâs Not Perfect Yet
\alphaThe fine-tuned AuroraSmall got better at predicting some variables (like air pressure at high altitudes) in South Africa compared to Europe. But overall, the model's accuracy in South Africa is still lower than in data-rich regions like Europe and the USA. Itâs like Aurora learned some Afrikaans but isnât fluent yet.
Fully closing the model performance gap between South Africa and Europe or USA is still a major challenge, and this may be due to some limitations we identify at the end of this study. The number one limitation is the lack of high-resolution local South African weather data for fine-tuning. Getting access to or generating such data sets is critical for future work. It represents a key step to unlocking better regional performance of AI Weather Prediction models. Compute resources are still a constraint limiting training time and the amount of data used, so more computer power would help allow for longer training runs. We also suggested exploring other PEFT techniques beyond LoRA that may be more effective.

đ Want to Learn More
This study demonstrates how AI can enhance weather forecasting, making it more cost-effective and accurate for regions like South Africa. For further details, you can access the full report here. Please feel free to contact us with any questions or suggestions. You can also contribute to the project on our GitHub repository. Thank you for your time and interest.





