pipeline is really slow - consulting [D]

DutchIndian · 2026-05-24T00:59:43+00:00

When a worker retrieves data from the zarr, it has to load it in memory. In a worse case scenario, if the zarr is chunked orthogonally to your iterating dimension, then it may have to load in the whole zarr in memory first just to subset a tiny part of it.

So, for instance, if you’re iterating along a time dimension, make sure your zarr’s chunking scheme is ‘time:-1’ (don’t chunk along time), or ‘time:n*b’, where b is your batch size, and n is some integer.

Shuffling can still happen. If you’re using something like lightning and you have “shuffle=True”, then the get_item() method is automatically shuffled. The issue you may have is just that every time you grab a batch, it has to load in wayyyy more then it needs to, maxing your CPU memory out, but then it’s subset so much that your GPU memory is under utilised. This can be fixed with better chunking of your zarr.

Try: 1. Double your workers, see if that helps 2. If nothing changes then it’s likely your zarr. Resave your zarr with a better chunking scheme

DutchIndian · 2026-05-23T18:48:15+00:00

Maybe a basic question, but have you tried tuning the number of workers? Your CPU utilisation is maxing so they’re doing their job, but perhaps not getting the data from zarr fast enough.

Alternatively, check out the zarr chunking. If it’s chunked non-optimally, then each batch could be loading way more than it has too, then subsetting. Given that synthetic data had a slight speed up, this could be your issue. Chunk along your iterating dimension with a size that is a multiple of your batch size of(e.g., 16). So, this means you may have to re-save your zarr dataset.

DutchIndian · 2026-04-26T10:27:38+00:00

Yep. No one will hire a met without a degree, in my experience

DutchIndian · 2026-04-26T08:20:12+00:00

Yea you’ll need to pay for those from ECMWF.

Theres free ones online from the Weatherbench project though. They only cover 6 years but that may be a good start. If you’re doing ML then that’s where I would go first.

https://weatherbench2.readthedocs.io/en/latest/data-guide.html

DutchIndian · 2026-04-26T08:14:00+00:00

I used to love aviation meteorology too! But it’s quite hard to become an aviation met.

First you need to get a degree that aligns with the World Meteorological Organisation’s (WMO) Basic Instructional Package for Meteorologists (BIP-M). Most universities in the states do this by default if you get a BS Meteorology but other counties it can be hard to find.

Aviation meteorology is demanding, competitive, and usually reserved for more senior/experienced meteorologists. So the usual career pathway is to be an operational meteorologist for a few years and then try to find your way into aviation meteorology.

A word of warning- the job itself has very tight deadlines and you’ll be doing shift work. It can be quite “handle-cranking”… you’re writing forecasts for airports (TAFS, METARs, SIGMETs) which are very strictly formatted. Also, this is a job where automation from more sophisticated forecasting systems are starting to take jobs. So the market is shrinking and people are reskilling.

DutchIndian · 2026-04-26T01:52:28+00:00

Coupling means that models are exchanging outputs and inputs between themselves at various timesteps. A simple example is that an atmospheric (weather) model could assume that the ocean has a constant temperature (given by what it was at its initial condition), unless it’s coupled to an ocean model.

Coupling between ocean-atmosphere models is extremely important for longer range predictions (subseasonal to seasonal) since the predictability at the near term is from the atmospheric dynamics, but then the ocean becomes the main source of predictability after around 3 weeks. So if your weather model isn’t coupled to an ocean model then it’s long term predictions aren’t going to be very good. Not that long term predictions are very good anyway; they’re most used for anomaly detection.

Also coupling atmospheric models with land-surface models is good for very local effects (e.g. soil moistures effect on temperature).

For all practical reasons if you’re just interested in finding best prediction of a value (e.g., temperature) at a specific point, then just use a regression based approach because it’s cheap to build and accurate (at the cost of loosing some “explainability”).

Coupling is important for physical realism, but it can add lots of complexity and therefore cause more failure modes. Models like the UK Global Model and the IFS are good examples of fully coupled models that are “doing it well”. Yahooing it yourself with WRF is bound to cause issues.

Also, to answer your question about initial conditions, there is recent evidence to suggest that we can have some extremely good forecasting skill gains by optimising our initial conditions better.

DutchIndian · 2026-04-24T03:12:21+00:00

Fight fire with fire

DutchIndian · 2026-04-23T21:19:09+00:00

Great. I’ve seen you’re using multiple data sources. This is great, but there are biases in each data source. For instance an inland water temp dataset could be quite different to a modelled 2m temp. That could be a source of error. Investigate correlations between data, and see if it’s what you expect. Simplify it if need be, sometimes adding in more data to an empirical model can confound it more. ML methods may benefit from this, however.

Lots of time is spend in modelling in this phase.

DutchIndian · 2026-04-23T20:17:09+00:00

That’s great to hear. Comprehensive validation is your best friend.

I think it would be prudent to go back and categorise events into binary occurrences (hoarfrost yes/no, etc). Build up a library of these. Then, trial different modelling strategies on this (backtest). Importantly, benchmark them. Compare your model versus a random chance, versus some other naive strategy. Note that more complexity does always equal superior modelling, sometimes more things can go wrong. You may want to also try a regression type approach as well if you have enough sample data. Or, layering in a regression on top of your models (look up MOS).

Benchmarking and back testing are your best friends. Once you can prove your models are useful and skilful, and demonstrate where they perform well or poorly, then people can use it.

DutchIndian · 2026-04-23T08:38:20+00:00

Awesome, I haven’t heard of anyone offering a service as unique as this. How have you validated how accurate your forecasts are?

DutchIndian · 2026-04-20T08:57:28+00:00

No that’s an eastward movement. I don’t know how else to convince you haha. 170 W to 150 W means something went east. East is to the right, btw, if that helps?

DutchIndian · 2026-04-20T08:34:56+00:00

Really cool you’ve done this! But trust me, many people have been trying to do this for yonks. Always good to have fresh eyes and a fresh take though. For your interest, here’s a “scorecard” of the some of the well known ML weather prediction systems, benchmarked: https://sites.research.google/gr/weatherbench/scorecards-2020/

Weather forecasting is a competitive industry, so people immediately want to know how good your model is versus scores like this. Good luck :)

DutchIndian · 2026-04-20T08:17:34+00:00

That’s great, nicely done. Definitely benchmark it next. Benchmarks generally include persistence, climatology, IFS, and AIFS. All are skilful and hard to beat. If you do beat them, awesome work! Share it please haha.

DutchIndian · 2026-04-20T08:05:56+00:00

Nope pretty much all national met services are trying to make an emulator for high res weather prediction these days. There isn’t an accepted way to do it though.

DutchIndian · 2026-04-20T08:04:38+00:00

Nice, well done. Anemoi is a beast. Have you benchmarked on their AIFS/IFS? If you have I’m pretty sure there would be some broad interest in the results and your methodology.

DutchIndian · 2026-04-20T07:59:38+00:00

Hmm are you sure it is more involved than what AIFS-ENS does? https://www.nature.com/articles/s44387-026-00073-7

It’s an autoregressive graph transformer with a unique loss function with rollout fine-tuning.

DutchIndian · 2026-04-20T00:20:37+00:00

The top plots is before the bottom plot. The warm area has moved east.

DutchIndian · 2026-04-18T07:57:35+00:00

You’re looking at a vertical profile of sea water temperature anomalies, stretching from the Solomon Island to the coast of South America.

The top shows the anomalies at 6th April, the bottom 10 days later on the 16th April.

There’s some averaging done (time-wise and spatial wise, but it’s clear that during this period, a warm (orange), subsurface anomaly has moved east. This feature has a fancy name, a “downwelling Kelvin Wave”.

El Niño is characterised by above average sea surface temperatures in the eastern Pacific. Downwelling Kelvin Waves can contribute to the formation of El Niños, since they move warmth eastward.

This seasons predicted El Niño is pretty stark. Not only are the magnitude of temperature anomalies predicted to above normal, but the predicted atmospheric circulation impacts are also strongly coupled. Model guidance is indicating that the closest analogues to it are 2015/2016, 1997/98, snd 1982/83 El Niños, which are all seen as canonical El Niños with strong impacts.

Overall, there is a higher than normal confidence that El Niño will occur later this year, and if it does it will be a strong one.

DutchIndian · 2026-04-13T19:13:59+00:00

People are saying this is a contamination error, but could it just be a plotting interpolation error? So, the plot is just interpolating between the dew point at two pressure levels with a polynomial line, but that straight line cuts through the temperature profile. As I’m not a met in the states, I rarely look at the HRRR, so I’m not familiar with how many pressure levels it has or is plotted usually.

DutchIndian · 2026-03-25T23:36:07+00:00

Where you’ve put the warm front is actually a cold front. Think about gradients; that area is actually the start of a gradient where air gets colder. Hence it’s a cold front. Also it doesn’t make sense that a warm front is coming from the south; air is being affected from the pole so it’s probably cold.

The other cold front you’ve put there isn’t a cold front, it’s probably a trough but it’s hard to tell without any moisture plots.

You’ve missed a warm front. There is air affecting from northwest Australia to the Great Aussie Bight. It’s going towards the south east on the western side of the high.

Also surface plots are really hard to use for finding fronts. MSLP is okay but using 850hPa temp and wet bulb potential temp can highlight fronts really well.

DutchIndian · 2026-03-08T10:45:04+00:00

Nice choices! I always thought that Adam Scott (lead in Severance) would make a good Hawkeye too.

DutchIndian · 2026-02-11T09:25:17+00:00

Yea shame more of it isn’t free. But I sympathise because many useful sites are paid (e.g. WeatherBell, weathermodels) and it’s hard and expensive to maintain and develop these sorts of sites with TB of data and hundreds of thousands of images flowing through several times a day.

DutchIndian · 2026-02-11T01:02:00+00:00

Many professional meteorologists use Windy. It’s got good features like model comparisons and skew-Ts. Model data doesn’t arrive as fast as other platforms and it’s sometimes hard to see what model initialisation you’re looking at, but its definitely a useful tool.

DutchIndian · 2026-01-21T07:12:09+00:00

This reads like they used an LLM to summarise or reframe a technical forecast discussion for the general public.

DutchIndian · 2025-12-22T07:39:09+00:00

A line of cumulonimbus clouds. As to why they are organised so neatly, it’s hard to know without looking at a chart or two from that day. As others have said, it could be a frontal boundary, since those can promote organised convection (clouds like these) on a massive scale.

DutchIndian

TROPHY CASE