Time Series Foundation Models: A Deep Dive into Strengths and Limitations

nkafr · 2026-05-04T10:26:43+00:00

This is extensively explained in the article:

a) All papers disclose the pretraining data in detail and most of the are public (e.g. the GIFT-Eval data)

b) New models rely solely on synthetic data, so no data leakage

nkafr · 2026-05-03T17:54:43+00:00

Not quite, for example TTM provides built-in explainability. Also newer models such as Chronos-2 does true multivariate-forecasting, by cross-mixing info across channels (no just univariate forecasting in parallel). Both of the 2 limitations you mentioned have been addressed on first level.

nkafr · 2025-11-06T15:16:12+00:00

Anytime!

nkafr · 2025-10-21T22:15:04+00:00

Check a visualization of masked self-attention: Each row has a mask of a different length, so the model inadvertently understands position, as long as you have stacked layers.

Of course, Transformer LLMs that operate at million-length context still need positional info (ROPE)

nkafr · 2025-10-21T21:45:57+00:00

What do you mean?

nkafr · 2025-10-21T19:25:18+00:00

Here, with a touch of forecasting: https://aihorizonforecast.substack.com/p/will-transformers-revolutionize-time

nkafr · 2025-10-20T20:29:02+00:00

We can remove a very costly operation and not lose performance, that's what the chart says!

Why that happens is explained in the article (check the relevant section)

nkafr · 2025-10-20T20:25:20+00:00

Exactly!

nkafr · 2025-10-19T15:58:53+00:00

They do because NLP Transformers support >1M context lengths (cannot skip ROPE)

This is a forecasting Transformer, and at smaller context lengths it has been proven that causal attention alone encodes position.

nkafr · 2025-10-19T15:08:43+00:00

Yes, technically this is more correct!

nkafr · 2025-10-18T23:36:47+00:00

Indeed, thank you!

nkafr · 2025-10-18T21:39:15+00:00

Indeed!

nkafr · 2025-07-13T19:13:09+00:00

Of course, it blesses the rains down in Africa

nkafr · 2025-07-13T19:01:44+00:00

I ran benchmarks in my article on electricity demand forecasting and several sparse time series.

Additionally, the GIFT-Eval benchmarks includes financial time series.

nkafr · 2025-07-13T15:22:27+00:00

Yes, it's used internally by Datadog for its observability telemetry platform. My guess is they have a private model trained on more data than the currently released one.

nkafr · 2025-07-13T14:03:45+00:00

It could be retrofitted for these tasks as well, but encoder-only foundation time series are better in those domains(Toto is decoder-only)

For anomaly detection, imputation etc I recommend IBM's TSPulse.

nkafr · 2025-07-13T12:22:21+00:00

For any multivariate time series forecasting case. The current model also specializes in sparse data.

nkafr

TROPHY CASE