TimesFM: Google's 200M Parameter Decoder for Time Series Forecasting

What it does

TimesFM is a pretrained time series foundation model from Google Research. You feed it historical data — sales, sensor readings, stock prices, whatever — and it forecasts forward with quantile uncertainty estimates. No training required. The 2.5 release (200M parameters, down from 500M) supports up to 16k context length and 1k forecast horizon.

Why I starred it

Most time series forecasting still means training a model per dataset. TimesFM flips that. It's a general-purpose pretrained model that works zero-shot on arbitrary time series, similar to how LLMs work on arbitrary text. The paper (ICML 2024) backs this up with benchmarks across multiple domains.

What caught my eye was the architecture: a decoder-only transformer applied to time series. Not a novelty wrapper around GPT — an actual purpose-built model with domain-specific decisions throughout.

How it works

The core architecture lives in src/timesfm/timesfm_2p5/. The model definition in timesfm_2p5_base.py sets the constants: 32-token input patches, 128-token output patches, 20 transformer layers, 16 attention heads, 1280 model dimensions. That's the 200M parameter budget.

The input pipeline is where it gets interesting. Raw time series values are chunked into patches of 32, then each patch gets concatenated with its mask and run through a ResidualBlock tokenizer — a two-layer MLP with swish activation that maps from 64 dimensions (32 values + 32 mask bits) into the 1280-dimensional model space.

# timesfm_2p5_torch.py
tokenizer_inputs = torch.cat([inputs, masks.to(inputs.dtype)], dim=-1)
input_embeddings = self.tokenizer(tokenizer_inputs)

The masking is crucial. TimesFM handles variable-length inputs by left-padding with zeros and tracking which positions are real data versus padding. The forecast method in timesfm_2p5_base.py does this explicitly:

if (w := len(value)) >= context:
    value = value[-context:]
    mask = np.zeros_like(value, dtype=bool)
else:
    mask = np.array([True] * (context - w) + [False] * w)
    value = np.pad(value, (context - w, 0), "constant", constant_values=0.0)

The normalization story is the cleverest part. In src/timesfm/torch/util.py, there's a revin function — reversible instance normalization. Instead of normalizing the entire input once, the model tracks running statistics (mean, standard deviation) patch by patch using Welford's online algorithm in update_running_stats. Each patch gets normalized against the running stats up to that point, and the output gets denormalized with the same stats. This means the model sees relative patterns rather than absolute magnitudes.

def revin(x, mu, sigma, reverse=False):
    if reverse:
        return x * sigma + mu
    else:
        return (x - mu) / torch.where(sigma < _TOLERANCE, 1.0, sigma)

Decoding is autoregressive. The first forward pass processes all context patches, produces an output patch of 128 values, then feeds those back as 4 new input patches (128 / 32 = 4) for the next step. KV caching is implemented via DecodeCache — standard transformer inference optimization. The decode method in timesfm_2p5_torch.py:96 orchestrates this loop.

One flag worth noting: force_flip_invariance. TimesFM guarantees that TimesFM(aX + b) = a * TimesFM(X) + b for positive a by default. When you enable flip invariance, it also handles negative a by running inference on both the original and negated input, then averaging. Two forward passes for symmetry — an expensive but mathematically clean solution.

The transformer itself uses rotary position embeddings, RMS normalization, fused QKV projections, and swish activations. The attention implementation in src/timesfm/torch/transformer.py offers both a manual einsum path and PyTorch's fused scaled_dot_product_attention with scale=1.0 (unscaled attention, since the model was trained that way).

Using it

import torch
import numpy as np
import timesfm

torch.set_float32_matmul_precision("high")

model = timesfm.TimesFM_2p5_200M_torch.from_pretrained(
    "google/timesfm-2.5-200m-pytorch"
)

model.compile(
    timesfm.ForecastConfig(
        max_context=1024,
        max_horizon=256,
        normalize_inputs=True,
        use_continuous_quantile_head=True,
        force_flip_invariance=True,
    )
)

# Forecast 12 steps on two dummy series
point_forecast, quantile_forecast = model.forecast(
    horizon=12,
    inputs=[
        np.linspace(0, 1, 100),
        np.sin(np.linspace(0, 20, 67)),
    ],
)
# point_forecast.shape: (2, 12)
# quantile_forecast.shape: (2, 12, 10) — mean + 10th-90th percentiles

The covariate support (forecast_with_covariates) adds external regressors via a linear model. Two modes: "timesfm + xreg" fits on TimesFM residuals, "xreg + timesfm" removes the linear component first and lets TimesFM forecast the remainder. Both use ridge regression under the hood via xreg_lib.py.

Installation uses uv and offers separate extras for torch, flax, and xreg. The base dependencies are surprisingly lean: just numpy, huggingface_hub, and safetensors.

Rough edges

The compile step has rigid alignment requirements. max_context must be a multiple of the 32-token patch size and max_horizon a multiple of 128. The code silently rounds up, which can waste memory if you're not paying attention.

Documentation is thin. The config dataclass in configs.py has a window_size field with a TODO comment (TODO(siriuz42):implement it) — suggests decomposed forecasting was planned but never shipped. The docstrings are decent, but there are no tutorials beyond the README code snippet.

No test files in the repository. For a model with 15k stars that's integrated into BigQuery, the absence of a test suite is surprising. You're trusting the ICML paper and Google's internal testing.

The v1 directory still sits in the repo root — old 500M model code that's effectively dead but not cleaned up. Minor, but it makes navigation slightly confusing for newcomers.

torch.compile is called by default during checkpoint loading. On first run, expect a significant warmup delay as PyTorch traces and compiles the model. There's a torch_compile=False flag if you need faster startup at the cost of slower inference.

Bottom line

If you need time series forecasting without training a model per dataset, TimesFM is the strongest open option from a major lab. The architecture is clean, the quantile estimates are useful for uncertainty-aware applications, and the 200M parameter count keeps it runnable on a single GPU. Best suited for batch forecasting workloads — the compile-then-predict pattern isn't ideal for one-off predictions.