Skip to main content

Modern Forecasting Methods

Transformer-based models, N-BEATS, foundation models, and probabilistic forecasting

~45 min
Listen to this lesson

Modern Forecasting Methods

The latest advances in time series forecasting leverage Transformer architectures, specialized neural network designs, and foundation models pre-trained on massive datasets.

Temporal Fusion Transformers (TFT)

TFT (Google, 2019) is an attention-based architecture specifically designed for multi-horizon forecasting:

  • Variable selection networks: Automatically selects relevant features
  • Static covariate encoders: Handles time-invariant metadata (e.g., store ID)
  • LSTM encoder-decoder: Captures temporal patterns
  • Multi-head attention: Focuses on the most relevant time steps
  • Quantile outputs: Produces prediction intervals, not just point forecasts
  • TFT excels when you have:

  • Multiple time series (e.g., sales for 1000 products)
  • Rich metadata (categories, locations)
  • Known future inputs (holidays, promotions)
  • N-BEATS (Neural Basis Expansion)

    N-BEATS (Oreshkin et al., 2019) is a pure deep learning architecture that achieved state-of-the-art results:

    Architecture

  • Stack of blocks, each with a fully connected network
  • Each block outputs a backcast (reconstruction of input) and a forecast
  • Blocks are organized into stacks (trend stack, seasonality stack)
  • Residual connections: each block processes what the previous one couldn't explain
  • python
    1import numpy as np
    2
    3class NBEATSBlock:
    4    """A single N-BEATS block (simplified)."""
    5
    6    def __init__(self, input_dim, hidden_dim, backcast_dim, forecast_dim):
    7        self.input_dim = input_dim
    8        self.backcast_dim = backcast_dim
    9        self.forecast_dim = forecast_dim
    10
    11        scale = np.sqrt(2.0 / hidden_dim)
    12
    13        # Fully connected layers
    14        self.W1 = np.random.randn(input_dim, hidden_dim) * scale
    15        self.b1 = np.zeros(hidden_dim)
    16        self.W2 = np.random.randn(hidden_dim, hidden_dim) * scale
    17        self.b2 = np.zeros(hidden_dim)
    18
    19        # Backcast and forecast heads
    20        self.W_back = np.random.randn(hidden_dim, backcast_dim) * scale
    21        self.b_back = np.zeros(backcast_dim)
    22        self.W_fore = np.random.randn(hidden_dim, forecast_dim) * scale
    23        self.b_fore = np.zeros(forecast_dim)
    24
    25    def forward(self, x):
    26        """Forward pass returning backcast and forecast."""
    27        h = np.maximum(0, x @ self.W1 + self.b1)
    28        h = np.maximum(0, h @ self.W2 + self.b2)
    29        backcast = h @ self.W_back + self.b_back
    30        forecast = h @ self.W_fore + self.b_fore
    31        return backcast, forecast
    32
    33
    34class SimpleNBEATS:
    35    """Simplified N-BEATS model."""
    36
    37    def __init__(self, input_dim, forecast_dim, n_blocks=3, hidden_dim=64):
    38        self.blocks = [
    39            NBEATSBlock(input_dim, hidden_dim, input_dim, forecast_dim)
    40            for _ in range(n_blocks)
    41        ]
    42
    43    def forward(self, x):
    44        """
    45        Process input through all blocks with residual learning.
    46        Each block sees the residual from previous blocks.
    47        """
    48        residual = x.copy()
    49        total_forecast = np.zeros(self.blocks[0].forecast_dim)
    50
    51        for block in self.blocks:
    52            backcast, forecast = block.forward(residual)
    53            residual = residual - backcast  # Subtract what was explained
    54            total_forecast = total_forecast + forecast  # Add forecast contribution
    55
    56        return total_forecast
    57
    58
    59# Demo
    60input_dim = 30   # Look-back window
    61forecast_dim = 10  # Forecast horizon
    62
    63model = SimpleNBEATS(input_dim, forecast_dim, n_blocks=3, hidden_dim=32)
    64
    65# Test with a sample input
    66x = np.random.randn(input_dim)
    67forecast = model.forward(x)
    68print(f"Input shape: {x.shape}")
    69print(f"Forecast shape: {forecast.shape}")
    70print(f"Forecast: {np.round(forecast, 3)}")

    PatchTST (Patch Time Series Transformer)

    PatchTST (Nie et al., 2023) adapts Vision Transformer ideas to time series:

    1. Patching: Divide the time series into non-overlapping patches (e.g., 16 consecutive points = 1 patch) 2. Patch embedding: Project each patch to an embedding vector 3. Transformer encoder: Apply self-attention across patches 4. Channel independence: Process each variable independently

    This is much more efficient than point-level attention (O(n^2) vs O((n/p)^2)).

    Foundation Models for Time Series

    Just as GPT revolutionized NLP, foundation models are emerging for time series:

    TimeGPT (Nixtla)

  • Pre-trained on 100B+ time series data points
  • Zero-shot forecasting: works without task-specific training
  • API-based service
  • Chronos (Amazon)

  • Pre-trained language model adapted for time series
  • Tokenizes time series values into bins
  • Generates forecasts autoregressively
  • Open-source (available on HuggingFace)
  • Lag-Llama

  • LLM-based forecasting model
  • Uses lagged features as input tokens
  • Probabilistic outputs via distribution parameters
  • Probabilistic Forecasting

    Point forecasts (a single number) are often insufficient. Probabilistic forecasts provide uncertainty estimates: - **Quantile regression**: Predict the 10th, 50th, 90th percentiles - **Distributional**: Predict parameters of a distribution (mean + variance for Gaussian) - **Conformal prediction**: Provides calibrated prediction intervals with guaranteed coverage - **Monte Carlo dropout**: Use dropout at inference time for uncertainty via multiple passes Knowing uncertainty is critical for decision-making: ordering inventory, managing risk, etc.
    python
    1import numpy as np
    2
    3class QuantileForecaster:
    4    """
    5    Quantile regression for probabilistic forecasting.
    6    Predicts multiple quantiles to form prediction intervals.
    7    """
    8
    9    def __init__(self, input_dim, quantiles=(0.1, 0.5, 0.9)):
    10        self.quantiles = quantiles
    11        self.models = {}
    12
    13        for q in quantiles:
    14            # Separate linear model for each quantile
    15            self.models[q] = {
    16                'W': np.random.randn(input_dim, 1) * 0.01,
    17                'b': np.zeros(1),
    18            }
    19
    20    def quantile_loss(self, y_true, y_pred, q):
    21        """Pinball loss for quantile regression."""
    22        errors = y_true - y_pred
    23        return np.mean(np.maximum(q * errors, (q - 1) * errors))
    24
    25    def fit(self, X, y, epochs=200, lr=0.001):
    26        """Train each quantile model separately."""
    27        for q in self.quantiles:
    28            W = self.models[q]['W']
    29            b = self.models[q]['b']
    30
    31            for _ in range(epochs):
    32                pred = (X @ W + b).flatten()
    33                errors = y - pred
    34
    35                # Gradient of pinball loss
    36                grad = np.where(errors >= 0, -q, -(q - 1))
    37                grad_W = (X.T @ grad.reshape(-1, 1)) / len(y)
    38                grad_b = np.mean(grad)
    39
    40                W -= lr * grad_W
    41                b -= lr * grad_b
    42
    43            self.models[q]['W'] = W
    44            self.models[q]['b'] = b
    45
    46    def predict(self, X):
    47        """Predict all quantiles."""
    48        predictions = {}
    49        for q in self.quantiles:
    50            W = self.models[q]['W']
    51            b = self.models[q]['b']
    52            predictions[q] = (X @ W + b).flatten()
    53        return predictions
    54
    55
    56# Demo: probabilistic forecast
    57np.random.seed(42)
    58n = 300
    59t = np.arange(n, dtype=float)
    60y = 2 * np.sin(t / 10) + t * 0.01 + np.random.randn(n) * 0.5
    61
    62# Create features (lagged values)
    63window = 10
    64X = np.array([y[i:i+window] for i in range(n - window)])
    65targets = y[window:]
    66
    67# Split
    68split = 250
    69X_train, X_test = X[:split], X[split:]
    70y_train, y_test = targets[:split], targets[split:]
    71
    72# Train quantile forecaster
    73model = QuantileForecaster(input_dim=window, quantiles=(0.1, 0.25, 0.5, 0.75, 0.9))
    74model.fit(X_train, y_train, epochs=300, lr=0.001)
    75
    76# Predict
    77preds = model.predict(X_test)
    78
    79# Evaluate coverage
    80p10, p50, p90 = preds[0.1], preds[0.5], preds[0.9]
    81coverage_80 = np.mean((y_test >= p10) & (y_test <= p90))
    82mae_median = np.mean(np.abs(y_test - p50))
    83
    84print(f"80% interval coverage: {coverage_80*100:.1f}% (target: 80%)")
    85print(f"Median forecast MAE: {mae_median:.4f}")
    86print(f"Average interval width: {np.mean(p90 - p10):.4f}")

    Choosing the Right Method

    MethodBest ForData RequirementInterpretability
    ARIMA/SARIMASingle series, clear patternsSmall-mediumHigh
    ProphetBusiness data with holidaysMediumHigh
    LSTM/GRUComplex nonlinear patternsLargeLow
    TCNLong sequences, need speedLargeLow
    N-BEATSPure forecasting benchmarkLargeMedium
    TFTMulti-series with metadataVery largeMedium
    PatchTSTLong-context multivariateLargeLow
    Foundation modelsZero/few-shot scenariosPre-trainedLow

    Decision Framework:

    1. Start with naive baselines and exponential smoothing 2. Try ARIMA/Prophet for interpretable models 3. Move to deep learning if you have enough data and classical methods underperform 4. Consider foundation models for quick prototyping or when training data is limited