Modern Forecasting Methods

The latest advances in time series forecasting leverage Transformer architectures, specialized neural network designs, and foundation models pre-trained on massive datasets.

Temporal Fusion Transformers (TFT)

TFT (Google, 2019) is an attention-based architecture specifically designed for multi-horizon forecasting:

Variable selection networks: Automatically selects relevant features

Static covariate encoders: Handles time-invariant metadata (e.g., store ID)

LSTM encoder-decoder: Captures temporal patterns

Multi-head attention: Focuses on the most relevant time steps

Quantile outputs: Produces prediction intervals, not just point forecasts

TFT excels when you have:

Multiple time series (e.g., sales for 1000 products)

Rich metadata (categories, locations)

Known future inputs (holidays, promotions)

N-BEATS (Neural Basis Expansion)

N-BEATS (Oreshkin et al., 2019) is a pure deep learning architecture that achieved state-of-the-art results:

Architecture

Stack of blocks, each with a fully connected network

Each block outputs a backcast (reconstruction of input) and a forecast

Blocks are organized into stacks (trend stack, seasonality stack)

Residual connections: each block processes what the previous one couldn't explain

python

1import numpy as np
2
3class NBEATSBlock:
4    """A single N-BEATS block (simplified)."""
5
6    def __init__(self, input_dim, hidden_dim, backcast_dim, forecast_dim):
7        self.input_dim = input_dim
8        self.backcast_dim = backcast_dim
9        self.forecast_dim = forecast_dim
10
11        scale = np.sqrt(2.0 / hidden_dim)
12
13        # Fully connected layers
14        self.W1 = np.random.randn(input_dim, hidden_dim) * scale
15        self.b1 = np.zeros(hidden_dim)
16        self.W2 = np.random.randn(hidden_dim, hidden_dim) * scale
17        self.b2 = np.zeros(hidden_dim)
18
19        # Backcast and forecast heads
20        self.W_back = np.random.randn(hidden_dim, backcast_dim) * scale
21        self.b_back = np.zeros(backcast_dim)
22        self.W_fore = np.random.randn(hidden_dim, forecast_dim) * scale
23        self.b_fore = np.zeros(forecast_dim)
24
25    def forward(self, x):
26        """Forward pass returning backcast and forecast."""
27        h = np.maximum(0, x @ self.W1 + self.b1)
28        h = np.maximum(0, h @ self.W2 + self.b2)
29        backcast = h @ self.W_back + self.b_back
30        forecast = h @ self.W_fore + self.b_fore
31        return backcast, forecast
32
33
34class SimpleNBEATS:
35    """Simplified N-BEATS model."""
36
37    def __init__(self, input_dim, forecast_dim, n_blocks=3, hidden_dim=64):
38        self.blocks = [
39            NBEATSBlock(input_dim, hidden_dim, input_dim, forecast_dim)
40            for _ in range(n_blocks)
41        ]
42
43    def forward(self, x):
44        """
45        Process input through all blocks with residual learning.
46        Each block sees the residual from previous blocks.
47        """
48        residual = x.copy()
49        total_forecast = np.zeros(self.blocks[0].forecast_dim)
50
51        for block in self.blocks:
52            backcast, forecast = block.forward(residual)
53            residual = residual - backcast  # Subtract what was explained
54            total_forecast = total_forecast + forecast  # Add forecast contribution
55
56        return total_forecast
57
58
59# Demo
60input_dim = 30   # Look-back window
61forecast_dim = 10  # Forecast horizon
62
63model = SimpleNBEATS(input_dim, forecast_dim, n_blocks=3, hidden_dim=32)
64
65# Test with a sample input
66x = np.random.randn(input_dim)
67forecast = model.forward(x)
68print(f"Input shape: {x.shape}")
69print(f"Forecast shape: {forecast.shape}")
70print(f"Forecast: {np.round(forecast, 3)}")

PatchTST (Patch Time Series Transformer)

PatchTST (Nie et al., 2023) adapts Vision Transformer ideas to time series:

1. Patching: Divide the time series into non-overlapping patches (e.g., 16 consecutive points = 1 patch) 2. Patch embedding: Project each patch to an embedding vector 3. Transformer encoder: Apply self-attention across patches 4. Channel independence: Process each variable independently

This is much more efficient than point-level attention (O(n^2) vs O((n/p)^2)).

Foundation Models for Time Series

Just as GPT revolutionized NLP, foundation models are emerging for time series:

TimeGPT (Nixtla)

Pre-trained on 100B+ time series data points

Zero-shot forecasting: works without task-specific training

API-based service

Chronos (Amazon)

Pre-trained language model adapted for time series

Tokenizes time series values into bins

Generates forecasts autoregressively

Open-source (available on HuggingFace)

Lag-Llama

LLM-based forecasting model

Uses lagged features as input tokens

Probabilistic outputs via distribution parameters

Probabilistic Forecasting

Point forecasts (a single number) are often insufficient. Probabilistic forecasts provide uncertainty estimates: - **Quantile regression**: Predict the 10th, 50th, 90th percentiles - **Distributional**: Predict parameters of a distribution (mean + variance for Gaussian) - **Conformal prediction**: Provides calibrated prediction intervals with guaranteed coverage - **Monte Carlo dropout**: Use dropout at inference time for uncertainty via multiple passes Knowing uncertainty is critical for decision-making: ordering inventory, managing risk, etc.

python

1import numpy as np
2
3class QuantileForecaster:
4    """
5    Quantile regression for probabilistic forecasting.
6    Predicts multiple quantiles to form prediction intervals.
7    """
8
9    def __init__(self, input_dim, quantiles=(0.1, 0.5, 0.9)):
10        self.quantiles = quantiles
11        self.models = {}
12
13        for q in quantiles:
14            # Separate linear model for each quantile
15            self.models[q] = {
16                'W': np.random.randn(input_dim, 1) * 0.01,
17                'b': np.zeros(1),
18            }
19
20    def quantile_loss(self, y_true, y_pred, q):
21        """Pinball loss for quantile regression."""
22        errors = y_true - y_pred
23        return np.mean(np.maximum(q * errors, (q - 1) * errors))
24
25    def fit(self, X, y, epochs=200, lr=0.001):
26        """Train each quantile model separately."""
27        for q in self.quantiles:
28            W = self.models[q]['W']
29            b = self.models[q]['b']
30
31            for _ in range(epochs):
32                pred = (X @ W + b).flatten()
33                errors = y - pred
34
35                # Gradient of pinball loss
36                grad = np.where(errors >= 0, -q, -(q - 1))
37                grad_W = (X.T @ grad.reshape(-1, 1)) / len(y)
38                grad_b = np.mean(grad)
39
40                W -= lr * grad_W
41                b -= lr * grad_b
42
43            self.models[q]['W'] = W
44            self.models[q]['b'] = b
45
46    def predict(self, X):
47        """Predict all quantiles."""
48        predictions = {}
49        for q in self.quantiles:
50            W = self.models[q]['W']
51            b = self.models[q]['b']
52            predictions[q] = (X @ W + b).flatten()
53        return predictions
54
55
56# Demo: probabilistic forecast
57np.random.seed(42)
58n = 300
59t = np.arange(n, dtype=float)
60y = 2 * np.sin(t / 10) + t * 0.01 + np.random.randn(n) * 0.5
61
62# Create features (lagged values)
63window = 10
64X = np.array([y[i:i+window] for i in range(n - window)])
65targets = y[window:]
66
67# Split
68split = 250
69X_train, X_test = X[:split], X[split:]
70y_train, y_test = targets[:split], targets[split:]
71
72# Train quantile forecaster
73model = QuantileForecaster(input_dim=window, quantiles=(0.1, 0.25, 0.5, 0.75, 0.9))
74model.fit(X_train, y_train, epochs=300, lr=0.001)
75
76# Predict
77preds = model.predict(X_test)
78
79# Evaluate coverage
80p10, p50, p90 = preds[0.1], preds[0.5], preds[0.9]
81coverage_80 = np.mean((y_test >= p10) & (y_test <= p90))
82mae_median = np.mean(np.abs(y_test - p50))
83
84print(f"80% interval coverage: {coverage_80*100:.1f}% (target: 80%)")
85print(f"Median forecast MAE: {mae_median:.4f}")
86print(f"Average interval width: {np.mean(p90 - p10):.4f}")

Choosing the Right Method

Method	Best For	Data Requirement	Interpretability
ARIMA/SARIMA	Single series, clear patterns	Small-medium	High
Prophet	Business data with holidays	Medium	High
LSTM/GRU	Complex nonlinear patterns	Large	Low
TCN	Long sequences, need speed	Large	Low
N-BEATS	Pure forecasting benchmark	Large	Medium
TFT	Multi-series with metadata	Very large	Medium
PatchTST	Long-context multivariate	Large	Low
Foundation models	Zero/few-shot scenarios	Pre-trained	Low

Decision Framework:

1. Start with naive baselines and exponential smoothing 2. Try ARIMA/Prophet for interpretable models 3. Move to deep learning if you have enough data and classical methods underperform 4. Consider foundation models for quick prototyping or when training data is limited

Modern Forecasting Methods

The latest advances in time series forecasting leverage Transformer architectures, specialized neural network designs, and foundation models pre-trained on massive datasets.

Temporal Fusion Transformers (TFT)

TFT (Google, 2019) is an attention-based architecture specifically designed for multi-horizon forecasting:

Variable selection networks: Automatically selects relevant features

Static covariate encoders: Handles time-invariant metadata (e.g., store ID)

LSTM encoder-decoder: Captures temporal patterns

Multi-head attention: Focuses on the most relevant time steps

Quantile outputs: Produces prediction intervals, not just point forecasts

TFT excels when you have:

Multiple time series (e.g., sales for 1000 products)

Rich metadata (categories, locations)

Known future inputs (holidays, promotions)

N-BEATS (Neural Basis Expansion)

N-BEATS (Oreshkin et al., 2019) is a pure deep learning architecture that achieved state-of-the-art results:

Architecture

Stack of blocks, each with a fully connected network

Each block outputs a backcast (reconstruction of input) and a forecast

Blocks are organized into stacks (trend stack, seasonality stack)

Residual connections: each block processes what the previous one couldn't explain

python

1import numpy as np
2
3class NBEATSBlock:
4    """A single N-BEATS block (simplified)."""
5
6    def __init__(self, input_dim, hidden_dim, backcast_dim, forecast_dim):
7        self.input_dim = input_dim
8        self.backcast_dim = backcast_dim
9        self.forecast_dim = forecast_dim
10
11        scale = np.sqrt(2.0 / hidden_dim)
12
13        # Fully connected layers
14        self.W1 = np.random.randn(input_dim, hidden_dim) * scale
15        self.b1 = np.zeros(hidden_dim)
16        self.W2 = np.random.randn(hidden_dim, hidden_dim) * scale
17        self.b2 = np.zeros(hidden_dim)
18
19        # Backcast and forecast heads
20        self.W_back = np.random.randn(hidden_dim, backcast_dim) * scale
21        self.b_back = np.zeros(backcast_dim)
22        self.W_fore = np.random.randn(hidden_dim, forecast_dim) * scale
23        self.b_fore = np.zeros(forecast_dim)
24
25    def forward(self, x):
26        """Forward pass returning backcast and forecast."""
27        h = np.maximum(0, x @ self.W1 + self.b1)
28        h = np.maximum(0, h @ self.W2 + self.b2)
29        backcast = h @ self.W_back + self.b_back
30        forecast = h @ self.W_fore + self.b_fore
31        return backcast, forecast
32
33
34class SimpleNBEATS:
35    """Simplified N-BEATS model."""
36
37    def __init__(self, input_dim, forecast_dim, n_blocks=3, hidden_dim=64):
38        self.blocks = [
39            NBEATSBlock(input_dim, hidden_dim, input_dim, forecast_dim)
40            for _ in range(n_blocks)
41        ]
42
43    def forward(self, x):
44        """
45        Process input through all blocks with residual learning.
46        Each block sees the residual from previous blocks.
47        """
48        residual = x.copy()
49        total_forecast = np.zeros(self.blocks[0].forecast_dim)
50
51        for block in self.blocks:
52            backcast, forecast = block.forward(residual)
53            residual = residual - backcast  # Subtract what was explained
54            total_forecast = total_forecast + forecast  # Add forecast contribution
55
56        return total_forecast
57
58
59# Demo
60input_dim = 30   # Look-back window
61forecast_dim = 10  # Forecast horizon
62
63model = SimpleNBEATS(input_dim, forecast_dim, n_blocks=3, hidden_dim=32)
64
65# Test with a sample input
66x = np.random.randn(input_dim)
67forecast = model.forward(x)
68print(f"Input shape: {x.shape}")
69print(f"Forecast shape: {forecast.shape}")
70print(f"Forecast: {np.round(forecast, 3)}")

PatchTST (Patch Time Series Transformer)

PatchTST (Nie et al., 2023) adapts Vision Transformer ideas to time series:

This is much more efficient than point-level attention (O(n^2) vs O((n/p)^2)).

Foundation Models for Time Series

Just as GPT revolutionized NLP, foundation models are emerging for time series:

TimeGPT (Nixtla)

Pre-trained on 100B+ time series data points

Zero-shot forecasting: works without task-specific training

API-based service

Chronos (Amazon)

Pre-trained language model adapted for time series

Tokenizes time series values into bins

Generates forecasts autoregressively

Open-source (available on HuggingFace)

Lag-Llama

LLM-based forecasting model

Uses lagged features as input tokens

Probabilistic outputs via distribution parameters

Probabilistic Forecasting

python

1import numpy as np
2
3class QuantileForecaster:
4    """
5    Quantile regression for probabilistic forecasting.
6    Predicts multiple quantiles to form prediction intervals.
7    """
8
9    def __init__(self, input_dim, quantiles=(0.1, 0.5, 0.9)):
10        self.quantiles = quantiles
11        self.models = {}
12
13        for q in quantiles:
14            # Separate linear model for each quantile
15            self.models[q] = {
16                'W': np.random.randn(input_dim, 1) * 0.01,
17                'b': np.zeros(1),
18            }
19
20    def quantile_loss(self, y_true, y_pred, q):
21        """Pinball loss for quantile regression."""
22        errors = y_true - y_pred
23        return np.mean(np.maximum(q * errors, (q - 1) * errors))
24
25    def fit(self, X, y, epochs=200, lr=0.001):
26        """Train each quantile model separately."""
27        for q in self.quantiles:
28            W = self.models[q]['W']
29            b = self.models[q]['b']
30
31            for _ in range(epochs):
32                pred = (X @ W + b).flatten()
33                errors = y - pred
34
35                # Gradient of pinball loss
36                grad = np.where(errors >= 0, -q, -(q - 1))
37                grad_W = (X.T @ grad.reshape(-1, 1)) / len(y)
38                grad_b = np.mean(grad)
39
40                W -= lr * grad_W
41                b -= lr * grad_b
42
43            self.models[q]['W'] = W
44            self.models[q]['b'] = b
45
46    def predict(self, X):
47        """Predict all quantiles."""
48        predictions = {}
49        for q in self.quantiles:
50            W = self.models[q]['W']
51            b = self.models[q]['b']
52            predictions[q] = (X @ W + b).flatten()
53        return predictions
54
55
56# Demo: probabilistic forecast
57np.random.seed(42)
58n = 300
59t = np.arange(n, dtype=float)
60y = 2 * np.sin(t / 10) + t * 0.01 + np.random.randn(n) * 0.5
61
62# Create features (lagged values)
63window = 10
64X = np.array([y[i:i+window] for i in range(n - window)])
65targets = y[window:]
66
67# Split
68split = 250
69X_train, X_test = X[:split], X[split:]
70y_train, y_test = targets[:split], targets[split:]
71
72# Train quantile forecaster
73model = QuantileForecaster(input_dim=window, quantiles=(0.1, 0.25, 0.5, 0.75, 0.9))
74model.fit(X_train, y_train, epochs=300, lr=0.001)
75
76# Predict
77preds = model.predict(X_test)
78
79# Evaluate coverage
80p10, p50, p90 = preds[0.1], preds[0.5], preds[0.9]
81coverage_80 = np.mean((y_test >= p10) & (y_test <= p90))
82mae_median = np.mean(np.abs(y_test - p50))
83
84print(f"80% interval coverage: {coverage_80*100:.1f}% (target: 80%)")
85print(f"Median forecast MAE: {mae_median:.4f}")
86print(f"Average interval width: {np.mean(p90 - p10):.4f}")

Choosing the Right Method

Method	Best For	Data Requirement	Interpretability
ARIMA/SARIMA	Single series, clear patterns	Small-medium	High
Prophet	Business data with holidays	Medium	High
LSTM/GRU	Complex nonlinear patterns	Large	Low
TCN	Long sequences, need speed	Large	Low
N-BEATS	Pure forecasting benchmark	Large	Medium
TFT	Multi-series with metadata	Very large	Medium
PatchTST	Long-context multivariate	Large	Low
Foundation models	Zero/few-shot scenarios	Pre-trained	Low