Skip to main content

Deep Learning for Time Series

LSTM, GRU, sequence-to-sequence, and temporal convolutional networks

~50 min
Listen to this lesson

Deep Learning for Time Series

When classical methods struggle -- with complex nonlinear patterns, high-dimensional inputs, or very long sequences -- deep learning offers powerful alternatives.

Why Deep Learning for Time Series?

  • Handle nonlinear relationships automatically
  • Process multivariate inputs (many features simultaneously)
  • Learn hierarchical temporal patterns
  • Enable sequence-to-sequence (multi-step) forecasting
  • Scale to massive datasets
  • LSTM for Forecasting

    LSTMs (Long Short-Term Memory) are the most popular RNN variant for time series. They solve the vanishing gradient problem with a gating mechanism:

  • Forget gate: What information to discard from memory
  • Input gate: What new information to store
  • Output gate: What to output from memory
  • Windowed Input Format

    Time series data must be transformed into windows for LSTM input:

  • Input: Window of past values [y(t-w), y(t-w+1), ..., y(t-1)]
  • Output: Next value y(t) or next h values [y(t), ..., y(t+h-1)]
  • Shape: (batch_size, sequence_length, n_features)
  • python
    1import numpy as np
    2
    3def create_sequences(data, seq_length, forecast_horizon=1):
    4    """
    5    Create windowed sequences for LSTM training.
    6
    7    Args:
    8        data: 1D or 2D array of time series data
    9        seq_length: Number of past steps to use as input
    10        forecast_horizon: Number of future steps to predict
    11
    12    Returns:
    13        X: shape (n_samples, seq_length, n_features)
    14        y: shape (n_samples, forecast_horizon)
    15    """
    16    if data.ndim == 1:
    17        data = data.reshape(-1, 1)
    18
    19    n_samples = len(data) - seq_length - forecast_horizon + 1
    20    n_features = data.shape[1]
    21
    22    X = np.zeros((n_samples, seq_length, n_features))
    23    y = np.zeros((n_samples, forecast_horizon))
    24
    25    for i in range(n_samples):
    26        X[i] = data[i:i + seq_length]
    27        y[i] = data[i + seq_length:i + seq_length + forecast_horizon, 0]
    28
    29    return X, y
    30
    31# Demo
    32np.random.seed(42)
    33series = np.sin(np.linspace(0, 20, 200)) + np.random.randn(200) * 0.1
    34
    35X, y = create_sequences(series, seq_length=30, forecast_horizon=5)
    36print(f"Input shape: {X.shape}")   # (n, 30, 1)
    37print(f"Target shape: {y.shape}")  # (n, 5)
    38print(f"First window: [{X[0, 0, 0]:.3f}, ..., {X[0, -1, 0]:.3f}]")
    39print(f"First target: [{y[0, 0]:.3f}, ..., {y[0, -1]:.3f}]")

    GRU (Gated Recurrent Unit)

    GRU is a simplified LSTM with only two gates:

  • Reset gate: Controls how much past information to forget
  • Update gate: Controls how much new information to let in
  • GRU has fewer parameters than LSTM and often performs comparably. Use GRU when:

  • You want faster training
  • Your sequences aren't extremely long
  • You want a simpler model
  • Sequence-to-Sequence (Seq2Seq)

    For multi-step forecasting, the encoder-decoder architecture is powerful:

    1. Encoder: Processes the input sequence and compresses it into a context vector 2. Decoder: Takes the context vector and autoregressively generates the forecast

    This decouples input length from output length, allowing flexible forecast horizons.

    python
    1import numpy as np
    2
    3# LSTM cell implementation (forward pass only)
    4class LSTMCell:
    5    """Single LSTM cell for understanding the mechanics."""
    6
    7    def __init__(self, input_dim, hidden_dim):
    8        self.hidden_dim = hidden_dim
    9        scale = np.sqrt(2.0 / (input_dim + hidden_dim))
    10
    11        # Forget gate
    12        self.Wf = np.random.randn(input_dim + hidden_dim, hidden_dim) * scale
    13        self.bf = np.zeros(hidden_dim)
    14
    15        # Input gate
    16        self.Wi = np.random.randn(input_dim + hidden_dim, hidden_dim) * scale
    17        self.bi = np.zeros(hidden_dim)
    18
    19        # Cell candidate
    20        self.Wc = np.random.randn(input_dim + hidden_dim, hidden_dim) * scale
    21        self.bc = np.zeros(hidden_dim)
    22
    23        # Output gate
    24        self.Wo = np.random.randn(input_dim + hidden_dim, hidden_dim) * scale
    25        self.bo = np.zeros(hidden_dim)
    26
    27    def forward(self, x, h_prev, c_prev):
    28        """Single step forward pass."""
    29        combined = np.concatenate([x, h_prev])
    30
    31        # Gates
    32        f = self._sigmoid(combined @ self.Wf + self.bf)  # Forget
    33        i = self._sigmoid(combined @ self.Wi + self.bi)   # Input
    34        c_hat = np.tanh(combined @ self.Wc + self.bc)     # Candidate
    35        o = self._sigmoid(combined @ self.Wo + self.bo)   # Output
    36
    37        # Update cell state and hidden state
    38        c = f * c_prev + i * c_hat
    39        h = o * np.tanh(c)
    40
    41        return h, c
    42
    43    def _sigmoid(self, x):
    44        return 1 / (1 + np.exp(-np.clip(x, -500, 500)))
    45
    46
    47# Demo forward pass
    48cell = LSTMCell(input_dim=1, hidden_dim=16)
    49h = np.zeros(16)
    50c = np.zeros(16)
    51
    52# Process a sequence of 5 time steps
    53sequence = [0.5, 0.8, 1.2, 0.3, 0.9]
    54print("Processing sequence through LSTM:")
    55for t, x in enumerate(sequence):
    56    h, c = cell.forward(np.array([x]), h, c)
    57    print(f"  t={t}: input={x}, h_norm={np.linalg.norm(h):.4f}, c_norm={np.linalg.norm(c):.4f}")

    Temporal Convolutional Networks (TCN)

    TCNs use 1D convolutions with causal padding (no future information leaks) and dilated convolutions (exponentially increasing receptive field).

    Advantages over RNNs:

  • Parallelizable: All positions computed simultaneously (no sequential bottleneck)
  • Stable gradients: No vanishing/exploding gradient issues
  • Flexible receptive field: Dilations allow looking back very far
  • Architecture

  • Stack of dilated causal convolution layers
  • Dilation factors: 1, 2, 4, 8, 16, ... (doubles each layer)
  • Residual connections between layers
  • WaveNet-Style Models

    WaveNet (originally for audio) uses dilated causal convolutions and has been adapted for time series. It can model very long-range dependencies efficiently.

    python
    1import numpy as np
    2
    3class CausalConv1D:
    4    """Causal 1D convolution (no future leakage)."""
    5
    6    def __init__(self, in_channels, out_channels, kernel_size, dilation=1):
    7        self.kernel_size = kernel_size
    8        self.dilation = dilation
    9        self.padding = (kernel_size - 1) * dilation  # Causal: pad only the left
    10
    11        scale = np.sqrt(2.0 / (kernel_size * in_channels))
    12        self.weight = np.random.randn(out_channels, in_channels, kernel_size) * scale
    13        self.bias = np.zeros(out_channels)
    14
    15    def forward(self, x):
    16        """
    17        x: shape (batch, channels, time)
    18        returns: shape (batch, out_channels, time)
    19        """
    20        batch, channels, time = x.shape
    21
    22        # Pad left side only (causal)
    23        x_padded = np.pad(x, ((0,0), (0,0), (self.padding, 0)))
    24
    25        # Dilated convolution
    26        out_time = time
    27        out = np.zeros((batch, self.weight.shape[0], out_time))
    28
    29        for t in range(out_time):
    30            for k in range(self.kernel_size):
    31                idx = t + self.padding - k * self.dilation
    32                if 0 <= idx < x_padded.shape[2]:
    33                    out[:, :, t] += np.einsum(
    34                        'bc,oc->bo',
    35                        x_padded[:, :, idx],
    36                        self.weight[:, :, k]
    37                    )
    38
    39        out += self.bias.reshape(1, -1, 1)
    40        return out
    41
    42    def receptive_field(self):
    43        return (self.kernel_size - 1) * self.dilation + 1
    44
    45
    46# Show how dilated convolutions expand the receptive field
    47print("TCN Receptive Field Growth:")
    48total_rf = 1
    49for layer, dilation in enumerate([1, 2, 4, 8, 16]):
    50    conv = CausalConv1D(1, 1, kernel_size=3, dilation=dilation)
    51    rf = conv.receptive_field()
    52    total_rf += rf - 1
    53    print(f"  Layer {layer}: dilation={dilation}, layer_rf={rf}, total_rf={total_rf}")
    54
    55print(f"\nWith 5 layers of kernel_size=3: total receptive field = {total_rf} time steps")

    Multi-Step Forecasting Strategies

    1. **Direct**: Train separate models for each horizon h (one model for t+1, another for t+2, etc.) 2. **Recursive/Iterative**: Train one 1-step model, feed predictions back as inputs 3. **Multi-output**: Single model outputs all horizons at once (Seq2Seq approach) 4. **Direct-recursive hybrid**: Combine both approaches Recursive is simplest but accumulates errors. Direct avoids error accumulation but needs multiple models. Multi-output is the most common deep learning approach.