Skip to main content

Deploying ML Models

Export models in production formats and serve them via APIs

~50 min
Listen to this lesson

Deploying ML Models

Training a great model is only half the battle. The other half — and often the harder half — is getting that model into production where it can serve real users reliably, efficiently, and at scale.

In this lesson you'll learn the most common export formats, how to serve models via REST APIs, and how to monitor them once they're live.

The Deployment Gap

According to industry surveys, only about 50% of ML models ever make it to production. The gap between a working notebook and a production system is one of the biggest challenges in applied ML.

SavedModel Format

TensorFlow's SavedModel is the standard serialization format for production deployment. It captures the complete model — architecture, weights, optimizer state, and the computation graph — in a self-contained directory.

saved_model/
├── saved_model.pb          # Graph definition + metadata
├── variables/
│   ├── variables.data-00000-of-00001   # Weight values
│   └── variables.index                 # Weight index
└── assets/                 # External files (vocab, etc.)

Key advantages:

  • Language-agnostic: Can be loaded from Python, C++, Java, JavaScript
  • Signature definitions: Describes exactly what inputs/outputs the model expects
  • Versioned: Easy to swap between model versions
  • python
    1import tensorflow as tf
    2
    3# Train a simple model
    4model = tf.keras.Sequential([
    5    tf.keras.layers.Dense(128, activation='relu', input_shape=(784,)),
    6    tf.keras.layers.Dropout(0.2),
    7    tf.keras.layers.Dense(10, activation='softmax')
    8])
    9model.compile(optimizer='adam', loss='sparse_categorical_crossentropy',
    10              metrics=['accuracy'])
    11
    12# Save in SavedModel format
    13model.save("my_model/1")  # The "1" is the version number
    14
    15# Inspect the saved model
    16!saved_model_cli show --dir my_model/1 --all
    17
    18# Reload the model
    19reloaded = tf.keras.models.load_model("my_model/1")
    20print(reloaded.summary())

    TensorFlow Lite (TFLite) Conversion

    TFLite is optimized for mobile and edge devices — phones, Raspberry Pi, microcontrollers. It produces a much smaller file using a FlatBuffer format.

    Quantization

    Quantization reduces model size and speeds up inference by converting 32-bit floats to smaller types:

    TechniquePrecisionSize ReductionSpeed-upAccuracy Loss
    No quantizationfloat321x (baseline)1xNone
    Dynamic rangefloat32→int8 (weights only)~4x2-3xMinimal
    Full integerint8 (weights + activations)~4x3-4xSmall
    float16float16~2x1.5-2xVery small

    python
    1import tensorflow as tf
    2
    3# Load a trained Keras model
    4model = tf.keras.models.load_model("my_model/1")
    5
    6# --- Basic conversion (no quantization) ---
    7converter = tf.lite.TFLiteConverter.from_keras_model(model)
    8tflite_model = converter.convert()
    9
    10with open("model.tflite", "wb") as f:
    11    f.write(tflite_model)
    12print(f"Basic TFLite size: {len(tflite_model) / 1024:.1f} KB")
    13
    14# --- Dynamic range quantization ---
    15converter = tf.lite.TFLiteConverter.from_keras_model(model)
    16converter.optimizations = [tf.lite.Optimize.DEFAULT]
    17tflite_quant = converter.convert()
    18
    19with open("model_quant.tflite", "wb") as f:
    20    f.write(tflite_quant)
    21print(f"Quantized TFLite size: {len(tflite_quant) / 1024:.1f} KB")
    22print(f"Size reduction: {len(tflite_model) / len(tflite_quant):.1f}x")
    23
    24# --- Run inference with TFLite ---
    25interpreter = tf.lite.Interpreter(model_path="model_quant.tflite")
    26interpreter.allocate_tensors()
    27
    28input_details = interpreter.get_input_details()
    29output_details = interpreter.get_output_details()
    30
    31import numpy as np
    32test_input = np.random.rand(1, 784).astype(np.float32)
    33interpreter.set_tensor(input_details[0]['index'], test_input)
    34interpreter.invoke()
    35output = interpreter.get_tensor(output_details[0]['index'])
    36print(f"Prediction: {np.argmax(output)}")

    TensorFlow.js Export

    TF.js lets you run models directly in the browser or in Node.js. This is ideal for interactive demos, privacy-sensitive applications (data never leaves the client), or reducing server costs.

    python
    1import tensorflowjs as tfjs
    2
    3# Convert a Keras model to TF.js format
    4model = tf.keras.models.load_model("my_model/1")
    5tfjs.converters.save_keras_model(model, "tfjs_model/")
    6
    7# This creates:
    8# tfjs_model/
    9# ├── model.json        # Architecture + weight manifest
    10# └── group1-shard1of1.bin  # Binary weight data
    11
    12# In JavaScript:
    13# const model = await tf.loadLayersModel('/tfjs_model/model.json');
    14# const prediction = model.predict(tf.tensor2d([[...features]]));

    TensorFlow Serving with Docker

    TF Serving is Google's production-grade serving system. It's designed for high-throughput, low-latency inference with features like model versioning, request batching, and hardware acceleration.

    bash
    1# Pull the TF Serving Docker image
    2docker pull tensorflow/serving
    3
    4# Serve a SavedModel
    5# Model must be in: /path/to/models/<model_name>/<version>/
    6docker run -p 8501:8501 \
    7  --mount type=bind,source=/path/to/my_model,target=/models/my_model \
    8  -e MODEL_NAME=my_model \
    9  tensorflow/serving
    10
    11# Test with curl (REST API)
    12curl -X POST http://localhost:8501/v1/models/my_model:predict \
    13  -H "Content-Type: application/json" \
    14  -d '{"instances": [[0.1, 0.2, 0.3, 0.4, 0.5]]}'
    15
    16# Check model status
    17curl http://localhost:8501/v1/models/my_model
    18
    19# Response format:
    20# {
    21#   "predictions": [[0.01, 0.02, 0.85, ...]]
    22# }

    gRPC vs REST

    TF Serving supports both REST (port 8501) and gRPC (port 8500). gRPC is significantly faster for production workloads due to Protocol Buffers serialization and HTTP/2 multiplexing. Use REST for testing; use gRPC when latency matters.

    TFX Pipeline Overview

    TensorFlow Extended (TFX) is an end-to-end platform for deploying production ML pipelines. It provides standard components for each stage:

    ComponentPurpose
    ExampleGenIngests and splits data
    StatisticsGenComputes dataset statistics
    SchemaGenInfers a schema from the data
    ExampleValidatorDetects anomalies and data drift
    TransformFeature engineering (runs at training and serving time)
    TrainerTrains the model
    TunerHyperparameter tuning
    EvaluatorValidates model quality before pushing
    InfraValidatorChecks the model can be served
    PusherDeploys the model to serving infrastructure
    Each component consumes and produces artifacts tracked in a metadata store, ensuring full reproducibility and lineage tracking.

    Model Monitoring Is Not Optional

    Once deployed, models degrade silently. Watch for: - **Data drift**: Input data distribution changes over time - **Concept drift**: The relationship between inputs and outputs changes - **Latency spikes**: Model or infrastructure performance degrades Without monitoring, you won't know your model is making bad predictions until users complain — or worse, until real damage is done.

    A/B Testing for Models

    A/B testing compares two model versions by routing a percentage of traffic to each:

                       ┌──── Model v1 (90% traffic) ──── Response
    User Request ────►│
                       └──── Model v2 (10% traffic) ──── Response
    

    Steps: 1. Deploy the new model alongside the existing one 2. Route a small percentage (e.g., 5-10%) of traffic to the new model 3. Compare key metrics (accuracy, latency, business KPIs) 4. Gradually increase traffic if the new model wins 5. Roll back immediately if metrics degrade

    This is safer than a full cutover because you limit the blast radius of a bad model.