Deploying ML Models

Training a great model is only half the battle. The other half — and often the harder half — is getting that model into production where it can serve real users reliably, efficiently, and at scale.

In this lesson you'll learn the most common export formats, how to serve models via REST APIs, and how to monitor them once they're live.

The Deployment Gap

According to industry surveys, only about 50% of ML models ever make it to production. The gap between a working notebook and a production system is one of the biggest challenges in applied ML.

SavedModel Format

TensorFlow's SavedModel is the standard serialization format for production deployment. It captures the complete model — architecture, weights, optimizer state, and the computation graph — in a self-contained directory.

saved_model/
├── saved_model.pb          # Graph definition + metadata
├── variables/
│   ├── variables.data-00000-of-00001   # Weight values
│   └── variables.index                 # Weight index
└── assets/                 # External files (vocab, etc.)

Key advantages:

Language-agnostic: Can be loaded from Python, C++, Java, JavaScript

Signature definitions: Describes exactly what inputs/outputs the model expects

Versioned: Easy to swap between model versions

python

1import tensorflow as tf
2
3# Train a simple model
4model = tf.keras.Sequential([
5    tf.keras.layers.Dense(128, activation='relu', input_shape=(784,)),
6    tf.keras.layers.Dropout(0.2),
7    tf.keras.layers.Dense(10, activation='softmax')
8])
9model.compile(optimizer='adam', loss='sparse_categorical_crossentropy',
10              metrics=['accuracy'])
11
12# Save in SavedModel format
13model.save("my_model/1")  # The "1" is the version number
14
15# Inspect the saved model
16!saved_model_cli show --dir my_model/1 --all
17
18# Reload the model
19reloaded = tf.keras.models.load_model("my_model/1")
20print(reloaded.summary())

TensorFlow Lite (TFLite) Conversion

TFLite is optimized for mobile and edge devices — phones, Raspberry Pi, microcontrollers. It produces a much smaller file using a FlatBuffer format.

Quantization

Quantization reduces model size and speeds up inference by converting 32-bit floats to smaller types:

Technique	Precision	Size Reduction	Speed-up	Accuracy Loss
No quantization	float32	1x (baseline)	1x	None
Dynamic range	float32→int8 (weights only)	~4x	2-3x	Minimal
Full integer	int8 (weights + activations)	~4x	3-4x	Small
float16	float16	~2x	1.5-2x	Very small

python

1import tensorflow as tf
2
3# Load a trained Keras model
4model = tf.keras.models.load_model("my_model/1")
5
6# --- Basic conversion (no quantization) ---
7converter = tf.lite.TFLiteConverter.from_keras_model(model)
8tflite_model = converter.convert()
9
10with open("model.tflite", "wb") as f:
11    f.write(tflite_model)
12print(f"Basic TFLite size: {len(tflite_model) / 1024:.1f} KB")
13
14# --- Dynamic range quantization ---
15converter = tf.lite.TFLiteConverter.from_keras_model(model)
16converter.optimizations = [tf.lite.Optimize.DEFAULT]
17tflite_quant = converter.convert()
18
19with open("model_quant.tflite", "wb") as f:
20    f.write(tflite_quant)
21print(f"Quantized TFLite size: {len(tflite_quant) / 1024:.1f} KB")
22print(f"Size reduction: {len(tflite_model) / len(tflite_quant):.1f}x")
23
24# --- Run inference with TFLite ---
25interpreter = tf.lite.Interpreter(model_path="model_quant.tflite")
26interpreter.allocate_tensors()
27
28input_details = interpreter.get_input_details()
29output_details = interpreter.get_output_details()
30
31import numpy as np
32test_input = np.random.rand(1, 784).astype(np.float32)
33interpreter.set_tensor(input_details[0]['index'], test_input)
34interpreter.invoke()
35output = interpreter.get_tensor(output_details[0]['index'])
36print(f"Prediction: {np.argmax(output)}")

TensorFlow.js Export

TF.js lets you run models directly in the browser or in Node.js. This is ideal for interactive demos, privacy-sensitive applications (data never leaves the client), or reducing server costs.

python

1import tensorflowjs as tfjs
2
3# Convert a Keras model to TF.js format
4model = tf.keras.models.load_model("my_model/1")
5tfjs.converters.save_keras_model(model, "tfjs_model/")
6
7# This creates:
8# tfjs_model/
9# ├── model.json        # Architecture + weight manifest
10# └── group1-shard1of1.bin  # Binary weight data
11
12# In JavaScript:
13# const model = await tf.loadLayersModel('/tfjs_model/model.json');
14# const prediction = model.predict(tf.tensor2d([[...features]]));

TensorFlow Serving with Docker

TF Serving is Google's production-grade serving system. It's designed for high-throughput, low-latency inference with features like model versioning, request batching, and hardware acceleration.

bash

1# Pull the TF Serving Docker image
2docker pull tensorflow/serving
3
4# Serve a SavedModel
5# Model must be in: /path/to/models/<model_name>/<version>/
6docker run -p 8501:8501 \
7  --mount type=bind,source=/path/to/my_model,target=/models/my_model \
8  -e MODEL_NAME=my_model \
9  tensorflow/serving
10
11# Test with curl (REST API)
12curl -X POST http://localhost:8501/v1/models/my_model:predict \
13  -H "Content-Type: application/json" \
14  -d '{"instances": [[0.1, 0.2, 0.3, 0.4, 0.5]]}'
15
16# Check model status
17curl http://localhost:8501/v1/models/my_model
18
19# Response format:
20# {
21#   "predictions": [[0.01, 0.02, 0.85, ...]]
22# }

gRPC vs REST

TF Serving supports both REST (port 8501) and gRPC (port 8500). gRPC is significantly faster for production workloads due to Protocol Buffers serialization and HTTP/2 multiplexing. Use REST for testing; use gRPC when latency matters.

TFX Pipeline Overview

TensorFlow Extended (TFX) is an end-to-end platform for deploying production ML pipelines. It provides standard components for each stage:

Component	Purpose
ExampleGen	Ingests and splits data
StatisticsGen	Computes dataset statistics
SchemaGen	Infers a schema from the data
ExampleValidator	Detects anomalies and data drift
Transform	Feature engineering (runs at training and serving time)
Trainer	Trains the model
Tuner	Hyperparameter tuning
Evaluator	Validates model quality before pushing
InfraValidator	Checks the model can be served
Pusher	Deploys the model to serving infrastructure

Each component consumes and produces artifacts tracked in a metadata store, ensuring full reproducibility and lineage tracking.

Model Monitoring Is Not Optional

Once deployed, models degrade silently. Watch for: - **Data drift**: Input data distribution changes over time - **Concept drift**: The relationship between inputs and outputs changes - **Latency spikes**: Model or infrastructure performance degrades Without monitoring, you won't know your model is making bad predictions until users complain — or worse, until real damage is done.

A/B Testing for Models

A/B testing compares two model versions by routing a percentage of traffic to each:

                   ┌──── Model v1 (90% traffic) ──── Response
User Request ────►│
                   └──── Model v2 (10% traffic) ──── Response

Steps: 1. Deploy the new model alongside the existing one 2. Route a small percentage (e.g., 5-10%) of traffic to the new model 3. Compare key metrics (accuracy, latency, business KPIs) 4. Gradually increase traffic if the new model wins 5. Roll back immediately if metrics degrade

This is safer than a full cutover because you limit the blast radius of a bad model.

Deploying ML Models

In this lesson you'll learn the most common export formats, how to serve models via REST APIs, and how to monitor them once they're live.

The Deployment Gap

According to industry surveys, only about 50% of ML models ever make it to production. The gap between a working notebook and a production system is one of the biggest challenges in applied ML.

SavedModel Format

saved_model/
├── saved_model.pb          # Graph definition + metadata
├── variables/
│   ├── variables.data-00000-of-00001   # Weight values
│   └── variables.index                 # Weight index
└── assets/                 # External files (vocab, etc.)

Key advantages:

Language-agnostic: Can be loaded from Python, C++, Java, JavaScript

Signature definitions: Describes exactly what inputs/outputs the model expects

Versioned: Easy to swap between model versions

python

1import tensorflow as tf
2
3# Train a simple model
4model = tf.keras.Sequential([
5    tf.keras.layers.Dense(128, activation='relu', input_shape=(784,)),
6    tf.keras.layers.Dropout(0.2),
7    tf.keras.layers.Dense(10, activation='softmax')
8])
9model.compile(optimizer='adam', loss='sparse_categorical_crossentropy',
10              metrics=['accuracy'])
11
12# Save in SavedModel format
13model.save("my_model/1")  # The "1" is the version number
14
15# Inspect the saved model
16!saved_model_cli show --dir my_model/1 --all
17
18# Reload the model
19reloaded = tf.keras.models.load_model("my_model/1")
20print(reloaded.summary())

TensorFlow Lite (TFLite) Conversion

TFLite is optimized for mobile and edge devices — phones, Raspberry Pi, microcontrollers. It produces a much smaller file using a FlatBuffer format.

Quantization

Quantization reduces model size and speeds up inference by converting 32-bit floats to smaller types:

Technique	Precision	Size Reduction	Speed-up	Accuracy Loss
No quantization	float32	1x (baseline)	1x	None
Dynamic range	float32→int8 (weights only)	~4x	2-3x	Minimal
Full integer	int8 (weights + activations)	~4x	3-4x	Small
float16	float16	~2x	1.5-2x	Very small

python

1import tensorflow as tf
2
3# Load a trained Keras model
4model = tf.keras.models.load_model("my_model/1")
5
6# --- Basic conversion (no quantization) ---
7converter = tf.lite.TFLiteConverter.from_keras_model(model)
8tflite_model = converter.convert()
9
10with open("model.tflite", "wb") as f:
11    f.write(tflite_model)
12print(f"Basic TFLite size: {len(tflite_model) / 1024:.1f} KB")
13
14# --- Dynamic range quantization ---
15converter = tf.lite.TFLiteConverter.from_keras_model(model)
16converter.optimizations = [tf.lite.Optimize.DEFAULT]
17tflite_quant = converter.convert()
18
19with open("model_quant.tflite", "wb") as f:
20    f.write(tflite_quant)
21print(f"Quantized TFLite size: {len(tflite_quant) / 1024:.1f} KB")
22print(f"Size reduction: {len(tflite_model) / len(tflite_quant):.1f}x")
23
24# --- Run inference with TFLite ---
25interpreter = tf.lite.Interpreter(model_path="model_quant.tflite")
26interpreter.allocate_tensors()
27
28input_details = interpreter.get_input_details()
29output_details = interpreter.get_output_details()
30
31import numpy as np
32test_input = np.random.rand(1, 784).astype(np.float32)
33interpreter.set_tensor(input_details[0]['index'], test_input)
34interpreter.invoke()
35output = interpreter.get_tensor(output_details[0]['index'])
36print(f"Prediction: {np.argmax(output)}")

TensorFlow.js Export

TF.js lets you run models directly in the browser or in Node.js. This is ideal for interactive demos, privacy-sensitive applications (data never leaves the client), or reducing server costs.

python

1import tensorflowjs as tfjs
2
3# Convert a Keras model to TF.js format
4model = tf.keras.models.load_model("my_model/1")
5tfjs.converters.save_keras_model(model, "tfjs_model/")
6
7# This creates:
8# tfjs_model/
9# ├── model.json        # Architecture + weight manifest
10# └── group1-shard1of1.bin  # Binary weight data
11
12# In JavaScript:
13# const model = await tf.loadLayersModel('/tfjs_model/model.json');
14# const prediction = model.predict(tf.tensor2d([[...features]]));

TensorFlow Serving with Docker

TF Serving is Google's production-grade serving system. It's designed for high-throughput, low-latency inference with features like model versioning, request batching, and hardware acceleration.

bash

1# Pull the TF Serving Docker image
2docker pull tensorflow/serving
3
4# Serve a SavedModel
5# Model must be in: /path/to/models/<model_name>/<version>/
6docker run -p 8501:8501 \
7  --mount type=bind,source=/path/to/my_model,target=/models/my_model \
8  -e MODEL_NAME=my_model \
9  tensorflow/serving
10
11# Test with curl (REST API)
12curl -X POST http://localhost:8501/v1/models/my_model:predict \
13  -H "Content-Type: application/json" \
14  -d '{"instances": [[0.1, 0.2, 0.3, 0.4, 0.5]]}'
15
16# Check model status
17curl http://localhost:8501/v1/models/my_model
18
19# Response format:
20# {
21#   "predictions": [[0.01, 0.02, 0.85, ...]]
22# }

gRPC vs REST

TFX Pipeline Overview

TensorFlow Extended (TFX) is an end-to-end platform for deploying production ML pipelines. It provides standard components for each stage:

Component	Purpose
ExampleGen	Ingests and splits data
StatisticsGen	Computes dataset statistics
SchemaGen	Infers a schema from the data
ExampleValidator	Detects anomalies and data drift
Transform	Feature engineering (runs at training and serving time)
Trainer	Trains the model
Tuner	Hyperparameter tuning
Evaluator	Validates model quality before pushing
InfraValidator	Checks the model can be served
Pusher	Deploys the model to serving infrastructure

Each component consumes and produces artifacts tracked in a metadata store, ensuring full reproducibility and lineage tracking.

Model Monitoring Is Not Optional

A/B Testing for Models

A/B testing compares two model versions by routing a percentage of traffic to each:

                   ┌──── Model v1 (90% traffic) ──── Response
User Request ────►│
                   └──── Model v2 (10% traffic) ──── Response

This is safer than a full cutover because you limit the blast radius of a bad model.