Deploying ML Models
Training a great model is only half the battle. The other half — and often the harder half — is getting that model into production where it can serve real users reliably, efficiently, and at scale.
In this lesson you'll learn the most common export formats, how to serve models via REST APIs, and how to monitor them once they're live.
The Deployment Gap
SavedModel Format
TensorFlow's SavedModel is the standard serialization format for production deployment. It captures the complete model — architecture, weights, optimizer state, and the computation graph — in a self-contained directory.
saved_model/
├── saved_model.pb # Graph definition + metadata
├── variables/
│ ├── variables.data-00000-of-00001 # Weight values
│ └── variables.index # Weight index
└── assets/ # External files (vocab, etc.)
Key advantages:
1import tensorflow as tf
2
3# Train a simple model
4model = tf.keras.Sequential([
5 tf.keras.layers.Dense(128, activation='relu', input_shape=(784,)),
6 tf.keras.layers.Dropout(0.2),
7 tf.keras.layers.Dense(10, activation='softmax')
8])
9model.compile(optimizer='adam', loss='sparse_categorical_crossentropy',
10 metrics=['accuracy'])
11
12# Save in SavedModel format
13model.save("my_model/1") # The "1" is the version number
14
15# Inspect the saved model
16!saved_model_cli show --dir my_model/1 --all
17
18# Reload the model
19reloaded = tf.keras.models.load_model("my_model/1")
20print(reloaded.summary())TensorFlow Lite (TFLite) Conversion
TFLite is optimized for mobile and edge devices — phones, Raspberry Pi, microcontrollers. It produces a much smaller file using a FlatBuffer format.
Quantization
Quantization reduces model size and speeds up inference by converting 32-bit floats to smaller types:
| Technique | Precision | Size Reduction | Speed-up | Accuracy Loss |
|---|---|---|---|---|
| No quantization | float32 | 1x (baseline) | 1x | None |
| Dynamic range | float32→int8 (weights only) | ~4x | 2-3x | Minimal |
| Full integer | int8 (weights + activations) | ~4x | 3-4x | Small |
| float16 | float16 | ~2x | 1.5-2x | Very small |
1import tensorflow as tf
2
3# Load a trained Keras model
4model = tf.keras.models.load_model("my_model/1")
5
6# --- Basic conversion (no quantization) ---
7converter = tf.lite.TFLiteConverter.from_keras_model(model)
8tflite_model = converter.convert()
9
10with open("model.tflite", "wb") as f:
11 f.write(tflite_model)
12print(f"Basic TFLite size: {len(tflite_model) / 1024:.1f} KB")
13
14# --- Dynamic range quantization ---
15converter = tf.lite.TFLiteConverter.from_keras_model(model)
16converter.optimizations = [tf.lite.Optimize.DEFAULT]
17tflite_quant = converter.convert()
18
19with open("model_quant.tflite", "wb") as f:
20 f.write(tflite_quant)
21print(f"Quantized TFLite size: {len(tflite_quant) / 1024:.1f} KB")
22print(f"Size reduction: {len(tflite_model) / len(tflite_quant):.1f}x")
23
24# --- Run inference with TFLite ---
25interpreter = tf.lite.Interpreter(model_path="model_quant.tflite")
26interpreter.allocate_tensors()
27
28input_details = interpreter.get_input_details()
29output_details = interpreter.get_output_details()
30
31import numpy as np
32test_input = np.random.rand(1, 784).astype(np.float32)
33interpreter.set_tensor(input_details[0]['index'], test_input)
34interpreter.invoke()
35output = interpreter.get_tensor(output_details[0]['index'])
36print(f"Prediction: {np.argmax(output)}")TensorFlow.js Export
TF.js lets you run models directly in the browser or in Node.js. This is ideal for interactive demos, privacy-sensitive applications (data never leaves the client), or reducing server costs.
1import tensorflowjs as tfjs
2
3# Convert a Keras model to TF.js format
4model = tf.keras.models.load_model("my_model/1")
5tfjs.converters.save_keras_model(model, "tfjs_model/")
6
7# This creates:
8# tfjs_model/
9# ├── model.json # Architecture + weight manifest
10# └── group1-shard1of1.bin # Binary weight data
11
12# In JavaScript:
13# const model = await tf.loadLayersModel('/tfjs_model/model.json');
14# const prediction = model.predict(tf.tensor2d([[...features]]));TensorFlow Serving with Docker
TF Serving is Google's production-grade serving system. It's designed for high-throughput, low-latency inference with features like model versioning, request batching, and hardware acceleration.
1# Pull the TF Serving Docker image
2docker pull tensorflow/serving
3
4# Serve a SavedModel
5# Model must be in: /path/to/models/<model_name>/<version>/
6docker run -p 8501:8501 \
7 --mount type=bind,source=/path/to/my_model,target=/models/my_model \
8 -e MODEL_NAME=my_model \
9 tensorflow/serving
10
11# Test with curl (REST API)
12curl -X POST http://localhost:8501/v1/models/my_model:predict \
13 -H "Content-Type: application/json" \
14 -d '{"instances": [[0.1, 0.2, 0.3, 0.4, 0.5]]}'
15
16# Check model status
17curl http://localhost:8501/v1/models/my_model
18
19# Response format:
20# {
21# "predictions": [[0.01, 0.02, 0.85, ...]]
22# }gRPC vs REST
TFX Pipeline Overview
TensorFlow Extended (TFX) is an end-to-end platform for deploying production ML pipelines. It provides standard components for each stage:
| Component | Purpose |
|---|---|
| ExampleGen | Ingests and splits data |
| StatisticsGen | Computes dataset statistics |
| SchemaGen | Infers a schema from the data |
| ExampleValidator | Detects anomalies and data drift |
| Transform | Feature engineering (runs at training and serving time) |
| Trainer | Trains the model |
| Tuner | Hyperparameter tuning |
| Evaluator | Validates model quality before pushing |
| InfraValidator | Checks the model can be served |
| Pusher | Deploys the model to serving infrastructure |
Model Monitoring Is Not Optional
A/B Testing for Models
A/B testing compares two model versions by routing a percentage of traffic to each:
┌──── Model v1 (90% traffic) ──── Response
User Request ────►│
└──── Model v2 (10% traffic) ──── Response
Steps: 1. Deploy the new model alongside the existing one 2. Route a small percentage (e.g., 5-10%) of traffic to the new model 3. Compare key metrics (accuracy, latency, business KPIs) 4. Gradually increase traffic if the new model wins 5. Roll back immediately if metrics degrade
This is safer than a full cutover because you limit the blast radius of a bad model.