Inference Runtimes for Edge Deployment

Training a model is only half the battle — you need an inference runtime optimized for your target hardware. Each runtime applies hardware-specific optimizations (operator fusion, memory planning, platform-specific instructions) that make models run faster than naive execution.

Why Not Just Use PyTorch/TensorFlow?

Training frameworks (PyTorch, TensorFlow) are designed for flexibility and debugging during development. Inference runtimes (TF Lite, Core ML, ONNX Runtime, TensorRT) are designed for speed and efficiency during deployment. They strip out training-only features (autograd, optimizer state) and apply aggressive optimizations for specific hardware.

TensorFlow Lite (TF Lite)

Google's lightweight inference runtime for mobile and edge devices:

Platforms: Android, iOS, Linux, microcontrollers

Optimizations: Quantization, operator fusion, delegate support (GPU, NNAPI, Hexagon DSP)

Model format: .tflite (FlatBuffer-based, no parsing overhead)

Conversion Pipeline

TensorFlow Model (.pb / SavedModel / Keras)
    → TFLite Converter (optimize, quantize)
        → .tflite file
            → TFLite Interpreter (on-device inference)

python

1import tensorflow as tf
2import numpy as np
3
4# --- Step 1: Create a Keras model ---
5model = tf.keras.Sequential([
6    tf.keras.layers.Dense(128, activation="relu", input_shape=(10,)),
7    tf.keras.layers.Dense(64, activation="relu"),
8    tf.keras.layers.Dense(3, activation="softmax"),
9])
10model.compile(optimizer="adam", loss="sparse_categorical_crossentropy")
11
12# Train on dummy data
13X_train = np.random.randn(1000, 10).astype(np.float32)
14y_train = np.random.randint(0, 3, 1000)
15model.fit(X_train, y_train, epochs=5, verbose=0)
16
17# --- Step 2: Convert to TF Lite ---
18converter = tf.lite.TFLiteConverter.from_keras_model(model)
19
20# Option A: Default conversion (FP32)
21tflite_model_fp32 = converter.convert()
22
23# Option B: Dynamic range quantization (INT8 weights)
24converter.optimizations = [tf.lite.Optimize.DEFAULT]
25tflite_model_int8 = converter.convert()
26
27# Option C: Full integer quantization (requires calibration data)
28def representative_dataset():
29    for i in range(100):
30        yield [X_train[i:i+1]]
31
32converter.optimizations = [tf.lite.Optimize.DEFAULT]
33converter.representative_dataset = representative_dataset
34converter.target_spec.supported_ops = [
35    tf.lite.OpsSet.TFLITE_BUILTINS_INT8
36]
37converter.inference_input_type = tf.uint8
38converter.inference_output_type = tf.uint8
39tflite_model_full_int8 = converter.convert()
40
41# --- Step 3: Save and compare sizes ---
42import os
43for name, model_bytes in [
44    ("fp32", tflite_model_fp32),
45    ("int8_dynamic", tflite_model_int8),
46    ("int8_full", tflite_model_full_int8),
47]:
48    path = f"/tmp/model_{name}.tflite"
49    with open(path, "wb") as f:
50        f.write(model_bytes)
51    size_kb = os.path.getsize(path) / 1024
52    print(f"{name:15s}: {size_kb:>8.1f} KB")
53
54# --- Step 4: Run inference with TF Lite Interpreter ---
55interpreter = tf.lite.Interpreter(model_content=tflite_model_int8)
56interpreter.allocate_tensors()
57
58input_details = interpreter.get_input_details()
59output_details = interpreter.get_output_details()
60
61# Single inference
62test_input = np.random.randn(1, 10).astype(np.float32)
63interpreter.set_tensor(input_details[0]["index"], test_input)
64interpreter.invoke()
65
66output = interpreter.get_tensor(output_details[0]["index"])
67print(f"\nPrediction: {output}")
68print(f"Predicted class: {np.argmax(output)}")

Apple Core ML

Apple's inference framework optimized for iPhone, iPad, Mac, and Apple Watch:

Hardware: CPU, GPU, and Neural Engine (Apple's dedicated ML accelerator)

Model format: .mlmodel / .mlpackage

Integration: Native Swift/Objective-C APIs, on-device only (no data sent to cloud)

Conversion with coremltools

python

1import coremltools as ct
2import torch
3import torch.nn as nn
4
5# --- Convert PyTorch model to Core ML ---
6class ImageClassifier(nn.Module):
7    def __init__(self):
8        super().__init__()
9        self.features = nn.Sequential(
10            nn.Conv2d(3, 16, 3, padding=1),
11            nn.ReLU(),
12            nn.AdaptiveAvgPool2d(1),
13        )
14        self.classifier = nn.Linear(16, 10)
15
16    def forward(self, x):
17        x = self.features(x)
18        x = x.view(x.size(0), -1)
19        return self.classifier(x)
20
21model = ImageClassifier()
22model.eval()
23
24# Trace the model
25example_input = torch.randn(1, 3, 224, 224)
26traced_model = torch.jit.trace(model, example_input)
27
28# Convert to Core ML
29mlmodel = ct.convert(
30    traced_model,
31    inputs=[ct.ImageType(
32        name="image",
33        shape=(1, 3, 224, 224),
34        scale=1/255.0,  # Normalize pixel values
35    )],
36    classifier_config=ct.ClassifierConfig(
37        class_labels=[f"class_{i}" for i in range(10)]
38    ),
39    compute_precision=ct.precision.FLOAT16,  # FP16 for Neural Engine
40)
41
42# Save
43mlmodel.save("ImageClassifier.mlpackage")
44print("Core ML model saved!")
45
46# Model metadata
47spec = mlmodel.get_spec()
48print(f"Input: {spec.description.input[0].name}")
49print(f"Output: {spec.description.output[0].name}")

ONNX Runtime

Open Neural Network Exchange — a cross-platform, hardware-agnostic inference runtime:

Converts from: PyTorch, TensorFlow, scikit-learn, and more

Runs on: CPU, GPU, mobile, web (WASM), edge devices

Key advantage: Write once, deploy anywhere — single model format for all platforms

TensorRT

NVIDIA's inference optimizer for GPU deployment:

Applies layer fusion, kernel auto-tuning, and precision calibration

Achieves the highest throughput on NVIDIA GPUs (2-5x faster than PyTorch)

Best for server-side inference with NVIDIA hardware

Runtime	Best For	Platforms	Quantization
TF Lite	Mobile, IoT, microcontrollers	Android, iOS, Linux, MCU	INT8, FP16
Core ML	Apple devices	iOS, macOS, watchOS	FP16, INT8
ONNX Runtime	Cross-platform	All platforms	INT8, FP16
TensorRT	NVIDIA GPU servers	Linux, Windows (NVIDIA)	INT8, FP16, INT4

python

1# ONNX Runtime — Cross-platform inference
2import torch
3import torch.nn as nn
4import onnxruntime as ort
5import numpy as np
6
7# --- Export PyTorch model to ONNX ---
8class SimpleNet(nn.Module):
9    def __init__(self):
10        super().__init__()
11        self.fc1 = nn.Linear(10, 64)
12        self.relu = nn.ReLU()
13        self.fc2 = nn.Linear(64, 3)
14
15    def forward(self, x):
16        return self.fc2(self.relu(self.fc1(x)))
17
18model = SimpleNet()
19model.eval()
20
21dummy = torch.randn(1, 10)
22torch.onnx.export(
23    model, dummy, "/tmp/model.onnx",
24    input_names=["input"],
25    output_names=["output"],
26    dynamic_axes={"input": {0: "batch"}, "output": {0: "batch"}},
27)
28
29# --- Run inference with ONNX Runtime ---
30session = ort.InferenceSession("/tmp/model.onnx")
31
32# Single inference
33input_data = np.random.randn(1, 10).astype(np.float32)
34result = session.run(None, {"input": input_data})
35print(f"Output shape: {result[0].shape}")
36print(f"Predictions: {result[0]}")
37
38# Batch inference
39batch_data = np.random.randn(100, 10).astype(np.float32)
40batch_result = session.run(None, {"input": batch_data})
41print(f"\nBatch output shape: {batch_result[0].shape}")

Benchmarking Is Essential

Never assume one runtime is faster than another — always benchmark on your target hardware with your actual model. TF Lite may be fastest on a Pixel phone, Core ML on an iPhone, and TensorRT on an NVIDIA server. Use each runtime's built-in benchmarking tools and test with representative input data.

Inference Runtimes for Edge Deployment

Why Not Just Use PyTorch/TensorFlow?

TensorFlow Lite (TF Lite)

Google's lightweight inference runtime for mobile and edge devices:

Platforms: Android, iOS, Linux, microcontrollers

Optimizations: Quantization, operator fusion, delegate support (GPU, NNAPI, Hexagon DSP)

Model format: .tflite (FlatBuffer-based, no parsing overhead)

Conversion Pipeline

TensorFlow Model (.pb / SavedModel / Keras)
    → TFLite Converter (optimize, quantize)
        → .tflite file
            → TFLite Interpreter (on-device inference)

python

1import tensorflow as tf
2import numpy as np
3
4# --- Step 1: Create a Keras model ---
5model = tf.keras.Sequential([
6    tf.keras.layers.Dense(128, activation="relu", input_shape=(10,)),
7    tf.keras.layers.Dense(64, activation="relu"),
8    tf.keras.layers.Dense(3, activation="softmax"),
9])
10model.compile(optimizer="adam", loss="sparse_categorical_crossentropy")
11
12# Train on dummy data
13X_train = np.random.randn(1000, 10).astype(np.float32)
14y_train = np.random.randint(0, 3, 1000)
15model.fit(X_train, y_train, epochs=5, verbose=0)
16
17# --- Step 2: Convert to TF Lite ---
18converter = tf.lite.TFLiteConverter.from_keras_model(model)
19
20# Option A: Default conversion (FP32)
21tflite_model_fp32 = converter.convert()
22
23# Option B: Dynamic range quantization (INT8 weights)
24converter.optimizations = [tf.lite.Optimize.DEFAULT]
25tflite_model_int8 = converter.convert()
26
27# Option C: Full integer quantization (requires calibration data)
28def representative_dataset():
29    for i in range(100):
30        yield [X_train[i:i+1]]
31
32converter.optimizations = [tf.lite.Optimize.DEFAULT]
33converter.representative_dataset = representative_dataset
34converter.target_spec.supported_ops = [
35    tf.lite.OpsSet.TFLITE_BUILTINS_INT8
36]
37converter.inference_input_type = tf.uint8
38converter.inference_output_type = tf.uint8
39tflite_model_full_int8 = converter.convert()
40
41# --- Step 3: Save and compare sizes ---
42import os
43for name, model_bytes in [
44    ("fp32", tflite_model_fp32),
45    ("int8_dynamic", tflite_model_int8),
46    ("int8_full", tflite_model_full_int8),
47]:
48    path = f"/tmp/model_{name}.tflite"
49    with open(path, "wb") as f:
50        f.write(model_bytes)
51    size_kb = os.path.getsize(path) / 1024
52    print(f"{name:15s}: {size_kb:>8.1f} KB")
53
54# --- Step 4: Run inference with TF Lite Interpreter ---
55interpreter = tf.lite.Interpreter(model_content=tflite_model_int8)
56interpreter.allocate_tensors()
57
58input_details = interpreter.get_input_details()
59output_details = interpreter.get_output_details()
60
61# Single inference
62test_input = np.random.randn(1, 10).astype(np.float32)
63interpreter.set_tensor(input_details[0]["index"], test_input)
64interpreter.invoke()
65
66output = interpreter.get_tensor(output_details[0]["index"])
67print(f"\nPrediction: {output}")
68print(f"Predicted class: {np.argmax(output)}")

Apple Core ML

Apple's inference framework optimized for iPhone, iPad, Mac, and Apple Watch:

Hardware: CPU, GPU, and Neural Engine (Apple's dedicated ML accelerator)

Model format: .mlmodel / .mlpackage

Integration: Native Swift/Objective-C APIs, on-device only (no data sent to cloud)

Conversion with coremltools

python

1import coremltools as ct
2import torch
3import torch.nn as nn
4
5# --- Convert PyTorch model to Core ML ---
6class ImageClassifier(nn.Module):
7    def __init__(self):
8        super().__init__()
9        self.features = nn.Sequential(
10            nn.Conv2d(3, 16, 3, padding=1),
11            nn.ReLU(),
12            nn.AdaptiveAvgPool2d(1),
13        )
14        self.classifier = nn.Linear(16, 10)
15
16    def forward(self, x):
17        x = self.features(x)
18        x = x.view(x.size(0), -1)
19        return self.classifier(x)
20
21model = ImageClassifier()
22model.eval()
23
24# Trace the model
25example_input = torch.randn(1, 3, 224, 224)
26traced_model = torch.jit.trace(model, example_input)
27
28# Convert to Core ML
29mlmodel = ct.convert(
30    traced_model,
31    inputs=[ct.ImageType(
32        name="image",
33        shape=(1, 3, 224, 224),
34        scale=1/255.0,  # Normalize pixel values
35    )],
36    classifier_config=ct.ClassifierConfig(
37        class_labels=[f"class_{i}" for i in range(10)]
38    ),
39    compute_precision=ct.precision.FLOAT16,  # FP16 for Neural Engine
40)
41
42# Save
43mlmodel.save("ImageClassifier.mlpackage")
44print("Core ML model saved!")
45
46# Model metadata
47spec = mlmodel.get_spec()
48print(f"Input: {spec.description.input[0].name}")
49print(f"Output: {spec.description.output[0].name}")

ONNX Runtime

Open Neural Network Exchange — a cross-platform, hardware-agnostic inference runtime:

Converts from: PyTorch, TensorFlow, scikit-learn, and more

Runs on: CPU, GPU, mobile, web (WASM), edge devices

Key advantage: Write once, deploy anywhere — single model format for all platforms

TensorRT

NVIDIA's inference optimizer for GPU deployment:

Applies layer fusion, kernel auto-tuning, and precision calibration

Achieves the highest throughput on NVIDIA GPUs (2-5x faster than PyTorch)

Best for server-side inference with NVIDIA hardware

Runtime	Best For	Platforms	Quantization
TF Lite	Mobile, IoT, microcontrollers	Android, iOS, Linux, MCU	INT8, FP16
Core ML	Apple devices	iOS, macOS, watchOS	FP16, INT8
ONNX Runtime	Cross-platform	All platforms	INT8, FP16
TensorRT	NVIDIA GPU servers	Linux, Windows (NVIDIA)	INT8, FP16, INT4

python

1# ONNX Runtime — Cross-platform inference
2import torch
3import torch.nn as nn
4import onnxruntime as ort
5import numpy as np
6
7# --- Export PyTorch model to ONNX ---
8class SimpleNet(nn.Module):
9    def __init__(self):
10        super().__init__()
11        self.fc1 = nn.Linear(10, 64)
12        self.relu = nn.ReLU()
13        self.fc2 = nn.Linear(64, 3)
14
15    def forward(self, x):
16        return self.fc2(self.relu(self.fc1(x)))
17
18model = SimpleNet()
19model.eval()
20
21dummy = torch.randn(1, 10)
22torch.onnx.export(
23    model, dummy, "/tmp/model.onnx",
24    input_names=["input"],
25    output_names=["output"],
26    dynamic_axes={"input": {0: "batch"}, "output": {0: "batch"}},
27)
28
29# --- Run inference with ONNX Runtime ---
30session = ort.InferenceSession("/tmp/model.onnx")
31
32# Single inference
33input_data = np.random.randn(1, 10).astype(np.float32)
34result = session.run(None, {"input": input_data})
35print(f"Output shape: {result[0].shape}")
36print(f"Predictions: {result[0]}")
37
38# Batch inference
39batch_data = np.random.randn(100, 10).astype(np.float32)
40batch_result = session.run(None, {"input": batch_data})
41print(f"\nBatch output shape: {batch_result[0].shape}")

TF Lite, Core ML & Inference Runtimes

Inference Runtimes for Edge Deployment

Why Not Just Use PyTorch/TensorFlow?

TensorFlow Lite (TF Lite)

Conversion Pipeline

Apple Core ML

Conversion with coremltools

ONNX Runtime

TensorRT

Benchmarking Is Essential

TF Lite, Core ML & Inference Runtimes

Inference Runtimes for Edge Deployment

Why Not Just Use PyTorch/TensorFlow?

TensorFlow Lite (TF Lite)

Conversion Pipeline

Apple Core ML

Conversion with coremltools

ONNX Runtime

TensorRT

Benchmarking Is Essential