Skip to main content

TF Lite, Core ML & Inference Runtimes

TF Lite conversion and inference, Core ML with coremltools, ONNX Runtime, TensorRT, and model benchmarking

~45 min
Listen to this lesson

Inference Runtimes for Edge Deployment

Training a model is only half the battle — you need an inference runtime optimized for your target hardware. Each runtime applies hardware-specific optimizations (operator fusion, memory planning, platform-specific instructions) that make models run faster than naive execution.

Why Not Just Use PyTorch/TensorFlow?

Training frameworks (PyTorch, TensorFlow) are designed for flexibility and debugging during development. Inference runtimes (TF Lite, Core ML, ONNX Runtime, TensorRT) are designed for speed and efficiency during deployment. They strip out training-only features (autograd, optimizer state) and apply aggressive optimizations for specific hardware.

TensorFlow Lite (TF Lite)

Google's lightweight inference runtime for mobile and edge devices:

  • Platforms: Android, iOS, Linux, microcontrollers
  • Optimizations: Quantization, operator fusion, delegate support (GPU, NNAPI, Hexagon DSP)
  • Model format: .tflite (FlatBuffer-based, no parsing overhead)
  • Conversion Pipeline

    TensorFlow Model (.pb / SavedModel / Keras)
        → TFLite Converter (optimize, quantize)
            → .tflite file
                → TFLite Interpreter (on-device inference)
    

    python
    1import tensorflow as tf
    2import numpy as np
    3
    4# --- Step 1: Create a Keras model ---
    5model = tf.keras.Sequential([
    6    tf.keras.layers.Dense(128, activation="relu", input_shape=(10,)),
    7    tf.keras.layers.Dense(64, activation="relu"),
    8    tf.keras.layers.Dense(3, activation="softmax"),
    9])
    10model.compile(optimizer="adam", loss="sparse_categorical_crossentropy")
    11
    12# Train on dummy data
    13X_train = np.random.randn(1000, 10).astype(np.float32)
    14y_train = np.random.randint(0, 3, 1000)
    15model.fit(X_train, y_train, epochs=5, verbose=0)
    16
    17# --- Step 2: Convert to TF Lite ---
    18converter = tf.lite.TFLiteConverter.from_keras_model(model)
    19
    20# Option A: Default conversion (FP32)
    21tflite_model_fp32 = converter.convert()
    22
    23# Option B: Dynamic range quantization (INT8 weights)
    24converter.optimizations = [tf.lite.Optimize.DEFAULT]
    25tflite_model_int8 = converter.convert()
    26
    27# Option C: Full integer quantization (requires calibration data)
    28def representative_dataset():
    29    for i in range(100):
    30        yield [X_train[i:i+1]]
    31
    32converter.optimizations = [tf.lite.Optimize.DEFAULT]
    33converter.representative_dataset = representative_dataset
    34converter.target_spec.supported_ops = [
    35    tf.lite.OpsSet.TFLITE_BUILTINS_INT8
    36]
    37converter.inference_input_type = tf.uint8
    38converter.inference_output_type = tf.uint8
    39tflite_model_full_int8 = converter.convert()
    40
    41# --- Step 3: Save and compare sizes ---
    42import os
    43for name, model_bytes in [
    44    ("fp32", tflite_model_fp32),
    45    ("int8_dynamic", tflite_model_int8),
    46    ("int8_full", tflite_model_full_int8),
    47]:
    48    path = f"/tmp/model_{name}.tflite"
    49    with open(path, "wb") as f:
    50        f.write(model_bytes)
    51    size_kb = os.path.getsize(path) / 1024
    52    print(f"{name:15s}: {size_kb:>8.1f} KB")
    53
    54# --- Step 4: Run inference with TF Lite Interpreter ---
    55interpreter = tf.lite.Interpreter(model_content=tflite_model_int8)
    56interpreter.allocate_tensors()
    57
    58input_details = interpreter.get_input_details()
    59output_details = interpreter.get_output_details()
    60
    61# Single inference
    62test_input = np.random.randn(1, 10).astype(np.float32)
    63interpreter.set_tensor(input_details[0]["index"], test_input)
    64interpreter.invoke()
    65
    66output = interpreter.get_tensor(output_details[0]["index"])
    67print(f"\nPrediction: {output}")
    68print(f"Predicted class: {np.argmax(output)}")

    Apple Core ML

    Apple's inference framework optimized for iPhone, iPad, Mac, and Apple Watch:

  • Hardware: CPU, GPU, and Neural Engine (Apple's dedicated ML accelerator)
  • Model format: .mlmodel / .mlpackage
  • Integration: Native Swift/Objective-C APIs, on-device only (no data sent to cloud)
  • Conversion with coremltools

    python
    1import coremltools as ct
    2import torch
    3import torch.nn as nn
    4
    5# --- Convert PyTorch model to Core ML ---
    6class ImageClassifier(nn.Module):
    7    def __init__(self):
    8        super().__init__()
    9        self.features = nn.Sequential(
    10            nn.Conv2d(3, 16, 3, padding=1),
    11            nn.ReLU(),
    12            nn.AdaptiveAvgPool2d(1),
    13        )
    14        self.classifier = nn.Linear(16, 10)
    15
    16    def forward(self, x):
    17        x = self.features(x)
    18        x = x.view(x.size(0), -1)
    19        return self.classifier(x)
    20
    21model = ImageClassifier()
    22model.eval()
    23
    24# Trace the model
    25example_input = torch.randn(1, 3, 224, 224)
    26traced_model = torch.jit.trace(model, example_input)
    27
    28# Convert to Core ML
    29mlmodel = ct.convert(
    30    traced_model,
    31    inputs=[ct.ImageType(
    32        name="image",
    33        shape=(1, 3, 224, 224),
    34        scale=1/255.0,  # Normalize pixel values
    35    )],
    36    classifier_config=ct.ClassifierConfig(
    37        class_labels=[f"class_{i}" for i in range(10)]
    38    ),
    39    compute_precision=ct.precision.FLOAT16,  # FP16 for Neural Engine
    40)
    41
    42# Save
    43mlmodel.save("ImageClassifier.mlpackage")
    44print("Core ML model saved!")
    45
    46# Model metadata
    47spec = mlmodel.get_spec()
    48print(f"Input: {spec.description.input[0].name}")
    49print(f"Output: {spec.description.output[0].name}")

    ONNX Runtime

    Open Neural Network Exchange — a cross-platform, hardware-agnostic inference runtime:

  • Converts from: PyTorch, TensorFlow, scikit-learn, and more
  • Runs on: CPU, GPU, mobile, web (WASM), edge devices
  • Key advantage: Write once, deploy anywhere — single model format for all platforms
  • TensorRT

    NVIDIA's inference optimizer for GPU deployment:

  • Applies layer fusion, kernel auto-tuning, and precision calibration
  • Achieves the highest throughput on NVIDIA GPUs (2-5x faster than PyTorch)
  • Best for server-side inference with NVIDIA hardware
  • RuntimeBest ForPlatformsQuantization
    TF LiteMobile, IoT, microcontrollersAndroid, iOS, Linux, MCUINT8, FP16
    Core MLApple devicesiOS, macOS, watchOSFP16, INT8
    ONNX RuntimeCross-platformAll platformsINT8, FP16
    TensorRTNVIDIA GPU serversLinux, Windows (NVIDIA)INT8, FP16, INT4

    python
    1# ONNX Runtime — Cross-platform inference
    2import torch
    3import torch.nn as nn
    4import onnxruntime as ort
    5import numpy as np
    6
    7# --- Export PyTorch model to ONNX ---
    8class SimpleNet(nn.Module):
    9    def __init__(self):
    10        super().__init__()
    11        self.fc1 = nn.Linear(10, 64)
    12        self.relu = nn.ReLU()
    13        self.fc2 = nn.Linear(64, 3)
    14
    15    def forward(self, x):
    16        return self.fc2(self.relu(self.fc1(x)))
    17
    18model = SimpleNet()
    19model.eval()
    20
    21dummy = torch.randn(1, 10)
    22torch.onnx.export(
    23    model, dummy, "/tmp/model.onnx",
    24    input_names=["input"],
    25    output_names=["output"],
    26    dynamic_axes={"input": {0: "batch"}, "output": {0: "batch"}},
    27)
    28
    29# --- Run inference with ONNX Runtime ---
    30session = ort.InferenceSession("/tmp/model.onnx")
    31
    32# Single inference
    33input_data = np.random.randn(1, 10).astype(np.float32)
    34result = session.run(None, {"input": input_data})
    35print(f"Output shape: {result[0].shape}")
    36print(f"Predictions: {result[0]}")
    37
    38# Batch inference
    39batch_data = np.random.randn(100, 10).astype(np.float32)
    40batch_result = session.run(None, {"input": batch_data})
    41print(f"\nBatch output shape: {batch_result[0].shape}")

    Benchmarking Is Essential

    Never assume one runtime is faster than another — always benchmark on your target hardware with your actual model. TF Lite may be fastest on a Pixel phone, Core ML on an iPhone, and TensorRT on an NVIDIA server. Use each runtime's built-in benchmarking tools and test with representative input data.