Engine-build replay

TensorRT Inference Optimization

Shipping a trained model to the edge: ONNX → TensorRT engine build → layer/tensor fusion (Conv+BN+ReLU collapses into one CBR kernel) → INT8/FP16 PTQ calibration → a custom NMS plugin (IPluginV2) → SSD object-detection inference. The senior MLSys piece the portfolio lacks.

A guided replay of the TensorRT Builder compiling an ONNX model into an inference engine: raw Conv→BN→ReLU chains collapse into single CBR kernels (layer fusion), INT8 PTQ calibration runs, a custom NMS plugin slots into the graph, and a before/after latency bar closes it out.

TensorRTINT8Layer FusionCUDA PluginONNX

Case Study Source Code

Why this local version exists

This replays TensorRT's standard, public-doc behavior; it runs no real engine build and calls no GPU. The source 贪心 course is video-only (133 MP4 files, with no code / slides / subtitles), so the techniques are grounded in TensorRT docs (which the course confirms teaching), not lifted from course code. The latency numbers are labeled illustrative — the course gave none, and nothing is presented as measured.

Interactive Preview

TensorRT engine build (ONNX → SSD inference)

Replays the Builder compiling an ONNX model into an inference engine: layer/tensor fusion collapses Conv+BN+ReLU into one CBR kernel, INT8 PTQ calibration runs, then a custom NMS plugin slots into the graph. Latency numbers are illustrative.

Compute graph (raw ONNX)

Conv

→

ReLU

Conv

→

ReLU

Conv

→

ReLU

NMS (pending)

SSD detection head

fusion drops kernel launches 9 → 3 · auto-tuning picks the fastest kernel

1. Parse ONNX → TensorRT graph

2. Layer/tensor fusion: Conv+BN+ReLU → CBR

3. INT8 PTQ calibration (FP32 → INT8)

4. Custom NMS plugin (IPluginV2DynamicExt)

5. Engine built → before/after latency

builder.build_serialized_network(network, config) → engine.plan

What to try

Run the build and watch the most compelling step: three Conv→BN→ReLU chains visibly merge into single CBR kernels (9 → 3 kernel launches).

See INT8 PTQ calibration tag every fused node, then the custom NMS plugin slot into the graph as its own node.

Read the before/after latency bar — labeled illustrative, since the video-only course shipped no benchmarks.

What this demo proves

You can ship models to the edge — compile a trained model into a low-latency inference engine, not just train it.

You understand graph compilers: layer fusion, kernel auto-tuning, and the line between decomposable and non-decomposable ops.

You can write a custom C++/CUDA op (IPluginV2 NMS plugin) against a production inference runtime and reason about INT8 quantization trade-offs — the senior MLSys skill the rest of the portfolio lacks.

Pipeline

ONNX → TensorRT engine build → fusion → INT8 PTQ → NMS plugin → SSD inference

Custom op

NMS as an IPluginV2 / IPluginV2DynamicExt CUDA plugin

Best signal

Graph compilers + quantization + custom CUDA ops — edge deployment

Back to case study