Back to case study
Engine-build replay

TensorRT Inference Optimization

Shipping a trained model to the edge: ONNX → TensorRT engine build → layer/tensor fusion (Conv+BN+ReLU collapses into one CBR kernel) → INT8/FP16 PTQ calibration → a custom NMS plugin (IPluginV2) → SSD object-detection inference. The senior MLSys piece the portfolio lacks.

A guided replay of the TensorRT Builder compiling an ONNX model into an inference engine: raw Conv→BN→ReLU chains collapse into single CBR kernels (layer fusion), INT8 PTQ calibration runs, a custom NMS plugin slots into the graph, and a before/after latency bar closes it out.

TensorRTINT8Layer FusionCUDA PluginONNX
TensorRT Inference Optimization

Why this local version exists

This replays TensorRT's standard, public-doc behavior; it runs no real engine build and calls no GPU. The source 贪心 course is video-only (133 MP4 files, with no code / slides / subtitles), so the techniques are grounded in TensorRT docs (which the course confirms teaching), not lifted from course code. The latency numbers are labeled illustrative — the course gave none, and nothing is presented as measured.

Interactive Preview

TensorRT engine build (ONNX → SSD inference)

Replays the Builder compiling an ONNX model into an inference engine: layer/tensor fusion collapses Conv+BN+ReLU into one CBR kernel, INT8 PTQ calibration runs, then a custom NMS plugin slots into the graph. Latency numbers are illustrative.

Compute graph (raw ONNX)

Conv
BN
ReLU
Conv
BN
ReLU
Conv
BN
ReLU
NMS (pending)
SSD detection head

fusion drops kernel launches 9 → 3 · auto-tuning picks the fastest kernel

1. Parse ONNX → TensorRT graph

2. Layer/tensor fusion: Conv+BN+ReLU → CBR

3. INT8 PTQ calibration (FP32 → INT8)

4. Custom NMS plugin (IPluginV2DynamicExt)

5. Engine built → before/after latency

builder.build_serialized_network(network, config) → engine.plan

What to try

Run the build and watch the most compelling step: three Conv→BN→ReLU chains visibly merge into single CBR kernels (9 → 3 kernel launches).

See INT8 PTQ calibration tag every fused node, then the custom NMS plugin slot into the graph as its own node.

Read the before/after latency bar — labeled illustrative, since the video-only course shipped no benchmarks.

What this demo proves

You can ship models to the edge — compile a trained model into a low-latency inference engine, not just train it.

You understand graph compilers: layer fusion, kernel auto-tuning, and the line between decomposable and non-decomposable ops.

You can write a custom C++/CUDA op (IPluginV2 NMS plugin) against a production inference runtime and reason about INT8 quantization trade-offs — the senior MLSys skill the rest of the portfolio lacks.

Pipeline

ONNX → TensorRT engine build → fusion → INT8 PTQ → NMS plugin → SSD inference

Custom op

NMS as an IPluginV2 / IPluginV2DynamicExt CUDA plugin

Best signal

Graph compilers + quantization + custom CUDA ops — edge deployment