TensorRT Inference Optimization
Shipping a trained model to the edge: ONNX → TensorRT engine build → layer/tensor fusion (Conv+BN+ReLU collapses into one CBR kernel) → INT8/FP16 PTQ calibration → a custom NMS plugin (IPluginV2) → SSD object-detection inference. The senior MLSys piece the portfolio lacks.
A guided replay of the TensorRT Builder compiling an ONNX model into an inference engine: raw Conv→BN→ReLU chains collapse into single CBR kernels (layer fusion), INT8 PTQ calibration runs, a custom NMS plugin slots into the graph, and a before/after latency bar closes it out.
Why this local version exists
This replays TensorRT's standard, public-doc behavior; it runs no real engine build and calls no GPU. The source 贪心 course is video-only (133 MP4 files, with no code / slides / subtitles), so the techniques are grounded in TensorRT docs (which the course confirms teaching), not lifted from course code. The latency numbers are labeled illustrative — the course gave none, and nothing is presented as measured.
TensorRT engine build (ONNX → SSD inference)
Replays the Builder compiling an ONNX model into an inference engine: layer/tensor fusion collapses Conv+BN+ReLU into one CBR kernel, INT8 PTQ calibration runs, then a custom NMS plugin slots into the graph. Latency numbers are illustrative.
Compute graph (raw ONNX)
fusion drops kernel launches 9 → 3 · auto-tuning picks the fastest kernel
1. Parse ONNX → TensorRT graph
2. Layer/tensor fusion: Conv+BN+ReLU → CBR
3. INT8 PTQ calibration (FP32 → INT8)
4. Custom NMS plugin (IPluginV2DynamicExt)
5. Engine built → before/after latency
builder.build_serialized_network(network, config) → engine.plan
What to try
Run the build and watch the most compelling step: three Conv→BN→ReLU chains visibly merge into single CBR kernels (9 → 3 kernel launches).
See INT8 PTQ calibration tag every fused node, then the custom NMS plugin slot into the graph as its own node.
Read the before/after latency bar — labeled illustrative, since the video-only course shipped no benchmarks.
What this demo proves
You can ship models to the edge — compile a trained model into a low-latency inference engine, not just train it.
You understand graph compilers: layer fusion, kernel auto-tuning, and the line between decomposable and non-decomposable ops.
You can write a custom C++/CUDA op (IPluginV2 NMS plugin) against a production inference runtime and reason about INT8 quantization trade-offs — the senior MLSys skill the rest of the portfolio lacks.
Pipeline
ONNX → TensorRT engine build → fusion → INT8 PTQ → NMS plugin → SSD inference
Custom op
NMS as an IPluginV2 / IPluginV2DynamicExt CUDA plugin
Best signal
Graph compilers + quantization + custom CUDA ops — edge deployment