Multimodal Vision LLM (PandaGPT)
ImageBind binds 6 modalities (image/text/audio/depth/thermal/IMU) into one embedding space; a single linear projection feeds Vicuna — PandaGPT trains only on image-text yet emergently understands audio/depth. Plus VPT visual-prompt tuning for pathology downstream transfer.
From Zhimo's "Multimodal Vision LLM" course. Two threads: ① how to turn a language model into one that can see, hear, and sense (PandaGPT = ImageBind + Vicuna); ② how to transfer to a vertical domain (pathology images) with almost no backbone changes via visual prompt tuning (VPT).
PandaGPT: one linear layer connects 6 modalities to an LLM
PandaGPT's trick is to stand on two giants and train only a projection + LoRA:
6 modalities → ImageBind_huge (frozen, out 1024) → llama_proj (1024→4096) → 1 soft token → LoRA-Vicuna-7B → text
- ImageBind (Meta FAIR, CVPR 2023) aligns image / text / audio / depth / thermal / IMU into one embedding space (out 1024) — the basis of "emergent zero-shot"; fully frozen
- Vicuna-7B is the language brain, with LoRA (
r=32, alpha=32, target=q/k/v/o_proj) - Only a single
nn.Linear(1024, 4096)is trained: each imageunsqueeze(1)→ one image = one soft token; multimodal inference sums the per-modality tokens - Only LoRA + the projection train (delta weights saved); ImageBind + Vicuna stay frozen. Training: DeepSpeed 8×A100, epoch 2 / batch 64 / lr 5e-4 / max_len 1024
The most counterintuitive part: emergent cross-modal ability
PandaGPT is trained only on image-text instruction pairs, but because ImageBind already binds 6 modalities into one space, the model can emergently handle audio, depth, etc. it never saw in training — drop in a clip of ocean waves and it "hears" them. That's the power of "One Embedding Space To Bind Them All."
The course's multimodal lineage (papers in the materials)
The course traces the evolution of multimodal vision LLMs, with three close-read papers:
| Paper | Key contribution |
|---|---|
| ImageBind (Meta) | binds 6 modalities into one space, emergent zero-shot |
| InternVL | scaling vision foundation models + aligning for visual-linguistic tasks |
| Gemini | a natively multimodal model family |
Upstream it also covers self-supervised learning, vision-foundation-model architectures, and downstream transfer / visual prompting — PandaGPT is where that line lands in practice.
Companion: VPT visual prompt tuning (pathology transfer)
The second hands-on project is VPT (Visual Prompt Tuning): freeze the entire ViT-B/16 backbone, concatenate a few learnable prompt tokens after the CLS token (incorporate_prompt, default NUM_TOKENS=5, prepend + random init), and transfer downstream:
- VPT-Shallow (prompts at the input once) vs VPT-Deep (
deep_prompt_embeddings(11, NUM_TOKENS, 768), re-inserted at every layer) - Only prompts + head train:
if "prompt" not in k: requires_grad=False, encoder stays in eval; logs "tuned percent" = trainable/total params - Benchmarks: FGVC (CUB/NABirds/Flowers/Dogs/Cars) / VTAB (19 tasks, 800/200 to pick hyperparams → 5 seeded runs); can run on MoCo-v3 / MAE backbones
- Pathology transfer (
configs/prompt/bci.yaml): the BCI dataset (HE-stained breast-cancer pathology), 4 classes, backbone untouched → low-cost transfer with very few params
What this signals
- Engineering intuition for multimodal alignment: knowing that "bind modalities" (ImageBind) + "attach an LLM" (a projection) builds a multimodal model — no need to train from scratch
- Understanding emergence: why training only on image-text still handles audio — the shared embedding space
- Efficient transfer: VPT tunes only prompt tokens for cheap vertical-domain (pathology) onboarding
- Reading the frontier: the ImageBind / InternVL / Gemini lineage
What the demo replays
The demo replays PandaGPT inference: pick modalities (image+text required, audio/depth optional) → ImageBind encodes them into one shared space → linear projection → Vicuna generates the answer; using a non-image-text modality is flagged as 'emergent.' The 6 modalities, the ImageBind+Vicuna+1-linear-projection architecture, and the emergent ability all come from Zhimo's PandaGPT hands-on code and the ImageBind paper; no model runs in the browser.