Multimodal Vision LLM (PandaGPT)

From Zhimo's "Multimodal Vision LLM" course. Two threads: ① how to turn a language model into one that can see, hear, and sense (PandaGPT = ImageBind + Vicuna); ② how to transfer to a vertical domain (pathology images) with almost no backbone changes via visual prompt tuning (VPT).

PandaGPT: one linear layer connects 6 modalities to an LLM

PandaGPT's trick is to stand on two giants and train only a projection + LoRA:

6 modalities → ImageBind_huge (frozen, out 1024) → llama_proj (1024→4096) → 1 soft token → LoRA-Vicuna-7B → text

ImageBind (Meta FAIR, CVPR 2023) aligns image / text / audio / depth / thermal / IMU into one embedding space (out 1024) — the basis of "emergent zero-shot"; fully frozen
Vicuna-7B is the language brain, with LoRA (r=32, alpha=32, target=q/k/v/o_proj)
Only a single nn.Linear(1024, 4096) is trained: each image unsqueeze(1) → one image = one soft token; multimodal inference sums the per-modality tokens
Only LoRA + the projection train (delta weights saved); ImageBind + Vicuna stay frozen. Training: DeepSpeed 8×A100, epoch 2 / batch 64 / lr 5e-4 / max_len 1024

PandaGPT is trained only on image-text instruction pairs, but because ImageBind already binds 6 modalities into one space, the model can emergently handle audio, depth, etc. it never saw in training — drop in a clip of ocean waves and it "hears" them. That's the power of "One Embedding Space To Bind Them All."

The course's multimodal lineage (papers in the materials)

The course traces the evolution of multimodal vision LLMs, with three close-read papers:

Paper	Key contribution
ImageBind (Meta)	binds 6 modalities into one space, emergent zero-shot
InternVL	scaling vision foundation models + aligning for visual-linguistic tasks
Gemini	a natively multimodal model family

Upstream it also covers self-supervised learning, vision-foundation-model architectures, and downstream transfer / visual prompting — PandaGPT is where that line lands in practice.

Companion: VPT visual prompt tuning (pathology transfer)

The second hands-on project is VPT (Visual Prompt Tuning): freeze the entire ViT-B/16 backbone, concatenate a few learnable prompt tokens after the CLS token (incorporate_prompt, default NUM_TOKENS=5, prepend + random init), and transfer downstream:

VPT-Shallow (prompts at the input once) vs VPT-Deep (deep_prompt_embeddings(11, NUM_TOKENS, 768), re-inserted at every layer)
Only prompts + head train: if "prompt" not in k: requires_grad=False, encoder stays in eval; logs "tuned percent" = trainable/total params
Benchmarks: FGVC (CUB/NABirds/Flowers/Dogs/Cars) / VTAB (19 tasks, 800/200 to pick hyperparams → 5 seeded runs); can run on MoCo-v3 / MAE backbones
Pathology transfer (configs/prompt/bci.yaml): the BCI dataset (HE-stained breast-cancer pathology), 4 classes, backbone untouched → low-cost transfer with very few params

What this signals

Engineering intuition for multimodal alignment: knowing that "bind modalities" (ImageBind) + "attach an LLM" (a projection) builds a multimodal model — no need to train from scratch
Understanding emergence: why training only on image-text still handles audio — the shared embedding space
Efficient transfer: VPT tunes only prompt tokens for cheap vertical-domain (pathology) onboarding
Reading the frontier: the ImageBind / InternVL / Gemini lineage

Demo strategy

What the demo replays

The demo replays PandaGPT inference: pick modalities (image+text required, audio/depth optional) → ImageBind encodes them into one shared space → linear projection → Vicuna generates the answer; using a non-image-text modality is flagged as 'emergent.' The 6 modalities, the ImageBind+Vicuna+1-linear-projection architecture, and the emergent ability all come from Zhimo's PandaGPT hands-on code and the ImageBind paper; no model runs in the browser.

Public preview can be enabled later without redesigning the case-study layout

PandaGPT: one linear layer connects 6 modalities to an LLM

The most counterintuitive part: emergent cross-modal ability

The course's multimodal lineage (papers in the materials)

Companion: VPT visual prompt tuning (pathology transfer)

What this signals

What the demo replays