Multimodal inference replay

Multimodal Vision LLM (PandaGPT)

ImageBind binds 6 modalities into one space, a single linear projection feeds Vicuna — PandaGPT trains only on image-text yet emergently understands audio/depth. Plus VPT visual-prompt tuning for pathology downstream transfer.

Pick modalities, then watch PandaGPT infer: ImageBind encodes them into one shared embedding space, a linear projection feeds Vicuna, and the answer comes back — using a non-image-text modality is flagged as emergent.

ImageBindPandaGPTVicunaMultimodalVPT

Case Study Source Code

Why this local version exists

The 6 modalities, the ImageBind(frozen)+1-linear-projection+Vicuna architecture, and the emergent cross-modal ability come from Zhimo's PandaGPT hands-on code and the ImageBind paper (Meta, CVPR 2023). No model runs in the browser.

Interactive Preview

Six modalities into one LLM (PandaGPT)

ImageBind aligns 6 modalities into one embedding space; a linear projection feeds Vicuna — PandaGPT trains only on image-text yet emergently understands audio/depth/etc.

ImageBind · 6 modalities (toggle)

Multimodal input

🏖️ beach photo + 🌊 wave audio + prompt: "What's happening and what do you hear?"

Architecture

modalities → ImageBind (frozen, 1024) → linear proj 1024→4096 → 1 soft token → LoRA-Vicuna-7B → text

1. ImageBind encodes → one shared embedding space

2. linear projection → Vicuna embedding space

3. Vicuna generates a multimodal answer

What to try

Toggle audio/depth/thermal/IMU on or off, then run inference.

Watch ImageBind bind the selected modalities into one shared space → projection → Vicuna.

Note: using audio (which PandaGPT never trained on) still works — flagged as emergent.

What this demo proves

You can build a multimodal model by composition (ImageBind + a projection + an LLM) instead of training from scratch.

You understand emergence: one shared embedding space lets image-text training generalize to audio/depth.

You know efficient transfer (VPT visual prompt tuning) for vertical domains like pathology.

Backbone

ImageBind (6 modalities, frozen) + Vicuna + 1 linear projection

Lineage

ImageBind · InternVL · Gemini (papers read in the course)

Efficient transfer

VPT (visual prompt tuning) on a frozen ViT → pathology downstream

Back to case study