Multimodal Vision LLM (PandaGPT)
ImageBind binds 6 modalities into one space, a single linear projection feeds Vicuna — PandaGPT trains only on image-text yet emergently understands audio/depth. Plus VPT visual-prompt tuning for pathology downstream transfer.
Pick modalities, then watch PandaGPT infer: ImageBind encodes them into one shared embedding space, a linear projection feeds Vicuna, and the answer comes back — using a non-image-text modality is flagged as emergent.
Why this local version exists
The 6 modalities, the ImageBind(frozen)+1-linear-projection+Vicuna architecture, and the emergent cross-modal ability come from Zhimo's PandaGPT hands-on code and the ImageBind paper (Meta, CVPR 2023). No model runs in the browser.
Six modalities into one LLM (PandaGPT)
ImageBind aligns 6 modalities into one embedding space; a linear projection feeds Vicuna — PandaGPT trains only on image-text yet emergently understands audio/depth/etc.
ImageBind · 6 modalities (toggle)
Multimodal input
🏖️ beach photo + 🌊 wave audio + prompt: "What's happening and what do you hear?"
Architecture
modalities → ImageBind (frozen, 1024) → linear proj 1024→4096 → 1 soft token → LoRA-Vicuna-7B → text
What to try
Toggle audio/depth/thermal/IMU on or off, then run inference.
Watch ImageBind bind the selected modalities into one shared space → projection → Vicuna.
Note: using audio (which PandaGPT never trained on) still works — flagged as emergent.
What this demo proves
You can build a multimodal model by composition (ImageBind + a projection + an LLM) instead of training from scratch.
You understand emergence: one shared embedding space lets image-text training generalize to audio/depth.
You know efficient transfer (VPT visual prompt tuning) for vertical domains like pathology.
Backbone
ImageBind (6 modalities, frozen) + Vicuna + 1 linear projection
Lineage
ImageBind · InternVL · Gemini (papers read in the course)
Efficient transfer
VPT (visual prompt tuning) on a frozen ViT → pathology downstream