CLIP Cross-Modal Retrieval RAG
CLIP encodes text and images into one 512-dim space, so text→image / image→image works. On LlamaIndex, from a CLIP MVP to VLM captioning + BM25 hybrid retrieval (RRF), persisted in Milvus.
Switch between text→image, image→image, and hybrid: a query is encoded to 512-dim and lands in one shared space, pulling the nearest images by cosine — with the real low cross-modal scores and a self-hit filter.
Why this local version exists
The 512-dim shared space, the ~0.24 text→image scores, the image→image self-hit (1.0) filter, RRF (k=60), and the 1536-dim VLM-caption index are all from the LlamaIndex multimodal notebook. The 2D plot is a projection for intuition.
Text and images in one vector space
CLIP encodes text and images into the same 512-dim space, so you can do text→image / image→image. Hybrid adds BM25 + RRF fusion.
Shared 512-dim space (2D projection)
Retrieval results (cosine)
Run retrieval to pull the nearest images in the shared space.
Beyond: VLM caption + hybrid
CLIP can't read in-image text and is weak in Chinese → caption images with a VLM (GPT-4o / Qwen-VL-Max), re-embed with text-embedding-3-small (1536d); at query time, BM25 + vector are fused by QueryFusionRetriever via RRF (k=60).
What to try
Run text→image and watch a text query pull the nearest image diagrams.
Switch to image→image and see the self-hit (1.0) get filtered out.
Try hybrid and note BM25 + vector fused by RRF (k=60).
What this demo proves
You can land cross-modal retrieval (text↔image), not just chat-over-text.
You know CLIP's limits (in-image text / Chinese) and patch them with VLM captions.
You engineer hybrid retrieval (vector + BM25 via RRF) on the LlamaIndex multimodal stack.
Shared space
CLIP (OpenAI) · get_text/image_embedding both 512-dim
Beyond CLIP
VLM caption (GPT-4o / Qwen-VL-Max) → text-embedding-3-small 1536-dim
Hybrid
BM25 + vector → QueryFusionRetriever, RRF k=60 · Milvus