Back to case study
Shared-space retrieval

CLIP Cross-Modal Retrieval RAG

CLIP encodes text and images into one 512-dim space, so text→image / image→image works. On LlamaIndex, from a CLIP MVP to VLM captioning + BM25 hybrid retrieval (RRF), persisted in Milvus.

Switch between text→image, image→image, and hybrid: a query is encoded to 512-dim and lands in one shared space, pulling the nearest images by cosine — with the real low cross-modal scores and a self-hit filter.

CLIPLlamaIndexMultimodalMilvusRRF
CLIP Cross-Modal Retrieval RAG

Why this local version exists

The 512-dim shared space, the ~0.24 text→image scores, the image→image self-hit (1.0) filter, RRF (k=60), and the 1536-dim VLM-caption index are all from the LlamaIndex multimodal notebook. The 2D plot is a projection for intuition.

Interactive Preview

Text and images in one vector space

CLIP encodes text and images into the same 512-dim space, so you can do text→image / image→image. Hybrid adds BM25 + RRF fusion.

Shared 512-dim space (2D projection)

architectureflowchartdiagrambar chartUI screenshotlandscape photo
text: "system architecture diagram"

Retrieval results (cosine)

Run retrieval to pull the nearest images in the shared space.

Beyond: VLM caption + hybrid

CLIP can't read in-image text and is weak in Chinese → caption images with a VLM (GPT-4o / Qwen-VL-Max), re-embed with text-embedding-3-small (1536d); at query time, BM25 + vector are fused by QueryFusionRetriever via RRF (k=60).

What to try

Run text→image and watch a text query pull the nearest image diagrams.

Switch to image→image and see the self-hit (1.0) get filtered out.

Try hybrid and note BM25 + vector fused by RRF (k=60).

What this demo proves

You can land cross-modal retrieval (text↔image), not just chat-over-text.

You know CLIP's limits (in-image text / Chinese) and patch them with VLM captions.

You engineer hybrid retrieval (vector + BM25 via RRF) on the LlamaIndex multimodal stack.

Shared space

CLIP (OpenAI) · get_text/image_embedding both 512-dim

Beyond CLIP

VLM caption (GPT-4o / Qwen-VL-Max) → text-embedding-3-small 1536-dim

Hybrid

BM25 + vector → QueryFusionRetriever, RRF k=60 · Milvus