CLIP Cross-Modal Retrieval RAG

Plain RAG retrieves over text. This project does cross-modal retrieval: search images with text, search images with images. The key is CLIP (OpenAI, 2021, a contrastive dual-encoder) mapping text and images into one 512-dim vector space, so cross-modal cosine similarity is meaningful. Built on LlamaIndex.

CLIP MVP: text & images in one space

from llama_index.embeddings.clip import ClipEmbedding

embed = ClipEmbedding()                       # auto-downloads ~400MB weights
t = embed.get_text_embedding("architecture diagram")   # len = 512
i = embed.get_image_embedding("diagram.png")           # len = 512  ← same space!

Both get_text_embedding and get_image_embedding return 512-dim vectors (verified in the notebook), so you can compute text ↔ image cosine directly. CLIP encodes at ~10ms/image.

CLIP's limits (stated in the course): can't read in-image text, weak fine-grained discrimination, weak Chinese (suggests Chinese-CLIP).

text→image / image→image

index = MultiModalVectorStoreIndex.from_documents(docs, image_embed_model=embed)
retriever = index.as_retriever(similarity_top_k=3, image_similarity_top_k=3)

# text→image
retriever.text_to_image_retrieve("system architecture diagram")
# image→image
retriever.image_to_image_retrieve("query.png")

Two practical details:

Low text→image scores are normal: "架构图" hits ~0.24; the English query (architecture diagram 0.28) scores slightly higher. The absolute value doesn't matter — ranking does; precision comes from a downstream reranker
image→image hits itself: the query image has 1.0 self-similarity, filtered in practice via a Path.resolve() path comparison

Persistence: MilvusVectorStore(uri, collection_name, dim=512, overwrite=True) (Milvus v2.3.21, localhost:19530), separate text/image collections.

Beyond: VLM captioning + hybrid retrieval

CLIP's weaknesses (in-image text, Chinese) are patched with VLM-generated captions:

# VLMSelector: < 50 images → GPT-4o ($0.003/img), ≥ 50 → Qwen-VL-Max (¥0.01/img)
# captions cached in ./caption_cache
# caption index uses OpenAIEmbedding(text-embedding-3-small), 1536-dim
# ImageNode(text=caption, image_path=...) → Milvus collection (dim=1536)

At query time, layer in BM25 (keyword) + vector (semantic) hybrid:

fusion = QueryFusionRetriever(
    retrievers=[vector_retriever, bm25_retriever],
    mode="reciprocal_rerank",     # RRF
    num_queries=1,
)
# RRF: score = Σ 1/(k + rank_i),  k = 60

Evolution path

The notebook also points to an evolution path (a Qwen3-VL "golden architecture," Agentic RAG), but the load-bearing, runnable scope is CLIP + VLM captioning + hybrid retrieval.

What this signals

Landing cross-modal retrieval: not "chat with text" but text ↔ image search, understanding the shared vector space
Knowing CLIP's limits: weak on in-image text / Chinese, and patching with VLM captions
Hybrid-retrieval engineering: vector + BM25 fused via RRF, not a single lane
LlamaIndex multimodal stack: MultiModalVectorStoreIndex + Milvus + QueryFusionRetriever wired together

Demo strategy

What the demo replays

The demo visualizes CLIP's 'same space': a text/image query is encoded to 512-dim and lands in one 2D plot, pulling the nearest images by cosine (text→image ~0.24, image→image self 1.0 filtered); hybrid mode adds BM25 + RRF (k=60). The 512-dim space, the ~0.24 scores, RRF k=60, and the 1536-dim VLM-caption index all come from the 'LlamaIndex 多模态文搜图图搜图 RAG 实战' notebook.

Public preview can be enabled later without redesigning the case-study layout