Multimodal Document RAG Platform

This project is positioned as a full-stack product system rather than a “RAG notebook.” The key signal is that the document workflow is packaged into upload, retrieval, and chat experiences that a real user can understand.

Overview

Multimodal Document RAG Platform was built around a recurring enterprise workflow: teams have PDFs, scanned pages, and mixed-layout documents, but they do not just want extraction. They want to upload a document, build a searchable knowledge base, inspect what was retrieved, and ask questions with traceable context.

Instead of collapsing everything into one service, the system separates parsing, chunking, retrieval, and chat into deployable modules. That makes the stack easier to reason about, test, and evolve.

Product Shape

The user experience is organized as a document workflow:

upload PDF or multimodal document assets
parse content into searchable representations
build or update the knowledge base
inspect retrieval outputs
ask grounded questions against indexed content

The React frontend is important here because it turns the backend pipeline into something a user can actually operate.

System Design

The backend combines several layers:

PDF and multimodal parsing services
text chunking and document preprocessing
vector retrieval with Milvus
LangChain orchestration for answer generation
service startup and environment management for local or bundled deployment

That architecture matters because most RAG demos stop at retrieval quality. This project goes one step further and treats deployability and usability as part of the system.

V2.0: Three Selectable OCR Engines

The OCR branch (Multimodal_RAG_OCR) deepens the parsing layer in V2.0 — instead of a single OCR, it offers three swappable engines chosen by layout complexity and accuracy needs:

MinerU — run as the Docker image mineru-vllm:2.5.4 (vLLM backend 0.10.1.1); outputs markdown + bounding boxes for complex-layout PDFs.
DeepSeek-OCR — a VLM-based OCR for table- and formula-dense pages.
PaddleOCR-VL — a 0.9B small VLM (RT-DETR layout detection + ERNIE-4.5-0.3B), lightweight and high-throughput (the vendor quotes ~1.22 pages/s on an A100; not measured here).

Generation uses Qwen qwen3-vl-plus (DashScope) and embeddings use text-embedding-v4. Crucially, V1 and V2 are one React + Vite frontend, not two sites: src/api/config.ts switches between API_CONFIG.v1 and API_CONFIG.v2 (shared components + conditional render + localStorage), and V2 chunks carry extra metadata (layout_info, has_images, has_page_images) so layout structure enters retrieval.

Why This Project Is Strong

This is one of the best portfolio projects for applied AI roles because it shows:

practical document intelligence use cases
full-stack implementation instead of notebook-only experimentation
multi-service backend thinking
production-adjacent concerns such as startup scripts, environment config, and debugging

UX and Operator Value

For hiring managers, the most useful signal is that the project is understandable from the outside. A visitor can see where a document enters the system, how it becomes indexed, and how the final grounded answer is produced.

That is much stronger than showing only a model API or only a retrieval benchmark.

Demo Strategy

If this project is exposed publicly, the safest and most convincing live-demo mode is:

a small capped document set
one or two preloaded sample files
a retrieval-inspection screen
a controlled chat flow over that indexed content

That lets visitors experience the end-to-end workflow without opening up an unlimited-cost sandbox.

Demo strategy

Recommended public demo format

Expose a capped sandbox with two or three sample documents, visible retrieval chunks, and a grounded chat interface. This preserves the strongest product signal without turning the portfolio site into an unrestricted document-processing endpoint.

Public preview can be enabled later without redesigning the case-study layout

What This Project Signals

full-stack AI application development
document-intelligence product design
multi-service backend architecture
practical RAG systems thinking beyond toy demos