NL2SQL Data-Analysis Agent

This is not the "NL2SQL fine-tuning" project (that one trains a model with LlamaFactory LoRA). This is a runtime RAG agent: it trains nothing — it retrieves your schema, business knowledge, and past SQL, feeds them to a general LLM, and turns a single-sentence question into SQL that runs on MySQL, then renders a chart + explanation. The base is a fork of the open-source project Vanna.

Not the same as "NL2SQL fine-tuning"

	NL2SQL fine-tuning (case 7)	This project (NL2SQL data-analysis agent)
Core method	training (LlamaFactory LoRA)	retrieval (RAG) + prompting
What changes	model weights	the context fed to the model
Deployed form	a fine-tuned private model	a general LLM + vector store + agent orchestration
When to use	stable schema, high volume, local inference	shifting schema, fast onboarding, no training

The two are complementary — one bakes capability into the model, the other feeds capability into the context.

Four-layer architecture (forked from Vanna)

NL2SQLAgent/
├── frontend/                     # React + Vite (port 3000)
├── backend/
│   └── vanna/
│       ├── api_server.py         # FastAPI main service (port 8000)
│       └── ...                   # forked from Vanna, adapted for Chinese + Jina
├── jina-embedding/
│   └── app_jina_embedding_v4.py  # Jina embedding service, standalone FastAPI
└── milvus-deployment/
    └── start.sh                  # Docker Compose for Milvus (port 19550, COSINE)

The agent runs a ReAct loop via LangChain 1.0's create_agent (AGENT_RECURSION_LIMIT=50), with Qwen qwen-flash as the LLM (DashScope OpenAI-compatible API, temperature=0.1, max_tokens=14000).

RAG is the core: three Milvus collections

NL2SQL accuracy depends less on model size than on whether the model knows what your database looks like. The agent's get_table_schema tool vectorizes the question and retrieves in parallel from three Milvus collections:

question ──embed──┬──▶ collection: DDL / table schemas  → CREATE TABLE for relevant tables
                  ├──▶ collection: business docs        → column meanings / definitions / terms
                  └──▶ collection: historical SQL        → correct SQL for similar questions
                                                            ↓
                            assembled context → qwen-flash generates SQL

The three hits are assembled into context the model writes SQL from — like fetching a fresh cheat-sheet for this database on every question. The course quotes RAG lifting SQL accuracy from ~40% to over 85% (a course figure; this project ships no eval table, so treat it as a reference).

Jina Embeddings v4 as its own service

Vector quality directly drives retrieval quality, so the embedding is not Vanna's built-in one — it is swapped for Jina Embeddings v4, deployed as a standalone FastAPI microservice:

# app_jina_embedding_v4.py (key points)
# model: jina-embeddings-v4-vllm-retrieval
# OpenAI-compatible endpoint: POST /v1/embeddings
# output vector dimension: 2048

Environment jina_run (py3.13.5, torch==2.8.0 / transformers==4.56.2 / fastapi==0.118.3). Splitting it off follows the same logic as the multimodal RAG project: the embedding is a GPU-heavy component, so start / scale / swap it independently.

A full question's ReAct trace

Q: "Last month's deal amount per sales region, descending"
  └─> Agent (qwen-flash) runs a ReAct loop
        ├─ tool: get_table_schema(question)
        │     → 3 Milvus collections in parallel → DDL + business terms + similar SQL
        ├─ generates SQL: SELECT region, SUM(amount) ... GROUP BY region ORDER BY ... DESC
        ├─ tool: run_sql(sql)   → MySQL executes → result set
        ├─ tool: plot(result)   → picks chart type → bar chart
        └─ NL explanation: "East China is highest at ~…; QoQ …"
  └─> Frontend renders: SQL + data table + chart + explanation

What was changed in Vanna

After forking Vanna, four changes are spelled out in the courseware:

Embedding swap: native embedding → Jina Embeddings v4 (better Chinese retrieval)
Chinese support: handling for Chinese questions / Chinese-commented SQL
Batching: batch embedding + batch insert for much faster indexing
Dedup: deduplicate repeated questions so the historical-SQL collection doesn't pile up duplicates

What this signals

Training-vs-retrieval judgment: for the same NL2SQL goal, knowing when to fine-tune vs when RAG is cheaper
Agent tool design: three tools (get_table_schema / run_sql / plot) orchestrated autonomously by ReAct
Retrieval is the accuracy lever: parallel three-collection retrieval (DDL + business docs + historical SQL) is what makes NL2SQL land — not a bigger model
Engineering on top of OSS: forking Vanna and adapting embedding / Chinese / batching / dedup to real needs

Demo strategy

Why this is a case study, not an interactive demo

Demonstrating this system needs MySQL (a real business database) + Milvus (three vector collections) + the Jina embedding service running together, which can't be safely replayed in the browser. So this page documents the architecture, the RAG retrieval design, and the Vanna modifications; for the shared agent-orchestration pattern, compare the demos on deep-research-agent and the document-review agent.

Public preview can be enabled later without redesigning the case-study layout