返回案例页
真实数据生成流水线 + LoRA 评估

企业私有化 NL2SQL 微调系统

从数据库元数据出发生成训练样本,配套数据构造、训练、校验与评估流程,用于企业私有化 SQL 问答能力建设。

项目真实的两层结构:6 阶段流水线把私有 DB 转成 Alpaca 训练数据,然后 LLaMA-Factory 在 Qwen3-4B 上跑 LoRA。前后评估数据(BLEU-4 10.25 → 22.90、ROUGE-L 10.31 → 28.05)原样取自 notebook。

LoRAQLoRAFastAPIWebSocketSQL
企业私有化 NL2SQL 微调系统

本地版本说明

没有捏造内容。Alpaca 样例、LLaMA-Factory 命令、评估表、DeepSeek-Coder 分支数字(语法 91%、执行 61%、可训参 0.59%)都直接来自案例7 notebook《企业私有化Nl2SQL模型微调实战.ipynb》。

Faithful walkthrough · data_create + LoRA

From a private DB to a fine-tuned Qwen3-4B NL2SQL model

The project ships two halves: a six-step pipeline that turns DB metadata into Alpaca-format training data (data_create), and a LLaMA-Factory LoRA recipe on Qwen3-4B (nl2sql_fine_tuning). The numbers below are the predict_with_generate eval before vs after that LoRA pass.

data_create · 6-step generation pipeline

1

modules/db_connector.py

PyMySQL / psycopg2 / pyodbc · validate reachability of MySQL / Postgres / SQL Server.

2

modules/metadata_extractor.py

All columns (name, type, nullable, comment) + primary keys + foreign keys.

3

modules/table_cards.py

Compress full metadata into LLM-digestible table cards; preserve FK links.

4

modules/planner.py · LLM stage A

LLM groups related tables into business topics and allocates a per-topic sample budget.

5

modules/generator.py · LLM stage B

Per topic: build prompt with simplified DDL + dialect → LLM emits NL ↔ SQL pairs; auto-补充 if topic count is short.

6

modules/validator.py + exporter.py

SQLGlot syntax check → write Alpaca / ShareGPT JSONL (nl2sql.jsonl) → ready for LLaMA-Factory --dataset sql_train_alpaca.

Real Alpaca-format training sample

The canonical Spider / CSpider DB used in the notebook (cell 91) as the schema-aware NL2SQL textbook case.

CREATE TABLE singer (

Singer_ID INT PRIMARY KEY,

Name TEXT,

Country TEXT,

Song_Name TEXT,

Song_release_year TEXT,

Age INT,

Is_male BOOL

);

CREATE TABLE concert (

concert_ID INT PRIMARY KEY,

concert_Name TEXT,

Theme TEXT,

Stadium_ID INT,

Year TEXT

);

{
  "instruction": "将下面的自然语言问题转换为SQL查询语句。",
  "input":       "How many singers do we have?",
  "output":      "SELECT count(*) FROM singer"
}

This is exactly the Alpaca shape LLaMA-Factory expects via --dataset sql_train_alpaca.

Stage 1 · LLaMA-Factory CLI · real train command

llamafactory-cli train \
    --stage sft \
    --model_name_or_path /home/ubuntu/Qwen3-4B \
    --finetuning_type lora \
    --template qwen3_nothink \
    --dataset_dir data \
    --dataset sql_train_alpaca \
    --cutoff_len 2048 \
    --learning_rate 5e-05 \
    --num_train_epochs 3.0 \
    --per_device_train_batch_size 8 \
    --gradient_accumulation_steps 8 \
    --lr_scheduler_type cosine \
    --bf16 True \
    --lora_rank 8 \
    --lora_alpha 16 \
    --lora_target all

Effective batch = 8 × 8 = 64 · cosine scheduler · bf16 mixed precision · LoRA on all linear layers. Loss drops from ~3.0 → ~0.3 over ~400 steps (notebook §10).

predict_with_generate eval · before → after LoRA

MetricBeforeAfterΔ
BLEU-410.2522.90+123%, 不可用 → 可用
ROUGE-119.1444.67+133%, 差 → 良好
ROUGE-25.3514.54+172%, 很差 → 中等
ROUGE-L10.3128.05+172%, 差 → 良好

Same dataset, same generation config (max_new_tokens=512, top_p=0.7, temperature=0.95). Inference speed essentially unchanged (0.94 → 0.904 samples/sec on ~1034 dev samples).

Stage 2 · nl2sql_fine_tuning/ on DeepSeek-Coder 6.7B

Files: train_lora.py · eval_model.py · vllm_test.py · utils.py · README.md · requirements.txt. Default params: rank=16, alpha=32, early-stopping patience=3, auto Alpaca / ShareGPT format detection.

Trainable params~40M / 6.7B (0.59%)
Syntax-valid SQL rate91%
Execution-matches-gold rate61%
vLLM batch throughput (batch=32)~0.7 sample/s/GPU

eval_model.py runs both --eval_text_match (Exact-Match + Token-F1) and --eval_sql_execution (run the predicted SQL on a real DB, compare result sets). The project counts execution-match as the truth signal. Deploy via vLLM API server (OpenAI-compatible) on port 8000.

Real LoRA target_modules (train_lora.py:265): [q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj] · all 7 linear layers of the Llama / DeepSeek attention + MLP blocks.

Real planner prompt · modules/planner.py:_build_planning_prompt

你是数据库分析专家。以下是数据库表卡片摘要:
{table_cards_text}

请基于以上数据库结构,规划出若干个业务主题(topics),
每个主题选择 {min_tables}~{max_tables} 张相关联的表,
用于生成 NL2SQL 训练样本。

要求:
1. 主题应覆盖不同业务场景(用户分析、订单统计、销售报表…)
2. 每个主题的表应有业务关联性(外键或业务逻辑)
3. 所有主题的 count 之和 = {total_samples}
4. 每个主题至少 20 个样本

输出 JSON:
{
  "topics": [
    { "name": "...", "tables": [...], "reason": "...",
      "count": 100, "dialect": "{dialect}" }
  ]
}

Output is JSON-validated by _validate_and_adjust_plan: tables must resolve to metadata; counts re-scaled by ratio if drift; at least 1 sample per topic; topics auto-trimmed if total_samples is small.

modules/validator.py · 3-layer SQL check (in order)

1_check_syntax(sql, dialect)

sqlglot.parse_one(sql, read=dialect) — 失败立即 reject。

2_check_schema(sql, dialect)

遍历 exp.Table / exp.Column;构建别名映射 alias_to_table;表名+列名必须落到 metadata.table_columns。

3_check_execution(sql) · 可选

只允许 SELECT;自动加 LIMIT 1;db_connector.execute_query 真跑一次。

Same table_columns index serves both data_create (validates LLM-generated SQL) and nl2sql_fine_tuning (validates the model's predictions before grading).

nl2sql_fine_tuning/utils.py · 5 metric functions used by eval_model.py

normalize_sql(sql)

sqlparse.format(keyword_case="lower", strip_comments=True) → 单行空格归一。

calculate_exact_match(preds, refs)

两边都 normalize_sql 后字符串比较 — 严格但稳定。

calculate_token_f1(pred, ref)

re.findall(r"\w+", lower) → set;precision · recall · F1。

calculate_execution_accuracy(preds, refs, executor)

逐条 executor.execute_sql 跑生成 SQL → execution_success_rate;再 executor.compare_results 比执行结果集 → result_match_rate。

extract_sql_from_output(output)

从模型输出里抠 ```sql ... ``` 代码块;处理「我先解释一下…」之类的冗余。

SQLExecutor reuses one PyMySQL connection across the whole eval pass (execute_batch()) — that's the trick behind the 351 samples/s number; without connection reuse it's ~10x slower.

建议体验

在 Spider 的 concert_singer 和项目自带的中文销售 schema 之间切换 — Alpaca 形态相同,SQL 方言不同。

读 llamafactory-cli 命令 — LoRA rank 8 / alpha 16 / cosine / bf16 / 3 epochs。有效 batch = 8 × 8 = 64。

对照 LoRA 前后的 BLEU-4 / ROUGE-L — 这是真实的 predict_with_generate 数字,不是挑出来的。

这个试玩能说明什么

你能交付完整生命周期:私有 DB 元数据 → Alpaca 数据 → LLaMA-Factory LoRA → predict_with_generate 评估。

你把文本相似度(BLEU/ROUGE)和执行匹配(真跑 SQL 比对结果集)分开 — 项目把后者当成最终真值。

你清楚每个杠杆的作用:LoRA 提语法、扩数据提执行准确率、vLLM 提推理吞吐。

技术栈

Qwen3-4B + LLaMA-Factory LoRA (r=8, α=16) · DeepSeek-Coder 6.7B 分支 · vLLM 351 samples/s

真实评估

BLEU-4 10.25→22.90 · ROUGE-1 19.14→44.67 · ROUGE-L 10.31→28.05

执行匹配

DeepSeek-Coder 分支:语法 91%、执行匹配 61%