企业私有化 NL2SQL 微调系统
从数据库元数据出发生成训练样本,配套数据构造、训练、校验与评估流程,用于企业私有化 SQL 问答能力建设。
项目真实的两层结构:6 阶段流水线把私有 DB 转成 Alpaca 训练数据,然后 LLaMA-Factory 在 Qwen3-4B 上跑 LoRA。前后评估数据(BLEU-4 10.25 → 22.90、ROUGE-L 10.31 → 28.05)原样取自 notebook。
本地版本说明
没有捏造内容。Alpaca 样例、LLaMA-Factory 命令、评估表、DeepSeek-Coder 分支数字(语法 91%、执行 61%、可训参 0.59%)都直接来自案例7 notebook《企业私有化Nl2SQL模型微调实战.ipynb》。
From a private DB to a fine-tuned Qwen3-4B NL2SQL model
The project ships two halves: a six-step pipeline that turns DB metadata into Alpaca-format training data (data_create), and a LLaMA-Factory LoRA recipe on Qwen3-4B (nl2sql_fine_tuning). The numbers below are the predict_with_generate eval before vs after that LoRA pass.
data_create · 6-step generation pipeline
modules/db_connector.py
PyMySQL / psycopg2 / pyodbc · validate reachability of MySQL / Postgres / SQL Server.
modules/metadata_extractor.py
All columns (name, type, nullable, comment) + primary keys + foreign keys.
modules/table_cards.py
Compress full metadata into LLM-digestible table cards; preserve FK links.
modules/planner.py · LLM stage A
LLM groups related tables into business topics and allocates a per-topic sample budget.
modules/generator.py · LLM stage B
Per topic: build prompt with simplified DDL + dialect → LLM emits NL ↔ SQL pairs; auto-补充 if topic count is short.
modules/validator.py + exporter.py
SQLGlot syntax check → write Alpaca / ShareGPT JSONL (nl2sql.jsonl) → ready for LLaMA-Factory --dataset sql_train_alpaca.
Real Alpaca-format training sample
The canonical Spider / CSpider DB used in the notebook (cell 91) as the schema-aware NL2SQL textbook case.
CREATE TABLE singer (
Singer_ID INT PRIMARY KEY,
Name TEXT,
Country TEXT,
Song_Name TEXT,
Song_release_year TEXT,
Age INT,
Is_male BOOL
);
CREATE TABLE concert (
concert_ID INT PRIMARY KEY,
concert_Name TEXT,
Theme TEXT,
Stadium_ID INT,
Year TEXT
);
{
"instruction": "将下面的自然语言问题转换为SQL查询语句。",
"input": "How many singers do we have?",
"output": "SELECT count(*) FROM singer"
}This is exactly the Alpaca shape LLaMA-Factory expects via --dataset sql_train_alpaca.
Stage 1 · LLaMA-Factory CLI · real train command
llamafactory-cli train \
--stage sft \
--model_name_or_path /home/ubuntu/Qwen3-4B \
--finetuning_type lora \
--template qwen3_nothink \
--dataset_dir data \
--dataset sql_train_alpaca \
--cutoff_len 2048 \
--learning_rate 5e-05 \
--num_train_epochs 3.0 \
--per_device_train_batch_size 8 \
--gradient_accumulation_steps 8 \
--lr_scheduler_type cosine \
--bf16 True \
--lora_rank 8 \
--lora_alpha 16 \
--lora_target allEffective batch = 8 × 8 = 64 · cosine scheduler · bf16 mixed precision · LoRA on all linear layers. Loss drops from ~3.0 → ~0.3 over ~400 steps (notebook §10).
predict_with_generate eval · before → after LoRA
| Metric | Before | After | Δ |
|---|---|---|---|
| BLEU-4 | 10.25 | 22.90 | +123%, 不可用 → 可用 |
| ROUGE-1 | 19.14 | 44.67 | +133%, 差 → 良好 |
| ROUGE-2 | 5.35 | 14.54 | +172%, 很差 → 中等 |
| ROUGE-L | 10.31 | 28.05 | +172%, 差 → 良好 |
Same dataset, same generation config (max_new_tokens=512, top_p=0.7, temperature=0.95). Inference speed essentially unchanged (0.94 → 0.904 samples/sec on ~1034 dev samples).
Stage 2 · nl2sql_fine_tuning/ on DeepSeek-Coder 6.7B
Files: train_lora.py · eval_model.py · vllm_test.py · utils.py · README.md · requirements.txt. Default params: rank=16, alpha=32, early-stopping patience=3, auto Alpaca / ShareGPT format detection.
eval_model.py runs both --eval_text_match (Exact-Match + Token-F1) and --eval_sql_execution (run the predicted SQL on a real DB, compare result sets). The project counts execution-match as the truth signal. Deploy via vLLM API server (OpenAI-compatible) on port 8000.
Real LoRA target_modules (train_lora.py:265): [q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj] · all 7 linear layers of the Llama / DeepSeek attention + MLP blocks.
Real planner prompt · modules/planner.py:_build_planning_prompt
你是数据库分析专家。以下是数据库表卡片摘要:
{table_cards_text}
请基于以上数据库结构,规划出若干个业务主题(topics),
每个主题选择 {min_tables}~{max_tables} 张相关联的表,
用于生成 NL2SQL 训练样本。
要求:
1. 主题应覆盖不同业务场景(用户分析、订单统计、销售报表…)
2. 每个主题的表应有业务关联性(外键或业务逻辑)
3. 所有主题的 count 之和 = {total_samples}
4. 每个主题至少 20 个样本
输出 JSON:
{
"topics": [
{ "name": "...", "tables": [...], "reason": "...",
"count": 100, "dialect": "{dialect}" }
]
}Output is JSON-validated by _validate_and_adjust_plan: tables must resolve to metadata; counts re-scaled by ratio if drift; at least 1 sample per topic; topics auto-trimmed if total_samples is small.
modules/validator.py · 3-layer SQL check (in order)
1_check_syntax(sql, dialect)
sqlglot.parse_one(sql, read=dialect) — 失败立即 reject。
2_check_schema(sql, dialect)
遍历 exp.Table / exp.Column;构建别名映射 alias_to_table;表名+列名必须落到 metadata.table_columns。
3_check_execution(sql) · 可选
只允许 SELECT;自动加 LIMIT 1;db_connector.execute_query 真跑一次。
Same table_columns index serves both data_create (validates LLM-generated SQL) and nl2sql_fine_tuning (validates the model's predictions before grading).
nl2sql_fine_tuning/utils.py · 5 metric functions used by eval_model.py
normalize_sql(sql)sqlparse.format(keyword_case="lower", strip_comments=True) → 单行空格归一。
calculate_exact_match(preds, refs)两边都 normalize_sql 后字符串比较 — 严格但稳定。
calculate_token_f1(pred, ref)re.findall(r"\w+", lower) → set;precision · recall · F1。
calculate_execution_accuracy(preds, refs, executor)逐条 executor.execute_sql 跑生成 SQL → execution_success_rate;再 executor.compare_results 比执行结果集 → result_match_rate。
extract_sql_from_output(output)从模型输出里抠 ```sql ... ``` 代码块;处理「我先解释一下…」之类的冗余。
SQLExecutor reuses one PyMySQL connection across the whole eval pass (execute_batch()) — that's the trick behind the 351 samples/s number; without connection reuse it's ~10x slower.
建议体验
在 Spider 的 concert_singer 和项目自带的中文销售 schema 之间切换 — Alpaca 形态相同,SQL 方言不同。
读 llamafactory-cli 命令 — LoRA rank 8 / alpha 16 / cosine / bf16 / 3 epochs。有效 batch = 8 × 8 = 64。
对照 LoRA 前后的 BLEU-4 / ROUGE-L — 这是真实的 predict_with_generate 数字,不是挑出来的。
这个试玩能说明什么
你能交付完整生命周期:私有 DB 元数据 → Alpaca 数据 → LLaMA-Factory LoRA → predict_with_generate 评估。
你把文本相似度(BLEU/ROUGE)和执行匹配(真跑 SQL 比对结果集)分开 — 项目把后者当成最终真值。
你清楚每个杠杆的作用:LoRA 提语法、扩数据提执行准确率、vLLM 提推理吞吐。
技术栈
Qwen3-4B + LLaMA-Factory LoRA (r=8, α=16) · DeepSeek-Coder 6.7B 分支 · vLLM 351 samples/s
真实评估
BLEU-4 10.25→22.90 · ROUGE-1 19.14→44.67 · ROUGE-L 10.31→28.05
执行匹配
DeepSeek-Coder 分支:语法 91%、执行匹配 61%