RL-Tuned Function-Calling Agent Pipeline
A function-calling agent pipeline for preference data generation and evaluation, designed to improve tool selection and argument quality.
AutoToolDPO is the actual project name. The samples below are taken verbatim from its generated dataset — versioned tool names (get_stock_price@v1, web_search@v1, ...), <function_call> / <final> tags, and the chosen/rejected shape LLaMA-Factory consumes for DPO.
Why this local version exists
No live LLM call. The samples and 6-stage pipeline visualization come straight from cells 47, 68, and §6 of 案例9 notebook 企业级Agent Function-Calling RL微调.ipynb. The Concurrent Engine concurrency (asyncio.Semaphore=10) and DeepSeek API as the underlying LLM are project facts.
DPO data for function-calling agents · chosen vs rejected
The samples below are taken verbatim from the AutoToolDPO project's generated dataset (notebook cells 47, 68). Each sample carries the tool registry, full conversation, and a chosen / rejected pair — the exact JSONL shape consumed by LLaMA-Factory's DPO trainer.
Sample dataset rows · click to switch
Tool registry · this sample
get_stock_price@v1
获取股票实时价格 · args: symbol (e.g. AAPL, TSLA)
web_search@v1
在互联网上搜索信息 · args: query, max_results?
send_email@v1
发送电子邮件 · args: to, subject, body
The @v1 suffix is intentional — the project uses versioned tool names so older agent traces stay parseable when schemas evolve.
messages[] · conversation prefix
user
请使用获取股票实时价格工具
chosen (positive sample)
DPO chosen我需要知道您想查询哪只股票的实时价格。请提供股票代码,比如 AAPL(苹果公司)或 TSLA(特斯拉公司)。
<function_call>
{
"name": "get_stock_price@v1",
"arguments": { "symbol": "请提供股票代码" }
}
</function_call>rejected (negative sample)
DPO rejected<function_call>
{
"name": "get_stock_price@v1",
"arguments": { "symbol": "" }
}
</function_call>Why this is rejected
空 symbol — 工具会失败,且没有向用户索取必填参数。
backend/configs/tools_registry.json · 10 tools shipped
get_current_time@v1
获取当前时间
category="time" · args: (无参数)
get_weather@v1
查询指定城市的天气信息
category="weather" · args: city
calculate@v1
执行数学计算
category="math" · args: expression
web_search@v1
在互联网上搜索信息
category="search" · args: query, max_results?
translate_text@v1
翻译文本到目标语言
category="translation" · args: text, target_language
send_email@v1
发送电子邮件
category="communication" · args: to, subject, body
get_stock_price@v1
获取股票实时价格
category="finance" · args: symbol
create_reminder@v1
创建提醒事项
category="productivity" · args: title, time
get_news@v1
获取最新新闻
category="news" · args: category, country?
convert_currency@v1
货币汇率转换
category="finance" · args: amount, from_currency, to_currency
The 3 samples above sample subsets of these 10 tools (the project also supports tool_count_min/max range mode to pick 2-5 tools per sample at random). Adding tools = edit this JSON, no code change.
backend/core/task_generator.py · TASK_TEMPLATES · 8 categories · 76 templates total
e.g. {city}今天天气怎么样?
e.g. 现在几点了?
e.g. 帮我计算{expr}
e.g. 帮我搜索关于{query}的信息
e.g. 请把'{text}'翻译成{target_lang}
e.g. 把{amount}{from_currency}转换成{to_currency}
e.g. 给我看看{category}类的新闻
e.g. 请使用合适的工具帮我完成这个任务
Single vs multi-turn split by multi_ratio (default 0.3). Multi-turn joins via 7 connectors: 然后 / 接着 / 同时 / 另外 / 还有 / 并且 / 以及.
PARAMS pools · diversity behind the 76 templates
cities北京 / 上海 / 广州 / 深圳 / 杭州 …
expressions1+1 / 25*4 / sqrt(144) / 2^10 …
search_queries人工智能 / 机器学习 / 量子计算 …
texts你好 / 谢谢 / 早上好 …
target_langs英语 / 日语 / 法语 …
currencies_from × to美元↔人民币 / 欧元↔英镑 …
amounts100 / 500 / 1000 / 5000 …
news_categories科技 / 体育 / 财经 / 娱乐 …
76 templates × multiple param slots → realistic 数千 unique user queries even before the multi-turn joiner kicks in.
Backend 6-module pipeline · FastAPI + asyncio.Semaphore(concurrency=10)
stage 1
task_generator.py
TaskGenerator.generate_tasks() — 抽 task 模板、随机绑定 toolset、产出 Task(user_query, tools, system_prompt)。
stage 2
data_synthesizer.py · chosen
DataSynthesizer._generate_chosen(task) — 单轮 vs 多轮分支:多轮调 generate_multi_turn_dialogue(), 写回 task._multi_turn_context。
stage 3
data_synthesizer.py · smart_rejected
synthesize_sample_with_smart_rejected() — 5 步:并发跑 chosen+rejected → LLM 自评 quality_score + similarity_score → 策略 1 (质量<5 且能修正→ 拿 corrected_chosen 当新 chosen) → 策略 2 (相似度>80% → 用 temperature=1.2 重生成更差的 rejected) → 收尾。
stage 4
validator.py
Validator.validate_sample() — 必填字段齐 · chosen ≠ rejected · function_call JSON 解析通过 · 可选 LLM 自评打分。
stage 5
concurrent_engine.py
ConcurrentEngine.process_tasks() — asyncio.Semaphore + ProgressStats(progress_percent / generation_rate / validation_success_rate) 推 WebSocket; 指数退避重试。
stage 6
exporter.py
Exporter.export_to_jsonl() — data_dpo.jsonl + dataset_info.json + generation_stats.json + invalid_samples.jsonl。
smart_rejected 策略 · 5 steps · data_synthesizer.py:89-200
step 1
并发生成
asyncio.gather(_generate_chosen, _generate_rejected) — chosen 和 rejected 并发跑,省一轮 LLM 等待。
step 2
构造临时样本
把 task + chosen + rejected 装成临时 sample 字典,丢给 LLM 自评。
step 3
LLM 自评
llm_client.validate_and_correct(sample) → quality_score (0-10) + similarity_score (0-100) + 可选 corrected_chosen。
step 4
策略 1 · 修正
如果 quality_score < 5.0 且 corrected_chosen 存在 → 用 corrected_chosen 当新 chosen,原 rejected 保留为真实错误案例。
step 5
策略 2 · 重生成
如果 similarity_score > 80% → 用 temperature=1.2 重新生成更差的 rejected (避免「假对比」对 DPO 没用)。
5 步走完后样本字段:{task_id, task_type, system, tools, messages, chosen, rejected, quality_score, similarity_score}
validate_and_correct LLM prompt · 4 axes (services/llm_client.py:269)
1. Chosen 回复质量 — 是否正确调用了工具,参数是否准确
2. Rejected 回复质量 — 是否确实比 chosen 更差
3. 两者差异度 — 差异是否明显,是否具有学习价值
4. 格式规范性 — 是否符合 function_call 格式要求
返回 JSON 含字段:{is_valid, quality_score, similarity_score, issues[], corrected_chosen?, corrected_rejected?}; quality 9-10 极好 / 5-6 一般 / <5 差 ; similarity <50% 优秀 / >80% rejected 不够差。
JSON parse 兜底
DeepSeek 偶尔包 ```json … ``` 输出。 客户端先 strip 三种围栏,再 json.loads(); 解析失败时回退到 {quality_score: 7.0, similarity_score: 50.0, issues: ["LLM返回格式错误"]}, 不让单次 LLM 抖动掐死整批生成。
LLM provider
DeepSeek API · deepseek-chat
OpenAI-compatible client, swappable to GPT-4 / 本地模型 不改业务代码
Concurrency control
asyncio.Semaphore(10) + exponential backoff
退避:2/4/8s(普通)→ 3/9/27s(超时), MAX_RETRIES=15
Training target
LLaMA-Factory · DPO · Qwen 系列
Dataset 注册:dataset_info.json columns ↔ JSONL keys 严格对齐
What to try
Click through the 3 samples — read the exact chosen vs rejected text and the "why this is rejected" rationale.
Spot the @v1 version suffix on every tool name — it is on purpose, so older traces stay parseable when schemas evolve.
Trace the 6-stage backend pipeline (TaskGenerator → DataSynthesizer → smart rejected → LLM self-validate → Validator → Exporter) along the bottom strip.
What this demo proves
You can ship the data infrastructure modern DPO actually needs — not just hand-wave about "preference learning".
You design rejected samples on purpose (wrong tool / empty args / skipped tool / over-confident <final>) — a real DPO failure-mode taxonomy, not abstract scoring.
You handle production concerns: concurrency (Semaphore), retries (exponential backoff), JSONL strictness for LLaMA-Factory ingestion.
Project name
AutoToolDPO · FastAPI + React + DeepSeek API
Sample shape
{system, tools, messages, chosen, rejected} JSONL · versioned tool names @v1
Training target
LLaMA-Factory DPO trainer on Qwen-7B family