AutoToolDPO · 真实 chosen/rejected 对

函数调用 Agent 偏好优化流水线

围绕 function calling 代理构建数据生成、偏好对构造和评估流程，用于提升工具选择与参数调用质量。

AutoToolDPO 是项目的真实名字。下面 3 个样本原样取自它生成的数据集 — 带版本号的工具名（get_stock_price@v1、web_search@v1...）、<function_call> / <final> 标签、LLaMA-Factory DPO 训练直接消费的 chosen/rejected 形态。

DPOFunction CallingEvaluationFastAPIAgents

案例拆解源码

本地版本说明

不调真实 LLM。样本和 6 阶段流水线可视化直接来自案例9 notebook《企业级Agent Function-Calling RL微调.ipynb》第 47、68 个 cell 和第 6 章。并发引擎（asyncio.Semaphore=10）+ DeepSeek API 都是项目的真实事实。

AutoToolDPO · faithful walkthrough

DPO data for function-calling agents · chosen vs rejected

The samples below are taken verbatim from the AutoToolDPO project's generated dataset (notebook cells 47, 68). Each sample carries the tool registry, full conversation, and a chosen / rejected pair — the exact JSONL shape consumed by LLaMA-Factory's DPO trainer.

Sample dataset rows · click to switch

Tool registry · this sample

get_stock_price@v1

获取股票实时价格 · args: symbol (e.g. AAPL, TSLA)

web_search@v1

在互联网上搜索信息 · args: query, max_results?

send_email@v1

发送电子邮件 · args: to, subject, body

The @v1 suffix is intentional — the project uses versioned tool names so older agent traces stay parseable when schemas evolve.

messages[] · conversation prefix

user

请使用获取股票实时价格工具

chosen (positive sample)

DPO chosen

我需要知道您想查询哪只股票的实时价格。请提供股票代码，比如 AAPL（苹果公司）或 TSLA（特斯拉公司）。

<function_call>
{
  "name": "get_stock_price@v1",
  "arguments": { "symbol": "请提供股票代码" }
}
</function_call>

rejected (negative sample)

DPO rejected

<function_call>
{
  "name": "get_stock_price@v1",
  "arguments": { "symbol": "" }
}
</function_call>

Why this is rejected

空 symbol — 工具会失败，且没有向用户索取必填参数。

backend/configs/tools_registry.json · 10 tools shipped

get_current_time@v1

获取当前时间

category="time" · args: (无参数)

get_weather@v1

查询指定城市的天气信息

category="weather" · args: city

calculate@v1

执行数学计算

category="math" · args: expression

web_search@v1

在互联网上搜索信息

category="search" · args: query, max_results?

translate_text@v1

翻译文本到目标语言

category="translation" · args: text, target_language

send_email@v1

发送电子邮件

category="communication" · args: to, subject, body

get_stock_price@v1

获取股票实时价格

category="finance" · args: symbol

create_reminder@v1

创建提醒事项

category="productivity" · args: title, time

get_news@v1

获取最新新闻

category="news" · args: category, country?

convert_currency@v1

货币汇率转换

category="finance" · args: amount, from_currency, to_currency

The 3 samples above sample subsets of these 10 tools (the project also supports tool_count_min/max range mode to pick 2-5 tools per sample at random). Adding tools = edit this JSON, no code change.

backend/core/task_generator.py · TASK_TEMPLATES · 8 categories · 76 templates total

天气查询12 templates

e.g. {city}今天天气怎么样？

时间查询10 templates

e.g. 现在几点了？

计算10 templates

e.g. 帮我计算{expr}

搜索12 templates

e.g. 帮我搜索关于{query}的信息

翻译10 templates

e.g. 请把'{text}'翻译成{target_lang}

货币转换7 templates

e.g. 把{amount}{from_currency}转换成{to_currency}

新闻获取7 templates

e.g. 给我看看{category}类的新闻

通用8 templates

e.g. 请使用合适的工具帮我完成这个任务

Single vs multi-turn split by multi_ratio (default 0.3). Multi-turn joins via 7 connectors: 然后 / 接着 / 同时 / 另外 / 还有 / 并且 / 以及.

PARAMS pools · diversity behind the 76 templates

cities

北京 / 上海 / 广州 / 深圳 / 杭州 …

×20

expressions

1+1 / 25*4 / sqrt(144) / 2^10 …

×15

search_queries

人工智能 / 机器学习 / 量子计算 …

×18

texts

你好 / 谢谢 / 早上好 …

×13

target_langs

英语 / 日语 / 法语 …

×8

currencies_from × to

美元↔人民币 / 欧元↔英镑 …

×25

amounts

100 / 500 / 1000 / 5000 …

×7

news_categories

科技 / 体育 / 财经 / 娱乐 …

×8

76 templates × multiple param slots → realistic 数千 unique user queries even before the multi-turn joiner kicks in.

Backend 6-module pipeline · FastAPI + asyncio.Semaphore(concurrency=10)

stage 1

task_generator.py

TaskGenerator.generate_tasks() — 抽 task 模板、随机绑定 toolset、产出 Task(user_query, tools, system_prompt)。

stage 2

data_synthesizer.py · chosen

DataSynthesizer._generate_chosen(task) — 单轮 vs 多轮分支：多轮调 generate_multi_turn_dialogue(), 写回 task._multi_turn_context。

stage 3

data_synthesizer.py · smart_rejected

synthesize_sample_with_smart_rejected() — 5 步：并发跑 chosen+rejected → LLM 自评 quality_score + similarity_score → 策略 1 (质量<5 且能修正→ 拿 corrected_chosen 当新 chosen) → 策略 2 (相似度>80% → 用 temperature=1.2 重生成更差的 rejected) → 收尾。

stage 4

validator.py

Validator.validate_sample() — 必填字段齐 · chosen ≠ rejected · function_call JSON 解析通过 · 可选 LLM 自评打分。

stage 5

concurrent_engine.py

ConcurrentEngine.process_tasks() — asyncio.Semaphore + ProgressStats(progress_percent / generation_rate / validation_success_rate) 推 WebSocket; 指数退避重试。

stage 6

exporter.py

Exporter.export_to_jsonl() — data_dpo.jsonl + dataset_info.json + generation_stats.json + invalid_samples.jsonl。

smart_rejected 策略 · 5 steps · data_synthesizer.py:89-200

step 1

并发生成

asyncio.gather(_generate_chosen, _generate_rejected) — chosen 和 rejected 并发跑，省一轮 LLM 等待。

step 2

构造临时样本

把 task + chosen + rejected 装成临时 sample 字典，丢给 LLM 自评。

step 3

LLM 自评

llm_client.validate_and_correct(sample) → quality_score (0-10) + similarity_score (0-100) + 可选 corrected_chosen。

step 4

策略 1 · 修正

如果 quality_score < 5.0 且 corrected_chosen 存在 → 用 corrected_chosen 当新 chosen，原 rejected 保留为真实错误案例。

step 5

策略 2 · 重生成

如果 similarity_score > 80% → 用 temperature=1.2 重新生成更差的 rejected (避免「假对比」对 DPO 没用)。

5 步走完后样本字段：{task_id, task_type, system, tools, messages, chosen, rejected, quality_score, similarity_score}

validate_and_correct LLM prompt · 4 axes (services/llm_client.py:269)

1. Chosen 回复质量 — 是否正确调用了工具，参数是否准确

2. Rejected 回复质量 — 是否确实比 chosen 更差

3. 两者差异度 — 差异是否明显，是否具有学习价值

4. 格式规范性 — 是否符合 function_call 格式要求

返回 JSON 含字段：{is_valid, quality_score, similarity_score, issues[], corrected_chosen?, corrected_rejected?}; quality 9-10 极好 / 5-6 一般 / <5 差 ; similarity <50% 优秀 / >80% rejected 不够差。

JSON parse 兜底

DeepSeek 偶尔包 ```json … ``` 输出。客户端先 strip 三种围栏，再 json.loads()；解析失败时回退到 {quality_score: 7.0, similarity_score: 50.0, issues: ["LLM返回格式错误"]}，不让单次 LLM 抖动掐死整批生成。

LLM provider

DeepSeek API · deepseek-chat

OpenAI-compatible client, swappable to GPT-4 / 本地模型不改业务代码

Concurrency control

asyncio.Semaphore(10) + exponential backoff

退避：2/4/8s（普通）→ 3/9/27s（超时）, MAX_RETRIES=15

Training target

LLaMA-Factory · DPO · Qwen 系列

Dataset 注册：dataset_info.json columns ↔ JSONL keys 严格对齐

建议体验

依次点 3 个样本 — 看每条样本里 chosen vs rejected 的真实文本以及「为什么 rejected」的归因。

注意每个工具名后的 @v1 后缀 — 这是故意的，让 schema 演化后旧轨迹仍然可解析。

沿着底部条带跟 6 阶段后端流水线（TaskGenerator → DataSynthesizer → 智能 rejected → LLM 自评 → Validator → Exporter）。

这个试玩能说明什么

你能交付现代 DPO 真正需要的数据基础设施 — 不是空谈「偏好学习」。

你刻意设计 rejected 样本（错工具 / 空参 / 跳过工具 / 太自信的 <final>）— 一份真实的 DPO 失败模式分类，不是抽象打分。

你顾及到生产实践细节：并发控制（Semaphore）、重试（指数退避）、给 LLaMA-Factory 喂数据时 JSONL 的严格要求。

项目名

AutoToolDPO · FastAPI + React + DeepSeek API

样本形态

{system, tools, messages, chosen, rejected} JSONL · 带版本号的工具名 @v1

训练目标

LLaMA-Factory DPO trainer · Qwen-7B 系列

返回案例页