AutoToolDPO · real chosen/rejected pairs

RL-Tuned Function-Calling Agent Pipeline

A function-calling agent pipeline for preference data generation and evaluation, designed to improve tool selection and argument quality.

AutoToolDPO is the actual project name. The samples below are taken verbatim from its generated dataset — versioned tool names (get_stock_price@v1, web_search@v1, ...), <function_call> / <final> tags, and the chosen/rejected shape LLaMA-Factory consumes for DPO.

DPOFunction CallingEvaluationFastAPIAgents

Case Study Source Code

Why this local version exists

No live LLM call. The samples and 6-stage pipeline visualization come straight from cells 47, 68, and §6 of 案例9 notebook 企业级Agent Function-Calling RL微调.ipynb. The Concurrent Engine concurrency (asyncio.Semaphore=10) and DeepSeek API as the underlying LLM are project facts.

AutoToolDPO · faithful walkthrough

DPO data for function-calling agents · chosen vs rejected

The samples below are taken verbatim from the AutoToolDPO project's generated dataset (notebook cells 47, 68). Each sample carries the tool registry, full conversation, and a chosen / rejected pair — the exact JSONL shape consumed by LLaMA-Factory's DPO trainer.

Sample dataset rows · click to switch

Tool registry · this sample

get_stock_price@v1

获取股票实时价格 · args: symbol (e.g. AAPL, TSLA)

web_search@v1

在互联网上搜索信息 · args: query, max_results?

send_email@v1

发送电子邮件 · args: to, subject, body

The @v1 suffix is intentional — the project uses versioned tool names so older agent traces stay parseable when schemas evolve.

messages[] · conversation prefix

user

请使用获取股票实时价格工具

chosen (positive sample)

DPO chosen

我需要知道您想查询哪只股票的实时价格。请提供股票代码，比如 AAPL（苹果公司）或 TSLA（特斯拉公司）。

<function_call>
{
  "name": "get_stock_price@v1",
  "arguments": { "symbol": "请提供股票代码" }
}
</function_call>

rejected (negative sample)

DPO rejected

<function_call>
{
  "name": "get_stock_price@v1",
  "arguments": { "symbol": "" }
}
</function_call>

Why this is rejected

空 symbol — 工具会失败，且没有向用户索取必填参数。

backend/configs/tools_registry.json · 10 tools shipped

get_current_time@v1

获取当前时间

category="time" · args: (无参数)

get_weather@v1

查询指定城市的天气信息

category="weather" · args: city

calculate@v1

执行数学计算

category="math" · args: expression

web_search@v1

在互联网上搜索信息

category="search" · args: query, max_results?

translate_text@v1

翻译文本到目标语言

category="translation" · args: text, target_language

send_email@v1

发送电子邮件

category="communication" · args: to, subject, body

get_stock_price@v1

获取股票实时价格

category="finance" · args: symbol

create_reminder@v1

创建提醒事项

category="productivity" · args: title, time

get_news@v1

获取最新新闻

category="news" · args: category, country?

convert_currency@v1

货币汇率转换

category="finance" · args: amount, from_currency, to_currency

The 3 samples above sample subsets of these 10 tools (the project also supports tool_count_min/max range mode to pick 2-5 tools per sample at random). Adding tools = edit this JSON, no code change.

backend/core/task_generator.py · TASK_TEMPLATES · 8 categories · 76 templates total

天气查询12 templates

e.g. {city}今天天气怎么样？

时间查询10 templates

e.g. 现在几点了？

计算10 templates

e.g. 帮我计算{expr}

搜索12 templates

e.g. 帮我搜索关于{query}的信息

翻译10 templates

e.g. 请把'{text}'翻译成{target_lang}

货币转换7 templates

e.g. 把{amount}{from_currency}转换成{to_currency}

新闻获取7 templates

e.g. 给我看看{category}类的新闻

通用8 templates

e.g. 请使用合适的工具帮我完成这个任务

Single vs multi-turn split by multi_ratio (default 0.3). Multi-turn joins via 7 connectors: 然后 / 接着 / 同时 / 另外 / 还有 / 并且 / 以及.

PARAMS pools · diversity behind the 76 templates

cities

北京 / 上海 / 广州 / 深圳 / 杭州 …

×20

expressions

1+1 / 25*4 / sqrt(144) / 2^10 …

×15

search_queries

人工智能 / 机器学习 / 量子计算 …

×18

texts

你好 / 谢谢 / 早上好 …

×13

target_langs

英语 / 日语 / 法语 …

×8

currencies_from × to

美元↔人民币 / 欧元↔英镑 …

×25

amounts

100 / 500 / 1000 / 5000 …

×7

news_categories

科技 / 体育 / 财经 / 娱乐 …

×8

76 templates × multiple param slots → realistic 数千 unique user queries even before the multi-turn joiner kicks in.

Backend 6-module pipeline · FastAPI + asyncio.Semaphore(concurrency=10)

stage 1

task_generator.py

TaskGenerator.generate_tasks() — 抽 task 模板、随机绑定 toolset、产出 Task(user_query, tools, system_prompt)。

stage 2

data_synthesizer.py · chosen

DataSynthesizer._generate_chosen(task) — 单轮 vs 多轮分支：多轮调 generate_multi_turn_dialogue(), 写回 task._multi_turn_context。

stage 3

data_synthesizer.py · smart_rejected

synthesize_sample_with_smart_rejected() — 5 步：并发跑 chosen+rejected → LLM 自评 quality_score + similarity_score → 策略 1 (质量<5 且能修正→ 拿 corrected_chosen 当新 chosen) → 策略 2 (相似度>80% → 用 temperature=1.2 重生成更差的 rejected) → 收尾。

stage 4

validator.py

Validator.validate_sample() — 必填字段齐 · chosen ≠ rejected · function_call JSON 解析通过 · 可选 LLM 自评打分。

stage 5

concurrent_engine.py

ConcurrentEngine.process_tasks() — asyncio.Semaphore + ProgressStats(progress_percent / generation_rate / validation_success_rate) 推 WebSocket; 指数退避重试。

stage 6

exporter.py

Exporter.export_to_jsonl() — data_dpo.jsonl + dataset_info.json + generation_stats.json + invalid_samples.jsonl。

smart_rejected 策略 · 5 steps · data_synthesizer.py:89-200

step 1

并发生成

asyncio.gather(_generate_chosen, _generate_rejected) — chosen 和 rejected 并发跑，省一轮 LLM 等待。

step 2

构造临时样本

把 task + chosen + rejected 装成临时 sample 字典，丢给 LLM 自评。

step 3

LLM 自评

llm_client.validate_and_correct(sample) → quality_score (0-10) + similarity_score (0-100) + 可选 corrected_chosen。

step 4

策略 1 · 修正

如果 quality_score < 5.0 且 corrected_chosen 存在 → 用 corrected_chosen 当新 chosen，原 rejected 保留为真实错误案例。

step 5

策略 2 · 重生成

如果 similarity_score > 80% → 用 temperature=1.2 重新生成更差的 rejected (避免「假对比」对 DPO 没用)。

5 步走完后样本字段：{task_id, task_type, system, tools, messages, chosen, rejected, quality_score, similarity_score}

validate_and_correct LLM prompt · 4 axes (services/llm_client.py:269)

1. Chosen 回复质量 — 是否正确调用了工具，参数是否准确

2. Rejected 回复质量 — 是否确实比 chosen 更差

3. 两者差异度 — 差异是否明显，是否具有学习价值

4. 格式规范性 — 是否符合 function_call 格式要求

返回 JSON 含字段：{is_valid, quality_score, similarity_score, issues[], corrected_chosen?, corrected_rejected?}; quality 9-10 极好 / 5-6 一般 / <5 差 ; similarity <50% 优秀 / >80% rejected 不够差。

JSON parse 兜底

DeepSeek 偶尔包 ```json … ``` 输出。客户端先 strip 三种围栏，再 json.loads()；解析失败时回退到 {quality_score: 7.0, similarity_score: 50.0, issues: ["LLM返回格式错误"]}，不让单次 LLM 抖动掐死整批生成。

LLM provider

DeepSeek API · deepseek-chat

OpenAI-compatible client, swappable to GPT-4 / 本地模型不改业务代码

Concurrency control

asyncio.Semaphore(10) + exponential backoff

退避：2/4/8s（普通）→ 3/9/27s（超时）, MAX_RETRIES=15

Training target

LLaMA-Factory · DPO · Qwen 系列

Dataset 注册：dataset_info.json columns ↔ JSONL keys 严格对齐

What to try

Click through the 3 samples — read the exact chosen vs rejected text and the "why this is rejected" rationale.

Spot the @v1 version suffix on every tool name — it is on purpose, so older traces stay parseable when schemas evolve.

Trace the 6-stage backend pipeline (TaskGenerator → DataSynthesizer → smart rejected → LLM self-validate → Validator → Exporter) along the bottom strip.

What this demo proves

You can ship the data infrastructure modern DPO actually needs — not just hand-wave about "preference learning".

You design rejected samples on purpose (wrong tool / empty args / skipped tool / over-confident <final>) — a real DPO failure-mode taxonomy, not abstract scoring.

You handle production concerns: concurrency (Semaphore), retries (exponential backoff), JSONL strictness for LLaMA-Factory ingestion.

Project name

AutoToolDPO · FastAPI + React + DeepSeek API

Sample shape

{system, tools, messages, chosen, rejected} JSONL · versioned tool names @v1

Training target

LLaMA-Factory DPO trainer on Qwen-7B family

Back to case study