Before/after comparison

Multimodal Fine-Tuning for Chinese Chart VQA

Fine-tuning a general VLM into a Chinese chart-VQA specialist using LlamaFactory and a zh-train chart dataset. The data-generation tool is a companion React + FastAPI project.

Same Chinese chart, same question — a general VLM vs the LoRA fine-tuned model. The base model misreads Chinese labels; the fine-tuned model returns the exact label + number from the training target.

MultimodalLlamaFactoryQwen-VLChart VQAFine-tuning

Case Study Source Code

Why this local version exists

The fine-tuned answers are the real assistant targets from llamafactory_train.jsonl; the training command is the real LlamaFactory setup. No model runs in the browser — this isolates the strongest signal: what fine-tuning fixes.

Interactive Preview

Before vs after fine-tuning: Chinese chart VQA

Same Chinese chart, same question — a general VLM vs a vertical model fine-tuned with LlamaFactory, whose target answers come from the real llamafactory_train.jsonl.

Input chart (Chinese labels)

2024 revenue by industry (¥100M)

156

Tech

134

Finance

112

Mfg.

Consumer

Pick a question

LlamaFactory training setup

--model_name_or_path Qwen2.5-VL-7B-Instruct
--finetuning_type lora  --template qwen2_vl
--dataset chart_vqa_train  --image_resolution 448
--cutoff_len 4096  --lora_rank 16  --lora_alpha 32

Current question

Which industry has the highest revenue, and its YoY growth?

General VLM (base)

Click "Run both models" to see the answer.

Fine-tuned vertical model (LoRA)

Targets come from the assistant content in llamafactory_train.jsonl.

What to try

Switch between the three sample questions and re-run both models.

Compare the base model (vague, misreads Chinese labels) with the fine-tuned model (exact label + number).

Read the LlamaFactory setup — qwen2_vl template, 448 image resolution, LoRA rank 16 / alpha 32.

What this demo proves

You can run an end-to-end vertical multimodal fine-tune, not just call an API.

You understand data construction is as important as training — the data-gen tool is its own React + FastAPI project.

You stay inside the LlamaFactory ecosystem, so this composes with the NL2SQL / function-calling / Qwen-VL RL projects.

Base model

Qwen2.5-VL-7B-Instruct · LoRA (rank 16, alpha 32) · template qwen2_vl

Data

llamafactory_train.jsonl — synthetic Chinese charts → 5–10 Q&A per image

Best signal

A targeted fine-tune that puts domain labels into the vocabulary

Back to case study