Multimodal Fine-Tuning for Chinese Chart VQA
Fine-tuning a general VLM into a Chinese chart-VQA specialist using LlamaFactory and a zh-train chart dataset. The data-generation tool is a companion React + FastAPI project.
Same Chinese chart, same question — a general VLM vs the LoRA fine-tuned model. The base model misreads Chinese labels; the fine-tuned model returns the exact label + number from the training target.
Why this local version exists
The fine-tuned answers are the real assistant targets from llamafactory_train.jsonl; the training command is the real LlamaFactory setup. No model runs in the browser — this isolates the strongest signal: what fine-tuning fixes.
Before vs after fine-tuning: Chinese chart VQA
Same Chinese chart, same question — a general VLM vs a vertical model fine-tuned with LlamaFactory, whose target answers come from the real llamafactory_train.jsonl.
Input chart (Chinese labels)
2024 revenue by industry (¥100M)
Pick a question
LlamaFactory training setup
--model_name_or_path Qwen2.5-VL-7B-Instruct --finetuning_type lora --template qwen2_vl --dataset chart_vqa_train --image_resolution 448 --cutoff_len 4096 --lora_rank 16 --lora_alpha 32
Current question
Which industry has the highest revenue, and its YoY growth?
General VLM (base)
Click "Run both models" to see the answer.
Fine-tuned vertical model (LoRA)
Targets come from the assistant content in llamafactory_train.jsonl.
What to try
Switch between the three sample questions and re-run both models.
Compare the base model (vague, misreads Chinese labels) with the fine-tuned model (exact label + number).
Read the LlamaFactory setup — qwen2_vl template, 448 image resolution, LoRA rank 16 / alpha 32.
What this demo proves
You can run an end-to-end vertical multimodal fine-tune, not just call an API.
You understand data construction is as important as training — the data-gen tool is its own React + FastAPI project.
You stay inside the LlamaFactory ecosystem, so this composes with the NL2SQL / function-calling / Qwen-VL RL projects.
Base model
Qwen2.5-VL-7B-Instruct · LoRA (rank 16, alpha 32) · template qwen2_vl
Data
llamafactory_train.jsonl — synthetic Chinese charts → 5–10 Q&A per image
Best signal
A targeted fine-tune that puts domain labels into the vocabulary