Back to projects
Multimodal Fine-Tuning for Chinese Chart VQA
Case Study

Multimodal Fine-Tuning for Chinese Chart VQA

Fine-tuning a general VLM into a Chinese chart VQA specialist using LlamaFactory and a curated zh-train chart dataset. The data-generation tool is a companion React + FastAPI project; the whole pipeline is reusable.

MultimodalLlamaFactoryQwen-VLChart VQAFine-tuning

Generic VLMs read English screenshots fine, but choke on a 中文 chart with 万/亿/同比 labels and mixed Chinese/English axes. This project fine-tunes one to actually read those, on top of LlamaFactory. The dataset construction is itself a small React + FastAPI product, mirroring case 7's data_create design.

Two halves

case6/
├── data_create/             # React + Vite UI + FastAPI backend that GENERATES the training data
└── llamafactory_train.jsonl + llamafactory_val.jsonl  # the actual dataset used to fine-tune

The dataset format is LlamaFactory-native: each row is a multi-turn conversation with an image reference, all bundled into JSONL ready for the trainer.

Why not just SFT GPT-4o

Three reasons:

  1. Vocabulary specificity — 中文 chart labels (营业收入 / 同比增长 / 占比) need to land in the model's vocabulary as single tokens, not be fragmented. A small fine-tune fixes this fast.
  2. Cost at scale — once it works, you can run inference on your own GPU; no per-call API spend.
  3. Domain familiarity — your charts have a house style (color palette, font, axis convention). The fine-tune learns that style and stops asking "is this a bar or a column?"

Dataset construction (data_create)

Mirrors case 7's NL2SQL data_create design, but for VLM:

  • pick a chart template (柱状 / 折线 / 饼图 / 堆叠柱)
  • generate synthetic data with realistic Chinese labels (industries / regions / time periods)
  • render the chart (matplotlib / plotly) and save as PNG
  • LLM generates 5-10 Q&A pairs per chart ("营收最高的是哪个?" "同比增长率多少?")
  • export as LlamaFactory JSONL

Training (LlamaFactory)

llamafactory-cli train \
    --stage sft \
    --model_name_or_path /home/ubuntu/Qwen2.5-VL-7B-Instruct \
    --finetuning_type lora \
    --template qwen2_vl \
    --dataset_dir data \
    --dataset chart_vqa_train \
    --cutoff_len 4096 \
    --image_resolution 448 \
    --num_train_epochs 3.0 \
    --per_device_train_batch_size 2 \
    --gradient_accumulation_steps 8 \
    --lora_rank 16 \
    --lora_alpha 32 \
    --bf16 True

What this signals

  • You can run vertical-domain multimodal fine-tuning end to end — not just call the API
  • You understand the data generation half matters as much as the trainer — and you can build it
  • You stay in the LlamaFactory ecosystem so it composes with the rest of the lineup (NL2SQL / function-calling / Qwen-VL RL)
Demo strategy

What the demo replays

The interactive demo takes the same Chinese chart + the same question and compares a general VLM with the fine-tuned model: the base model misreads Chinese labels, the fine-tuned one returns the exact label + number. The fine-tuned answers are the real assistant targets from llamafactory_train.jsonl, and the training command is the real LlamaFactory setup — no model runs in the browser.

Public preview can be enabled later without redesigning the case-study layout