convfinqa-qwen3.5-4b

Qwen/Qwen3.5-4B fine-tuned on the ConvFinQA train split for conversational, multi-turn numerical questions over single-page financial documents (a 10-K page consisting of pre-text, a table, and post-text). LoRA-merged into the base weights, so this is a standalone HuggingFace model directory: load with transformers or vLLM directly, no PEFT needed.

Trained on Tinker via tinker-cookbook.

A standalone LoRA-adapter version is at sharick008/convfinqa-qwen3.5-4b-lora (~290 MB) if you would rather mount the adapter on top of the base model yourself.

Try it without setting anything up: sharick008/convfinqa-agent is a Gradio Space that lets you replay any of the 421 dev records turn-by-turn or chat freely against any record's document. Runs on Hugging Face ZeroGPU.

Reproducibility update (2026-05-07, post-submission): the original Hub README named the two tools but did not show the system prompt, the tool-call grammar, or a runnable loop. The Usage section below now inlines the system prompt, the tool specs, the tool-call parser, and a reference Python loop that runs end-to-end against transformers alone. The loop was verified end-to-end on Hugging Face Jobs; that verification surfaced and fixed two issues: this repo was missing a generation_config.json (now added), and the loop's model.generate() needs an explicit eos_token_id set to <|im_end|> so generation stops at the end of one assistant turn. With both in place, the worked example below produces 218.6 then 351.0 as documented.

Result

Execution accuracy on the 421-record ConvFinQA dev split, graded turn-by-turn against the dataset's gold executed answers:

Metric Value
Records 421 / 421 (zero failures)
Turns graded 1,490
Turns correct 1,227
Execution accuracy 82.35%

For reference, the same agent harness on the base Qwen/Qwen3.5-4B (no fine-tune, with the calculator and submit_answer tool spec injected via the renderer's tools-prefix helper) reaches 69.56% on the same dev split (1,026 / 1,475 turns; the bare base model fails to emit a parseable submit_answer on 15 turns, which the fine-tune fixes). The fine-tune adds +12.79 points through identical inference plumbing.

Training config

Parameter Value
Base model Qwen/Qwen3.5-4B
Method LoRA SFT (Tinker, qwen3_5 renderer)
LoRA rank 32
Learning rate 3e-4, cosine decay
Batch size 64
Epochs 3
Optimisation steps 837
Max sequence length 16,384
Train rows 18,078 per-target rows from 3,002 ConvFinQA train conversations
Held-out test 200 rows
Final train mean NLL 0.0002
Final test mean NLL 0.010
Train-on-what Last assistant message per row

Each training row is a prefix of one full multi-turn dialogue, ending at one assistant target message. Two rows per question turn (one for each assistant message: the calculate tool call, and the submit_answer tool call), so an N-turn dialogue produces 2N rows.

The qwen3.5 renderer does not satisfy the extension property, so training with ALL_ASSISTANT_MESSAGES over a multi-turn dialogue puts loss on prefix tokens that do not match what build_generation_prompt would produce at that point. Splitting into per-target rows and switching to LAST_ASSISTANT_MESSAGE gives every loss-bearing sample the same on-policy prefix the model would actually see at inference.

Assistant tool calls are synthesised from the dataset's gold turn_program. Multi-step DSL programs like subtract(206588, 181001), divide(#0, 181001) fold into a single Python expression ((206588 - 181001) / 181001) emitted as one calculate call, followed by a submit_answer call carrying the gold value and an inferred unit.

Usage

This is not a vanilla chat model. It expects an agentic loop with two tools, calculate and submit_answer, plus a specific system prompt. Everything below is self-contained: with transformers alone you can reproduce a turn end-to-end without any additional repository.

System prompt

The model was trained against the prompt below. Append the record-specific document at the bottom and use the result as the role: system message. Verbatim text matters: the worked examples are how the model learnt to pick the right unit.

You are a financial analyst reading one page from a 10-K annual report filed
with the U.S. Securities and Exchange Commission. Your job is to answer
numerical questions about this page.

## Conversation format

Each turn asks one question. Later turns in the same conversation may
reference earlier answers (for example "the difference", "that amount").
Reuse numbers you have already derived.

## Tools

- `calculate(expression)`: evaluate an arithmetic expression. Call this
  for every +, -, *, /, ** or percentage step. Copy the returned number
  through verbatim; it is already at the precision the grader expects.
- `submit_answer(value, unit)`: submit the final answer. Call this once
  you have the answer for the current question. The conversation does
  not advance until you call `submit_answer`.

## Table units and scale

Tables often state their scale in headers like
"(in millions, except per share data)" or "(amounts in thousands)".
When the question asks for a value that appears in such a table,
return the cell value AS WRITTEN. Do NOT multiply it out into raw
units. Gold answers in this dataset are in the table's stated scale.

Example: if the table caption says "(in millions)" and the cell
shows 29,500, the answer is 29500 (unit `absolute`), not
29,500,000,000.

If a calculation combines two values from a "(in millions)" table,
the magnitude is preserved: both inputs are in millions and so is the
result.

## Answer units

When you call `submit_answer`, choose the `unit` that matches your
`value`:

- `fraction`: a decimal ratio, e.g. 0.14136 for 14.136%.
- `percent`: a percentage value, e.g. 14.136 for 14.136%.
- `absolute`: a raw count or currency amount with no unit symbol,
  e.g. 206588.
- `count`: a whole-number count, e.g. 4.
- `yes_no`: the string 'yes' or 'no'.

## Worked examples

Percentage change, answered as a fraction.
Question: net cash was 206588 in 2009 and 181001 in 2008; what is the
percentage change? Call `calculate("(206588 - 181001) / 181001")`
which returns 0.14136. Then call
`submit_answer(value=0.14136, unit="fraction")`.

Ratio, answered as a percent.
Question: what amortisation rate does an 8-year useful life represent?
Call `calculate("100 / 8")` which returns 12.5. Then call
`submit_answer(value=12.5, unit="percent")`.

Raw currency amount, no arithmetic needed.
Question: what long-term debt matures in 2017? Read the cell directly
from the table and call `submit_answer(value=307403, unit="absolute")`.

Scaled-table value, no arithmetic.
Question: what was Net cash from financing activities in 2014?
Table caption says "(in millions)" and the cell reads 29,500. Call
`submit_answer(value=29500, unit="absolute")`. Do NOT submit
29500000000 or 29.5 billion. Keep the value in the table's stated
scale.

Boolean.
Question: is net income higher in 2009 than in 2008? Call
`calculate("103102 - 104222")` which returns -1120. Then call
`submit_answer(value="no", unit="yes_no")`.

Unit mismatch.
For the percentage-change question above, do not submit value=14.136
with unit="absolute". That would be flagged as a unit mismatch.
Either submit 0.14136 as `fraction`, or 14.136 as `percent`.

## Document

The ## Document heading is followed by the record-specific document, rendered as XML wrapping a markdown table:

def render_document(pre_text: str, table: dict, post_text: str) -> str:
    """Render a ConvFinQA document as XML wrapping a markdown table.

    `table` is a column-keyed nested dict: {col_header: {row_header: cell}}.
    """
    cols = list(table)
    rows = list(dict.fromkeys(r for col in table.values() for r in col))
    md = ["| | " + " | ".join(cols) + " |", "|---" * (len(cols) + 1) + "|"]
    for row in rows:
        cells = [str(table[c].get(row, "")) for c in cols]
        md.append(f"| {row} | " + " | ".join(cells) + " |")
    return (
        "<document>\n"
        f"<pre_text>\n{pre_text}\n</pre_text>\n"
        "<table>\n" + "\n".join(md) + "\n</table>\n"
        f"<post_text>\n{post_text}\n</post_text>\n"
        "</document>"
    )

Tool specs

Two functions, no others. The model emits one calculate then one submit_answer per turn in the typical case, but the loop should allow multiple calculate calls before the terminal submit_answer.

calculate(expression) — evaluate a Python arithmetic expression. Use for any +, -, *, /, ** or percentage step. The returned number is rounded to 5 decimals.

Parameter Type Description
expression string Arithmetic over numeric literals using + - * / ** and parentheses, e.g. "(206588 - 181001) / 181001". No variables or function calls.

Returns: a stringified number (the rounded result).

submit_answer(value, unit) — submit the final answer for the current question and terminate the turn.

Parameter Type Description
value number or string Number for numeric questions; 'yes' or 'no' when unit is yes_no.
unit string enum One of fraction, percent, absolute, count, yes_no.

Tool-call grammar

The model emits tool calls in Qwen3.5 XML, possibly multiple per assistant turn:

<tool_call>
<function=calculate>
<parameter=expression>
(206588 - 181001) / 181001
</parameter>
</function>
</tool_call>

Whitespace inside <parameter> blocks is significant; trim with .strip() after extraction. Two regexes are enough to parse the stream:

import re

TOOL_CALL_RE = re.compile(
    r"<tool_call>\s*<function=([^>]+)>\s*(.*?)\s*</function>\s*</tool_call>",
    re.DOTALL,
)
PARAM_RE = re.compile(
    r"<parameter=([^>]+)>\s*(.*?)\s*</parameter>",
    re.DOTALL,
)


def parse_tool_calls(text: str) -> list[dict]:
    out = []
    for m in TOOL_CALL_RE.finditer(text):
        params = {
            p.group(1).strip(): p.group(2).strip()
            for p in PARAM_RE.finditer(m.group(2))
        }
        out.append({"name": m.group(1).strip(), "arguments": params})
    return out

Calculator implementation

Safe and dependency-free. Walks the AST and refuses anything beyond arithmetic over numeric literals; never calls Python's eval():

import ast

_BIN = (ast.Add, ast.Sub, ast.Mult, ast.Div, ast.Pow)
_UN = (ast.UAdd, ast.USub)


def _eval_node(node):
    if isinstance(node, ast.Constant):
        if isinstance(node.value, bool) or not isinstance(node.value, (int, float)):
            raise ValueError(f"non-numeric literal: {node.value!r}")
        return float(node.value)
    if isinstance(node, ast.UnaryOp) and isinstance(node.op, _UN):
        v = _eval_node(node.operand)
        return -v if isinstance(node.op, ast.USub) else +v
    if isinstance(node, ast.BinOp) and isinstance(node.op, _BIN):
        l, r = _eval_node(node.left), _eval_node(node.right)
        if isinstance(node.op, ast.Add):  return l + r
        if isinstance(node.op, ast.Sub):  return l - r
        if isinstance(node.op, ast.Mult): return l * r
        if isinstance(node.op, ast.Div):  return l / r
        return l ** r
    raise ValueError(f"disallowed construct: {type(node).__name__}")


def calculate(expression: str) -> float:
    """Evaluate an arithmetic expression; round to 5 decimals."""
    cleaned = str(expression).replace(",", "").strip()
    return round(float(_eval_node(ast.parse(cleaned, mode="eval").body)), 5)

Reference loop

End-to-end loop for one ConvFinQA dialogue, using only transformers plus the helpers above. Paste SYSTEM_PROMPT_PREFIX, render_document, parse_tool_calls, and calculate from the sections above into the same module:

from transformers import AutoModelForImageTextToText, AutoTokenizer

MODEL_ID = "sharick008/convfinqa-qwen3.5-4b"

tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)
model = AutoModelForImageTextToText.from_pretrained(
    MODEL_ID, dtype="bfloat16", device_map="auto"
)
model.eval()


def run_turn(messages: list[dict], max_iterations: int = 6) -> dict:
    """Run one question turn until submit_answer or budget exhausts."""
    for _ in range(max_iterations):
        prompt = tokenizer.apply_chat_template(
            messages,
            add_generation_prompt=True,
            enable_thinking=False,
            tokenize=False,
        )
        inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
        out = model.generate(
            **inputs,
            max_new_tokens=512,
            do_sample=False,
            eos_token_id=tokenizer.convert_tokens_to_ids("<|im_end|>"),
            pad_token_id=tokenizer.convert_tokens_to_ids("<|im_end|>"),
        )
        text = tokenizer.decode(
            out[0, inputs.input_ids.shape[1] :], skip_special_tokens=True
        )
        messages.append({"role": "assistant", "content": text})
        calls = parse_tool_calls(text)
        if not calls:
            messages.append(
                {
                    "role": "user",
                    "content": (
                        "Please call submit_answer with the final answer "
                        "for the current question."
                    ),
                }
            )
            continue
        for c in calls:
            if c["name"] == "submit_answer":
                return c["arguments"]
            if c["name"] == "calculate":
                result = calculate(c["arguments"]["expression"])
                messages.append({"role": "tool", "content": str(result)})
    raise RuntimeError("no submit_answer after max_iterations")


# Worked example: dev record Single_APD/2016/page_96.pdf-1.
doc = render_document(
    pre_text=(
        "15 . debt the tables below summarize our outstanding debt at "
        "30 september 2016 and 2015 : total debt ."
    ),
    table={
        "2016": {
            "current portion of long-term debt": 371.3,
            "long-term debt": 4918.1,
            "total debt": 6225.2,
            "bank obligations": 133.1,
            "commercial paper": 802.7,
            "total short-term borrowings": 935.8,
        },
        "2015": {
            "current portion of long-term debt": 435.6,
            "long-term debt": 3949.1,
            "total debt": 5879.0,
            "bank obligations": 234.3,
            "commercial paper": 1260.0,
            "total short-term borrowings": 1494.3,
        },
    },
    post_text=(
        "the weighted average interest rate of short-term borrowings "
        "outstanding at 30 september 2016 and 2015 was 1.1% and 0.8% "
        "respectively. cash paid for interest, net of amounts capitalized, "
        "was $121.1 in 2016, $97.5 in 2015, and $132.4 in 2014."
    ),
)

messages = [
    {"role": "system", "content": SYSTEM_PROMPT_PREFIX + doc},
    {
        "role": "user",
        "content": (
            "what was the total cash paid for interest in the years of "
            "2015 and 2016, combined?"
        ),
    },
]
print(run_turn(messages))
# {"value": "218.6", "unit": "absolute"}

# Multi-turn: append the next user question and call run_turn(messages) again.
messages.append(
    {
        "role": "user",
        "content": "including the year of 2014, what then becomes this total?",
    }
)
print(run_turn(messages))
# {"value": "351.0", "unit": "absolute"}

The fine-tune was trained with thinking disabled, so enable_thinking=False matches the prompt format the model expects. Greedy decoding (do_sample=False) matches how the dev-split numbers were produced; sampled generation will give different results. eos_token_id is set to <|im_end|> so generation stops at the end of one assistant turn; without this, the model runs to max_new_tokens and emits a hallucinated multi-turn transcript, because its BPE-level <|endoftext|> is rarely produced by the chat-trained weights.

4-bit loading (~6 GB VRAM)

The reference loop above is unchanged; only the model load differs:

from transformers import AutoModelForImageTextToText, AutoTokenizer, BitsAndBytesConfig

MODEL_ID = "sharick008/convfinqa-qwen3.5-4b"
quant = BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_compute_dtype="bfloat16")

tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)
model = AutoModelForImageTextToText.from_pretrained(
    MODEL_ID, quantization_config=quant, device_map="auto"
)

Serving via vLLM

vllm serve sharick008/convfinqa-qwen3.5-4b

vLLM dispatches by architectures and will allocate the vision tower at serve time even though only text prompts are sent. Drive it via the OpenAI-compatible endpoint at http://localhost:8000/v1 with model="sharick008/convfinqa-qwen3.5-4b". The agentic loop above applies unchanged: feed it the same system prompt, parse the same XML tool calls from the assistant content, and feed calculate results back as role: "tool" messages.

Notes on the AutoModelForImageTextToText class

Qwen/Qwen3.5-4B is an image-text-to-text model (the architecture is Qwen3_5ForConditionalGeneration, with a vision tower in addition to the language tower). This fine-tune only updates the language tower, but the multimodal config travels with the merged weights, so use AutoModelForImageTextToText (not AutoModelForCausalLM) to load the multimodal config without partial-weight warnings. The vision tower stays cold for ConvFinQA (no images in the prompts) but does sit in VRAM; if you are short on memory, prefer the 4-bit path or the adapter repo mounted on a base model you already have cached.

Evaluation: where errors come from

Per-turn-position accuracy is flat between 75% and 81% across positions 0 to 4, so multi-turn co-reference is not the bottleneck. Compared to the un-tuned base model on the same harness, fine-tuning specifically fixes:

  • Sign errors. The base model occasionally drops a negative sign on subtractions; the fine-tune does not.
  • Table-scale errors. Many tables are stated "(in millions)". The base model sometimes multiplies out (e.g. predicting 932,000,000 when the table cell reads 932 and gold is 932); the fine-tune handles these consistently.

A deliberate trade-off shows up on yes/no questions:

  • Boolean comparisons. 35 of 3,037 ConvFinQA train records (1.1%) use the greater(a, b) op in their gold programs to answer "is X higher than Y?" style questions. Our calculator tool exposes only arithmetic, so at SFT-build time those 35 records were skipped rather than expanding the tool surface for ~1% of data. The cost is visible on the dev split: a small handful of yes/no questions where the base model's general boolean reasoning answers correctly, while the fine-tune leans harder on its newly-strengthened "emit a number" prior. Closing this is a mechanical change, listed in follow-up work below.

What the fine-tune did not fix:

  • Wrong cell selection from the table remains the largest residual error class. The model occasionally pulls the cell next to the right one, or picks the wrong year column.
  • Inverted divisions on a small number of "what fraction of X is Y" turns.

Limitations

  • English only. All training data is English-language US 10-K excerpts.
  • Single-page documents only. Each ConvFinQA record contains exactly one page of context. The model has not been trained to retrieve from longer documents.
  • Numerical reasoning over tables and short prose. It will not generalise well to free-form financial commentary, multi-document synthesis, or domains outside US-equity 10-Ks.

Suggested follow-up work

  1. Cell-grounding in SFT traces. Current assistant turns go straight from "user question" to calculate(literal_numbers). The literal numbers come from the table or post-text, but the trace never says where. Adding a one-sentence assistant content before the calculate ("Reading 2014 cash paid for interest = 132.4 from the post-text") would give the model a place to ground and should attack the dominant failure mode directly.
  2. Scale-aware data augmentation. Rewrite a subset of train traces so the assistant explicitly reads the "(in millions)" caption before quoting a cell. The current SFT data gives the model no demonstrated example of checking scale captions, only an instruction in the system prompt.
  3. Restore the 35 boolean train records. Extend the calculator to accept Python comparison operators (>, <, >=, <=), returning "yes" / "no" strings instead of Python True / False. Fold the dataset's greater(a, b) programs into (a > b) at SFT-build time. Closes the yes/no trade-off described in the evaluation section above.
  4. DPO on cell-selection mistakes. Build preference pairs from a dev run: chosen = a re-prompted trace that produces the correct executed answer; rejected = the original incorrect trace. Bias the model away from the off-by-one cell errors that dominate the failure mix.
  5. Verifier pass. Add a second-pass call that, given the question, the document, and the proposed answer, judges whether the answer is plausible and either accepts or asks the agent to retry. The cost is manageable (one extra short call per turn) and would absorb a chunk of the cell-selection misses.

Citation

ConvFinQA dataset:

Chen, Z., Li, S., Smiley, C., Ma, Z., Shah, S., & Wang, W. Y. (2022). ConvFinQA: Exploring the Chain of Numerical Reasoning in Conversational Finance Question Answering. EMNLP.

This model is open-weight under Apache-2.0.

Framework versions

  • tinker-cookbook: 0.3.0
  • transformers: 5.6.2
  • torch: 2.11.0
Downloads last month
86
Safetensors
Model size
5B params
Tensor type
BF16
·
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for sharick008/convfinqa-qwen3.5-4b

Finetuned
Qwen/Qwen3.5-4B
Adapter
(254)
this model

Space using sharick008/convfinqa-qwen3.5-4b 1