Instructions to use sharick008/convfinqa-qwen3.5-4b with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use sharick008/convfinqa-qwen3.5-4b with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="sharick008/convfinqa-qwen3.5-4b") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] pipe(text=messages)# Load model directly from transformers import AutoProcessor, AutoModelForMultimodalLM processor = AutoProcessor.from_pretrained("sharick008/convfinqa-qwen3.5-4b") model = AutoModelForMultimodalLM.from_pretrained("sharick008/convfinqa-qwen3.5-4b") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] inputs = processor.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use sharick008/convfinqa-qwen3.5-4b with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "sharick008/convfinqa-qwen3.5-4b" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "sharick008/convfinqa-qwen3.5-4b", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/sharick008/convfinqa-qwen3.5-4b
- SGLang
How to use sharick008/convfinqa-qwen3.5-4b with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "sharick008/convfinqa-qwen3.5-4b" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "sharick008/convfinqa-qwen3.5-4b", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "sharick008/convfinqa-qwen3.5-4b" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "sharick008/convfinqa-qwen3.5-4b", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use sharick008/convfinqa-qwen3.5-4b with Docker Model Runner:
docker model run hf.co/sharick008/convfinqa-qwen3.5-4b
convfinqa-qwen3.5-4b
Qwen/Qwen3.5-4B fine-tuned on the ConvFinQA train split for conversational, multi-turn numerical questions over single-page financial documents (a 10-K page consisting of pre-text, a table, and post-text). LoRA-merged into the base weights, so this is a standalone HuggingFace model directory: load with transformers or vLLM directly, no PEFT needed.
Trained on Tinker via tinker-cookbook.
A standalone LoRA-adapter version is at sharick008/convfinqa-qwen3.5-4b-lora (~290 MB) if you would rather mount the adapter on top of the base model yourself.
Try it without setting anything up: sharick008/convfinqa-agent is a Gradio Space that lets you replay any of the 421 dev records turn-by-turn or chat freely against any record's document. Runs on Hugging Face ZeroGPU.
Reproducibility update (2026-05-07, post-submission): the original Hub README named the two tools but did not show the system prompt, the tool-call grammar, or a runnable loop. The Usage section below now inlines the system prompt, the tool specs, the tool-call parser, and a reference Python loop that runs end-to-end against
transformersalone. The loop was verified end-to-end on Hugging Face Jobs; that verification surfaced and fixed two issues: this repo was missing ageneration_config.json(now added), and the loop'smodel.generate()needs an expliciteos_token_idset to<|im_end|>so generation stops at the end of one assistant turn. With both in place, the worked example below produces 218.6 then 351.0 as documented.
Result
Execution accuracy on the 421-record ConvFinQA dev split, graded turn-by-turn against the dataset's gold executed answers:
| Metric | Value |
|---|---|
| Records | 421 / 421 (zero failures) |
| Turns graded | 1,490 |
| Turns correct | 1,227 |
| Execution accuracy | 82.35% |
For reference, the same agent harness on the base Qwen/Qwen3.5-4B (no fine-tune, with the calculator and submit_answer tool spec injected via the renderer's tools-prefix helper) reaches 69.56% on the same dev split (1,026 / 1,475 turns; the bare base model fails to emit a parseable submit_answer on 15 turns, which the fine-tune fixes). The fine-tune adds +12.79 points through identical inference plumbing.
Training config
| Parameter | Value |
|---|---|
| Base model | Qwen/Qwen3.5-4B |
| Method | LoRA SFT (Tinker, qwen3_5 renderer) |
| LoRA rank | 32 |
| Learning rate | 3e-4, cosine decay |
| Batch size | 64 |
| Epochs | 3 |
| Optimisation steps | 837 |
| Max sequence length | 16,384 |
| Train rows | 18,078 per-target rows from 3,002 ConvFinQA train conversations |
| Held-out test | 200 rows |
| Final train mean NLL | 0.0002 |
| Final test mean NLL | 0.010 |
| Train-on-what | Last assistant message per row |
Each training row is a prefix of one full multi-turn dialogue, ending at one assistant target message. Two rows per question turn (one for each assistant message: the calculate tool call, and the submit_answer tool call), so an N-turn dialogue produces 2N rows.
The qwen3.5 renderer does not satisfy the extension property, so training with ALL_ASSISTANT_MESSAGES over a multi-turn dialogue puts loss on prefix tokens that do not match what build_generation_prompt would produce at that point. Splitting into per-target rows and switching to LAST_ASSISTANT_MESSAGE gives every loss-bearing sample the same on-policy prefix the model would actually see at inference.
Assistant tool calls are synthesised from the dataset's gold turn_program. Multi-step DSL programs like subtract(206588, 181001), divide(#0, 181001) fold into a single Python expression ((206588 - 181001) / 181001) emitted as one calculate call, followed by a submit_answer call carrying the gold value and an inferred unit.
Usage
This is not a vanilla chat model. It expects an agentic loop with two
tools, calculate and submit_answer, plus a specific system prompt.
Everything below is self-contained: with transformers alone you can
reproduce a turn end-to-end without any additional repository.
System prompt
The model was trained against the prompt below. Append the
record-specific document at the bottom and use the result as the
role: system message. Verbatim text matters: the worked examples are
how the model learnt to pick the right unit.
You are a financial analyst reading one page from a 10-K annual report filed
with the U.S. Securities and Exchange Commission. Your job is to answer
numerical questions about this page.
## Conversation format
Each turn asks one question. Later turns in the same conversation may
reference earlier answers (for example "the difference", "that amount").
Reuse numbers you have already derived.
## Tools
- `calculate(expression)`: evaluate an arithmetic expression. Call this
for every +, -, *, /, ** or percentage step. Copy the returned number
through verbatim; it is already at the precision the grader expects.
- `submit_answer(value, unit)`: submit the final answer. Call this once
you have the answer for the current question. The conversation does
not advance until you call `submit_answer`.
## Table units and scale
Tables often state their scale in headers like
"(in millions, except per share data)" or "(amounts in thousands)".
When the question asks for a value that appears in such a table,
return the cell value AS WRITTEN. Do NOT multiply it out into raw
units. Gold answers in this dataset are in the table's stated scale.
Example: if the table caption says "(in millions)" and the cell
shows 29,500, the answer is 29500 (unit `absolute`), not
29,500,000,000.
If a calculation combines two values from a "(in millions)" table,
the magnitude is preserved: both inputs are in millions and so is the
result.
## Answer units
When you call `submit_answer`, choose the `unit` that matches your
`value`:
- `fraction`: a decimal ratio, e.g. 0.14136 for 14.136%.
- `percent`: a percentage value, e.g. 14.136 for 14.136%.
- `absolute`: a raw count or currency amount with no unit symbol,
e.g. 206588.
- `count`: a whole-number count, e.g. 4.
- `yes_no`: the string 'yes' or 'no'.
## Worked examples
Percentage change, answered as a fraction.
Question: net cash was 206588 in 2009 and 181001 in 2008; what is the
percentage change? Call `calculate("(206588 - 181001) / 181001")`
which returns 0.14136. Then call
`submit_answer(value=0.14136, unit="fraction")`.
Ratio, answered as a percent.
Question: what amortisation rate does an 8-year useful life represent?
Call `calculate("100 / 8")` which returns 12.5. Then call
`submit_answer(value=12.5, unit="percent")`.
Raw currency amount, no arithmetic needed.
Question: what long-term debt matures in 2017? Read the cell directly
from the table and call `submit_answer(value=307403, unit="absolute")`.
Scaled-table value, no arithmetic.
Question: what was Net cash from financing activities in 2014?
Table caption says "(in millions)" and the cell reads 29,500. Call
`submit_answer(value=29500, unit="absolute")`. Do NOT submit
29500000000 or 29.5 billion. Keep the value in the table's stated
scale.
Boolean.
Question: is net income higher in 2009 than in 2008? Call
`calculate("103102 - 104222")` which returns -1120. Then call
`submit_answer(value="no", unit="yes_no")`.
Unit mismatch.
For the percentage-change question above, do not submit value=14.136
with unit="absolute". That would be flagged as a unit mismatch.
Either submit 0.14136 as `fraction`, or 14.136 as `percent`.
## Document
The ## Document heading is followed by the record-specific document,
rendered as XML wrapping a markdown table:
def render_document(pre_text: str, table: dict, post_text: str) -> str:
"""Render a ConvFinQA document as XML wrapping a markdown table.
`table` is a column-keyed nested dict: {col_header: {row_header: cell}}.
"""
cols = list(table)
rows = list(dict.fromkeys(r for col in table.values() for r in col))
md = ["| | " + " | ".join(cols) + " |", "|---" * (len(cols) + 1) + "|"]
for row in rows:
cells = [str(table[c].get(row, "")) for c in cols]
md.append(f"| {row} | " + " | ".join(cells) + " |")
return (
"<document>\n"
f"<pre_text>\n{pre_text}\n</pre_text>\n"
"<table>\n" + "\n".join(md) + "\n</table>\n"
f"<post_text>\n{post_text}\n</post_text>\n"
"</document>"
)
Tool specs
Two functions, no others. The model emits one calculate then one
submit_answer per turn in the typical case, but the loop should
allow multiple calculate calls before the terminal submit_answer.
calculate(expression) — evaluate a Python arithmetic expression. Use
for any +, -, *, /, ** or percentage step. The returned number
is rounded to 5 decimals.
| Parameter | Type | Description |
|---|---|---|
expression |
string | Arithmetic over numeric literals using + - * / ** and parentheses, e.g. "(206588 - 181001) / 181001". No variables or function calls. |
Returns: a stringified number (the rounded result).
submit_answer(value, unit) — submit the final answer for the current
question and terminate the turn.
| Parameter | Type | Description |
|---|---|---|
value |
number or string | Number for numeric questions; 'yes' or 'no' when unit is yes_no. |
unit |
string enum | One of fraction, percent, absolute, count, yes_no. |
Tool-call grammar
The model emits tool calls in Qwen3.5 XML, possibly multiple per assistant turn:
<tool_call>
<function=calculate>
<parameter=expression>
(206588 - 181001) / 181001
</parameter>
</function>
</tool_call>
Whitespace inside <parameter> blocks is significant; trim with
.strip() after extraction. Two regexes are enough to parse the
stream:
import re
TOOL_CALL_RE = re.compile(
r"<tool_call>\s*<function=([^>]+)>\s*(.*?)\s*</function>\s*</tool_call>",
re.DOTALL,
)
PARAM_RE = re.compile(
r"<parameter=([^>]+)>\s*(.*?)\s*</parameter>",
re.DOTALL,
)
def parse_tool_calls(text: str) -> list[dict]:
out = []
for m in TOOL_CALL_RE.finditer(text):
params = {
p.group(1).strip(): p.group(2).strip()
for p in PARAM_RE.finditer(m.group(2))
}
out.append({"name": m.group(1).strip(), "arguments": params})
return out
Calculator implementation
Safe and dependency-free. Walks the AST and refuses anything beyond
arithmetic over numeric literals; never calls Python's eval():
import ast
_BIN = (ast.Add, ast.Sub, ast.Mult, ast.Div, ast.Pow)
_UN = (ast.UAdd, ast.USub)
def _eval_node(node):
if isinstance(node, ast.Constant):
if isinstance(node.value, bool) or not isinstance(node.value, (int, float)):
raise ValueError(f"non-numeric literal: {node.value!r}")
return float(node.value)
if isinstance(node, ast.UnaryOp) and isinstance(node.op, _UN):
v = _eval_node(node.operand)
return -v if isinstance(node.op, ast.USub) else +v
if isinstance(node, ast.BinOp) and isinstance(node.op, _BIN):
l, r = _eval_node(node.left), _eval_node(node.right)
if isinstance(node.op, ast.Add): return l + r
if isinstance(node.op, ast.Sub): return l - r
if isinstance(node.op, ast.Mult): return l * r
if isinstance(node.op, ast.Div): return l / r
return l ** r
raise ValueError(f"disallowed construct: {type(node).__name__}")
def calculate(expression: str) -> float:
"""Evaluate an arithmetic expression; round to 5 decimals."""
cleaned = str(expression).replace(",", "").strip()
return round(float(_eval_node(ast.parse(cleaned, mode="eval").body)), 5)
Reference loop
End-to-end loop for one ConvFinQA dialogue, using only transformers
plus the helpers above. Paste SYSTEM_PROMPT_PREFIX, render_document,
parse_tool_calls, and calculate from the sections above into the
same module:
from transformers import AutoModelForImageTextToText, AutoTokenizer
MODEL_ID = "sharick008/convfinqa-qwen3.5-4b"
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)
model = AutoModelForImageTextToText.from_pretrained(
MODEL_ID, dtype="bfloat16", device_map="auto"
)
model.eval()
def run_turn(messages: list[dict], max_iterations: int = 6) -> dict:
"""Run one question turn until submit_answer or budget exhausts."""
for _ in range(max_iterations):
prompt = tokenizer.apply_chat_template(
messages,
add_generation_prompt=True,
enable_thinking=False,
tokenize=False,
)
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
out = model.generate(
**inputs,
max_new_tokens=512,
do_sample=False,
eos_token_id=tokenizer.convert_tokens_to_ids("<|im_end|>"),
pad_token_id=tokenizer.convert_tokens_to_ids("<|im_end|>"),
)
text = tokenizer.decode(
out[0, inputs.input_ids.shape[1] :], skip_special_tokens=True
)
messages.append({"role": "assistant", "content": text})
calls = parse_tool_calls(text)
if not calls:
messages.append(
{
"role": "user",
"content": (
"Please call submit_answer with the final answer "
"for the current question."
),
}
)
continue
for c in calls:
if c["name"] == "submit_answer":
return c["arguments"]
if c["name"] == "calculate":
result = calculate(c["arguments"]["expression"])
messages.append({"role": "tool", "content": str(result)})
raise RuntimeError("no submit_answer after max_iterations")
# Worked example: dev record Single_APD/2016/page_96.pdf-1.
doc = render_document(
pre_text=(
"15 . debt the tables below summarize our outstanding debt at "
"30 september 2016 and 2015 : total debt ."
),
table={
"2016": {
"current portion of long-term debt": 371.3,
"long-term debt": 4918.1,
"total debt": 6225.2,
"bank obligations": 133.1,
"commercial paper": 802.7,
"total short-term borrowings": 935.8,
},
"2015": {
"current portion of long-term debt": 435.6,
"long-term debt": 3949.1,
"total debt": 5879.0,
"bank obligations": 234.3,
"commercial paper": 1260.0,
"total short-term borrowings": 1494.3,
},
},
post_text=(
"the weighted average interest rate of short-term borrowings "
"outstanding at 30 september 2016 and 2015 was 1.1% and 0.8% "
"respectively. cash paid for interest, net of amounts capitalized, "
"was $121.1 in 2016, $97.5 in 2015, and $132.4 in 2014."
),
)
messages = [
{"role": "system", "content": SYSTEM_PROMPT_PREFIX + doc},
{
"role": "user",
"content": (
"what was the total cash paid for interest in the years of "
"2015 and 2016, combined?"
),
},
]
print(run_turn(messages))
# {"value": "218.6", "unit": "absolute"}
# Multi-turn: append the next user question and call run_turn(messages) again.
messages.append(
{
"role": "user",
"content": "including the year of 2014, what then becomes this total?",
}
)
print(run_turn(messages))
# {"value": "351.0", "unit": "absolute"}
The fine-tune was trained with thinking disabled, so
enable_thinking=False matches the prompt format the model expects.
Greedy decoding (do_sample=False) matches how the dev-split numbers
were produced; sampled generation will give different results.
eos_token_id is set to <|im_end|> so generation stops at the end of
one assistant turn; without this, the model runs to max_new_tokens and
emits a hallucinated multi-turn transcript, because its BPE-level
<|endoftext|> is rarely produced by the chat-trained weights.
4-bit loading (~6 GB VRAM)
The reference loop above is unchanged; only the model load differs:
from transformers import AutoModelForImageTextToText, AutoTokenizer, BitsAndBytesConfig
MODEL_ID = "sharick008/convfinqa-qwen3.5-4b"
quant = BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_compute_dtype="bfloat16")
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)
model = AutoModelForImageTextToText.from_pretrained(
MODEL_ID, quantization_config=quant, device_map="auto"
)
Serving via vLLM
vllm serve sharick008/convfinqa-qwen3.5-4b
vLLM dispatches by architectures and will allocate the vision tower
at serve time even though only text prompts are sent. Drive it via the
OpenAI-compatible endpoint at http://localhost:8000/v1 with
model="sharick008/convfinqa-qwen3.5-4b". The agentic loop above
applies unchanged: feed it the same system prompt, parse the same XML
tool calls from the assistant content, and feed calculate results
back as role: "tool" messages.
Notes on the AutoModelForImageTextToText class
Qwen/Qwen3.5-4B is an image-text-to-text model (the architecture is
Qwen3_5ForConditionalGeneration, with a vision tower in addition to
the language tower). This fine-tune only updates the language tower,
but the multimodal config travels with the merged weights, so use
AutoModelForImageTextToText (not AutoModelForCausalLM) to load the
multimodal config without partial-weight warnings. The vision tower
stays cold for ConvFinQA (no images in the prompts) but does sit in
VRAM; if you are short on memory, prefer the 4-bit path or the
adapter repo
mounted on a base model you already have cached.
Evaluation: where errors come from
Per-turn-position accuracy is flat between 75% and 81% across positions 0 to 4, so multi-turn co-reference is not the bottleneck. Compared to the un-tuned base model on the same harness, fine-tuning specifically fixes:
- Sign errors. The base model occasionally drops a negative sign on subtractions; the fine-tune does not.
- Table-scale errors. Many tables are stated "(in millions)". The base model sometimes multiplies out (e.g. predicting 932,000,000 when the table cell reads 932 and gold is 932); the fine-tune handles these consistently.
A deliberate trade-off shows up on yes/no questions:
- Boolean comparisons. 35 of 3,037 ConvFinQA train records (1.1%) use the
greater(a, b)op in their gold programs to answer "is X higher than Y?" style questions. Our calculator tool exposes only arithmetic, so at SFT-build time those 35 records were skipped rather than expanding the tool surface for ~1% of data. The cost is visible on the dev split: a small handful of yes/no questions where the base model's general boolean reasoning answers correctly, while the fine-tune leans harder on its newly-strengthened "emit a number" prior. Closing this is a mechanical change, listed in follow-up work below.
What the fine-tune did not fix:
- Wrong cell selection from the table remains the largest residual error class. The model occasionally pulls the cell next to the right one, or picks the wrong year column.
- Inverted divisions on a small number of "what fraction of X is Y" turns.
Limitations
- English only. All training data is English-language US 10-K excerpts.
- Single-page documents only. Each ConvFinQA record contains exactly one page of context. The model has not been trained to retrieve from longer documents.
- Numerical reasoning over tables and short prose. It will not generalise well to free-form financial commentary, multi-document synthesis, or domains outside US-equity 10-Ks.
Suggested follow-up work
- Cell-grounding in SFT traces. Current assistant turns go straight from "user question" to
calculate(literal_numbers). The literal numbers come from the table or post-text, but the trace never says where. Adding a one-sentence assistant content before the calculate ("Reading 2014 cash paid for interest = 132.4 from the post-text") would give the model a place to ground and should attack the dominant failure mode directly. - Scale-aware data augmentation. Rewrite a subset of train traces so the assistant explicitly reads the "(in millions)" caption before quoting a cell. The current SFT data gives the model no demonstrated example of checking scale captions, only an instruction in the system prompt.
- Restore the 35 boolean train records. Extend the calculator to accept Python comparison operators (
>,<,>=,<=), returning"yes"/"no"strings instead of PythonTrue/False. Fold the dataset'sgreater(a, b)programs into(a > b)at SFT-build time. Closes the yes/no trade-off described in the evaluation section above. - DPO on cell-selection mistakes. Build preference pairs from a dev run: chosen = a re-prompted trace that produces the correct executed answer; rejected = the original incorrect trace. Bias the model away from the off-by-one cell errors that dominate the failure mix.
- Verifier pass. Add a second-pass call that, given the question, the document, and the proposed answer, judges whether the answer is plausible and either accepts or asks the agent to retry. The cost is manageable (one extra short call per turn) and would absorb a chunk of the cell-selection misses.
Citation
ConvFinQA dataset:
Chen, Z., Li, S., Smiley, C., Ma, Z., Shah, S., & Wang, W. Y. (2022). ConvFinQA: Exploring the Chain of Numerical Reasoning in Conversational Finance Question Answering. EMNLP.
This model is open-weight under Apache-2.0.
Framework versions
- tinker-cookbook: 0.3.0
- transformers: 5.6.2
- torch: 2.11.0
- Downloads last month
- 86