Funding Extraction LoRA (Qwen3.5-9B)

LoRA adapter for extracting structured funding metadata (funder names + award IDs) from academic paper funding statements. Fine-tuned on Qwen3.5-9B via SFT then GRPO reinforcement learning.

This is the Qwen3.5-9B counterpart to cometadata/funding-extraction-llama-3.1-8b-instruct-artifact-data-mix-grpo-mixed-reward, trained with the same data, pipeline, and reward. See Comparison to the Llama 3.1 8B baseline below.

Training Pipeline

Trained on the cometadata/funding-extraction-artifact-data-mix-grpo-mixed-reward dataset using its pre-split sft / rl / test separations on the Tinker training service.

Stage 1: Supervised Fine-Tuning (SFT)

Base model: Qwen/Qwen3.5-9B
Data (data/sft/): 3,528 real + 7,240 synthetic funding statements with gold-standard funder/award labels (synthetic upsampled 2×)
Data augmentation: 50% of training examples augmented with synthetic noise (OCR-like case errors, digit/letter swaps, Unicode artifacts, XML/HTML tags, LaTeX markup)
Renderer: qwen3_5_disable_thinking (no chain-of-thought; keep thinking disabled at inference, see Usage)
LoRA rank: 128
Epochs: 2
Result: eval NLL 0.116 → 0.0035 over 252 steps

Stage 2: Reinforcement Learning (GRPO)

Algorithm: Group Relative Policy Optimization (GRPO) with importance sampling loss
Data (data/rl/): 1,160 real + 1,916 synthetic (train); 576 real + 968 synthetic (eval)
Reward: Hierarchical F0.5 scoring with binary funder/award-ID matching + flat award-ID association bonus
- reward = 0.50 * funder_F0.5 + 0.40 * hierarchical_award_id_F0.5 + 0.10 * flat_award_id_F0.5
- Funder matching — fuzzy (token_sort_ratio ≥ 0.80 threshold, Hungarian optimal assignment)
- Award ID matching — binary exact after normalization (strip whitespace/hyphens/slashes, uppercase), with soft (edit-distance-1) partial credit during training
- Flat award-ID term — awards partial credit when the correct award ID is extracted under the wrong funder, providing gradient on funder-award association errors
KL penalty: 0.03 (anchored to SFT checkpoint)
Group size: 8 rollouts per prompt
Temperature: 0.8
Learning rate: 3e-5
Steps: 193 batches
Checkpoint: final (batch 193)

Evaluation Results

arxiv_test.jsonl (300 held-out examples)

Permissive (partial_ratio + token_set, no damping)

Field	P	R	F1	F0.5	F1.5
Funder	0.9384	0.9362	0.9373	0.9379	0.9369
Award ID	0.9069	0.8909	0.8988	0.9037	0.8957
Scheme	0.7407	0.8264	0.7812	0.7564	0.7980
Title	0.9048	0.3958	0.5507	0.7197	0.4787

Balanced (length-damped + acronym detection)

Field	P	R	F1	F0.5	F1.5
Funder	0.8882	0.8960	0.8921	0.8897	0.8936
Award ID	0.8889	0.8732	0.8810	0.8857	0.8779
Scheme	0.6889	0.7686	0.7266	0.7035	0.7422
Title	0.9048	0.3958	0.5507	0.7197	0.4787

Strict (token_sort_ratio only)

Field	P	R	F1	F0.5	F1.5
Funder	0.8796	0.8874	0.8835	0.8812	0.8850
Award ID	0.8859	0.8702	0.8780	0.8827	0.8750
Scheme	0.6667	0.7438	0.7031	0.6808	0.7182
Title	0.8095	0.3542	0.4928	0.6439	0.4283

All 300 outputs were valid JSON.

funding-entity-extraction-dataset-mix test sets

Evaluated on the held-out test sets from cometadata/funding-entity-extraction-dataset-mix with the same evaluation harness. test_with_context uses the full_text field (the funding statement with its surrounding document text) as the model input.

`test.jsonl` (347 examples)

Permissive (partial_ratio + token_set, no damping)

Field	P	R	F1	F0.5	F1.5
Funder	0.9376	0.8923	0.9144	0.9282	0.9058
Award ID	0.8407	0.8339	0.8373	0.8394	0.8360
Scheme	0.4118	0.5927	0.4860	0.4385	0.5221
Title	0.1034	0.0170	0.0293	0.0514	0.0229

Balanced (length-damped + acronym detection)

Field	P	R	F1	F0.5	F1.5
Funder	0.9008	0.8555	0.8776	0.8913	0.8689
Award ID	0.8138	0.8072	0.8105	0.8125	0.8092
Scheme	0.3725	0.5363	0.4397	0.3968	0.4724
Title	0.0690	0.0114	0.0195	0.0342	0.0153

Strict (token_sort_ratio only)

Field	P	R	F1	F0.5	F1.5
Funder	0.8722	0.8276	0.8493	0.8629	0.8408
Award ID	0.7963	0.7898	0.7930	0.7949	0.7918
Scheme	0.3333	0.4798	0.3934	0.3550	0.4227
Title	0.0690	0.0114	0.0195	0.0342	0.0153

`test_degraded.jsonl` (1,288 examples)

Permissive (partial_ratio + token_set, no damping)

Field	P	R	F1	F0.5	F1.5
Funder	0.9285	0.9216	0.9250	0.9271	0.9237
Award ID	0.8586	0.8560	0.8573	0.8581	0.8568
Scheme	0.7413	0.6704	0.7041	0.7260	0.6907
Title	0.7723	0.2267	0.3506	0.5214	0.2897

Balanced (length-damped + acronym detection)

Field	P	R	F1	F0.5	F1.5
Funder	0.9001	0.8906	0.8953	0.8981	0.8935
Award ID	0.8416	0.8390	0.8403	0.8411	0.8398
Scheme	0.6757	0.6110	0.6417	0.6617	0.6296
Title	0.6634	0.1948	0.3011	0.4479	0.2489

Strict (token_sort_ratio only)

Field	P	R	F1	F0.5	F1.5
Funder	0.8801	0.8690	0.8745	0.8778	0.8724
Award ID	0.8317	0.8291	0.8304	0.8312	0.8299
Scheme	0.6039	0.5461	0.5735	0.5913	0.5627
Title	0.6139	0.1802	0.2787	0.4144	0.2303

`test_with_context.jsonl` (322 examples)

Permissive (partial_ratio + token_set, no damping)

Field	P	R	F1	F0.5	F1.5
Funder	0.9348	0.9383	0.9365	0.9355	0.9372
Award ID	0.8711	0.8690	0.8700	0.8707	0.8696
Scheme	0.7515	0.6844	0.7164	0.7371	0.7037
Title	0.8750	0.2442	0.3818	0.5769	0.3138

Balanced (length-damped + acronym detection)

Field	P	R	F1	F0.5	F1.5
Funder	0.9072	0.9061	0.9066	0.9070	0.9064
Award ID	0.8538	0.8517	0.8527	0.8534	0.8523
Scheme	0.6871	0.6257	0.6550	0.6739	0.6434
Title	0.7500	0.2093	0.3273	0.4945	0.2690

Strict (token_sort_ratio only)

Field	P	R	F1	F0.5	F1.5
Funder	0.8863	0.8842	0.8852	0.8859	0.8848
Award ID	0.8439	0.8418	0.8428	0.8434	0.8424
Scheme	0.6074	0.5531	0.5789	0.5957	0.5687
Title	0.7083	0.1977	0.3091	0.4670	0.2540

Comparison to the Llama 3.1 8B baseline

Balanced-mode F1 on the two test sets reported by the Llama baseline card:

arxiv_test (300 examples)

Field	Llama 3.1 8B	Qwen3.5-9B	Δ
Funder	0.9001	0.8921	−0.008
Award ID	0.8780	0.8810	+0.003
Scheme	0.6466	0.7266	+0.080
Title	0.5316	0.5507	+0.019

test_degraded (1,288 examples)

Field	Llama 3.1 8B	Qwen3.5-9B	Δ
Funder	0.8999	0.8953	−0.005
Award ID	0.8477	0.8403	−0.007
Scheme	0.6370	0.6417	+0.005
Title	0.4110	0.3011	−0.110

Funder and award ID (the reward-weighted fields) are within 0.008 F1 of the Llama baseline on both sets. Scheme and title carry zero reward weight.

Usage

from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer

base_model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen3.5-9B")
model = PeftModel.from_pretrained(base_model, "cometadata/funding-extraction-qwen3.5-9B-artifact-data-mix-grpo-mixed-reward")
tokenizer = AutoTokenizer.from_pretrained("cometadata/funding-extraction-qwen3.5-9B-artifact-data-mix-grpo-mixed-reward")

prompt = """Extract funding information from the following statement:

This work was supported by the National Science Foundation under grant DMS-1613002 and by the NIH (R01-AI123456)."""

messages = [
    {"role": "system", "content": "You are an expert at extracting structured funding metadata from academic papers. Given a funding statement, extract all funders and their associated awards. Return a JSON array of funder objects. Each funder has:\n- \"funder_name\": string or null\n- \"awards\": array of objects with \"award_ids\" (array of strings), \"funding_scheme\" (array of strings), and \"award_title\" (array of strings)\nReturn ONLY the JSON array, no other text."},
    {"role": "user", "content": prompt},
]

# Model trained with thinking disabled; keep enable_thinking=False.
inputs = tokenizer.apply_chat_template(
    messages, return_tensors="pt", add_generation_prompt=True, enable_thinking=False
)
outputs = model.generate(inputs, max_new_tokens=512, do_sample=False)
print(tokenizer.decode(outputs[0][inputs.shape[-1]:], skip_special_tokens=True))

Output Format

[
  {
    "funder_name": "National Science Foundation",
    "awards": [
      {
        "award_ids": ["DMS-1613002"],
        "funding_scheme": [],
        "award_title": []
      }
    ]
  },
  {
    "funder_name": "NIH",
    "awards": [
      {
        "award_ids": ["R01-AI123456"],
        "funding_scheme": [],
        "award_title": []
      }
    ]
  }
]

Downloads last month: 3

Model tree for cometadata/funding-extraction-qwen3.5-9B-non-thinking-artifact-data-mix-grpo-mixed-reward

Base model

Qwen/Qwen3.5-9B-Base

Finetuned

Qwen/Qwen3.5-9B

Adapter

(384)

this model

cometadata
/

funding-extraction-qwen3.5-9B-non-thinking-artifact-data-mix-grpo-mixed-reward

Funding Extraction LoRA (Qwen3.5-9B)

Training Pipeline

Stage 1: Supervised Fine-Tuning (SFT)

Stage 2: Reinforcement Learning (GRPO)

Evaluation Results

arxiv_test.jsonl (300 held-out examples)

Permissive (partial_ratio + token_set, no damping)

Balanced (length-damped + acronym detection)

Strict (token_sort_ratio only)

funding-entity-extraction-dataset-mix test sets

`test.jsonl` (347 examples)

`test_degraded.jsonl` (1,288 examples)

`test_with_context.jsonl` (322 examples)

Comparison to the Llama 3.1 8B baseline

Usage

Output Format

Model tree for cometadata/funding-extraction-qwen3.5-9B-non-thinking-artifact-data-mix-grpo-mixed-reward

Dataset used to train cometadata/funding-extraction-qwen3.5-9B-non-thinking-artifact-data-mix-grpo-mixed-reward

Funding Extraction LoRA (Qwen3.5-9B)

Training Pipeline

Stage 1: Supervised Fine-Tuning (SFT)

Stage 2: Reinforcement Learning (GRPO)

Evaluation Results

arxiv_test.jsonl (300 held-out examples)

Permissive (partial_ratio + token_set, no damping)

Balanced (length-damped + acronym detection)

Strict (token_sort_ratio only)

funding-entity-extraction-dataset-mix test sets

test.jsonl (347 examples)

test_degraded.jsonl (1,288 examples)

test_with_context.jsonl (322 examples)

Comparison to the Llama 3.1 8B baseline

Usage

Output Format

Model tree for cometadata/funding-extraction-qwen3.5-9B-non-thinking-artifact-data-mix-grpo-mixed-reward

Dataset used to train cometadata/funding-extraction-qwen3.5-9B-non-thinking-artifact-data-mix-grpo-mixed-reward

`test.jsonl` (347 examples)

`test_degraded.jsonl` (1,288 examples)

`test_with_context.jsonl` (322 examples)