--- base_model: Qwen/Qwen3.5-9B library_name: peft license: apache-2.0 datasets: - cometadata/funding-extraction-artifact-data-mix-grpo-mixed-reward tags: - funding-extraction - lora - grpo - rl - scholarly-metadata language: - en pipeline_tag: text-generation --- # Funding Extraction LoRA (Qwen3.5-9B) LoRA adapter for extracting structured funding metadata (funder names + award IDs) from academic paper funding statements. Fine-tuned on Qwen3.5-9B via SFT then GRPO reinforcement learning. This is the Qwen3.5-9B counterpart to [`cometadata/funding-extraction-llama-3.1-8b-instruct-artifact-data-mix-grpo-mixed-reward`](https://huggingface.co/cometadata/funding-extraction-llama-3.1-8b-instruct-artifact-data-mix-grpo-mixed-reward), trained with the same data, pipeline, and reward. See [Comparison to the Llama 3.1 8B baseline](#comparison-to-the-llama-31-8b-baseline) below. ## Training Pipeline Trained on the [`cometadata/funding-extraction-artifact-data-mix-grpo-mixed-reward`](https://huggingface.co/datasets/cometadata/funding-extraction-artifact-data-mix-grpo-mixed-reward) dataset using its pre-split `sft` / `rl` / `test` separations on the [Tinker](https://thinkingmachines.ai) training service. ### Stage 1: Supervised Fine-Tuning (SFT) - **Base model:** `Qwen/Qwen3.5-9B` - **Data (`data/sft/`):** 3,528 real + 7,240 synthetic funding statements with gold-standard funder/award labels (synthetic upsampled 2×) - **Data augmentation:** 50% of training examples augmented with synthetic noise (OCR-like case errors, digit/letter swaps, Unicode artifacts, XML/HTML tags, LaTeX markup) - **Renderer:** `qwen3_5_disable_thinking` (no chain-of-thought; keep thinking disabled at inference, see [Usage](#usage)) - **LoRA rank:** 128 - **Epochs:** 2 - **Result:** eval NLL 0.116 → 0.0035 over 252 steps ### Stage 2: Reinforcement Learning (GRPO) - **Algorithm:** Group Relative Policy Optimization (GRPO) with importance sampling loss - **Data (`data/rl/`):** 1,160 real + 1,916 synthetic (train); 576 real + 968 synthetic (eval) - **Reward:** Hierarchical F0.5 scoring with binary funder/award-ID matching + flat award-ID association bonus - `reward = 0.50 * funder_F0.5 + 0.40 * hierarchical_award_id_F0.5 + 0.10 * flat_award_id_F0.5` - Funder matching — fuzzy (token_sort_ratio ≥ 0.80 threshold, Hungarian optimal assignment) - Award ID matching — binary exact after normalization (strip whitespace/hyphens/slashes, uppercase), with soft (edit-distance-1) partial credit during training - Flat award-ID term — awards partial credit when the correct award ID is extracted under the wrong funder, providing gradient on funder-award association errors - **KL penalty:** 0.03 (anchored to SFT checkpoint) - **Group size:** 8 rollouts per prompt - **Temperature:** 0.8 - **Learning rate:** 3e-5 - **Steps:** 193 batches - **Checkpoint:** final (batch 193) ## Evaluation Results ### arxiv_test.jsonl (300 held-out examples) #### Permissive (partial_ratio + token_set, no damping) | Field | P | R | F1 | F0.5 | F1.5 | |-------|---|---|----|----|------| | Funder | 0.9384 | 0.9362 | 0.9373 | 0.9379 | 0.9369 | | Award ID | 0.9069 | 0.8909 | 0.8988 | 0.9037 | 0.8957 | | Scheme | 0.7407 | 0.8264 | 0.7812 | 0.7564 | 0.7980 | | Title | 0.9048 | 0.3958 | 0.5507 | 0.7197 | 0.4787 | #### Balanced (length-damped + acronym detection) | Field | P | R | F1 | F0.5 | F1.5 | |-------|---|---|----|----|------| | Funder | 0.8882 | 0.8960 | 0.8921 | 0.8897 | 0.8936 | | Award ID | 0.8889 | 0.8732 | 0.8810 | 0.8857 | 0.8779 | | Scheme | 0.6889 | 0.7686 | 0.7266 | 0.7035 | 0.7422 | | Title | 0.9048 | 0.3958 | 0.5507 | 0.7197 | 0.4787 | #### Strict (token_sort_ratio only) | Field | P | R | F1 | F0.5 | F1.5 | |-------|---|---|----|----|------| | Funder | 0.8796 | 0.8874 | 0.8835 | 0.8812 | 0.8850 | | Award ID | 0.8859 | 0.8702 | 0.8780 | 0.8827 | 0.8750 | | Scheme | 0.6667 | 0.7438 | 0.7031 | 0.6808 | 0.7182 | | Title | 0.8095 | 0.3542 | 0.4928 | 0.6439 | 0.4283 | All 300 outputs were valid JSON. ### funding-entity-extraction-dataset-mix test sets Evaluated on the held-out test sets from [`cometadata/funding-entity-extraction-dataset-mix`](https://huggingface.co/datasets/cometadata/funding-entity-extraction-dataset-mix) with the same evaluation harness. `test_with_context` uses the `full_text` field (the funding statement with its surrounding document text) as the model input. #### `test.jsonl` (347 examples) Permissive (partial_ratio + token_set, no damping) | Field | P | R | F1 | F0.5 | F1.5 | |-------|---|---|----|----|------| | Funder | 0.9376 | 0.8923 | 0.9144 | 0.9282 | 0.9058 | | Award ID | 0.8407 | 0.8339 | 0.8373 | 0.8394 | 0.8360 | | Scheme | 0.4118 | 0.5927 | 0.4860 | 0.4385 | 0.5221 | | Title | 0.1034 | 0.0170 | 0.0293 | 0.0514 | 0.0229 | Balanced (length-damped + acronym detection) | Field | P | R | F1 | F0.5 | F1.5 | |-------|---|---|----|----|------| | Funder | 0.9008 | 0.8555 | 0.8776 | 0.8913 | 0.8689 | | Award ID | 0.8138 | 0.8072 | 0.8105 | 0.8125 | 0.8092 | | Scheme | 0.3725 | 0.5363 | 0.4397 | 0.3968 | 0.4724 | | Title | 0.0690 | 0.0114 | 0.0195 | 0.0342 | 0.0153 | Strict (token_sort_ratio only) | Field | P | R | F1 | F0.5 | F1.5 | |-------|---|---|----|----|------| | Funder | 0.8722 | 0.8276 | 0.8493 | 0.8629 | 0.8408 | | Award ID | 0.7963 | 0.7898 | 0.7930 | 0.7949 | 0.7918 | | Scheme | 0.3333 | 0.4798 | 0.3934 | 0.3550 | 0.4227 | | Title | 0.0690 | 0.0114 | 0.0195 | 0.0342 | 0.0153 | #### `test_degraded.jsonl` (1,288 examples) Permissive (partial_ratio + token_set, no damping) | Field | P | R | F1 | F0.5 | F1.5 | |-------|---|---|----|----|------| | Funder | 0.9285 | 0.9216 | 0.9250 | 0.9271 | 0.9237 | | Award ID | 0.8586 | 0.8560 | 0.8573 | 0.8581 | 0.8568 | | Scheme | 0.7413 | 0.6704 | 0.7041 | 0.7260 | 0.6907 | | Title | 0.7723 | 0.2267 | 0.3506 | 0.5214 | 0.2897 | Balanced (length-damped + acronym detection) | Field | P | R | F1 | F0.5 | F1.5 | |-------|---|---|----|----|------| | Funder | 0.9001 | 0.8906 | 0.8953 | 0.8981 | 0.8935 | | Award ID | 0.8416 | 0.8390 | 0.8403 | 0.8411 | 0.8398 | | Scheme | 0.6757 | 0.6110 | 0.6417 | 0.6617 | 0.6296 | | Title | 0.6634 | 0.1948 | 0.3011 | 0.4479 | 0.2489 | Strict (token_sort_ratio only) | Field | P | R | F1 | F0.5 | F1.5 | |-------|---|---|----|----|------| | Funder | 0.8801 | 0.8690 | 0.8745 | 0.8778 | 0.8724 | | Award ID | 0.8317 | 0.8291 | 0.8304 | 0.8312 | 0.8299 | | Scheme | 0.6039 | 0.5461 | 0.5735 | 0.5913 | 0.5627 | | Title | 0.6139 | 0.1802 | 0.2787 | 0.4144 | 0.2303 | #### `test_with_context.jsonl` (322 examples) Permissive (partial_ratio + token_set, no damping) | Field | P | R | F1 | F0.5 | F1.5 | |-------|---|---|----|----|------| | Funder | 0.9348 | 0.9383 | 0.9365 | 0.9355 | 0.9372 | | Award ID | 0.8711 | 0.8690 | 0.8700 | 0.8707 | 0.8696 | | Scheme | 0.7515 | 0.6844 | 0.7164 | 0.7371 | 0.7037 | | Title | 0.8750 | 0.2442 | 0.3818 | 0.5769 | 0.3138 | Balanced (length-damped + acronym detection) | Field | P | R | F1 | F0.5 | F1.5 | |-------|---|---|----|----|------| | Funder | 0.9072 | 0.9061 | 0.9066 | 0.9070 | 0.9064 | | Award ID | 0.8538 | 0.8517 | 0.8527 | 0.8534 | 0.8523 | | Scheme | 0.6871 | 0.6257 | 0.6550 | 0.6739 | 0.6434 | | Title | 0.7500 | 0.2093 | 0.3273 | 0.4945 | 0.2690 | Strict (token_sort_ratio only) | Field | P | R | F1 | F0.5 | F1.5 | |-------|---|---|----|----|------| | Funder | 0.8863 | 0.8842 | 0.8852 | 0.8859 | 0.8848 | | Award ID | 0.8439 | 0.8418 | 0.8428 | 0.8434 | 0.8424 | | Scheme | 0.6074 | 0.5531 | 0.5789 | 0.5957 | 0.5687 | | Title | 0.7083 | 0.1977 | 0.3091 | 0.4670 | 0.2540 | ### Comparison to the Llama 3.1 8B baseline Balanced-mode F1 on the two test sets reported by the Llama baseline card: **arxiv_test (300 examples)** | Field | Llama 3.1 8B | Qwen3.5-9B | Δ | |-------|:---:|:---:|:---:| | Funder | 0.9001 | 0.8921 | −0.008 | | Award ID | 0.8780 | 0.8810 | +0.003 | | Scheme | 0.6466 | 0.7266 | +0.080 | | Title | 0.5316 | 0.5507 | +0.019 | **`test_degraded` (1,288 examples)** | Field | Llama 3.1 8B | Qwen3.5-9B | Δ | |-------|:---:|:---:|:---:| | Funder | 0.8999 | 0.8953 | −0.005 | | Award ID | 0.8477 | 0.8403 | −0.007 | | Scheme | 0.6370 | 0.6417 | +0.005 | | Title | 0.4110 | 0.3011 | −0.110 | Funder and award ID (the reward-weighted fields) are within 0.008 F1 of the Llama baseline on both sets. Scheme and title carry zero reward weight. ## Usage ```python from peft import PeftModel from transformers import AutoModelForCausalLM, AutoTokenizer base_model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen3.5-9B") model = PeftModel.from_pretrained(base_model, "cometadata/funding-extraction-qwen3.5-9B-artifact-data-mix-grpo-mixed-reward") tokenizer = AutoTokenizer.from_pretrained("cometadata/funding-extraction-qwen3.5-9B-artifact-data-mix-grpo-mixed-reward") prompt = """Extract funding information from the following statement: This work was supported by the National Science Foundation under grant DMS-1613002 and by the NIH (R01-AI123456).""" messages = [ {"role": "system", "content": "You are an expert at extracting structured funding metadata from academic papers. Given a funding statement, extract all funders and their associated awards. Return a JSON array of funder objects. Each funder has:\n- \"funder_name\": string or null\n- \"awards\": array of objects with \"award_ids\" (array of strings), \"funding_scheme\" (array of strings), and \"award_title\" (array of strings)\nReturn ONLY the JSON array, no other text."}, {"role": "user", "content": prompt}, ] # Model trained with thinking disabled; keep enable_thinking=False. inputs = tokenizer.apply_chat_template( messages, return_tensors="pt", add_generation_prompt=True, enable_thinking=False ) outputs = model.generate(inputs, max_new_tokens=512, do_sample=False) print(tokenizer.decode(outputs[0][inputs.shape[-1]:], skip_special_tokens=True)) ``` ## Output Format ```json [ { "funder_name": "National Science Foundation", "awards": [ { "award_ids": ["DMS-1613002"], "funding_scheme": [], "award_title": [] } ] }, { "funder_name": "NIH", "awards": [ { "award_ids": ["R01-AI123456"], "funding_scheme": [], "award_title": [] } ] } ] ```