cometadata
/

funding-extraction-qwen3.5-9B-non-thinking-artifact-data-mix-grpo-mixed-reward

@@ -29,8 +29,8 @@ Trained on the [`cometadata/funding-extraction-artifact-data-mix-grpo-mixed-rewa
 - **Base model:** `Qwen/Qwen3.5-9B`
 - **Data (`data/sft/`):** 3,528 real + 7,240 synthetic funding statements with gold-standard funder/award labels (synthetic upsampled 2×)
-- **Data augmentation:** 50% of training examples augmented with synthetic noise (OCR-like case errors, digit/letter swaps, Unicode artifacts, XML/HTML tags, LaTeX markup) for robustness to real-world document formats
-- **Renderer:** `qwen3_5_disable_thinking` — the model is trained to emit JSON directly (no chain-of-thought), so inference should disable thinking (see [Usage](#usage))
 - **LoRA rank:** 128
 - **Epochs:** 2
 - **Result:** eval NLL 0.116 → 0.0035 over 252 steps
@@ -82,11 +82,11 @@ Trained on the [`cometadata/funding-extraction-artifact-data-mix-grpo-mixed-rewa
 | Scheme | 0.6667 | 0.7438 | 0.7031 | 0.6808 | 0.7182 |
 | Title | 0.8095 | 0.3542 | 0.4928 | 0.6439 | 0.4283 |
-Inference on the 300 examples produced 100% parseable JSON (no truncations), averaging 126 output tokens per example.
 ### funding-entity-extraction-dataset-mix test sets
-Evaluated on the held-out test sets from [`cometadata/funding-entity-extraction-dataset-mix`](https://huggingface.co/datasets/cometadata/funding-entity-extraction-dataset-mix), same evaluation harness. 100% parseable JSON across all 1,957 examples. For `test_with_context`, the model is given the funding statement embedded in its surrounding document text (the `full_text` field) — performance on the primary fields is maintained (in fact highest of the three sets), showing the model is not distracted by surrounding paper content.
 #### `test.jsonl` (347 examples)
@@ -117,7 +117,7 @@ Strict (token_sort_ratio only)
 | Scheme | 0.3333 | 0.4798 | 0.3934 | 0.3550 | 0.4227 |
 | Title | 0.0690 | 0.0114 | 0.0195 | 0.0342 | 0.0153 |
-#### `test_degraded.jsonl` (1,288 examples — the `synthetic_edges` set from the Llama baseline card)
 Permissive (partial_ratio + token_set, no damping)
@@ -146,7 +146,7 @@ Strict (token_sort_ratio only)
 | Scheme | 0.6039 | 0.5461 | 0.5735 | 0.5913 | 0.5627 |
 | Title | 0.6139 | 0.1802 | 0.2787 | 0.4144 | 0.2303 |
-#### `test_with_context.jsonl` (322 examples — funding statement embedded in surrounding document text, avg 1,143 vs 375 chars)
 Permissive (partial_ratio + token_set, no damping)
@@ -177,7 +177,7 @@ Strict (token_sort_ratio only)
 ### Comparison to the Llama 3.1 8B baseline
-Both test sets the Llama baseline card reports, scored with the same harness and pipeline. Balanced-mode F1:
 **arxiv_test (300 examples)**
@@ -188,7 +188,7 @@ Both test sets the Llama baseline card reports, scored with the same harness and
 | Scheme | 0.6466 | 0.7266 | +0.080 |
 | Title | 0.5316 | 0.5507 | +0.019 |
-**synthetic_edges / `test_degraded` (1,288 examples)**
 | Field | Llama 3.1 8B | Qwen3.5-9B | Δ |
 |-------|:---:|:---:|:---:|
@@ -197,7 +197,7 @@ Both test sets the Llama baseline card reports, scored with the same harness and
 | Scheme | 0.6370 | 0.6417 | +0.005 |
 | Title | 0.4110 | 0.3011 | −0.110 |
-On both sets the two RL-optimized fields (funder, award ID) are statistically tied with the Llama baseline (≤0.008 F1, within run-to-run noise). The un-weighted secondary fields are mixed — scheme is comparable-to-better, while title is the one regression (notably on the degraded set); both carry zero reward weight, and Qwen extracts titles conservatively (high precision, low recall).
 ## Usage
@@ -218,7 +218,7 @@ messages = [
     {"role": "user", "content": prompt},
 ]
-# Trained without chain-of-thought: disable thinking to match the training distribution.
 inputs = tokenizer.apply_chat_template(
     messages, return_tensors="pt", add_generation_prompt=True, enable_thinking=False
 )

 - **Base model:** `Qwen/Qwen3.5-9B`
 - **Data (`data/sft/`):** 3,528 real + 7,240 synthetic funding statements with gold-standard funder/award labels (synthetic upsampled 2×)
+- **Data augmentation:** 50% of training examples augmented with synthetic noise (OCR-like case errors, digit/letter swaps, Unicode artifacts, XML/HTML tags, LaTeX markup)
+- **Renderer:** `qwen3_5_disable_thinking` (no chain-of-thought; keep thinking disabled at inference, see [Usage](#usage))
 - **LoRA rank:** 128
 - **Epochs:** 2
 - **Result:** eval NLL 0.116 → 0.0035 over 252 steps
 | Scheme | 0.6667 | 0.7438 | 0.7031 | 0.6808 | 0.7182 |
 | Title | 0.8095 | 0.3542 | 0.4928 | 0.6439 | 0.4283 |
+All 300 outputs were valid JSON.
 ### funding-entity-extraction-dataset-mix test sets
+Evaluated on the held-out test sets from [`cometadata/funding-entity-extraction-dataset-mix`](https://huggingface.co/datasets/cometadata/funding-entity-extraction-dataset-mix) with the same evaluation harness. `test_with_context` uses the `full_text` field (the funding statement with its surrounding document text) as the model input.
 #### `test.jsonl` (347 examples)
 | Scheme | 0.3333 | 0.4798 | 0.3934 | 0.3550 | 0.4227 |
 | Title | 0.0690 | 0.0114 | 0.0195 | 0.0342 | 0.0153 |
+#### `test_degraded.jsonl` (1,288 examples)
 Permissive (partial_ratio + token_set, no damping)
 | Scheme | 0.6039 | 0.5461 | 0.5735 | 0.5913 | 0.5627 |
 | Title | 0.6139 | 0.1802 | 0.2787 | 0.4144 | 0.2303 |
+#### `test_with_context.jsonl` (322 examples)
 Permissive (partial_ratio + token_set, no damping)
 ### Comparison to the Llama 3.1 8B baseline
+Balanced-mode F1 on the two test sets reported by the Llama baseline card:
 **arxiv_test (300 examples)**
 | Scheme | 0.6466 | 0.7266 | +0.080 |
 | Title | 0.5316 | 0.5507 | +0.019 |
+**`test_degraded` (1,288 examples)**
 | Field | Llama 3.1 8B | Qwen3.5-9B | Δ |
 |-------|:---:|:---:|:---:|
 | Scheme | 0.6370 | 0.6417 | +0.005 |
 | Title | 0.4110 | 0.3011 | −0.110 |
+Funder and award ID (the reward-weighted fields) are within 0.008 F1 of the Llama baseline on both sets. Scheme and title carry zero reward weight.
 ## Usage
     {"role": "user", "content": prompt},
 ]
+# Model trained with thinking disabled; keep enable_thinking=False.
 inputs = tokenizer.apply_chat_template(
     messages, return_tensors="pt", add_generation_prompt=True, enable_thinking=False
 )