cometadata
/

funding-extraction-qwen3.5-9B-non-thinking-artifact-data-mix-grpo-mixed-reward

@@ -84,9 +84,102 @@ Trained on the [`cometadata/funding-extraction-artifact-data-mix-grpo-mixed-rewa
 Inference on the 300 examples produced 100% parseable JSON (no truncations), averaging 126 output tokens per example.
 ### Comparison to the Llama 3.1 8B baseline
-Same `arxiv_test.jsonl` (300 examples), same evaluation harness and pipeline. Balanced-mode F1:
 | Field | Llama 3.1 8B | Qwen3.5-9B | Δ |
 |-------|:---:|:---:|:---:|
@@ -95,7 +188,16 @@ Same `arxiv_test.jsonl` (300 examples), same evaluation harness and pipeline. Ba
 | Scheme | 0.6466 | 0.7266 | +0.080 |
 | Title | 0.5316 | 0.5507 | +0.019 |
-The two RL-optimized fields (funder, award ID) are statistically tied with the Llama baseline (≤0.008 F1, within run-to-run noise), while the un-weighted secondary fields (scheme, title) improve — most clearly scheme, consistently across Permissive/Balanced/Strict modes.
 ## Usage

 Inference on the 300 examples produced 100% parseable JSON (no truncations), averaging 126 output tokens per example.
+### funding-entity-extraction-dataset-mix test sets
+Evaluated on the held-out test sets from [`cometadata/funding-entity-extraction-dataset-mix`](https://huggingface.co/datasets/cometadata/funding-entity-extraction-dataset-mix), same evaluation harness. 100% parseable JSON across all 1,957 examples. For `test_with_context`, the model is given the funding statement embedded in its surrounding document text (the `full_text` field) — performance on the primary fields is maintained (in fact highest of the three sets), showing the model is not distracted by surrounding paper content.
+#### `test.jsonl` (347 examples)
+Permissive (partial_ratio + token_set, no damping)
+| Field | P | R | F1 | F0.5 | F1.5 |
+|-------|---|---|----|----|------|
+| Funder | 0.9376 | 0.8923 | 0.9144 | 0.9282 | 0.9058 |
+| Award ID | 0.8407 | 0.8339 | 0.8373 | 0.8394 | 0.8360 |
+| Scheme | 0.4118 | 0.5927 | 0.4860 | 0.4385 | 0.5221 |
+| Title | 0.1034 | 0.0170 | 0.0293 | 0.0514 | 0.0229 |
+Balanced (length-damped + acronym detection)
+| Field | P | R | F1 | F0.5 | F1.5 |
+|-------|---|---|----|----|------|
+| Funder | 0.9008 | 0.8555 | 0.8776 | 0.8913 | 0.8689 |
+| Award ID | 0.8138 | 0.8072 | 0.8105 | 0.8125 | 0.8092 |
+| Scheme | 0.3725 | 0.5363 | 0.4397 | 0.3968 | 0.4724 |
+| Title | 0.0690 | 0.0114 | 0.0195 | 0.0342 | 0.0153 |
+Strict (token_sort_ratio only)
+| Field | P | R | F1 | F0.5 | F1.5 |
+|-------|---|---|----|----|------|
+| Funder | 0.8722 | 0.8276 | 0.8493 | 0.8629 | 0.8408 |
+| Award ID | 0.7963 | 0.7898 | 0.7930 | 0.7949 | 0.7918 |
+| Scheme | 0.3333 | 0.4798 | 0.3934 | 0.3550 | 0.4227 |
+| Title | 0.0690 | 0.0114 | 0.0195 | 0.0342 | 0.0153 |
+#### `test_degraded.jsonl` (1,288 examples — the `synthetic_edges` set from the Llama baseline card)
+Permissive (partial_ratio + token_set, no damping)
+| Field | P | R | F1 | F0.5 | F1.5 |
+|-------|---|---|----|----|------|
+| Funder | 0.9285 | 0.9216 | 0.9250 | 0.9271 | 0.9237 |
+| Award ID | 0.8586 | 0.8560 | 0.8573 | 0.8581 | 0.8568 |
+| Scheme | 0.7413 | 0.6704 | 0.7041 | 0.7260 | 0.6907 |
+| Title | 0.7723 | 0.2267 | 0.3506 | 0.5214 | 0.2897 |
+Balanced (length-damped + acronym detection)
+| Field | P | R | F1 | F0.5 | F1.5 |
+|-------|---|---|----|----|------|
+| Funder | 0.9001 | 0.8906 | 0.8953 | 0.8981 | 0.8935 |
+| Award ID | 0.8416 | 0.8390 | 0.8403 | 0.8411 | 0.8398 |
+| Scheme | 0.6757 | 0.6110 | 0.6417 | 0.6617 | 0.6296 |
+| Title | 0.6634 | 0.1948 | 0.3011 | 0.4479 | 0.2489 |
+Strict (token_sort_ratio only)
+| Field | P | R | F1 | F0.5 | F1.5 |
+|-------|---|---|----|----|------|
+| Funder | 0.8801 | 0.8690 | 0.8745 | 0.8778 | 0.8724 |
+| Award ID | 0.8317 | 0.8291 | 0.8304 | 0.8312 | 0.8299 |
+| Scheme | 0.6039 | 0.5461 | 0.5735 | 0.5913 | 0.5627 |
+| Title | 0.6139 | 0.1802 | 0.2787 | 0.4144 | 0.2303 |
+#### `test_with_context.jsonl` (322 examples — funding statement embedded in surrounding document text, avg 1,143 vs 375 chars)
+Permissive (partial_ratio + token_set, no damping)
+| Field | P | R | F1 | F0.5 | F1.5 |
+|-------|---|---|----|----|------|
+| Funder | 0.9348 | 0.9383 | 0.9365 | 0.9355 | 0.9372 |
+| Award ID | 0.8711 | 0.8690 | 0.8700 | 0.8707 | 0.8696 |
+| Scheme | 0.7515 | 0.6844 | 0.7164 | 0.7371 | 0.7037 |
+| Title | 0.8750 | 0.2442 | 0.3818 | 0.5769 | 0.3138 |
+Balanced (length-damped + acronym detection)
+| Field | P | R | F1 | F0.5 | F1.5 |
+|-------|---|---|----|----|------|
+| Funder | 0.9072 | 0.9061 | 0.9066 | 0.9070 | 0.9064 |
+| Award ID | 0.8538 | 0.8517 | 0.8527 | 0.8534 | 0.8523 |
+| Scheme | 0.6871 | 0.6257 | 0.6550 | 0.6739 | 0.6434 |
+| Title | 0.7500 | 0.2093 | 0.3273 | 0.4945 | 0.2690 |
+Strict (token_sort_ratio only)
+| Field | P | R | F1 | F0.5 | F1.5 |
+|-------|---|---|----|----|------|
+| Funder | 0.8863 | 0.8842 | 0.8852 | 0.8859 | 0.8848 |
+| Award ID | 0.8439 | 0.8418 | 0.8428 | 0.8434 | 0.8424 |
+| Scheme | 0.6074 | 0.5531 | 0.5789 | 0.5957 | 0.5687 |
+| Title | 0.7083 | 0.1977 | 0.3091 | 0.4670 | 0.2540 |
 ### Comparison to the Llama 3.1 8B baseline
+Both test sets the Llama baseline card reports, scored with the same harness and pipeline. Balanced-mode F1:
+**arxiv_test (300 examples)**
 | Field | Llama 3.1 8B | Qwen3.5-9B | Δ |
 |-------|:---:|:---:|:---:|
 | Scheme | 0.6466 | 0.7266 | +0.080 |
 | Title | 0.5316 | 0.5507 | +0.019 |
+**synthetic_edges / `test_degraded` (1,288 examples)**
+| Field | Llama 3.1 8B | Qwen3.5-9B | Δ |
+|-------|:---:|:---:|:---:|
+| Funder | 0.8999 | 0.8953 | −0.005 |
+| Award ID | 0.8477 | 0.8403 | −0.007 |
+| Scheme | 0.6370 | 0.6417 | +0.005 |
+| Title | 0.4110 | 0.3011 | −0.110 |
+On both sets the two RL-optimized fields (funder, award ID) are statistically tied with the Llama baseline (≤0.008 F1, within run-to-run noise). The un-weighted secondary fields are mixed — scheme is comparable-to-better, while title is the one regression (notably on the degraded set); both carry zero reward weight, and Qwen extracts titles conservatively (high precision, low recall).
 ## Usage