Text Generation
PEFT
Safetensors
English
funding-extraction
lora
grpo
rl
scholarly-metadata
conversational
Instructions to use cometadata/funding-extraction-qwen3.5-9B-non-thinking-artifact-data-mix-grpo-mixed-reward with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- PEFT
How to use cometadata/funding-extraction-qwen3.5-9B-non-thinking-artifact-data-mix-grpo-mixed-reward with PEFT:
from peft import PeftModel from transformers import AutoModelForCausalLM base_model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen3.5-9B") model = PeftModel.from_pretrained(base_model, "cometadata/funding-extraction-qwen3.5-9B-non-thinking-artifact-data-mix-grpo-mixed-reward") - Notebooks
- Google Colab
- Kaggle
Add funding-entity-extraction-dataset-mix eval results (test, degraded/synthetic_edges, with_context) + extend Llama comparison
Browse files
README.md
CHANGED
|
@@ -84,9 +84,102 @@ Trained on the [`cometadata/funding-extraction-artifact-data-mix-grpo-mixed-rewa
|
|
| 84 |
|
| 85 |
Inference on the 300 examples produced 100% parseable JSON (no truncations), averaging 126 output tokens per example.
|
| 86 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 87 |
### Comparison to the Llama 3.1 8B baseline
|
| 88 |
|
| 89 |
-
|
|
|
|
|
|
|
| 90 |
|
| 91 |
| Field | Llama 3.1 8B | Qwen3.5-9B | Δ |
|
| 92 |
|-------|:---:|:---:|:---:|
|
|
@@ -95,7 +188,16 @@ Same `arxiv_test.jsonl` (300 examples), same evaluation harness and pipeline. Ba
|
|
| 95 |
| Scheme | 0.6466 | 0.7266 | +0.080 |
|
| 96 |
| Title | 0.5316 | 0.5507 | +0.019 |
|
| 97 |
|
| 98 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 99 |
|
| 100 |
## Usage
|
| 101 |
|
|
|
|
| 84 |
|
| 85 |
Inference on the 300 examples produced 100% parseable JSON (no truncations), averaging 126 output tokens per example.
|
| 86 |
|
| 87 |
+
### funding-entity-extraction-dataset-mix test sets
|
| 88 |
+
|
| 89 |
+
Evaluated on the held-out test sets from [`cometadata/funding-entity-extraction-dataset-mix`](https://huggingface.co/datasets/cometadata/funding-entity-extraction-dataset-mix), same evaluation harness. 100% parseable JSON across all 1,957 examples. For `test_with_context`, the model is given the funding statement embedded in its surrounding document text (the `full_text` field) — performance on the primary fields is maintained (in fact highest of the three sets), showing the model is not distracted by surrounding paper content.
|
| 90 |
+
|
| 91 |
+
#### `test.jsonl` (347 examples)
|
| 92 |
+
|
| 93 |
+
Permissive (partial_ratio + token_set, no damping)
|
| 94 |
+
|
| 95 |
+
| Field | P | R | F1 | F0.5 | F1.5 |
|
| 96 |
+
|-------|---|---|----|----|------|
|
| 97 |
+
| Funder | 0.9376 | 0.8923 | 0.9144 | 0.9282 | 0.9058 |
|
| 98 |
+
| Award ID | 0.8407 | 0.8339 | 0.8373 | 0.8394 | 0.8360 |
|
| 99 |
+
| Scheme | 0.4118 | 0.5927 | 0.4860 | 0.4385 | 0.5221 |
|
| 100 |
+
| Title | 0.1034 | 0.0170 | 0.0293 | 0.0514 | 0.0229 |
|
| 101 |
+
|
| 102 |
+
Balanced (length-damped + acronym detection)
|
| 103 |
+
|
| 104 |
+
| Field | P | R | F1 | F0.5 | F1.5 |
|
| 105 |
+
|-------|---|---|----|----|------|
|
| 106 |
+
| Funder | 0.9008 | 0.8555 | 0.8776 | 0.8913 | 0.8689 |
|
| 107 |
+
| Award ID | 0.8138 | 0.8072 | 0.8105 | 0.8125 | 0.8092 |
|
| 108 |
+
| Scheme | 0.3725 | 0.5363 | 0.4397 | 0.3968 | 0.4724 |
|
| 109 |
+
| Title | 0.0690 | 0.0114 | 0.0195 | 0.0342 | 0.0153 |
|
| 110 |
+
|
| 111 |
+
Strict (token_sort_ratio only)
|
| 112 |
+
|
| 113 |
+
| Field | P | R | F1 | F0.5 | F1.5 |
|
| 114 |
+
|-------|---|---|----|----|------|
|
| 115 |
+
| Funder | 0.8722 | 0.8276 | 0.8493 | 0.8629 | 0.8408 |
|
| 116 |
+
| Award ID | 0.7963 | 0.7898 | 0.7930 | 0.7949 | 0.7918 |
|
| 117 |
+
| Scheme | 0.3333 | 0.4798 | 0.3934 | 0.3550 | 0.4227 |
|
| 118 |
+
| Title | 0.0690 | 0.0114 | 0.0195 | 0.0342 | 0.0153 |
|
| 119 |
+
|
| 120 |
+
#### `test_degraded.jsonl` (1,288 examples — the `synthetic_edges` set from the Llama baseline card)
|
| 121 |
+
|
| 122 |
+
Permissive (partial_ratio + token_set, no damping)
|
| 123 |
+
|
| 124 |
+
| Field | P | R | F1 | F0.5 | F1.5 |
|
| 125 |
+
|-------|---|---|----|----|------|
|
| 126 |
+
| Funder | 0.9285 | 0.9216 | 0.9250 | 0.9271 | 0.9237 |
|
| 127 |
+
| Award ID | 0.8586 | 0.8560 | 0.8573 | 0.8581 | 0.8568 |
|
| 128 |
+
| Scheme | 0.7413 | 0.6704 | 0.7041 | 0.7260 | 0.6907 |
|
| 129 |
+
| Title | 0.7723 | 0.2267 | 0.3506 | 0.5214 | 0.2897 |
|
| 130 |
+
|
| 131 |
+
Balanced (length-damped + acronym detection)
|
| 132 |
+
|
| 133 |
+
| Field | P | R | F1 | F0.5 | F1.5 |
|
| 134 |
+
|-------|---|---|----|----|------|
|
| 135 |
+
| Funder | 0.9001 | 0.8906 | 0.8953 | 0.8981 | 0.8935 |
|
| 136 |
+
| Award ID | 0.8416 | 0.8390 | 0.8403 | 0.8411 | 0.8398 |
|
| 137 |
+
| Scheme | 0.6757 | 0.6110 | 0.6417 | 0.6617 | 0.6296 |
|
| 138 |
+
| Title | 0.6634 | 0.1948 | 0.3011 | 0.4479 | 0.2489 |
|
| 139 |
+
|
| 140 |
+
Strict (token_sort_ratio only)
|
| 141 |
+
|
| 142 |
+
| Field | P | R | F1 | F0.5 | F1.5 |
|
| 143 |
+
|-------|---|---|----|----|------|
|
| 144 |
+
| Funder | 0.8801 | 0.8690 | 0.8745 | 0.8778 | 0.8724 |
|
| 145 |
+
| Award ID | 0.8317 | 0.8291 | 0.8304 | 0.8312 | 0.8299 |
|
| 146 |
+
| Scheme | 0.6039 | 0.5461 | 0.5735 | 0.5913 | 0.5627 |
|
| 147 |
+
| Title | 0.6139 | 0.1802 | 0.2787 | 0.4144 | 0.2303 |
|
| 148 |
+
|
| 149 |
+
#### `test_with_context.jsonl` (322 examples — funding statement embedded in surrounding document text, avg 1,143 vs 375 chars)
|
| 150 |
+
|
| 151 |
+
Permissive (partial_ratio + token_set, no damping)
|
| 152 |
+
|
| 153 |
+
| Field | P | R | F1 | F0.5 | F1.5 |
|
| 154 |
+
|-------|---|---|----|----|------|
|
| 155 |
+
| Funder | 0.9348 | 0.9383 | 0.9365 | 0.9355 | 0.9372 |
|
| 156 |
+
| Award ID | 0.8711 | 0.8690 | 0.8700 | 0.8707 | 0.8696 |
|
| 157 |
+
| Scheme | 0.7515 | 0.6844 | 0.7164 | 0.7371 | 0.7037 |
|
| 158 |
+
| Title | 0.8750 | 0.2442 | 0.3818 | 0.5769 | 0.3138 |
|
| 159 |
+
|
| 160 |
+
Balanced (length-damped + acronym detection)
|
| 161 |
+
|
| 162 |
+
| Field | P | R | F1 | F0.5 | F1.5 |
|
| 163 |
+
|-------|---|---|----|----|------|
|
| 164 |
+
| Funder | 0.9072 | 0.9061 | 0.9066 | 0.9070 | 0.9064 |
|
| 165 |
+
| Award ID | 0.8538 | 0.8517 | 0.8527 | 0.8534 | 0.8523 |
|
| 166 |
+
| Scheme | 0.6871 | 0.6257 | 0.6550 | 0.6739 | 0.6434 |
|
| 167 |
+
| Title | 0.7500 | 0.2093 | 0.3273 | 0.4945 | 0.2690 |
|
| 168 |
+
|
| 169 |
+
Strict (token_sort_ratio only)
|
| 170 |
+
|
| 171 |
+
| Field | P | R | F1 | F0.5 | F1.5 |
|
| 172 |
+
|-------|---|---|----|----|------|
|
| 173 |
+
| Funder | 0.8863 | 0.8842 | 0.8852 | 0.8859 | 0.8848 |
|
| 174 |
+
| Award ID | 0.8439 | 0.8418 | 0.8428 | 0.8434 | 0.8424 |
|
| 175 |
+
| Scheme | 0.6074 | 0.5531 | 0.5789 | 0.5957 | 0.5687 |
|
| 176 |
+
| Title | 0.7083 | 0.1977 | 0.3091 | 0.4670 | 0.2540 |
|
| 177 |
+
|
| 178 |
### Comparison to the Llama 3.1 8B baseline
|
| 179 |
|
| 180 |
+
Both test sets the Llama baseline card reports, scored with the same harness and pipeline. Balanced-mode F1:
|
| 181 |
+
|
| 182 |
+
**arxiv_test (300 examples)**
|
| 183 |
|
| 184 |
| Field | Llama 3.1 8B | Qwen3.5-9B | Δ |
|
| 185 |
|-------|:---:|:---:|:---:|
|
|
|
|
| 188 |
| Scheme | 0.6466 | 0.7266 | +0.080 |
|
| 189 |
| Title | 0.5316 | 0.5507 | +0.019 |
|
| 190 |
|
| 191 |
+
**synthetic_edges / `test_degraded` (1,288 examples)**
|
| 192 |
+
|
| 193 |
+
| Field | Llama 3.1 8B | Qwen3.5-9B | Δ |
|
| 194 |
+
|-------|:---:|:---:|:---:|
|
| 195 |
+
| Funder | 0.8999 | 0.8953 | −0.005 |
|
| 196 |
+
| Award ID | 0.8477 | 0.8403 | −0.007 |
|
| 197 |
+
| Scheme | 0.6370 | 0.6417 | +0.005 |
|
| 198 |
+
| Title | 0.4110 | 0.3011 | −0.110 |
|
| 199 |
+
|
| 200 |
+
On both sets the two RL-optimized fields (funder, award ID) are statistically tied with the Llama baseline (≤0.008 F1, within run-to-run noise). The un-weighted secondary fields are mixed — scheme is comparable-to-better, while title is the one regression (notably on the degraded set); both carry zero reward weight, and Qwen extracts titles conservatively (high precision, low recall).
|
| 201 |
|
| 202 |
## Usage
|
| 203 |
|