adambuttrick commited on
Commit
370dad3
·
verified ·
1 Parent(s): 390796c

Add funding-entity-extraction-dataset-mix eval results (test, degraded/synthetic_edges, with_context) + extend Llama comparison

Browse files
Files changed (1) hide show
  1. README.md +104 -2
README.md CHANGED
@@ -84,9 +84,102 @@ Trained on the [`cometadata/funding-extraction-artifact-data-mix-grpo-mixed-rewa
84
 
85
  Inference on the 300 examples produced 100% parseable JSON (no truncations), averaging 126 output tokens per example.
86
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
87
  ### Comparison to the Llama 3.1 8B baseline
88
 
89
- Same `arxiv_test.jsonl` (300 examples), same evaluation harness and pipeline. Balanced-mode F1:
 
 
90
 
91
  | Field | Llama 3.1 8B | Qwen3.5-9B | Δ |
92
  |-------|:---:|:---:|:---:|
@@ -95,7 +188,16 @@ Same `arxiv_test.jsonl` (300 examples), same evaluation harness and pipeline. Ba
95
  | Scheme | 0.6466 | 0.7266 | +0.080 |
96
  | Title | 0.5316 | 0.5507 | +0.019 |
97
 
98
- The two RL-optimized fields (funder, award ID) are statistically tied with the Llama baseline (≤0.008 F1, within run-to-run noise), while the un-weighted secondary fields (scheme, title) improve — most clearly scheme, consistently across Permissive/Balanced/Strict modes.
 
 
 
 
 
 
 
 
 
99
 
100
  ## Usage
101
 
 
84
 
85
  Inference on the 300 examples produced 100% parseable JSON (no truncations), averaging 126 output tokens per example.
86
 
87
+ ### funding-entity-extraction-dataset-mix test sets
88
+
89
+ Evaluated on the held-out test sets from [`cometadata/funding-entity-extraction-dataset-mix`](https://huggingface.co/datasets/cometadata/funding-entity-extraction-dataset-mix), same evaluation harness. 100% parseable JSON across all 1,957 examples. For `test_with_context`, the model is given the funding statement embedded in its surrounding document text (the `full_text` field) — performance on the primary fields is maintained (in fact highest of the three sets), showing the model is not distracted by surrounding paper content.
90
+
91
+ #### `test.jsonl` (347 examples)
92
+
93
+ Permissive (partial_ratio + token_set, no damping)
94
+
95
+ | Field | P | R | F1 | F0.5 | F1.5 |
96
+ |-------|---|---|----|----|------|
97
+ | Funder | 0.9376 | 0.8923 | 0.9144 | 0.9282 | 0.9058 |
98
+ | Award ID | 0.8407 | 0.8339 | 0.8373 | 0.8394 | 0.8360 |
99
+ | Scheme | 0.4118 | 0.5927 | 0.4860 | 0.4385 | 0.5221 |
100
+ | Title | 0.1034 | 0.0170 | 0.0293 | 0.0514 | 0.0229 |
101
+
102
+ Balanced (length-damped + acronym detection)
103
+
104
+ | Field | P | R | F1 | F0.5 | F1.5 |
105
+ |-------|---|---|----|----|------|
106
+ | Funder | 0.9008 | 0.8555 | 0.8776 | 0.8913 | 0.8689 |
107
+ | Award ID | 0.8138 | 0.8072 | 0.8105 | 0.8125 | 0.8092 |
108
+ | Scheme | 0.3725 | 0.5363 | 0.4397 | 0.3968 | 0.4724 |
109
+ | Title | 0.0690 | 0.0114 | 0.0195 | 0.0342 | 0.0153 |
110
+
111
+ Strict (token_sort_ratio only)
112
+
113
+ | Field | P | R | F1 | F0.5 | F1.5 |
114
+ |-------|---|---|----|----|------|
115
+ | Funder | 0.8722 | 0.8276 | 0.8493 | 0.8629 | 0.8408 |
116
+ | Award ID | 0.7963 | 0.7898 | 0.7930 | 0.7949 | 0.7918 |
117
+ | Scheme | 0.3333 | 0.4798 | 0.3934 | 0.3550 | 0.4227 |
118
+ | Title | 0.0690 | 0.0114 | 0.0195 | 0.0342 | 0.0153 |
119
+
120
+ #### `test_degraded.jsonl` (1,288 examples — the `synthetic_edges` set from the Llama baseline card)
121
+
122
+ Permissive (partial_ratio + token_set, no damping)
123
+
124
+ | Field | P | R | F1 | F0.5 | F1.5 |
125
+ |-------|---|---|----|----|------|
126
+ | Funder | 0.9285 | 0.9216 | 0.9250 | 0.9271 | 0.9237 |
127
+ | Award ID | 0.8586 | 0.8560 | 0.8573 | 0.8581 | 0.8568 |
128
+ | Scheme | 0.7413 | 0.6704 | 0.7041 | 0.7260 | 0.6907 |
129
+ | Title | 0.7723 | 0.2267 | 0.3506 | 0.5214 | 0.2897 |
130
+
131
+ Balanced (length-damped + acronym detection)
132
+
133
+ | Field | P | R | F1 | F0.5 | F1.5 |
134
+ |-------|---|---|----|----|------|
135
+ | Funder | 0.9001 | 0.8906 | 0.8953 | 0.8981 | 0.8935 |
136
+ | Award ID | 0.8416 | 0.8390 | 0.8403 | 0.8411 | 0.8398 |
137
+ | Scheme | 0.6757 | 0.6110 | 0.6417 | 0.6617 | 0.6296 |
138
+ | Title | 0.6634 | 0.1948 | 0.3011 | 0.4479 | 0.2489 |
139
+
140
+ Strict (token_sort_ratio only)
141
+
142
+ | Field | P | R | F1 | F0.5 | F1.5 |
143
+ |-------|---|---|----|----|------|
144
+ | Funder | 0.8801 | 0.8690 | 0.8745 | 0.8778 | 0.8724 |
145
+ | Award ID | 0.8317 | 0.8291 | 0.8304 | 0.8312 | 0.8299 |
146
+ | Scheme | 0.6039 | 0.5461 | 0.5735 | 0.5913 | 0.5627 |
147
+ | Title | 0.6139 | 0.1802 | 0.2787 | 0.4144 | 0.2303 |
148
+
149
+ #### `test_with_context.jsonl` (322 examples — funding statement embedded in surrounding document text, avg 1,143 vs 375 chars)
150
+
151
+ Permissive (partial_ratio + token_set, no damping)
152
+
153
+ | Field | P | R | F1 | F0.5 | F1.5 |
154
+ |-------|---|---|----|----|------|
155
+ | Funder | 0.9348 | 0.9383 | 0.9365 | 0.9355 | 0.9372 |
156
+ | Award ID | 0.8711 | 0.8690 | 0.8700 | 0.8707 | 0.8696 |
157
+ | Scheme | 0.7515 | 0.6844 | 0.7164 | 0.7371 | 0.7037 |
158
+ | Title | 0.8750 | 0.2442 | 0.3818 | 0.5769 | 0.3138 |
159
+
160
+ Balanced (length-damped + acronym detection)
161
+
162
+ | Field | P | R | F1 | F0.5 | F1.5 |
163
+ |-------|---|---|----|----|------|
164
+ | Funder | 0.9072 | 0.9061 | 0.9066 | 0.9070 | 0.9064 |
165
+ | Award ID | 0.8538 | 0.8517 | 0.8527 | 0.8534 | 0.8523 |
166
+ | Scheme | 0.6871 | 0.6257 | 0.6550 | 0.6739 | 0.6434 |
167
+ | Title | 0.7500 | 0.2093 | 0.3273 | 0.4945 | 0.2690 |
168
+
169
+ Strict (token_sort_ratio only)
170
+
171
+ | Field | P | R | F1 | F0.5 | F1.5 |
172
+ |-------|---|---|----|----|------|
173
+ | Funder | 0.8863 | 0.8842 | 0.8852 | 0.8859 | 0.8848 |
174
+ | Award ID | 0.8439 | 0.8418 | 0.8428 | 0.8434 | 0.8424 |
175
+ | Scheme | 0.6074 | 0.5531 | 0.5789 | 0.5957 | 0.5687 |
176
+ | Title | 0.7083 | 0.1977 | 0.3091 | 0.4670 | 0.2540 |
177
+
178
  ### Comparison to the Llama 3.1 8B baseline
179
 
180
+ Both test sets the Llama baseline card reports, scored with the same harness and pipeline. Balanced-mode F1:
181
+
182
+ **arxiv_test (300 examples)**
183
 
184
  | Field | Llama 3.1 8B | Qwen3.5-9B | Δ |
185
  |-------|:---:|:---:|:---:|
 
188
  | Scheme | 0.6466 | 0.7266 | +0.080 |
189
  | Title | 0.5316 | 0.5507 | +0.019 |
190
 
191
+ **synthetic_edges / `test_degraded` (1,288 examples)**
192
+
193
+ | Field | Llama 3.1 8B | Qwen3.5-9B | Δ |
194
+ |-------|:---:|:---:|:---:|
195
+ | Funder | 0.8999 | 0.8953 | −0.005 |
196
+ | Award ID | 0.8477 | 0.8403 | −0.007 |
197
+ | Scheme | 0.6370 | 0.6417 | +0.005 |
198
+ | Title | 0.4110 | 0.3011 | −0.110 |
199
+
200
+ On both sets the two RL-optimized fields (funder, award ID) are statistically tied with the Llama baseline (≤0.008 F1, within run-to-run noise). The un-weighted secondary fields are mixed — scheme is comparable-to-better, while title is the one regression (notably on the degraded set); both carry zero reward weight, and Qwen extracts titles conservatively (high precision, low recall).
201
 
202
  ## Usage
203