adambuttrick commited on
Commit
e453c3e
·
verified ·
1 Parent(s): 370dad3

Trim editorial prose from model card

Browse files
Files changed (1) hide show
  1. README.md +10 -10
README.md CHANGED
@@ -29,8 +29,8 @@ Trained on the [`cometadata/funding-extraction-artifact-data-mix-grpo-mixed-rewa
29
 
30
  - **Base model:** `Qwen/Qwen3.5-9B`
31
  - **Data (`data/sft/`):** 3,528 real + 7,240 synthetic funding statements with gold-standard funder/award labels (synthetic upsampled 2×)
32
- - **Data augmentation:** 50% of training examples augmented with synthetic noise (OCR-like case errors, digit/letter swaps, Unicode artifacts, XML/HTML tags, LaTeX markup) for robustness to real-world document formats
33
- - **Renderer:** `qwen3_5_disable_thinking` — the model is trained to emit JSON directly (no chain-of-thought), so inference should disable thinking (see [Usage](#usage))
34
  - **LoRA rank:** 128
35
  - **Epochs:** 2
36
  - **Result:** eval NLL 0.116 → 0.0035 over 252 steps
@@ -82,11 +82,11 @@ Trained on the [`cometadata/funding-extraction-artifact-data-mix-grpo-mixed-rewa
82
  | Scheme | 0.6667 | 0.7438 | 0.7031 | 0.6808 | 0.7182 |
83
  | Title | 0.8095 | 0.3542 | 0.4928 | 0.6439 | 0.4283 |
84
 
85
- Inference on the 300 examples produced 100% parseable JSON (no truncations), averaging 126 output tokens per example.
86
 
87
  ### funding-entity-extraction-dataset-mix test sets
88
 
89
- Evaluated on the held-out test sets from [`cometadata/funding-entity-extraction-dataset-mix`](https://huggingface.co/datasets/cometadata/funding-entity-extraction-dataset-mix), same evaluation harness. 100% parseable JSON across all 1,957 examples. For `test_with_context`, the model is given the funding statement embedded in its surrounding document text (the `full_text` field) — performance on the primary fields is maintained (in fact highest of the three sets), showing the model is not distracted by surrounding paper content.
90
 
91
  #### `test.jsonl` (347 examples)
92
 
@@ -117,7 +117,7 @@ Strict (token_sort_ratio only)
117
  | Scheme | 0.3333 | 0.4798 | 0.3934 | 0.3550 | 0.4227 |
118
  | Title | 0.0690 | 0.0114 | 0.0195 | 0.0342 | 0.0153 |
119
 
120
- #### `test_degraded.jsonl` (1,288 examples — the `synthetic_edges` set from the Llama baseline card)
121
 
122
  Permissive (partial_ratio + token_set, no damping)
123
 
@@ -146,7 +146,7 @@ Strict (token_sort_ratio only)
146
  | Scheme | 0.6039 | 0.5461 | 0.5735 | 0.5913 | 0.5627 |
147
  | Title | 0.6139 | 0.1802 | 0.2787 | 0.4144 | 0.2303 |
148
 
149
- #### `test_with_context.jsonl` (322 examples — funding statement embedded in surrounding document text, avg 1,143 vs 375 chars)
150
 
151
  Permissive (partial_ratio + token_set, no damping)
152
 
@@ -177,7 +177,7 @@ Strict (token_sort_ratio only)
177
 
178
  ### Comparison to the Llama 3.1 8B baseline
179
 
180
- Both test sets the Llama baseline card reports, scored with the same harness and pipeline. Balanced-mode F1:
181
 
182
  **arxiv_test (300 examples)**
183
 
@@ -188,7 +188,7 @@ Both test sets the Llama baseline card reports, scored with the same harness and
188
  | Scheme | 0.6466 | 0.7266 | +0.080 |
189
  | Title | 0.5316 | 0.5507 | +0.019 |
190
 
191
- **synthetic_edges / `test_degraded` (1,288 examples)**
192
 
193
  | Field | Llama 3.1 8B | Qwen3.5-9B | Δ |
194
  |-------|:---:|:---:|:---:|
@@ -197,7 +197,7 @@ Both test sets the Llama baseline card reports, scored with the same harness and
197
  | Scheme | 0.6370 | 0.6417 | +0.005 |
198
  | Title | 0.4110 | 0.3011 | −0.110 |
199
 
200
- On both sets the two RL-optimized fields (funder, award ID) are statistically tied with the Llama baseline (≤0.008 F1, within run-to-run noise). The un-weighted secondary fields are mixed — scheme is comparable-to-better, while title is the one regression (notably on the degraded set); both carry zero reward weight, and Qwen extracts titles conservatively (high precision, low recall).
201
 
202
  ## Usage
203
 
@@ -218,7 +218,7 @@ messages = [
218
  {"role": "user", "content": prompt},
219
  ]
220
 
221
- # Trained without chain-of-thought: disable thinking to match the training distribution.
222
  inputs = tokenizer.apply_chat_template(
223
  messages, return_tensors="pt", add_generation_prompt=True, enable_thinking=False
224
  )
 
29
 
30
  - **Base model:** `Qwen/Qwen3.5-9B`
31
  - **Data (`data/sft/`):** 3,528 real + 7,240 synthetic funding statements with gold-standard funder/award labels (synthetic upsampled 2×)
32
+ - **Data augmentation:** 50% of training examples augmented with synthetic noise (OCR-like case errors, digit/letter swaps, Unicode artifacts, XML/HTML tags, LaTeX markup)
33
+ - **Renderer:** `qwen3_5_disable_thinking` (no chain-of-thought; keep thinking disabled at inference, see [Usage](#usage))
34
  - **LoRA rank:** 128
35
  - **Epochs:** 2
36
  - **Result:** eval NLL 0.116 → 0.0035 over 252 steps
 
82
  | Scheme | 0.6667 | 0.7438 | 0.7031 | 0.6808 | 0.7182 |
83
  | Title | 0.8095 | 0.3542 | 0.4928 | 0.6439 | 0.4283 |
84
 
85
+ All 300 outputs were valid JSON.
86
 
87
  ### funding-entity-extraction-dataset-mix test sets
88
 
89
+ Evaluated on the held-out test sets from [`cometadata/funding-entity-extraction-dataset-mix`](https://huggingface.co/datasets/cometadata/funding-entity-extraction-dataset-mix) with the same evaluation harness. `test_with_context` uses the `full_text` field (the funding statement with its surrounding document text) as the model input.
90
 
91
  #### `test.jsonl` (347 examples)
92
 
 
117
  | Scheme | 0.3333 | 0.4798 | 0.3934 | 0.3550 | 0.4227 |
118
  | Title | 0.0690 | 0.0114 | 0.0195 | 0.0342 | 0.0153 |
119
 
120
+ #### `test_degraded.jsonl` (1,288 examples)
121
 
122
  Permissive (partial_ratio + token_set, no damping)
123
 
 
146
  | Scheme | 0.6039 | 0.5461 | 0.5735 | 0.5913 | 0.5627 |
147
  | Title | 0.6139 | 0.1802 | 0.2787 | 0.4144 | 0.2303 |
148
 
149
+ #### `test_with_context.jsonl` (322 examples)
150
 
151
  Permissive (partial_ratio + token_set, no damping)
152
 
 
177
 
178
  ### Comparison to the Llama 3.1 8B baseline
179
 
180
+ Balanced-mode F1 on the two test sets reported by the Llama baseline card:
181
 
182
  **arxiv_test (300 examples)**
183
 
 
188
  | Scheme | 0.6466 | 0.7266 | +0.080 |
189
  | Title | 0.5316 | 0.5507 | +0.019 |
190
 
191
+ **`test_degraded` (1,288 examples)**
192
 
193
  | Field | Llama 3.1 8B | Qwen3.5-9B | Δ |
194
  |-------|:---:|:---:|:---:|
 
197
  | Scheme | 0.6370 | 0.6417 | +0.005 |
198
  | Title | 0.4110 | 0.3011 | −0.110 |
199
 
200
+ Funder and award ID (the reward-weighted fields) are within 0.008 F1 of the Llama baseline on both sets. Scheme and title carry zero reward weight.
201
 
202
  ## Usage
203
 
 
218
  {"role": "user", "content": prompt},
219
  ]
220
 
221
+ # Model trained with thinking disabled; keep enable_thinking=False.
222
  inputs = tokenizer.apply_chat_template(
223
  messages, return_tensors="pt", add_generation_prompt=True, enable_thinking=False
224
  )