Text Generation
PEFT
Safetensors
English
lora
qwen3
affiliation-parsing
information-extraction

affiliation-parsing-lora-Qwen3-8B-distil-GLM_4.5_Air-dapo-latex

A single LoRA adapter on top of Qwen/Qwen3-8B for extracting authors and institutional affiliations from academic papers into structured JSON.

This adapter combines two training stages into one rank-40 LoRA (the combination is exact — no SVD compression):

  1. Stage 1 — SFT distillation from zai-org/GLM-4.5-Air, reproducing the adapter at cometadata/affiliation-parsing-lora-Qwen3-8B-distil-GLM_4.5_Air (r=8, α=32).
  2. Stage 2 — DAPO continual RL on the LaTeX-extracted arXiv format (cometadata/arxiv-author-affiliations-latex-extract-train-test, 1222 train / 987 test) starting from the merged stage-1 model (r=32, α=64). True DAPO (grpo_norm_by_std=false), clip-higher (0.2 / 0.28), dynamic-sampling filter, token-level loss, ×10 reward rescale, lr 1e-5, 16 rollouts/prompt on 2× H100.

The two adapters target the same 7 projection modules, so for every module the additive delta (α₁/r₁)·B₁A₁ + (α₂/r₂)·B₂A₂ is expressible exactly as one LoRA at rank 40 by concatenating the A/B matrices with the per-adapter scale baked into A (max reconstruction error: 3.3e-9).

Eval

987-prompt LaTeX-extracted test split. Matching uses case-insensitive name matching (fuzz.ratio, threshold 85) and token_sort_ratio + domain normalization (expand abbreviations, drop postal codes) for affiliations (threshold 85, audited at precision 1.0 on a 64-pair labeled set so the metric does not credit genuinely-different institutions).

Reward (normalized (format + author_IoU + affiliation_IoU) / 3)

Stage Test reward
Stage 1 only (distil) 0.918
Stage 1 + Stage 2 (this adapter) 0.921

Per-category precision / recall / F0.5 / F1

Pooled (micro) TP/FP/FN across all 987 prompts. Parse rate (schema-valid JSON emitted): 0.992 for both stages.

Authors — fuzzy name match across each prompt's gold vs. predicted author list.

Stage TP FP FN P R F0.5 F1
Stage 1 only 3095 147 680 0.955 0.820 0.924 0.882
Stage 1 + 2 3008 90 767 0.971 0.797 0.930 0.875

Affiliations — affiliation matching within matched-author pairs; gold affiliations of unmatched authors count as FN, predicted of unmatched as FP.

Stage TP FP FN P R F0.5 F1
Stage 1 only 2821 596 1348 0.826 0.677 0.791 0.744
Stage 1 + 2 2793 361 1376 0.886 0.670 0.832 0.763

Macro (mean of per-prompt P/R/F) for Stage 1 + 2 is higher because large multi-author papers drag the micro denominators down: authors P 0.954 / R 0.954 / F0.5 0.953 / F1 0.953; affiliations P 0.840 / R 0.822 / F0.5 0.833 / F1 0.826.

DAPO moves precision, not recall — authors FP halves (147 → 90), affiliations FP drops 39% (596 → 361). The model learned to stop emitting hallucinated / wrong items. Recall is flat — the unreachable items are ~4% truly bad data (the author block was lost during source extraction for those papers) plus very large author lists where some authors are consistently skipped.

A discovery during training was that the naive case-sensitive fuzz.ratio metric was scoring ~22% of correctly extracted papers as 0 — gold labels are often ALL-CAPS (LUKASZ PAWELEC) while a correct extraction from the paper text is mixed-case (Łukasz Pawelec); the corrected metric reveals the model was always ~0.91, not the 0.81 that the buggy metric showed.

Usage

from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer

base = AutoModelForCausalLM.from_pretrained("Qwen/Qwen3-8B", torch_dtype="bfloat16")
tok  = AutoTokenizer.from_pretrained("Qwen/Qwen3-8B")
model = PeftModel.from_pretrained(
    base,
    "cometadata/affiliation-parsing-lora-Qwen3-8B-distil-GLM_4.5_Air-dapo-latex",
)

SYSTEM = (
    "You are an expert at reading academic articles and parsing information "
    "about their affiliations. The user will show you an academic article and "
    "your job is to extract the authors and their affiliations in a structured "
    "format (a JSON array of {name, affiliations}). Respond after </think>."
)

messages = [
    {"role": "system", "content": SYSTEM},
    {"role": "user",   "content": "<the paper text>"},
]
inputs = tok.apply_chat_template(messages, add_generation_prompt=True,
                                 return_tensors="pt").to(model.device)
out = model.generate(inputs, max_new_tokens=2048, temperature=0.7, do_sample=True)
print(tok.decode(out[0, inputs.shape[1]:], skip_special_tokens=True))

The model emits <think>…</think> reasoning followed by a JSON array [{"name": ..., "affiliations": [...]}, ...].

Training & evaluation code

github.com/cometadata/affiliation-parsing-cl-latex (or the project directory /scratch/m000152-pm05/affiliation-parsing-cl-latex/).

License

Apache-2.0 (matches the Qwen3-8B base).

Downloads last month
2
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for cometadata/affiliation-parsing-lora-Qwen3-8B-distil-GLM_4.5_Air-dapo-latex

Finetuned
Qwen/Qwen3-8B
Adapter
(1468)
this model