affiliation-parsing-lora-Qwen3-8B-distil-GLM_4.5_Air-dapo-latex

A single LoRA adapter on top of Qwen/Qwen3-8B for extracting authors and institutional affiliations from academic papers into structured JSON.

This adapter combines two training stages into one rank-40 LoRA (the combination is exact — no SVD compression):

Stage 1 — SFT distillation from zai-org/GLM-4.5-Air, reproducing the adapter at cometadata/affiliation-parsing-lora-Qwen3-8B-distil-GLM_4.5_Air (r=8, α=32).
Stage 2 — DAPO continual RL on the LaTeX-extracted arXiv format (cometadata/arxiv-author-affiliations-latex-extract-train-test, 1222 train / 987 test) starting from the merged stage-1 model (r=32, α=64). True DAPO (grpo_norm_by_std=false), clip-higher (0.2 / 0.28), dynamic-sampling filter, token-level loss, ×10 reward rescale, lr 1e-5, 16 rollouts/prompt on 2× H100.

The two adapters target the same 7 projection modules, so for every module the additive delta (α₁/r₁)·B₁A₁ + (α₂/r₂)·B₂A₂ is expressible exactly as one LoRA at rank 40 by concatenating the A/B matrices with the per-adapter scale baked into A (max reconstruction error: 3.3e-9).

Eval

987-prompt LaTeX-extracted test split. Matching uses case-insensitive name matching (fuzz.ratio, threshold 85) and token_sort_ratio + domain normalization (expand abbreviations, drop postal codes) for affiliations (threshold 85, audited at precision 1.0 on a 64-pair labeled set so the metric does not credit genuinely-different institutions).

Reward (normalized `(format + author_IoU + affiliation_IoU) / 3`)

Stage	Test reward
Stage 1 only (distil)	0.918
Stage 1 + Stage 2 (this adapter)	0.921

Per-category precision / recall / F0.5 / F1

Pooled (micro) TP/FP/FN across all 987 prompts. Parse rate (schema-valid JSON emitted): 0.992 for both stages.

Authors — fuzzy name match across each prompt's gold vs. predicted author list.

Stage	TP	FP	FN	P	R	F0.5	F1
Stage 1 only	3095	147	680	0.955	0.820	0.924	0.882
Stage 1 + 2	3008	90	767	0.971	0.797	0.930	0.875

Affiliations — affiliation matching within matched-author pairs; gold affiliations of unmatched authors count as FN, predicted of unmatched as FP.

Stage	TP	FP	FN	P	R	F0.5	F1
Stage 1 only	2821	596	1348	0.826	0.677	0.791	0.744
Stage 1 + 2	2793	361	1376	0.886	0.670	0.832	0.763

Macro (mean of per-prompt P/R/F) for Stage 1 + 2 is higher because large multi-author papers drag the micro denominators down: authors P 0.954 / R 0.954 / F0.5 0.953 / F1 0.953; affiliations P 0.840 / R 0.822 / F0.5 0.833 / F1 0.826.

DAPO moves precision, not recall — authors FP halves (147 → 90), affiliations FP drops 39% (596 → 361). The model learned to stop emitting hallucinated / wrong items. Recall is flat — the unreachable items are ~4% truly bad data (the author block was lost during source extraction for those papers) plus very large author lists where some authors are consistently skipped.

A discovery during training was that the naive case-sensitive fuzz.ratio metric was scoring ~22% of correctly extracted papers as 0 — gold labels are often ALL-CAPS (LUKASZ PAWELEC) while a correct extraction from the paper text is mixed-case (Łukasz Pawelec); the corrected metric reveals the model was always ~0.91, not the 0.81 that the buggy metric showed.

Usage

from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer

base = AutoModelForCausalLM.from_pretrained("Qwen/Qwen3-8B", torch_dtype="bfloat16")
tok  = AutoTokenizer.from_pretrained("Qwen/Qwen3-8B")
model = PeftModel.from_pretrained(
    base,
    "cometadata/affiliation-parsing-lora-Qwen3-8B-distil-GLM_4.5_Air-dapo-latex",
)

SYSTEM = (
    "You are an expert at reading academic articles and parsing information "
    "about their affiliations. The user will show you an academic article and "
    "your job is to extract the authors and their affiliations in a structured "
    "format (a JSON array of {name, affiliations}). Respond after </think>."
)

messages = [
    {"role": "system", "content": SYSTEM},
    {"role": "user",   "content": "<the paper text>"},
]
inputs = tok.apply_chat_template(messages, add_generation_prompt=True,
                                 return_tensors="pt").to(model.device)
out = model.generate(inputs, max_new_tokens=2048, temperature=0.7, do_sample=True)
print(tok.decode(out[0, inputs.shape[1]:], skip_special_tokens=True))

The model emits <think>…</think> reasoning followed by a JSON array [{"name": ..., "affiliations": [...]}, ...].

Training & evaluation code

github.com/cometadata/affiliation-parsing-cl-latex (or the project directory /scratch/m000152-pm05/affiliation-parsing-cl-latex/).

License

Apache-2.0 (matches the Qwen3-8B base).

Downloads last month: 2

Model tree for cometadata/affiliation-parsing-lora-Qwen3-8B-distil-GLM_4.5_Air-dapo-latex

Base model

Qwen/Qwen3-8B-Base

Finetuned

Qwen/Qwen3-8B

Adapter

(1468)

this model