funding-parsing-and-extraction-Llama-3.1-8B-Instruct-lora

Two-stage LoRA for joint funding-statement extraction (stage 1) and funder parsing (stage 2), trained on top of meta-llama/Llama-3.1-8B-Instruct.

Repository contents

sft/   # Stage A: supervised fine-tuning LoRA adapter
dapo/  # Stage B: DAPO (RL) LoRA adapter β€” applied after SFT is merged

Each folder is a standard PEFT LoRA: adapter_config.json + adapter_model.safetensors.

What the model does

Given a research article (or a chunk of one), the adapter produces funding metadata in two stages:

  1. Extract β€” copy any funding-acknowledgment sentences verbatim.
  2. Parse β€” take those sentences and emit structured funder / award records.

Both stages share the same weights; only the system prompt changes.

System prompts

Both stages must be called with the exact prompts below.

Stage 1 β€” extract

You are a funding statement extractor. Given an article or text chunk, identify all funding acknowledgment statements. Return ONLY valid JSON in this exact format: {"statements": ["statement1", "statement2", ...]}. Each statement must be copied verbatim from the source text. If no funding statements exist, return {"statements": []}. Do not include any text outside the JSON object.

User message: the article text (or chunk) as-is.

Stage 2 β€” parse

You are an expert at extracting structured funding metadata from academic papers. Given a funding statement, extract all funders and their associated awards. Return a JSON array of funder objects. Each funder has:
- "funder_name": string or null
- "awards": array of objects with "award_ids" (array of strings), "funding_scheme" (array of strings), and "award_title" (array of strings)
Return ONLY the JSON array, no other text.

User message:

Extract funding information from the following statement:

<funding statement here>

Usage

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel

BASE = "meta-llama/Llama-3.1-8B-Instruct"
REPO = "cometadata/funding-parsing-and-extraction-Llama-3.1-8B-Instruct-lora"

tokenizer = AutoTokenizer.from_pretrained(BASE)
model = AutoModelForCausalLM.from_pretrained(BASE, torch_dtype=torch.bfloat16, device_map="auto")

# Stage A: apply and merge the SFT adapter
model = PeftModel.from_pretrained(model, REPO, subfolder="sft")
model = model.merge_and_unload()

# Stage B: apply and merge the DAPO adapter on top of the SFT-merged model
model = PeftModel.from_pretrained(model, REPO, subfolder="dapo")
model = model.merge_and_unload()

model.eval()

The DAPO adapter's deltas are computed relative to the SFT-merged base, so the adapters must be applied in order: SFT first, then DAPO.

Running stage 1 (extract)

SYSTEM_EXTRACT = (
    "You are a funding statement extractor. Given an article or text chunk, "
    "identify all funding acknowledgment statements. Return ONLY valid JSON "
    'in this exact format: {"statements": ["statement1", "statement2", ...]}. '
    "Each statement must be copied verbatim from the source text. "
    'If no funding statements exist, return {"statements": []}. '
    "Do not include any text outside the JSON object."
)

def extract(article_text: str) -> str:
    messages = [
        {"role": "system", "content": SYSTEM_EXTRACT},
        {"role": "user", "content": article_text},
    ]
    inputs = tokenizer.apply_chat_template(
        messages, add_generation_prompt=True, return_tensors="pt"
    ).to(model.device)
    out = model.generate(inputs, max_new_tokens=512, do_sample=False)
    return tokenizer.decode(out[0, inputs.shape[1]:], skip_special_tokens=True)

Running stage 2 (parse)

SYSTEM_PARSE = (
    "You are an expert at extracting structured funding metadata from academic papers. "
    "Given a funding statement, extract all funders and their associated awards. "
    "Return a JSON array of funder objects. Each funder has:\n"
    '- "funder_name": string or null\n'
    '- "awards": array of objects with "award_ids" (array of strings), '
    '"funding_scheme" (array of strings), and "award_title" (array of strings)\n'
    "Return ONLY the JSON array, no other text."
)

def parse(funding_statement: str) -> str:
    user = f"Extract funding information from the following statement:\n\n{funding_statement}"
    messages = [
        {"role": "system", "content": SYSTEM_PARSE},
        {"role": "user", "content": user},
    ]
    inputs = tokenizer.apply_chat_template(
        messages, add_generation_prompt=True, return_tensors="pt"
    ).to(model.device)
    out = model.generate(inputs, max_new_tokens=1024, do_sample=False)
    return tokenizer.decode(out[0, inputs.shape[1]:], skip_special_tokens=True)

Chunked long-document pipeline

Articles longer than ~7500 tokens are split into overlapping 7500-token chunks with 1024 overlap at training time. For inference on long documents, apply stage 1 to each chunk, dedupe/concat the extracted statements, then pass the joined text to stage 2 once per article.

Training recipe

  • Base: meta-llama/Llama-3.1-8B-Instruct
  • LoRA: r=64, Ξ±=128, dropout=0.05, target_modules=all-linear (q,k,v,o,gate,up,down proj)
  • Data: Adam's cometadata/funding-extraction-artifact-data-mix-grpo-mixed-reward dataset (articles + gold funding statements + structured funder metadata)

Stage A β€” SFT

  • 2 epochs, batch 2, grad-accum 16, bf16, max_length 8192
  • LR 1e-4, 2Γ—H100
  • Mixture: extract prompts + parse prompts on rows with markdown and structured funder labels
  • Trains the model to format JSON correctly for both stages

Stage B β€” DAPO (RL)

  • Algorithm: GRPO variant with DAPO-style clipping (eps_clip_low=0.2, eps_clip_high=0.28, token-mean loss reduction)
  • Regularization: use_kl_loss=true (coef=0.001), entropy_loss_coef=0.01 β€” essential to prevent the all-empty collapse we observed without regularization
  • Sampling: temperature=0.9, top_p=0.9, n_samples=16 per prompt, step-wise trajectories (10 extract rollouts + 1 parse rollout per article)
  • Optimizer: LR 2e-5 (constant with 20-step warmup), max_grad_norm=1.0
  • Reward: pure parse reward (funder F0.5 + award-id F0.5, soft ID matching; see below). An earlier variant added a per-chunk extract reward which induced reward hacking (always emitting {"statements": []}), so the chunk component was dropped in favor of KL / entropy regularization.
  • OCR pre-processing: zero-width / BOM characters are stripped from input markdown before chunking.
  • Compute: 8Γ—H100, ~14h wall, 36 DAPO steps

Reward formula (stage B)

r = 0.50 Β· funder_F0.5 + 0.40 Β· award_id_F0.5 + 0.10 Β· flat_award_id_F0.5

with soft ID matching (edit-distance-1 partial credit on award_id). Empty gold ↔ empty prediction gets r=1.0; mismatched empty prediction with non-empty gold gets r=0.0.

Final training-time metrics (rollout average over last 5 steps)

metric value
reward/avg_raw_reward ~0.90
reward/avg_pass_at_16 ~0.99
avg response length ~30 tokens

Best single step: reward 0.941, pass@16 1.0 at step 33/36.

Citation / provenance

Training code: [github TODO β€” the rl/ directory of the funding-statement-identification repo].

Trained by the comet-data / funding extraction effort, 2026-04.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for cometadata/funding-parsing-and-extraction-Llama-3.1-8B-Instruct-lora

Finetuned
(2830)
this model