cometadata/funding-extraction-sft-data
Viewer • Updated • 1.32k • 38
How to use cometadata/funding-parsing-lora-Llama_3.1_8B-instruct-ep2-r64-a32-grpo with PEFT:
from peft import PeftModel
from transformers import AutoModelForCausalLM
base_model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3.1-8B-Instruct")
model = PeftModel.from_pretrained(base_model, "cometadata/funding-parsing-lora-Llama_3.1_8B-instruct-ep2-r64-a32-grpo")A LoRA adapter for extracting structured funding information from funding statements in scholarly works.
train.jsonl) + 2,531 synthetic examples (synthetic.jsonl) upsampled 2x = 6,378 totalSee https://github.com/cometadata/funding-metadata-enrichment/tree/main/train for the full training code
Gated, hierarchical matching on funder using the Hungarian algorithm for limiting subordinate fields 1:1 funder pairing:
from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer
base_model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3.1-8B-Instruct")
model = PeftModel.from_pretrained(base_model, "cometadata/funding-parsing-lora-Llama_3.1_8B-instruct-ep2-r64-a32-grpo")
tokenizer = AutoTokenizer.from_pretrained("cometadata/funding-parsing-lora-Llama_3.1_8B-instruct-ep2-r64-a32-grpo")
messages = [
{"role": "system", "content": "Extract funding information from the text. Return a JSON array of funders."},
{"role": "user", "content": "Extract funding information from the following statement:\n\nThis work was supported by the National Science Foundation (Grant No. 2045678) and the European Research Council (ERC-2021-StG-101039567)."}
]
inputs = tokenizer.apply_chat_template(messages, return_tensors="pt", add_generation_prompt=True)
outputs = model.generate(inputs, max_new_tokens=512, temperature=0.1)
print(tokenizer.decode(outputs[0][inputs.shape[1]:], skip_special_tokens=True))
Expected output:
[
{
"funder_name": "National Science Foundation",
"awards": [
{
"award_ids": ["2045678"],
"funding_scheme": [],
"award_title": []
}
]
},
{
"funder_name": "European Research Council",
"awards": [
{
"award_ids": ["ERC-2021-StG-101039567"],
"funding_scheme": [],
"award_title": []
}
]
}
]
Trained on Tinker by Thinking Machines Lab
| Step | Eval Reward | Funder F0.5 | Award F0.5 | Format Valid | KL |
|---|---|---|---|---|---|
| 0 | 0.944 | 0.961 | 0.926 | 99.7% | 0.0005 |
| 10 | 0.937 | 0.954 | 0.920 | 99.7% | 0.0015 |
| 20 | 0.950 | 0.969 | 0.932 | 100% | 0.0020 |
| 30 | 0.954 | 0.966 | 0.942 | 100% | 0.0025 |
| 40 | 0.951 | 0.971 | 0.931 | 100% | 0.0013 |
| 50 | 0.938 | 0.956 | 0.919 | 100% | 0.0051 |
| 60 | 0.949 | 0.967 | 0.931 | 100% | 0.0047 |
| 70 | 0.954 | 0.968 | 0.939 | 100% | 0.0025 |
| 80 | 0.951 | 0.962 | 0.940 | 100% | 0.0021 |
| 90 | 0.945 | 0.959 | 0.931 | 100% | 0.0026 |
| 100 | 0.943 | 0.963 | 0.923 | 99.7% | 0.0016 |
| 110 | 0.945 | 0.961 | 0.929 | 99.5% | 0.0036 |
| 120 | 0.950 | 0.964 | 0.936 | 99.5% | 0.0028 |
| 130 | 0.961 | 0.974 | 0.948 | 100% | 0.0026 |
| 140 | 0.955 | 0.973 | 0.938 | 100% | 0.0020 |
| 150 | 0.957 | 0.972 | 0.942 | 100% | 0.0012 |
| 160 | 0.947 | 0.963 | 0.931 | 99.7% | 0.0034 |
| 170 | 0.951 | 0.957 | 0.944 | 100% | 0.0023 |
| 180 | 0.944 | 0.960 | 0.928 | 100% | 0.0013 |
| 190 | 0.933 | 0.956 | 0.910 | 99.5% | 0.0004 |
| 200 | 0.942 | 0.961 | 0.922 | 99.7% | 0.0017 |
| 210 | 0.957 | 0.967 | 0.947 | 100% | 0.0014 |
Base model
meta-llama/Llama-3.1-8B