SoilFM Language Tower — Qwen2.5-14B Literature CPT

A domain-adapted large language model for soil science and soil microbiology, created by continued pretraining of Qwen2.5-14B-Instruct on 200,000 curated soil science text passages.

This model is the Language Tower component of SoilFM2, a multi-modal foundation model for soil microbiome analysis developed at Lawrence Berkeley National Laboratory.

Model Details

Base model Qwen/Qwen2.5-14B-Instruct (14.2B parameters)
Method Continued pretraining via QLoRA (4-bit NF4)
Format Full merged model (LoRA weights merged into base)
Precision BF16
Context length 32,768 tokens
Size on disk ~28 GB
LoRA adapter Also available at northenlab/soilfm-qwen2.5-14b-qlora (263 MB)

Intended Uses

  • Generating explanations of soil microbial processes, rhizosphere ecology, and plant-microbe interactions
  • Providing domain-grounded context within the SoilFM2 multi-modal pipeline (prebiotic recommendation, substrate preference prediction)
  • Serving as a soil-science-aware backbone for downstream fine-tuning or RAG systems
  • Research and educational applications in soil microbiology

Training Data

The training corpus was assembled from four sources of soil science domain knowledge, stratified-sampled to 200,000 training examples and 10,000 validation examples (seed = 42):

Source Description Proportion Train Val
PubMed Central Full-text soil microbiology papers (39,853 articles) 55% 110,000 5,500
Wikipedia Soil science articles 20% 40,000 2,000
USDA Soil Survey Manual Official USDA technical reference 10% 20,000 1,000
Wikipedia General Biology Broad biology context to prevent catastrophic forgetting 15% 30,000 1,500
Total 100% 200,000 10,000

Text was chunked to 1,024 tokens with 100-token overlap. The full corpus contained 329M tokens across 388,563 chunks; the 200K stratified subsample was used for this training run.

Preprocessing

  • PubMed Central articles retrieved via BioC JSON API, cleaned of XML artifacts
  • Soil Survey Manual cleaned of page headers, footers, and index content (57 of 435 chunks removed)
  • All sources standardized to JSONL format with text and source fields

Training Procedure

Configuration

Parameter Value
Quantization 4-bit NF4 (BitsAndBytes, double quantization)
LoRA rank 16
LoRA alpha 32
LoRA dropout 0.05
Rank-stabilized LoRA (rsLoRA) Yes
Target modules q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj
Trainable parameters ~263M (1.85% of total)
Optimizer PagedAdamW8bit
Learning rate 2e-5 (cosine schedule, 10% warmup)
Effective batch size 128 (micro-batch 2, gradient accumulation 64)
Max gradient norm 1.0
Weight decay 0.01
Precision BF16 mixed precision
Flash Attention 2 Enabled
Epochs 1
Total steps 1,500

Infrastructure

  • GPU: NVIDIA A100 PCIe 80GB (RunPod)
  • Training time: ~42 hours
  • Peak VRAM: 54 GB (67% utilization)

Training Script

A custom manual PyTorch training loop was used (rather than HuggingFace Trainer) for compatibility and control. The script is available in the project repository as train_soilfm_cpt_MANUAL.py.

Results

Validation Loss

Step Validation Loss
500 1.7369
1,000 1.6281
1,500 1.6130

Total improvement: 7.2% over the course of training. Loss was still decreasing at the end of training with no signs of overfitting. Gradient norms remained stable in the 0.2–0.8 range throughout.

Qualitative Evaluation

Prompt: "The role of root exudates in shaping rhizosphere microbial communities involves"

Output: The model produces coherent, technically accurate continuations using appropriate domain terminology (root exudates, rhizosphere, primary/secondary metabolites, phytohormones), demonstrating successful domain adaptation.

Usage

Direct Loading (Recommended)

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained(
    "northenlab/soilfm-qwen2.5-14b-literature-cpt",
    torch_dtype=torch.bfloat16,
    device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained(
    "northenlab/soilfm-qwen2.5-14b-literature-cpt"
)

prompt = "The role of mycorrhizal fungi in soil nutrient cycling"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=256)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

With vLLM (Production)

python3 -m vllm.entrypoints.openai.api_server \
  --model northenlab/soilfm-qwen2.5-14b-literature-cpt \
  --host 0.0.0.0 --port 8001

LoRA Adapter Only

If you prefer to load the adapter separately (e.g., for 4-bit inference):

from transformers import AutoModelForCausalLM
from peft import PeftModel

base = AutoModelForCausalLM.from_pretrained(
    "Qwen/Qwen2.5-14B-Instruct",
    load_in_4bit=True,
    device_map="auto",
)
model = PeftModel.from_pretrained(base, "northenlab/soilfm-qwen2.5-14b-qlora")

Part of SoilFM2

This Language Tower works alongside other SoilFM2 components:

Component Description HuggingFace
Language Tower (this model) Domain-adapted LLM northenlab/soilfm-qwen2.5-14b-literature-cpt
Graph Tower Heterogeneous GNN on 2.39M-node knowledge graph northenlab/soilfm2-graph-tower-joint-v0.1
BSPR Bayesian substrate preference model (AUC 0.94)

Together these components power a prebiotic recommendation pipeline that takes 16S microbiome profiles as input and suggests soil amendments to steer community function.

Use Restrictions

This model is intended for research and non-commercial use only. The training corpus includes PubMed Central Open Access articles under various Creative Commons licenses, some of which may carry non-commercial (CC BY-NC) terms. Users should ensure their use complies with the underlying data licenses.

The base model (Qwen2.5-14B-Instruct) is released under the Apache 2.0 license.

Limitations

  • Trained for 1 epoch on a 200K subsample of the full 667K corpus; additional training may further improve performance
  • Domain adaptation was evaluated primarily via validation loss and qualitative generation; systematic benchmarking on soil science Q&A tasks is ongoing
  • The model inherits the base Qwen2.5-14B-Instruct capabilities and limitations
  • Not intended for medical, agricultural, or regulatory decision-making without expert review

Citation

@misc{soilfm-language-tower-2025,
  title={SoilFM Language Tower: Domain Adaptation of Qwen2.5-14B for Soil Science},
  author={Northen Lab, Lawrence Berkeley National Laboratory},
  year={2025},
  publisher={HuggingFace},
  url={https://huggingface.co/northenlab/soilfm-qwen2.5-14b-literature-cpt},
  note={Continued pretraining on 200K soil science literature examples via QLoRA}
}

License

Apache 2.0 (inherited from the base Qwen2.5-14B-Instruct model). Training data includes PubMed Central Open Access articles under various CC licenses — see Use Restrictions above.

Downloads last month
4
Safetensors
Model size
15B params
Tensor type
BF16
·
Inference Providers NEW
Input a message to start chatting with northenlab/soilfm-qwen2.5-14b-literature-cpt.

Model tree for northenlab/soilfm-qwen2.5-14b-literature-cpt

Base model

Qwen/Qwen2.5-14B
Finetuned
(420)
this model