OmniGene-4-SFT-v3-Merged

Full BF16 model with CPT + SFT (Supervised Fine-Tuning) merged

This is the complete instruction-tuned model (not LoRA adapter). You can load and use it directly without needing the base Gemma-4 model.

Model Description

OmniGene-4-SFT-v3-Merged is the final instruction-tuned biological foundation model with:

Base: Gemma-4-26B-A4B-Instruct + Bio CPT + Bio SFT
Vocabulary: 290,048 tokens
SFT data: 199,576 instruction examples across 8 task families
Precision: BF16 (~50 GB)

Performance

Benchmark	Accuracy
Standard Homology	99.95% (6,000 pairs)
Remote Homology	59.50% (2,000 pairs)
BixBench Knowledge	93.66%

vs. ESM-2 (650M): OmniGene-4 59.5% vs ESM-2 50.5% (+9 pp on same 2,000 pairs)

Quick Start

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

# Load model (requires ~50GB GPU memory)
model = AutoModelForCausalLM.from_pretrained(
    "dnagpt/OmniGene-4-SFT-v3-merged",
    torch_dtype=torch.bfloat16,
    device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained("dnagpt/OmniGene-4-SFT-v3-merged")

# Example: Protein homology detection
prompt = """### Instruction:
Determine if the two protein sequences below are structurally related (homologous).

### Sequence 1:
MKTAYIAKQRQISFVKSHFSRQLEERLGLIEVQAPILSRVGDGTQDNLSGAEKAVQVKVKALPDAQFEVVHSLAKWKRQTLGQHDFSAGEGLYTHMKALRPDEDRLSPLHSVYVDQWDWERVMGDGERQFSTLKSTVEAIWAGIKATEAAVSEEFGLAPFLPDQIHFVHSQELLSRYPDLDAKGRERAIAKDLGAVFLVGIGGKLSDGHRHDVRAPDYDDWSTPSELGHAGLNGDILVWNPVLEDAFELSSMGIRVDADTLKHQLALTGDEDRLELEWHQALLRGEMPQTIGGGIGQSRLTMLLLQLPHIGQVQAGVWPAAVRESVPSLL

### Sequence 2:
MKKFDRGEQVVKVKALPQAQFEEVHSLAKWKRQTLGQHDFSAGEGLYTHMKALRPDEDRLSPLHSVYVDQWDWERVMGDGERQFSTLKSTVEAIWAGIKATEAAVSEEFGLAPFLPDQIHFVHSQELLSRYPDLDAKGRERAIAKDLGAVFLVGIGGKLSDGHRHDVRAPDYDDWSTPSELGHAGLNGDILVWNPVLEDAFELSSMGIRVDADTLKHQLALTGDEDRLELEWHQALLRGEMPQTIGGGIGQSRLTMLLLQLPHIGQVQAGVWPAAVRESVPSLL

### Answer:
"""

inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=5, do_sample=False)
answer = tokenizer.decode(outputs[0][inputs.input_ids.shape[-1]:], skip_special_tokens=True)
print(answer)  # Expected: "Yes" or "No"

For Lower Memory Usage

If you have limited GPU memory, use 4-bit quantization:

from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
import torch

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_use_double_quant=True,
)

model = AutoModelForCausalLM.from_pretrained(
    "dnagpt/OmniGene-4-SFT-v3-merged",
    quantization_config=bnb_config,
    device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained("dnagpt/OmniGene-4-SFT-v3-merged")

This reduces memory usage to ~13GB.

Supported Tasks

The model is instruction-tuned on 8 task families:

Task	Examples	Proportion
Protein homology	49,894	25.0%
Literature (UniProtQA)	39,915	20.0%
Mutation (MutaDescribe)	29,936	15.0%
Cell biology	29,936	15.0%
Molecule (SMILES)	25,945	13.0%
Structure (3D)	19,958	10.0%
DNA homology	3,992	2.0%

Example Tasks

1. Protein Homology Detection

prompt = """### Instruction:
Determine if the two protein sequences below are structurally related (homologous).

### Sequence 1:
[protein sequence 1]

### Sequence 2:
[protein sequence 2]

### Answer:
"""

2. Protein Function Prediction

prompt = """### Instruction:
Predict the biological function of the following protein sequence.

### Protein Sequence:
[protein sequence]

### Answer:
"""

3. Mutation Effect Prediction

prompt = """### Instruction:
Describe the effect of the mutation on protein function.

### Wild-type:
[wild-type sequence]

### Mutant:
[mutant sequence]

### Answer:
"""

4. Cell Type Identification

prompt = """### Instruction:
Identify the cell type based on the gene expression profile.

### Gene Expression:
CD4: high, CD8: low, IL2: high

### Answer:
"""

5. SMILES to Properties

prompt = """### Instruction:
Predict the drug-likeness of the following molecule.

### SMILES:
CC(C)Cc1ccc(cc1)C(C)C(O)=O

### Answer:
"""

Model Architecture

Layers: 30 transformer layers
Experts: 128 experts per layer (top-8 routing)
Hidden size: 2816
Attention heads: 22
Active parameters: ~3.8B per token
Total parameters: ~26B

Differences from LoRA Version

LoRA version (dnagpt/OmniGene-4-SFT-v3): 1.9 GB, requires base Gemma-4 model
Merged version (this repo): ~50 GB, standalone, no base model needed

Previous Checkpoints

CPT only: https://huggingface.co/dnagpt/OmniGene-4-CPT-v2-merged
LoRA adapters: https://huggingface.co/dnagpt/OmniGene-4-SFT-v3

Citation

@article{wang2026omnigene4,
  title={OmniGene-4: A Unified Bio-Language MoE Model with Router-Level Interpretability},
  author={Wang, Liang},
  journal={bioRxiv},
  year={2026}
}

Paper

Full paper: https://github.com/maris205/omnigene4

License

Apache 2.0

Contact

Liang Wang (wangliang.f@gmail.com)
School of Artificial Intelligence and Automation
Huazhong University of Science and Technology

Downloads last month: 2

Safetensors

Model size

26B params

Tensor type

BF16

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for dnagpt/OmniGene-4-SFT-v3-merged

Quantizations

1 model