YAML Metadata Warning:empty or missing yaml metadata in repo card
Check out the documentation for more information.
OmniGene-4-SFT-v3-Merged
Full BF16 model with CPT + SFT (Supervised Fine-Tuning) merged
This is the complete instruction-tuned model (not LoRA adapter). You can load and use it directly without needing the base Gemma-4 model.
Model Description
OmniGene-4-SFT-v3-Merged is the final instruction-tuned biological foundation model with:
- Base: Gemma-4-26B-A4B-Instruct + Bio CPT + Bio SFT
- Vocabulary: 290,048 tokens
- SFT data: 199,576 instruction examples across 8 task families
- Precision: BF16 (~50 GB)
Performance
| Benchmark | Accuracy |
|---|---|
| Standard Homology | 99.95% (6,000 pairs) |
| Remote Homology | 59.50% (2,000 pairs) |
| BixBench Knowledge | 93.66% |
vs. ESM-2 (650M): OmniGene-4 59.5% vs ESM-2 50.5% (+9 pp on same 2,000 pairs)
Quick Start
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
# Load model (requires ~50GB GPU memory)
model = AutoModelForCausalLM.from_pretrained(
"dnagpt/OmniGene-4-SFT-v3-merged",
torch_dtype=torch.bfloat16,
device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained("dnagpt/OmniGene-4-SFT-v3-merged")
# Example: Protein homology detection
prompt = """### Instruction:
Determine if the two protein sequences below are structurally related (homologous).
### Sequence 1:
MKTAYIAKQRQISFVKSHFSRQLEERLGLIEVQAPILSRVGDGTQDNLSGAEKAVQVKVKALPDAQFEVVHSLAKWKRQTLGQHDFSAGEGLYTHMKALRPDEDRLSPLHSVYVDQWDWERVMGDGERQFSTLKSTVEAIWAGIKATEAAVSEEFGLAPFLPDQIHFVHSQELLSRYPDLDAKGRERAIAKDLGAVFLVGIGGKLSDGHRHDVRAPDYDDWSTPSELGHAGLNGDILVWNPVLEDAFELSSMGIRVDADTLKHQLALTGDEDRLELEWHQALLRGEMPQTIGGGIGQSRLTMLLLQLPHIGQVQAGVWPAAVRESVPSLL
### Sequence 2:
MKKFDRGEQVVKVKALPQAQFEEVHSLAKWKRQTLGQHDFSAGEGLYTHMKALRPDEDRLSPLHSVYVDQWDWERVMGDGERQFSTLKSTVEAIWAGIKATEAAVSEEFGLAPFLPDQIHFVHSQELLSRYPDLDAKGRERAIAKDLGAVFLVGIGGKLSDGHRHDVRAPDYDDWSTPSELGHAGLNGDILVWNPVLEDAFELSSMGIRVDADTLKHQLALTGDEDRLELEWHQALLRGEMPQTIGGGIGQSRLTMLLLQLPHIGQVQAGVWPAAVRESVPSLL
### Answer:
"""
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=5, do_sample=False)
answer = tokenizer.decode(outputs[0][inputs.input_ids.shape[-1]:], skip_special_tokens=True)
print(answer) # Expected: "Yes" or "No"
For Lower Memory Usage
If you have limited GPU memory, use 4-bit quantization:
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
import torch
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16,
bnb_4bit_use_double_quant=True,
)
model = AutoModelForCausalLM.from_pretrained(
"dnagpt/OmniGene-4-SFT-v3-merged",
quantization_config=bnb_config,
device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained("dnagpt/OmniGene-4-SFT-v3-merged")
This reduces memory usage to ~13GB.
Supported Tasks
The model is instruction-tuned on 8 task families:
| Task | Examples | Proportion |
|---|---|---|
| Protein homology | 49,894 | 25.0% |
| Literature (UniProtQA) | 39,915 | 20.0% |
| Mutation (MutaDescribe) | 29,936 | 15.0% |
| Cell biology | 29,936 | 15.0% |
| Molecule (SMILES) | 25,945 | 13.0% |
| Structure (3D) | 19,958 | 10.0% |
| DNA homology | 3,992 | 2.0% |
Example Tasks
1. Protein Homology Detection
prompt = """### Instruction:
Determine if the two protein sequences below are structurally related (homologous).
### Sequence 1:
[protein sequence 1]
### Sequence 2:
[protein sequence 2]
### Answer:
"""
2. Protein Function Prediction
prompt = """### Instruction:
Predict the biological function of the following protein sequence.
### Protein Sequence:
[protein sequence]
### Answer:
"""
3. Mutation Effect Prediction
prompt = """### Instruction:
Describe the effect of the mutation on protein function.
### Wild-type:
[wild-type sequence]
### Mutant:
[mutant sequence]
### Answer:
"""
4. Cell Type Identification
prompt = """### Instruction:
Identify the cell type based on the gene expression profile.
### Gene Expression:
CD4: high, CD8: low, IL2: high
### Answer:
"""
5. SMILES to Properties
prompt = """### Instruction:
Predict the drug-likeness of the following molecule.
### SMILES:
CC(C)Cc1ccc(cc1)C(C)C(O)=O
### Answer:
"""
Model Architecture
- Layers: 30 transformer layers
- Experts: 128 experts per layer (top-8 routing)
- Hidden size: 2816
- Attention heads: 22
- Active parameters: ~3.8B per token
- Total parameters: ~26B
Differences from LoRA Version
- LoRA version (
dnagpt/OmniGene-4-SFT-v3): 1.9 GB, requires base Gemma-4 model - Merged version (this repo): ~50 GB, standalone, no base model needed
Previous Checkpoints
- CPT only: https://huggingface.co/dnagpt/OmniGene-4-CPT-v2-merged
- LoRA adapters: https://huggingface.co/dnagpt/OmniGene-4-SFT-v3
Citation
@article{wang2026omnigene4,
title={OmniGene-4: A Unified Bio-Language MoE Model with Router-Level Interpretability},
author={Wang, Liang},
journal={bioRxiv},
year={2026}
}
Paper
Full paper: https://github.com/maris205/omnigene4
License
Apache 2.0
Contact
Liang Wang (wangliang.f@gmail.com)
School of Artificial Intelligence and Automation
Huazhong University of Science and Technology
- Downloads last month
- 2