File size: 14,018 Bytes
bddc0b1
 
14a7b54
 
 
 
 
 
 
 
 
 
bddc0b1
 
14a7b54
bddc0b1
 
 
 
 
14a7b54
bddc0b1
14a7b54
bddc0b1
14a7b54
bddc0b1
14a7b54
 
 
 
 
bddc0b1
14a7b54
bddc0b1
14a7b54
 
 
bddc0b1
 
 
 
 
14a7b54
bddc0b1
14a7b54
bddc0b1
14a7b54
bddc0b1
14a7b54
bddc0b1
 
 
14a7b54
bddc0b1
14a7b54
bddc0b1
 
 
14a7b54
bddc0b1
14a7b54
bddc0b1
14a7b54
bddc0b1
14a7b54
bddc0b1
14a7b54
bddc0b1
 
 
14a7b54
bddc0b1
14a7b54
a1e94f1
14a7b54
bddc0b1
14a7b54
bddc0b1
14a7b54
 
 
bddc0b1
14a7b54
bddc0b1
a1e94f1
d50e895
a1e94f1
14a7b54
 
 
 
 
 
bddc0b1
14a7b54
 
bddc0b1
14a7b54
bddc0b1
14a7b54
 
bddc0b1
a1e94f1
bddc0b1
14a7b54
bddc0b1
14a7b54
 
9f9cf33
14a7b54
 
 
 
 
a1e94f1
14a7b54
bddc0b1
14a7b54
a1e94f1
14a7b54
bddc0b1
14a7b54
bddc0b1
14a7b54
 
a1e94f1
 
 
bddc0b1
14a7b54
 
 
 
9f9cf33
14a7b54
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
a1e94f1
 
14a7b54
bddc0b1
14a7b54
 
 
 
 
 
 
 
9f9cf33
14a7b54
 
 
 
 
a1e94f1
14a7b54
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
bddc0b1
 
 
 
 
14a7b54
bddc0b1
14a7b54
bddc0b1
 
 
14a7b54
bddc0b1
 
 
14a7b54
bddc0b1
14a7b54
 
 
 
 
 
 
bddc0b1
 
 
14a7b54
bddc0b1
14a7b54
bddc0b1
14a7b54
bddc0b1
14a7b54
bddc0b1
14a7b54
bddc0b1
14a7b54
bddc0b1
14a7b54
bddc0b1
 
 
14a7b54
bddc0b1
14a7b54
 
 
 
 
bddc0b1
14a7b54
bddc0b1
 
 
14a7b54
bddc0b1
14a7b54
bddc0b1
14a7b54
bddc0b1
 
 
14a7b54
bddc0b1
 
 
14a7b54
bddc0b1
14a7b54
bddc0b1
 
 
14a7b54
 
 
 
 
 
 
bddc0b1
 
 
14a7b54
bddc0b1
14a7b54
bddc0b1
14a7b54
 
 
 
 
 
bddc0b1
14a7b54
bddc0b1
14a7b54
bddc0b1
14a7b54
bddc0b1
14a7b54
bddc0b1
 
 
14a7b54
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
---
library_name: transformers
tags:
- biology
- protein-language-model
- protein-generation
- causal-lm
- mixture-of-experts
- transformers
- dpo
- alignment
- protein-design
---

# Model Card for ProtGPT3-1.3B-dpo

## Model Details

### Model Description

ProtGPT3-1.3B-dpo is a DPO-aligned single-sequence autoregressive protein language model for protein sequence generation. It is part of the ProtGPT3 family, an open-source suite of promptable and aligned protein language models for protein design.

The base ProtGPT3-1.3B model is a causal decoder-only protein language model using a Mixtral-style sparse Mixture-of-Experts architecture. It was trained for causal language modeling on protein sequences and supports generation in both N-to-C and C-to-N directions using special directional tokens.

This checkpoint was further aligned with Direct Preference Optimization (DPO) to improve generation quality. The alignment procedure shifts the model toward protein sequences with higher predicted structural confidence and reduced low-complexity content, while preserving sequence diversity.

- **Developed by:** Anonymous authors
- **Model type:** DPO-aligned autoregressive protein language model; causal decoder-only Mixture-of-Experts model
- **Language(s):** Protein sequences / amino-acid sequences
- **License:** More Information Needed
- **Finetuned from model:** `protgpt3/ProtGPT3-1.3B`

### Model Sources

- **Repository:** https://huggingface.co/protgpt3
- **Paper:** ProtGPT3: an Open-source family of Promptable and Aligned Protein Language Models
- **Code:** https://anonymous.4open.science/r/protGPT3-2053/README.md

## Uses

### Direct Use

ProtGPT3-1.3B-dpo can be used for single-sequence autoregressive protein generation. Users can generate protein sequences unconditionally or condition generation on an amino-acid prefix.

Compared with the base ProtGPT3-1.3B checkpoint, this DPO-aligned model is intended for users who want generations biased toward higher-complexity sequences with improved predicted structural confidence.

### Downstream Use

The model may be used in protein design workflows, computational screening pipelines, protein variant generation, and candidate sequence proposal. Generated sequences can be further evaluated with structure prediction, sequence-complexity filters, solubility filters, fitness predictors, or experimental validation.

### Out-of-Scope Use

The model should not be used as the sole basis for experimental, clinical, environmental, or safety-critical decisions. Generated sequences require downstream computational and experimental validation. The model is not guaranteed to generate functional, soluble, safe, synthesizable, or experimentally successful proteins.

The model should not be used for irresponsible or harmful biological design applications.

## Bias, Risks, and Limitations

ProtGPT3-1.3B-dpo learns from public protein sequence datasets and may reproduce biases present in those datasets. Although DPO alignment reduces low-complexity generations and improves generation quality according to the alignment objectives (pLDDT and reduction of lcr, as a binary objective, see main manuscript), generated sequences may still be nonfunctional, unstable, insoluble, repetitive, biologically implausible, or unsuitable for a user’s intended application.

The DPO alignment objective uses predicted structural confidence and low-complexity filtering as proxy objectives. These proxies do not guarantee biological function, experimental success, safety, solubility, or manufacturability.

As with other generative protein models, ProtGPT3-1.3B-dpo may present dual-use risks if applied irresponsibly.

### Recommendations

Users should validate generated sequences with appropriate downstream computational and experimental methods. Recommended checks include sequence-complexity filtering, structure prediction, predicted confidence scoring, similarity searches against known proteins, solubility assessment, and task-specific functional evaluation.

## How to Get Started with the Model

Install dependencies:

```bash
pip install transformers accelerate torch
```

Load the model and tokenizer:

```python
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

model_id = "protgpt3/ProtGPT3-1.3B-dpo"

# Load tokenizer for generation
tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True,add_bos_token=True, add_eos_token=False, padding_side="left")

model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto",
    trust_remote_code=True,
)

model.eval()
```

### Generate a protein sequence

```python
import torch

prompt = ""  # Optionally provide an amino-acid prefix or model-specific direction

inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

with torch.no_grad():
    output_ids = model.generate(
        inputs["input_ids"],
        max_new_tokens=512,
        do_sample=True,
        temperature=0.8,
        top_p=0.9,
        eos_token_id=tokenizer.eos_token_id,
        pad_token_id=tokenizer.pad_token_id,
    )

sequence = tokenizer.decode(output_ids[0], skip_special_tokens=True)
print(sequence) # output includes directional token "1" or "2" to denote if sequence was generated N-to-C or C-to-N
```

### Generate from an amino-acid prefix

```python
import torch

# forward N-to-C generation with special token "1" 
prefix = "1MKT" # use special token "2" instead of "1" for reverse  C-to-N generation

inputs = tokenizer(prefix, return_tensors="pt").to(model.device)

with torch.no_grad():
    output_ids = model.generate(
        inputs["input_ids"],
        max_new_tokens=256,
        do_sample=True,
        temperature=0.8,
        top_p=0.9,
        eos_token_id=tokenizer.eos_token_id,
        pad_token_id=tokenizer.eos_token_id,
    )

sequence = tokenizer.decode(output_ids[0], skip_special_tokens=True)
print(sequence)
```

### Batch generation

```python
import torch

prompts = [
    "",
    "1MKT", # N-to-C generation
    "2MAV", # C-to-N generation
]

inputs = tokenizer(
    prompts,
    return_tensors="pt",
    padding=True,
).to(model.device)

with torch.no_grad():
    output_ids = model.generate(
        inputs["input_ids"],
        max_new_tokens=256,
        do_sample=True,
        temperature=0.8,
        top_p=0.9,
        eos_token_id=tokenizer.eos_token_id,
        pad_token_id=tokenizer.bos_token_id,
    )

sequences = tokenizer.batch_decode(output_ids, skip_special_tokens=True)

for sequence in sequences:
    print(sequence)
```

### Notes on generation

- Use this checkpoint for single-sequence protein generation.
- Sampling parameters such as `temperature` and `top_p` can strongly affect sequence quality and diversity.
- Lower temperatures may produce more conservative sequences.
- Higher temperatures may increase diversity but can also increase failure modes.
- Generated sequences should be validated before experimental use.

## Training Details

### Training Data

The base ProtGPT3-1.3B model was trained on publicly available protein sequence data from UniRef90 and the GigaRef subset of the Dayhoff Atlas. The 1.3B-parameter model used approximately 64M UniRef90 sequences and 120M GigaRef sequences, corresponding to approximately 43B training tokens.

The DPO alignment dataset was constructed from model-generated sequences. Sequences were scored using predicted structural confidence and low-complexity-region content. Sequences with pLDDT greater than 0.7 and fewer than 25% low-complexity residues were treated as positive examples, while the remaining generations were treated as negative examples.

### Training Procedure

#### Preprocessing

For base-model pretraining, protein sequences were sampled from UniRef90 and GigaRef. During training, each sequence was assigned a generation direction, either N-to-C or C-to-N, with a special token prepended to indicate the direction.

For DPO alignment, generated sequences were classified as pass or fail according to predicted pLDDT and low-complexity-region thresholds. Pass and fail sequences were clustered separately at 50% sequence identity and 0.8 coverage. Preference pairs were constructed by pairing positive and negative examples with matched sequence lengths, helping prevent the model from learning sequence length as a shortcut.

#### Training Hyperparameters

Base ProtGPT3-1.3B pretraining:

- **Training regime:** bfloat16
- **Architecture:** Mixtral-style sparse Mixture-of-Experts causal decoder
- **Maximum sequence length:** 1024
- **Optimizer:** AdamW
- **Learning rate:** 2.5e-4
- **Optimizer betas:** β1 = 0.9, β2 = 0.999
- **Weight decay:** 0.1
- **Gradient clipping:** 1.0
- **Gradient accumulation steps:** 4
- **Batch size:** 100
- **Router auxiliary loss coefficient:** 0.05
- **Number of training GPUs:** 16
- **Precision:** bfloat16

DPO alignment:

- **Alignment method:** Direct Preference Optimization
- **Positive-example criterion:** pLDDT > 0.7 and low-complexity regions < 25%
- **Negative-example criterion:** all other generated sequences
- **Pairing strategy:** length-matched positive and negative sequence pairs
- **Preference-data clustering:** 50% sequence identity, 0.8 coverage
- **Alignment objective:** shift the model toward higher-complexity, higher-pLDDT generations

#### Speeds, Sizes, Times

- **Model size:** 1.3B parameters
- **Base-model training tokens:** Approximately 43B
- **Hardware:** NVIDIA H100 GPUs

## Evaluation

### Testing Data, Factors & Metrics

#### Testing Data

ProtGPT3 models were evaluated on held-out protein sequences with at most 50% sequence identity to the training set. The model family was also benchmarked on ProteinGym and assessed for generation quality across sampling settings.

The DPO-aligned models were evaluated on generated sequences and on naturally occurring protein sequences from PDB-derived data to assess whether the alignment objective generalized beyond the model-generated preference data.

#### Factors

Evaluation considered model scale, sampling temperature, nucleus sampling parameter `top_p`, sequence direction, predicted structure confidence, low-complexity-region content, and sequence diversity.

#### Metrics

Evaluation included:

- Validation perplexity
- ProteinGym Spearman correlation
- Predicted pLDDT
- Fraction of low-complexity generations
- Sequence diversity
- Fraction of sequences passing the pLDDT and low-complexity filters
- Intrinsic reward discrimination between high-quality and low-quality natural sequences

### Results

DPO alignment improved generation quality across the ProtGPT3 single-sequence model family. Alignment reduced the fraction of low-complexity generations while preserving high predicted structural confidence and sequence diversity.

For the 1.3B-scale model, DPO alignment increased the pass rate of generated sequences under the pLDDT and low-complexity criteria. The paper reports that alignment reduced low-complexity generations by more than 20% for the 112M and 1B-scale models, while preserving diversity and causing little change in held-out pretraining perplexity.

#### Summary

ProtGPT3-1.3B-dpo is the DPO-aligned version of ProtGPT3-1.3B. It is intended for users who want a single-sequence protein generator biased toward higher-complexity and higher-predicted-confidence generations compared with the base checkpoint.

## Model Examination

ProtGPT3-1.3B-dpo was examined as part of the ProtGPT3 alignment study. The DPO alignment pipeline was designed to reduce repetitive or low-complexity protein generations while maintaining diversity and preserving base-model knowledge.

The aligned models were also examined using an intrinsic reward discrimination analysis on real protein sequences, where aligned models assigned systematically higher intrinsic rewards to high-quality sequences than to low-quality sequences.

## Environmental Impact

Carbon emissions can be estimated using the Machine Learning Impact calculator.

- **Hardware Type:** NVIDIA H100 GPUs
- **Hours used:** More Information Needed
- **Cloud Provider:** More Information Needed
- **Compute Region:** More Information Needed
- **Carbon Emitted:** More Information Needed

## Technical Specifications

### Model Architecture and Objective

ProtGPT3-1.3B-dpo is a decoder-only autoregressive protein language model using a Mixtral-style sparse Mixture-of-Experts architecture. The base model was trained with a causal language modeling objective on protein sequences.

The DPO-aligned checkpoint was optimized to prefer generated sequences with higher predicted structural confidence and lower low-complexity-region content.

### Compute Infrastructure

#### Hardware

The base ProtGPT3-1.3B model was trained on NVIDIA H100 GPUs.

#### Software

Training used FlashAttention-2, online mini-batch packing, Liger Kernel, and DeepSpeed.

## Citation

**BibTeX:**

```bibtex
@article{protgpt3,
  title={ProtGPT3: an Open-source family of Promptable and Aligned Protein Language Models},
  author={Anonymous Authors},
  year={2026}
}
```

**APA:**

Anonymous Authors. (2026). *ProtGPT3: an Open-source family of Promptable and Aligned Protein Language Models*.

## Glossary

- **DPO:** Direct Preference Optimization, an alignment method that optimizes a model using preference pairs.
- **pLDDT:** A predicted local structure confidence score.
- **Low-complexity region:** A repetitive or compositionally simple sequence region.
- **Causal language modeling:** Autoregressive prediction of the next token given previous tokens.
- **Mixture-of-Experts:** A sparse neural architecture using multiple expert subnetworks.
- **N-to-C / C-to-N:** Protein sequence generation directions from N-terminus to C-terminus or C-terminus to N-terminus.

## More Information

All models and code are released through the Hugging Face ecosystem and accompanying code repository.

## Model Card Authors

Anonymous authors

## Model Card Contact

More Information Needed