Text Generation
Transformers
Safetensors
English
llama
Llama-3.1
instruct
finetune
reasoning
hybrid-mode
chatml
function calling
tool use
json mode
structured outputs
atropos
dataforge
long context
roleplaying
chat
conversational
text-generation-inference
2-bit
exl3
Instructions to use cpral/Hermes-4-405B-exl3-2bpw with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use cpral/Hermes-4-405B-exl3-2bpw with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="cpral/Hermes-4-405B-exl3-2bpw") messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("cpral/Hermes-4-405B-exl3-2bpw") model = AutoModelForCausalLM.from_pretrained("cpral/Hermes-4-405B-exl3-2bpw") messages = [ {"role": "user", "content": "Who are you?"}, ] inputs = tokenizer.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use cpral/Hermes-4-405B-exl3-2bpw with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "cpral/Hermes-4-405B-exl3-2bpw" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "cpral/Hermes-4-405B-exl3-2bpw", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/cpral/Hermes-4-405B-exl3-2bpw
- SGLang
How to use cpral/Hermes-4-405B-exl3-2bpw with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "cpral/Hermes-4-405B-exl3-2bpw" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "cpral/Hermes-4-405B-exl3-2bpw", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "cpral/Hermes-4-405B-exl3-2bpw" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "cpral/Hermes-4-405B-exl3-2bpw", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use cpral/Hermes-4-405B-exl3-2bpw with Docker Model Runner:
docker model run hf.co/cpral/Hermes-4-405B-exl3-2bpw
Update README.md
Browse files
README.md
CHANGED
|
@@ -18,7 +18,8 @@ tags:
|
|
| 18 |
- long context
|
| 19 |
- roleplaying
|
| 20 |
- chat
|
| 21 |
-
base_model:
|
|
|
|
| 22 |
library_name: transformers
|
| 23 |
widget:
|
| 24 |
- example_title: Hermes 4
|
|
@@ -28,194 +29,54 @@ widget:
|
|
| 28 |
You are Hermes 4, a capable, neutrally-aligned assistant. Prefer concise,
|
| 29 |
correct answers.
|
| 30 |
- role: user
|
| 31 |
-
content:
|
| 32 |
-
Explain the difference between BFS and DFS to a new CS student.
|
| 33 |
model-index:
|
| 34 |
- name: Hermes-4-Llama-3.1-405B
|
| 35 |
results: []
|
| 36 |
---
|
| 37 |
|
| 38 |
-
# Hermes 4 — Llama-3.1 405B
|
| 39 |
-
|
| 40 |
-

|
| 41 |
-
|
| 42 |
-
## Model Description
|
| 43 |
-
|
| 44 |
-
Hermes 4 405B is a frontier, hybrid-mode **reasoning** model based on Llama-3.1-405B by Nous Research that is aligned to **you**.
|
| 45 |
-
|
| 46 |
-
Read the Hermes 4 technical report here: <a href="https://arxiv.org/abs/2508.18255">Hermes 4 Technical Report</a>
|
| 47 |
-
|
| 48 |
-
Chat with Hermes in Nous Chat: https://chat.nousresearch.com
|
| 49 |
-
|
| 50 |
-
Training highlights include a newly synthesized post-training corpus emphasizing verified reasoning traces, massive improvements in math, code, STEM, logic, creativity, and format-faithful outputs, while preserving general assistant quality and broadly neutral alignment.
|
| 51 |
-
|
| 52 |
-
## What’s new vs Hermes 3
|
| 53 |
-
|
| 54 |
-
- **Post-training corpus**: Massively increased dataset size from 1M samples and 1.2B tokens to **~5M samples / ~60B tokens** blended across reasoning and non-reasoning data.
|
| 55 |
-
- **Hybrid reasoning mode** with explicit `<think>…</think>` segments when the model decides to deliberate, and options to make your responses faster when you want.
|
| 56 |
-
- **Reasoning** that is top quality, expressive, improves math, code, STEM, logic, and even creative writing and subjective responses.
|
| 57 |
-
- **Schema adherence & structured outputs**: trained to produce valid JSON for given schemas and to repair malformed objects.
|
| 58 |
-
- **Much easier to steer and align**: extreme improvements on steerability, especially on reduced refusal rates.
|
| 59 |
-
|
| 60 |
-
## Our Mission: Frontier Capabilities Aligned to You
|
| 61 |
-
|
| 62 |
-
In pursuit of the mission of producing models that are open, steerable and capable of producing the full range of human expression, while being able to be aligned to your values, we created a new benchmark, RefusalBench, that tests the models willingness to be helpful in a variety of scenarios commonly disallowed by closed and open models.
|
| 63 |
-
|
| 64 |
-

|
| 65 |
-
|
| 66 |
-
Hermes 4 achieves SOTA on RefusalBench across all popular closed and open models in being helpful and conforming to your values, without censorship.
|
| 67 |
-
|
| 68 |
-
## Benchmarks (Hermes 4 405B)
|
| 69 |
-
|
| 70 |
-

|
| 71 |
-
|
| 72 |
-
> Full tables, settings, and comparisons are in the technical report.
|
| 73 |
|
| 74 |
-
|
| 75 |
|
| 76 |
-
Hermes 4 uses Llama-3-Chat format with role headers and special tags.
|
| 77 |
-
|
| 78 |
-
**Basic chat:**
|
| 79 |
-
```
|
| 80 |
-
<|start_header_id|>system<|end_header_id|>
|
| 81 |
-
|
| 82 |
-
You are Hermes 4. Be concise and helpful.<|eot_id|>
|
| 83 |
-
<|start_header_id|>user<|end_header_id|>
|
| 84 |
-
|
| 85 |
-
Explain the photoelectric effect simply.<|eot_id|>
|
| 86 |
-
<|start_header_id|>assistant<|end_header_id|>
|
| 87 |
```
|
| 88 |
-
|
| 89 |
-
|
| 90 |
-
|
| 91 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 92 |
|
| 93 |
```
|
| 94 |
-
You are a deep thinking AI, you may use extremely long chains of thought to deeply consider the problem and deliberate with yourself via systematic reasoning processes to help come to a correct solution prior to answering. You should enclose your thoughts and internal monologue inside <think> </think> tags, and then provide your solution or response to the problem.
|
| 95 |
-
```
|
| 96 |
-
|
| 97 |
-
Note that you can add any additional system instructions before or after this system message, and it will adjust the models policies, style, and effort of thinking, as well as its post-thinking style, format, identity, and more. You may also interleave the tool definition system message with the reasoning one.
|
| 98 |
-
|
| 99 |
-
When the model chooses to deliberate, it emits:
|
| 100 |
-
|
| 101 |
-
```
|
| 102 |
-
<|start_header_id|>assistant<|end_header_id|>
|
| 103 |
-
<think>
|
| 104 |
-
…model’s internal reasoning may appear here…
|
| 105 |
-
</think>
|
| 106 |
-
Final response starts here…<|eot_id|>
|
| 107 |
-
```
|
| 108 |
-
|
| 109 |
-
Additionally, we provide a flag to keep the content inbetween the `<think> ... </think>` that you can play with by setting `keep_cots=True`
|
| 110 |
-
|
| 111 |
-
|
| 112 |
-
## Function Calling & Tool Use
|
| 113 |
|
| 114 |
-
|
| 115 |
|
| 116 |
-
**System message (example):**
|
| 117 |
-
|
| 118 |
-
```
|
| 119 |
-
<|start_header_id|>system<|end_header_id|>
|
| 120 |
-
You are a function-calling AI. Tools are provided inside <tools>…</tools>.
|
| 121 |
-
When appropriate, call a tool by emitting a <tool_call>{...}</tool_call> object.
|
| 122 |
-
After a tool responds (as <tool_response>), continue reasoning inside <think> and produce the final answer.
|
| 123 |
-
<tools>
|
| 124 |
-
{"type":"function","function":{"name":"get_weather","description":"Get weather by city","parameters":{"type":"object","properties":{"city":{"type":"string"}},"required":["city"]}}}
|
| 125 |
-
</tools><|eot_id|>
|
| 126 |
```
|
| 127 |
-
|
| 128 |
-
|
| 129 |
-
|
| 130 |
-
|
| 131 |
-
|
| 132 |
-
|
| 133 |
-
|
| 134 |
-
-
|
| 135 |
-
-
|
| 136 |
-
|
| 137 |
-
### Transformers example
|
| 138 |
-
|
| 139 |
-
```python
|
| 140 |
-
from transformers import AutoTokenizer, AutoModelForCausalLM
|
| 141 |
-
import torch
|
| 142 |
-
|
| 143 |
-
model_id = "NousResearch/Hermes-4-Llama-3.1-405B"
|
| 144 |
-
|
| 145 |
-
tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
|
| 146 |
-
model = AutoModelForCausalLM.from_pretrained(
|
| 147 |
-
model_id,
|
| 148 |
-
torch_dtype=torch.float16,
|
| 149 |
-
device_map="auto"
|
| 150 |
-
)
|
| 151 |
-
|
| 152 |
-
messages = [
|
| 153 |
-
{"role":"system","content":"You are Hermes 4. Be concise."},
|
| 154 |
-
{"role":"user","content":"Summarize CRISPR in 3 sentences."}
|
| 155 |
-
]
|
| 156 |
-
|
| 157 |
-
inputs = tokenizer.apply_chat_template(
|
| 158 |
-
messages, add_generation_prompt=True, return_tensors="pt"
|
| 159 |
-
).to(model.device)
|
| 160 |
-
|
| 161 |
-
outputs = model.generate(
|
| 162 |
-
**inputs, max_new_tokens=400, temperature=0.6, top_p=0.95, top_k=20, do_sample=True
|
| 163 |
-
)
|
| 164 |
-
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
|
| 165 |
```
|
| 166 |
-
|
| 167 |
-
For production serving on multi-GPU nodes, consider tensor parallel inference engines (e.g., SGLang/vLLM backends) with prefix caching.
|
| 168 |
-
|
| 169 |
-
## Inference Providers:
|
| 170 |
-
|
| 171 |
-
### Nous Portal:
|
| 172 |
-
|
| 173 |
-
<a href="https://portal.nousresearch.com"><img width=256 alt="chutes logo" src="https://cdn-uploads.huggingface.co/production/uploads/6317aade83d8d2fd903192d9/6YytY7N0mjCnBQvWo3qtv.png"></a>
|
| 174 |
-
|
| 175 |
-
### Chutes:
|
| 176 |
-
|
| 177 |
-
<a href="https://chutes.ai/app"><img width=256 alt="chutes logo" src="https://cdn-uploads.huggingface.co/production/uploads/6317aade83d8d2fd903192d9/l14AWPv6cSvaprpwK_IWY.png"></a>
|
| 178 |
-
|
| 179 |
-
### Nebius:
|
| 180 |
-
|
| 181 |
-
<a href="https://nebius.com/services/studio-inference-service">
|
| 182 |
-
<picture>
|
| 183 |
-
<source media="(prefers-color-scheme: dark)" srcset="https://cdn-uploads.huggingface.co/production/uploads/6317aade83d8d2fd903192d9/vhL0oAomFa_awBdt2KF_x.png">
|
| 184 |
-
<source media="(prefers-color-scheme: light)" srcset="https://cdn-uploads.huggingface.co/production/uploads/64b21cbb2fc8324fcb1dac03/LjAfeFfAz8ac5rV-iiwj5.png">
|
| 185 |
-
<img width=256 alt="nebius.com logo" src="https://cdn-uploads.huggingface.co/production/uploads/64b21cbb2fc8324fcb1dac03/LjAfeFfAz8ac5rV-iiwj5.png">
|
| 186 |
-
</picture>
|
| 187 |
-
</a>
|
| 188 |
-
|
| 189 |
-
### Luminal:
|
| 190 |
-
|
| 191 |
-
<a href="https://luminalai.com/">
|
| 192 |
-
<img width=256 alt="luminal logo" src="https://cdn-uploads.huggingface.co/production/uploads/6317aade83d8d2fd903192d9/FIHsRdjMMP0HUjebiuJyH.png">
|
| 193 |
-
</a>
|
| 194 |
-
|
| 195 |
-
# Quantized / Smaller Variants
|
| 196 |
-
|
| 197 |
-
Hermes 4 is available as BF16 original weights as well as FP8 variants and GGUF variants by LM Studio.
|
| 198 |
-
|
| 199 |
-
FP8: https://huggingface.co/NousResearch/Hermes-4-405B-FP8
|
| 200 |
-
|
| 201 |
-
GGUF (Courtesy of LM Studio team!):
|
| 202 |
-
https://huggingface.co/lmstudio-community/Hermes-4-405B-GGUF
|
| 203 |
-
|
| 204 |
-
Hermes 4 is also available in smaller sizes (e.g., 70B and 14B) with similar prompt formats.
|
| 205 |
-
|
| 206 |
-
See the Hermes 4 collection to explore them all:
|
| 207 |
-
https://huggingface.co/collections/NousResearch/hermes-4-collection-68a731bfd452e20816725728
|
| 208 |
-
|
| 209 |
-
# How to cite
|
| 210 |
-
|
| 211 |
-
```bibtex
|
| 212 |
-
@misc{teknium2025hermes4technicalreport,
|
| 213 |
-
title={Hermes 4 Technical Report},
|
| 214 |
-
author={Ryan Teknium and Roger Jin and Jai Suphavadeeprasit and Dakota Mahan and Jeffrey Quesnelle and Joe Li and Chen Guang and Shannon Sands and Karan Malhotra},
|
| 215 |
-
year={2025},
|
| 216 |
-
eprint={2508.18255},
|
| 217 |
-
archivePrefix={arXiv},
|
| 218 |
-
primaryClass={cs.AI},
|
| 219 |
-
url={https://arxiv.org/abs/2508.18255},
|
| 220 |
-
}
|
| 221 |
-
```
|
|
|
|
| 18 |
- long context
|
| 19 |
- roleplaying
|
| 20 |
- chat
|
| 21 |
+
base_model:
|
| 22 |
+
- NousResearch/Hermes-4-405B
|
| 23 |
library_name: transformers
|
| 24 |
widget:
|
| 25 |
- example_title: Hermes 4
|
|
|
|
| 29 |
You are Hermes 4, a capable, neutrally-aligned assistant. Prefer concise,
|
| 30 |
correct answers.
|
| 31 |
- role: user
|
| 32 |
+
content: Explain what Hadamard Transform is.
|
|
|
|
| 33 |
model-index:
|
| 34 |
- name: Hermes-4-Llama-3.1-405B
|
| 35 |
results: []
|
| 36 |
---
|
| 37 |
|
| 38 |
+
# Hermes 4 — Llama-3.1 405B EXL 3 2.00bpw
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 39 |
|
| 40 |
+
2.00 BPW H8 exllamav3 quant of Hermes 4 405B.
|
| 41 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 42 |
```
|
| 43 |
+
-- A perplexity: 1.50484401
|
| 44 |
+
-- B perplexity: 4.46562014
|
| 45 |
+
-- A label in top-K:
|
| 46 |
+
K = 1: 0.8938
|
| 47 |
+
K = 2: 0.9486
|
| 48 |
+
K = 3: 0.9640
|
| 49 |
+
K = 4: 0.9714
|
| 50 |
+
K = 5: 0.9757
|
| 51 |
+
-- B label in top-K:
|
| 52 |
+
K = 1: 0.6383
|
| 53 |
+
K = 2: 0.7622
|
| 54 |
+
K = 3: 0.8163
|
| 55 |
+
K = 4: 0.8482
|
| 56 |
+
K = 5: 0.8698
|
| 57 |
+
-- Top-K agreement, A vs B:
|
| 58 |
+
K = 1: 0.6743
|
| 59 |
+
K = 2: 0.2721
|
| 60 |
+
K = 3: 0.0833
|
| 61 |
+
K = 4: 0.0222
|
| 62 |
+
K = 5: 0.0056
|
| 63 |
+
-- KL divergence (A, B): 2.27405149
|
| 64 |
+
-- KL divergence (B, A): 1.05870732
|
| 65 |
|
| 66 |
```
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 67 |
|
| 68 |
+
command used to generate this quant
|
| 69 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 70 |
```
|
| 71 |
+
ulimit -n 100000
|
| 72 |
+
PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True python convert.py -i /home/ubuntu/workspace/models/Hermes-4-405B \
|
| 73 |
+
-o /home/ubuntu/workspace/models/final/hermes4-405b-2bpw \
|
| 74 |
+
-w /home/ubuntu/workspace/models/workdir \
|
| 75 |
+
-b 2.0 \
|
| 76 |
+
-hq \
|
| 77 |
+
-ss 2048 \
|
| 78 |
+
-cpi 3600 \
|
| 79 |
+
-hb 8 \
|
| 80 |
+
-d 0
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 81 |
```
|
| 82 |
+
<img src="https://cdn-uploads.huggingface.co/production/uploads/6317aade83d8d2fd903192d9/roT9o5bMYBtQziRMlaSDf.jpeg" width="300" style="float:center" />
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|