cpral commited on
Commit
b89fe27
·
verified ·
1 Parent(s): 7e6eda0

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +39 -178
README.md CHANGED
@@ -18,7 +18,8 @@ tags:
18
  - long context
19
  - roleplaying
20
  - chat
21
- base_model: meta-llama/Meta-Llama-3.1-405B
 
22
  library_name: transformers
23
  widget:
24
  - example_title: Hermes 4
@@ -28,194 +29,54 @@ widget:
28
  You are Hermes 4, a capable, neutrally-aligned assistant. Prefer concise,
29
  correct answers.
30
  - role: user
31
- content: >-
32
- Explain the difference between BFS and DFS to a new CS student.
33
  model-index:
34
  - name: Hermes-4-Llama-3.1-405B
35
  results: []
36
  ---
37
 
38
- # Hermes 4 — Llama-3.1 405B
39
-
40
- ![image/jpeg](https://cdn-uploads.huggingface.co/production/uploads/6317aade83d8d2fd903192d9/roT9o5bMYBtQziRMlaSDf.jpeg)
41
-
42
- ## Model Description
43
-
44
- Hermes 4 405B is a frontier, hybrid-mode **reasoning** model based on Llama-3.1-405B by Nous Research that is aligned to **you**.
45
-
46
- Read the Hermes 4 technical report here: <a href="https://arxiv.org/abs/2508.18255">Hermes 4 Technical Report</a>
47
-
48
- Chat with Hermes in Nous Chat: https://chat.nousresearch.com
49
-
50
- Training highlights include a newly synthesized post-training corpus emphasizing verified reasoning traces, massive improvements in math, code, STEM, logic, creativity, and format-faithful outputs, while preserving general assistant quality and broadly neutral alignment.
51
-
52
- ## What’s new vs Hermes 3
53
-
54
- - **Post-training corpus**: Massively increased dataset size from 1M samples and 1.2B tokens to **~5M samples / ~60B tokens** blended across reasoning and non-reasoning data.
55
- - **Hybrid reasoning mode** with explicit `<think>…</think>` segments when the model decides to deliberate, and options to make your responses faster when you want.
56
- - **Reasoning** that is top quality, expressive, improves math, code, STEM, logic, and even creative writing and subjective responses.
57
- - **Schema adherence & structured outputs**: trained to produce valid JSON for given schemas and to repair malformed objects.
58
- - **Much easier to steer and align**: extreme improvements on steerability, especially on reduced refusal rates.
59
-
60
- ## Our Mission: Frontier Capabilities Aligned to You
61
-
62
- In pursuit of the mission of producing models that are open, steerable and capable of producing the full range of human expression, while being able to be aligned to your values, we created a new benchmark, RefusalBench, that tests the models willingness to be helpful in a variety of scenarios commonly disallowed by closed and open models.
63
-
64
- ![image/png](https://cdn-uploads.huggingface.co/production/uploads/6317aade83d8d2fd903192d9/t_HvRYPEHV0pc8iS2zHHn.png)
65
-
66
- Hermes 4 achieves SOTA on RefusalBench across all popular closed and open models in being helpful and conforming to your values, without censorship.
67
-
68
- ## Benchmarks (Hermes 4 405B)
69
-
70
- ![image/png](https://cdn-uploads.huggingface.co/production/uploads/6317aade83d8d2fd903192d9/ZOj3LrFweV7MYwlfP_eiO.png)
71
-
72
- > Full tables, settings, and comparisons are in the technical report.
73
 
74
- ## Prompt Format
75
 
76
- Hermes 4 uses Llama-3-Chat format with role headers and special tags.
77
-
78
- **Basic chat:**
79
- ```
80
- <|start_header_id|>system<|end_header_id|>
81
-
82
- You are Hermes 4. Be concise and helpful.<|eot_id|>
83
- <|start_header_id|>user<|end_header_id|>
84
-
85
- Explain the photoelectric effect simply.<|eot_id|>
86
- <|start_header_id|>assistant<|end_header_id|>
87
  ```
88
-
89
- ### Reasoning mode
90
-
91
- Reasoning mode can be activated with the chat template via the flag `thinking=True` or by using the following system prompt:
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
92
 
93
  ```
94
- You are a deep thinking AI, you may use extremely long chains of thought to deeply consider the problem and deliberate with yourself via systematic reasoning processes to help come to a correct solution prior to answering. You should enclose your thoughts and internal monologue inside <think> </think> tags, and then provide your solution or response to the problem.
95
- ```
96
-
97
- Note that you can add any additional system instructions before or after this system message, and it will adjust the models policies, style, and effort of thinking, as well as its post-thinking style, format, identity, and more. You may also interleave the tool definition system message with the reasoning one.
98
-
99
- When the model chooses to deliberate, it emits:
100
-
101
- ```
102
- <|start_header_id|>assistant<|end_header_id|>
103
- <think>
104
- …model’s internal reasoning may appear here…
105
- </think>
106
- Final response starts here…<|eot_id|>
107
- ```
108
-
109
- Additionally, we provide a flag to keep the content inbetween the `<think> ... </think>` that you can play with by setting `keep_cots=True`
110
-
111
-
112
- ## Function Calling & Tool Use
113
 
114
- Hermes 4 supports function/tool calls *within* a single assistant turn, interleaved with its reasoning:
115
 
116
- **System message (example):**
117
-
118
- ```
119
- <|start_header_id|>system<|end_header_id|>
120
- You are a function-calling AI. Tools are provided inside <tools>…</tools>.
121
- When appropriate, call a tool by emitting a <tool_call>{...}</tool_call> object.
122
- After a tool responds (as <tool_response>), continue reasoning inside <think> and produce the final answer.
123
- <tools>
124
- {"type":"function","function":{"name":"get_weather","description":"Get weather by city","parameters":{"type":"object","properties":{"city":{"type":"string"}},"required":["city"]}}}
125
- </tools><|eot_id|>
126
  ```
127
-
128
- Note that you may also simply place tool definitions into the "tools:" field of your messages, and the chat template will parse and create the system prompt for you. This also works with reasoning mode for improved accuracy of tool use.
129
-
130
- The model will then generate tool calls within `<tool_call> {tool_call} </tool_call>` tags, for easy parsing. The tool_call tags are also added tokens, so it makes it easy to parse while streaming! There are also automatic tool parsers built-in to VLLM and SGLang for Hermes, just set the tool parser in VLLM to `hermes` and in SGLang to `qwen25`.
131
-
132
- ## Inference Notes
133
-
134
- - **Sampling defaults that work well:** `temperature=0.6, top_p=0.95, top_k=20`.
135
- - **Template:** Use the Llama chat format for Hermes 4 70B and 405B as shown above, or set `add_generation_prompt=True` when using `tokenizer.apply_chat_template(...)`.
136
-
137
- ### Transformers example
138
-
139
- ```python
140
- from transformers import AutoTokenizer, AutoModelForCausalLM
141
- import torch
142
-
143
- model_id = "NousResearch/Hermes-4-Llama-3.1-405B"
144
-
145
- tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
146
- model = AutoModelForCausalLM.from_pretrained(
147
- model_id,
148
- torch_dtype=torch.float16,
149
- device_map="auto"
150
- )
151
-
152
- messages = [
153
- {"role":"system","content":"You are Hermes 4. Be concise."},
154
- {"role":"user","content":"Summarize CRISPR in 3 sentences."}
155
- ]
156
-
157
- inputs = tokenizer.apply_chat_template(
158
- messages, add_generation_prompt=True, return_tensors="pt"
159
- ).to(model.device)
160
-
161
- outputs = model.generate(
162
- **inputs, max_new_tokens=400, temperature=0.6, top_p=0.95, top_k=20, do_sample=True
163
- )
164
- print(tokenizer.decode(outputs[0], skip_special_tokens=True))
165
  ```
166
-
167
- For production serving on multi-GPU nodes, consider tensor parallel inference engines (e.g., SGLang/vLLM backends) with prefix caching.
168
-
169
- ## Inference Providers:
170
-
171
- ### Nous Portal:
172
-
173
- <a href="https://portal.nousresearch.com"><img width=256 alt="chutes logo" src="https://cdn-uploads.huggingface.co/production/uploads/6317aade83d8d2fd903192d9/6YytY7N0mjCnBQvWo3qtv.png"></a>
174
-
175
- ### Chutes:
176
-
177
- <a href="https://chutes.ai/app"><img width=256 alt="chutes logo" src="https://cdn-uploads.huggingface.co/production/uploads/6317aade83d8d2fd903192d9/l14AWPv6cSvaprpwK_IWY.png"></a>
178
-
179
- ### Nebius:
180
-
181
- <a href="https://nebius.com/services/studio-inference-service">
182
- <picture>
183
- <source media="(prefers-color-scheme: dark)" srcset="https://cdn-uploads.huggingface.co/production/uploads/6317aade83d8d2fd903192d9/vhL0oAomFa_awBdt2KF_x.png">
184
- <source media="(prefers-color-scheme: light)" srcset="https://cdn-uploads.huggingface.co/production/uploads/64b21cbb2fc8324fcb1dac03/LjAfeFfAz8ac5rV-iiwj5.png">
185
- <img width=256 alt="nebius.com logo" src="https://cdn-uploads.huggingface.co/production/uploads/64b21cbb2fc8324fcb1dac03/LjAfeFfAz8ac5rV-iiwj5.png">
186
- </picture>
187
- </a>
188
-
189
- ### Luminal:
190
-
191
- <a href="https://luminalai.com/">
192
- <img width=256 alt="luminal logo" src="https://cdn-uploads.huggingface.co/production/uploads/6317aade83d8d2fd903192d9/FIHsRdjMMP0HUjebiuJyH.png">
193
- </a>
194
-
195
- # Quantized / Smaller Variants
196
-
197
- Hermes 4 is available as BF16 original weights as well as FP8 variants and GGUF variants by LM Studio.
198
-
199
- FP8: https://huggingface.co/NousResearch/Hermes-4-405B-FP8
200
-
201
- GGUF (Courtesy of LM Studio team!):
202
- https://huggingface.co/lmstudio-community/Hermes-4-405B-GGUF
203
-
204
- Hermes 4 is also available in smaller sizes (e.g., 70B and 14B) with similar prompt formats.
205
-
206
- See the Hermes 4 collection to explore them all:
207
- https://huggingface.co/collections/NousResearch/hermes-4-collection-68a731bfd452e20816725728
208
-
209
- # How to cite
210
-
211
- ```bibtex
212
- @misc{teknium2025hermes4technicalreport,
213
- title={Hermes 4 Technical Report},
214
- author={Ryan Teknium and Roger Jin and Jai Suphavadeeprasit and Dakota Mahan and Jeffrey Quesnelle and Joe Li and Chen Guang and Shannon Sands and Karan Malhotra},
215
- year={2025},
216
- eprint={2508.18255},
217
- archivePrefix={arXiv},
218
- primaryClass={cs.AI},
219
- url={https://arxiv.org/abs/2508.18255},
220
- }
221
- ```
 
18
  - long context
19
  - roleplaying
20
  - chat
21
+ base_model:
22
+ - NousResearch/Hermes-4-405B
23
  library_name: transformers
24
  widget:
25
  - example_title: Hermes 4
 
29
  You are Hermes 4, a capable, neutrally-aligned assistant. Prefer concise,
30
  correct answers.
31
  - role: user
32
+ content: Explain what Hadamard Transform is.
 
33
  model-index:
34
  - name: Hermes-4-Llama-3.1-405B
35
  results: []
36
  ---
37
 
38
+ # Hermes 4 — Llama-3.1 405B EXL 3 2.00bpw
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
39
 
40
+ 2.00 BPW H8 exllamav3 quant of Hermes 4 405B.
41
 
 
 
 
 
 
 
 
 
 
 
 
42
  ```
43
+ -- A perplexity: 1.50484401
44
+ -- B perplexity: 4.46562014
45
+ -- A label in top-K:
46
+ K = 1: 0.8938
47
+ K = 2: 0.9486
48
+ K = 3: 0.9640
49
+ K = 4: 0.9714
50
+ K = 5: 0.9757
51
+ -- B label in top-K:
52
+ K = 1: 0.6383
53
+ K = 2: 0.7622
54
+ K = 3: 0.8163
55
+ K = 4: 0.8482
56
+ K = 5: 0.8698
57
+ -- Top-K agreement, A vs B:
58
+ K = 1: 0.6743
59
+ K = 2: 0.2721
60
+ K = 3: 0.0833
61
+ K = 4: 0.0222
62
+ K = 5: 0.0056
63
+ -- KL divergence (A, B): 2.27405149
64
+ -- KL divergence (B, A): 1.05870732
65
 
66
  ```
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
67
 
68
+ command used to generate this quant
69
 
 
 
 
 
 
 
 
 
 
 
70
  ```
71
+ ulimit -n 100000
72
+ PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True python convert.py -i /home/ubuntu/workspace/models/Hermes-4-405B \
73
+ -o /home/ubuntu/workspace/models/final/hermes4-405b-2bpw \
74
+ -w /home/ubuntu/workspace/models/workdir \
75
+ -b 2.0 \
76
+ -hq \
77
+ -ss 2048 \
78
+ -cpi 3600 \
79
+ -hb 8 \
80
+ -d 0
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
81
  ```
82
+ <img src="https://cdn-uploads.huggingface.co/production/uploads/6317aade83d8d2fd903192d9/roT9o5bMYBtQziRMlaSDf.jpeg" width="300" style="float:center" />