---
base_model:
- deepreinforce-ai/Ornith-1.0-397B
---
**Warning, all models only work with ik_llama.cpp**
Quants of Ornith 1.0, a fine tune built on Qwen 3.5 397B A17B. Comes with mmproj for vision, but isn't shipped with MTP.
You can use DFLASH with it, a novel diffusion based MTP-like, to speed up TG - comes in a variety of quants, you can download the one that works best for your model size.
DFLASH paper: https://arxiv.org/abs/2602.06036
Thanks to:
https://huggingface.co/z-lab/Qwen3.5-397B-A17B-DFlash
https://huggingface.co/modal-labs/Qwen3.5-397B-A17B-DFlash
https://huggingface.co/lmsys/Qwen3.5-397B-A17B-DFlash
Load DFLASH with:
```
--model-draft path/to/Ornith-DFLASH.gguf
--spec-type dflash:n_max=1,cross_ctx=256
```
All quants target 16/24/32GB GPUs, with varying amounts of RAM depending on the quant.
Specific quant details (memory footprint with mmproj, without MTP/DFLASH):
IQ4_K - for 256GB RAM + 24GB VRAM
- Will eat 20180MB of VRAM and 198GB of RAM with standard config:
```
./build/bin/llama-server
-m pmodels/Ornith-1.0-397B-A17B-IQ4_K.gguf
--mmproj pmodels/Ornith-mmproj-BF16.gguf
--mmproj-gpu-lazy
-a Orinth
--slot-save-path slots
--context-shift off
-ot "blk\.(?:[0-9]|[1-5][0-9])\.ffn.*_exps.*=CPU"
-ot "token_embd\.weight=CPU"
-c 200000
--ctx-checkpoints 8
--ctx-checkpoints-interval 0
--ctx-checkpoints-tolerance 4
--parallel 1
-cram 0
-b 4096 -ub 4096
-wgt 1
-ctk q8_0 -ctv q8_0
-khad
-mqkv
--threads 15 --threads-batch 16 -ngl 100
-cuda fusion=1,offload-batch-size=16,mmq-id-size=0,fa-offset=0
--host 127.0.0.1
--port 8080
--webui none
--jinja
```
Details:
```
# 60 Repeating Layers [0-59] + MTP
## Gated Attention/Delta Net [Blended 0-59]
blk\..*\.attn_gate\.weight=q8_0
blk\..*\.attn_qkv\.weight=q8_0
blk\..*\.ssm_alpha\.weight=bf16
blk\..*\.ssm_beta\.weight=bf16
blk\..*\.ssm_out\.weight=bf16
# Normal attention
blk\..*\.attn_output\.weight=q8_0
blk\..*\.attn_q\.weight=q8_0
blk\..*\.attn_k\.weight=q8_0
blk\..*\.attn_v\.weight=q8_0
# Shared Expert Layers [0-59]
blk\..*\.ffn_down_shexp\.weight=q8_0
blk\..*\.ffn_(gate|up)_shexp\.weight=q8_0
# Routed Experts Layers [0-59]
blk\..*\.ffn_down_exps\.weight=iq4_k
blk\..*\.ffn_(gate|up)_exps\.weight=iq4_kss
# Non-Repeating Layers
token_embd\.weight=q8_0
output\.weight=iq6_k
```
---
IQ4_KSS - for 256GB RAM + 24GB VRAM
- Will eat 18826MB of VRAM and 191GB of RAM with standard config:
```
./build/bin/llama-server
-m pmodels/Ornith-1.0-397B-A17B-IQ4_KSS.gguf
--mmproj pmodels/Ornith-mmproj-BF16.gguf
--mmproj-gpu-lazy
-a Orinth
--slot-save-path slots
--context-shift off
-ot "blk\.(?:[0-9]|[1-5][0-9])\.ffn.*_exps.*=CPU"
-ot "token_embd\.weight=CPU"
-c 200000
--ctx-checkpoints 8
--ctx-checkpoints-interval 0
--ctx-checkpoints-tolerance 4
--parallel 1
-cram 0
-b 4096 -ub 4096
-wgt 1
-ctk q8_0 -ctv q8_0
-khad
-mqkv
--threads 15 --threads-batch 16 -ngl 100
-cuda fusion=1,offload-batch-size=16,mmq-id-size=0,fa-offset=0
--host 127.0.0.1
--port 8080
--webui none
--jinja
```
Details:
```
# 60 Repeating Layers [0-59] + MTP
## Gated Attention/Delta Net [Blended 0-59]
blk\..*\.attn_gate\.weight=q8_0
blk\..*\.attn_qkv\.weight=q8_0
blk\..*\.ssm_alpha\.weight=bf16
blk\..*\.ssm_beta\.weight=bf16
blk\..*\.ssm_out\.weight=q8_0
# Normal attention
blk\..*\.attn_output\.weight=q8_0
blk\..*\.attn_q\.weight=q8_0
blk\..*\.attn_k\.weight=q8_0
blk\..*\.attn_v\.weight=q8_0
# Shared Expert Layers [0-59]
blk\..*\.ffn_down_shexp\.weight=q8_0
blk\..*\.ffn_(gate|up)_shexp\.weight=q8_0
# Routed Experts Layers [0-59]
blk\..*\.ffn_down_exps\.weight=iq4_kss
blk\..*\.ffn_(gate|up)_exps\.weight=iq4_kss
# Non-Repeating Layers
token_embd\.weight=q8_0
output\.weight=iq6_k
```
---
IQ3_KS - for 192GB RAM + 24GB VRAM
- Will eat 17600MB of VRAM and 137GB of RAM with standard config:
```
./build/bin/llama-server
-m pmodels/Ornith-1.0-397B-A17B-IQ3_KS.gguf
--mmproj pmodels/Ornith-mmproj-BF16.gguf
--mmproj-gpu-lazy
-a Orinth
--slot-save-path slots
--context-shift off
-ot "blk\.(?:[0-9]|[1-5][0-9])\.ffn.*_exps.*=CPU"
-ot "token_embd\.weight=CPU"
-c 200000
--ctx-checkpoints 8
--ctx-checkpoints-interval 0
--ctx-checkpoints-tolerance 4
--parallel 1
-cram 0
-b 4096 -ub 4096
-wgt 1
-ctk q8_0 -ctv q8_0
-khad
-mqkv
--threads 15 --threads-batch 16 -ngl 100
-cuda fusion=1,offload-batch-size=16,mmq-id-size=0,fa-offset=0
--host 127.0.0.1
--port 8080
--webui none
--jinja
```
Details:
```
# 60 Repeating Layers [0-59] + MTP
## Gated Attention/Delta Net [Blended 0-59]
blk\..*\.attn_gate\.weight=q8_0
blk\..*\.attn_qkv\.weight=q8_0
blk\..*\.ssm_alpha\.weight=bf16
blk\..*\.ssm_beta\.weight=bf16
blk\..*\.ssm_out\.weight=q8_0
# Normal attention
blk\..*\.attn_output\.weight=q8_0
blk\..*\.attn_q\.weight=q8_0
blk\..*\.attn_k\.weight=q8_0
blk\..*\.attn_v\.weight=q8_0
# Shared Expert Layers [0-59]
blk\..*\.ffn_down_shexp\.weight=iq6_k
blk\..*\.ffn_(gate|up)_shexp\.weight=iq6_k
# Routed Experts Layers [0-59]
blk\..*\.ffn_down_exps\.weight=iq3_ks
blk\..*\.ffn_(gate|up)_exps\.weight=iq2_kl
# Non-Repeating Layers
token_embd\.weight=iq6_k
output\.weight=iq6_k
```
---
IQ2_KS - for 128GB RAM + 16GB VRAM
- Will eat 13988MB of VRAM and 92.4GB of RAM with standard config:
```
./build/bin/llama-server
-m pmodels/Ornith-1.0-397B-A17B-IQ2_KS.gguf
--mmproj pmodels/Ornith-mmproj-BF16.gguf
--mmproj-gpu-lazy
-a Orinth
--slot-save-path slots
--context-shift off
-ot "blk\.(?:[0-9]|[1-5][0-9])\.ffn.*_exps.*=CPU"
-ot "token_embd\.weight=CPU"
-c 200000
--ctx-checkpoints 8
--ctx-checkpoints-interval 0
--ctx-checkpoints-tolerance 4
--parallel 1
-cram 0
-b 4096 -ub 4096
-wgt 1
-ctk q8_0 -ctv q8_0
-khad
-mqkv
--threads 15 --threads-batch 16 -ngl 100
-cuda fusion=1,offload-batch-size=16,mmq-id-size=0,fa-offset=0
--host 127.0.0.1
--port 8080
--webui none
--jinja
```
Details:
```
# 60 Repeating Layers [0-59] + MTP
## Gated Attention/Delta Net [Blended 0-59]
blk\..*\.attn_gate\.weight=iq4_ks
blk\..*\.attn_qkv\.weight=iq4_ks
blk\..*\.ssm_alpha\.weight=q8_0
blk\..*\.ssm_beta\.weight=q8_0
blk\..*\.ssm_out\.weight=q8_0
# Normal attention
blk\..*\.attn_output\.weight=iq4_kss
blk\..*\.attn_q\.weight=iq4_kss
blk\..*\.attn_k\.weight=iq4_kss
blk\..*\.attn_v\.weight=iq4_kss
# Shared Expert Layers [0-59]
blk\..*\.ffn_down_shexp\.weight=iq4_kss
blk\..*\.ffn_(gate|up)_shexp\.weight=iq4_kss
# Routed Experts Layers [0-59]
blk\..*\.ffn_down_exps\.weight=iq2_kt
blk\..*\.ffn_(gate|up)_exps\.weight=iq1_kt
# Non-Repeating Layers
token_embd\.weight=iq4_ks
output\.weight=iq4_ks
```
---
---
Every additional 65536 tokens of context window require one additional GB of VRAM at Q8 KV cache.
The model was natively trained on a 262144 ctx window, so if you want to go beyond 262144 you need to use the additional YARN commands (both for ik and mainline):
```
--rope-scaling yarn
--rope-scale N
--yarn-orig-ctx 262144
```
Where N is the context ceiling multiplier (2 for 524288, 4 for 1M). Close to no quality loss at scale 2, some quality loss at scale 4.