---
base_model:
- deepreinforce-ai/Ornith-1.0-397B
---

**Warning, all models only work with ik_llama.cpp**

Quants of Ornith 1.0, a fine tune built on Qwen 3.5 397B A17B. Comes with mmproj for vision, but isn't shipped with MTP.
You can use DFLASH with it, a novel diffusion based MTP-like, to speed up TG - comes in a variety of quants, you can download the one that works best for your model size.


DFLASH paper: https://arxiv.org/abs/2602.06036


Thanks to:

https://huggingface.co/z-lab/Qwen3.5-397B-A17B-DFlash

https://huggingface.co/modal-labs/Qwen3.5-397B-A17B-DFlash

https://huggingface.co/lmsys/Qwen3.5-397B-A17B-DFlash

Load DFLASH with:

```
--model-draft path/to/Ornith-DFLASH.gguf
--spec-type dflash:n_max=1,cross_ctx=256
```

All quants target 16/24/32GB GPUs, with varying amounts of RAM depending on the quant.

Specific quant details (memory footprint with mmproj, without MTP/DFLASH):

<details>
<summary>IQ4_K - for 256GB RAM + 24GB VRAM</summary>

- Will eat 20180MB of VRAM and 198GB of RAM with standard config:
  ```
      ./build/bin/llama-server
      -m pmodels/Ornith-1.0-397B-A17B-IQ4_K.gguf
      --mmproj pmodels/Ornith-mmproj-BF16.gguf
      --mmproj-gpu-lazy
      -a Orinth
      --slot-save-path slots
      --context-shift off
      -ot "blk\.(?:[0-9]|[1-5][0-9])\.ffn.*_exps.*=CPU"
      -ot "token_embd\.weight=CPU"
      -c 200000
      --ctx-checkpoints 8
      --ctx-checkpoints-interval 0
      --ctx-checkpoints-tolerance 4
      --parallel 1
      -cram 0
      -b 4096 -ub 4096
      -wgt 1
      -ctk q8_0 -ctv q8_0
      -khad
      -mqkv
      --threads 15 --threads-batch 16 -ngl 100
      -cuda fusion=1,offload-batch-size=16,mmq-id-size=0,fa-offset=0
      --host 127.0.0.1
      --port 8080
      --webui none
      --jinja
  ```

Details:
  
  ```
    # 60 Repeating Layers [0-59] + MTP

    ## Gated Attention/Delta Net [Blended 0-59]
    blk\..*\.attn_gate\.weight=q8_0
    blk\..*\.attn_qkv\.weight=q8_0
    blk\..*\.ssm_alpha\.weight=bf16
    blk\..*\.ssm_beta\.weight=bf16
    blk\..*\.ssm_out\.weight=bf16

    # Normal attention
    blk\..*\.attn_output\.weight=q8_0
    blk\..*\.attn_q\.weight=q8_0
    blk\..*\.attn_k\.weight=q8_0
    blk\..*\.attn_v\.weight=q8_0

    # Shared Expert Layers [0-59]
    blk\..*\.ffn_down_shexp\.weight=q8_0
    blk\..*\.ffn_(gate|up)_shexp\.weight=q8_0

    # Routed Experts Layers [0-59]
    blk\..*\.ffn_down_exps\.weight=iq4_k
    blk\..*\.ffn_(gate|up)_exps\.weight=iq4_kss

    # Non-Repeating Layers
    token_embd\.weight=q8_0
    output\.weight=iq6_k
  ```

---

</details>


<details>
<summary>IQ4_KSS - for 256GB RAM + 24GB VRAM</summary>

- Will eat 18826MB of VRAM and 191GB of RAM with standard config:
  ```
      ./build/bin/llama-server
      -m pmodels/Ornith-1.0-397B-A17B-IQ4_KSS.gguf
      --mmproj pmodels/Ornith-mmproj-BF16.gguf
      --mmproj-gpu-lazy
      -a Orinth
      --slot-save-path slots
      --context-shift off
      -ot "blk\.(?:[0-9]|[1-5][0-9])\.ffn.*_exps.*=CPU"
      -ot "token_embd\.weight=CPU"
      -c 200000
      --ctx-checkpoints 8
      --ctx-checkpoints-interval 0
      --ctx-checkpoints-tolerance 4
      --parallel 1
      -cram 0
      -b 4096 -ub 4096
      -wgt 1
      -ctk q8_0 -ctv q8_0
      -khad
      -mqkv
      --threads 15 --threads-batch 16 -ngl 100
      -cuda fusion=1,offload-batch-size=16,mmq-id-size=0,fa-offset=0
      --host 127.0.0.1
      --port 8080
      --webui none
      --jinja
  ```

Details:
  
  ```
    # 60 Repeating Layers [0-59] + MTP

    ## Gated Attention/Delta Net [Blended 0-59]
    blk\..*\.attn_gate\.weight=q8_0
    blk\..*\.attn_qkv\.weight=q8_0
    blk\..*\.ssm_alpha\.weight=bf16
    blk\..*\.ssm_beta\.weight=bf16
    blk\..*\.ssm_out\.weight=q8_0

    # Normal attention
    blk\..*\.attn_output\.weight=q8_0
    blk\..*\.attn_q\.weight=q8_0
    blk\..*\.attn_k\.weight=q8_0
    blk\..*\.attn_v\.weight=q8_0

    # Shared Expert Layers [0-59]
    blk\..*\.ffn_down_shexp\.weight=q8_0
    blk\..*\.ffn_(gate|up)_shexp\.weight=q8_0

    # Routed Experts Layers [0-59]
    blk\..*\.ffn_down_exps\.weight=iq4_kss
    blk\..*\.ffn_(gate|up)_exps\.weight=iq4_kss

    # Non-Repeating Layers
    token_embd\.weight=q8_0
    output\.weight=iq6_k
  ```

---

</details>

<details>
<summary>IQ3_KS - for 192GB RAM + 24GB VRAM</summary>

- Will eat 17600MB of VRAM and 137GB of RAM with standard config:
  ```
      ./build/bin/llama-server
      -m pmodels/Ornith-1.0-397B-A17B-IQ3_KS.gguf
      --mmproj pmodels/Ornith-mmproj-BF16.gguf
      --mmproj-gpu-lazy
      -a Orinth
      --slot-save-path slots
      --context-shift off
      -ot "blk\.(?:[0-9]|[1-5][0-9])\.ffn.*_exps.*=CPU"
      -ot "token_embd\.weight=CPU"
      -c 200000
      --ctx-checkpoints 8
      --ctx-checkpoints-interval 0
      --ctx-checkpoints-tolerance 4
      --parallel 1
      -cram 0
      -b 4096 -ub 4096
      -wgt 1
      -ctk q8_0 -ctv q8_0
      -khad
      -mqkv
      --threads 15 --threads-batch 16 -ngl 100
      -cuda fusion=1,offload-batch-size=16,mmq-id-size=0,fa-offset=0
      --host 127.0.0.1
      --port 8080
      --webui none
      --jinja
  ```

Details:
  
  ```
    # 60 Repeating Layers [0-59] + MTP

    ## Gated Attention/Delta Net [Blended 0-59]
    blk\..*\.attn_gate\.weight=q8_0
    blk\..*\.attn_qkv\.weight=q8_0
    blk\..*\.ssm_alpha\.weight=bf16
    blk\..*\.ssm_beta\.weight=bf16
    blk\..*\.ssm_out\.weight=q8_0

    # Normal attention
    blk\..*\.attn_output\.weight=q8_0
    blk\..*\.attn_q\.weight=q8_0
    blk\..*\.attn_k\.weight=q8_0
    blk\..*\.attn_v\.weight=q8_0

    # Shared Expert Layers [0-59]
    blk\..*\.ffn_down_shexp\.weight=iq6_k
    blk\..*\.ffn_(gate|up)_shexp\.weight=iq6_k

    # Routed Experts Layers [0-59]
    blk\..*\.ffn_down_exps\.weight=iq3_ks
    blk\..*\.ffn_(gate|up)_exps\.weight=iq2_kl

    # Non-Repeating Layers
    token_embd\.weight=iq6_k
    output\.weight=iq6_k
  ```

---

</details>

<details>
<summary>IQ2_KS - for 128GB RAM + 16GB VRAM</summary>

- Will eat 13988MB of VRAM and 92.4GB of RAM with standard config:
  ```
      ./build/bin/llama-server
      -m pmodels/Ornith-1.0-397B-A17B-IQ2_KS.gguf
      --mmproj pmodels/Ornith-mmproj-BF16.gguf
      --mmproj-gpu-lazy
      -a Orinth
      --slot-save-path slots
      --context-shift off
      -ot "blk\.(?:[0-9]|[1-5][0-9])\.ffn.*_exps.*=CPU"
      -ot "token_embd\.weight=CPU"
      -c 200000
      --ctx-checkpoints 8
      --ctx-checkpoints-interval 0
      --ctx-checkpoints-tolerance 4
      --parallel 1
      -cram 0
      -b 4096 -ub 4096
      -wgt 1
      -ctk q8_0 -ctv q8_0
      -khad
      -mqkv
      --threads 15 --threads-batch 16 -ngl 100
      -cuda fusion=1,offload-batch-size=16,mmq-id-size=0,fa-offset=0
      --host 127.0.0.1
      --port 8080
      --webui none
      --jinja
  ```

Details:
  
  ```
    # 60 Repeating Layers [0-59] + MTP

    ## Gated Attention/Delta Net [Blended 0-59]
    blk\..*\.attn_gate\.weight=iq4_ks
    blk\..*\.attn_qkv\.weight=iq4_ks
    blk\..*\.ssm_alpha\.weight=q8_0
    blk\..*\.ssm_beta\.weight=q8_0
    blk\..*\.ssm_out\.weight=q8_0

    # Normal attention
    blk\..*\.attn_output\.weight=iq4_kss
    blk\..*\.attn_q\.weight=iq4_kss
    blk\..*\.attn_k\.weight=iq4_kss
    blk\..*\.attn_v\.weight=iq4_kss

    # Shared Expert Layers [0-59]
    blk\..*\.ffn_down_shexp\.weight=iq4_kss
    blk\..*\.ffn_(gate|up)_shexp\.weight=iq4_kss

    # Routed Experts Layers [0-59]
    blk\..*\.ffn_down_exps\.weight=iq2_kt
    blk\..*\.ffn_(gate|up)_exps\.weight=iq1_kt

    # Non-Repeating Layers
    token_embd\.weight=iq4_ks
    output\.weight=iq4_ks
  ```

---

</details>

---

Every additional 65536 tokens of context window require one additional GB of VRAM at Q8 KV cache.

The model was natively trained on a 262144 ctx window, so if you want to go beyond 262144 you need to use the additional YARN commands (both for ik and mainline):
```
  --rope-scaling yarn
  --rope-scale N
  --yarn-orig-ctx 262144
```
Where N is the context ceiling multiplier (2 for 524288, 4 for 1M). Close to no quality loss at scale 2, some quality loss at scale 4.