--- base_model: - deepreinforce-ai/Ornith-1.0-397B --- **Warning, all models only work with ik_llama.cpp** Quants of Ornith 1.0, a fine tune built on Qwen 3.5 397B A17B. Comes with mmproj for vision, but isn't shipped with MTP. You can use DFLASH with it, a novel diffusion based MTP-like, to speed up TG - comes in a variety of quants, you can download the one that works best for your model size. DFLASH paper: https://arxiv.org/abs/2602.06036 Thanks to: https://huggingface.co/z-lab/Qwen3.5-397B-A17B-DFlash https://huggingface.co/modal-labs/Qwen3.5-397B-A17B-DFlash https://huggingface.co/lmsys/Qwen3.5-397B-A17B-DFlash Load DFLASH with: ``` --model-draft path/to/Ornith-DFLASH.gguf --spec-type dflash:n_max=1,cross_ctx=256 ``` All quants target 16/24/32GB GPUs, with varying amounts of RAM depending on the quant. Specific quant details (memory footprint with mmproj, without MTP/DFLASH):
IQ4_K - for 256GB RAM + 24GB VRAM - Will eat 20180MB of VRAM and 198GB of RAM with standard config: ``` ./build/bin/llama-server -m pmodels/Ornith-1.0-397B-A17B-IQ4_K.gguf --mmproj pmodels/Ornith-mmproj-BF16.gguf --mmproj-gpu-lazy -a Orinth --slot-save-path slots --context-shift off -ot "blk\.(?:[0-9]|[1-5][0-9])\.ffn.*_exps.*=CPU" -ot "token_embd\.weight=CPU" -c 200000 --ctx-checkpoints 8 --ctx-checkpoints-interval 0 --ctx-checkpoints-tolerance 4 --parallel 1 -cram 0 -b 4096 -ub 4096 -wgt 1 -ctk q8_0 -ctv q8_0 -khad -mqkv --threads 15 --threads-batch 16 -ngl 100 -cuda fusion=1,offload-batch-size=16,mmq-id-size=0,fa-offset=0 --host 127.0.0.1 --port 8080 --webui none --jinja ``` Details: ``` # 60 Repeating Layers [0-59] + MTP ## Gated Attention/Delta Net [Blended 0-59] blk\..*\.attn_gate\.weight=q8_0 blk\..*\.attn_qkv\.weight=q8_0 blk\..*\.ssm_alpha\.weight=bf16 blk\..*\.ssm_beta\.weight=bf16 blk\..*\.ssm_out\.weight=bf16 # Normal attention blk\..*\.attn_output\.weight=q8_0 blk\..*\.attn_q\.weight=q8_0 blk\..*\.attn_k\.weight=q8_0 blk\..*\.attn_v\.weight=q8_0 # Shared Expert Layers [0-59] blk\..*\.ffn_down_shexp\.weight=q8_0 blk\..*\.ffn_(gate|up)_shexp\.weight=q8_0 # Routed Experts Layers [0-59] blk\..*\.ffn_down_exps\.weight=iq4_k blk\..*\.ffn_(gate|up)_exps\.weight=iq4_kss # Non-Repeating Layers token_embd\.weight=q8_0 output\.weight=iq6_k ``` ---
IQ4_KSS - for 256GB RAM + 24GB VRAM - Will eat 18826MB of VRAM and 191GB of RAM with standard config: ``` ./build/bin/llama-server -m pmodels/Ornith-1.0-397B-A17B-IQ4_KSS.gguf --mmproj pmodels/Ornith-mmproj-BF16.gguf --mmproj-gpu-lazy -a Orinth --slot-save-path slots --context-shift off -ot "blk\.(?:[0-9]|[1-5][0-9])\.ffn.*_exps.*=CPU" -ot "token_embd\.weight=CPU" -c 200000 --ctx-checkpoints 8 --ctx-checkpoints-interval 0 --ctx-checkpoints-tolerance 4 --parallel 1 -cram 0 -b 4096 -ub 4096 -wgt 1 -ctk q8_0 -ctv q8_0 -khad -mqkv --threads 15 --threads-batch 16 -ngl 100 -cuda fusion=1,offload-batch-size=16,mmq-id-size=0,fa-offset=0 --host 127.0.0.1 --port 8080 --webui none --jinja ``` Details: ``` # 60 Repeating Layers [0-59] + MTP ## Gated Attention/Delta Net [Blended 0-59] blk\..*\.attn_gate\.weight=q8_0 blk\..*\.attn_qkv\.weight=q8_0 blk\..*\.ssm_alpha\.weight=bf16 blk\..*\.ssm_beta\.weight=bf16 blk\..*\.ssm_out\.weight=q8_0 # Normal attention blk\..*\.attn_output\.weight=q8_0 blk\..*\.attn_q\.weight=q8_0 blk\..*\.attn_k\.weight=q8_0 blk\..*\.attn_v\.weight=q8_0 # Shared Expert Layers [0-59] blk\..*\.ffn_down_shexp\.weight=q8_0 blk\..*\.ffn_(gate|up)_shexp\.weight=q8_0 # Routed Experts Layers [0-59] blk\..*\.ffn_down_exps\.weight=iq4_kss blk\..*\.ffn_(gate|up)_exps\.weight=iq4_kss # Non-Repeating Layers token_embd\.weight=q8_0 output\.weight=iq6_k ``` ---
IQ3_KS - for 192GB RAM + 24GB VRAM - Will eat 17600MB of VRAM and 137GB of RAM with standard config: ``` ./build/bin/llama-server -m pmodels/Ornith-1.0-397B-A17B-IQ3_KS.gguf --mmproj pmodels/Ornith-mmproj-BF16.gguf --mmproj-gpu-lazy -a Orinth --slot-save-path slots --context-shift off -ot "blk\.(?:[0-9]|[1-5][0-9])\.ffn.*_exps.*=CPU" -ot "token_embd\.weight=CPU" -c 200000 --ctx-checkpoints 8 --ctx-checkpoints-interval 0 --ctx-checkpoints-tolerance 4 --parallel 1 -cram 0 -b 4096 -ub 4096 -wgt 1 -ctk q8_0 -ctv q8_0 -khad -mqkv --threads 15 --threads-batch 16 -ngl 100 -cuda fusion=1,offload-batch-size=16,mmq-id-size=0,fa-offset=0 --host 127.0.0.1 --port 8080 --webui none --jinja ``` Details: ``` # 60 Repeating Layers [0-59] + MTP ## Gated Attention/Delta Net [Blended 0-59] blk\..*\.attn_gate\.weight=q8_0 blk\..*\.attn_qkv\.weight=q8_0 blk\..*\.ssm_alpha\.weight=bf16 blk\..*\.ssm_beta\.weight=bf16 blk\..*\.ssm_out\.weight=q8_0 # Normal attention blk\..*\.attn_output\.weight=q8_0 blk\..*\.attn_q\.weight=q8_0 blk\..*\.attn_k\.weight=q8_0 blk\..*\.attn_v\.weight=q8_0 # Shared Expert Layers [0-59] blk\..*\.ffn_down_shexp\.weight=iq6_k blk\..*\.ffn_(gate|up)_shexp\.weight=iq6_k # Routed Experts Layers [0-59] blk\..*\.ffn_down_exps\.weight=iq3_ks blk\..*\.ffn_(gate|up)_exps\.weight=iq2_kl # Non-Repeating Layers token_embd\.weight=iq6_k output\.weight=iq6_k ``` ---
IQ2_KS - for 128GB RAM + 16GB VRAM - Will eat 13988MB of VRAM and 92.4GB of RAM with standard config: ``` ./build/bin/llama-server -m pmodels/Ornith-1.0-397B-A17B-IQ2_KS.gguf --mmproj pmodels/Ornith-mmproj-BF16.gguf --mmproj-gpu-lazy -a Orinth --slot-save-path slots --context-shift off -ot "blk\.(?:[0-9]|[1-5][0-9])\.ffn.*_exps.*=CPU" -ot "token_embd\.weight=CPU" -c 200000 --ctx-checkpoints 8 --ctx-checkpoints-interval 0 --ctx-checkpoints-tolerance 4 --parallel 1 -cram 0 -b 4096 -ub 4096 -wgt 1 -ctk q8_0 -ctv q8_0 -khad -mqkv --threads 15 --threads-batch 16 -ngl 100 -cuda fusion=1,offload-batch-size=16,mmq-id-size=0,fa-offset=0 --host 127.0.0.1 --port 8080 --webui none --jinja ``` Details: ``` # 60 Repeating Layers [0-59] + MTP ## Gated Attention/Delta Net [Blended 0-59] blk\..*\.attn_gate\.weight=iq4_ks blk\..*\.attn_qkv\.weight=iq4_ks blk\..*\.ssm_alpha\.weight=q8_0 blk\..*\.ssm_beta\.weight=q8_0 blk\..*\.ssm_out\.weight=q8_0 # Normal attention blk\..*\.attn_output\.weight=iq4_kss blk\..*\.attn_q\.weight=iq4_kss blk\..*\.attn_k\.weight=iq4_kss blk\..*\.attn_v\.weight=iq4_kss # Shared Expert Layers [0-59] blk\..*\.ffn_down_shexp\.weight=iq4_kss blk\..*\.ffn_(gate|up)_shexp\.weight=iq4_kss # Routed Experts Layers [0-59] blk\..*\.ffn_down_exps\.weight=iq2_kt blk\..*\.ffn_(gate|up)_exps\.weight=iq1_kt # Non-Repeating Layers token_embd\.weight=iq4_ks output\.weight=iq4_ks ``` ---
--- Every additional 65536 tokens of context window require one additional GB of VRAM at Q8 KV cache. The model was natively trained on a 262144 ctx window, so if you want to go beyond 262144 you need to use the additional YARN commands (both for ik and mainline): ``` --rope-scaling yarn --rope-scale N --yarn-orig-ctx 262144 ``` Where N is the context ceiling multiplier (2 for 524288, 4 for 1M). Close to no quality loss at scale 2, some quality loss at scale 4.