GGUF
conversational
How to use from
Lemonade
Pull the model
# Download Lemonade from https://lemonade-server.ai/
lemonade pull paragon-of-brah/Ornith-1.0-397B-DFLASH-GGUF:BF16
Run and chat with the model
lemonade run user.Ornith-1.0-397B-DFLASH-GGUF-BF16
List all available models
lemonade list
Quick Links

Warning, all models only work with ik_llama.cpp

Quants of Ornith 1.0, a fine tune built on Qwen 3.5 397B A17B. Comes with mmproj for vision, but isn't shipped with MTP. You can use DFLASH with it, a novel diffusion based MTP-like, to speed up TG - comes in a variety of quants, you can download the one that works best for your model size.

DFLASH paper: https://arxiv.org/abs/2602.06036

Thanks to:

https://huggingface.co/z-lab/Qwen3.5-397B-A17B-DFlash

https://huggingface.co/modal-labs/Qwen3.5-397B-A17B-DFlash

https://huggingface.co/lmsys/Qwen3.5-397B-A17B-DFlash

Load DFLASH with:

--model-draft path/to/Ornith-DFLASH.gguf
--spec-type dflash:n_max=1,cross_ctx=256

All quants target 16/24/32GB GPUs, with varying amounts of RAM depending on the quant.

Specific quant details (memory footprint with mmproj, without MTP/DFLASH):

IQ4_K - for 256GB RAM + 24GB VRAM
  • Will eat 20180MB of VRAM and 198GB of RAM with standard config:
        ./build/bin/llama-server
        -m pmodels/Ornith-1.0-397B-A17B-IQ4_K.gguf
        --mmproj pmodels/Ornith-mmproj-BF16.gguf
        --mmproj-gpu-lazy
        -a Orinth
        --slot-save-path slots
        --context-shift off
        -ot "blk\.(?:[0-9]|[1-5][0-9])\.ffn.*_exps.*=CPU"
        -ot "token_embd\.weight=CPU"
        -c 200000
        --ctx-checkpoints 8
        --ctx-checkpoints-interval 0
        --ctx-checkpoints-tolerance 4
        --parallel 1
        -cram 0
        -b 4096 -ub 4096
        -wgt 1
        -ctk q8_0 -ctv q8_0
        -khad
        -mqkv
        --threads 15 --threads-batch 16 -ngl 100
        -cuda fusion=1,offload-batch-size=16,mmq-id-size=0,fa-offset=0
        --host 127.0.0.1
        --port 8080
        --webui none
        --jinja
    

Details:

  # 60 Repeating Layers [0-59] + MTP

  ## Gated Attention/Delta Net [Blended 0-59]
  blk\..*\.attn_gate\.weight=q8_0
  blk\..*\.attn_qkv\.weight=q8_0
  blk\..*\.ssm_alpha\.weight=bf16
  blk\..*\.ssm_beta\.weight=bf16
  blk\..*\.ssm_out\.weight=bf16

  # Normal attention
  blk\..*\.attn_output\.weight=q8_0
  blk\..*\.attn_q\.weight=q8_0
  blk\..*\.attn_k\.weight=q8_0
  blk\..*\.attn_v\.weight=q8_0

  # Shared Expert Layers [0-59]
  blk\..*\.ffn_down_shexp\.weight=q8_0
  blk\..*\.ffn_(gate|up)_shexp\.weight=q8_0

  # Routed Experts Layers [0-59]
  blk\..*\.ffn_down_exps\.weight=iq4_k
  blk\..*\.ffn_(gate|up)_exps\.weight=iq4_kss

  # Non-Repeating Layers
  token_embd\.weight=q8_0
  output\.weight=iq6_k

IQ4_KSS - for 256GB RAM + 24GB VRAM
  • Will eat 18826MB of VRAM and 191GB of RAM with standard config:
        ./build/bin/llama-server
        -m pmodels/Ornith-1.0-397B-A17B-IQ4_KSS.gguf
        --mmproj pmodels/Ornith-mmproj-BF16.gguf
        --mmproj-gpu-lazy
        -a Orinth
        --slot-save-path slots
        --context-shift off
        -ot "blk\.(?:[0-9]|[1-5][0-9])\.ffn.*_exps.*=CPU"
        -ot "token_embd\.weight=CPU"
        -c 200000
        --ctx-checkpoints 8
        --ctx-checkpoints-interval 0
        --ctx-checkpoints-tolerance 4
        --parallel 1
        -cram 0
        -b 4096 -ub 4096
        -wgt 1
        -ctk q8_0 -ctv q8_0
        -khad
        -mqkv
        --threads 15 --threads-batch 16 -ngl 100
        -cuda fusion=1,offload-batch-size=16,mmq-id-size=0,fa-offset=0
        --host 127.0.0.1
        --port 8080
        --webui none
        --jinja
    

Details:

  # 60 Repeating Layers [0-59] + MTP

  ## Gated Attention/Delta Net [Blended 0-59]
  blk\..*\.attn_gate\.weight=q8_0
  blk\..*\.attn_qkv\.weight=q8_0
  blk\..*\.ssm_alpha\.weight=bf16
  blk\..*\.ssm_beta\.weight=bf16
  blk\..*\.ssm_out\.weight=q8_0

  # Normal attention
  blk\..*\.attn_output\.weight=q8_0
  blk\..*\.attn_q\.weight=q8_0
  blk\..*\.attn_k\.weight=q8_0
  blk\..*\.attn_v\.weight=q8_0

  # Shared Expert Layers [0-59]
  blk\..*\.ffn_down_shexp\.weight=q8_0
  blk\..*\.ffn_(gate|up)_shexp\.weight=q8_0

  # Routed Experts Layers [0-59]
  blk\..*\.ffn_down_exps\.weight=iq4_kss
  blk\..*\.ffn_(gate|up)_exps\.weight=iq4_kss

  # Non-Repeating Layers
  token_embd\.weight=q8_0
  output\.weight=iq6_k

IQ3_KS - for 192GB RAM + 24GB VRAM
  • Will eat 17600MB of VRAM and 137GB of RAM with standard config:
        ./build/bin/llama-server
        -m pmodels/Ornith-1.0-397B-A17B-IQ3_KS.gguf
        --mmproj pmodels/Ornith-mmproj-BF16.gguf
        --mmproj-gpu-lazy
        -a Orinth
        --slot-save-path slots
        --context-shift off
        -ot "blk\.(?:[0-9]|[1-5][0-9])\.ffn.*_exps.*=CPU"
        -ot "token_embd\.weight=CPU"
        -c 200000
        --ctx-checkpoints 8
        --ctx-checkpoints-interval 0
        --ctx-checkpoints-tolerance 4
        --parallel 1
        -cram 0
        -b 4096 -ub 4096
        -wgt 1
        -ctk q8_0 -ctv q8_0
        -khad
        -mqkv
        --threads 15 --threads-batch 16 -ngl 100
        -cuda fusion=1,offload-batch-size=16,mmq-id-size=0,fa-offset=0
        --host 127.0.0.1
        --port 8080
        --webui none
        --jinja
    

Details:

  # 60 Repeating Layers [0-59] + MTP

  ## Gated Attention/Delta Net [Blended 0-59]
  blk\..*\.attn_gate\.weight=q8_0
  blk\..*\.attn_qkv\.weight=q8_0
  blk\..*\.ssm_alpha\.weight=bf16
  blk\..*\.ssm_beta\.weight=bf16
  blk\..*\.ssm_out\.weight=q8_0

  # Normal attention
  blk\..*\.attn_output\.weight=q8_0
  blk\..*\.attn_q\.weight=q8_0
  blk\..*\.attn_k\.weight=q8_0
  blk\..*\.attn_v\.weight=q8_0

  # Shared Expert Layers [0-59]
  blk\..*\.ffn_down_shexp\.weight=iq6_k
  blk\..*\.ffn_(gate|up)_shexp\.weight=iq6_k

  # Routed Experts Layers [0-59]
  blk\..*\.ffn_down_exps\.weight=iq3_ks
  blk\..*\.ffn_(gate|up)_exps\.weight=iq2_kl

  # Non-Repeating Layers
  token_embd\.weight=iq6_k
  output\.weight=iq6_k

IQ2_KS - for 128GB RAM + 16GB VRAM
  • Will eat 13988MB of VRAM and 92.4GB of RAM with standard config:
        ./build/bin/llama-server
        -m pmodels/Ornith-1.0-397B-A17B-IQ2_KS.gguf
        --mmproj pmodels/Ornith-mmproj-BF16.gguf
        --mmproj-gpu-lazy
        -a Orinth
        --slot-save-path slots
        --context-shift off
        -ot "blk\.(?:[0-9]|[1-5][0-9])\.ffn.*_exps.*=CPU"
        -ot "token_embd\.weight=CPU"
        -c 200000
        --ctx-checkpoints 8
        --ctx-checkpoints-interval 0
        --ctx-checkpoints-tolerance 4
        --parallel 1
        -cram 0
        -b 4096 -ub 4096
        -wgt 1
        -ctk q8_0 -ctv q8_0
        -khad
        -mqkv
        --threads 15 --threads-batch 16 -ngl 100
        -cuda fusion=1,offload-batch-size=16,mmq-id-size=0,fa-offset=0
        --host 127.0.0.1
        --port 8080
        --webui none
        --jinja
    

Details:

  # 60 Repeating Layers [0-59] + MTP

  ## Gated Attention/Delta Net [Blended 0-59]
  blk\..*\.attn_gate\.weight=iq4_ks
  blk\..*\.attn_qkv\.weight=iq4_ks
  blk\..*\.ssm_alpha\.weight=q8_0
  blk\..*\.ssm_beta\.weight=q8_0
  blk\..*\.ssm_out\.weight=q8_0

  # Normal attention
  blk\..*\.attn_output\.weight=iq4_kss
  blk\..*\.attn_q\.weight=iq4_kss
  blk\..*\.attn_k\.weight=iq4_kss
  blk\..*\.attn_v\.weight=iq4_kss

  # Shared Expert Layers [0-59]
  blk\..*\.ffn_down_shexp\.weight=iq4_kss
  blk\..*\.ffn_(gate|up)_shexp\.weight=iq4_kss

  # Routed Experts Layers [0-59]
  blk\..*\.ffn_down_exps\.weight=iq2_kt
  blk\..*\.ffn_(gate|up)_exps\.weight=iq1_kt

  # Non-Repeating Layers
  token_embd\.weight=iq4_ks
  output\.weight=iq4_ks


Every additional 65536 tokens of context window require one additional GB of VRAM at Q8 KV cache.

The model was natively trained on a 262144 ctx window, so if you want to go beyond 262144 you need to use the additional YARN commands (both for ik and mainline):

  --rope-scaling yarn
  --rope-scale N
  --yarn-orig-ctx 262144

Where N is the context ceiling multiplier (2 for 524288, 4 for 1M). Close to no quality loss at scale 2, some quality loss at scale 4.

Downloads last month
2,515
GGUF
Model size
1B params
Architecture
dflash-draft
Hardware compatibility
Log In to add your hardware

2-bit

16-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for paragon-of-brah/Ornith-1.0-397B-DFLASH-GGUF

Quantized
(13)
this model

Paper for paragon-of-brah/Ornith-1.0-397B-DFLASH-GGUF