Instructions to use paragon-of-brah/Ornith-1.0-397B-DFLASH-GGUF with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- llama-cpp-python
How to use paragon-of-brah/Ornith-1.0-397B-DFLASH-GGUF with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="paragon-of-brah/Ornith-1.0-397B-DFLASH-GGUF", filename="DFlash/Ornith-DFLASH-bf16.gguf", )
llm.create_chat_completion( messages = "No input example has been defined for this model task." )
- Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- llama.cpp
How to use paragon-of-brah/Ornith-1.0-397B-DFLASH-GGUF with llama.cpp:
Install (macOS, Linux)
curl -LsSf https://llama.app/install.sh | sh # Start a local OpenAI-compatible server with a web UI: llama serve -hf paragon-of-brah/Ornith-1.0-397B-DFLASH-GGUF:BF16 # Run inference directly in the terminal: llama cli -hf paragon-of-brah/Ornith-1.0-397B-DFLASH-GGUF:BF16
Install from WinGet (Windows)
winget install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama serve -hf paragon-of-brah/Ornith-1.0-397B-DFLASH-GGUF:BF16 # Run inference directly in the terminal: llama cli -hf paragon-of-brah/Ornith-1.0-397B-DFLASH-GGUF:BF16
Use pre-built binary
# Download pre-built binary from: # https://github.com/ggerganov/llama.cpp/releases # Start a local OpenAI-compatible server with a web UI: ./llama-server -hf paragon-of-brah/Ornith-1.0-397B-DFLASH-GGUF:BF16 # Run inference directly in the terminal: ./llama-cli -hf paragon-of-brah/Ornith-1.0-397B-DFLASH-GGUF:BF16
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp cmake -B build cmake --build build -j --target llama-server llama-cli # Start a local OpenAI-compatible server with a web UI: ./build/bin/llama-server -hf paragon-of-brah/Ornith-1.0-397B-DFLASH-GGUF:BF16 # Run inference directly in the terminal: ./build/bin/llama-cli -hf paragon-of-brah/Ornith-1.0-397B-DFLASH-GGUF:BF16
Use Docker
docker model run hf.co/paragon-of-brah/Ornith-1.0-397B-DFLASH-GGUF:BF16
- LM Studio
- Jan
- Ollama
How to use paragon-of-brah/Ornith-1.0-397B-DFLASH-GGUF with Ollama:
ollama run hf.co/paragon-of-brah/Ornith-1.0-397B-DFLASH-GGUF:BF16
- Unsloth Studio
How to use paragon-of-brah/Ornith-1.0-397B-DFLASH-GGUF with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for paragon-of-brah/Ornith-1.0-397B-DFLASH-GGUF to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for paragon-of-brah/Ornith-1.0-397B-DFLASH-GGUF to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for paragon-of-brah/Ornith-1.0-397B-DFLASH-GGUF to start chatting
- Pi
How to use paragon-of-brah/Ornith-1.0-397B-DFLASH-GGUF with Pi:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama serve -hf paragon-of-brah/Ornith-1.0-397B-DFLASH-GGUF:BF16
Configure the model in Pi
# Install Pi: npm install -g @mariozechner/pi-coding-agent # Add to ~/.pi/agent/models.json: { "providers": { "llama-cpp": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "paragon-of-brah/Ornith-1.0-397B-DFLASH-GGUF:BF16" } ] } } }Run Pi
# Start Pi in your project directory: pi
- Hermes Agent new
How to use paragon-of-brah/Ornith-1.0-397B-DFLASH-GGUF with Hermes Agent:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama serve -hf paragon-of-brah/Ornith-1.0-397B-DFLASH-GGUF:BF16
Configure Hermes
# Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default paragon-of-brah/Ornith-1.0-397B-DFLASH-GGUF:BF16
Run Hermes
hermes
- Atomic Chat new
- Docker Model Runner
How to use paragon-of-brah/Ornith-1.0-397B-DFLASH-GGUF with Docker Model Runner:
docker model run hf.co/paragon-of-brah/Ornith-1.0-397B-DFLASH-GGUF:BF16
- Lemonade
How to use paragon-of-brah/Ornith-1.0-397B-DFLASH-GGUF with Lemonade:
Pull the model
# Download Lemonade from https://lemonade-server.ai/ lemonade pull paragon-of-brah/Ornith-1.0-397B-DFLASH-GGUF:BF16
Run and chat with the model
lemonade run user.Ornith-1.0-397B-DFLASH-GGUF-BF16
List all available models
lemonade list
Configure Hermes
# Install Hermes:
curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash
hermes setup# Point Hermes at the local server:
hermes config set model.provider custom
hermes config set model.base_url http://127.0.0.1:8080/v1
hermes config set model.default paragon-of-brah/Ornith-1.0-397B-DFLASH-GGUF:BF16Run Hermes
hermesWarning, all models only work with ik_llama.cpp
Quants of Ornith 1.0, a fine tune built on Qwen 3.5 397B A17B. Comes with mmproj for vision, but isn't shipped with MTP. You can use DFLASH with it, a novel diffusion based MTP-like, to speed up TG - comes in a variety of quants, you can download the one that works best for your model size.
DFLASH paper: https://arxiv.org/abs/2602.06036
Thanks to:
https://huggingface.co/z-lab/Qwen3.5-397B-A17B-DFlash
https://huggingface.co/modal-labs/Qwen3.5-397B-A17B-DFlash
https://huggingface.co/lmsys/Qwen3.5-397B-A17B-DFlash
Load DFLASH with:
--model-draft path/to/Ornith-DFLASH.gguf
--spec-type dflash:n_max=1,cross_ctx=256
All quants target 16/24/32GB GPUs, with varying amounts of RAM depending on the quant.
Specific quant details (memory footprint with mmproj, without MTP/DFLASH):
IQ4_K - for 256GB RAM + 24GB VRAM
- Will eat 20180MB of VRAM and 198GB of RAM with standard config:
./build/bin/llama-server -m pmodels/Ornith-1.0-397B-A17B-IQ4_K.gguf --mmproj pmodels/Ornith-mmproj-BF16.gguf --mmproj-gpu-lazy -a Orinth --slot-save-path slots --context-shift off -ot "blk\.(?:[0-9]|[1-5][0-9])\.ffn.*_exps.*=CPU" -ot "token_embd\.weight=CPU" -c 200000 --ctx-checkpoints 8 --ctx-checkpoints-interval 0 --ctx-checkpoints-tolerance 4 --parallel 1 -cram 0 -b 4096 -ub 4096 -wgt 1 -ctk q8_0 -ctv q8_0 -khad -mqkv --threads 15 --threads-batch 16 -ngl 100 -cuda fusion=1,offload-batch-size=16,mmq-id-size=0,fa-offset=0 --host 127.0.0.1 --port 8080 --webui none --jinja
Details:
# 60 Repeating Layers [0-59] + MTP
## Gated Attention/Delta Net [Blended 0-59]
blk\..*\.attn_gate\.weight=q8_0
blk\..*\.attn_qkv\.weight=q8_0
blk\..*\.ssm_alpha\.weight=bf16
blk\..*\.ssm_beta\.weight=bf16
blk\..*\.ssm_out\.weight=bf16
# Normal attention
blk\..*\.attn_output\.weight=q8_0
blk\..*\.attn_q\.weight=q8_0
blk\..*\.attn_k\.weight=q8_0
blk\..*\.attn_v\.weight=q8_0
# Shared Expert Layers [0-59]
blk\..*\.ffn_down_shexp\.weight=q8_0
blk\..*\.ffn_(gate|up)_shexp\.weight=q8_0
# Routed Experts Layers [0-59]
blk\..*\.ffn_down_exps\.weight=iq4_k
blk\..*\.ffn_(gate|up)_exps\.weight=iq4_kss
# Non-Repeating Layers
token_embd\.weight=q8_0
output\.weight=iq6_k
IQ4_KSS - for 256GB RAM + 24GB VRAM
- Will eat 18826MB of VRAM and 191GB of RAM with standard config:
./build/bin/llama-server -m pmodels/Ornith-1.0-397B-A17B-IQ4_KSS.gguf --mmproj pmodels/Ornith-mmproj-BF16.gguf --mmproj-gpu-lazy -a Orinth --slot-save-path slots --context-shift off -ot "blk\.(?:[0-9]|[1-5][0-9])\.ffn.*_exps.*=CPU" -ot "token_embd\.weight=CPU" -c 200000 --ctx-checkpoints 8 --ctx-checkpoints-interval 0 --ctx-checkpoints-tolerance 4 --parallel 1 -cram 0 -b 4096 -ub 4096 -wgt 1 -ctk q8_0 -ctv q8_0 -khad -mqkv --threads 15 --threads-batch 16 -ngl 100 -cuda fusion=1,offload-batch-size=16,mmq-id-size=0,fa-offset=0 --host 127.0.0.1 --port 8080 --webui none --jinja
Details:
# 60 Repeating Layers [0-59] + MTP
## Gated Attention/Delta Net [Blended 0-59]
blk\..*\.attn_gate\.weight=q8_0
blk\..*\.attn_qkv\.weight=q8_0
blk\..*\.ssm_alpha\.weight=bf16
blk\..*\.ssm_beta\.weight=bf16
blk\..*\.ssm_out\.weight=q8_0
# Normal attention
blk\..*\.attn_output\.weight=q8_0
blk\..*\.attn_q\.weight=q8_0
blk\..*\.attn_k\.weight=q8_0
blk\..*\.attn_v\.weight=q8_0
# Shared Expert Layers [0-59]
blk\..*\.ffn_down_shexp\.weight=q8_0
blk\..*\.ffn_(gate|up)_shexp\.weight=q8_0
# Routed Experts Layers [0-59]
blk\..*\.ffn_down_exps\.weight=iq4_kss
blk\..*\.ffn_(gate|up)_exps\.weight=iq4_kss
# Non-Repeating Layers
token_embd\.weight=q8_0
output\.weight=iq6_k
IQ3_KS - for 192GB RAM + 24GB VRAM
- Will eat 17600MB of VRAM and 137GB of RAM with standard config:
./build/bin/llama-server -m pmodels/Ornith-1.0-397B-A17B-IQ3_KS.gguf --mmproj pmodels/Ornith-mmproj-BF16.gguf --mmproj-gpu-lazy -a Orinth --slot-save-path slots --context-shift off -ot "blk\.(?:[0-9]|[1-5][0-9])\.ffn.*_exps.*=CPU" -ot "token_embd\.weight=CPU" -c 200000 --ctx-checkpoints 8 --ctx-checkpoints-interval 0 --ctx-checkpoints-tolerance 4 --parallel 1 -cram 0 -b 4096 -ub 4096 -wgt 1 -ctk q8_0 -ctv q8_0 -khad -mqkv --threads 15 --threads-batch 16 -ngl 100 -cuda fusion=1,offload-batch-size=16,mmq-id-size=0,fa-offset=0 --host 127.0.0.1 --port 8080 --webui none --jinja
Details:
# 60 Repeating Layers [0-59] + MTP
## Gated Attention/Delta Net [Blended 0-59]
blk\..*\.attn_gate\.weight=q8_0
blk\..*\.attn_qkv\.weight=q8_0
blk\..*\.ssm_alpha\.weight=bf16
blk\..*\.ssm_beta\.weight=bf16
blk\..*\.ssm_out\.weight=q8_0
# Normal attention
blk\..*\.attn_output\.weight=q8_0
blk\..*\.attn_q\.weight=q8_0
blk\..*\.attn_k\.weight=q8_0
blk\..*\.attn_v\.weight=q8_0
# Shared Expert Layers [0-59]
blk\..*\.ffn_down_shexp\.weight=iq6_k
blk\..*\.ffn_(gate|up)_shexp\.weight=iq6_k
# Routed Experts Layers [0-59]
blk\..*\.ffn_down_exps\.weight=iq3_ks
blk\..*\.ffn_(gate|up)_exps\.weight=iq2_kl
# Non-Repeating Layers
token_embd\.weight=iq6_k
output\.weight=iq6_k
IQ2_KS - for 128GB RAM + 16GB VRAM
- Will eat 13988MB of VRAM and 92.4GB of RAM with standard config:
./build/bin/llama-server -m pmodels/Ornith-1.0-397B-A17B-IQ2_KS.gguf --mmproj pmodels/Ornith-mmproj-BF16.gguf --mmproj-gpu-lazy -a Orinth --slot-save-path slots --context-shift off -ot "blk\.(?:[0-9]|[1-5][0-9])\.ffn.*_exps.*=CPU" -ot "token_embd\.weight=CPU" -c 200000 --ctx-checkpoints 8 --ctx-checkpoints-interval 0 --ctx-checkpoints-tolerance 4 --parallel 1 -cram 0 -b 4096 -ub 4096 -wgt 1 -ctk q8_0 -ctv q8_0 -khad -mqkv --threads 15 --threads-batch 16 -ngl 100 -cuda fusion=1,offload-batch-size=16,mmq-id-size=0,fa-offset=0 --host 127.0.0.1 --port 8080 --webui none --jinja
Details:
# 60 Repeating Layers [0-59] + MTP
## Gated Attention/Delta Net [Blended 0-59]
blk\..*\.attn_gate\.weight=iq4_ks
blk\..*\.attn_qkv\.weight=iq4_ks
blk\..*\.ssm_alpha\.weight=q8_0
blk\..*\.ssm_beta\.weight=q8_0
blk\..*\.ssm_out\.weight=q8_0
# Normal attention
blk\..*\.attn_output\.weight=iq4_kss
blk\..*\.attn_q\.weight=iq4_kss
blk\..*\.attn_k\.weight=iq4_kss
blk\..*\.attn_v\.weight=iq4_kss
# Shared Expert Layers [0-59]
blk\..*\.ffn_down_shexp\.weight=iq4_kss
blk\..*\.ffn_(gate|up)_shexp\.weight=iq4_kss
# Routed Experts Layers [0-59]
blk\..*\.ffn_down_exps\.weight=iq2_kt
blk\..*\.ffn_(gate|up)_exps\.weight=iq1_kt
# Non-Repeating Layers
token_embd\.weight=iq4_ks
output\.weight=iq4_ks
Every additional 65536 tokens of context window require one additional GB of VRAM at Q8 KV cache.
The model was natively trained on a 262144 ctx window, so if you want to go beyond 262144 you need to use the additional YARN commands (both for ik and mainline):
--rope-scaling yarn
--rope-scale N
--yarn-orig-ctx 262144
Where N is the context ceiling multiplier (2 for 524288, 4 for 1M). Close to no quality loss at scale 2, some quality loss at scale 4.
- Downloads last month
- 2,515
Model tree for paragon-of-brah/Ornith-1.0-397B-DFLASH-GGUF
Base model
deepreinforce-ai/Ornith-1.0-397B
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp# Start a local OpenAI-compatible server: llama serve -hf paragon-of-brah/Ornith-1.0-397B-DFLASH-GGUF:BF16