openPangu · 2.0 · Flash

EXL3  ·  4.0 bpw  ·  51.0 GB  ·  Mixture‑of‑Experts  ·  46 layers × 256 experts


format bpw size arch

base model quantized by collection


An ExLlamaV3 build of openpangu/openPangu-2.0-Flash at 4.0 bits per weight. See Quants for sibling repos at other bit‑widths or browse the collection.

Quants

BPW     Head bits     Calibration rows     Size     KL ÷ fp16     Status
4.0 8 250 51.0 GB 0.0817 this repo

KL ÷ fp16: mean KL-divergence from the fp16 source over wikitext rows — lower is closer to the original.

Inference

Loader Use it for
TabbyAPI OpenAI‑compatible HTTP server. Drop‑in for OpenAI clients.
text‑generation‑webui Local chat UI. Pick the ExLlamaV3 loader from the model dropdown.
ExLlamaV3 Direct Python API for embedding the model in your own code or pipeline.

Download

pip install -U huggingface_hub

hf download \
  blockblockblock/openPangu-2.0-Flash-exl3-4.0bpw \
  --local-dir ./openPangu-2.0-Flash-exl3-4.0bpw
Quantization recipe  (advanced, embedded in quantization_config.json)
Setting Value
Format EXL3
Bits per weight 4.0
Head bits 8
Calibration rows 250
Codebook MCG
Out‑scales always
Parallel mode enabled (MoE expert batching)

Loaded automatically by every ExLlamaV3 loader; reproduced here for searchability.

License & use

Use and license follow the base model. Quantization adds no additional restrictions. Refer to the upstream repository for terms, citation, and safety documentation.


Quantized with BlockQuant  ·  convention {org}/{model}-exl3-{bpw}bpw

Requirements

First EXL3 of this architecture. Loading needs exllamav3 with OpenPanguV2 support from Honkware/exllamav3@openpangu; stock exllamav3 releases will not load it yet. mHC, MoME conv, attention sink and DSA indexer tensors ship unquantized.

Multi-token prediction

All three MTP draft layers (46-48) are included and quantized. The per-depth heads and embeddings are tied to the trunk in the base checkpoint, so they are borrowed at load rather than duplicated. Teacher-forced acceptance against the quantized trunk: 92.5% / 89.2% / 88.6% by depth (bf16 reference: 89.2% / 88.6% / 85.9%).

Scope

DSA layers run dense in this build. Batch forward and the MTP chain are validated against the bf16 reference. Cached (paged) generation and per-depth MTP draft dispatch are available on the pangu-paged branch: single-stream chat runs at about 14 tok/s on an H100 and up to 32 tok/s with 3-token MTP drafting (86-96 percent depth-0 acceptance). Batch size 1 only for now; concurrent-job decoding has a known issue under investigation. Quality vs the bf16 original on natural text: mean KL 0.0817, top-1 agreement 92.9%.

License

Use and license follow the base model (LICENSE included in this repo, attribution required). Powered by openPangu.

Downloads last month
-
Safetensors
Model size
25B params
Tensor type
BF16
·
F16
·
I16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for blockblockblock/openPangu-2.0-Flash-exl3-4.0bpw

Quantized
(5)
this model

Collection including blockblockblock/openPangu-2.0-Flash-exl3-4.0bpw