openPangu · 2.0 · Flash

_{EXL3 · 4.0 bpw · 51.0 GB · Mixture‑of‑Experts · 46 layers × 256 experts}

An ExLlamaV3 build of openpangu/openPangu-2.0-Flash at 4.0 bits per weight. See Quants for sibling repos at other bit‑widths or browse the collection.

Quants

BPW	Head bits	Calibration rows	Size	KL ÷ fp16	Status
4.0	8	250	51.0 GB	0.0817	`this repo`

_{KL ÷ fp16: mean KL-divergence from the fp16 source over wikitext rows — lower is closer to the original.}

Inference

Loader	Use it for
TabbyAPI	OpenAI‑compatible HTTP server. Drop‑in for OpenAI clients.
text‑generation‑webui	Local chat UI. Pick the ExLlamaV3 loader from the model dropdown.
ExLlamaV3	Direct Python API for embedding the model in your own code or pipeline.

Download

pip install -U huggingface_hub

hf download \
  blockblockblock/openPangu-2.0-Flash-exl3-4.0bpw \
  --local-dir ./openPangu-2.0-Flash-exl3-4.0bpw

Quantization recipe _{(advanced, embedded in quantization_config.json)}

Setting	Value
Format	`EXL3`
Bits per weight	`4.0`
Head bits	`8`
Calibration rows	`250`
Codebook	`MCG`
Out‑scales	`always`
Parallel mode	`enabled` (MoE expert batching)

Loaded automatically by every ExLlamaV3 loader; reproduced here for searchability.

License & use

Use and license follow the base model. Quantization adds no additional restrictions. Refer to the upstream repository for terms, citation, and safety documentation.

_{Quantized with BlockQuant · convention {org}/{model}-exl3-{bpw}bpw}

Requirements

First EXL3 of this architecture. Loading needs exllamav3 with OpenPanguV2 support from Honkware/exllamav3@openpangu; stock exllamav3 releases will not load it yet. mHC, MoME conv, attention sink and DSA indexer tensors ship unquantized.

Multi-token prediction

All three MTP draft layers (46-48) are included and quantized. The per-depth heads and embeddings are tied to the trunk in the base checkpoint, so they are borrowed at load rather than duplicated. Teacher-forced acceptance against the quantized trunk: 92.5% / 89.2% / 88.6% by depth (bf16 reference: 89.2% / 88.6% / 85.9%).

Scope

DSA layers run dense in this build. Batch forward and the MTP chain are validated against the bf16 reference. Cached (paged) generation and per-depth MTP draft dispatch are available on the pangu-paged branch: single-stream chat runs at about 14 tok/s on an H100 and up to 32 tok/s with 3-token MTP drafting (86-96 percent depth-0 acceptance). Batch size 1 only for now; concurrent-job decoding has a known issue under investigation. Quality vs the bf16 original on natural text: mean KL 0.0817, top-1 agreement 92.9%.

License

Use and license follow the base model (LICENSE included in this repo, attribution required). Powered by openPangu.

Downloads last month: -

Safetensors

Model size

25B params

Tensor type

BF16

F16

I16

Model tree for blockblockblock/openPangu-2.0-Flash-exl3-4.0bpw

Base model

openpangu/openPangu-2.0-Flash

Quantized

(5)

this model

Collection including blockblockblock/openPangu-2.0-Flash-exl3-4.0bpw

openPangu-2.0-Flash EXL3

Collection

EXL3 quants of openPangu-2.0-Flash, produced by BlockQuant. • 1 item • Updated 1 day ago