An ExLlamaV3 build of
openpangu/openPangu-2.0-Flashat 4.0 bits per weight. See Quants for sibling repos at other bit‑widths or browse the collection.
Quants
| BPW | Head bits | Calibration rows | Size | KL ÷ fp16 | Status |
|---|---|---|---|---|---|
| 4.0 | 8 | 250 | 51.0 GB | 0.0817 | this repo |
KL ÷ fp16: mean KL-divergence from the fp16 source over wikitext rows — lower is closer to the original.
Inference
| Loader | Use it for |
|---|---|
| TabbyAPI | OpenAI‑compatible HTTP server. Drop‑in for OpenAI clients. |
| text‑generation‑webui | Local chat UI. Pick the ExLlamaV3 loader from the model dropdown. |
| ExLlamaV3 | Direct Python API for embedding the model in your own code or pipeline. |
Download
pip install -U huggingface_hub
hf download \
blockblockblock/openPangu-2.0-Flash-exl3-4.0bpw \
--local-dir ./openPangu-2.0-Flash-exl3-4.0bpw
Quantization recipe (advanced, embedded in quantization_config.json)
| Setting | Value |
|---|---|
| Format | EXL3 |
| Bits per weight | 4.0 |
| Head bits | 8 |
| Calibration rows | 250 |
| Codebook | MCG |
| Out‑scales | always |
| Parallel mode | enabled (MoE expert batching) |
Loaded automatically by every ExLlamaV3 loader; reproduced here for searchability.
License & use
Use and license follow the base model. Quantization adds no additional restrictions. Refer to the upstream repository for terms, citation, and safety documentation.
{org}/{model}-exl3-{bpw}bpw
Requirements
First EXL3 of this architecture. Loading needs exllamav3 with OpenPanguV2 support from
Honkware/exllamav3@openpangu; stock
exllamav3 releases will not load it yet. mHC, MoME conv, attention sink and DSA indexer tensors
ship unquantized.
Multi-token prediction
All three MTP draft layers (46-48) are included and quantized. The per-depth heads and embeddings are tied to the trunk in the base checkpoint, so they are borrowed at load rather than duplicated. Teacher-forced acceptance against the quantized trunk: 92.5% / 89.2% / 88.6% by depth (bf16 reference: 89.2% / 88.6% / 85.9%).
Scope
DSA layers run dense in this build. Batch forward and the MTP chain are validated against the bf16 reference. Cached (paged) generation and per-depth MTP draft dispatch are available on the pangu-paged branch: single-stream chat runs at about 14 tok/s on an H100 and up to 32 tok/s with 3-token MTP drafting (86-96 percent depth-0 acceptance). Batch size 1 only for now; concurrent-job decoding has a known issue under investigation. Quality vs the bf16 original on natural text: mean KL 0.0817, top-1 agreement 92.9%.
License
Use and license follow the base model (LICENSE included in this repo, attribution required). Powered by openPangu.
- Downloads last month
- -
Model tree for blockblockblock/openPangu-2.0-Flash-exl3-4.0bpw
Base model
openpangu/openPangu-2.0-Flash