m51Lab-MiniMax-M2.7-REAP-139B-A10B-NVFP4-GB10

v1.1.1 — router-gate quantization fix (2026-04-16)

What happened: The initial upload (2026-04-15) used ignore=["lm_head"] in the llm-compressor recipe, which meant the 62 MoE routers (block_sparse_moe.gate) got quantized along with the expert weights. vLLM's MiniMax-M2 loader expects an unquantized ReplicatedLinear router and fails at engine-init with:

KeyError: 'layers.0.block_sparse_moe.gate.weight_scale'       # FP8
KeyError: 'layers.0.block_sparse_moe.gate.input_global_scale' # NVFP4

This is a hard load failure — the engine never initializes, so no tokens are generated. (The earlier "degraded output" framing understated the severity.)

Root cause: Missing MoE-aware entries in the llm-compressor ignore list. The correct pattern (per saricles/MiniMax-M2.5-REAP-139B-A10B-NVFP4-GB10):

ignore = [
    "lm_head",
    "model.embed_tokens",
    r"re:.*block_sparse_moe\.gate$",
]

Fix: This variant was re-rolled 2026-04-16 with the corrected recipe. quantization_config.ignore now lists all 62 per-layer router gates alongside lm_head.

Verification: config.json on this repo now contains 62 model.layers.N.block_sparse_moe.gate entries in the ignore list. Loaders should open the model without the KeyError above.

Credit: Thanks to the community user who reported this first on the NVFP4-GB10 DGX Spark load. The saricles reference repo was invaluable for confirming the exact pattern.

Unaffected variants (no re-roll needed): BF16 safetensors, all GGUF quantizations.


NVFP4 W4A4 (FP4 weights and activations) of dervig/m51Lab-MiniMax-M2.7-REAP-139B-A10B — the first publicly available REAP-40 % pruned variant of MiniMax-M2.7 — specifically targeting GB10 (NVIDIA DGX Spark / Project Digits, SM12.1) and Blackwell FP4-native workloads.

Aspect Value
Base model dervig/m51Lab-MiniMax-M2.7-REAP-139B-A10B (BF16)
Quantization NVFP4 (microscaled FP4 for both weights and activations — W4A4)
Format compressed-tensors (vLLM / SGLang native)
Tool llmcompressor
File size ~80 GB across ~20 safetensors shards
Ignored layers lm_head (kept in BF16)

Why "GB10"?

This variant exists specifically because W4A16 NVFP4 (our sibling NVFP4 repo) does not run on GB10:

  • SGLang's CompressedTensorsW4A4Fp4 kernel requires FP4 activations (rejects W4A16)
  • CompressedTensorsWNA16 / Marlin rejects NVFP4's microscaling packing (expects INT4 pack layout)
  • Dequanting W4A16 to BF16 at load costs ~260 GB — exceeds 128 GB unified memory

This W4A4 variant is the canonical format for GB10 and routes through the native FP4 kernel path with Marlin fallback. Follows the established saricles/MiniMax-M2.5-REAP-172B-A10B-NVFP4-GB10 convention.

Hardware compatibility

Hardware Status Notes
GB10 (DGX Spark / Project Digits, 128 GB) Primary target Fits comfortably: ~80 GB weights + ~48 GB KV headroom
NVIDIA Blackwell B100 / B200 ✅ Native FP4 tensor cores accelerate both weights and activations
Hopper H100 / H200 ⚠️ Not supported No FP4 tensor cores; use FP8 variant instead
Ampere A100 ⚠️ Not supported Use AWQ variant

Inference

vLLM (Blackwell)

from vllm import LLM, SamplingParams

llm = LLM(
    model="dervig/m51Lab-MiniMax-M2.7-REAP-139B-A10B-NVFP4-GB10",
    tensor_parallel_size=1,
    trust_remote_code=True,
    max_model_len=32768,
)

params = SamplingParams(temperature=1.0, top_p=0.95, top_k=40, max_tokens=2048)
out = llm.generate(["Explain REAP pruning briefly."], params)
print(out[0].outputs[0].text)

SGLang (GB10)

python -m sglang.launch_server \
    --model-path dervig/m51Lab-MiniMax-M2.7-REAP-139B-A10B-NVFP4-GB10 \
    --trust-remote-code \
    --context-length 32768

Quality

Inference quality validated on the BF16 parent via a 5 / 5 pre-publish smoke test and full HumanEval evaluation (see parent safetensors card). W4A4 quantization has more aggressive compression than W4A16 — activation quantization adds a modest quality delta vs FP8 or the W4A16 NVFP4 — typically 1-3 % on reasoning benchmarks for this class of model. For maximum quality on Blackwell, prefer the FP8 or W4A16 NVFP4 variants; for GB10 deployment where 128 GB memory is the binding constraint, this W4A4 variant is the canonical choice.

Base model summary

Property Value
Architecture MoE, 62 layers, 154 experts (pruned from 256), top-8 routing
Active parameters / token ~10 B
Total parameters ~139 B
Max position embeddings 196,608
Vocabulary size 200,064
Pruning REAP 40 %, seed 42

See the parent safetensors card for full architecture, pruning details, and known minor layer-0 bias imperfection.

Recommended generation parameters

  • temperature: 1.0
  • top_p: 0.95
  • top_k: 40
  • repeat_penalty: 1.05

Companion repos

Acknowledgements

The W4A4 recipe and GB10-specific naming follow saricles/MiniMax-M2.5-REAP-172B-A10B-NVFP4-GB10 — thanks to saricles for establishing this convention in the community.

Citation & License

See the safetensors repo. Core references: Lasby et al., REAP the Experts (arXiv:2510.13999); MiniMax AI, MiniMax-M2.7.

Inherits the Modified MIT License from MiniMaxAI/MiniMax-M2.7.


Published by m51Lab — open-source LLM contributions from the M51 AI OS group.

Downloads last month
380
Safetensors
Model size
79B params
Tensor type
BF16
·
F32
·
F8_E4M3
·
U8
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for dervig/m51Lab-MiniMax-M2.7-REAP-139B-A10B-NVFP4-GB10

Quantized
(6)
this model

Paper for dervig/m51Lab-MiniMax-M2.7-REAP-139B-A10B-NVFP4-GB10