Instructions to use dervig/m51Lab-MiniMax-M2.7-REAP-139B-A10B-NVFP4-GB10 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use dervig/m51Lab-MiniMax-M2.7-REAP-139B-A10B-NVFP4-GB10 with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="dervig/m51Lab-MiniMax-M2.7-REAP-139B-A10B-NVFP4-GB10", trust_remote_code=True)
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForMultimodalLM

tokenizer = AutoTokenizer.from_pretrained("dervig/m51Lab-MiniMax-M2.7-REAP-139B-A10B-NVFP4-GB10", trust_remote_code=True)
model = AutoModelForMultimodalLM.from_pretrained("dervig/m51Lab-MiniMax-M2.7-REAP-139B-A10B-NVFP4-GB10", trust_remote_code=True)
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use dervig/m51Lab-MiniMax-M2.7-REAP-139B-A10B-NVFP4-GB10 with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "dervig/m51Lab-MiniMax-M2.7-REAP-139B-A10B-NVFP4-GB10"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "dervig/m51Lab-MiniMax-M2.7-REAP-139B-A10B-NVFP4-GB10",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/dervig/m51Lab-MiniMax-M2.7-REAP-139B-A10B-NVFP4-GB10

SGLang

How to use dervig/m51Lab-MiniMax-M2.7-REAP-139B-A10B-NVFP4-GB10 with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "dervig/m51Lab-MiniMax-M2.7-REAP-139B-A10B-NVFP4-GB10" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "dervig/m51Lab-MiniMax-M2.7-REAP-139B-A10B-NVFP4-GB10",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "dervig/m51Lab-MiniMax-M2.7-REAP-139B-A10B-NVFP4-GB10" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "dervig/m51Lab-MiniMax-M2.7-REAP-139B-A10B-NVFP4-GB10",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use dervig/m51Lab-MiniMax-M2.7-REAP-139B-A10B-NVFP4-GB10 with Docker Model Runner:
```
docker model run hf.co/dervig/m51Lab-MiniMax-M2.7-REAP-139B-A10B-NVFP4-GB10
```

m51Lab-MiniMax-M2.7-REAP-139B-A10B-NVFP4-GB10

`v1.1.1` — router-gate quantization fix (2026-04-16)

What happened: The initial upload (2026-04-15) used ignore=["lm_head"] in the llm-compressor recipe, which meant the 62 MoE routers (block_sparse_moe.gate) got quantized along with the expert weights. vLLM's MiniMax-M2 loader expects an unquantized ReplicatedLinear router and fails at engine-init with:

KeyError: 'layers.0.block_sparse_moe.gate.weight_scale'       # FP8
KeyError: 'layers.0.block_sparse_moe.gate.input_global_scale' # NVFP4

This is a hard load failure — the engine never initializes, so no tokens are generated. (The earlier "degraded output" framing understated the severity.)

Root cause: Missing MoE-aware entries in the llm-compressor ignore list. The correct pattern (per saricles/MiniMax-M2.5-REAP-139B-A10B-NVFP4-GB10):

ignore = [
    "lm_head",
    "model.embed_tokens",
    r"re:.*block_sparse_moe\.gate$",
]

Fix: This variant was re-rolled 2026-04-16 with the corrected recipe. quantization_config.ignore now lists all 62 per-layer router gates alongside lm_head.

Verification: config.json on this repo now contains 62 model.layers.N.block_sparse_moe.gate entries in the ignore list. Loaders should open the model without the KeyError above.

Credit: Thanks to the community user who reported this first on the NVFP4-GB10 DGX Spark load. The saricles reference repo was invaluable for confirming the exact pattern.

Unaffected variants (no re-roll needed): BF16 safetensors, all GGUF quantizations.

NVFP4 W4A4 (FP4 weights and activations) of dervig/m51Lab-MiniMax-M2.7-REAP-139B-A10B — the first publicly available REAP-40 % pruned variant of MiniMax-M2.7 — specifically targeting GB10 (NVIDIA DGX Spark / Project Digits, SM12.1) and Blackwell FP4-native workloads.

Aspect	Value
Base model	`dervig/m51Lab-MiniMax-M2.7-REAP-139B-A10B` (BF16)
Quantization	NVFP4 (microscaled FP4 for both weights and activations — W4A4)
Format	`compressed-tensors` (vLLM / SGLang native)
Tool	`llmcompressor`
File size	~80 GB across ~20 safetensors shards
Ignored layers	`lm_head` (kept in BF16)

Why "GB10"?

This variant exists specifically because W4A16 NVFP4 (our sibling NVFP4 repo) does not run on GB10:

SGLang's CompressedTensorsW4A4Fp4 kernel requires FP4 activations (rejects W4A16)
CompressedTensorsWNA16 / Marlin rejects NVFP4's microscaling packing (expects INT4 pack layout)
Dequanting W4A16 to BF16 at load costs ~260 GB — exceeds 128 GB unified memory

This W4A4 variant is the canonical format for GB10 and routes through the native FP4 kernel path with Marlin fallback. Follows the established saricles/MiniMax-M2.5-REAP-172B-A10B-NVFP4-GB10 convention.

Hardware compatibility

Hardware	Status	Notes
GB10 (DGX Spark / Project Digits, 128 GB)	✅ Primary target	Fits comfortably: ~80 GB weights + ~48 GB KV headroom
NVIDIA Blackwell B100 / B200	✅ Native	FP4 tensor cores accelerate both weights and activations
Hopper H100 / H200	⚠️ Not supported	No FP4 tensor cores; use FP8 variant instead
Ampere A100	⚠️ Not supported	Use AWQ variant

Inference

vLLM (Blackwell)

from vllm import LLM, SamplingParams

llm = LLM(
    model="dervig/m51Lab-MiniMax-M2.7-REAP-139B-A10B-NVFP4-GB10",
    tensor_parallel_size=1,
    trust_remote_code=True,
    max_model_len=32768,
)

params = SamplingParams(temperature=1.0, top_p=0.95, top_k=40, max_tokens=2048)
out = llm.generate(["Explain REAP pruning briefly."], params)
print(out[0].outputs[0].text)

SGLang (GB10)

python -m sglang.launch_server \
    --model-path dervig/m51Lab-MiniMax-M2.7-REAP-139B-A10B-NVFP4-GB10 \
    --trust-remote-code \
    --context-length 32768

Quality

Inference quality validated on the BF16 parent via a 5 / 5 pre-publish smoke test and full HumanEval evaluation (see parent safetensors card). W4A4 quantization has more aggressive compression than W4A16 — activation quantization adds a modest quality delta vs FP8 or the W4A16 NVFP4 — typically 1-3 % on reasoning benchmarks for this class of model. For maximum quality on Blackwell, prefer the FP8 or W4A16 NVFP4 variants; for GB10 deployment where 128 GB memory is the binding constraint, this W4A4 variant is the canonical choice.

Base model summary

Property	Value
Architecture	MoE, 62 layers, 154 experts (pruned from 256), top-8 routing
Active parameters / token	~10 B
Total parameters	~139 B
Max position embeddings	196,608
Vocabulary size	200,064
Pruning	REAP 40 %, seed 42

See the parent safetensors card for full architecture, pruning details, and known minor layer-0 bias imperfection.

Recommended generation parameters

temperature: 1.0
top_p: 0.95
top_k: 40
repeat_penalty: 1.05

Companion repos

Parent safetensors (BF16): dervig/m51Lab-MiniMax-M2.7-REAP-139B-A10B
GGUF (Mac / llama.cpp): dervig/m51Lab-MiniMax-M2.7-REAP-139B-A10B-GGUF
FP8 (Hopper-native): dervig/m51Lab-MiniMax-M2.7-REAP-139B-A10B-FP8
NVFP4 W4A16 (Blackwell B100/B200 + Hopper fallback): dervig/m51Lab-MiniMax-M2.7-REAP-139B-A10B-NVFP4
AWQ-4bit (vLLM / HF Transformers INT4): dervig/m51Lab-MiniMax-M2.7-REAP-139B-A10B-AWQ

Acknowledgements

The W4A4 recipe and GB10-specific naming follow saricles/MiniMax-M2.5-REAP-172B-A10B-NVFP4-GB10 — thanks to saricles for establishing this convention in the community.

Citation & License

See the safetensors repo. Core references: Lasby et al., REAP the Experts (arXiv:2510.13999); MiniMax AI, MiniMax-M2.7.

Inherits the Modified MIT License from MiniMaxAI/MiniMax-M2.7.

Published by m51Lab — open-source LLM contributions from the M51 AI OS group.

Downloads last month: 380

Safetensors

Model size

79B params

Tensor type

BF16

F32

F8_E4M3

Model tree for dervig/m51Lab-MiniMax-M2.7-REAP-139B-A10B-NVFP4-GB10

Base model

MiniMaxAI/MiniMax-M2.7

Finetuned

dervig/m51Lab-MiniMax-M2.7-REAP-139B-A10B

Quantized

(6)

this model

Paper for dervig/m51Lab-MiniMax-M2.7-REAP-139B-A10B-NVFP4-GB10

REAP the Experts: Why Pruning Prevails for One-Shot MoE compression

Paper • 2510.13999 • Published Oct 15, 2025 • 20