How to use from
Docker Model Runner
docker model run hf.co/olka-fi/Mistral-Medium-3.5-128B-MXFP4
Quick Links

Mistral-Medium-3.5-128B-MXFP4

Mixed MXFP4 + FP8 quantization of Mistral-Medium-3.5-128B — Mistral AI's flagship merged dense 128B model with 256k context window, instruct + reasoning + coding in unified weights, multimodal input (text + image), and 24+ language support.

MLP weights quantized to MXFP4. Self-attention weights kept at the source's per-tensor static FP8. Vision tower, multi-modal projector, embeddings, layer norms, and lm_head kept at original BF16.

Base (FP8) MXFP4 mixed
Size 133.6 GB 90.0 GB
Perplexity (WikiText-2) -- 2.349
Compression 1x 1.48x

Baseline FP8 perplexity not measured: the source model and quantized model could not be loaded alongside each other on the available hardware (1× GB10, 128 GB unified memory). PPL 2.35 is in the expected range for a 123B frontier model on WikiText-2 (Llama-3 70B ≈ 3.0; Mixtral 8×22B ≈ 3.0-3.5). Multi-step arithmetic and chain-of-reasoning prompts produce coherent, correct answers.

Format

Two-group compressed-tensors config:

MLP linears — MXFP4 block-32

  • weight_packed: uint8 [out, in//2] — two 4-bit E2M1 values packed per byte
  • weight_scale: uint8 e8m0 [out, in//32] — one shared exponent per block of 32 input channels

Self-attention linears — FP8 E4M3FN, per-tensor static

  • weight: float8_e4m3fn [out, in]
  • weight_scale: float32 (1,) — per-tensor weight scale
  • input_scale: float32 (1,) — per-tensor activation scale (static)

The source model uses per-tensor static FP8 (weight_block_size: null, activation_scheme: "static"), distinct from the more common DeepSeek/Qwen-style 128×128 block-scaled FP8.

Untouched (BF16 passthrough): model.vision_tower.*, model.multi_modal_projector.*, model.language_model.embed_tokens, lm_head, all *_layernorm, all ffn_norm.

Quantized with qstream.

Serving

vLLM

Requires vLLM with compressed-tensors MXFP4 + FP8 mixed-config support. Tested on vLLM 0.19.2.dev (April 2026).

GB10 / DGX Spark (Blackwell, 128 GB unified)

TORCH_CUDA_ARCH_LIST="12.1" \
VLLM_USE_PRECOMPILED=1 \
MAX_JOBS=3 \
PYTORCH_NO_CUDA_MEMORY_CACHING=1 \
vllm serve olka-fi/Mistral-Medium-3.5-128B-MXFP4 \
    --max-num-seqs 1 \
    --max-model-len 8192 \
    --gpu-memory-utilization 0.80 \
    --load-format safetensors \
    --enforce-eager \
    --kv-cache-dtype fp8 \
    --no-enable-prefix-caching

The env vars are GB10-specific:

  • TORCH_CUDA_ARCH_LIST="12.1" — target Blackwell sm_121 directly. PyTorch ≤ 2.11 ships sm_120 SASS only; without this, kernels PTX-fallback and JIT-compile at load time.
  • VLLM_USE_PRECOMPILED=1 — skip rebuilding vLLM's custom ops if the wheel already has them.
  • MAX_JOBS=3 — cap parallel nvcc invocations during first-time flashinfer kernel build (sm_121a is not yet pre-shipped). On unified memory each nvcc + cc1plus pair peaks ~5-6 GB; raise on machines with more headroom.
  • PYTORCH_NO_CUDA_MEMORY_CACHING=1 — release intermediate buffers immediately. On unified memory, host- and device-side allocations share the same pool, so caching has no benefit and increases peak.

--max-num-seqs 1 + --max-model-len 8192 reflect the practical concurrency/context budget after weights and overhead — power-of-2 lengths are friendlier to KV-block alignment. --enforce-eager skips CUDA-graph capture (saves a few GB at small throughput cost). --no-enable-prefix-caching is recommended for multimodal: prefix caching has historically interacted badly with image tokens in vLLM and can trigger KV-block index assertions on long image-conditioned prompts.

First load triggers 30 minutes of flashinfer kernel compilation into `/.cache/flashinfer/0.6.8.post1/121a/`. Subsequent loads are cache-hit and start in ~8 minutes (weight-load only).

Discrete-GPU systems (HBM, ≥96 GB)

vllm serve olka-fi/Mistral-Medium-3.5-128B-MXFP4 \
    --gpu-memory-utilization 0.85 \
    --load-format safetensors

Memory Budget

At 90 GB the model fits on a single GB10 / DGX Spark (128 GB unified LPDDR5X) with ~20 GB left for KV cache, system, and CUDA buffers. Per-token KV cost at BF16 is ~360 KB (88 layers × 8 KV heads × 128 dim × 2 × 2 bytes), so the practical context budget is roughly 50K tokens total across all concurrent sequences. With --kv-cache-dtype fp8 it doubles to ~100K. The source 133.6 GB FP8 model does not fit on this hardware — vLLM weight-load OOMs before reaching steady state.

Throughput (single-stream, GB10, BF16 KV)

Value
Decode (autoregressive) ~1.5 tokens/sec
Prefill ~170 tokens/sec
Theoretical decode ceiling ~3.4 tokens/sec (LPDDR5X 273 GB/s ÷ 80 GB weight read per token)

Decode is bandwidth-bound: ~44% of the LPDDR5X peak, typical for vLLM on memory-bound dense workloads. To go meaningfully faster on this model class you need HBM (H100 ≈ 12× the bandwidth).

Evaluation Details

Sliding-window perplexity on WikiText-2 test split, queried against the running vLLM server via /v1/completions with echo: true, logprobs: 1.

Setting Value
Window length 2048 tokens
Stride 512 tokens
Windows 5 (smoke)
Tokens scored 4,095
Mean NLL 0.854
Perplexity 2.349
Wall clock ~60 s

Acknowledgments

Based on Mistral-Medium-3.5-128B by Mistral AI. Original model license applies.

Downloads last month
514
Safetensors
Model size
128B params
Tensor type
F32
·
BF16
·
F8_E4M3
·
U8
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for olka-fi/Mistral-Medium-3.5-128B-MXFP4

Quantized
(21)
this model