Instructions to use olka-fi/Mistral-Medium-3.5-128B-MXFP4 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use olka-fi/Mistral-Medium-3.5-128B-MXFP4 with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="olka-fi/Mistral-Medium-3.5-128B-MXFP4") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] pipe(text=messages)# Load model directly from transformers import AutoProcessor, AutoModelForMultimodalLM processor = AutoProcessor.from_pretrained("olka-fi/Mistral-Medium-3.5-128B-MXFP4") model = AutoModelForMultimodalLM.from_pretrained("olka-fi/Mistral-Medium-3.5-128B-MXFP4") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] inputs = processor.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use olka-fi/Mistral-Medium-3.5-128B-MXFP4 with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "olka-fi/Mistral-Medium-3.5-128B-MXFP4" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "olka-fi/Mistral-Medium-3.5-128B-MXFP4", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/olka-fi/Mistral-Medium-3.5-128B-MXFP4
- SGLang
How to use olka-fi/Mistral-Medium-3.5-128B-MXFP4 with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "olka-fi/Mistral-Medium-3.5-128B-MXFP4" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "olka-fi/Mistral-Medium-3.5-128B-MXFP4", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "olka-fi/Mistral-Medium-3.5-128B-MXFP4" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "olka-fi/Mistral-Medium-3.5-128B-MXFP4", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use olka-fi/Mistral-Medium-3.5-128B-MXFP4 with Docker Model Runner:
docker model run hf.co/olka-fi/Mistral-Medium-3.5-128B-MXFP4
Mistral-Medium-3.5-128B-MXFP4
Mixed MXFP4 + FP8 quantization of Mistral-Medium-3.5-128B — Mistral AI's flagship merged dense 128B model with 256k context window, instruct + reasoning + coding in unified weights, multimodal input (text + image), and 24+ language support.
MLP weights quantized to MXFP4. Self-attention weights kept at the source's per-tensor static FP8. Vision tower, multi-modal projector, embeddings, layer norms, and lm_head kept at original BF16.
| Base (FP8) | MXFP4 mixed | |
|---|---|---|
| Size | 133.6 GB | 90.0 GB |
| Perplexity (WikiText-2) | -- | 2.349 |
| Compression | 1x | 1.48x |
Baseline FP8 perplexity not measured: the source model and quantized model could not be loaded alongside each other on the available hardware (1× GB10, 128 GB unified memory). PPL 2.35 is in the expected range for a 123B frontier model on WikiText-2 (Llama-3 70B ≈ 3.0; Mixtral 8×22B ≈ 3.0-3.5). Multi-step arithmetic and chain-of-reasoning prompts produce coherent, correct answers.
Format
Two-group compressed-tensors config:
MLP linears — MXFP4 block-32
weight_packed: uint8[out, in//2]— two 4-bit E2M1 values packed per byteweight_scale: uint8 e8m0[out, in//32]— one shared exponent per block of 32 input channels
Self-attention linears — FP8 E4M3FN, per-tensor static
weight: float8_e4m3fn[out, in]weight_scale: float32(1,)— per-tensor weight scaleinput_scale: float32(1,)— per-tensor activation scale (static)
The source model uses per-tensor static FP8 (weight_block_size: null, activation_scheme: "static"), distinct from the more common DeepSeek/Qwen-style 128×128 block-scaled FP8.
Untouched (BF16 passthrough): model.vision_tower.*, model.multi_modal_projector.*, model.language_model.embed_tokens, lm_head, all *_layernorm, all ffn_norm.
Quantized with qstream.
Serving
vLLM
Requires vLLM with compressed-tensors MXFP4 + FP8 mixed-config support. Tested on vLLM 0.19.2.dev (April 2026).
GB10 / DGX Spark (Blackwell, 128 GB unified)
TORCH_CUDA_ARCH_LIST="12.1" \
VLLM_USE_PRECOMPILED=1 \
MAX_JOBS=3 \
PYTORCH_NO_CUDA_MEMORY_CACHING=1 \
vllm serve olka-fi/Mistral-Medium-3.5-128B-MXFP4 \
--max-num-seqs 1 \
--max-model-len 8192 \
--gpu-memory-utilization 0.80 \
--load-format safetensors \
--enforce-eager \
--kv-cache-dtype fp8 \
--no-enable-prefix-caching
The env vars are GB10-specific:
TORCH_CUDA_ARCH_LIST="12.1"— target Blackwell sm_121 directly. PyTorch ≤ 2.11 ships sm_120 SASS only; without this, kernels PTX-fallback and JIT-compile at load time.VLLM_USE_PRECOMPILED=1— skip rebuilding vLLM's custom ops if the wheel already has them.MAX_JOBS=3— cap parallel nvcc invocations during first-time flashinfer kernel build (sm_121a is not yet pre-shipped). On unified memory each nvcc + cc1plus pair peaks ~5-6 GB; raise on machines with more headroom.PYTORCH_NO_CUDA_MEMORY_CACHING=1— release intermediate buffers immediately. On unified memory, host- and device-side allocations share the same pool, so caching has no benefit and increases peak.
--max-num-seqs 1 + --max-model-len 8192 reflect the practical concurrency/context budget after weights and overhead — power-of-2 lengths are friendlier to KV-block alignment. --enforce-eager skips CUDA-graph capture (saves a few GB at small throughput cost). --no-enable-prefix-caching is recommended for multimodal: prefix caching has historically interacted badly with image tokens in vLLM and can trigger KV-block index assertions on long image-conditioned prompts.
First load triggers 30 minutes of flashinfer kernel compilation into `/.cache/flashinfer/0.6.8.post1/121a/`. Subsequent loads are cache-hit and start in ~8 minutes (weight-load only).
Discrete-GPU systems (HBM, ≥96 GB)
vllm serve olka-fi/Mistral-Medium-3.5-128B-MXFP4 \
--gpu-memory-utilization 0.85 \
--load-format safetensors
Memory Budget
At 90 GB the model fits on a single GB10 / DGX Spark (128 GB unified LPDDR5X) with ~20 GB left for KV cache, system, and CUDA buffers. Per-token KV cost at BF16 is ~360 KB (88 layers × 8 KV heads × 128 dim × 2 × 2 bytes), so the practical context budget is roughly 50K tokens total across all concurrent sequences. With --kv-cache-dtype fp8 it doubles to ~100K. The source 133.6 GB FP8 model does not fit on this hardware — vLLM weight-load OOMs before reaching steady state.
Throughput (single-stream, GB10, BF16 KV)
| Value | |
|---|---|
| Decode (autoregressive) | ~1.5 tokens/sec |
| Prefill | ~170 tokens/sec |
| Theoretical decode ceiling | ~3.4 tokens/sec (LPDDR5X 273 GB/s ÷ 80 GB weight read per token) |
Decode is bandwidth-bound: ~44% of the LPDDR5X peak, typical for vLLM on memory-bound dense workloads. To go meaningfully faster on this model class you need HBM (H100 ≈ 12× the bandwidth).
Evaluation Details
Sliding-window perplexity on WikiText-2 test split, queried against the running vLLM server via /v1/completions with echo: true, logprobs: 1.
| Setting | Value |
|---|---|
| Window length | 2048 tokens |
| Stride | 512 tokens |
| Windows | 5 (smoke) |
| Tokens scored | 4,095 |
| Mean NLL | 0.854 |
| Perplexity | 2.349 |
| Wall clock | ~60 s |
Acknowledgments
Based on Mistral-Medium-3.5-128B by Mistral AI. Original model license applies.
- Downloads last month
- 514
Model tree for olka-fi/Mistral-Medium-3.5-128B-MXFP4
Base model
mistralai/Mistral-Medium-3.5-128B
docker model run hf.co/olka-fi/Mistral-Medium-3.5-128B-MXFP4