Instructions to use olka-fi/Mistral-Medium-3.5-128B-MXFP4 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use olka-fi/Mistral-Medium-3.5-128B-MXFP4 with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="olka-fi/Mistral-Medium-3.5-128B-MXFP4")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
pipe(text=messages)

# Load model directly
from transformers import AutoProcessor, AutoModelForMultimodalLM

processor = AutoProcessor.from_pretrained("olka-fi/Mistral-Medium-3.5-128B-MXFP4")
model = AutoModelForMultimodalLM.from_pretrained("olka-fi/Mistral-Medium-3.5-128B-MXFP4")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
inputs = processor.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use olka-fi/Mistral-Medium-3.5-128B-MXFP4 with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "olka-fi/Mistral-Medium-3.5-128B-MXFP4"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "olka-fi/Mistral-Medium-3.5-128B-MXFP4",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/olka-fi/Mistral-Medium-3.5-128B-MXFP4

SGLang

How to use olka-fi/Mistral-Medium-3.5-128B-MXFP4 with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "olka-fi/Mistral-Medium-3.5-128B-MXFP4" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "olka-fi/Mistral-Medium-3.5-128B-MXFP4",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "olka-fi/Mistral-Medium-3.5-128B-MXFP4" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "olka-fi/Mistral-Medium-3.5-128B-MXFP4",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use olka-fi/Mistral-Medium-3.5-128B-MXFP4 with Docker Model Runner:
```
docker model run hf.co/olka-fi/Mistral-Medium-3.5-128B-MXFP4
```

Mistral-Medium-3.5-128B-MXFP4

Mixed MXFP4 + FP8 quantization of Mistral-Medium-3.5-128B — Mistral AI's flagship merged dense 128B model with 256k context window, instruct + reasoning + coding in unified weights, multimodal input (text + image), and 24+ language support.

MLP weights quantized to MXFP4. Self-attention weights kept at the source's per-tensor static FP8. Vision tower, multi-modal projector, embeddings, layer norms, and lm_head kept at original BF16.

	Base (FP8)	MXFP4 mixed
Size	133.6 GB	90.0 GB
Perplexity (WikiText-2)	--	2.349
Compression	1x	1.48x

Baseline FP8 perplexity not measured: the source model and quantized model could not be loaded alongside each other on the available hardware (1× GB10, 128 GB unified memory). PPL 2.35 is in the expected range for a 123B frontier model on WikiText-2 (Llama-3 70B ≈ 3.0; Mixtral 8×22B ≈ 3.0-3.5). Multi-step arithmetic and chain-of-reasoning prompts produce coherent, correct answers.

Format

Two-group compressed-tensors config:

MLP linears — MXFP4 block-32

weight_packed: uint8 [out, in//2] — two 4-bit E2M1 values packed per byte
weight_scale: uint8 e8m0 [out, in//32] — one shared exponent per block of 32 input channels

Self-attention linears — FP8 E4M3FN, per-tensor static

weight: float8_e4m3fn [out, in]
weight_scale: float32 (1,) — per-tensor weight scale
input_scale: float32 (1,) — per-tensor activation scale (static)

The source model uses per-tensor static FP8 (weight_block_size: null, activation_scheme: "static"), distinct from the more common DeepSeek/Qwen-style 128×128 block-scaled FP8.

Untouched (BF16 passthrough): model.vision_tower.*, model.multi_modal_projector.*, model.language_model.embed_tokens, lm_head, all *_layernorm, all ffn_norm.

Quantized with qstream.

Serving

vLLM

Requires vLLM with compressed-tensors MXFP4 + FP8 mixed-config support. Tested on vLLM 0.19.2.dev (April 2026).

GB10 / DGX Spark (Blackwell, 128 GB unified)

TORCH_CUDA_ARCH_LIST="12.1" \
VLLM_USE_PRECOMPILED=1 \
MAX_JOBS=3 \
PYTORCH_NO_CUDA_MEMORY_CACHING=1 \
vllm serve olka-fi/Mistral-Medium-3.5-128B-MXFP4 \
    --max-num-seqs 1 \
    --max-model-len 8192 \
    --gpu-memory-utilization 0.80 \
    --load-format safetensors \
    --enforce-eager \
    --kv-cache-dtype fp8 \
    --no-enable-prefix-caching

The env vars are GB10-specific:

TORCH_CUDA_ARCH_LIST="12.1" — target Blackwell sm_121 directly. PyTorch ≤ 2.11 ships sm_120 SASS only; without this, kernels PTX-fallback and JIT-compile at load time.
VLLM_USE_PRECOMPILED=1 — skip rebuilding vLLM's custom ops if the wheel already has them.
MAX_JOBS=3 — cap parallel nvcc invocations during first-time flashinfer kernel build (sm_121a is not yet pre-shipped). On unified memory each nvcc + cc1plus pair peaks ~5-6 GB; raise on machines with more headroom.
PYTORCH_NO_CUDA_MEMORY_CACHING=1 — release intermediate buffers immediately. On unified memory, host- and device-side allocations share the same pool, so caching has no benefit and increases peak.

--max-num-seqs 1 + --max-model-len 8192 reflect the practical concurrency/context budget after weights and overhead — power-of-2 lengths are friendlier to KV-block alignment. --enforce-eager skips CUDA-graph capture (saves a few GB at small throughput cost). --no-enable-prefix-caching is recommended for multimodal: prefix caching has historically interacted badly with image tokens in vLLM and can trigger KV-block index assertions on long image-conditioned prompts.

First load triggers ~~30 minutes of flashinfer kernel compilation into `~~/.cache/flashinfer/0.6.8.post1/121a/`. Subsequent loads are cache-hit and start in ~8 minutes (weight-load only).

Discrete-GPU systems (HBM, ≥96 GB)

vllm serve olka-fi/Mistral-Medium-3.5-128B-MXFP4 \
    --gpu-memory-utilization 0.85 \
    --load-format safetensors

Memory Budget

At 90 GB the model fits on a single GB10 / DGX Spark (128 GB unified LPDDR5X) with ~20 GB left for KV cache, system, and CUDA buffers. Per-token KV cost at BF16 is ~360 KB (88 layers × 8 KV heads × 128 dim × 2 × 2 bytes), so the practical context budget is roughly 50K tokens total across all concurrent sequences. With --kv-cache-dtype fp8 it doubles to ~100K. The source 133.6 GB FP8 model does not fit on this hardware — vLLM weight-load OOMs before reaching steady state.

Throughput (single-stream, GB10, BF16 KV)

	Value
Decode (autoregressive)	~1.5 tokens/sec
Prefill	~170 tokens/sec
Theoretical decode ceiling	~3.4 tokens/sec (LPDDR5X 273 GB/s ÷ 80 GB weight read per token)

Decode is bandwidth-bound: ~44% of the LPDDR5X peak, typical for vLLM on memory-bound dense workloads. To go meaningfully faster on this model class you need HBM (H100 ≈ 12× the bandwidth).

Evaluation Details

Sliding-window perplexity on WikiText-2 test split, queried against the running vLLM server via /v1/completions with echo: true, logprobs: 1.

Setting	Value
Window length	2048 tokens
Stride	512 tokens
Windows	5 (smoke)
Tokens scored	4,095
Mean NLL	0.854
Perplexity	2.349
Wall clock	~60 s

Acknowledgments

Based on Mistral-Medium-3.5-128B by Mistral AI. Original model license applies.

Downloads last month: 514

Safetensors

Model size

128B params

Tensor type

F32

BF16

F8_E4M3

Model tree for olka-fi/Mistral-Medium-3.5-128B-MXFP4

Base model

mistralai/Mistral-Medium-3.5-128B

Quantized

(21)

this model