How to use from
vLLM
Install from pip and serve model
# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "Sehyo/Qwen3.5-122B-A10B-NVFP4"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Sehyo/Qwen3.5-122B-A10B-NVFP4",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'
Use Docker
docker model run hf.co/Sehyo/Qwen3.5-122B-A10B-NVFP4
Quick Links

Qwen3.5-122B-A10B-NVFP4

This is a quantized version of Qwen/Qwen3.5-122B-A10B using the NVFP4 quantization scheme.

Please use nightly vLLM for support.

Changelog

  • 02/03/2026: Added MTP (multi-token prediction) weights from source checkpoint, enabling speculative decoding with vLLM.
  • 25/02/2026: Initial upload.

Calibration

Creation

This model was created using VLLM's LLM Compressor with Qwen3.5 MoE support added via PR #2383. The PR adds a custom CalibrationQwen3MoeSparseMoeBlock that routes calibration data to all experts during quantization, ensuring every expert receives proper calibration for accurate NVFP4 quantization.

Downloads last month
37,850
Safetensors
Model size
71B params
Tensor type
F32
·
BF16
·
F8_E4M3
·
U8
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Sehyo/Qwen3.5-122B-A10B-NVFP4

Quantized
(120)
this model