Possible future 32 GB single-GPU variant?

#31
by vanbukin - opened

Your model delivers phenomenal results for its size. If possible, I’d love to see a future version that fits on a single RTX 5090 (32 GB) together with the KV cache. An NVFP4 version would be especially valuable if that is feasible.

Thank you for the great work!

Why future and not this one, there's at least 24 variants of NVFP4 quantization of this model as of today?

To check: open "Model Card" tab, click on the link "Quantizations 254 models", filter by "NVFP4": 24 results.

Why future and not this one, there's at least 24 variants of NVFP4 quantization of this model as of today?

To check: open "Model Card" tab, click on the link "Quantizations 254 models", filter by "NVFP4": 24 results.

Yes, there are already many NVFP4 community quantizations available, but they are extremely inconsistent in practice.

Some work well for images but not MTP, some support MTP but lose other capabilities, and some over-quantize critical layers enough that answer quality degrades noticeably. A few are genuinely good, but even those can become very tight on VRAM once you add KV cache and a usable context length.

That’s why I mentioned a future official variant. I think a quantization prepared or validated by the Qwen team themselves would likely deliver a much better balance of quality, compatibility, and memory efficiency.

future official variant.

Got it, thank your for explanation.

Sign up or log in to comment