Compatible version with Ampere? SM8.6

#6
by bullerwins - opened

Hi!
Would it be possible to get a version that doesn't have the fp8 block so it's also compatible with Ampere?
Maybe using W4A16/W8A16 GPTQ for the attention paths too?
And/or a full W8A16 model that would still have full 8 bit precission but can load on Ampere cards

Canada Quant Labs org

Disclosure: this comment was generated with AI assistance.

Hey @bullerwins — sorry for the delay on this. Short answer: this specific artifact won't load on Ampere, and a no-FP8 sibling is non-trivial to produce. Long answer below.

Why this artifact is Hopper/Blackwell-only

The attention path uses FP8 block 128×128 quantization (re:.*attn\\.(wq_a|wq_b|wkv|wo_a|wo_b|fused_wqa_wkv|q_a_proj|q_b_proj|kv_proj|o_a_proj|o_b_proj)$ per the quantization_config.config_groups). Ampere (SM 8.6) has no FP8 hardware path — neither cuBLAS FP8 GEMM nor the Marlin FP8 fast path lands on SM 8.6. vLLM's compressed-tensors loader will refuse to build the attention modules.

What an Ampere-compatible variant would actually require

A re-quantization pass — not just a config edit. Options, roughly in increasing effort:

  1. W4A16 routed experts + BF16 attention. Drop the FP8 attention scheme, leave attention BF16. Largest sibling — attention is dense and BF16 attention recovers a lot of the FP8 savings. Cheapest to produce (just re-run the calibration with attention in the ignore list); decode speed will drop on Hopper too, so this is an Ampere-specific build.
  2. W4A16 routed experts + W8A16 INT8 attention (GPTQ). Closer to your suggestion. Needs a fresh GPTQ pass over the attention projections — couple hours of calibration on H200/B300 + a verification run. Quality should be near-identical to FP8 (INT8 weights / FP16 acts is well-tuned upstream).
  3. Full W8A16 INT8 model. Largest disk but loads everywhere. Requires re-running the whole compressor pipeline against a W8A16 recipe; we'd want to validate that GSM8K/HumanEval don't regress vs the FP8 baseline.

Honest cost estimate: each of these is 1–2 days of calibration + bench work on our side, then a fresh artifact upload. Not a config patch.

Near-term suggestions

  • If you can wait, (2) above is the most likely thing we'd publish — it's a useful Ampere/MI-series sibling and the calibration recipe is short. No ETA yet.
  • Today, on Ampere, the closest working options are:
    • deepseek-ai/DeepSeek-V4-Flash-Base (BF16, full size, fits on ≥4× A100 80GB or ≥8× A100 40GB)
    • Other community W8A8 / GGUF / GPTQ-only quants if any have been published (check the DeepSeek-V4-Flash model page) — we haven't audited those personally.

If your use-case is large enough that an Ampere W4A16+INT8-attn artifact would be useful for others too, drop a +1 here and we'll prioritize it. Otherwise the FP8-attn requirement is structural to this particular recipe and not something we can paper over with metadata.

Sign up or log in to comment