Compatible version with Ampere? SM8.6

by bullerwins - opened 20 days ago

Hi!
Would it be possible to get a version that doesn't have the fp8 block so it's also compatible with Ampere?
Maybe using W4A16/W8A16 GPTQ for the attention paths too?
And/or a full W8A16 model that would still have full 8 bit precission but can load on Ampere cards

pastapaul

Canada Quant Labs org 7 days ago

Disclosure: this comment was generated with AI assistance.

Hey @bullerwins — sorry for the delay on this. Short answer: this specific artifact won't load on Ampere, and a no-FP8 sibling is non-trivial to produce. Long answer below.

Why this artifact is Hopper/Blackwell-only

What an Ampere-compatible variant would actually require

A re-quantization pass — not just a config edit. Options, roughly in increasing effort:

W4A16 routed experts + BF16 attention. Drop the FP8 attention scheme, leave attention BF16. Largest sibling — attention is dense and BF16 attention recovers a lot of the FP8 savings. Cheapest to produce (just re-run the calibration with attention in the ignore list); decode speed will drop on Hopper too, so this is an Ampere-specific build.
W4A16 routed experts + W8A16 INT8 attention (GPTQ). Closer to your suggestion. Needs a fresh GPTQ pass over the attention projections — couple hours of calibration on H200/B300 + a verification run. Quality should be near-identical to FP8 (INT8 weights / FP16 acts is well-tuned upstream).
Full W8A16 INT8 model. Largest disk but loads everywhere. Requires re-running the whole compressor pipeline against a W8A16 recipe; we'd want to validate that GSM8K/HumanEval don't regress vs the FP8 baseline.

Honest cost estimate: each of these is 1–2 days of calibration + bench work on our side, then a fresh artifact upload. Not a config patch.

Near-term suggestions

If you can wait, (2) above is the most likely thing we'd publish — it's a useful Ampere/MI-series sibling and the calibration recipe is short. No ETA yet.
Today, on Ampere, the closest working options are:
- deepseek-ai/DeepSeek-V4-Flash-Base (BF16, full size, fits on ≥4× A100 80GB or ≥8× A100 40GB)
- Other community W8A8 / GGUF / GPTQ-only quants if any have been published (check the DeepSeek-V4-Flash model page) — we haven't audited those personally.

If your use-case is large enough that an Ampere W4A16+INT8-attn artifact would be useful for others too, drop a +1 here and we'll prioritize it. Otherwise the FP8-attn requirement is structural to this particular recipe and not something we can paper over with metadata.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment