smoothquant: fix auto-alpha KeyError when quantizing convs

Running with --op-types MatMul,Conv (default --alpha auto) crashed in the
neural-compressor smoother:

_reshape_scale_for_input -> KeyError '/pre_encode/conv/conv.0/Conv'

Convs can be statically quantized but cannot be SMOOTHED by this library: the
existing layout shim always returns None for a conv weight ((out,in,kh,kw) has
no per-input-channel max matching the activation), so the auto-alpha search
iterates the conv, finds no scale in tensor_scales_info, and KeyErrors. With a
fixed --alpha the auto-tune path is skipped so it didn't surface there.

Decouple the smoother's op set from the quantizer's: smooth only the
matmul-family ops (smooth_op_types), but still hand Conv to the static
quantizer. Wired both via extra_options['SmoothQuantOpTypes'] and, because
transform() drops that knob like it drops alpha, via the existing _SMOOTH_OVERRIDE
shim ('op_types'). --op-types MatMul,Conv now runs under auto-alpha.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Files changed (1) hide show

scripts/quantize-int8-smoothquant.py +19 -1

scripts/quantize-int8-smoothquant.py CHANGED Viewed

@@ -468,6 +468,14 @@ def main():
              "entropy": CalibrationMethod.Entropy,
              "percentile": CalibrationMethod.Percentile}[args.calibrate_method]
     op_types = [t.strip() for t in args.op_types.split(",") if t.strip()]
     cfg = config.StaticQuantConfig(
         calibration_data_reader=FeatureReader(feats),
         quant_format=fmt,
@@ -490,6 +498,10 @@ def main():
         execution_provider="CPUExecutionProvider",
         extra_options={
             "SmoothQuant": True,
             # alpha="auto" makes the smoother search a per-layer optimal alpha
             # (minimising each layer's QDQ output error vs fp32) instead of forcing
             # one global value onto FastConformer's very uneven outlier profile.
@@ -519,10 +531,16 @@ def main():
             "alpha": eo["SmoothQuantAlpha"],
             "folding": eo["SmoothQuantFolding"],
             "auto_alpha_args": eo["AutoAlphaArgs"],
         })
     logger.info(f"[sq] SmoothQuant(alpha={alpha}) static int8, per-channel, "
-                f"calib={args.calibrate_method}, ops={op_types}, format={args.quant_format} ...")
     logger.info(f"[sq]   {human(os.path.getsize(in_encoder) + os.path.getsize(str(in_encoder) + '.data'))} fp32 encoder")
     t0 = time.time()
     # ORT_DISABLE_ALL skips neural-compressor's pre-optimization InferenceSession

              "entropy": CalibrationMethod.Entropy,
              "percentile": CalibrationMethod.Percentile}[args.calibrate_method]
     op_types = [t.strip() for t in args.op_types.split(",") if t.strip()]
+    # Convs can be statically QUANTIZED but never SMOOTHED by this library: the
+    # layout shim above always returns None for a conv (its (out,in,kh,kw) weight
+    # has no per-input-channel max that matches the activation), so the auto-alpha
+    # search (_auto_tune_alpha -> _reshape_scale_for_input) KeyErrors on the conv's
+    # missing scale. So smooth only the matmul-family ops, but still hand the full
+    # requested set (incl. Conv) to the static quantizer below. This is also how
+    # istupakov ends up smaller: convs become plain static int8, no SmoothQuant.
+    smooth_op_types = [t for t in op_types if t not in ("Conv", "FusedConv")]
     cfg = config.StaticQuantConfig(
         calibration_data_reader=FeatureReader(feats),
         quant_format=fmt,
         execution_provider="CPUExecutionProvider",
         extra_options={
             "SmoothQuant": True,
+            # Smoother op set != quantizer op set: never send Conv to the smoother
+            # (see smooth_op_types above). Forwarded for a fixed/future library;
+            # the armed shim below also injects it into transform() directly.
+            "SmoothQuantOpTypes": smooth_op_types,
             # alpha="auto" makes the smoother search a per-layer optimal alpha
             # (minimising each layer's QDQ output error vs fp32) instead of forcing
             # one global value onto FastConformer's very uneven outlier profile.
             "alpha": eo["SmoothQuantAlpha"],
             "folding": eo["SmoothQuantFolding"],
             "auto_alpha_args": eo["AutoAlphaArgs"],
+            # transform() also forgets SmoothQuantOpTypes; inject it so Conv never
+            # reaches the smoother (auto-alpha would KeyError on its skipped scale).
+            "op_types": eo["SmoothQuantOpTypes"],
         })
+    quant_only = [t for t in op_types if t not in smooth_op_types]
+    quant_only_note = f", quantize-only(no smooth)={quant_only}" if quant_only else ""
     logger.info(f"[sq] SmoothQuant(alpha={alpha}) static int8, per-channel, "
+                f"calib={args.calibrate_method}, smooth-ops={smooth_op_types}"
+                f"{quant_only_note}, format={args.quant_format} ...")
     logger.info(f"[sq]   {human(os.path.getsize(in_encoder) + os.path.getsize(str(in_encoder) + '.data'))} fp32 encoder")
     t0 = time.time()
     # ORT_DISABLE_ALL skips neural-compressor's pre-optimization InferenceSession