--- license: apache-2.0 base_model: InternScience/Agents-A1 base_model_relation: quantized pipeline_tag: text-generation tags: - int4 - w4a16 - rtn - compressed-tensors - llm-compressor language: - en --- # Agents-A1-W4A16 INT4 (W4A16) quantization of [InternScience/Agents-A1](https://huggingface.co/InternScience/Agents-A1) — a 35B-A3B multimodal Mixture-of-Experts agentic model (`qwen3_5_moe`: hybrid GatedDeltaNet linear-attention + full-attention over 40 layers, 256 routed experts + a shared expert with 8 active per token, plus a 27-block vision tower) built for task decomposition, planning, tool use / function calling, and scientific & professional reasoning. **Variant**: **W4A16** — 4-bit symmetric integer weights, group size 128, activations BF16. Round-to-nearest (data-free, no activation-aware scaling). **Quantized by**: [sahilchachra](https://huggingface.co/sahilchachra) **Tooling**: `llm-compressor` `model_free_ptq` (data-free, RTN) -> `compressed-tensors` > This is a quantized derivative. Weights, behavior, and license follow the base > model — see the > [original card](https://huggingface.co/InternScience/Agents-A1) for full details, benchmarks, and citation. ## What is quantized Quantized to 4-bit: - routed experts `mlp.experts.*.{gate,up,down}_proj` (all layers) - shared expert `{gate,up,down}_proj` - full-attention `self_attn.{q,k,v,o}_proj` Kept in **BF16**: GatedDeltaNet `linear_attn` (mamba) layers, MoE router `mlp.gate` + `shared_expert_gate`, vision tower (`model.visual.*`, 27 blocks), token embeddings, lm_head, all norms. ## Calibration Data-free — weight-only (`model_free_ptq`, round-to-nearest); no calibration data. Weights are quantized by streaming the safetensors from disk. ## Usage (vLLM) ```python from vllm import LLM, SamplingParams # This is a multimodal checkpoint: the vision tower is kept in BF16 # (only the text / MoE weights are 4-bit). vLLM builds the full model. llm = LLM( model="sahilchachra/Agents-A1-W4A16", trust_remote_code=True, ) out = llm.chat( [{"role": "user", "content": "Hello!"}], SamplingParams(temperature=0.6, top_p=0.95, max_tokens=512), ) print(out[0].outputs[0].text) ``` Serving via the CLI, pass the flag directly: ```bash vllm serve sahilchachra/Agents-A1-W4A16 \ --trust-remote-code \ --max-model-len 262144 --reasoning-parser qwen3 ```