Agents-A1-W4A16

INT4 (W4A16) quantization of InternScience/Agents-A1 — a 35B-A3B multimodal Mixture-of-Experts agentic model (qwen3_5_moe: hybrid GatedDeltaNet linear-attention + full-attention over 40 layers, 256 routed experts + a shared expert with 8 active per token, plus a 27-block vision tower) built for task decomposition, planning, tool use / function calling, and scientific & professional reasoning.

Variant: W4A16 — 4-bit symmetric integer weights, group size 128, activations BF16. Round-to-nearest (data-free, no activation-aware scaling). Quantized by: sahilchachra Tooling: llm-compressor model_free_ptq (data-free, RTN) -> compressed-tensors

This is a quantized derivative. Weights, behavior, and license follow the base model — see the original card for full details, benchmarks, and citation.

What is quantized

Quantized to 4-bit:

routed experts mlp.experts.*.{gate,up,down}_proj (all layers)
shared expert {gate,up,down}_proj
full-attention self_attn.{q,k,v,o}_proj

Kept in BF16: GatedDeltaNet linear_attn (mamba) layers, MoE router mlp.gate + shared_expert_gate, vision tower (model.visual.*, 27 blocks), token embeddings, lm_head, all norms.

Calibration

Data-free — weight-only (model_free_ptq, round-to-nearest); no calibration data. Weights are quantized by streaming the safetensors from disk.

Usage (vLLM)

from vllm import LLM, SamplingParams

# This is a multimodal checkpoint: the vision tower is kept in BF16
# (only the text / MoE weights are 4-bit). vLLM builds the full model.
llm = LLM(
    model="sahilchachra/Agents-A1-W4A16",
    trust_remote_code=True,
)
out = llm.chat(
    [{"role": "user", "content": "Hello!"}],
    SamplingParams(temperature=0.6, top_p=0.95, max_tokens=512),
)
print(out[0].outputs[0].text)

Serving via the CLI, pass the flag directly:

vllm serve sahilchachra/Agents-A1-W4A16 \
    --trust-remote-code \
    --max-model-len 262144 --reasoning-parser qwen3

Downloads last month: -

Safetensors

Model size

35B params

Tensor type

I64

I32

BF16

Model tree for sahilchachra/Agents-A1-W4A16

Base model

InternScience/Agents-A1

Quantized

(47)

this model