Agents-A1-W4A16
INT4 (W4A16) quantization of
InternScience/Agents-A1
— a 35B-A3B multimodal Mixture-of-Experts agentic model (qwen3_5_moe: hybrid GatedDeltaNet linear-attention + full-attention over 40 layers, 256 routed experts + a shared expert with 8 active per token, plus a 27-block vision tower) built for task decomposition, planning, tool use / function calling, and scientific & professional reasoning.
Variant: W4A16 — 4-bit symmetric integer weights, group size 128, activations BF16. Round-to-nearest (data-free, no activation-aware scaling).
Quantized by: sahilchachra
Tooling: llm-compressor model_free_ptq (data-free, RTN) -> compressed-tensors
This is a quantized derivative. Weights, behavior, and license follow the base model — see the original card for full details, benchmarks, and citation.
What is quantized
Quantized to 4-bit:
- routed experts
mlp.experts.*.{gate,up,down}_proj(all layers) - shared expert
{gate,up,down}_proj - full-attention
self_attn.{q,k,v,o}_proj
Kept in BF16: GatedDeltaNet linear_attn (mamba) layers, MoE router mlp.gate + shared_expert_gate, vision tower (model.visual.*, 27 blocks), token embeddings, lm_head, all norms.
Calibration
Data-free — weight-only (model_free_ptq, round-to-nearest); no calibration data. Weights are quantized by streaming the safetensors from disk.
Usage (vLLM)
from vllm import LLM, SamplingParams
# This is a multimodal checkpoint: the vision tower is kept in BF16
# (only the text / MoE weights are 4-bit). vLLM builds the full model.
llm = LLM(
model="sahilchachra/Agents-A1-W4A16",
trust_remote_code=True,
)
out = llm.chat(
[{"role": "user", "content": "Hello!"}],
SamplingParams(temperature=0.6, top_p=0.95, max_tokens=512),
)
print(out[0].outputs[0].text)
Serving via the CLI, pass the flag directly:
vllm serve sahilchachra/Agents-A1-W4A16 \
--trust-remote-code \
--max-model-len 262144 --reasoning-parser qwen3
- Downloads last month
- -
Model tree for sahilchachra/Agents-A1-W4A16
Base model
InternScience/Agents-A1