---
license: apache-2.0
base_model: InternScience/Agents-A1
base_model_relation: quantized
pipeline_tag: text-generation
tags:
  - int4
  - w4a16
  - rtn
  - compressed-tensors
  - llm-compressor
language:
  - en
---

# Agents-A1-W4A16

INT4 (W4A16) quantization of
[InternScience/Agents-A1](https://huggingface.co/InternScience/Agents-A1)
— a 35B-A3B multimodal Mixture-of-Experts agentic model (`qwen3_5_moe`: hybrid GatedDeltaNet linear-attention + full-attention over 40 layers, 256 routed experts + a shared expert with 8 active per token, plus a 27-block vision tower) built for task decomposition, planning, tool use / function calling, and scientific & professional reasoning.

**Variant**: **W4A16** — 4-bit symmetric integer weights, group size 128, activations BF16. Round-to-nearest (data-free, no activation-aware scaling).
**Quantized by**: [sahilchachra](https://huggingface.co/sahilchachra)
**Tooling**: `llm-compressor` `model_free_ptq` (data-free, RTN) -> `compressed-tensors`

> This is a quantized derivative. Weights, behavior, and license follow the base
> model — see the
> [original card](https://huggingface.co/InternScience/Agents-A1) for full details, benchmarks, and citation.

## What is quantized

Quantized to 4-bit:

- routed experts `mlp.experts.*.{gate,up,down}_proj` (all layers)
- shared expert `{gate,up,down}_proj`
- full-attention `self_attn.{q,k,v,o}_proj`

Kept in **BF16**: GatedDeltaNet `linear_attn` (mamba) layers, MoE router `mlp.gate` + `shared_expert_gate`, vision tower (`model.visual.*`, 27 blocks), token embeddings, lm_head, all norms.

## Calibration

Data-free — weight-only (`model_free_ptq`, round-to-nearest); no calibration data. Weights are quantized by streaming the safetensors from disk.

## Usage (vLLM)

```python
from vllm import LLM, SamplingParams

# This is a multimodal checkpoint: the vision tower is kept in BF16
# (only the text / MoE weights are 4-bit). vLLM builds the full model.
llm = LLM(
    model="sahilchachra/Agents-A1-W4A16",
    trust_remote_code=True,
)
out = llm.chat(
    [{"role": "user", "content": "Hello!"}],
    SamplingParams(temperature=0.6, top_p=0.95, max_tokens=512),
)
print(out[0].outputs[0].text)
```

Serving via the CLI, pass the flag directly:

```bash
vllm serve sahilchachra/Agents-A1-W4A16 \
    --trust-remote-code \
    --max-model-len 262144 --reasoning-parser qwen3
```