--- license: agpl-3.0 base_model: lordx64/Qwable-v1 base_model_relation: quantized pipeline_tag: text-generation library_name: transformers tags: - awq - int4 - w4a16 - compressed-tensors - llm-compressor - quantized - moe - qwen3.6 - chain-of-thought - agentic - tool-use - vllm language: - en --- # Qwable-v1-AWQ AWQ 4-bit (W4A16) quantization of [lordx64/Qwable-v1](https://huggingface.co/lordx64/Qwable-v1) — a **35B-total / 3B-active text generation** Mixture-of-Experts model (`Qwen3_5MoeForConditionalGeneration`, Qwen3.6 family, with hybrid linear / full attention). Per the base model card it is **text-only** and aimed at reasoning, agentic tool-use, and coding (see [Capabilities](#capabilities)). **Variant**: AWQ weight-only (W4A16) — int4 symmetric weights, group size 128, activation-aware scaling; activations stay BF16 **Disk size**: ~22 GB (vs ~72 GB BF16, ~3.3×) **Quantized by**: [sahilchachra](https://huggingface.co/sahilchachra) **Tooling**: `llm-compressor` AWQ (`oneshot`) — activation-aware, calibrated on general instruct chat (UltraChat-200k) > **Note on what is quantized**: only the linear weights that hold the bulk of the parameters are > taken to int4 — the 256-way routed experts, the shared experts, and the full-attention > projections. The linear/Gated-Delta-Net (mamba-style) layers, the MoE routers, embeddings, > `lm_head`, the MTP head and all norms are kept in BF16 for stability. The architecture also carries > a vision tower (`Qwen3_5MoeForConditionalGeneration`), which is likewise kept in BF16 — but the base > model is documented as text-only, so this quantization neither adds nor validates any image > capability. The headline variant name reflects the dominant (expert/attention) quantization; the > on-disk size averages the int4 and BF16 halves of the model. ## Capabilities Unchanged from the base model — quantization only changes weight precision, not behavior. Per the [base model card](https://huggingface.co/lordx64/Qwable-v1): - **Reasoning** — thinks in explicit `` chains-of-thought. - **Agentic tool-use** — emits `` XML blocks for file/shell operations (activates with agent-style system prompts or prior `` turns). - **Coding** — designed for agentic coding tasks with multi-turn agent interactions. - **Context length**: 4096 tokens (training) / 16384 tokens (serving). See the base card for limitations (narrow training distribution, tool-name differences, reasoning inherited from the Opus-4.7 distill). ## Smoke test Loaded and run with **transformers** on an NVIDIA Thor (Blackwell) device. The model loads, runs the hybrid linear-attention + int4 MoE path, and produces coherent text from a chat-templated prompt. A structure census confirms only the intended decoder Linears are int4 (routed experts, shared expert, full-attention `q/k/v/o`) with the routers, linear-attention, vision, MTP and norms left in BF16. This is a functional smoke test only — it is **not** a quality benchmark. ### Test device - **GPU**: NVIDIA Thor (Blackwell) - **CPU / memory**: 14-core ARM (aarch64), 122 GB unified memory - **Software**: JetPack / L4T R38.4 (Ubuntu 24.04), CUDA 13.0, driver 580, kernel 6.8.12-tegra ## What's quantized | Quantized → int4 (AWQ W4A16) | Kept in BF16 | |---|---| | Routed experts (`mlp.experts.*.{gate,up,down}_proj`, 40 layers × 256 experts) | Linear / Gated-Delta-Net layers (`*.linear_attn.*`) | | Shared experts (`mlp.shared_expert.{gate,up,down}_proj`) | MoE routers (`mlp.gate`), shared-expert gates | | Full-attention projections (`self_attn.{q,k,v,o}_proj`) | Embeddings, `lm_head`, MTP head, all norms | | | Vision tower (`model.visual.*`) — present in the arch, unused for text | ## Usage (vLLM) ```python from vllm import LLM, SamplingParams llm = LLM(model="sahilchachra/Qwable-v1-AWQ", dtype="bfloat16", max_model_len=16384, trust_remote_code=True) out = llm.generate(["Hello!"], SamplingParams(temperature=0.7, top_p=0.9, max_tokens=128)) print(out[0].outputs[0].text) ``` Runs on GPUs with `compressed-tensors` W4A16 support (vLLM unpacks the int4 weights for you). ## Notes - Weight-only AWQ (W4A16): weights are int4 (group size 128, symmetric, activation-aware scales), activations remain BF16. - Format: `pack-quantized` (compressed-tensors), per-expert layout — the standard layout vLLM consumes for quantized MoE. - Loading requires `compressed-tensors` and a recent `transformers` (the `qwen3_5_moe` architecture). - Smoke-tested only; not formally benchmarked for quality. - Sibling quantization: [sahilchachra/Qwable-v1-NVFP4A16](https://huggingface.co/sahilchachra/Qwable-v1-NVFP4A16) (NVFP4 for Blackwell GPUs). ## Original model See [lordx64/Qwable-v1](https://huggingface.co/lordx64/Qwable-v1) for full lineage, intended use, and limitations. License (AGPL-3.0) is inherited from the base model.