---
license: agpl-3.0
base_model: lordx64/Qwable-v1
base_model_relation: quantized
pipeline_tag: text-generation
library_name: transformers
tags:
- nvfp4
- fp4
- compressed-tensors
- llm-compressor
- quantized
- moe
- qwen3.6
- chain-of-thought
- agentic
- tool-use
- vllm
language:
- en
---
# Qwable-v1-NVFP4A16
NVFP4 quantization of [lordx64/Qwable-v1](https://huggingface.co/lordx64/Qwable-v1) — a **35B-total /
3B-active text generation** Mixture-of-Experts model (`Qwen3_5MoeForConditionalGeneration`, Qwen3.6
family, with hybrid linear / full attention). Per the base model card it is **text-only** and aimed at
reasoning, agentic tool-use, and coding (see [Capabilities](#capabilities)).
**Variant**: NVFP4 weight-only (W4A16) — 4-bit float weights, group size 16, per-group FP8 (e4m3) scales + per-tensor FP32 global scales; activations stay BF16
**Disk size**: ~24 GB (vs ~67 GB BF16, ~2.8×)
**Quantized by**: [sahilchachra](https://huggingface.co/sahilchachra)
**Tooling**: `llm-compressor` `model_free_ptq` (data-free, streaming PTQ — no calibration data)
> **Note on what is quantized**: only the linear weights that hold the bulk of the parameters are
> taken to NVFP4 — the 256-way routed experts, the shared experts, and the full-attention
> projections. The linear/Gated-Delta-Net (mamba-style) layers, the MoE routers, embeddings,
> `lm_head`, the MTP head and all norms are kept in BF16 for stability. The architecture also carries
> a vision tower (`Qwen3_5MoeForConditionalGeneration`), which is likewise kept in BF16 — but the base
> model is documented as text-only, so this quantization neither adds nor validates any image
> capability. The headline variant name reflects the dominant (expert/attention) quantization; the
> on-disk size averages the NVFP4 and BF16 halves of the model.
## Capabilities
Unchanged from the base model — quantization only changes weight precision, not behavior. Per the
[base model card](https://huggingface.co/lordx64/Qwable-v1):
- **Reasoning** — thinks in explicit `…` chains-of-thought.
- **Agentic tool-use** — emits `` XML blocks for file/shell operations (activates with
agent-style system prompts or prior `` turns).
- **Coding** — designed for agentic coding tasks with multi-turn agent interactions.
- **Context length**: 4096 tokens (training) / 16384 tokens (serving).
See the base card for limitations (narrow training distribution, tool-name differences, reasoning
inherited from the Opus-4.7 distill).
## Smoke test
Loaded and run with **vLLM 0.19** on an NVIDIA Thor (Blackwell) device. The model loads, captures
CUDA graphs, runs the hybrid linear-attention + NVFP4 MoE path, and produces coherent text. This is a
functional smoke test only — it is **not** a quality benchmark.
### Generation speed
Quick on-device measurement (not a tuned benchmark): warmed, short chat-templated prompt, greedy
decoding, CUDA graphs enabled, identical settings for both variants, single GPU.
| | This model (NVFP4 W4A16) | BF16 source |
|---|---:|---:|
| Single-stream decode (tok/s) | **41.8** | 30.3 |
| Batched ×16 aggregate decode (tok/s) | 330.8 | 303.0 |
| On-disk size | ~24 GB | ~67 GB |
Single-stream decode is memory-bandwidth bound, so the ~4× smaller weights give the largest gain
(~1.4×); batched decode is more compute-bound and the W4A16 dequant cost narrows the gap. Numbers
will vary with prompt length, batch size and KV-cache growth (this is a reasoning model — long
thinking traces decode more tokens).
### Test device
- **GPU**: NVIDIA Thor (Blackwell, native NVFP4)
- **CPU / memory**: 14-core ARM (aarch64), 122 GB unified memory
- **Software**: JetPack / L4T R38.4 (Ubuntu 24.04), CUDA 13.0, driver 580, kernel 6.8.12-tegra
- **Serving**: vLLM 0.19 (`ghcr.io/nvidia-ai-iot/vllm:latest-jetson-thor`)
## What's quantized
| Quantized → NVFP4 | Kept in BF16 |
|---|---|
| Routed experts (`mlp.experts.*.{gate,up,down}_proj`, 40 layers × 256 experts) | Linear / Gated-Delta-Net layers (`*.linear_attn.*`) |
| Shared experts (`mlp.shared_expert.{gate,up,down}_proj`) | MoE routers (`mlp.gate`), shared-expert gates |
| Full-attention projections (`self_attn.{q,k,v,o}_proj`) | Embeddings, `lm_head`, MTP head, all norms |
| | Vision tower (`model.visual.*`) — present in the arch, unused for text |
## Usage (vLLM)
```python
from vllm import LLM, SamplingParams
llm = LLM(model="sahilchachra/Qwable-v1-NVFP4A16", dtype="bfloat16", max_model_len=16384)
out = llm.generate(["Hello!"], SamplingParams(temperature=0.0, max_tokens=128))
print(out[0].outputs[0].text)
```
Runs on Blackwell GPUs with native NVFP4 support.
## Notes
- Weight-only NVFP4 (W4A16): weights are 4-bit, activations remain BF16.
- Format: `nvfp4-pack-quantized` (compressed-tensors), per-expert layout — the standard layout vLLM consumes for quantized MoE.
- Smoke-tested only; not formally benchmarked for quality.
## Original model
See [lordx64/Qwable-v1](https://huggingface.co/lordx64/Qwable-v1) for full lineage, intended use, and
limitations. License (AGPL-3.0) is inherited from the base model.