Instructions to use sahilchachra/Qwable-v1-NVFP4A16 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use sahilchachra/Qwable-v1-NVFP4A16 with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="sahilchachra/Qwable-v1-NVFP4A16")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
pipe(text=messages)

# Load model directly
from transformers import AutoProcessor, AutoModelForMultimodalLM

processor = AutoProcessor.from_pretrained("sahilchachra/Qwable-v1-NVFP4A16")
model = AutoModelForMultimodalLM.from_pretrained("sahilchachra/Qwable-v1-NVFP4A16")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
inputs = processor.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use sahilchachra/Qwable-v1-NVFP4A16 with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "sahilchachra/Qwable-v1-NVFP4A16"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "sahilchachra/Qwable-v1-NVFP4A16",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/sahilchachra/Qwable-v1-NVFP4A16

SGLang

How to use sahilchachra/Qwable-v1-NVFP4A16 with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "sahilchachra/Qwable-v1-NVFP4A16" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "sahilchachra/Qwable-v1-NVFP4A16",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "sahilchachra/Qwable-v1-NVFP4A16" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "sahilchachra/Qwable-v1-NVFP4A16",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use sahilchachra/Qwable-v1-NVFP4A16 with Docker Model Runner:
```
docker model run hf.co/sahilchachra/Qwable-v1-NVFP4A16
```

Qwable-v1-NVFP4A16

NVFP4 quantization of lordx64/Qwable-v1 — a 35B-total / 3B-active text generation Mixture-of-Experts model (Qwen3_5MoeForConditionalGeneration, Qwen3.6 family, with hybrid linear / full attention). Per the base model card it is text-only and aimed at reasoning, agentic tool-use, and coding (see Capabilities).

Variant: NVFP4 weight-only (W4A16) — 4-bit float weights, group size 16, per-group FP8 (e4m3) scales + per-tensor FP32 global scales; activations stay BF16 Disk size: ~24 GB (vs ~67 GB BF16, ~2.8×) Quantized by: sahilchachra Tooling: llm-compressor model_free_ptq (data-free, streaming PTQ — no calibration data)

Note on what is quantized: only the linear weights that hold the bulk of the parameters are taken to NVFP4 — the 256-way routed experts, the shared experts, and the full-attention projections. The linear/Gated-Delta-Net (mamba-style) layers, the MoE routers, embeddings, lm_head, the MTP head and all norms are kept in BF16 for stability. The architecture also carries a vision tower (Qwen3_5MoeForConditionalGeneration), which is likewise kept in BF16 — but the base model is documented as text-only, so this quantization neither adds nor validates any image capability. The headline variant name reflects the dominant (expert/attention) quantization; the on-disk size averages the NVFP4 and BF16 halves of the model.

Capabilities

Unchanged from the base model — quantization only changes weight precision, not behavior. Per the base model card:

Reasoning — thinks in explicit <think>…</think> chains-of-thought.
Agentic tool-use — emits <tool_use> XML blocks for file/shell operations (activates with agent-style system prompts or prior <tool_result> turns).
Coding — designed for agentic coding tasks with multi-turn agent interactions.
Context length: 4096 tokens (training) / 16384 tokens (serving).

See the base card for limitations (narrow training distribution, tool-name differences, reasoning inherited from the Opus-4.7 distill).

Smoke test

Loaded and run with vLLM 0.19 on an NVIDIA Thor (Blackwell) device. The model loads, captures CUDA graphs, runs the hybrid linear-attention + NVFP4 MoE path, and produces coherent text. This is a functional smoke test only — it is not a quality benchmark.

Generation speed

Quick on-device measurement (not a tuned benchmark): warmed, short chat-templated prompt, greedy decoding, CUDA graphs enabled, identical settings for both variants, single GPU.

	This model (NVFP4 W4A16)	BF16 source
Single-stream decode (tok/s)	41.8	30.3
Batched ×16 aggregate decode (tok/s)	330.8	303.0
On-disk size	~24 GB	~67 GB

Single-stream decode is memory-bandwidth bound, so the ~~4× smaller weights give the largest gain (~~1.4×); batched decode is more compute-bound and the W4A16 dequant cost narrows the gap. Numbers will vary with prompt length, batch size and KV-cache growth (this is a reasoning model — long thinking traces decode more tokens).

Test device

GPU: NVIDIA Thor (Blackwell, native NVFP4)
CPU / memory: 14-core ARM (aarch64), 122 GB unified memory
Software: JetPack / L4T R38.4 (Ubuntu 24.04), CUDA 13.0, driver 580, kernel 6.8.12-tegra
Serving: vLLM 0.19 (ghcr.io/nvidia-ai-iot/vllm:latest-jetson-thor)

What's quantized

Quantized → NVFP4	Kept in BF16
Routed experts (`mlp.experts.*.{gate,up,down}_proj`, 40 layers × 256 experts)	Linear / Gated-Delta-Net layers (`.linear_attn.`)
Shared experts (`mlp.shared_expert.{gate,up,down}_proj`)	MoE routers (`mlp.gate`), shared-expert gates
Full-attention projections (`self_attn.{q,k,v,o}_proj`)	Embeddings, `lm_head`, MTP head, all norms
	Vision tower (`model.visual.*`) — present in the arch, unused for text

Usage (vLLM)

from vllm import LLM, SamplingParams

llm = LLM(model="sahilchachra/Qwable-v1-NVFP4A16", dtype="bfloat16", max_model_len=16384)
out = llm.generate(["Hello!"], SamplingParams(temperature=0.0, max_tokens=128))
print(out[0].outputs[0].text)

Runs on Blackwell GPUs with native NVFP4 support.

Notes

Weight-only NVFP4 (W4A16): weights are 4-bit, activations remain BF16.
Format: nvfp4-pack-quantized (compressed-tensors), per-expert layout — the standard layout vLLM consumes for quantized MoE.
Smoke-tested only; not formally benchmarked for quality.