Instructions to use sahilchachra/Qwable-v1-NVFP4A16 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use sahilchachra/Qwable-v1-NVFP4A16 with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="sahilchachra/Qwable-v1-NVFP4A16") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] pipe(text=messages)# Load model directly from transformers import AutoProcessor, AutoModelForMultimodalLM processor = AutoProcessor.from_pretrained("sahilchachra/Qwable-v1-NVFP4A16") model = AutoModelForMultimodalLM.from_pretrained("sahilchachra/Qwable-v1-NVFP4A16") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] inputs = processor.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use sahilchachra/Qwable-v1-NVFP4A16 with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "sahilchachra/Qwable-v1-NVFP4A16" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "sahilchachra/Qwable-v1-NVFP4A16", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/sahilchachra/Qwable-v1-NVFP4A16
- SGLang
How to use sahilchachra/Qwable-v1-NVFP4A16 with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "sahilchachra/Qwable-v1-NVFP4A16" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "sahilchachra/Qwable-v1-NVFP4A16", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "sahilchachra/Qwable-v1-NVFP4A16" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "sahilchachra/Qwable-v1-NVFP4A16", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use sahilchachra/Qwable-v1-NVFP4A16 with Docker Model Runner:
docker model run hf.co/sahilchachra/Qwable-v1-NVFP4A16
Qwable-v1-NVFP4A16
NVFP4 quantization of lordx64/Qwable-v1 — a 35B-total /
3B-active text generation Mixture-of-Experts model (Qwen3_5MoeForConditionalGeneration, Qwen3.6
family, with hybrid linear / full attention). Per the base model card it is text-only and aimed at
reasoning, agentic tool-use, and coding (see Capabilities).
Variant: NVFP4 weight-only (W4A16) — 4-bit float weights, group size 16, per-group FP8 (e4m3) scales + per-tensor FP32 global scales; activations stay BF16
Disk size: ~24 GB (vs ~67 GB BF16, ~2.8×)
Quantized by: sahilchachra
Tooling: llm-compressor model_free_ptq (data-free, streaming PTQ — no calibration data)
Note on what is quantized: only the linear weights that hold the bulk of the parameters are taken to NVFP4 — the 256-way routed experts, the shared experts, and the full-attention projections. The linear/Gated-Delta-Net (mamba-style) layers, the MoE routers, embeddings,
lm_head, the MTP head and all norms are kept in BF16 for stability. The architecture also carries a vision tower (Qwen3_5MoeForConditionalGeneration), which is likewise kept in BF16 — but the base model is documented as text-only, so this quantization neither adds nor validates any image capability. The headline variant name reflects the dominant (expert/attention) quantization; the on-disk size averages the NVFP4 and BF16 halves of the model.
Capabilities
Unchanged from the base model — quantization only changes weight precision, not behavior. Per the base model card:
- Reasoning — thinks in explicit
<think>…</think>chains-of-thought. - Agentic tool-use — emits
<tool_use>XML blocks for file/shell operations (activates with agent-style system prompts or prior<tool_result>turns). - Coding — designed for agentic coding tasks with multi-turn agent interactions.
- Context length: 4096 tokens (training) / 16384 tokens (serving).
See the base card for limitations (narrow training distribution, tool-name differences, reasoning inherited from the Opus-4.7 distill).
Smoke test
Loaded and run with vLLM 0.19 on an NVIDIA Thor (Blackwell) device. The model loads, captures CUDA graphs, runs the hybrid linear-attention + NVFP4 MoE path, and produces coherent text. This is a functional smoke test only — it is not a quality benchmark.
Generation speed
Quick on-device measurement (not a tuned benchmark): warmed, short chat-templated prompt, greedy decoding, CUDA graphs enabled, identical settings for both variants, single GPU.
| This model (NVFP4 W4A16) | BF16 source | |
|---|---|---|
| Single-stream decode (tok/s) | 41.8 | 30.3 |
| Batched ×16 aggregate decode (tok/s) | 330.8 | 303.0 |
| On-disk size | ~24 GB | ~67 GB |
Single-stream decode is memory-bandwidth bound, so the 4× smaller weights give the largest gain
(1.4×); batched decode is more compute-bound and the W4A16 dequant cost narrows the gap. Numbers
will vary with prompt length, batch size and KV-cache growth (this is a reasoning model — long
thinking traces decode more tokens).
Test device
- GPU: NVIDIA Thor (Blackwell, native NVFP4)
- CPU / memory: 14-core ARM (aarch64), 122 GB unified memory
- Software: JetPack / L4T R38.4 (Ubuntu 24.04), CUDA 13.0, driver 580, kernel 6.8.12-tegra
- Serving: vLLM 0.19 (
ghcr.io/nvidia-ai-iot/vllm:latest-jetson-thor)
What's quantized
| Quantized → NVFP4 | Kept in BF16 |
|---|---|
Routed experts (mlp.experts.*.{gate,up,down}_proj, 40 layers × 256 experts) |
Linear / Gated-Delta-Net layers (*.linear_attn.*) |
Shared experts (mlp.shared_expert.{gate,up,down}_proj) |
MoE routers (mlp.gate), shared-expert gates |
Full-attention projections (self_attn.{q,k,v,o}_proj) |
Embeddings, lm_head, MTP head, all norms |
Vision tower (model.visual.*) — present in the arch, unused for text |
Usage (vLLM)
from vllm import LLM, SamplingParams
llm = LLM(model="sahilchachra/Qwable-v1-NVFP4A16", dtype="bfloat16", max_model_len=16384)
out = llm.generate(["Hello!"], SamplingParams(temperature=0.0, max_tokens=128))
print(out[0].outputs[0].text)
Runs on Blackwell GPUs with native NVFP4 support.
Notes
- Weight-only NVFP4 (W4A16): weights are 4-bit, activations remain BF16.
- Format:
nvfp4-pack-quantized(compressed-tensors), per-expert layout — the standard layout vLLM consumes for quantized MoE. - Smoke-tested only; not formally benchmarked for quality.
Original model
See lordx64/Qwable-v1 for full lineage, intended use, and limitations. License (AGPL-3.0) is inherited from the base model.
- Downloads last month
- 2,465
Model tree for sahilchachra/Qwable-v1-NVFP4A16
Base model
Qwen/Qwen3.6-35B-A3B