Instructions to use kyaky/Qwen-AgentWorld-35B-A3B-NVFP4 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use kyaky/Qwen-AgentWorld-35B-A3B-NVFP4 with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="kyaky/Qwen-AgentWorld-35B-A3B-NVFP4") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] pipe(text=messages)# Load model directly from transformers import AutoProcessor, AutoModelForMultimodalLM processor = AutoProcessor.from_pretrained("kyaky/Qwen-AgentWorld-35B-A3B-NVFP4") model = AutoModelForMultimodalLM.from_pretrained("kyaky/Qwen-AgentWorld-35B-A3B-NVFP4") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] inputs = processor.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use kyaky/Qwen-AgentWorld-35B-A3B-NVFP4 with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "kyaky/Qwen-AgentWorld-35B-A3B-NVFP4" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "kyaky/Qwen-AgentWorld-35B-A3B-NVFP4", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/kyaky/Qwen-AgentWorld-35B-A3B-NVFP4
- SGLang
How to use kyaky/Qwen-AgentWorld-35B-A3B-NVFP4 with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "kyaky/Qwen-AgentWorld-35B-A3B-NVFP4" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "kyaky/Qwen-AgentWorld-35B-A3B-NVFP4", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "kyaky/Qwen-AgentWorld-35B-A3B-NVFP4" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "kyaky/Qwen-AgentWorld-35B-A3B-NVFP4", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use kyaky/Qwen-AgentWorld-35B-A3B-NVFP4 with Docker Model Runner:
docker model run hf.co/kyaky/Qwen-AgentWorld-35B-A3B-NVFP4
Use Docker
docker model run hf.co/kyaky/Qwen-AgentWorld-35B-A3B-NVFP4Qwen-AgentWorld-35B-A3B-NVFP4
NVFP4 (mixed-precision, compressed-tensors) quantization of
Qwen/Qwen-AgentWorld-35B-A3B,
produced with llm-compressor for fast
inference with vLLM on NVIDIA Blackwell (sm120 / RTX PRO 6000).
Benchmark vs official BF16
Measured on the same hardware (1× RTX PRO 6000 Blackwell) and identical vLLM 0.23 config
(--max-model-len 8192 --max-num-seqs 64 --gpu-memory-utilization 0.90, temperature=0):
| Metric | Official BF16 | NVFP4 (this model) | Δ |
|---|---|---|---|
| Disk size | 66 GB | 24.96 GB | −62% |
| First-token latency (TTFT) | 35 ms | 32 ms | −8.6% |
| Single-stream decode | 157.9 tok/s | 184.1 tok/s | +16.6% |
| Concurrent throughput (N=16) | 1351.6 tok/s | 1430.5 tok/s | +5.8% |
Quality (temperature=0): 17×23, a factorial function, "why is the sky blue", and
echo $((6*7)) — all correct and equivalent to the BF16 reference. The NVFP4 build is faster
and ~1/3 the size with matching quality.
Quantization
- Tool: llm-compressor 0.12,
compressed-tensors(formatmixed-precision). - Scheme:
- Attention (
self_attn.{q,k,v,o}_proj, GDNlinear_attn.{in_proj_qkv,in_proj_z,out_proj}) → FP8 (block[128,128]). - MoE experts (
gate_proj/up_proj/down_proj) → NVFP4 (group size 16, fp8_e4m3 scales). - Left in BF16 (ignored):
lm_head,embed_tokens, routermlp.gate, shared expert, GDN state (linear_attn.{A_log,conv1d,in_proj_a,in_proj_b}), first/last MoE layer experts, and the vision tower.
- Attention (
- MoE calibration: all 256 experts calibrated (
moe_calibrate_all_experts=True) onHuggingFaceH4/ultrachat_200k(256 samples, seq len 2048). - Size: ~25 GB (down from ~70 GB BF16). No MTP / speculative module.
Serving (vLLM)
Text-only deployment (the source defines a vision tower but sets language_model_only=true).
vLLM auto-detects compressed-tensors — no --quantization flag needed.
vllm serve kyaky/Qwen-AgentWorld-35B-A3B-NVFP4 \
--max-model-len 262144 --max-num-batched-tokens 2096 --max-num-seqs 256 \
--enable-prefix-caching --disable-custom-all-reduce --trust-remote-code \
--reasoning-parser qwen3 --enable-auto-tool-choice --tool-call-parser qwen3_xml
Notes: hybrid Gated-DeltaNet attention requires --max-num-batched-tokens 2096 (cache alignment);
head_dim=256 auto-selects the Triton attention backend.
License
Apache-2.0 (inherited from the base model).
- Downloads last month
- 100
Model tree for kyaky/Qwen-AgentWorld-35B-A3B-NVFP4
Base model
Qwen/Qwen3.5-35B-A3B-Base
Install from pip and serve model
# Install vLLM from pip: pip install vllm# Start the vLLM server: vllm serve "kyaky/Qwen-AgentWorld-35B-A3B-NVFP4"# Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "kyaky/Qwen-AgentWorld-35B-A3B-NVFP4", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'