Instructions to use kyaky/Qwen-AgentWorld-35B-A3B-NVFP4 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use kyaky/Qwen-AgentWorld-35B-A3B-NVFP4 with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="kyaky/Qwen-AgentWorld-35B-A3B-NVFP4")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
pipe(text=messages)

# Load model directly
from transformers import AutoProcessor, AutoModelForMultimodalLM

processor = AutoProcessor.from_pretrained("kyaky/Qwen-AgentWorld-35B-A3B-NVFP4")
model = AutoModelForMultimodalLM.from_pretrained("kyaky/Qwen-AgentWorld-35B-A3B-NVFP4")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
inputs = processor.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use kyaky/Qwen-AgentWorld-35B-A3B-NVFP4 with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "kyaky/Qwen-AgentWorld-35B-A3B-NVFP4"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "kyaky/Qwen-AgentWorld-35B-A3B-NVFP4",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/kyaky/Qwen-AgentWorld-35B-A3B-NVFP4

SGLang

How to use kyaky/Qwen-AgentWorld-35B-A3B-NVFP4 with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "kyaky/Qwen-AgentWorld-35B-A3B-NVFP4" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "kyaky/Qwen-AgentWorld-35B-A3B-NVFP4",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "kyaky/Qwen-AgentWorld-35B-A3B-NVFP4" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "kyaky/Qwen-AgentWorld-35B-A3B-NVFP4",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use kyaky/Qwen-AgentWorld-35B-A3B-NVFP4 with Docker Model Runner:
```
docker model run hf.co/kyaky/Qwen-AgentWorld-35B-A3B-NVFP4
```

Qwen-AgentWorld-35B-A3B-NVFP4

NVFP4 (mixed-precision, compressed-tensors) quantization of Qwen/Qwen-AgentWorld-35B-A3B, produced with llm-compressor for fast inference with vLLM on NVIDIA Blackwell (sm120 / RTX PRO 6000).

Benchmark vs official BF16

Measured on the same hardware (1× RTX PRO 6000 Blackwell) and identical vLLM 0.23 config (--max-model-len 8192 --max-num-seqs 64 --gpu-memory-utilization 0.90, temperature=0):

Metric	Official BF16	NVFP4 (this model)	Δ
Disk size	66 GB	24.96 GB	−62%
First-token latency (TTFT)	35 ms	32 ms	−8.6%
Single-stream decode	157.9 tok/s	184.1 tok/s	+16.6%
Concurrent throughput (N=16)	1351.6 tok/s	1430.5 tok/s	+5.8%

Quality (temperature=0): 17×23, a factorial function, "why is the sky blue", and echo $((6*7)) — all correct and equivalent to the BF16 reference. The NVFP4 build is faster and ~1/3 the size with matching quality.

Quantization

Tool: llm-compressor 0.12, compressed-tensors (format mixed-precision).
Scheme:
- Attention (self_attn.{q,k,v,o}_proj, GDN linear_attn.{in_proj_qkv,in_proj_z,out_proj}) → FP8 (block [128,128]).
- MoE experts (gate_proj/up_proj/down_proj) → NVFP4 (group size 16, fp8_e4m3 scales).
- Left in BF16 (ignored): lm_head, embed_tokens, router mlp.gate, shared expert, GDN state (linear_attn.{A_log,conv1d,in_proj_a,in_proj_b}), first/last MoE layer experts, and the vision tower.
MoE calibration: all 256 experts calibrated (moe_calibrate_all_experts=True) on HuggingFaceH4/ultrachat_200k (256 samples, seq len 2048).
Size: ~25 GB (down from ~70 GB BF16). No MTP / speculative module.

Serving (vLLM)

Text-only deployment (the source defines a vision tower but sets language_model_only=true). vLLM auto-detects compressed-tensors — no --quantization flag needed.

vllm serve kyaky/Qwen-AgentWorld-35B-A3B-NVFP4 \
  --max-model-len 262144 --max-num-batched-tokens 2096 --max-num-seqs 256 \
  --enable-prefix-caching --disable-custom-all-reduce --trust-remote-code \
  --reasoning-parser qwen3 --enable-auto-tool-choice --tool-call-parser qwen3_xml

Notes: hybrid Gated-DeltaNet attention requires --max-num-batched-tokens 2096 (cache alignment); head_dim=256 auto-selects the Triton attention backend.

License

Apache-2.0 (inherited from the base model).

Downloads last month: 100

Safetensors

Model size

22B params

Tensor type

F32

BF16

F8_E4M3

Model tree for kyaky/Qwen-AgentWorld-35B-A3B-NVFP4

Base model

Qwen/Qwen3.5-35B-A3B-Base

Finetuned

Qwen/Qwen-AgentWorld-35B-A3B

Quantized

(34)

this model