Instructions to use MrVolts/Qwen3-30B-A3B-Thinking-2507-NVFP4 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use MrVolts/Qwen3-30B-A3B-Thinking-2507-NVFP4 with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="MrVolts/Qwen3-30B-A3B-Thinking-2507-NVFP4")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForMultimodalLM

tokenizer = AutoTokenizer.from_pretrained("MrVolts/Qwen3-30B-A3B-Thinking-2507-NVFP4")
model = AutoModelForMultimodalLM.from_pretrained("MrVolts/Qwen3-30B-A3B-Thinking-2507-NVFP4")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use MrVolts/Qwen3-30B-A3B-Thinking-2507-NVFP4 with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "MrVolts/Qwen3-30B-A3B-Thinking-2507-NVFP4"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "MrVolts/Qwen3-30B-A3B-Thinking-2507-NVFP4",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/MrVolts/Qwen3-30B-A3B-Thinking-2507-NVFP4

SGLang

How to use MrVolts/Qwen3-30B-A3B-Thinking-2507-NVFP4 with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "MrVolts/Qwen3-30B-A3B-Thinking-2507-NVFP4" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "MrVolts/Qwen3-30B-A3B-Thinking-2507-NVFP4",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "MrVolts/Qwen3-30B-A3B-Thinking-2507-NVFP4" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "MrVolts/Qwen3-30B-A3B-Thinking-2507-NVFP4",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use MrVolts/Qwen3-30B-A3B-Thinking-2507-NVFP4 with Docker Model Runner:
```
docker model run hf.co/MrVolts/Qwen3-30B-A3B-Thinking-2507-NVFP4
```

Qwen3-30B-A3B-Thinking-2507-NVFP4

This is a 4-bit NVFP4 quantized version of Qwen/Qwen3-30B-A3B-Thinking-2507, compressed using llmcompressor.

Model Description

This model represents a significant compression of the original 30B parameter Qwen3 thinking model, reducing the model size by approximately 75% while maintaining most of its reasoning capabilities. The quantization was performed using NVIDIA's FP4 (4-bit floating point) format, which is optimized for deployment on NVIDIA GPUs with Blackwell architecture.

Quantization Details

Method: NVFP4 (NVIDIA 4-bit Floating Point)
Tool: llmcompressor v0.3.0+
Original Size: ~60-120GB (depending on precision)
Compressed Size: ~18GB
Compression Ratio: ~4-8x

Quantization Configuration

targets: Linear
scheme: NVFP4
ignore:
  - lm_head
  - model.embed_tokens
  - re:.*input_layernorm$
  - re:.*post_attention_layernorm$
  - model.norm
  - re:.*mlp.gate$

Key layers preserved at full precision:

Output head (lm_head)
Embeddings
Layer normalization layers
MLP gate layers

Calibration Dataset

The model was calibrated using 1,250 samples from the NVIDIA Llama-Nemotron Post-Training Dataset:

250 samples from math split
250 samples from code split
250 samples from science split
250 samples from chat split
250 samples from safety split

All samples were filtered for:

Reasoning mode enabled (reasoning: on)
Maximum sequence length of 20,000 tokens

Usage

With Transformers

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model_id = "your-username/Qwen3-30B-A3B-Thinking-2507-NVFP4"

# Load the quantized model
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.float16,
    device_map="auto"
)

tokenizer = AutoTokenizer.from_pretrained(model_id)

# Use the model
messages = [
    {"role": "user", "content": "Solve this step by step: What is 25 * 48?"}
]

text = tokenizer.apply_chat_template(messages, tokenize=False)
inputs = tokenizer(text, return_tensors="pt").to(model.device)

outputs = model.generate(
    **inputs,
    max_new_tokens=512,
    temperature=0.7,
    do_sample=True
)

response = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(response)

With vLLM (Recommended for Production)

NVFP4 quantized models are optimized for deployment with vLLM:

from vllm import LLM, SamplingParams

model_id = "your-username/Qwen3-30B-A3B-Thinking-2507-NVFP4"

llm = LLM(model=model_id)
sampling_params = SamplingParams(temperature=0.7, max_tokens=512)

prompts = ["Solve step by step: What is 25 * 48?"]
outputs = llm.generate(prompts, sampling_params)

for output in outputs:
    print(output.outputs[0].text)

Performance Characteristics

Advantages

Memory Efficiency: ~75% reduction in memory requirements
Faster Inference: Reduced memory bandwidth requirements lead to faster token generation
Deployment Flexibility: Can run on GPUs with smaller VRAM
Preserved Quality: Critical layers maintained at full precision

Trade-offs

Slight accuracy degradation compared to full precision model
Best performance on NVIDIA GPUs with FP4 support
May require specific deployment frameworks for optimal performance

Limitations

This is a quantized model with some accuracy trade-offs
Performance is optimized for NVIDIA GPUs
Not all inference frameworks support NVFP4 format natively
The model retains the same context length limitations as the original

Citation

If you use this model, please cite both the original model and the quantization method:

@misc{qwen3-thinking-2507,
  title={Qwen3-30B-A3B-Thinking-2507},
  author={Qwen Team},
  year={2025},
  publisher={Hugging Face}
}

@software{llmcompressor,
  title={LLM Compressor},
  author={vLLM Team},
  url={https://github.com/vllm-project/llm-compressor},
  year={2024}
}

License

This model follows the same license as the original Qwen3-30B-A3B-Thinking-2507 model.

Acknowledgments

Original model by Qwen Team
Quantization performed using llmcompressor by vLLM Team
Calibration dataset provided by NVIDIA

Downloads last month: 13

Safetensors

Model size

17B params

Tensor type

F32

BF16

F8_E4M3

Model tree for MrVolts/Qwen3-30B-A3B-Thinking-2507-NVFP4

Base model

Qwen/Qwen3-30B-A3B-Thinking-2507

Quantized

(70)

this model