Apriel 1.6 15B Thinker - NVFP4

This is an NVFP4 quantized version of ServiceNow-AI/Apriel-1.6-15b-Thinker, optimized for NVIDIA Blackwell GPUs (RTX 50xx series).

✅ FULLY FUNCTIONAL - This model loads and generates correctly in vLLM 0.15.1!

Note: HuggingFace may incorrectly display this as 9B params due to quantization - this is a 15B parameter model compressed to 4-bit precision.

Model Details

Base Model: Apriel 1.6 15B Thinker (Multimodal reasoning model)
Parameters: 15 billion (not 9B - HF auto-detection is confused by quantization)
Quantization: NVFP4 (4-bit floating point)
Size: 11GB (down from 29GB, 62% reduction)
Quantizer: SILVERTHRONE
Method: llmcompressor with 128 calibration samples

Quantization Strategy

✅ Quantized to NVFP4: All linear layers
❌ Preserved at full precision: Vision encoder, embeddings, lm_head, multimodal projector

Calibration:

Dataset: open-platypus
Samples: 128
Sequence length: 512

This selective approach maintains multimodal quality while achieving significant size reduction.

Hardware Requirements

GPU: NVIDIA RTX 50xx series (Blackwell) or newer
VRAM: 14-16GB minimum
Framework: vLLM 0.15.1+

Note: This model will NOT work with llama.cpp, transformers library, or GGUF-based tools.

Usage

With vLLM

from vllm import LLM, SamplingParams
from transformers import AutoTokenizer

# Load tokenizer for chat template
tokenizer = AutoTokenizer.from_pretrained("SILVERTHRONE/apriel-1.6-15b-thinker-nvfp4")

# Load model (CRITICAL: set max_model_len=2048 and gpu_memory_utilization=0.90)
model = LLM(
    model="SILVERTHRONE/apriel-1.6-15b-thinker-nvfp4",
    gpu_memory_utilization=0.90,
    max_model_len=2048,  # Required! Model is 11GB, need tight memory settings
    enforce_eager=True,
    max_num_seqs=1
)

# Format message with chat template
messages = [{"role": "user", "content": "What is 2+2?"}]
prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)

# Generate
sampling_params = SamplingParams(
    temperature=0.7,
    max_tokens=200
)

outputs = model.generate(prompt, sampling_params)
response = outputs[0].outputs[0].text

print(response)

Chat Template

Apriel uses a custom reasoning format with [BEGIN FINAL RESPONSE] marker:

<s><|begin_system|>
You are a thoughtful, systematic AI assistant from ServiceNow Language Models (SLAM) lab. 
Analyze each question carefully, present your reasoning step-by-step, then provide the 
final response after the marker [BEGIN FINAL RESPONSE].
<|begin_user|>
What is 2+2?
<|begin_assistant|>
Here are my reasoning steps:
The user asks "What is 2+2?" It's a simple arithmetic question. The answer is 4.
[BEGIN FINAL RESPONSE]
2 + 2 = 4.

The model outputs reasoning steps first, then marks the final answer with [BEGIN FINAL RESPONSE]. Parse responses to extract the answer after this marker.

Performance

Inference Speed: ~25 tokens/sec output on RTX 5060 Ti (16GB)
Quality: Good - coherent reasoning, correct answers on factual questions
Memory: Requires 14GB+ VRAM with tight settings

Example output:

Input: "What is 2+2?"
Output: 
The user asks "What is 2+2?" It's a simple arithmetic question. The answer is 4.
[BEGIN FINAL RESPONSE]
2 + 2 = 4.

Known Limitations

Does NOT work with transformers library - Crashes with KeyError: 'weight_scale' due to incomplete NVFP4 support in compressed-tensors for VLM architectures
vLLM only - This is by design; NVFP4 is optimized for vLLM inference
Tight memory requirements - Must use max_model_len=2048 and gpu_memory_utilization=0.90 to fit in 16GB VRAM
Small context window - Limited to 2048 tokens due to VRAM constraints on 16GB GPUs
VLM-specific bug - The weight_scale error affects vision-language models but not pure text models

Testing Environment

GPU: NVIDIA GeForce RTX 5060 Ti (16GB VRAM, Blackwell)
CPU: AMD Ryzen 9 5950X
RAM: 32GB DDR4 2666MHz
OS: Ubuntu 24.04.3 LTS (Noble)
vLLM: 0.15.1
transformers: 4.57.3 (for tokenizer only)
llmcompressor: Latest (Feb 2026)
NVIDIA Driver: 580.126.09

Comparison: VLM NVFP4 Challenges

Vision-language models (LLaVA architecture like Apriel) have additional challenges with NVFP4:

transformers library cannot load them (weight_scale bug)
Larger memory footprint (11GB model + vision encoder)
Requires very tight vLLM settings

Pure text models (like Nanbeige 4.1 3B) are easier to run with more generous settings.

Credits

Original Model: ServiceNow-AI
Quantization: SILVERTHRONE (@SILVERTHRONE)
Method: Neural Magic llmcompressor

License

MIT (inherited from base model)

Downloads last month: 4

Safetensors

Model size

9B params

Tensor type

BF16

F32

F8_E4M3

Model tree for SILVERTHRONE/apriel-1.6-15b-thinker-nvfp4

Base model

ServiceNow-AI/Apriel-1.6-15b-Thinker

Quantized

(15)

this model