Apriel 1.6 15B Thinker - NVFP4

This is an NVFP4 quantized version of ServiceNow-AI/Apriel-1.6-15b-Thinker, optimized for NVIDIA Blackwell GPUs (RTX 50xx series).

✅ FULLY FUNCTIONAL - This model loads and generates correctly in vLLM 0.15.1!

Note: HuggingFace may incorrectly display this as 9B params due to quantization - this is a 15B parameter model compressed to 4-bit precision.

Model Details

  • Base Model: Apriel 1.6 15B Thinker (Multimodal reasoning model)
  • Parameters: 15 billion (not 9B - HF auto-detection is confused by quantization)
  • Quantization: NVFP4 (4-bit floating point)
  • Size: 11GB (down from 29GB, 62% reduction)
  • Quantizer: SILVERTHRONE
  • Method: llmcompressor with 128 calibration samples

Quantization Strategy

  • Quantized to NVFP4: All linear layers
  • Preserved at full precision: Vision encoder, embeddings, lm_head, multimodal projector

Calibration:

  • Dataset: open-platypus
  • Samples: 128
  • Sequence length: 512

This selective approach maintains multimodal quality while achieving significant size reduction.

Hardware Requirements

  • GPU: NVIDIA RTX 50xx series (Blackwell) or newer
  • VRAM: 14-16GB minimum
  • Framework: vLLM 0.15.1+

Note: This model will NOT work with llama.cpp, transformers library, or GGUF-based tools.

Usage

With vLLM

from vllm import LLM, SamplingParams
from transformers import AutoTokenizer

# Load tokenizer for chat template
tokenizer = AutoTokenizer.from_pretrained("SILVERTHRONE/apriel-1.6-15b-thinker-nvfp4")

# Load model (CRITICAL: set max_model_len=2048 and gpu_memory_utilization=0.90)
model = LLM(
    model="SILVERTHRONE/apriel-1.6-15b-thinker-nvfp4",
    gpu_memory_utilization=0.90,
    max_model_len=2048,  # Required! Model is 11GB, need tight memory settings
    enforce_eager=True,
    max_num_seqs=1
)

# Format message with chat template
messages = [{"role": "user", "content": "What is 2+2?"}]
prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)

# Generate
sampling_params = SamplingParams(
    temperature=0.7,
    max_tokens=200
)

outputs = model.generate(prompt, sampling_params)
response = outputs[0].outputs[0].text

print(response)

Chat Template

Apriel uses a custom reasoning format with [BEGIN FINAL RESPONSE] marker:

<s><|begin_system|>
You are a thoughtful, systematic AI assistant from ServiceNow Language Models (SLAM) lab. 
Analyze each question carefully, present your reasoning step-by-step, then provide the 
final response after the marker [BEGIN FINAL RESPONSE].
<|begin_user|>
What is 2+2?
<|begin_assistant|>
Here are my reasoning steps:
The user asks "What is 2+2?" It's a simple arithmetic question. The answer is 4.
[BEGIN FINAL RESPONSE]
2 + 2 = 4.

The model outputs reasoning steps first, then marks the final answer with [BEGIN FINAL RESPONSE]. Parse responses to extract the answer after this marker.

Performance

  • Inference Speed: ~25 tokens/sec output on RTX 5060 Ti (16GB)
  • Quality: Good - coherent reasoning, correct answers on factual questions
  • Memory: Requires 14GB+ VRAM with tight settings

Example output:

Input: "What is 2+2?"
Output: 
The user asks "What is 2+2?" It's a simple arithmetic question. The answer is 4.
[BEGIN FINAL RESPONSE]
2 + 2 = 4.

Known Limitations

  1. Does NOT work with transformers library - Crashes with KeyError: 'weight_scale' due to incomplete NVFP4 support in compressed-tensors for VLM architectures
  2. vLLM only - This is by design; NVFP4 is optimized for vLLM inference
  3. Tight memory requirements - Must use max_model_len=2048 and gpu_memory_utilization=0.90 to fit in 16GB VRAM
  4. Small context window - Limited to 2048 tokens due to VRAM constraints on 16GB GPUs
  5. VLM-specific bug - The weight_scale error affects vision-language models but not pure text models

Testing Environment

  • GPU: NVIDIA GeForce RTX 5060 Ti (16GB VRAM, Blackwell)
  • CPU: AMD Ryzen 9 5950X
  • RAM: 32GB DDR4 2666MHz
  • OS: Ubuntu 24.04.3 LTS (Noble)
  • vLLM: 0.15.1
  • transformers: 4.57.3 (for tokenizer only)
  • llmcompressor: Latest (Feb 2026)
  • NVIDIA Driver: 580.126.09

Comparison: VLM NVFP4 Challenges

Vision-language models (LLaVA architecture like Apriel) have additional challenges with NVFP4:

  • transformers library cannot load them (weight_scale bug)
  • Larger memory footprint (11GB model + vision encoder)
  • Requires very tight vLLM settings

Pure text models (like Nanbeige 4.1 3B) are easier to run with more generous settings.

Credits

License

MIT (inherited from base model)

Downloads last month
4
Safetensors
Model size
9B params
Tensor type
BF16
·
F32
·
F8_E4M3
·
U8
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for SILVERTHRONE/apriel-1.6-15b-thinker-nvfp4

Quantized
(15)
this model