Text Generation
Transformers
Safetensors
PyTorch
nemotron_h
nvidia
conversational
custom_code
Eval Results

MLX 4-bit quant available (M4 Pro benchmarks included)

#63
by BrendanL79 - opened

MLX 4-bit quant for Apple Silicon:
https://huggingface.co/BrendanL79/Nemotron-3-Nano-30B-A3B-MLX-4bit

Recipe: mlx_lm.convert -q --q-bits 4 --q-group-size 64 (mlx-lm supports nemotron_h out of the box).

Measured on a Mac Mini M4 Pro 24GB, same harness for both models:

metric Nemotron 3 Nano 30B-A3B 4bit Qwen3.6-35B-A3B 4bit
decode (1024 tok, greedy) ~95 tok/s ~92 tok/s
peak memory 17.1 GB, flat across generation 18.7 GB, growing
prefill @1K prompt ~620 tok/s ~686 tok/s
prose perplexity (1024-tok window) 2.98 1.45
chat trigram diversity 0.962 0.913

The flat memory curve is the hybrid-Mamba effect — only 6 of 52 layers carry KV cache. Thinking mode is on by default via the chat template; give it ≥512 tokens or it spends the budget reasoning.

Sign up or log in to comment