Nemotron-3-Nano-30B-A3B (context tower) — MLX bfloat16 (unquantized)

The autoregressive context tower of nvidia/Nemotron-Labs-TwoTower-30B-A3B-Base-BF16, extracted and converted to MLX (bfloat16 (unquantized)). This is the frozen base model (NemotronH hybrid: 52 layers = 23 Mamba-2, 6 attention, 23 MoE; 128 experts, 6 active + 1 shared; ~30B params, ~3B active). It runs directly in stock mlx-lm as an ordinary text model — no custom code.

This is the AR backbone only. For the actual two-tower diffusion behavior, use the Nemotron-Labs-TwoTower-30B-A3B-mlx-* repos below.

📦 Usage guide, examples & benchmarks: https://github.com/PipeNetwork/nemotron-twotower-mlx

Usage

pip install mlx-lm
mlx_lm.generate --model pipenetwork/Nemotron-3-Nano-30B-A3B-context-mlx-bf16 \
  --prompt "The capital of France is" --max-tokens 128

from mlx_lm import load, generate
model, tok = load("pipenetwork/Nemotron-3-Nano-30B-A3B-context-mlx-bf16")
print(generate(model, tok, prompt="The key idea behind Mamba is", max_tokens=128))

Model family

Format	AR / context tower	Full TwoTower diffusion
4bit	4bit	4bit
6bit	6bit	6bit
8bit	8bit	8bit
bf16	bf16 ◄	bf16

Validation

The MLX conversion was checked against NVIDIA's reference implementation running on an NVIDIA GB10 (CUDA). Greedy decoding matched token-for-token: 120/120 tokens (100%) and 5/5 top-1 across the test prompts — e.g. both produce "George Washington. He was elected in 1789 and served two terms until 1797." (The AR/context tower is exercised directly; it is the shared backbone the diffusion denoiser also uses.)

Benchmarks

Measured on Apple M3 Ultra (512 GB unified memory), MLX 0.31 — steady-state (post-warmup). Peak RAM is the unified-memory high-water mark during generation.

AR / context tower — 128-token single-stream generation:

Quant	Size	Generation	Peak RAM
4-bit	17 GB	16.1 tok/s	17.9 GB
6-bit	24 GB	13.2 tok/s	25.7 GB
8-bit	31 GB	13.3 tok/s	33.6 GB
bf16	59 GB	13.3 tok/s	63.2 GB

TwoTower diffusion — 64 new tokens, block size 16, ≤16 steps/block:

Quant	Size	Throughput	Denoiser evals	Peak RAM
4-bit	34 GB	3.8 tok/s	64	37.1 GB
6-bit	48 GB	3.3 tok/s	64	52.5 GB
8-bit	63 GB	3.4 tok/s	64	67.9 GB
bf16	118 GB	1.5 tok/s	39	136.9 GB

Diffusion runs steps_per_block denoiser passes per block, so it is slower per token than the AR tower — lower --steps-per-block trades quality for speed. Higher-precision builds tend to converge in fewer denoiser evaluations, but each pass moves more memory.

License

Governed by the NVIDIA Open Model License of the base model.

Downloads last month: 92

Safetensors

Model size

32B params

Tensor type

BF16

F32

MLX

Hardware compatibility

Quantized

Model tree for pipenetwork/Nemotron-3-Nano-30B-A3B-context-mlx-bf16

Base model

nvidia/Nemotron-Labs-TwoTower-30B-A3B-Base-BF16

Finetuned

(2)

this model

Collection including pipenetwork/Nemotron-3-Nano-30B-A3B-context-mlx-bf16

Nemotron TwoTower · MLX

Collection

NVIDIA Nemotron TwoTower 30B-A3B diffusion LM in MLX for Apple Silicon: AR tower + two-tower diffusion, 4/6/8-bit + bf16. • 8 items • Updated 2 days ago