Nemotron-3-Nano-30B-A3B (context tower) — MLX bfloat16 (unquantized)

The autoregressive context tower of nvidia/Nemotron-Labs-TwoTower-30B-A3B-Base-BF16, extracted and converted to MLX (bfloat16 (unquantized)). This is the frozen base model (NemotronH hybrid: 52 layers = 23 Mamba-2, 6 attention, 23 MoE; 128 experts, 6 active + 1 shared; ~30B params, ~3B active). It runs directly in stock mlx-lm as an ordinary text model — no custom code.

This is the AR backbone only. For the actual two-tower diffusion behavior, use the Nemotron-Labs-TwoTower-30B-A3B-mlx-* repos below.

📦 Usage guide, examples & benchmarks: https://github.com/PipeNetwork/nemotron-twotower-mlx

Usage

pip install mlx-lm
mlx_lm.generate --model pipenetwork/Nemotron-3-Nano-30B-A3B-context-mlx-bf16 \
  --prompt "The capital of France is" --max-tokens 128
from mlx_lm import load, generate
model, tok = load("pipenetwork/Nemotron-3-Nano-30B-A3B-context-mlx-bf16")
print(generate(model, tok, prompt="The key idea behind Mamba is", max_tokens=128))

Model family

Format AR / context tower Full TwoTower diffusion
4bit 4bit 4bit
6bit 6bit 6bit
8bit 8bit 8bit
bf16 bf16 â—„ bf16

Validation

The MLX conversion was checked against NVIDIA's reference implementation running on an NVIDIA GB10 (CUDA). Greedy decoding matched token-for-token: 120/120 tokens (100%) and 5/5 top-1 across the test prompts — e.g. both produce "George Washington. He was elected in 1789 and served two terms until 1797." (The AR/context tower is exercised directly; it is the shared backbone the diffusion denoiser also uses.)

Benchmarks

Measured on Apple M3 Ultra (512 GB unified memory), MLX 0.31 — steady-state (post-warmup). Peak RAM is the unified-memory high-water mark during generation.

AR / context tower — 128-token single-stream generation:

Quant Size Generation Peak RAM
4-bit 17 GB 16.1 tok/s 17.9 GB
6-bit 24 GB 13.2 tok/s 25.7 GB
8-bit 31 GB 13.3 tok/s 33.6 GB
bf16 59 GB 13.3 tok/s 63.2 GB

TwoTower diffusion — 64 new tokens, block size 16, ≤16 steps/block:

Quant Size Throughput Denoiser evals Peak RAM
4-bit 34 GB 3.8 tok/s 64 37.1 GB
6-bit 48 GB 3.3 tok/s 64 52.5 GB
8-bit 63 GB 3.4 tok/s 64 67.9 GB
bf16 118 GB 1.5 tok/s 39 136.9 GB

Diffusion runs steps_per_block denoiser passes per block, so it is slower per token than the AR tower — lower --steps-per-block trades quality for speed. Higher-precision builds tend to converge in fewer denoiser evaluations, but each pass moves more memory.

License

Governed by the NVIDIA Open Model License of the base model.

Downloads last month
92
Safetensors
Model size
32B params
Tensor type
BF16
·
F32
·
MLX
Hardware compatibility
Log In to add your hardware

Quantized

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for pipenetwork/Nemotron-3-Nano-30B-A3B-context-mlx-bf16

Finetuned
(2)
this model

Collection including pipenetwork/Nemotron-3-Nano-30B-A3B-context-mlx-bf16