Nemotron-Labs-TwoTower-30B-A3B — MLX bfloat16 (unquantized)

An MLX port of nvidia/Nemotron-Labs-TwoTower-30B-A3B-Base-BF16, a block-wise autoregressive diffusion language model, running on Apple Silicon.

Two NemotronH towers: a frozen context tower (prefills the prompt into KV + Mamba states) and a denoiser tower (adaLN-conditioned on the diffusion timestep). Generation is block-wise mask diffusion — each block starts fully masked and is iteratively denoised, committing high-confidence tokens and remasking the rest.

⚠️ Custom code required

Stock mlx-lm has no two-tower diffusion architecture, so this repo ships the MLX modeling code (nemotron_twotower_mlx.py) + a runner (run_twotower_mlx.py). The entry point is generate_mask_diffusion, not mlx_lm.generate.

📦 Usage guide, examples & benchmarks: https://github.com/PipeNetwork/nemotron-twotower-mlx

pip install mlx mlx-lm transformers
huggingface-cli download pipenetwork/Nemotron-Labs-TwoTower-30B-A3B-mlx-bf16 --local-dir tt-bf16
python tt-bf16/run_twotower_mlx.py --model tt-bf16 \
  --prompt "The capital of France is" --max-new-tokens 64 \
  --block-size 16 --steps-per-block 16 --mask-token-id 3

import sys; sys.path.insert(0, "tt-bf16")
from run_twotower_mlx import load
import mlx.core as mx
model, tok = load("tt-bf16")
ids = mx.array([tok("The capital of France is")["input_ids"]])
out = model.generate_mask_diffusion(ids, max_new_tokens=64, block_size=16,
        steps_per_block=16, mask_token_id=3, eos_token_id=tok.eos_token_id)
print(tok.decode(out[0].tolist()))

Token id 3 is the mask token (the model's training convention). Example output: "The capital of France is Paris, the capital of Germany is Berlin, the capital of Japan is Tokyo, ..."

Model family

Format	AR / context tower	Full TwoTower diffusion
4bit	4bit	4bit
6bit	6bit	6bit
8bit	8bit	8bit
bf16	bf16	bf16 ◄

Validation

The MLX conversion was checked against NVIDIA's reference implementation running on an NVIDIA GB10 (CUDA). Greedy decoding matched token-for-token: 120/120 tokens (100%) and 5/5 top-1 across the test prompts — e.g. both produce "George Washington. He was elected in 1789 and served two terms until 1797." (The AR/context tower is exercised directly; it is the shared backbone the diffusion denoiser also uses.)

Benchmarks

Measured on Apple M3 Ultra (512 GB unified memory), MLX 0.31 — steady-state (post-warmup). Peak RAM is the unified-memory high-water mark during generation.

AR / context tower — 128-token single-stream generation:

Quant	Size	Generation	Peak RAM
4-bit	17 GB	16.1 tok/s	17.9 GB
6-bit	24 GB	13.2 tok/s	25.7 GB
8-bit	31 GB	13.3 tok/s	33.6 GB
bf16	59 GB	13.3 tok/s	63.2 GB

TwoTower diffusion — 64 new tokens, block size 16, ≤16 steps/block:

Quant	Size	Throughput	Denoiser evals	Peak RAM
4-bit	34 GB	3.8 tok/s	64	37.1 GB
6-bit	48 GB	3.3 tok/s	64	52.5 GB
8-bit	63 GB	3.4 tok/s	64	67.9 GB
bf16	118 GB	1.5 tok/s	39	136.9 GB

Diffusion runs steps_per_block denoiser passes per block, so it is slower per token than the AR tower — lower --steps-per-block trades quality for speed. Higher-precision builds tend to converge in fewer denoiser evaluations, but each pass moves more memory.

License

Governed by the NVIDIA Open Model License of the base model.

Downloads last month: 98

Safetensors

Model size

63B params

Tensor type

BF16

F32

MLX

Hardware compatibility

Quantized

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for pipenetwork/Nemotron-Labs-TwoTower-30B-A3B-mlx-bf16

Base model

nvidia/Nemotron-Labs-TwoTower-30B-A3B-Base-BF16

Finetuned

(2)

this model

Collection including pipenetwork/Nemotron-Labs-TwoTower-30B-A3B-mlx-bf16

Nemotron TwoTower · MLX

Collection

NVIDIA Nemotron TwoTower 30B-A3B diffusion LM in MLX for Apple Silicon: AR tower + two-tower diffusion, 4/6/8-bit + bf16. • 8 items • Updated 3 days ago