Nemotron-Labs-TwoTower-30B-A3B — MLX bfloat16 (unquantized)

An MLX port of nvidia/Nemotron-Labs-TwoTower-30B-A3B-Base-BF16, a block-wise autoregressive diffusion language model, running on Apple Silicon.

Two NemotronH towers: a frozen context tower (prefills the prompt into KV + Mamba states) and a denoiser tower (adaLN-conditioned on the diffusion timestep). Generation is block-wise mask diffusion — each block starts fully masked and is iteratively denoised, committing high-confidence tokens and remasking the rest.

⚠️ Custom code required

Stock mlx-lm has no two-tower diffusion architecture, so this repo ships the MLX modeling code (nemotron_twotower_mlx.py) + a runner (run_twotower_mlx.py). The entry point is generate_mask_diffusion, not mlx_lm.generate.

📦 Usage guide, examples & benchmarks: https://github.com/PipeNetwork/nemotron-twotower-mlx

pip install mlx mlx-lm transformers
huggingface-cli download pipenetwork/Nemotron-Labs-TwoTower-30B-A3B-mlx-bf16 --local-dir tt-bf16
python tt-bf16/run_twotower_mlx.py --model tt-bf16 \
  --prompt "The capital of France is" --max-new-tokens 64 \
  --block-size 16 --steps-per-block 16 --mask-token-id 3
import sys; sys.path.insert(0, "tt-bf16")
from run_twotower_mlx import load
import mlx.core as mx
model, tok = load("tt-bf16")
ids = mx.array([tok("The capital of France is")["input_ids"]])
out = model.generate_mask_diffusion(ids, max_new_tokens=64, block_size=16,
        steps_per_block=16, mask_token_id=3, eos_token_id=tok.eos_token_id)
print(tok.decode(out[0].tolist()))

Token id 3 is the mask token (the model's training convention). Example output: "The capital of France is Paris, the capital of Germany is Berlin, the capital of Japan is Tokyo, ..."

Model family

Format AR / context tower Full TwoTower diffusion
4bit 4bit 4bit
6bit 6bit 6bit
8bit 8bit 8bit
bf16 bf16 bf16

Validation

The MLX conversion was checked against NVIDIA's reference implementation running on an NVIDIA GB10 (CUDA). Greedy decoding matched token-for-token: 120/120 tokens (100%) and 5/5 top-1 across the test prompts — e.g. both produce "George Washington. He was elected in 1789 and served two terms until 1797." (The AR/context tower is exercised directly; it is the shared backbone the diffusion denoiser also uses.)

Benchmarks

Measured on Apple M3 Ultra (512 GB unified memory), MLX 0.31 — steady-state (post-warmup). Peak RAM is the unified-memory high-water mark during generation.

AR / context tower — 128-token single-stream generation:

Quant Size Generation Peak RAM
4-bit 17 GB 16.1 tok/s 17.9 GB
6-bit 24 GB 13.2 tok/s 25.7 GB
8-bit 31 GB 13.3 tok/s 33.6 GB
bf16 59 GB 13.3 tok/s 63.2 GB

TwoTower diffusion — 64 new tokens, block size 16, ≤16 steps/block:

Quant Size Throughput Denoiser evals Peak RAM
4-bit 34 GB 3.8 tok/s 64 37.1 GB
6-bit 48 GB 3.3 tok/s 64 52.5 GB
8-bit 63 GB 3.4 tok/s 64 67.9 GB
bf16 118 GB 1.5 tok/s 39 136.9 GB

Diffusion runs steps_per_block denoiser passes per block, so it is slower per token than the AR tower — lower --steps-per-block trades quality for speed. Higher-precision builds tend to converge in fewer denoiser evaluations, but each pass moves more memory.

License

Governed by the NVIDIA Open Model License of the base model.

Downloads last month
98
Safetensors
Model size
63B params
Tensor type
BF16
·
F32
·
MLX
Hardware compatibility
Log In to add your hardware

Quantized

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for pipenetwork/Nemotron-Labs-TwoTower-30B-A3B-mlx-bf16

Finetuned
(2)
this model

Collection including pipenetwork/Nemotron-Labs-TwoTower-30B-A3B-mlx-bf16