Instructions to use pipenetwork/Nemotron-Labs-TwoTower-30B-A3B-mlx-bf16 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- MLX
How to use pipenetwork/Nemotron-Labs-TwoTower-30B-A3B-mlx-bf16 with MLX:
# Download the model from the Hub pip install huggingface_hub[hf_xet] huggingface-cli download --local-dir Nemotron-Labs-TwoTower-30B-A3B-mlx-bf16 pipenetwork/Nemotron-Labs-TwoTower-30B-A3B-mlx-bf16
- Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- LM Studio
Nemotron-Labs-TwoTower-30B-A3B — MLX bfloat16 (unquantized)
An MLX port of
nvidia/Nemotron-Labs-TwoTower-30B-A3B-Base-BF16,
a block-wise autoregressive diffusion language model, running on Apple Silicon.
Two NemotronH towers: a frozen context tower (prefills the prompt into KV + Mamba states) and a denoiser tower (adaLN-conditioned on the diffusion timestep). Generation is block-wise mask diffusion — each block starts fully masked and is iteratively denoised, committing high-confidence tokens and remasking the rest.
⚠️ Custom code required
Stock mlx-lm has no two-tower diffusion architecture, so this repo ships the MLX
modeling code (nemotron_twotower_mlx.py) + a runner (run_twotower_mlx.py). The entry
point is generate_mask_diffusion, not mlx_lm.generate.
📦 Usage guide, examples & benchmarks: https://github.com/PipeNetwork/nemotron-twotower-mlx
pip install mlx mlx-lm transformers
huggingface-cli download pipenetwork/Nemotron-Labs-TwoTower-30B-A3B-mlx-bf16 --local-dir tt-bf16
python tt-bf16/run_twotower_mlx.py --model tt-bf16 \
--prompt "The capital of France is" --max-new-tokens 64 \
--block-size 16 --steps-per-block 16 --mask-token-id 3
import sys; sys.path.insert(0, "tt-bf16")
from run_twotower_mlx import load
import mlx.core as mx
model, tok = load("tt-bf16")
ids = mx.array([tok("The capital of France is")["input_ids"]])
out = model.generate_mask_diffusion(ids, max_new_tokens=64, block_size=16,
steps_per_block=16, mask_token_id=3, eos_token_id=tok.eos_token_id)
print(tok.decode(out[0].tolist()))
Token id 3 is the mask token (the model's training convention). Example output: "The capital of France is Paris, the capital of Germany is Berlin, the capital of Japan is Tokyo, ..."
Model family
| Format | AR / context tower | Full TwoTower diffusion |
|---|---|---|
| 4bit | 4bit | 4bit |
| 6bit | 6bit | 6bit |
| 8bit | 8bit | 8bit |
| bf16 | bf16 | bf16 ◄ |
Validation
The MLX conversion was checked against NVIDIA's reference implementation running on an NVIDIA GB10 (CUDA). Greedy decoding matched token-for-token: 120/120 tokens (100%) and 5/5 top-1 across the test prompts — e.g. both produce "George Washington. He was elected in 1789 and served two terms until 1797." (The AR/context tower is exercised directly; it is the shared backbone the diffusion denoiser also uses.)
Benchmarks
Measured on Apple M3 Ultra (512 GB unified memory), MLX 0.31 — steady-state (post-warmup). Peak RAM is the unified-memory high-water mark during generation.
AR / context tower — 128-token single-stream generation:
| Quant | Size | Generation | Peak RAM |
|---|---|---|---|
| 4-bit | 17 GB | 16.1 tok/s | 17.9 GB |
| 6-bit | 24 GB | 13.2 tok/s | 25.7 GB |
| 8-bit | 31 GB | 13.3 tok/s | 33.6 GB |
| bf16 | 59 GB | 13.3 tok/s | 63.2 GB |
TwoTower diffusion — 64 new tokens, block size 16, ≤16 steps/block:
| Quant | Size | Throughput | Denoiser evals | Peak RAM |
|---|---|---|---|---|
| 4-bit | 34 GB | 3.8 tok/s | 64 | 37.1 GB |
| 6-bit | 48 GB | 3.3 tok/s | 64 | 52.5 GB |
| 8-bit | 63 GB | 3.4 tok/s | 64 | 67.9 GB |
| bf16 | 118 GB | 1.5 tok/s | 39 | 136.9 GB |
Diffusion runs steps_per_block denoiser passes per block, so it is slower per token
than the AR tower — lower --steps-per-block trades quality for speed. Higher-precision
builds tend to converge in fewer denoiser evaluations, but each pass moves more memory.
License
Governed by the NVIDIA Open Model License of the base model.
- Downloads last month
- 98
Quantized