Instructions to use pipenetwork/Nemotron-3-Nano-30B-A3B-context-mlx-bf16 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- MLX
How to use pipenetwork/Nemotron-3-Nano-30B-A3B-context-mlx-bf16 with MLX:
# Make sure mlx-lm is installed # pip install --upgrade mlx-lm # if on a CUDA device, also pip install mlx[cuda] # Generate text with mlx-lm from mlx_lm import load, generate model, tokenizer = load("pipenetwork/Nemotron-3-Nano-30B-A3B-context-mlx-bf16") prompt = "Once upon a time in" text = generate(model, tokenizer, prompt=prompt, verbose=True) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- LM Studio
- MLX LM
How to use pipenetwork/Nemotron-3-Nano-30B-A3B-context-mlx-bf16 with MLX LM:
Generate or start a chat session
# Install MLX LM uv tool install mlx-lm # Generate some text mlx_lm.generate --model "pipenetwork/Nemotron-3-Nano-30B-A3B-context-mlx-bf16" --prompt "Once upon a time"
Nemotron-3-Nano-30B-A3B (context tower) — MLX bfloat16 (unquantized)
The autoregressive context tower of
nvidia/Nemotron-Labs-TwoTower-30B-A3B-Base-BF16,
extracted and converted to MLX (bfloat16 (unquantized)). This is the
frozen base model (NemotronH hybrid: 52 layers = 23 Mamba-2, 6 attention, 23 MoE;
128 experts, 6 active + 1 shared; ~30B params, ~3B active). It runs directly in stock
mlx-lm as an ordinary text model — no custom code.
This is the AR backbone only. For the actual two-tower diffusion behavior, use the
Nemotron-Labs-TwoTower-30B-A3B-mlx-*repos below.
📦 Usage guide, examples & benchmarks: https://github.com/PipeNetwork/nemotron-twotower-mlx
Usage
pip install mlx-lm
mlx_lm.generate --model pipenetwork/Nemotron-3-Nano-30B-A3B-context-mlx-bf16 \
--prompt "The capital of France is" --max-tokens 128
from mlx_lm import load, generate
model, tok = load("pipenetwork/Nemotron-3-Nano-30B-A3B-context-mlx-bf16")
print(generate(model, tok, prompt="The key idea behind Mamba is", max_tokens=128))
Model family
| Format | AR / context tower | Full TwoTower diffusion |
|---|---|---|
| 4bit | 4bit | 4bit |
| 6bit | 6bit | 6bit |
| 8bit | 8bit | 8bit |
| bf16 | bf16 â—„ | bf16 |
Validation
The MLX conversion was checked against NVIDIA's reference implementation running on an NVIDIA GB10 (CUDA). Greedy decoding matched token-for-token: 120/120 tokens (100%) and 5/5 top-1 across the test prompts — e.g. both produce "George Washington. He was elected in 1789 and served two terms until 1797." (The AR/context tower is exercised directly; it is the shared backbone the diffusion denoiser also uses.)
Benchmarks
Measured on Apple M3 Ultra (512 GB unified memory), MLX 0.31 — steady-state (post-warmup). Peak RAM is the unified-memory high-water mark during generation.
AR / context tower — 128-token single-stream generation:
| Quant | Size | Generation | Peak RAM |
|---|---|---|---|
| 4-bit | 17 GB | 16.1 tok/s | 17.9 GB |
| 6-bit | 24 GB | 13.2 tok/s | 25.7 GB |
| 8-bit | 31 GB | 13.3 tok/s | 33.6 GB |
| bf16 | 59 GB | 13.3 tok/s | 63.2 GB |
TwoTower diffusion — 64 new tokens, block size 16, ≤16 steps/block:
| Quant | Size | Throughput | Denoiser evals | Peak RAM |
|---|---|---|---|---|
| 4-bit | 34 GB | 3.8 tok/s | 64 | 37.1 GB |
| 6-bit | 48 GB | 3.3 tok/s | 64 | 52.5 GB |
| 8-bit | 63 GB | 3.4 tok/s | 64 | 67.9 GB |
| bf16 | 118 GB | 1.5 tok/s | 39 | 136.9 GB |
Diffusion runs steps_per_block denoiser passes per block, so it is slower per token
than the AR tower — lower --steps-per-block trades quality for speed. Higher-precision
builds tend to converge in fewer denoiser evaluations, but each pass moves more memory.
License
Governed by the NVIDIA Open Model License of the base model.
- Downloads last month
- 92
Quantized