--- library_name: transformers license: other license_name: nvidia-open-model-license license_link: >- https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-open-model-license/ pipeline_tag: text-generation language: - en - es - fr - de - ja - it - pt - zh - ar - da - ko - nl - pl - ru - sv - th tags: - nvidia - pytorch - two-tower - diffusion - mamba datasets: - nvidia/Nemotron-Pretraining-Code-v1 - nvidia/Nemotron-CC-v2 - nvidia/Nemotron-Pretraining-SFT-v1 - nvidia/Nemotron-CC-Math-v1 - nvidia/Nemotron-Pretraining-Code-v2 - nvidia/Nemotron-Pretraining-Specialized-v1 - nvidia/Nemotron-CC-v2.1 - nvidia/Nemotron-CC-Code-v1 - nvidia/Nemotron-Pretraining-Dataset-sample track_downloads: true --- # Nemotron-TwoTower-30B-A3B-Base-BF16
## Model Overview **Model Developer:** NVIDIA Corporation **Model Dates:** September 2025 β April 2026 **Data Freshness:** The pre-training data has a cutoff date of June 25, 2025. ## Description Nemotron-TwoTower-30B-A3B-Base-BF16 is a **two-tower** variant of [NVIDIA-Nemotron-3-Nano-30B-A3B-Base-BF16](https://huggingface.co/nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-Base-BF16). It uses the same Mamba2-Transformer Hybrid MoE architecture but splits the model into two separate towers: - **Context Tower** β processes the prompt and generated context (causal, autoregressive) - **Denoiser Tower** β generates new tokens given the context (can be used for AR or block-wise diffusion) Both towers share the same architecture (52 layers, `MEMEM*EMEMEM*...` hybrid pattern) but have **independently trained weights**. The context tower is initialized from the single-tower base model and frozen; the denoiser tower is trained to predict next tokens given the context tower's representations. ### Two-Tower Generation Modes | Mode | Description | Tokens/step | API | |------|-------------|-------------|-----| | **Mask Diffusion** | Block-wise iterative denoising with confidence-based unmasking (flagship two-tower mode). | up to `block_size` | `generate_mask_diffusion()` | | **Mock-AR** | Two-tower autoregressive. Context tower builds cache, denoiser predicts next token. | 1 | `generate_mock_ar()` | | **AR** | Standard autoregressive via `generate()`. Uses context tower only (single GPU). | 1 | `generate()` | ### What is Two-Tower? The two-tower architecture decouples "understanding context" from "generating tokens" into separate networks: - **Context Tower** runs causally over the prompt and all previously committed tokens, producing the layer-aligned KV cache (attention) and Mamba states that the denoiser conditions on. - **Denoiser Tower** generates a *block* of tokens at once. Within a block it is **bidirectional** (every position attends to the whole noisy block + the full causal context); across blocks it is causal via the context cache. This enables **block-wise parallel generation** β the denoiser fills `block_size` masked positions per block and commits the most confident ones each step, so a block resolves in a handful of denoising steps rather than `block_size` autoregressive steps. ### Mask Diffusion: how it works Generation proceeds block by block. For each new block of `block_size` positions: 1. Initialize the block as all `[MASK]` tokens (`mask_token_id`). 2. For `steps_per_block` iterations: - Compute the diffusion timestep `t` = current masked fraction of the block, and feed it to the **time-conditioned denoiser** (PixArt-Ξ± adaLN-single modulation on every denoiser layer). - Run the denoiser over the whole block (bidirectional self-attention + cross-attention to the context cache; Mamba chunk-scan seeded from the context state). - Constrain to `p(xβ | xβ)` (mask token forbidden; already-decoded positions fixed), then **commit** the highest-confidence positions (all above `confidence_threshold`, with a floor that guarantees completion in `steps_per_block`) and re-mask the rest. 3. Append the resolved block to the context, extend the context cache, and continue. This model is ready for commercial use. ## License/Terms of Use GOVERNING TERMS: Use of this model is governed by the [NVIDIA Nemotron Open Model License Agreement](https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-nemotron-open-model-license/). ## Benchmark Evaluations *Benchmark scores will be added in a future update.* ## Model Architecture - **Architecture Type:** Two-Tower Mamba2-Transformer Hybrid Mixture of Experts (MoE) - **Network Architecture:** Nemotron Hybrid MoE (Two-Tower) - **Number of model parameters:** ~60B total (30B context tower + 30B denoiser tower) - **Active parameters per token:** ~3B per tower (~6B total for two-tower generation) - **Number of layers:** 52 per tower - **Layer pattern:** `MEMEM*EMEMEM*EMEMEM*EMEMEM*EMEMEM*EMEMEMEM*EMEMEMEME` - `M` = Mamba2, `*` = Attention, `E` = MoE, `-` = MLP ## Training Methodology The two-tower model is trained in two stages: 1. **Stage 1: Base Pre-Training** β The single-tower [NVIDIA-Nemotron-3-Nano-30B-A3B-Base-BF16](https://huggingface.co/nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-Base-BF16) is pre-trained with standard next-token prediction (~25T tokens). 2. **Stage 2: Two-Tower Training** β The model is duplicated into context + denoiser towers. The context tower is frozen; the denoiser tower is trained with the two-tower objective where it learns to predict tokens given context tower representations. Software used for training: [Megatron-LM](https://github.com/NVIDIA/Megatron-LM) ## Input - **Input Type(s):** Text - **Input Format(s):** String - **Input Parameters:** One-Dimensional (1D): Sequences - **Maximum input size:** 128K tokens ## Output - **Output Type(s):** Text - **Output Format:** String - **Output Parameters:** One-Dimensional (1D): Sequences - **Maximum output size:** 128K tokens ## Software Integration - Supported Hardware: NVIDIA H100-80GB, NVIDIA A100 (requires 2 GPUs for full two-tower inference, ~59GB per GPU) - Operating System(s): Linux ### Use it with Transformers The snippet below shows how to use this model with HuggingFace Transformers. **Two-tower inference requires 2 GPUs** (~59GB per GPU for bf16 weights); the towers are placed on separate devices. ```python import torch from transformers import AutoTokenizer, AutoModelForCausalLM model_name = "nvidia/Nemotron-TwoTower-30B-A3B-Base-BF16" tokenizer = AutoTokenizer.from_pretrained(model_name) model = AutoModelForCausalLM.from_pretrained( model_name, torch_dtype=torch.bfloat16, trust_remote_code=True, ) # Context tower -> GPU 0, denoiser tower -> GPU 1 model.place_towers_on_devices("cuda:0", "cuda:1") model.eval() prompt = "France is a country " inputs = tokenizer(prompt, return_tensors="pt").to("cuda:0") # Flagship mode: block-wise mask diffusion outputs = model.generate_mask_diffusion( inputs["input_ids"], max_new_tokens=128, block_size=16, # tokens generated per block steps_per_block=16, # denoising iterations per block mask_token_id=3, #