Instructions to use nvidia/Nemotron-Labs-TwoTower-30B-A3B-Base-BF16 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use nvidia/Nemotron-Labs-TwoTower-30B-A3B-Base-BF16 with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="nvidia/Nemotron-Labs-TwoTower-30B-A3B-Base-BF16")

# Load model directly
from transformers import AutoModel
model = AutoModel.from_pretrained("nvidia/Nemotron-Labs-TwoTower-30B-A3B-Base-BF16", dtype="auto")

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use nvidia/Nemotron-Labs-TwoTower-30B-A3B-Base-BF16 with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "nvidia/Nemotron-Labs-TwoTower-30B-A3B-Base-BF16"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "nvidia/Nemotron-Labs-TwoTower-30B-A3B-Base-BF16",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker

docker model run hf.co/nvidia/Nemotron-Labs-TwoTower-30B-A3B-Base-BF16

SGLang

How to use nvidia/Nemotron-Labs-TwoTower-30B-A3B-Base-BF16 with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "nvidia/Nemotron-Labs-TwoTower-30B-A3B-Base-BF16" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "nvidia/Nemotron-Labs-TwoTower-30B-A3B-Base-BF16",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "nvidia/Nemotron-Labs-TwoTower-30B-A3B-Base-BF16" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "nvidia/Nemotron-Labs-TwoTower-30B-A3B-Base-BF16",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Docker Model Runner
How to use nvidia/Nemotron-Labs-TwoTower-30B-A3B-Base-BF16 with Docker Model Runner:
```
docker model run hf.co/nvidia/Nemotron-Labs-TwoTower-30B-A3B-Base-BF16
```

fitsumreda commited on 30 days ago

Commit

a203471

verified ·

1 Parent(s): 8d7e74f

Two-tower mask diffusion: fix denoiser (adaLN norm order, bidirectional in-block attention, block-wise chunk-scan Mamba) + fp64 router; refresh README

Browse files

Files changed (5) hide show

README.md +41 -21
config.json +1 -1
inference.py +76 -5
modeling_nemotron_h.py +4 -2
modeling_nemotron_twotower.py +242 -29

README.md CHANGED Viewed

@@ -77,19 +77,31 @@ Both towers share the same architecture (52 layers, `MEMEM*EMEMEM*...` hybrid pa
 ### Two-Tower Generation Modes
-| Mode | Description | Tokens/step |
-|------|-------------|-------------|
-| **AR** | Standard autoregressive via `generate()`. Uses context tower only. | 1 |
-| **Mock-AR** | Two-tower autoregressive. Context tower builds cache, denoiser predicts next token. | 1 |
-| **Mask Diffusion** | Block-wise iterative denoising with confidence-based unmasking. *(Coming soon)* | block_size |
 ### What is Two-Tower?
-The two-tower architecture decouples the "understanding context" and "generating tokens" responsibilities into separate networks. This enables:
-1. **Block-wise parallel generation** — the denoiser can generate multiple tokens simultaneously via iterative diffusion
-2. **Architectural flexibility** — context and denoiser can be optimized independently
-3. **Speculative decoding** — the denoiser can be a smaller/faster model
 This model is ready for commercial use.
@@ -142,7 +154,7 @@ Software used for training: [Megatron-LM](https://github.com/NVIDIA/Megatron-LM)
 ### Use it with Transformers
-The snippet below shows how to use this model with HuggingFace Transformers. **Two-tower inference requires 2 GPUs** (~59GB per GPU for bf16 weights).
 ```python
 import torch
@@ -156,39 +168,47 @@ model = AutoModelForCausalLM.from_pretrained(
     trust_remote_code=True,
 )
-# Place context tower on GPU 0, denoiser tower on GPU 1
 model.place_towers_on_devices("cuda:0", "cuda:1")
 model.eval()
 prompt = "France is a country "
 inputs = tokenizer(prompt, return_tensors="pt").to("cuda:0")
-# Two-tower mock-AR generation
-outputs = model.generate_mock_ar(
     inputs["input_ids"],
     max_new_tokens=128,
-    temperature=0.0,
     eos_token_id=tokenizer.eos_token_id,
 )
 print(tokenizer.decode(outputs[0], skip_special_tokens=True))
 ```
-For **AR-only mode** (single GPU, context tower only):
 ```python
-model = AutoModelForCausalLM.from_pretrained(
-    model_name,
-    torch_dtype=torch.bfloat16,
-    trust_remote_code=True,
-).cuda()
 outputs = model.generate(**inputs, max_new_tokens=128, do_sample=False)
 print(tokenizer.decode(outputs[0], skip_special_tokens=True))
 ```
 ## Model Version(s)
 - v1.0 — Two-tower AR (mock-AR) checkpoint
 # Training, Testing, and Evaluation Datasets

 ### Two-Tower Generation Modes
+| Mode | Description | Tokens/step | API |
+|------|-------------|-------------|-----|
+| **Mask Diffusion** | Block-wise iterative denoising with confidence-based unmasking (flagship two-tower mode). | up to `block_size` | `generate_mask_diffusion()` |
+| **Mock-AR** | Two-tower autoregressive. Context tower builds cache, denoiser predicts next token. | 1 | `generate_mock_ar()` |
+| **AR** | Standard autoregressive via `generate()`. Uses context tower only (single GPU). | 1 | `generate()` |
 ### What is Two-Tower?
+The two-tower architecture decouples "understanding context" from "generating tokens" into separate networks:
+- **Context Tower** runs causally over the prompt and all previously committed tokens, producing the layer-aligned KV cache (attention) and Mamba states that the denoiser conditions on.
+- **Denoiser Tower** generates a *block* of tokens at once. Within a block it is **bidirectional** (every position attends to the whole noisy block + the full causal context); across blocks it is causal via the context cache.
+This enables **block-wise parallel generation** — the denoiser fills `block_size` masked positions per block and commits the most confident ones each step, so a block resolves in a handful of denoising steps rather than `block_size` autoregressive steps.
+### Mask Diffusion: how it works
+Generation proceeds block by block. For each new block of `block_size` positions:
+1. Initialize the block as all `[MASK]` tokens (`mask_token_id`).
+2. For `steps_per_block` iterations:
+   - Compute the diffusion timestep `t` = current masked fraction of the block, and feed it to the **time-conditioned denoiser** (PixArt-α adaLN-single modulation on every denoiser layer).
+   - Run the denoiser over the whole block (bidirectional self-attention + cross-attention to the context cache; Mamba chunk-scan seeded from the context state).
+   - Constrain to `p(x₀ | xₜ)` (mask token forbidden; already-decoded positions fixed), then **commit** the highest-confidence positions (all above `confidence_threshold`, with a floor that guarantees completion in `steps_per_block`) and re-mask the rest.
+3. Append the resolved block to the context, extend the context cache, and continue.
 This model is ready for commercial use.
 ### Use it with Transformers
+The snippet below shows how to use this model with HuggingFace Transformers. **Two-tower inference requires 2 GPUs** (~59GB per GPU for bf16 weights); the towers are placed on separate devices.
 ```python
 import torch
     trust_remote_code=True,
 )
+# Context tower -> GPU 0, denoiser tower -> GPU 1
 model.place_towers_on_devices("cuda:0", "cuda:1")
 model.eval()
 prompt = "France is a country "
 inputs = tokenizer(prompt, return_tensors="pt").to("cuda:0")
+# Flagship mode: block-wise mask diffusion
+outputs = model.generate_mask_diffusion(
     inputs["input_ids"],
     max_new_tokens=128,
+    block_size=16,            # tokens generated per block
+    steps_per_block=16,       # denoising iterations per block
+    mask_token_id=3,          # <mask>
+    temperature=0.1,
+    confidence_threshold=0.8, # commit positions above this confidence each step
     eos_token_id=tokenizer.eos_token_id,
 )
+print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True))
+```
+**Mock-AR** (two-tower, one token per step):
+```python
+outputs = model.generate_mock_ar(
+    inputs["input_ids"], max_new_tokens=128, temperature=0.0,
+    eos_token_id=tokenizer.eos_token_id,
+)
 print(tokenizer.decode(outputs[0], skip_special_tokens=True))
 ```
+**AR-only** (single GPU, context tower only — load with `.cuda()` instead of `place_towers_on_devices`):
 ```python
 outputs = model.generate(**inputs, max_new_tokens=128, do_sample=False)
 print(tokenizer.decode(outputs[0], skip_special_tokens=True))
 ```
 ## Model Version(s)
+- v1.1 — Block-wise **mask-diffusion** generation enabled (time-conditioned denoiser, bidirectional in-block attention, chunk-scan Mamba); AR and mock-AR also supported.
 - v1.0 — Two-tower AR (mock-AR) checkpoint
 # Training, Testing, and Evaluation Datasets

config.json CHANGED Viewed

@@ -54,7 +54,7 @@
   "time_step_floor": 0.0001,
   "time_step_limit": [
     0.0,
-    "Infinity"
   ],
   "time_step_max": 0.1,
   "time_step_min": 0.001,

   "time_step_floor": 0.0001,
   "time_step_limit": [
     0.0,
+    Infinity
   ],
   "time_step_max": 0.1,
   "time_step_min": 0.001,

inference.py CHANGED Viewed

@@ -11,19 +11,46 @@ Usage:
   # AR (context tower only, 1 GPU):
   python inference.py --mode ar
 """
 import argparse
 import torch
 from pathlib import Path
 from transformers import AutoTokenizer
 from modeling_nemotron_twotower import NemotronHTwoTowerForCausalLM
 parser = argparse.ArgumentParser()
-parser.add_argument("--prompt", default="France is a country ")
 parser.add_argument("--model", default=str(Path(__file__).resolve().parent))
 parser.add_argument("--max-new-tokens", type=int, default=128)
-parser.add_argument("--mode", choices=["ar", "mock_ar"], default="mock_ar")
 args = parser.parse_args()
 tokenizer = AutoTokenizer.from_pretrained(args.model)
 model = NemotronHTwoTowerForCausalLM.from_pretrained(
@@ -31,23 +58,67 @@ model = NemotronHTwoTowerForCausalLM.from_pretrained(
 )
 num_gpus = torch.cuda.device_count()
-if args.mode == "mock_ar" and num_gpus >= 2:
     model.place_towers_on_devices("cuda:0", "cuda:1")
 else:
     model.cuda()
 model.eval()
-inputs = tokenizer(args.prompt, return_tensors="pt").to(
     next(model.context_tower.parameters()).device
 )
 if args.mode == "ar":
     outputs = model.generate(**inputs, max_new_tokens=args.max_new_tokens, do_sample=False)
-else:
     outputs = model.generate_mock_ar(
         inputs["input_ids"], max_new_tokens=args.max_new_tokens,
         temperature=0.0, eos_token_id=tokenizer.eos_token_id,
     )
 text = tokenizer.decode(outputs[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True)
 print(text)

   # AR (context tower only, 1 GPU):
   python inference.py --mode ar
+  # Mask diffusion (two-tower, 2 GPUs):
+  python inference.py --mode mask_diffusion --model /path/to/diffusion_hf_out
 """
 import argparse
+import inspect
 import torch
+import random
+import numpy as np
 from pathlib import Path
 from transformers import AutoTokenizer
 from modeling_nemotron_twotower import NemotronHTwoTowerForCausalLM
 parser = argparse.ArgumentParser()
+parser.add_argument("prompt_arg", nargs="?", default=None)
+parser.add_argument("--prompt", default=None)
 parser.add_argument("--model", default=str(Path(__file__).resolve().parent))
 parser.add_argument("--max-new-tokens", type=int, default=128)
+parser.add_argument("--mode", choices=["ar", "mock_ar", "mask_diffusion"], default="mock_ar")
+parser.add_argument("--block-size", type=int, default=16)
+parser.add_argument("--steps-per-block", type=int, default=16)
+parser.add_argument("--mask-token-id", type=int, default=3)
+parser.add_argument("--temperature", type=float, default=0.0)
+parser.add_argument("--top-k", "--top_k", dest="top_k", type=int, default=None)
+parser.add_argument("--confidence-threshold", type=float, default=0.9)
+parser.add_argument("--deterministic", action="store_true")
+parser.add_argument("--seed", type=int, default=42)
+parser.add_argument("--print-diffusion-steps", action="store_true")
+parser.add_argument("--trace-context-layers", action="store_true")
+parser.add_argument("--trace-denoiser-layers", action="store_true")
 args = parser.parse_args()
+prompt = args.prompt if args.prompt is not None else (args.prompt_arg or "France is a country ")
+if args.deterministic:
+    random.seed(args.seed)
+    np.random.seed(args.seed)
+    torch.manual_seed(args.seed)
+    torch.cuda.manual_seed_all(args.seed)
+    torch.backends.cudnn.deterministic = True
+    torch.backends.cudnn.benchmark = False
 tokenizer = AutoTokenizer.from_pretrained(args.model)
 model = NemotronHTwoTowerForCausalLM.from_pretrained(
 )
 num_gpus = torch.cuda.device_count()
+if num_gpus >= 2:
+    # Split towers across GPUs (both towers don't fit on one 80GB card).
+    # AR mode only uses the context tower (cuda:0), but placing both is fine.
     model.place_towers_on_devices("cuda:0", "cuda:1")
+elif args.mode == "ar":
+    # AR uses only the context tower + context head; keep the denoiser tower
+    # off the GPU so a single card suffices.
+    model.context_tower = model.context_tower.cuda()
+    model.context_lm_head = model.context_lm_head.cuda()
 else:
     model.cuda()
 model.eval()
+model.trace_context_layers = args.trace_context_layers
+model.trace_denoiser_layers = args.trace_denoiser_layers
+inputs = tokenizer(prompt, return_tensors="pt").to(
     next(model.context_tower.parameters()).device
 )
 if args.mode == "ar":
     outputs = model.generate(**inputs, max_new_tokens=args.max_new_tokens, do_sample=False)
+elif args.mode == "mock_ar":
     outputs = model.generate_mock_ar(
         inputs["input_ids"], max_new_tokens=args.max_new_tokens,
         temperature=0.0, eos_token_id=tokenizer.eos_token_id,
     )
+else:
+    def step_callback(step_idx, total_steps, tokens, t=None, logits=None, block_idx=0):
+        if not args.print_diffusion_steps:
+            return
+        if logits is None:
+            print(f"\n--- Block {block_idx} Step {step_idx}/{total_steps} | init ---")
+            print("xt:", tokenizer.decode(tokens[0], skip_special_tokens=False))
+            return
+        log_x = model._mdlm_forward(logits, tokens.to(logits.device), args.mask_token_id)
+        probs = log_x.exp()[0]
+        top2_probs, top2_ids = probs.topk(2, dim=-1)
+        n_masked = int((tokens == args.mask_token_id).sum().item())
+        print(f"\n--- Block {block_idx} Step {step_idx}/{total_steps} | masked={n_masked}/{tokens.shape[1]} | t={t:.4f} ---")
+        print("xt:   " + repr(tokenizer.decode(tokens[0], skip_special_tokens=False)))
+        print("top1: " + "|".join(tokenizer.decode([tid.item()])[:9].rjust(9) for tid in top2_ids[:, 0]))
+        print("prb1: " + "|".join(f"{p.item():.3f}".rjust(9) for p in top2_probs[:, 0]))
+        print("top2: " + "|".join(tokenizer.decode([tid.item()])[:9].rjust(9) for tid in top2_ids[:, 1]))
+        print("prb2: " + "|".join(f"{p.item():.3f}".rjust(9) for p in top2_probs[:, 1]))
+    generate_kwargs = dict(
+        max_new_tokens=args.max_new_tokens,
+        block_size=args.block_size,
+        steps_per_block=args.steps_per_block,
+        mask_token_id=args.mask_token_id,
+        temperature=args.temperature,
+        top_k=args.top_k,
+        confidence_threshold=args.confidence_threshold,
+        eos_token_id=tokenizer.eos_token_id,
+    )
+    if (
+        args.print_diffusion_steps
+        and "step_callback" in inspect.signature(model.generate_mask_diffusion).parameters
+    ):
+        generate_kwargs["step_callback"] = step_callback
+    outputs = model.generate_mask_diffusion(inputs["input_ids"], **generate_kwargs)
 text = tokenizer.decode(outputs[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True)
 print(text)

modeling_nemotron_h.py CHANGED Viewed

@@ -910,7 +910,9 @@ class NemotronHTopkRouter(nn.Module):
     def forward(self, hidden_states):
         hidden_states = hidden_states.view(-1, self.config.hidden_size)
-        router_logits = F.linear(hidden_states.type(torch.float32), self.weight.type(torch.float32))
         scores = router_logits.sigmoid()
         topk_indices = self.get_topk_indices(scores)
         topk_weights = scores.gather(1, topk_indices)
@@ -918,7 +920,7 @@ class NemotronHTopkRouter(nn.Module):
             denominator = topk_weights.sum(dim=-1, keepdim=True) + 1e-20
             topk_weights /= denominator
         topk_weights = topk_weights * self.routed_scaling_factor
-        return topk_indices, topk_weights
 # Copied from transformers.models.llama.modeling_llama.repeat_kv
 def repeat_kv(hidden_states: torch.Tensor, n_rep: int) -> torch.Tensor:

     def forward(self, hidden_states):
         hidden_states = hidden_states.view(-1, self.config.hidden_size)
+        # mcore runs the MoE router in fp64 (--moe-router-dtype fp64); match it so
+        # top-k expert selection is bit-identical at borderline scores.
+        router_logits = F.linear(hidden_states.type(torch.float64), self.weight.type(torch.float64))
         scores = router_logits.sigmoid()
         topk_indices = self.get_topk_indices(scores)
         topk_weights = scores.gather(1, topk_indices)
             denominator = topk_weights.sum(dim=-1, keepdim=True) + 1e-20
             topk_weights /= denominator
         topk_weights = topk_weights * self.routed_scaling_factor
+        return topk_indices, topk_weights.type(torch.float32)
 # Copied from transformers.models.llama.modeling_llama.repeat_kv
 def repeat_kv(hidden_states: torch.Tensor, n_rep: int) -> torch.Tensor:

modeling_nemotron_twotower.py CHANGED Viewed

@@ -30,6 +30,7 @@ try:
         NemotronHForCausalLM,
         NemotronHModel,
         NemotronHPreTrainedModel,
     )
     from .configuration_nemotron_h import NemotronHConfig
 except ImportError:
@@ -39,6 +40,7 @@ except ImportError:
         NemotronHForCausalLM,
         NemotronHModel,
         NemotronHPreTrainedModel,
     )
     from configuration_nemotron_h import NemotronHConfig
@@ -180,8 +182,11 @@ class NemotronHTwoTowerForCausalLM(NemotronHPreTrainedModel, GenerationMixin):
             elif input_ids.shape[1] != cache_position.shape[0]:
                 input_ids = input_ids[:, cache_position]
         else:
-            past_key_values = HybridMambaAttentionDynamicCache(
-                self.config, input_ids.shape[0], self.dtype, device=self.device
             )
         if attention_mask is not None and position_ids is None:
             position_ids = attention_mask.long().cumsum(-1) - 1
@@ -330,18 +335,20 @@ class NemotronHTwoTowerForCausalLM(NemotronHPreTrainedModel, GenerationMixin):
         return {"ctx_cache": cache_p2, "mamba_s2": mamba_s2, "ctx_len": S}
     def _extend_context_cache(self, new_tokens, cache_state):
-        """Extend context cache by new_tokens (B, L). Old S-1 -> new S-2.
-        Processes tokens one at a time so HF Mamba can use its single-step
-        cached path (seq_len=1, cache_position[0] > 0).
         """
         ctx_cache = cache_state["ctx_cache"]
         pattern = self.config.hybrid_override_pattern
         ctx_len = cache_state["ctx_len"]
-        ctx_device = next(self.context_tower.parameters()).device
         L = new_tokens.shape[1]
-        tokens_on_device = new_tokens.to(ctx_device)
         new_s2 = {}
         for i in range(self.config.num_hidden_layers):
             if pattern[i] == "M":
@@ -350,12 +357,37 @@ class NemotronHTwoTowerForCausalLM(NemotronHPreTrainedModel, GenerationMixin):
         cache_state["mamba_s2"] = new_s2
         ctx_cache.has_previous_state = True
-        for j in range(L):
-            cp = torch.tensor([ctx_len + j], device=ctx_device)
-            self._forward_tower_with_cache(
-                self.context_tower, self.context_lm_head,
-                tokens_on_device[:, j:j+1], ctx_cache, cp,
-            )
         cache_state["ctx_len"] = ctx_len + L
         return cache_state
@@ -415,13 +447,139 @@ class NemotronHTwoTowerForCausalLM(NemotronHPreTrainedModel, GenerationMixin):
             self.denoiser_tower, self.lm_head, den_input, den_cache, cp,
         )
-    def _run_denoiser_step_diffusion(self, block_ids, cache_state, t=None):
-        """Diffusion denoiser: pos=ctx_len..ctx_len+L-1, full KV, Mamba S-1.
-        Processes the block token-by-token so the HF Mamba mixer can use its
-        single-step cached path (seq_len=1 with cache_position[0] > 0).
-        This is mathematically equivalent to full-block processing since all
-        layers are causal, and it properly propagates Mamba states from context.
         Args:
             block_ids: (B, L) tokens to denoise
@@ -431,28 +589,74 @@ class NemotronHTwoTowerForCausalLM(NemotronHPreTrainedModel, GenerationMixin):
         Returns: logits (B, L, V)
         """
         ctx_len = cache_state["ctx_len"]
-        den_device = next(self.denoiser_tower.parameters()).device
         den_input = block_ids.to(den_device)
         L = den_input.shape[1]
         t_emb = None
         if t is not None:
             t_dev = t.to(device=den_device, dtype=self.dtype)
             t_repr = self.t_embedder(t_dev)
             t_emb = self.t_block(t_repr)
         den_cache = self._build_denoiser_cache_diffusion(cache_state, den_device)
-        all_logits = []
-        for i in range(L):
-            cp = torch.tensor([ctx_len + i], device=den_device)
-            logits_i = self._forward_tower_with_cache(
-                self.denoiser_tower, self.lm_head, den_input[:, i:i+1],
-                den_cache, cp, t_emb=t_emb,
-            )
-            all_logits.append(logits_i)
-        return torch.cat(all_logits, dim=1)
     # ------------------------------------------------------------------
     # Mock-AR generation (unchanged)
@@ -516,6 +720,7 @@ class NemotronHTwoTowerForCausalLM(NemotronHPreTrainedModel, GenerationMixin):
         top_k=None,
         confidence_threshold=0.9,
         eos_token_id=None,
     ):
         """Block-wise mask diffusion with confidence_unmasking.
@@ -558,6 +763,9 @@ class NemotronHTwoTowerForCausalLM(NemotronHPreTrainedModel, GenerationMixin):
             # Initialize fully masked block
             xt = torch.full((B, block_size), mask_token_id, dtype=torch.long,
                             device=device)
             for step_idx in range(steps_per_block):
                 # t_model = current mask fraction
@@ -626,6 +834,11 @@ class NemotronHTwoTowerForCausalLM(NemotronHPreTrainedModel, GenerationMixin):
                         remask_idx = masked_indices[sort_idx[:num_to_remask[b]]]
                         output[b, remask_idx] = mask_token_id
                 xt = output
             # Block complete — extend context

         NemotronHForCausalLM,
         NemotronHModel,
         NemotronHPreTrainedModel,
+        repeat_kv,
     )
     from .configuration_nemotron_h import NemotronHConfig
 except ImportError:
         NemotronHForCausalLM,
         NemotronHModel,
         NemotronHPreTrainedModel,
+        repeat_kv,
     )
     from configuration_nemotron_h import NemotronHConfig
             elif input_ids.shape[1] != cache_position.shape[0]:
                 input_ids = input_ids[:, cache_position]
         else:
+            # FixedHybridCache (not the base class) so the Mamba mixer finds
+            # conv_kernel_size during the cached forward (needed for AR generate).
+            past_key_values = FixedHybridCache(
+                self.config, input_ids.shape[0], self.dtype,
+                device=next(self.context_tower.parameters()).device,
             )
         if attention_mask is not None and position_ids is None:
             position_ids = attention_mask.long().cumsum(-1) - 1
         return {"ctx_cache": cache_p2, "mamba_s2": mamba_s2, "ctx_len": S}
     def _extend_context_cache(self, new_tokens, cache_state):
+        """Extend context cache by new_tokens (B, L), block-wise (matches mcore).
+        Mamba layers advance via the block chunk-scan from the current state;
+        attention layers append the block KV (causal within block); MoE is plain.
         """
         ctx_cache = cache_state["ctx_cache"]
         pattern = self.config.hybrid_override_pattern
         ctx_len = cache_state["ctx_len"]
+        tower = self.context_tower
+        ctx_device = next(tower.parameters()).device
         L = new_tokens.shape[1]
+        tokens = new_tokens.to(ctx_device)
+        # Snapshot pre-extension Mamba states as the new S-2 (used by mock-AR).
         new_s2 = {}
         for i in range(self.config.num_hidden_layers):
             if pattern[i] == "M":
         cache_state["mamba_s2"] = new_s2
         ctx_cache.has_previous_state = True
+        cache_position = torch.arange(ctx_len, ctx_len + L, device=ctx_device)
+        hidden = tower.embeddings(tokens)
+        causal_mask = tower._update_causal_mask(None, hidden, cache_position)
+        for layer_idx, block in enumerate(tower.layers):
+            residual = hidden
+            h = block.norm(hidden.to(dtype=block.norm.weight.dtype))
+            if block.residual_in_fp32:
+                residual = residual.to(torch.float32)
+            if block.block_type == "mamba":
+                d_conv = block.mixer.conv_kernel_size
+                init_conv = ctx_cache.conv_states[layer_idx][..., -(d_conv - 1):]
+                init_ssm = ctx_cache.ssm_states[layer_idx].contiguous()
+                h, new_conv, new_ssm = self._denoiser_block_mamba(
+                    block.mixer, h, init_conv, init_ssm, return_states=True,
+                )
+                ctx_cache.conv_states[layer_idx] = new_conv
+                ctx_cache.ssm_states[layer_idx] = new_ssm
+            elif block.block_type == "attention":
+                # Standard cached attention appends block KV (causal within block).
+                h, _, _ = block.mixer(
+                    h, attention_mask=causal_mask,
+                    past_key_value=ctx_cache, cache_position=cache_position,
+                )
+            elif block.block_type in ["mlp", "moe"]:
+                h = block.mixer(h)
+            else:
+                raise ValueError(f"Unknown block_type: {block.block_type}")
+            hidden = residual + h
         cache_state["ctx_len"] = ctx_len + L
         return cache_state
             self.denoiser_tower, self.lm_head, den_input, den_cache, cp,
         )
+    def _denoiser_block_attention(self, mixer, hidden, ctx_k, ctx_v):
+        """Bidirectional denoiser self-attention over [context_KV | block_KV].
+        Mirrors the mcore `_forward_attn_with_past` (is_causal=False, no mask):
+        every block position attends to ALL context positions and ALL block
+        positions (the noisy block is processed bidirectionally within itself).
+        Args:
+            mixer: NemotronHAttention module (provides q/k/v/o projections)
+            hidden: (B, L, D) post-norm (and post-modulation) block hidden states
+            ctx_k, ctx_v: context KV, each (B, num_kv_heads, ctx_len, head_dim)
+        Returns: (B, L, D) attention output (before residual add)
+        """
+        bsz, q_len, _ = hidden.shape
+        q = mixer.q_proj(hidden).view(bsz, q_len, mixer.num_heads, mixer.head_dim).transpose(1, 2)
+        k = mixer.k_proj(hidden).view(bsz, q_len, mixer.num_key_value_heads, mixer.head_dim).transpose(1, 2)
+        v = mixer.v_proj(hidden).view(bsz, q_len, mixer.num_key_value_heads, mixer.head_dim).transpose(1, 2)
+        # Concatenate context KV (past) with current block KV on the sequence dim.
+        k = torch.cat([ctx_k.to(k.dtype), k], dim=2)
+        v = torch.cat([ctx_v.to(v.dtype), v], dim=2)
+        # GQA: expand KV heads to match query heads.
+        k = repeat_kv(k, mixer.num_key_value_groups)
+        v = repeat_kv(v, mixer.num_key_value_groups)
+        # Full (non-causal) attention: block sees all context + whole block.
+        attn_output = F.scaled_dot_product_attention(
+            q, k, v, attn_mask=None, dropout_p=0.0, is_causal=False,
+        )
+        attn_output = attn_output.transpose(1, 2).contiguous().view(
+            bsz, q_len, mixer.num_heads * mixer.head_dim
+        )
+        return mixer.o_proj(attn_output)
+    def _denoiser_block_mamba(self, mixer, hidden, init_conv, init_ssm, return_states=False):
+        """Chunk-scan the whole block through the Mamba mixer, seeded from the
+        context state — mirrors mcore `forward_mamba_layer_with_states`
+        (non-bidirectional). Uses the same mamba_ssm/causal_conv1d kernels as
+        mcore, instead of HF's token-by-token single-step path (which is both a
+        numerical mismatch and crashes in this env's causal_conv1d_update).
+        Args:
+            mixer: NemotronHMamba2Mixer
+            hidden: (B, L, D) post-norm (and post-modulation) block hidden states
+            init_conv: (B, conv_dim, d_conv-1) context conv state, or None
+            init_ssm:  (B, nheads, headdim, d_state) context SSM state, or None
+            return_states: also return the updated (conv_state[width d_conv], ssm_state)
+                so the caller can advance a KV/Mamba cache (used by context extend).
+        Returns: (B, L, D) mixer output (before adaLN gate / residual);
+                 or (output, new_conv_state, new_ssm_state) if return_states.
+        """
+        from einops import rearrange
+        from mamba_ssm.ops.triton.ssd_combined import mamba_chunk_scan_combined
+        from causal_conv1d import causal_conv1d_fn
+        d_inner = mixer.intermediate_size
+        ngroups = mixer.n_groups
+        d_state = mixer.ssm_state_size
+        headdim = mixer.head_dim
+        conv_dim = mixer.conv_dim
+        d_conv = mixer.conv_kernel_size
+        proj = mixer.in_proj(hidden)                       # (B, L, d_inner+conv_dim+nheads)
+        z, xBC, dt = torch.split(proj, [d_inner, conv_dim, mixer.num_heads], dim=-1)
+        # causal_conv1d_fn with initial_states requires channel-last layout:
+        #  - input (B, conv_dim, L): use the transpose VIEW (stride(1)==1), no .contiguous()
+        #  - initial_states (B, conv_dim, d_conv-1): force channel-last via the
+        #    transpose->contiguous->transpose trick (mcore _run_denoiser_step).
+        if init_conv is not None:
+            init_conv = init_conv.transpose(-1, -2).contiguous().transpose(-1, -2)
+        xBC_conv = causal_conv1d_fn(
+            xBC.transpose(1, 2),                           # (B, conv_dim, L) channel-last view
+            mixer.conv1d.weight.squeeze(1),
+            mixer.conv1d.bias,
+            activation=mixer.activation,
+            initial_states=init_conv,
+        ).transpose(1, 2)                                  # (B, L, conv_dim)
+        x, B_proj, C_proj = torch.split(
+            xBC_conv, [d_inner, ngroups * d_state, ngroups * d_state], dim=-1
+        )
+        x = rearrange(x, "b s (h p) -> b s h p", p=headdim).contiguous()
+        B_proj = rearrange(B_proj, "b s (g n) -> b s g n", n=d_state).contiguous()
+        C_proj = rearrange(C_proj, "b s (g n) -> b s g n", n=d_state).contiguous()
+        A = -torch.exp(mixer.A_log.float())
+        scan = mamba_chunk_scan_combined(
+            x, dt.contiguous(), A, B_proj, C_proj, mixer.chunk_size,
+            D=mixer.D, z=None,
+            dt_bias=mixer.dt_bias.float(), dt_softplus=True,
+            initial_states=init_ssm,
+            return_final_states=return_states,
+        )
+        if return_states:
+            y, new_ssm = scan
+        else:
+            y = scan
+        y = rearrange(y, "b s h p -> b s (h p)")
+        y = mixer.norm(y, z)                               # Mamba2 z-gated RMSNorm
+        out = mixer.out_proj(y)
+        if not return_states:
+            return out
+        # New conv state: HF cache stores the last d_conv raw xBC inputs (width
+        # d_conv), most-recent at index -1. block_size >= d_conv here.
+        L = xBC.shape[1]
+        if L >= d_conv:
+            new_conv = xBC[:, -d_conv:, :].transpose(1, 2).contiguous()
+        else:
+            hist = init_conv if init_conv is not None else xBC.new_zeros(xBC.shape[0], conv_dim, d_conv - 1)
+            comb = torch.cat([hist.transpose(1, 2), xBC], dim=1)
+            new_conv = comb[:, -d_conv:, :].transpose(1, 2).contiguous()
+        return out, new_conv, new_ssm
+    def _run_denoiser_step_diffusion(self, block_ids, cache_state, t=None):
+        """Diffusion denoiser forward over the FULL block (B, L) in one pass.
+        Parity with mcore `_run_denoiser_step`:
+          - Attention layers run BIDIRECTIONALLY within the block, attending to
+            the full context KV cache + the whole noisy block (is_causal=False).
+            A token-by-token causal pass would hide later block positions from
+            earlier ones.
+          - Mamba layers are causal/forward-only (bidirectional_mamba=False) and
+            are chunk-scanned over the whole block from the context state (S-1),
+            matching mcore's `forward_mamba_layer_with_states`.
+          - Time conditioning (adaLN-single) is applied per layer. The modulate/norm
+            ORDER depends on where mcore's norm lives: mamba & attention norms are
+            FUSED into in_proj/linear_qkv (applied AFTER modulate) -> modulate THEN
+            norm; MoE uses a separate pre_mlp_layernorm -> norm THEN modulate.
+            Gate is applied to the mixer output in all cases.
         Args:
             block_ids: (B, L) tokens to denoise
         Returns: logits (B, L, V)
         """
         ctx_len = cache_state["ctx_len"]
+        tower = self.denoiser_tower
+        den_device = next(tower.parameters()).device
         den_input = block_ids.to(den_device)
         L = den_input.shape[1]
+        # Time embedding -> per-layer modulation params (shift, scale, gate).
         t_emb = None
         if t is not None:
             t_dev = t.to(device=den_device, dtype=self.dtype)
             t_repr = self.t_embedder(t_dev)
             t_emb = self.t_block(t_repr)
+        # Fresh denoiser cache seeded from context: Mamba S-1 state + full context KV.
         den_cache = self._build_denoiser_cache_diffusion(cache_state, den_device)
+        hidden = tower.embeddings(den_input)
+        for layer_idx, block in enumerate(tower.layers):
+            residual = hidden
+            if block.residual_in_fp32:
+                residual = residual.to(torch.float32)
+            mod = None
+            if t_emb is not None:
+                mod = _get_mod_params(t_emb, self.scale_shift_tables[layer_idx])
+                shift, scale, gate = mod
+            # adaLN modulate vs norm ORDER depends on where mcore's norm lives:
+            #   - mamba/attention: norm is FUSED into in_proj/linear_qkv and is
+            #     applied AFTER the explicit modulate  -> modulate THEN norm.
+            #   - moe/mlp: separate pre_mlp_layernorm applied BEFORE modulate
+            #     -> norm THEN modulate.
+            if block.block_type in ("mamba", "attention"):
+                h = hidden
+                if mod is not None:
+                    h = _modulate(h, shift, scale)
+                h = block.norm(h.to(dtype=block.norm.weight.dtype))
+            else:  # mlp / moe
+                h = block.norm(hidden.to(dtype=block.norm.weight.dtype))
+                if mod is not None:
+                    h = _modulate(h, shift, scale)
+            if block.block_type == "mamba":
+                # Chunk-scan the whole block in one kernel launch, seeded from the
+                # context Mamba state (matches mcore forward_mamba_layer_with_states).
+                # HF conv_states are width d_conv; causal_conv1d_fn's initial_states
+                # wants the d_conv-1 most-recent columns.
+                d_conv = block.mixer.conv_kernel_size
+                init_conv = den_cache.conv_states[layer_idx][..., -(d_conv - 1):]
+                init_ssm = den_cache.ssm_states[layer_idx].contiguous()
+                h = self._denoiser_block_mamba(block.mixer, h, init_conv, init_ssm)
+            elif block.block_type == "attention":
+                ctx_k = den_cache.key_cache[layer_idx]
+                ctx_v = den_cache.value_cache[layer_idx]
+                h = self._denoiser_block_attention(block.mixer, h, ctx_k, ctx_v)
+            elif block.block_type in ["mlp", "moe"]:
+                h = block.mixer(h)
+            else:
+                raise ValueError(f"Unknown block_type: {block.block_type}")
+            if mod is not None:
+                h = gate.unsqueeze(1) * h
+            hidden = residual + h
+        hidden = tower.norm_f(hidden)
+        logits = self.lm_head(hidden.to(self.lm_head.weight.dtype)).float()
+        return logits
     # ------------------------------------------------------------------
     # Mock-AR generation (unchanged)
         top_k=None,
         confidence_threshold=0.9,
         eos_token_id=None,
+        step_callback=None,
     ):
         """Block-wise mask diffusion with confidence_unmasking.
             # Initialize fully masked block
             xt = torch.full((B, block_size), mask_token_id, dtype=torch.long,
                             device=device)
+            if step_callback is not None:
+                step_callback(0, steps_per_block, xt, t=1.0, logits=None,
+                              block_idx=block_idx)
             for step_idx in range(steps_per_block):
                 # t_model = current mask fraction
                         remask_idx = masked_indices[sort_idx[:num_to_remask[b]]]
                         output[b, remask_idx] = mask_token_id
+                if step_callback is not None:
+                    step_callback(step_idx, steps_per_block, xt,
+                                  t=float(t_model.detach().cpu()), logits=logits,
+                                  block_idx=block_idx)
                 xt = output
             # Block complete — extend context