Instructions to use nvidia/Nemotron-Labs-TwoTower-30B-A3B-Base-BF16 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use nvidia/Nemotron-Labs-TwoTower-30B-A3B-Base-BF16 with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="nvidia/Nemotron-Labs-TwoTower-30B-A3B-Base-BF16")

# Load model directly
from transformers import AutoModel
model = AutoModel.from_pretrained("nvidia/Nemotron-Labs-TwoTower-30B-A3B-Base-BF16", dtype="auto")

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use nvidia/Nemotron-Labs-TwoTower-30B-A3B-Base-BF16 with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "nvidia/Nemotron-Labs-TwoTower-30B-A3B-Base-BF16"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "nvidia/Nemotron-Labs-TwoTower-30B-A3B-Base-BF16",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker

docker model run hf.co/nvidia/Nemotron-Labs-TwoTower-30B-A3B-Base-BF16

SGLang

How to use nvidia/Nemotron-Labs-TwoTower-30B-A3B-Base-BF16 with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "nvidia/Nemotron-Labs-TwoTower-30B-A3B-Base-BF16" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "nvidia/Nemotron-Labs-TwoTower-30B-A3B-Base-BF16",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "nvidia/Nemotron-Labs-TwoTower-30B-A3B-Base-BF16" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "nvidia/Nemotron-Labs-TwoTower-30B-A3B-Base-BF16",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Docker Model Runner
How to use nvidia/Nemotron-Labs-TwoTower-30B-A3B-Base-BF16 with Docker Model Runner:
```
docker model run hf.co/nvidia/Nemotron-Labs-TwoTower-30B-A3B-Base-BF16
```

fitsumreda Claude Opus 4.8 commited on 29 days ago

Commit

67bf233

1 Parent(s): c739325

Fix NaN corruption in long-context diffusion (fp32 denoiser SSM scan) + multi-request inference

Browse files

Denoiser Mamba chunk-scan ran in bf16. With a long context the seeded SSM
state grows large (e.g. ~5e3 at L00 for a 1042-token prompt) and the bf16
scan overflows to NaN. Because the Triton kernel's reductions are not
bit-deterministic this struck nondeterministically: a NaN on a block's
all-masked first step makes every confidence NaN, so `NaN > threshold` is
False, the fallback commits 1 token, and sorting NaN confidences force-commits
an arbitrary garbage token (e.g. "katalog"/"hips"), wrecking the answer.

Fix: run the SSM scan in fp32 (x/dt/B/C/D upcast; init_ssm too; cast back
before the gated norm). The scan spans one <=16-token block so cost is
negligible. Also covers the block-to-block context-extend path (same helper).
NOTE: this is broader than mcore (which keeps x/B/C/dt in bf16, fp32 only for
A/D/dt_bias/state); kept as a stability safety net pending the state-magnitude
parity investigation vs mcore.

inference.py: add --prompt-file (jsonl, mcore format) -> Request i/N loop.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Files changed (2) hide show

inference.py +95 -70
modeling_nemotron_twotower.py +13 -4

inference.py CHANGED Viewed

@@ -28,6 +28,9 @@ from modeling_nemotron_twotower import NemotronHTwoTowerForCausalLM
 parser = argparse.ArgumentParser()
 parser.add_argument("prompt_arg", nargs="?", default=None)
 parser.add_argument("--prompt", default=None)
 parser.add_argument("--model", default=str(Path(__file__).resolve().parent))
 parser.add_argument("--max-new-tokens", type=int, default=128)
 parser.add_argument("--mode", choices=["ar", "mock_ar", "mask_diffusion"], default="mock_ar")
@@ -74,76 +77,97 @@ else:
 model.eval()
 model.trace_context_layers = args.trace_context_layers
 model.trace_denoiser_layers = args.trace_denoiser_layers
-inputs = tokenizer(prompt, return_tensors="pt").to(
-    next(model.context_tower.parameters()).device
-)
-t0 = time.perf_counter()
-if args.mode == "ar":
-    # Context-tower-only AR via our cached single-step path (the fair ST-AR
-    # baseline). Avoids HF generate()'s cache path that crashes on this env.
-    outputs = model.generate_ar(
-        inputs["input_ids"], max_new_tokens=args.max_new_tokens,
-        temperature=0.0, eos_token_id=tokenizer.eos_token_id,
-    )
-elif args.mode == "mock_ar":
-    outputs = model.generate_mock_ar(
-        inputs["input_ids"], max_new_tokens=args.max_new_tokens,
-        temperature=0.0, eos_token_id=tokenizer.eos_token_id,
-    )
 else:
-    def step_callback(step_idx, total_steps, tokens, t=None, logits=None, block_idx=0):
-        if not args.print_diffusion_steps:
-            return
-        if logits is None:
-            print(f"\n--- Block {block_idx} Step {step_idx}/{total_steps} | init ---")
-            print("xt:", tokenizer.decode(tokens[0], skip_special_tokens=False))
-            return
-        log_x = model._mdlm_forward(logits, tokens.to(logits.device), args.mask_token_id)
-        probs = log_x.exp()[0]
-        top2_probs, top2_ids = probs.topk(2, dim=-1)
-        n_masked = int((tokens == args.mask_token_id).sum().item())
-        print(f"\n--- Block {block_idx} Step {step_idx}/{total_steps} | masked={n_masked}/{tokens.shape[1]} | t={t:.4f} ---")
-        print("xt:   " + repr(tokenizer.decode(tokens[0], skip_special_tokens=False)))
-        print("top1: " + "|".join(tokenizer.decode([tid.item()])[:9].rjust(9) for tid in top2_ids[:, 0]))
-        print("prb1: " + "|".join(f"{p.item():.3f}".rjust(9) for p in top2_probs[:, 0]))
-        print("top2: " + "|".join(tokenizer.decode([tid.item()])[:9].rjust(9) for tid in top2_ids[:, 1]))
-        print("prb2: " + "|".join(f"{p.item():.3f}".rjust(9) for p in top2_probs[:, 1]))
-    generate_kwargs = dict(
-        max_new_tokens=args.max_new_tokens,
-        block_size=args.block_size,
-        steps_per_block=args.steps_per_block,
-        mask_token_id=args.mask_token_id,
-        temperature=args.temperature,
-        top_k=args.top_k,
-        confidence_threshold=args.confidence_threshold,
-        eos_token_id=tokenizer.eos_token_id,
-    )
-    if (
-        args.print_diffusion_steps
-        and "step_callback" in inspect.signature(model.generate_mask_diffusion).parameters
-    ):
-        generate_kwargs["step_callback"] = step_callback
-    outputs = model.generate_mask_diffusion(inputs["input_ids"], **generate_kwargs)
-if torch.cuda.is_available():
-    torch.cuda.synchronize()
-elapsed = max(time.perf_counter() - t0, 1e-9)
-prompt_len = inputs["input_ids"].shape[1]
-gen_ids = outputs[0][prompt_len:]
-n_new = int(gen_ids.shape[0])
-text = tokenizer.decode(gen_ids, skip_special_tokens=True)
-nfe = getattr(model, "_last_nfe", None)
 print("\n" + "=" * 70)
-print("--- Request 1/1 ---")
-print(f"Prompt: {prompt}")
-_nfe_str = f"{nfe} NFE, " if (args.mode == "mask_diffusion" and nfe is not None) else ""
-print(f"Generated ({_nfe_str}{n_new} tokens, {elapsed:.2f}s, {n_new / elapsed:.1f} tok/s):")
-print(text)
-print("=" * 70)
 if args.mode == "mask_diffusion":
     print("Two-Tower mask-diffusion generation complete")
     print("=" * 70)
@@ -156,8 +180,9 @@ if args.mode == "mask_diffusion":
     print(f"  top_k:                {args.top_k}")
     print(f"  confidence_threshold: {args.confidence_threshold}")
     print(f"  mask_token_id:        {args.mask_token_id}")
-    print(f"  NFE:                  {nfe}")
-    print(f"  wall_clock:           {elapsed:.2f}s")
-    print(f"  throughput:           {n_new / elapsed:.1f} tokens/s")
     print(f"  model:                {args.model}")
     print("=" * 70)

 parser = argparse.ArgumentParser()
 parser.add_argument("prompt_arg", nargs="?", default=None)
 parser.add_argument("--prompt", default=None)
+parser.add_argument("--prompt-file", dest="prompt_file", default=None,
+                    help="jsonl of {\"text\": ...} per line (same format as mcore "
+                         "--prompt-file); each line is run as its own Request i/N.")
 parser.add_argument("--model", default=str(Path(__file__).resolve().parent))
 parser.add_argument("--max-new-tokens", type=int, default=128)
 parser.add_argument("--mode", choices=["ar", "mock_ar", "mask_diffusion"], default="mock_ar")
 model.eval()
 model.trace_context_layers = args.trace_context_layers
 model.trace_denoiser_layers = args.trace_denoiser_layers
+# Build the request list. A --prompt-file (jsonl, one {"text": ...} per line,
+# same format mcore consumes) runs as multiple Requests i/N; otherwise the
+# single positional/--prompt is the lone request.
+if args.prompt_file:
+    import json
+    prompts = []
+    with open(args.prompt_file) as f:
+        for line in f:
+            line = line.strip()
+            if line:
+                prompts.append(json.loads(line)["text"])
+    if not prompts:
+        raise ValueError(f"No prompts found in {args.prompt_file}")
 else:
+    prompts = [prompt]
+def step_callback(step_idx, total_steps, tokens, t=None, logits=None, block_idx=0):
+    if not args.print_diffusion_steps:
+        return
+    if logits is None:
+        print(f"\n--- Block {block_idx} Step {step_idx}/{total_steps} | init ---")
+        print("xt:", tokenizer.decode(tokens[0], skip_special_tokens=False))
+        return
+    log_x = model._mdlm_forward(logits, tokens.to(logits.device), args.mask_token_id)
+    probs = log_x.exp()[0]
+    top2_probs, top2_ids = probs.topk(2, dim=-1)
+    n_masked = int((tokens == args.mask_token_id).sum().item())
+    print(f"\n--- Block {block_idx} Step {step_idx}/{total_steps} | masked={n_masked}/{tokens.shape[1]} | t={t:.4f} ---")
+    print("xt:   " + repr(tokenizer.decode(tokens[0], skip_special_tokens=False)))
+    print("top1: " + "|".join(tokenizer.decode([tid.item()])[:9].rjust(9) for tid in top2_ids[:, 0]))
+    print("prb1: " + "|".join(f"{p.item():.3f}".rjust(9) for p in top2_probs[:, 0]))
+    print("top2: " + "|".join(tokenizer.decode([tid.item()])[:9].rjust(9) for tid in top2_ids[:, 1]))
+    print("prb2: " + "|".join(f"{p.item():.3f}".rjust(9) for p in top2_probs[:, 1]))
+ctx_device = next(model.context_tower.parameters()).device
+n_requests = len(prompts)
+for ridx, prompt in enumerate(prompts):
+    inputs = tokenizer(prompt, return_tensors="pt").to(ctx_device)
+    if args.print_diffusion_steps and args.mode == "mask_diffusion":
+        print(f"\n--- Diffusion steps for request {ridx + 1} ---")
+    t0 = time.perf_counter()
+    if args.mode == "ar":
+        # Context-tower-only AR via our cached single-step path (the fair ST-AR
+        # baseline). Avoids HF generate()'s cache path that crashes on this env.
+        outputs = model.generate_ar(
+            inputs["input_ids"], max_new_tokens=args.max_new_tokens,
+            temperature=0.0, eos_token_id=tokenizer.eos_token_id,
+        )
+    elif args.mode == "mock_ar":
+        outputs = model.generate_mock_ar(
+            inputs["input_ids"], max_new_tokens=args.max_new_tokens,
+            temperature=0.0, eos_token_id=tokenizer.eos_token_id,
+        )
+    else:
+        generate_kwargs = dict(
+            max_new_tokens=args.max_new_tokens,
+            block_size=args.block_size,
+            steps_per_block=args.steps_per_block,
+            mask_token_id=args.mask_token_id,
+            temperature=args.temperature,
+            top_k=args.top_k,
+            confidence_threshold=args.confidence_threshold,
+            eos_token_id=tokenizer.eos_token_id,
+        )
+        if (
+            args.print_diffusion_steps
+            and "step_callback" in inspect.signature(model.generate_mask_diffusion).parameters
+        ):
+            generate_kwargs["step_callback"] = step_callback
+        outputs = model.generate_mask_diffusion(inputs["input_ids"], **generate_kwargs)
+    if torch.cuda.is_available():
+        torch.cuda.synchronize()
+    elapsed = max(time.perf_counter() - t0, 1e-9)
+    prompt_len = inputs["input_ids"].shape[1]
+    gen_ids = outputs[0][prompt_len:]
+    n_new = int(gen_ids.shape[0])
+    text = tokenizer.decode(gen_ids, skip_special_tokens=True)
+    nfe = getattr(model, "_last_nfe", None)
+    print(f"\n--- Request {ridx + 1}/{n_requests} ---")
+    print(f"Prompt: {prompt}")
+    _nfe_str = f"{nfe} NFE, " if (args.mode == "mask_diffusion" and nfe is not None) else ""
+    print(f"Generated ({_nfe_str}{n_new} tokens, {elapsed:.2f}s, {n_new / elapsed:.1f} tok/s):")
+    print(text)
 print("\n" + "=" * 70)
 if args.mode == "mask_diffusion":
     print("Two-Tower mask-diffusion generation complete")
     print("=" * 70)
     print(f"  top_k:                {args.top_k}")
     print(f"  confidence_threshold: {args.confidence_threshold}")
     print(f"  mask_token_id:        {args.mask_token_id}")
+    print(f"  num_requests:         {n_requests}")
     print(f"  model:                {args.model}")
     print("=" * 70)
+else:
+    print("Two-tower generation complete")
+    print("=" * 70)

modeling_nemotron_twotower.py CHANGED Viewed

@@ -554,19 +554,28 @@ class NemotronHTwoTowerForCausalLM(NemotronHPreTrainedModel, GenerationMixin):
         B_proj = rearrange(B_proj, "b s (g n) -> b s g n", n=d_state).contiguous()
         C_proj = rearrange(C_proj, "b s (g n) -> b s g n", n=d_state).contiguous()
         A = -torch.exp(mixer.A_log.float())
         scan = mamba_chunk_scan_combined(
-            x, dt.contiguous(), A, B_proj, C_proj, mixer.chunk_size,
-            D=mixer.D, z=None,
             dt_bias=mixer.dt_bias.float(), dt_softplus=True,
-            initial_states=init_ssm,
             return_final_states=return_states,
         )
         if return_states:
             y, new_ssm = scan
         else:
             y = scan
-        y = rearrange(y, "b s h p -> b s (h p)")
         y = mixer.norm(y, z)                               # Mamba2 z-gated RMSNorm
         out = mixer.out_proj(y)
         if not return_states:

         B_proj = rearrange(B_proj, "b s (g n) -> b s g n", n=d_state).contiguous()
         C_proj = rearrange(C_proj, "b s (g n) -> b s g n", n=d_state).contiguous()
+        # Run the SSM scan in fp32. With a long context the seeded SSM state gets
+        # large (O(1e3)+); the bf16 chunk-scan then overflows to NaN, and because
+        # the Triton kernel's reductions are not bit-deterministic this strikes
+        # nondeterministically (a NaN on a block's first/all-masked step force-
+        # commits a garbage token, e.g. "katalog"/"hips", and wrecks the answer).
+        # The scan spans only one block (<=16 tokens) so fp32 is essentially free,
+        # and it is strictly more accurate. Cast back before the gated norm.
+        _y_dtype = z.dtype
         A = -torch.exp(mixer.A_log.float())
         scan = mamba_chunk_scan_combined(
+            x.float(), dt.float().contiguous(), A, B_proj.float(), C_proj.float(),
+            mixer.chunk_size,
+            D=mixer.D.float(), z=None,
             dt_bias=mixer.dt_bias.float(), dt_softplus=True,
+            initial_states=(init_ssm.float() if init_ssm is not None else None),
             return_final_states=return_states,
         )
         if return_states:
             y, new_ssm = scan
         else:
             y = scan
+        y = rearrange(y, "b s h p -> b s (h p)").to(_y_dtype)
         y = mixer.norm(y, z)                               # Mamba2 z-gated RMSNorm
         out = mixer.out_proj(y)
         if not return_states: