Instructions to use syscall42/nemotron-twotower-nvfp4 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use syscall42/nemotron-twotower-nvfp4 with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="syscall42/nemotron-twotower-nvfp4")

# Load model directly
from transformers import AutoModel
model = AutoModel.from_pretrained("syscall42/nemotron-twotower-nvfp4", dtype="auto")

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use syscall42/nemotron-twotower-nvfp4 with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "syscall42/nemotron-twotower-nvfp4"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "syscall42/nemotron-twotower-nvfp4",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker

docker model run hf.co/syscall42/nemotron-twotower-nvfp4

SGLang

How to use syscall42/nemotron-twotower-nvfp4 with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "syscall42/nemotron-twotower-nvfp4" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "syscall42/nemotron-twotower-nvfp4",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "syscall42/nemotron-twotower-nvfp4" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "syscall42/nemotron-twotower-nvfp4",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Docker Model Runner
How to use syscall42/nemotron-twotower-nvfp4 with Docker Model Runner:
```
docker model run hf.co/syscall42/nemotron-twotower-nvfp4
```

Configuration Parsing Warning:Invalid JSON for config file config.json

Nemotron TwoTower NVFP4 for Atlas

This repository contains an Atlas-compatible working NVFP4 quantization of nvidia/Nemotron-Labs-TwoTower-30B-A3B-Base-BF16.

The checkpoint was prepared from a local ModelOpt NVFP4 export of NemotronHTwoTowerForCausalLM and repaired for Atlas causal inference. The repaired payload is intended for the OpenAI-compatible Atlas inference API using the context tower.

What was repaired

The original local ModelOpt NVFP4 export had defective routed expert scale tensors in the context tower, which caused incoherent output when loaded by Atlas. The context-tower routed expert matrices were re-quantized from the BF16 source weights and written back into the NVFP4 safetensors layout.

Repair scope:

Tower: context_tower
Layers: 23 MoE layers
Experts: 128 routed experts per MoE layer
Matrices: up_proj and down_proj
Total repaired matrices: 5,888
Total replaced tensor payloads: 23,552

The denoiser tower was not repaired in this checkpoint. Atlas causal/OpenAI-compatible inference uses the context tower.

Atlas usage

Example:

ATLAS_TARGET_MODEL=nemotron-3-nano-30b-a3b \
ATLAS_TARGET_QUANT=nvfp4 \
CUDARC_CUDA_VERSION=12000 \
./target/debug/spark serve \
  --model-from-path /path/to/nemotron-twotower-nvfp4 \
  --port 8891 \
  --max-seq-len 4096 \
  --max-num-seqs 1 \
  --max-batch-size 1 \
  --gpu-memory-utilization 0.70 \
  --kv-cache-dtype bf16 \
  --lm-head-dtype bf16

Verified English completion prompts with Atlas included:

The capital of France is -> coherent answer mentioning Paris.
Question: What is 2 + 2? Answer: -> 4.
Write one concise sentence about the Moon: -> coherent factual sentence.

Notes

This is a derived quantized checkpoint. Use is governed by the NVIDIA Nemotron Open Model License Agreement linked in the metadata above.

Downloads last month: -

Safetensors

Model size

34B params

Tensor type

F32

BF16

F8_E4M3

Model tree for syscall42/nemotron-twotower-nvfp4

Base model

nvidia/Nemotron-Labs-TwoTower-30B-A3B-Base-BF16

Quantized

(7)

this model