Instructions to use syscall42/nemotron-twotower-nvfp4 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use syscall42/nemotron-twotower-nvfp4 with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="syscall42/nemotron-twotower-nvfp4")# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("syscall42/nemotron-twotower-nvfp4", dtype="auto") - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use syscall42/nemotron-twotower-nvfp4 with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "syscall42/nemotron-twotower-nvfp4" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "syscall42/nemotron-twotower-nvfp4", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker
docker model run hf.co/syscall42/nemotron-twotower-nvfp4
- SGLang
How to use syscall42/nemotron-twotower-nvfp4 with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "syscall42/nemotron-twotower-nvfp4" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "syscall42/nemotron-twotower-nvfp4", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "syscall42/nemotron-twotower-nvfp4" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "syscall42/nemotron-twotower-nvfp4", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }' - Docker Model Runner
How to use syscall42/nemotron-twotower-nvfp4 with Docker Model Runner:
docker model run hf.co/syscall42/nemotron-twotower-nvfp4
Configuration Parsing Warning:Invalid JSON for config file config.json
Nemotron TwoTower NVFP4 for Atlas
This repository contains an Atlas-compatible working NVFP4 quantization of nvidia/Nemotron-Labs-TwoTower-30B-A3B-Base-BF16.
The checkpoint was prepared from a local ModelOpt NVFP4 export of NemotronHTwoTowerForCausalLM and repaired for Atlas causal inference. The repaired payload is intended for the OpenAI-compatible Atlas inference API using the context tower.
What was repaired
The original local ModelOpt NVFP4 export had defective routed expert scale tensors in the context tower, which caused incoherent output when loaded by Atlas. The context-tower routed expert matrices were re-quantized from the BF16 source weights and written back into the NVFP4 safetensors layout.
Repair scope:
- Tower:
context_tower - Layers: 23 MoE layers
- Experts: 128 routed experts per MoE layer
- Matrices:
up_projanddown_proj - Total repaired matrices: 5,888
- Total replaced tensor payloads: 23,552
The denoiser tower was not repaired in this checkpoint. Atlas causal/OpenAI-compatible inference uses the context tower.
Atlas usage
Example:
ATLAS_TARGET_MODEL=nemotron-3-nano-30b-a3b \
ATLAS_TARGET_QUANT=nvfp4 \
CUDARC_CUDA_VERSION=12000 \
./target/debug/spark serve \
--model-from-path /path/to/nemotron-twotower-nvfp4 \
--port 8891 \
--max-seq-len 4096 \
--max-num-seqs 1 \
--max-batch-size 1 \
--gpu-memory-utilization 0.70 \
--kv-cache-dtype bf16 \
--lm-head-dtype bf16
Verified English completion prompts with Atlas included:
The capital of France is-> coherent answer mentioning Paris.Question: What is 2 + 2? Answer:->4.Write one concise sentence about the Moon:-> coherent factual sentence.
Notes
This is a derived quantized checkpoint. Use is governed by the NVIDIA Nemotron Open Model License Agreement linked in the metadata above.
- Downloads last month
- -