Cosmos3-Super — Weight-Only NVFP4 (NVIDIA ModelOpt)

Weight-only quantization of the Cosmos3OmniTransformer from NVIDIA's nvidia/Cosmos3-Super — the 64B omnimodal Cosmos 3 world model (text-to-image, text-to-video, image-to-video, optional synchronized sound). Produced with NVIDIA TensorRT Model Optimizer (ModelOpt) on a single 96 GB workstation GPU, via a streaming method that never materializes the ~128 GB bf16 model (method scripts included).

Only the transformer is quantized. The VAEs and tokenizers are the original bf16 components, bundled so the repo is self-contained. Loading requires the bundled load_cosmos3_modelopt.py (see How to use).

Variants & measured performance

Measured on an RTX 6000 Pro Blackwell (96 GB), 1024×1024 single-frame render, 50 steps. Drop-in loading of these repos performs identically to the in-memory quantization path they were validated against.

Build Bits (weights) Repo size Resident VRAM s/it (1024² still)
NVFP4 (this repo) 4-bit (E2M1 + scales) ~36 GB ~43 GB (meas.) ~4.6
FP8 (sibling) 8-bit (E4M3) ~64 GB ~67 GB (meas.) ~1.2

Pick NVFP4 for footprint — it brings the model into ~48 GB-card territory for stills. Pick FP8 if it fits — in this serving path it is both higher fidelity and ~4× faster, because FP8 dequant is a single cheap scale on a native float8 tensor, while NVFP4 dequant must unpack two 4-bit values per byte and apply two-level block scales in PyTorch. Note this is dequant-on-the-fly: quantization here buys memory, not speed — NVFP4's hardware FP4 tensor-core advantage only materializes in engines with FP4 GEMM kernels (TRT-LLM/vLLM territory), not in diffusers.

Layers kept in bf16 (not quantized): embeddings, norms, the reasoner head, in/out projections, time/modality adapters, audio adapter. The 64 transformer blocks' attention + MLP linears (incl. MoE experts) are quantized.

Status

  • Drop-in loading verified end to end (load → render → performance parity with the in-memory method) on Blackwell (sm_120), via the bundled loader.
  • modelopt_state.pth is part of the checkpoint and is required — it restores the quantized module structure at load. Do not delete it.
  • ⚠️ The loader (load_cosmos3_modelopt.py) is required, not optional. The current diffusers/accelerate/modelopt combination cannot materialize a pre-quantized ModelOpt checkpoint unaided; the loader applies three small, source-verified workarounds (parameter materialization for packed weights, payload-dtype restoration for FP8, and weight-only quantizer enforcement) plus the validated bf16 dtype normalization. ModelOpt marks this path experimental; expect the loader to become unnecessary as upstream catches up.
  • vLLM-Omni: not a working path as of 0.22.0. This is an upstream roadmap gap, not a defect of this checkpoint: vLLM-Omni's ModelOpt integration is currently wired for LLMs only, and ModelOpt-quantized diffusion support is an open RFC (#2709, #1959).
  • ComfyUI: no known node support for this ModelOpt layout (the NF4 build linked below has community nodes; this one does not).
  • Validated only on Blackwell (sm_120). NVFP4's packed format loads anywhere but was only tested there.

How to use

Requires a diffusers build with Cosmos 3 support (currently from source) plus modelopt and accelerate. Pin to the verified versions for guaranteed reproducibility (newer versions may also work, but this code path moves fast):

pip install "git+https://github.com/huggingface/diffusers.git@2c7efb95349296cf6bcce981ea036275a82a94df"
pip install accelerate "nvidia-modelopt==0.44.0"
from load_cosmos3_modelopt import load_pipe   # bundled in this repo
from diffusers import UniPCMultistepScheduler

pipe = load_pipe("prometheusAIR/Cosmos3-Super-nvfp4")   # or a local path
pipe.scheduler = UniPCMultistepScheduler.from_config(
    pipe.scheduler.config, flow_shift=3.0   # NVIDIA's text-to-image setting; use 5.0 for image-to-video
)

# Single image -- pass parameters EXPLICITLY (see warning below):
r = pipe("a weathered lighthouse on a cliff at golden hour, photoreal, 50mm",
         height=1024, width=1024, num_frames=1,
         num_inference_steps=50, guidance_scale=4.0)
r.video[0].save("out.png")   # .video is the list of PIL frames; [0] is the image

# Video (~2 s): frame counts of the form 4n+1 map cleanly to the VAE's 4x
# temporal compression; 24 fps is the native rate and conditions the model.
r = pipe("The lighthouse beam sweeps slowly across the water. Static camera.",
         height=704, width=1280, num_frames=49, fps=24.0,
         num_inference_steps=35, guidance_scale=6.0)

These still-image settings (1024², 50 steps, guidance 4.0, flow_shift=3.0, result.video[0]) match NVIDIA's first-party Cosmos3 text-to-image reference.

⚠️ A bare pipe(prompt) call renders a 189-frame 720×1280 video (~8 s at 24 fps) — that is the pipeline's built-in default, not a still. It takes ~40× the compute of a single frame and is the most common reason this model "seems slow." Always pass num_frames/height/width explicitly.

Cosmos 3 expects a dense structured-JSON prompt for best quality; plain prompts work but render softer. See NVIDIA's prompt-upsampling docs.

Reproducing from scratch: quantize_cosmos3_super_streaming.py (included) streams the bf16 shards directly into compressed FP8/NVFP4 form (peak memory ≈ the compressed footprint, so a single 96 GB card suffices), and repackage_for_hf.py emits this repo's round-trippable layout via save_pretrained + enable_huggingface_checkpointing() — note that ModelOpt's export_hf_checkpoint() produces a deployment checkpoint that diffusers cannot round-trip; the modelopt_state.pth from save_pretrained is what makes drop-in loading possible. serve_cosmos3_diffusers.py is a small FastAPI server (text→image, image→video) around the same model.

Known limitations / caveats

  • The bundled loader is required (see Status).
  • QKV scale unification was skipped at export (ModelOpt's fusion probe doesn't recognize this architecture); q/k/v keep independent scales. Harmless here; relevant only to engines that fuse QKV.
  • Render sharpness depends heavily on prompt density, scheduler settings, and guidance — tune these; they are not quantization loss.

Guardrails

Cosmos 3 ships an optional safety checker (cosmos_guardrail). The bundled loader passes enable_safety_checker=False for local single-user use. If you deploy this or publish generated media, install cosmos-guardrail, accept the gated nvidia/Cosmos-Guardrail1 model (released under its own NVIDIA Open Model License, separate from this repo's OpenMDW-1.1), and run with load_pipe(..., enable_safety_checker=True).

Provenance & License

  • Derivative of: nvidia/Cosmos3-Super (bf16). This repo modifies only the weight encoding of the transformer.
  • Produced with: NVIDIA TensorRT Model Optimizer + diffusers (from source).
  • Exact versions used: diffusers 0.39.0.dev0 @ 2c7efb9, nvidia-modelopt 0.44.0, accelerate 1.13.0, torch 2.12.0, CUDA 13.3.
  • License: OpenMDW-1.1, inherited from the base model. This repo includes a copy of the agreement (LICENSE) and documents its origin above; the upstream repo ships no separate NOTICE file. OpenMDW-1.1 permits modification and redistribution and places no restrictions on generated outputs; you remain responsible for clearing any third-party rights embodied in the materials.

Related repos

Downloads last month
16
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for prometheusAIR/Cosmos3-Super-NVFP4

Finetuned
(3)
this model