# NVIDIA Cosmos World Foundation Models - Deployment Report ## Summary | Item | Value | |------|-------| | **Space URL** | https://huggingface.co/spaces/wbw2000/cosmos-predict-transfer-demo | | **Commit Hash** | `f829a7a65deb6afc884d48e479a0c02f16885e24` | | **Created** | 2026-01-26 | | **Status** | Building / Pending Verification | --- ## 1. Model Information ### Cosmos Predict2.5-2B | Property | Value | |----------|-------| | **Model ID** | `nvidia/Cosmos-Predict2.5-2B` | | **Parameters** | 2,059,174,912 (~2.06B) | | **VRAM Required** | 32.54 GB | | **Precision** | BF16 only | | **Capabilities** | Text2World, Image2World, Video2World | ### Cosmos Transfer2.5-2B | Property | Value | |----------|-------| | **Model ID** | `nvidia/Cosmos-Transfer2.5-2B` | | **Parameters** | 2,358,047,744 (~2.36B) | | **VRAM Required** | 65.4 GB | | **Precision** | BF16 only | | **Control Inputs** | Blur, Edge, Depth, Segmentation | --- ## 2. Hardware Configuration | Property | Value | |----------|-------| | **Hardware** | ZeroGPU (NVIDIA H200) | | **VRAM** | 70 GB | | **Supported** | Both Predict2.5 (32GB) and Transfer2.5 (65GB) | | **GPU Duration** | Predict: 300s, Transfer: 420s | ### Why ZeroGPU/H200 - Predict2.5-2B requires 32.54 GB → A10G (24GB) insufficient - Transfer2.5-2B requires 65.4 GB → Only H200 (70GB) or A100 (80GB) sufficient - ZeroGPU H200 (70GB) is the most cost-effective option on HuggingFace --- ## 3. Key Dependencies ``` torch==2.5.1 diffusers>=0.34.0 transformers>=4.52.4 accelerate>=1.7.0 gradio>=5.0.0 av>=14.0.0 opencv-python-headless>=4.8.0 imageio>=2.31.0 scikit-image>=0.21.0 ``` --- ## 4. Default Parameters ### Predict2.5 | Parameter | Default | Range | |-----------|---------|-------| | Resolution | 720×480 | 720×480 or 1280×720 | | Frames | 49 (~3s) | 17-97 | | Inference Steps | 30 | 10-50 | | Guidance Scale | 7.0 | 1.0-15.0 | | Seed | 42 | Any integer | ### Transfer2.5 | Parameter | Default | Range | |-----------|---------|-------| | Control Type | blur | blur, edge, depth, segmentation | | Inference Steps | 30 | 10-50 | | Guidance Scale | 7.0 | 1.0-15.0 | | Control Scale | 1.0 | 0.5-2.0 | | Seed | 42 | Any integer | --- ## 5. Smoke Test Design ### Tests Implemented #### Predict2.5 Tests (`tests/smoke_predict.py`) 1. **Output Validation**: Verify video utilities work correctly 2. **Model Loading**: Verify Predict2.5-2B can be loaded 3. **Text2World Inference**: Generate video from text prompt #### Transfer2.5 Tests (`tests/smoke_transfer.py`) 1. **Control Extraction**: Verify edge/depth extraction works 2. **Style Consistency**: Verify SSIM computation works 3. **Model Loading**: Verify Transfer2.5-2B can be loaded 4. **Video Inference**: Apply style transfer to video ### Running Tests ```bash # All tests python -m tests.smoke_all # Predict2.5 only (if VRAM < 65GB) python -m tests.smoke_all --predict-only # Individual modules python -m tests.smoke_predict python -m tests.smoke_transfer ``` ### Expected Output ``` ====================================================================== NVIDIA COSMOS WORLD FOUNDATION MODELS - SMOKE TESTS ====================================================================== [System Information] python_version: 3.11.x torch_version: 2.5.1 cuda_available: True gpu_name: NVIDIA H200 gpu_total_vram_gb: 70.0 [PREDICT2.5 TESTS] output_validation: PASSED model_loading: PASSED text2world_inference: PASSED [TRANSFER2.5 TESTS] control_extraction: PASSED style_consistency: PASSED model_loading: PASSED video_inference: PASSED Overall: 7/7 tests passed STATUS: ALL TESTS PASSED ``` --- ## 6. Paper Consistency Validation ### Reference **Paper**: arXiv 2511.00062 - "World Simulation with Video Foundation Models for Physical AI" ### Validation Points #### 6.1 Predict2.5 - Temporal Consistency (Section 4.2) **Paper Claim**: "Generated videos maintain reasonable spatiotemporal continuity in short-term prediction" **Validation Method**: - Generate N=3 videos with different seeds - Compute mean frame-to-frame pixel difference - Verify differences are smooth (mean_diff < 50) **Metric**: Mean adjacent frame difference (pixel intensity 0-255 scale) **Pass Criteria**: mean_diff < 50 for majority of samples #### 6.2 Predict2.5 - Reproducibility (Section 5) **Paper Claim**: "Fixed random seeds produce deterministic outputs" **Validation Method**: - Run same prompt with same seed twice - Compute SSIM between corresponding frames - Verify outputs are nearly identical **Metric**: Mean SSIM between run1 and run2 **Pass Criteria**: mean_ssim > 0.95 #### 6.3 Transfer2.5 - Structure Preservation (Section 3.2) **Paper Claim**: "Cosmos-Transfer2.5 preserves structural consistency during domain transfer" **Validation Method**: - Extract edge maps from input and output - Compute SSIM between edge maps - Verify edges are preserved **Metric**: Edge SSIM between input and output **Pass Criteria**: mean_edge_ssim > 0.3 #### 6.4 Transfer2.5 - Domain Change (Section 4.3) **Paper Claim**: "Model can perform world-to-world translation (e.g., day→night)" **Validation Method**: - Apply day→night transfer - Compute SSIM between input and output - Verify output differs from input while maintaining structure **Metric**: Pixel SSIM between input and output **Pass Criteria**: 0.1 < mean_ssim < 0.9 (different but not random) ### Running Validation ```bash python -m tests.paper_validation python -m tests.paper_validation --skip-transfer # If VRAM limited ``` --- ## 7. Limitations & Future Work ### Current Limitations | Limitation | Reason | Potential Solution | |------------|--------|-------------------| | Transfer2.5 may fail on ZeroGPU | 65.4GB very close to 70GB limit | Use A100 80GB or quantization | | Long cold start | Large model downloads | Use model caching | | Limited output length | Avoid timeout (5 min) | Increase GPU duration | | No multi-view support | Not implemented | Add multi-view inference | ### Not Covered from Paper 1. **Multi-view generation** (Section 3.3) - Not implemented 2. **Autonomous Vehicle post-training** (Section 4.1) - Not included 3. **Full benchmark evaluation** (Section 5.2) - Only smoke tests 4. **14B model variants** - VRAM insufficient ### Future Improvements 1. Add A100 80GB option for better Transfer2.5 stability 2. Implement multi-view inference for robotics use cases 3. Add quantized model variants for lower VRAM 4. Implement full paper benchmark suite 5. Add video-to-video inference for Predict2.5 --- ## 8. API Usage Examples ### Gradio Client (Python) ```python from gradio_client import Client # Connect to Space client = Client("wbw2000/cosmos-predict-transfer-demo") # Text2World result = client.predict( prompt="A peaceful garden with butterflies", negative_prompt="low quality, blurry", num_frames=49, height=480, width=720, num_inference_steps=30, guidance_scale=7.0, seed=42, api_name="/run_predict_text2world" ) video_path, log = result print(f"Video saved to: {video_path}") ``` ### REST API (curl) ```bash curl -X POST "https://wbw2000-cosmos-predict-transfer-demo.hf.space/api/run_predict_text2world" \ -H "Content-Type: application/json" \ -d '{ "data": [ "A futuristic city at sunset", "low quality, blurry", 49, 480, 720, 30, 7.0, 42 ] }' ``` --- ## 9. References - **Paper**: https://arxiv.org/abs/2511.00062 - **Predict2.5 GitHub**: https://github.com/nvidia-cosmos/cosmos-predict2.5 - **Transfer2.5 GitHub**: https://github.com/nvidia-cosmos/cosmos-transfer2.5 - **Predict2.5 HuggingFace**: https://huggingface.co/nvidia/Cosmos-Predict2.5-2B - **Transfer2.5 HuggingFace**: https://huggingface.co/nvidia/Cosmos-Transfer2.5-2B - **License**: https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-open-model-license --- ## 10. Changelog | Date | Change | |------|--------| | 2026-01-26 | Initial deployment to HuggingFace Spaces | | - | Commit: f829a7a65deb6afc884d48e479a0c02f16885e24 | --- *Report generated: 2026-01-26* *Author: Claude Code (Anthropic)*