# NVIDIA Cosmos World Foundation Models - Deployment Report

## Summary

| Item | Value |
|------|-------|
| **Space URL** | https://huggingface.co/spaces/wbw2000/cosmos-predict-transfer-demo |
| **Commit Hash** | `f829a7a65deb6afc884d48e479a0c02f16885e24` |
| **Created** | 2026-01-26 |
| **Status** | Building / Pending Verification |

---

## 1. Model Information

### Cosmos Predict2.5-2B
| Property | Value |
|----------|-------|
| **Model ID** | `nvidia/Cosmos-Predict2.5-2B` |
| **Parameters** | 2,059,174,912 (~2.06B) |
| **VRAM Required** | 32.54 GB |
| **Precision** | BF16 only |
| **Capabilities** | Text2World, Image2World, Video2World |

### Cosmos Transfer2.5-2B
| Property | Value |
|----------|-------|
| **Model ID** | `nvidia/Cosmos-Transfer2.5-2B` |
| **Parameters** | 2,358,047,744 (~2.36B) |
| **VRAM Required** | 65.4 GB |
| **Precision** | BF16 only |
| **Control Inputs** | Blur, Edge, Depth, Segmentation |

---

## 2. Hardware Configuration

| Property | Value |
|----------|-------|
| **Hardware** | ZeroGPU (NVIDIA H200) |
| **VRAM** | 70 GB |
| **Supported** | Both Predict2.5 (32GB) and Transfer2.5 (65GB) |
| **GPU Duration** | Predict: 300s, Transfer: 420s |

### Why ZeroGPU/H200
- Predict2.5-2B requires 32.54 GB → A10G (24GB) insufficient
- Transfer2.5-2B requires 65.4 GB → Only H200 (70GB) or A100 (80GB) sufficient
- ZeroGPU H200 (70GB) is the most cost-effective option on HuggingFace

---

## 3. Key Dependencies

```
torch==2.5.1
diffusers>=0.34.0
transformers>=4.52.4
accelerate>=1.7.0
gradio>=5.0.0
av>=14.0.0
opencv-python-headless>=4.8.0
imageio>=2.31.0
scikit-image>=0.21.0
```

---

## 4. Default Parameters

### Predict2.5
| Parameter | Default | Range |
|-----------|---------|-------|
| Resolution | 720×480 | 720×480 or 1280×720 |
| Frames | 49 (~3s) | 17-97 |
| Inference Steps | 30 | 10-50 |
| Guidance Scale | 7.0 | 1.0-15.0 |
| Seed | 42 | Any integer |

### Transfer2.5
| Parameter | Default | Range |
|-----------|---------|-------|
| Control Type | blur | blur, edge, depth, segmentation |
| Inference Steps | 30 | 10-50 |
| Guidance Scale | 7.0 | 1.0-15.0 |
| Control Scale | 1.0 | 0.5-2.0 |
| Seed | 42 | Any integer |

---

## 5. Smoke Test Design

### Tests Implemented

#### Predict2.5 Tests (`tests/smoke_predict.py`)
1. **Output Validation**: Verify video utilities work correctly
2. **Model Loading**: Verify Predict2.5-2B can be loaded
3. **Text2World Inference**: Generate video from text prompt

#### Transfer2.5 Tests (`tests/smoke_transfer.py`)
1. **Control Extraction**: Verify edge/depth extraction works
2. **Style Consistency**: Verify SSIM computation works
3. **Model Loading**: Verify Transfer2.5-2B can be loaded
4. **Video Inference**: Apply style transfer to video

### Running Tests

```bash
# All tests
python -m tests.smoke_all

# Predict2.5 only (if VRAM < 65GB)
python -m tests.smoke_all --predict-only

# Individual modules
python -m tests.smoke_predict
python -m tests.smoke_transfer
```

### Expected Output

```
======================================================================
NVIDIA COSMOS WORLD FOUNDATION MODELS - SMOKE TESTS
======================================================================

[System Information]
  python_version: 3.11.x
  torch_version: 2.5.1
  cuda_available: True
  gpu_name: NVIDIA H200
  gpu_total_vram_gb: 70.0

[PREDICT2.5 TESTS]
  output_validation: PASSED
  model_loading: PASSED
  text2world_inference: PASSED

[TRANSFER2.5 TESTS]
  control_extraction: PASSED
  style_consistency: PASSED
  model_loading: PASSED
  video_inference: PASSED

Overall: 7/7 tests passed
STATUS: ALL TESTS PASSED
```

---

## 6. Paper Consistency Validation

### Reference
**Paper**: arXiv 2511.00062 - "World Simulation with Video Foundation Models for Physical AI"

### Validation Points

#### 6.1 Predict2.5 - Temporal Consistency (Section 4.2)

**Paper Claim**: "Generated videos maintain reasonable spatiotemporal continuity in short-term prediction"

**Validation Method**:
- Generate N=3 videos with different seeds
- Compute mean frame-to-frame pixel difference
- Verify differences are smooth (mean_diff < 50)

**Metric**: Mean adjacent frame difference (pixel intensity 0-255 scale)

**Pass Criteria**: mean_diff < 50 for majority of samples

#### 6.2 Predict2.5 - Reproducibility (Section 5)

**Paper Claim**: "Fixed random seeds produce deterministic outputs"

**Validation Method**:
- Run same prompt with same seed twice
- Compute SSIM between corresponding frames
- Verify outputs are nearly identical

**Metric**: Mean SSIM between run1 and run2

**Pass Criteria**: mean_ssim > 0.95

#### 6.3 Transfer2.5 - Structure Preservation (Section 3.2)

**Paper Claim**: "Cosmos-Transfer2.5 preserves structural consistency during domain transfer"

**Validation Method**:
- Extract edge maps from input and output
- Compute SSIM between edge maps
- Verify edges are preserved

**Metric**: Edge SSIM between input and output

**Pass Criteria**: mean_edge_ssim > 0.3

#### 6.4 Transfer2.5 - Domain Change (Section 4.3)

**Paper Claim**: "Model can perform world-to-world translation (e.g., day→night)"

**Validation Method**:
- Apply day→night transfer
- Compute SSIM between input and output
- Verify output differs from input while maintaining structure

**Metric**: Pixel SSIM between input and output

**Pass Criteria**: 0.1 < mean_ssim < 0.9 (different but not random)

### Running Validation

```bash
python -m tests.paper_validation
python -m tests.paper_validation --skip-transfer  # If VRAM limited
```

---

## 7. Limitations & Future Work

### Current Limitations

| Limitation | Reason | Potential Solution |
|------------|--------|-------------------|
| Transfer2.5 may fail on ZeroGPU | 65.4GB very close to 70GB limit | Use A100 80GB or quantization |
| Long cold start | Large model downloads | Use model caching |
| Limited output length | Avoid timeout (5 min) | Increase GPU duration |
| No multi-view support | Not implemented | Add multi-view inference |

### Not Covered from Paper

1. **Multi-view generation** (Section 3.3) - Not implemented
2. **Autonomous Vehicle post-training** (Section 4.1) - Not included
3. **Full benchmark evaluation** (Section 5.2) - Only smoke tests
4. **14B model variants** - VRAM insufficient

### Future Improvements

1. Add A100 80GB option for better Transfer2.5 stability
2. Implement multi-view inference for robotics use cases
3. Add quantized model variants for lower VRAM
4. Implement full paper benchmark suite
5. Add video-to-video inference for Predict2.5

---

## 8. API Usage Examples

### Gradio Client (Python)

```python
from gradio_client import Client

# Connect to Space
client = Client("wbw2000/cosmos-predict-transfer-demo")

# Text2World
result = client.predict(
    prompt="A peaceful garden with butterflies",
    negative_prompt="low quality, blurry",
    num_frames=49,
    height=480,
    width=720,
    num_inference_steps=30,
    guidance_scale=7.0,
    seed=42,
    api_name="/run_predict_text2world"
)

video_path, log = result
print(f"Video saved to: {video_path}")
```

### REST API (curl)

```bash
curl -X POST "https://wbw2000-cosmos-predict-transfer-demo.hf.space/api/run_predict_text2world" \
  -H "Content-Type: application/json" \
  -d '{
    "data": [
      "A futuristic city at sunset",
      "low quality, blurry",
      49, 480, 720, 30, 7.0, 42
    ]
  }'
```

---

## 9. References

- **Paper**: https://arxiv.org/abs/2511.00062
- **Predict2.5 GitHub**: https://github.com/nvidia-cosmos/cosmos-predict2.5
- **Transfer2.5 GitHub**: https://github.com/nvidia-cosmos/cosmos-transfer2.5
- **Predict2.5 HuggingFace**: https://huggingface.co/nvidia/Cosmos-Predict2.5-2B
- **Transfer2.5 HuggingFace**: https://huggingface.co/nvidia/Cosmos-Transfer2.5-2B
- **License**: https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-open-model-license

---

## 10. Changelog

| Date | Change |
|------|--------|
| 2026-01-26 | Initial deployment to HuggingFace Spaces |
| - | Commit: f829a7a65deb6afc884d48e479a0c02f16885e24 |

---

*Report generated: 2026-01-26*
*Author: Claude Code (Anthropic)*