Upload REPORT.md with huggingface_hub
Browse files
REPORT.md
ADDED
|
@@ -0,0 +1,310 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# NVIDIA Cosmos World Foundation Models - Deployment Report
|
| 2 |
+
|
| 3 |
+
## Summary
|
| 4 |
+
|
| 5 |
+
| Item | Value |
|
| 6 |
+
|------|-------|
|
| 7 |
+
| **Space URL** | https://huggingface.co/spaces/wbw2000/cosmos-predict-transfer-demo |
|
| 8 |
+
| **Commit Hash** | `f829a7a65deb6afc884d48e479a0c02f16885e24` |
|
| 9 |
+
| **Created** | 2026-01-26 |
|
| 10 |
+
| **Status** | Building / Pending Verification |
|
| 11 |
+
|
| 12 |
+
---
|
| 13 |
+
|
| 14 |
+
## 1. Model Information
|
| 15 |
+
|
| 16 |
+
### Cosmos Predict2.5-2B
|
| 17 |
+
| Property | Value |
|
| 18 |
+
|----------|-------|
|
| 19 |
+
| **Model ID** | `nvidia/Cosmos-Predict2.5-2B` |
|
| 20 |
+
| **Parameters** | 2,059,174,912 (~2.06B) |
|
| 21 |
+
| **VRAM Required** | 32.54 GB |
|
| 22 |
+
| **Precision** | BF16 only |
|
| 23 |
+
| **Capabilities** | Text2World, Image2World, Video2World |
|
| 24 |
+
|
| 25 |
+
### Cosmos Transfer2.5-2B
|
| 26 |
+
| Property | Value |
|
| 27 |
+
|----------|-------|
|
| 28 |
+
| **Model ID** | `nvidia/Cosmos-Transfer2.5-2B` |
|
| 29 |
+
| **Parameters** | 2,358,047,744 (~2.36B) |
|
| 30 |
+
| **VRAM Required** | 65.4 GB |
|
| 31 |
+
| **Precision** | BF16 only |
|
| 32 |
+
| **Control Inputs** | Blur, Edge, Depth, Segmentation |
|
| 33 |
+
|
| 34 |
+
---
|
| 35 |
+
|
| 36 |
+
## 2. Hardware Configuration
|
| 37 |
+
|
| 38 |
+
| Property | Value |
|
| 39 |
+
|----------|-------|
|
| 40 |
+
| **Hardware** | ZeroGPU (NVIDIA H200) |
|
| 41 |
+
| **VRAM** | 70 GB |
|
| 42 |
+
| **Supported** | Both Predict2.5 (32GB) and Transfer2.5 (65GB) |
|
| 43 |
+
| **GPU Duration** | Predict: 300s, Transfer: 420s |
|
| 44 |
+
|
| 45 |
+
### Why ZeroGPU/H200
|
| 46 |
+
- Predict2.5-2B requires 32.54 GB → A10G (24GB) insufficient
|
| 47 |
+
- Transfer2.5-2B requires 65.4 GB → Only H200 (70GB) or A100 (80GB) sufficient
|
| 48 |
+
- ZeroGPU H200 (70GB) is the most cost-effective option on HuggingFace
|
| 49 |
+
|
| 50 |
+
---
|
| 51 |
+
|
| 52 |
+
## 3. Key Dependencies
|
| 53 |
+
|
| 54 |
+
```
|
| 55 |
+
torch==2.5.1
|
| 56 |
+
diffusers>=0.34.0
|
| 57 |
+
transformers>=4.52.4
|
| 58 |
+
accelerate>=1.7.0
|
| 59 |
+
gradio>=5.0.0
|
| 60 |
+
av>=14.0.0
|
| 61 |
+
opencv-python-headless>=4.8.0
|
| 62 |
+
imageio>=2.31.0
|
| 63 |
+
scikit-image>=0.21.0
|
| 64 |
+
```
|
| 65 |
+
|
| 66 |
+
---
|
| 67 |
+
|
| 68 |
+
## 4. Default Parameters
|
| 69 |
+
|
| 70 |
+
### Predict2.5
|
| 71 |
+
| Parameter | Default | Range |
|
| 72 |
+
|-----------|---------|-------|
|
| 73 |
+
| Resolution | 720×480 | 720×480 or 1280×720 |
|
| 74 |
+
| Frames | 49 (~3s) | 17-97 |
|
| 75 |
+
| Inference Steps | 30 | 10-50 |
|
| 76 |
+
| Guidance Scale | 7.0 | 1.0-15.0 |
|
| 77 |
+
| Seed | 42 | Any integer |
|
| 78 |
+
|
| 79 |
+
### Transfer2.5
|
| 80 |
+
| Parameter | Default | Range |
|
| 81 |
+
|-----------|---------|-------|
|
| 82 |
+
| Control Type | blur | blur, edge, depth, segmentation |
|
| 83 |
+
| Inference Steps | 30 | 10-50 |
|
| 84 |
+
| Guidance Scale | 7.0 | 1.0-15.0 |
|
| 85 |
+
| Control Scale | 1.0 | 0.5-2.0 |
|
| 86 |
+
| Seed | 42 | Any integer |
|
| 87 |
+
|
| 88 |
+
---
|
| 89 |
+
|
| 90 |
+
## 5. Smoke Test Design
|
| 91 |
+
|
| 92 |
+
### Tests Implemented
|
| 93 |
+
|
| 94 |
+
#### Predict2.5 Tests (`tests/smoke_predict.py`)
|
| 95 |
+
1. **Output Validation**: Verify video utilities work correctly
|
| 96 |
+
2. **Model Loading**: Verify Predict2.5-2B can be loaded
|
| 97 |
+
3. **Text2World Inference**: Generate video from text prompt
|
| 98 |
+
|
| 99 |
+
#### Transfer2.5 Tests (`tests/smoke_transfer.py`)
|
| 100 |
+
1. **Control Extraction**: Verify edge/depth extraction works
|
| 101 |
+
2. **Style Consistency**: Verify SSIM computation works
|
| 102 |
+
3. **Model Loading**: Verify Transfer2.5-2B can be loaded
|
| 103 |
+
4. **Video Inference**: Apply style transfer to video
|
| 104 |
+
|
| 105 |
+
### Running Tests
|
| 106 |
+
|
| 107 |
+
```bash
|
| 108 |
+
# All tests
|
| 109 |
+
python -m tests.smoke_all
|
| 110 |
+
|
| 111 |
+
# Predict2.5 only (if VRAM < 65GB)
|
| 112 |
+
python -m tests.smoke_all --predict-only
|
| 113 |
+
|
| 114 |
+
# Individual modules
|
| 115 |
+
python -m tests.smoke_predict
|
| 116 |
+
python -m tests.smoke_transfer
|
| 117 |
+
```
|
| 118 |
+
|
| 119 |
+
### Expected Output
|
| 120 |
+
|
| 121 |
+
```
|
| 122 |
+
======================================================================
|
| 123 |
+
NVIDIA COSMOS WORLD FOUNDATION MODELS - SMOKE TESTS
|
| 124 |
+
======================================================================
|
| 125 |
+
|
| 126 |
+
[System Information]
|
| 127 |
+
python_version: 3.11.x
|
| 128 |
+
torch_version: 2.5.1
|
| 129 |
+
cuda_available: True
|
| 130 |
+
gpu_name: NVIDIA H200
|
| 131 |
+
gpu_total_vram_gb: 70.0
|
| 132 |
+
|
| 133 |
+
[PREDICT2.5 TESTS]
|
| 134 |
+
output_validation: PASSED
|
| 135 |
+
model_loading: PASSED
|
| 136 |
+
text2world_inference: PASSED
|
| 137 |
+
|
| 138 |
+
[TRANSFER2.5 TESTS]
|
| 139 |
+
control_extraction: PASSED
|
| 140 |
+
style_consistency: PASSED
|
| 141 |
+
model_loading: PASSED
|
| 142 |
+
video_inference: PASSED
|
| 143 |
+
|
| 144 |
+
Overall: 7/7 tests passed
|
| 145 |
+
STATUS: ALL TESTS PASSED
|
| 146 |
+
```
|
| 147 |
+
|
| 148 |
+
---
|
| 149 |
+
|
| 150 |
+
## 6. Paper Consistency Validation
|
| 151 |
+
|
| 152 |
+
### Reference
|
| 153 |
+
**Paper**: arXiv 2511.00062 - "World Simulation with Video Foundation Models for Physical AI"
|
| 154 |
+
|
| 155 |
+
### Validation Points
|
| 156 |
+
|
| 157 |
+
#### 6.1 Predict2.5 - Temporal Consistency (Section 4.2)
|
| 158 |
+
|
| 159 |
+
**Paper Claim**: "Generated videos maintain reasonable spatiotemporal continuity in short-term prediction"
|
| 160 |
+
|
| 161 |
+
**Validation Method**:
|
| 162 |
+
- Generate N=3 videos with different seeds
|
| 163 |
+
- Compute mean frame-to-frame pixel difference
|
| 164 |
+
- Verify differences are smooth (mean_diff < 50)
|
| 165 |
+
|
| 166 |
+
**Metric**: Mean adjacent frame difference (pixel intensity 0-255 scale)
|
| 167 |
+
|
| 168 |
+
**Pass Criteria**: mean_diff < 50 for majority of samples
|
| 169 |
+
|
| 170 |
+
#### 6.2 Predict2.5 - Reproducibility (Section 5)
|
| 171 |
+
|
| 172 |
+
**Paper Claim**: "Fixed random seeds produce deterministic outputs"
|
| 173 |
+
|
| 174 |
+
**Validation Method**:
|
| 175 |
+
- Run same prompt with same seed twice
|
| 176 |
+
- Compute SSIM between corresponding frames
|
| 177 |
+
- Verify outputs are nearly identical
|
| 178 |
+
|
| 179 |
+
**Metric**: Mean SSIM between run1 and run2
|
| 180 |
+
|
| 181 |
+
**Pass Criteria**: mean_ssim > 0.95
|
| 182 |
+
|
| 183 |
+
#### 6.3 Transfer2.5 - Structure Preservation (Section 3.2)
|
| 184 |
+
|
| 185 |
+
**Paper Claim**: "Cosmos-Transfer2.5 preserves structural consistency during domain transfer"
|
| 186 |
+
|
| 187 |
+
**Validation Method**:
|
| 188 |
+
- Extract edge maps from input and output
|
| 189 |
+
- Compute SSIM between edge maps
|
| 190 |
+
- Verify edges are preserved
|
| 191 |
+
|
| 192 |
+
**Metric**: Edge SSIM between input and output
|
| 193 |
+
|
| 194 |
+
**Pass Criteria**: mean_edge_ssim > 0.3
|
| 195 |
+
|
| 196 |
+
#### 6.4 Transfer2.5 - Domain Change (Section 4.3)
|
| 197 |
+
|
| 198 |
+
**Paper Claim**: "Model can perform world-to-world translation (e.g., day→night)"
|
| 199 |
+
|
| 200 |
+
**Validation Method**:
|
| 201 |
+
- Apply day→night transfer
|
| 202 |
+
- Compute SSIM between input and output
|
| 203 |
+
- Verify output differs from input while maintaining structure
|
| 204 |
+
|
| 205 |
+
**Metric**: Pixel SSIM between input and output
|
| 206 |
+
|
| 207 |
+
**Pass Criteria**: 0.1 < mean_ssim < 0.9 (different but not random)
|
| 208 |
+
|
| 209 |
+
### Running Validation
|
| 210 |
+
|
| 211 |
+
```bash
|
| 212 |
+
python -m tests.paper_validation
|
| 213 |
+
python -m tests.paper_validation --skip-transfer # If VRAM limited
|
| 214 |
+
```
|
| 215 |
+
|
| 216 |
+
---
|
| 217 |
+
|
| 218 |
+
## 7. Limitations & Future Work
|
| 219 |
+
|
| 220 |
+
### Current Limitations
|
| 221 |
+
|
| 222 |
+
| Limitation | Reason | Potential Solution |
|
| 223 |
+
|------------|--------|-------------------|
|
| 224 |
+
| Transfer2.5 may fail on ZeroGPU | 65.4GB very close to 70GB limit | Use A100 80GB or quantization |
|
| 225 |
+
| Long cold start | Large model downloads | Use model caching |
|
| 226 |
+
| Limited output length | Avoid timeout (5 min) | Increase GPU duration |
|
| 227 |
+
| No multi-view support | Not implemented | Add multi-view inference |
|
| 228 |
+
|
| 229 |
+
### Not Covered from Paper
|
| 230 |
+
|
| 231 |
+
1. **Multi-view generation** (Section 3.3) - Not implemented
|
| 232 |
+
2. **Autonomous Vehicle post-training** (Section 4.1) - Not included
|
| 233 |
+
3. **Full benchmark evaluation** (Section 5.2) - Only smoke tests
|
| 234 |
+
4. **14B model variants** - VRAM insufficient
|
| 235 |
+
|
| 236 |
+
### Future Improvements
|
| 237 |
+
|
| 238 |
+
1. Add A100 80GB option for better Transfer2.5 stability
|
| 239 |
+
2. Implement multi-view inference for robotics use cases
|
| 240 |
+
3. Add quantized model variants for lower VRAM
|
| 241 |
+
4. Implement full paper benchmark suite
|
| 242 |
+
5. Add video-to-video inference for Predict2.5
|
| 243 |
+
|
| 244 |
+
---
|
| 245 |
+
|
| 246 |
+
## 8. API Usage Examples
|
| 247 |
+
|
| 248 |
+
### Gradio Client (Python)
|
| 249 |
+
|
| 250 |
+
```python
|
| 251 |
+
from gradio_client import Client
|
| 252 |
+
|
| 253 |
+
# Connect to Space
|
| 254 |
+
client = Client("wbw2000/cosmos-predict-transfer-demo")
|
| 255 |
+
|
| 256 |
+
# Text2World
|
| 257 |
+
result = client.predict(
|
| 258 |
+
prompt="A peaceful garden with butterflies",
|
| 259 |
+
negative_prompt="low quality, blurry",
|
| 260 |
+
num_frames=49,
|
| 261 |
+
height=480,
|
| 262 |
+
width=720,
|
| 263 |
+
num_inference_steps=30,
|
| 264 |
+
guidance_scale=7.0,
|
| 265 |
+
seed=42,
|
| 266 |
+
api_name="/run_predict_text2world"
|
| 267 |
+
)
|
| 268 |
+
|
| 269 |
+
video_path, log = result
|
| 270 |
+
print(f"Video saved to: {video_path}")
|
| 271 |
+
```
|
| 272 |
+
|
| 273 |
+
### REST API (curl)
|
| 274 |
+
|
| 275 |
+
```bash
|
| 276 |
+
curl -X POST "https://wbw2000-cosmos-predict-transfer-demo.hf.space/api/run_predict_text2world" \
|
| 277 |
+
-H "Content-Type: application/json" \
|
| 278 |
+
-d '{
|
| 279 |
+
"data": [
|
| 280 |
+
"A futuristic city at sunset",
|
| 281 |
+
"low quality, blurry",
|
| 282 |
+
49, 480, 720, 30, 7.0, 42
|
| 283 |
+
]
|
| 284 |
+
}'
|
| 285 |
+
```
|
| 286 |
+
|
| 287 |
+
---
|
| 288 |
+
|
| 289 |
+
## 9. References
|
| 290 |
+
|
| 291 |
+
- **Paper**: https://arxiv.org/abs/2511.00062
|
| 292 |
+
- **Predict2.5 GitHub**: https://github.com/nvidia-cosmos/cosmos-predict2.5
|
| 293 |
+
- **Transfer2.5 GitHub**: https://github.com/nvidia-cosmos/cosmos-transfer2.5
|
| 294 |
+
- **Predict2.5 HuggingFace**: https://huggingface.co/nvidia/Cosmos-Predict2.5-2B
|
| 295 |
+
- **Transfer2.5 HuggingFace**: https://huggingface.co/nvidia/Cosmos-Transfer2.5-2B
|
| 296 |
+
- **License**: https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-open-model-license
|
| 297 |
+
|
| 298 |
+
---
|
| 299 |
+
|
| 300 |
+
## 10. Changelog
|
| 301 |
+
|
| 302 |
+
| Date | Change |
|
| 303 |
+
|------|--------|
|
| 304 |
+
| 2026-01-26 | Initial deployment to HuggingFace Spaces |
|
| 305 |
+
| - | Commit: f829a7a65deb6afc884d48e479a0c02f16885e24 |
|
| 306 |
+
|
| 307 |
+
---
|
| 308 |
+
|
| 309 |
+
*Report generated: 2026-01-26*
|
| 310 |
+
*Author: Claude Code (Anthropic)*
|