Add files using upload-large-folder tool
Browse files- .gitattributes +2 -0
- assets/foundation-pipelines/README.md +16 -16
- assets/foundation-pipelines/human-video-world-model-pipeline.png +2 -2
- assets/foundation-pipelines/prompts.md +19 -44
- assets/foundation-pipelines/source-photos/human-video-world-model-source.jpg +3 -0
- assets/foundation-pipelines/source-photos/vision-language-action-source.jpg +3 -0
- assets/foundation-pipelines/spatial-intelligence-pipeline.png +2 -2
- assets/foundation-pipelines/vision-language-action-pipeline.png +2 -2
- docs/data/research_roadmap_interactive.json +4 -4
- docs/data/scope_claims_audit.json +1 -1
.gitattributes
CHANGED
|
@@ -63,3 +63,5 @@ assets/foundation-pipelines/human-video-world-model-pipeline.png filter=lfs diff
|
|
| 63 |
assets/foundation-pipelines/vision-language-action-pipeline.png filter=lfs diff=lfs merge=lfs -text
|
| 64 |
assets/foundation-pipelines/spatial-intelligence-pipeline.png filter=lfs diff=lfs merge=lfs -text
|
| 65 |
results/omni_finetune/xperience10m_128ep_dense_multiscale_hierarchical_v1_20260608/dense_multiscale_windows.jsonl filter=lfs diff=lfs merge=lfs -text
|
|
|
|
|
|
|
|
|
| 63 |
assets/foundation-pipelines/vision-language-action-pipeline.png filter=lfs diff=lfs merge=lfs -text
|
| 64 |
assets/foundation-pipelines/spatial-intelligence-pipeline.png filter=lfs diff=lfs merge=lfs -text
|
| 65 |
results/omni_finetune/xperience10m_128ep_dense_multiscale_hierarchical_v1_20260608/dense_multiscale_windows.jsonl filter=lfs diff=lfs merge=lfs -text
|
| 66 |
+
assets/foundation-pipelines/source-photos/human-video-world-model-source.jpg filter=lfs diff=lfs merge=lfs -text
|
| 67 |
+
assets/foundation-pipelines/source-photos/vision-language-action-source.jpg filter=lfs diff=lfs merge=lfs -text
|
assets/foundation-pipelines/README.md
CHANGED
|
@@ -1,21 +1,21 @@
|
|
| 1 |
-
# Foundation Pipeline
|
| 2 |
|
| 3 |
-
These three
|
| 4 |
-
|
|
|
|
| 5 |
`docs/data/three_foundation_pipelines.json`.
|
| 6 |
|
| 7 |
-
They replace the earlier concept-art images
|
| 8 |
-
|
| 9 |
-
|
| 10 |
-
|
| 11 |
-
Markdown, JSON, and website labels.
|
| 12 |
|
| 13 |
-
| Track |
|
| 14 |
-
| --- | --- |
|
| 15 |
-
| Spatial intelligence models | `spatial-intelligence-pipeline.png` |
|
| 16 |
-
| Human-video world models | `human-video-world-model-pipeline.png` |
|
| 17 |
-
| Vision-language-action models | `vision-language-action-pipeline.png` |
|
| 18 |
|
| 19 |
-
The deterministic
|
| 20 |
-
`scripts/render_foundation_pipeline_diagrams.py`;
|
| 21 |
-
|
|
|
|
| 1 |
+
# Foundation Pipeline Presentation Photos
|
| 2 |
|
| 3 |
+
These three public images are restored high-resolution photos from the
|
| 4 |
+
foundation-direction presentation slides. They are used for the pipeline tracks
|
| 5 |
+
documented in `THREE_FOUNDATION_PIPELINES.md` and
|
| 6 |
`docs/data/three_foundation_pipelines.json`.
|
| 7 |
|
| 8 |
+
They replace the earlier concept-art images and keep the public visuals tied to
|
| 9 |
+
the original direction slides. They are still **pipeline communication
|
| 10 |
+
assets**, not evidence of completed foundation-model quality. Exact technical
|
| 11 |
+
claims live in the surrounding Markdown, JSON, and website labels.
|
|
|
|
| 12 |
|
| 13 |
+
| Track | Enhanced asset | Source photo |
|
| 14 |
+
| --- | --- | --- |
|
| 15 |
+
| Spatial intelligence models | `spatial-intelligence-pipeline.png` | `source-photos/spatial-intelligence-source.jpg` |
|
| 16 |
+
| Human-video world models | `human-video-world-model-pipeline.png` | `source-photos/human-video-world-model-source.jpg` |
|
| 17 |
+
| Vision-language-action models | `vision-language-action-pipeline.png` | `source-photos/vision-language-action-source.jpg` |
|
| 18 |
|
| 19 |
+
The deterministic restoration script is
|
| 20 |
+
`scripts/render_foundation_pipeline_diagrams.py`; restoration notes and source
|
| 21 |
+
photo mapping are in `prompts.md`.
|
assets/foundation-pipelines/human-video-world-model-pipeline.png
CHANGED
|
Git LFS Details
|
|
Git LFS Details
|
assets/foundation-pipelines/prompts.md
CHANGED
|
@@ -1,49 +1,24 @@
|
|
| 1 |
-
# Foundation Pipeline
|
| 2 |
|
| 3 |
-
The
|
| 4 |
-
|
| 5 |
-
|
| 6 |
-
|
| 7 |
|
| 8 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
| 9 |
|
| 10 |
-
|
| 11 |
-
Xperience-10M foundation pipeline track. Create a structured diagram, not
|
| 12 |
-
concept art, for a spatial intelligence model training direction. Show four
|
| 13 |
-
left-to-right zones: inputs, task targets, model training, and evaluation
|
| 14 |
-
gates. The content should represent multiview RGB, egocentric video, depth,
|
| 15 |
-
camera pose, calibration, object/contact/language cues, spatial QA, object
|
| 16 |
-
counting, object permanence, relative location, multiview retrieval, 3D
|
| 17 |
-
consistency, spatial-memory encoders, and held-out episode metrics. Use a
|
| 18 |
-
premium dark research-product style, high contrast, crisp panels, clean
|
| 19 |
-
technical linework, no decorative blobs, no logos, no watermark.
|
| 20 |
|
| 21 |
-
|
|
|
|
|
|
|
|
|
|
| 22 |
|
| 23 |
-
|
| 24 |
-
|
| 25 |
-
|
| 26 |
-
|
| 27 |
-
training, and held-out future evaluation. The content should represent
|
| 28 |
-
observed video/audio/sensor windows, hand/body motion, camera pose,
|
| 29 |
-
object/contact state, action/subtask labels, next action, next subtask, future
|
| 30 |
-
object set, contact transition, camera-motion delta, latent future state, Qwen
|
| 31 |
-
structured future probes, Cosmos/dynamics branches, rollout or latent
|
| 32 |
-
reconstruction, no future leakage, and future-task metrics. Use a premium dark
|
| 33 |
-
research-product style, high contrast, crisp panels, clean technical linework,
|
| 34 |
-
no decorative blobs, no logos, no watermark.
|
| 35 |
-
|
| 36 |
-
## Vision-Language-Action
|
| 37 |
-
|
| 38 |
-
Use case: infographic-diagram. Asset type: 16:9 website figure for Ropedia
|
| 39 |
-
Xperience-10M foundation pipeline track. Create a structured diagram, not
|
| 40 |
-
concept art, for a vision-language-action model training direction. Show four
|
| 41 |
-
left-to-right zones: observation/language inputs, action task targets,
|
| 42 |
-
VLA/policy-compatible training, and held-out action evaluation. The content
|
| 43 |
-
should represent egocentric video, captions, objects, contacts, procedures,
|
| 44 |
-
hand/body motion windows, subtask labels, action-token vocabulary, next action,
|
| 45 |
-
action chunks, object-conditioned action, contact state, subtask transition,
|
| 46 |
-
action-space conversion, normalization, leakage and retargeting reports, VLA
|
| 47 |
-
or policy heads, and held-out policy/action metrics. Use a premium dark
|
| 48 |
-
research-product style, high contrast, crisp panels, clean technical linework,
|
| 49 |
-
no decorative blobs, no logos, no watermark.
|
|
|
|
| 1 |
+
# Foundation Pipeline Photo Restoration Notes
|
| 2 |
|
| 3 |
+
The current public assets are not generated concept art. They are restored
|
| 4 |
+
high-resolution PNGs rebuilt from original presentation photos supplied by the
|
| 5 |
+
project owner. The filename is kept as `prompts.md` because older public
|
| 6 |
+
manifests and mirrors already link here as the provenance note.
|
| 7 |
|
| 8 |
+
| Track | Source photo | Enhanced public PNG |
|
| 9 |
+
| --- | --- | --- |
|
| 10 |
+
| Spatial intelligence models | `source-photos/spatial-intelligence-source.jpg` | `spatial-intelligence-pipeline.png` |
|
| 11 |
+
| Human-video world models | `source-photos/human-video-world-model-source.jpg` | `human-video-world-model-pipeline.png` |
|
| 12 |
+
| Vision-language-action models | `source-photos/vision-language-action-source.jpg` | `vision-language-action-pipeline.png` |
|
| 13 |
|
| 14 |
+
Restoration is deterministic and local:
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 15 |
|
| 16 |
+
- EXIF orientation normalization.
|
| 17 |
+
- Autocontrast and moderate brightness/color/contrast correction.
|
| 18 |
+
- Lanczos resize to a 2560-pixel public width.
|
| 19 |
+
- Gentle sharpening and unsharp masking.
|
| 20 |
|
| 21 |
+
The restoration script deliberately does not synthesize, redraw, or hallucinate
|
| 22 |
+
slide text. Technical task/training/evaluation claims are maintained in
|
| 23 |
+
`THREE_FOUNDATION_PIPELINES.md` and
|
| 24 |
+
`docs/data/three_foundation_pipelines.json`.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
assets/foundation-pipelines/source-photos/human-video-world-model-source.jpg
ADDED
|
Git LFS Details
|
assets/foundation-pipelines/source-photos/vision-language-action-source.jpg
ADDED
|
Git LFS Details
|
assets/foundation-pipelines/spatial-intelligence-pipeline.png
CHANGED
|
Git LFS Details
|
|
Git LFS Details
|
assets/foundation-pipelines/vision-language-action-pipeline.png
CHANGED
|
Git LFS Details
|
|
Git LFS Details
|
docs/data/research_roadmap_interactive.json
CHANGED
|
@@ -2222,7 +2222,7 @@
|
|
| 2222 |
],
|
| 2223 |
"status": "planning_artifact"
|
| 2224 |
},
|
| 2225 |
-
"generated_at_utc": "2026-06-
|
| 2226 |
"omni_plan": {
|
| 2227 |
"adapter": "LoRA rank 16, alpha 32, dropout 0.05",
|
| 2228 |
"backbone": "Qwen/Qwen3-Omni-30B-A3B-Instruct",
|
|
@@ -3307,7 +3307,7 @@
|
|
| 3307 |
"diagram_image": "docs/assets/foundation-pipelines/spatial-intelligence-pipeline.png",
|
| 3308 |
"first_pipeline": "Build a spatial-memory exporter, start with metric depth and pose consistency tasks, then evaluate spatial QA, object permanence, counting, retrieval, and pose-aware consistency.",
|
| 3309 |
"id": "spatial_intelligence",
|
| 3310 |
-
"image_alt": "
|
| 3311 |
"intermediate_artifacts": [
|
| 3312 |
"synchronized camera window manifest",
|
| 3313 |
"pose and depth availability report",
|
|
@@ -3386,7 +3386,7 @@
|
|
| 3386 |
"diagram_image": "docs/assets/foundation-pipelines/human-video-world-model-pipeline.png",
|
| 3387 |
"first_pipeline": "Keep Qwen-style structured future probes for task interpretability, keep Cosmos-style dynamics branches separate, and add latent or feature-reconstruction metrics before claiming world-model quality.",
|
| 3388 |
"id": "human_video_world_models",
|
| 3389 |
-
"image_alt": "
|
| 3390 |
"intermediate_artifacts": [
|
| 3391 |
"observed and future window pairs",
|
| 3392 |
"future label targets",
|
|
@@ -3463,7 +3463,7 @@
|
|
| 3463 |
"diagram_image": "docs/assets/foundation-pipelines/vision-language-action-pipeline.png",
|
| 3464 |
"first_pipeline": "Define the action space, use existing 20-task next-action/contact/object-conditioned tasks first, then add hand-trajectory or policy-compatible action chunks after conversion is traceable.",
|
| 3465 |
"id": "vision_language_action",
|
| 3466 |
-
"image_alt": "
|
| 3467 |
"intermediate_artifacts": [
|
| 3468 |
"action-token vocabulary",
|
| 3469 |
"action-chunk windows",
|
|
|
|
| 2222 |
],
|
| 2223 |
"status": "planning_artifact"
|
| 2224 |
},
|
| 2225 |
+
"generated_at_utc": "2026-06-18T08:24:55+00:00",
|
| 2226 |
"omni_plan": {
|
| 2227 |
"adapter": "LoRA rank 16, alpha 32, dropout 0.05",
|
| 2228 |
"backbone": "Qwen/Qwen3-Omni-30B-A3B-Instruct",
|
|
|
|
| 3307 |
"diagram_image": "docs/assets/foundation-pipelines/spatial-intelligence-pipeline.png",
|
| 3308 |
"first_pipeline": "Build a spatial-memory exporter, start with metric depth and pose consistency tasks, then evaluate spatial QA, object permanence, counting, retrieval, and pose-aware consistency.",
|
| 3309 |
"id": "spatial_intelligence",
|
| 3310 |
+
"image_alt": "Restored presentation photo showing the Spatial intelligence models direction slide for Xperience-10M.",
|
| 3311 |
"intermediate_artifacts": [
|
| 3312 |
"synchronized camera window manifest",
|
| 3313 |
"pose and depth availability report",
|
|
|
|
| 3386 |
"diagram_image": "docs/assets/foundation-pipelines/human-video-world-model-pipeline.png",
|
| 3387 |
"first_pipeline": "Keep Qwen-style structured future probes for task interpretability, keep Cosmos-style dynamics branches separate, and add latent or feature-reconstruction metrics before claiming world-model quality.",
|
| 3388 |
"id": "human_video_world_models",
|
| 3389 |
+
"image_alt": "Restored presentation photo showing the Human-video world models direction slide for Xperience-10M.",
|
| 3390 |
"intermediate_artifacts": [
|
| 3391 |
"observed and future window pairs",
|
| 3392 |
"future label targets",
|
|
|
|
| 3463 |
"diagram_image": "docs/assets/foundation-pipelines/vision-language-action-pipeline.png",
|
| 3464 |
"first_pipeline": "Define the action space, use existing 20-task next-action/contact/object-conditioned tasks first, then add hand-trajectory or policy-compatible action chunks after conversion is traceable.",
|
| 3465 |
"id": "vision_language_action",
|
| 3466 |
+
"image_alt": "Restored presentation photo showing the Vision-language-action models direction slide for Xperience-10M.",
|
| 3467 |
"intermediate_artifacts": [
|
| 3468 |
"action-token vocabulary",
|
| 3469 |
"action-chunk windows",
|
docs/data/scope_claims_audit.json
CHANGED
|
@@ -1,6 +1,6 @@
|
|
| 1 |
{
|
| 2 |
"status": "pass",
|
| 3 |
-
"generated_at_utc": "2026-06-
|
| 4 |
"summary": {
|
| 5 |
"qwen3_omni_verified_diagnostic_pilot": true,
|
| 6 |
"dataset_manifest_num_episodes": 119,
|
|
|
|
| 1 |
{
|
| 2 |
"status": "pass",
|
| 3 |
+
"generated_at_utc": "2026-06-18T08:27:01+00:00",
|
| 4 |
"summary": {
|
| 5 |
"qwen3_omni_verified_diagnostic_pilot": true,
|
| 6 |
"dataset_manifest_num_episodes": 119,
|