Any-to-Any
Transformers
Safetensors
neo_chat
feature-extraction
multimodal
text-to-image
image-to-text
image-editing
interleaved-generation
custom_code
Instructions to use sensenova/SenseNova-U1-8B-MoT-SFT with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use sensenova/SenseNova-U1-8B-MoT-SFT with Transformers:
# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("sensenova/SenseNova-U1-8B-MoT-SFT", trust_remote_code=True, dtype="auto") - Notebooks
- Google Colab
- Kaggle
Upload folder using huggingface_hub
Browse files- .gitattributes +9 -0
- docs/assets/benchmarks/generation.webp +2 -2
- docs/assets/benchmarks/interleaved.webp +2 -2
- docs/assets/perform_vs_speed_avg3.png +3 -0
- docs/assets/perform_vs_speed_avg8.png +3 -0
- docs/assets/showcases/interleave/case_0001_makeup_three_looks.webp +3 -0
- docs/assets/showcases/interleave/case_0003_beachfront_villa.webp +3 -0
- docs/assets/showcases/interleave/case_0004_scented_candle_promo.webp +3 -0
- docs/assets/showcases/interleave/case_0005_matchgirl_warm_au.webp +3 -0
- docs/assets/showcases/interleave/case_0006_orange_cat_travel.webp +3 -0
- docs/assets/showcases/interleave/case_0007_bowie_slide_design.webp +3 -0
- docs/assets/showcases/vqa/agentic_case.webp +2 -2
- docs/assets/showcases/vqa/agentic_case_2.webp +2 -2
- docs/assets/teaser_1.png +2 -2
- docs/assets/teaser_2.png +3 -0
- docs/inference_infra.md +45 -31
- docs/inference_infra_CN.md +46 -32
- docs/showcases.md +7 -7
- docs/showcases_CN.md +8 -6
.gitattributes
CHANGED
|
@@ -164,3 +164,12 @@ docs/assets/showcases/vla/2.png filter=lfs diff=lfs merge=lfs -text
|
|
| 164 |
docs/assets/showcases/vla/3.mp4 filter=lfs diff=lfs merge=lfs -text
|
| 165 |
docs/assets/showcases/vla/3.png filter=lfs diff=lfs merge=lfs -text
|
| 166 |
docs/assets/teaser_1.png filter=lfs diff=lfs merge=lfs -text
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 164 |
docs/assets/showcases/vla/3.mp4 filter=lfs diff=lfs merge=lfs -text
|
| 165 |
docs/assets/showcases/vla/3.png filter=lfs diff=lfs merge=lfs -text
|
| 166 |
docs/assets/teaser_1.png filter=lfs diff=lfs merge=lfs -text
|
| 167 |
+
docs/assets/perform_vs_speed_avg3.png filter=lfs diff=lfs merge=lfs -text
|
| 168 |
+
docs/assets/perform_vs_speed_avg8.png filter=lfs diff=lfs merge=lfs -text
|
| 169 |
+
docs/assets/showcases/interleave/case_0001_makeup_three_looks.webp filter=lfs diff=lfs merge=lfs -text
|
| 170 |
+
docs/assets/showcases/interleave/case_0003_beachfront_villa.webp filter=lfs diff=lfs merge=lfs -text
|
| 171 |
+
docs/assets/showcases/interleave/case_0004_scented_candle_promo.webp filter=lfs diff=lfs merge=lfs -text
|
| 172 |
+
docs/assets/showcases/interleave/case_0005_matchgirl_warm_au.webp filter=lfs diff=lfs merge=lfs -text
|
| 173 |
+
docs/assets/showcases/interleave/case_0006_orange_cat_travel.webp filter=lfs diff=lfs merge=lfs -text
|
| 174 |
+
docs/assets/showcases/interleave/case_0007_bowie_slide_design.webp filter=lfs diff=lfs merge=lfs -text
|
| 175 |
+
docs/assets/teaser_2.png filter=lfs diff=lfs merge=lfs -text
|
docs/assets/benchmarks/generation.webp
CHANGED
|
Git LFS Details
|
|
Git LFS Details
|
docs/assets/benchmarks/interleaved.webp
CHANGED
|
Git LFS Details
|
|
Git LFS Details
|
docs/assets/perform_vs_speed_avg3.png
ADDED
|
Git LFS Details
|
docs/assets/perform_vs_speed_avg8.png
ADDED
|
Git LFS Details
|
docs/assets/showcases/interleave/case_0001_makeup_three_looks.webp
ADDED
|
Git LFS Details
|
docs/assets/showcases/interleave/case_0003_beachfront_villa.webp
ADDED
|
Git LFS Details
|
docs/assets/showcases/interleave/case_0004_scented_candle_promo.webp
ADDED
|
Git LFS Details
|
docs/assets/showcases/interleave/case_0005_matchgirl_warm_au.webp
ADDED
|
Git LFS Details
|
docs/assets/showcases/interleave/case_0006_orange_cat_travel.webp
ADDED
|
Git LFS Details
|
docs/assets/showcases/interleave/case_0007_bowie_slide_design.webp
ADDED
|
Git LFS Details
|
docs/assets/showcases/vqa/agentic_case.webp
CHANGED
|
Git LFS Details
|
|
Git LFS Details
|
docs/assets/showcases/vqa/agentic_case_2.webp
CHANGED
|
Git LFS Details
|
|
Git LFS Details
|
docs/assets/teaser_1.png
CHANGED
|
Git LFS Details
|
|
Git LFS Details
|
docs/assets/teaser_2.png
ADDED
|
Git LFS Details
|
docs/inference_infra.md
CHANGED
|
@@ -17,8 +17,8 @@ These two engines exchange generation state through pinned shared memory and hig
|
|
| 17 |
|
| 18 |
This design provides practical benefits in production:
|
| 19 |
|
| 20 |
-
- Independent parallelism (for example, understanding with `TP=2`, generation
|
| 21 |
-
with `CFG=2` or `SP=2`).
|
| 22 |
- Independent resource allocation (different GPU counts and memory budgets).
|
| 23 |
- Independent scaling for text-heavy vs. image-heavy traffic.
|
| 24 |
- Better operational isolation and simpler performance tuning.
|
|
@@ -32,7 +32,7 @@ In most production setups, `Separate` is the default choice because it gives cle
|
|
| 32 |
|
| 33 |
### Attention for Multimodal Prefill of NEO-Unify
|
| 34 |
|
| 35 |
-
NEO-Unify's prefill attention is not standard causal attention. Text tokens remain causal, while image tokens attend to the full text prefix together with the entire image span. To support this hybrid masking pattern, we modified both attention implementations in our stack: the Triton kernel and the official FA3 codebase. Our FA3 branch is available at [WANDY666/flash-attention](https://github.com/WANDY666/flash-attention).
|
| 36 |
|
| 37 |
Concretely, we introduced an optional image_token_tag argument that adjusts the mask row by row. Text rows keep the standard causal mask. Image rows, instead of using plain causal truncation, are allowed to attend to all preceding text tokens and all image tokens within the image span.
|
| 38 |
|
|
@@ -47,23 +47,27 @@ The benchmark below compares two implementations for Neo-style multimodal prefil
|
|
| 47 |
integration cost and faster iteration.
|
| 48 |
- **FA3 implementation**: higher absolute performance on supported hardware.
|
| 49 |
|
| 50 |
-
|
| 51 |
-
|
| 52 |
-
|
|
| 53 |
-
|
|
| 54 |
-
|
|
| 55 |
-
|
|
| 56 |
-
|
|
| 57 |
-
|
|
| 58 |
-
|
|
| 59 |
-
|
|
| 60 |
-
|
|
| 61 |
-
|
|
| 62 |
-
|
|
| 63 |
-
|
|
| 64 |
-
|
|
| 65 |
-
|
|
| 66 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 67 |
|
| 68 |
|
| 69 |
### Deployment
|
|
@@ -76,14 +80,19 @@ see [`deployment.md`](./deployment.md).
|
|
| 76 |
|
| 77 |
The table below is the benchmark template for **2048x2048** image generation.
|
| 78 |
Fill in measured numbers for each machine and deployment profile.
|
|
|
|
| 79 |
|
| 80 |
-
|
| 81 |
-
|
|
|
|
|
|
|
| 82 |
| H100 | TP2+CFG2 / colocate | 0.158 | 9.23 |
|
| 83 |
| H200 | TP2+CFG2 / colocate | 0.152 | 9.54 |
|
| 84 |
| 5090 | TP2+CFG2 / separate | 0.415 | 23.04 |
|
| 85 |
| L40S | TP2+CFG2 / separate | 0.443 | 25.62 |
|
| 86 |
|
|
|
|
|
|
|
| 87 |
In NEO-Unify, the KV cache for the generation stage is provided by the understanding module, so T2I (generation) and I2I (editing) have very similar runtime characteristics. For brevity, we report only T2I latency here.
|
| 88 |
|
| 89 |
|
|
@@ -93,13 +102,18 @@ The table below compares the latency of a single diffusion step for
|
|
| 93 |
**2048x2048** image generation with **CFG enabled**. Unless otherwise noted,
|
| 94 |
all measurements are taken on **H100**; the `NEO-Unify (TP2+CFG2)` result uses
|
| 95 |
`2x H100`.
|
| 96 |
-
|
| 97 |
-
|
| 98 |
-
|
| 99 |
-
|
| 100 |
-
|
|
| 101 |
-
|
|
| 102 |
-
|
|
| 103 |
-
|
|
| 104 |
-
|
|
|
|
|
|
|
|
|
|
|
| 105 |
| NEO-Unify (TP2+CFG2) | 8B | 8B | 0.158 |
|
|
|
|
|
|
|
|
|
| 17 |
|
| 18 |
This design provides practical benefits in production:
|
| 19 |
|
| 20 |
+
- Independent parallelism (for example, understanding with `TP=2` (Tensor Parallel=2), generation
|
| 21 |
+
with `CFG=2` (CFG Parallel=2) or `SP=2` (Sequence Parallel=2)).
|
| 22 |
- Independent resource allocation (different GPU counts and memory budgets).
|
| 23 |
- Independent scaling for text-heavy vs. image-heavy traffic.
|
| 24 |
- Better operational isolation and simpler performance tuning.
|
|
|
|
| 32 |
|
| 33 |
### Attention for Multimodal Prefill of NEO-Unify
|
| 34 |
|
| 35 |
+
NEO-Unify's prefill attention is not standard causal attention. Text tokens remain causal, while image tokens attend to the full text prefix together with the entire image span. To support this hybrid masking pattern, we modified both attention implementations in our stack: the Triton kernel and the official FlashAttention3 (FA3) codebase. Our FA3 branch is available at [WANDY666/flash-attention](https://github.com/WANDY666/flash-attention).
|
| 36 |
|
| 37 |
Concretely, we introduced an optional image_token_tag argument that adjusts the mask row by row. Text rows keep the standard causal mask. Image rows, instead of using plain causal truncation, are allowed to attend to all preceding text tokens and all image tokens within the image span.
|
| 38 |
|
|
|
|
| 47 |
integration cost and faster iteration.
|
| 48 |
- **FA3 implementation**: higher absolute performance on supported hardware.
|
| 49 |
|
| 50 |
+
<div align="center">
|
| 51 |
+
|
| 52 |
+
| batch | max_seq_len | image_token_num | triton (ms) | FA3 (ms) | speedup (×) |
|
| 53 |
+
|:-------:|:-----------:|:---------------:|:-----------:|:--------:|:-----------:|
|
| 54 |
+
| 8 | 4096 | 88 | 1.95 | 0.81 | **2.41×** |
|
| 55 |
+
| 8 | 8192 | 171 | 6.55 | 2.68 | **2.45×** |
|
| 56 |
+
| 8 | 65536 | 150 | 43.30 | 14.95 | **2.90×** |
|
| 57 |
+
| 16 | 4096 | 379 | 4.12 | 1.68 | **2.46×** |
|
| 58 |
+
| 16 | 8192 | 246 | 17.76 | 7.40 | **2.40×** |
|
| 59 |
+
| 16 | 65536 | 206 | 107.74 | 33.66 | **3.20×** |
|
| 60 |
+
| 32 | 4096 | 726 | 8.46 | 3.46 | **2.44×** |
|
| 61 |
+
| 32 | 8192 | 536 | 31.74 | 13.24 | **2.40×** |
|
| 62 |
+
| 32 | 65536 | 417 | 171.00 | 58.26 | **2.94×** |
|
| 63 |
+
| 64 | 4096 | 1170 | 16.08 | 6.88 | **2.34×** |
|
| 64 |
+
| 64 | 8192 | 1177 | 55.48 | 22.91 | **2.42×** |
|
| 65 |
+
| 64 | 65536 | 1291 | 348.89 | 124.82 | **2.80×** |
|
| 66 |
+
| 128 | 4096 | 2057 | 30.89 | 12.53 | **2.47×** |
|
| 67 |
+
| 128 | 8192 | 2196 | 104.73 | 43.22 | **2.42×** |
|
| 68 |
+
| 128 | 65536 | 2205 | 706.60 | 241.67 | **2.92×** |
|
| 69 |
+
|
| 70 |
+
</div>
|
| 71 |
|
| 72 |
|
| 73 |
### Deployment
|
|
|
|
| 80 |
|
| 81 |
The table below is the benchmark template for **2048x2048** image generation.
|
| 82 |
Fill in measured numbers for each machine and deployment profile.
|
| 83 |
+
Note: TP2+CFG2 means Tensor Parallel=2 + CFG Parallel=2.
|
| 84 |
|
| 85 |
+
<div align="center">
|
| 86 |
+
|
| 87 |
+
| GPU | Deployment Config | Per-step Latency (s/step) | End-to-end Latency (s) |
|
| 88 |
+
|:----:|:-----------------:|:-------------------------:|:----------------------:|
|
| 89 |
| H100 | TP2+CFG2 / colocate | 0.158 | 9.23 |
|
| 90 |
| H200 | TP2+CFG2 / colocate | 0.152 | 9.54 |
|
| 91 |
| 5090 | TP2+CFG2 / separate | 0.415 | 23.04 |
|
| 92 |
| L40S | TP2+CFG2 / separate | 0.443 | 25.62 |
|
| 93 |
|
| 94 |
+
</div>
|
| 95 |
+
|
| 96 |
In NEO-Unify, the KV cache for the generation stage is provided by the understanding module, so T2I (generation) and I2I (editing) have very similar runtime characteristics. For brevity, we report only T2I latency here.
|
| 97 |
|
| 98 |
|
|
|
|
| 102 |
**2048x2048** image generation with **CFG enabled**. Unless otherwise noted,
|
| 103 |
all measurements are taken on **H100**; the `NEO-Unify (TP2+CFG2)` result uses
|
| 104 |
`2x H100`.
|
| 105 |
+
Note: TP2+CFG2 means Tensor Parallel=2 + CFG Parallel=2.
|
| 106 |
+
|
| 107 |
+
<div align="center">
|
| 108 |
+
|
| 109 |
+
| Model | Understanding | Generation | Per-step latency (s/step) |
|
| 110 |
+
|:-----------------:|:-------------:|:----------:|:-------------------------:|
|
| 111 |
+
| Qwen-Image-2512 | 7B | 20B | 1.478 |
|
| 112 |
+
| Z-Image | 4B | 6B | 1.110 |
|
| 113 |
+
| GLM-Image | 9B | 7B | 1.394 |
|
| 114 |
+
| ERNIE-Image | 8B | 8B | 1.565 |
|
| 115 |
+
| LongCat-Image | 8B | 6B | 0.796 |
|
| 116 |
+
| NEO-Unify (1x, no TP/CFG parallelism) | 8B | 8B | 0.312 |
|
| 117 |
| NEO-Unify (TP2+CFG2) | 8B | 8B | 0.158 |
|
| 118 |
+
|
| 119 |
+
</div>
|
docs/inference_infra_CN.md
CHANGED
|
@@ -17,7 +17,7 @@ SenseNova-U1 对外呈现为一个统一的多模态模型,但在实际生产
|
|
| 17 |
|
| 18 |
该设计在生产中具有以下实际收益:
|
| 19 |
|
| 20 |
-
- 并行策略相互独立(例如理解侧 `TP=2`,生成侧 `CFG=2` 或 `SP=2`);
|
| 21 |
- 资源配额相互独立(可分配不同的 GPU 数量与显存预算);
|
| 22 |
- 针对文本密集型与图像密集型流量可分别弹性扩缩;
|
| 23 |
- 运维隔离更清晰,性能调优也更简单。
|
|
@@ -31,7 +31,7 @@ SenseNova-U1 对外呈现为一个统一的多模态模型,但在实际生产
|
|
| 31 |
|
| 32 |
### NEO-Unify 多模态 Prefill 的注意力
|
| 33 |
|
| 34 |
-
NEO-Unify 的 prefill 注意力并非标准因果注意力:文本 token 仍保持因果,而图像 token 则会同时关注整个文本前缀以及完整的图像 span。为支持这种混合掩码,我们对栈内两套注意力实现都进行了改造——Triton 内核与官方 FA3 代码库。我们的 FA3 分支见 [WANDY666/flash-attention](https://github.com/WANDY666/flash-attention)。
|
| 35 |
|
| 36 |
具体做法是新增一个可选的 image_token_tag 参数,用以逐行调整掩码:文本行沿用标准因果掩码;图像行不再采用朴素的因果截断,而是被允许关注其之前的全部文本 token,以及所在图像 span 内的全部图像 token。
|
| 37 |
|
|
@@ -46,23 +46,27 @@ NEO-Unify 的 prefill 注意力并非标准因果注意力:文本 token 仍保
|
|
| 46 |
- **Triton 实现**:更容易迁移到现有代码库,集成成本低、迭代更快;
|
| 47 |
- **FA3 实现**:在受支持的硬件上绝对性能更高。
|
| 48 |
|
| 49 |
-
|
| 50 |
-
|
| 51 |
-
|
|
| 52 |
-
|
|
| 53 |
-
|
|
| 54 |
-
|
|
| 55 |
-
|
|
| 56 |
-
|
|
| 57 |
-
|
|
| 58 |
-
|
|
| 59 |
-
|
|
| 60 |
-
|
|
| 61 |
-
|
|
| 62 |
-
|
|
| 63 |
-
|
|
| 64 |
-
|
|
| 65 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 66 |
|
| 67 |
|
| 68 |
### 部署
|
|
@@ -73,26 +77,36 @@ Docker 镜像、启动命令与 API 测试的简明操作手册,请参见 [`de
|
|
| 73 |
### 生成性能
|
| 74 |
|
| 75 |
下表为 **2048x2048** 图像生成的基准模板,列出了不同机型与部署配置下的实测数据。
|
|
|
|
| 76 |
|
| 77 |
-
|
| 78 |
-
|
|
|
|
|
|
|
| 79 |
| H100 | TP2+CFG2 / colocate | 0.158 | 9.23 |
|
| 80 |
| H200 | TP2+CFG2 / colocate | 0.152 | 9.54 |
|
| 81 |
| 5090 | TP2+CFG2 / separate | 0.415 | 23.04 |
|
| 82 |
| L40S | TP2+CFG2 / separate | 0.443 | 25.62 |
|
| 83 |
|
|
|
|
|
|
|
| 84 |
在 NEO-Unify 中,生成阶段所用的 KV cache 由理解模块提供,因此 T2I(文生图)与 I2I(图像编辑)在运行时特征上几乎一致。为简洁起见,此处仅给出 T2I 的延迟数据。
|
| 85 |
|
| 86 |
### 跨模型速度对比
|
| 87 |
|
| 88 |
-
下表对比了在启用
|
| 89 |
-
|
| 90 |
-
|
| 91 |
-
|
| 92 |
-
|
| 93 |
-
|
|
| 94 |
-
|
|
| 95 |
-
|
|
| 96 |
-
|
|
| 97 |
-
|
|
| 98 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 17 |
|
| 18 |
该设计在生产中具有以下实际收益:
|
| 19 |
|
| 20 |
+
- 并行策略相互独立(例如理解侧 `TP=2`(张量并行=2),生成侧 `CFG=2`(CFG 并行=2)或 `SP=2`(序列并行=2));
|
| 21 |
- 资源配额相互独立(可分配不同的 GPU 数量与显存预算);
|
| 22 |
- 针对文本密集型与图像密集型流量可分别弹性扩缩;
|
| 23 |
- 运维隔离更清晰,性能调优也更简单。
|
|
|
|
| 31 |
|
| 32 |
### NEO-Unify 多模态 Prefill 的注意力
|
| 33 |
|
| 34 |
+
NEO-Unify 的 prefill 注意力并非标准因果注意力:文本 token 仍保持因果,而图像 token 则会同时关注整个文本前缀以及完整的图像 span。为支持这种混合掩码,我们对栈内两套注意力实现都进行了改造——Triton 内核与官方 FlashAttention3 (FA3) 代码库。我们的 FA3 分支见 [WANDY666/flash-attention](https://github.com/WANDY666/flash-attention)。
|
| 35 |
|
| 36 |
具体做法是新增一个可选的 image_token_tag 参数,用以逐行调整掩码:文本行沿用标准因果掩码;图像行不再采用朴素的因果截断,而是被允许关注其之前的全部文本 token,以及所在图像 span 内的全部图像 token。
|
| 37 |
|
|
|
|
| 46 |
- **Triton 实现**:更容易迁移到现有代码库,集成成本低、迭代更快;
|
| 47 |
- **FA3 实现**:在受支持的硬件上绝对性能更高。
|
| 48 |
|
| 49 |
+
<div align="center">
|
| 50 |
+
|
| 51 |
+
| batch | max_seq_len | image_token_num | triton (ms) | FA3 (ms) | 加速比 (×) |
|
| 52 |
+
|:-------:|:-----------:|:---------------:|:-----------:|:--------:|:----------:|
|
| 53 |
+
| 8 | 4096 | 88 | 1.95 | 0.81 | **2.41×** |
|
| 54 |
+
| 8 | 8192 | 171 | 6.55 | 2.68 | **2.45×** |
|
| 55 |
+
| 8 | 65536 | 150 | 43.30 | 14.95 | **2.90×** |
|
| 56 |
+
| 16 | 4096 | 379 | 4.12 | 1.68 | **2.46×** |
|
| 57 |
+
| 16 | 8192 | 246 | 17.76 | 7.40 | **2.40×** |
|
| 58 |
+
| 16 | 65536 | 206 | 107.74 | 33.66 | **3.20×** |
|
| 59 |
+
| 32 | 4096 | 726 | 8.46 | 3.46 | **2.44×** |
|
| 60 |
+
| 32 | 8192 | 536 | 31.74 | 13.24 | **2.40×** |
|
| 61 |
+
| 32 | 65536 | 417 | 171.00 | 58.26 | **2.94×** |
|
| 62 |
+
| 64 | 4096 | 1170 | 16.08 | 6.88 | **2.34×** |
|
| 63 |
+
| 64 | 8192 | 1177 | 55.48 | 22.91 | **2.42×** |
|
| 64 |
+
| 64 | 65536 | 1291 | 348.89 | 124.82 | **2.80×** |
|
| 65 |
+
| 128 | 4096 | 2057 | 30.89 | 12.53 | **2.47×** |
|
| 66 |
+
| 128 | 8192 | 2196 | 104.73 | 43.22 | **2.42×** |
|
| 67 |
+
| 128 | 65536 | 2205 | 706.60 | 241.67 | **2.92×** |
|
| 68 |
+
|
| 69 |
+
</div>
|
| 70 |
|
| 71 |
|
| 72 |
### 部署
|
|
|
|
| 77 |
### 生成性能
|
| 78 |
|
| 79 |
下表为 **2048x2048** 图像生成的基准模板,列出了不同机型与部署配置下的实测数据。
|
| 80 |
+
注:TP2+CFG2 表示张量并行=2 + CFG 并行=2。
|
| 81 |
|
| 82 |
+
<div align="center">
|
| 83 |
+
|
| 84 |
+
| GPU | 部署配置 | 单步延迟 (s/step) | 端到端延迟 (s) |
|
| 85 |
+
|:----:|:--------:|:-----------------:|:--------------:|
|
| 86 |
| H100 | TP2+CFG2 / colocate | 0.158 | 9.23 |
|
| 87 |
| H200 | TP2+CFG2 / colocate | 0.152 | 9.54 |
|
| 88 |
| 5090 | TP2+CFG2 / separate | 0.415 | 23.04 |
|
| 89 |
| L40S | TP2+CFG2 / separate | 0.443 | 25.62 |
|
| 90 |
|
| 91 |
+
</div>
|
| 92 |
+
|
| 93 |
在 NEO-Unify 中,生成阶段所用的 KV cache 由理解模块提供,因此 T2I(文生图)与 I2I(图像编辑)在运行时特征上几乎一致。为简洁起见,此处仅给出 T2I 的延迟数据。
|
| 94 |
|
| 95 |
### 跨模型速度对比
|
| 96 |
|
| 97 |
+
下表对比了在启用**CFG**条件下,生成 **2048x2048** 图像时单个 diffusion step 的延迟。除特别说明外,所有数据均在 **H100** 上测得;其中 `NEO-Unify (TP2+CFG2)` 使用的是 `2x H100`。
|
| 98 |
+
注:TP2+CFG2 表示张量并行=2 + CFG 并行=2。
|
| 99 |
+
|
| 100 |
+
<div align="center">
|
| 101 |
+
|
| 102 |
+
| 模型 | 理解模块 | 生成模块 | 单步延迟 (s/step) |
|
| 103 |
+
|:-------------------------:|:--------:|:--------:|:-----------------:|
|
| 104 |
+
| Qwen-Image-2512 | 7B | 20B | 1.478 |
|
| 105 |
+
| Z-Image | 4B | 6B | 1.110 |
|
| 106 |
+
| GLM-Image | 9B | 7B | 1.394 |
|
| 107 |
+
| ERNIE-Image | 8B | 8B | 1.565 |
|
| 108 |
+
| LongCat-Image | 8B | 6B | 0.796 |
|
| 109 |
+
| NEO-Unify (1x,无TP/CFG并行) | 8B | 8B | 0.312 |
|
| 110 |
+
| NEO-Unify (TP2+CFG2) | 8B | 8B | 0.158 |
|
| 111 |
+
|
| 112 |
+
</div>
|
docs/showcases.md
CHANGED
|
@@ -215,15 +215,17 @@ answer.
|
|
| 215 |
|
| 216 |
Reproducible prompts are in
|
| 217 |
[`examples/interleave/data/samples.jsonl`](../examples/interleave/data/samples.jsonl).
|
|
|
|
| 218 |
|
| 219 |
|
| 220 |
| |
|
| 221 |
| :---: |
|
| 222 |
-
| [<img alt="interleave case
|
| 223 |
-
| [<img alt="interleave case
|
| 224 |
-
| [<img alt="interleave case
|
| 225 |
-
| [<img alt="interleave case
|
| 226 |
-
| [<img alt="interleave case
|
|
|
|
| 227 |
|
| 228 |
|
| 229 |
#### ♻️ *Interleaved Generation (Reasoning)*
|
|
@@ -252,8 +254,6 @@ Reproducible prompts are in [`examples/vqa/data/samples.jsonl`](../examples/vqa/
|
|
| 252 |
|
| 253 |
#### 📝 *Visual Understanding (Agentic)*
|
| 254 |
|
| 255 |
-
Reproducible prompts are in [`examples/vqa/data/samples_agentic.jsonl`](../examples/vqa/data/samples_agentic.jsonl).
|
| 256 |
-
|
| 257 |
| |
|
| 258 |
| :---: |
|
| 259 |
| [<img alt="vqa agentic case" src="./assets/showcases/vqa/agentic_case.webp">](./assets/showcases/vqa/agentic_case.webp) |
|
|
|
|
| 215 |
|
| 216 |
Reproducible prompts are in
|
| 217 |
[`examples/interleave/data/samples.jsonl`](../examples/interleave/data/samples.jsonl).
|
| 218 |
+
All examples are generated with think-mode reasoning; the chain-of-thought is omitted in some cases for cleaner visualization.
|
| 219 |
|
| 220 |
|
| 221 |
| |
|
| 222 |
| :---: |
|
| 223 |
+
| [<img alt="interleave case 03" src="./assets/showcases/interleave/case_0003_beachfront_villa.webp">](./assets/showcases/interleave/case_0003_beachfront_villa.webp) |
|
| 224 |
+
| [<img alt="interleave case 04" src="./assets/showcases/interleave/case_0004_scented_candle_promo.webp">](./assets/showcases/interleave/case_0004_scented_candle_promo.webp) |
|
| 225 |
+
| [<img alt="interleave case 05" src="./assets/showcases/interleave/case_0005_matchgirl_warm_au.webp">](./assets/showcases/interleave/case_0005_matchgirl_warm_au.webp) |
|
| 226 |
+
| [<img alt="interleave case 06" src="./assets/showcases/interleave/case_0006_orange_cat_travel.webp">](./assets/showcases/interleave/case_0006_orange_cat_travel.webp) |
|
| 227 |
+
| [<img alt="interleave case 01" src="./assets/showcases/interleave/case_0001_makeup_three_looks.webp">](./assets/showcases/interleave/case_0001_makeup_three_looks.webp) |
|
| 228 |
+
| [<img alt="interleave case 07" src="./assets/showcases/interleave/case_0007_bowie_slide_design.webp">](./assets/showcases/interleave/case_0007_bowie_slide_design.webp) |
|
| 229 |
|
| 230 |
|
| 231 |
#### ♻️ *Interleaved Generation (Reasoning)*
|
|
|
|
| 254 |
|
| 255 |
#### 📝 *Visual Understanding (Agentic)*
|
| 256 |
|
|
|
|
|
|
|
| 257 |
| |
|
| 258 |
| :---: |
|
| 259 |
| [<img alt="vqa agentic case" src="./assets/showcases/vqa/agentic_case.webp">](./assets/showcases/vqa/agentic_case.webp) |
|
docs/showcases_CN.md
CHANGED
|
@@ -197,16 +197,18 @@
|
|
| 197 |
|
| 198 |
下方每个案例均为 `model.interleave_gen` 的一次完整响应:模型先在 `<think>...</think>` 推理块中生成若干中间图像,再输出最终图文交错的答案。
|
| 199 |
|
| 200 |
-
可复现的 prompt 位于 [`examples/interleave/data/samples.jsonl`](../examples/interleave/data/
|
|
|
|
| 201 |
|
| 202 |
|
| 203 |
| |
|
| 204 |
| :---: |
|
| 205 |
-
| [<img alt="interleave case
|
| 206 |
-
| [<img alt="interleave case
|
| 207 |
-
| [<img alt="interleave case
|
| 208 |
-
| [<img alt="interleave case
|
| 209 |
-
| [<img alt="interleave case
|
|
|
|
| 210 |
|
| 211 |
---
|
| 212 |
|
|
|
|
| 197 |
|
| 198 |
下方每个案例均为 `model.interleave_gen` 的一次完整响应:模型先在 `<think>...</think>` 推理块中生成若干中间图像,再输出最终图文交错的答案。
|
| 199 |
|
| 200 |
+
可复现的 prompt 位于 [`examples/interleave/data/samples.jsonl`](../examples/interleave/data/samples.jsonl)。
|
| 201 |
+
所有示例均带 think 推理生成;为可视化简洁,部分示例未展示思维链。
|
| 202 |
|
| 203 |
|
| 204 |
| |
|
| 205 |
| :---: |
|
| 206 |
+
| [<img alt="interleave case 03" src="./assets/showcases/interleave/case_0003_beachfront_villa.webp">](./assets/showcases/interleave/case_0003_beachfront_villa.webp) |
|
| 207 |
+
| [<img alt="interleave case 04" src="./assets/showcases/interleave/case_0004_scented_candle_promo.webp">](./assets/showcases/interleave/case_0004_scented_candle_promo.webp) |
|
| 208 |
+
| [<img alt="interleave case 05" src="./assets/showcases/interleave/case_0005_matchgirl_warm_au.webp">](./assets/showcases/interleave/case_0005_matchgirl_warm_au.webp) |
|
| 209 |
+
| [<img alt="interleave case 06" src="./assets/showcases/interleave/case_0006_orange_cat_travel.webp">](./assets/showcases/interleave/case_0006_orange_cat_travel.webp) |
|
| 210 |
+
| [<img alt="interleave case 01" src="./assets/showcases/interleave/case_0001_makeup_three_looks.webp">](./assets/showcases/interleave/case_0001_makeup_three_looks.webp) |
|
| 211 |
+
| [<img alt="interleave case 07" src="./assets/showcases/interleave/case_0007_bowie_slide_design.webp">](./assets/showcases/interleave/case_0007_bowie_slide_design.webp) |
|
| 212 |
|
| 213 |
---
|
| 214 |
|