TokenWang commited on
Commit
ebf17aa
·
verified ·
1 Parent(s): a1096ae

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +162 -37
README.md CHANGED
@@ -1,7 +1,7 @@
1
  # SenseNova-U1: Unifying Multimodal Understanding and Generation with NEO-Unify Architecture
2
 
3
  <p align="center">
4
- <strong>English</strong> | <a href="./README_CN.md">简体中文</a>
5
  </p>
6
 
7
  <p align="center">
@@ -15,34 +15,71 @@
15
  <p align="center">
16
  <img src="docs/assets/teaser.png" alt="SenseNova-U1" width="900">
17
  </p>
 
 
 
18
 
19
  ## 🌟 Overview
20
 
21
- 🚀 **SenseNova-U1**, a native unified paradigm (based on **[NEO-Unify](https://huggingface.co/blog/sensenova/neo-unify)**) where models no longer translate between modalities, but think and act across them natively.
22
- Multimodal AI is no longer about connecting separate systems, but about building a unified one and trusting the necessary capabilities to emerge from within.
 
 
 
 
 
 
 
 
23
 
 
24
 
25
- #### 🏗️ *Key Pillars :*
 
 
26
 
27
- - 🖼️ Near-Lossless Visual Interface: Preserving semantic richness + pixel fidelity (no VAEs or Vision Encoders) !
28
 
29
- - 🧠 Native Mixture-of-Transformers: Modality-agnostic reasoning with high efficiency and minimal conflict !
30
 
31
- - 🔗 Unified End-to-End Learning: Modeling directly on pixels + text from the first principles !
 
 
 
 
32
 
 
33
 
34
- #### 🌍 *Beyond Multimodality :*
35
 
36
- - 🤖 Vision–Language–Action (VLA)
37
 
 
38
  - 🌐 World Modeling (WM)
39
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
40
 
41
  ## 📣 Updated News
42
 
43
- - `[2026.04.23]` Initial release of the weights for [SenseNova-U1-Mini-SFT](https://huggingface.co/sensenova/SenseNova-U1-Mini-Beta) and [SenseNova-U1-Mini-Beta](https://huggingface.co/sensenova/SenseNova-U1-Mini-Beta).
44
 
45
- - `[2026.04.23]` Initial release of the [inference code](https://github.com/OpenSenseNova/SenseNova-U1/blob/main/examples/README.md) for SenseNova-U1.
46
 
47
  ## 📋 ToDo List
48
 
@@ -51,17 +88,6 @@ Multimodal AI is no longer about connecting separate systems, but about building
51
  - [ ] Final weights and technical report of SenseNova-U1
52
 
53
 
54
- ## 🦁 Model Zoo
55
-
56
- | Model | Params | HF Weights |
57
- | :---- | :------- | :--------- |
58
- | SenseNova-U1-Mini-SFT | 8B MoT | [🤗 link](https://huggingface.co/sensenova/SenseNova-U1-Mini-SFT) |
59
- | SenseNova-U1-Mini-Beta | 8B MoT | [🤗 link](https://huggingface.co/sensenova/SenseNova-U1-Mini-Beta) |
60
- | SenseNova-U1-Flash-SFT | A3B MoT | 🤗 link |
61
- | SenseNova-U1-Flash-Beta | A3B MoT | 🤗 link |
62
-
63
- Note that the **SFT models** are trained in four stages: (1) *Understanding Warmup*, (2) *Generation Pre-training*, (3) *Unified Mid-training*, and (4) *Unified Supervised Fine-tuning*. The **Beta models** are obtained from the base model following an initial round of T2I reinforcement learning (RL) training.
64
-
65
  ## 🎨 Showcases
66
 
67
  <details>
@@ -96,6 +122,16 @@ Note that the **SFT models** are trained in four stages: (1) *Understanding Warm
96
  <td><div style="max-height: 200px; overflow-y: auto;">1. <b>Instruction Understanding:</b> The core subjects are a small piece of dry wood and a dense iron block positioned within a transparent water tank. The wood is floating on the surface while the iron block is submerged at the bottom. 2. <b>Reasoning Process:</b> The wood is less dense than water so it will float, while the iron is denser and will sink to the bottom. 3. <b>Establish the frame:</b> The composition is a vertical medium shot centering the rectangular tank within the frame. The camera angle is eye-level to clearly display the water line and the submerged base. Focus is sharp across the entire depth of the tank to ensure both materials are distinct. 4. <b>Build the environment:</b> The scene is contained entirely within the clear glass walls of the water tank. The water fills the majority of the volume, providing a medium for the floating wood and sunken iron block. The background remains out of focus to keep attention on the tank's interior. 5. <b>Set the lighting and color:</b> Soft natural light illuminates the scene from the left, creating gentle reflections on the water surface. The color palette features the brown grain of the wood contrasting against the dark grey metallic finish of the iron. Shadows are soft and diffused through the liquid. 6. <b>Explicit Prompt:</b> A realistic photo of a transparent water tank showing a piece of wood floating on the surface and an iron block resting at the bottom.</div></td>
97
  <td style="vertical-align: top;"><img src="./docs/assets/showcases/t2i_reasoning/5_reasoning.png" style="max-width: 100%; max-height: 100%; object-fit: contain;"></td>
98
  </tr>
 
 
 
 
 
 
 
 
 
 
99
  </table>
100
 
101
  </details>
@@ -103,11 +139,44 @@ Note that the **SFT models** are trained in four stages: (1) *Understanding Warm
103
  <details>
104
  <summary>🖼️ Text-to-Image (Infographics)</summary>
105
 
106
- | | | |
107
- | :---: | :---: | :---: |
108
- | [<img width="300" alt="t2i landscape 0001" src="./docs/assets/showcases/t2i_infographic/0001_2720x1536.webp">](./docs/assets/showcases/t2i_infographic/0001_2720x1536.webp) | [<img width="300" alt="t2i landscape 0002" src="./docs/assets/showcases/t2i_infographic/0002_2720x1536.webp">](./docs/assets/showcases/t2i_infographic/0002_2720x1536.webp) | [<img width="300" alt="t2i landscape 0003" src="./docs/assets/showcases/t2i_infographic/0003_2720x1536.webp">](./docs/assets/showcases/t2i_infographic/0003_2720x1536.webp) |
109
- | [<img width="300" alt="t2i square 0004" src="./docs/assets/showcases/t2i_infographic/0004_2048x2048.webp">](./docs/assets/showcases/t2i_infographic/0004_2048x2048.webp) | [<img width="300" alt="t2i square 0005" src="./docs/assets/showcases/t2i_infographic/0005_2048x2048.webp">](./docs/assets/showcases/t2i_infographic/0005_2048x2048.webp) | [<img width="300" alt="t2i square 0006" src="./docs/assets/showcases/t2i_infographic/0006_2048x2048.webp">](./docs/assets/showcases/t2i_infographic/0006_2048x2048.webp) |
110
- | [<img width="200" alt="t2i portrait 0007" src="./docs/assets/showcases/t2i_infographic/0007_1536x2720.webp">](./docs/assets/showcases/t2i_infographic/0007_1536x2720.webp) | [<img width="200" alt="t2i portrait 0008" src="./docs/assets/showcases/t2i_infographic/0008_1536x2720.webp">](./docs/assets/showcases/t2i_infographic/0008_1536x2720.webp) | [<img width="200" alt="t2i portrait 0009" src="./docs/assets/showcases/t2i_infographic/0009_1536x2720.webp">](./docs/assets/showcases/t2i_infographic/0009_1536x2720.webp) |
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
111
 
112
  </details>
113
 
@@ -171,7 +240,7 @@ Note that the **SFT models** are trained in four stages: (1) *Understanding Warm
171
  > 📸 **More editing samples:** see [Image Editing Gallery](./docs/showcases.md#image-editing).
172
 
173
  <details>
174
- <summary>♻️ Interleaved Generation</summary>
175
 
176
  | |
177
  | :---: |
@@ -180,21 +249,50 @@ Note that the **SFT models** are trained in four stages: (1) *Understanding Warm
180
 
181
  </details>
182
 
 
 
 
 
 
 
 
 
 
 
183
  > 📸 **More interleaved samples:** see [Interleaved Generation Gallery](./docs/showcases.md#interleaved-generation).
184
 
185
  <details>
186
- <summary>📝 Visual Understanding</summary>
187
 
188
  | |
189
  | :---: |
190
- | [<img alt="vqa agentic case" src="./docs/assets/showcases/vqa/agentic_case.webp">](./docs/assets/showcases/vqa/agentic_case.webp) |
191
  | [<img alt="vqa general cases" src="./docs/assets/showcases/vqa/general_case.webp">](./docs/assets/showcases/vqa/general_case.webp) |
192
 
 
 
 
 
 
 
 
 
 
 
193
  </details>
194
 
195
  > 📸 **More understanding samples:** see [Visual Understanding Gallery](./docs/showcases.md#visual-understanding).
196
 
197
 
 
 
 
 
 
 
 
 
 
 
198
  ## 📊 Key Benchmarks
199
 
200
  <details>
@@ -213,7 +311,6 @@ Note that the **SFT models** are trained in four stages: (1) *Understanding Warm
213
  <img src="docs/assets/benchmarks/generation.webp" alt="Generation Benchmarks">
214
  </p>
215
 
216
-
217
  </details>
218
 
219
  <details>
@@ -242,8 +339,19 @@ The easiest way to integrate SenseNova-U1 into your own agent or application is
242
 
243
  > Refer to the [SenseNova-Skills README](https://github.com/OpenSenseNova/SenseNova-Skills) for installation and usage details.
244
 
 
 
245
 
246
- ### 🤗 Run with transformers
 
 
 
 
 
 
 
 
 
247
 
248
  > **Setup:** Follow the [Installation Guide](./docs/installation.md) to clone the repo and install dependencies with uv.
249
 
@@ -251,7 +359,7 @@ The easiest way to integrate SenseNova-U1 into your own agent or application is
251
  <summary>📝 Visual Understanding</summary>
252
 
253
  ```bash
254
- python examples/vqa/inference.py --model_path SenseNova/SenseNova-U1-Mini-Beta --image examples/vqa/data/images/menu.jpg --question "My friend and I are dining together tonight. Looking at this menu, can you recommend a good combination of dishes for 2 people? We want a balanced meal — a mix of mains and maybe a starter or dessert. Budget-conscious but want to try the highlights." --output outputs/answer.txt --max_new_tokens 8192 --do_sample --temperature 0.6 --top_p 0.95 --top_k 20 --repetition_penalty 1.05 --profile
255
  ```
256
 
257
  </details>
@@ -262,7 +370,7 @@ python examples/vqa/inference.py --model_path SenseNova/SenseNova-U1-Mini-Beta -
262
  <summary>🖼️ Text-to-Image</summary>
263
 
264
  ```bash
265
- python examples/t2i/inference.py --model_path SenseNova/SenseNova-U1-Mini-Beta --prompt "这张信息图的标题是“SenseNova-U1”,采用代极简科技矩阵风格。整体布局为水平三列网格结构,背景是带有极浅银灰色细密点阵的哑光纯白高级纸张纹理,画面长宽比为16:9。\n\n排版采用严谨的视觉层级:主标题使用粗体无衬线黑体字,正文使用清晰的现代等宽字体。配色方案极其克制,以纯白色为底,深炭黑为主视觉文字和边框,浅石板灰用于背景色块和次要信息区分,图标采用精致的银灰色线框绘制。\n\n在画面正上方居中位置,使用醒目的深炭黑粗体字排布着大标题“SenseNova-U1”。标题正下方是浅石板灰色的等宽字体副标题“新一代端到端统一多模态大模型家族”。\n\n画面主体分为左、中、右三个相等的垂直信息区块,区块之间通过充足的负空间进行物理隔离。\n\n左侧区块的主题是概述。顶部有一个银灰色线框绘制的、由放大镜和齿轮交织的图标,旁边是粗体小标题“Overview”。该区块内从上到下垂直排列着三个要点:第一个要点旁边是一个代表文档与照片重叠的极简图标,紧跟着文字“多模态模型家族,统一文本/图像理解和生成”。向下是由两个相连的同心圆组成的架构图标,配有文字“基于NEO-Unify架构(端到端统一理解和生成)”。最下方是一个带有斜线划掉的眼睛和漏斗形状的图标,明确指示文本“无需视觉编码器(VE)和变分自编码器(VAE)”。\n\n中间区块展示模型矩阵。顶部是一个包含两个分支节点的树状网络图标,旁边是粗体小标题“两个模型版本”。区块内分为上下两个包裹在浅石板灰色极细边框内的卡片。上方的卡片内画着一个代表高密度的实心几何立方体图标,大字标注“SenseNova-U1-Mini”,下方是等宽字体说明“18B参数密集模型”。下方的卡片内画着一个带有闪电符号的网状发光大脑图标,大字标注“SenseNova-U1-Flash”,下方是等宽字体说明“38B参数,3B激活的混合专家(MoE)模型”。在这两个独立卡片的正下方,左侧放置一个笑脸轮廓图标搭配文字“将在HF等平台公开”,右侧放置一个带有折角的书面报告图标搭配文字“将发布技术报告”。\n\n右侧区块呈现核心优势。顶部是一个代表巅峰的上升阶梯折线图图标,旁边是粗体小标题“Highlights”。该区块内部垂直分布着四个带有浅石板灰底色的长方形色块,每个色块内部左侧对应一个具体的图标,右侧为文字。第一个色块内是一个无缝相连的莫比乌斯环图标,配文“原生统一架构,无VE和VAE”。第二个色块内是一个顶端带有星星的奖杯图标,配文“单一统一模型在理解和生成任务上均达到SOTA性能”。第三个色块内是代表文本行与拍立得照片交替穿插的图标,配文“强大的原生交错推理能力(模型原生生成图像进行推理)”。最后一个色块内是一个被切分出一小块的硬币与详细饼状图结合的图标,配文“能生成复杂信息图表,成本仅为商业模型的1/10”。" --width 2048 --height 2048 --cfg_scale 4.0 --cfg_norm none --timestep_shift 3.0 --num_steps 50 --output output.png --profile
266
  ```
267
 
268
  </details>
@@ -274,7 +382,7 @@ python examples/t2i/inference.py --model_path SenseNova/SenseNova-U1-Mini-Beta -
274
  <summary>✏️ Image Editing</summary>
275
 
276
  ```bash
277
- python examples/editing/inference.py --model_path SenseNova/SenseNova-U1-Mini-Beta --prompt "Change the animal's fur color to a darker shade." --image examples/editing/data/images/1.jpg --cfg_scale 4.0 --img_cfg_scale 1.0 --cfg_norm none --timestep_shift 3.0 --num_steps 50 --output output_edited.png --profile --compare
278
  ```
279
 
280
  </details>
@@ -287,14 +395,14 @@ python examples/editing/inference.py --model_path SenseNova/SenseNova-U1-Mini-Be
287
  <summary>♻️ Interleaved Generation</summary>
288
 
289
  ```bash
290
- python examples/interleave/inference.py --model_path SenseNova/SenseNova-U1-Mini-Beta --prompt "I want to learn how to cook tomato and egg stir-fry. Please give me a beginner-friendly illustrated tutorial." --resolution "16:9" --output_dir outputs/interleave/ --stem demo --profile
291
  ```
292
  </details>
293
 
294
  > See [`examples/README.md`](./examples/README.md) for batched inference, JSONL format, prompt enhancement, resolution buckets, and full flag reference.
295
 
296
 
297
- ### ⚡ Run with LightLLM + LightX2V
298
 
299
  For production serving, we co-design a dedicated inference stack on top of **[LightLLM](https://github.com/ModelTC/lightllm)** (understanding) and **[LightX2V](https://github.com/ModelTC/lightx2v)** (generation). The two engines are disaggregated so that each path can use its own parallelism and resource budget, with a low-overhead transfer channel in between.
300
 
@@ -316,6 +424,23 @@ docker pull lightx2v/lightllm_lightx2v:20260407
316
 
317
  ``` -->
318
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
319
  ## ⚖️ License
320
 
321
  This project is released under the [Apache 2.0 License](./LICENSE).
 
1
  # SenseNova-U1: Unifying Multimodal Understanding and Generation with NEO-Unify Architecture
2
 
3
  <p align="center">
4
+ <strong>English</strong> | <a href="https://huggingface.co/sensenova/SenseNova-U1-Mini-Beta/blob/main/README_CN.md">简体中文</a>
5
  </p>
6
 
7
  <p align="center">
 
15
  <p align="center">
16
  <img src="docs/assets/teaser.png" alt="SenseNova-U1" width="900">
17
  </p>
18
+ <p align="center">
19
+ <img src="docs/assets/teaser_1.png" alt="radar plot" width="900">
20
+ </p>
21
 
22
  ## 🌟 Overview
23
 
24
+ 🚀 **SenseNova U1** is a new series of native multimodal models that unifies multimodal understanding, reasoning, and generation within a single architecture.
25
+ It marks a fundamental paradigm shift in multimodal AI: **from modality integration to true unification**. Rather than relying on adapters to translate between modalities, SenseNova U1 models think and act across language and vision natively.
26
+
27
+ The unification of visual understanding and generation opens tremendous possibilities. SenseNova U1 sits in the stage of data-driven learning (like ChatGPT), yet gestures toward the next stage, that is, Agentic learning (like OpenClaw) and thinking in a natively multimodal way.
28
+
29
+
30
+ #### 🏗️ *Key Pillars:*
31
+
32
+ At the core of SenseNova U1 is **[NEO-Unify](https://huggingface.co/blog/sensenova/neo-unify)**, a novel architecture designed from first principles for multimodal AI: language and visual information are inherently and deeply correlated.
33
+ NEO-Unify eliminates both Visual Encoder (VE) and Variational Auto-Encoder (VAE), replacing them with a unified representation.
34
 
35
+ This architecture has several important features:
36
 
37
+ - 🔗 Model language and visual information end-to-end as a unified compound.
38
+ - 🖼️ Preserve semantic richness while maintaining pixel-level visual fidelity.
39
+ - 🧠 Reason across modalities with high efficiency & minimal conflict via native MoTs.
40
 
 
41
 
42
+ #### *What This Unlocks:*
43
 
44
+ Powered by this new core architecture, SenseNova U1 delivers exceptional efficiency in multimodal learning:
45
+
46
+ - 🏆 **Open-source SoTA in both understanding and generation**: SenseNova U1 sets a new standard for unified multimodal understanding and generation, achieving state-of-the-art performance among open-source models across a wide range of understanding, reasoning, and generation benchmarks.
47
+
48
+ - 📖 **Native interleaved image-text generation**: SenseNova U1 can generate coherent interleaved text and images in a single flow with one model, enabling use cases such as practical guides and travel diaries that combine clear communication with vivid storytelling and transform complex information into intuitive visuals.
49
 
50
+ - 📰 **High-density information rendering**: SenseNova U1 demonstrates strong capabilities in dense visual communication, generating richly structured layouts for knowledge illustrations, posters, presentations, comics, resumes, and other information-rich formats.
51
 
 
52
 
53
+ #### 🌍 *Beyond Multimodality:*
54
 
55
+ - 🤖 Vision–Language–Action (VLA)
56
  - 🌐 World Modeling (WM)
57
 
58
+ ## 🦁 Models
59
+
60
+ In this release, we are open-sourcing the SenseNova U1 Lite series in two sizes:
61
+
62
+ - SenseNova U1-8B-MoT — dense backbone
63
+ - SenseNova U1-A3B-MoT — MoE backbone
64
+
65
+
66
+ | Model | Params | HF Weights |
67
+ | :---- | :------- | :--------- |
68
+ | SenseNova-U1-8B-MoT-SFT | 8B MoT | [🤗 link](https://huggingface.co/sensenova/SenseNova-U1-8B-MoT-SFT) |
69
+ | SenseNova-U1-8B-MoT | 8B MoT | [🤗 link](https://huggingface.co/sensenova/SenseNova-U1-8B-MoT) |
70
+ | SenseNova-U1-A3B-MoT-SFT | A3B MoT | 🤗 link |
71
+ | SenseNova-U1-A3B-MoT | A3B MoT | 🤗 link |
72
+
73
+ Note that the **SFT models** are trained in four stages: (1) *Understanding Warmup*, (2) *Generation Pre-training*, (3) *Unified Mid-training*, and (4) *Unified Supervised Fine-tuning*. The **final models** are obtained from the base model following an initial round of T2I reinforcement learning (RL) training.
74
+
75
+ Although these models are relatively compact by today’s standards, they already demonstrate strong potential across a wide range of tasks, delivering performance comparable to commercial models while offering outstanding cost efficiency. That being said, We plan to release larger-scale models in future, which, we believe, can deliver stronger capabilities and higher performance.
76
+
77
 
78
  ## 📣 Updated News
79
 
80
+ - `[2026.04.27]` Initial release of the weights for [SenseNova-U1-8B-MoT-SFT](https://huggingface.co/sensenova/SenseNova-U1-8B-MoT-SFT) and [SenseNova-U1-8B-MoT](https://huggingface.co/sensenova/SenseNova-U1-8B-MoT).
81
 
82
+ - `[2026.04.27]` Initial release of the [inference code](https://github.com/OpenSenseNova/SenseNova-U1/blob/main/examples/README.md) for SenseNova-U1.
83
 
84
  ## 📋 ToDo List
85
 
 
88
  - [ ] Final weights and technical report of SenseNova-U1
89
 
90
 
 
 
 
 
 
 
 
 
 
 
 
91
  ## 🎨 Showcases
92
 
93
  <details>
 
122
  <td><div style="max-height: 200px; overflow-y: auto;">1. <b>Instruction Understanding:</b> The core subjects are a small piece of dry wood and a dense iron block positioned within a transparent water tank. The wood is floating on the surface while the iron block is submerged at the bottom. 2. <b>Reasoning Process:</b> The wood is less dense than water so it will float, while the iron is denser and will sink to the bottom. 3. <b>Establish the frame:</b> The composition is a vertical medium shot centering the rectangular tank within the frame. The camera angle is eye-level to clearly display the water line and the submerged base. Focus is sharp across the entire depth of the tank to ensure both materials are distinct. 4. <b>Build the environment:</b> The scene is contained entirely within the clear glass walls of the water tank. The water fills the majority of the volume, providing a medium for the floating wood and sunken iron block. The background remains out of focus to keep attention on the tank's interior. 5. <b>Set the lighting and color:</b> Soft natural light illuminates the scene from the left, creating gentle reflections on the water surface. The color palette features the brown grain of the wood contrasting against the dark grey metallic finish of the iron. Shadows are soft and diffused through the liquid. 6. <b>Explicit Prompt:</b> A realistic photo of a transparent water tank showing a piece of wood floating on the surface and an iron block resting at the bottom.</div></td>
123
  <td style="vertical-align: top;"><img src="./docs/assets/showcases/t2i_reasoning/5_reasoning.png" style="max-width: 100%; max-height: 100%; object-fit: contain;"></td>
124
  </tr>
125
+ <tr>
126
+ <td style="vertical-align: top;">A typical dish from the country where Naples is located</td>
127
+ <td><div style="max-height: 200px; overflow-y: auto;">1. <b>Instruction Understanding:</b> The core subject is a Neapolitan pizza presented as a typical dish. The context specifies Naples, Italy, as the country of origin for this food item. 2. <b>Reasoning Process:</b> Naples is in Italy, and a classic dish is a Neapolitan pizza. 3. <b>Establish the frame:</b> The Neapolitan pizza is captured in a close-up shot that fills the central frame. The angle is slightly elevated to show the round form of the dish clearly. 4. <b>Set the lighting and color:</b> Soft lighting illuminates the surface of the Neapolitan pizza to reveal texture. Warm tones dominate the color palette, emphasizing the baked nature of the dish. 5. <b>Lock the style:</b> The image utilizes a realistic photographic style with sharp focus on the main subject. The finish appears natural and appetizing, suitable for food documentation. 6. <b>Explicit Prompt:</b> A delicious Neapolitan pizza with a soft, charred crust, tomato sauce, and fresh mozzarella, served on a rustic wooden table, realistic food photography.</div></td>
128
+ <td style="vertical-align: top;"><img src="./docs/assets/showcases/t2i_reasoning/2_reasoning.png" style="max-width: 100%; max-height: 100%; object-fit: contain;"></td>
129
+ </tr>
130
+ <tr>
131
+ <td style="vertical-align: top;">A solution of calcium carbonate reacting with acetic acid</td>
132
+ <td><div style="max-height: 200px; overflow-y: auto;">1. <b>Instruction Understanding:</b> The core subject is a solution of calcium carbonate and acetic acid. The prompt specifies the reacting state of the chemical mixture. 2. <b>Reasoning Process:</b> The reaction produces carbon dioxide gas, which would be visible as a steady stream of bubbles rising through the liquid. 3. <b>Establish the frame:</b> The camera frames the solution closely to capture the details of the reaction. The composition centers on the liquid where the gas is visible. 4. <b>Set the lighting and color:</b> The liquid appears clear, allowing the white bubbles to stand out distinctly. The lighting is bright and even to illuminate the stream of gas. 5. <b>Lock the style:</b> The image maintains a realistic photographic style suitable for scientific observation. The focus is sharp on the reacting solution and bubbles. 6. <b>Explicit Prompt:</b> A test tube filled with a clear liquid and a rapid, effervescent stream of carbon dioxide bubbles rising to the surface, laboratory experiment.</div></td>
133
+ <td style="vertical-align: top;"><img src="./docs/assets/showcases/t2i_reasoning/7_reasoning.png" style="max-width: 100%; max-height: 100%; object-fit: contain;"></td>
134
+ </tr>
135
  </table>
136
 
137
  </details>
 
139
  <details>
140
  <summary>🖼️ Text-to-Image (Infographics)</summary>
141
 
142
+ <table align="center">
143
+ <tr>
144
+ <td align="center"><a href="./docs/assets/showcases/t2i_infographic/0004.webp"><img width="300" alt="t2i landscape 0001" src="./docs/assets/showcases/t2i_infographic/0004.webp"></a></td>
145
+ <td align="center"><a href="./docs/assets/showcases/t2i_infographic/0012.webp"><img width="300" alt="t2i landscape 0002" src="./docs/assets/showcases/t2i_infographic/0012.webp"></a></td>
146
+ <td align="center"><a href="./docs/assets/showcases/t2i_infographic/0005.webp"><img width="300" alt="t2i landscape 0003" src="./docs/assets/showcases/t2i_infographic/0005.webp"></a></td>
147
+ </tr>
148
+ <tr>
149
+ <td align="center"><a href="./docs/assets/showcases/t2i_infographic/0018.webp"><img width="300" alt="t2i landscape 0004" src="./docs/assets/showcases/t2i_infographic/0018.webp"></a></td>
150
+ <td align="center"><a href="./docs/assets/showcases/t2i_infographic/0024.webp"><img width="300" alt="t2i landscape 0005" src="./docs/assets/showcases/t2i_infographic/0024.webp"></a></td>
151
+ <td align="center"><a href="./docs/assets/showcases/t2i_infographic/0013.webp"><img width="300" alt="t2i landscape 0006" src="./docs/assets/showcases/t2i_infographic/0013.webp"></a></td>
152
+ </tr>
153
+ <tr>
154
+ <td align="center"><a href="./docs/assets/showcases/t2i_infographic/0006.webp"><img width="300" alt="t2i landscape 0007" src="./docs/assets/showcases/t2i_infographic/0006.webp"></a></td>
155
+ <td align="center"><a href="./docs/assets/showcases/t2i_infographic/0015.webp"><img width="300" alt="t2i landscape 0008" src="./docs/assets/showcases/t2i_infographic/0015.webp"></a></td>
156
+ <td align="center"><a href="./docs/assets/showcases/t2i_infographic/0025.webp"><img width="300" alt="t2i landscape 0009" src="./docs/assets/showcases/t2i_infographic/0025.webp"></a></td>
157
+ </tr>
158
+ </table>
159
+
160
+ <table align="center">
161
+ <tr>
162
+ <td align="center"><a href="./docs/assets/showcases/t2i_infographic/0000.webp"><img width="220" alt="t2i landscape 0010" src="./docs/assets/showcases/t2i_infographic/0000.webp"></a></td>
163
+ <td align="center"><a href="./docs/assets/showcases/t2i_infographic/0003.webp"><img width="220" alt="t2i landscape 0011" src="./docs/assets/showcases/t2i_infographic/0003.webp"></a></td>
164
+ <td align="center"><a href="./docs/assets/showcases/t2i_infographic/0001.webp"><img width="220" alt="t2i landscape 0012" src="./docs/assets/showcases/t2i_infographic/0001.webp"></a></td>
165
+ <td align="center"><a href="./docs/assets/showcases/t2i_infographic/0022.webp"><img width="220" alt="t2i landscape 0012" src="./docs/assets/showcases/t2i_infographic/0022.webp"></a></td>
166
+ </tr>
167
+ <tr>
168
+ <td align="center"><a href="./docs/assets/showcases/t2i_infographic/0016.webp"><img width="220" alt="t2i image 0022" src="./docs/assets/showcases/t2i_infographic/0016.webp"></a></td>
169
+ <td align="center"><a href="./docs/assets/showcases/t2i_infographic/0010.webp"><img width="220" alt="t2i image 0020" src="./docs/assets/showcases/t2i_infographic/0010.webp"></a></td>
170
+ <td align="center"><a href="./docs/assets/showcases/t2i_infographic/0007.webp"><img width="220" alt="t2i image 0021" src="./docs/assets/showcases/t2i_infographic/0007.webp"></a></td>
171
+ <td align="center"><a href="./docs/assets/showcases/t2i_infographic/0021.webp"><img width="220" alt="t2i image 0023" src="./docs/assets/showcases/t2i_infographic/0021.webp"></a></td>
172
+ </tr>
173
+ <tr>
174
+ <td align="center"><a href="./docs/assets/showcases/t2i_infographic/0009.webp"><img width="220" alt="t2i image 0024" src="./docs/assets/showcases/t2i_infographic/0009.webp"></a></td>
175
+ <td align="center"><a href="./docs/assets/showcases/t2i_infographic/0020.webp"><img width="220" alt="t2i image 0025" src="./docs/assets/showcases/t2i_infographic/0020.webp"></a></td>
176
+ <td align="center"><a href="./docs/assets/showcases/t2i_infographic/0008.webp"><img width="220" alt="t2i image 0026" src="./docs/assets/showcases/t2i_infographic/0008.webp"></a></td>
177
+ <td align="center"><a href="./docs/assets/showcases/t2i_infographic/0002.webp"><img width="220" alt="t2i image 0027" src="./docs/assets/showcases/t2i_infographic/0002.webp"></a></td>
178
+ </tr>
179
+ </table>
180
 
181
  </details>
182
 
 
240
  > 📸 **More editing samples:** see [Image Editing Gallery](./docs/showcases.md#image-editing).
241
 
242
  <details>
243
+ <summary>♻️ Interleaved Generation (General)</summary>
244
 
245
  | |
246
  | :---: |
 
249
 
250
  </details>
251
 
252
+
253
+ <details>
254
+ <summary>♻️ Interleaved Generation (Reasoning)</summary>
255
+
256
+ | |
257
+ | :---: |
258
+ | [<img alt="interleave case 05" src="./docs/assets/showcases/interleave/reasoning_case1.png">](./docs/assets/showcases/interleave/reasoning_case1.png) |
259
+
260
+ </details>
261
+
262
  > 📸 **More interleaved samples:** see [Interleaved Generation Gallery](./docs/showcases.md#interleaved-generation).
263
 
264
  <details>
265
+ <summary>📝 Visual Understanding (General)</summary>
266
 
267
  | |
268
  | :---: |
 
269
  | [<img alt="vqa general cases" src="./docs/assets/showcases/vqa/general_case.webp">](./docs/assets/showcases/vqa/general_case.webp) |
270
 
271
+ </details>
272
+
273
+ <details>
274
+ <summary>📝 Visual Understanding (Agentic)</summary>
275
+
276
+ | |
277
+ | :---: |
278
+ | [<img alt="vqa agentic case" src="./docs/assets/showcases/vqa/agentic_case.webp">](./docs/assets/showcases/vqa/agentic_case.webp) |
279
+
280
+
281
  </details>
282
 
283
  > 📸 **More understanding samples:** see [Visual Understanding Gallery](./docs/showcases.md#visual-understanding).
284
 
285
 
286
+ <details>
287
+ <summary>🦾 Visual-Language Action</summary>
288
+
289
+ [![YouTube](./docs/assets/showcases/vla/1.png)](https://www.youtube.com/watch?v=3mvBPPgv8vo)
290
+ [![YouTube](./docs/assets/showcases/vla/2.png)](https://www.youtube.com/watch?v=2QZY8gf0Vsk)
291
+ [![YouTube](./docs/assets/showcases/vla/3.png)](https://www.youtube.com/watch?v=tznVbuYf0yw)
292
+
293
+ </details>
294
+
295
+
296
  ## 📊 Key Benchmarks
297
 
298
  <details>
 
311
  <img src="docs/assets/benchmarks/generation.webp" alt="Generation Benchmarks">
312
  </p>
313
 
 
314
  </details>
315
 
316
  <details>
 
339
 
340
  > Refer to the [SenseNova-Skills README](https://github.com/OpenSenseNova/SenseNova-Skills) for installation and usage details.
341
 
342
+ <details>
343
+ <summary>✨ Some interesting cases produced through our Skills and Studio</summary>
344
 
345
+ <p align="center">
346
+ <img width="800" alt="u1 case" src="./docs/assets/showcases/t2i_infographic/u1-case.webp">
347
+ </p>
348
+
349
+ <p align="center">
350
+ <img width="800" alt="neo case 2" src="./docs/assets/showcases/t2i_infographic/neo-case2.webp">
351
+ </p>
352
+ </details>
353
+
354
+ ### 🤗 Run with transformers (Default)
355
 
356
  > **Setup:** Follow the [Installation Guide](./docs/installation.md) to clone the repo and install dependencies with uv.
357
 
 
359
  <summary>📝 Visual Understanding</summary>
360
 
361
  ```bash
362
+ python examples/vqa/inference.py --model_path SenseNova/SenseNova-U1-8B-MoT --image examples/vqa/data/images/menu.jpg --question "My friend and I are dining together tonight. Looking at this menu, can you recommend a good combination of dishes for 2 people? We want a balanced meal — a mix of mains and maybe a starter or dessert. Budget-conscious but want to try the highlights." --output outputs/answer.txt --max_new_tokens 8192 --do_sample --temperature 0.6 --top_p 0.95 --top_k 20 --repetition_penalty 1.05 --profile
363
  ```
364
 
365
  </details>
 
370
  <summary>🖼️ Text-to-Image</summary>
371
 
372
  ```bash
373
+ python examples/t2i/inference.py --model_path SenseNova/SenseNova-U1-8B-MoT --prompt "这张信息图的标题是“SenseNova-U1”,采用���代极简科技矩阵风格。整体布局为水平三列网格结构,背景是带有极浅银灰色细密点阵的哑光纯白高级纸张纹理,画面长宽比为16:9。\n\n排版采用严谨的视觉层级:主标题使用粗体无衬线黑体字,正文使用清晰的现代等宽字体。配色方案极其克制,以纯白色为底,深炭黑为主视觉文字和边框,浅石板灰用于背景色块和次要信息区分,图标采用精致的银灰色线框绘制。\n\n在画面正上方居中位置,使用醒目的深炭黑粗体字排布着大标题“SenseNova-U1”。标题正下方是浅石板灰色的等宽字体副标题“新一代端到端统一多模态大模型家族”。\n\n画面主体分为左、中、右三个相等的垂直信息区块,区块之间通过充足的负空间进行物理隔离。\n\n左侧区块的主题是概述。顶部有一个银灰色线框绘制的、由放大镜和齿轮交织的图标,旁边是粗体小标题“Overview”。该区块内从上到下垂直排列着三个要点:第一个要点旁边是一个代表文档与照片重叠的极简图标,紧跟着文字“多模态模型家族,统一文本/图像理解和生成”。向下是由两个相连的同心圆组成的架构图标,配有文字“基于NEO-Unify架构(端到端统一理解和生成)”。最下方是一个带有斜线划掉的眼睛和漏斗形状的图标,明确指示文本“无需视觉编码器(VE)和变分自编码器(VAE)”。\n\n中间区块展示模型矩阵。顶部是一个包含两个分支节点的树状网络图标,旁边是粗体小标题“两个模型版本”。区块内分为上下两个包裹在浅石板灰色极细边框内的卡片。上方的卡片内画着一个代表高密度的实心几何立方体图标,大字标注“SenseNova-U1-Mini”,下方是等宽字体说明“18B参数密集模型”。下方的卡片内画着一个带有闪电符号的网状发光大脑图标,大字标注“SenseNova-U1-Flash”,下方是等宽字体说明“38B参数,3B激活的混合专家(MoE)模型”。在这两个独立卡片的正下方,左侧放置一个笑脸轮廓图标搭配文字“将在HF等平台公开”,右侧放置一个带有折角的书面报告图标搭配文字“将发布技术报告”。\n\n右侧区块呈现核心优势。顶部是一个代表巅峰的上升阶梯折线图图标,旁边是粗体小标题“Highlights”。该区块内部垂直分布着四个带有浅石板灰底色的长方形色块,每个色块内部左侧对应一个具体的图标,右侧为文字。第一个色块内是一个无缝相连的莫比乌斯环图标,配文“原生统一架构,无VE和VAE”。第二个色块内是一个顶端带有星星的奖杯图标,配文“单一统一模型在理解和生成任务上均达到SOTA性能”。第三个色块内是代表文本行与拍立得照片交替穿插的图标,配文“强大的原生交错推理能力(模型原生生成图像进行推理)”。最后一个色块内是一个被切分出一小块的硬币与详细饼状图结合的图标,配文“能生成复杂信息图表,成本仅为商业模型的1/10”。" --width 2048 --height 2048 --cfg_scale 4.0 --cfg_norm none --timestep_shift 3.0 --num_steps 50 --output output.png --profile
374
  ```
375
 
376
  </details>
 
382
  <summary>✏️ Image Editing</summary>
383
 
384
  ```bash
385
+ python examples/editing/inference.py --model_path SenseNova/SenseNova-U1-8B-MoT --prompt "Change the animal's fur color to a darker shade." --image examples/editing/data/images/1.jpg --cfg_scale 4.0 --img_cfg_scale 1.0 --cfg_norm none --timestep_shift 3.0 --num_steps 50 --output output_edited.png --profile --compare
386
  ```
387
 
388
  </details>
 
395
  <summary>♻️ Interleaved Generation</summary>
396
 
397
  ```bash
398
+ python examples/interleave/inference.py --model_path SenseNova/SenseNova-U1-8B-MoT --prompt "I want to learn how to cook tomato and egg stir-fry. Please give me a beginner-friendly illustrated tutorial." --resolution "16:9" --output_dir outputs/interleave/ --stem demo --profile
399
  ```
400
  </details>
401
 
402
  > See [`examples/README.md`](./examples/README.md) for batched inference, JSONL format, prompt enhancement, resolution buckets, and full flag reference.
403
 
404
 
405
+ ### ⚡ Run with LightLLM + LightX2V (Recommended)
406
 
407
  For production serving, we co-design a dedicated inference stack on top of **[LightLLM](https://github.com/ModelTC/lightllm)** (understanding) and **[LightX2V](https://github.com/ModelTC/lightx2v)** (generation). The two engines are disaggregated so that each path can use its own parallelism and resource budget, with a low-overhead transfer channel in between.
408
 
 
424
 
425
  ``` -->
426
 
427
+ ## 🌐 Join the Community!
428
+
429
+ Join our growing community to share feedback, get support, and stay updated on the latest SenseNova-U1 developments — we'd love to hear from you!
430
+
431
+ <div align="center">
432
+ <table>
433
+ <tr>
434
+ <td align="center"><b><a href="https://discord.gg/cxkwXWjp">Discord</a></b></td>
435
+ <td align="center"><b>WeChat Group</b></td>
436
+ </tr>
437
+ <tr>
438
+ <td align="center"><a href="https://discord.gg/cxkwXWjp"><img src="docs/assets/discord_qr.webp" width="160"/></a></td>
439
+ <td align="center"><img src="docs/assets/wechat_qr.webp" width="160"/></td>
440
+ </tr>
441
+ </table>
442
+ </div>
443
+
444
  ## ⚖️ License
445
 
446
  This project is released under the [Apache 2.0 License](./LICENSE).