xhhcode
/

File size: 4,966 Bytes
e5f5711
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
---
license: cc-by-nc-4.0
library_name: transformers
tags:
- mobile-o
- multimodal
- unified-model
- vision-language
- text-to-image
- image-understanding
- on-device
- mobile
pipeline_tag: text-to-image
datasets:
- Amshaker/Mobile-O-Post-Train
- Amshaker/Mobile-O-SFT
- Amshaker/Mobile-O-Pre-Train
base_model:
- Efficient-Large-Model/Sana_600M_512px_diffusers
- apple/FastVLM-0.5B
---

<div align="center">

<h1>
  <img src="https://github.com/Amshaker/Mobile-O/blob/main/assets/mobile-o-logo.png?raw=true" width="30" /> Mobile-O-0.5B
</h1>

**Unified Multimodal Understanding and Generation on Mobile Device**

<p>
<a href="https://arxiv.org/abs/2602.20161"><img src="https://img.shields.io/badge/arXiv-2602.20161-b31b1b.svg" alt="arXiv"></a>
<a href="https://github.com/Amshaker/Mobile-O"><img src="https://img.shields.io/badge/GitHub-Code-black.svg" alt="Code"></a>
<a href="https://amshaker.github.io/Mobile-O/"><img src="https://img.shields.io/badge/๐ŸŒ-Project_Page-2563eb.svg" alt="Project Page"></a>
<a href="https://mobileo.cvmbzuai.com/"><img src="https://img.shields.io/badge/๐Ÿš€-Live_Demo-10b981.svg" alt="Demo"></a>
<a href="https://huggingface.co/collections/Amshaker/mobile-o-datasets"><img src="https://img.shields.io/badge/๐Ÿค—-Datasets-yellow.svg" alt="Datasets"></a>
<a href="https://apps.apple.com/app/mobile-o/id6759238106"><img src="https://img.shields.io/badge/๏ฃฟ-App_Store-black.svg" alt="App Store"></a>
</p>

</div>

## ๐Ÿ“Œ Overview

Mobile-O-0.5B is a compact unified visionโ€“languageโ€“diffusion model that performs both **multimodal understanding** (VQA, OCR, reasoning) and **image generation** within a single architecture, designed for mobile and edge deployment.

| Spec | Detail |
|------|--------|
| **Total Parameters** | 1.6B |
| **Image Resolution** | 512ร—512 |
| **Image Generation** | ~3 seconds on iPhone |
| **Visual Understanding** | ~0.4 seconds on iPhone |
| **Memory Footprint** | < 2GB |

## ๐ŸŽฏ Supported Tasks

| Task | Input โ†’ Output |
|------|---------------|
| ๐Ÿ’ฌ Conversational AI | Text โ†’ Text |
| ๐Ÿ‘๏ธ Image Understanding | Image + Text โ†’ Text |
| ๐Ÿ–ผ๏ธ Image Generation | Text โ†’ Image |
| โœ๏ธ Image Editing | Image + Text โ†’ Image |

## ๐Ÿš€ Quick Start

### Download

```python
from huggingface_hub import snapshot_download

snapshot_download(
    repo_id="Amshaker/Mobile-O-0.5B",
    repo_type="model",
    local_dir="checkpoints",
    allow_patterns=["final_merged_model_23620/*"]
)
```

### Image Understanding

```bash
python infer_und.py \
    --model_path checkpoints/final_merged_model_23620/ \
    --image_path assets/cute_cat.png \
    --prompt "What is in the image?"
```

### Image Generation

```bash
python infer_gen.py \
    --model_path checkpoints/final_merged_model_23620/ \
    --prompt "A vibrant tropical rainforest scene with a scarlet macaw perched on a moss-covered branch"
```

### Image Editing

```bash
python infer_edit.py \
    --model_path checkpoints/final_merged_model_23620/ \
    --image_path assets/cute_cat.png \
    --prompt "Make the cat wear a hat"
```

## ๐Ÿ—๏ธ Architecture

Mobile-O consists of three main components:

- **Vision-Language Model (VLM):** [FastVLM-0.5B](https://github.com/apple/ml-fastvlm) โ€” FastViT vision encoder + Qwen2-0.5B language backbone
- **Diffusion Decoder:** [SANA-600M-512](https://github.com/NVlabs/Sana) โ€” lightweight linear DiT with VAE for 512ร—512 generation
- **Mobile Conditioning Projector (MCP):** ~2.4M param connector using layerwise feature fusion with temperature-scaled weights, depthwise-separable 1D convolutions, and efficient channel attention

## ๐Ÿ‹๏ธ Training

Trained in three stages:

1. **Pre-training** โ€” Cross-modal alignment on [4M text-image pairs](https://huggingface.co/datasets/Amshaker/Mobile-O-Pre-Train)
2. **SFT** โ€” Supervised fine-tuning on [~105K curated pairs](https://huggingface.co/datasets/Amshaker/Mobile-O-SFT)
3. **Post-training** โ€” Unified multimodal training on [~105K quadruplets](https://huggingface.co/datasets/Amshaker/Mobile-O-Post-Train)

## ๐Ÿ”— Related Resources

| Resource | Link |
|----------|------|
| ๐Ÿค— Mobile-O-1.5B | [Model](https://huggingface.co/Amshaker/Mobile-O-1.5B) |
| ๐Ÿค— Mobile-O-0.5B-iOS | [iOS Components](https://huggingface.co/Amshaker/Mobile-O-0.5B-iOS) |
| ๐Ÿ“ฑ iOS App Source Code | [Mobile-O-App](https://github.com/Amshaker/Mobile-O/tree/main/Mobile-O-App) |

## ๐Ÿ“„ Citation

```bibtex
@article{shaker2026mobileo,
  title={Mobile-O: Unified Multimodal Understanding and Generation on Mobile Device},
  author={Shaker, Abdelrahman and Heakl, Ahmed and Muhammad, Jaseel and Thawkar, Ritesh and Thawakar, Omkar and Li, Senmao and Cholakkal, Hisham and Reid, Ian and Xing, Eric P. and Khan, Salman and Khan, Fahad Shahbaz},
  journal={arXiv preprint arXiv:2602.20161},
  year={2026}
}
```

## โš–๏ธ License

Released under [CC BY-NC 4.0](https://creativecommons.org/licenses/by-nc/4.0/). For research purposes only.