Text-to-Image
Transformers
Safetensors
mobile_o_inference
text-generation
mobile-o
multimodal
unified-model
vision-language
image-understanding
on-device
mobile
Instructions to use xhhcode/Mobile-O-0.5B with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use xhhcode/Mobile-O-0.5B with Transformers:
# Load model directly from transformers import AutoModelForCausalLM model = AutoModelForCausalLM.from_pretrained("xhhcode/Mobile-O-0.5B", dtype="auto") - Notebooks
- Google Colab
- Kaggle
File size: 4,966 Bytes
e5f5711 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 | ---
license: cc-by-nc-4.0
library_name: transformers
tags:
- mobile-o
- multimodal
- unified-model
- vision-language
- text-to-image
- image-understanding
- on-device
- mobile
pipeline_tag: text-to-image
datasets:
- Amshaker/Mobile-O-Post-Train
- Amshaker/Mobile-O-SFT
- Amshaker/Mobile-O-Pre-Train
base_model:
- Efficient-Large-Model/Sana_600M_512px_diffusers
- apple/FastVLM-0.5B
---
<div align="center">
<h1>
<img src="https://github.com/Amshaker/Mobile-O/blob/main/assets/mobile-o-logo.png?raw=true" width="30" /> Mobile-O-0.5B
</h1>
**Unified Multimodal Understanding and Generation on Mobile Device**
<p>
<a href="https://arxiv.org/abs/2602.20161"><img src="https://img.shields.io/badge/arXiv-2602.20161-b31b1b.svg" alt="arXiv"></a>
<a href="https://github.com/Amshaker/Mobile-O"><img src="https://img.shields.io/badge/GitHub-Code-black.svg" alt="Code"></a>
<a href="https://amshaker.github.io/Mobile-O/"><img src="https://img.shields.io/badge/๐-Project_Page-2563eb.svg" alt="Project Page"></a>
<a href="https://mobileo.cvmbzuai.com/"><img src="https://img.shields.io/badge/๐-Live_Demo-10b981.svg" alt="Demo"></a>
<a href="https://huggingface.co/collections/Amshaker/mobile-o-datasets"><img src="https://img.shields.io/badge/๐ค-Datasets-yellow.svg" alt="Datasets"></a>
<a href="https://apps.apple.com/app/mobile-o/id6759238106"><img src="https://img.shields.io/badge/๏ฃฟ-App_Store-black.svg" alt="App Store"></a>
</p>
</div>
## ๐ Overview
Mobile-O-0.5B is a compact unified visionโlanguageโdiffusion model that performs both **multimodal understanding** (VQA, OCR, reasoning) and **image generation** within a single architecture, designed for mobile and edge deployment.
| Spec | Detail |
|------|--------|
| **Total Parameters** | 1.6B |
| **Image Resolution** | 512ร512 |
| **Image Generation** | ~3 seconds on iPhone |
| **Visual Understanding** | ~0.4 seconds on iPhone |
| **Memory Footprint** | < 2GB |
## ๐ฏ Supported Tasks
| Task | Input โ Output |
|------|---------------|
| ๐ฌ Conversational AI | Text โ Text |
| ๐๏ธ Image Understanding | Image + Text โ Text |
| ๐ผ๏ธ Image Generation | Text โ Image |
| โ๏ธ Image Editing | Image + Text โ Image |
## ๐ Quick Start
### Download
```python
from huggingface_hub import snapshot_download
snapshot_download(
repo_id="Amshaker/Mobile-O-0.5B",
repo_type="model",
local_dir="checkpoints",
allow_patterns=["final_merged_model_23620/*"]
)
```
### Image Understanding
```bash
python infer_und.py \
--model_path checkpoints/final_merged_model_23620/ \
--image_path assets/cute_cat.png \
--prompt "What is in the image?"
```
### Image Generation
```bash
python infer_gen.py \
--model_path checkpoints/final_merged_model_23620/ \
--prompt "A vibrant tropical rainforest scene with a scarlet macaw perched on a moss-covered branch"
```
### Image Editing
```bash
python infer_edit.py \
--model_path checkpoints/final_merged_model_23620/ \
--image_path assets/cute_cat.png \
--prompt "Make the cat wear a hat"
```
## ๐๏ธ Architecture
Mobile-O consists of three main components:
- **Vision-Language Model (VLM):** [FastVLM-0.5B](https://github.com/apple/ml-fastvlm) โ FastViT vision encoder + Qwen2-0.5B language backbone
- **Diffusion Decoder:** [SANA-600M-512](https://github.com/NVlabs/Sana) โ lightweight linear DiT with VAE for 512ร512 generation
- **Mobile Conditioning Projector (MCP):** ~2.4M param connector using layerwise feature fusion with temperature-scaled weights, depthwise-separable 1D convolutions, and efficient channel attention
## ๐๏ธ Training
Trained in three stages:
1. **Pre-training** โ Cross-modal alignment on [4M text-image pairs](https://huggingface.co/datasets/Amshaker/Mobile-O-Pre-Train)
2. **SFT** โ Supervised fine-tuning on [~105K curated pairs](https://huggingface.co/datasets/Amshaker/Mobile-O-SFT)
3. **Post-training** โ Unified multimodal training on [~105K quadruplets](https://huggingface.co/datasets/Amshaker/Mobile-O-Post-Train)
## ๐ Related Resources
| Resource | Link |
|----------|------|
| ๐ค Mobile-O-1.5B | [Model](https://huggingface.co/Amshaker/Mobile-O-1.5B) |
| ๐ค Mobile-O-0.5B-iOS | [iOS Components](https://huggingface.co/Amshaker/Mobile-O-0.5B-iOS) |
| ๐ฑ iOS App Source Code | [Mobile-O-App](https://github.com/Amshaker/Mobile-O/tree/main/Mobile-O-App) |
## ๐ Citation
```bibtex
@article{shaker2026mobileo,
title={Mobile-O: Unified Multimodal Understanding and Generation on Mobile Device},
author={Shaker, Abdelrahman and Heakl, Ahmed and Muhammad, Jaseel and Thawkar, Ritesh and Thawakar, Omkar and Li, Senmao and Cholakkal, Hisham and Reid, Ian and Xing, Eric P. and Khan, Salman and Khan, Fahad Shahbaz},
journal={arXiv preprint arXiv:2602.20161},
year={2026}
}
```
## โ๏ธ License
Released under [CC BY-NC 4.0](https://creativecommons.org/licenses/by-nc/4.0/). For research purposes only. |