| # Oculus 0.1 Architecture |
|
|
| ## Overview |
| Oculus is a ~3.8B parameter multimodal vision-language model combining DINOv3, SigLIP2, and LFM2.5-1.2B. Designed for Apple Silicon using MLX. |
|
|
| ## Architecture Components |
|
|
| ### 1. DINOv3 Encoder (ViT-L/16) |
| - **Model**: DINOv3 ViT-L/16 (pretrained) |
| - **Parameters**: ~1.7B |
| - **Input**: 224Γ224 images |
| - **Output**: 197 tokens (1 CLS + 196 patches) |
| - **Patch Grid**: 14Γ14 |
| - **Feature Dimension**: 1024D |
| - **Capabilities**: Universal vision backbone, dense prediction |
|
|
| ### 2. SigLIP2 Encoder (SO400M) |
| - **Model**: SigLIP2 SO400M (pretrained) |
| - **Parameters**: ~400M |
| - **Input**: 384Γ384 images |
| - **Output**: 576 patch tokens |
| - **Patch Grid**: 24Γ24 |
| - **Feature Dimension**: 1152D |
| - **Capabilities**: Vision-language understanding, fine-grained features |
|
|
| ### 3. Feature Fusion |
| - **Method**: Concatenation |
| - **Input**: DINOv3 patches (1024D) + SigLIP2 patches (1152D) |
| - **Output**: 2176D per spatial location |
| - **Note**: SigLIP2 features resampled to 14Γ14 to match DINOv3 |
|
|
| ### 4. Vision-Language Projector |
| - **Type**: 2-layer MLP with GELU |
| - **Input**: 2176D |
| - **Hidden**: 4352D |
| - **Output**: 1536D (LFM2.5 embedding dimension) |
| - **Parameters**: ~5M |
|
|
| ### 5. LFM2.5-1.2B Language Model |
| - **Model**: LFM2.5-1.2B-Base (pretrained) |
| - **Parameters**: ~1.2B |
| - **Architecture**: Hybrid transformer (full_attention + conv layers) |
| - **Embedding Dimension**: 1536D |
| - **Depth**: 16 layers |
| - **Attention Heads**: 24 |
| - **Vocab Size**: 131072 |
| - **Context Length**: 32768 tokens |
| - **Why LFM2.5**: 3x faster training, 2x faster inference than Qwen3 on CPU |
| |
| ### 6. Task-Specific Heads |
| |
| #### Segmentation Head |
| - **Type**: MLP |
| - **Input**: 2176D |
| - **Hidden**: 256D |
| - **Output**: num_classes (e.g., 150 for ADE20K) |
| - **Output Shape**: (batch, 14, 14, num_classes) |
| |
| #### Classification Head |
| - **Type**: MLP |
| - **Input**: 2176D |
| - **Hidden**: 256D |
| - **Output**: num_classes (e.g., 1000 for ImageNet) |
| - **Uses**: CLS token from fused features |
|
|
| #### Detection Head |
| - **Type**: MLP |
| - **Input**: 2176D |
| - **Hidden**: 256D |
| - **Outputs**: |
| - Class logits: (batch, 196, anchors, num_classes) |
| - Box predictions: (batch, 196, anchors, 4) |
| |
| #### OCR Head |
| - **Type**: CNN + MLP |
| - **Input**: 2176D |
| - **Outputs**: |
| - Text logits: (batch, 14, 14, max_seq_len) |
| - Geometry: (batch, 196, 4) [x, y, w, h] |
| |
| ## Model Flow |
| |
| ``` |
| Input Image 1 (224Γ224) βββ DINOv3 Encoder |
| β |
| 196 patches (14Γ14) |
| 1024D per patch |
| β |
| βββββββββββββββββββ |
| β |
| Input Image 2 (384Γ384) βββ SigLIP2 Encoder β |
| β β |
| 576 patches (24Γ24) β |
| 1152D per patch β |
| β β |
| Resample to 14Γ14 β |
| β β |
| βββββββ Concatenate βββ 2176D features |
| β |
| β |
| Vision Projector (MLP) |
| β |
| β |
| 1536D embeddings |
| β |
| ββββββββββββββββββββ¬βββββββββββββββββββββ΄βββββββββββββββββββββ |
| β β β |
| Segmentation Classification LFM2.5 LM |
| Head Head (1.2B) |
| β β β |
| (14Γ14, classes) (class_id) Text Output |
| (Caption/VQA) |
| β β β |
| Segmentation Classification Generated |
| Predictions Predictions Text |
| |
| βββββββββββββββββββββββββ |
| β β |
| Detection Head OCR Head |
| β β |
| (boxes + classes) (text + geometry) |
| ``` |
| |
| ## Parameter Count |
|
|
| | Component | Parameters | |
| |-----------|------------| |
| | DINOv3 Encoder | 1,700,000,000 | |
| | SigLIP2 Encoder | 400,000,000 | |
| | Projector | 5,000,000 | |
| | LFM2.5 Language Model | 1,200,000,000 | |
| | Segmentation Head | 500,000 | |
| | Classification Head | 300,000 | |
| | Detection Head | 500,000 | |
| | OCR Head | 300,000 | |
| | **Total** | **~3,806,600,000** | |
|
|
| ## Training Strategy |
|
|
| ### Stage 1: Connector Pretraining |
| - **Freeze**: All vision encoders, LFM2.5 |
| - **Train**: Projector only |
| - **Data**: Image-caption pairs (CC3M, LAION) |
| - **Goal**: Align vision and language representations |
| - **Batch Size**: 8-16 |
| - **Learning Rate**: 1e-3 |
|
|
| ### Stage 2: Head Training |
| - **Freeze**: Encoders, LFM2.5, Projector |
| - **Train**: Task heads only |
| - **Data**: Task-specific datasets |
| - **Goal**: Learn task-specific heads |
| - **Batch Size**: 8-16 |
| - **Learning Rate**: 1e-3 |
|
|
| ### Stage 3: Full Fine-tuning |
| - **Freeze**: None |
| - **Train**: All components |
| - **Data**: Multi-task or specific task |
| - **Goal**: End-to-end optimization |
| - **Learning Rate**: 1e-5 (encoders), 1e-4 (heads) |
|
|
| ## Memory Requirements |
|
|
| | Mode | Memory | |
| |------|--------| |
| | Inference | ~10 GB | |
| | Training (frozen encoders) | ~12 GB | |
| | Training (full) | ~30 GB | |
|
|
| ## Why LFM2.5? |
|
|
| - **3x faster training** than Qwen3 on CPU |
| - **2x faster decode/prefill** on CPU |
| - **Optimized for edge** - runs under 1GB memory |
| - **Native MLX support** |
| - **Hybrid architecture** - mix of attention and conv layers |
|
|
| ## Comparison with Alternatives |
|
|
| | Aspect | Oculus (LFM2.5) | Oculus (Qwen2) | |
| |--------|---------------|--------------| |
| | LM Parameters | 1.2B | 1.5B | |
| | Training Speed | 3x faster | Baseline | |
| | Inference Speed | 2x faster | Baseline | |
| | MLX Support | Native | Via mlx-lm | |
| | Edge Performance | Excellent | Good | |
|
|
| ## Supported Tasks |
|
|
| | Task | Input | Output | |
| |------|-------|--------| |
| | Captioning | Image + prompt | Generated text | |
| | VQA | Image + question | Answer text | |
| | Segmentation | Image | Class per pixel | |
| | Classification | Image | Class label | |
| | Detection | Image | Boxes + classes | |
| | OCR | Image | Text + bounding boxes | |
| | Feature Extraction | Image | 2176D features | |
|
|
| ## Input/Output Shapes |
|
|
| | Input | Shape | |
| |-------|-------| |
| | DINOv3 Image | (B, 3, 224, 224) | |
| | SigLIP2 Image | (B, 3, 384, 384) | |
| | Input IDs | (B, seq_len) | |
| |
| | Output | Shape | |
| |--------|-------| |
| | Generated Text | (B, seq_len + new_tokens) | |
| | Segmentation | (B, 14, 14) | |
| | Classification | (B,) | |
| | Detection | (B, 196, 9, 80), (B, 196, 9, 4) | |
| | OCR Text | (B, 14, 14, max_seq_len) | |
| |