Oculus / docs /ARCHITECTURE.md

Upload docs/ARCHITECTURE.md with huggingface_hub

11e1f9d verified 5 months ago

7.45 kB

	# Oculus 0.1 Architecture

	## Overview
	Oculus is a ~3.8B parameter multimodal vision-language model combining DINOv3, SigLIP2, and LFM2.5-1.2B. Designed for Apple Silicon using MLX.

	## Architecture Components

	### 1. DINOv3 Encoder (ViT-L/16)
	- Model: DINOv3 ViT-L/16 (pretrained)
	- Parameters: ~1.7B
	- Input: 224×224 images
	- Output: 197 tokens (1 CLS + 196 patches)
	- Patch Grid: 14×14
	- Feature Dimension: 1024D
	- Capabilities: Universal vision backbone, dense prediction

	### 2. SigLIP2 Encoder (SO400M)
	- Model: SigLIP2 SO400M (pretrained)
	- Parameters: ~400M
	- Input: 384×384 images
	- Output: 576 patch tokens
	- Patch Grid: 24×24
	- Feature Dimension: 1152D
	- Capabilities: Vision-language understanding, fine-grained features

	### 3. Feature Fusion
	- Method: Concatenation
	- Input: DINOv3 patches (1024D) + SigLIP2 patches (1152D)
	- Output: 2176D per spatial location
	- Note: SigLIP2 features resampled to 14×14 to match DINOv3

	### 4. Vision-Language Projector
	- Type: 2-layer MLP with GELU
	- Input: 2176D
	- Hidden: 4352D
	- Output: 1536D (LFM2.5 embedding dimension)
	- Parameters: ~5M

	### 5. LFM2.5-1.2B Language Model
	- Model: LFM2.5-1.2B-Base (pretrained)
	- Parameters: ~1.2B
	- Architecture: Hybrid transformer (full_attention + conv layers)
	- Embedding Dimension: 1536D
	- Depth: 16 layers
	- Attention Heads: 24
	- Vocab Size: 131072
	- Context Length: 32768 tokens
	- Why LFM2.5: 3x faster training, 2x faster inference than Qwen3 on CPU

	### 6. Task-Specific Heads

	#### Segmentation Head
	- Type: MLP
	- Input: 2176D
	- Hidden: 256D
	- Output: num_classes (e.g., 150 for ADE20K)
	- Output Shape: (batch, 14, 14, num_classes)

	#### Classification Head
	- Type: MLP
	- Input: 2176D
	- Hidden: 256D
	- Output: num_classes (e.g., 1000 for ImageNet)
	- Uses: CLS token from fused features

	#### Detection Head
	- Type: MLP
	- Input: 2176D
	- Hidden: 256D
	- Outputs:
	- Class logits: (batch, 196, anchors, num_classes)
	- Box predictions: (batch, 196, anchors, 4)

	#### OCR Head
	- Type: CNN + MLP
	- Input: 2176D
	- Outputs:
	- Text logits: (batch, 14, 14, max_seq_len)
	- Geometry: (batch, 196, 4) [x, y, w, h]

	## Model Flow

	```
	Input Image 1 (224×224) ──→ DINOv3 Encoder
	↓
	196 patches (14×14)
	1024D per patch
	↓
	└─────────────────┐
	│
	Input Image 2 (384×384) ──→ SigLIP2 Encoder │
	↓ │
	576 patches (24×24) │
	1152D per patch │
	↓ │
	Resample to 14×14 │
	↓ │
	└────── Concatenate ──→ 2176D features
	│
	↓
	Vision Projector (MLP)
	│
	↓
	1536D embeddings
	│
	┌──────────────────┬────────────────────┴────────────────────┐
	↓ ↓ ↓
	Segmentation Classification LFM2.5 LM
	Head Head (1.2B)
	↓ ↓ ↓
	(14×14, classes) (class_id) Text Output
	(Caption/VQA)
	↓ ↓ ↓
	Segmentation Classification Generated
	Predictions Predictions Text

	┌───────────────────────┐
	↓ ↓
	Detection Head OCR Head
	↓ ↓
	(boxes + classes) (text + geometry)
	```

	## Parameter Count

	\| Component \| Parameters \|
	\|-----------\|------------\|
	\| DINOv3 Encoder \| 1,700,000,000 \|
	\| SigLIP2 Encoder \| 400,000,000 \|
	\| Projector \| 5,000,000 \|
	\| LFM2.5 Language Model \| 1,200,000,000 \|
	\| Segmentation Head \| 500,000 \|
	\| Classification Head \| 300,000 \|
	\| Detection Head \| 500,000 \|
	\| OCR Head \| 300,000 \|
	\| Total \| ~3,806,600,000 \|

	## Training Strategy

	### Stage 1: Connector Pretraining
	- Freeze: All vision encoders, LFM2.5
	- Train: Projector only
	- Data: Image-caption pairs (CC3M, LAION)
	- Goal: Align vision and language representations
	- Batch Size: 8-16
	- Learning Rate: 1e-3

	### Stage 2: Head Training
	- Freeze: Encoders, LFM2.5, Projector
	- Train: Task heads only
	- Data: Task-specific datasets
	- Goal: Learn task-specific heads
	- Batch Size: 8-16
	- Learning Rate: 1e-3

	### Stage 3: Full Fine-tuning
	- Freeze: None
	- Train: All components
	- Data: Multi-task or specific task
	- Goal: End-to-end optimization
	- Learning Rate: 1e-5 (encoders), 1e-4 (heads)

	## Memory Requirements

	\| Mode \| Memory \|
	\|------\|--------\|
	\| Inference \| ~10 GB \|
	\| Training (frozen encoders) \| ~12 GB \|
	\| Training (full) \| ~30 GB \|

	## Why LFM2.5?

	- 3x faster training than Qwen3 on CPU
	- 2x faster decode/prefill on CPU
	- Optimized for edge - runs under 1GB memory
	- Native MLX support
	- Hybrid architecture - mix of attention and conv layers

	## Comparison with Alternatives

	\| Aspect \| Oculus (LFM2.5) \| Oculus (Qwen2) \|
	\|--------\|---------------\|--------------\|
	\| LM Parameters \| 1.2B \| 1.5B \|
	\| Training Speed \| 3x faster \| Baseline \|
	\| Inference Speed \| 2x faster \| Baseline \|
	\| MLX Support \| Native \| Via mlx-lm \|
	\| Edge Performance \| Excellent \| Good \|

	## Supported Tasks

	\| Task \| Input \| Output \|
	\|------\|-------\|--------\|
	\| Captioning \| Image + prompt \| Generated text \|
	\| VQA \| Image + question \| Answer text \|
	\| Segmentation \| Image \| Class per pixel \|
	\| Classification \| Image \| Class label \|
	\| Detection \| Image \| Boxes + classes \|
	\| OCR \| Image \| Text + bounding boxes \|
	\| Feature Extraction \| Image \| 2176D features \|

	## Input/Output Shapes

	\| Input \| Shape \|
	\|-------\|-------\|
	\| DINOv3 Image \| (B, 3, 224, 224) \|
	\| SigLIP2 Image \| (B, 3, 384, 384) \|
	\| Input IDs \| (B, seq_len) \|

	\| Output \| Shape \|
	\|--------\|-------\|
	\| Generated Text \| (B, seq_len + new_tokens) \|
	\| Segmentation \| (B, 14, 14) \|
	\| Classification \| (B,) \|
	\| Detection \| (B, 196, 9, 80), (B, 196, 9, 4) \|
	\| OCR Text \| (B, 14, 14, max_seq_len) \|