--- license: other license_name: openarlow license_link: LICENSE language: - en pipeline_tag: text-generation library_name: transformers --- # THIS MODEL IS NOT OFFICIAL BUT RATHER A PROOF OF CONCEPT OF THE ARLOW VISION ARCHITECTURE **Arlow Vision is the standalone vision-pretraining stage for the Arlow multimodal stack. It trains the visual tower to produce visual tokens that match the Arlow text backbone width and can later be plugged into a full vision-language model.** ## This model requires a specific Transformers fork because the architecture code has not been merged into official Transformers yet. **Special transformers fork:** https://github.com/yuchenxie4645/transformers/tree/ArlowVL ```bash git clone --branch ArlowVL --single-branch https://github.com/yuchenxie4645/transformers cd transformers pip install -e . ``` ## Training Summary | Item | Value | | --- | --- | | Objective | Masked autoencoding over visual patch tokens | | Modalities | Images, with optional video mixed into training | | Output width | `3072` | | Next stage | Multimodal alignment with the Arlow text backbone | ## Model | Item | Value | | --- | --- | | Vision encoder | `ArlowVLVisionModel` | | Depth | `48` | | Embedding dimension | `1536` | | Hidden size | `3072` | | Attention heads | `24` | | Patch size | `14` | | Temporal patch size | `2` | | Spatial merge size | `2` | | Activation | `gelu_pytorch_tanh` | | Deformable attention | Enabled | | Progressive patches | Enabled | | DeepStack visual features | Enabled | | M-ROPE | Enabled | ## Data | Item | Value | | --- | --- | | Primary modality | Images | | Optional modality | Video | | Default video sampling probability | `0.25` | | Default image data | `ILSVRC/imagenet-1k` train split | | Default video data | `ucf101` train split | | Recommended larger-scale direction | YFCC-style image data and OpenVid-style video data | ## Optimization | Item | Value | | --- | --- | | Hardware target | `8x RTX 8000` with `48 GB` each | | System RAM target | `200 GB` | | Precision | `fp16` | | Attention backend | `sdpa` | | Distributed strategy | DeepSpeed ZeRO-2 | | Epochs | `1` | | Steps per epoch cap | `2621440` | | Per-device batch size | `2` | | Gradient accumulation | `16` | | Effective global batch size on 8 GPUs | `256` | | Learning rate | `1.5e-4` | | Weight decay | `0.05` | | Warmup steps | `40000` | | Max grad norm | `1.0` | ## MAE Objective | Item | Value | | --- | --- | | Mask ratio | `0.75` | | Decoder embedding size | `512` | | Decoder depth | `8` | | Decoder heads | `8` | | Normalized pixel loss | Enabled | ## Exported Artifacts | Item | Value | | --- | --- | | Main artifact to keep | `checkpoint-*/vision_encoder/` | | Matching preprocessing artifacts | `image_processor/`, `video_processor/`, `processor_config.json` |