---
license: other
license_name: openarlow
license_link: LICENSE
language:
- en
pipeline_tag: text-generation
library_name: transformers
---

# THIS MODEL IS NOT OFFICIAL BUT RATHER A PROOF OF CONCEPT OF THE ARLOW VISION ARCHITECTURE

**Arlow Vision is the standalone vision-pretraining stage for the Arlow multimodal stack. It trains the visual tower to produce visual tokens that match the Arlow text backbone width and can later be plugged into a full vision-language model.**

## This model requires a specific Transformers fork because the architecture code has not been merged into official Transformers yet.

**Special transformers fork:** https://github.com/yuchenxie4645/transformers/tree/ArlowVL

```bash
git clone --branch ArlowVL --single-branch https://github.com/yuchenxie4645/transformers
cd transformers
pip install -e .
```

## Training Summary

| Item | Value |
| --- | --- |
| Objective | Masked autoencoding over visual patch tokens |
| Modalities | Images, with optional video mixed into training |
| Output width | `3072` |
| Next stage | Multimodal alignment with the Arlow text backbone |

## Model

| Item | Value |
| --- | --- |
| Vision encoder | `ArlowVLVisionModel` |
| Depth | `48` |
| Embedding dimension | `1536` |
| Hidden size | `3072` |
| Attention heads | `24` |
| Patch size | `14` |
| Temporal patch size | `2` |
| Spatial merge size | `2` |
| Activation | `gelu_pytorch_tanh` |
| Deformable attention | Enabled |
| Progressive patches | Enabled |
| DeepStack visual features | Enabled |
| M-ROPE | Enabled |

## Data

| Item | Value |
| --- | --- |
| Primary modality | Images |
| Optional modality | Video |
| Default video sampling probability | `0.25` |
| Default image data | `ILSVRC/imagenet-1k` train split |
| Default video data | `ucf101` train split |
| Recommended larger-scale direction | YFCC-style image data and OpenVid-style video data |

## Optimization

| Item | Value |
| --- | --- |
| Hardware target | `8x RTX 8000` with `48 GB` each |
| System RAM target | `200 GB` |
| Precision | `fp16` |
| Attention backend | `sdpa` |
| Distributed strategy | DeepSpeed ZeRO-2 |
| Epochs | `1` |
| Steps per epoch cap | `2621440` |
| Per-device batch size | `2` |
| Gradient accumulation | `16` |
| Effective global batch size on 8 GPUs | `256` |
| Learning rate | `1.5e-4` |
| Weight decay | `0.05` |
| Warmup steps | `40000` |
| Max grad norm | `1.0` |

## MAE Objective

| Item | Value |
| --- | --- |
| Mask ratio | `0.75` |
| Decoder embedding size | `512` |
| Decoder depth | `8` |
| Decoder heads | `8` |
| Normalized pixel loss | Enabled |

## Exported Artifacts

| Item | Value |
| --- | --- |
| Main artifact to keep | `checkpoint-*/vision_encoder/` |
| Matching preprocessing artifacts | `image_processor/`, `video_processor/`, `processor_config.json` |