UniverSat — Resolution- and Modality-Agnostic Transformer for Earth Observation

One set of weights for many sensors, resolutions, scales, and modalities.

UniverSat is a ViT-style Earth Observation backbone built around a Universal Patch Encoder (UPE) that maps patches of arbitrary spatial, spectral, and temporal shape into a shared embedding space — no resampling, no channel selection, no per-sensor encoder. A single model is trained jointly on 13 sensors from 7 datasets spanning ~3 orders of magnitude in resolution, channel count, and revisit frequency, and generalises to unseen sensors within this gamut without input resampling.

📄 Paper (arXiv): https://arxiv.org/abs/2606.23503
📦 Code / Torch Hub: https://github.com/gastruc/UniverSat
🌐 Project page: https://gastruc.github.io/universat

Highlights

🌐 Universal. One weight set processes many modality combinations and arbitrary resolutions — optical, SAR, hyperspectral, and elevation — without channel filtering or resampling.
📏 Resolution-flexible. The output spatial resolution is chosen at inference and decoupled from the input patch size: coarse maps, native resolution, or per-pixel features from the same forward pass.
🔍 Granular. A sub-patch skip cross-attention recovers fine spatial detail (field boundaries, roads) beyond patch-level embeddings.
🧊 Frozen-backbone friendly. Competitive with ~9K-parameter linear probes — strong in low-label regimes.

Usage

The model is published with PyTorchModelHubMixin, so from_pretrained pulls the weights (config.json + model.safetensors) straight from this repo:

from hubconf import UniverSat   # from a local checkout on your path

model = UniverSat.from_pretrained("g-astruc/UniverSat").eval()

Equivalently, through Torch Hub — same weights, same tracked download, no local checkout needed:

import torch

model = torch.hub.load("gastruc/UniverSat", "from_pretrained").eval()

Loading requires huggingface_hub (and safetensors); building the model needs only torch.

Encode any combination of sensors

model.encode(...) looks up per-modality wavelengths, physical resolution, and sub-patch factors automatically from a built-in registry, so you only pass {modality_name: tensor}:

# Snapshot modalities: (B, C, H, W). Time series: (B, T, C, H, W) + a "<mod>_dates" tensor.
data = {
    "spot":     torch.randn(2,  3, 360, 360),       # 1 m VHR RGB snapshot
    "s2":       torch.randn(2, 20, 10,  36,  36),   # 10 m Sentinel-2 time series
    "s2_dates": torch.randint(0, 365, (2, 20)),     # day-of-year per timestamp
    "s1":       torch.randn(2, 12,  3,  36,  36),   # 10 m Sentinel-1 (VV, VH, ratio)
    "s1_dates": torch.randint(0, 365, (2, 12)),
    "dsm":      torch.randn(2,  1,  12,  12),       # 30 m elevation snapshot
}

features, _ = model.encode(data, patch_size=40, output_grid=36)
# features: (2, 1296, 768)  ->  a 36×36 dense feature grid (register tokens stripped for you)

patch_size — patch size in metres (patch_size=40 → 40 m patches; scale = patch_size / 10 internally).
output_grid — side G of the output grid (a G×G map, G² tokens), decoupled from the input patch size. The same model + inputs produce coarse or per-pixel maps just by changing it:

patch, _   = model.encode(data, patch_size=40, output_grid=9)     #   9×9   patch-level
dense, _   = model.encode(data, patch_size=40, output_grid=36)    #  36×36  dense
highres, _ = model.encode(data, patch_size=40, output_grid=180)   # 180×180 high-res

Unseen sensors? Pass the sensor's wavelengths={...} (optical/hyperspectral), polarization codes (SAR), input_res={...}, and subpatches={...} overrides to encode(...). The UPE uses these as positional encodings — no retraining needed.

Inputs should be normalised (per-channel z-score). For the low-level forward(...) API (explicit wavelengths, latent grid, masking), see hubconf.py.

Supported sensors

The encoder accepts any combination of the registered modalities below (and more — see modality_registry.py). Time-series modalities take a 5-D tensor plus a "<modality>_dates" companion (day-of-year, Jan 1 = 0).

Modality	Type	Channels	Resolution
`aerial` / `aerialflair`	snapshot	RGB-NiR (4)	0.2 m
`spot`	snapshot	RGB (3)	1 m
`spotRGBN`	snapshot	RGB-NiR (4)	1.6 m
`naip`	snapshot	RGB-NiR (4)	1.25 m
`rgbneon`	snapshot	RGB (3)	0.1 m
`dem` / `dsm` / `ndemneon`	snapshot	DSM / nDEM (1–2)	0.2–30 m
`s2` (Sentinel-2)	time series	10	10 m
`s1` (Sentinel-1)	time series	VV, VH, ratio (3)	10 m
`l7` / `l8` (Landsat)	time series	6 / 11	30 / 10 m
`alos` (ALOS-2)	time series	HH, HV, ratio (3)	30 m
`modis`	time series	7	250 m
`enmap` / `EO1` / `neon`	snapshot	hyperspectral	30 m / 30 m / 1 m

Training

UniverSat is pre-trained self-supervised on 13 sensors from 7 datasets with a combination of latent multimodal masked modeling (LM³) and cross-modal contrastive learning under aggressive (~90%) masking across channels, time, space, and modalities.

Dataset	Sensors used
FLAIR-Hub	SPOT 6/7 + aerial UHR + Sentinel-1 + Sentinel-2 + DSM + nDEM
PASTIS-HD	SPOT 6/7 + Sentinel-1 + Sentinel-2 time series
TreeSatAI-TS	aerial UHR + Sentinel-1 + Sentinel-2 time series
Planted	Sentinel-1 + Sentinel-2 + Landsat-7/8/9 + ALOS-2 + MODIS
S2NAIP-Urban	NAIP + Landsat-8 + Sentinel-1 + Sentinel-2
HyperGlobal	EO-1 Hyperion (175 bands) + Gaofen-5 (150 bands)
EarthView (NEON)	NEON RGB/UAV + NIS hyperspectral (396 bands) + nDEM

Combined coverage: spatial resolution 0.1–300 m, temporal depth 1–150 images/year, spectral width 1–396 channels. Fold 1 of PASTIS is held out of pretraining for downstream benchmarking.

Evaluation

UniverSat is evaluated on 16 datasets across GeoBench, PangaeaBench, and SpectralEarth with strict kNN / linear probing. A 9K-parameter linear probe on UniverSat's dense embeddings matches or exceeds UperNet decoders with 33–47 M parameters, including on configurations unseen at pretraining (mono-temporal Sentinel inputs, the synthetic HLS sensor) and on hyperspectral EnMAP (SpectralEarth), which it was not trained on. See the project page and paper for full tables.

Intended use and limitations

Intended use. A general-purpose EO feature extractor for classification, semantic segmentation, and change detection — via fine-tuning or frozen-backbone probing — across heterogeneous sensors and resolutions.

Limitations. UniverSat trades specialisation for generality: in homogeneous settings (e.g. VHR-RGB-only or mono-temporal Sentinel-2) modality-specific models can be more accurate or efficient. Generalisation to unseen non-optical sensors is less seamless than to optical ones (it benefits from learning a small modality-encoding vector). As with any large EO model, it may enable large-scale monitoring; consider surveillance and misuse implications.

Citation

@article{perron2026universat,
  title   = {UniverSat: Resolution- and Modality-Agnostic Transformers for Earth Observation},
  author  = {Perron, Yohann and Astruc, Guillaume and Gonthier, Nicolas
             and Mallet, Clement and Landrieu, Loic},
  journal = {arXiv preprint arXiv:2606.23503},
  eprint  = {2606.23503},
  archivePrefix = {arXiv},
  year    = {2026}
}

Acknowledgements

Transformer blocks from timm; L-TAE/PSE from utae-paps; MP-Fourier features inspired by EDM2; axial attention follows Ho et al., 2019. Project skeleton: lightning-hydra-template.

License

MIT.

Downloads last month: 16

Safetensors

Model size

0.2B params

Tensor type

F32

Inference Providers NEW

Image Feature Extraction

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Papers for g-astruc/UniverSat

UniverSat: Resolution- and Modality-Agnostic Transformers for Earth Observation

Paper • 2606.23503 • Published 2 days ago • 2

Axial Attention in Multidimensional Transformers

Paper • 1912.12180 • Published Dec 20, 2019