UniverSat β Resolution- and Modality-Agnostic Transformer for Earth Observation
One set of weights for many sensors, resolutions, scales, and modalities.
UniverSat is a ViT-style Earth Observation backbone built around a Universal Patch Encoder (UPE) that maps patches of arbitrary spatial, spectral, and temporal shape into a shared embedding space β no resampling, no channel selection, no per-sensor encoder. A single model is trained jointly on 13 sensors from 7 datasets spanning ~3 orders of magnitude in resolution, channel count, and revisit frequency, and generalises to unseen sensors within this gamut without input resampling.
- π Paper (arXiv): https://arxiv.org/abs/2606.23503
- π¦ Code / Torch Hub: https://github.com/gastruc/UniverSat
- π Project page: https://gastruc.github.io/universat
Highlights
- π Universal. One weight set processes many modality combinations and arbitrary resolutions β optical, SAR, hyperspectral, and elevation β without channel filtering or resampling.
- π Resolution-flexible. The output spatial resolution is chosen at inference and decoupled from the input patch size: coarse maps, native resolution, or per-pixel features from the same forward pass.
- π Granular. A sub-patch skip cross-attention recovers fine spatial detail (field boundaries, roads) beyond patch-level embeddings.
- π§ Frozen-backbone friendly. Competitive with ~9K-parameter linear probes β strong in low-label regimes.
Usage
The model is published with PyTorchModelHubMixin,
so from_pretrained pulls the weights (config.json + model.safetensors) straight from this repo:
from hubconf import UniverSat # from a local checkout on your path
model = UniverSat.from_pretrained("g-astruc/UniverSat").eval()
Equivalently, through Torch Hub β same weights, same tracked download, no local checkout needed:
import torch
model = torch.hub.load("gastruc/UniverSat", "from_pretrained").eval()
Loading requires huggingface_hub (and safetensors); building the model needs only torch.
Encode any combination of sensors
model.encode(...) looks up per-modality wavelengths, physical resolution, and sub-patch factors
automatically from a built-in registry, so you only pass {modality_name: tensor}:
# Snapshot modalities: (B, C, H, W). Time series: (B, T, C, H, W) + a "<mod>_dates" tensor.
data = {
"spot": torch.randn(2, 3, 360, 360), # 1 m VHR RGB snapshot
"s2": torch.randn(2, 20, 10, 36, 36), # 10 m Sentinel-2 time series
"s2_dates": torch.randint(0, 365, (2, 20)), # day-of-year per timestamp
"s1": torch.randn(2, 12, 3, 36, 36), # 10 m Sentinel-1 (VV, VH, ratio)
"s1_dates": torch.randint(0, 365, (2, 12)),
"dsm": torch.randn(2, 1, 12, 12), # 30 m elevation snapshot
}
features, _ = model.encode(data, patch_size=40, output_grid=36)
# features: (2, 1296, 768) -> a 36Γ36 dense feature grid (register tokens stripped for you)
patch_sizeβ patch size in metres (patch_size=40β 40 m patches;scale = patch_size / 10internally).output_gridβ sideGof the output grid (aGΓGmap,GΒ²tokens), decoupled from the input patch size. The same model + inputs produce coarse or per-pixel maps just by changing it:
patch, _ = model.encode(data, patch_size=40, output_grid=9) # 9Γ9 patch-level
dense, _ = model.encode(data, patch_size=40, output_grid=36) # 36Γ36 dense
highres, _ = model.encode(data, patch_size=40, output_grid=180) # 180Γ180 high-res
Unseen sensors? Pass the sensor's
wavelengths={...}(optical/hyperspectral), polarization codes (SAR),input_res={...}, andsubpatches={...}overrides toencode(...). The UPE uses these as positional encodings β no retraining needed.
Inputs should be normalised (per-channel z-score). For the low-level forward(...) API (explicit
wavelengths, latent grid, masking), see hubconf.py.
Supported sensors
The encoder accepts any combination of the registered modalities below (and more β see
modality_registry.py). Time-series
modalities take a 5-D tensor plus a "<modality>_dates" companion (day-of-year, Jan 1 = 0).
| Modality | Type | Channels | Resolution |
|---|---|---|---|
aerial / aerialflair |
snapshot | RGB-NiR (4) | 0.2 m |
spot |
snapshot | RGB (3) | 1 m |
spotRGBN |
snapshot | RGB-NiR (4) | 1.6 m |
naip |
snapshot | RGB-NiR (4) | 1.25 m |
rgbneon |
snapshot | RGB (3) | 0.1 m |
dem / dsm / ndemneon |
snapshot | DSM / nDEM (1β2) | 0.2β30 m |
s2 (Sentinel-2) |
time series | 10 | 10 m |
s1 (Sentinel-1) |
time series | VV, VH, ratio (3) | 10 m |
l7 / l8 (Landsat) |
time series | 6 / 11 | 30 / 10 m |
alos (ALOS-2) |
time series | HH, HV, ratio (3) | 30 m |
modis |
time series | 7 | 250 m |
enmap / EO1 / neon |
snapshot | hyperspectral | 30 m / 30 m / 1 m |
Training
UniverSat is pre-trained self-supervised on 13 sensors from 7 datasets with a combination of latent multimodal masked modeling (LMΒ³) and cross-modal contrastive learning under aggressive (~90%) masking across channels, time, space, and modalities.
| Dataset | Sensors used |
|---|---|
| FLAIR-Hub | SPOT 6/7 + aerial UHR + Sentinel-1 + Sentinel-2 + DSM + nDEM |
| PASTIS-HD | SPOT 6/7 + Sentinel-1 + Sentinel-2 time series |
| TreeSatAI-TS | aerial UHR + Sentinel-1 + Sentinel-2 time series |
| Planted | Sentinel-1 + Sentinel-2 + Landsat-7/8/9 + ALOS-2 + MODIS |
| S2NAIP-Urban | NAIP + Landsat-8 + Sentinel-1 + Sentinel-2 |
| HyperGlobal | EO-1 Hyperion (175 bands) + Gaofen-5 (150 bands) |
| EarthView (NEON) | NEON RGB/UAV + NIS hyperspectral (396 bands) + nDEM |
Combined coverage: spatial resolution 0.1β300 m, temporal depth 1β150 images/year, spectral width 1β396 channels. Fold 1 of PASTIS is held out of pretraining for downstream benchmarking.
Evaluation
UniverSat is evaluated on 16 datasets across GeoBench, PangaeaBench, and SpectralEarth with strict kNN / linear probing. A 9K-parameter linear probe on UniverSat's dense embeddings matches or exceeds UperNet decoders with 33β47 M parameters, including on configurations unseen at pretraining (mono-temporal Sentinel inputs, the synthetic HLS sensor) and on hyperspectral EnMAP (SpectralEarth), which it was not trained on. See the project page and paper for full tables.
Intended use and limitations
Intended use. A general-purpose EO feature extractor for classification, semantic segmentation, and change detection β via fine-tuning or frozen-backbone probing β across heterogeneous sensors and resolutions.
Limitations. UniverSat trades specialisation for generality: in homogeneous settings (e.g. VHR-RGB-only or mono-temporal Sentinel-2) modality-specific models can be more accurate or efficient. Generalisation to unseen non-optical sensors is less seamless than to optical ones (it benefits from learning a small modality-encoding vector). As with any large EO model, it may enable large-scale monitoring; consider surveillance and misuse implications.
Citation
@article{perron2026universat,
title = {UniverSat: Resolution- and Modality-Agnostic Transformers for Earth Observation},
author = {Perron, Yohann and Astruc, Guillaume and Gonthier, Nicolas
and Mallet, Clement and Landrieu, Loic},
journal = {arXiv preprint arXiv:2606.23503},
eprint = {2606.23503},
archivePrefix = {arXiv},
year = {2026}
}
Acknowledgements
Transformer blocks from timm; L-TAE/PSE from utae-paps; MP-Fourier features inspired by EDM2; axial attention follows Ho et al., 2019. Project skeleton: lightning-hydra-template.
License
MIT.
- Downloads last month
- 16