Swin-STCLN × PASTIS

Swin-STCLN: Hierarchical Swin Transformer Enhanced Spatio-Temporal Contrastive Learning Network for Crop Mapping

This repository implements an improved STCLN architecture for Sentinel-2 satellite image time-series semantic segmentation on the PASTIS benchmark.

The original STCLN pretrain → finetune workflow is preserved while replacing the flat CNN spatial encoder with a hierarchical Swin Transformer encoder and introducing cross-scale spatiotemporal fusion with boundary-aware refinement.

🏆 Results — 5-Fold Cross Validation

Summary

Metric	Fold 1	Fold 2	Fold 3	Fold 4	Fold 5	Mean ± Std
mFscore	58.37%	56.69%	59.18%	61.00%	56.49%	58.35% ± 1.70%
mIoU	44.35%	42.90%	45.54%	46.80%	42.58%	44.43% ± 1.62%
OA	67.10%	66.85%	69.96%	71.48%	66.41%	68.36% ± 2.13%
Kappa	60.02%	59.71%	62.79%	64.33%	58.99%	61.17% ± 2.19%
mPrecision	52.78%	51.33%	54.51%	57.40%	51.24%	53.45% ± 2.47%
mRecall	70.72%	69.00%	68.69%	68.93%	68.72%	69.21% ± 0.82%

Official benchmark result:

mFscore = 58.35% ± 1.70%

Trained from scratch on AMD MI300X GPU.

📊 Per-Class IoU — All Folds

Class	Mean IoU
🌽 Corn	75.90%
🌿 Winter rapeseed	75.55%
🌾 Beet	73.90%
🌾 Soft winter wheat	71.47%
🌿 Soybeans	61.48%
🌾 Winter barley	58.64%
🌱 Meadow	56.69%
🌻 Sunflower	54.87%
🟫 Background	50.98%
🌾 Winter durum wheat	44.56%
🌿 Grapevine	36.70%
🥔 Potatoes	36.34%
🌾 Spring barley	35.08%
🌿 Leguminous fodder	22.41%
🌾 Winter triticale	22.34%
🍎 Fruits/veg/flowers	21.53%
🍑 Orchard	17.27%
🌾 Mixed cereal	14.66%
🌿 Sorghum	13.91%

🏗️ Model Architecture — Swin-STCLN

1. Swin Spatial Encoder (SEncoder Replacement)

Original STCLN:

DoubleConv
+
DoubleConv

filters:
32 → 256

kernel:
3×3

Output:
(B×T,256,32,32)

Problem:

flat representation
single scale
limited long range spatial modeling

Swin-STCLN replaces it:

Input:

(B,T,10,32,32)

Patch Embedding

Conv2D:

10 → 96 channels

kernel = 2

stride = 2

Spatial:

32×32 → 16×16

Output:

(B×T,96,16,16)

Swin Stage 1

2 Swin Transformer Blocks

Window attention
Shifted window attention

Configuration:

dim = 96

window = 4

Output:

f_fine:

(B×T,96,16,16)

Patch Merging

16×16 → 8×8

Channel:

96 → 192

Swin Stage 2

2 Swin Transformer Blocks

dim = 192

window = 4

Output:

f_coarse:

(B×T,192,8,8)

2. Temporal Encoder

The original STCLN Transformer Temporal Encoder is kept unchanged.

Configuration:

Transformer layers: 3
Attention heads: 8
Sinusoidal positional encoding
GroupNorm

Only input representation changes.

Original STCLN:

(B×32×32) × T × 256

Swin-STCLN:

(B×8×8) × T × 192

Benefits:

reduces temporal tokens from 1024 → 64 locations
temporal reasoning happens on semantic regions
lower memory usage

3. STFusion — Cross Scale Fusion

Replacement for STCLN STA module.

Inputs:

Temporal branch:

coarse_agg

(B,192,8,8)

Spatial branch:

fine_agg

(B,96,16,16)

Process:

Upsample

8×8 →16×16

Projection

192 →96

Cross Attention

Query:

coarse semantic features

Key / Value:

fine spatial features

Residual fusion
Upsample

16×16 →32×32

Output:

(B,128,32,32)

4. Pretraining Reconstruction Decoder

The original STCLN self-supervised learning objective is preserved.

Unchanged:

Spatiotemporal masking
Mask ratio = 0.4
MSE reconstruction loss
Unlabeled pretraining patches = 7936

Difference:

Base STCLN reconstructs directly using a Linear layer because temporal features remain at 32×32.

Swin-STCLN reconstructs from hierarchical features.

TEncoder output:

(B,T,192,8,8)

Reconstruction Head:

ConvTranspose2D

8×8
 ↓
16×16


ConvTranspose2D

16×16
 ↓
32×32


Conv2D

192 → 10 Sentinel-2 bands

Final reconstruction:

(B,T,10,32,32)

The same masked pixel MSE objective is applied.

5. Finetuning Decoder

The original STCLN linear decoder is replaced with a boundary-aware segmentation decoder.

Input:

STFusion output

(B,128,32,32)

Semantic Decoder

Architecture:

Conv-BN-ReLU

128 → 64

Conv-BN-ReLU

64 →64

1×1 Conv classifier

Output:

(B,18,32,32)

Boundary Decoder

Parallel boundary prediction branch:

Conv-BN-ReLU

128 →64

Conv-BN-ReLU

64 →64

1×1 Conv

Output:

(B,1,32,32)

Boundary supervision is generated automatically using morphological gradient from semantic masks.

No additional annotation required.

Gated Refinement

Semantic feature

Boundary feature

↓

Concatenation

↓

Convolution refinement

↓

Final refined prediction:

(B,18,32,32)

6. Finetuning Loss

Original STCLN:

Cross Entropy

Swin-STCLN:

Total Loss =

CE(semantic output)

+

CE(refined output)

+

0.5 × BCE(boundary output)

The boundary loss improves separation between neighbouring crop parcels.

Boundary weight = 0.5

⚡ Inference Speed

Measured on AMD MI300X GPU.

Batch Size	Time (ms)	Throughput	VRAM Used
1	7.8 ms	128.9 patches/sec	1.12 GB
4	19.6 ms	203.8 patches/sec	2.08 GB
8	37.1 ms	215.4 patches/sec	3.40 GB
16	73.4 ms	218.0 patches/sec	6.03 GB
32	143.0 ms	223.8 patches/sec	11.31 GB
64	281.0 ms	227.8 patches/sec	21.85 GB

📁 Repository Structure

Swin-STCLN-PASTIS/

├── models/

│   ├── swin_encoder.py
│       # PatchEmbed + Swin Blocks + PatchMerging

│   ├── temporal_encoder.py
│       # STCLN Transformer Encoder

│   ├── stfusion.py
│       # Cross-scale attention fusion

│   ├── decoder.py
│       # Semantic decoder
│       # Boundary decoder
│       # Gated refinement

│   ├── reconstruction.py
│       # ConvTranspose reconstruction head

│   └── swin_stcln.py
│       # Complete architecture


├── datasets/

│   └── pastis_dataset.py


├── losses/

│   ├── segmentation_loss.py
│   └── boundary_loss.py


├── evaluation/

│   └── metrics.py


├── train.py

├── pretrain.py

├── finetune.py

├── visualize_results.py

├── checkpoints/

└── results/

🚀 Quick Start

Installation

git clone https://huggingface.co/Dhruv1000/Swin-STCLN-PASTIS

cd Swin-STCLN-PASTIS


pip install torch torchvision timm einops geopandas matplotlib scikit-learn

Training

Single fold:

python train.py \
    --data_root /path/to/PASTIS \
    --fold 1 \
    --epochs 100 \
    --batch_size 16 \
    --lr 5e-5 \
    --weight_decay 0.05 \
    --warmup_iters 500 \
    --num_workers 4 \
    --amp \
    --work_dir ./work_dirs/fold1

5-Fold Cross Validation

for fold in 1 2 3 4 5
do

python train.py \
    --data_root /path/to/PASTIS \
    --fold $fold \
    --epochs 100 \
    --batch_size 16 \
    --lr 5e-5 \
    --work_dir ./work_dirs/fold${fold}

done

Inference

import torch

from models.swin_stcln import build_swin_stcln


model = build_swin_stcln(
    num_classes=18
)


checkpoint = torch.load(
    "checkpoints/best_model.pth",
    weights_only=False
)


model.load_state_dict(
    checkpoint["model"]
)


model.eval()



# Input:
# Batch
# Time
# Sentinel-2 Bands
# Height
# Width


x = torch.randn(
    1,
    32,
    10,
    32,
    32
)


logits = model(x)


# Output:

# (1,18,32,32)


prediction = logits.argmax(dim=1)

📋 Training Configuration

Parameter	Value
Model	Swin-STCLN
Spatial Encoder	Swin Transformer
Temporal Encoder	STCLN Transformer Encoder
Fusion	Cross-scale STFusion
Optimizer	AdamW β=(0.9,0.999)
Learning rate	5e-5
Weight decay	0.05
Schedule	Warmup 500 iters + cosine decay
Batch size	16
Epochs	100
AMP	Enabled
Gradient clipping	max_norm=5.0
Loss	CE + CE + 0.5 BCE
Input bands	Sentinel-2 10 bands
Input size	32×32
Classes	18

📦 Dataset — PASTIS

The model is evaluated on the PASTIS (Panoptic Agricultural Satellite Time Series) benchmark.

Property	Details
Total patches	2,433 geo-referenced tiles
Satellite	Sentinel-2
Spectral bands	10
Temporal observations	61
Input crop	32×32 pixels
Classes	18 crop classes
Splits	Official 5-fold geographic split
Pretraining data	7936 unlabeled patches
Label fraction	2% labelled setting

The official train / validation / test split is preserved.

🌾 PASTIS Classes

ID	Class	Avg IoU
0	Background	50.98%
1	Meadow	56.69%
2	Soft winter wheat	71.47%
3	Corn	75.90%
4	Winter barley	58.64%
5	Winter rapeseed	75.55%
6	Spring barley	35.08%
7	Sunflower	54.87%
8	Grapevine	36.70%
9	Beet	73.90%
10	Winter triticale	22.34%
11	Winter durum wheat	44.56%
12	Fruits/veg/flowers	21.53%
13	Potatoes	36.34%
14	Leguminous fodder	22.41%
15	Soybeans	61.48%
16	Orchard	17.27%
17	Mixed cereal	14.66%
18	Sorghum	13.91%

🔥 What Remains Identical To Original STCLN

The following components are unchanged:

✅ Pretrain → finetune workflow

✅ Spatiotemporal masked reconstruction

✅ Mask ratio:

0.4

✅ Reconstruction objective:

Mean Squared Error

✅ Temporal Encoder design:

3 Transformer layers
8 attention heads
sinusoidal positional encoding
GroupNorm

✅ Dataset protocol:

PASTIS32
Sentinel-2
10 spectral channels
Official folds

✅ Weight transfer:

Pretrained:

SEncoder
+
TEncoder


        ↓


Finetuning initialization

🆚 STCLN vs Swin-STCLN

Component	STCLN	Swin-STCLN
Spatial Encoder	DoubleConv CNN	Swin Transformer
Spatial hierarchy	Single scale	Multi scale
Feature output	256 @32×32	96@16×16 + 192@8×8
Spatial attention	Local CNN	Window self-attention
Temporal input	1024 spatial tokens	64 semantic tokens
TEncoder	Transformer	Same Transformer
Fusion	STA	Cross-scale STFusion
Reconstruction	Linear	ConvTranspose decoder
Decoder	Linear classifier	Semantic + Boundary Decoder
Boundary learning	No	Yes
Final refinement	No	Gated refinement
Loss	CE	CE + CE + Boundary BCE

📈 Training Dynamics (Fold 1)

Epoch	Train Loss	Val Loss	mFscore	mIoU	Kappa
1	0.878	0.671	2.79%	1.47%	2.39%
4	0.431	0.472	20.08%	12.34%	10.76%
10	0.320	0.380	~33%	~22%	~24%
18	0.222	0.323	35.55%	24.17%	24.93%
55	0.083	0.363	53.37%	39.92%	53.16%
92	0.050	0.350	58.20%	44.20%	60.0%
100	0.048	0.360	57.90%	44.10%	59.8%

Best checkpoint:

Epoch 92

Total training time per fold:

~32 minutes on AMD MI300X

🖼️ Visualizations

Generated evaluation plots:

Per Fold

results/fold{N}/plots/

Includes:

Training curves
Per-class IoU plots
Metrics radar chart
Confusion matrix
Prediction maps
Error maps
IoU scatter analysis
Overfitting analysis

🔧 Implementation Details

Pure PyTorch Implementation

No dependency on:

MMSegmentation
MMEngine
external training framework

Implemented with:

PyTorch
timm
einops

Major Architectural Changes

1. Hierarchical Spatial Learning

CNN encoder replaced by Swin Transformer blocks.

Benefits:

larger receptive field
window attention
shifted window information exchange

2. Efficient Temporal Modelling

Instead of:

1024 temporal sequences/image

Swin-STCLN uses:

64 temporal sequences/image

This reduces memory while keeping semantic information.

3. Cross Scale Feature Recovery

The coarse temporal representation loses boundaries.

STFusion restores:

high level semantics
fine spatial details

using cross attention fusion.

4. Boundary Aware Learning

Boundary decoder learns crop separation using automatically generated masks.

No manual boundary annotation needed.

📚 Citation

If this implementation is useful, cite the original STCLN work and PASTIS benchmark.

@inproceedings{garnot2021pastis,

title={Panoptic Segmentation of Satellite Image Time Series with Convolutional Temporal Attention Networks},

author={Garnot, Vivien Sainte Fare and Landrieu, Loic},

booktitle={ICCV},

year={2021}

}

For Swin Transformer:

@inproceedings{liu2021swin,

title={Swin Transformer: Hierarchical Vision Transformer using Shifted Windows},

author={Liu, Ze and Lin, Yutong and Cao, Yue and others},

booktitle={ICCV},

year={2021}

}

📄 License

Apache-2.0

Trained on AMD MI300X
ROCm 7.0
PyTorch 2.x

Swin-STCLN × PASTIS

Hierarchical Swin Spatial Encoder
+ STCLN Temporal Encoder
+ Cross Scale Boundary-Aware Fusion

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Evaluation results

Mean IoU (5-fold CV) on PASTIS
self-reported

44.430
F1 Score (5-fold CV) on PASTIS
self-reported

58.350
Overall Accuracy (5-fold CV) on PASTIS
self-reported

68.360