---
title: Dense-Iso-ViT SR
emoji: 🔭
colorFrom: blue
colorTo: green
sdk: gradio
sdk_version: "5.16.0"
python_version: "3.10"
app_file: app.py
pinned: true
license: mit
tags:
  - image-super-resolution
  - vision-transformer
  - super-resolution
  - isotropic-vit
  - dense-connections
  - GDFN
  - pytorch
  - image-restoration
  - computer-vision
  - encoder-only-vision-transformer-super-resolution
  - constant-spatial-resolution
  - DenseNet-style-feature-propagation
  - isotropic-token-grid
  - gated-depthwise-feed-forward-network
---

# Dense-Iso-ViT
## Constant-Resolution Hierarchical Vision Transformer for ×4 Image Super-Resolution

---

## What is Dense-Iso-ViT?

Dense-Iso-ViT is a pure Vision Transformer built from scratch for ×4 image
super-resolution. The core design idea: keep the spatial token grid constant
throughout all transformer stages — no patch merging, no spatial compression —
and connect all stage outputs directly to the reconstruction head using
DenseNet-style dense concatenation.

The name captures the two central ideas:
- **Dense** — dense inter-stage feature aggregation (DenseNet principle applied
  to transformers)
- **Iso** — isotropic spatial resolution, constant 16×16 token grid across all
  4 stages

---

## Architecture

### How it works

```
Input [B, 3, 64, 64]
  → PatchEmbedding (patch=4) → [B, 256, 192]   (16×16 grid, fixed throughout)

  → Stage 1: 3× TransformerBlock(embed=192, heads=6, GDFN) → h1 [B, 256, 192]
  → Linear(192→256)

  → Stage 2: 3× TransformerBlock(embed=256, heads=8, GDFN) → h2 [B, 256, 256]
  → Linear(256→288)

  → Stage 3: 3× TransformerBlock(embed=288, heads=6, GDFN) → h3 [B, 256, 288]
  → Linear(288→384)

  → Stage 4: 3× TransformerBlock(embed=384, heads=8, GDFN) → h4 [B, 256, 384]

  → Dense concat: cat([h1, h2, h3, h4]) → [B, 256, 1120] → [B, 1120, 16, 16]
  → SR Head: fusion conv → 4× PixelShuffle → [B, 3, 256, 256]
  → + F.interpolate(lr_img, 256×256) bilinear skip
  → Output [B, 3, 256, 256]
```

### Design decisions and reasoning

**Isotropic token grid — constant 16×16 spatial resolution**

All 4 transformer stages operate on the same 256 tokens (16×16 grid). No patch
merging, no token downsampling at any point. Every token maps to the same 4×4
pixel region from input through to reconstruction — spatial coordinates are
preserved exactly throughout the network.

**Hierarchical embed dims [192, 256, 288, 384]**

Representational capacity increases with depth. Early stages learn local edges
and textures — 192 dimensions is sufficient. Deep stages reason about global
scene structure and semantics — 384 dimensions gives more capacity where the
task is genuinely harder. Linear projections between stages change the channel
dimension without affecting spatial resolution.

**Dense inter-stage concatenation**

Outputs from all 4 stages are concatenated before the reconstruction head:
`cat([h1, h2, h3, h4]) → [B, 256, 1120]`. The head receives low-level edge
information (stage 1) and high-level semantic context (stage 4) simultaneously,
without early-stage features being filtered through subsequent blocks.
Inspired by DenseNet's feature reuse principle, applied across transformer stages.

**Gated Depthwise Feed-Forward Network (GDFN)**

Standard transformer MLPs process each token independently — no spatial
awareness in the feed-forward step. GDFN replaces this with gated depthwise
3×3 convolutions, giving each token access to its 8 spatial neighbors during
the feed-forward computation. Local spatial context injected at every attention
layer, at almost no extra parameter cost.

```python
# GDFN — replaces standard Linear → SiLU → Linear
g_proj   = Linear(embed_dim, 2 * hidden)
path_1   = Conv2d(hidden, hidden, 3, padding=1, groups=hidden)
path_2   = Conv2d(hidden, hidden, 3, padding=1, groups=hidden)
out_proj = Linear(hidden, embed_dim)

# gate: path1 × GELU(path2)
x = out_proj(path_1(x) * F.gelu(path_2(x)))
```

**Bilinear skip connection**

`output = F.interpolate(lr, 256×256) + vit_residual`

The model learns a residual correction on top of a bilinear upscale of the
input — not full reconstruction from scratch. Faster convergence, more stable
training.

### Summary

| Component | Choice |
|-----------|--------|
| Attention | Global self-attention, O(N²), N=256 |
| Spatial resolution | Constant 16×16 throughout |
| Embed dims | [192, 256, 288, 384] |
| Heads | [6, 8, 6, 8] |
| Depths | [3, 3, 3, 3] — 12 blocks total |
| Feed-forward | GDFN (gated depthwise conv) |
| Feature routing | Dense concat all 4 stages |
| Skip | Bilinear upscale + residual |
| Parameters | 23.8M |
| Model size | 90.99 MB (fp32) |

---

## Results

| Benchmark | PSNR | SSIM |
|-----------|------|------|
| DIV2K validation | **25.20 dB** | **0.8298** |

---

## Keywords

encoder-only vision transformer super-resolution ·
DenseNet-style skip connections and feature propagation ·
constant spatial resolution across 4 hierarchical stages ·
isotropic token grid no patch merging ·
gated depthwise feed-forward network GDFN ·
×4 sub-pixel convolution upscale shuffling ·
dense inter-stage feature aggregation ·
pure ViT image restoration ·
Dense-Iso-ViT

---

## License

MIT — free to use, modify, and distribute with attribution.