--- title: Dense-Iso-ViT SR emoji: ๐Ÿ”ญ colorFrom: blue colorTo: green sdk: gradio sdk_version: "5.16.0" python_version: "3.10" app_file: app.py pinned: true license: mit tags: - image-super-resolution - vision-transformer - super-resolution - isotropic-vit - dense-connections - GDFN - pytorch - image-restoration - computer-vision - encoder-only-vision-transformer-super-resolution - constant-spatial-resolution - DenseNet-style-feature-propagation - isotropic-token-grid - gated-depthwise-feed-forward-network --- # Dense-Iso-ViT ## Constant-Resolution Hierarchical Vision Transformer for ร—4 Image Super-Resolution --- ## What is Dense-Iso-ViT? Dense-Iso-ViT is a pure Vision Transformer built from scratch for ร—4 image super-resolution. The core design idea: keep the spatial token grid constant throughout all transformer stages โ€” no patch merging, no spatial compression โ€” and connect all stage outputs directly to the reconstruction head using DenseNet-style dense concatenation. The name captures the two central ideas: - **Dense** โ€” dense inter-stage feature aggregation (DenseNet principle applied to transformers) - **Iso** โ€” isotropic spatial resolution, constant 16ร—16 token grid across all 4 stages --- ## Architecture ### How it works ``` Input [B, 3, 64, 64] โ†’ PatchEmbedding (patch=4) โ†’ [B, 256, 192] (16ร—16 grid, fixed throughout) โ†’ Stage 1: 3ร— TransformerBlock(embed=192, heads=6, GDFN) โ†’ h1 [B, 256, 192] โ†’ Linear(192โ†’256) โ†’ Stage 2: 3ร— TransformerBlock(embed=256, heads=8, GDFN) โ†’ h2 [B, 256, 256] โ†’ Linear(256โ†’288) โ†’ Stage 3: 3ร— TransformerBlock(embed=288, heads=6, GDFN) โ†’ h3 [B, 256, 288] โ†’ Linear(288โ†’384) โ†’ Stage 4: 3ร— TransformerBlock(embed=384, heads=8, GDFN) โ†’ h4 [B, 256, 384] โ†’ Dense concat: cat([h1, h2, h3, h4]) โ†’ [B, 256, 1120] โ†’ [B, 1120, 16, 16] โ†’ SR Head: fusion conv โ†’ 4ร— PixelShuffle โ†’ [B, 3, 256, 256] โ†’ + F.interpolate(lr_img, 256ร—256) bilinear skip โ†’ Output [B, 3, 256, 256] ``` ### Design decisions and reasoning **Isotropic token grid โ€” constant 16ร—16 spatial resolution** All 4 transformer stages operate on the same 256 tokens (16ร—16 grid). No patch merging, no token downsampling at any point. Every token maps to the same 4ร—4 pixel region from input through to reconstruction โ€” spatial coordinates are preserved exactly throughout the network. **Hierarchical embed dims [192, 256, 288, 384]** Representational capacity increases with depth. Early stages learn local edges and textures โ€” 192 dimensions is sufficient. Deep stages reason about global scene structure and semantics โ€” 384 dimensions gives more capacity where the task is genuinely harder. Linear projections between stages change the channel dimension without affecting spatial resolution. **Dense inter-stage concatenation** Outputs from all 4 stages are concatenated before the reconstruction head: `cat([h1, h2, h3, h4]) โ†’ [B, 256, 1120]`. The head receives low-level edge information (stage 1) and high-level semantic context (stage 4) simultaneously, without early-stage features being filtered through subsequent blocks. Inspired by DenseNet's feature reuse principle, applied across transformer stages. **Gated Depthwise Feed-Forward Network (GDFN)** Standard transformer MLPs process each token independently โ€” no spatial awareness in the feed-forward step. GDFN replaces this with gated depthwise 3ร—3 convolutions, giving each token access to its 8 spatial neighbors during the feed-forward computation. Local spatial context injected at every attention layer, at almost no extra parameter cost. ```python # GDFN โ€” replaces standard Linear โ†’ SiLU โ†’ Linear g_proj = Linear(embed_dim, 2 * hidden) path_1 = Conv2d(hidden, hidden, 3, padding=1, groups=hidden) path_2 = Conv2d(hidden, hidden, 3, padding=1, groups=hidden) out_proj = Linear(hidden, embed_dim) # gate: path1 ร— GELU(path2) x = out_proj(path_1(x) * F.gelu(path_2(x))) ``` **Bilinear skip connection** `output = F.interpolate(lr, 256ร—256) + vit_residual` The model learns a residual correction on top of a bilinear upscale of the input โ€” not full reconstruction from scratch. Faster convergence, more stable training. ### Summary | Component | Choice | |-----------|--------| | Attention | Global self-attention, O(Nยฒ), N=256 | | Spatial resolution | Constant 16ร—16 throughout | | Embed dims | [192, 256, 288, 384] | | Heads | [6, 8, 6, 8] | | Depths | [3, 3, 3, 3] โ€” 12 blocks total | | Feed-forward | GDFN (gated depthwise conv) | | Feature routing | Dense concat all 4 stages | | Skip | Bilinear upscale + residual | | Parameters | 23.8M | | Model size | 90.99 MB (fp32) | --- ## Results | Benchmark | PSNR | SSIM | |-----------|------|------| | DIV2K validation | **25.20 dB** | **0.8298** | --- ## Keywords encoder-only vision transformer super-resolution ยท DenseNet-style skip connections and feature propagation ยท constant spatial resolution across 4 hierarchical stages ยท isotropic token grid no patch merging ยท gated depthwise feed-forward network GDFN ยท ร—4 sub-pixel convolution upscale shuffling ยท dense inter-stage feature aggregation ยท pure ViT image restoration ยท Dense-Iso-ViT --- ## License MIT โ€” free to use, modify, and distribute with attribution.