Spaces:
Running
Running
| title: Dense-Iso-ViT SR | |
| emoji: π | |
| colorFrom: blue | |
| colorTo: green | |
| sdk: gradio | |
| sdk_version: "5.16.0" | |
| app_file: app.py | |
| pinned: true | |
| license: mit | |
| tags: | |
| - image-super-resolution | |
| - vision-transformer | |
| - super-resolution | |
| - isotropic-vit | |
| - dense-connections | |
| - GDFN | |
| - pytorch | |
| - image-restoration | |
| - computer-vision | |
| - encoder-only-vision-transformer-super-resolution | |
| - constant-spatial-resolution | |
| - DenseNet-style-feature-propagation | |
| - isotropic-token-grid | |
| - gated-depthwise-feed-forward-network | |
| # Dense-Iso-ViT | |
| ## Constant-Resolution Hierarchical Vision Transformer for Γ4 Image Super-Resolution | |
| **Author:** Sathya77 | **Parameters:** 23.8M | **Training:** from scratch, no pretrained weights | |
| --- | |
| ## What is Dense-Iso-ViT? | |
| Dense-Iso-ViT is a pure Vision Transformer built from scratch for Γ4 image | |
| super-resolution. The core design idea: keep the spatial token grid constant | |
| throughout all transformer stages β no patch merging, no spatial compression β | |
| and connect all stage outputs directly to the reconstruction head using | |
| DenseNet-style dense concatenation. | |
| The name captures the two central ideas: | |
| - **Dense** β dense inter-stage feature aggregation (DenseNet principle applied | |
| to transformers) | |
| - **Iso** β isotropic spatial resolution, constant 16Γ16 token grid across all | |
| 4 stages | |
| --- | |
| ## Architecture | |
| ### How it works | |
| ``` | |
| Input [B, 3, 64, 64] | |
| β PatchEmbedding (patch=4) β [B, 256, 192] (16Γ16 grid, fixed throughout) | |
| β Stage 1: 3Γ TransformerBlock(embed=192, heads=6, GDFN) β h1 [B, 256, 192] | |
| β Linear(192β256) | |
| β Stage 2: 3Γ TransformerBlock(embed=256, heads=8, GDFN) β h2 [B, 256, 256] | |
| β Linear(256β288) | |
| β Stage 3: 3Γ TransformerBlock(embed=288, heads=6, GDFN) β h3 [B, 256, 288] | |
| β Linear(288β384) | |
| β Stage 4: 3Γ TransformerBlock(embed=384, heads=8, GDFN) β h4 [B, 256, 384] | |
| β Dense concat: cat([h1, h2, h3, h4]) β [B, 256, 1120] β [B, 1120, 16, 16] | |
| β SR Head: fusion conv β 4Γ PixelShuffle β [B, 3, 256, 256] | |
| β + F.interpolate(lr_img, 256Γ256) bilinear skip | |
| β Output [B, 3, 256, 256] | |
| ``` | |
| ### Design decisions and reasoning | |
| **Isotropic token grid β constant 16Γ16 spatial resolution** | |
| All 4 transformer stages operate on the same 256 tokens (16Γ16 grid). No patch | |
| merging, no token downsampling at any point. Every token maps to the same 4Γ4 | |
| pixel region from input through to reconstruction β spatial coordinates are | |
| preserved exactly throughout the network. | |
| **Hierarchical embed dims [192, 256, 288, 384]** | |
| Representational capacity increases with depth. Early stages learn local edges | |
| and textures β 192 dimensions is sufficient. Deep stages reason about global | |
| scene structure and semantics β 384 dimensions gives more capacity where the | |
| task is genuinely harder. Linear projections between stages change the channel | |
| dimension without affecting spatial resolution. | |
| **Dense inter-stage concatenation** | |
| Outputs from all 4 stages are concatenated before the reconstruction head: | |
| `cat([h1, h2, h3, h4]) β [B, 256, 1120]`. The head receives low-level edge | |
| information (stage 1) and high-level semantic context (stage 4) simultaneously, | |
| without early-stage features being filtered through subsequent blocks. | |
| Inspired by DenseNet's feature reuse principle, applied across transformer stages. | |
| **Gated Depthwise Feed-Forward Network (GDFN)** | |
| Standard transformer MLPs process each token independently β no spatial | |
| awareness in the feed-forward step. GDFN replaces this with gated depthwise | |
| 3Γ3 convolutions, giving each token access to its 8 spatial neighbors during | |
| the feed-forward computation. Local spatial context injected at every attention | |
| layer, at almost no extra parameter cost. | |
| ```python | |
| # GDFN β replaces standard Linear β SiLU β Linear | |
| g_proj = Linear(embed_dim, 2 * hidden) | |
| path_1 = Conv2d(hidden, hidden, 3, padding=1, groups=hidden) | |
| path_2 = Conv2d(hidden, hidden, 3, padding=1, groups=hidden) | |
| out_proj = Linear(hidden, embed_dim) | |
| # gate: path1 Γ GELU(path2) | |
| x = out_proj(path_1(x) * F.gelu(path_2(x))) | |
| ``` | |
| **Bilinear skip connection** | |
| `output = F.interpolate(lr, 256Γ256) + vit_residual` | |
| The model learns a residual correction on top of a bilinear upscale of the | |
| input β not full reconstruction from scratch. Faster convergence, more stable | |
| training. | |
| ### Summary | |
| | Component | Choice | | |
| |-----------|--------| | |
| | Attention | Global self-attention, O(NΒ²), N=256 | | |
| | Spatial resolution | Constant 16Γ16 throughout | | |
| | Embed dims | [192, 256, 288, 384] | | |
| | Heads | [6, 8, 6, 8] | | |
| | Depths | [3, 3, 3, 3] β 12 blocks total | | |
| | Feed-forward | GDFN (gated depthwise conv) | | |
| | Feature routing | Dense concat all 4 stages | | |
| | Skip | Bilinear upscale + residual | | |
| | Parameters | 23.8M | | |
| | Model size | 90.99 MB (fp32) | | |
| --- | |
| ## Results | |
| | Benchmark | PSNR | SSIM | | |
| |-----------|------|------| | |
| | DIV2K validation | **25.20 dB** | **0.8298** | | |
| --- | |
| ## Keywords | |
| encoder-only vision transformer super-resolution Β· | |
| DenseNet-style skip connections and feature propagation Β· | |
| constant spatial resolution across 4 hierarchical stages Β· | |
| isotropic token grid no patch merging Β· | |
| gated depthwise feed-forward network GDFN Β· | |
| Γ4 sub-pixel convolution upscale shuffling Β· | |
| dense inter-stage feature aggregation Β· | |
| pure ViT image restoration Β· | |
| Dense-Iso-ViT | |
| --- | |
| ## License | |
| MIT β free to use, modify, and distribute with attribution. | |