Spaces:

Sathya77
/

Dense-Iso-ViT-SR

Running

App Files Files Community

Dense-Iso-ViT-SR / README.md

SathyaSantosh77

update content

d073163 8 days ago

preview code

Raw

History Blame Contribute Delete

5.35 kB

A newer version of the Gradio SDK is available: 6.19.0

Upgrade

metadata

title: Dense-Iso-ViT SR
emoji: 🔭
colorFrom: blue
colorTo: green
sdk: gradio
sdk_version: 5.16.0
python_version: '3.10'
app_file: app.py
pinned: true
license: mit
tags:
  - image-super-resolution
  - vision-transformer
  - super-resolution
  - isotropic-vit
  - dense-connections
  - GDFN
  - pytorch
  - image-restoration
  - computer-vision
  - encoder-only-vision-transformer-super-resolution
  - constant-spatial-resolution
  - DenseNet-style-feature-propagation
  - isotropic-token-grid
  - gated-depthwise-feed-forward-network

Dense-Iso-ViT

Constant-Resolution Hierarchical Vision Transformer for ×4 Image Super-Resolution

What is Dense-Iso-ViT?

Dense-Iso-ViT is a pure Vision Transformer built from scratch for ×4 image super-resolution. The core design idea: keep the spatial token grid constant throughout all transformer stages — no patch merging, no spatial compression — and connect all stage outputs directly to the reconstruction head using DenseNet-style dense concatenation.

The name captures the two central ideas:

Dense — dense inter-stage feature aggregation (DenseNet principle applied to transformers)
Iso — isotropic spatial resolution, constant 16×16 token grid across all 4 stages

Architecture

How it works

Input [B, 3, 64, 64]
  → PatchEmbedding (patch=4) → [B, 256, 192]   (16×16 grid, fixed throughout)

  → Stage 1: 3× TransformerBlock(embed=192, heads=6, GDFN) → h1 [B, 256, 192]
  → Linear(192→256)

  → Stage 2: 3× TransformerBlock(embed=256, heads=8, GDFN) → h2 [B, 256, 256]
  → Linear(256→288)

  → Stage 3: 3× TransformerBlock(embed=288, heads=6, GDFN) → h3 [B, 256, 288]
  → Linear(288→384)

  → Stage 4: 3× TransformerBlock(embed=384, heads=8, GDFN) → h4 [B, 256, 384]

  → Dense concat: cat([h1, h2, h3, h4]) → [B, 256, 1120] → [B, 1120, 16, 16]
  → SR Head: fusion conv → 4× PixelShuffle → [B, 3, 256, 256]
  → + F.interpolate(lr_img, 256×256) bilinear skip
  → Output [B, 3, 256, 256]

Design decisions and reasoning

Isotropic token grid — constant 16×16 spatial resolution

All 4 transformer stages operate on the same 256 tokens (16×16 grid). No patch merging, no token downsampling at any point. Every token maps to the same 4×4 pixel region from input through to reconstruction — spatial coordinates are preserved exactly throughout the network.

Hierarchical embed dims [192, 256, 288, 384]

Representational capacity increases with depth. Early stages learn local edges and textures — 192 dimensions is sufficient. Deep stages reason about global scene structure and semantics — 384 dimensions gives more capacity where the task is genuinely harder. Linear projections between stages change the channel dimension without affecting spatial resolution.

Dense inter-stage concatenation

Outputs from all 4 stages are concatenated before the reconstruction head: cat([h1, h2, h3, h4]) → [B, 256, 1120]. The head receives low-level edge information (stage 1) and high-level semantic context (stage 4) simultaneously, without early-stage features being filtered through subsequent blocks. Inspired by DenseNet's feature reuse principle, applied across transformer stages.

Gated Depthwise Feed-Forward Network (GDFN)

Standard transformer MLPs process each token independently — no spatial awareness in the feed-forward step. GDFN replaces this with gated depthwise 3×3 convolutions, giving each token access to its 8 spatial neighbors during the feed-forward computation. Local spatial context injected at every attention layer, at almost no extra parameter cost.

# GDFN — replaces standard Linear → SiLU → Linear
g_proj   = Linear(embed_dim, 2 * hidden)
path_1   = Conv2d(hidden, hidden, 3, padding=1, groups=hidden)
path_2   = Conv2d(hidden, hidden, 3, padding=1, groups=hidden)
out_proj = Linear(hidden, embed_dim)

# gate: path1 × GELU(path2)
x = out_proj(path_1(x) * F.gelu(path_2(x)))

Bilinear skip connection

output = F.interpolate(lr, 256×256) + vit_residual

The model learns a residual correction on top of a bilinear upscale of the input — not full reconstruction from scratch. Faster convergence, more stable training.

Summary

Component	Choice
Attention	Global self-attention, O(N²), N=256
Spatial resolution	Constant 16×16 throughout
Embed dims	[192, 256, 288, 384]
Heads	[6, 8, 6, 8]
Depths	[3, 3, 3, 3] — 12 blocks total
Feed-forward	GDFN (gated depthwise conv)
Feature routing	Dense concat all 4 stages
Skip	Bilinear upscale + residual
Parameters	23.8M
Model size	90.99 MB (fp32)

Results

Benchmark	PSNR	SSIM
DIV2K validation	25.20 dB	0.8298

Keywords

encoder-only vision transformer super-resolution · DenseNet-style skip connections and feature propagation · constant spatial resolution across 4 hierarchical stages · isotropic token grid no patch merging · gated depthwise feed-forward network GDFN · ×4 sub-pixel convolution upscale shuffling · dense inter-stage feature aggregation · pure ViT image restoration · Dense-Iso-ViT

License

MIT — free to use, modify, and distribute with attribution.