Spaces:

Sathya77
/

Dense-Iso-ViT-SR

Running

App Files Files Community

Dense-Iso-ViT-SR / README.md

SathyaSantosh77

sdk_version change

4562f2f 12 days ago

preview code

Raw

History Blame

5.45 kB

	---
	title: Dense-Iso-ViT SR
	emoji: 🔭
	colorFrom: blue
	colorTo: green
	sdk: gradio
	sdk_version: "5.16.0"
	app_file: app.py
	pinned: true
	license: mit
	tags:
	- image-super-resolution
	- vision-transformer
	- super-resolution
	- isotropic-vit
	- dense-connections
	- GDFN
	- pytorch
	- image-restoration
	- computer-vision
	- encoder-only-vision-transformer-super-resolution
	- constant-spatial-resolution
	- DenseNet-style-feature-propagation
	- isotropic-token-grid
	- gated-depthwise-feed-forward-network
	---

	# Dense-Iso-ViT
	## Constant-Resolution Hierarchical Vision Transformer for ×4 Image Super-Resolution

	Author: Sathya77  \|  Parameters: 23.8M  \|  Training: from scratch, no pretrained weights

	---

	## What is Dense-Iso-ViT?

	Dense-Iso-ViT is a pure Vision Transformer built from scratch for ×4 image
	super-resolution. The core design idea: keep the spatial token grid constant
	throughout all transformer stages — no patch merging, no spatial compression —
	and connect all stage outputs directly to the reconstruction head using
	DenseNet-style dense concatenation.

	The name captures the two central ideas:
	- Dense — dense inter-stage feature aggregation (DenseNet principle applied
	to transformers)
	- Iso — isotropic spatial resolution, constant 16×16 token grid across all
	4 stages

	---

	## Architecture

	### How it works

	```
	Input [B, 3, 64, 64]
	→ PatchEmbedding (patch=4) → [B, 256, 192] (16×16 grid, fixed throughout)

	→ Stage 1: 3× TransformerBlock(embed=192, heads=6, GDFN) → h1 [B, 256, 192]
	→ Linear(192→256)

	→ Stage 2: 3× TransformerBlock(embed=256, heads=8, GDFN) → h2 [B, 256, 256]
	→ Linear(256→288)

	→ Stage 3: 3× TransformerBlock(embed=288, heads=6, GDFN) → h3 [B, 256, 288]
	→ Linear(288→384)

	→ Stage 4: 3× TransformerBlock(embed=384, heads=8, GDFN) → h4 [B, 256, 384]

	→ Dense concat: cat([h1, h2, h3, h4]) → [B, 256, 1120] → [B, 1120, 16, 16]
	→ SR Head: fusion conv → 4× PixelShuffle → [B, 3, 256, 256]
	→ + F.interpolate(lr_img, 256×256) bilinear skip
	→ Output [B, 3, 256, 256]
	```

	### Design decisions and reasoning

	Isotropic token grid — constant 16×16 spatial resolution

	All 4 transformer stages operate on the same 256 tokens (16×16 grid). No patch
	merging, no token downsampling at any point. Every token maps to the same 4×4
	pixel region from input through to reconstruction — spatial coordinates are
	preserved exactly throughout the network.

	Hierarchical embed dims [192, 256, 288, 384]

	Representational capacity increases with depth. Early stages learn local edges
	and textures — 192 dimensions is sufficient. Deep stages reason about global
	scene structure and semantics — 384 dimensions gives more capacity where the
	task is genuinely harder. Linear projections between stages change the channel
	dimension without affecting spatial resolution.

	Dense inter-stage concatenation

	Outputs from all 4 stages are concatenated before the reconstruction head:
	`cat([h1, h2, h3, h4]) → [B, 256, 1120]`. The head receives low-level edge
	information (stage 1) and high-level semantic context (stage 4) simultaneously,
	without early-stage features being filtered through subsequent blocks.
	Inspired by DenseNet's feature reuse principle, applied across transformer stages.

	Gated Depthwise Feed-Forward Network (GDFN)

	Standard transformer MLPs process each token independently — no spatial
	awareness in the feed-forward step. GDFN replaces this with gated depthwise
	3×3 convolutions, giving each token access to its 8 spatial neighbors during
	the feed-forward computation. Local spatial context injected at every attention
	layer, at almost no extra parameter cost.

	```python
	# GDFN — replaces standard Linear → SiLU → Linear
	g_proj = Linear(embed_dim, 2 * hidden)
	path_1 = Conv2d(hidden, hidden, 3, padding=1, groups=hidden)
	path_2 = Conv2d(hidden, hidden, 3, padding=1, groups=hidden)
	out_proj = Linear(hidden, embed_dim)

	# gate: path1 × GELU(path2)
	x = out_proj(path_1(x) * F.gelu(path_2(x)))
	```

	Bilinear skip connection

	`output = F.interpolate(lr, 256×256) + vit_residual`

	The model learns a residual correction on top of a bilinear upscale of the
	input — not full reconstruction from scratch. Faster convergence, more stable
	training.

	### Summary

	\| Component \| Choice \|
	\|-----------\|--------\|
	\| Attention \| Global self-attention, O(N²), N=256 \|
	\| Spatial resolution \| Constant 16×16 throughout \|
	\| Embed dims \| [192, 256, 288, 384] \|
	\| Heads \| [6, 8, 6, 8] \|
	\| Depths \| [3, 3, 3, 3] — 12 blocks total \|
	\| Feed-forward \| GDFN (gated depthwise conv) \|
	\| Feature routing \| Dense concat all 4 stages \|
	\| Skip \| Bilinear upscale + residual \|
	\| Parameters \| 23.8M \|
	\| Model size \| 90.99 MB (fp32) \|

	---

	## Results

	\| Benchmark \| PSNR \| SSIM \|
	\|-----------\|------\|------\|
	\| DIV2K validation \| 25.20 dB \| 0.8298 \|

	---

	## Keywords

	encoder-only vision transformer super-resolution ·
	DenseNet-style skip connections and feature propagation ·
	constant spatial resolution across 4 hierarchical stages ·
	isotropic token grid no patch merging ·
	gated depthwise feed-forward network GDFN ·
	×4 sub-pixel convolution upscale shuffling ·
	dense inter-stage feature aggregation ·
	pure ViT image restoration ·
	Dense-Iso-ViT

	---

	## License

	MIT — free to use, modify, and distribute with attribution.