@Sathya77 on Hugging Face: "My previous ViT-SR model handled broad structure decently well but lost fine…"

Post

My previous ViT-SR model handled broad structure decently well but lost fine detail. Scaling parameters alone didn't fix it — a larger model with the same spatial compression through patch merging scored worse than the baseline. The problem wasn't capacity. It was architecture.

The reconstruction head was being asked to recover spatial detail from a heavily compressed token grid — information that was already gone before upsampling began. Fixing that required a different design, not just more parameters.

What changed:
Constant 16×16 token grid across all 4 transformer stages. No patch merging, no token downsampling. Every token maps to the same 4×4 pixel region from input through to reconstruction (DenseNet's Principle to Transformers)

Hierarchical embed dims [192 → 256 → 288 → 384]. These architectural changes and the parameter count are not separable — GDFN, dense concatenation, and hierarchical dims each require sufficient channel depth to function. A 786K version of this architecture would be a fundamentally different, weaker model.

Dense concatenation of all 4 stage outputs directly to the reconstruction head — simultaneously accessing low-level edge maps and high-level semantic context.

Gated Depthwise Feed-Forward Network (GDFN) — replaces standard MLPs with gated depthwise 3×3 convolutions, injecting local spatial context into every attention step.

Bilinear skip connection — model learns residual correction only, not full reconstruction from scratch.

Results:

DIV2K validation: 25.20 dB PSNR, SSIM 0.8298
~23.8M parameters, trained from scratch on LSDIR (84,991 images)

The model handles smooth regions and structured content well. High-frequency fine detail remains the harder case — likely a property of global attention at this token count rather than anything specific to this architecture.

🤗 Space: Sathya77/Dense-Iso-ViT-SR

Feedback Welcome!

Join the conversation