Hugging Face
Models
Datasets
Spaces
Buckets
new
Docs
Enterprise
Pricing
Website
Tasks
HuggingChat
Collections
Languages
Organizations
Community
Blog
Posts
Daily Papers
Learn
Discord
Forum
GitHub
Solutions
Team & Enterprise
Hugging Face PRO
Enterprise Support
Inference Providers
Inference Endpoints
Storage Buckets
Log In
Sign Up
Building on HF
12.2
TFLOPS
2
3
Santosh Kompella
PRO
Sathya77
Follow
danjacobellis's profile picture
Fishtiks's profile picture
PhysiQuanty's profile picture
5 followers
·
13 following
AI & ML interests
LLMs Natural Language Processing (NLP) Transformers Deep Learning Machine Learning
Recent Activity
posted
an
update
8 days ago
My previous ViT-SR model handled broad structure decently well but lost fine detail. Scaling parameters alone didn't fix it — a larger model with the same spatial compression through patch merging scored worse than the baseline. The problem wasn't capacity. It was architecture. The reconstruction head was being asked to recover spatial detail from a heavily compressed token grid — information that was already gone before upsampling began. Fixing that required a different design, not just more parameters. What changed: Constant 16×16 token grid across all 4 transformer stages. No patch merging, no token downsampling. Every token maps to the same 4×4 pixel region from input through to reconstruction (DenseNet's Principle to Transformers) Hierarchical embed dims [192 → 256 → 288 → 384]. These architectural changes and the parameter count are not separable — GDFN, dense concatenation, and hierarchical dims each require sufficient channel depth to function. A 786K version of this architecture would be a fundamentally different, weaker model. Dense concatenation of all 4 stage outputs directly to the reconstruction head — simultaneously accessing low-level edge maps and high-level semantic context. Gated Depthwise Feed-Forward Network (GDFN) — replaces standard MLPs with gated depthwise 3×3 convolutions, injecting local spatial context into every attention step. Bilinear skip connection — model learns residual correction only, not full reconstruction from scratch. Results: DIV2K validation: 25.20 dB PSNR, SSIM 0.8298 ~23.8M parameters, trained from scratch on LSDIR (84,991 images) The model handles smooth regions and structured content well. High-frequency fine detail remains the harder case — likely a property of global attention at this token count rather than anything specific to this architecture. 🤗 Space: https://huggingface.co/spaces/Sathya77/Dense-Iso-ViT-SR Feedback Welcome!
updated
a Space
8 days ago
Sathya77/Dense-Iso-ViT-SR
published
a Space
8 days ago
Sathya77/Dense-Iso-ViT-SR
View all activity
Organizations
None yet
Sathya77
's datasets
1
Sort: Recently updated
Sathya77/telecom_plans
Viewer
•
Updated
Aug 30, 2025
•
125
•
4