New architecture: TemporalMesh Transformer — dynamic kNN graph attention + per-token exit routing, 29.4 PPL at 48% compute

#180
by vigneshwar234 - opened

TemporalMesh Transformer (TMT) — open-source, 120M params, state-of-the-art efficiency

Sharing a new transformer architecture for the community's feedback and comparison.

TMT achieves 29.4 PPL on WikiText-2 (−30.2% vs vanilla) at 48% relative compute — outperforming Mamba (31.8), RWKV (33.1), Longformer (39.6), and vanilla transformer (42.1) at ~120M parameters.

Five innovations unified in one forward pass

  1. Mesh Attention — dynamic kNN graph (k=8) rebuilt per-layer from cosine similarity. O(S·k) vs O(S²). At S=1024: 128× fewer attention ops.
  2. Temporal Decay Encoding — learned per-head multiplicative scalar post-softmax: ã_ij = α_ij × σ(w·|t_i−t_j|)
  3. Adaptive Depth Routing — per-token exit gate, avg 5.76/12 layers used (52% compute saved)
  4. Dual-Stream FFN — syntax + semantic parallel streams with sigmoid fusion gate
  5. EMA Memory Anchors — 16 persistent fast-weight vectors (β=0.99), 32KB params

Results across 8 benchmarks

WT-2 PPL↓ WT-103 PPL↓ LongBench↑ C4 PPL↓ Compute
Vanilla 42.1 51.3 41.2 38.4 100%
Longformer 39.6 47.2 49.8 36.1 62%
Mamba 31.8 38.4 51.3 30.1 55%
TMT 29.4 36.1 53.4 27.4 48%

Quick start

from tmt.model.config import TMTConfig
from tmt.model.model import TMTModel
model = TMTModel(TMTConfig(vocab_size=50257, d_model=512, n_heads=8, n_layers=12))
out = model(input_ids)
# out.logits, out.exit_masks, out.graph_edges, out.confidences

📄 Paper: https://zenodo.org/records/20287390 · DOI: 10.5281/zenodo.20287197
💻 Code (226 tests): https://github.com/vignesh2027/TemporalMesh-Transformer
🎮 Live Demo: https://huggingface.co/spaces/vigneshwar234/TemporalMesh-Transformer-Demo
🤗 Model: https://huggingface.co/vigneshwar234/TemporalMesh-Transformer

Sign up or log in to comment