Vanilla Transformer + Unigram β€” WebFAQ Turkish Retrieval

Monolingual Turkish dense retrieval model trained from scratch on the PaDaS-Lab/webfaq-retrieval Turkish subset.

Pipeline

  1. Unigram tokenizer (vocab=32k) trained on Turkish corpus
  2. MLM pre-training β€” 3 epochs, lr=1e-4, batch=64, grad_accum=2
  3. Contrastive fine-tuning β€” 5 epochs, lr=2e-5, batch=32, grad_accum=4, temp=0.05

Architecture

  • Vanilla Transformer encoder (pure PyTorch, random init)
  • 6 layers, hidden=512, heads=8, FFN=2048

Evaluation Metrics

nDCG@1, nDCG@5, nDCG@10, MRR@10, R@50, R@100

Downloads last month
56
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Dataset used to train NIRVLab/unigram-vanilla-webfaq-tur