Vanilla Transformer + Unigram β WebFAQ Turkish Retrieval
Monolingual Turkish dense retrieval model trained from scratch on the
PaDaS-Lab/webfaq-retrieval Turkish subset.
Pipeline
- Unigram tokenizer (vocab=32k) trained on Turkish corpus
- MLM pre-training β 3 epochs, lr=1e-4, batch=64, grad_accum=2
- Contrastive fine-tuning β 5 epochs, lr=2e-5, batch=32, grad_accum=4, temp=0.05
Architecture
- Vanilla Transformer encoder (pure PyTorch, random init)
- 6 layers, hidden=512, heads=8, FFN=2048
Evaluation Metrics
nDCG@1, nDCG@5, nDCG@10, MRR@10, R@50, R@100