NIRVLab
/

unigram-vanilla-webfaq-tur

Sentence Similarity

vanilla_transformer

information-retrieval

vanilla-transformer

unigram-tokenizer

webfaq-retrieval

Model card Files Files and versions

Vanilla Transformer + Unigram — WebFAQ Turkish Retrieval

Monolingual Turkish dense retrieval model trained from scratch on the PaDaS-Lab/webfaq-retrieval Turkish subset.

Pipeline

Unigram tokenizer (vocab=32k) trained on Turkish corpus
MLM pre-training — 3 epochs, lr=1e-4, batch=64, grad_accum=2
Contrastive fine-tuning — 5 epochs, lr=2e-5, batch=32, grad_accum=4, temp=0.05

Architecture

Vanilla Transformer encoder (pure PyTorch, random init)
6 layers, hidden=512, heads=8, FFN=2048

Evaluation Metrics

nDCG@1, nDCG@5, nDCG@10, MRR@10, R@50, R@100

Downloads last month: 56

Dataset used to train NIRVLab/unigram-vanilla-webfaq-tur