Base + Language-Specific LangMAP β€” phi3-mini Γ— tur_Latn

Unsupervised tokenization specialised for tur_Latn, derived from the phi3-mini base BPE tokenizer using the LangMAP framework.

This repository bundles:

  • base_tokenizer.json β€” joint LAS Unigram base
  • langspec_tur_Latn.json β€” language-specific overlay (re-EM on tur_Latn corpus)
  • tokenizer.json β€” alias for the overlay (default load target)

Inference uses base + language-specific scores together (the LangMAP variant); do not use the bare overlay or base on its own.

Trained from job smoke.tur.phi3-mini.v32064 (vocab=32064, langs=[tur_Latn], iters=5, em_mode=soft, byte_fallback=True, seed-fix applied).

Loading

from tokenizers import Tokenizer
tok = Tokenizer.from_file("tokenizer.json")
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support