Base + Language-Specific LangMAP β phi3-mini Γ fin_Latn
Unsupervised tokenization specialised for fin_Latn, derived from the phi3-mini base BPE tokenizer using the LangMAP framework.
This repository bundles:
base_tokenizer.jsonβ joint LAS Unigram baselangspec_fin_Latn.jsonβ language-specific overlay (re-EM on fin_Latn corpus)tokenizer.jsonβ alias for the overlay (default load target)
Inference uses base + language-specific scores together (the LangMAP variant); do not use the bare overlay or base on its own.
Trained from job smokeM5.fin.phi3-mini.v32064 (vocab=32064, langs=[fin_Latn], iters=5,
em_mode=soft, byte_fallback=True, seed-fix applied).
Loading
from tokenizers import Tokenizer
tok = Tokenizer.from_file("tokenizer.json")
Inference Providers NEW
This model isn't deployed by any Inference Provider. π Ask for provider support