Turkish Subwords Research
Collection
Collection models, tokenizers and testsets for the research work "Optimal Turkish Subword Strategies at Scale". The models are experimental models. • 35 items • Updated • 2
How to use turkish-nlp-suite/wordpiece_2k_cased_minimal with Transformers:
# Load model directly
from transformers import AutoModel
model = AutoModel.from_pretrained("turkish-nlp-suite/wordpiece_2k_cased_minimal", dtype="auto")This is a tokenizer from the Turkish tokenizer collection of research work Optimal Turkish Subword Strategies at Scale: Systematic Evaluation of Data, Vocabulary, Morphology Interplay.
The collection Turkish Subwords Research contains tokenizers and this tokenizer read as 2K vocabulary - cased and trained on minimal sized corpus. Corpora sizes comes in 3, Minimal, Medium and Alldata. The collection contains all the tokenizers of the name wordpiece_{voxab-size}k_{corpus size}. For more information, plrease refer to the research paper.
# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("turkish-nlp-suite/wordpiece_2k_cased_minimal", dtype="auto")