Turkish spaCy models
We currently ship four packages: tr_core_news_[trf|lg|md|sm]. You can find them all in our Hugging Face collection
If you want to try things immediately, we also provide a kick-start Colab notebook.
Each package comes with a pretrained pipeline that includes sentence segmentation, POS tagging, morphological analysis (morphologizer), lemmatization, dependency parsing, and named entity recognition (NER). Below, we’ll look at these components one by one—but first, let’s install a model. In this post I’ll use the Transformer-based package, tr_core_news_trf:
pip install https://huggingface.co/turkish-nlp-suite/tr_core_news_trf/resolve/main/tr_core_news_trf-1.0-py3-none-any.whl
Then load it:
>>> import spacy
>>> nlp = spacy.load("tr_core_news_trf")
Now we're ready for the action, let's dive in!
Finding sentence boundaries
Before we can parse syntax, recognize entities, or run any sentence-level analytics, we need one thing: reliable sentence boundaries. In spaCy, sentence segmentation is represented by the Token.is_sent_start flag, and exposed via Doc.sents. In a full Turkish pipeline, there are two main ways those boundaries can be produced: a rule-based sentencizer or a dependency parser–based segmenter. They solve the same problem, but with different trade-offs.
1) Rule-based: the sentencizer
The sentencizer is spaCy’s lightweight, deterministic sentence boundary component. It primarily uses punctuation and token patterns (e.g. . ! ? and related rules) to decide where sentences start. It does not need a tagger or parser, and it’s extremely fast—great for large-scale batch processing and production settings where predictability matters.
In Turkish, the sentencizer tends to work very well on clean, editorial text, but it can be challenged by cases where a period doesn’t mean “end of sentence”, such as:
- abbreviations (
Dr.,Prof.,T.C.,Örn.,vb.) - ordinals and list items (
1. madde,2. kez,3. sınıf) - dates/times and decimals (
12.04.2026,14.30,36.5) - quotes/parentheses where punctuation appears inside spans
Because it’s rule-based, its behavior is easy to reason about and consistent—if it’s wrong, it’s often wrong in a repeatable way (which can be a feature when you want stable preprocessing).
2) Model-based: sentence boundaries from the dependency parser
In spaCy, the dependency parser can also set sentence boundaries. Instead of relying mainly on punctuation, it uses a learned model that considers broader context: surrounding tokens, structure, and the kinds of sequences that typically form a sentence in the training data. This can help in exactly the tricky Turkish cases above—especially when punctuation is ambiguous—because the model can learn patterns like “1. followed by madde is probably not a boundary by itself” or that T.C. is often internal to a sentence.
The trade-off is that parser-based segmentation is:
- heavier (more compute than the
sentencizer) - data-dependent (quality reflects the data the parser was trained on)
- sometimes less “predictable” than explicit rules in edge cases
How they fit in a full pipeline
In a full package, you typically choose one primary source of boundaries:
- Use
sentencizerwhen you want speed and deterministic behavior, or when your text domain is consistent and punctuation is reliable. - Use parser-based segmentation when you already run the dependency parser anyway and you want better handling of ambiguous punctuation, abbreviations, and “list-like” constructs.
A practical pattern is to treat this as a configurable choice: sentence boundaries are foundational, and different downstream tasks (NER vs parsing vs sentence-level classification) can prefer different segmentation behaviors. Let's see all in action:
import spacy
MODEL = "tr_core_news_trf" # TODO: replace with your pipeline name
TEXTS = {
"abbreviations": (
"Dr. Ayşe Yılmaz bugün Ankara’ya geldi. Prof. Dr. Mehmet K. Demir de toplantıdaydı. "
"Toplantı T.C. Sağlık Bakanlığı’nda yapıldı. Örn. bu tür kısaltmalar cümle sonu değildir."
),
"ordinals": (
"1. madde yürürlüğe girdi. 2. madde ise yeniden düzenlendi. "
"3. kez denedik ama sonuç değişmedi. 4. sınıf öğrencileri katıldı."
),
"dates_and_decimals": (
"Rapor 12.04.2026 tarihinde yayımlandı. Sıcaklık 36.5 dereceye çıktı. "
"Saat 09.15’te başladık. 10. bölümde konu tekrar ele alınıyor."
),
}
def print_sents(doc, title):
print("\n" + title)
print("-" * len(title))
for i, sent in enumerate(doc.sents, 1):
print(f"{i:02d}. {sent.text}")
# --- A) Rule-based segmentation: tokenizer + sentencizer
nlp_sent = spacy.blank("tr")
nlp_sent.add_pipe("sentencizer")
# --- B) Parser-based segmentation: load full pipeline, remove sentencizer (if present)
nlp_full = spacy.load(MODEL)
if "sentencizer" in nlp_full.pipe_names:
nlp_full.remove_pipe("sentencizer")
# --- Run both and print results
for name, text in TEXTS.items():
doc_sent = nlp_sent(text)
doc_full = nlp_full(text)
print_sents(doc_sent, f"[{name}] sentencizer (rule-based)")
print_sents(doc_full, f"[{name}] parser-based (full pipeline)")
In the printed results, each block shows the same Turkish text split into sentences by two different strategies: the rule-based sentencizer (fast, punctuation-driven) and your full model with the dependency parser (context-aware). When the two outputs match, it means the boundaries are unambiguous (clear sentence-ending punctuation, straightforward numbering, dates/decimals that don’t confuse tokenization). When they differ—most often around abbreviations and dotted forms like Dr., T.C., Örn. or section-like patterns such as 1. followed by a noun—it typically reflects the trade-off: the sentencizer may treat a period as sentence-final and split too early, while the parser can often keep the span together because it has learned that these dotted tokens frequently continue the same sentence in Turkish text.
Morphologizer: Turkish grammar as UD features
In tr_core_news_trf, the pipeline order is:
['transformer', 'tagger', 'morphologizer', 'trainable_lemmatizer', 'parser', 'ner']
That placement is exactly what you want for Turkish: after the tagger has decided a token’s coarse category (token.pos_), the morphologizer predicts UD morphological features (FEATS) and exposes them as token.morph.
For Turkish, token.morph is where the language becomes “machine-readable”: case, possession, agreement, negation, TAM, voice, and—most importantly—whether a verb is finite or one of the Turkish workhorse non-finite forms (VerbForm=Conv/Part/Vnoun). Those features then become strong signals for the next components, especially the trainable_lemmatizer and the dependency parser. Let's see an example.
A “killer” example: when a whole clause becomes one token
Take:
Ali’nin gelmediğini biliyorum.
"I know that Ali didn’t come."
import spacy
nlp = spacy.load("tr_core_news_trf")
print("Pipeline:", nlp.pipe_names)
def show_morph(doc):
for t in doc:
if t.is_space:
continue
# token.morph prints as a UD FEATS bundle, and lemma_ is useful context
print(f"{t.text:<12} {str(t.morph):<90} {t.lemma_:<10} {t.pos_}")
text = "Ali’nin gelmediğini biliyorum."
doc = nlp(text)
show_morph(doc)
Here is the exact output of the above code (POS + UD FEATS + lemma):
Ali’nin Case=Gen|Number=Sing|Person=3 Ali PROPN
gelmediğini Aspect=Perf|Case=Acc|Number=Sing|Number[psor]=Sing|Person=3|Person[psor]=3|Polarity=Neg|Tense=Past|VerbForm=Part gel VERB
biliyorum Aspect=Prog|Number=Sing|Person=1|Polarity=Pos|Tense=Pres bil VERB
. . . PUNCT
This is such a good morphology demo because each token exposes a different layer of the clause packaging: Ali’nin is marked Case=Gen, signaling the embedded-clause subject as a genitive dependent; gelmediğini carries the embedded meaning in a single word— Polarity=Neg + Tense=Past + Aspect=Perf gives “didn’t come,” VerbForm=Part shows it’s non-finite (one of the forms Turkish uses to build clause-like nominalizations/modifiers), Case=Acc marks the whole embedded event as an accusative object of biliyorum, and Person[psor]=3 / Number[psor]=Sing adds possessive agreement on that nominalized event (roughly “his not-coming”); finally, biliyorum anchors the matrix clause with Person=1|Number=Sing (“I”) and present/progressive-style marking (Tense=Pres, Aspect=Prog). In Turkish, morphology doesn’t just decorate words—it reveals when an entire clause has been turned into a noun-like object, and you can see that directly in VerbForm + Case + possessive features on a single token.
Practical reading guide: the highest-value FEATS in Turkish
When you scan token.morph, these are the features that tend to pay off immediately:
Nominals (
NOUN/PROPN/PRON):Case=Nom/Acc/Dat/Loc/Abl/Gen/Ins,Number=Sing/Plur, plus possessive markers viaPerson[psor]/Number[psor].Verbs (
VERB/AUX):VerbForm=Fin/Conv/Part/Vnoun,Tense,Mood,Person,Number,Polarity, sometimesVoiceandEvident.
If you only watch one thing: VerbForm is often the “syntax switch” in Turkish:
Convfrequently aligns with adverbial-clause behavior (oftenadvcldownstream),Partfrequently aligns with modifier behavior (oftenacl),Vnounfrequently behaves nominally (can take case and appear as arguments).
Now we uncovered the morphologizer component, next component is naturally the lemmatizer. Let's move onto the lemmatizer component.
trainable_lemmatizer: stable base forms for Turkish tokens
In all our pipelines, the lemmatizer comes right after morphology. This order matters. In Turkish, deciding “what is the lemma?” is rarely a simple suffix strip: the same surface-looking ending can participate in different constructions, and the model often needs POS + UD morphological features to decide which base form is intended. spaCy’s trainable_lemmatizer is designed for that reality: it learns to predict lemmas from examples, using the contextual and grammatical signals produced earlier in the pipeline.
Practically, this component gives you token.lemma_: a normalized dictionary form that makes downstream tasks much easier—counting vocabulary, matching against lexicons, building search indexes, and normalizing entity mentions. If token.text is what was written, token.lemma_ is the form you usually want for “what word is this, really?”
Let's get started, run the below code and see some lemmas
import spacy
nlp = spacy.load("tr_core_news_trf")
def show_lemmas(text):
doc = nlp(text)
for t in doc:
if t.is_space:
continue
print(
f"{t.text:<14} lemma={t.lemma_:<10}"
)
show_lemmas("Ali’nin gelmediğini biliyorum.")
show_lemmas("Dün verdiğin konuyu googleladım")
The output of trainable_lemmatizer is simply token.lemma_: one normalized base form per token. The key point—especially for Turkish—is that this normalization is learned, not just a hand-written suffix stripper, so it can cope with messy real text and productive “Frankenstein” formations like googleladım → googlela, where an English brand/stem is Turkish-ified with native verbal morphology. In practice, that means you get consistent lemmas for (1) proper nouns with case/possessive suffixes (Ali’nin → Ali), (2) long inflected or non-finite verb forms (gelmediğini → gel, verdiğin → ver, biliyorum → bil), and (3) non-standard or newly coined loanword verbs—reducing sparsity and making downstream matching, counting, and search much more reliable.
A tiny, practical pattern: lemmatized vocabulary vs surface vocabulary
One of the practical usage of lemmatization is to compactify the vocabulary of statistical models, lemmatization actually reduces sparsity:
text = "Geldim, geliyorum, geleceğim; gelmediğini de biliyorum."
doc = nlp(text)
surface = [t.text for t in doc if t.is_alpha]
lemmas = [t.lemma_ for t in doc if t.is_alpha]
print("Surface:", surface)
print("Lemmas :", lemmas)
print("Unique surface:", len(set(surface)))
print("Unique lemmas :", len(set(lemmas)))
In our beautiful Turkish, the “unique surface” count often balloons quickly; lemmatization pulls many of those back into a smaller set of bases.
Once we have both:
token.morph(what the suffixes mean), andtoken.lemma_(what word to normalize to),
the next component—the dependency parser—becomes easier to understand: case features line up with relations like obj/obl/nmod, and non-finite VerbForm values help explain attachments like advcl and acl, dependency relations. let's dive into the dependency parser component.
Dependency parser
After tagging, morphology, and lemmatization, the dependency parser in tr_core_news_trf predicts the syntactic head of each token and the dependency relation connecting them. In spaCy you’ll typically read this via token.dep_ (relation label) and token.head (the token it attaches to). For Turkish, this step is especially useful because the parser can combine case-rich morphology (e.g., Case=Acc/Dat/Gen) with context to decide who did what to whom—even when word order is flexible.
A quick look in code
Let's run our quick code
import spacy
nlp = spacy.load("tr_core_news_trf")
text = "Keloğlan çıktı, kapıyı kapadı, sonra sessizce bekledi."
doc = nlp(text)
for t in doc:
if t.is_space:
continue
print(f"{t.text:<10} dep={t.dep_:<8} head={t.head.text:<10} pos={t.pos_}")
to get the parse tree
The image above shows a very typical Turkish narrative chain with coordination:
Keloğlan→nsubjofçıktı(subject of the first verb).kapıyı→objofkapadı(accusative-marked object attached to the verb).sonraandsessizce→advmodofbekledi(adverbial modifiers).- The later verbs (
kapadı,bekledi) attach to the earlier clause viaconj(coordination), forming a sequence: he went out, (he) closed the door, then (he) waited quietly.
For deeper parsing details
I’m keeping parsing intentionally light here—if you want a deeper dive into labels like obl, nmod, acl/advcl, coordination, and how Turkish case features interact with dependencies, see my longer write-up here: Github blog.
Tag your entities: NER
Last component of our pipeline is NER. This component is indeed independent from other pipeline components and trained separately.
spaCy's Turkish models have a pretty nice story on the NER side: the NER layer is trained on your professionally annotated WikiNER dataset, so the model’s entity boundaries and label choices are grounded in the same conventions you used when building that corpus. The tagset is also pleasantly “modern” and fairly rich: you get the classic core labels like PERSON, ORG, and GPE/LOC, plus people/affiliation-ish categories like NORP and TITLE; more fine-grained named-object categories like FAC, PRODUCT, WORK_OF_ART, EVENT, and LAW; and all the numeric/time-ish ones you actually need in practice—DATE, TIME, and quantities such as QUANTITY, ORDINAL, CARDINAL, MONEY, and PERCENT—with LANGUAGE covering mentions like “Türkçe”. What makes it work well for Turkish is that NER sits on top of the UD-informed pipeline you’ve already seen: proper nouns don’t get “broken” just because they carry suffixes (so something like “Ankara’dan” can still be recognized as a GPE with the lemma Ankara), and the surrounding morphology/dependency context helps the model make sensible decisions without fragmenting spans. The figure below shows the idea in one glance: a single sentence cleanly surfaces a DATE span (“5 Temmuz 2005”), a FAC (“Reebok Stadyum”), and a GPE (“Bolton, İngiltere”).
Let's see the code in action:
import spacy
nlp = spacy.load("tr_core_news_trf")
text = "Çekimler 5 Temmuz 2005 tarihinde Reebok Stadyum, Bolton, İngiltere'de yapılmıştır."
doc = nlp(text)
for ent in doc.ents:
print(f"{ent.text:<25} label={ent.label_:<10} chars=({ent.start_char},{ent.end_char})")
Resulting in:
That's it lovely readers, we hope you enjoyed the tour. This was a very compact tour, but there too much to discover and play, below you'll see some resources to dive into the subject further.


