Turkish spaCy models

Published April 28, 2026

Official Turkish spaCy models have been around for a while now, giving you ready-to-use statistical pipelines for processing Turkish text in a practical, “just run it” way.

We currently ship four packages: tr_core_news_[trf|lg|md|sm]. You can find them all in our Hugging Face collection

If you want to try things immediately, we also provide a kick-start Colab notebook.

Each package comes with a pretrained pipeline that includes sentence segmentation, POS tagging, morphological analysis (morphologizer), lemmatization, dependency parsing, and named entity recognition (NER). Below, we’ll look at these components one by one—but first, let’s install a model. In this post I’ll use the Transformer-based package, tr_core_news_trf:

pip install https://huggingface.co/turkish-nlp-suite/tr_core_news_trf/resolve/main/tr_core_news_trf-1.0-py3-none-any.whl

Then load it:

>>> import spacy
>>> nlp = spacy.load("tr_core_news_trf")

Now we're ready for the action, let's dive in!

Finding sentence boundaries

Before we can parse syntax, recognize entities, or run any sentence-level analytics, we need one thing: reliable sentence boundaries. In spaCy, sentence segmentation is represented by the Token.is_sent_start flag, and exposed via Doc.sents. In a full Turkish pipeline, there are two main ways those boundaries can be produced: a rule-based sentencizer or a dependency parser–based segmenter. They solve the same problem, but with different trade-offs.

1) Rule-based: the `sentencizer`

The sentencizer is spaCy’s lightweight, deterministic sentence boundary component. It primarily uses punctuation and token patterns (e.g. . ! ? and related rules) to decide where sentences start. It does not need a tagger or parser, and it’s extremely fast—great for large-scale batch processing and production settings where predictability matters.

In Turkish, the sentencizer tends to work very well on clean, editorial text, but it can be challenged by cases where a period doesn’t mean “end of sentence”, such as:

abbreviations (Dr., Prof., T.C., Örn., vb.)
ordinals and list items (1. madde, 2. kez, 3. sınıf)
dates/times and decimals (12.04.2026, 14.30, 36.5)
quotes/parentheses where punctuation appears inside spans

Because it’s rule-based, its behavior is easy to reason about and consistent—if it’s wrong, it’s often wrong in a repeatable way (which can be a feature when you want stable preprocessing).

2) Model-based: sentence boundaries from the dependency parser

In spaCy, the dependency parser can also set sentence boundaries. Instead of relying mainly on punctuation, it uses a learned model that considers broader context: surrounding tokens, structure, and the kinds of sequences that typically form a sentence in the training data. This can help in exactly the tricky Turkish cases above—especially when punctuation is ambiguous—because the model can learn patterns like “1. followed by madde is probably not a boundary by itself” or that T.C. is often internal to a sentence.

The trade-off is that parser-based segmentation is:

heavier (more compute than the sentencizer)
data-dependent (quality reflects the data the parser was trained on)
sometimes less “predictable” than explicit rules in edge cases

How they fit in a full pipeline

In a full package, you typically choose one primary source of boundaries:

Use sentencizer when you want speed and deterministic behavior, or when your text domain is consistent and punctuation is reliable.
Use parser-based segmentation when you already run the dependency parser anyway and you want better handling of ambiguous punctuation, abbreviations, and “list-like” constructs.

A practical pattern is to treat this as a configurable choice: sentence boundaries are foundational, and different downstream tasks (NER vs parsing vs sentence-level classification) can prefer different segmentation behaviors. Let's see all in action:

import spacy

MODEL = "tr_core_news_trf"  # TODO: replace with your pipeline name

TEXTS = {
    "abbreviations": (
        "Dr. Ayşe Yılmaz bugün Ankara’ya geldi. Prof. Dr. Mehmet K. Demir de toplantıdaydı. "
        "Toplantı T.C. Sağlık Bakanlığı’nda yapıldı. Örn. bu tür kısaltmalar cümle sonu değildir."
    ),
    "ordinals": (
        "1. madde yürürlüğe girdi. 2. madde ise yeniden düzenlendi. "
        "3. kez denedik ama sonuç değişmedi. 4. sınıf öğrencileri katıldı."
    ),
    "dates_and_decimals": (
        "Rapor 12.04.2026 tarihinde yayımlandı. Sıcaklık 36.5 dereceye çıktı. "
        "Saat 09.15’te başladık. 10. bölümde konu tekrar ele alınıyor."
    ),
}

def print_sents(doc, title):
    print("\n" + title)
    print("-" * len(title))
    for i, sent in enumerate(doc.sents, 1):
        print(f"{i:02d}. {sent.text}")

# --- A) Rule-based segmentation: tokenizer + sentencizer
nlp_sent = spacy.blank("tr")
nlp_sent.add_pipe("sentencizer")

# --- B) Parser-based segmentation: load full pipeline, remove sentencizer (if present)
nlp_full = spacy.load(MODEL)

if "sentencizer" in nlp_full.pipe_names:
    nlp_full.remove_pipe("sentencizer")


# --- Run both and print results
for name, text in TEXTS.items():
    doc_sent = nlp_sent(text)
    doc_full = nlp_full(text)

    print_sents(doc_sent, f"[{name}] sentencizer (rule-based)")
    print_sents(doc_full, f"[{name}] parser-based (full pipeline)")

In the printed results, each block shows the same Turkish text split into sentences by two different strategies: the rule-based sentencizer (fast, punctuation-driven) and your full model with the dependency parser (context-aware). When the two outputs match, it means the boundaries are unambiguous (clear sentence-ending punctuation, straightforward numbering, dates/decimals that don’t confuse tokenization). When they differ—most often around abbreviations and dotted forms like Dr., T.C., Örn. or section-like patterns such as 1. followed by a noun—it typically reflects the trade-off: the sentencizer may treat a period as sentence-final and split too early, while the parser can often keep the span together because it has learned that these dotted tokens frequently continue the same sentence in Turkish text.

Morphologizer: Turkish grammar as UD features

In tr_core_news_trf, the pipeline order is:

['transformer', 'tagger', 'morphologizer', 'trainable_lemmatizer', 'parser', 'ner']

That placement is exactly what you want for Turkish: after the tagger has decided a token’s coarse category (token.pos_), the morphologizer predicts UD morphological features (FEATS) and exposes them as token.morph.

For Turkish, token.morph is where the language becomes “machine-readable”: case, possession, agreement, negation, TAM, voice, and—most importantly—whether a verb is finite or one of the Turkish workhorse non-finite forms (VerbForm=Conv/Part/Vnoun). Those features then become strong signals for the next components, especially the trainable_lemmatizer and the dependency parser. Let's see an example.

A “killer” example: when a whole clause becomes one token

Take:

Ali’nin gelmediğini biliyorum.
"I know that Ali didn’t come."

import spacy

nlp = spacy.load("tr_core_news_trf")
print("Pipeline:", nlp.pipe_names)

def show_morph(doc):
    for t in doc:
        if t.is_space:
            continue
        # token.morph prints as a UD FEATS bundle, and lemma_ is useful context
        print(f"{t.text:<12} {str(t.morph):<90} {t.lemma_:<10} {t.pos_}")

text = "Ali’nin gelmediğini biliyorum."
doc = nlp(text)
show_morph(doc)

Here is the exact output of the above code (POS + UD FEATS + lemma):

Ali’nin      Case=Gen|Number=Sing|Person=3    Ali  PROPN
gelmediğini  Aspect=Perf|Case=Acc|Number=Sing|Number[psor]=Sing|Person=3|Person[psor]=3|Polarity=Neg|Tense=Past|VerbForm=Part  gel  VERB
biliyorum    Aspect=Prog|Number=Sing|Person=1|Polarity=Pos|Tense=Pres  bil  VERB
.            .      .    PUNCT

This is such a good morphology demo because each token exposes a different layer of the clause packaging: Ali’nin is marked Case=Gen, signaling the embedded-clause subject as a genitive dependent; gelmediğini carries the embedded meaning in a single word— Polarity=Neg + Tense=Past + Aspect=Perf gives “didn’t come,” VerbForm=Part shows it’s non-finite (one of the forms Turkish uses to build clause-like nominalizations/modifiers), Case=Acc marks the whole embedded event as an accusative object of biliyorum, and Person[psor]=3 / Number[psor]=Sing adds possessive agreement on that nominalized event (roughly “his not-coming”); finally, biliyorum anchors the matrix clause with Person=1|Number=Sing (“I”) and present/progressive-style marking (Tense=Pres, Aspect=Prog). In Turkish, morphology doesn’t just decorate words—it reveals when an entire clause has been turned into a noun-like object, and you can see that directly in VerbForm + Case + possessive features on a single token.

Practical reading guide: the highest-value FEATS in Turkish

When you scan token.morph, these are the features that tend to pay off immediately:

Nominals (NOUN/PROPN/PRON):
Case=Nom/Acc/Dat/Loc/Abl/Gen/Ins, Number=Sing/Plur, plus possessive markers via Person[psor] / Number[psor].
Verbs (VERB/AUX):
VerbForm=Fin/Conv/Part/Vnoun, Tense, Mood, Person, Number, Polarity, sometimes Voice and Evident.

If you only watch one thing: VerbForm is often the “syntax switch” in Turkish:

Conv frequently aligns with adverbial-clause behavior (often advcl downstream),
Part frequently aligns with modifier behavior (often acl),
Vnoun frequently behaves nominally (can take case and appear as arguments).

Now we uncovered the morphologizer component, next component is naturally the lemmatizer. Let's move onto the lemmatizer component.

`trainable_lemmatizer`: stable base forms for Turkish tokens

In all our pipelines, the lemmatizer comes right after morphology. This order matters. In Turkish, deciding “what is the lemma?” is rarely a simple suffix strip: the same surface-looking ending can participate in different constructions, and the model often needs POS + UD morphological features to decide which base form is intended. spaCy’s trainable_lemmatizer is designed for that reality: it learns to predict lemmas from examples, using the contextual and grammatical signals produced earlier in the pipeline.

Practically, this component gives you token.lemma_: a normalized dictionary form that makes downstream tasks much easier—counting vocabulary, matching against lexicons, building search indexes, and normalizing entity mentions. If token.text is what was written, token.lemma_ is the form you usually want for “what word is this, really?”

Let's get started, run the below code and see some lemmas

import spacy

nlp = spacy.load("tr_core_news_trf")

def show_lemmas(text):
    doc = nlp(text)
    for t in doc:
        if t.is_space:
            continue
        print(
            f"{t.text:<14} lemma={t.lemma_:<10}"
        )

show_lemmas("Ali’nin gelmediğini biliyorum.")
show_lemmas("Dün verdiğin konuyu googleladım")

The output of trainable_lemmatizer is simply token.lemma_: one normalized base form per token. The key point—especially for Turkish—is that this normalization is learned, not just a hand-written suffix stripper, so it can cope with messy real text and productive “Frankenstein” formations like googleladım → googlela, where an English brand/stem is Turkish-ified with native verbal morphology. In practice, that means you get consistent lemmas for (1) proper nouns with case/possessive suffixes (Ali’nin → Ali), (2) long inflected or non-finite verb forms (gelmediğini → gel, verdiğin → ver, biliyorum → bil), and (3) non-standard or newly coined loanword verbs—reducing sparsity and making downstream matching, counting, and search much more reliable.

A tiny, practical pattern: lemmatized vocabulary vs surface vocabulary

One of the practical usage of lemmatization is to compactify the vocabulary of statistical models, lemmatization actually reduces sparsity:

text = "Geldim, geliyorum, geleceğim; gelmediğini de biliyorum."
doc = nlp(text)

surface = [t.text for t in doc if t.is_alpha]
lemmas  = [t.lemma_ for t in doc if t.is_alpha]

print("Surface:", surface)
print("Lemmas :", lemmas)
print("Unique surface:", len(set(surface)))
print("Unique lemmas :", len(set(lemmas)))

In our beautiful Turkish, the “unique surface” count often balloons quickly; lemmatization pulls many of those back into a smaller set of bases.

Once we have both:

token.morph (what the suffixes mean), and
token.lemma_ (what word to normalize to),

the next component—the dependency parser—becomes easier to understand: case features line up with relations like obj/obl/nmod, and non-finite VerbForm values help explain attachments like advcl and acl, dependency relations. let's dive into the dependency parser component.

Dependency parser

After tagging, morphology, and lemmatization, the dependency parser in tr_core_news_trf predicts the syntactic head of each token and the dependency relation connecting them. In spaCy you’ll typically read this via token.dep_ (relation label) and token.head (the token it attaches to). For Turkish, this step is especially useful because the parser can combine case-rich morphology (e.g., Case=Acc/Dat/Gen) with context to decide who did what to whom—even when word order is flexible.

A quick look in code

Let's run our quick code

import spacy
nlp = spacy.load("tr_core_news_trf")

text = "Keloğlan çıktı, kapıyı kapadı, sonra sessizce bekledi."
doc = nlp(text)

for t in doc:
    if t.is_space:
        continue
    print(f"{t.text:<10} dep={t.dep_:<8} head={t.head.text:<10} pos={t.pos_}")

to get the parse tree

The image above shows a very typical Turkish narrative chain with coordination:

Keloğlan → nsubj of çıktı (subject of the first verb).
kapıyı → obj of kapadı (accusative-marked object attached to the verb).
sonra and sessizce → advmod of bekledi (adverbial modifiers).
The later verbs (kapadı, bekledi) attach to the earlier clause via conj (coordination), forming a sequence: he went out, (he) closed the door, then (he) waited quietly.

For deeper parsing details

I’m keeping parsing intentionally light here—if you want a deeper dive into labels like obl, nmod, acl/advcl, coordination, and how Turkish case features interact with dependencies, see my longer write-up here: Github blog.

Tag your entities: NER

Last component of our pipeline is NER. This component is indeed independent from other pipeline components and trained separately.

spaCy's Turkish models have a pretty nice story on the NER side: the NER layer is trained on your professionally annotated WikiNER dataset, so the model’s entity boundaries and label choices are grounded in the same conventions you used when building that corpus. The tagset is also pleasantly “modern” and fairly rich: you get the classic core labels like PERSON, ORG, and GPE/LOC, plus people/affiliation-ish categories like NORP and TITLE; more fine-grained named-object categories like FAC, PRODUCT, WORK_OF_ART, EVENT, and LAW; and all the numeric/time-ish ones you actually need in practice—DATE, TIME, and quantities such as QUANTITY, ORDINAL, CARDINAL, MONEY, and PERCENT—with LANGUAGE covering mentions like “Türkçe”. What makes it work well for Turkish is that NER sits on top of the UD-informed pipeline you’ve already seen: proper nouns don’t get “broken” just because they carry suffixes (so something like “Ankara’dan” can still be recognized as a GPE with the lemma Ankara), and the surrounding morphology/dependency context helps the model make sensible decisions without fragmenting spans. The figure below shows the idea in one glance: a single sentence cleanly surfaces a DATE span (“5 Temmuz 2005”), a FAC (“Reebok Stadyum”), and a GPE (“Bolton, İngiltere”).

Let's see the code in action:

import spacy
nlp = spacy.load("tr_core_news_trf")

text = "Çekimler 5 Temmuz 2005 tarihinde Reebok Stadyum, Bolton, İngiltere'de yapılmıştır."
doc = nlp(text)

for ent in doc.ents:
    print(f"{ent.text:<25} label={ent.label_:<10} chars=({ent.start_char},{ent.end_char})")

Resulting in:

That's it lovely readers, we hope you enjoyed the tour. This was a very compact tour, but there too much to discover and play, below you'll see some resources to dive into the subject further.

References

Models mentioned in this article 1

Datasets mentioned in this article 1

Collections mentioned in this article 1

WikiNER: Your Ultimate Turkish NER Set

April 28, 2026

Community

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Turkish spaCy models

Finding sentence boundaries

1) Rule-based: the sentencizer

2) Model-based: sentence boundaries from the dependency parser

How they fit in a full pipeline

Morphologizer: Turkish grammar as UD features

A “killer” example: when a whole clause becomes one token

Practical reading guide: the highest-value FEATS in Turkish

trainable_lemmatizer: stable base forms for Turkish tokens

A tiny, practical pattern: lemmatized vocabulary vs surface vocabulary

Dependency parser

A quick look in code

For deeper parsing details

Tag your entities: NER

References

Models mentioned in this article 1

Datasets mentioned in this article 1

Collections mentioned in this article 1

WikiNER: Your Ultimate Turkish NER Set

Community

Models mentioned in this article 1

Datasets mentioned in this article 1

Collections mentioned in this article 1

1) Rule-based: the `sentencizer`

`trainable_lemmatizer`: stable base forms for Turkish tokens