---
license: cc-by-nc-sa-4.0
language:
  - sv
tags:
  - segmentation
  - tokenization
  - combo-seg
  - universal-dependencies
datasets:
  - universal_dependencies
model-name: combo-seg-xlm-roberta-base-swedish-lines-ud2.17
pipeline_tag: token-classification
---

# COMBO-SEG Model for Swedish

## Model Description

This is a Swedish-language character-level segmentation model based on [COMBO-SEG](https://gitlab.clarin-pl.eu/syntactic-tools/combo-seg), an open-source text segmentation system. It performs:

- sentence segmentation
- tokenisation (including multi-word token detection)

The Swedish model uses ``FacebookAI/xlm-roberta-base`` as its base encoder and is trained on [UD_Swedish-LinES](https://github.com/UniversalDependencies/UD_Swedish-LinES) (UD v2.17).


## Evaluation

| Metric | Tokens | Words | Sentences |
| ------ | ------ | ----- | --------- |
| F1 | 99.99 | 99.98 | 89.42 |


## Usage

Install the library from PyPI:

```bash
pip install combo-seg
```

```python
from combo_seg import ComboSeg

# Load a pre-trained model
nlp = ComboSeg("Swedish")

# Segment raw text — returns Document with hierarchy: Document -> Turn -> Sentence -> Token
doc = nlp("Den snabba bruna räven hoppar över den lata hunden.")

# Inspect results
for turn in doc.turns:
    for sentence in turn.sentences:
        print(f"Sentence: {sentence.text}")
        for token in sentence.tokens:
            if token.is_multi_word:
                print(f"  MWT: {token.text} -> {token.subwords}")
            else:
                print(f"  Token: {token.text}")
```

Or load directly from HuggingFace:

```python
from combo_seg import ComboSeg

nlp = ComboSeg.from_pretrained("clarin-pl/combo-seg-xlm-roberta-base-swedish-lines-ud2.17")
doc = nlp("Den snabba bruna räven hoppar över den lata hunden.")
```

## License

The training data license: cc-by-nc-sa-4.0 is derived from the Universal Dependencies treebank. For the full license terms of each treebank, please refer to the corresponding `LICENSE.txt` file in the treebank repository:

- [UD_Swedish-LinES LICENSE.txt](https://github.com/UniversalDependencies/UD_Swedish-LinES/blob/master/LICENSE.txt)


## Citation

If you use this model, please cite:

Ulewicz, M., & Wróblewska, A. (2026). *COMBO-SEG Models Trained on UD v2.17*. https://doi.org/10.5281/zenodo.19651441

```bibtex
@software{combo_seg_2026,
  author    = {Ulewicz, Michał and Wróblewska, Alina},
  title     = {{COMBO-SEG} Models Trained on {UD} v2.17},
  year      = {2026},
  publisher = {Zenodo},
  doi       = {10.5281/zenodo.19651441},
  url       = {https://doi.org/10.5281/zenodo.19651441}
}
```

## Resources

- COMBO-SEG: [https://gitlab.clarin-pl.eu/syntactic-tools/combo-seg](https://gitlab.clarin-pl.eu/syntactic-tools/combo-seg)
- UD_Swedish-LinES: [https://github.com/UniversalDependencies/UD_Swedish-LinES](https://github.com/UniversalDependencies/UD_Swedish-LinES)