--- license: cc-by-nc-sa-4.0 language: - sv tags: - segmentation - tokenization - combo-seg - universal-dependencies datasets: - universal_dependencies model-name: combo-seg-xlm-roberta-base-swedish-lines-ud2.17 pipeline_tag: token-classification --- # COMBO-SEG Model for Swedish ## Model Description This is a Swedish-language character-level segmentation model based on [COMBO-SEG](https://gitlab.clarin-pl.eu/syntactic-tools/combo-seg), an open-source text segmentation system. It performs: - sentence segmentation - tokenisation (including multi-word token detection) The Swedish model uses ``FacebookAI/xlm-roberta-base`` as its base encoder and is trained on [UD_Swedish-LinES](https://github.com/UniversalDependencies/UD_Swedish-LinES) (UD v2.17). ## Evaluation | Metric | Tokens | Words | Sentences | | ------ | ------ | ----- | --------- | | F1 | 99.99 | 99.98 | 89.42 | ## Usage Install the library from PyPI: ```bash pip install combo-seg ``` ```python from combo_seg import ComboSeg # Load a pre-trained model nlp = ComboSeg("Swedish") # Segment raw text — returns Document with hierarchy: Document -> Turn -> Sentence -> Token doc = nlp("Den snabba bruna räven hoppar över den lata hunden.") # Inspect results for turn in doc.turns: for sentence in turn.sentences: print(f"Sentence: {sentence.text}") for token in sentence.tokens: if token.is_multi_word: print(f" MWT: {token.text} -> {token.subwords}") else: print(f" Token: {token.text}") ``` Or load directly from HuggingFace: ```python from combo_seg import ComboSeg nlp = ComboSeg.from_pretrained("clarin-pl/combo-seg-xlm-roberta-base-swedish-lines-ud2.17") doc = nlp("Den snabba bruna räven hoppar över den lata hunden.") ``` ## License The training data license: cc-by-nc-sa-4.0 is derived from the Universal Dependencies treebank. For the full license terms of each treebank, please refer to the corresponding `LICENSE.txt` file in the treebank repository: - [UD_Swedish-LinES LICENSE.txt](https://github.com/UniversalDependencies/UD_Swedish-LinES/blob/master/LICENSE.txt) ## Citation If you use this model, please cite: Ulewicz, M., & Wróblewska, A. (2026). *COMBO-SEG Models Trained on UD v2.17*. https://doi.org/10.5281/zenodo.19651441 ```bibtex @software{combo_seg_2026, author = {Ulewicz, Michał and Wróblewska, Alina}, title = {{COMBO-SEG} Models Trained on {UD} v2.17}, year = {2026}, publisher = {Zenodo}, doi = {10.5281/zenodo.19651441}, url = {https://doi.org/10.5281/zenodo.19651441} } ``` ## Resources - COMBO-SEG: [https://gitlab.clarin-pl.eu/syntactic-tools/combo-seg](https://gitlab.clarin-pl.eu/syntactic-tools/combo-seg) - UD_Swedish-LinES: [https://github.com/UniversalDependencies/UD_Swedish-LinES](https://github.com/UniversalDependencies/UD_Swedish-LinES)