--- license: apache-2.0 datasets: - mamei16/multilingual-wikipedia-paragraphs base_model: - distilbert/distilbert-base-multilingual-cased pipeline_tag: token-classification library_name: transformers tags: - RAG - rag - chunking language: - afr - ara - arg - ast - azb - aze - bak - bar - bel - bos - bpy - bre - bul - bxg - cat - ceb - ces - che - chv - cym - dan - deu - eng - est - eus - fas - fin - fra - fry - gle - glg - guj - hbs - heb - hin - hrv - ht - hun - hye - ido - ind - isl - ita - jav - jpn - kan - kat - kaz - kor - ky - lat - lav - lit - lmo - ltz - lzh - mal - mar - mbp - min - mkd - mlg - mon - mus - mya - nds - ne - new - nld - nno - nor - oci - pan - pms - pnb - pol - por - ron - rus - scn - sco - slk - slv - spa - sqi - srp - sun - swe - swh - tam - tat - tel - tgk - tgl - tha - tur - ukr - urd - uzb - vie - vol - wbf - yor - zho --- ## Model Details ### Model Description Fine-tune of [distilbert/distilbert-base-multilingual-cased](https://huggingface.co/distilbert/distilbert-base-multilingual-cased) trained on nearly 11 billion tokens from more than 34 million Wikipedia articles to predict paragraph breaks. This model can be used to split arbitrary natural language texts into semantic chunks. ### Model Sources - **Repository:** https://github.com/mamei16/chonky - **Demo:** https://huggingface.co/spaces/mamei16/chonky_chunk ## Uses This model can be used as part of a RAG pipeline to hopefully improve downstream performance. ## Bias, Risks, and Limitations This model has been fine-tuned on non-fictional natural language from Wikipedia. As such, it may not work as well on fictional texts containing dialog or poems, mathematical expressions or code. ## How to Get Started with the Model ``` pip install git+https://github.com/mamei16/chonky ``` ## Usage: ```python from chonky import ParagraphSplitter splitter = ParagraphSplitter(device="cpu", model_id="mamei16/chonky_distilbert-base-multilingual-cased") text = """Before college the two main things I worked on, outside of school, were writing and programming. I didn't write essays. I wrote what beginning writers were supposed to write then, and probably still are: short stories. My stories were awful. They had hardly any plot, just characters with strong feelings, which I imagined made them deep. The first programs I tried writing were on the IBM 1401 that our school district used for what was then called "data processing." This was in 9th grade, so I was 13 or 14. The school district's 1401 happened to be in the basement of our junior high school, and my friend Rich Draves and I got permission to use it. It was like a mini Bond villain's lair down there, with all these alien-looking machines — CPU, disk drives, printer, card reader — sitting up on a raised floor under bright fluorescent lights.""" for chunk in splitter(text): print(chunk) print("--") # Before college the two main things I worked on, outside of school, were writing and programming. I didn't write essays. I wrote what beginning writers were supposed to write then, and probably still are: short stories. My stories were awful. They had hardly any plot, just characters with strong feelings, which I imagined made them deep. # -- # The first programs I tried writing were on the IBM 1401 that our school district used for what was then called "data processing." This was in 9th grade, so I was 13 or 14. The school district's 1401 happened to be in the basement of our junior high school, and my friend Rich Draves and I got permission to use it. It was like a mini Bond villain's lair down there, with all these alien-looking machines — CPU, disk drives, printer, card reader — sitting up on a raised floor under bright fluorescent lights. # -- ``` ## Training Details ### Training Data Link: https://huggingface.co/datasets/mamei16/multilingual-wikipedia-paragraphs Note that the data has been pre-tokenized and truncated using the tokenizer from [distilbert/distilbert-base-multilingual-cased](https://huggingface.co/distilbert/distilbert-base-multilingual-cased). The training data is based on [wikimedia/wikipedia](https://huggingface.co/datasets/wikimedia/wikipedia). Although it's claimed that unwanted sections such as "References", "See more" etc. have been removed from that dataset, that is not the case. Hence, a semi-accurate and half-manual procedure was used to identify and remove those sections. This procedure involved finding the most common short paragraphs in the last 500 characters of the articles in each language and translating each of them to identify the section headers corresponding to the English "References" etc. Another issue was the existence of several million articles, most notably in the languages Waray, Cebuano and Swedish, which had been written by a bot named Lsjbot. To not artificially inflate low-resource languages and to ensure that most (if not all) articles were written by humans, the offending articles were removed. Lastly, articles containing too many extremely long or short paragraphs were removed, as well as stub articles. ### Training Procedure #### Preprocessing All articles were truncated to max. 512 tokens. For designing the training strategy, the following desirable traits were identified: - Avoid catastrophic forgetting - Avoid sudden drastic changes in distribution - Boost low-resource languages - Uphold good performance on high-resource/most commonly used languages - Must be implementable with static training data In the end, the training data was generated by combining the datasets of all 104 languages using temperature sampling without replacement and with τ = 2. This would boost low-resource languages, and languages would "die out" one by one spread over the entire epoch (see figure below), thus avoiding sudden big changes in distribution. Furthermore, the model would keep seeing high-resource languages until the end, thus making it likely that it would maintain good performance on them. At the same time, the linear learning rate schedule would ensure that the more low-resource languages were exchausted during training, the lower the learning rate would be, thus making catastrophic forgetting less likely.
Image
Fig.1 - Number of languages remaining vs number of training steps taken
#### Training Hyperparameters - Training regime: fp16 mixed precision - Epochs: 1 - Batch size: 64 - Start learning rate: 2e-5 - Optimizer: Adam - Weight decay: 0.01 - Loss: NLLLoss - Label Smoothing factor: 0.1 ## Evaluation ### ​[MTCB](https://github.com/chonkie-inc/mtcb) Nano Benchmark (Aggregated Score) **Note: This benchmark is English only (and includes code in multiple programming languages)** *Score = Mean of `mean_recall`, `mean_precision`, `mean_mrr`, and `mean_ndcg` across `k=[1, 3, 5, 10]`* ([Metrics reference](https://github.com/chonkie-inc/mtcb#-metrics)) | Model / Chunker | Chunk Size 512 | Chunk Size 1024 | Chunk Size 2048 | Avg Score | | :--- | :---: | :---: | :---: | :---: | | [mirth/chonky_modernbert_large_1](https://huggingface.co/mirth/chonky_modernbert_large_1) | 0.5621 | 0.5621 | 0.5621 | 0.5621 | | [mamei16/chonky_mdistilbert-base-english-cased](https://huggingface.co/mamei16/chonky_mdistilbert-base-english-cased) | 0.5517 | 0.5517 | 0.5517 | 0.5517 | | [mamei16/chonky_distilbert_base_uncased_1.1](https://huggingface.co/mamei16/chonky_distilbert_base_uncased_1.1) | 0.5342 | 0.5342 | 0.5342 | 0.5342 | | [mirth/chonky_modernbert_base_1](https://huggingface.co/mirth/chonky_modernbert_base_1) | 0.5305 | 0.5305 | 0.5305 | 0.5305 | | **[mamei16/chonky_distilbert-base-multilingual-cased](https://huggingface.co/mamei16/chonky_distilbert-base-multilingual-cased)** | **0.5294** | **0.5294** | **0.5294** | **0.5294** | | [mirth/chonky_distilbert_base_uncased_1](https://huggingface.co/mirth/chonky_distilbert_base_uncased_1) | 0.5116 | 0.5116 | 0.5116 | 0.5116 | | RecursiveChunker | 0.4596 | 0.5214 | 0.5431 | 0.5080 | | SentenceChunker | 0.4612 | 0.5026 | 0.5263 | 0.4967 | | TokenChunker | 0.3155 | 0.4338 | 0.4801 | 0.4098 | | SemanticChunker_potion-32M | 0.4022 | 0.4021 | 0.4019 | 0.4021 | | SemanticChunker_potion-multi-128M | 0.4004 | 0.3999 | 0.3991 | 0.4001 | | SemanticChunker_potion-8M | 0.3987 | 0.3966 | 0.3966 | 0.3973 | ##### Model Implementation Details | Benchmark Name | Implementation | | :--- | :--- | | RecursiveChunker | `RecursiveChunker(chunk_size=chunk_size)` | | SentenceChunker | `SentenceChunker(chunk_size=chunk_size)` | | TokenChunker | `TokenChunker(chunk_size=chunk_size)` | | SemanticChunker_potion-32M | `SemanticChunker(chunk_size=chunk_size, embedding_model="minishlab/potion-base-32M")` | | SemanticChunker_potion-multi-128M | `SemanticChunker(chunk_size=chunk_size, embedding_model="minishlab/potion-multilingual-128M")` | | SemanticChunker_potion-8M | `SemanticChunker(chunk_size=chunk_size, embedding_model="minishlab/potion-base-8M")` | ### Testing Data The testing data can be found in the "test" split of each language in the dataset. ### Testing Metrics Due to the extreme class imbalance, the F1-score was chosen as the main evaluation metric. ### Results | Language | F1 Score | |-------------------|----------| | Chechen | 0.994 | | Cebuano | 0.993 | | Newari | 0.989 | | Volapük | 0.989 | | Minangkabau | 0.984 | | Bishnupriya | 0.982 | | Malagasy | 0.971 | | Haitian Creole | 0.966 | | Tatar | 0.96 | | Waray | 0.956 | | Piedmontese | 0.936 | | South Azerbaijani | 0.934 | | Ido | 0.916 | | Telugu | 0.912 | | Kazakh | 0.907 | | Welsh | 0.897 | | Serbo-Croatian | 0.893 | | Aragonese | 0.886 | | Basque | 0.879 | | Lombard | 0.879 | | Tajik | 0.876 | | Urdu | 0.876 | | Kyrgyz | 0.872 | | Chuvash | 0.868 | | Marathi | 0.865 | | Dutch | 0.854 | | Sundanese | 0.851 | | Ukrainian | 0.848 | | Serbian | 0.847 | | Polish | 0.841 | | Luxembourgish | 0.84 | | Slovak | 0.84 | | Hungarian | 0.834 | | Armenian | 0.833 | | Malay | 0.832 | | Latin | 0.83 | | French | 0.829 | | Swedish | 0.829 | | Bosnian | 0.828 | | Bavarian | 0.826 | | German | 0.826 | | Belarusian | 0.825 | | Korean | 0.821 | | Slovenian | 0.821 | | Persian | 0.82 | | Italian | 0.82 | | Uzbek | 0.82 | | Japanese | 0.818 | | Swahili | 0.818 | | Macedonian | 0.817 | | English | 0.815 | | Georgian | 0.814 | | Indonesian | 0.813 | | Occitan | 0.813 | | Romanian | 0.813 | | Russian | 0.813 | | Vietnamese | 0.812 | | Norwegian | 0.811 | | Portuguese | 0.811 | | Afrikaans | 0.81 | | Bulgarian | 0.809 | | Catalan | 0.807 | | Czech | 0.807 | | Scots | 0.806 | | Tamil | 0.805 | | Western Frisian | 0.804 | | Arabic | 0.804 | | Turkish | 0.803 | | Bashkir | 0.801 | | Spanish | 0.8 | | Lithuanian | 0.797 | | Asturian | 0.796 | | Breton | 0.796 | | Norwegian Nynorsk | 0.795 | | Galician | 0.794 | | Bangla | 0.793 | | Latvian | 0.793 | | Estonian | 0.792 | | Danish | 0.787 | | Azerbaijani | 0.784 | | Sicilian | 0.783 | | Finnish | 0.781 | | Javanese | 0.781 | | Hindi | 0.777 | | Greek | 0.772 | | Gujarati | 0.772 | | Low Saxon | 0.765 | | Tagalog | 0.764 | | Croatian | 0.763 | | Irish | 0.751 | | Hebrew | 0.75 | | Icelandic | 0.743 | | Malayalam | 0.742 | | Kannada | 0.73 | | Yoruba | 0.729 | | Chinese | 0.727 | | Thai | 0.725 | | Albanian | 0.724 | | Punjabi | 0.724 | | Mongolian | 0.722 | | Burmese | 0.711 | | Classical Chinese | 0.706 | | Western Punjabi | 0.658 | | Nepali | 0.593 | **Mean F1 Score:** 0.824 ## Technical Specifications #### Hardware RTX 5090