--- library_name: transformers tags: - generated_from_trainer metrics: - accuracy model-index: - name: ModernCantoneseBert-base results: [] --- # ModernCantoneseBert-base ![ModernCantoneseBert](./ModernCantoneseBert.png) This model was train from sctratch with a 487M tokens Cantonese dataset. It achieves the following results on the evaluation set: - Loss: 2.0096 - Accuracy: 0.6119 ## Model description ModernCantoneseBert is a BERT model specifically designed for Cantonese language understanding. It leverages the ModernBERT architecture from HuggingFace Transformers and is trained using masked language modeling (MLM) on Cantonese text data. ## Intended Uses & Limitations ### Intended Uses This model is intended for research and academic purposes in Cantonese natural language processing tasks, including: - Masked language modeling (fill-mask) - Feature extraction for downstream NLP tasks - Fine-tuning for Cantonese text classification, named entity recognition, and other sequence labeling tasks ### Limitations - **Model Size**: The model has 134,004,544 trainable parameters, which is relatively small compared to larger language models. This may limit its performance on complex language understanding tasks. - **Research Purpose Only**: This project is intended for research and academic use. It may not be suitable for production environments without further evaluation and fine-tuning. - **Language Coverage**: The model is specifically trained on Cantonese text and may not perform well on other Chinese dialects or languages. ## Training and Evaluation Data The model was trained on open source datasets and web-scraped content, including: - Cantonese filtered Common Crawl - Web scraped news and articles Due to respect for copyright, the training dataset will not be released. ## Quick Start ### Installation ```bash pip install transformers torch ``` ### Usage Use the model for fill-mask tasks with the HuggingFace Transformers library: ```python from transformers import pipeline mask_filler = pipeline( "fill-mask", model="hon9kon9ize/ModernCantoneseBert-Base" ) mask_filler("雞蛋六隻,糖呢就兩茶匙,仲有[MASK]橙皮添。") ``` **Output:** ```python [{'score': 0.19885674118995667, 'token': 2494, 'token_str': '啲', 'sequence': '雞 蛋 六 隻 , 糖 呢 就 兩 茶 匙 , 仲 有 啲 橙 皮 添 。'}, {'score': 0.12493402510881424, 'token': 1617, 'token_str': '個', 'sequence': '雞 蛋 六 隻 , 糖 呢 就 兩 茶 匙 , 仲 有 個 橙 皮 添 。'}, {'score': 0.051472704857587814, 'token': 1804, 'token_str': '兩', 'sequence': '雞 蛋 六 隻 , 糖 呢 就 兩 茶 匙 , 仲 有 兩 橙 皮 添 。'}, {'score': 0.03404267504811287, 'token': 11419, 'token_str': '隻', 'sequence': '雞 蛋 六 隻 , 糖 呢 就 兩 茶 匙 , 仲 有 隻 橙 皮 添 。'}, {'score': 0.028425632044672966, 'token': 1572, 'token_str': '係', 'sequence': '雞 蛋 六 隻 , 糖 呢 就 兩 茶 匙 , 仲 有 係 橙 皮 添 。'}] ``` Another example: ```python mask_filler("香港特首係李家[MASK]。") ``` **Output:** ```python [{'score': 0.3403128683567047, 'token': 10162, 'token_str': '超', 'sequence': '香 港 特 首 係 李 家 超 。'}, {'score': 0.04880792275071144, 'token': 10360, 'token_str': '輝', 'sequence': '香 港 特 首 係 李 家 輝 。'}, {'score': 0.013930004090070724, 'token': 11425, 'token_str': '雄', 'sequence': '香 港 特 首 係 李 家 雄 。'}, {'score': 0.01386457122862339, 'token': 1407, 'token_str': '人', 'sequence': '香 港 特 首 係 李 家 人 。'}, {'score': 0.01234334148466587, 'token': 3774, 'token_str': '庭', 'sequence': '香 港 特 首 係 李 家 庭 。'}] ``` ## Training ### Data Preprocessing 1. Prepare your JSONL data files with a `text` field 2. Run the preprocessing script: ```bash python preprocess.py \ --model_path ./ModernCBert-Large/ \ --data_path ./data/ \ --output_path ./pretrain/data/ \ --max_seq_len 4096 ``` ### Training the Tokenizer ```bash python train_tokenizer.py \ --files "./data/*.txt" \ --out ./tokenizer/ \ --name bert-wordpiece ``` ### Training the Model ```bash python run_mlm.py \ --model_name_or_path \ --tokenizer_name \ --train_file \ --do_train \ --do_eval \ --output_dir ./output/ ``` ## Training procedure ### Training hyperparameters The following hyperparameters were used during training: - learning_rate: 0.0001 - train_batch_size: 16 - eval_batch_size: 8 - seed: 42 - gradient_accumulation_steps: 2 - total_train_batch_size: 32 - optimizer: Use OptimizerNames.ADAMW_TORCH with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments - lr_scheduler_type: cosine - lr_scheduler_warmup_ratio: 0.05 - num_epochs: 1.0 ### Training results | Training Loss | Epoch | Step | Validation Loss | Accuracy | |:-------------:|:------:|:----:|:---------------:|:--------:| | 3.2329 | 0.2042 | 2000 | 3.0884 | 0.4568 | | 2.5065 | 0.4084 | 4000 | 2.3732 | 0.5643 | | 2.2408 | 0.6125 | 6000 | 2.1655 | 0.5903 | | 2.1828 | 0.8167 | 8000 | 2.0549 | 0.6103 | ### Framework versions - Transformers 4.57.1 - Pytorch 2.7.1+cu128 - Datasets 3.6.0 - Tokenizers 0.22.1 ## License This project is licensed under the Apache License 2.0.