--- tags: - translation - transformers - safetensors - chinese - vietnamese - marian - text2text-generation - zh-vi - chinese-vietnamese - marianmt - machine-translation language: - zh - vi pipeline_tag: translation library_name: transformers license: apache-2.0 --- # MoxhiMT 60 zh-vi Chinese → Vietnamese Marian-style machine translation model, tuned for xianxia / web-novel text. ## Intended Use - Chinese → Vietnamese web novel / fiction translation - Local or server inference - Experimental release; review output for high-stakes / publication use ## Model Details - **Architecture**: Marian seq2seq (8 encoder + 2 decoder layers) - **Parameters**: ~57M (d_model 576, ffn 2304) - **Tokenizer**: SentencePiece source/target, joint ZH+VI, vocab 24k - **Suggested decoding**: `num_beams=4`, `max_length=512` ## Quick Start ```python from transformers import AutoTokenizer, AutoModelForSeq2SeqLM model_id = "DanVP/MoxhiMT-60" tok = AutoTokenizer.from_pretrained(model_id) model = AutoModelForSeq2SeqLM.from_pretrained(model_id) text = "他抬头看向远处的山门。" inputs = tok(text, return_tensors="pt", truncation=True, max_length=512) out = model.generate(**inputs, max_length=512, num_beams=4) print(tok.decode(out[0], skip_special_tokens=True)) ``` ## CTranslate2 (INT8) A CTranslate2 INT8 build is included under `ct2-int8/` for faster CPU inference. ```python import ctranslate2 from transformers import AutoTokenizer tok = AutoTokenizer.from_pretrained("DanVP/MoxhiMT-60") translator = ctranslate2.Translator("ct2-int8", device="cpu", compute_type="int8") text = "他抬头看向远处的山门。" src = tok.convert_ids_to_tokens(tok(text, truncation=True, max_length=512).input_ids) results = translator.translate_batch([src], beam_size=4, max_decoding_length=512) print(tok.decode(tok.convert_tokens_to_ids(results[0].hypotheses[0]), skip_special_tokens=True)) ``` ## Notes - Prioritizes translation quality on xianxia / cultivation terminology. - Trained from scratch with a custom SentencePiece-BPE 24k joint ZH+VI tokenizer.