Hachimi MT 30 zh-vi

Fast Chinese to Vietnamese Marian-style machine translation model.

Intended Use

Chinese -> Vietnamese web novel / fiction translation.
Fast local or server inference where a small model is preferred.
This is an experimental release; output should still be reviewed for high-stakes or publication use.

Model Details

Architecture: Marian seq2seq
Parameters: ~32M
Tokenizer: SentencePiece source/target tokenizer
Suggested decoding: num_beams=1, max_length=512

Benchmark Snapshot

FLORES-200 zho_Hans -> vie_Latn devtest, decoded with num_beams=1, max_length=512.

Evaluation set	Rows	BLEU	chrF	Chinese-character leak rows
FLORES-200 devtest	1,012	14.29	37.30	0

This is a general-domain benchmark; it is useful for public comparability but does not fully reflect web-novel style or domain terminology.

Metric notes:

BLEU and chrF are automatic reference-based metrics. Higher is generally better, but human quality may differ, especially for fiction/web-novel style.
Chinese-character leak rows counts outputs that still contain Chinese characters after translation. Lower is better.

Quick Start

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

model_id = "ngocdang83/HachimiMT-30-zh-vi"
tok = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForSeq2SeqLM.from_pretrained(model_id)

text = "他抬头看向远处的山门。"
inputs = tok(text, return_tensors="pt", truncation=True, max_length=512)
out = model.generate(**inputs, max_length=512, num_beams=1)
print(tok.decode(out[0], skip_special_tokens=True))

Notes

This model prioritizes speed and small footprint.
A CTranslate2 INT8 runtime is available in ct2-int8_float32/ for faster CPU inference.
Known hard cases include rare proper nouns, idioms, and highly domain-specific OOD terminology.
For production-style usage, pair with reviewed glossary/guard layers where appropriate.

Downloads last month: 24

Safetensors

Model size

31.7M params

Tensor type

F32

ngocdang83
/

HachimiMT-30-zh-vi