How to use from the
Use from the
Transformers library
# Use a pipeline as a high-level helper
# Warning: Pipeline type "translation" is no longer supported in transformers v5.
# You must load the model directly (see below) or downgrade to v4.x with:
# 'pip install "transformers<5.0.0'
from transformers import pipeline

pipe = pipeline("translation", model="ngocdang83/HachimiMT-30-zh-vi")
# Load model directly
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

tokenizer = AutoTokenizer.from_pretrained("ngocdang83/HachimiMT-30-zh-vi")
model = AutoModelForSeq2SeqLM.from_pretrained("ngocdang83/HachimiMT-30-zh-vi")
Quick Links

Hachimi MT 30 zh-vi

Fast Chinese to Vietnamese Marian-style machine translation model.

Intended Use

  • Chinese -> Vietnamese web novel / fiction translation.
  • Fast local or server inference where a small model is preferred.
  • This is an experimental release; output should still be reviewed for high-stakes or publication use.

Model Details

  • Architecture: Marian seq2seq
  • Parameters: ~32M
  • Tokenizer: SentencePiece source/target tokenizer
  • Suggested decoding: num_beams=1, max_length=512

Benchmark Snapshot

FLORES-200 zho_Hans -> vie_Latn devtest, decoded with num_beams=1, max_length=512.

Evaluation set Rows BLEU chrF Chinese-character leak rows
FLORES-200 devtest 1,012 14.29 37.30 0

This is a general-domain benchmark; it is useful for public comparability but does not fully reflect web-novel style or domain terminology.

Metric notes:

  • BLEU and chrF are automatic reference-based metrics. Higher is generally better, but human quality may differ, especially for fiction/web-novel style.
  • Chinese-character leak rows counts outputs that still contain Chinese characters after translation. Lower is better.

Quick Start

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

model_id = "ngocdang83/HachimiMT-30-zh-vi"
tok = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForSeq2SeqLM.from_pretrained(model_id)

text = "δ»–ζŠ¬ε€΄ηœ‹ε‘θΏœε€„ηš„ε±±ι—¨γ€‚"
inputs = tok(text, return_tensors="pt", truncation=True, max_length=512)
out = model.generate(**inputs, max_length=512, num_beams=1)
print(tok.decode(out[0], skip_special_tokens=True))

Notes

  • This model prioritizes speed and small footprint.
  • A CTranslate2 INT8 runtime is available in ct2-int8_float32/ for faster CPU inference.
  • Known hard cases include rare proper nouns, idioms, and highly domain-specific OOD terminology.
  • For production-style usage, pair with reviewed glossary/guard layers where appropriate.
Downloads last month
24
Safetensors
Model size
31.7M params
Tensor type
F32
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Dataset used to train ngocdang83/HachimiMT-30-zh-vi

Spaces using ngocdang83/HachimiMT-30-zh-vi 3