Translation
Transformers
Safetensors
English
Khasi
marian
text2text-generation
Khasi
Low-resource
A newer version of this model is available: Bapynshngain/opus-mt-kha-en-v2

You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Log in or Sign Up to review the conditions and access this model content.

Model Card for Bapynshngain/opus-mt-kha-en-v1

Model Details

Model Description

Bapynshngain/opus-mt-kha-en-v1 is a Transformer-based neural machine translation (NMT) model for translating Khasi to English. The model is fine-tuned from the OPUS-MT multilingual family, adapted for Khasi.

This model is part of an effort to improve digital accessibility and NLP tooling for Khasi through data-centric and transfer learning approaches.

  • Developed by: Bapynshngainlang Nongkynrih
  • Funded by: Independent / self-initiated research
  • Shared by: Bapynshngainlang Nongkynrih
  • Model type: Transformer-based Neural Machine Translation (Seq2Seq)
  • Language(s): Khasi → English
  • License: CC BY-NC-SA 4.0
  • Finetuned from model: Helsinki/opus-mt-en-vi
  • Demo: N/A

Uses

Direct Use

  • Khasi → English translation for:
    • Research in low-resource NLP
    • Digital content accessibility
    • Linguistic analysis
    • Dataset creation / augmentation

Downstream Use

  • Integration into:
    • Translation pipelines
    • Cross-lingual information retrieval systems
    • Multilingual chatbots
    • Preprocessing for English-centric NLP models

Out-of-Scope Use

  • High-stakes domains (legal, medical, financial) without human validation
  • Highly domain-specific translation (technical/legal/formal jargon)
  • Long-form document translation without segmentation
  • English → Khasi translation (not supported)

Bias, Risks, and Limitations

  • Low-resource constraints: Despite scaling to ~60,000 parallel pairs, coverage remains limited compared to high-resource languages
  • Domain bias: Training data may not fully represent all domains or dialectal variation in Khasi
  • Structural inconsistencies: Complex sentence structures may degrade translation quality
  • Hallucination risk: The model may occasionally generate plausible but unsupported content, particularly on noisy or unfamiliar inputs
  • Named entity handling: Proper nouns and rare entities may be mistranslated or dropped

How to Get Started with the Model

from transformers import MarianMTModel, MarianTokenizer

model_name = "Bapynshngain/opus-mt-kha-en-v1"

tokenizer = MarianTokenizer.from_pretrained(model_name)
model = MarianMTModel.from_pretrained(model_name)

text = "Phi long kumno?"
inputs = tokenizer(text, return_tensors="pt", padding=True)

translated = model.generate(**inputs)
output = tokenizer.decode(translated[0], skip_special_tokens=True)

print(output)
Downloads last month
-
Safetensors
Model size
0.1B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Bapynshngain/opus-mt-kha-en-v1

Finetuned
(21)
this model

Dataset used to train Bapynshngain/opus-mt-kha-en-v1