A newer version of this model is available: Bapynshngain/opus-mt-kha-en-v2

You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Model Card for Bapynshngain/opus-mt-kha-en-v1

Model Details

Model Description

Bapynshngain/opus-mt-kha-en-v1 is a Transformer-based neural machine translation (NMT) model for translating Khasi to English. The model is fine-tuned from the OPUS-MT multilingual family, adapted for Khasi.

This model is part of an effort to improve digital accessibility and NLP tooling for Khasi through data-centric and transfer learning approaches.

Developed by: Bapynshngainlang Nongkynrih
Funded by: Independent / self-initiated research
Shared by: Bapynshngainlang Nongkynrih
Model type: Transformer-based Neural Machine Translation (Seq2Seq)
Language(s): Khasi → English
License: CC BY-NC-SA 4.0
Finetuned from model: Helsinki/opus-mt-en-vi
Demo: N/A

Uses

Direct Use

Khasi → English translation for:
- Research in low-resource NLP
- Digital content accessibility
- Linguistic analysis
- Dataset creation / augmentation

Downstream Use

Integration into:
- Translation pipelines
- Cross-lingual information retrieval systems
- Multilingual chatbots
- Preprocessing for English-centric NLP models

Out-of-Scope Use

High-stakes domains (legal, medical, financial) without human validation
Highly domain-specific translation (technical/legal/formal jargon)
Long-form document translation without segmentation
English → Khasi translation (not supported)

Bias, Risks, and Limitations

Low-resource constraints: Despite scaling to ~60,000 parallel pairs, coverage remains limited compared to high-resource languages
Domain bias: Training data may not fully represent all domains or dialectal variation in Khasi
Structural inconsistencies: Complex sentence structures may degrade translation quality
Hallucination risk: The model may occasionally generate plausible but unsupported content, particularly on noisy or unfamiliar inputs
Named entity handling: Proper nouns and rare entities may be mistranslated or dropped

How to Get Started with the Model

from transformers import MarianMTModel, MarianTokenizer

model_name = "Bapynshngain/opus-mt-kha-en-v1"

tokenizer = MarianTokenizer.from_pretrained(model_name)
model = MarianMTModel.from_pretrained(model_name)

text = "Phi long kumno?"
inputs = tokenizer(text, return_tensors="pt", padding=True)

translated = model.generate(**inputs)
output = tokenizer.decode(translated[0], skip_special_tokens=True)

print(output)

Downloads last month: -

Safetensors

Model size

0.1B params

Tensor type

F32

Model tree for Bapynshngain/opus-mt-kha-en-v1

Base model

Helsinki-NLP/opus-mt-en-vi

Finetuned

(21)

this model

Bapynshngain
/

opus-mt-kha-en-v1