Update README.md

b1de30d verified 3 months ago

5.73 kB

license: cc-by-4.0
language:
  - kha
  - en
tags:
  - spelling-correction
  - text-processing
  - language-modeling
  - low-resource-languages
  - khasi

Khasi Spell Checker v1

Khasi Spell Checker v1 is a statistical spell-checking system for the Khasi language built using a large monolingual corpus (~700k sentences). The system detects and corrects spelling errors using edit-distance candidate generation and contextual ranking using a probabilistic language model.

The goal of this project is to provide basic NLP infrastructure for Khasi, a low-resource language, enabling improved writing assistance, text preprocessing, and downstream NLP applications.

🚀 Live Demo: https://huggingface.co/spaces/Bapynshngain/Khasi-Spell-Checker

Overview

This spell checker follows a classical architecture historically used in early search engines and spelling correction systems.

   Input sentence 
         ↓ 
    Tokenization 
         ↓ 
Suspicious word detection 
         ↓

Candidate generation (edit distance)

         ↓ 
Language model scoring 
         
         ↓

Best correction selection

         ↓ 
 Corrected sentence

The system combines:

Edit-distance candidate generation
Word frequency model
Bigram language model
Bidirectional context scoring

Training Data

The model is derived from a Khasi monolingual corpus containing ~700,000 sentences.

From this corpus we extracted:

Resource	Description
Vocabulary	Unique Khasi tokens
Word frequencies	Frequency counts for each token
Bigram frequencies	Context probabilities between word pairs

After cleaning, the vocabulary contains ~58,000 unique Khasi words.

Detection of Misspelled Words

The system first determines whether a word is likely to be incorrect.

A word is trusted if it is:

Present in the vocabulary
Sufficiently frequent in the corpus

Formally:

$\text{If } w \in V \text{ and } freq(w) > \tau$

then the word is accepted.

Otherwise the system attempts correction.

Example:

Word	Frequency	Action
nga	high	keep
ka	very high	keep
shnogn	low	correct

Candidate Generation

Candidate corrections are generated using edit distance operations.

Allowed operations:

Operation	Description
Deletion	remove a character
Insertion	add a character
Replacement	replace a character
Transposition	swap adjacent characters

Example:

sngewhuh → sngewthuh

shnogn → shnong

Candidates are generated using: edits1(word) edits2(word)

Where:

$e d i t s_{2} (w) = e d i t s_{1} (e d i t s_{1} (w))$

After generation:

$Candidates = edits(w) \cap Vocabulary$

Only candidates present in the vocabulary are retained.

Probabilistic Ranking

Once candidate corrections are generated, the system ranks them probabilistically.

The classical spelling correction objective is:

$\hat{c} = \arg\max_c P(c | w)$

Using Bayes' theorem:

$P(c | w) = \frac{P(w|c)P(c)}{P(w)}$

Since (P(w)) is constant:

$\hat{c} = \arg\max_c P(w|c)P(c)$

In this implementation:

(P(c)) is approximated using word frequency
contextual probabilities are modeled using bigram statistics

Language Model

A bigram language model is used to model contextual probability.

$P(w_i | w_{i-1}) = \frac{count(w_{i-1}, w_i)}{count(w_{i-1})}$

Example:

ban sngewthuh → common ban sngewleh → rare

Thus:

$P (s n g e w t h u h ∣ b a n) > P (s n g e w l e h ∣ b a n)$

Bidirectional Context Scoring

To improve correction accuracy, both left and right context are used.

The final candidate score is:

$Score(c) = \log P(c) + \log P(c | w_{i-1}) + \log P(w_{i+1} | c)$

Where:

Term	Meaning
(P(c))	candidate word frequency
(P(c	w_{i-1}))
(P(w_{i+1}	c))

This allows the system to evaluate phrases such as: me khlem leit

instead of only evaluating: me khlem

Implementation Details

Language: Python

Framework: Gradio (via Hugging Face Spaces)

Limitations

Current limitations include:

No explicit typo probability model (P(w|c))
Candidate explosion for short words
No phonetic error modeling
No neural context understanding

Example challenging case:

khlm → kum vs khlem

Because:

frequency(kum) >> frequency(khlem)

Future Improvements

Character Error Model

Learn probabilities for common typing errors.

Trigram Language Model

Replace bigram model with:

$P(w_i | w_{i-1}, w_{i-2})$

using tools such as KenLM.

Neural Spell Correction

Future versions may incorporate neural models such as:

BERT
T5
sequence-to-sequence transformers

for improved contextual understanding.

Intended Use

This spell checker is designed for:

Khasi writing assistance
educational tools
preprocessing Khasi text
improving downstream NLP pipelines

Citation

If you use this work, please cite:

@software{nongkynrih2026khasi_spellchecker_v1,
  author       = {Nongkynrih, Bapynshngainlang},
  title        = {Khasi Spell Checker v1},
  version      = {1.0},
  year         = {2026},
  month        = mar,
  day          = 13
  publisher    = {Hugging Face},
  doi          = {10.57967/hf/7999},
  url          = {https://huggingface.co/Bapynshngain/Khasi-SpellChecker-v1}
}

APA Citation

Nongkynrih, B. (2026, March 13). Khasi Spell Checker v1 (Version 1.0) [Software]. Hugging Face. https://doi.org/10.57967/hf/7999