Bapynshngain's picture
Update README.md
b1de30d verified
metadata
license: cc-by-4.0
language:
  - kha
  - en
tags:
  - spelling-correction
  - text-processing
  - language-modeling
  - low-resource-languages
  - khasi

Khasi Spell Checker v1

Khasi Spell Checker v1 is a statistical spell-checking system for the Khasi language built using a large monolingual corpus (~700k sentences). The system detects and corrects spelling errors using edit-distance candidate generation and contextual ranking using a probabilistic language model.

The goal of this project is to provide basic NLP infrastructure for Khasi, a low-resource language, enabling improved writing assistance, text preprocessing, and downstream NLP applications.

πŸš€ Live Demo: https://huggingface.co/spaces/Bapynshngain/Khasi-Spell-Checker


Overview

This spell checker follows a classical architecture historically used in early search engines and spelling correction systems.

   Input sentence 
         ↓ 
    Tokenization 
         ↓ 
Suspicious word detection 
         ↓ 

Candidate generation (edit distance)

         ↓ 
Language model scoring 
         
         ↓ 

Best correction selection

         ↓ 
 Corrected sentence

The system combines:

  • Edit-distance candidate generation
  • Word frequency model
  • Bigram language model
  • Bidirectional context scoring

Training Data

The model is derived from a Khasi monolingual corpus containing ~700,000 sentences.

From this corpus we extracted:

Resource Description
Vocabulary Unique Khasi tokens
Word frequencies Frequency counts for each token
Bigram frequencies Context probabilities between word pairs

After cleaning, the vocabulary contains ~58,000 unique Khasi words.


Detection of Misspelled Words

The system first determines whether a word is likely to be incorrect.

A word is trusted if it is:

  1. Present in the vocabulary
  2. Sufficiently frequent in the corpus

Formally:

If w∈V and freq(w)>Ο„ \text{If } w \in V \text{ and } freq(w) > \tau

then the word is accepted.

Otherwise the system attempts correction.

Example:

Word Frequency Action
nga high keep
ka very high keep
shnogn low correct

Candidate Generation

Candidate corrections are generated using edit distance operations.

Allowed operations:

Operation Description
Deletion remove a character
Insertion add a character
Replacement replace a character
Transposition swap adjacent characters

Example:

sngewhuh β†’ sngewthuh

shnogn β†’ shnong

Candidates are generated using: edits1(word) edits2(word)

Where:

edits2(w)=edits1(edits1(w)) edits_2(w) = edits_1(edits_1(w))

After generation:

Candidates=edits(w)∩Vocabulary Candidates = edits(w) \cap Vocabulary

Only candidates present in the vocabulary are retained.


Probabilistic Ranking

Once candidate corrections are generated, the system ranks them probabilistically.

The classical spelling correction objective is:

c^=arg⁑max⁑cP(c∣w) \hat{c} = \arg\max_c P(c | w)

Using Bayes' theorem:

P(c∣w)=P(w∣c)P(c)P(w) P(c | w) = \frac{P(w|c)P(c)}{P(w)}

Since (P(w)) is constant:

c^=arg⁑max⁑cP(w∣c)P(c) \hat{c} = \arg\max_c P(w|c)P(c)

In this implementation:

  • (P(c)) is approximated using word frequency
  • contextual probabilities are modeled using bigram statistics

Language Model

A bigram language model is used to model contextual probability.

P(wi∣wiβˆ’1)=count(wiβˆ’1,wi)count(wiβˆ’1) P(w_i | w_{i-1}) = \frac{count(w_{i-1}, w_i)}{count(w_{i-1})}

Example:

ban sngewthuh β†’ common ban sngewleh β†’ rare

Thus:

P(sngewthuh∣ban)>P(sngewleh∣ban) P(sngewthuh | ban) > P(sngewleh | ban)


Bidirectional Context Scoring

To improve correction accuracy, both left and right context are used.

The final candidate score is:

Score(c)=log⁑P(c)+log⁑P(c∣wiβˆ’1)+log⁑P(wi+1∣c) Score(c) = \log P(c) + \log P(c | w_{i-1}) + \log P(w_{i+1} | c)

Where:

Term Meaning
(P(c)) candidate word frequency
(P(c w_{i-1}))
(P(w_{i+1} c))

This allows the system to evaluate phrases such as: me khlem leit

instead of only evaluating: me khlem


Implementation Details

Language: Python

Framework: Gradio (via Hugging Face Spaces)


Limitations

Current limitations include:

  • No explicit typo probability model (P(w|c))
  • Candidate explosion for short words
  • No phonetic error modeling
  • No neural context understanding

Example challenging case:

khlm β†’ kum vs khlem

Because:

frequency(kum) >> frequency(khlem)



Future Improvements

Character Error Model

Learn probabilities for common typing errors.

Trigram Language Model

Replace bigram model with:

P(wi∣wiβˆ’1,wiβˆ’2) P(w_i | w_{i-1}, w_{i-2})

using tools such as KenLM.


Neural Spell Correction

Future versions may incorporate neural models such as:

  • BERT
  • T5
  • sequence-to-sequence transformers

for improved contextual understanding.


Intended Use

This spell checker is designed for:

  • Khasi writing assistance
  • educational tools
  • preprocessing Khasi text
  • improving downstream NLP pipelines

Citation

If you use this work, please cite:

@software{nongkynrih2026khasi_spellchecker_v1,
  author       = {Nongkynrih, Bapynshngainlang},
  title        = {Khasi Spell Checker v1},
  version      = {1.0},
  year         = {2026},
  month        = mar,
  day          = 13
  publisher    = {Hugging Face},
  doi          = {10.57967/hf/7999},
  url          = {https://huggingface.co/Bapynshngain/Khasi-SpellChecker-v1}
}

APA Citation

Nongkynrih, B. (2026, March 13). Khasi Spell Checker v1 (Version 1.0) [Software]. Hugging Face. https://doi.org/10.57967/hf/7999