---
license: cc-by-4.0
language:
- kha
- en
tags:
- spelling-correction
- text-processing
- language-modeling
- low-resource-languages
- khasi
---

# Khasi Spell Checker v1

**Khasi Spell Checker v1** is a statistical spell-checking system for the Khasi language built using a large monolingual corpus (~700k sentences). The system detects and corrects spelling errors using **edit-distance candidate generation** and **contextual ranking using a probabilistic language model**.

The goal of this project is to provide **basic NLP infrastructure for Khasi**, a low-resource language, enabling improved writing assistance, text preprocessing, and downstream NLP applications.

🚀 Live Demo: https://huggingface.co/spaces/Bapynshngain/Khasi-Spell-Checker

---

# Overview

This spell checker follows a classical architecture historically used in early search engines and spelling correction systems.
  
       Input sentence 
             ↓ 
        Tokenization 
             ↓ 
    Suspicious word detection 
             ↓ 
Candidate generation (edit distance) 
            
             ↓ 
    Language model scoring 
             
             ↓ 
   Best correction selection 
            
             ↓ 
     Corrected sentence

The system combines:

- Edit-distance candidate generation
- Word frequency model
- Bigram language model
- Bidirectional context scoring

---

# Training Data

The model is derived from a **Khasi monolingual corpus containing ~700,000 sentences**.

From this corpus we extracted:

| Resource | Description |
|--------|-------------|
| Vocabulary | Unique Khasi tokens |
| Word frequencies | Frequency counts for each token |
| Bigram frequencies | Context probabilities between word pairs |

After cleaning, the vocabulary contains **~58,000 unique Khasi words**.

---

# Detection of Misspelled Words

The system first determines whether a word is likely to be incorrect.

A word is trusted if it is:

1. Present in the vocabulary
2. Sufficiently frequent in the corpus

Formally:

$$
\text{If } w \in V \text{ and } freq(w) > \tau
$$

then the word is accepted.

Otherwise the system attempts correction.

Example:

| Word | Frequency | Action |
|-----|-----------|--------|
| nga | high | keep |
| ka  | very high | keep |
| shnogn | low | correct |

---

# Candidate Generation

Candidate corrections are generated using **edit distance operations**.

Allowed operations:

| Operation | Description |
|-----------|-------------|
| Deletion | remove a character |
| Insertion | add a character |
| Replacement | replace a character |
| Transposition | swap adjacent characters |

Example:

sngewhuh → sngewthuh 

shnogn → shnong

Candidates are generated using:
edits1(word) 
edits2(word)

Where:

$$
edits_2(w) = edits_1(edits_1(w))
$$

After generation:

$$
Candidates = edits(w) \cap Vocabulary
$$

Only candidates present in the vocabulary are retained.

---

# Probabilistic Ranking

Once candidate corrections are generated, the system ranks them probabilistically.

The classical spelling correction objective is:

$$
\hat{c} = \arg\max_c P(c | w)
$$

Using **Bayes' theorem**:

$$
P(c | w) = \frac{P(w|c)P(c)}{P(w)}
$$

Since \(P(w)\) is constant:

$$
\hat{c} = \arg\max_c P(w|c)P(c)
$$

In this implementation:

- \(P(c)\) is approximated using **word frequency**
- contextual probabilities are modeled using **bigram statistics**

---

# Language Model

A **bigram language model** is used to model contextual probability.

$$
P(w_i | w_{i-1}) =
\frac{count(w_{i-1}, w_i)}{count(w_{i-1})}
$$

Example:

ban sngewthuh → common 
ban sngewleh → rare

Thus:

$$
P(sngewthuh | ban) > P(sngewleh | ban)
$$

---

# Bidirectional Context Scoring

To improve correction accuracy, both **left and right context** are used.

The final candidate score is:

$$
Score(c) =
\log P(c)
+
\log P(c | w_{i-1})
+
\log P(w_{i+1} | c)
$$

Where:

| Term | Meaning |
|------|--------|
| \(P(c)\) | candidate word frequency |
| \(P(c|w_{i-1})\) | left context probability |
| \(P(w_{i+1}|c)\) | right context probability |

This allows the system to evaluate phrases such as:
me khlem leit

instead of only evaluating: 
me khlem

---

# Implementation Details

Language: **Python**

Framework: **Gradio** (via Hugging Face Spaces)

---

# Limitations

Current limitations include:

- No explicit typo probability model \(P(w|c)\)
- Candidate explosion for short words
- No phonetic error modeling
- No neural context understanding

Example challenging case:

khlm → kum vs khlem

Because:

frequency(kum) >> frequency(khlem)

---

---

# Future Improvements

## Character Error Model

Learn probabilities for common typing errors.

## Trigram Language Model

Replace bigram model with:

$$
P(w_i | w_{i-1}, w_{i-2})
$$

using tools such as **KenLM**.

---

## Neural Spell Correction

Future versions may incorporate neural models such as:

- BERT
- T5
- sequence-to-sequence transformers

for improved contextual understanding.

---

# Intended Use

This spell checker is designed for:

- Khasi writing assistance
- educational tools
- preprocessing Khasi text
- improving downstream NLP pipelines

---

# Citation

If you use this work, please cite:

```bibtex
@software{nongkynrih2026khasi_spellchecker_v1,
  author       = {Nongkynrih, Bapynshngainlang},
  title        = {Khasi Spell Checker v1},
  version      = {1.0},
  year         = {2026},
  month        = mar,
  day          = 13
  publisher    = {Hugging Face},
  doi          = {10.57967/hf/7999},
  url          = {https://huggingface.co/Bapynshngain/Khasi-SpellChecker-v1}
}

APA Citation

Nongkynrih, B. (2026, March 13). Khasi Spell Checker v1 (Version 1.0) [Software]. Hugging Face. https://doi.org/10.57967/hf/7999