Bapynshngain's picture
Update README.md
b1de30d verified
---
license: cc-by-4.0
language:
- kha
- en
tags:
- spelling-correction
- text-processing
- language-modeling
- low-resource-languages
- khasi
---
# Khasi Spell Checker v1
**Khasi Spell Checker v1** is a statistical spell-checking system for the Khasi language built using a large monolingual corpus (~700k sentences). The system detects and corrects spelling errors using **edit-distance candidate generation** and **contextual ranking using a probabilistic language model**.
The goal of this project is to provide **basic NLP infrastructure for Khasi**, a low-resource language, enabling improved writing assistance, text preprocessing, and downstream NLP applications.
🚀 Live Demo: https://huggingface.co/spaces/Bapynshngain/Khasi-Spell-Checker
---
# Overview
This spell checker follows a classical architecture historically used in early search engines and spelling correction systems.
Input sentence
Tokenization
Suspicious word detection
Candidate generation (edit distance)
Language model scoring
Best correction selection
Corrected sentence
The system combines:
- Edit-distance candidate generation
- Word frequency model
- Bigram language model
- Bidirectional context scoring
---
# Training Data
The model is derived from a **Khasi monolingual corpus containing ~700,000 sentences**.
From this corpus we extracted:
| Resource | Description |
|--------|-------------|
| Vocabulary | Unique Khasi tokens |
| Word frequencies | Frequency counts for each token |
| Bigram frequencies | Context probabilities between word pairs |
After cleaning, the vocabulary contains **~58,000 unique Khasi words**.
---
# Detection of Misspelled Words
The system first determines whether a word is likely to be incorrect.
A word is trusted if it is:
1. Present in the vocabulary
2. Sufficiently frequent in the corpus
Formally:
$$
\text{If } w \in V \text{ and } freq(w) > \tau
$$
then the word is accepted.
Otherwise the system attempts correction.
Example:
| Word | Frequency | Action |
|-----|-----------|--------|
| nga | high | keep |
| ka | very high | keep |
| shnogn | low | correct |
---
# Candidate Generation
Candidate corrections are generated using **edit distance operations**.
Allowed operations:
| Operation | Description |
|-----------|-------------|
| Deletion | remove a character |
| Insertion | add a character |
| Replacement | replace a character |
| Transposition | swap adjacent characters |
Example:
sngewhuh → sngewthuh
shnogn → shnong
Candidates are generated using:
edits1(word)
edits2(word)
Where:
$$
edits_2(w) = edits_1(edits_1(w))
$$
After generation:
$$
Candidates = edits(w) \cap Vocabulary
$$
Only candidates present in the vocabulary are retained.
---
# Probabilistic Ranking
Once candidate corrections are generated, the system ranks them probabilistically.
The classical spelling correction objective is:
$$
\hat{c} = \arg\max_c P(c | w)
$$
Using **Bayes' theorem**:
$$
P(c | w) = \frac{P(w|c)P(c)}{P(w)}
$$
Since \(P(w)\) is constant:
$$
\hat{c} = \arg\max_c P(w|c)P(c)
$$
In this implementation:
- \(P(c)\) is approximated using **word frequency**
- contextual probabilities are modeled using **bigram statistics**
---
# Language Model
A **bigram language model** is used to model contextual probability.
$$
P(w_i | w_{i-1}) =
\frac{count(w_{i-1}, w_i)}{count(w_{i-1})}
$$
Example:
ban sngewthuh → common
ban sngewleh → rare
Thus:
$$
P(sngewthuh | ban) > P(sngewleh | ban)
$$
---
# Bidirectional Context Scoring
To improve correction accuracy, both **left and right context** are used.
The final candidate score is:
$$
Score(c) =
\log P(c)
+
\log P(c | w_{i-1})
+
\log P(w_{i+1} | c)
$$
Where:
| Term | Meaning |
|------|--------|
| \(P(c)\) | candidate word frequency |
| \(P(c|w_{i-1})\) | left context probability |
| \(P(w_{i+1}|c)\) | right context probability |
This allows the system to evaluate phrases such as:
me khlem leit
instead of only evaluating:
me khlem
---
# Implementation Details
Language: **Python**
Framework: **Gradio** (via Hugging Face Spaces)
---
# Limitations
Current limitations include:
- No explicit typo probability model \(P(w|c)\)
- Candidate explosion for short words
- No phonetic error modeling
- No neural context understanding
Example challenging case:
khlm → kum vs khlem
Because:
frequency(kum) >> frequency(khlem)
---
---
# Future Improvements
## Character Error Model
Learn probabilities for common typing errors.
## Trigram Language Model
Replace bigram model with:
$$
P(w_i | w_{i-1}, w_{i-2})
$$
using tools such as **KenLM**.
---
## Neural Spell Correction
Future versions may incorporate neural models such as:
- BERT
- T5
- sequence-to-sequence transformers
for improved contextual understanding.
---
# Intended Use
This spell checker is designed for:
- Khasi writing assistance
- educational tools
- preprocessing Khasi text
- improving downstream NLP pipelines
---
# Citation
If you use this work, please cite:
```bibtex
@software{nongkynrih2026khasi_spellchecker_v1,
author = {Nongkynrih, Bapynshngainlang},
title = {Khasi Spell Checker v1},
version = {1.0},
year = {2026},
month = mar,
day = 13
publisher = {Hugging Face},
doi = {10.57967/hf/7999},
url = {https://huggingface.co/Bapynshngain/Khasi-SpellChecker-v1}
}
APA Citation
Nongkynrih, B. (2026, March 13). Khasi Spell Checker v1 (Version 1.0) [Software]. Hugging Face. https://doi.org/10.57967/hf/7999