| --- |
| license: cc-by-4.0 |
| language: |
| - kha |
| - en |
| tags: |
| - spelling-correction |
| - text-processing |
| - language-modeling |
| - low-resource-languages |
| - khasi |
| --- |
| |
| # Khasi Spell Checker v1 |
|
|
| **Khasi Spell Checker v1** is a statistical spell-checking system for the Khasi language built using a large monolingual corpus (~700k sentences). The system detects and corrects spelling errors using **edit-distance candidate generation** and **contextual ranking using a probabilistic language model**. |
|
|
| The goal of this project is to provide **basic NLP infrastructure for Khasi**, a low-resource language, enabling improved writing assistance, text preprocessing, and downstream NLP applications. |
|
|
| 🚀 Live Demo: https://huggingface.co/spaces/Bapynshngain/Khasi-Spell-Checker |
|
|
| --- |
|
|
| # Overview |
|
|
| This spell checker follows a classical architecture historically used in early search engines and spelling correction systems. |
| |
| Input sentence |
| ↓ |
| Tokenization |
| ↓ |
| Suspicious word detection |
| ↓ |
| Candidate generation (edit distance) |
| |
| ↓ |
| Language model scoring |
| |
| ↓ |
| Best correction selection |
| |
| ↓ |
| Corrected sentence |
| |
| The system combines: |
|
|
| - Edit-distance candidate generation |
| - Word frequency model |
| - Bigram language model |
| - Bidirectional context scoring |
|
|
| --- |
|
|
| # Training Data |
|
|
| The model is derived from a **Khasi monolingual corpus containing ~700,000 sentences**. |
|
|
| From this corpus we extracted: |
|
|
| | Resource | Description | |
| |--------|-------------| |
| | Vocabulary | Unique Khasi tokens | |
| | Word frequencies | Frequency counts for each token | |
| | Bigram frequencies | Context probabilities between word pairs | |
|
|
| After cleaning, the vocabulary contains **~58,000 unique Khasi words**. |
|
|
| --- |
|
|
| # Detection of Misspelled Words |
|
|
| The system first determines whether a word is likely to be incorrect. |
|
|
| A word is trusted if it is: |
|
|
| 1. Present in the vocabulary |
| 2. Sufficiently frequent in the corpus |
|
|
| Formally: |
|
|
| $$ |
| \text{If } w \in V \text{ and } freq(w) > \tau |
| $$ |
|
|
| then the word is accepted. |
|
|
| Otherwise the system attempts correction. |
|
|
| Example: |
|
|
| | Word | Frequency | Action | |
| |-----|-----------|--------| |
| | nga | high | keep | |
| | ka | very high | keep | |
| | shnogn | low | correct | |
|
|
| --- |
|
|
| # Candidate Generation |
|
|
| Candidate corrections are generated using **edit distance operations**. |
|
|
| Allowed operations: |
|
|
| | Operation | Description | |
| |-----------|-------------| |
| | Deletion | remove a character | |
| | Insertion | add a character | |
| | Replacement | replace a character | |
| | Transposition | swap adjacent characters | |
|
|
| Example: |
|
|
| sngewhuh → sngewthuh |
|
|
| shnogn → shnong |
|
|
| Candidates are generated using: |
| edits1(word) |
| edits2(word) |
|
|
| Where: |
|
|
| $$ |
| edits_2(w) = edits_1(edits_1(w)) |
| $$ |
| |
| After generation: |
| |
| $$ |
| Candidates = edits(w) \cap Vocabulary |
| $$ |
| |
| Only candidates present in the vocabulary are retained. |
| |
| --- |
| |
| # Probabilistic Ranking |
| |
| Once candidate corrections are generated, the system ranks them probabilistically. |
| |
| The classical spelling correction objective is: |
| |
| $$ |
| \hat{c} = \arg\max_c P(c | w) |
| $$ |
|
|
| Using **Bayes' theorem**: |
|
|
| $$ |
| P(c | w) = \frac{P(w|c)P(c)}{P(w)} |
| $$ |
|
|
| Since \(P(w)\) is constant: |
|
|
| $$ |
| \hat{c} = \arg\max_c P(w|c)P(c) |
| $$ |
| |
| In this implementation: |
| |
| - \(P(c)\) is approximated using **word frequency** |
| - contextual probabilities are modeled using **bigram statistics** |
| |
| --- |
| |
| # Language Model |
| |
| A **bigram language model** is used to model contextual probability. |
| |
| $$ |
| P(w_i | w_{i-1}) = |
| \frac{count(w_{i-1}, w_i)}{count(w_{i-1})} |
| $$ |
|
|
| Example: |
|
|
| ban sngewthuh → common |
| ban sngewleh → rare |
|
|
| Thus: |
|
|
| $$ |
| P(sngewthuh | ban) > P(sngewleh | ban) |
| $$ |
|
|
| --- |
|
|
| # Bidirectional Context Scoring |
|
|
| To improve correction accuracy, both **left and right context** are used. |
|
|
| The final candidate score is: |
|
|
| $$ |
| Score(c) = |
| \log P(c) |
| + |
| \log P(c | w_{i-1}) |
| + |
| \log P(w_{i+1} | c) |
| $$ |
|
|
| Where: |
|
|
| | Term | Meaning | |
| |------|--------| |
| | \(P(c)\) | candidate word frequency | |
| | \(P(c|w_{i-1})\) | left context probability | |
| | \(P(w_{i+1}|c)\) | right context probability | |
|
|
| This allows the system to evaluate phrases such as: |
| me khlem leit |
|
|
| instead of only evaluating: |
| me khlem |
|
|
| --- |
|
|
| # Implementation Details |
|
|
| Language: **Python** |
|
|
| Framework: **Gradio** (via Hugging Face Spaces) |
|
|
| --- |
|
|
| # Limitations |
|
|
| Current limitations include: |
|
|
| - No explicit typo probability model \(P(w|c)\) |
| - Candidate explosion for short words |
| - No phonetic error modeling |
| - No neural context understanding |
|
|
| Example challenging case: |
|
|
| khlm → kum vs khlem |
|
|
| Because: |
|
|
| frequency(kum) >> frequency(khlem) |
|
|
| --- |
|
|
| --- |
|
|
| # Future Improvements |
|
|
| ## Character Error Model |
|
|
| Learn probabilities for common typing errors. |
|
|
| ## Trigram Language Model |
|
|
| Replace bigram model with: |
|
|
| $$ |
| P(w_i | w_{i-1}, w_{i-2}) |
| $$ |
| |
| using tools such as **KenLM**. |
| |
| --- |
| |
| ## Neural Spell Correction |
| |
| Future versions may incorporate neural models such as: |
| |
| - BERT |
| - T5 |
| - sequence-to-sequence transformers |
| |
| for improved contextual understanding. |
| |
| --- |
| |
| # Intended Use |
| |
| This spell checker is designed for: |
| |
| - Khasi writing assistance |
| - educational tools |
| - preprocessing Khasi text |
| - improving downstream NLP pipelines |
| |
| --- |
| |
| # Citation |
| |
| If you use this work, please cite: |
| |
| ```bibtex |
| @software{nongkynrih2026khasi_spellchecker_v1, |
| author = {Nongkynrih, Bapynshngainlang}, |
| title = {Khasi Spell Checker v1}, |
| version = {1.0}, |
| year = {2026}, |
| month = mar, |
| day = 13 |
| publisher = {Hugging Face}, |
| doi = {10.57967/hf/7999}, |
| url = {https://huggingface.co/Bapynshngain/Khasi-SpellChecker-v1} |
| } |
| |
| APA Citation |
| |
| Nongkynrih, B. (2026, March 13). Khasi Spell Checker v1 (Version 1.0) [Software]. Hugging Face. https://doi.org/10.57967/hf/7999 |