--- license: cc-by-4.0 language: - kha - en tags: - spelling-correction - text-processing - language-modeling - low-resource-languages - khasi --- # Khasi Spell Checker v1 **Khasi Spell Checker v1** is a statistical spell-checking system for the Khasi language built using a large monolingual corpus (~700k sentences). The system detects and corrects spelling errors using **edit-distance candidate generation** and **contextual ranking using a probabilistic language model**. The goal of this project is to provide **basic NLP infrastructure for Khasi**, a low-resource language, enabling improved writing assistance, text preprocessing, and downstream NLP applications. 🚀 Live Demo: https://huggingface.co/spaces/Bapynshngain/Khasi-Spell-Checker --- # Overview This spell checker follows a classical architecture historically used in early search engines and spelling correction systems. Input sentence ↓ Tokenization ↓ Suspicious word detection ↓ Candidate generation (edit distance) ↓ Language model scoring ↓ Best correction selection ↓ Corrected sentence The system combines: - Edit-distance candidate generation - Word frequency model - Bigram language model - Bidirectional context scoring --- # Training Data The model is derived from a **Khasi monolingual corpus containing ~700,000 sentences**. From this corpus we extracted: | Resource | Description | |--------|-------------| | Vocabulary | Unique Khasi tokens | | Word frequencies | Frequency counts for each token | | Bigram frequencies | Context probabilities between word pairs | After cleaning, the vocabulary contains **~58,000 unique Khasi words**. --- # Detection of Misspelled Words The system first determines whether a word is likely to be incorrect. A word is trusted if it is: 1. Present in the vocabulary 2. Sufficiently frequent in the corpus Formally: $$ \text{If } w \in V \text{ and } freq(w) > \tau $$ then the word is accepted. Otherwise the system attempts correction. Example: | Word | Frequency | Action | |-----|-----------|--------| | nga | high | keep | | ka | very high | keep | | shnogn | low | correct | --- # Candidate Generation Candidate corrections are generated using **edit distance operations**. Allowed operations: | Operation | Description | |-----------|-------------| | Deletion | remove a character | | Insertion | add a character | | Replacement | replace a character | | Transposition | swap adjacent characters | Example: sngewhuh → sngewthuh shnogn → shnong Candidates are generated using: edits1(word) edits2(word) Where: $$ edits_2(w) = edits_1(edits_1(w)) $$ After generation: $$ Candidates = edits(w) \cap Vocabulary $$ Only candidates present in the vocabulary are retained. --- # Probabilistic Ranking Once candidate corrections are generated, the system ranks them probabilistically. The classical spelling correction objective is: $$ \hat{c} = \arg\max_c P(c | w) $$ Using **Bayes' theorem**: $$ P(c | w) = \frac{P(w|c)P(c)}{P(w)} $$ Since \(P(w)\) is constant: $$ \hat{c} = \arg\max_c P(w|c)P(c) $$ In this implementation: - \(P(c)\) is approximated using **word frequency** - contextual probabilities are modeled using **bigram statistics** --- # Language Model A **bigram language model** is used to model contextual probability. $$ P(w_i | w_{i-1}) = \frac{count(w_{i-1}, w_i)}{count(w_{i-1})} $$ Example: ban sngewthuh → common ban sngewleh → rare Thus: $$ P(sngewthuh | ban) > P(sngewleh | ban) $$ --- # Bidirectional Context Scoring To improve correction accuracy, both **left and right context** are used. The final candidate score is: $$ Score(c) = \log P(c) + \log P(c | w_{i-1}) + \log P(w_{i+1} | c) $$ Where: | Term | Meaning | |------|--------| | \(P(c)\) | candidate word frequency | | \(P(c|w_{i-1})\) | left context probability | | \(P(w_{i+1}|c)\) | right context probability | This allows the system to evaluate phrases such as: me khlem leit instead of only evaluating: me khlem --- # Implementation Details Language: **Python** Framework: **Gradio** (via Hugging Face Spaces) --- # Limitations Current limitations include: - No explicit typo probability model \(P(w|c)\) - Candidate explosion for short words - No phonetic error modeling - No neural context understanding Example challenging case: khlm → kum vs khlem Because: frequency(kum) >> frequency(khlem) --- --- # Future Improvements ## Character Error Model Learn probabilities for common typing errors. ## Trigram Language Model Replace bigram model with: $$ P(w_i | w_{i-1}, w_{i-2}) $$ using tools such as **KenLM**. --- ## Neural Spell Correction Future versions may incorporate neural models such as: - BERT - T5 - sequence-to-sequence transformers for improved contextual understanding. --- # Intended Use This spell checker is designed for: - Khasi writing assistance - educational tools - preprocessing Khasi text - improving downstream NLP pipelines --- # Citation If you use this work, please cite: ```bibtex @software{nongkynrih2026khasi_spellchecker_v1, author = {Nongkynrih, Bapynshngainlang}, title = {Khasi Spell Checker v1}, version = {1.0}, year = {2026}, month = mar, day = 13 publisher = {Hugging Face}, doi = {10.57967/hf/7999}, url = {https://huggingface.co/Bapynshngain/Khasi-SpellChecker-v1} } APA Citation Nongkynrih, B. (2026, March 13). Khasi Spell Checker v1 (Version 1.0) [Software]. Hugging Face. https://doi.org/10.57967/hf/7999