Update README.md

b1de30d verified 3 months ago

5.73 kB

	---
	license: cc-by-4.0
	language:
	- kha
	- en
	tags:
	- spelling-correction
	- text-processing
	- language-modeling
	- low-resource-languages
	- khasi
	---

	# Khasi Spell Checker v1

	Khasi Spell Checker v1 is a statistical spell-checking system for the Khasi language built using a large monolingual corpus (~700k sentences). The system detects and corrects spelling errors using edit-distance candidate generation and contextual ranking using a probabilistic language model.

	The goal of this project is to provide basic NLP infrastructure for Khasi, a low-resource language, enabling improved writing assistance, text preprocessing, and downstream NLP applications.

	🚀 Live Demo: https://huggingface.co/spaces/Bapynshngain/Khasi-Spell-Checker

	---

	# Overview

	This spell checker follows a classical architecture historically used in early search engines and spelling correction systems.

	Input sentence
	↓
	Tokenization
	↓
	Suspicious word detection
	↓
	Candidate generation (edit distance)

	↓
	Language model scoring

	↓
	Best correction selection

	↓
	Corrected sentence

	The system combines:

	- Edit-distance candidate generation
	- Word frequency model
	- Bigram language model
	- Bidirectional context scoring

	---

	# Training Data

	The model is derived from a Khasi monolingual corpus containing ~700,000 sentences.

	From this corpus we extracted:

	\| Resource \| Description \|
	\|--------\|-------------\|
	\| Vocabulary \| Unique Khasi tokens \|
	\| Word frequencies \| Frequency counts for each token \|
	\| Bigram frequencies \| Context probabilities between word pairs \|

	After cleaning, the vocabulary contains ~58,000 unique Khasi words.

	---

	# Detection of Misspelled Words

	The system first determines whether a word is likely to be incorrect.

	A word is trusted if it is:

	1. Present in the vocabulary
	2. Sufficiently frequent in the corpus

	Formally:

	$$
	\text{If } w \in V \text{ and } freq(w) > \tau
	$$

	then the word is accepted.

	Otherwise the system attempts correction.

	Example:

	\| Word \| Frequency \| Action \|
	\|-----\|-----------\|--------\|
	\| nga \| high \| keep \|
	\| ka \| very high \| keep \|
	\| shnogn \| low \| correct \|

	---

	# Candidate Generation

	Candidate corrections are generated using edit distance operations.

	Allowed operations:

	\| Operation \| Description \|
	\|-----------\|-------------\|
	\| Deletion \| remove a character \|
	\| Insertion \| add a character \|
	\| Replacement \| replace a character \|
	\| Transposition \| swap adjacent characters \|

	Example:

	sngewhuh → sngewthuh

	shnogn → shnong

	Candidates are generated using:
	edits1(word)
	edits2(word)

	Where:

	$$
	edits_2(w) = edits_1(edits_1(w))
	$$

	After generation:

	$$
	Candidates = edits(w) \cap Vocabulary
	$$

	Only candidates present in the vocabulary are retained.

	---

	# Probabilistic Ranking

	Once candidate corrections are generated, the system ranks them probabilistically.

	The classical spelling correction objective is:

	$$
	\hat{c} = \arg\max_c P(c \| w)
	$$

	Using Bayes' theorem:

	$$
	P(c \| w) = \frac{P(w\|c)P(c)}{P(w)}
	$$

	Since $P(w)$ is constant:

	$$
	\hat{c} = \arg\max_c P(w\|c)P(c)
	$$

	In this implementation:

	- $P(c)$ is approximated using word frequency
	- contextual probabilities are modeled using bigram statistics

	---

	# Language Model

	A bigram language model is used to model contextual probability.

	$$
	P(w_i \| w_{i-1}) =
	\frac{count(w_{i-1}, w_i)}{count(w_{i-1})}
	$$

	Example:

	ban sngewthuh → common
	ban sngewleh → rare

	Thus:

	$$
	P(sngewthuh \| ban) > P(sngewleh \| ban)
	$$

	---

	# Bidirectional Context Scoring

	To improve correction accuracy, both left and right context are used.

	The final candidate score is:

	$$
	Score(c) =
	\log P(c)
	+
	\log P(c \| w_{i-1})
	+
	\log P(w_{i+1} \| c)
	$$

	Where:

	\| Term \| Meaning \|
	\|------\|--------\|
	\| $P(c)$ \| candidate word frequency \|
	\| $P(c\|w_{i-1})$ \| left context probability \|
	\| $P(w_{i+1}\|c)$ \| right context probability \|

	This allows the system to evaluate phrases such as:
	me khlem leit

	instead of only evaluating:
	me khlem

	---

	# Implementation Details

	Language: Python

	Framework: Gradio (via Hugging Face Spaces)

	---

	# Limitations

	Current limitations include:

	- No explicit typo probability model $P(w\|c)$
	- Candidate explosion for short words
	- No phonetic error modeling
	- No neural context understanding

	Example challenging case:

	khlm → kum vs khlem

	Because:

	frequency(kum) >> frequency(khlem)

	---

	---

	# Future Improvements

	## Character Error Model

	Learn probabilities for common typing errors.

	## Trigram Language Model

	Replace bigram model with:

	$$
	P(w_i \| w_{i-1}, w_{i-2})
	$$

	using tools such as KenLM.

	---

	## Neural Spell Correction

	Future versions may incorporate neural models such as:

	- BERT
	- T5
	- sequence-to-sequence transformers

	for improved contextual understanding.

	---

	# Intended Use

	This spell checker is designed for:

	- Khasi writing assistance
	- educational tools
	- preprocessing Khasi text
	- improving downstream NLP pipelines

	---

	# Citation

	If you use this work, please cite:

	```bibtex
	@software{nongkynrih2026khasi_spellchecker_v1,
	author = {Nongkynrih, Bapynshngainlang},
	title = {Khasi Spell Checker v1},
	version = {1.0},
	year = {2026},
	month = mar,
	day = 13
	publisher = {Hugging Face},
	doi = {10.57967/hf/7999},
	url = {https://huggingface.co/Bapynshngain/Khasi-SpellChecker-v1}
	}

	APA Citation

	Nongkynrih, B. (2026, March 13). Khasi Spell Checker v1 (Version 1.0) [Software]. Hugging Face. https://doi.org/10.57967/hf/7999