--- library_name: scikit-learn pipeline_tag: video-classification language: - ko tags: - korean-sign-language - sign-language-recognition - mediapipe - keypoint-classification - real-time - joblib license: other --- # KSL Word Recognition Model Repo This folder is a Hugging Face-compatible model package for word-level Korean Sign Language recognition. It contains the model structure, inference contract, and the MediaPipe MVP runtime used by the backend. ## Current MVP Artifact The repository loads a lightweight MediaPipe MVP artifact when `mediapipe_mvp.joblib` is present. This artifact is trained for a controlled one-person proof-of-concept, not for production-quality Korean Sign Language translation. Current MVP vocabulary: ```text 수어, 좋다, 감사, 괜찮다, 싫다, 이해, 부탁, 모르다, 맞다, 힘 ``` Measured locally on the development machine: ```text Validation accuracy: 98.0% on a 50-sample held-out signer split Training samples: 900 balanced samples, 10 labels, 5 camera angles Mean inference latency: about 112.3 ms/frame on CPU Approximate throughput: about 8.9 fps ``` Important limitation: this metric is from controlled AIHub word clips and a small 10-word vocabulary. It is suitable for a one-person demo, but it is not evidence of robust real-world meeting performance. Training command used for the current artifact: ```powershell python training\train_mediapipe_mvp.py --data-root "D:\수어 영상" --cache-dir "D:\ksl_cache\mediapipe_mvp_features" --max-per-label 90 --sequence-length 16 --angles F D U L R --classifier extra_trees --confidence-threshold 0.45 ``` Benchmark command: ```powershell python training\benchmark_mediapipe_mvp.py --data-root "D:\수어 영상" --cache-dir "D:\ksl_cache\mediapipe_mvp_features" --max-frames 120 ``` The trained `mediapipe_mvp.joblib` file is intentionally not tracked in GitHub. It should be distributed through Hugging Face. Hugging Face model repo used by the backend: ```text TechieMoon/realtime-ksl-captioning-mediapipe-mvp ``` Backend environment: ```powershell $env:MODEL_BACKEND = "huggingface" $env:HF_MODEL_ID = "TechieMoon/realtime-ksl-captioning-mediapipe-mvp" $env:MODEL_DEVICE = "cpu" uvicorn app.main:app --host 0.0.0.0 --port 8000 ``` This MVP artifact does not contain the AIHub video dataset. It only contains the trained lightweight classifier and inference code. ## Runtime Structure ```text RGB webcam frame -> MediaPipe MVP recognizer when mediapipe_mvp.joblib exists - resizes large RGB frames to max width 640 before MediaPipe - extracts pose and hand landmarks - classifies a short keypoint sequence with a lightweight classifier -> otherwise YoloRoiExtractor - uses yolo/yolo.pt when present and enabled - falls back to the full frame when no detector checkpoint exists -> VideoMAEWordClassifier - keeps a short frame clip buffer - loads a transformers VideoMAE checkpoint from videomae/ when present - returns unknown until enough clip frames and a checkpoint are available -> KslWordRecognizer - confidence thresholding - short-window smoothing - word-level caption JSON ``` The current repository intentionally does not include trained checkpoints. Add them later with this layout: ```text ai_model/ inference.py modeling.py labels.json model_config.json preprocessor_config.json yolo/ yolo.pt videomae/ config.json model.safetensors preprocessor_config.json ``` ## Contract ```python def load_model(model_dir: str, device: str): ... def predict(model, frames: list, timestamps_ms: list[int | None]) -> list[dict]: ... ``` Each `frame` must be an RGB `numpy.ndarray` with shape `(height, width, 3)` and dtype `uint8`. Output: ```json { "text": "안녕하세요", "words": [ { "text": "안녕하세요", "confidence": 0.91, "start_ms": 1200, "end_ms": 1700 } ], "is_final": true } ``` If no checkpoint exists, inference stays conservative and returns an empty caption: ```json { "text": "", "words": [], "is_final": false } ``` ## Config `model_config.json` controls the optional detector and classifier: ```json { "roi_detector": { "enabled": false, "weights_path": "yolo/yolo.pt" }, "classifier": { "checkpoint_path": "videomae", "clip_size": 8 } } ``` Set `roi_detector.enabled` to `true` after adding YOLO weights. The classifier automatically loads `videomae/` when that checkpoint directory exists. ## AIHub Training Fit For AIHub 수어 영상 데이터, the later training side should create word-level clips from video annotations and fine-tune a VideoMAE classifier with label IDs aligned to `labels.json`. The resulting Hugging Face checkpoint can be saved into `videomae/`; no backend changes are required. YOLO can be trained or fine-tuned separately for signer/hand/upper-body ROI detection, then exported as `yolo/yolo.pt`. ## Local Smoke Test From the repository root: ```powershell cd server $env:MODEL_BACKEND = "huggingface" $env:HF_MODEL_ID = "..\ai_model" $env:MODEL_DEVICE = "cpu" pytest tests/test_ai_model_contract.py tests/test_huggingface_adapter.py ```