--- language: - en license: mit tags: - gesture-recognition - hand-gesture - pytorch - mediapipe - temporal-model - lstm - attention - bidirectional datasets: - IPN-Hand metrics: - accuracy - f1 model-index: - name: two_stream_attn_v1_2layer_ld_cong_finetune_20260515T134706Z results: - task: type: gesture-recognition dataset: name: IPN Hand type: IPN-Hand metrics: - type: accuracy value: 0.9606 - type: f1 value: 0.9587 --- # two_stream_attn_v1_2layer_ld_cong_finetune_20260515T134706Z A real-time hand gesture classifier trained on a Hybrid Jester+IPN gesture dataset (Jester dynamic classes + IPN pointing classes). This model is part of the **Maestro** pipeline that enables touchless control of presentation and meeting software through hand gestures captured from a standard webcam using MediaPipe for landmark extraction. ## Model Description - **Architecture**: EnhancedTwoStreamLSTM (BiLSTM h=96×2, MHA 8 heads, proj=96, mean+max pool, MLP gate) - **Parameters**: 1,208,554 - **Input**: `(batch, 16, 147)` — 16-frame sliding window at 30 FPS ≈ 533 ms - **Output**: Softmax logits over 10 gesture classes - **Inference latency**: < 1 ms per call (CPU, single sample) - **Feature schema**: `feature-schema-v1` ## Architecture `EnhancedTwoStreamLSTM` splits the 147-dim feature vector into two parallel streams and processes them through a BiLSTM + self-attention + MLP-gate pipeline: ``` Input (B, T=32, 147) │ ├─ Stream A — Pose/Shape (73 dims) │ Linear+LN+GELU → 96 │ 2-layer BiLSTM (h=96) → (B, T, 192) │ LayerNorm → Self-MHA (8 heads) + residual + post-LN │ mean+max pool → pool_LN → ctx_a (B, 192) │ ├─ Stream B — Motion/Dynamics (74 dims) │ (identical structure) → ctx_b (B, 192) │ ├─ MLP cross-stream gate │ gate_a = Sigmoid( │ Linear(96→192)( │ Tanh(Linear(192→96)(ctx_b)))) │ ctx_a = LN(ctx_a × gate_a + ctx_a) │ gate_b = Sigmoid( │ Linear(96→192)( │ Tanh(Linear(192→96)(ctx_a)))) │ ctx_b = LN(ctx_b × gate_b + ctx_b) │ └─ cat(ctx_a, ctx_b) → (384,) LN → Linear(384→192) → GELU → Dropout → Linear(192→10) ``` **Design rationale:** - BiLSTMs encode temporal order via their recurrent cell state — no positional encoding needed. - Mean+Max pooling captures both sustained gesture shape (mean) and transient click events (max). - The 2-layer MLP gate provides non-linear cross-modal recalibration at ~37 K params (vs ~263 K for full MHA cross-attention with a degenerate mean-pooled query). ## Gesture Classes | Class | Description | |-------|-------------| | `fist` | Closed fist (all fingers curled, thumb tucked) | | `swiping_right` | Horizontal swipe from left to right | | `swiping_left` | Horizontal swipe from right to left | | `swiping_down` | Vertical swipe downward | | `swiping_up` | Vertical swipe upward | | `zooming_in_full_hand` | Pinch-open / spread fingers away from each other | | `zooming_out_full_hand` | Pinch-close / bring fingers together | | `point_one` | Single-finger pointing gesture (continuous laser-pointer control) | | `point_two` | Two-finger pointing gesture (continuous annotation-pen control) | | `unknown` | Background / transition / no gesture | ## Gesture Usage In Presentation System | Class | Mode | Command | Runtime handling | |-------|------|---------|------------------| | `fist` | `discrete` | `erase_annotations` | Discrete command via GestureActivationController → CommandDispatcher | | `swiping_right` | `discrete` | `next_slide` | Discrete command via GestureActivationController → CommandDispatcher | | `swiping_left` | `discrete` | `previous_slide` | Discrete command via GestureActivationController → CommandDispatcher | | `swiping_down` | `discrete` | `stop_presentation` | Discrete command via GestureActivationController → CommandDispatcher | | `swiping_up` | `discrete` | `start_presentation` | Discrete command via GestureActivationController → CommandDispatcher | | `zooming_in_full_hand` | `discrete` | `zoom_in_view` | Discrete command via GestureActivationController → CommandDispatcher | | `zooming_out_full_hand` | `discrete` | `zoom_out_view` | Discrete command via GestureActivationController → CommandDispatcher | | `point_one` | `continuous` | `—` | Continuous tracker: LaserPointerTracker (bypasses discrete dispatcher) | | `point_two` | `continuous` | `—` | Continuous tracker: AnnotationPenTracker (bypasses discrete dispatcher) | | `unknown` | `discrete` | `no_action` | No-op background class | ## Feature Schema (`feature-schema-v1`) | Block | Dims | Description | |-------|------|-------------| | `position` | 0–62 | 21 wrist-relative, scale-normalised landmark positions (x, y, z) | | `fingertip_spread` | 63–67 | 5 inter-fingertip Euclidean distances | | `wrist_trajectory` | 68–70 | Net wrist displacement from oldest frame in the window | | `velocity` | 71–133 | 21 per-landmark wrist-relative velocity vectors (Δposition per unit time) | | `joint_angles` | 134–143 | 10 MCP + PIP joint angles in radians | | `wrist_vel_raw` | 144–146 | Camera-normalised wrist velocity (x, y, z) — key directional signal | ## How to Use ```python import torch from huggingface_hub import hf_hub_download from maestro.infrastructure.model.checkpoint_loader import load_inference_artifact # Download the artifact (cached after first call) local_path = hf_hub_download( repo_id="ntsrigaud/maestro-lstm-hybrid", filename="two_stream_attn_v1_2layer_ld_cong_finetune_20260515T134706Z_inference.pt", ) # Load the artifact (includes model, class labels, and feature schema) artifact = load_inference_artifact( artifact_path=local_path, device=torch.device("cpu"), ) artifact.model.eval() # Build a 147-dim feature vector using LandmarkFeatureTransformer # and fill a 32-frame SlidingWindowSequenceBuffer, then: with torch.no_grad(): # tensor shape: (batch=1, T=32, F=147) window_tensor = torch.tensor(window_np, dtype=torch.float32).unsqueeze(0) logits = artifact.model(window_tensor) pred_class = artifact.class_labels[logits.argmax(dim=1).item()] ``` ## Training Dataset - **Source**: Hybrid merge of Jester and IPN-Hand windows: Jester provides unknown/swiping/zoom/stop_sign classes; IPN-Hand provides point_one and point_two - **Used classes**: 10 (9 active gestures + `unknown` background) - **Dataset split**: 70% train / 15% val / 15% test (stratified by class) - **Augmentation**: temporal scale ±20%, spatial jitter σ=0.005 ## Training Strategy Two-phase transfer learning pipeline: - **Phase 1 (pretraining):** backbone pretrained on external checkpoint `two_stream_attn_v1_2layer_pretrain_20260515T125437Z.pt` to learn generic gesture dynamics. - **Phase 2 (fine-tuning):** head replaced and model adapted on Hybrid Jester+IPN 10-gesture vocabulary. - **Stage A (frozen backbone):** 10 epoch(s) head-only warmup. - **Stage B (full model):** up to 66 epoch(s) joint fine-tuning with scheduler/early stopping. - **Stage B retention defences:** replay_max_samples_per_class=500, distillation_weight=0.0, replay_ce_weight=0.0, backbone_lr_multiplier=0.1, ewc_weight=N/A, gpm_components=0, forgetting_penalty_weight=0.5. ## Training Configuration | Parameter | Value | |-----------|-------| | Architecture | EnhancedTwoStreamLSTM (BiLSTM h=96×2, MHA 8 heads, proj=96, mean+max pool, MLP gate) | | Input size | 147 | | Hidden size | 96/stream (BiLSTM output: 192) | | Projection dim | 96 | | Num layers | 2 | | MHA heads | 8 (head dim: 24) | | Dropout | 0.4 | | Learning rate | 3e-05 | | Weight decay | 0.001 | | Batch size | 128 | | Max epochs | 80 | | Early stopping patience | 20 | | Label smoothing | 0.05 | | Class weighting | disabled | | Max samples per class | 3000 | | LR scheduler | ReduceLROnPlateau (factor=0.5, patience=10) | ## Evaluation Results (Test Set) | Metric | Value | |--------|-------| | Accuracy | 96.1% | | Macro F1 | 95.9% | ### Per-Class Recall | Class | Recall | |-------|--------| | `fist` | 97.3% | | `swiping_right` | 97.1% | | `swiping_left` | 98.3% | | `swiping_down` | 98.0% | | `swiping_up` | 98.2% | | `zooming_in_full_hand` | 97.0% | | `zooming_out_full_hand` | 95.1% | | `point_one` | 97.4% | | `point_two` | 95.1% | | `unknown` | 85.7% | ## Comparison with Previous Architecture | Feature | TwoStreamGestureLSTM | EnhancedTwoStreamLSTM | |---------|---------------------|-----------------------| | LSTM direction | Unidirectional | **Bidirectional** | | Attention | Bahdanau (scalar) | **MHA Q/K/V (8 heads)** | | Feature projection | No | **Yes (→96)** | | Temporal pooling | Mean only | **Mean + Max** | | Cross-stream fusion | Concat only | **2-layer MLP gate** | | Parameters | ~182 K | ~1,208,554 | ## Limitations and Risks - Trained on IPN Hand subjects only. Performance may degrade with unusual hand sizes, skin tones, or lighting conditions not represented in training data. - The `unknown` class represents background/transition frames. At runtime, predictions are filtered through per-class confidence thresholds defined in `production_hybrid.yaml`. - Requires **mediapipe>=0.10.14** for landmark extraction at inference time. - Not intended for safety-critical or accessibility-critical applications. - Performance was measured on a held-out test split from the same dataset; real-world generalisation may differ. ## Environmental Impact Training was performed on CPU/MPS. Estimated training time: ~10 minutes. Estimated CO₂ equivalent: negligible (<0.001 kg CO₂eq). --- *Generated by the Maestro training pipeline on 2026-05-15.*