| --- |
| language: |
| - en |
| license: mit |
| tags: |
| - gesture-recognition |
| - hand-gesture |
| - pytorch |
| - mediapipe |
| - temporal-model |
| - lstm |
| - attention |
| - bidirectional |
| datasets: |
| - IPN-Hand |
| metrics: |
| - accuracy |
| - f1 |
| model-index: |
| - name: two_stream_attn_v1_2layer_ld_cong_finetune_20260515T134706Z |
| results: |
| - task: |
| type: gesture-recognition |
| dataset: |
| name: IPN Hand |
| type: IPN-Hand |
| metrics: |
| - type: accuracy |
| value: 0.9606 |
| - type: f1 |
| value: 0.9587 |
| --- |
| |
| # two_stream_attn_v1_2layer_ld_cong_finetune_20260515T134706Z |
|
|
| A real-time hand gesture classifier trained on |
| a Hybrid Jester+IPN gesture dataset (Jester dynamic classes + IPN pointing classes). |
|
|
| This model is part of the **Maestro** pipeline that enables touchless |
| control of presentation and meeting software through hand gestures captured from a |
| standard webcam using MediaPipe for landmark extraction. |
|
|
| ## Model Description |
|
|
| - **Architecture**: EnhancedTwoStreamLSTM (BiLSTM h=96×2, MHA 8 heads, proj=96, mean+max pool, MLP gate) |
| - **Parameters**: 1,208,554 |
| - **Input**: `(batch, 16, 147)` |
| — 16-frame sliding window at 30 FPS ≈ 533 ms |
| - **Output**: Softmax logits over 10 gesture classes |
| - **Inference latency**: < 1 ms per call (CPU, single sample) |
| - **Feature schema**: `feature-schema-v1` |
| |
| ## Architecture |
|
|
| `EnhancedTwoStreamLSTM` splits the 147-dim feature vector into two parallel streams and |
| processes them through a BiLSTM + self-attention + MLP-gate pipeline: |
|
|
| ``` |
| Input (B, T=32, 147) |
| │ |
| ├─ Stream A — Pose/Shape (73 dims) |
| │ Linear+LN+GELU → 96 |
| │ 2-layer BiLSTM (h=96) → (B, T, 192) |
| │ LayerNorm → Self-MHA (8 heads) + residual + post-LN |
| │ mean+max pool → pool_LN → ctx_a (B, 192) |
| │ |
| ├─ Stream B — Motion/Dynamics (74 dims) |
| │ (identical structure) → ctx_b (B, 192) |
| │ |
| ├─ MLP cross-stream gate |
| │ gate_a = Sigmoid( |
| │ Linear(96→192)( |
| │ Tanh(Linear(192→96)(ctx_b)))) |
| │ ctx_a = LN(ctx_a × gate_a + ctx_a) |
| │ gate_b = Sigmoid( |
| │ Linear(96→192)( |
| │ Tanh(Linear(192→96)(ctx_a)))) |
| │ ctx_b = LN(ctx_b × gate_b + ctx_b) |
| │ |
| └─ cat(ctx_a, ctx_b) → (384,) |
| LN → Linear(384→192) → GELU → Dropout → Linear(192→10) |
| ``` |
|
|
| **Design rationale:** |
| - BiLSTMs encode temporal order via their recurrent cell state — no positional encoding needed. |
| - Mean+Max pooling captures both sustained gesture shape (mean) and transient click events (max). |
| - The 2-layer MLP gate provides non-linear cross-modal recalibration at ~37 K params |
| (vs ~263 K for full MHA cross-attention with a degenerate mean-pooled query). |
|
|
| ## Gesture Classes |
|
|
| | Class | Description | |
| |-------|-------------| |
| | `fist` | Closed fist (all fingers curled, thumb tucked) | |
| | `swiping_right` | Horizontal swipe from left to right | |
| | `swiping_left` | Horizontal swipe from right to left | |
| | `swiping_down` | Vertical swipe downward | |
| | `swiping_up` | Vertical swipe upward | |
| | `zooming_in_full_hand` | Pinch-open / spread fingers away from each other | |
| | `zooming_out_full_hand` | Pinch-close / bring fingers together | |
| | `point_one` | Single-finger pointing gesture (continuous laser-pointer control) | |
| | `point_two` | Two-finger pointing gesture (continuous annotation-pen control) | |
| | `unknown` | Background / transition / no gesture | |
|
|
| ## Gesture Usage In Presentation System |
|
|
| | Class | Mode | Command | Runtime handling | |
| |-------|------|---------|------------------| |
| | `fist` | `discrete` | `erase_annotations` | Discrete command via GestureActivationController → CommandDispatcher | |
| | `swiping_right` | `discrete` | `next_slide` | Discrete command via GestureActivationController → CommandDispatcher | |
| | `swiping_left` | `discrete` | `previous_slide` | Discrete command via GestureActivationController → CommandDispatcher | |
| | `swiping_down` | `discrete` | `stop_presentation` | Discrete command via GestureActivationController → CommandDispatcher | |
| | `swiping_up` | `discrete` | `start_presentation` | Discrete command via GestureActivationController → CommandDispatcher | |
| | `zooming_in_full_hand` | `discrete` | `zoom_in_view` | Discrete command via GestureActivationController → CommandDispatcher | |
| | `zooming_out_full_hand` | `discrete` | `zoom_out_view` | Discrete command via GestureActivationController → CommandDispatcher | |
| | `point_one` | `continuous` | `—` | Continuous tracker: LaserPointerTracker (bypasses discrete dispatcher) | |
| | `point_two` | `continuous` | `—` | Continuous tracker: AnnotationPenTracker (bypasses discrete dispatcher) | |
| | `unknown` | `discrete` | `no_action` | No-op background class | |
|
|
| ## Feature Schema (`feature-schema-v1`) |
|
|
| | Block | Dims | Description | |
| |-------|------|-------------| |
| | `position` | 0–62 | 21 wrist-relative, scale-normalised landmark positions (x, y, z) | |
| | `fingertip_spread` | 63–67 | 5 inter-fingertip Euclidean distances | |
| | `wrist_trajectory` | 68–70 | Net wrist displacement from oldest frame in the window | |
| | `velocity` | 71–133 | 21 per-landmark wrist-relative velocity vectors (Δposition per unit time) | |
| | `joint_angles` | 134–143 | 10 MCP + PIP joint angles in radians | |
| | `wrist_vel_raw` | 144–146 | Camera-normalised wrist velocity (x, y, z) — key directional signal | |
|
|
|
|
| ## How to Use |
|
|
| ```python |
| import torch |
| from huggingface_hub import hf_hub_download |
| from maestro.infrastructure.model.checkpoint_loader import load_inference_artifact |
| |
| # Download the artifact (cached after first call) |
| local_path = hf_hub_download( |
| repo_id="ntsrigaud/maestro-lstm-hybrid", |
| filename="two_stream_attn_v1_2layer_ld_cong_finetune_20260515T134706Z_inference.pt", |
| ) |
| |
| # Load the artifact (includes model, class labels, and feature schema) |
| artifact = load_inference_artifact( |
| artifact_path=local_path, |
| device=torch.device("cpu"), |
| ) |
| artifact.model.eval() |
| |
| # Build a 147-dim feature vector using LandmarkFeatureTransformer |
| # and fill a 32-frame SlidingWindowSequenceBuffer, then: |
| with torch.no_grad(): |
| # tensor shape: (batch=1, T=32, F=147) |
| window_tensor = torch.tensor(window_np, dtype=torch.float32).unsqueeze(0) |
| logits = artifact.model(window_tensor) |
| pred_class = artifact.class_labels[logits.argmax(dim=1).item()] |
| ``` |
|
|
| ## Training Dataset |
|
|
| - **Source**: Hybrid merge of Jester and IPN-Hand windows: Jester provides unknown/swiping/zoom/stop_sign classes; IPN-Hand provides point_one and point_two |
| - **Used classes**: 10 (9 active gestures + `unknown` background) |
| - **Dataset split**: 70% train / 15% val / 15% test (stratified by class) |
| - **Augmentation**: temporal scale ±20%, spatial jitter σ=0.005 |
| |
| ## Training Strategy |
| |
| Two-phase transfer learning pipeline: |
| - **Phase 1 (pretraining):** backbone pretrained on external checkpoint `two_stream_attn_v1_2layer_pretrain_20260515T125437Z.pt` to learn generic gesture dynamics. |
| - **Phase 2 (fine-tuning):** head replaced and model adapted on Hybrid Jester+IPN 10-gesture vocabulary. |
| - **Stage A (frozen backbone):** 10 epoch(s) head-only warmup. |
| - **Stage B (full model):** up to 66 epoch(s) joint fine-tuning with scheduler/early stopping. |
| - **Stage B retention defences:** replay_max_samples_per_class=500, distillation_weight=0.0, replay_ce_weight=0.0, backbone_lr_multiplier=0.1, ewc_weight=N/A, gpm_components=0, forgetting_penalty_weight=0.5. |
|
|
| ## Training Configuration |
|
|
| | Parameter | Value | |
| |-----------|-------| |
| | Architecture | EnhancedTwoStreamLSTM (BiLSTM h=96×2, MHA 8 heads, proj=96, mean+max pool, MLP gate) | |
| | Input size | 147 | |
| | Hidden size | 96/stream (BiLSTM output: 192) | |
| | Projection dim | 96 | |
| | Num layers | 2 | |
| | MHA heads | 8 (head dim: 24) | |
| | Dropout | 0.4 | |
| | Learning rate | 3e-05 | |
| | Weight decay | 0.001 | |
| | Batch size | 128 | |
| | Max epochs | 80 | |
| | Early stopping patience | 20 | |
| | Label smoothing | 0.05 | |
| | Class weighting | disabled | |
| | Max samples per class | 3000 | |
| | LR scheduler | ReduceLROnPlateau (factor=0.5, patience=10) | |
|
|
| ## Evaluation Results (Test Set) |
|
|
| | Metric | Value | |
| |--------|-------| |
| | Accuracy | 96.1% | |
| | Macro F1 | 95.9% | |
|
|
| ### Per-Class Recall |
|
|
| | Class | Recall | |
| |-------|--------| |
| | `fist` | 97.3% | |
| | `swiping_right` | 97.1% | |
| | `swiping_left` | 98.3% | |
| | `swiping_down` | 98.0% | |
| | `swiping_up` | 98.2% | |
| | `zooming_in_full_hand` | 97.0% | |
| | `zooming_out_full_hand` | 95.1% | |
| | `point_one` | 97.4% | |
| | `point_two` | 95.1% | |
| | `unknown` | 85.7% | |
|
|
| ## Comparison with Previous Architecture |
|
|
| | Feature | TwoStreamGestureLSTM | EnhancedTwoStreamLSTM | |
| |---------|---------------------|-----------------------| |
| | LSTM direction | Unidirectional | **Bidirectional** | |
| | Attention | Bahdanau (scalar) | **MHA Q/K/V (8 heads)** | |
| | Feature projection | No | **Yes (→96)** | |
| | Temporal pooling | Mean only | **Mean + Max** | |
| | Cross-stream fusion | Concat only | **2-layer MLP gate** | |
| | Parameters | ~182 K | ~1,208,554 | |
|
|
| ## Limitations and Risks |
|
|
| - Trained on IPN Hand subjects only. Performance may degrade with unusual hand sizes, |
| skin tones, or lighting conditions not represented in training data. |
| - The `unknown` class represents background/transition frames. At runtime, predictions |
| are filtered through per-class confidence thresholds defined in `production_hybrid.yaml`. |
| - Requires **mediapipe>=0.10.14** for landmark extraction at inference time. |
| - Not intended for safety-critical or accessibility-critical applications. |
| - Performance was measured on a held-out test split from the same dataset; real-world |
| generalisation may differ. |
|
|
| ## Environmental Impact |
|
|
| Training was performed on CPU/MPS. Estimated training time: ~10 minutes. |
| Estimated CO₂ equivalent: negligible (<0.001 kg CO₂eq). |
|
|
| --- |
|
|
| *Generated by the Maestro training pipeline on 2026-05-15.* |
|
|