ntsrigaud
/

maestro-lstm-hybrid

+---
+language:
+- en
+license: mit
+tags:
+- gesture-recognition
+- hand-gesture
+- pytorch
+- mediapipe
+- temporal-model
+- lstm
+- attention
+- bidirectional
+datasets:
+- IPN-Hand
+metrics:
+- accuracy
+- f1
+model-index:
+- name: two_stream_attn_v1_finetune_20260512T041947Z
+  results:
+  - task:
+      type: gesture-recognition
+    dataset:
+      name: IPN Hand
+      type: IPN-Hand
+    metrics:
+    - type: accuracy
+      value: 0.9898
+    - type: f1
+      value: 0.9917
+---
+# two_stream_attn_v1_finetune_20260512T041947Z
+A real-time hand gesture classifier trained on
+a Hybrid Jester+IPN gesture dataset (Jester dynamic classes + IPN pointing classes).
+This model is part of the **Maestro** pipeline that enables touchless
+control of presentation and meeting software through hand gestures captured from a
+standard webcam using MediaPipe for landmark extraction.
+## Model Description
+- **Architecture**: EnhancedTwoStreamLSTM (BiLSTM h=96×2, MHA 8 heads, proj=96, mean+max pool, MLP gate)
+- **Parameters**: 2,099,434
+- **Input**: `(batch, 32, 147)`
+    — 32-frame sliding window at 30 FPS ≈ 1067 ms
+- **Output**: Softmax logits over 10 gesture classes
+- **Inference latency**: < 1 ms per call (CPU, single sample)
+- **Feature schema**: `feature-schema-v5`
+## Architecture
+`EnhancedTwoStreamLSTM` splits the 147-dim feature vector into two parallel streams and
+processes them through a BiLSTM + self-attention + MLP-gate pipeline:
+```
+Input (B, T=32, 147)
+    │
+    ├─ Stream A — Pose/Shape (73 dims)
+    │   Linear+LN+GELU → 96
+    │   2-layer BiLSTM (h=96) → (B, T, 192)
+    │   LayerNorm → Self-MHA (8 heads) + residual + post-LN
+    │   mean+max pool → pool_LN → ctx_a (B, 192)
+    │
+    ├─ Stream B — Motion/Dynamics (74 dims)
+    │   (identical structure) → ctx_b (B, 192)
+    │
+    ├─ MLP cross-stream gate
+    │   gate_a = Sigmoid(
+    │     Linear(96→192)(
+    │       Tanh(Linear(192→96)(ctx_b))))
+    │   ctx_a  = LN(ctx_a × gate_a + ctx_a)
+    │   gate_b = Sigmoid(
+    │     Linear(96→192)(
+    │       Tanh(Linear(192→96)(ctx_a))))
+    │   ctx_b  = LN(ctx_b × gate_b + ctx_b)
+    │
+    └─ cat(ctx_a, ctx_b) → (384,)
+       LN → Linear(384→192) → GELU → Dropout → Linear(192→10)
+```
+**Design rationale:**
+- BiLSTMs encode temporal order via their recurrent cell state — no positional encoding needed.
+- Mean+Max pooling captures both sustained gesture shape (mean) and transient click events (max).
+- The 2-layer MLP gate provides non-linear cross-modal recalibration at ~37 K params
+  (vs ~263 K for full MHA cross-attention with a degenerate mean-pooled query).
+## Gesture Classes
+| Class | Description |
+|-------|-------------|
+| `no_gesture` | — |
+| `point_one` | Single-finger pointing gesture (continuous laser-pointer control) |
+| `point_two` | Two-finger pointing gesture (continuous annotation-pen control) |
+| `stop_sign` | — |
+| `swiping_down` | — |
+| `swiping_left` | — |
+| `swiping_right` | — |
+| `swiping_up` | — |
+| `zooming_in_full_hand` | — |
+| `zooming_out_full_hand` | — |
+## Gesture Usage In Presentation System
+| Class | Mode | Command | Runtime handling |
+|-------|------|---------|------------------|
+| `no_gesture` | `unmapped` | `—` | Not mapped in command_map_presentation.yaml |
+| `point_one` | `continuous` | `—` | Continuous tracker: LaserPointerTracker (bypasses discrete dispatcher) |
+| `point_two` | `continuous` | `—` | Continuous tracker: AnnotationPenTracker (bypasses discrete dispatcher) |
+| `stop_sign` | `unmapped` | `—` | Not mapped in command_map_presentation.yaml |
+| `swiping_down` | `unmapped` | `—` | Not mapped in command_map_presentation.yaml |
+| `swiping_left` | `unmapped` | `—` | Not mapped in command_map_presentation.yaml |
+| `swiping_right` | `unmapped` | `—` | Not mapped in command_map_presentation.yaml |
+| `swiping_up` | `unmapped` | `—` | Not mapped in command_map_presentation.yaml |
+| `zooming_in_full_hand` | `unmapped` | `—` | Not mapped in command_map_presentation.yaml |
+| `zooming_out_full_hand` | `unmapped` | `—` | Not mapped in command_map_presentation.yaml |
+## Feature Schema (`feature-schema-v5`)
+| Block | Dims | Description |
+|-------|------|-------------|
+| `position` | 0–62 | 21 wrist-relative, scale-normalised landmark positions (x, y, z) |
+| `fingertip_spread` | 63–67 | 5 inter-fingertip Euclidean distances |
+| `wrist_trajectory` | 68–70 | Net wrist displacement from oldest frame in the window |
+| `velocity` | 71–133 | 21 per-landmark wrist-relative velocity vectors (Δposition per unit time) |
+| `joint_angles` | 134–143 | 10 MCP + PIP joint angles in radians |
+| `wrist_vel_raw` | 144–146 | Camera-normalised wrist velocity (x, y, z) — key directional signal |
+## How to Use
+```python
+import torch
+from huggingface_hub import hf_hub_download
+from maestro.infrastructure.model.checkpoint_loader import load_inference_artifact
+# Download the artifact (cached after first call)
+local_path = hf_hub_download(
+    repo_id="ntsrigaud/maestro-lstm-hybrid",
+    filename="two_stream_attn_v1_finetune_20260512T041947Z_inference.pt",
+)
+# Load the artifact (includes model, class labels, and feature schema)
+artifact = load_inference_artifact(
+    artifact_path=local_path,
+    device=torch.device("cpu"),
+)
+artifact.model.eval()
+# Build a 147-dim feature vector using LandmarkFeatureTransformer
+# and fill a 32-frame SlidingWindowSequenceBuffer, then:
+with torch.no_grad():
+    # tensor shape: (batch=1, T=32, F=147)
+    window_tensor = torch.tensor(window_np, dtype=torch.float32).unsqueeze(0)
+    logits = artifact.model(window_tensor)
+    pred_class = artifact.class_labels[logits.argmax(dim=1).item()]
+```
+## Training Dataset
+- **Source**: Hybrid merge of Jester and IPN-Hand windows: Jester provides no_gesture/swiping/zoom/stop_sign classes; IPN-Hand provides point_one and point_two
+- **Used classes**: 10 (9 active gestures + `unknown` background)
+- **Dataset split**: 70% train / 15% val / 15% test (stratified by class)
+- **Augmentation**: temporal scale ±20%, spatial jitter σ=0.005;
+  label-aware horizontal mirror (swipe_left ↔ swipe_right)
+## Training Strategy
+Two-phase transfer learning pipeline:
+- **Phase 1 (pretraining):** backbone pretrained on external checkpoint `two_stream_attn_v1_20260512T041219Z.pt` to learn generic gesture dynamics.
+- **Phase 2 (fine-tuning):** head replaced and model adapted on Hybrid Jester+IPN 10-gesture vocabulary.
+- **Stage A (frozen backbone):** 10 epoch(s) head-only warmup.
+- **Stage B (full model):** up to 60 epoch(s) joint fine-tuning with scheduler/early stopping.
+- **Stage B retention defences:** replay_max_samples_per_class=500, distillation_weight=0.5, replay_ce_weight=0.3, backbone_lr_multiplier=0.1, ewc_weight=100.0, gpm_components=20, forgetting_penalty_weight=0.5.
+## Training Configuration
+| Parameter | Value |
+|-----------|-------|
+| Architecture | EnhancedTwoStreamLSTM (BiLSTM h=96×2, MHA 8 heads, proj=96, mean+max pool, MLP gate) |
+| Input size | 147 |
+| Hidden size | 96/stream (BiLSTM output: 192) |
+| Projection dim | 96 |
+| Num layers | 4 |
+| MHA heads | 8 (head dim: 24) |
+| Dropout | 0.35 |
+| Learning rate | 3e-05 |
+| Weight decay | 0.0005 |
+| Batch size | 128 |
+| Max epochs | 60 |
+| Early stopping patience | 20 |
+| Label smoothing | 0.05 |
+| Class weighting | disabled |
+| Max samples per class | 5000 |
+| LR scheduler | ReduceLROnPlateau (factor=0.5, patience=8) |
+## Evaluation Results (Test Set)
+| Metric | Value |
+|--------|-------|
+| Accuracy | 99.0% |
+| Macro F1 | 99.2% |
+### Per-Class Recall
+| Class | Recall |
+|-------|--------|
+| `no_gesture` | 100.0% |
+| `point_one` | 98.9% |
+| `point_two` | 98.5% |
+| `stop_sign` | 99.5% |
+| `swiping_down` | 99.0% |
+| `swiping_left` | 100.0% |
+| `swiping_right` | 99.1% |
+| `swiping_up` | 98.1% |
+| `zooming_in_full_hand` | 99.2% |
+| `zooming_out_full_hand` | 99.0% |
+## Comparison with Previous Architecture
+| Feature | TwoStreamGestureLSTM | EnhancedTwoStreamLSTM |
+|---------|---------------------|-----------------------|
+| LSTM direction | Unidirectional | **Bidirectional** |
+| Attention | Bahdanau (scalar) | **MHA Q/K/V (8 heads)** |
+| Feature projection | No | **Yes (→96)** |
+| Temporal pooling | Mean only | **Mean + Max** |
+| Cross-stream fusion | Concat only | **2-layer MLP gate** |
+| Parameters | ~182 K | ~2,099,434 |
+## Limitations and Risks
+- Trained on IPN Hand subjects only. Performance may degrade with unusual hand sizes,
+  skin tones, or lighting conditions not represented in training data.
+- The `unknown` class represents background/transition frames. At runtime, predictions
+  are filtered through per-class confidence thresholds defined in `production_ipn.yaml`.
+- Requires **mediapipe>=0.10.14** for landmark extraction at inference time.
+- Not intended for safety-critical or accessibility-critical applications.
+- Performance was measured on a held-out test split from the same dataset; real-world
+  generalisation may differ.
+## Environmental Impact
+Training was performed on CPU/MPS. Estimated training time: ~10 minutes.
+Estimated CO₂ equivalent: negligible (<0.001 kg CO₂eq).
+---
+*Generated by the Maestro training pipeline on 2026-05-12.*

config.json ADDED Viewed

	@@ -0,0 +1,100 @@

+{
+  "model_version": "two_stream_attn_v1_finetune_20260512T041947Z",
+  "model_config": {
+    "model_name": "two_stream_attn_v1",
+    "input_size": 147,
+    "hidden_size": 96,
+    "num_layers": 4,
+    "dropout": 0.35,
+    "num_classes": 10
+  },
+  "feature_schema": {
+    "feature_schema_version": "feature-schema-v5",
+    "feature_dim": 147,
+    "orientation_normalization": false,
+    "window_length": 32,
+    "window_step": null
+  },
+  "training_config": {
+    "epochs": 60,
+    "batch_size": 128,
+    "learning_rate": 3e-05,
+    "weight_decay": 0.0005,
+    "grad_clip_norm": 1.0,
+    "seed": 42,
+    "label_smoothing": 0.05,
+    "class_weighting": false,
+    "max_samples_per_class": 5000,
+    "scheduler": {
+      "factor": 0.5,
+      "patience": 8,
+      "min_lr": 1e-06
+    }
+  },
+  "evaluation": {
+    "test_accuracy": 0.9898119122257053,
+    "test_macro_f1": 0.9916782280254713,
+    "test_loss": 0.3169419604159946,
+    "calibration_ece": 0.04126546900162752,
+    "per_class_recall": {
+      "no_gesture": 1.0,
+      "point_one": 0.9890560875512996,
+      "point_two": 0.9850746268656716,
+      "stop_sign": 0.9947460595446584,
+      "swiping_down": 0.9903846153846154,
+      "swiping_left": 1.0,
+      "swiping_right": 0.990909090909091,
+      "swiping_up": 0.9810126582278481,
+      "zooming_in_full_hand": 0.9919484702093397,
+      "zooming_out_full_hand": 0.9897959183673469
+    },
+    "per_class_precision": {
+      "no_gesture": 1.0,
+      "point_one": 0.9836734693877551,
+      "point_two": 0.9864130434782609,
+      "stop_sign": 0.9964912280701754,
+      "swiping_down": 1.0,
+      "swiping_left": 0.9818181818181818,
+      "swiping_right": 0.990909090909091,
+      "swiping_up": 1.0,
+      "zooming_in_full_hand": 0.9919484702093397,
+      "zooming_out_full_hand": 0.9897959183673469
+    }
+  },
+  "class_labels": [
+    "no_gesture",
+    "point_one",
+    "point_two",
+    "stop_sign",
+    "swiping_down",
+    "swiping_left",
+    "swiping_right",
+    "swiping_up",
+    "zooming_in_full_hand",
+    "zooming_out_full_hand"
+  ],
+  "created_at": "2026-05-12T04:25:36.916751+00:00",
+  "gesture_command_mapping": {
+    "commands": {
+      "swipe_up": "start_presentation",
+      "swipe_down": "stop_presentation",
+      "swipe_right": "next_slide",
+      "swipe_left": "previous_slide",
+      "zoom_in": "zoom_in_view",
+      "zoom_out": "zoom_out_view",
+      "open_palm_hold": "erase_annotations",
+      "unknown": "no_action"
+    },
+    "modes": {
+      "swipe_up": "discrete",
+      "swipe_down": "discrete",
+      "swipe_right": "discrete",
+      "swipe_left": "discrete",
+      "zoom_in": "discrete",
+      "zoom_out": "discrete",
+      "open_palm_hold": "discrete",
+      "point_one": "continuous",
+      "point_two": "continuous"
+    }
+  }
+}