File size: 9,761 Bytes
0b50302 f7e0e62 0b50302 f7e0e62 0b50302 f7e0e62 0b50302 f7e0e62 0b50302 d7a5fcf 0b50302 f7e0e62 3132009 2df62ef 0b50302 3132009 0b50302 2df62ef 0b50302 22227a9 3132009 0b50302 3132009 0b50302 22227a9 c9027ea 3132009 c9027ea 3132009 0b50302 3132009 0b50302 f7e0e62 0b50302 d7a5fcf 2df62ef 0b50302 c9027ea 0b50302 9768190 f7e0e62 d7a5fcf 9768190 f7e0e62 22227a9 0b50302 22227a9 0b50302 6ef1410 9768190 6ef1410 0b50302 6ef1410 0b50302 3132009 6ef1410 0b50302 f7e0e62 0b50302 f7e0e62 6ef1410 f7e0e62 0b50302 f7e0e62 0b50302 9768190 c9027ea 0b50302 d167d8d | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 | ---
language:
- en
license: mit
tags:
- gesture-recognition
- hand-gesture
- pytorch
- mediapipe
- temporal-model
- lstm
- attention
- bidirectional
datasets:
- IPN-Hand
metrics:
- accuracy
- f1
model-index:
- name: two_stream_attn_v1_2layer_ld_cong_finetune_20260515T134706Z
results:
- task:
type: gesture-recognition
dataset:
name: IPN Hand
type: IPN-Hand
metrics:
- type: accuracy
value: 0.9606
- type: f1
value: 0.9587
---
# two_stream_attn_v1_2layer_ld_cong_finetune_20260515T134706Z
A real-time hand gesture classifier trained on
a Hybrid Jester+IPN gesture dataset (Jester dynamic classes + IPN pointing classes).
This model is part of the **Maestro** pipeline that enables touchless
control of presentation and meeting software through hand gestures captured from a
standard webcam using MediaPipe for landmark extraction.
## Model Description
- **Architecture**: EnhancedTwoStreamLSTM (BiLSTM h=96Γ2, MHA 8 heads, proj=96, mean+max pool, MLP gate)
- **Parameters**: 1,208,554
- **Input**: `(batch, 16, 147)`
β 16-frame sliding window at 30 FPS β 533 ms
- **Output**: Softmax logits over 10 gesture classes
- **Inference latency**: < 1 ms per call (CPU, single sample)
- **Feature schema**: `feature-schema-v1`
## Architecture
`EnhancedTwoStreamLSTM` splits the 147-dim feature vector into two parallel streams and
processes them through a BiLSTM + self-attention + MLP-gate pipeline:
```
Input (B, T=32, 147)
β
ββ Stream A β Pose/Shape (73 dims)
β Linear+LN+GELU β 96
β 2-layer BiLSTM (h=96) β (B, T, 192)
β LayerNorm β Self-MHA (8 heads) + residual + post-LN
β mean+max pool β pool_LN β ctx_a (B, 192)
β
ββ Stream B β Motion/Dynamics (74 dims)
β (identical structure) β ctx_b (B, 192)
β
ββ MLP cross-stream gate
β gate_a = Sigmoid(
β Linear(96β192)(
β Tanh(Linear(192β96)(ctx_b))))
β ctx_a = LN(ctx_a Γ gate_a + ctx_a)
β gate_b = Sigmoid(
β Linear(96β192)(
β Tanh(Linear(192β96)(ctx_a))))
β ctx_b = LN(ctx_b Γ gate_b + ctx_b)
β
ββ cat(ctx_a, ctx_b) β (384,)
LN β Linear(384β192) β GELU β Dropout β Linear(192β10)
```
**Design rationale:**
- BiLSTMs encode temporal order via their recurrent cell state β no positional encoding needed.
- Mean+Max pooling captures both sustained gesture shape (mean) and transient click events (max).
- The 2-layer MLP gate provides non-linear cross-modal recalibration at ~37 K params
(vs ~263 K for full MHA cross-attention with a degenerate mean-pooled query).
## Gesture Classes
| Class | Description |
|-------|-------------|
| `fist` | Closed fist (all fingers curled, thumb tucked) |
| `swiping_right` | Horizontal swipe from left to right |
| `swiping_left` | Horizontal swipe from right to left |
| `swiping_down` | Vertical swipe downward |
| `swiping_up` | Vertical swipe upward |
| `zooming_in_full_hand` | Pinch-open / spread fingers away from each other |
| `zooming_out_full_hand` | Pinch-close / bring fingers together |
| `point_one` | Single-finger pointing gesture (continuous laser-pointer control) |
| `point_two` | Two-finger pointing gesture (continuous annotation-pen control) |
| `unknown` | Background / transition / no gesture |
## Gesture Usage In Presentation System
| Class | Mode | Command | Runtime handling |
|-------|------|---------|------------------|
| `fist` | `discrete` | `erase_annotations` | Discrete command via GestureActivationController β CommandDispatcher |
| `swiping_right` | `discrete` | `next_slide` | Discrete command via GestureActivationController β CommandDispatcher |
| `swiping_left` | `discrete` | `previous_slide` | Discrete command via GestureActivationController β CommandDispatcher |
| `swiping_down` | `discrete` | `stop_presentation` | Discrete command via GestureActivationController β CommandDispatcher |
| `swiping_up` | `discrete` | `start_presentation` | Discrete command via GestureActivationController β CommandDispatcher |
| `zooming_in_full_hand` | `discrete` | `zoom_in_view` | Discrete command via GestureActivationController β CommandDispatcher |
| `zooming_out_full_hand` | `discrete` | `zoom_out_view` | Discrete command via GestureActivationController β CommandDispatcher |
| `point_one` | `continuous` | `β` | Continuous tracker: LaserPointerTracker (bypasses discrete dispatcher) |
| `point_two` | `continuous` | `β` | Continuous tracker: AnnotationPenTracker (bypasses discrete dispatcher) |
| `unknown` | `discrete` | `no_action` | No-op background class |
## Feature Schema (`feature-schema-v1`)
| Block | Dims | Description |
|-------|------|-------------|
| `position` | 0β62 | 21 wrist-relative, scale-normalised landmark positions (x, y, z) |
| `fingertip_spread` | 63β67 | 5 inter-fingertip Euclidean distances |
| `wrist_trajectory` | 68β70 | Net wrist displacement from oldest frame in the window |
| `velocity` | 71β133 | 21 per-landmark wrist-relative velocity vectors (Ξposition per unit time) |
| `joint_angles` | 134β143 | 10 MCP + PIP joint angles in radians |
| `wrist_vel_raw` | 144β146 | Camera-normalised wrist velocity (x, y, z) β key directional signal |
## How to Use
```python
import torch
from huggingface_hub import hf_hub_download
from maestro.infrastructure.model.checkpoint_loader import load_inference_artifact
# Download the artifact (cached after first call)
local_path = hf_hub_download(
repo_id="ntsrigaud/maestro-lstm-hybrid",
filename="two_stream_attn_v1_2layer_ld_cong_finetune_20260515T134706Z_inference.pt",
)
# Load the artifact (includes model, class labels, and feature schema)
artifact = load_inference_artifact(
artifact_path=local_path,
device=torch.device("cpu"),
)
artifact.model.eval()
# Build a 147-dim feature vector using LandmarkFeatureTransformer
# and fill a 32-frame SlidingWindowSequenceBuffer, then:
with torch.no_grad():
# tensor shape: (batch=1, T=32, F=147)
window_tensor = torch.tensor(window_np, dtype=torch.float32).unsqueeze(0)
logits = artifact.model(window_tensor)
pred_class = artifact.class_labels[logits.argmax(dim=1).item()]
```
## Training Dataset
- **Source**: Hybrid merge of Jester and IPN-Hand windows: Jester provides unknown/swiping/zoom/stop_sign classes; IPN-Hand provides point_one and point_two
- **Used classes**: 10 (9 active gestures + `unknown` background)
- **Dataset split**: 70% train / 15% val / 15% test (stratified by class)
- **Augmentation**: temporal scale Β±20%, spatial jitter Ο=0.005
## Training Strategy
Two-phase transfer learning pipeline:
- **Phase 1 (pretraining):** backbone pretrained on external checkpoint `two_stream_attn_v1_2layer_pretrain_20260515T125437Z.pt` to learn generic gesture dynamics.
- **Phase 2 (fine-tuning):** head replaced and model adapted on Hybrid Jester+IPN 10-gesture vocabulary.
- **Stage A (frozen backbone):** 10 epoch(s) head-only warmup.
- **Stage B (full model):** up to 66 epoch(s) joint fine-tuning with scheduler/early stopping.
- **Stage B retention defences:** replay_max_samples_per_class=500, distillation_weight=0.0, replay_ce_weight=0.0, backbone_lr_multiplier=0.1, ewc_weight=N/A, gpm_components=0, forgetting_penalty_weight=0.5.
## Training Configuration
| Parameter | Value |
|-----------|-------|
| Architecture | EnhancedTwoStreamLSTM (BiLSTM h=96Γ2, MHA 8 heads, proj=96, mean+max pool, MLP gate) |
| Input size | 147 |
| Hidden size | 96/stream (BiLSTM output: 192) |
| Projection dim | 96 |
| Num layers | 2 |
| MHA heads | 8 (head dim: 24) |
| Dropout | 0.4 |
| Learning rate | 3e-05 |
| Weight decay | 0.001 |
| Batch size | 128 |
| Max epochs | 80 |
| Early stopping patience | 20 |
| Label smoothing | 0.05 |
| Class weighting | disabled |
| Max samples per class | 3000 |
| LR scheduler | ReduceLROnPlateau (factor=0.5, patience=10) |
## Evaluation Results (Test Set)
| Metric | Value |
|--------|-------|
| Accuracy | 96.1% |
| Macro F1 | 95.9% |
### Per-Class Recall
| Class | Recall |
|-------|--------|
| `fist` | 97.3% |
| `swiping_right` | 97.1% |
| `swiping_left` | 98.3% |
| `swiping_down` | 98.0% |
| `swiping_up` | 98.2% |
| `zooming_in_full_hand` | 97.0% |
| `zooming_out_full_hand` | 95.1% |
| `point_one` | 97.4% |
| `point_two` | 95.1% |
| `unknown` | 85.7% |
## Comparison with Previous Architecture
| Feature | TwoStreamGestureLSTM | EnhancedTwoStreamLSTM |
|---------|---------------------|-----------------------|
| LSTM direction | Unidirectional | **Bidirectional** |
| Attention | Bahdanau (scalar) | **MHA Q/K/V (8 heads)** |
| Feature projection | No | **Yes (β96)** |
| Temporal pooling | Mean only | **Mean + Max** |
| Cross-stream fusion | Concat only | **2-layer MLP gate** |
| Parameters | ~182 K | ~1,208,554 |
## Limitations and Risks
- Trained on IPN Hand subjects only. Performance may degrade with unusual hand sizes,
skin tones, or lighting conditions not represented in training data.
- The `unknown` class represents background/transition frames. At runtime, predictions
are filtered through per-class confidence thresholds defined in `production_hybrid.yaml`.
- Requires **mediapipe>=0.10.14** for landmark extraction at inference time.
- Not intended for safety-critical or accessibility-critical applications.
- Performance was measured on a held-out test split from the same dataset; real-world
generalisation may differ.
## Environmental Impact
Training was performed on CPU/MPS. Estimated training time: ~10 minutes.
Estimated COβ equivalent: negligible (<0.001 kg COβeq).
---
*Generated by the Maestro training pipeline on 2026-05-15.*
|