File size: 9,761 Bytes
0b50302
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
f7e0e62
0b50302
 
 
 
 
 
 
 
f7e0e62
0b50302
f7e0e62
0b50302
 
f7e0e62
0b50302
 
d7a5fcf
0b50302
 
 
 
 
 
 
 
f7e0e62
3132009
 
2df62ef
0b50302
3132009
0b50302
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2df62ef
0b50302
 
 
 
 
 
 
 
 
 
 
 
22227a9
3132009
 
 
 
 
 
0b50302
 
3132009
0b50302
 
 
 
 
22227a9
c9027ea
3132009
 
c9027ea
 
 
3132009
 
 
0b50302
3132009
0b50302
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
f7e0e62
0b50302
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
d7a5fcf
2df62ef
0b50302
c9027ea
0b50302
 
 
9768190
f7e0e62
d7a5fcf
9768190
f7e0e62
22227a9
0b50302
 
 
 
 
 
 
 
 
22227a9
0b50302
6ef1410
9768190
6ef1410
0b50302
6ef1410
 
0b50302
 
3132009
6ef1410
0b50302
 
 
 
 
f7e0e62
 
0b50302
 
 
 
 
f7e0e62
 
 
 
 
 
 
6ef1410
f7e0e62
 
0b50302
 
 
 
 
 
 
 
 
 
f7e0e62
0b50302
 
 
 
 
9768190
c9027ea
0b50302
 
 
 
 
 
 
 
 
 
 
 
d167d8d
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
---
language:
- en
license: mit
tags:
- gesture-recognition
- hand-gesture
- pytorch
- mediapipe
- temporal-model
- lstm
- attention
- bidirectional
datasets:
- IPN-Hand
metrics:
- accuracy
- f1
model-index:
- name: two_stream_attn_v1_2layer_ld_cong_finetune_20260515T134706Z
  results:
  - task:
      type: gesture-recognition
    dataset:
      name: IPN Hand
      type: IPN-Hand
    metrics:
    - type: accuracy
      value: 0.9606
    - type: f1
      value: 0.9587
---

# two_stream_attn_v1_2layer_ld_cong_finetune_20260515T134706Z

A real-time hand gesture classifier trained on
a Hybrid Jester+IPN gesture dataset (Jester dynamic classes + IPN pointing classes).

This model is part of the **Maestro** pipeline that enables touchless
control of presentation and meeting software through hand gestures captured from a
standard webcam using MediaPipe for landmark extraction.

## Model Description

- **Architecture**: EnhancedTwoStreamLSTM (BiLSTM h=96Γ—2, MHA 8 heads, proj=96, mean+max pool, MLP gate)
- **Parameters**: 1,208,554
- **Input**: `(batch, 16, 147)`
    β€” 16-frame sliding window at 30 FPS β‰ˆ 533 ms
- **Output**: Softmax logits over 10 gesture classes
- **Inference latency**: < 1 ms per call (CPU, single sample)
- **Feature schema**: `feature-schema-v1`

## Architecture

`EnhancedTwoStreamLSTM` splits the 147-dim feature vector into two parallel streams and
processes them through a BiLSTM + self-attention + MLP-gate pipeline:

```
Input (B, T=32, 147)
    β”‚
    β”œβ”€ Stream A β€” Pose/Shape (73 dims)
    β”‚   Linear+LN+GELU β†’ 96
    β”‚   2-layer BiLSTM (h=96) β†’ (B, T, 192)
    β”‚   LayerNorm β†’ Self-MHA (8 heads) + residual + post-LN
    β”‚   mean+max pool β†’ pool_LN β†’ ctx_a (B, 192)
    β”‚
    β”œβ”€ Stream B β€” Motion/Dynamics (74 dims)
    β”‚   (identical structure) β†’ ctx_b (B, 192)
    β”‚
    β”œβ”€ MLP cross-stream gate
    β”‚   gate_a = Sigmoid(
    β”‚     Linear(96β†’192)(
    β”‚       Tanh(Linear(192β†’96)(ctx_b))))
    β”‚   ctx_a  = LN(ctx_a Γ— gate_a + ctx_a)
    β”‚   gate_b = Sigmoid(
    β”‚     Linear(96β†’192)(
    β”‚       Tanh(Linear(192β†’96)(ctx_a))))
    β”‚   ctx_b  = LN(ctx_b Γ— gate_b + ctx_b)
    β”‚
    └─ cat(ctx_a, ctx_b) β†’ (384,)
       LN β†’ Linear(384β†’192) β†’ GELU β†’ Dropout β†’ Linear(192β†’10)
```

**Design rationale:**
- BiLSTMs encode temporal order via their recurrent cell state β€” no positional encoding needed.
- Mean+Max pooling captures both sustained gesture shape (mean) and transient click events (max).
- The 2-layer MLP gate provides non-linear cross-modal recalibration at ~37 K params
  (vs ~263 K for full MHA cross-attention with a degenerate mean-pooled query).

## Gesture Classes

| Class | Description |
|-------|-------------|
| `fist` | Closed fist (all fingers curled, thumb tucked) |
| `swiping_right` | Horizontal swipe from left to right |
| `swiping_left` | Horizontal swipe from right to left |
| `swiping_down` | Vertical swipe downward |
| `swiping_up` | Vertical swipe upward |
| `zooming_in_full_hand` | Pinch-open / spread fingers away from each other |
| `zooming_out_full_hand` | Pinch-close / bring fingers together |
| `point_one` | Single-finger pointing gesture (continuous laser-pointer control) |
| `point_two` | Two-finger pointing gesture (continuous annotation-pen control) |
| `unknown` | Background / transition / no gesture |

## Gesture Usage In Presentation System

| Class | Mode | Command | Runtime handling |
|-------|------|---------|------------------|
| `fist` | `discrete` | `erase_annotations` | Discrete command via GestureActivationController β†’ CommandDispatcher |
| `swiping_right` | `discrete` | `next_slide` | Discrete command via GestureActivationController β†’ CommandDispatcher |
| `swiping_left` | `discrete` | `previous_slide` | Discrete command via GestureActivationController β†’ CommandDispatcher |
| `swiping_down` | `discrete` | `stop_presentation` | Discrete command via GestureActivationController β†’ CommandDispatcher |
| `swiping_up` | `discrete` | `start_presentation` | Discrete command via GestureActivationController β†’ CommandDispatcher |
| `zooming_in_full_hand` | `discrete` | `zoom_in_view` | Discrete command via GestureActivationController β†’ CommandDispatcher |
| `zooming_out_full_hand` | `discrete` | `zoom_out_view` | Discrete command via GestureActivationController β†’ CommandDispatcher |
| `point_one` | `continuous` | `β€”` | Continuous tracker: LaserPointerTracker (bypasses discrete dispatcher) |
| `point_two` | `continuous` | `β€”` | Continuous tracker: AnnotationPenTracker (bypasses discrete dispatcher) |
| `unknown` | `discrete` | `no_action` | No-op background class |

## Feature Schema (`feature-schema-v1`)

| Block | Dims | Description |
|-------|------|-------------|
| `position` | 0–62 | 21 wrist-relative, scale-normalised landmark positions (x, y, z) |
| `fingertip_spread` | 63–67 | 5 inter-fingertip Euclidean distances |
| `wrist_trajectory` | 68–70 | Net wrist displacement from oldest frame in the window |
| `velocity` | 71–133 | 21 per-landmark wrist-relative velocity vectors (Ξ”position per unit time) |
| `joint_angles` | 134–143 | 10 MCP + PIP joint angles in radians |
| `wrist_vel_raw` | 144–146 | Camera-normalised wrist velocity (x, y, z) β€” key directional signal |


## How to Use

```python
import torch
from huggingface_hub import hf_hub_download
from maestro.infrastructure.model.checkpoint_loader import load_inference_artifact

# Download the artifact (cached after first call)
local_path = hf_hub_download(
    repo_id="ntsrigaud/maestro-lstm-hybrid",
    filename="two_stream_attn_v1_2layer_ld_cong_finetune_20260515T134706Z_inference.pt",
)

# Load the artifact (includes model, class labels, and feature schema)
artifact = load_inference_artifact(
    artifact_path=local_path,
    device=torch.device("cpu"),
)
artifact.model.eval()

# Build a 147-dim feature vector using LandmarkFeatureTransformer
# and fill a 32-frame SlidingWindowSequenceBuffer, then:
with torch.no_grad():
    # tensor shape: (batch=1, T=32, F=147)
    window_tensor = torch.tensor(window_np, dtype=torch.float32).unsqueeze(0)
    logits = artifact.model(window_tensor)
    pred_class = artifact.class_labels[logits.argmax(dim=1).item()]
```

## Training Dataset

- **Source**: Hybrid merge of Jester and IPN-Hand windows: Jester provides unknown/swiping/zoom/stop_sign classes; IPN-Hand provides point_one and point_two
- **Used classes**: 10 (9 active gestures + `unknown` background)
- **Dataset split**: 70% train / 15% val / 15% test (stratified by class)
- **Augmentation**: temporal scale Β±20%, spatial jitter Οƒ=0.005

## Training Strategy

Two-phase transfer learning pipeline:
- **Phase 1 (pretraining):** backbone pretrained on external checkpoint `two_stream_attn_v1_2layer_pretrain_20260515T125437Z.pt` to learn generic gesture dynamics.
- **Phase 2 (fine-tuning):** head replaced and model adapted on Hybrid Jester+IPN 10-gesture vocabulary.
- **Stage A (frozen backbone):** 10 epoch(s) head-only warmup.
- **Stage B (full model):** up to 66 epoch(s) joint fine-tuning with scheduler/early stopping.
- **Stage B retention defences:** replay_max_samples_per_class=500, distillation_weight=0.0, replay_ce_weight=0.0, backbone_lr_multiplier=0.1, ewc_weight=N/A, gpm_components=0, forgetting_penalty_weight=0.5.

## Training Configuration

| Parameter | Value |
|-----------|-------|
| Architecture | EnhancedTwoStreamLSTM (BiLSTM h=96Γ—2, MHA 8 heads, proj=96, mean+max pool, MLP gate) |
| Input size | 147 |
| Hidden size | 96/stream (BiLSTM output: 192) |
| Projection dim | 96 |
| Num layers | 2 |
| MHA heads | 8 (head dim: 24) |
| Dropout | 0.4 |
| Learning rate | 3e-05 |
| Weight decay | 0.001 |
| Batch size | 128 |
| Max epochs | 80 |
| Early stopping patience | 20 |
| Label smoothing | 0.05 |
| Class weighting | disabled |
| Max samples per class | 3000 |
| LR scheduler | ReduceLROnPlateau (factor=0.5, patience=10) |

## Evaluation Results (Test Set)

| Metric | Value |
|--------|-------|
| Accuracy | 96.1% |
| Macro F1 | 95.9% |

### Per-Class Recall

| Class | Recall |
|-------|--------|
| `fist` | 97.3% |
| `swiping_right` | 97.1% |
| `swiping_left` | 98.3% |
| `swiping_down` | 98.0% |
| `swiping_up` | 98.2% |
| `zooming_in_full_hand` | 97.0% |
| `zooming_out_full_hand` | 95.1% |
| `point_one` | 97.4% |
| `point_two` | 95.1% |
| `unknown` | 85.7% |

## Comparison with Previous Architecture

| Feature | TwoStreamGestureLSTM | EnhancedTwoStreamLSTM |
|---------|---------------------|-----------------------|
| LSTM direction | Unidirectional | **Bidirectional** |
| Attention | Bahdanau (scalar) | **MHA Q/K/V (8 heads)** |
| Feature projection | No | **Yes (β†’96)** |
| Temporal pooling | Mean only | **Mean + Max** |
| Cross-stream fusion | Concat only | **2-layer MLP gate** |
| Parameters | ~182 K | ~1,208,554 |

## Limitations and Risks

- Trained on IPN Hand subjects only. Performance may degrade with unusual hand sizes,
  skin tones, or lighting conditions not represented in training data.
- The `unknown` class represents background/transition frames. At runtime, predictions
  are filtered through per-class confidence thresholds defined in `production_hybrid.yaml`.
- Requires **mediapipe>=0.10.14** for landmark extraction at inference time.
- Not intended for safety-critical or accessibility-critical applications.
- Performance was measured on a held-out test split from the same dataset; real-world
  generalisation may differ.

## Environmental Impact

Training was performed on CPU/MPS. Estimated training time: ~10 minutes.
Estimated COβ‚‚ equivalent: negligible (<0.001 kg COβ‚‚eq).

---

*Generated by the Maestro training pipeline on 2026-05-15.*