ntsrigaud commited on
Commit
0b50302
·
verified ·
1 Parent(s): 4eb309e

Upload two_stream_attn_v1_finetune_20260512T041947Z

Browse files
Files changed (2) hide show
  1. README.md +250 -0
  2. config.json +100 -0
README.md ADDED
@@ -0,0 +1,250 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - en
4
+ license: mit
5
+ tags:
6
+ - gesture-recognition
7
+ - hand-gesture
8
+ - pytorch
9
+ - mediapipe
10
+ - temporal-model
11
+ - lstm
12
+ - attention
13
+ - bidirectional
14
+ datasets:
15
+ - IPN-Hand
16
+ metrics:
17
+ - accuracy
18
+ - f1
19
+ model-index:
20
+ - name: two_stream_attn_v1_finetune_20260512T041947Z
21
+ results:
22
+ - task:
23
+ type: gesture-recognition
24
+ dataset:
25
+ name: IPN Hand
26
+ type: IPN-Hand
27
+ metrics:
28
+ - type: accuracy
29
+ value: 0.9898
30
+ - type: f1
31
+ value: 0.9917
32
+ ---
33
+
34
+ # two_stream_attn_v1_finetune_20260512T041947Z
35
+
36
+ A real-time hand gesture classifier trained on
37
+ a Hybrid Jester+IPN gesture dataset (Jester dynamic classes + IPN pointing classes).
38
+
39
+ This model is part of the **Maestro** pipeline that enables touchless
40
+ control of presentation and meeting software through hand gestures captured from a
41
+ standard webcam using MediaPipe for landmark extraction.
42
+
43
+ ## Model Description
44
+
45
+ - **Architecture**: EnhancedTwoStreamLSTM (BiLSTM h=96×2, MHA 8 heads, proj=96, mean+max pool, MLP gate)
46
+ - **Parameters**: 2,099,434
47
+ - **Input**: `(batch, 32, 147)`
48
+ — 32-frame sliding window at 30 FPS ≈ 1067 ms
49
+ - **Output**: Softmax logits over 10 gesture classes
50
+ - **Inference latency**: < 1 ms per call (CPU, single sample)
51
+ - **Feature schema**: `feature-schema-v5`
52
+
53
+ ## Architecture
54
+
55
+ `EnhancedTwoStreamLSTM` splits the 147-dim feature vector into two parallel streams and
56
+ processes them through a BiLSTM + self-attention + MLP-gate pipeline:
57
+
58
+ ```
59
+ Input (B, T=32, 147)
60
+
61
+ ├─ Stream A — Pose/Shape (73 dims)
62
+ │ Linear+LN+GELU → 96
63
+ │ 2-layer BiLSTM (h=96) → (B, T, 192)
64
+ │ LayerNorm → Self-MHA (8 heads) + residual + post-LN
65
+ │ mean+max pool → pool_LN → ctx_a (B, 192)
66
+
67
+ ├─ Stream B — Motion/Dynamics (74 dims)
68
+ │ (identical structure) → ctx_b (B, 192)
69
+
70
+ ├─ MLP cross-stream gate
71
+ │ gate_a = Sigmoid(
72
+ │ Linear(96→192)(
73
+ │ Tanh(Linear(192→96)(ctx_b))))
74
+ │ ctx_a = LN(ctx_a × gate_a + ctx_a)
75
+ │ gate_b = Sigmoid(
76
+ │ Linear(96→192)(
77
+ │ Tanh(Linear(192→96)(ctx_a))))
78
+ │ ctx_b = LN(ctx_b × gate_b + ctx_b)
79
+
80
+ └─ cat(ctx_a, ctx_b) → (384,)
81
+ LN → Linear(384→192) → GELU → Dropout → Linear(192→10)
82
+ ```
83
+
84
+ **Design rationale:**
85
+ - BiLSTMs encode temporal order via their recurrent cell state — no positional encoding needed.
86
+ - Mean+Max pooling captures both sustained gesture shape (mean) and transient click events (max).
87
+ - The 2-layer MLP gate provides non-linear cross-modal recalibration at ~37 K params
88
+ (vs ~263 K for full MHA cross-attention with a degenerate mean-pooled query).
89
+
90
+ ## Gesture Classes
91
+
92
+ | Class | Description |
93
+ |-------|-------------|
94
+ | `no_gesture` | — |
95
+ | `point_one` | Single-finger pointing gesture (continuous laser-pointer control) |
96
+ | `point_two` | Two-finger pointing gesture (continuous annotation-pen control) |
97
+ | `stop_sign` | — |
98
+ | `swiping_down` | — |
99
+ | `swiping_left` | — |
100
+ | `swiping_right` | — |
101
+ | `swiping_up` | — |
102
+ | `zooming_in_full_hand` | — |
103
+ | `zooming_out_full_hand` | — |
104
+
105
+ ## Gesture Usage In Presentation System
106
+
107
+ | Class | Mode | Command | Runtime handling |
108
+ |-------|------|---------|------------------|
109
+ | `no_gesture` | `unmapped` | `—` | Not mapped in command_map_presentation.yaml |
110
+ | `point_one` | `continuous` | `—` | Continuous tracker: LaserPointerTracker (bypasses discrete dispatcher) |
111
+ | `point_two` | `continuous` | `—` | Continuous tracker: AnnotationPenTracker (bypasses discrete dispatcher) |
112
+ | `stop_sign` | `unmapped` | `—` | Not mapped in command_map_presentation.yaml |
113
+ | `swiping_down` | `unmapped` | `—` | Not mapped in command_map_presentation.yaml |
114
+ | `swiping_left` | `unmapped` | `—` | Not mapped in command_map_presentation.yaml |
115
+ | `swiping_right` | `unmapped` | `—` | Not mapped in command_map_presentation.yaml |
116
+ | `swiping_up` | `unmapped` | `—` | Not mapped in command_map_presentation.yaml |
117
+ | `zooming_in_full_hand` | `unmapped` | `—` | Not mapped in command_map_presentation.yaml |
118
+ | `zooming_out_full_hand` | `unmapped` | `—` | Not mapped in command_map_presentation.yaml |
119
+
120
+ ## Feature Schema (`feature-schema-v5`)
121
+
122
+ | Block | Dims | Description |
123
+ |-------|------|-------------|
124
+ | `position` | 0–62 | 21 wrist-relative, scale-normalised landmark positions (x, y, z) |
125
+ | `fingertip_spread` | 63–67 | 5 inter-fingertip Euclidean distances |
126
+ | `wrist_trajectory` | 68–70 | Net wrist displacement from oldest frame in the window |
127
+ | `velocity` | 71–133 | 21 per-landmark wrist-relative velocity vectors (Δposition per unit time) |
128
+ | `joint_angles` | 134–143 | 10 MCP + PIP joint angles in radians |
129
+ | `wrist_vel_raw` | 144–146 | Camera-normalised wrist velocity (x, y, z) — key directional signal |
130
+
131
+
132
+ ## How to Use
133
+
134
+ ```python
135
+ import torch
136
+ from huggingface_hub import hf_hub_download
137
+ from maestro.infrastructure.model.checkpoint_loader import load_inference_artifact
138
+
139
+ # Download the artifact (cached after first call)
140
+ local_path = hf_hub_download(
141
+ repo_id="ntsrigaud/maestro-lstm-hybrid",
142
+ filename="two_stream_attn_v1_finetune_20260512T041947Z_inference.pt",
143
+ )
144
+
145
+ # Load the artifact (includes model, class labels, and feature schema)
146
+ artifact = load_inference_artifact(
147
+ artifact_path=local_path,
148
+ device=torch.device("cpu"),
149
+ )
150
+ artifact.model.eval()
151
+
152
+ # Build a 147-dim feature vector using LandmarkFeatureTransformer
153
+ # and fill a 32-frame SlidingWindowSequenceBuffer, then:
154
+ with torch.no_grad():
155
+ # tensor shape: (batch=1, T=32, F=147)
156
+ window_tensor = torch.tensor(window_np, dtype=torch.float32).unsqueeze(0)
157
+ logits = artifact.model(window_tensor)
158
+ pred_class = artifact.class_labels[logits.argmax(dim=1).item()]
159
+ ```
160
+
161
+ ## Training Dataset
162
+
163
+ - **Source**: Hybrid merge of Jester and IPN-Hand windows: Jester provides no_gesture/swiping/zoom/stop_sign classes; IPN-Hand provides point_one and point_two
164
+ - **Used classes**: 10 (9 active gestures + `unknown` background)
165
+ - **Dataset split**: 70% train / 15% val / 15% test (stratified by class)
166
+ - **Augmentation**: temporal scale ±20%, spatial jitter σ=0.005;
167
+ label-aware horizontal mirror (swipe_left ↔ swipe_right)
168
+
169
+ ## Training Strategy
170
+
171
+ Two-phase transfer learning pipeline:
172
+ - **Phase 1 (pretraining):** backbone pretrained on external checkpoint `two_stream_attn_v1_20260512T041219Z.pt` to learn generic gesture dynamics.
173
+ - **Phase 2 (fine-tuning):** head replaced and model adapted on Hybrid Jester+IPN 10-gesture vocabulary.
174
+ - **Stage A (frozen backbone):** 10 epoch(s) head-only warmup.
175
+ - **Stage B (full model):** up to 60 epoch(s) joint fine-tuning with scheduler/early stopping.
176
+ - **Stage B retention defences:** replay_max_samples_per_class=500, distillation_weight=0.5, replay_ce_weight=0.3, backbone_lr_multiplier=0.1, ewc_weight=100.0, gpm_components=20, forgetting_penalty_weight=0.5.
177
+
178
+ ## Training Configuration
179
+
180
+ | Parameter | Value |
181
+ |-----------|-------|
182
+ | Architecture | EnhancedTwoStreamLSTM (BiLSTM h=96×2, MHA 8 heads, proj=96, mean+max pool, MLP gate) |
183
+ | Input size | 147 |
184
+ | Hidden size | 96/stream (BiLSTM output: 192) |
185
+ | Projection dim | 96 |
186
+ | Num layers | 4 |
187
+ | MHA heads | 8 (head dim: 24) |
188
+ | Dropout | 0.35 |
189
+ | Learning rate | 3e-05 |
190
+ | Weight decay | 0.0005 |
191
+ | Batch size | 128 |
192
+ | Max epochs | 60 |
193
+ | Early stopping patience | 20 |
194
+ | Label smoothing | 0.05 |
195
+ | Class weighting | disabled |
196
+ | Max samples per class | 5000 |
197
+ | LR scheduler | ReduceLROnPlateau (factor=0.5, patience=8) |
198
+
199
+ ## Evaluation Results (Test Set)
200
+
201
+ | Metric | Value |
202
+ |--------|-------|
203
+ | Accuracy | 99.0% |
204
+ | Macro F1 | 99.2% |
205
+
206
+ ### Per-Class Recall
207
+
208
+ | Class | Recall |
209
+ |-------|--------|
210
+ | `no_gesture` | 100.0% |
211
+ | `point_one` | 98.9% |
212
+ | `point_two` | 98.5% |
213
+ | `stop_sign` | 99.5% |
214
+ | `swiping_down` | 99.0% |
215
+ | `swiping_left` | 100.0% |
216
+ | `swiping_right` | 99.1% |
217
+ | `swiping_up` | 98.1% |
218
+ | `zooming_in_full_hand` | 99.2% |
219
+ | `zooming_out_full_hand` | 99.0% |
220
+
221
+ ## Comparison with Previous Architecture
222
+
223
+ | Feature | TwoStreamGestureLSTM | EnhancedTwoStreamLSTM |
224
+ |---------|---------------------|-----------------------|
225
+ | LSTM direction | Unidirectional | **Bidirectional** |
226
+ | Attention | Bahdanau (scalar) | **MHA Q/K/V (8 heads)** |
227
+ | Feature projection | No | **Yes (→96)** |
228
+ | Temporal pooling | Mean only | **Mean + Max** |
229
+ | Cross-stream fusion | Concat only | **2-layer MLP gate** |
230
+ | Parameters | ~182 K | ~2,099,434 |
231
+
232
+ ## Limitations and Risks
233
+
234
+ - Trained on IPN Hand subjects only. Performance may degrade with unusual hand sizes,
235
+ skin tones, or lighting conditions not represented in training data.
236
+ - The `unknown` class represents background/transition frames. At runtime, predictions
237
+ are filtered through per-class confidence thresholds defined in `production_ipn.yaml`.
238
+ - Requires **mediapipe>=0.10.14** for landmark extraction at inference time.
239
+ - Not intended for safety-critical or accessibility-critical applications.
240
+ - Performance was measured on a held-out test split from the same dataset; real-world
241
+ generalisation may differ.
242
+
243
+ ## Environmental Impact
244
+
245
+ Training was performed on CPU/MPS. Estimated training time: ~10 minutes.
246
+ Estimated CO₂ equivalent: negligible (<0.001 kg CO₂eq).
247
+
248
+ ---
249
+
250
+ *Generated by the Maestro training pipeline on 2026-05-12.*
config.json ADDED
@@ -0,0 +1,100 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "model_version": "two_stream_attn_v1_finetune_20260512T041947Z",
3
+ "model_config": {
4
+ "model_name": "two_stream_attn_v1",
5
+ "input_size": 147,
6
+ "hidden_size": 96,
7
+ "num_layers": 4,
8
+ "dropout": 0.35,
9
+ "num_classes": 10
10
+ },
11
+ "feature_schema": {
12
+ "feature_schema_version": "feature-schema-v5",
13
+ "feature_dim": 147,
14
+ "orientation_normalization": false,
15
+ "window_length": 32,
16
+ "window_step": null
17
+ },
18
+ "training_config": {
19
+ "epochs": 60,
20
+ "batch_size": 128,
21
+ "learning_rate": 3e-05,
22
+ "weight_decay": 0.0005,
23
+ "grad_clip_norm": 1.0,
24
+ "seed": 42,
25
+ "label_smoothing": 0.05,
26
+ "class_weighting": false,
27
+ "max_samples_per_class": 5000,
28
+ "scheduler": {
29
+ "factor": 0.5,
30
+ "patience": 8,
31
+ "min_lr": 1e-06
32
+ }
33
+ },
34
+ "evaluation": {
35
+ "test_accuracy": 0.9898119122257053,
36
+ "test_macro_f1": 0.9916782280254713,
37
+ "test_loss": 0.3169419604159946,
38
+ "calibration_ece": 0.04126546900162752,
39
+ "per_class_recall": {
40
+ "no_gesture": 1.0,
41
+ "point_one": 0.9890560875512996,
42
+ "point_two": 0.9850746268656716,
43
+ "stop_sign": 0.9947460595446584,
44
+ "swiping_down": 0.9903846153846154,
45
+ "swiping_left": 1.0,
46
+ "swiping_right": 0.990909090909091,
47
+ "swiping_up": 0.9810126582278481,
48
+ "zooming_in_full_hand": 0.9919484702093397,
49
+ "zooming_out_full_hand": 0.9897959183673469
50
+ },
51
+ "per_class_precision": {
52
+ "no_gesture": 1.0,
53
+ "point_one": 0.9836734693877551,
54
+ "point_two": 0.9864130434782609,
55
+ "stop_sign": 0.9964912280701754,
56
+ "swiping_down": 1.0,
57
+ "swiping_left": 0.9818181818181818,
58
+ "swiping_right": 0.990909090909091,
59
+ "swiping_up": 1.0,
60
+ "zooming_in_full_hand": 0.9919484702093397,
61
+ "zooming_out_full_hand": 0.9897959183673469
62
+ }
63
+ },
64
+ "class_labels": [
65
+ "no_gesture",
66
+ "point_one",
67
+ "point_two",
68
+ "stop_sign",
69
+ "swiping_down",
70
+ "swiping_left",
71
+ "swiping_right",
72
+ "swiping_up",
73
+ "zooming_in_full_hand",
74
+ "zooming_out_full_hand"
75
+ ],
76
+ "created_at": "2026-05-12T04:25:36.916751+00:00",
77
+ "gesture_command_mapping": {
78
+ "commands": {
79
+ "swipe_up": "start_presentation",
80
+ "swipe_down": "stop_presentation",
81
+ "swipe_right": "next_slide",
82
+ "swipe_left": "previous_slide",
83
+ "zoom_in": "zoom_in_view",
84
+ "zoom_out": "zoom_out_view",
85
+ "open_palm_hold": "erase_annotations",
86
+ "unknown": "no_action"
87
+ },
88
+ "modes": {
89
+ "swipe_up": "discrete",
90
+ "swipe_down": "discrete",
91
+ "swipe_right": "discrete",
92
+ "swipe_left": "discrete",
93
+ "zoom_in": "discrete",
94
+ "zoom_out": "discrete",
95
+ "open_palm_hold": "discrete",
96
+ "point_one": "continuous",
97
+ "point_two": "continuous"
98
+ }
99
+ }
100
+ }