ntsrigaud commited on
Commit
3132009
·
verified ·
1 Parent(s): 5ce6565

Upload two_stream_attn_v1_finetune_20260513T160347Z

Browse files
Files changed (2) hide show
  1. README.md +59 -47
  2. config.json +60 -40
README.md CHANGED
@@ -17,7 +17,7 @@ metrics:
17
  - accuracy
18
  - f1
19
  model-index:
20
- - name: two_stream_attn_v1_finetune_20260513T093937Z
21
  results:
22
  - task:
23
  type: gesture-recognition
@@ -26,15 +26,15 @@ model-index:
26
  type: IPN-Hand
27
  metrics:
28
  - type: accuracy
29
- value: 0.9648
30
  - type: f1
31
- value: 0.9638
32
  ---
33
 
34
- # two_stream_attn_v1_finetune_20260513T093937Z
35
 
36
  A real-time hand gesture classifier trained on
37
- a Hybrid Jester+IPN gesture dataset (Jester dynamic classes + IPN pointing classes).
38
 
39
  This model is part of the **Maestro** pipeline that enables touchless
40
  control of presentation and meeting software through hand gestures captured from a
@@ -43,12 +43,12 @@ standard webcam using MediaPipe for landmark extraction.
43
  ## Model Description
44
 
45
  - **Architecture**: EnhancedTwoStreamLSTM (BiLSTM h=96×2, MHA 8 heads, proj=96, mean+max pool, MLP gate)
46
- - **Parameters**: 2,099,434
47
- - **Input**: `(batch, 32, 147)`
48
- 32-frame sliding window at 30 FPS ≈ 1067 ms
49
- - **Output**: Softmax logits over 10 gesture classes
50
  - **Inference latency**: < 1 ms per call (CPU, single sample)
51
- - **Feature schema**: `feature-schema-v5`
52
 
53
  ## Architecture
54
 
@@ -78,7 +78,7 @@ Input (B, T=32, 147)
78
  │ ctx_b = LN(ctx_b × gate_b + ctx_b)
79
 
80
  └─ cat(ctx_a, ctx_b) → (384,)
81
- LN → Linear(384→192) → GELU → Dropout → Linear(192→10)
82
  ```
83
 
84
  **Design rationale:**
@@ -91,33 +91,41 @@ Input (B, T=32, 147)
91
 
92
  | Class | Description |
93
  |-------|-------------|
94
- | `unknown` | Background / transition / no gesture |
 
 
 
 
 
 
 
 
 
 
95
  | `point_one` | Single-finger pointing gesture (continuous laser-pointer control) |
96
  | `point_two` | Two-finger pointing gesture (continuous annotation-pen control) |
97
- | `stop_sign` | Static open palm facing camera (Jester class) |
98
- | `swiping_down` | Vertical swipe downward (Jester class) |
99
- | `swiping_left` | Horizontal swipe from right to left (Jester class) |
100
- | `swiping_right` | Horizontal swipe from left to right (Jester class) |
101
- | `swiping_up` | Vertical swipe upward (Jester class) |
102
- | `zooming_in_full_hand` | Pinch-open / spread fingers away from each other (Jester class) |
103
- | `zooming_out_full_hand` | Pinch-close / bring fingers together (Jester class) |
104
 
105
  ## Gesture Usage In Presentation System
106
 
107
  | Class | Mode | Command | Runtime handling |
108
  |-------|------|---------|------------------|
109
- | `unknown` | `discrete` | `no_action` | No-op background class |
110
- | `point_one` | `continuous` | `` | Continuous tracker: LaserPointerTracker (bypasses discrete dispatcher) |
111
- | `point_two` | `continuous` | `` | Continuous tracker: AnnotationPenTracker (bypasses discrete dispatcher) |
112
- | `stop_sign` | `discrete` | `erase_annotations` | Discrete command via GestureActivationController → CommandDispatcher |
113
- | `swiping_down` | `discrete` | `stop_presentation` | Discrete command via GestureActivationController → CommandDispatcher |
114
- | `swiping_left` | `discrete` | `previous_slide` | Discrete command via GestureActivationController → CommandDispatcher |
115
  | `swiping_right` | `discrete` | `next_slide` | Discrete command via GestureActivationController → CommandDispatcher |
 
 
116
  | `swiping_up` | `discrete` | `start_presentation` | Discrete command via GestureActivationController → CommandDispatcher |
117
  | `zooming_in_full_hand` | `discrete` | `zoom_in_view` | Discrete command via GestureActivationController → CommandDispatcher |
118
  | `zooming_out_full_hand` | `discrete` | `zoom_out_view` | Discrete command via GestureActivationController → CommandDispatcher |
 
 
 
119
 
120
- ## Feature Schema (`feature-schema-v5`)
121
 
122
  | Block | Dims | Description |
123
  |-------|------|-------------|
@@ -139,7 +147,7 @@ from maestro.infrastructure.model.checkpoint_loader import load_inference_artifa
139
  # Download the artifact (cached after first call)
140
  local_path = hf_hub_download(
141
  repo_id="ntsrigaud/maestro-lstm-hybrid",
142
- filename="two_stream_attn_v1_finetune_20260513T093937Z_inference.pt",
143
  )
144
 
145
  # Load the artifact (includes model, class labels, and feature schema)
@@ -160,19 +168,19 @@ with torch.no_grad():
160
 
161
  ## Training Dataset
162
 
163
- - **Source**: Hybrid merge of Jester and IPN-Hand windows: Jester provides unknown/swiping/zoom/stop_sign classes; IPN-Hand provides point_one and point_two
164
- - **Used classes**: 10 (9 active gestures + `unknown` background)
165
  - **Dataset split**: 70% train / 15% val / 15% test (stratified by class)
166
  - **Augmentation**: temporal scale ±20%, spatial jitter σ=0.005
167
 
168
  ## Training Strategy
169
 
170
  Two-phase transfer learning pipeline:
171
- - **Phase 1 (pretraining):** backbone pretrained on external checkpoint `two_stream_attn_v1_20260513T093445Z.pt` to learn generic gesture dynamics.
172
- - **Phase 2 (fine-tuning):** head replaced and model adapted on Hybrid Jester+IPN 10-gesture vocabulary.
173
  - **Stage A (frozen backbone):** 10 epoch(s) head-only warmup.
174
- - **Stage B (full model):** up to 60 epoch(s) joint fine-tuning with scheduler/early stopping.
175
- - **Stage B retention defences:** replay_max_samples_per_class=500, distillation_weight=0.5, replay_ce_weight=0.3, backbone_lr_multiplier=0.1, ewc_weight=100.0, gpm_components=20, forgetting_penalty_weight=0.5.
176
 
177
  ## Training Configuration
178
 
@@ -192,30 +200,34 @@ Two-phase transfer learning pipeline:
192
  | Early stopping patience | 12 |
193
  | Label smoothing | 0.05 |
194
  | Class weighting | disabled |
195
- | Max samples per class | 5000 |
196
  | LR scheduler | ReduceLROnPlateau (factor=0.5, patience=8) |
197
 
198
  ## Evaluation Results (Test Set)
199
 
200
  | Metric | Value |
201
  |--------|-------|
202
- | Accuracy | 96.5% |
203
- | Macro F1 | 96.4% |
204
 
205
  ### Per-Class Recall
206
 
207
  | Class | Recall |
208
  |-------|--------|
209
- | `unknown` | 91.4% |
210
- | `point_one` | 98.1% |
211
- | `point_two` | 98.0% |
212
- | `stop_sign` | 98.7% |
213
- | `swiping_down` | 93.1% |
214
- | `swiping_left` | 96.7% |
215
  | `swiping_right` | 95.5% |
216
- | `swiping_up` | 98.7% |
217
- | `zooming_in_full_hand` | 96.7% |
218
- | `zooming_out_full_hand` | 95.3% |
 
 
 
 
 
219
 
220
  ## Comparison with Previous Architecture
221
 
@@ -226,7 +238,7 @@ Two-phase transfer learning pipeline:
226
  | Feature projection | No | **Yes (→96)** |
227
  | Temporal pooling | Mean only | **Mean + Max** |
228
  | Cross-stream fusion | Concat only | **2-layer MLP gate** |
229
- | Parameters | ~182 K | ~2,099,434 |
230
 
231
  ## Limitations and Risks
232
 
@@ -246,4 +258,4 @@ Estimated CO₂ equivalent: negligible (<0.001 kg CO₂eq).
246
 
247
  ---
248
 
249
- *Generated by the Maestro training pipeline on 2026-05-13.*
 
17
  - accuracy
18
  - f1
19
  model-index:
20
+ - name: two_stream_attn_v1_finetune_20260513T160347Z
21
  results:
22
  - task:
23
  type: gesture-recognition
 
26
  type: IPN-Hand
27
  metrics:
28
  - type: accuracy
29
+ value: 0.9618
30
  - type: f1
31
+ value: 0.9632
32
  ---
33
 
34
+ # two_stream_attn_v1_finetune_20260513T160347Z
35
 
36
  A real-time hand gesture classifier trained on
37
+ an LD-CONG Hybrid gesture dataset (LD-CONG unique gestures + Jester zooms + IPN pointing, W=16 frames).
38
 
39
  This model is part of the **Maestro** pipeline that enables touchless
40
  control of presentation and meeting software through hand gestures captured from a
 
43
  ## Model Description
44
 
45
  - **Architecture**: EnhancedTwoStreamLSTM (BiLSTM h=96×2, MHA 8 heads, proj=96, mean+max pool, MLP gate)
46
+ - **Parameters**: 2,100,206
47
+ - **Input**: `(batch, 16, 147)`
48
+ 16-frame sliding window at 30 FPS ≈ 533 ms
49
+ - **Output**: Softmax logits over 14 gesture classes
50
  - **Inference latency**: < 1 ms per call (CPU, single sample)
51
+ - **Feature schema**: `feature-schema-v1`
52
 
53
  ## Architecture
54
 
 
78
  │ ctx_b = LN(ctx_b × gate_b + ctx_b)
79
 
80
  └─ cat(ctx_a, ctx_b) → (384,)
81
+ LN → Linear(384→192) → GELU → Dropout → Linear(192→14)
82
  ```
83
 
84
  **Design rationale:**
 
91
 
92
  | Class | Description |
93
  |-------|-------------|
94
+ | `palm` | Open palm held flat toward camera (static hand shape) |
95
+ | `fist` | Closed fist (all fingers curled, thumb tucked) |
96
+ | `thumb_up` | Thumbs-up — thumb extended upward, other fingers closed |
97
+ | `pinch` | Thumb and index finger brought together (pinch grip) |
98
+ | `click` | Brief tap / click gesture — index finger snaps toward thumb |
99
+ | `swiping_right` | Horizontal swipe from left to right |
100
+ | `swiping_left` | Horizontal swipe from right to left |
101
+ | `swiping_down` | Vertical swipe downward |
102
+ | `swiping_up` | Vertical swipe upward |
103
+ | `zooming_in_full_hand` | Pinch-open / spread fingers away from each other |
104
+ | `zooming_out_full_hand` | Pinch-close / bring fingers together |
105
  | `point_one` | Single-finger pointing gesture (continuous laser-pointer control) |
106
  | `point_two` | Two-finger pointing gesture (continuous annotation-pen control) |
107
+ | `unknown` | Background / transition / no gesture |
 
 
 
 
 
 
108
 
109
  ## Gesture Usage In Presentation System
110
 
111
  | Class | Mode | Command | Runtime handling |
112
  |-------|------|---------|------------------|
113
+ | `palm` | `discrete` | `erase_annotations` | Discrete command via GestureActivationController → CommandDispatcher |
114
+ | `fist` | `discrete` | `no_action` | No-op background class |
115
+ | `thumb_up` | `discrete` | `no_action` | No-op background class |
116
+ | `pinch` | `discrete` | `no_action` | No-op background class |
117
+ | `click` | `discrete` | `no_action` | No-op background class |
 
118
  | `swiping_right` | `discrete` | `next_slide` | Discrete command via GestureActivationController → CommandDispatcher |
119
+ | `swiping_left` | `discrete` | `previous_slide` | Discrete command via GestureActivationController → CommandDispatcher |
120
+ | `swiping_down` | `discrete` | `stop_presentation` | Discrete command via GestureActivationController → CommandDispatcher |
121
  | `swiping_up` | `discrete` | `start_presentation` | Discrete command via GestureActivationController → CommandDispatcher |
122
  | `zooming_in_full_hand` | `discrete` | `zoom_in_view` | Discrete command via GestureActivationController → CommandDispatcher |
123
  | `zooming_out_full_hand` | `discrete` | `zoom_out_view` | Discrete command via GestureActivationController → CommandDispatcher |
124
+ | `point_one` | `continuous` | `—` | Continuous tracker: LaserPointerTracker (bypasses discrete dispatcher) |
125
+ | `point_two` | `continuous` | `—` | Continuous tracker: AnnotationPenTracker (bypasses discrete dispatcher) |
126
+ | `unknown` | `discrete` | `no_action` | No-op background class |
127
 
128
+ ## Feature Schema (`feature-schema-v1`)
129
 
130
  | Block | Dims | Description |
131
  |-------|------|-------------|
 
147
  # Download the artifact (cached after first call)
148
  local_path = hf_hub_download(
149
  repo_id="ntsrigaud/maestro-lstm-hybrid",
150
+ filename="two_stream_attn_v1_finetune_20260513T160347Z_inference.pt",
151
  )
152
 
153
  # Load the artifact (includes model, class labels, and feature schema)
 
168
 
169
  ## Training Dataset
170
 
171
+ - **Source**: Three-source hybrid: **LD-CONG** provides palm/fist/thumb_up/pinch/click and directional swipes (right/left/upward/downward → renamed); **Jester** provides zooming_in/out_full_hand and additional swipe data (center-cropped W32→16); **IPN-Hand** provides point_one/point_two/unknown (center-cropped W32→16). Up to 3,000 windows per class, seed=42.
172
+ - **Used classes**: 14 (13 active gestures + `unknown` background)
173
  - **Dataset split**: 70% train / 15% val / 15% test (stratified by class)
174
  - **Augmentation**: temporal scale ±20%, spatial jitter σ=0.005
175
 
176
  ## Training Strategy
177
 
178
  Two-phase transfer learning pipeline:
179
+ - **Phase 1 (pretraining):** backbone pretrained on external checkpoint `two_stream_attn_v1_20260513T155730Z.pt` to learn generic gesture dynamics.
180
+ - **Phase 2 (fine-tuning):** head replaced and model adapted on LD-CONG Hybrid 14-class vocabulary (LD-CONG unique + Jester zooms + pooled swipes + IPN pointing, W=16).
181
  - **Stage A (frozen backbone):** 10 epoch(s) head-only warmup.
182
+ - **Stage B (full model):** up to 28 epoch(s) joint fine-tuning with scheduler/early stopping.
183
+ - **Stage B retention defences:** replay_max_samples_per_class=500, distillation_weight=0.0, replay_ce_weight=0.0, backbone_lr_multiplier=0.1, ewc_weight=2000.0, gpm_components=0, forgetting_penalty_weight=0.5.
184
 
185
  ## Training Configuration
186
 
 
200
  | Early stopping patience | 12 |
201
  | Label smoothing | 0.05 |
202
  | Class weighting | disabled |
203
+ | Max samples per class | 3000 |
204
  | LR scheduler | ReduceLROnPlateau (factor=0.5, patience=8) |
205
 
206
  ## Evaluation Results (Test Set)
207
 
208
  | Metric | Value |
209
  |--------|-------|
210
+ | Accuracy | 96.2% |
211
+ | Macro F1 | 96.3% |
212
 
213
  ### Per-Class Recall
214
 
215
  | Class | Recall |
216
  |-------|--------|
217
+ | `palm` | 99.6% |
218
+ | `fist` | 99.1% |
219
+ | `thumb_up` | 99.0% |
220
+ | `pinch` | 99.5% |
221
+ | `click` | 97.3% |
 
222
  | `swiping_right` | 95.5% |
223
+ | `swiping_left` | 97.5% |
224
+ | `swiping_down` | 98.4% |
225
+ | `swiping_up` | 96.8% |
226
+ | `zooming_in_full_hand` | 95.9% |
227
+ | `zooming_out_full_hand` | 93.9% |
228
+ | `point_one` | 97.1% |
229
+ | `point_two` | 93.5% |
230
+ | `unknown` | 87.7% |
231
 
232
  ## Comparison with Previous Architecture
233
 
 
238
  | Feature projection | No | **Yes (→96)** |
239
  | Temporal pooling | Mean only | **Mean + Max** |
240
  | Cross-stream fusion | Concat only | **2-layer MLP gate** |
241
+ | Parameters | ~182 K | ~2,100,206 |
242
 
243
  ## Limitations and Risks
244
 
 
258
 
259
  ---
260
 
261
+ *Generated by the Maestro training pipeline on 2026-05-14.*
config.json CHANGED
@@ -1,19 +1,19 @@
1
  {
2
- "model_version": "two_stream_attn_v1_finetune_20260513T093937Z",
3
  "model_config": {
4
  "model_name": "two_stream_attn_v1",
5
  "input_size": 147,
6
  "hidden_size": 96,
7
  "num_layers": 4,
8
  "dropout": 0.35,
9
- "num_classes": 10
10
  },
11
  "feature_schema": {
12
- "feature_schema_version": "feature-schema-v5",
13
  "feature_dim": 147,
14
  "orientation_normalization": false,
15
- "window_length": 32,
16
- "window_step": null
17
  },
18
  "training_config": {
19
  "epochs": 60,
@@ -24,7 +24,7 @@
24
  "seed": 42,
25
  "label_smoothing": 0.05,
26
  "class_weighting": false,
27
- "max_samples_per_class": 5000,
28
  "scheduler": {
29
  "factor": 0.5,
30
  "patience": 8,
@@ -32,48 +32,60 @@
32
  }
33
  },
34
  "evaluation": {
35
- "test_accuracy": 0.964765601672141,
36
- "test_macro_f1": 0.9638047676525682,
37
- "test_loss": 0.3919019465946233,
38
- "calibration_ece": 0.041047210158003236,
39
  "per_class_recall": {
40
- "unknown": 0.9143576826196473,
41
- "point_one": 0.9810964083175804,
42
- "point_two": 0.98,
43
- "stop_sign": 0.9873684210526316,
44
- "swiping_down": 0.9310344827586207,
45
- "swiping_left": 0.9666666666666667,
46
- "swiping_right": 0.9553571428571429,
47
- "swiping_up": 0.9871794871794872,
48
- "zooming_in_full_hand": 0.9672131147540983,
49
- "zooming_out_full_hand": 0.9528688524590164
 
 
 
 
50
  },
51
  "per_class_precision": {
52
- "unknown": 0.9477806788511749,
53
- "point_one": 0.941923774954628,
54
- "point_two": 0.974155069582505,
55
- "stop_sign": 0.9770833333333333,
56
- "swiping_down": 0.9926470588235294,
57
- "swiping_left": 0.9354838709677419,
58
- "swiping_right": 0.981651376146789,
59
- "swiping_up": 0.9625,
60
- "zooming_in_full_hand": 0.971764705882353,
61
- "zooming_out_full_hand": 0.9728033472803347
 
 
 
 
62
  }
63
  },
64
  "class_labels": [
65
- "unknown",
66
- "point_one",
67
- "point_two",
68
- "stop_sign",
69
- "swiping_down",
70
- "swiping_left",
71
  "swiping_right",
 
 
72
  "swiping_up",
73
  "zooming_in_full_hand",
74
- "zooming_out_full_hand"
 
 
 
75
  ],
76
- "created_at": "2026-05-13T09:50:38.473285+00:00",
77
  "gesture_command_mapping": {
78
  "commands": {
79
  "swiping_up": "start_presentation",
@@ -82,7 +94,11 @@
82
  "swiping_left": "previous_slide",
83
  "zooming_in_full_hand": "zoom_in_view",
84
  "zooming_out_full_hand": "zoom_out_view",
85
- "stop_sign": "erase_annotations",
 
 
 
 
86
  "unknown": "no_action"
87
  },
88
  "modes": {
@@ -92,7 +108,11 @@
92
  "swiping_left": "discrete",
93
  "zooming_in_full_hand": "discrete",
94
  "zooming_out_full_hand": "discrete",
95
- "stop_sign": "discrete",
 
 
 
 
96
  "point_one": "continuous",
97
  "point_two": "continuous"
98
  }
 
1
  {
2
+ "model_version": "two_stream_attn_v1_finetune_20260513T160347Z",
3
  "model_config": {
4
  "model_name": "two_stream_attn_v1",
5
  "input_size": 147,
6
  "hidden_size": 96,
7
  "num_layers": 4,
8
  "dropout": 0.35,
9
+ "num_classes": 14
10
  },
11
  "feature_schema": {
12
+ "feature_schema_version": "feature-schema-v1",
13
  "feature_dim": 147,
14
  "orientation_normalization": false,
15
+ "window_length": 16,
16
+ "window_step": 3
17
  },
18
  "training_config": {
19
  "epochs": 60,
 
24
  "seed": 42,
25
  "label_smoothing": 0.05,
26
  "class_weighting": false,
27
+ "max_samples_per_class": 3000,
28
  "scheduler": {
29
  "factor": 0.5,
30
  "patience": 8,
 
32
  }
33
  },
34
  "evaluation": {
35
+ "test_accuracy": 0.9617670364500792,
36
+ "test_macro_f1": 0.9632447186344674,
37
+ "test_loss": 0.43081179908262757,
38
+ "calibration_ece": 0.032766067943935724,
39
  "per_class_recall": {
40
+ "palm": 0.996415770609319,
41
+ "fist": 0.9912280701754386,
42
+ "thumb_up": 0.98989898989899,
43
+ "pinch": 0.9951923076923077,
44
+ "click": 0.972568578553616,
45
+ "swiping_right": 0.9546925566343042,
46
+ "swiping_left": 0.9748603351955307,
47
+ "swiping_down": 0.9840425531914894,
48
+ "swiping_up": 0.9682539682539683,
49
+ "zooming_in_full_hand": 0.958997722095672,
50
+ "zooming_out_full_hand": 0.9389978213507625,
51
+ "point_one": 0.9711286089238845,
52
+ "point_two": 0.9347826086956522,
53
+ "unknown": 0.8765432098765432
54
  },
55
  "per_class_precision": {
56
+ "palm": 0.996415770609319,
57
+ "fist": 0.9576271186440678,
58
+ "thumb_up": 1.0,
59
+ "pinch": 0.9495412844036697,
60
+ "click": 0.9948979591836735,
61
+ "swiping_right": 0.9546925566343042,
62
+ "swiping_left": 0.9721448467966574,
63
+ "swiping_down": 0.961038961038961,
64
+ "swiping_up": 0.977116704805492,
65
+ "zooming_in_full_hand": 0.9376391982182628,
66
+ "zooming_out_full_hand": 0.9620535714285714,
67
+ "point_one": 0.891566265060241,
68
+ "point_two": 0.9690140845070423,
69
+ "unknown": 0.9491978609625669
70
  }
71
  },
72
  "class_labels": [
73
+ "palm",
74
+ "fist",
75
+ "thumb_up",
76
+ "pinch",
77
+ "click",
 
78
  "swiping_right",
79
+ "swiping_left",
80
+ "swiping_down",
81
  "swiping_up",
82
  "zooming_in_full_hand",
83
+ "zooming_out_full_hand",
84
+ "point_one",
85
+ "point_two",
86
+ "unknown"
87
  ],
88
+ "created_at": "2026-05-14T00:26:14.536961+00:00",
89
  "gesture_command_mapping": {
90
  "commands": {
91
  "swiping_up": "start_presentation",
 
94
  "swiping_left": "previous_slide",
95
  "zooming_in_full_hand": "zoom_in_view",
96
  "zooming_out_full_hand": "zoom_out_view",
97
+ "palm": "erase_annotations",
98
+ "fist": "no_action",
99
+ "thumb_up": "no_action",
100
+ "pinch": "no_action",
101
+ "click": "no_action",
102
  "unknown": "no_action"
103
  },
104
  "modes": {
 
108
  "swiping_left": "discrete",
109
  "zooming_in_full_hand": "discrete",
110
  "zooming_out_full_hand": "discrete",
111
+ "palm": "discrete",
112
+ "fist": "discrete",
113
+ "thumb_up": "discrete",
114
+ "pinch": "discrete",
115
+ "click": "discrete",
116
  "point_one": "continuous",
117
  "point_two": "continuous"
118
  }