ntsrigaud commited on
Commit
9768190
·
verified ·
1 Parent(s): 192b721

Upload two_stream_attn_v1_finetune_20260513T050407Z

Browse files
Files changed (2) hide show
  1. README.md +30 -26
  2. config.json +27 -27
README.md CHANGED
@@ -17,7 +17,7 @@ metrics:
17
  - accuracy
18
  - f1
19
  model-index:
20
- - name: two_stream_attn_v1_20260512T145906Z
21
  results:
22
  - task:
23
  type: gesture-recognition
@@ -26,12 +26,12 @@ model-index:
26
  type: IPN-Hand
27
  metrics:
28
  - type: accuracy
29
- value: 0.9675
30
  - type: f1
31
- value: 0.9641
32
  ---
33
 
34
- # two_stream_attn_v1_20260512T145906Z
35
 
36
  A real-time hand gesture classifier trained on
37
  a Hybrid Jester+IPN gesture dataset (Jester dynamic classes + IPN pointing classes).
@@ -91,7 +91,7 @@ Input (B, T=32, 147)
91
 
92
  | Class | Description |
93
  |-------|-------------|
94
- | `unknown` | |
95
  | `point_one` | Single-finger pointing gesture (continuous laser-pointer control) |
96
  | `point_two` | Two-finger pointing gesture (continuous annotation-pen control) |
97
  | `stop_sign` | Static open palm facing camera (Jester class) |
@@ -139,7 +139,7 @@ from maestro.infrastructure.model.checkpoint_loader import load_inference_artifa
139
  # Download the artifact (cached after first call)
140
  local_path = hf_hub_download(
141
  repo_id="ntsrigaud/maestro-lstm-hybrid",
142
- filename="two_stream_attn_v1_20260512T145906Z_inference.pt",
143
  )
144
 
145
  # Load the artifact (includes model, class labels, and feature schema)
@@ -160,15 +160,19 @@ with torch.no_grad():
160
 
161
  ## Training Dataset
162
 
163
- - **Source**: Hybrid merge of Jester and IPN-Hand windows: Jester provides no_gesture/swiping/zoom/stop_sign classes; IPN-Hand provides point_one and point_two
164
- - **Used classes**: 10 (9 active gestures + `no_gesture` background)
165
  - **Dataset split**: 70% train / 15% val / 15% test (stratified by class)
166
  - **Augmentation**: temporal scale ±20%, spatial jitter σ=0.005
167
 
168
  ## Training Strategy
169
 
170
- Single-stage supervised training on IPN-Hand only.
171
- The model is initialized from scratch and optimized end-to-end on the target gesture set.
 
 
 
 
172
 
173
  ## Training Configuration
174
 
@@ -181,11 +185,11 @@ The model is initialized from scratch and optimized end-to-end on the target ges
181
  | Num layers | 4 |
182
  | MHA heads | 8 (head dim: 24) |
183
  | Dropout | 0.35 |
184
- | Learning rate | 0.001 |
185
  | Weight decay | 0.0005 |
186
  | Batch size | 128 |
187
- | Max epochs | 80 |
188
- | Early stopping patience | 20 |
189
  | Label smoothing | 0.05 |
190
  | Class weighting | disabled |
191
  | Max samples per class | 5000 |
@@ -195,22 +199,22 @@ The model is initialized from scratch and optimized end-to-end on the target ges
195
 
196
  | Metric | Value |
197
  |--------|-------|
198
- | Accuracy | 96.7% |
199
- | Macro F1 | 96.4% |
200
 
201
  ### Per-Class Recall
202
 
203
  | Class | Recall |
204
  |-------|--------|
205
- | `unknown` | 86.6% |
206
- | `point_one` | 98.4% |
207
- | `point_two` | 98.6% |
208
- | `stop_sign` | 98.4% |
209
- | `swiping_down` | 95.7% |
210
- | `swiping_left` | 99.1% |
211
- | `swiping_right` | 94.3% |
212
- | `swiping_up` | 94.2% |
213
- | `zooming_in_full_hand` | 98.1% |
214
  | `zooming_out_full_hand` | 97.1% |
215
 
216
  ## Comparison with Previous Architecture
@@ -228,7 +232,7 @@ The model is initialized from scratch and optimized end-to-end on the target ges
228
 
229
  - Trained on IPN Hand subjects only. Performance may degrade with unusual hand sizes,
230
  skin tones, or lighting conditions not represented in training data.
231
- - The `no_gesture` class represents background/transition frames. At runtime, predictions
232
  are filtered through per-class confidence thresholds defined in `production_hybrid.yaml`.
233
  - Requires **mediapipe>=0.10.14** for landmark extraction at inference time.
234
  - Not intended for safety-critical or accessibility-critical applications.
@@ -242,4 +246,4 @@ Estimated CO₂ equivalent: negligible (<0.001 kg CO₂eq).
242
 
243
  ---
244
 
245
- *Generated by the Maestro training pipeline on 2026-05-12.*
 
17
  - accuracy
18
  - f1
19
  model-index:
20
+ - name: two_stream_attn_v1_finetune_20260513T050407Z
21
  results:
22
  - task:
23
  type: gesture-recognition
 
26
  type: IPN-Hand
27
  metrics:
28
  - type: accuracy
29
+ value: 0.9551
30
  - type: f1
31
+ value: 0.9481
32
  ---
33
 
34
+ # two_stream_attn_v1_finetune_20260513T050407Z
35
 
36
  A real-time hand gesture classifier trained on
37
  a Hybrid Jester+IPN gesture dataset (Jester dynamic classes + IPN pointing classes).
 
91
 
92
  | Class | Description |
93
  |-------|-------------|
94
+ | `unknown` | Background / transition / no gesture |
95
  | `point_one` | Single-finger pointing gesture (continuous laser-pointer control) |
96
  | `point_two` | Two-finger pointing gesture (continuous annotation-pen control) |
97
  | `stop_sign` | Static open palm facing camera (Jester class) |
 
139
  # Download the artifact (cached after first call)
140
  local_path = hf_hub_download(
141
  repo_id="ntsrigaud/maestro-lstm-hybrid",
142
+ filename="two_stream_attn_v1_finetune_20260513T050407Z_inference.pt",
143
  )
144
 
145
  # Load the artifact (includes model, class labels, and feature schema)
 
160
 
161
  ## Training Dataset
162
 
163
+ - **Source**: Hybrid merge of Jester and IPN-Hand windows: Jester provides unknown/swiping/zoom/stop_sign classes; IPN-Hand provides point_one and point_two
164
+ - **Used classes**: 10 (9 active gestures + `unknown` background)
165
  - **Dataset split**: 70% train / 15% val / 15% test (stratified by class)
166
  - **Augmentation**: temporal scale ±20%, spatial jitter σ=0.005
167
 
168
  ## Training Strategy
169
 
170
+ Two-phase transfer learning pipeline:
171
+ - **Phase 1 (pretraining):** backbone pretrained on external checkpoint `two_stream_attn_v1_20260513T045733Z.pt` to learn generic gesture dynamics.
172
+ - **Phase 2 (fine-tuning):** head replaced and model adapted on Hybrid Jester+IPN 10-gesture vocabulary.
173
+ - **Stage A (frozen backbone):** 10 epoch(s) head-only warmup.
174
+ - **Stage B (full model):** up to 58 epoch(s) joint fine-tuning with scheduler/early stopping.
175
+ - **Stage B retention defences:** replay_max_samples_per_class=500, distillation_weight=0.5, replay_ce_weight=0.3, backbone_lr_multiplier=0.1, ewc_weight=100.0, gpm_components=20, forgetting_penalty_weight=0.5.
176
 
177
  ## Training Configuration
178
 
 
185
  | Num layers | 4 |
186
  | MHA heads | 8 (head dim: 24) |
187
  | Dropout | 0.35 |
188
+ | Learning rate | 3e-05 |
189
  | Weight decay | 0.0005 |
190
  | Batch size | 128 |
191
+ | Max epochs | 60 |
192
+ | Early stopping patience | 12 |
193
  | Label smoothing | 0.05 |
194
  | Class weighting | disabled |
195
  | Max samples per class | 5000 |
 
199
 
200
  | Metric | Value |
201
  |--------|-------|
202
+ | Accuracy | 95.5% |
203
+ | Macro F1 | 94.8% |
204
 
205
  ### Per-Class Recall
206
 
207
  | Class | Recall |
208
  |-------|--------|
209
+ | `unknown` | 82.8% |
210
+ | `point_one` | 98.1% |
211
+ | `point_two` | 97.2% |
212
+ | `stop_sign` | 98.5% |
213
+ | `swiping_down` | 92.2% |
214
+ | `swiping_left` | 93.6% |
215
+ | `swiping_right` | 88.5% |
216
+ | `swiping_up` | 92.3% |
217
+ | `zooming_in_full_hand` | 97.0% |
218
  | `zooming_out_full_hand` | 97.1% |
219
 
220
  ## Comparison with Previous Architecture
 
232
 
233
  - Trained on IPN Hand subjects only. Performance may degrade with unusual hand sizes,
234
  skin tones, or lighting conditions not represented in training data.
235
+ - The `unknown` class represents background/transition frames. At runtime, predictions
236
  are filtered through per-class confidence thresholds defined in `production_hybrid.yaml`.
237
  - Requires **mediapipe>=0.10.14** for landmark extraction at inference time.
238
  - Not intended for safety-critical or accessibility-critical applications.
 
246
 
247
  ---
248
 
249
+ *Generated by the Maestro training pipeline on 2026-05-13.*
config.json CHANGED
@@ -1,5 +1,5 @@
1
  {
2
- "model_version": "two_stream_attn_v1_20260512T145906Z",
3
  "model_config": {
4
  "model_name": "two_stream_attn_v1",
5
  "input_size": 147,
@@ -16,9 +16,9 @@
16
  "window_step": null
17
  },
18
  "training_config": {
19
- "epochs": 80,
20
  "batch_size": 128,
21
- "learning_rate": 0.001,
22
  "weight_decay": 0.0005,
23
  "grad_clip_norm": 1.0,
24
  "seed": 42,
@@ -32,33 +32,33 @@
32
  }
33
  },
34
  "evaluation": {
35
- "test_accuracy": 0.9674507008790687,
36
- "test_macro_f1": 0.9640774183053038,
37
- "test_loss": 0.3977402173430542,
38
- "calibration_ece": 0.035556920550729974,
39
  "per_class_recall": {
40
- "unknown": 0.8656716417910447,
41
- "point_one": 0.9835841313269493,
42
- "point_two": 0.9864314789687924,
43
- "stop_sign": 0.9835164835164835,
44
- "swiping_down": 0.9568965517241379,
45
- "swiping_left": 0.990909090909091,
46
- "swiping_right": 0.9425287356321839,
47
- "swiping_up": 0.9423076923076923,
48
- "zooming_in_full_hand": 0.9808917197452229,
49
  "zooming_out_full_hand": 0.9712643678160919
50
  },
51
  "per_class_precision": {
52
- "unknown": 0.9613259668508287,
53
- "point_one": 0.9663978494623656,
54
- "point_two": 0.9404915912031048,
55
- "stop_sign": 0.9728260869565217,
56
- "swiping_down": 0.9327731092436975,
57
- "swiping_left": 0.990909090909091,
58
- "swiping_right": 0.9647058823529412,
59
- "swiping_up": 0.9932432432432432,
60
- "zooming_in_full_hand": 0.9777777777777777,
61
- "zooming_out_full_hand": 0.9854227405247813
62
  }
63
  },
64
  "class_labels": [
@@ -73,7 +73,7 @@
73
  "zooming_in_full_hand",
74
  "zooming_out_full_hand"
75
  ],
76
- "created_at": "2026-05-12T15:07:23.016730+00:00",
77
  "gesture_command_mapping": {
78
  "commands": {
79
  "swiping_up": "start_presentation",
 
1
  {
2
+ "model_version": "two_stream_attn_v1_finetune_20260513T050407Z",
3
  "model_config": {
4
  "model_name": "two_stream_attn_v1",
5
  "input_size": 147,
 
16
  "window_step": null
17
  },
18
  "training_config": {
19
+ "epochs": 60,
20
  "batch_size": 128,
21
+ "learning_rate": 3e-05,
22
  "weight_decay": 0.0005,
23
  "grad_clip_norm": 1.0,
24
  "seed": 42,
 
32
  }
33
  },
34
  "evaluation": {
35
+ "test_accuracy": 0.955096222380613,
36
+ "test_macro_f1": 0.9481389392072146,
37
+ "test_loss": 0.41714808393697267,
38
+ "calibration_ece": 0.026080811808244665,
39
  "per_class_recall": {
40
+ "unknown": 0.8283582089552238,
41
+ "point_one": 0.9808481532147743,
42
+ "point_two": 0.9715061058344641,
43
+ "stop_sign": 0.9853479853479854,
44
+ "swiping_down": 0.9224137931034483,
45
+ "swiping_left": 0.9363636363636364,
46
+ "swiping_right": 0.8850574712643678,
47
+ "swiping_up": 0.9230769230769231,
48
+ "zooming_in_full_hand": 0.9697452229299363,
49
  "zooming_out_full_hand": 0.9712643678160919
50
  },
51
  "per_class_precision": {
52
+ "unknown": 0.9380281690140845,
53
+ "point_one": 0.9409448818897638,
54
+ "point_two": 0.9250645994832042,
55
+ "stop_sign": 0.972875226039783,
56
+ "swiping_down": 0.9385964912280702,
57
+ "swiping_left": 0.9716981132075472,
58
+ "swiping_right": 0.9746835443037974,
59
+ "swiping_up": 1.0,
60
+ "zooming_in_full_hand": 0.9712918660287081,
61
+ "zooming_out_full_hand": 0.9726618705035971
62
  }
63
  },
64
  "class_labels": [
 
73
  "zooming_in_full_hand",
74
  "zooming_out_full_hand"
75
  ],
76
+ "created_at": "2026-05-13T05:12:19.988870+00:00",
77
  "gesture_command_mapping": {
78
  "commands": {
79
  "swiping_up": "start_presentation",