ntsrigaud commited on
Commit
f7e0e62
·
verified ·
1 Parent(s): 55a59ab

Upload two_stream_attn_v1_2layer_ld_cong_finetune_20260515T134706Z

Browse files
Files changed (2) hide show
  1. README.md +20 -20
  2. config.json +26 -28
README.md CHANGED
@@ -17,7 +17,7 @@ metrics:
17
  - accuracy
18
  - f1
19
  model-index:
20
- - name: two_stream_attn_v1_finetune_20260515T104743Z
21
  results:
22
  - task:
23
  type: gesture-recognition
@@ -26,12 +26,12 @@ model-index:
26
  type: IPN-Hand
27
  metrics:
28
  - type: accuracy
29
- value: 0.9566
30
  - type: f1
31
- value: 0.9556
32
  ---
33
 
34
- # two_stream_attn_v1_finetune_20260515T104743Z
35
 
36
  A real-time hand gesture classifier trained on
37
  a Hybrid Jester+IPN gesture dataset (Jester dynamic classes + IPN pointing classes).
@@ -43,7 +43,7 @@ standard webcam using MediaPipe for landmark extraction.
43
  ## Model Description
44
 
45
  - **Architecture**: EnhancedTwoStreamLSTM (BiLSTM h=96×2, MHA 8 heads, proj=96, mean+max pool, MLP gate)
46
- - **Parameters**: 2,099,434
47
  - **Input**: `(batch, 16, 147)`
48
  — 16-frame sliding window at 30 FPS ≈ 533 ms
49
  - **Output**: Softmax logits over 10 gesture classes
@@ -139,7 +139,7 @@ from maestro.infrastructure.model.checkpoint_loader import load_inference_artifa
139
  # Download the artifact (cached after first call)
140
  local_path = hf_hub_download(
141
  repo_id="ntsrigaud/maestro-lstm-hybrid",
142
- filename="two_stream_attn_v1_finetune_20260515T104743Z_inference.pt",
143
  )
144
 
145
  # Load the artifact (includes model, class labels, and feature schema)
@@ -168,10 +168,10 @@ with torch.no_grad():
168
  ## Training Strategy
169
 
170
  Two-phase transfer learning pipeline:
171
- - **Phase 1 (pretraining):** backbone pretrained on external checkpoint `two_stream_attn_v1_20260513T155730Z.pt` to learn generic gesture dynamics.
172
  - **Phase 2 (fine-tuning):** head replaced and model adapted on Hybrid Jester+IPN 10-gesture vocabulary.
173
  - **Stage A (frozen backbone):** 10 epoch(s) head-only warmup.
174
- - **Stage B (full model):** up to 80 epoch(s) joint fine-tuning with scheduler/early stopping.
175
  - **Stage B retention defences:** replay_max_samples_per_class=500, distillation_weight=0.0, replay_ce_weight=0.0, backbone_lr_multiplier=0.1, ewc_weight=N/A, gpm_components=0, forgetting_penalty_weight=0.5.
176
 
177
  ## Training Configuration
@@ -199,23 +199,23 @@ Two-phase transfer learning pipeline:
199
 
200
  | Metric | Value |
201
  |--------|-------|
202
- | Accuracy | 95.7% |
203
- | Macro F1 | 95.6% |
204
 
205
  ### Per-Class Recall
206
 
207
  | Class | Recall |
208
  |-------|--------|
209
- | `fist` | 96.2% |
210
- | `swiping_right` | 95.5% |
211
- | `swiping_left` | 98.7% |
212
- | `swiping_down` | 96.6% |
213
- | `swiping_up` | 97.6% |
214
- | `zooming_in_full_hand` | 97.5% |
215
- | `zooming_out_full_hand` | 93.9% |
216
  | `point_one` | 97.4% |
217
- | `point_two` | 94.3% |
218
- | `unknown` | 87.7% |
219
 
220
  ## Comparison with Previous Architecture
221
 
@@ -226,7 +226,7 @@ Two-phase transfer learning pipeline:
226
  | Feature projection | No | **Yes (→96)** |
227
  | Temporal pooling | Mean only | **Mean + Max** |
228
  | Cross-stream fusion | Concat only | **2-layer MLP gate** |
229
- | Parameters | ~182 K | ~2,099,434 |
230
 
231
  ## Limitations and Risks
232
 
 
17
  - accuracy
18
  - f1
19
  model-index:
20
+ - name: two_stream_attn_v1_2layer_ld_cong_finetune_20260515T134706Z
21
  results:
22
  - task:
23
  type: gesture-recognition
 
26
  type: IPN-Hand
27
  metrics:
28
  - type: accuracy
29
+ value: 0.9606
30
  - type: f1
31
+ value: 0.9587
32
  ---
33
 
34
+ # two_stream_attn_v1_2layer_ld_cong_finetune_20260515T134706Z
35
 
36
  A real-time hand gesture classifier trained on
37
  a Hybrid Jester+IPN gesture dataset (Jester dynamic classes + IPN pointing classes).
 
43
  ## Model Description
44
 
45
  - **Architecture**: EnhancedTwoStreamLSTM (BiLSTM h=96×2, MHA 8 heads, proj=96, mean+max pool, MLP gate)
46
+ - **Parameters**: 1,208,554
47
  - **Input**: `(batch, 16, 147)`
48
  — 16-frame sliding window at 30 FPS ≈ 533 ms
49
  - **Output**: Softmax logits over 10 gesture classes
 
139
  # Download the artifact (cached after first call)
140
  local_path = hf_hub_download(
141
  repo_id="ntsrigaud/maestro-lstm-hybrid",
142
+ filename="two_stream_attn_v1_2layer_ld_cong_finetune_20260515T134706Z_inference.pt",
143
  )
144
 
145
  # Load the artifact (includes model, class labels, and feature schema)
 
168
  ## Training Strategy
169
 
170
  Two-phase transfer learning pipeline:
171
+ - **Phase 1 (pretraining):** backbone pretrained on external checkpoint `two_stream_attn_v1_2layer_pretrain_20260515T125437Z.pt` to learn generic gesture dynamics.
172
  - **Phase 2 (fine-tuning):** head replaced and model adapted on Hybrid Jester+IPN 10-gesture vocabulary.
173
  - **Stage A (frozen backbone):** 10 epoch(s) head-only warmup.
174
+ - **Stage B (full model):** up to 66 epoch(s) joint fine-tuning with scheduler/early stopping.
175
  - **Stage B retention defences:** replay_max_samples_per_class=500, distillation_weight=0.0, replay_ce_weight=0.0, backbone_lr_multiplier=0.1, ewc_weight=N/A, gpm_components=0, forgetting_penalty_weight=0.5.
176
 
177
  ## Training Configuration
 
199
 
200
  | Metric | Value |
201
  |--------|-------|
202
+ | Accuracy | 96.1% |
203
+ | Macro F1 | 95.9% |
204
 
205
  ### Per-Class Recall
206
 
207
  | Class | Recall |
208
  |-------|--------|
209
+ | `fist` | 97.3% |
210
+ | `swiping_right` | 97.1% |
211
+ | `swiping_left` | 98.3% |
212
+ | `swiping_down` | 98.0% |
213
+ | `swiping_up` | 98.2% |
214
+ | `zooming_in_full_hand` | 97.0% |
215
+ | `zooming_out_full_hand` | 95.1% |
216
  | `point_one` | 97.4% |
217
+ | `point_two` | 95.1% |
218
+ | `unknown` | 85.7% |
219
 
220
  ## Comparison with Previous Architecture
221
 
 
226
  | Feature projection | No | **Yes (→96)** |
227
  | Temporal pooling | Mean only | **Mean + Max** |
228
  | Cross-stream fusion | Concat only | **2-layer MLP gate** |
229
+ | Parameters | ~182 K | ~1,208,554 |
230
 
231
  ## Limitations and Risks
232
 
config.json CHANGED
@@ -1,7 +1,7 @@
1
  {
2
- "model_version": "two_stream_attn_v1_finetune_20260515T104743Z",
3
  "model_config": {
4
- "model_name": "two_stream_attn_v1_2layer_finetune",
5
  "input_size": 147,
6
  "hidden_size": 96,
7
  "num_layers": 2,
@@ -32,33 +32,33 @@
32
  }
33
  },
34
  "evaluation": {
35
- "test_accuracy": 0.9566320645905421,
36
- "test_macro_f1": 0.9555633825064029,
37
- "test_loss": 0.4212703935896512,
38
- "calibration_ece": 0.030699590615060227,
39
  "per_class_recall": {
40
- "fist": 0.9621993127147767,
41
- "swiping_right": 0.9548532731376975,
42
- "swiping_left": 0.9869158878504672,
43
- "swiping_down": 0.9664694280078896,
44
- "swiping_up": 0.9755600814663951,
45
- "zooming_in_full_hand": 0.975,
46
- "zooming_out_full_hand": 0.9388185654008439,
47
  "point_one": 0.973753280839895,
48
- "point_two": 0.9429347826086957,
49
- "unknown": 0.8765432098765432
50
  },
51
  "per_class_precision": {
52
- "fist": 0.9790209790209791,
53
- "swiping_right": 0.9701834862385321,
54
- "swiping_left": 0.9777777777777777,
55
- "swiping_down": 0.9551656920077972,
56
- "swiping_up": 0.9618473895582329,
57
- "zooming_in_full_hand": 0.9407894736842105,
58
- "zooming_out_full_hand": 0.973741794310722,
59
- "point_one": 0.8918269230769231,
60
- "point_two": 0.9719887955182073,
61
- "unknown": 0.9441489361702128
62
  }
63
  },
64
  "class_labels": [
@@ -73,7 +73,7 @@
73
  "point_two",
74
  "unknown"
75
  ],
76
- "created_at": "2026-05-15T13:15:40.671167+00:00",
77
  "gesture_command_mapping": {
78
  "commands": {
79
  "swiping_up": "start_presentation",
@@ -83,8 +83,6 @@
83
  "zooming_in_full_hand": "zoom_in_view",
84
  "zooming_out_full_hand": "zoom_out_view",
85
  "fist": "erase_annotations",
86
- "pinch": "activate_laser_pointer",
87
- "click": "mouse_click",
88
  "unknown": "no_action"
89
  },
90
  "modes": {
 
1
  {
2
+ "model_version": "two_stream_attn_v1_2layer_ld_cong_finetune_20260515T134706Z",
3
  "model_config": {
4
+ "model_name": "two_stream_attn_v1_2layer_ld_cong",
5
  "input_size": 147,
6
  "hidden_size": 96,
7
  "num_layers": 2,
 
32
  }
33
  },
34
  "evaluation": {
35
+ "test_accuracy": 0.960553633217993,
36
+ "test_macro_f1": 0.9587371203998121,
37
+ "test_loss": 0.404887427927576,
38
+ "calibration_ece": 0.033548892410568715,
39
  "per_class_recall": {
40
+ "fist": 0.9725085910652921,
41
+ "swiping_right": 0.9706546275395034,
42
+ "swiping_left": 0.983177570093458,
43
+ "swiping_down": 0.980276134122288,
44
+ "swiping_up": 0.9816700610997964,
45
+ "zooming_in_full_hand": 0.9704545454545455,
46
+ "zooming_out_full_hand": 0.9514767932489452,
47
  "point_one": 0.973753280839895,
48
+ "point_two": 0.9510869565217391,
49
+ "unknown": 0.8567901234567902
50
  },
51
  "per_class_precision": {
52
+ "fist": 0.9433333333333334,
53
+ "swiping_right": 0.9728506787330317,
54
+ "swiping_left": 0.9813432835820896,
55
+ "swiping_down": 0.9613152804642167,
56
+ "swiping_up": 0.9620758483033932,
57
+ "zooming_in_full_hand": 0.9510022271714922,
58
+ "zooming_out_full_hand": 0.9740820734341252,
59
+ "point_one": 0.9298245614035088,
60
+ "point_two": 0.958904109589041,
61
+ "unknown": 0.9559228650137741
62
  }
63
  },
64
  "class_labels": [
 
73
  "point_two",
74
  "unknown"
75
  ],
76
+ "created_at": "2026-05-15T13:52:14.109098+00:00",
77
  "gesture_command_mapping": {
78
  "commands": {
79
  "swiping_up": "start_presentation",
 
83
  "zooming_in_full_hand": "zoom_in_view",
84
  "zooming_out_full_hand": "zoom_out_view",
85
  "fist": "erase_annotations",
 
 
86
  "unknown": "no_action"
87
  },
88
  "modes": {