ntsrigaud commited on
Commit
2df62ef
·
verified ·
1 Parent(s): bc4e671

Upload two_stream_attn_v1_finetune_20260514T122537Z

Browse files
Files changed (2) hide show
  1. README.md +21 -27
  2. config.json +27 -37
README.md CHANGED
@@ -17,7 +17,7 @@ metrics:
17
  - accuracy
18
  - f1
19
  model-index:
20
- - name: two_stream_attn_v1_finetune_20260514T013537Z
21
  results:
22
  - task:
23
  type: gesture-recognition
@@ -26,12 +26,12 @@ model-index:
26
  type: IPN-Hand
27
  metrics:
28
  - type: accuracy
29
- value: 0.9566
30
  - type: f1
31
- value: 0.9561
32
  ---
33
 
34
- # two_stream_attn_v1_finetune_20260514T013537Z
35
 
36
  A real-time hand gesture classifier trained on
37
  a Hybrid Jester+IPN gesture dataset (Jester dynamic classes + IPN pointing classes).
@@ -43,10 +43,10 @@ standard webcam using MediaPipe for landmark extraction.
43
  ## Model Description
44
 
45
  - **Architecture**: EnhancedTwoStreamLSTM (BiLSTM h=96×2, MHA 8 heads, proj=96, mean+max pool, MLP gate)
46
- - **Parameters**: 2,099,820
47
  - **Input**: `(batch, 16, 147)`
48
  — 16-frame sliding window at 30 FPS ≈ 533 ms
49
- - **Output**: Softmax logits over 12 gesture classes
50
  - **Inference latency**: < 1 ms per call (CPU, single sample)
51
  - **Feature schema**: `feature-schema-v1`
52
 
@@ -78,7 +78,7 @@ Input (B, T=32, 147)
78
  │ ctx_b = LN(ctx_b × gate_b + ctx_b)
79
 
80
  └─ cat(ctx_a, ctx_b) → (384,)
81
- LN → Linear(384→192) → GELU → Dropout → Linear(192→12)
82
  ```
83
 
84
  **Design rationale:**
@@ -92,8 +92,6 @@ Input (B, T=32, 147)
92
  | Class | Description |
93
  |-------|-------------|
94
  | `palm` | Open palm held flat toward camera (static hand shape) |
95
- | `pinch` | Thumb and index finger brought together (pinch grip) |
96
- | `click` | Brief tap / click gesture — index finger snaps toward thumb |
97
  | `swiping_right` | Horizontal swipe from left to right |
98
  | `swiping_left` | Horizontal swipe from right to left |
99
  | `swiping_down` | Vertical swipe downward |
@@ -109,8 +107,6 @@ Input (B, T=32, 147)
109
  | Class | Mode | Command | Runtime handling |
110
  |-------|------|---------|------------------|
111
  | `palm` | `discrete` | `erase_annotations` | Discrete command via GestureActivationController → CommandDispatcher |
112
- | `pinch` | `discrete` | `no_action` | No-op background class |
113
- | `click` | `discrete` | `no_action` | No-op background class |
114
  | `swiping_right` | `discrete` | `next_slide` | Discrete command via GestureActivationController → CommandDispatcher |
115
  | `swiping_left` | `discrete` | `previous_slide` | Discrete command via GestureActivationController → CommandDispatcher |
116
  | `swiping_down` | `discrete` | `stop_presentation` | Discrete command via GestureActivationController → CommandDispatcher |
@@ -143,7 +139,7 @@ from maestro.infrastructure.model.checkpoint_loader import load_inference_artifa
143
  # Download the artifact (cached after first call)
144
  local_path = hf_hub_download(
145
  repo_id="ntsrigaud/maestro-lstm-hybrid",
146
- filename="two_stream_attn_v1_finetune_20260514T013537Z_inference.pt",
147
  )
148
 
149
  # Load the artifact (includes model, class labels, and feature schema)
@@ -165,7 +161,7 @@ with torch.no_grad():
165
  ## Training Dataset
166
 
167
  - **Source**: Hybrid merge of Jester and IPN-Hand windows: Jester provides unknown/swiping/zoom/stop_sign classes; IPN-Hand provides point_one and point_two
168
- - **Used classes**: 12 (11 active gestures + `unknown` background)
169
  - **Dataset split**: 70% train / 15% val / 15% test (stratified by class)
170
  - **Augmentation**: temporal scale ±20%, spatial jitter σ=0.005
171
 
@@ -175,7 +171,7 @@ Two-phase transfer learning pipeline:
175
  - **Phase 1 (pretraining):** backbone pretrained on external checkpoint `two_stream_attn_v1_20260513T155730Z.pt` to learn generic gesture dynamics.
176
  - **Phase 2 (fine-tuning):** head replaced and model adapted on Hybrid Jester+IPN 10-gesture vocabulary.
177
  - **Stage A (frozen backbone):** 10 epoch(s) head-only warmup.
178
- - **Stage B (full model):** up to 51 epoch(s) joint fine-tuning with scheduler/early stopping.
179
  - **Stage B retention defences:** replay_max_samples_per_class=500, distillation_weight=0.0, replay_ce_weight=0.0, backbone_lr_multiplier=0.1, ewc_weight=2000.0, gpm_components=0, forgetting_penalty_weight=0.5.
180
 
181
  ## Training Configuration
@@ -203,25 +199,23 @@ Two-phase transfer learning pipeline:
203
 
204
  | Metric | Value |
205
  |--------|-------|
206
- | Accuracy | 95.7% |
207
- | Macro F1 | 95.6% |
208
 
209
  ### Per-Class Recall
210
 
211
  | Class | Recall |
212
  |-------|--------|
213
- | `palm` | 98.7% |
214
- | `pinch` | 98.4% |
215
- | `click` | 94.0% |
216
  | `swiping_right` | 95.3% |
217
- | `swiping_left` | 98.3% |
218
- | `swiping_down` | 96.3% |
219
- | `swiping_up` | 97.6% |
220
- | `zooming_in_full_hand` | 96.8% |
221
- | `zooming_out_full_hand` | 92.4% |
222
  | `point_one` | 97.1% |
223
- | `point_two` | 93.2% |
224
- | `unknown` | 89.4% |
225
 
226
  ## Comparison with Previous Architecture
227
 
@@ -232,7 +226,7 @@ Two-phase transfer learning pipeline:
232
  | Feature projection | No | **Yes (→96)** |
233
  | Temporal pooling | Mean only | **Mean + Max** |
234
  | Cross-stream fusion | Concat only | **2-layer MLP gate** |
235
- | Parameters | ~182 K | ~2,099,820 |
236
 
237
  ## Limitations and Risks
238
 
 
17
  - accuracy
18
  - f1
19
  model-index:
20
+ - name: two_stream_attn_v1_finetune_20260514T122537Z
21
  results:
22
  - task:
23
  type: gesture-recognition
 
26
  type: IPN-Hand
27
  metrics:
28
  - type: accuracy
29
+ value: 0.9584
30
  - type: f1
31
+ value: 0.9573
32
  ---
33
 
34
+ # two_stream_attn_v1_finetune_20260514T122537Z
35
 
36
  A real-time hand gesture classifier trained on
37
  a Hybrid Jester+IPN gesture dataset (Jester dynamic classes + IPN pointing classes).
 
43
  ## Model Description
44
 
45
  - **Architecture**: EnhancedTwoStreamLSTM (BiLSTM h=96×2, MHA 8 heads, proj=96, mean+max pool, MLP gate)
46
+ - **Parameters**: 2,099,434
47
  - **Input**: `(batch, 16, 147)`
48
  — 16-frame sliding window at 30 FPS ≈ 533 ms
49
+ - **Output**: Softmax logits over 10 gesture classes
50
  - **Inference latency**: < 1 ms per call (CPU, single sample)
51
  - **Feature schema**: `feature-schema-v1`
52
 
 
78
  │ ctx_b = LN(ctx_b × gate_b + ctx_b)
79
 
80
  └─ cat(ctx_a, ctx_b) → (384,)
81
+ LN → Linear(384→192) → GELU → Dropout → Linear(192→10)
82
  ```
83
 
84
  **Design rationale:**
 
92
  | Class | Description |
93
  |-------|-------------|
94
  | `palm` | Open palm held flat toward camera (static hand shape) |
 
 
95
  | `swiping_right` | Horizontal swipe from left to right |
96
  | `swiping_left` | Horizontal swipe from right to left |
97
  | `swiping_down` | Vertical swipe downward |
 
107
  | Class | Mode | Command | Runtime handling |
108
  |-------|------|---------|------------------|
109
  | `palm` | `discrete` | `erase_annotations` | Discrete command via GestureActivationController → CommandDispatcher |
 
 
110
  | `swiping_right` | `discrete` | `next_slide` | Discrete command via GestureActivationController → CommandDispatcher |
111
  | `swiping_left` | `discrete` | `previous_slide` | Discrete command via GestureActivationController → CommandDispatcher |
112
  | `swiping_down` | `discrete` | `stop_presentation` | Discrete command via GestureActivationController → CommandDispatcher |
 
139
  # Download the artifact (cached after first call)
140
  local_path = hf_hub_download(
141
  repo_id="ntsrigaud/maestro-lstm-hybrid",
142
+ filename="two_stream_attn_v1_finetune_20260514T122537Z_inference.pt",
143
  )
144
 
145
  # Load the artifact (includes model, class labels, and feature schema)
 
161
  ## Training Dataset
162
 
163
  - **Source**: Hybrid merge of Jester and IPN-Hand windows: Jester provides unknown/swiping/zoom/stop_sign classes; IPN-Hand provides point_one and point_two
164
+ - **Used classes**: 10 (9 active gestures + `unknown` background)
165
  - **Dataset split**: 70% train / 15% val / 15% test (stratified by class)
166
  - **Augmentation**: temporal scale ±20%, spatial jitter σ=0.005
167
 
 
171
  - **Phase 1 (pretraining):** backbone pretrained on external checkpoint `two_stream_attn_v1_20260513T155730Z.pt` to learn generic gesture dynamics.
172
  - **Phase 2 (fine-tuning):** head replaced and model adapted on Hybrid Jester+IPN 10-gesture vocabulary.
173
  - **Stage A (frozen backbone):** 10 epoch(s) head-only warmup.
174
+ - **Stage B (full model):** up to 60 epoch(s) joint fine-tuning with scheduler/early stopping.
175
  - **Stage B retention defences:** replay_max_samples_per_class=500, distillation_weight=0.0, replay_ce_weight=0.0, backbone_lr_multiplier=0.1, ewc_weight=2000.0, gpm_components=0, forgetting_penalty_weight=0.5.
176
 
177
  ## Training Configuration
 
199
 
200
  | Metric | Value |
201
  |--------|-------|
202
+ | Accuracy | 95.8% |
203
+ | Macro F1 | 95.7% |
204
 
205
  ### Per-Class Recall
206
 
207
  | Class | Recall |
208
  |-------|--------|
209
+ | `palm` | 98.4% |
 
 
210
  | `swiping_right` | 95.3% |
211
+ | `swiping_left` | 98.7% |
212
+ | `swiping_down` | 96.8% |
213
+ | `swiping_up` | 97.8% |
214
+ | `zooming_in_full_hand` | 97.3% |
215
+ | `zooming_out_full_hand` | 92.8% |
216
  | `point_one` | 97.1% |
217
+ | `point_two` | 94.3% |
218
+ | `unknown` | 88.9% |
219
 
220
  ## Comparison with Previous Architecture
221
 
 
226
  | Feature projection | No | **Yes (→96)** |
227
  | Temporal pooling | Mean only | **Mean + Max** |
228
  | Cross-stream fusion | Concat only | **2-layer MLP gate** |
229
+ | Parameters | ~182 K | ~2,099,434 |
230
 
231
  ## Limitations and Risks
232
 
config.json CHANGED
@@ -1,12 +1,12 @@
1
  {
2
- "model_version": "two_stream_attn_v1_finetune_20260514T013537Z",
3
  "model_config": {
4
  "model_name": "two_stream_attn_v1",
5
  "input_size": 147,
6
  "hidden_size": 96,
7
  "num_layers": 4,
8
  "dropout": 0.35,
9
- "num_classes": 12
10
  },
11
  "feature_schema": {
12
  "feature_schema_version": "feature-schema-v1",
@@ -32,43 +32,37 @@
32
  }
33
  },
34
  "evaluation": {
35
- "test_accuracy": 0.9566010951125532,
36
- "test_macro_f1": 0.9560781251156708,
37
- "test_loss": 0.4304363657823322,
38
- "calibration_ece": 0.029200464802868007,
39
  "per_class_recall": {
40
- "palm": 0.9869109947643979,
41
- "pinch": 0.984313725490196,
42
- "click": 0.94,
43
  "swiping_right": 0.9525959367945824,
44
- "swiping_left": 0.983177570093458,
45
- "swiping_down": 0.9625246548323472,
46
- "swiping_up": 0.9755600814663951,
47
- "zooming_in_full_hand": 0.9681818181818181,
48
- "zooming_out_full_hand": 0.9240506329113924,
49
  "point_one": 0.9711286089238845,
50
- "point_two": 0.9320652173913043,
51
- "unknown": 0.8938271604938272
52
  },
53
  "per_class_precision": {
54
- "palm": 0.9792207792207792,
55
- "pinch": 0.9471698113207547,
56
- "click": 0.9832635983263598,
57
- "swiping_right": 0.9634703196347032,
58
- "swiping_left": 0.9813432835820896,
59
- "swiping_down": 0.9606299212598425,
60
- "swiping_up": 0.9676767676767677,
61
- "zooming_in_full_hand": 0.9240780911062907,
62
- "zooming_out_full_hand": 0.9668874172185431,
63
- "point_one": 0.8958837772397095,
64
- "point_two": 0.9607843137254902,
65
- "unknown": 0.9501312335958005
66
  }
67
  },
68
  "class_labels": [
69
  "palm",
70
- "pinch",
71
- "click",
72
  "swiping_right",
73
  "swiping_left",
74
  "swiping_down",
@@ -79,7 +73,7 @@
79
  "point_two",
80
  "unknown"
81
  ],
82
- "created_at": "2026-05-14T01:38:18.988584+00:00",
83
  "gesture_command_mapping": {
84
  "commands": {
85
  "swiping_up": "start_presentation",
@@ -89,10 +83,8 @@
89
  "zooming_in_full_hand": "zoom_in_view",
90
  "zooming_out_full_hand": "zoom_out_view",
91
  "palm": "erase_annotations",
92
- "fist": "no_action",
93
- "thumb_up": "no_action",
94
- "pinch": "no_action",
95
- "click": "no_action",
96
  "unknown": "no_action"
97
  },
98
  "modes": {
@@ -103,8 +95,6 @@
103
  "zooming_in_full_hand": "discrete",
104
  "zooming_out_full_hand": "discrete",
105
  "palm": "discrete",
106
- "fist": "discrete",
107
- "thumb_up": "discrete",
108
  "pinch": "discrete",
109
  "click": "discrete",
110
  "point_one": "continuous",
 
1
  {
2
+ "model_version": "two_stream_attn_v1_finetune_20260514T122537Z",
3
  "model_config": {
4
  "model_name": "two_stream_attn_v1",
5
  "input_size": 147,
6
  "hidden_size": 96,
7
  "num_layers": 4,
8
  "dropout": 0.35,
9
+ "num_classes": 10
10
  },
11
  "feature_schema": {
12
  "feature_schema_version": "feature-schema-v1",
 
32
  }
33
  },
34
  "evaluation": {
35
+ "test_accuracy": 0.9584274740171712,
36
+ "test_macro_f1": 0.9573115163058359,
37
+ "test_loss": 0.41317880455315303,
38
+ "calibration_ece": 0.03046416184897343,
39
  "per_class_recall": {
40
+ "palm": 0.9842931937172775,
 
 
41
  "swiping_right": 0.9525959367945824,
42
+ "swiping_left": 0.9869158878504672,
43
+ "swiping_down": 0.9684418145956607,
44
+ "swiping_up": 0.9775967413441955,
45
+ "zooming_in_full_hand": 0.9727272727272728,
46
+ "zooming_out_full_hand": 0.9282700421940928,
47
  "point_one": 0.9711286089238845,
48
+ "point_two": 0.9429347826086957,
49
+ "unknown": 0.8888888888888888
50
  },
51
  "per_class_precision": {
52
+ "palm": 0.9868766404199475,
53
+ "swiping_right": 0.9723502304147466,
54
+ "swiping_left": 0.9795918367346939,
55
+ "swiping_down": 0.958984375,
56
+ "swiping_up": 0.97165991902834,
57
+ "zooming_in_full_hand": 0.928416485900217,
58
+ "zooming_out_full_hand": 0.9649122807017544,
59
+ "point_one": 0.9002433090024331,
60
+ "point_two": 0.9665738161559888,
61
+ "unknown": 0.9498680738786279
 
 
62
  }
63
  },
64
  "class_labels": [
65
  "palm",
 
 
66
  "swiping_right",
67
  "swiping_left",
68
  "swiping_down",
 
73
  "point_two",
74
  "unknown"
75
  ],
76
+ "created_at": "2026-05-14T12:30:37.922314+00:00",
77
  "gesture_command_mapping": {
78
  "commands": {
79
  "swiping_up": "start_presentation",
 
83
  "zooming_in_full_hand": "zoom_in_view",
84
  "zooming_out_full_hand": "zoom_out_view",
85
  "palm": "erase_annotations",
86
+ "pinch": "activate_laser_pointer",
87
+ "click": "mouse_click",
 
 
88
  "unknown": "no_action"
89
  },
90
  "modes": {
 
95
  "zooming_in_full_hand": "discrete",
96
  "zooming_out_full_hand": "discrete",
97
  "palm": "discrete",
 
 
98
  "pinch": "discrete",
99
  "click": "discrete",
100
  "point_one": "continuous",