ntsrigaud commited on
Commit
d7a5fcf
·
verified ·
1 Parent(s): 6232e0b

Upload two_stream_attn_v1_finetune_20260514T013537Z

Browse files
Files changed (2) hide show
  1. README.md +27 -33
  2. config.json +30 -36
README.md CHANGED
@@ -17,7 +17,7 @@ metrics:
17
  - accuracy
18
  - f1
19
  model-index:
20
- - name: two_stream_attn_v1_finetune_20260513T160347Z
21
  results:
22
  - task:
23
  type: gesture-recognition
@@ -26,15 +26,15 @@ model-index:
26
  type: IPN-Hand
27
  metrics:
28
  - type: accuracy
29
- value: 0.9618
30
  - type: f1
31
- value: 0.9632
32
  ---
33
 
34
- # two_stream_attn_v1_finetune_20260513T160347Z
35
 
36
  A real-time hand gesture classifier trained on
37
- an LD-CONG Hybrid gesture dataset (LD-CONG unique gestures + Jester zooms + IPN pointing, W=16 frames).
38
 
39
  This model is part of the **Maestro** pipeline that enables touchless
40
  control of presentation and meeting software through hand gestures captured from a
@@ -43,10 +43,10 @@ standard webcam using MediaPipe for landmark extraction.
43
  ## Model Description
44
 
45
  - **Architecture**: EnhancedTwoStreamLSTM (BiLSTM h=96×2, MHA 8 heads, proj=96, mean+max pool, MLP gate)
46
- - **Parameters**: 2,100,206
47
  - **Input**: `(batch, 16, 147)`
48
  — 16-frame sliding window at 30 FPS ≈ 533 ms
49
- - **Output**: Softmax logits over 14 gesture classes
50
  - **Inference latency**: < 1 ms per call (CPU, single sample)
51
  - **Feature schema**: `feature-schema-v1`
52
 
@@ -78,7 +78,7 @@ Input (B, T=32, 147)
78
  │ ctx_b = LN(ctx_b × gate_b + ctx_b)
79
 
80
  └─ cat(ctx_a, ctx_b) → (384,)
81
- LN → Linear(384→192) → GELU → Dropout → Linear(192→14)
82
  ```
83
 
84
  **Design rationale:**
@@ -92,8 +92,6 @@ Input (B, T=32, 147)
92
  | Class | Description |
93
  |-------|-------------|
94
  | `palm` | Open palm held flat toward camera (static hand shape) |
95
- | `fist` | Closed fist (all fingers curled, thumb tucked) |
96
- | `thumb_up` | Thumbs-up — thumb extended upward, other fingers closed |
97
  | `pinch` | Thumb and index finger brought together (pinch grip) |
98
  | `click` | Brief tap / click gesture — index finger snaps toward thumb |
99
  | `swiping_right` | Horizontal swipe from left to right |
@@ -111,8 +109,6 @@ Input (B, T=32, 147)
111
  | Class | Mode | Command | Runtime handling |
112
  |-------|------|---------|------------------|
113
  | `palm` | `discrete` | `erase_annotations` | Discrete command via GestureActivationController → CommandDispatcher |
114
- | `fist` | `discrete` | `no_action` | No-op background class |
115
- | `thumb_up` | `discrete` | `no_action` | No-op background class |
116
  | `pinch` | `discrete` | `no_action` | No-op background class |
117
  | `click` | `discrete` | `no_action` | No-op background class |
118
  | `swiping_right` | `discrete` | `next_slide` | Discrete command via GestureActivationController → CommandDispatcher |
@@ -147,7 +143,7 @@ from maestro.infrastructure.model.checkpoint_loader import load_inference_artifa
147
  # Download the artifact (cached after first call)
148
  local_path = hf_hub_download(
149
  repo_id="ntsrigaud/maestro-lstm-hybrid",
150
- filename="two_stream_attn_v1_finetune_20260513T160347Z_inference.pt",
151
  )
152
 
153
  # Load the artifact (includes model, class labels, and feature schema)
@@ -168,8 +164,8 @@ with torch.no_grad():
168
 
169
  ## Training Dataset
170
 
171
- - **Source**: Three-source hybrid: **LD-CONG** provides palm/fist/thumb_up/pinch/click and directional swipes (right/left/upward/downward → renamed); **Jester** provides zooming_in/out_full_hand and additional swipe data (center-cropped W32→16); **IPN-Hand** provides point_one/point_two/unknown (center-cropped W32→16). Up to 3,000 windows per class, seed=42.
172
- - **Used classes**: 14 (13 active gestures + `unknown` background)
173
  - **Dataset split**: 70% train / 15% val / 15% test (stratified by class)
174
  - **Augmentation**: temporal scale ±20%, spatial jitter σ=0.005
175
 
@@ -177,9 +173,9 @@ with torch.no_grad():
177
 
178
  Two-phase transfer learning pipeline:
179
  - **Phase 1 (pretraining):** backbone pretrained on external checkpoint `two_stream_attn_v1_20260513T155730Z.pt` to learn generic gesture dynamics.
180
- - **Phase 2 (fine-tuning):** head replaced and model adapted on LD-CONG Hybrid 14-class vocabulary (LD-CONG unique + Jester zooms + pooled swipes + IPN pointing, W=16).
181
  - **Stage A (frozen backbone):** 10 epoch(s) head-only warmup.
182
- - **Stage B (full model):** up to 28 epoch(s) joint fine-tuning with scheduler/early stopping.
183
  - **Stage B retention defences:** replay_max_samples_per_class=500, distillation_weight=0.0, replay_ce_weight=0.0, backbone_lr_multiplier=0.1, ewc_weight=2000.0, gpm_components=0, forgetting_penalty_weight=0.5.
184
 
185
  ## Training Configuration
@@ -207,27 +203,25 @@ Two-phase transfer learning pipeline:
207
 
208
  | Metric | Value |
209
  |--------|-------|
210
- | Accuracy | 96.2% |
211
- | Macro F1 | 96.3% |
212
 
213
  ### Per-Class Recall
214
 
215
  | Class | Recall |
216
  |-------|--------|
217
- | `palm` | 99.6% |
218
- | `fist` | 99.1% |
219
- | `thumb_up` | 99.0% |
220
- | `pinch` | 99.5% |
221
- | `click` | 97.3% |
222
- | `swiping_right` | 95.5% |
223
- | `swiping_left` | 97.5% |
224
- | `swiping_down` | 98.4% |
225
- | `swiping_up` | 96.8% |
226
- | `zooming_in_full_hand` | 95.9% |
227
- | `zooming_out_full_hand` | 93.9% |
228
  | `point_one` | 97.1% |
229
- | `point_two` | 93.5% |
230
- | `unknown` | 87.7% |
231
 
232
  ## Comparison with Previous Architecture
233
 
@@ -238,7 +232,7 @@ Two-phase transfer learning pipeline:
238
  | Feature projection | No | **Yes (→96)** |
239
  | Temporal pooling | Mean only | **Mean + Max** |
240
  | Cross-stream fusion | Concat only | **2-layer MLP gate** |
241
- | Parameters | ~182 K | ~2,100,206 |
242
 
243
  ## Limitations and Risks
244
 
 
17
  - accuracy
18
  - f1
19
  model-index:
20
+ - name: two_stream_attn_v1_finetune_20260514T013537Z
21
  results:
22
  - task:
23
  type: gesture-recognition
 
26
  type: IPN-Hand
27
  metrics:
28
  - type: accuracy
29
+ value: 0.9566
30
  - type: f1
31
+ value: 0.9561
32
  ---
33
 
34
+ # two_stream_attn_v1_finetune_20260514T013537Z
35
 
36
  A real-time hand gesture classifier trained on
37
+ a Hybrid Jester+IPN gesture dataset (Jester dynamic classes + IPN pointing classes).
38
 
39
  This model is part of the **Maestro** pipeline that enables touchless
40
  control of presentation and meeting software through hand gestures captured from a
 
43
  ## Model Description
44
 
45
  - **Architecture**: EnhancedTwoStreamLSTM (BiLSTM h=96×2, MHA 8 heads, proj=96, mean+max pool, MLP gate)
46
+ - **Parameters**: 2,099,820
47
  - **Input**: `(batch, 16, 147)`
48
  — 16-frame sliding window at 30 FPS ≈ 533 ms
49
+ - **Output**: Softmax logits over 12 gesture classes
50
  - **Inference latency**: < 1 ms per call (CPU, single sample)
51
  - **Feature schema**: `feature-schema-v1`
52
 
 
78
  │ ctx_b = LN(ctx_b × gate_b + ctx_b)
79
 
80
  └─ cat(ctx_a, ctx_b) → (384,)
81
+ LN → Linear(384→192) → GELU → Dropout → Linear(192→12)
82
  ```
83
 
84
  **Design rationale:**
 
92
  | Class | Description |
93
  |-------|-------------|
94
  | `palm` | Open palm held flat toward camera (static hand shape) |
 
 
95
  | `pinch` | Thumb and index finger brought together (pinch grip) |
96
  | `click` | Brief tap / click gesture — index finger snaps toward thumb |
97
  | `swiping_right` | Horizontal swipe from left to right |
 
109
  | Class | Mode | Command | Runtime handling |
110
  |-------|------|---------|------------------|
111
  | `palm` | `discrete` | `erase_annotations` | Discrete command via GestureActivationController → CommandDispatcher |
 
 
112
  | `pinch` | `discrete` | `no_action` | No-op background class |
113
  | `click` | `discrete` | `no_action` | No-op background class |
114
  | `swiping_right` | `discrete` | `next_slide` | Discrete command via GestureActivationController → CommandDispatcher |
 
143
  # Download the artifact (cached after first call)
144
  local_path = hf_hub_download(
145
  repo_id="ntsrigaud/maestro-lstm-hybrid",
146
+ filename="two_stream_attn_v1_finetune_20260514T013537Z_inference.pt",
147
  )
148
 
149
  # Load the artifact (includes model, class labels, and feature schema)
 
164
 
165
  ## Training Dataset
166
 
167
+ - **Source**: Hybrid merge of Jester and IPN-Hand windows: Jester provides unknown/swiping/zoom/stop_sign classes; IPN-Hand provides point_one and point_two
168
+ - **Used classes**: 12 (11 active gestures + `unknown` background)
169
  - **Dataset split**: 70% train / 15% val / 15% test (stratified by class)
170
  - **Augmentation**: temporal scale ±20%, spatial jitter σ=0.005
171
 
 
173
 
174
  Two-phase transfer learning pipeline:
175
  - **Phase 1 (pretraining):** backbone pretrained on external checkpoint `two_stream_attn_v1_20260513T155730Z.pt` to learn generic gesture dynamics.
176
+ - **Phase 2 (fine-tuning):** head replaced and model adapted on Hybrid Jester+IPN 10-gesture vocabulary.
177
  - **Stage A (frozen backbone):** 10 epoch(s) head-only warmup.
178
+ - **Stage B (full model):** up to 51 epoch(s) joint fine-tuning with scheduler/early stopping.
179
  - **Stage B retention defences:** replay_max_samples_per_class=500, distillation_weight=0.0, replay_ce_weight=0.0, backbone_lr_multiplier=0.1, ewc_weight=2000.0, gpm_components=0, forgetting_penalty_weight=0.5.
180
 
181
  ## Training Configuration
 
203
 
204
  | Metric | Value |
205
  |--------|-------|
206
+ | Accuracy | 95.7% |
207
+ | Macro F1 | 95.6% |
208
 
209
  ### Per-Class Recall
210
 
211
  | Class | Recall |
212
  |-------|--------|
213
+ | `palm` | 98.7% |
214
+ | `pinch` | 98.4% |
215
+ | `click` | 94.0% |
216
+ | `swiping_right` | 95.3% |
217
+ | `swiping_left` | 98.3% |
218
+ | `swiping_down` | 96.3% |
219
+ | `swiping_up` | 97.6% |
220
+ | `zooming_in_full_hand` | 96.8% |
221
+ | `zooming_out_full_hand` | 92.4% |
 
 
222
  | `point_one` | 97.1% |
223
+ | `point_two` | 93.2% |
224
+ | `unknown` | 89.4% |
225
 
226
  ## Comparison with Previous Architecture
227
 
 
232
  | Feature projection | No | **Yes (→96)** |
233
  | Temporal pooling | Mean only | **Mean + Max** |
234
  | Cross-stream fusion | Concat only | **2-layer MLP gate** |
235
+ | Parameters | ~182 K | ~2,099,820 |
236
 
237
  ## Limitations and Risks
238
 
config.json CHANGED
@@ -1,12 +1,12 @@
1
  {
2
- "model_version": "two_stream_attn_v1_finetune_20260513T160347Z",
3
  "model_config": {
4
  "model_name": "two_stream_attn_v1",
5
  "input_size": 147,
6
  "hidden_size": 96,
7
  "num_layers": 4,
8
  "dropout": 0.35,
9
- "num_classes": 14
10
  },
11
  "feature_schema": {
12
  "feature_schema_version": "feature-schema-v1",
@@ -32,47 +32,41 @@
32
  }
33
  },
34
  "evaluation": {
35
- "test_accuracy": 0.9617670364500792,
36
- "test_macro_f1": 0.9632447186344674,
37
- "test_loss": 0.43081179908262757,
38
- "calibration_ece": 0.032766067943935724,
39
  "per_class_recall": {
40
- "palm": 0.996415770609319,
41
- "fist": 0.9912280701754386,
42
- "thumb_up": 0.98989898989899,
43
- "pinch": 0.9951923076923077,
44
- "click": 0.972568578553616,
45
- "swiping_right": 0.9546925566343042,
46
- "swiping_left": 0.9748603351955307,
47
- "swiping_down": 0.9840425531914894,
48
- "swiping_up": 0.9682539682539683,
49
- "zooming_in_full_hand": 0.958997722095672,
50
- "zooming_out_full_hand": 0.9389978213507625,
51
  "point_one": 0.9711286089238845,
52
- "point_two": 0.9347826086956522,
53
- "unknown": 0.8765432098765432
54
  },
55
  "per_class_precision": {
56
- "palm": 0.996415770609319,
57
- "fist": 0.9576271186440678,
58
- "thumb_up": 1.0,
59
- "pinch": 0.9495412844036697,
60
- "click": 0.9948979591836735,
61
- "swiping_right": 0.9546925566343042,
62
- "swiping_left": 0.9721448467966574,
63
- "swiping_down": 0.961038961038961,
64
- "swiping_up": 0.977116704805492,
65
- "zooming_in_full_hand": 0.9376391982182628,
66
- "zooming_out_full_hand": 0.9620535714285714,
67
- "point_one": 0.891566265060241,
68
- "point_two": 0.9690140845070423,
69
- "unknown": 0.9491978609625669
70
  }
71
  },
72
  "class_labels": [
73
  "palm",
74
- "fist",
75
- "thumb_up",
76
  "pinch",
77
  "click",
78
  "swiping_right",
@@ -85,7 +79,7 @@
85
  "point_two",
86
  "unknown"
87
  ],
88
- "created_at": "2026-05-14T00:26:14.536961+00:00",
89
  "gesture_command_mapping": {
90
  "commands": {
91
  "swiping_up": "start_presentation",
 
1
  {
2
+ "model_version": "two_stream_attn_v1_finetune_20260514T013537Z",
3
  "model_config": {
4
  "model_name": "two_stream_attn_v1",
5
  "input_size": 147,
6
  "hidden_size": 96,
7
  "num_layers": 4,
8
  "dropout": 0.35,
9
+ "num_classes": 12
10
  },
11
  "feature_schema": {
12
  "feature_schema_version": "feature-schema-v1",
 
32
  }
33
  },
34
  "evaluation": {
35
+ "test_accuracy": 0.9566010951125532,
36
+ "test_macro_f1": 0.9560781251156708,
37
+ "test_loss": 0.4304363657823322,
38
+ "calibration_ece": 0.029200464802868007,
39
  "per_class_recall": {
40
+ "palm": 0.9869109947643979,
41
+ "pinch": 0.984313725490196,
42
+ "click": 0.94,
43
+ "swiping_right": 0.9525959367945824,
44
+ "swiping_left": 0.983177570093458,
45
+ "swiping_down": 0.9625246548323472,
46
+ "swiping_up": 0.9755600814663951,
47
+ "zooming_in_full_hand": 0.9681818181818181,
48
+ "zooming_out_full_hand": 0.9240506329113924,
 
 
49
  "point_one": 0.9711286089238845,
50
+ "point_two": 0.9320652173913043,
51
+ "unknown": 0.8938271604938272
52
  },
53
  "per_class_precision": {
54
+ "palm": 0.9792207792207792,
55
+ "pinch": 0.9471698113207547,
56
+ "click": 0.9832635983263598,
57
+ "swiping_right": 0.9634703196347032,
58
+ "swiping_left": 0.9813432835820896,
59
+ "swiping_down": 0.9606299212598425,
60
+ "swiping_up": 0.9676767676767677,
61
+ "zooming_in_full_hand": 0.9240780911062907,
62
+ "zooming_out_full_hand": 0.9668874172185431,
63
+ "point_one": 0.8958837772397095,
64
+ "point_two": 0.9607843137254902,
65
+ "unknown": 0.9501312335958005
 
 
66
  }
67
  },
68
  "class_labels": [
69
  "palm",
 
 
70
  "pinch",
71
  "click",
72
  "swiping_right",
 
79
  "point_two",
80
  "unknown"
81
  ],
82
+ "created_at": "2026-05-14T01:38:18.988584+00:00",
83
  "gesture_command_mapping": {
84
  "commands": {
85
  "swiping_up": "start_presentation",