Upload two_stream_attn_v1_2layer_ld_cong_finetune_20260515T134706Z

f7e0e62 verified 13 days ago

9.76 kB

	---
	language:
	- en
	license: mit
	tags:
	- gesture-recognition
	- hand-gesture
	- pytorch
	- mediapipe
	- temporal-model
	- lstm
	- attention
	- bidirectional
	datasets:
	- IPN-Hand
	metrics:
	- accuracy
	- f1
	model-index:
	- name: two_stream_attn_v1_2layer_ld_cong_finetune_20260515T134706Z
	results:
	- task:
	type: gesture-recognition
	dataset:
	name: IPN Hand
	type: IPN-Hand
	metrics:
	- type: accuracy
	value: 0.9606
	- type: f1
	value: 0.9587
	---

	# two_stream_attn_v1_2layer_ld_cong_finetune_20260515T134706Z

	A real-time hand gesture classifier trained on
	a Hybrid Jester+IPN gesture dataset (Jester dynamic classes + IPN pointing classes).

	This model is part of the Maestro pipeline that enables touchless
	control of presentation and meeting software through hand gestures captured from a
	standard webcam using MediaPipe for landmark extraction.

	## Model Description

	- Architecture: EnhancedTwoStreamLSTM (BiLSTM h=96×2, MHA 8 heads, proj=96, mean+max pool, MLP gate)
	- Parameters: 1,208,554
	- Input: `(batch, 16, 147)`
	— 16-frame sliding window at 30 FPS ≈ 533 ms
	- Output: Softmax logits over 10 gesture classes
	- Inference latency: < 1 ms per call (CPU, single sample)
	- Feature schema: `feature-schema-v1`

	## Architecture

	`EnhancedTwoStreamLSTM` splits the 147-dim feature vector into two parallel streams and
	processes them through a BiLSTM + self-attention + MLP-gate pipeline:

	```
	Input (B, T=32, 147)
	│
	├─ Stream A — Pose/Shape (73 dims)
	│ Linear+LN+GELU → 96
	│ 2-layer BiLSTM (h=96) → (B, T, 192)
	│ LayerNorm → Self-MHA (8 heads) + residual + post-LN
	│ mean+max pool → pool_LN → ctx_a (B, 192)
	│
	├─ Stream B — Motion/Dynamics (74 dims)
	│ (identical structure) → ctx_b (B, 192)
	│
	├─ MLP cross-stream gate
	│ gate_a = Sigmoid(
	│ Linear(96→192)(
	│ Tanh(Linear(192→96)(ctx_b))))
	│ ctx_a = LN(ctx_a × gate_a + ctx_a)
	│ gate_b = Sigmoid(
	│ Linear(96→192)(
	│ Tanh(Linear(192→96)(ctx_a))))
	│ ctx_b = LN(ctx_b × gate_b + ctx_b)
	│
	└─ cat(ctx_a, ctx_b) → (384,)
	LN → Linear(384→192) → GELU → Dropout → Linear(192→10)
	```

	Design rationale:
	- BiLSTMs encode temporal order via their recurrent cell state — no positional encoding needed.
	- Mean+Max pooling captures both sustained gesture shape (mean) and transient click events (max).
	- The 2-layer MLP gate provides non-linear cross-modal recalibration at ~37 K params
	(vs ~263 K for full MHA cross-attention with a degenerate mean-pooled query).

	## Gesture Classes

	\| Class \| Description \|
	\|-------\|-------------\|
	\| `fist` \| Closed fist (all fingers curled, thumb tucked) \|
	\| `swiping_right` \| Horizontal swipe from left to right \|
	\| `swiping_left` \| Horizontal swipe from right to left \|
	\| `swiping_down` \| Vertical swipe downward \|
	\| `swiping_up` \| Vertical swipe upward \|
	\| `zooming_in_full_hand` \| Pinch-open / spread fingers away from each other \|
	\| `zooming_out_full_hand` \| Pinch-close / bring fingers together \|
	\| `point_one` \| Single-finger pointing gesture (continuous laser-pointer control) \|
	\| `point_two` \| Two-finger pointing gesture (continuous annotation-pen control) \|
	\| `unknown` \| Background / transition / no gesture \|

	## Gesture Usage In Presentation System

	\| Class \| Mode \| Command \| Runtime handling \|
	\|-------\|------\|---------\|------------------\|
	\| `fist` \| `discrete` \| `erase_annotations` \| Discrete command via GestureActivationController → CommandDispatcher \|
	\| `swiping_right` \| `discrete` \| `next_slide` \| Discrete command via GestureActivationController → CommandDispatcher \|
	\| `swiping_left` \| `discrete` \| `previous_slide` \| Discrete command via GestureActivationController → CommandDispatcher \|
	\| `swiping_down` \| `discrete` \| `stop_presentation` \| Discrete command via GestureActivationController → CommandDispatcher \|
	\| `swiping_up` \| `discrete` \| `start_presentation` \| Discrete command via GestureActivationController → CommandDispatcher \|
	\| `zooming_in_full_hand` \| `discrete` \| `zoom_in_view` \| Discrete command via GestureActivationController → CommandDispatcher \|
	\| `zooming_out_full_hand` \| `discrete` \| `zoom_out_view` \| Discrete command via GestureActivationController → CommandDispatcher \|
	\| `point_one` \| `continuous` \| `—` \| Continuous tracker: LaserPointerTracker (bypasses discrete dispatcher) \|
	\| `point_two` \| `continuous` \| `—` \| Continuous tracker: AnnotationPenTracker (bypasses discrete dispatcher) \|
	\| `unknown` \| `discrete` \| `no_action` \| No-op background class \|

	## Feature Schema (`feature-schema-v1`)

	\| Block \| Dims \| Description \|
	\|-------\|------\|-------------\|
	\| `position` \| 0–62 \| 21 wrist-relative, scale-normalised landmark positions (x, y, z) \|
	\| `fingertip_spread` \| 63–67 \| 5 inter-fingertip Euclidean distances \|
	\| `wrist_trajectory` \| 68–70 \| Net wrist displacement from oldest frame in the window \|
	\| `velocity` \| 71–133 \| 21 per-landmark wrist-relative velocity vectors (Δposition per unit time) \|
	\| `joint_angles` \| 134–143 \| 10 MCP + PIP joint angles in radians \|
	\| `wrist_vel_raw` \| 144–146 \| Camera-normalised wrist velocity (x, y, z) — key directional signal \|


	## How to Use

	```python
	import torch
	from huggingface_hub import hf_hub_download
	from maestro.infrastructure.model.checkpoint_loader import load_inference_artifact

	# Download the artifact (cached after first call)
	local_path = hf_hub_download(
	repo_id="ntsrigaud/maestro-lstm-hybrid",
	filename="two_stream_attn_v1_2layer_ld_cong_finetune_20260515T134706Z_inference.pt",
	)

	# Load the artifact (includes model, class labels, and feature schema)
	artifact = load_inference_artifact(
	artifact_path=local_path,
	device=torch.device("cpu"),
	)
	artifact.model.eval()

	# Build a 147-dim feature vector using LandmarkFeatureTransformer
	# and fill a 32-frame SlidingWindowSequenceBuffer, then:
	with torch.no_grad():
	# tensor shape: (batch=1, T=32, F=147)
	window_tensor = torch.tensor(window_np, dtype=torch.float32).unsqueeze(0)
	logits = artifact.model(window_tensor)
	pred_class = artifact.class_labels[logits.argmax(dim=1).item()]
	```

	## Training Dataset

	- Source: Hybrid merge of Jester and IPN-Hand windows: Jester provides unknown/swiping/zoom/stop_sign classes; IPN-Hand provides point_one and point_two
	- Used classes: 10 (9 active gestures + `unknown` background)
	- Dataset split: 70% train / 15% val / 15% test (stratified by class)
	- Augmentation: temporal scale ±20%, spatial jitter σ=0.005

	## Training Strategy

	Two-phase transfer learning pipeline:
	- Phase 1 (pretraining): backbone pretrained on external checkpoint `two_stream_attn_v1_2layer_pretrain_20260515T125437Z.pt` to learn generic gesture dynamics.
	- Phase 2 (fine-tuning): head replaced and model adapted on Hybrid Jester+IPN 10-gesture vocabulary.
	- Stage A (frozen backbone): 10 epoch(s) head-only warmup.
	- Stage B (full model): up to 66 epoch(s) joint fine-tuning with scheduler/early stopping.
	- Stage B retention defences: replay_max_samples_per_class=500, distillation_weight=0.0, replay_ce_weight=0.0, backbone_lr_multiplier=0.1, ewc_weight=N/A, gpm_components=0, forgetting_penalty_weight=0.5.

	## Training Configuration

	\| Parameter \| Value \|
	\|-----------\|-------\|
	\| Architecture \| EnhancedTwoStreamLSTM (BiLSTM h=96×2, MHA 8 heads, proj=96, mean+max pool, MLP gate) \|
	\| Input size \| 147 \|
	\| Hidden size \| 96/stream (BiLSTM output: 192) \|
	\| Projection dim \| 96 \|
	\| Num layers \| 2 \|
	\| MHA heads \| 8 (head dim: 24) \|
	\| Dropout \| 0.4 \|
	\| Learning rate \| 3e-05 \|
	\| Weight decay \| 0.001 \|
	\| Batch size \| 128 \|
	\| Max epochs \| 80 \|
	\| Early stopping patience \| 20 \|
	\| Label smoothing \| 0.05 \|
	\| Class weighting \| disabled \|
	\| Max samples per class \| 3000 \|
	\| LR scheduler \| ReduceLROnPlateau (factor=0.5, patience=10) \|

	## Evaluation Results (Test Set)

	\| Metric \| Value \|
	\|--------\|-------\|
	\| Accuracy \| 96.1% \|
	\| Macro F1 \| 95.9% \|

	### Per-Class Recall

	\| Class \| Recall \|
	\|-------\|--------\|
	\| `fist` \| 97.3% \|
	\| `swiping_right` \| 97.1% \|
	\| `swiping_left` \| 98.3% \|
	\| `swiping_down` \| 98.0% \|
	\| `swiping_up` \| 98.2% \|
	\| `zooming_in_full_hand` \| 97.0% \|
	\| `zooming_out_full_hand` \| 95.1% \|
	\| `point_one` \| 97.4% \|
	\| `point_two` \| 95.1% \|
	\| `unknown` \| 85.7% \|

	## Comparison with Previous Architecture

	\| Feature \| TwoStreamGestureLSTM \| EnhancedTwoStreamLSTM \|
	\|---------\|---------------------\|-----------------------\|
	\| LSTM direction \| Unidirectional \| Bidirectional \|
	\| Attention \| Bahdanau (scalar) \| MHA Q/K/V (8 heads) \|
	\| Feature projection \| No \| Yes (→96) \|
	\| Temporal pooling \| Mean only \| Mean + Max \|
	\| Cross-stream fusion \| Concat only \| 2-layer MLP gate \|
	\| Parameters \| ~182 K \| ~1,208,554 \|

	## Limitations and Risks

	- Trained on IPN Hand subjects only. Performance may degrade with unusual hand sizes,
	skin tones, or lighting conditions not represented in training data.
	- The `unknown` class represents background/transition frames. At runtime, predictions
	are filtered through per-class confidence thresholds defined in `production_hybrid.yaml`.
	- Requires mediapipe>=0.10.14 for landmark extraction at inference time.
	- Not intended for safety-critical or accessibility-critical applications.
	- Performance was measured on a held-out test split from the same dataset; real-world
	generalisation may differ.

	## Environmental Impact

	Training was performed on CPU/MPS. Estimated training time: ~10 minutes.
	Estimated CO₂ equivalent: negligible (<0.001 kg CO₂eq).

	---

	Generated by the Maestro training pipeline on 2026-05-15.