# MedVidBench Leaderboard Supported Formats ## Overview The leaderboard web app now accepts **two submission formats**: 1. **Prediction-only** (preferred for users) 2. **Merged format** (for testing/debugging) Both formats are automatically detected and handled by the evaluation system. ## Format 1: Prediction-Only (User Submission) **Recommended for**: Public user submissions **Structure**: ```json [ { "id": "video_id&&start_frame&&end_frame&&fps", "qa_type": "tal", "prediction": "0.0-10.0 seconds." }, { "id": "another_video&&0&&100&&1.0", "qa_type": "video_summary", "prediction": "The surgeon performs cholecystectomy..." } ] ``` **Required fields**: - `id`: Sample identifier (video_id&&start&&end&&fps) - `qa_type`: Task type - `prediction`: Model's answer text **What happens**: 1. Server validates format 2. Server merges with private ground-truth 3. Runs evaluation 4. Adds to leaderboard ## Format 2: Merged (Internal/Testing) **Recommended for**: Internal testing, debugging **Structure**: ```json { "0": { "metadata": { "video_id": "kcOqlifSukA", "fps": "1.0", "input_video_start_frame": "22425", "input_video_end_frame": "25124" }, "qa_type": "tal", "struc_info": [ { "action": "cutting", "spans": [{"start": 0.0, "end": 10.0}] } ], "question": "When does cutting happen?", "gnd": "0.0-10.0 seconds.", "answer": "0.0-10.0 seconds.", "data_source": "AVOS" } } ``` **Required fields**: - `metadata`: Video metadata - `qa_type`: Task type - `struc_info`: Structured ground-truth - `question`: Question text - `gnd`: Ground-truth answer - `answer`: Model prediction - `data_source`: Dataset name **What happens**: 1. Server validates format 2. Skips ground-truth merging (already has it) 3. Runs evaluation directly 4. Adds to leaderboard ## How It Works ### Validation (`app.py::validate_results_file`) The validator auto-detects format by checking fields: ```python # Format 1: Prediction-only is_prediction_only = "id" in sample and "prediction" in sample # Format 2: Merged is_merged = "metadata" in sample and "question" in sample and "answer" in sample ``` Both formats pass validation if they have: - Valid structure - Required fields - ≥5000 samples - Valid qa_types ### Evaluation (`app.py::run_evaluation`) Uses `evaluation/evaluate_predictions.py` wrapper which: 1. **Auto-detects format**: - Checks for `id` + `prediction` → Prediction-only - Checks for `question` + `gnd` + `struc_info` → Merged 2. **Handles accordingly**: - Prediction-only → Merge with ground-truth first - Merged → Use directly 3. **Runs evaluation**: Calls `evaluate_all_pai.py` 4. **Returns metrics**: 10 metrics across 8 tasks ## Examples ### Example 1: User Submits Predictions ```bash # User downloads test set from HuggingFace # User runs inference on their model # User formats predictions as prediction-only JSON # User uploads to leaderboard # Result: Server merges with private GT → evaluates → adds to board ``` ### Example 2: Internal Testing with Merged File ```bash # Developer has complete results.json (with GT) # Developer uploads to leaderboard for testing # Result: Server detects merged format → skips merging → evaluates → adds to board ``` ## File Size Requirements - **Minimum samples**: 5,000 - **Full test set**: 6,245 samples - Files with <5,000 samples are rejected ## Valid QA Types - `tal` - Temporal Action Localization - `stg` - Spatiotemporal Grounding - `next_action` - Next Action Prediction - `dense_captioning` / `dense_captioning_gpt` / `dense_captioning_gemini` - `video_summary` / `video_summary_gpt` / `video_summary_gemini` - `region_caption` / `region_caption_gpt` / `region_caption_gemini` - `skill_assessment` - Skill Assessment - `cvs_assessment` - CVS Assessment ## Testing ### Test with Prediction-Only Format ```bash # Create sample predictions python -c " import json with open('data/sample_predictions.json') as f: data = json.load(f) print(f'Format: prediction-only') print(f'Samples: {len(data)}') print(f'Fields: {list(data[0].keys())}') " # Upload to leaderboard (web interface) # Should show: "✓ Valid predictions file (prediction-only format) with 100 samples" ``` ### Test with Merged Format ```bash # Check merged format python -c " import json with open('data/results.json') as f: data = json.load(f) records = list(data.values()) print(f'Format: merged') print(f'Samples: {len(records)}') print(f'Fields: {list(records[0].keys())}') " # Upload to leaderboard (web interface) # Should show: "✓ Valid predictions file (merged format) with 6245 samples" ``` ## Error Messages | Error | Cause | Solution | |-------|-------|----------| | Missing required field: 'id' | Wrong format | Check if using merged format, should pass now | | Missing required field: 'prediction' | Wrong format | Ensure prediction-only has 'prediction' field | | Invalid format: Must be either... | Unrecognized structure | Check file structure matches one of two formats | | Too few samples (X) | Incomplete file | Should have ~6245 samples for full test set | | Invalid qa_type | Wrong task name | Use valid qa_types listed above | ## Implementation Files - `app.py::validate_results_file()` - Format detection and validation - `app.py::run_evaluation()` - Uses wrapper for evaluation - `evaluation/evaluate_predictions.py` - Auto-detection wrapper - `evaluation/evaluate_all_pai.py` - Core evaluation engine ## Security Notes - Ground-truth data stored privately in `data/ground_truth.json` - Never exposed to users - Server-side merging ensures GT integrity - Users only submit predictions ## Updates (2026-01-13) - ✅ Added support for merged format in leaderboard - ✅ Auto-detection for both formats - ✅ Unified validation and evaluation - ✅ Both formats now accepted on web interface