# MedVidBench Leaderboard Supported Formats

## Overview

The leaderboard web app now accepts **two submission formats**:

1. **Prediction-only** (preferred for users)
2. **Merged format** (for testing/debugging)

Both formats are automatically detected and handled by the evaluation system.

## Format 1: Prediction-Only (User Submission)

**Recommended for**: Public user submissions

**Structure**:
```json
[
  {
    "id": "video_id&&start_frame&&end_frame&&fps",
    "qa_type": "tal",
    "prediction": "0.0-10.0 seconds."
  },
  {
    "id": "another_video&&0&&100&&1.0",
    "qa_type": "video_summary",
    "prediction": "The surgeon performs cholecystectomy..."
  }
]
```

**Required fields**:
- `id`: Sample identifier (video_id&&start&&end&&fps)
- `qa_type`: Task type
- `prediction`: Model's answer text

**What happens**:
1. Server validates format
2. Server merges with private ground-truth
3. Runs evaluation
4. Adds to leaderboard

## Format 2: Merged (Internal/Testing)

**Recommended for**: Internal testing, debugging

**Structure**:
```json
{
  "0": {
    "metadata": {
      "video_id": "kcOqlifSukA",
      "fps": "1.0",
      "input_video_start_frame": "22425",
      "input_video_end_frame": "25124"
    },
    "qa_type": "tal",
    "struc_info": [
      {
        "action": "cutting",
        "spans": [{"start": 0.0, "end": 10.0}]
      }
    ],
    "question": "When does cutting happen?",
    "gnd": "0.0-10.0 seconds.",
    "answer": "0.0-10.0 seconds.",
    "data_source": "AVOS"
  }
}
```

**Required fields**:
- `metadata`: Video metadata
- `qa_type`: Task type
- `struc_info`: Structured ground-truth
- `question`: Question text
- `gnd`: Ground-truth answer
- `answer`: Model prediction
- `data_source`: Dataset name

**What happens**:
1. Server validates format
2. Skips ground-truth merging (already has it)
3. Runs evaluation directly
4. Adds to leaderboard

## How It Works

### Validation (`app.py::validate_results_file`)

The validator auto-detects format by checking fields:

```python
# Format 1: Prediction-only
is_prediction_only = "id" in sample and "prediction" in sample

# Format 2: Merged
is_merged = "metadata" in sample and "question" in sample and "answer" in sample
```

Both formats pass validation if they have:
- Valid structure
- Required fields
- ≥5000 samples
- Valid qa_types

### Evaluation (`app.py::run_evaluation`)

Uses `evaluation/evaluate_predictions.py` wrapper which:

1. **Auto-detects format**:
   - Checks for `id` + `prediction` → Prediction-only
   - Checks for `question` + `gnd` + `struc_info` → Merged

2. **Handles accordingly**:
   - Prediction-only → Merge with ground-truth first
   - Merged → Use directly

3. **Runs evaluation**: Calls `evaluate_all_pai.py`

4. **Returns metrics**: 10 metrics across 8 tasks

## Examples

### Example 1: User Submits Predictions

```bash
# User downloads test set from HuggingFace
# User runs inference on their model
# User formats predictions as prediction-only JSON
# User uploads to leaderboard

# Result: Server merges with private GT → evaluates → adds to board
```

### Example 2: Internal Testing with Merged File

```bash
# Developer has complete results.json (with GT)
# Developer uploads to leaderboard for testing

# Result: Server detects merged format → skips merging → evaluates → adds to board
```

## File Size Requirements

- **Minimum samples**: 5,000
- **Full test set**: 6,245 samples
- Files with <5,000 samples are rejected

## Valid QA Types

- `tal` - Temporal Action Localization
- `stg` - Spatiotemporal Grounding
- `next_action` - Next Action Prediction
- `dense_captioning` / `dense_captioning_gpt` / `dense_captioning_gemini`
- `video_summary` / `video_summary_gpt` / `video_summary_gemini`
- `region_caption` / `region_caption_gpt` / `region_caption_gemini`
- `skill_assessment` - Skill Assessment
- `cvs_assessment` - CVS Assessment

## Testing

### Test with Prediction-Only Format

```bash
# Create sample predictions
python -c "
import json
with open('data/sample_predictions.json') as f:
    data = json.load(f)
print(f'Format: prediction-only')
print(f'Samples: {len(data)}')
print(f'Fields: {list(data[0].keys())}')
"

# Upload to leaderboard (web interface)
# Should show: "✓ Valid predictions file (prediction-only format) with 100 samples"
```

### Test with Merged Format

```bash
# Check merged format
python -c "
import json
with open('data/results.json') as f:
    data = json.load(f)
records = list(data.values())
print(f'Format: merged')
print(f'Samples: {len(records)}')
print(f'Fields: {list(records[0].keys())}')
"

# Upload to leaderboard (web interface)
# Should show: "✓ Valid predictions file (merged format) with 6245 samples"
```

## Error Messages

| Error | Cause | Solution |
|-------|-------|----------|
| Missing required field: 'id' | Wrong format | Check if using merged format, should pass now |
| Missing required field: 'prediction' | Wrong format | Ensure prediction-only has 'prediction' field |
| Invalid format: Must be either... | Unrecognized structure | Check file structure matches one of two formats |
| Too few samples (X) | Incomplete file | Should have ~6245 samples for full test set |
| Invalid qa_type | Wrong task name | Use valid qa_types listed above |

## Implementation Files

- `app.py::validate_results_file()` - Format detection and validation
- `app.py::run_evaluation()` - Uses wrapper for evaluation
- `evaluation/evaluate_predictions.py` - Auto-detection wrapper
- `evaluation/evaluate_all_pai.py` - Core evaluation engine

## Security Notes

- Ground-truth data stored privately in `data/ground_truth.json`
- Never exposed to users
- Server-side merging ensures GT integrity
- Users only submit predictions

## Updates (2026-01-13)

- ✅ Added support for merged format in leaderboard
- ✅ Auto-detection for both formats
- ✅ Unified validation and evaluation
- ✅ Both formats now accepted on web interface