Spaces:

UII-AI
/

MedVidBench-Leaderboard

Running

App Files Files Community

MedVidBench-Leaderboard

Commit History

Fix evaluate_all_pai to pass --skip-llm-judge to task main() functions

18339c0

MedGRPO Team commited on Jan 13

Fix eval_dvc.py main() to support --skip-llm-judge flag

5f41159

MedGRPO Team commited on Jan 13

Fix DVC evaluation to compute temporal F1 when --skip-llm-judge is set

dd1b9c6

MedGRPO Team Claude Sonnet 4.5 commited on Jan 13

Add semantic similarity matching for Next Action evaluation

a66b9a4

MedGRPO Team Claude Sonnet 4.5 commited on Jan 13

Fix CVS_acc to use raw accuracy instead of component_balanced_accuracy

6d8dbb2

MedGRPO Team commited on Jan 13

Fix TAL overall metrics computation and extraction

c8f4cad

MedGRPO Team commited on Jan 13

Use STG overall metrics instead of placeholder

d1337a3

MedGRPO Team commited on Jan 13

Fix STG evaluation to extract bbox_dict from struc_info

0b29eca

MedGRPO Team commited on Jan 13

Add placeholder for STG metrics when evaluation returns empty

6d00cf8

MedGRPO Team commited on Jan 13

Fix STG and next_action metric printing in overall mode

049c07c

MedGRPO Team commited on Jan 13

Show evaluation metrics even in silent mode (--grouping overall)

2323031

MedGRPO Team commited on Jan 13

Keep --skip-llm-judge flag, set caption metrics to 0 when skipped

5310708

MedGRPO Team commited on Jan 13

Remove --skip-llm-judge flag for production

3a69282

MedGRPO Team commited on Jan 13

Add process.wait() to ensure returncode is set

77d73db

MedGRPO Team commited on Jan 13

Add matplotlib to requirements

da11adf

MedGRPO Team commited on Jan 13

Filter DVC/VS/RC from tasks list when skip-llm-judge is set

2bd924c

MedGRPO Team commited on Jan 13

Add back --skip-llm-judge for faster testing

ebf8102

MedGRPO Team commited on Jan 13

Remove skip-llm-judge - use semantic similarity fallback

a3fa530

MedGRPO Team commited on Jan 13

Add --skip-llm-judge flag for faster evaluation

3487a07

MedGRPO Team commited on Jan 13

Remove debug messages - system working correctly

80e3e7d

MedGRPO Team commited on Jan 13

Add debug inside loop to verify entry

d53d0f7

MedGRPO Team commited on Jan 13

Add debug at line 741

d878ea1

MedGRPO Team commited on Jan 13

Add debug before task list print

c886e23

MedGRPO Team commited on Jan 13

Remove DEBUG filter - allow debug messages through

2497030

MedGRPO Team commited on Jan 13

Debug before/after imports

4e448a4

MedGRPO Team commited on Jan 13

Add debug at function entry and after analyze

c736ac8

MedGRPO Team commited on Jan 13

Add explicit flush debug

1c45d6c

MedGRPO Team commited on Jan 13

update

f0846a5

MedGRPO Team commited on Jan 13

update

1af117c

MedGRPO Team commited on Jan 13

update

af13c42

MedGRPO Team commited on Jan 13

update

ebac2de

MedGRPO Team commited on Jan 13

update

807bf44

MedGRPO Team commited on Jan 13

add merged format

e1e2d25

MedGRPO Team commited on Jan 13

update

a605ebb

MedGRPO Team commited on Jan 13

update name

04f5f37

MedGRPO Team commited on Jan 9

update

73fd321

MedGRPO Team commited on Jan 8

Add cache/ to gitignore and document test results

74ab719

MedGRPO Team commited on Jan 8

Implement secure ground truth with prediction-only submission format

fe743f5

MedGRPO Team commited on Jan 8

Remove ground_truth.json

d42216c

MedGRPO Team commited on Jan 8

Upload ground_truth.json

59e6091
verified

gaozhongpai commited on Jan 8

Add support for pre-computed LLM judge scores

31817d3

MedGRPO Team Claude Sonnet 4.5 commited on Jan 7

Make validation flexible: accept both 'answer'/'response' and 'gnd'/'ground_truth' field names

83aad2b

MedGRPO Team commited on Jan 7

Add contact column back to leaderboard display

4510cf8

MedGRPO Team commited on Jan 7

Fix leaderboard display: remove average column, show 10 metrics, fix Tasks tab

45e64f5

MedGRPO Team commited on Jan 7

Update evaluation metrics and leaderboard display

a36b7fe

MedGRPO Team Claude Sonnet 4.5 commited on Jan 7

Complete evaluator fixes for all 8 tasks

331979f

MedGRPO Team commited on Jan 7

Fix syntax errors and add TAL wrapper functions

8c805bc

MedGRPO Team commited on Jan 7

Fix evaluate_all_pai.py to use eval_caption_llm_judge

58dd6d7

MedGRPO Team commited on Jan 7

Consolidate evaluation scripts and remove hardcoded paths

3ea8a3a

MedGRPO Team commited on Jan 7

Remove all captioning_metrics dependencies

5a5d9ce

MedGRPO Team commited on Jan 7

Commit History

Fix evaluate_all_pai to pass --skip-llm-judge to task main() functions 18339c0

Fix eval_dvc.py main() to support --skip-llm-judge flag 5f41159

Fix DVC evaluation to compute temporal F1 when --skip-llm-judge is set dd1b9c6

Add semantic similarity matching for Next Action evaluation a66b9a4

Fix CVS_acc to use raw accuracy instead of component_balanced_accuracy 6d8dbb2

Fix TAL overall metrics computation and extraction c8f4cad

Use STG overall metrics instead of placeholder d1337a3

Fix STG evaluation to extract bbox_dict from struc_info 0b29eca

Add placeholder for STG metrics when evaluation returns empty 6d00cf8

Fix STG and next_action metric printing in overall mode 049c07c

Show evaluation metrics even in silent mode (--grouping overall) 2323031

Keep --skip-llm-judge flag, set caption metrics to 0 when skipped 5310708

Remove --skip-llm-judge flag for production 3a69282

Add process.wait() to ensure returncode is set 77d73db

Add matplotlib to requirements da11adf

Filter DVC/VS/RC from tasks list when skip-llm-judge is set 2bd924c

Add back --skip-llm-judge for faster testing ebf8102

Remove skip-llm-judge - use semantic similarity fallback a3fa530

Add --skip-llm-judge flag for faster evaluation 3487a07

Remove debug messages - system working correctly 80e3e7d

Add debug inside loop to verify entry d53d0f7

Add debug at line 741 d878ea1

Add debug before task list print c886e23

Remove DEBUG filter - allow debug messages through 2497030

Debug before/after imports 4e448a4

Add debug at function entry and after analyze c736ac8

Add explicit flush debug 1c45d6c

update f0846a5

update 1af117c

update af13c42

update ebac2de

update 807bf44

add merged format e1e2d25

update a605ebb

update name 04f5f37

update 73fd321

Add cache/ to gitignore and document test results 74ab719

Implement secure ground truth with prediction-only submission format fe743f5

Remove ground_truth.json d42216c

Upload ground_truth.json 59e6091 verified

Add support for pre-computed LLM judge scores 31817d3

Make validation flexible: accept both 'answer'/'response' and 'gnd'/'ground_truth' field names 83aad2b

Add contact column back to leaderboard display 4510cf8

Fix leaderboard display: remove average column, show 10 metrics, fix Tasks tab 45e64f5

Update evaluation metrics and leaderboard display a36b7fe

Complete evaluator fixes for all 8 tasks 331979f

Fix syntax errors and add TAL wrapper functions 8c805bc

Fix evaluate_all_pai.py to use eval_caption_llm_judge 58dd6d7

Consolidate evaluation scripts and remove hardcoded paths 3ea8a3a

Remove all captioning_metrics dependencies 5a5d9ce

Fix evaluate_all_pai to pass --skip-llm-judge to task main() functions

18339c0

Fix eval_dvc.py main() to support --skip-llm-judge flag

5f41159

Fix DVC evaluation to compute temporal F1 when --skip-llm-judge is set

dd1b9c6

Add semantic similarity matching for Next Action evaluation

a66b9a4

Fix CVS_acc to use raw accuracy instead of component_balanced_accuracy

6d8dbb2

Fix TAL overall metrics computation and extraction

c8f4cad

Use STG overall metrics instead of placeholder

d1337a3

Fix STG evaluation to extract bbox_dict from struc_info

0b29eca

Add placeholder for STG metrics when evaluation returns empty

6d00cf8

Fix STG and next_action metric printing in overall mode

049c07c

Show evaluation metrics even in silent mode (--grouping overall)

2323031

Keep --skip-llm-judge flag, set caption metrics to 0 when skipped

5310708

Remove --skip-llm-judge flag for production

3a69282

Add process.wait() to ensure returncode is set

77d73db

Add matplotlib to requirements

da11adf

Filter DVC/VS/RC from tasks list when skip-llm-judge is set

2bd924c

Add back --skip-llm-judge for faster testing

ebf8102

Remove skip-llm-judge - use semantic similarity fallback

a3fa530

Add --skip-llm-judge flag for faster evaluation

3487a07

Remove debug messages - system working correctly

80e3e7d

Add debug inside loop to verify entry

d53d0f7

Add debug at line 741

d878ea1

Add debug before task list print

c886e23

Remove DEBUG filter - allow debug messages through

2497030

Debug before/after imports

4e448a4

Add debug at function entry and after analyze

c736ac8

Add explicit flush debug

1c45d6c

update

f0846a5

update

1af117c

update

af13c42

update

ebac2de

update

807bf44

add merged format

e1e2d25

update

a605ebb

update name

04f5f37

update

73fd321

Add cache/ to gitignore and document test results

74ab719

Implement secure ground truth with prediction-only submission format

fe743f5

Remove ground_truth.json

d42216c

Upload ground_truth.json

59e6091
verified

Add support for pre-computed LLM judge scores

31817d3

Make validation flexible: accept both 'answer'/'response' and 'gnd'/'ground_truth' field names

83aad2b

Add contact column back to leaderboard display

4510cf8

Fix leaderboard display: remove average column, show 10 metrics, fix Tasks tab

45e64f5

Update evaluation metrics and leaderboard display

a36b7fe

Complete evaluator fixes for all 8 tasks

331979f

Fix syntax errors and add TAL wrapper functions

8c805bc

Fix evaluate_all_pai.py to use eval_caption_llm_judge

58dd6d7

Consolidate evaluation scripts and remove hardcoded paths

3ea8a3a

Remove all captioning_metrics dependencies

5a5d9ce