Commit History

Fix evaluate_all_pai to pass --skip-llm-judge to task main() functions
18339c0

MedGRPO Team commited on

Fix eval_dvc.py main() to support --skip-llm-judge flag
5f41159

MedGRPO Team commited on

Fix DVC evaluation to compute temporal F1 when --skip-llm-judge is set
dd1b9c6

MedGRPO Team Claude Sonnet 4.5 commited on

Add semantic similarity matching for Next Action evaluation
a66b9a4

MedGRPO Team Claude Sonnet 4.5 commited on

Fix CVS_acc to use raw accuracy instead of component_balanced_accuracy
6d8dbb2

MedGRPO Team commited on

Fix TAL overall metrics computation and extraction
c8f4cad

MedGRPO Team commited on

Use STG overall metrics instead of placeholder
d1337a3

MedGRPO Team commited on

Fix STG evaluation to extract bbox_dict from struc_info
0b29eca

MedGRPO Team commited on

Add placeholder for STG metrics when evaluation returns empty
6d00cf8

MedGRPO Team commited on

Fix STG and next_action metric printing in overall mode
049c07c

MedGRPO Team commited on

Show evaluation metrics even in silent mode (--grouping overall)
2323031

MedGRPO Team commited on

Keep --skip-llm-judge flag, set caption metrics to 0 when skipped
5310708

MedGRPO Team commited on

Remove --skip-llm-judge flag for production
3a69282

MedGRPO Team commited on

Add process.wait() to ensure returncode is set
77d73db

MedGRPO Team commited on

Add matplotlib to requirements
da11adf

MedGRPO Team commited on

Filter DVC/VS/RC from tasks list when skip-llm-judge is set
2bd924c

MedGRPO Team commited on

Add back --skip-llm-judge for faster testing
ebf8102

MedGRPO Team commited on

Remove skip-llm-judge - use semantic similarity fallback
a3fa530

MedGRPO Team commited on

Add --skip-llm-judge flag for faster evaluation
3487a07

MedGRPO Team commited on

Remove debug messages - system working correctly
80e3e7d

MedGRPO Team commited on

Add debug inside loop to verify entry
d53d0f7

MedGRPO Team commited on

Add debug at line 741
d878ea1

MedGRPO Team commited on

Add debug before task list print
c886e23

MedGRPO Team commited on

Remove DEBUG filter - allow debug messages through
2497030

MedGRPO Team commited on

Debug before/after imports
4e448a4

MedGRPO Team commited on

Add debug at function entry and after analyze
c736ac8

MedGRPO Team commited on

Add explicit flush debug
1c45d6c

MedGRPO Team commited on

update
f0846a5

MedGRPO Team commited on

update
1af117c

MedGRPO Team commited on

update
af13c42

MedGRPO Team commited on

update
ebac2de

MedGRPO Team commited on

update
807bf44

MedGRPO Team commited on

add merged format
e1e2d25

MedGRPO Team commited on

update
a605ebb

MedGRPO Team commited on

update name
04f5f37

MedGRPO Team commited on

update
73fd321

MedGRPO Team commited on

Add cache/ to gitignore and document test results
74ab719

MedGRPO Team commited on

Implement secure ground truth with prediction-only submission format
fe743f5

MedGRPO Team commited on

Remove ground_truth.json
d42216c

MedGRPO Team commited on

Upload ground_truth.json
59e6091
verified

gaozhongpai commited on

Add support for pre-computed LLM judge scores
31817d3

MedGRPO Team Claude Sonnet 4.5 commited on

Make validation flexible: accept both 'answer'/'response' and 'gnd'/'ground_truth' field names
83aad2b

MedGRPO Team commited on

Add contact column back to leaderboard display
4510cf8

MedGRPO Team commited on

Fix leaderboard display: remove average column, show 10 metrics, fix Tasks tab
45e64f5

MedGRPO Team commited on

Update evaluation metrics and leaderboard display
a36b7fe

MedGRPO Team Claude Sonnet 4.5 commited on

Complete evaluator fixes for all 8 tasks
331979f

MedGRPO Team commited on

Fix syntax errors and add TAL wrapper functions
8c805bc

MedGRPO Team commited on

Fix evaluate_all_pai.py to use eval_caption_llm_judge
58dd6d7

MedGRPO Team commited on

Consolidate evaluation scripts and remove hardcoded paths
3ea8a3a

MedGRPO Team commited on

Remove all captioning_metrics dependencies
5a5d9ce

MedGRPO Team commited on