mirror of
https://github.com/OpenHands/OpenHands.git
synced 2025-12-26 13:52:43 +08:00
* Remove global args * Remove global args * Update files * Update main * Bug fixes * Fix logging
Evaluation
This folder contains code and resources to run experiments and evaluations.
Logistics
To better organize the evaluation folder, we should follow the rules below:
- Each subfolder contains a specific benchmark or experiment. For example,
evaluation/swe_benchshould contain all the preprocessing/evaluation/analysis scripts. - Raw data and experimental records should not be stored within this repo.
- For model outputs, they should be stored at this huggingface space for visualization.
- Important data files of manageable size and analysis scripts (e.g., jupyter notebooks) can be directly uploaded to this repo.
Supported Benchmarks
- SWE-Bench:
evaluation/swe_bench - ML-Bench:
evaluation/ml_bench - HumanEvalFix:
evaluation/humanevalfix - GAIA:
evaluation/gaia - Entity deduction Arena (EDA):
evaluation/EDA - MINT:
evaluation/mint - AgentBench:
evaluation/agent_bench - BIRD:
evaluation/bird - LogicReasoning:
evaluation/logic_reasoning
Result Visualization
Check this huggingface space for visualization of existing experimental results.
Upload your results
You can start your own fork of our huggingface evaluation outputs and submit a PR of your evaluation results to our hosted huggingface repo via PR following the guide here.