mirror of https://github.com/OpenHands/OpenHands.git synced 2025-12-26 05:48:36 +08:00

History

* move use_host_network to sandbox config

* fix test runtime tests

* fix kwargs to make it clearer

2024-07-19 03:03:55 +00:00

agent_bench

fix: runtime test for mac (#3005 )

2024-07-19 03:03:55 +00:00

biocoder

fix: runtime test for mac (#3005 )

2024-07-19 03:03:55 +00:00

bird

fix: runtime test for mac (#3005 )

2024-07-19 03:03:55 +00:00

browsing_delegation

fix eval api_key leak in metadata; fix llm config in run infer (#2998 )

2024-07-18 15:46:59 +00:00

EDA

fix eval api_key leak in metadata; fix llm config in run infer (#2998 )

2024-07-18 15:46:59 +00:00

gaia

fix eval api_key leak in metadata; fix llm config in run infer (#2998 )

2024-07-18 15:46:59 +00:00

gorilla

docs: updated docstrings using ruff's autofix feature (#2923 )

2024-07-16 01:35:33 +00:00

gpqa

fix: runtime test for mac (#3005 )

2024-07-19 03:03:55 +00:00

humanevalfix

fix: runtime test for mac (#3005 )

2024-07-19 03:03:55 +00:00

logic_reasoning

fix: runtime test for mac (#3005 )

2024-07-19 03:03:55 +00:00

miniwob

fix eval api_key leak in metadata; fix llm config in run infer (#2998 )

2024-07-18 15:46:59 +00:00

mint

docs: updated docstrings using ruff's autofix feature (#2923 )

2024-07-16 01:35:33 +00:00

ml_bench

fix: runtime test for mac (#3005 )

2024-07-19 03:03:55 +00:00

regression

docs: updated docstrings using ruff's autofix feature (#2923 )

2024-07-16 01:35:33 +00:00

static

Add detailed tutorial for adding new evaluation benchmarks (#1827 )

2024-05-18 13:40:53 -04:00

swe_bench

fix: runtime test for mac (#3005 )

2024-07-19 03:03:55 +00:00

toolqa

fix eval api_key leak in metadata; fix llm config in run infer (#2998 )

2024-07-18 15:46:59 +00:00

utils

fix eval api_key leak in metadata; fix llm config in run infer (#2998 )

2024-07-18 15:46:59 +00:00

webarena

fix eval api_key leak in metadata; fix llm config in run infer (#2998 )

2024-07-18 15:46:59 +00:00

__init__.py

feat(SWE-Bench environment) integrate SWE-Bench sandbox (#1468 )

2024-05-15 16:15:55 +00:00

README.md

Add ML-Bench Evaluation with OpenDevin (#2015 )

2024-06-05 01:56:39 +00:00

TUTORIAL.md

fix: runtime test for mac (#3005 )

2024-07-19 03:03:55 +00:00

README.md

Evaluation

This folder contains code and resources to run experiments and evaluations.

Logistics

To better organize the evaluation folder, we should follow the rules below:

Each subfolder contains a specific benchmark or experiment. For example, evaluation/swe_bench should contain all the preprocessing/evaluation/analysis scripts.
Raw data and experimental records should not be stored within this repo.
- For model outputs, they should be stored at this huggingface space for visualization.
Important data files of manageable size and analysis scripts (e.g., jupyter notebooks) can be directly uploaded to this repo.

Supported Benchmarks

SWE-Bench: evaluation/swe_bench
ML-Bench: evaluation/ml_bench
HumanEvalFix: evaluation/humanevalfix
GAIA: evaluation/gaia
Entity deduction Arena (EDA): evaluation/EDA
MINT: evaluation/mint
AgentBench: evaluation/agent_bench
BIRD: evaluation/bird
LogicReasoning: evaluation/logic_reasoning

Result Visualization

Check this huggingface space for visualization of existing experimental results.

Upload your results

You can start your own fork of our huggingface evaluation outputs and submit a PR of your evaluation results to our hosted huggingface repo via PR following the guide here.