mirror of https://github.com/OpenHands/OpenHands.git synced 2026-03-22 13:47:19 +08:00

Files

Graham Neubig 275ea706cf Remove remaining global config (#3099 )

* Remove global config from memory

* Remove runtime global config

* Remove from storage

* Remove global config

* Fix event stream tests

* Fix sandbox issue

* Change config

* Removed transferred tests

* Add swe env box

* Fixes on testing

* Fixed some tests

* Fix typing

* Fix ipython test

* Revive function

* Make temp_dir fixture

* Remove test to avoid circular import

2024-07-26 18:43:32 +00:00

scripts

Add NUM_WORKERS variable to run_infer.sh scripts for configurable woker settings (#2597 )

2024-06-23 03:43:43 +00:00

get_score.py

Support GAIA benchmark (#1911 )

2024-05-24 11:22:28 +00:00

README.md

Fix doc error in evals (#2654 )

2024-06-27 16:13:47 +00:00

run_infer.py

Remove remaining global config (#3099 )

2024-07-26 18:43:32 +00:00

scorer.py

docs: updated docstrings using ruff's autofix feature (#2923 )

2024-07-16 01:35:33 +00:00

README.md

GAIA Evaluation

This folder contains evaluation harness for evaluating agents on the GAIA benchmark.

Configure OpenDevin and your LLM

Create a config.toml file if it does not exist at the root of the workspace. Please check README.md for how to set this up.

Run the evaluation

We are using the GAIA dataset hosted on Hugging Face. Please accept the terms and make sure to have logged in on your computer by huggingface-cli login before running the evaluation.

Following is the basic command to start the evaluation. Here we are evaluating on the validation set for the 2023_all split. You can adjust ./evaluation/gaia/scripts/run_infer.sh to change the subset you want to evaluate on.

./evaluation/gaia/scripts/run_infer.sh [model_config] [git-version] [agent] [eval_limit] [gaia_subset]
# e.g., ./evaluation/gaia/scripts/run_infer.sh eval_gpt4_1106_preview 0.6.2 CodeActAgent 300

where model_config is mandatory, while git-version, agent, eval_limit and gaia_subset are optional.

model_config, e.g. eval_gpt4_1106_preview, is the config group name for your LLM settings, as defined in your config.toml, defaulting to gpt-3.5-turbo
git-version, e.g. HEAD, is the git commit hash of the OpenDevin version you would like to evaluate. It could also be a release tag like 0.6.2.
agent, e.g. CodeActAgent, is the name of the agent for benchmarks, defaulting to CodeActAgent.
eval_limit, e.g. 10, limits the evaluation to the first eval_limit instances, defaulting to all instances.
gaia_subset, GAIA benchmark has multiple subsets: 2023_level1, 2023_level2, 2023_level3, 2023_all, defaulting to 2023_level1.

For example,

./evaluation/gaia/scripts/run_infer.sh eval_gpt4_1106_preview 0.6.2 CodeActAgent 10

Get score

Then you can get stats by running the following command:

python ./evaluation/gaia/get_score.py \
--file <path_to/output.json>