Fix issue #5222: [Refactor]: Refactor the evaluation directory (#5223)

Co-authored-by: Engel Nyst <enyst@users.noreply.github.com>
This commit is contained in:
OpenHands 2024-11-25 08:35:52 -05:00 committed by GitHub
parent 1725627c7d
commit 678436da30
No known key found for this signature in database
GPG Key ID: B5690EEEBB952194
152 changed files with 147 additions and 143 deletions

View File

@ -84,12 +84,12 @@ jobs:
EVAL_DOCKER_IMAGE_PREFIX: us-central1-docker.pkg.dev/evaluation-092424/swe-bench-images
run: |
poetry run ./evaluation/swe_bench/scripts/run_infer.sh llm.eval HEAD CodeActAgent 300 30 $N_PROCESSES "princeton-nlp/SWE-bench_Lite" test
poetry run ./evaluation/benchmarks/swe_bench/scripts/run_infer.sh llm.eval HEAD CodeActAgent 300 30 $N_PROCESSES "princeton-nlp/SWE-bench_Lite" test
OUTPUT_FOLDER=$(find evaluation/evaluation_outputs/outputs/princeton-nlp__SWE-bench_Lite-test/CodeActAgent -name "deepseek-chat_maxiter_50_N_*-no-hint-run_1" -type d | head -n 1)
echo "OUTPUT_FOLDER for SWE-bench evaluation: $OUTPUT_FOLDER"
poetry run ./evaluation/swe_bench/scripts/eval_infer_remote.sh $OUTPUT_FOLDER/output.jsonl $N_PROCESSES "princeton-nlp/SWE-bench_Lite" test
poetry run ./evaluation/benchmarks/swe_bench/scripts/eval_infer_remote.sh $OUTPUT_FOLDER/output.jsonl $N_PROCESSES "princeton-nlp/SWE-bench_Lite" test
poetry run ./evaluation/swe_bench/scripts/eval/summarize_outputs.py $OUTPUT_FOLDER/output.jsonl > summarize_outputs.log 2>&1
poetry run ./evaluation/benchmarks/swe_bench/scripts/eval/summarize_outputs.py $OUTPUT_FOLDER/output.jsonl > summarize_outputs.log 2>&1
echo "SWEBENCH_REPORT<<EOF" >> $GITHUB_ENV
cat summarize_outputs.log >> $GITHUB_ENV
echo "EOF" >> $GITHUB_ENV

View File

@ -76,7 +76,7 @@ La fonction `run_controller()` est le cœur de l'exécution d'OpenHands. Elle g
## Le moyen le plus simple de commencer : Explorer les benchmarks existants
Nous vous encourageons à examiner les différents benchmarks d'évaluation disponibles dans le [répertoire `evaluation/`](https://github.com/All-Hands-AI/OpenHands/blob/main/evaluation) de notre dépôt.
Nous vous encourageons à examiner les différents benchmarks d'évaluation disponibles dans le [répertoire `evaluation/benchmarks/`](https://github.com/All-Hands-AI/OpenHands/blob/main/evaluation/benchmarks) de notre dépôt.
Pour intégrer votre propre benchmark, nous vous suggérons de commencer par celui qui ressemble le plus à vos besoins. Cette approche peut considérablement rationaliser votre processus d'intégration, vous permettant de vous appuyer sur les structures existantes et de les adapter à vos exigences spécifiques.

View File

@ -73,7 +73,7 @@ OpenHands 的主要入口点在 `openhands/core/main.py` 中。以下是它工
## 入门最简单的方法:探索现有基准
我们鼓励您查看我们仓库的 [`evaluation/` 目录](https://github.com/All-Hands-AI/OpenHands/blob/main/evaluation)中提供的各种评估基准。
我们鼓励您查看我们仓库的 [`evaluation/benchmarks/` 目录](https://github.com/All-Hands-AI/OpenHands/blob/main/evaluation/benchmarks)中提供的各种评估基准。
要集成您自己的基准,我们建议从最接近您需求的基准开始。这种方法可以显著简化您的集成过程,允许您在现有结构的基础上进行构建并使其适应您的特定要求。

View File

@ -73,7 +73,7 @@ The `run_controller()` function is the core of OpenHands's execution. It manages
## Easiest way to get started: Exploring Existing Benchmarks
We encourage you to review the various evaluation benchmarks available in the [`evaluation/` directory](https://github.com/All-Hands-AI/OpenHands/blob/main/evaluation) of our repository.
We encourage you to review the various evaluation benchmarks available in the [`evaluation/benchmarks/` directory](https://github.com/All-Hands-AI/OpenHands/blob/main/evaluation/benchmarks) of our repository.
To integrate your own benchmark, we suggest starting with the one that most closely resembles your needs. This approach can significantly streamline your integration process, allowing you to build upon existing structures and adapt them to your specific requirements.

View File

@ -46,28 +46,32 @@ The OpenHands evaluation harness supports a wide variety of benchmarks across so
### Software Engineering
- SWE-Bench: [`evaluation/swe_bench`](./swe_bench)
- HumanEvalFix: [`evaluation/humanevalfix`](./humanevalfix)
- BIRD: [`evaluation/bird`](./bird)
- BioCoder: [`evaluation/ml_bench`](./ml_bench)
- ML-Bench: [`evaluation/ml_bench`](./ml_bench)
- APIBench: [`evaluation/gorilla`](./gorilla/)
- ToolQA: [`evaluation/toolqa`](./toolqa/)
- AiderBench: [`evaluation/aider_bench`](./aider_bench/)
- SWE-Bench: [`evaluation/benchmarks/swe_bench`](./benchmarks/swe_bench)
- HumanEvalFix: [`evaluation/benchmarks/humanevalfix`](./benchmarks/humanevalfix)
- BIRD: [`evaluation/benchmarks/bird`](./benchmarks/bird)
- BioCoder: [`evaluation/benchmarks/ml_bench`](./benchmarks/ml_bench)
- ML-Bench: [`evaluation/benchmarks/ml_bench`](./benchmarks/ml_bench)
- APIBench: [`evaluation/benchmarks/gorilla`](./benchmarks/gorilla/)
- ToolQA: [`evaluation/benchmarks/toolqa`](./benchmarks/toolqa/)
- AiderBench: [`evaluation/benchmarks/aider_bench`](./benchmarks/aider_bench/)
- Commit0: [`evaluation/benchmarks/commit0_bench`](./benchmarks/commit0_bench/)
- DiscoveryBench: [`evaluation/benchmarks/discoverybench`](./benchmarks/discoverybench/)
### Web Browsing
- WebArena: [`evaluation/webarena`](./webarena/)
- MiniWob++: [`evaluation/miniwob`](./miniwob/)
- WebArena: [`evaluation/benchmarks/webarena`](./benchmarks/webarena/)
- MiniWob++: [`evaluation/benchmarks/miniwob`](./benchmarks/miniwob/)
- Browsing Delegation: [`evaluation/benchmarks/browsing_delegation`](./benchmarks/browsing_delegation/)
### Misc. Assistance
- GAIA: [`evaluation/gaia`](./gaia)
- GPQA: [`evaluation/gpqa`](./gpqa)
- AgentBench: [`evaluation/agent_bench`](./agent_bench)
- MINT: [`evaluation/mint`](./mint)
- Entity deduction Arena (EDA): [`evaluation/EDA`](./EDA)
- ProofWriter: [`evaluation/logic_reasoning`](./logic_reasoning)
- GAIA: [`evaluation/benchmarks/gaia`](./benchmarks/gaia)
- GPQA: [`evaluation/benchmarks/gpqa`](./benchmarks/gpqa)
- AgentBench: [`evaluation/benchmarks/agent_bench`](./benchmarks/agent_bench)
- MINT: [`evaluation/benchmarks/mint`](./benchmarks/mint)
- Entity deduction Arena (EDA): [`evaluation/benchmarks/EDA`](./benchmarks/EDA)
- ProofWriter: [`evaluation/benchmarks/logic_reasoning`](./benchmarks/logic_reasoning)
- ScienceAgentBench: [`evaluation/benchmarks/scienceagentbench`](./benchmarks/scienceagentbench)
## Result Visualization
@ -79,7 +83,7 @@ You can start your own fork of [our huggingface evaluation outputs](https://hugg
To learn more about how to integrate your benchmark into OpenHands, check out [tutorial here](https://docs.all-hands.dev/modules/usage/how-to/evaluation-harness). Briefly,
- Each subfolder contains a specific benchmark or experiment. For example, `evaluation/swe_bench` should contain
- Each subfolder contains a specific benchmark or experiment. For example, `evaluation/benchmarks/swe_bench` should contain
all the preprocessing/evaluation/analysis scripts.
- Raw data and experimental records should not be stored within this repo.
- For model outputs, they should be stored at [this huggingface space](https://huggingface.co/spaces/OpenHands/evaluation) for visualization.

View File

@ -12,7 +12,7 @@ Please follow instruction [here](../README.md#setup) to setup your local develop
```bash
export OPENAI_API_KEY="sk-XXX"; # This is required for evaluation (to simulate another party of conversation)
./evaluation/EDA/scripts/run_infer.sh [model_config] [git-version] [agent] [dataset] [eval_limit]
./evaluation/benchmarks/EDA/scripts/run_infer.sh [model_config] [git-version] [agent] [dataset] [eval_limit]
```
where `model_config` is mandatory, while `git-version`, `agent`, `dataset` and `eval_limit` are optional.
@ -33,7 +33,7 @@ to `CodeActAgent`.
For example,
```bash
./evaluation/EDA/scripts/run_infer.sh eval_gpt4o_2024_05_13 0.6.2 CodeActAgent things
./evaluation/benchmarks/EDA/scripts/run_infer.sh eval_gpt4o_2024_05_13 0.6.2 CodeActAgent things
```
## Reference

View File

@ -4,7 +4,7 @@ import os
import pandas as pd
from datasets import load_dataset
from evaluation.EDA.game import Q20Game, Q20GameCelebrity
from evaluation.benchmarks.EDA.game import Q20Game, Q20GameCelebrity
from evaluation.utils.shared import (
EvalMetadata,
EvalOutput,

View File

@ -43,7 +43,7 @@ echo "AGENT_VERSION: $AGENT_VERSION"
echo "MODEL_CONFIG: $MODEL_CONFIG"
echo "DATASET: $DATASET"
COMMAND="poetry run python evaluation/EDA/run_infer.py \
COMMAND="poetry run python evaluation/benchmarks/EDA/run_infer.py \
--agent-cls $AGENT \
--llm-config $MODEL_CONFIG \
--dataset $DATASET \

View File

@ -9,7 +9,7 @@ Please follow instruction [here](../README.md#setup) to setup your local develop
## Start the evaluation
```bash
./evaluation/agent_bench/scripts/run_infer.sh [model_config] [git-version] [agent] [eval_limit]
./evaluation/benchmarks/agent_bench/scripts/run_infer.sh [model_config] [git-version] [agent] [eval_limit]
```
- `model_config`, e.g. `eval_gpt4_1106_preview`, is the config group name for your
@ -25,7 +25,7 @@ in order to use `eval_limit`, you must also set `agent`.
Following is the basic command to start the evaluation.
You can update the arguments in the script `evaluation/agent_bench/scripts/run_infer.sh`, such as `--max-iterations`, `--eval-num-workers` and so on.
You can update the arguments in the script `evaluation/benchmarks/agent_bench/scripts/run_infer.sh`, such as `--max-iterations`, `--eval-num-workers` and so on.
- `--agent-cls`, the agent to use. For example, `CodeActAgent`.
- `--llm-config`: the LLM configuration to use. For example, `eval_gpt4_1106_preview`.
@ -34,5 +34,5 @@ You can update the arguments in the script `evaluation/agent_bench/scripts/run_i
- `--eval-n-limit`: the number of examples to evaluate. For example, `100`.
```bash
./evaluation/agent_bench/scripts/run_infer.sh eval_gpt35_turbo HEAD CodeActAgent 1
./evaluation/benchmarks/agent_bench/scripts/run_infer.sh eval_gpt35_turbo HEAD CodeActAgent 1
```

View File

@ -7,7 +7,7 @@ from typing import Any
import pandas as pd
from datasets import load_dataset
from evaluation.agent_bench.helper import (
from evaluation.benchmarks.agent_bench.helper import (
FAKE_RESPONSES,
INST_SUFFIXES,
compare_results,

View File

@ -26,7 +26,7 @@ echo "AGENT: $AGENT"
echo "AGENT_VERSION: $AGENT_VERSION"
echo "MODEL_CONFIG: $MODEL_CONFIG"
COMMAND="export PYTHONPATH=evaluation/agent_bench:\$PYTHONPATH && poetry run python evaluation/agent_bench/run_infer.py \
COMMAND="export PYTHONPATH=evaluation/benchmarks/agent_bench:\$PYTHONPATH && poetry run python evaluation/benchmarks/agent_bench/run_infer.py \
--agent-cls $AGENT \
--llm-config $MODEL_CONFIG \
--max-iterations 30 \

View File

@ -16,7 +16,7 @@ development environment and LLM.
## Start the evaluation
```bash
./evaluation/aider_bench/scripts/run_infer.sh [model_config] [git-version] [agent] [eval_limit] [eval-num-workers] [eval_ids]
./evaluation/benchmarks/aider_bench/scripts/run_infer.sh [model_config] [git-version] [agent] [eval_limit] [eval-num-workers] [eval_ids]
```
- `model_config`, e.g. `eval_gpt4_1106_preview`, is the config group name for
@ -42,7 +42,7 @@ export SKIP_NUM=12 # skip the first 12 instances from the dataset
Following is the basic command to start the evaluation.
You can update the arguments in the script
`evaluation/aider_bench/scripts/run_infer.sh`, such as `--max-iterations`,
`evaluation/benchmarks/aider_bench/scripts/run_infer.sh`, such as `--max-iterations`,
`--eval-num-workers` and so on:
- `--agent-cls`, the agent to use. For example, `CodeActAgent`.
@ -53,7 +53,7 @@ You can update the arguments in the script
- `--eval-ids`: the IDs of the examples to evaluate (comma separated). For example, `"1,3,10"`.
```bash
./evaluation/aider_bench/scripts/run_infer.sh eval_gpt35_turbo HEAD CodeActAgent 100 1 "1,3,10"
./evaluation/benchmarks/aider_bench/scripts/run_infer.sh eval_gpt35_turbo HEAD CodeActAgent 100 1 "1,3,10"
```
### Run Inference on `RemoteRuntime` (experimental)
@ -61,25 +61,25 @@ You can update the arguments in the script
This is in limited beta. Contact Xingyao over slack if you want to try this out!
```bash
./evaluation/aider_bench/scripts/run_infer.sh [model_config] [git-version] [agent] [eval_limit] [eval-num-workers] [eval_ids]
./evaluation/benchmarks/aider_bench/scripts/run_infer.sh [model_config] [git-version] [agent] [eval_limit] [eval-num-workers] [eval_ids]
# Example - This runs evaluation on CodeActAgent for 133 instances on aider_bench test set, with 2 workers running in parallel
export ALLHANDS_API_KEY="YOUR-API-KEY"
export RUNTIME=remote
export SANDBOX_REMOTE_RUNTIME_API_URL="https://runtime.eval.all-hands.dev"
./evaluation/aider_bench/scripts/run_infer.sh llm.eval HEAD CodeActAgent 133 2
./evaluation/benchmarks/aider_bench/scripts/run_infer.sh llm.eval HEAD CodeActAgent 133 2
```
## Summarize Results
```bash
poetry run python ./evaluation/aider_bench/scripts/summarize_results.py [path_to_output_jsonl_file]
poetry run python ./evaluation/benchmarks/aider_bench/scripts/summarize_results.py [path_to_output_jsonl_file]
```
Full example:
```bash
poetry run python ./evaluation/aider_bench/scripts/summarize_results.py evaluation/evaluation_outputs/outputs/AiderBench/CodeActAgent/claude-3-5-sonnet@20240620_maxiter_30_N_v1.9/output.jsonl
poetry run python ./evaluation/benchmarks/aider_bench/scripts/summarize_results.py evaluation/evaluation_outputs/outputs/AiderBench/CodeActAgent/claude-3-5-sonnet@20240620_maxiter_30_N_v1.9/output.jsonl
```
This will list the instances that passed and the instances that failed. For each

View File

@ -7,7 +7,7 @@ from typing import Any
import pandas as pd
from datasets import load_dataset
from evaluation.aider_bench.helper import (
from evaluation.benchmarks.aider_bench.helper import (
FAKE_RESPONSES,
INST_SUFFIXES,
INSTRUCTIONS_ADDENDUM,

View File

@ -39,7 +39,7 @@ if [ "$USE_UNIT_TESTS" = true ]; then
EVAL_NOTE=$EVAL_NOTE-w-test
fi
COMMAND="export PYTHONPATH=evaluation/aider_bench:\$PYTHONPATH && poetry run python evaluation/aider_bench/run_infer.py \
COMMAND="export PYTHONPATH=evaluation/benchmarks/aider_bench:\$PYTHONPATH && poetry run python evaluation/benchmarks/aider_bench/run_infer.py \
--agent-cls $AGENT \
--llm-config $MODEL_CONFIG \
--max-iterations 30 \

View File

@ -21,7 +21,7 @@ To reproduce this image, please see the Dockerfile_Openopenhands in the `biocode
```bash
./evaluation/biocoder/scripts/run_infer.sh [model_config] [git-version] [agent] [eval_limit]
./evaluation/benchmarks/biocoder/scripts/run_infer.sh [model_config] [git-version] [agent] [eval_limit]
```
where `model_config` is mandatory, while `git-version`, `agent`, `dataset` and `eval_limit` are optional.
@ -43,7 +43,7 @@ with current OpenHands version, then your command would be:
## Examples
```bash
./evaluation/biocoder/scripts/run_infer.sh eval_gpt4o_2024_05_13 HEAD CodeActAgent 1
./evaluation/benchmarks/biocoder/scripts/run_infer.sh eval_gpt4o_2024_05_13 HEAD CodeActAgent 1
```
## Reference

View File

@ -8,7 +8,7 @@ from typing import Any
import pandas as pd
from datasets import load_dataset
from evaluation.biocoder.utils import BiocoderData
from evaluation.benchmarks.biocoder.utils import BiocoderData
from evaluation.utils.shared import (
EvalMetadata,
EvalOutput,

View File

@ -28,7 +28,7 @@ echo "AGENT_VERSION: $AGENT_VERSION"
echo "MODEL_CONFIG: $MODEL_CONFIG"
echo "DATASET: $DATASET"
COMMAND="poetry run python evaluation/biocoder/run_infer.py \
COMMAND="poetry run python evaluation/benchmarks/biocoder/run_infer.py \
--agent-cls $AGENT \
--llm-config $MODEL_CONFIG \
--max-iterations 10 \

View File

@ -9,7 +9,7 @@ Please follow instruction [here](../README.md#setup) to setup your local develop
## Run Inference on Bird
```bash
./evaluation/bird/scripts/run_infer.sh [model_config] [git-version]
./evaluation/benchmarks/bird/scripts/run_infer.sh [model_config] [git-version]
```
- `model_config`, e.g. `eval_gpt4_1106_preview`, is the config group name for your

View File

@ -26,7 +26,7 @@ echo "AGENT: $AGENT"
echo "AGENT_VERSION: $AGENT_VERSION"
echo "MODEL_CONFIG: $MODEL_CONFIG"
COMMAND="poetry run python evaluation/bird/run_infer.py \
COMMAND="poetry run python evaluation/benchmarks/bird/run_infer.py \
--agent-cls $AGENT \
--llm-config $MODEL_CONFIG \
--max-iterations 5 \

View File

@ -12,7 +12,7 @@ Please follow instruction [here](../README.md#setup) to setup your local develop
## Run Inference
```bash
./evaluation/browsing_delegation/scripts/run_infer.sh [model_config] [git-version] [agent] [eval_limit]
./evaluation/benchmarks/browsing_delegation/scripts/run_infer.sh [model_config] [git-version] [agent] [eval_limit]
# e.g., ./evaluation/swe_bench/scripts/run_infer.sh llm.eval_gpt4_1106_preview_llm HEAD CodeActAgent 300
```

View File

@ -28,7 +28,7 @@ echo "MODEL_CONFIG: $MODEL_CONFIG"
EVAL_NOTE="$AGENT_VERSION"
COMMAND="poetry run python evaluation/browsing_delegation/run_infer.py \
COMMAND="poetry run python evaluation/benchmarks/browsing_delegation/run_infer.py \
--agent-cls $AGENT \
--llm-config $MODEL_CONFIG \
--max-iterations 1 \

View File

@ -24,10 +24,10 @@ Make sure your Docker daemon is running, and you have ample disk space (at least
When the `run_infer.sh` script is started, it will automatically pull the `lite` split in Commit0. For example, for instance ID `commit-0/minitorch`, it will try to pull our pre-build docker image `wentingzhao/minitorch` from DockerHub. This image will be used create an OpenHands runtime image where the agent will operate on.
```bash
./evaluation/commit0_bench/scripts/run_infer.sh [repo_split] [model_config] [git-version] [agent] [eval_limit] [max_iter] [num_workers] [dataset] [dataset_split]
./evaluation/benchmarks/commit0_bench/scripts/run_infer.sh [repo_split] [model_config] [git-version] [agent] [eval_limit] [max_iter] [num_workers] [dataset] [dataset_split]
# Example
./evaluation/commit0_bench/scripts/run_infer.sh lite llm.eval_sonnet HEAD CodeActAgent 16 100 8 wentingzhao/commit0_combined test
./evaluation/benchmarks/commit0_bench/scripts/run_infer.sh lite llm.eval_sonnet HEAD CodeActAgent 16 100 8 wentingzhao/commit0_combined test
```
where `model_config` is mandatory, and the rest are optional.
@ -56,7 +56,7 @@ Let's say you'd like to run 10 instances using `llm.eval_sonnet` and CodeActAgen
then your command would be:
```bash
./evaluation/commit0_bench/scripts/run_infer.sh lite llm.eval_sonnet HEAD CodeActAgent 10 30 1 wentingzhao/commit0_combined test
./evaluation/benchmarks/commit0_bench/scripts/run_infer.sh lite llm.eval_sonnet HEAD CodeActAgent 10 30 1 wentingzhao/commit0_combined test
```
### Run Inference on `RemoteRuntime` (experimental)
@ -64,17 +64,17 @@ then your command would be:
This is in limited beta. Contact Xingyao over slack if you want to try this out!
```bash
./evaluation/commit0_bench/scripts/run_infer.sh [repo_split] [model_config] [git-version] [agent] [eval_limit] [max_iter] [num_workers] [dataset] [dataset_split]
./evaluation/benchmarks/commit0_bench/scripts/run_infer.sh [repo_split] [model_config] [git-version] [agent] [eval_limit] [max_iter] [num_workers] [dataset] [dataset_split]
# Example - This runs evaluation on CodeActAgent for 10 instances on "wentingzhao/commit0_combined"'s test set, with max 30 iteration per instances, with 1 number of workers running in parallel
ALLHANDS_API_KEY="YOUR-API-KEY" RUNTIME=remote SANDBOX_REMOTE_RUNTIME_API_URL="https://runtime.eval.all-hands.dev" EVAL_DOCKER_IMAGE_PREFIX="docker.io/wentingzhao" \
./evaluation/commit0_bench/scripts/run_infer.sh lite llm.eval_sonnet HEAD CodeActAgent 10 30 1 wentingzhao/commit0_combined test
./evaluation/benchmarks/commit0_bench/scripts/run_infer.sh lite llm.eval_sonnet HEAD CodeActAgent 10 30 1 wentingzhao/commit0_combined test
```
To clean-up all existing runtime you've already started, run:
```bash
ALLHANDS_API_KEY="YOUR-API-KEY" ./evaluation/commit0_bench/scripts/cleanup_remote_runtime.sh
ALLHANDS_API_KEY="YOUR-API-KEY" ./evaluation/benchmarks/commit0_bench/scripts/cleanup_remote_runtime.sh
```
### Specify a subset of tasks to run infer

View File

@ -91,7 +91,7 @@ fi
function run_eval() {
local eval_note=$1
COMMAND="poetry run python evaluation/commit0_bench/run_infer.py \
COMMAND="poetry run python evaluation/benchmarks/commit0_bench/run_infer.py \
--agent-cls $AGENT \
--llm-config $MODEL_CONFIG \
--max-iterations $MAX_ITER \

View File

@ -16,7 +16,7 @@
2. Execute the bash script to start DiscoveryBench Evaluation
```
./evaluation/discoverybench/scripts/run_infer.sh [YOUR MODEL CONFIG]
./evaluation/benchmarks/discoverybench/scripts/run_infer.sh [YOUR MODEL CONFIG]
```
Replace `[YOUR MODEL CONFIG]` with any model the model that you have set up in `config.toml`
@ -27,7 +27,7 @@ When the `run_infer.sh` script is started, it will automatically pull the latest
```
./evaluation/discoverybench/scripts/run_infer.sh [MODEL_CONFIG] [GIT_COMMIT] [AGENT] [EVAL_LIMIT] [NUM_WORKERS]
./evaluation/benchmarks/discoverybench/scripts/run_infer.sh [MODEL_CONFIG] [GIT_COMMIT] [AGENT] [EVAL_LIMIT] [NUM_WORKERS]
```
- `MODEL_CONFIG`: Name of the model you want to evaluate with

View File

@ -5,10 +5,10 @@ import os
import git
import pandas as pd
from evaluation.discoverybench.eval_utils.eval_w_subhypo_gen import (
from evaluation.benchmarks.discoverybench.eval_utils.eval_w_subhypo_gen import (
run_eval_gold_vs_gen_NL_hypo_workflow,
)
from evaluation.discoverybench.eval_utils.response_parser import (
from evaluation.benchmarks.discoverybench.eval_utils.response_parser import (
extract_gen_hypo_from_logs,
)
from evaluation.utils.shared import (

View File

@ -29,7 +29,7 @@ echo "AGENT: $AGENT"
echo "AGENT_VERSION: $AGENT_VERSION"
echo "MODEL_CONFIG: $MODEL_CONFIG"
COMMAND="poetry run python evaluation/discoverybench/run_infer.py \
COMMAND="poetry run python evaluation/benchmarks/discoverybench/run_infer.py \
--agent-cls $AGENT \
--llm-config $MODEL_CONFIG \
--max-iterations 10 \

View File

@ -10,11 +10,11 @@ Please follow instruction [here](../README.md#setup) to setup your local develop
We are using the GAIA dataset hosted on [Hugging Face](https://huggingface.co/datasets/gaia-benchmark/GAIA).
Please accept the terms and make sure to have logged in on your computer by `huggingface-cli login` before running the evaluation.
Following is the basic command to start the evaluation. Here we are evaluating on the validation set for the `2023_all` split. You can adjust `./evaluation/gaia/scripts/run_infer.sh` to change the subset you want to evaluate on.
Following is the basic command to start the evaluation. Here we are evaluating on the validation set for the `2023_all` split. You can adjust `./evaluation/benchmarks/gaia/scripts/run_infer.sh` to change the subset you want to evaluate on.
```bash
./evaluation/gaia/scripts/run_infer.sh [model_config] [git-version] [agent] [eval_limit] [gaia_subset]
# e.g., ./evaluation/gaia/scripts/run_infer.sh eval_gpt4_1106_preview 0.6.2 CodeActAgent 300
./evaluation/benchmarks/gaia/scripts/run_infer.sh [model_config] [git-version] [agent] [eval_limit] [gaia_subset]
# e.g., ./evaluation/benchmarks/gaia/scripts/run_infer.sh eval_gpt4_1106_preview 0.6.2 CodeActAgent 300
```
where `model_config` is mandatory, while `git-version`, `agent`, `eval_limit` and `gaia_subset` are optional.
@ -35,13 +35,13 @@ to `CodeActAgent`.
For example,
```bash
./evaluation/gaia/scripts/run_infer.sh eval_gpt4_1106_preview 0.6.2 CodeActAgent 10
./evaluation/benchmarks/gaia/scripts/run_infer.sh eval_gpt4_1106_preview 0.6.2 CodeActAgent 10
```
## Get score
Then you can get stats by running the following command:
```bash
python ./evaluation/gaia/get_score.py \
python ./evaluation/benchmarks/gaia/get_score.py \
--file <path_to/output.json>
```

View File

@ -7,7 +7,7 @@ import huggingface_hub
import pandas as pd
from datasets import load_dataset
from evaluation.gaia.scorer import question_scorer
from evaluation.benchmarks.gaia.scorer import question_scorer
from evaluation.utils.shared import (
EvalMetadata,
EvalOutput,

View File

@ -35,7 +35,7 @@ echo "AGENT_VERSION: $AGENT_VERSION"
echo "MODEL_CONFIG: $MODEL_CONFIG"
echo "LEVELS: $LEVELS"
COMMAND="poetry run python ./evaluation/gaia/run_infer.py \
COMMAND="poetry run python ./evaluation/benchmarks/gaia/run_infer.py \
--agent-cls $AGENT \
--llm-config $MODEL_CONFIG \
--max-iterations 30 \

View File

@ -11,7 +11,7 @@ Please follow instruction [here](../README.md#setup) to setup your local develop
Make sure your Docker daemon is running, then run this bash script:
```bash
./evaluation/gorilla/scripts/run_infer.sh [model_config] [git-version] [agent] [eval_limit] [hubs]
./evaluation/benchmarks/gorilla/scripts/run_infer.sh [model_config] [git-version] [agent] [eval_limit] [hubs]
```
where `model_config` is mandatory, while all other arguments are optional.
@ -35,5 +35,5 @@ Note: in order to use `eval_limit`, you must also set `agent`; in order to use `
For example,
```bash
./evaluation/gorilla/scripts/run_infer.sh llm 0.6.2 CodeActAgent 10 th
./evaluation/benchmarks/gorilla/scripts/run_infer.sh llm 0.6.2 CodeActAgent 10 th
```

View File

@ -5,7 +5,7 @@ import os
import pandas as pd
import requests
from evaluation.gorilla.utils import encode_question, get_data_for_hub
from evaluation.benchmarks.gorilla.utils import encode_question, get_data_for_hub
from evaluation.utils.shared import (
EvalMetadata,
EvalOutput,

View File

@ -33,7 +33,7 @@ echo "AGENT_VERSION: $AGENT_VERSION"
echo "MODEL_CONFIG: $MODEL_CONFIG"
echo "HUBS: $HUBS"
COMMAND="poetry run python evaluation/gorilla/run_infer.py \
COMMAND="poetry run python evaluation/benchmarks/gorilla/run_infer.py \
--agent-cls $AGENT \
--llm-config $MODEL_CONFIG \
--max-iterations 30 \

View File

@ -23,7 +23,7 @@ Please follow instruction [here](../README.md#setup) to setup your local develop
'gpqa_main', 'gqpa_diamond', 'gpqa_experts', 'gpqa_extended' -- data split options
From the root of the OpenHands repo, run the following command:
```bash
./evaluation/gpqa/scripts/run_infer.sh [model_config_name] [git-version] [num_samples_eval] [data_split] [AgentClass]
./evaluation/benchmarks/gpqa/scripts/run_infer.sh [model_config_name] [git-version] [num_samples_eval] [data_split] [AgentClass]
```
You can replace `model_config_name` with any model you set up in `config.toml`.

View File

@ -33,7 +33,7 @@ echo "AGENT: $AGENT"
echo "AGENT_VERSION: $AGENT_VERSION"
echo "MODEL_CONFIG: $MODEL_CONFIG"
COMMAND="poetry run python evaluation/gpqa/run_infer.py \
COMMAND="poetry run python evaluation/benchmarks/gpqa/run_infer.py \
--agent-cls $AGENT \
--llm-config $MODEL_CONFIG \
--max-iterations 10 \

View File

@ -9,7 +9,7 @@ Please follow instruction [here](../README.md#setup) to setup your local develop
## Run Inference on HumanEvalFix
```bash
./evaluation/humanevalfix/scripts/run_infer.sh eval_gpt4_1106_preview
./evaluation/benchmarks/humanevalfix/scripts/run_infer.sh eval_gpt4_1106_preview
```
You can replace `eval_gpt4_1106_preview` with any model you set up in `config.toml`.

View File

@ -64,7 +64,7 @@ echo "AGENT: $AGENT"
echo "AGENT_VERSION: $AGENT_VERSION"
echo "MODEL_CONFIG: $MODEL_CONFIG"
COMMAND="poetry run python evaluation/humanevalfix/run_infer.py \
COMMAND="poetry run python evaluation/benchmarks/humanevalfix/run_infer.py \
--agent-cls $AGENT \
--llm-config $MODEL_CONFIG \
--max-iterations 10 \

View File

@ -10,5 +10,5 @@ Please follow instruction [here](../README.md#setup) to setup your local develop
The following code will run inference on the first example of the ProofWriter dataset,
```bash
./evaluation/logic_reasoning/scripts/run_infer.sh eval_gpt4_1106_preview_llm ProofWriter
./evaluation/benchmarks/logic_reasoning/scripts/run_infer.sh eval_gpt4_1106_preview_llm ProofWriter
```

View File

@ -34,7 +34,7 @@ echo "AGENT: $AGENT"
echo "AGENT_VERSION: $AGENT_VERSION"
echo "MODEL_CONFIG: $MODEL_CONFIG"
COMMAND="poetry run python evaluation/logic_reasoning/run_infer.py \
COMMAND="poetry run python evaluation/benchmarks/logic_reasoning/run_infer.py \
--agent-cls $AGENT \
--llm-config $MODEL_CONFIG \
--dataset $DATASET \

View File

@ -13,7 +13,7 @@ Access with browser the above MiniWoB URLs and see if they load correctly.
## Run Evaluation
```sh
./evaluation/miniwob/scripts/run_infer.sh llm.claude-35-sonnet-eval
./evaluation/benchmarks/miniwob/scripts/run_infer.sh llm.claude-35-sonnet-eval
```
### Run Inference on `RemoteRuntime` (experimental)
@ -21,13 +21,13 @@ Access with browser the above MiniWoB URLs and see if they load correctly.
This is in limited beta. Contact Xingyao over slack if you want to try this out!
```bash
./evaluation/miniwob/scripts/run_infer.sh [model_config] [git-version] [agent] [note] [eval_limit] [num_workers]
./evaluation/benchmarks/miniwob/scripts/run_infer.sh [model_config] [git-version] [agent] [note] [eval_limit] [num_workers]
# Example - This runs evaluation on BrowsingAgent for 125 instances on miniwob, with 2 workers running in parallel
export ALLHANDS_API_KEY="YOUR-API-KEY"
export RUNTIME=remote
export SANDBOX_REMOTE_RUNTIME_API_URL="https://runtime.eval.all-hands.dev"
./evaluation/miniwob/scripts/run_infer.sh llm.eval HEAD BrowsingAgent "" 125 2
./evaluation/benchmarks/miniwob/scripts/run_infer.sh llm.eval HEAD BrowsingAgent "" 125 2
```
Results will be in `evaluation/evaluation_outputs/outputs/miniwob/`
@ -35,7 +35,7 @@ Results will be in `evaluation/evaluation_outputs/outputs/miniwob/`
To calculate the average reward, run:
```sh
poetry run python evaluation/miniwob/get_success_rate.py evaluation/evaluation_outputs/outputs/miniwob/SOME_AGENT/EXP_NAME/output.jsonl
poetry run python evaluation/benchmarks/miniwob/get_success_rate.py evaluation/evaluation_outputs/outputs/miniwob/SOME_AGENT/EXP_NAME/output.jsonl
```
## Submit your evaluation results

View File

@ -33,7 +33,7 @@ echo "MODEL_CONFIG: $MODEL_CONFIG"
EVAL_NOTE="${AGENT_VERSION}_${NOTE}"
COMMAND="export PYTHONPATH=evaluation/miniwob:\$PYTHONPATH && poetry run python evaluation/miniwob/run_infer.py \
COMMAND="export PYTHONPATH=evaluation/benchmarks/miniwob:\$PYTHONPATH && poetry run python evaluation/benchmarks/miniwob/run_infer.py \
--agent-cls $AGENT \
--llm-config $MODEL_CONFIG \
--max-iterations 10 \

View File

@ -6,7 +6,7 @@ We support evaluation of the [Eurus subset focus on math and code reasoning](htt
## Setup Environment and LLM Configuration
Please follow instruction [here](../README.md#setup) to setup your local development environment and LLM.
Please follow instruction [here](../../README.md#setup) to setup your local development environment and LLM.
## Start the evaluation
@ -15,7 +15,7 @@ We are using the MINT dataset hosted on [Hugging Face](https://huggingface.co/da
Following is the basic command to start the evaluation. Currently, the only agent supported with MINT is `CodeActAgent`.
```bash
./evaluation/mint/scripts/run_infer.sh [model_config] [git-version] [subset] [eval_limit]
./evaluation/benchmarks/mint/scripts/run_infer.sh [model_config] [git-version] [subset] [eval_limit]
```
where `model_config` is mandatory, while others are optional.
@ -34,7 +34,7 @@ Note: in order to use `eval_limit`, you must also set `subset`.
For example,
```bash
./evaluation/swe_bench/scripts/run_infer.sh eval_gpt4_1106_preview 0.6.2 gsm8k 3
./evaluation/benchmarks/mint/scripts/run_infer.sh eval_gpt4_1106_preview 0.6.2 gsm8k 3
```
## Reference

View File

@ -6,10 +6,10 @@ from typing import Any
import pandas as pd
from datasets import load_dataset
from evaluation.mint.datatypes import TaskState
from evaluation.mint.env import SimplifiedEnv
from evaluation.mint.prompts import ToolPromptTemplate
from evaluation.mint.tasks import Task
from evaluation.benchmarks.mint.datatypes import TaskState
from evaluation.benchmarks.mint.env import SimplifiedEnv
from evaluation.benchmarks.mint.prompts import ToolPromptTemplate
from evaluation.benchmarks.mint.tasks import Task
from evaluation.utils.shared import (
EvalMetadata,
EvalOutput,

View File

@ -1,6 +1,6 @@
from evaluation.mint.tasks.base import Task
from evaluation.mint.tasks.codegen import HumanEvalTask, MBPPTask
from evaluation.mint.tasks.reasoning import (
from evaluation.benchmarks.mint.tasks.base import Task
from evaluation.benchmarks.mint.tasks.codegen import HumanEvalTask, MBPPTask
from evaluation.benchmarks.mint.tasks.reasoning import (
MultipleChoiceTask,
ReasoningTask,
TheoremqaTask,

View File

@ -2,7 +2,7 @@ import logging
from utils import check_correctness
from evaluation.mint.tasks.base import Task
from evaluation.benchmarks.mint.tasks.base import Task
LOGGER = logging.getLogger('MINT')

Some files were not shown because too many files have changed in this diff Show More