Co-authored-by: Engel Nyst <enyst@users.noreply.github.com> Co-authored-by: Xingyao Wang <xingyao@all-hands.dev> Co-authored-by: enyst <engel.nyst@gmail.com>
3.4 KiB
SWE-fficiency Evaluation
This folder contains the OpenHands inference generation of the SWE-fficiency benchmark (paper).
The evaluation consists of three steps:
- Environment setup: install python environment and configure LLM config.
- Run inference: Generate a edit patch for each Github issue
- Evaluate patches
Setup Environment and LLM Configuration
Please follow instruction here to setup your local development environment and LLM.
Running inference Locally with Docker
Make sure your Docker daemon is running, and you have ample disk space (at least 200-500GB, depends on the SWE-PErf set you are running on) for the instance-level docker image.
When the run_infer.sh script is started, it will automatically pull the relevant SWE-Perf images.
For example, for instance ID scikit-learn_scikit-learn-11674, it will try to pull our pre-build docker image betty1202/sweb.eval.x86_64.scikit-learn_s_scikit-learn-11674 from DockerHub.
This image will be used create an OpenHands runtime image where the agent will operate on.
./evaluation/benchmarks/swefficiency/scripts/run_infer.sh [model_config] [git-version] [agent] [eval_limit] [max_iter] [num_workers] [dataset] [dataset_split] [n_runs] [mode]
# Example
./evaluation/benchmarks/swefficiency/scripts/run_infer.sh llm.eval_gpt4_1106_preview HEAD CodeActAgent 500 100 1 swefficiency/swefficiency test
where model_config is mandatory, and the rest are optional.
-
model_config, e.g.eval_gpt4_1106_preview, is the config group name for your LLM settings, as defined in yourconfig.toml. -
git-version, e.g.HEAD, is the git commit hash of the OpenHands version you would like to evaluate. It could also be a release tag like0.6.2. -
agent, e.g.CodeActAgent, is the name of the agent for benchmarks, defaulting toCodeActAgent. -
eval_limit, e.g.10, limits the evaluation to the firsteval_limitinstances. By default, the script evaluates the entire SWE-Perf test set (140 issues). Note: in order to useeval_limit, you must also setagent. -
max_iter, e.g.20, is the maximum number of iterations for the agent to run. By default, it is set to 100. -
num_workers, e.g.3, is the number of parallel workers to run the evaluation. By default, it is set to 1. -
dataset, a huggingface dataset name. e.g.SWE-Perf/SWE-Perf, specifies which dataset to evaluate on. -
dataset_split, split for the huggingface dataset. e.g.,test,dev. Default totest. -
n_runs, e.g.3, is the number of times to run the evaluation. Default is 1. -
mode, e.g.swt,swt-ci, orswe, specifies the evaluation mode. Default isswe.
Caution
Setting
num_workerslarger than 1 is not officially tested, YMMV.
Let's say you'd like to run 10 instances using llm.eval_gpt4_1106_preview and CodeActAgent,
then your command would be:
./evaluation/benchmarks/swe_bench/scripts/run_infer.sh llm.eval_gpt4_1106_preview HEAD CodeActAgent 10
2. Run the SWE-fficiency benchmark official evaluation
Once the output is converted, use the official SWE-fficiency benchmark evaluation to evaluate it.