mirror of
https://github.com/OpenHands/OpenHands.git
synced 2025-12-26 05:48:36 +08:00
* try to fix pip unavailable * update test case for pip * force rebuild in CI * remove extra symlink * fix newline * added semi-colon to line 31 * Dockerfile.j2: activate env at the end * Revert "Dockerfile.j2: activate env at the end" This reverts commit cf2f5651021fe80d4ab69a35a85f0a35b29dc3d7. * cleanup Dockerfile * switch default python image * remove image agnostic (no longer used) * fix tests * switch to nikolaik/python-nodejs:python3.11-nodejs22 * fix test * fix test * revert docker * update template --------- Co-authored-by: tobitege <tobitege@gmx.de> Co-authored-by: Graham Neubig <neubig@gmail.com>
AgentBench Evaluation
This folder contains evaluation harness for evaluating agents on the AgentBench: Evaluating LLMs as Agents. We currently only support running on the osbench subset.
Setup Environment and LLM Configuration
Please follow instruction here to setup your local development environment and LLM.
Start the evaluation
./evaluation/agent_bench/scripts/run_infer.sh [model_config] [git-version] [agent] [eval_limit]
model_config, e.g.eval_gpt4_1106_preview, is the config group name for your LLM settings, as defined in yourconfig.toml.git-version, e.g.HEAD, is the git commit hash of the OpenDevin version you would like to evaluate. It could also be a release tag like0.6.2.agent, e.g.CodeActAgent, is the name of the agent for benchmarks, defaulting toCodeActAgent.eval_limit, e.g.10, limits the evaluation to the firsteval_limitinstances. By default, the script evaluates the entire SWE-bench_Lite test set (300 issues). Note: in order to useeval_limit, you must also setagent.
Following is the basic command to start the evaluation.
You can update the arguments in the script evaluation/agent_bench/scripts/run_infer.sh, such as --max-iterations, --eval-num-workers and so on.
--agent-cls, the agent to use. For example,CodeActAgent.--llm-config: the LLM configuration to use. For example,eval_gpt4_1106_preview.--max-iterations: the number of iterations to run the evaluation. For example,30.--eval-num-workers: the number of workers to use for evaluation. For example,5.--eval-n-limit: the number of examples to evaluate. For example,100.
./evaluation/agent_bench/scripts/run_infer.sh eval_gpt35_turbo HEAD CodeActAgent 1