* try to fix pip unavailable * update test case for pip * force rebuild in CI * remove extra symlink * fix newline * added semi-colon to line 31 * Dockerfile.j2: activate env at the end * Revert "Dockerfile.j2: activate env at the end" This reverts commit cf2f5651021fe80d4ab69a35a85f0a35b29dc3d7. * cleanup Dockerfile * switch default python image * remove image agnostic (no longer used) * fix tests * switch to nikolaik/python-nodejs:python3.11-nodejs22 * fix test * fix test * revert docker * update template --------- Co-authored-by: tobitege <tobitege@gmx.de> Co-authored-by: Graham Neubig <neubig@gmail.com>
Browsing Delegation Evalution
Some of OpenDevin's agent supports agent delegation action, for example, CodeActAgent can delegate browsing tasks to BrowsingAgent.
This evaluation tests whether CodeActAgent can correctly delegate the instruction from WebArena and MiniWob benchmark to the BrowsingAgent. If so, the browsing performance upper-bound of CodeActAgent will be the performance of BrowsingAgent.
Setup Environment and LLM Configuration
Please follow instruction here to setup your local development environment and LLM.
Run Inference
./evaluation/browsing_delegation/scripts/run_infer.sh [model_config] [git-version] [agent] [eval_limit]
# e.g., ./evaluation/swe_bench/scripts/run_infer.sh llm.eval_gpt4_1106_preview_llm HEAD CodeActAgent 300
where model_config is mandatory, while agent and eval_limit are optional.
model_config, e.g. eval_gpt4_1106_preview, is the config group name for your
LLM settings, as defined in your config.toml.
git-version, e.g. HEAD, is the git commit hash of the OpenDevin version you would
like to evaluate. It could also be a release tag like 0.6.2.
agent, e.g. CodeActAgent, is the name of the agent for benchmarks, defaulting
to CodeActAgent.
eval_limit, e.g. 10, limits the evaluation to the first eval_limit instances.