mirror of https://github.com/OpenHands/OpenHands.git synced 2026-03-22 13:47:19 +08:00

Files

Graham Neubig cab7a288ca Add NUM_WORKERS variable to run_infer.sh scripts for configurable woker settings (#2597 )

* Add NUM_WORKERS variable to run_infer.sh scripts for configurable worker settings

* Update evaluation/webarena/scripts/run_infer.sh

---------

Co-authored-by: OpenDevin <opendevin@all-hands.dev>

2024-06-23 03:43:43 +00:00

scripts

Add NUM_WORKERS variable to run_infer.sh scripts for configurable woker settings (#2597 )

2024-06-23 03:43:43 +00:00

ast_eval_hf.py

fix typos (#2352 )

2024-06-09 12:57:58 -07:00

ast_eval_tf.py

fix typos (#2352 )

2024-06-09 12:57:58 -07:00

ast_eval_th.py

fix typos (#2352 )

2024-06-09 12:57:58 -07:00

README.md

Evaluation time travel: allow evaluation on a specific version (#2356 )

2024-06-16 10:25:14 -04:00

run_infer.py

fix typos (#2352 )

2024-06-09 12:57:58 -07:00

utils.py

Feat: Support Gorilla APIBench (#2081 )

2024-06-08 16:54:54 +00:00

README.md

Gorilla APIBench Evaluation with OpenDevin

This folder contains evaluation harness we built on top of the original Gorilla APIBench (paper).

Setup Environment

Please follow this document to setup local development environment for OpenDevin.

Configure OpenDevin and your LLM

Run make setup-config to set up the config.toml file if it does not exist at the root of the workspace.

Run Inference on APIBench Instances

Make sure your Docker daemon is running, then run this bash script:

bash evaluation/gorilla/scripts/run_infer.sh [model_config] [git-version] [agent] [eval_limit] [hubs]

where model_config is mandatory, while all other arguments are optional.

model_config, e.g. llm, is the config group name for your LLM settings, as defined in your config.toml.

git-version, e.g. head, is the git commit hash of the OpenDevin version you would like to evaluate. It could also be a release tag like 0.6.2.

agent, e.g. CodeActAgent, is the name of the agent for benchmarks, defaulting to CodeActAgent.

eval_limit, e.g. 10, limits the evaluation to the first eval_limit instances. By default, the script evaluates 1 instance.

hubs, the hub from APIBench to evaluate from. You could choose one or more from torch or th (which is abbreviation of torch), hf (which is abbreviation of huggingface), and tf (which is abbreviation of tensorflow), for hubs. The default is hf,torch,tf.

Note: in order to use eval_limit, you must also set agent; in order to use hubs, you must also set eval_limit.

For example,

bash evaluation/gorilla/scripts/run_infer.sh llm 0.6.2 CodeActAgent 10 th