Files
OpenHands/evaluation/toolqa
Xingyao Wang 83f36c1d66 test: build and run runtime tests on different custom docker images (#3324)
* try to fix pip unavailable

* update test case for pip

* force rebuild in CI

* remove extra symlink

* fix newline

* added semi-colon to line 31

* Dockerfile.j2: activate env at the end

* Revert "Dockerfile.j2: activate env at the end"

This reverts commit cf2f565102.

* cleanup Dockerfile

* switch default python image

* remove image agnostic (no longer used)

* fix tests

* simplify integration tests default image

* add nodejs specific runtime tests

* update tests and workflows

* switch to nikolaik/python-nodejs:python3.11-nodejs22

* update build sh to output image name correctly

* increase custom images to test

* fix test

* fix test

* fix double quote

* try fixing ci

* update ghcr workflow

* fix artifact name

* try to fix ghcr again

* fix workflow

* save built image to correct dir

* remove extra -docker-image

* make last tag to be human readable image tag

* fix hyphen to underscore

* run test runtime on all tags

* revert app build

* separate ghcr workflow

* update dockerfile for eval

* fix tag for test run

* try fix tag

* try fix tag via matrix output

* try workflow again

* update comments

* try fixing test matrix

* fix artifact name

* try fix tag again

* Revert "try fix tag again"

This reverts commit b369badd8c.

* tweak filename

* try different path

* fix filepath

* try fix tag artifact path again

* save json instead of line

* update matrix

* print all tags in workflow

* fix DOCKER_IMAGE to avoid ghcr.io/opendevin/ghcr.io/opendevin/od_runtime

* fix test matrix to only load unique test image tags

* try fix matrix again!!!!!

* add all runtime tests passed

---------

Co-authored-by: tobitege <tobitege@gmx.de>
Co-authored-by: Graham Neubig <neubig@gmail.com>
Co-authored-by: tobitege <10787084+tobitege@users.noreply.github.com>
2024-08-19 21:12:00 +08:00
..

ToolQA Evaluation with OpenDevin

This folder contains an evaluation harness we built on top of the original ToolQA (paper).

Setup Environment and LLM Configuration

Please follow instruction here to setup your local development environment and LLM.

Run Inference on ToolQA Instances

Make sure your Docker daemon is running, then run this bash script:

bash evaluation/toolqa/scripts/run_infer.sh [model_config] [git-version] [agent] [eval_limit] [dataset] [hardness] [wolfram_alpha_appid]

where model_config is mandatory, while all other arguments are optional.

model_config, e.g. llm, is the config group name for your LLM settings, as defined in your config.toml.

git-version, e.g. HEAD, is the git commit hash of the OpenDevin version you would like to evaluate. It could also be a release tag like 0.6.2.

agent, e.g. CodeActAgent, is the name of the agent for benchmarks, defaulting to CodeActAgent.

eval_limit, e.g. 10, limits the evaluation to the first eval_limit instances. By default, the script evaluates 1 instance.

dataset, the dataset from ToolQA to evaluate from. You could choose from agenda, airbnb, coffee, dblp, flight, gsm8k, scirex, yelp for dataset. The default is coffee.

hardness, the hardness to evaluate. You could choose from easy and hard. The default is easy.

wolfram_alpha_appid is an optional argument. When given wolfram_alpha_appid, the agent will be able to access Wolfram Alpha's APIs.

Note: in order to use eval_limit, you must also set agent; in order to use dataset, you must also set eval_limit; in order to use hardness, you must also set dataset.

Let's say you'd like to run 10 instances using llm and CodeActAgent on coffee easy test, then your command would be:

bash evaluation/toolqa/scripts/run_infer.sh llm CodeActAgent 10 coffee easy