mirror of https://github.com/OpenHands/OpenHands.git synced 2025-12-26 05:48:36 +08:00

History

Support MINT benchmark (MATH, GSM8K subset) (#1955 )

* setup boilerplate and README

* setup test script and load dataset

* add temp intg that works

* refactor code

* add solution evaluation through 'fake_user_response_fn'

* finish integrating MATH subset

* Update evaluation/mint/run_infer.py

* Update evaluation/mint/run_infer.sh

* Update opendevin/core/main.py

* remove redudant templates, add eval_note, update README

* use <execute_ipython> tag instead of <execute>

* hardcode AGENT option for run_infer.sh

* Update evaluation/mint/task.py

Co-authored-by: Yufan Song <33971064+yufansong@users.noreply.github.com>

* fix: bug no message returned when task's success

* change message to make the agent exit

* import bash abstractmethod

* install all required packages inside sandbox before the agent runs, adjust prompt

* add subset eval folder separation and test for gsm8k

* fix bug in Reasoning task result check, add requirements.txt

* Fix syntax error in evaluation/mint/run_infer.py

* update README, add default values for `SUBSET` and `EVAL_LIMIT`

---------

Co-authored-by: Yufan Song <33971064+yufansong@users.noreply.github.com>
Co-authored-by: yufansong <yufan@risingwave-labs.com>
Co-authored-by: Boxuan Li <liboxuan@connect.hku.hk>

2024-05-28 07:42:52 +00:00

in_context_examples/reasoning

Support MINT benchmark (MATH, GSM8K subset) (#1955 )

2024-05-28 07:42:52 +00:00

prompts

Support MINT benchmark (MATH, GSM8K subset) (#1955 )

2024-05-28 07:42:52 +00:00

scripts

Support MINT benchmark (MATH, GSM8K subset) (#1955 )

2024-05-28 07:42:52 +00:00

.gitignore

Support MINT benchmark (MATH, GSM8K subset) (#1955 )

2024-05-28 07:42:52 +00:00

config_variables.py

Support MINT benchmark (MATH, GSM8K subset) (#1955 )

2024-05-28 07:42:52 +00:00

datatypes.py

Support MINT benchmark (MATH, GSM8K subset) (#1955 )

2024-05-28 07:42:52 +00:00

env.py

Support MINT benchmark (MATH, GSM8K subset) (#1955 )

2024-05-28 07:42:52 +00:00

README.md

Support MINT benchmark (MATH, GSM8K subset) (#1955 )

2024-05-28 07:42:52 +00:00

requirements.txt

Support MINT benchmark (MATH, GSM8K subset) (#1955 )

2024-05-28 07:42:52 +00:00

run_infer.py

Support MINT benchmark (MATH, GSM8K subset) (#1955 )

2024-05-28 07:42:52 +00:00

task.py

Support MINT benchmark (MATH, GSM8K subset) (#1955 )

2024-05-28 07:42:52 +00:00

utils.py

Support MINT benchmark (MATH, GSM8K subset) (#1955 )

2024-05-28 07:42:52 +00:00

README.md

MINT Benchmark

This folder contains the evaluation harness for the MINT benchmark on LLMs' ability to solve tasks with multi-turn interactions.

Configure OpenDevin and LM

Create a config.toml file if it does not exist at the root of the workspace. Please check README.md for how to set this up.

Start the evaluation

We are using the MINT dataset hosted on Hugging Face.

Following is the basic command to start the evaluation. Currently, the only agent supported with MINT is CodeActAgent.

./evaluation/mint/scripts/run_infer.sh [model_config] [subset] [eval_limit]

where model_config is mandatory, while subset and eval_limit are optional.

model_config, e.g. eval_gpt4_1106_preview, is the config group name for your LLM settings, as defined in your config.toml.
subset, e.g. math, is the subset of the MINT benchmark to evaluate on, defaulting to math.
eval_limit, e.g. 2, limits the evaluation to the first eval_limit instances, defaulting to all instances.

Note: in order to use eval_limit, you must also set subset.

Let's say you'd like to run 3 instances on the gsm8k subset using eval_gpt4_1106_preview, then your command would be:

./evaluation/swe_bench/scripts/run_infer.sh eval_gpt4_1106_preview gsm8k 3

Reference

@misc{wang2024mint,
    title={MINT: Evaluating LLMs in Multi-turn Interaction with Tools and Language Feedback},
    author={Xingyao Wang and Zihan Wang and Jiateng Liu and Yangyi Chen and Lifan Yuan and Hao Peng and Heng Ji},
    year={2024},
    eprint={2309.10691},
    archivePrefix={arXiv},
    primaryClass={cs.CL}
}