54 Commits

Author SHA1 Message Date
Xingyao Wang
c2f46200c0
chore(lint): Apply comprehensive linting and formatting fixes (#10287)
Co-authored-by: openhands <openhands@all-hands.dev>
2025-08-13 21:13:19 +02:00
Ibragim Badertdinov
19a6b6b618
feat(eval): Support evaluation on SWE-rebench (#10251) 2025-08-12 14:05:43 +00:00
juanmichelini
ea50fe4e3c
Fix: Continue evaluation when an instance fails after max retries (#8868)
Co-authored-by: openhands <openhands@all-hands.dev>
Co-authored-by: Xingyao Wang <xingyaoww@gmail.com>
Co-authored-by: Xingyao Wang <xingyao@all-hands.dev>
2025-07-16 22:42:44 +00:00
Ryan H. Tran
dfa54673d2
[OH-Versa] Add remaining browsing & GAIA eval improvement (#9015)
Co-authored-by: openhands <openhands@all-hands.dev>
Co-authored-by: Engel Nyst <enyst@users.noreply.github.com>
2025-06-25 12:36:15 +07:00
Linghao Zhang
a93b0457c6
feat(eval): Support evaluation on SWE-bench-Live (#9137) 2025-06-15 12:30:47 +00:00
Graham Neubig
689d3c9046
Update pre-commit hook versions to most recent versions (#8343)
Co-authored-by: openhands <openhands@all-hands.dev>
2025-05-08 03:59:13 +00:00
Rohit Malhotra
9adfcede31
(Hotfix): Track reason for Error AgentState (#7584)
Co-authored-by: openhands <openhands@all-hands.dev>
2025-03-31 21:24:42 +00:00
Xingyao Wang
01e0e29a9f
Reduce bash SOFT timeout from 30 to 10 seconds (#7423)
Co-authored-by: openhands <openhands@all-hands.dev>
2025-03-22 22:42:24 +00:00
Xingyao Wang
33780f97d0
[eval] Upgrade SWE-Bench to use official image and latest harness (#6838)
Co-authored-by: Robert Brennan <accounts@rbren.io>
Co-authored-by: openhands <openhands@all-hands.dev>
Co-authored-by: Engel Nyst <enyst@users.noreply.github.com>
Co-authored-by: Graham Neubig <neubig@gmail.com>
2025-02-27 08:15:05 -05:00
Xingyao Wang
1a7003a705
Add sysbox support to remote runtime for eval; Add memory monitor, stress tests to help debug memory issue (#6684)
Co-authored-by: openhands <openhands@all-hands.dev>
Co-authored-by: Engel Nyst <enyst@users.noreply.github.com>
Co-authored-by: Graham Neubig <neubig@gmail.com>
2025-02-18 20:02:28 +00:00
Boxuan Li
ef12bc5381
Evaluation harness: Add agent config option (#6662) 2025-02-13 15:05:03 -05:00
Xingyao Wang
2b04ee2e62
feat(eval): reliability improvement for SWE-Bench eval_infer (#6347) 2025-01-18 14:02:59 -05:00
Calvin Smith
a12087243a
Pydantic-based configuration and setting objects (#6321)
Co-authored-by: Calvin Smith <calvin@all-hands.dev>
Co-authored-by: Graham Neubig <neubig@gmail.com>
Co-authored-by: Engel Nyst <enyst@users.noreply.github.com>
2025-01-17 12:33:22 -07:00
Xingyao Wang
899c1f8360
fix(bash): also show timeout reminder when no_change_timeout is triggered (#6318)
Co-authored-by: Robert Brennan <accounts@rbren.io>
2025-01-18 03:31:23 +08:00
tofarr
23473070b9
Revert "Config objects as Pydantic BaseModels (#6176)" (#6214) 2025-01-13 07:36:25 -07:00
Calvin Smith
873dddb4e8
Config objects as Pydantic BaseModels (#6176)
Co-authored-by: Calvin Smith <calvin@all-hands.dev>
Co-authored-by: Graham Neubig <neubig@gmail.com>
2025-01-12 15:09:45 -05:00
Calvin Smith
6e4ff56934
feature: Condenser Interface and Defaults (#5306)
Co-authored-by: openhands <openhands@all-hands.dev>
Co-authored-by: Calvin Smith <calvin@all-hands.dev>
Co-authored-by: Engel Nyst <enyst@users.noreply.github.com>
2025-01-08 04:36:30 +08:00
Xingyao Wang
f14f75b064
feat: runtime improvements for rate-limit and 502/503/404 error (#5975) 2025-01-03 08:36:19 -07:00
Xingyao Wang
581d5ec7a8
feat(eval): increase resource factor for remote runtime when previous run failed due to resource (#5709) 2024-12-21 01:47:06 +08:00
Xingyao Wang
e9cafb0372
chore: Cleanup runtime exception handling (#5696) 2024-12-19 17:28:29 +00:00
Xingyao Wang
a531413d86
fix(eval): support setting hard timeout per evaluation instance (#5110) 2024-11-18 21:22:55 -05:00
Xingyao Wang
07f0d1ccb3
feat(llm): convert function call request for non-funcall OSS model (#4711)
Co-authored-by: Calvin Smith <email@cjsmith.io>
2024-11-15 00:40:09 +08:00
Calvin Smith
50e7da9c3d
fix(evaluation): SWE-bench evaluation script supports multiprocessing (#4943) 2024-11-12 12:19:57 -07:00
Engel Nyst
eeb2342509
Refactor history/event stream (#3808) 2024-11-05 03:36:14 +01:00
Xingyao Wang
966da7b7c8
feat(agent, CodeAct 2.2): native CodeAct support for Browsing (#4667)
Co-authored-by: tofarr <tofarr@gmail.com>
2024-11-05 00:27:27 +08:00
Xingyao Wang
ae13171194
feat(agent): CodeAct with function calling (#4537)
Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: tobitege <10787084+tobitege@users.noreply.github.com>
Co-authored-by: Engel Nyst <enyst@users.noreply.github.com>
Co-authored-by: tofarr <tofarr@gmail.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2024-10-29 11:06:33 +08:00
Xingyao Wang
7340b78962
feat(eval): rewrite log_completions to save completions to directory (#4566) 2024-10-25 16:36:11 +00:00
mamoodi
6f2e678028
Fix eval output path in case of @ char (#4416) 2024-10-15 22:45:08 +00:00
Xingyao Wang
25f9413965
[Eval] Fix eval stuck when result is too large for pbar (#4361) 2024-10-14 22:08:34 +08:00
Xingyao Wang
9cc9b19958
eval: improve swebench infer error handling and retry (#4205) 2024-10-04 07:09:56 -05:00
Xingyao Wang
53a015f718
fix: make llm_completions optional to fix eval_infer.py (#4148) 2024-10-02 03:55:03 +08:00
tobitege
c3bbe604eb
(fix) Fix logging in shared eval file to prevent key disclosure (#4108) 2024-09-28 19:33:16 +00:00
Xingyao Wang
81b3cd71b3
[eval] log evaluating warnings directly to console (#4026) 2024-09-26 03:42:32 +08:00
Xingyao Wang
1b1d8f0b02
[eval] Use imap_unorderd for parallizing evaluation (#4040) 2024-09-24 20:47:27 +00:00
Xingyao Wang
a66e738957
[eval] use mp Pool instead ProcessPoolExecutor (#4025) 2024-09-24 23:59:06 +08:00
Xingyao Wang
714e46f29a
[eval] save eventstream & llm completions for SWE-Bench run_infer (#3923) 2024-09-22 04:39:13 +00:00
Xingyao Wang
5d7f2fd4ae
[eval] Allow evaluation of SWE-Bench patches on RemoteRuntime (#3927)
Co-authored-by: Boxuan Li <liboxuan@connect.hku.hk>
Co-authored-by: Graham Neubig <neubig@gmail.com>
2024-09-18 16:07:34 -04:00
Xingyao Wang
f996b31d64
[eval] Fix multi-processing bug (again^3) & allow set EXP_NAME for each run_infer (#3907)
Co-authored-by: Boxuan Li <liboxuan@connect.hku.hk>
2024-09-17 14:07:58 +00:00
Xingyao Wang
2b3925278d
[eval] refactor process instance logic into update_progress (#3875) 2024-09-15 18:47:15 -04:00
Engel Nyst
379f2b6f23
Fix queue length on Macs (#3867) 2024-09-14 01:11:29 +00:00
Xingyao Wang
3a1b8c093b
[eval] yet another eval fixes on multi-processing (#3854)
Co-authored-by: Graham Neubig <neubig@gmail.com>
2024-09-13 15:51:22 +00:00
Xingyao Wang
78c5f58adc
refactor & improve retry for the reliability of RemoteRuntime & evaluation (#3846) 2024-09-13 07:37:07 -04:00
tobitege
dbb671a8a5
logname fix; improve test calling instruction (#3666) 2024-08-30 17:15:31 +02:00
Xingyao Wang
090c911a50
(refactor) Make Runtime class synchronous (#3661)
* change runtime to be synchronous

* fix test runtime with the new interface

* fix arg

* fix eval

* fix missing config attribute

* fix plugins

* fix on_event by revert it back to async

* update upload_file endpoint

* fix argument to upload file

* remove unncessary async for eval;
fix evaluation run in parallel

* use asyncio to run controller for eval

* revert file upload

* truncate eval test result output
2024-08-30 01:37:03 +00:00
tobitege
9c39f07430
(enh) Aider-Bench: make resumable with skip_num arg (#3626)
* added optional START_ID env flag to resume from that instance id

* prepare_dataset: fix comparisons by using instance id's as int

* aider bench complete_runtime: close runtime to close container

* added matrix display of instance id for logging

* fix typo in summarize_results.py saying summarise_results

* changed start_id to skip_num to skip rows from dataset (start_id wasn't supportable)

* doc changes about huggingface spaces to temporarily point back to OD
2024-08-28 15:42:01 +00:00
Raj Maheshwari
e72dc96d13
[Fix] Stop API key from leaking in evaluation outputs. (#3603)
Co-authored-by: Xingyao Wang <xingyao6@illinois.edu>
2024-08-26 23:38:37 +02:00
tobitege
8fcf0817d4
(eval) Aider_bench: add eval_ids arg to run specific instance id's (#3592)
* add eval_ids arg to run specific instance id's; fix/extend README

* fix description in parser for --eval-ids

* fix test_arg_parser.py to account for added arg

* fix typo in README to say "summarize" instead of "summarise" for script
2024-08-27 00:49:26 +08:00
Robert Brennan
01ae22ef57
Rename OpenDevin to OpenHands (#3472)
* Replace OpenDevin with OpenHands

* Update CONTRIBUTING.md

* Update README.md

* Update README.md

* update poetry lock; move opendevin folder to openhands

* fix env var

* revert image references in docs

* revert permissions

* revert permissions

---------

Co-authored-by: Xingyao Wang <xingyao6@illinois.edu>
2024-08-20 00:44:54 +08:00
Xingyao Wang
31b244f95e
[Refactor, Evaluation] Refactor and clean up evaluation harness to remove global config and use EventStreamRuntime (#3230)
* move multi-line bash tests to test_runtime;
support multi-line bash for esruntime;

* add testcase to handle PS2 prompt

* use bashlex for bash parsing to handle multi-line commands;
add testcases for multi-line commands

* revert ghcr runtime change

* Apply stash

* fix run as other user;
make test async;

* fix test runtime for run as od

* add run-as-devin to all the runtime tests

* handle the case when username is root

* move all run-as-devin tests from sandbox;
only tests a few cases on different user to save time;

* move over multi-line echo related tests to test_runtime

* fix user-specific jupyter by fixing the pypoetry virtualenv folder

* make plugin's init async;
chdir at initialization of jupyter plugin;
move ipy simple testcase to test runtime;

* support agentskills import in
move tests for jupyter pwd tests;
overload `add_env_vars` for EventStreamRuntime to update env var also in Jupyter;
make agentskills read env var lazily, in case env var is updated;

* fix ServerRuntime agentskills issue

* move agnostic image test to test_runtime

* merge runtime tests in CI

* fix enable auto lint as env var

* update warning message

* update warning message

* test for different container images

* change parsing output as debug

* add exception handling for update_pwd_decorator

* fix unit test indentation

* add plugins as default input to Runtime class;
remove init_sandbox_plugins;
implement add_env_var (include jupyter) in the base class;

* fix server runtime auto lint

* Revert "add exception handling for update_pwd_decorator"

This reverts commit 2b668b1506e02145cb8f87e321aad62febca3d50.

* tries to print debugging info for agentskills

* explictly setting uid (try fix permission issue)

* Revert "tries to print debugging info for agentskills"

This reverts commit 8be4c86756f0e3fc62957b327ba2ac4999c419de.

* set sandbox user id during testing to hopefully fix the permission issue

* add browser tools for server runtime

* try to debug for old pwd

* update debug cmd

* only test agnostic runtime when TEST_RUNTIME is Server

* fix temp dir mkdir

* load TEST_RUNTIME at the beginning

* remove ipython tests

* only log to file when DEBUG

* default logging to project root

* temporarily remove log to file

* fix LLM logger dir

* fix logger

* make set pwd an optional aux action

* fix prev pwd

* fix infinity recursion

* simplify

* do not import the whole od library to avoid logger folder by jupyter

* fix browsing

* increase timeout

* attempt to fix agentskills yet again

* clean up in testcases, since CI maybe run as non-root

* add _cause attribute for event.id

* remove parent

* add a bunch of debugging statement again for CI :(

* fix temp_dir fixture

* change all temp dir to follow pytest's tmp_path_factory

* remove extra bracket

* clean up error printing a bit

* jupyter chdir to self.config.workspace_mount_path_in_sandbox on initialization

* jupyter chdir to self.config.workspace_mount_path_in_sandbox on initialization

* add typing for tmp dir fixture

* clear the directory before running the test to avoid weird CI temp dir

* remove agnostic test case for server runtime

* Revert "remove agnostic test case for server runtime"

This reverts commit 30e2181c3fc1410e69596c2dcd06be01f1d016b3.

* disable agnostic tests in CI

* fix test

* make sure plugin arg is not passed when no plugin is specified;
remove redundant on_event function;

* move mock prompt

* rename runtime

* remove extra logging

* refactor run_controller's interface;
support multiple runtime for integration test;
filter out hostname for prompt

* uncomment other tests

* pass the right runtime to controller

* log runtime when start

* uncomment tests

* improve symbol filters

* add intergration test prompts that seemd ok

* add integration test workflow

* add python3 to default ubuntu image

* symlink python and fix permission to jupyter pip

* add retry for jupyter execute server

* fix jupyter pip install;
add post-process for jupyter pip install;
simplify init by add agent_skills path to PYTHONPATH;
add testcase to tests jupyter pip install;

* fix bug

* use ubuntu:22.04 for eventstream integration tests

* add todo

* update testcase

* remove redundant code

* fix unit test

* reduce dependency for runtime

* try making llama-index an optional dependency that's not installed by default

* remove pip install since it seemd not needed

* log ipython execution;
await write message since it returns a future

* update ipy testcase

* do not install llama-index in CI

* do not install llama-index in the app docker as well

* set sandbox container image in the integration test script

* log plugins & env var for runtime

* update conftest for sha256

* add git

* remove all non-alphanumeric chalracters

* add working ipy module tests!

* default to use host network

* remove is_async from browser to make thing a little more reliable;
retry loading browser when error;

* add sleep to wait a bit for http server

* kill http server before regenerate browsing tests

* fix browsing

* only set sandbox container image if undefined

* skip empty config value

* update evaluation to use the latest run_controller

* revert logger in execute_server to be compatible with server runtime

* revert logging level to fix jupyter

* set logger level

* revert the logging

* chmod for workspace to fix permission

* support getting timeout from action

* update test for server runtime

* try to fix file permission

* fix test_cmd_run_action_serialization_deserialization test (added timeout)

* poetry: pip 24.2, torch 2.2.2

* revert adding pip to pyproject.toml

* add build to dependencies in pyproject.toml

* forgot poetry lock --no-update

* fix a DelegatorAgent prompt_002.log (timeout)

* fix a DelegatorAgent prompt_003.log (timeout)

* couple more timeout attribs in prompt files

* some more prompt files

* prompts galore

* add clarification comment for timeout

* default timeout to config

* add assert

* update integraton tests for eventstream

* update integration tests

* fix timeout for action<->dict

* remove redundant on_event

* default to use instance image

* update run_controller interface

* add logging for copy

* refactor swe_bench for the new design

* fix action execution timeout

* updatelock

* remove build sandbox locally

* fix runtime

* use plain for-loop for single process

* remove extra print

* get swebench inference working

* print whole `test_result` dict

* got swebench patch post-process working

* update swe-bench evaluation readme

* refactor using shared reset_logger function

* move messy swebench prompt to a different file

* support the ability to specify whether to keep prompt

* support the ability to specify whether to keep prompt

* fix dockerfile

* fix import and remove unnecessary strip logic

* fix action serialization

* get agentbench running

* remove extra ls for agent bench

* fix agentbench metric

* factor out common documentation for eval

* update biocoder doc

* remove swe_env_box since it is no longer needed

* get biocoder working

* add func timeout for bird

* fix jupyter pwd with ~ as user name

* fix jupyter pwd with ~ as user name

* get bird working

* get browsing evaluation working

* make eda runnable

* fix id column

* fix eda run_infer

* unify eval output using a structured format;
make swebench coompatible with that format;
update client source code for every swebench run;
do not inject testcmd for swebench

* standardize existing benchs for the new eval output

* set update source code = true

* get gaia standardized

* fix gaia

* gorilla refactored but stuck at language.so to test

* refactor and make gpqa work

* refactor humanevalfix and get it working

* refactor logic reasoning and get it working

* refactor browser env so it works with eventstream runtime for eval

* add initial version of miniwob refactor

* fix browsergym environment

* get miniwob working!!

* allowing injecting additional dependency to OD runtime docker image

* allowing injecting additional dependency to OD runtime docker image

* support logic reasoning with pre-injected dependency

* get mint working

* update runtime build

* fix mint docker

* add test for keep_prompt;
add missing await close for some tests

* update integration tests for eventstream runtime

* fix integration tests for server runtime

* refactor ml bench and toolqa

* refactor webarena

* fix default factory

* Update run_infer.py

* add APIError to retry

* increase timeout for swebench

* make sure to hide api key when dump eval output

* update the behavior of put source code to put files instead of tarball

* add dishash to dependency

* sendintr when timeout

* fix dockerfile copy

* reduce timeout

* use dirhash to avoid repeat building for update source

* fix runtime_build testcase

* add dir_hash to docker build pipeline

* revert api error

* update poetry lock

* add retries for swebench run infer

* fix git patch

* update poetry lock

* adjust config order

* fix mount volumns

* enforce all eval to use "instance_id"

* remove file store from runtime

* make file_store public inside eventstream

* move the runtime logic inside `main` out

* support using async function for process_instance_fn

* refactor run_infer with the create_time

* fix file store

* Update evaluation/toolqa/utils.py

Co-authored-by: Graham Neubig <neubig@gmail.com>

* fix typo

---------

Co-authored-by: tobitege <tobitege@gmx.de>
Co-authored-by: super-dainiu <78588128+super-dainiu@users.noreply.github.com>
Co-authored-by: Graham Neubig <neubig@gmail.com>
2024-08-06 17:21:45 +00:00
Graham Neubig
3a21198424
Remove monologue agent (#3036)
* Remove monologue agent

* Fixes
2024-07-19 19:25:05 +00:00