OpenHands

mirror of https://github.com/OpenHands/OpenHands.git synced 2025-12-26 05:48:36 +08:00

Author	SHA1	Message	Date
Xingyao Wang	84a578ad20	[test] remove integration tests from CI & move them into evaluation (#4447 )	2024-10-17 05:38:23 +08:00
mamoodi	6f2e678028	Fix eval output path in case of @ char (#4416 )	2024-10-15 22:45:08 +00:00
Abhijeetsingh Meena	173018eb58	fix: Resolves HumanEval Inference by replacing task_id with instance_id (#4364 ) Co-authored-by: Harshit Surana <surana.h@gmail.com>	2024-10-15 15:18:38 +00:00
Xingyao Wang	50c13aad98	[Eval] Improve SWE-Bench Eval harness: multi-run support & entry script simplification (#4396 )	2024-10-15 21:34:52 +08:00
Xingyao Wang	25f9413965	[Eval] Fix eval stuck when `result` is too large for pbar (#4361 )	2024-10-14 22:08:34 +08:00
Xingyao Wang	4dfc7a7ef0	[Eval] Add a more lightweight / easier-to-use SWE-Bench output visualizer (#4360 ) Co-authored-by: Engel Nyst <enyst@users.noreply.github.com>	2024-10-14 02:09:01 +00:00
Xingyao Wang	b23c7aab5a	[eval] stop set sid in eval (#4311 )	2024-10-10 11:47:27 +08:00
Robert Brennan	45fb4fb9bc	allow reconnecting to a runtime (#4223 )	2024-10-09 16:37:52 +00:00
Engel Nyst	e6847e9e61	Move agenthub within openhands (#4130 )	2024-10-08 00:34:18 +00:00
Alejandro Cuadron Lafuente	a3571ec510	[Fix] Error when trying to pull all docker evaluation containers (#4244 )	2024-10-08 05:03:36 +08:00
Aditya Bharat Soni	0809d26f4d	fix: Allow evaluation benchmarks to pass image urls in run_controller() instead of simply passing strings (#4100 ) Co-authored-by: Xingyao Wang <xingyao@all-hands.dev>	2024-10-07 15:37:08 -04:00
Xingyao Wang	01ae54a69d	fix swebench repo/version being string (#4241 )	2024-10-07 22:01:42 +08:00
Xingyao Wang	245334e89d	[eval] improve update output script for swe-bench (#4180 )	2024-10-04 15:10:03 +00:00
Xingyao Wang	80a631361b	eval: update aiderbench readme (#4209 )	2024-10-04 09:26:12 -04:00
Xingyao Wang	9cc9b19958	eval: improve swebench infer error handling and retry (#4205 )	2024-10-04 07:09:56 -05:00
Xingyao Wang	0c2a35b256	[eval] update aider bench scripts (#4203 )	2024-10-04 02:23:06 +00:00
tofarr	152f99c64f	Chore Bump python version (#3545 )	2024-10-03 13:40:55 -04:00
Xingyao Wang	53a015f718	fix: make llm_completions optional to fix `eval_infer.py` (#4148 )	2024-10-02 03:55:03 +08:00
mamoodi	0144caaf1f	Update eval doc for remote runtime (#4145 )	2024-10-01 13:14:36 -04:00
Xingyao Wang	1109637efb	Update instruction for new version of eval runtime-api (#4128 )	2024-09-30 23:48:38 +00:00
Xingyao Wang	8d6eda3623	fix eval_infer.sh to correctly copy SWE-Bench logs (#4111 )	2024-09-29 18:39:18 -05:00
tobitege	c3bbe604eb	(fix) Fix logging in shared eval file to prevent key disclosure (#4108 )	2024-09-28 19:33:16 +00:00
Xingyao Wang	81b3cd71b3	[eval] log evaluating warnings directly to console (#4026 )	2024-09-26 03:42:32 +08:00
Xingyao Wang	1b1d8f0b02	[eval] Use `imap_unorderd` for parallizing evaluation (#4040 )	2024-09-24 20:47:27 +00:00
Xingyao Wang	a66e738957	[eval] use mp Pool instead ProcessPoolExecutor (#4025 )	2024-09-24 23:59:06 +08:00
Ikko Eltociear Ashimine	c84495830e	[eval] update swe_bench/README.md (#3990 )	2024-09-23 11:03:09 +02:00
Xingyao Wang	714e46f29a	[eval] save eventstream & llm completions for SWE-Bench run_infer (#3923 )	2024-09-22 04:39:13 +00:00
Xingyao Wang	b13ed017d8	[eval] add git patch post-processing for SWE-Bench eval_infer (#3980 )	2024-09-20 15:33:53 +00:00
Engel Nyst	8fdfece059	Refactor messages serialization (#3832 ) Co-authored-by: Robert Brennan <accounts@rbren.io>	2024-09-18 23:48:58 +02:00
tofarr	ad0b549d8b	Feat Tightening up Timeouts and interrupt conditions. (#3926 )	2024-09-18 20:50:42 +00:00
Xingyao Wang	5d7f2fd4ae	[eval] Allow evaluation of SWE-Bench patches on `RemoteRuntime` (#3927 ) Co-authored-by: Boxuan Li <liboxuan@connect.hku.hk> Co-authored-by: Graham Neubig <neubig@gmail.com>	2024-09-18 16:07:34 -04:00
Engel Nyst	ef09f0fe37	Small fix in readme (#3912 )	2024-09-17 14:33:25 +00:00
Xingyao Wang	f996b31d64	[eval] Fix multi-processing bug (again^3) & allow set EXP_NAME for each `run_infer` (#3907 ) Co-authored-by: Boxuan Li <liboxuan@connect.hku.hk>	2024-09-17 14:07:58 +00:00
tobitege	52c5abccbf	(enh) Dockerfile.j2: improve env vars for bash and activate in .bashrc (#3871 )	2024-09-17 08:49:04 +02:00
Graham Neubig	243cb492aa	Run pre-commit on all files (#3884 )	2024-09-16 11:07:08 -04:00
Xingyao Wang	2b3925278d	[eval] refactor process instance logic into `update_progress` (#3875 )	2024-09-15 18:47:15 -04:00
Engel Nyst	379f2b6f23	Fix queue length on Macs (#3867 )	2024-09-14 01:11:29 +00:00
Xingyao Wang	3a1b8c093b	[eval] yet another eval fixes on multi-processing (#3854 ) Co-authored-by: Graham Neubig <neubig@gmail.com>	2024-09-13 15:51:22 +00:00
Xingyao Wang	78c5f58adc	refactor & improve retry for the reliability of `RemoteRuntime` & evaluation (#3846 )	2024-09-13 07:37:07 -04:00
Xingyao Wang	797f02ff6f	rename huggingface evaluation benchmark (#3845 )	2024-09-12 18:50:26 +00:00
Xingyao Wang	47d9621742	[eval] SWE-Bench eval usability fixes (#3836 ) * [eval] increase timeout for swebench eval init/complete * allow CmdRunAction to optionally block when .timeout is setted * fix unit test for serialization * fix unit tests for security analyzer * fix integration tests * add more timeout * only check P2P when instances are non-empty; convert P2P and F2P columns to string instead of list --------- Co-authored-by: Graham Neubig <neubig@gmail.com>	2024-09-12 16:33:51 +00:00
Xingyao Wang	2fe2f4c530	[eval] increase timeout for SWEBench eval init/complete (#3829 ) * [eval] increase timeout for swebench eval init/complete * allow CmdRunAction to optionally block when .timeout is setted * fix unit test for serialization * fix unit tests for security analyzer * fix integration tests * add more timeout	2024-09-12 15:20:58 +00:00
Jiayi Pan	43c4a7fff4	Allow Generalized SWE-Bench format for evaluation (#3752 ) * allow generalized swe-bench format * Update run_infer.py * fix linter --------- Co-authored-by: Xingyao Wang <xingyao6@illinois.edu>	2024-09-06 13:05:00 +00:00
Xingyao Wang	688068a44e	Fix issues for running `RemoteRuntime` in parallel on SWE-Bench (#3716 ) * feat: add SWE-bench fullset support * fix instance image list * update eval script and documentation * increase timeout for remote runtime * add push script * handle the case when ret push is an generator * update pbar * set SWE-Bench default to run SWE-Bench lite * add script to cleanup remote runtime * fix the cases when tag is too long * update README * update readme for cleanup * rename od to oh * Update evaluation/swe_bench/README.md Co-authored-by: Graham Neubig <neubig@gmail.com> * Update evaluation/swe_bench/README.md Co-authored-by: Graham Neubig <neubig@gmail.com> * Update evaluation/swe_bench/scripts/cleanup_remote_runtime.sh Co-authored-by: Graham Neubig <neubig@gmail.com> * Update evaluation/swe_bench/scripts/cleanup_remote_runtime.sh Co-authored-by: Graham Neubig <neubig@gmail.com> * Update evaluation/swe_bench/scripts/cleanup_remote_runtime.sh Co-authored-by: Graham Neubig <neubig@gmail.com> * gets API key and Runtime from env var --------- Co-authored-by: Graham Neubig <neubig@gmail.com>	2024-09-05 10:34:31 +08:00
Xingyao Wang	d8a87d7ccb	[Eval] Make SWE-Bench run_infer.sh to default to run SWE-Bench Lite (#3704 ) * feat: add SWE-bench fullset support * fix instance image list * update eval script and documentation * increase timeout for remote runtime * add push script * handle the case when ret push is an generator * update pbar * set SWE-Bench default to run SWE-Bench lite	2024-09-04 00:58:14 +08:00
Xingyao Wang	d283420ac2	feat: add SWE-bench fullset support (#3477 ) * feat: add SWE-bench fullset support * fix instance image list * update eval script and documentation * add push script * handle the case when ret push is an generator * update pbar	2024-09-02 20:28:52 -04:00
mamoodi	6fcc4ca052	fix eval README link (#3692 )	2024-09-02 09:29:42 +08:00
tobitege	dbb671a8a5	logname fix; improve test calling instruction (#3666 )	2024-08-30 17:15:31 +02:00
Xingyao Wang	090c911a50	(refactor) Make `Runtime` class synchronous (#3661 ) * change runtime to be synchronous * fix test runtime with the new interface * fix arg * fix eval * fix missing config attribute * fix plugins * fix on_event by revert it back to async * update upload_file endpoint * fix argument to upload file * remove unncessary async for eval; fix evaluation run in parallel * use asyncio to run controller for eval * revert file upload * truncate eval test result output	2024-08-30 01:37:03 +00:00
Xingyao Wang	8b1f207d39	feat: support remote runtime (#3406 ) * feat: refactor building logic into runtime builder * return image name * fix testcases * use runtime builder for eventstream runtime * have runtime builder return str * add api_key to sandbox config * draft remote runtime * remove extra if clause * initialize runtime based on box class * add build logic * use base64 for file upload * get runtime image prefix from API * replace ___ with _s_ to make it a valid image name * use /build to start build and /build_status to check the build progress * update logging * fix exit code * always use port * add remote runtime * rename runtime * fix tests import * make dir first if work_dir does not exists; * update debug print to remote runtime * fix exit close_sync * update logging * add retry for stop * use all box class for test keep prompt * fix test browsing * add retry stop * merge init commands to save startup time * fix await * remove sandbox url * support execute through specific runtime url * fix file ops * simplify close * factor out runtime retry code * fix exception handling * fix content type error (e.g., bad gateway when runtime is not ready) * add retry for wait until alive; add retry for check image exists * Revert "add retry for wait until alive;" This reverts commit dd013cd2681a159cd07747497d8c95e145d01c32. * retry when wait until alive * clean up msg * directly save sdist to temp dir for _put_source_code_to_dir * support running testcases in parallel * tweak logging; try to close session * try to close session even on exception * update poetry lock * support remote to run integration tests * add warning for workspace base on remote runtime * set default runtime api * remove server runtime * update poetry lock * support running swe-bench (n=1) eval on remoteruntime * add a timeout of 30 min * add todo for docker namespace * update poetry loc	2024-08-29 15:53:37 +00:00

... 3 4 5 6 7 ...

406 Commits