OpenHands

mirror of https://github.com/OpenHands/OpenHands.git synced 2025-12-26 05:48:36 +08:00

Author	SHA1	Message	Date
Xingyao Wang	c2f46200c0	chore(lint): Apply comprehensive linting and formatting fixes (#10287 ) Co-authored-by: openhands <openhands@all-hands.dev>	2025-08-13 21:13:19 +02:00
Ibragim Badertdinov	19a6b6b618	feat(eval): Support evaluation on SWE-rebench (#10251 )	2025-08-12 14:05:43 +00:00
juanmichelini	ea50fe4e3c	Fix: Continue evaluation when an instance fails after max retries (#8868 ) Co-authored-by: openhands <openhands@all-hands.dev> Co-authored-by: Xingyao Wang <xingyaoww@gmail.com> Co-authored-by: Xingyao Wang <xingyao@all-hands.dev>	2025-07-16 22:42:44 +00:00
Ryan H. Tran	dfa54673d2	[OH-Versa] Add remaining browsing & GAIA eval improvement (#9015 ) Co-authored-by: openhands <openhands@all-hands.dev> Co-authored-by: Engel Nyst <enyst@users.noreply.github.com>	2025-06-25 12:36:15 +07:00
Linghao Zhang	a93b0457c6	feat(eval): Support evaluation on SWE-bench-Live (#9137 )	2025-06-15 12:30:47 +00:00
Graham Neubig	689d3c9046	Update pre-commit hook versions to most recent versions (#8343 ) Co-authored-by: openhands <openhands@all-hands.dev>	2025-05-08 03:59:13 +00:00
Rohit Malhotra	9adfcede31	(Hotfix): Track reason for Error AgentState (#7584 ) Co-authored-by: openhands <openhands@all-hands.dev>	2025-03-31 21:24:42 +00:00
Xingyao Wang	01e0e29a9f	Reduce bash SOFT timeout from 30 to 10 seconds (#7423 ) Co-authored-by: openhands <openhands@all-hands.dev>	2025-03-22 22:42:24 +00:00
Xingyao Wang	33780f97d0	[eval] Upgrade SWE-Bench to use official image and latest harness (#6838 ) Co-authored-by: Robert Brennan <accounts@rbren.io> Co-authored-by: openhands <openhands@all-hands.dev> Co-authored-by: Engel Nyst <enyst@users.noreply.github.com> Co-authored-by: Graham Neubig <neubig@gmail.com>	2025-02-27 08:15:05 -05:00
Mateusz Kwiatkowski	6562297615	Replace shebang with /usr/bin/env bash for improved portability (#6876 ) Co-authored-by: Xingyao Wang <xingyao@all-hands.dev>	2025-02-24 18:07:28 +00:00
Xingyao Wang	1a7003a705	Add `sysbox` support to remote runtime for eval; Add memory monitor, stress tests to help debug memory issue (#6684 ) Co-authored-by: openhands <openhands@all-hands.dev> Co-authored-by: Engel Nyst <enyst@users.noreply.github.com> Co-authored-by: Graham Neubig <neubig@gmail.com>	2025-02-18 20:02:28 +00:00
Boxuan Li	ef12bc5381	Evaluation harness: Add agent config option (#6662 )	2025-02-13 15:05:03 -05:00
Xingyao Wang	2b04ee2e62	feat(eval): reliability improvement for SWE-Bench eval_infer (#6347 )	2025-01-18 14:02:59 -05:00
Calvin Smith	a12087243a	Pydantic-based configuration and setting objects (#6321 ) Co-authored-by: Calvin Smith <calvin@all-hands.dev> Co-authored-by: Graham Neubig <neubig@gmail.com> Co-authored-by: Engel Nyst <enyst@users.noreply.github.com>	2025-01-17 12:33:22 -07:00
Xingyao Wang	899c1f8360	fix(bash): also show timeout reminder when no_change_timeout is triggered (#6318 ) Co-authored-by: Robert Brennan <accounts@rbren.io>	2025-01-18 03:31:23 +08:00
tofarr	23473070b9	Revert "Config objects as Pydantic BaseModels (#6176 )" (#6214 )	2025-01-13 07:36:25 -07:00
Calvin Smith	873dddb4e8	Config objects as Pydantic BaseModels (#6176 ) Co-authored-by: Calvin Smith <calvin@all-hands.dev> Co-authored-by: Graham Neubig <neubig@gmail.com>	2025-01-12 15:09:45 -05:00
Calvin Smith	6e4ff56934	feature: Condenser Interface and Defaults (#5306 ) Co-authored-by: openhands <openhands@all-hands.dev> Co-authored-by: Calvin Smith <calvin@all-hands.dev> Co-authored-by: Engel Nyst <enyst@users.noreply.github.com>	2025-01-08 04:36:30 +08:00
Xingyao Wang	f14f75b064	feat: runtime improvements for rate-limit and 502/503/404 error (#5975 )	2025-01-03 08:36:19 -07:00
OpenHands	bfb191b5c7	Fix issue #5739 : [Bug]: Move ./evaluation/swe_bench/scripts/cleanup_remote_runtime.sh to general eval utils (#5740 )	2024-12-25 17:17:06 -05:00
Xingyao Wang	581d5ec7a8	feat(eval): increase resource factor for remote runtime when previous run failed due to resource (#5709 )	2024-12-21 01:47:06 +08:00
Xingyao Wang	e9cafb0372	chore: Cleanup runtime exception handling (#5696 )	2024-12-19 17:28:29 +00:00
Xingyao Wang	9908e1b285	[Evaluation]: Log openhands version in eval output folder, instead of agent version (#5394 )	2024-12-04 03:33:43 +00:00
Xingyao Wang	a531413d86	fix(eval): support setting hard timeout per evaluation instance (#5110 )	2024-11-18 21:22:55 -05:00
Xingyao Wang	07f0d1ccb3	feat(llm): convert function call request for non-funcall OSS model (#4711 ) Co-authored-by: Calvin Smith <email@cjsmith.io>	2024-11-15 00:40:09 +08:00
Calvin Smith	50e7da9c3d	fix(evaluation): SWE-bench evaluation script supports multiprocessing (#4943 )	2024-11-12 12:19:57 -07:00
Engel Nyst	eeb2342509	Refactor history/event stream (#3808 )	2024-11-05 03:36:14 +01:00
Xingyao Wang	966da7b7c8	feat(agent, CodeAct 2.2): native CodeAct support for Browsing (#4667 ) Co-authored-by: tofarr <tofarr@gmail.com>	2024-11-05 00:27:27 +08:00
Xingyao Wang	ae13171194	feat(agent): CodeAct with function calling (#4537 ) Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: tobitege <10787084+tobitege@users.noreply.github.com> Co-authored-by: Engel Nyst <enyst@users.noreply.github.com> Co-authored-by: tofarr <tofarr@gmail.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>	2024-10-29 11:06:33 +08:00
Xingyao Wang	7340b78962	feat(eval): rewrite log_completions to save completions to directory (#4566 )	2024-10-25 16:36:11 +00:00
mamoodi	6f2e678028	Fix eval output path in case of @ char (#4416 )	2024-10-15 22:45:08 +00:00
Xingyao Wang	25f9413965	[Eval] Fix eval stuck when `result` is too large for pbar (#4361 )	2024-10-14 22:08:34 +08:00
Engel Nyst	e6847e9e61	Move agenthub within openhands (#4130 )	2024-10-08 00:34:18 +00:00
Xingyao Wang	9cc9b19958	eval: improve swebench infer error handling and retry (#4205 )	2024-10-04 07:09:56 -05:00
Xingyao Wang	53a015f718	fix: make llm_completions optional to fix `eval_infer.py` (#4148 )	2024-10-02 03:55:03 +08:00
tobitege	c3bbe604eb	(fix) Fix logging in shared eval file to prevent key disclosure (#4108 )	2024-09-28 19:33:16 +00:00
Xingyao Wang	81b3cd71b3	[eval] log evaluating warnings directly to console (#4026 )	2024-09-26 03:42:32 +08:00
Xingyao Wang	1b1d8f0b02	[eval] Use `imap_unorderd` for parallizing evaluation (#4040 )	2024-09-24 20:47:27 +00:00
Xingyao Wang	a66e738957	[eval] use mp Pool instead ProcessPoolExecutor (#4025 )	2024-09-24 23:59:06 +08:00
Xingyao Wang	714e46f29a	[eval] save eventstream & llm completions for SWE-Bench run_infer (#3923 )	2024-09-22 04:39:13 +00:00
Xingyao Wang	5d7f2fd4ae	[eval] Allow evaluation of SWE-Bench patches on `RemoteRuntime` (#3927 ) Co-authored-by: Boxuan Li <liboxuan@connect.hku.hk> Co-authored-by: Graham Neubig <neubig@gmail.com>	2024-09-18 16:07:34 -04:00
Xingyao Wang	f996b31d64	[eval] Fix multi-processing bug (again^3) & allow set EXP_NAME for each `run_infer` (#3907 ) Co-authored-by: Boxuan Li <liboxuan@connect.hku.hk>	2024-09-17 14:07:58 +00:00
Xingyao Wang	2b3925278d	[eval] refactor process instance logic into `update_progress` (#3875 )	2024-09-15 18:47:15 -04:00
Engel Nyst	379f2b6f23	Fix queue length on Macs (#3867 )	2024-09-14 01:11:29 +00:00
Xingyao Wang	3a1b8c093b	[eval] yet another eval fixes on multi-processing (#3854 ) Co-authored-by: Graham Neubig <neubig@gmail.com>	2024-09-13 15:51:22 +00:00
Xingyao Wang	78c5f58adc	refactor & improve retry for the reliability of `RemoteRuntime` & evaluation (#3846 )	2024-09-13 07:37:07 -04:00
tobitege	dbb671a8a5	logname fix; improve test calling instruction (#3666 )	2024-08-30 17:15:31 +02:00
Xingyao Wang	090c911a50	(refactor) Make `Runtime` class synchronous (#3661 ) * change runtime to be synchronous * fix test runtime with the new interface * fix arg * fix eval * fix missing config attribute * fix plugins * fix on_event by revert it back to async * update upload_file endpoint * fix argument to upload file * remove unncessary async for eval; fix evaluation run in parallel * use asyncio to run controller for eval * revert file upload * truncate eval test result output	2024-08-30 01:37:03 +00:00
tobitege	9c39f07430	(enh) Aider-Bench: make resumable with skip_num arg (#3626 ) * added optional START_ID env flag to resume from that instance id * prepare_dataset: fix comparisons by using instance id's as int * aider bench complete_runtime: close runtime to close container * added matrix display of instance id for logging * fix typo in summarize_results.py saying summarise_results * changed start_id to skip_num to skip rows from dataset (start_id wasn't supportable) * doc changes about huggingface spaces to temporarily point back to OD	2024-08-28 15:42:01 +00:00
Raj Maheshwari	e72dc96d13	[Fix] Stop API key from leaking in evaluation outputs. (#3603 ) Co-authored-by: Xingyao Wang <xingyao6@illinois.edu>	2024-08-26 23:38:37 +02:00

1 2

60 Commits