OpenHands

mirror of https://github.com/OpenHands/OpenHands.git synced 2026-03-22 13:47:19 +08:00

Author	SHA1	Message	Date
Ziru "Ron" Chen	db4e1dbbec	[eval] Add ScienceAgentBench. (#4645 ) Co-authored-by: Xingyao Wang <xingyao@all-hands.dev>	2024-11-01 02:30:55 +08:00
Xingyao Wang	9c2b48ff5d	fix(eval): SWE-Bench instance with upper-case instance id (#4649 )	2024-10-30 21:24:18 +00:00
Xingyao Wang	6d19c93d19	[eval] add evaluation workflow (#4489 ) Co-authored-by: openhands <openhands@all-hands.dev> Co-authored-by: Engel Nyst <enyst@users.noreply.github.com>	2024-10-29 13:52:25 +00:00
Xingyao Wang	ae13171194	feat(agent): CodeAct with function calling (#4537 ) Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: tobitege <10787084+tobitege@users.noreply.github.com> Co-authored-by: Engel Nyst <enyst@users.noreply.github.com> Co-authored-by: tofarr <tofarr@gmail.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>	2024-10-29 11:06:33 +08:00
Xingyao Wang	1f23dc89b6	fix(eval): add runtime.connect to all eval harness (#4565 )	2024-10-26 00:41:30 +08:00
Xingyao Wang	7340b78962	feat(eval): rewrite log_completions to save completions to directory (#4566 )	2024-10-25 16:36:11 +00:00
tofarr	c4f5c07be1	Refactor: shorter syntax (#4558 )	2024-10-25 06:45:28 -06:00
Graham Neubig	ce2430180f	Update README.md to fix miniwob name (#4534 )	2024-10-23 18:24:43 +00:00
Xingyao Wang	2d5b360505	refactor: re-organize different runtime implementations into an impl folder (#4346 ) Co-authored-by: Graham Neubig <neubig@gmail.com>	2024-10-23 10:10:03 +00:00
Graham Neubig	54250e3fe2	Update evaluation README.md structure (#4516 )	2024-10-22 14:42:22 +00:00
Xingyao Wang	da548d308c	[agent] LLM-based editing (#3985 ) Co-authored-by: Tim O'Farrell <tofarr@gmail.com> Co-authored-by: Engel Nyst <enyst@users.noreply.github.com> Co-authored-by: Robert Brennan <accounts@rbren.io> Co-authored-by: Graham Neubig <neubig@gmail.com>	2024-10-22 04:51:44 +08:00
Alejandro Cuadron Lafuente	a9a593bb21	[Fix] Added support to specify the platform on which the runtime image should be built. (#4402 ) Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Xingyao Wang <xingyao@all-hands.dev> Co-authored-by: mamoodi <mamoodiha@gmail.com> Co-authored-by: tofarr <tofarr@gmail.com> Co-authored-by: Robert Brennan <contact@rbren.io>	2024-10-20 09:19:05 +08:00
Xingyao Wang	91308ba4dc	feat: clean-up retries RemoteRuntime & add FatalErrorObservation (#4485 )	2024-10-18 17:23:13 +00:00
Jiayi Pan	c1b323a076	Show actual dataset name in swebench log directory (#4417 )	2024-10-17 10:32:38 +08:00
Xingyao Wang	84a578ad20	[test] remove integration tests from CI & move them into evaluation (#4447 )	2024-10-17 05:38:23 +08:00
mamoodi	6f2e678028	Fix eval output path in case of @ char (#4416 )	2024-10-15 22:45:08 +00:00
Abhijeetsingh Meena	173018eb58	fix: Resolves HumanEval Inference by replacing task_id with instance_id (#4364 ) Co-authored-by: Harshit Surana <surana.h@gmail.com>	2024-10-15 15:18:38 +00:00
Xingyao Wang	50c13aad98	[Eval] Improve SWE-Bench Eval harness: multi-run support & entry script simplification (#4396 )	2024-10-15 21:34:52 +08:00
Xingyao Wang	25f9413965	[Eval] Fix eval stuck when `result` is too large for pbar (#4361 )	2024-10-14 22:08:34 +08:00
Xingyao Wang	4dfc7a7ef0	[Eval] Add a more lightweight / easier-to-use SWE-Bench output visualizer (#4360 ) Co-authored-by: Engel Nyst <enyst@users.noreply.github.com>	2024-10-14 02:09:01 +00:00
Xingyao Wang	b23c7aab5a	[eval] stop set sid in eval (#4311 )	2024-10-10 11:47:27 +08:00
Robert Brennan	45fb4fb9bc	allow reconnecting to a runtime (#4223 )	2024-10-09 16:37:52 +00:00
Engel Nyst	e6847e9e61	Move agenthub within openhands (#4130 )	2024-10-08 00:34:18 +00:00
Alejandro Cuadron Lafuente	a3571ec510	[Fix] Error when trying to pull all docker evaluation containers (#4244 )	2024-10-08 05:03:36 +08:00
Aditya Bharat Soni	0809d26f4d	fix: Allow evaluation benchmarks to pass image urls in run_controller() instead of simply passing strings (#4100 ) Co-authored-by: Xingyao Wang <xingyao@all-hands.dev>	2024-10-07 15:37:08 -04:00
Xingyao Wang	01ae54a69d	fix swebench repo/version being string (#4241 )	2024-10-07 22:01:42 +08:00
Xingyao Wang	245334e89d	[eval] improve update output script for swe-bench (#4180 )	2024-10-04 15:10:03 +00:00
Xingyao Wang	80a631361b	eval: update aiderbench readme (#4209 )	2024-10-04 09:26:12 -04:00
Xingyao Wang	9cc9b19958	eval: improve swebench infer error handling and retry (#4205 )	2024-10-04 07:09:56 -05:00
Xingyao Wang	0c2a35b256	[eval] update aider bench scripts (#4203 )	2024-10-04 02:23:06 +00:00
tofarr	152f99c64f	Chore Bump python version (#3545 )	2024-10-03 13:40:55 -04:00
Xingyao Wang	53a015f718	fix: make llm_completions optional to fix `eval_infer.py` (#4148 )	2024-10-02 03:55:03 +08:00
mamoodi	0144caaf1f	Update eval doc for remote runtime (#4145 )	2024-10-01 13:14:36 -04:00
Xingyao Wang	1109637efb	Update instruction for new version of eval runtime-api (#4128 )	2024-09-30 23:48:38 +00:00
Xingyao Wang	8d6eda3623	fix eval_infer.sh to correctly copy SWE-Bench logs (#4111 )	2024-09-29 18:39:18 -05:00
tobitege	c3bbe604eb	(fix) Fix logging in shared eval file to prevent key disclosure (#4108 )	2024-09-28 19:33:16 +00:00
Xingyao Wang	81b3cd71b3	[eval] log evaluating warnings directly to console (#4026 )	2024-09-26 03:42:32 +08:00
Xingyao Wang	1b1d8f0b02	[eval] Use `imap_unorderd` for parallizing evaluation (#4040 )	2024-09-24 20:47:27 +00:00
Xingyao Wang	a66e738957	[eval] use mp Pool instead ProcessPoolExecutor (#4025 )	2024-09-24 23:59:06 +08:00
Ikko Eltociear Ashimine	c84495830e	[eval] update swe_bench/README.md (#3990 )	2024-09-23 11:03:09 +02:00
Xingyao Wang	714e46f29a	[eval] save eventstream & llm completions for SWE-Bench run_infer (#3923 )	2024-09-22 04:39:13 +00:00
Xingyao Wang	b13ed017d8	[eval] add git patch post-processing for SWE-Bench eval_infer (#3980 )	2024-09-20 15:33:53 +00:00
Engel Nyst	8fdfece059	Refactor messages serialization (#3832 ) Co-authored-by: Robert Brennan <accounts@rbren.io>	2024-09-18 23:48:58 +02:00
tofarr	ad0b549d8b	Feat Tightening up Timeouts and interrupt conditions. (#3926 )	2024-09-18 20:50:42 +00:00
Xingyao Wang	5d7f2fd4ae	[eval] Allow evaluation of SWE-Bench patches on `RemoteRuntime` (#3927 ) Co-authored-by: Boxuan Li <liboxuan@connect.hku.hk> Co-authored-by: Graham Neubig <neubig@gmail.com>	2024-09-18 16:07:34 -04:00
Engel Nyst	ef09f0fe37	Small fix in readme (#3912 )	2024-09-17 14:33:25 +00:00
Xingyao Wang	f996b31d64	[eval] Fix multi-processing bug (again^3) & allow set EXP_NAME for each `run_infer` (#3907 ) Co-authored-by: Boxuan Li <liboxuan@connect.hku.hk>	2024-09-17 14:07:58 +00:00
tobitege	52c5abccbf	(enh) Dockerfile.j2: improve env vars for bash and activate in .bashrc (#3871 )	2024-09-17 08:49:04 +02:00
Graham Neubig	243cb492aa	Run pre-commit on all files (#3884 )	2024-09-16 11:07:08 -04:00
Xingyao Wang	2b3925278d	[eval] refactor process instance logic into `update_progress` (#3875 )	2024-09-15 18:47:15 -04:00

1 2 3 4 5

220 Commits