54 Commits

Author SHA1 Message Date
juanmichelini
53c0c5a07b
SWE-bench_verified instruction baseline improvements to 60% (#7546)
Co-authored-by: Xingyao Wang <xingyao@all-hands.dev>
2025-04-10 16:08:27 +00:00
Xingyao Wang
0087082643
Improve binary file handling and patch generation in SWE-bench evaluation (#7762)
Co-authored-by: openhands <openhands@all-hands.dev>
2025-04-08 22:57:33 +00:00
Xingyao Wang
ddda30d9b7
fix(eval): iterative evaluation improvements; SWE-Bench multimodal fixes (#7739)
Co-authored-by: Juan Michelini <juan@juan.com.uy>
Co-authored-by: Engel Nyst <enyst@users.noreply.github.com>
Co-authored-by: openhands <openhands@all-hands.dev>
2025-04-09 02:44:03 +08:00
Shixian Sheng
4fb073d1ea
Fixed a few hyperlinks. Translated some texts (#7652) 2025-04-02 22:10:19 +00:00
Xingyao Wang
648c8ffb21
(llm): Support OpenHands LM (#7598)
Co-authored-by: mamoodi <mamoodiha@gmail.com>
2025-03-31 17:29:31 +00:00
Xingyao Wang
54236f9617
[eval] Support SWE-Bench Multimodal (#7122)
Co-authored-by: openhands <openhands@all-hands.dev>
2025-03-31 07:42:44 -04:00
Xingyao Wang
9b9e728cf6
Iterative evaluation with rule-based critic (#7293) 2025-03-17 18:37:35 +00:00
Xingyao Wang
a4d632498c
SWE-Gym rollout stability fix & using a validated SWE-Gym set (#7182)
Co-authored-by: Robert Brennan <accounts@rbren.io>
Co-authored-by: openhands <openhands@all-hands.dev>
Co-authored-by: Engel Nyst <enyst@users.noreply.github.com>
Co-authored-by: Graham Neubig <neubig@gmail.com>
2025-03-17 21:15:01 +08:00
Calvin Smith
303b7ab180
(fix): Conditional imports resolved in SWE-bench eval script while multiprocessing enabled (#7244)
Co-authored-by: Calvin Smith <calvin@all-hands.dev>
2025-03-13 13:29:11 -06:00
Elena Chistova
38e866cde4
Fix official SWE-Bench docker image prefix (#7214) 2025-03-12 18:23:19 +00:00
Xingyao Wang
a4908f9a75
[agent] system message + SWE-Bench instruction improvements (#7018) 2025-03-08 00:27:02 +08:00
Xingyao Wang
9f720a9d69
[eval] SWE-Gym Integration (#6651)
Co-authored-by: Robert Brennan <accounts@rbren.io>
Co-authored-by: openhands <openhands@all-hands.dev>
Co-authored-by: Engel Nyst <enyst@users.noreply.github.com>
Co-authored-by: Graham Neubig <neubig@gmail.com>
2025-03-05 20:15:02 +00:00
Xingyao Wang
bbf40c6576
docs: cleanup and update SWE-Bench documentation; and remove the support of non-instance-level image (#7118)
Co-authored-by: Engel Nyst <enyst@users.noreply.github.com>
2025-03-06 03:18:40 +08:00
Xingyao Wang
4be33a079b
Update SWE-Bench README.md about RemoteRuntime (#7108) 2025-03-05 23:00:54 +08:00
Engel Nyst
395c1ea9e3
[Refactor] split runtime initialization (create, connect, init) in cli scripts (#7036) 2025-03-03 00:19:25 +01:00
Engel Nyst
660d1d1e64
Fix argument in swe-bench grading scripts (#7046) 2025-03-02 12:37:15 +08:00
Magic Mai
8a58e724c6
fix: Remove nested git repositories before adding files in SWE-bench (#6536)
Co-authored-by: Xingyao Wang <xingyao@all-hands.dev>
2025-02-28 01:19:33 +00:00
Xingyao Wang
33780f97d0
[eval] Upgrade SWE-Bench to use official image and latest harness (#6838)
Co-authored-by: Robert Brennan <accounts@rbren.io>
Co-authored-by: openhands <openhands@all-hands.dev>
Co-authored-by: Engel Nyst <enyst@users.noreply.github.com>
Co-authored-by: Graham Neubig <neubig@gmail.com>
2025-02-27 08:15:05 -05:00
Engel Nyst
4f98bce6df
Add selected_repo to command line (#6949) 2025-02-26 20:42:59 +01:00
Mateusz Kwiatkowski
6562297615
Replace shebang with /usr/bin/env bash for improved portability (#6876)
Co-authored-by: Xingyao Wang <xingyao@all-hands.dev>
2025-02-24 18:07:28 +00:00
Xingyao Wang
1a7003a705
Add sysbox support to remote runtime for eval; Add memory monitor, stress tests to help debug memory issue (#6684)
Co-authored-by: openhands <openhands@all-hands.dev>
Co-authored-by: Engel Nyst <enyst@users.noreply.github.com>
Co-authored-by: Graham Neubig <neubig@gmail.com>
2025-02-18 20:02:28 +00:00
Graham Neubig
e930cd0aef
Better error logging in posthog (#6346)
Co-authored-by: openhands <openhands@all-hands.dev>
Co-authored-by: Ray Myers <ray.myers@gmail.com>
2025-02-06 20:16:37 +00:00
Xingyao Wang
90bbd4edbe
fix: initialize default metadata with all required fields (#6583)
Co-authored-by: openhands <openhands@all-hands.dev>
2025-02-04 02:52:11 +08:00
tofarr
bbfdc62139
Fix for issue where retries continue on a closed runtime (#6564)
Co-authored-by: Xingyao Wang <xingyao6@illinois.edu>
2025-02-03 08:44:09 -07:00
Xingyao Wang
1a9971b1bf
misc: make RemoteRuntime API timeout configurable (#6518)
Co-authored-by: Robert Brennan <accounts@rbren.io>
2025-01-30 06:30:18 +08:00
Xingyao Wang
391200510c
fix: revert #5506 for SWE-Bench performance regression (#6491)
Co-authored-by: Robert Brennan <accounts@rbren.io>
2025-01-28 22:52:57 +08:00
Engel Nyst
b9a3f1c753
Fix eval on remote runtime (#6398) 2025-01-21 20:49:30 +00:00
Engel Nyst
5b7fcfbe1a
Disable prompt extensions in SWE-bench (#6391) 2025-01-21 17:18:30 +00:00
Xingyao Wang
899c1f8360
fix(bash): also show timeout reminder when no_change_timeout is triggered (#6318)
Co-authored-by: Robert Brennan <accounts@rbren.io>
2025-01-18 03:31:23 +08:00
Xingyao Wang
72af7bbba2
feat(eval): misc SWE-Bench improvement - use different resources for different instances (#6313)
Co-authored-by: openhands <openhands@all-hands.dev>
2025-01-17 02:48:41 +08:00
Xingyao Wang
0bed17758f
fix: incorrect soft-timeout implementation & fix hard-timeout follow-up command (#6280) 2025-01-17 01:27:00 +08:00
Calvin Smith
6e4ff56934
feature: Condenser Interface and Defaults (#5306)
Co-authored-by: openhands <openhands@all-hands.dev>
Co-authored-by: Calvin Smith <calvin@all-hands.dev>
Co-authored-by: Engel Nyst <enyst@users.noreply.github.com>
2025-01-08 04:36:30 +08:00
Dmitry Kozlov
17d722f3b3
Update README.md (#6076)
Co-authored-by: Xingyao Wang <xingyao@all-hands.dev>
2025-01-06 17:31:19 +00:00
Xingyao Wang
ec70af9412
refactor: Replace pexpect with libtmux in BashSession (#4881)
Co-authored-by: openhands <openhands@all-hands.dev>
Co-authored-by: Engel Nyst <enyst@users.noreply.github.com>
Co-authored-by: Robert Brennan <accounts@rbren.io>
2025-01-04 05:22:13 +08:00
Xingyao Wang
61ebec9ff7
feat(eval): better visualization for comparing two swe-bench runs (#5993) 2025-01-03 02:36:51 +00:00
Robert Brennan
0e4e1b3316
Factor out ActionExecutionClient (#5796) 2024-12-30 15:32:13 +00:00
OpenHands
8975fcd714
Fix issue #5748: Rename "Ran a Jupyter Command" to "Ran a Python Command" in UI (#5749)
Co-authored-by: Graham Neubig <neubig@gmail.com>
2024-12-26 23:30:19 +08:00
OpenHands
bfb191b5c7
Fix issue #5739: [Bug]: Move ./evaluation/swe_bench/scripts/cleanup_remote_runtime.sh to general eval utils (#5740) 2024-12-25 17:17:06 -05:00
Xingyao Wang
581d5ec7a8
feat(eval): increase resource factor for remote runtime when previous run failed due to resource (#5709) 2024-12-21 01:47:06 +08:00
Xingyao Wang
c333938384
feat(eval): add standard error to swebench summarize outputs (#5700)
Co-authored-by: openhands <openhands@all-hands.dev>
2024-12-20 08:39:43 +08:00
Xingyao Wang
e9cafb0372
chore: Cleanup runtime exception handling (#5696) 2024-12-19 17:28:29 +00:00
Xingyao Wang
9cdb8d06c0
fix(eval): Use cp -r instead of mv for SWE-Bench Initialization (#5659) 2024-12-17 21:21:27 +00:00
Engel Nyst
3297e4d5a8
Use litellm's modify params (#5636) 2024-12-17 21:32:49 +01:00
OpenHands
4998b5de32
Fix issue #5559: The turn limit should be measured from the last user interaction (#5560)
Co-authored-by: Graham Neubig <neubig@gmail.com>
Co-authored-by: Engel Nyst <enyst@users.noreply.github.com>
2024-12-16 16:28:23 -05:00
Engel Nyst
b295f5775c
Revert "Fix issue #5609: Use litellm's modify_params with default True" (#5631) 2024-12-16 20:39:57 +00:00
OpenHands
09735c7869
Fix issue #5609: Use litellm's modify_params with default True (#5611)
Co-authored-by: Engel Nyst <enyst@users.noreply.github.com>
2024-12-16 20:18:45 +01:00
Engel Nyst
4716955960
Remove unused codeact-SWE agent (#5600)
Co-authored-by: openhands <openhands@all-hands.dev>
2024-12-14 20:49:44 +01:00
Ryan H. Tran
8ae2fb636e
Remove symlink use for swebench setup (#5549) 2024-12-13 22:18:14 +08:00
Engel Nyst
b11e905988
Verify costs script (#5469) 2024-12-10 14:20:53 +01:00
Engel Nyst
455e667739
add cost to summary (#5473) 2024-12-10 03:14:03 +08:00