Xingyao Wang
648c8ffb21
(llm): Support OpenHands LM ( #7598 )
...
Co-authored-by: mamoodi <mamoodiha@gmail.com>
2025-03-31 17:29:31 +00:00
Xingyao Wang
54236f9617
[eval] Support SWE-Bench Multimodal ( #7122 )
...
Co-authored-by: openhands <openhands@all-hands.dev>
2025-03-31 07:42:44 -04:00
tofarr
1230b229b5
Replace use of requests with httpx ( #7354 )
2025-03-26 13:37:10 +00:00
Zach
0b3d15a4d7
Fix missing 'fi' statement in GAIA benchmark scripts/run_infer.sh ( #7465 )
2025-03-24 16:04:25 +00:00
Xingyao Wang
01e0e29a9f
Reduce bash SOFT timeout from 30 to 10 seconds ( #7423 )
...
Co-authored-by: openhands <openhands@all-hands.dev>
2025-03-22 22:42:24 +00:00
Engel Nyst
83458f5146
Fix style issues with pre-commit ( #7318 )
...
Co-authored-by: openhands <openhands@all-hands.dev>
2025-03-18 01:34:27 +00:00
kjain14
507afd7f06
Add TestGenEval benchmark ( #5534 )
...
Co-authored-by: Kush Dave Jain <kdjain@pit.isri.cmu.edu>
Co-authored-by: openhands <openhands@all-hands.dev>
Co-authored-by: Graham Neubig <neubig@gmail.com>
2025-03-17 20:16:45 +00:00
Xingyao Wang
9b9e728cf6
Iterative evaluation with rule-based critic ( #7293 )
2025-03-17 18:37:35 +00:00
Xingyao Wang
a4d632498c
SWE-Gym rollout stability fix & using a validated SWE-Gym set ( #7182 )
...
Co-authored-by: Robert Brennan <accounts@rbren.io>
Co-authored-by: openhands <openhands@all-hands.dev>
Co-authored-by: Engel Nyst <enyst@users.noreply.github.com>
Co-authored-by: Graham Neubig <neubig@gmail.com>
2025-03-17 21:15:01 +08:00
Engel Nyst
dd09d46ccb
Remove DelegatorAgent ( fix #7280 )
...
Co-authored-by: openhands <openhands@all-hands.dev>
2025-03-16 16:49:28 +01:00
Calvin Smith
303b7ab180
(fix): Conditional imports resolved in SWE-bench eval script while multiprocessing enabled ( #7244 )
...
Co-authored-by: Calvin Smith <calvin@all-hands.dev>
2025-03-13 13:29:11 -06:00
Elena Chistova
38e866cde4
Fix official SWE-Bench docker image prefix ( #7214 )
2025-03-12 18:23:19 +00:00
juanmichelini
b36deca265
Added link to paper in commit0 README ( #7221 )
2025-03-12 17:17:22 +00:00
Xingyao Wang
a4908f9a75
[agent] system message + SWE-Bench instruction improvements ( #7018 )
2025-03-08 00:27:02 +08:00
Nan Jiang
ec087993f1
rename commit0_bench to commit0 ( #7124 )
2025-03-06 02:55:39 +00:00
Xingyao Wang
9f720a9d69
[eval] SWE-Gym Integration ( #6651 )
...
Co-authored-by: Robert Brennan <accounts@rbren.io>
Co-authored-by: openhands <openhands@all-hands.dev>
Co-authored-by: Engel Nyst <enyst@users.noreply.github.com>
Co-authored-by: Graham Neubig <neubig@gmail.com>
2025-03-05 20:15:02 +00:00
Xingyao Wang
bbf40c6576
docs: cleanup and update SWE-Bench documentation; and remove the support of non-instance-level image ( #7118 )
...
Co-authored-by: Engel Nyst <enyst@users.noreply.github.com>
2025-03-06 03:18:40 +08:00
Xingyao Wang
4be33a079b
Update SWE-Bench README.md about RemoteRuntime ( #7108 )
2025-03-05 23:00:54 +08:00
He Du
896d7b8b96
Openhands fix issue 7091 ( #7092 )
...
Co-authored-by: 杜贺 <duhe@duhedeMacBook-Pro-2.local>
2025-03-04 18:39:28 +01:00
Rohit Malhotra
5ffb1ef704
Fix typing ( #7083 )
...
Co-authored-by: openhands <openhands@all-hands.dev>
2025-03-03 20:41:11 +00:00
Engel Nyst
395c1ea9e3
[Refactor] split runtime initialization (create, connect, init) in cli scripts ( #7036 )
2025-03-03 00:19:25 +01:00
Engel Nyst
660d1d1e64
Fix argument in swe-bench grading scripts ( #7046 )
2025-03-02 12:37:15 +08:00
Magic Mai
8a58e724c6
fix: Remove nested git repositories before adding files in SWE-bench ( #6536 )
...
Co-authored-by: Xingyao Wang <xingyao@all-hands.dev>
2025-02-28 01:19:33 +00:00
Xingyao Wang
33780f97d0
[eval] Upgrade SWE-Bench to use official image and latest harness ( #6838 )
...
Co-authored-by: Robert Brennan <accounts@rbren.io>
Co-authored-by: openhands <openhands@all-hands.dev>
Co-authored-by: Engel Nyst <enyst@users.noreply.github.com>
Co-authored-by: Graham Neubig <neubig@gmail.com>
2025-02-27 08:15:05 -05:00
Engel Nyst
4f98bce6df
Add selected_repo to command line ( #6949 )
2025-02-26 20:42:59 +01:00
Mateusz Kwiatkowski
6562297615
Replace shebang with /usr/bin/env bash for improved portability ( #6876 )
...
Co-authored-by: Xingyao Wang <xingyao@all-hands.dev>
2025-02-24 18:07:28 +00:00
Xingyao Wang
e52aee168e
Docs: Clarify config.toml usage in evaluation harness ( #6828 )
...
Co-authored-by: openhands <openhands@all-hands.dev>
2025-02-20 22:16:17 -08:00
Xingyao Wang
1a7003a705
Add sysbox support to remote runtime for eval; Add memory monitor, stress tests to help debug memory issue ( #6684 )
...
Co-authored-by: openhands <openhands@all-hands.dev>
Co-authored-by: Engel Nyst <enyst@users.noreply.github.com>
Co-authored-by: Graham Neubig <neubig@gmail.com>
2025-02-18 20:02:28 +00:00
Boxuan Li
4443417c75
A few fixes for TAC evaluation harness ( #6586 )
2025-02-14 21:01:57 -08:00
Boxuan Li
ef12bc5381
Evaluation harness: Add agent config option ( #6662 )
2025-02-13 15:05:03 -05:00
Graham Neubig
e930cd0aef
Better error logging in posthog ( #6346 )
...
Co-authored-by: openhands <openhands@all-hands.dev>
Co-authored-by: Ray Myers <ray.myers@gmail.com>
2025-02-06 20:16:37 +00:00
Xingyao Wang
90bbd4edbe
fix: initialize default metadata with all required fields ( #6583 )
...
Co-authored-by: openhands <openhands@all-hands.dev>
2025-02-04 02:52:11 +08:00
tofarr
bbfdc62139
Fix for issue where retries continue on a closed runtime ( #6564 )
...
Co-authored-by: Xingyao Wang <xingyao6@illinois.edu>
2025-02-03 08:44:09 -07:00
Boxuan Li
62402cd617
The-Agent-Company evaluation harness: Support splits ( #6577 )
2025-02-02 13:12:01 +08:00
Xingyao Wang
1a9971b1bf
misc: make RemoteRuntime API timeout configurable ( #6518 )
...
Co-authored-by: Robert Brennan <accounts@rbren.io>
2025-01-30 06:30:18 +08:00
Xingyao Wang
391200510c
fix: revert #5506 for SWE-Bench performance regression ( #6491 )
...
Co-authored-by: Robert Brennan <accounts@rbren.io>
2025-01-28 22:52:57 +08:00
Aditya Bharat Soni
aebb583779
Support for VisualWebArena evaluation in OpenHands ( #4773 )
...
Co-authored-by: Xingyao Wang <xingyao@all-hands.dev>
Co-authored-by: openhands <openhands@all-hands.dev>
Co-authored-by: Graham Neubig <neubig@gmail.com>
2025-01-23 20:18:30 +00:00
Engel Nyst
b9a3f1c753
Fix eval on remote runtime ( #6398 )
2025-01-21 20:49:30 +00:00
Engel Nyst
5b7fcfbe1a
Disable prompt extensions in SWE-bench ( #6391 )
2025-01-21 17:18:30 +00:00
louria
7f57dbebda
Update MiniWoB README ( #6385 )
2025-01-21 16:26:47 +01:00
Xingyao Wang
2b04ee2e62
feat(eval): reliability improvement for SWE-Bench eval_infer ( #6347 )
2025-01-18 14:02:59 -05:00
Calvin Smith
a12087243a
Pydantic-based configuration and setting objects ( #6321 )
...
Co-authored-by: Calvin Smith <calvin@all-hands.dev>
Co-authored-by: Graham Neubig <neubig@gmail.com>
Co-authored-by: Engel Nyst <enyst@users.noreply.github.com>
2025-01-17 12:33:22 -07:00
Xingyao Wang
899c1f8360
fix(bash): also show timeout reminder when no_change_timeout is triggered ( #6318 )
...
Co-authored-by: Robert Brennan <accounts@rbren.io>
2025-01-18 03:31:23 +08:00
Xingyao Wang
72af7bbba2
feat(eval): misc SWE-Bench improvement - use different resources for different instances ( #6313 )
...
Co-authored-by: openhands <openhands@all-hands.dev>
2025-01-17 02:48:41 +08:00
Xingyao Wang
0c961bfd8b
refactor(prompt): move runtime/repo info to user message and disable them in eval ( #6291 )
2025-01-16 17:53:10 +00:00
Xingyao Wang
0bed17758f
fix: incorrect soft-timeout implementation & fix hard-timeout follow-up command ( #6280 )
2025-01-17 01:27:00 +08:00
Engel Nyst
b9a70c8d5c
Delegation fixes ( #6165 )
2025-01-15 03:24:39 +00:00
Boxuan Li
92b8d55c2d
Rename trajectories_path config to save_trajectory_path ( #6216 )
...
Co-authored-by: Engel Nyst <enyst@users.noreply.github.com>
2025-01-14 04:32:45 +00:00
tofarr
23473070b9
Revert "Config objects as Pydantic BaseModels ( #6176 )" ( #6214 )
2025-01-13 07:36:25 -07:00
Calvin Smith
873dddb4e8
Config objects as Pydantic BaseModels ( #6176 )
...
Co-authored-by: Calvin Smith <calvin@all-hands.dev>
Co-authored-by: Graham Neubig <neubig@gmail.com>
2025-01-12 15:09:45 -05:00