9 Commits

Author SHA1 Message Date
Engel Nyst
a2c55cfdef
Refactor to clean up and move utility/legacy out of the agent (#7917) 2025-04-19 01:53:33 +08:00
Xingyao Wang
7c23993344
fix(eval): typo in SWE_Bench evaluation (#7930) 2025-04-19 00:31:08 +08:00
Engel Nyst
9b9b1291fc
[chore] Just linting on swe-bench files (#7918) 2025-04-18 22:12:01 +08:00
Niels Mündler
4b124d5906
Add inference for SWT-Bench (#7201)
Co-authored-by: Xingyao Wang <xingyao@all-hands.dev>
Co-authored-by: Graham Neubig <neubig@gmail.com>
Co-authored-by: Engel Nyst <enyst@users.noreply.github.com>
Co-authored-by: Calvin Smith <email@cjsmith.io>
2025-04-17 14:49:42 -06:00
Xingyao Wang
ddda30d9b7
fix(eval): iterative evaluation improvements; SWE-Bench multimodal fixes (#7739)
Co-authored-by: Juan Michelini <juan@juan.com.uy>
Co-authored-by: Engel Nyst <enyst@users.noreply.github.com>
Co-authored-by: openhands <openhands@all-hands.dev>
2025-04-09 02:44:03 +08:00
Xingyao Wang
a4d632498c
SWE-Gym rollout stability fix & using a validated SWE-Gym set (#7182)
Co-authored-by: Robert Brennan <accounts@rbren.io>
Co-authored-by: openhands <openhands@all-hands.dev>
Co-authored-by: Engel Nyst <enyst@users.noreply.github.com>
Co-authored-by: Graham Neubig <neubig@gmail.com>
2025-03-17 21:15:01 +08:00
Xingyao Wang
1a7003a705
Add sysbox support to remote runtime for eval; Add memory monitor, stress tests to help debug memory issue (#6684)
Co-authored-by: openhands <openhands@all-hands.dev>
Co-authored-by: Engel Nyst <enyst@users.noreply.github.com>
Co-authored-by: Graham Neubig <neubig@gmail.com>
2025-02-18 20:02:28 +00:00
Graham Neubig
e930cd0aef
Better error logging in posthog (#6346)
Co-authored-by: openhands <openhands@all-hands.dev>
Co-authored-by: Ray Myers <ray.myers@gmail.com>
2025-02-06 20:16:37 +00:00
Xingyao Wang
72af7bbba2
feat(eval): misc SWE-Bench improvement - use different resources for different instances (#6313)
Co-authored-by: openhands <openhands@all-hands.dev>
2025-01-17 02:48:41 +08:00