Robert Brennan
b5e00f577c
Replace All-Hands-AI references with OpenHands ( #11287 )
...
Co-authored-by: openhands <openhands@all-hands.dev>
Co-authored-by: Engel Nyst <engel.nyst@gmail.com>
Co-authored-by: Engel Nyst <enyst@users.noreply.github.com>
2025-10-26 01:52:45 +02:00
Tim O'Farrell
4b303ec9b4
Fixes to unblock frontend ( #11488 )
...
Co-authored-by: Ray Myers <ray.myers@gmail.com>
2025-10-23 14:43:45 -06:00
Xingyao Wang
b082ccc0fb
feat(llm): add support for deepseek and gpt-5-mini, util for token count ( #10626 )
...
Co-authored-by: openhands <openhands@all-hands.dev>
2025-08-27 11:03:35 +08:00
Xingyao Wang
c2f46200c0
chore(lint): Apply comprehensive linting and formatting fixes ( #10287 )
...
Co-authored-by: openhands <openhands@all-hands.dev>
2025-08-13 21:13:19 +02:00
Ibragim Badertdinov
19a6b6b618
feat(eval): Support evaluation on SWE-rebench ( #10251 )
2025-08-12 14:05:43 +00:00
Xingyao Wang
c4f303a07b
chore(eval): Remove eval_infer_remote.sh script and related references ( #10157 )
...
Co-authored-by: openhands <openhands@all-hands.dev>
2025-08-07 20:46:59 +00:00
xhguo7
9388fef0ef
feat(eval): loc acc evaluation ( #8515 )
...
Co-authored-by: Xingyao Wang <xingyao@all-hands.dev>
Co-authored-by: mamoodi <mamoodiha@gmail.com>
2025-07-11 03:22:35 +08:00
Linghao Zhang
a93b0457c6
feat(eval): Support evaluation on SWE-bench-Live ( #9137 )
2025-06-15 12:30:47 +00:00
Xuhui Zhou
14498c5e25
Feature/swe run interact ( #8714 )
...
Co-authored-by: Xingyao Wang <xingyao@all-hands.dev>
2025-05-27 19:35:21 +00:00
Zhaoling Chen
efe287ce34
integrate LocAgent into OpenHands ( #7371 )
...
Co-authored-by: czlll <gangda@huaihe.usc.edu>
Co-authored-by: Hoang Tran <descience.thh10@gmail.com>
2025-05-23 22:42:58 +07:00
Ryan H. Tran
3980ba53c9
Add option to run patch evaluation on Modal ( #8607 )
...
Co-authored-by: Engel Nyst <enyst@users.noreply.github.com>
2025-05-23 00:45:45 +07:00
Engel Nyst
637cb0726a
specify condenser config for evals ( #8177 )
...
Co-authored-by: openhands <openhands@all-hands.dev>
2025-05-21 22:08:57 +02:00
Graham Neubig
f317c03b1b
Fix inconsistent max_iterations in SWE-bench evaluation ( #8467 )
...
Co-authored-by: openhands <openhands@all-hands.dev>
2025-05-13 02:07:57 +00:00
Graham Neubig
689d3c9046
Update pre-commit hook versions to most recent versions ( #8343 )
...
Co-authored-by: openhands <openhands@all-hands.dev>
2025-05-08 03:59:13 +00:00
Michael Panchenko
14564b25d6
Fix linting ( #7965 )
2025-04-21 06:34:40 +08:00
Engel Nyst
9b9b1291fc
[chore] Just linting on swe-bench files ( #7918 )
2025-04-18 22:12:01 +08:00
Niels Mündler
4b124d5906
Add inference for SWT-Bench ( #7201 )
...
Co-authored-by: Xingyao Wang <xingyao@all-hands.dev>
Co-authored-by: Graham Neubig <neubig@gmail.com>
Co-authored-by: Engel Nyst <enyst@users.noreply.github.com>
Co-authored-by: Calvin Smith <email@cjsmith.io>
2025-04-17 14:49:42 -06:00
Xingyao Wang
ddda30d9b7
fix(eval): iterative evaluation improvements; SWE-Bench multimodal fixes ( #7739 )
...
Co-authored-by: Juan Michelini <juan@juan.com.uy>
Co-authored-by: Engel Nyst <enyst@users.noreply.github.com>
Co-authored-by: openhands <openhands@all-hands.dev>
2025-04-09 02:44:03 +08:00
Xingyao Wang
9b9e728cf6
Iterative evaluation with rule-based critic ( #7293 )
2025-03-17 18:37:35 +00:00
Xingyao Wang
a4d632498c
SWE-Gym rollout stability fix & using a validated SWE-Gym set ( #7182 )
...
Co-authored-by: Robert Brennan <accounts@rbren.io>
Co-authored-by: openhands <openhands@all-hands.dev>
Co-authored-by: Engel Nyst <enyst@users.noreply.github.com>
Co-authored-by: Graham Neubig <neubig@gmail.com>
2025-03-17 21:15:01 +08:00
Xingyao Wang
9f720a9d69
[eval] SWE-Gym Integration ( #6651 )
...
Co-authored-by: Robert Brennan <accounts@rbren.io>
Co-authored-by: openhands <openhands@all-hands.dev>
Co-authored-by: Engel Nyst <enyst@users.noreply.github.com>
Co-authored-by: Graham Neubig <neubig@gmail.com>
2025-03-05 20:15:02 +00:00
Xingyao Wang
bbf40c6576
docs: cleanup and update SWE-Bench documentation; and remove the support of non-instance-level image ( #7118 )
...
Co-authored-by: Engel Nyst <enyst@users.noreply.github.com>
2025-03-06 03:18:40 +08:00
Xingyao Wang
33780f97d0
[eval] Upgrade SWE-Bench to use official image and latest harness ( #6838 )
...
Co-authored-by: Robert Brennan <accounts@rbren.io>
Co-authored-by: openhands <openhands@all-hands.dev>
Co-authored-by: Engel Nyst <enyst@users.noreply.github.com>
Co-authored-by: Graham Neubig <neubig@gmail.com>
2025-02-27 08:15:05 -05:00
Mateusz Kwiatkowski
6562297615
Replace shebang with /usr/bin/env bash for improved portability ( #6876 )
...
Co-authored-by: Xingyao Wang <xingyao@all-hands.dev>
2025-02-24 18:07:28 +00:00
Xingyao Wang
391200510c
fix: revert #5506 for SWE-Bench performance regression ( #6491 )
...
Co-authored-by: Robert Brennan <accounts@rbren.io>
2025-01-28 22:52:57 +08:00
Xingyao Wang
72af7bbba2
feat(eval): misc SWE-Bench improvement - use different resources for different instances ( #6313 )
...
Co-authored-by: openhands <openhands@all-hands.dev>
2025-01-17 02:48:41 +08:00
Xingyao Wang
ec70af9412
refactor: Replace pexpect with libtmux in BashSession ( #4881 )
...
Co-authored-by: openhands <openhands@all-hands.dev>
Co-authored-by: Engel Nyst <enyst@users.noreply.github.com>
Co-authored-by: Robert Brennan <accounts@rbren.io>
2025-01-04 05:22:13 +08:00
Xingyao Wang
61ebec9ff7
feat(eval): better visualization for comparing two swe-bench runs ( #5993 )
2025-01-03 02:36:51 +00:00
OpenHands
8975fcd714
Fix issue #5748 : Rename "Ran a Jupyter Command" to "Ran a Python Command" in UI ( #5749 )
...
Co-authored-by: Graham Neubig <neubig@gmail.com>
2024-12-26 23:30:19 +08:00
OpenHands
bfb191b5c7
Fix issue #5739 : [Bug]: Move ./evaluation/swe_bench/scripts/cleanup_remote_runtime.sh to general eval utils ( #5740 )
2024-12-25 17:17:06 -05:00
Xingyao Wang
c333938384
feat(eval): add standard error to swebench summarize outputs ( #5700 )
...
Co-authored-by: openhands <openhands@all-hands.dev>
2024-12-20 08:39:43 +08:00
Xingyao Wang
9cdb8d06c0
fix(eval): Use cp -r instead of mv for SWE-Bench Initialization ( #5659 )
2024-12-17 21:21:27 +00:00
Ryan H. Tran
8ae2fb636e
Remove symlink use for swebench setup ( #5549 )
2024-12-13 22:18:14 +08:00
Engel Nyst
b11e905988
Verify costs script ( #5469 )
2024-12-10 14:20:53 +01:00
Engel Nyst
455e667739
add cost to summary ( #5473 )
2024-12-10 03:14:03 +08:00
Xingyao Wang
9908e1b285
[Evaluation]: Log openhands version in eval output folder, instead of agent version ( #5394 )
2024-12-04 03:33:43 +00:00
Xingyao Wang
990f277132
misc: Support folder-level exp analysis for SWE-Bench summarize_outputs.py; Handle CrashLoopBackoff for RemoteRuntime ( #5385 )
2024-12-03 15:37:21 +00:00
OpenHands
678436da30
Fix issue #5222 : [Refactor]: Refactor the evaluation directory ( #5223 )
...
Co-authored-by: Engel Nyst <enyst@users.noreply.github.com>
2024-11-25 08:35:52 -05:00