Kevin Musgrave
74ba21bad0
feat(evaluation): Added INSTRUCTION_TEMPLATE_NAME to run_infer.py in swe_bench ( #10270 )
...
Co-authored-by: Xingyao Wang <xingyao@all-hands.dev>
Co-authored-by: mamoodi <mamoodiha@gmail.com>
2025-08-18 14:18:08 +00:00
Ibragim Badertdinov
19a6b6b618
feat(eval): Support evaluation on SWE-rebench ( #10251 )
2025-08-12 14:05:43 +00:00
Xingyao Wang
c4f303a07b
chore(eval): Remove eval_infer_remote.sh script and related references ( #10157 )
...
Co-authored-by: openhands <openhands@all-hands.dev>
2025-08-07 20:46:59 +00:00
juanmichelini
ea50fe4e3c
Fix: Continue evaluation when an instance fails after max retries ( #8868 )
...
Co-authored-by: openhands <openhands@all-hands.dev>
Co-authored-by: Xingyao Wang <xingyaoww@gmail.com>
Co-authored-by: Xingyao Wang <xingyao@all-hands.dev>
2025-07-16 22:42:44 +00:00
Linghao Zhang
a93b0457c6
feat(eval): Support evaluation on SWE-bench-Live ( #9137 )
2025-06-15 12:30:47 +00:00
Xuhui Zhou
14498c5e25
Feature/swe run interact ( #8714 )
...
Co-authored-by: Xingyao Wang <xingyao@all-hands.dev>
2025-05-27 19:35:21 +00:00
Ryan H. Tran
3980ba53c9
Add option to run patch evaluation on Modal ( #8607 )
...
Co-authored-by: Engel Nyst <enyst@users.noreply.github.com>
2025-05-23 00:45:45 +07:00
Engel Nyst
637cb0726a
specify condenser config for evals ( #8177 )
...
Co-authored-by: openhands <openhands@all-hands.dev>
2025-05-21 22:08:57 +02:00
Graham Neubig
f317c03b1b
Fix inconsistent max_iterations in SWE-bench evaluation ( #8467 )
...
Co-authored-by: openhands <openhands@all-hands.dev>
2025-05-13 02:07:57 +00:00
Engel Nyst
9b9b1291fc
[chore] Just linting on swe-bench files ( #7918 )
2025-04-18 22:12:01 +08:00
Niels Mündler
4b124d5906
Add inference for SWT-Bench ( #7201 )
...
Co-authored-by: Xingyao Wang <xingyao@all-hands.dev>
Co-authored-by: Graham Neubig <neubig@gmail.com>
Co-authored-by: Engel Nyst <enyst@users.noreply.github.com>
Co-authored-by: Calvin Smith <email@cjsmith.io>
2025-04-17 14:49:42 -06:00
Shixian Sheng
4fb073d1ea
Fixed a few hyperlinks. Translated some texts ( #7652 )
2025-04-02 22:10:19 +00:00
Xingyao Wang
54236f9617
[eval] Support SWE-Bench Multimodal ( #7122 )
...
Co-authored-by: openhands <openhands@all-hands.dev>
2025-03-31 07:42:44 -04:00
Xingyao Wang
9b9e728cf6
Iterative evaluation with rule-based critic ( #7293 )
2025-03-17 18:37:35 +00:00
Xingyao Wang
9f720a9d69
[eval] SWE-Gym Integration ( #6651 )
...
Co-authored-by: Robert Brennan <accounts@rbren.io>
Co-authored-by: openhands <openhands@all-hands.dev>
Co-authored-by: Engel Nyst <enyst@users.noreply.github.com>
Co-authored-by: Graham Neubig <neubig@gmail.com>
2025-03-05 20:15:02 +00:00
Xingyao Wang
bbf40c6576
docs: cleanup and update SWE-Bench documentation; and remove the support of non-instance-level image ( #7118 )
...
Co-authored-by: Engel Nyst <enyst@users.noreply.github.com>
2025-03-06 03:18:40 +08:00
Xingyao Wang
4be33a079b
Update SWE-Bench README.md about RemoteRuntime ( #7108 )
2025-03-05 23:00:54 +08:00
Dmitry Kozlov
17d722f3b3
Update README.md ( #6076 )
...
Co-authored-by: Xingyao Wang <xingyao@all-hands.dev>
2025-01-06 17:31:19 +00:00
OpenHands
bfb191b5c7
Fix issue #5739 : [Bug]: Move ./evaluation/swe_bench/scripts/cleanup_remote_runtime.sh to general eval utils ( #5740 )
2024-12-25 17:17:06 -05:00
Cheng Yang
8f47547b08
docs: fix markdown linting and broken links ( #5401 )
2024-12-05 01:28:04 +08:00
OpenHands
678436da30
Fix issue #5222 : [Refactor]: Refactor the evaluation directory ( #5223 )
...
Co-authored-by: Engel Nyst <enyst@users.noreply.github.com>
2024-11-25 08:35:52 -05:00