Ryan H. Tran
3980ba53c9
Add option to run patch evaluation on Modal ( #8607 )
...
Co-authored-by: Engel Nyst <enyst@users.noreply.github.com>
2025-05-23 00:45:45 +07:00
Engel Nyst
637cb0726a
specify condenser config for evals ( #8177 )
...
Co-authored-by: openhands <openhands@all-hands.dev>
2025-05-21 22:08:57 +02:00
luolin101
1a3cb16ba6
add Visual SWE-bench benchmark ( #7131 )
...
Co-authored-by: tsukimi <yuailun@pku.edu.cn>
Co-authored-by: Ryan H. Tran <descience.thh10@gmail.com>
2025-05-19 12:08:46 +07:00
Xingyao Wang
2ecc39ffcc
[eval]: disable MCP for SWE-Bench evaluation ( #8574 )
...
Co-authored-by: openhands <openhands@all-hands.dev>
Co-authored-by: Engel Nyst <enyst@users.noreply.github.com>
Co-authored-by: Engel Nyst <engel.nyst@gmail.com>
2025-05-19 01:32:46 +00:00
Yueqi Song
3ca585b79f
Update run_infer.py to incorporate selection of task based on repo ( #8509 )
2025-05-15 12:27:28 +08:00
omahs
4bb6ec2ee5
Fix typos ( #8469 )
2025-05-13 09:34:21 +00:00
Graham Neubig
f317c03b1b
Fix inconsistent max_iterations in SWE-bench evaluation ( #8467 )
...
Co-authored-by: openhands <openhands@all-hands.dev>
2025-05-13 02:07:57 +00:00
Graham Neubig
689d3c9046
Update pre-commit hook versions to most recent versions ( #8343 )
...
Co-authored-by: openhands <openhands@all-hands.dev>
2025-05-08 03:59:13 +00:00
Engel Nyst
985e20d529
[chore] Run full agent pre-commit ( #8235 )
2025-05-03 11:24:03 -04:00
Qi Liu
3d22520992
[Feat] add multi-swe-bench ( #8174 )
...
Co-authored-by: ByteDance User <tiger@bytedance.localdomain>
2025-05-01 00:23:19 +00:00
Michael Panchenko
14564b25d6
Fix linting ( #7965 )
2025-04-21 06:34:40 +08:00
Engel Nyst
a2c55cfdef
Refactor to clean up and move utility/legacy out of the agent ( #7917 )
2025-04-19 01:53:33 +08:00
Xingyao Wang
7c23993344
fix(eval): typo in SWE_Bench evaluation ( #7930 )
2025-04-19 00:31:08 +08:00
Engel Nyst
9b9b1291fc
[chore] Just linting on swe-bench files ( #7918 )
2025-04-18 22:12:01 +08:00
Niels Mündler
4b124d5906
Add inference for SWT-Bench ( #7201 )
...
Co-authored-by: Xingyao Wang <xingyao@all-hands.dev>
Co-authored-by: Graham Neubig <neubig@gmail.com>
Co-authored-by: Engel Nyst <enyst@users.noreply.github.com>
Co-authored-by: Calvin Smith <email@cjsmith.io>
2025-04-17 14:49:42 -06:00
juanmichelini
6bcebd4b9d
Jetbrains CI Benchmark ( #7811 )
...
Co-authored-by: Xingyao Wang <xingyao@all-hands.dev>
2025-04-17 15:10:20 +00:00
Engel Nyst
5e5bf23f9c
[Evaluation] Fix KeyError when the instance failed prematurely ( #7864 )
2025-04-15 15:19:31 +00:00
Engel Nyst
d05a6f30e1
[Refactor] Rename codeact_* agent options to simple name ( #7853 )
2025-04-15 00:14:13 +02:00
sp.wack
72b5e18898
fix(backend): Return 400 if trying to open a binary file ( #7825 )
2025-04-11 22:47:57 +00:00
Engel Nyst
bb98d94b35
[evaluation] fix missing metadata ( #7819 )
2025-04-11 16:58:59 +00:00
juanmichelini
53c0c5a07b
SWE-bench_verified instruction baseline improvements to 60% ( #7546 )
...
Co-authored-by: Xingyao Wang <xingyao@all-hands.dev>
2025-04-10 16:08:27 +00:00
Xingyao Wang
0087082643
Improve binary file handling and patch generation in SWE-bench evaluation ( #7762 )
...
Co-authored-by: openhands <openhands@all-hands.dev>
2025-04-08 22:57:33 +00:00
Xingyao Wang
ddda30d9b7
fix(eval): iterative evaluation improvements; SWE-Bench multimodal fixes ( #7739 )
...
Co-authored-by: Juan Michelini <juan@juan.com.uy>
Co-authored-by: Engel Nyst <enyst@users.noreply.github.com>
Co-authored-by: openhands <openhands@all-hands.dev>
2025-04-09 02:44:03 +08:00
Engel Nyst
22cf5144cc
Fix integration test ( #7747 )
2025-04-07 22:31:50 -04:00
Boxuan Li
d7c49a0656
[Evaluation] Fix sandbox config in TAC ( #7684 )
2025-04-03 08:19:10 +00:00
Boxuan Li
34bf6a6402
[Evaluation] Fix run_infer.py path in TAC ( #7683 )
2025-04-03 04:34:02 +00:00
Shixian Sheng
4fb073d1ea
Fixed a few hyperlinks. Translated some texts ( #7652 )
2025-04-02 22:10:19 +00:00
Rohit Malhotra
9adfcede31
(Hotfix): Track reason for Error AgentState ( #7584 )
...
Co-authored-by: openhands <openhands@all-hands.dev>
2025-03-31 21:24:42 +00:00
Xingyao Wang
648c8ffb21
(llm): Support OpenHands LM ( #7598 )
...
Co-authored-by: mamoodi <mamoodiha@gmail.com>
2025-03-31 17:29:31 +00:00
Xingyao Wang
54236f9617
[eval] Support SWE-Bench Multimodal ( #7122 )
...
Co-authored-by: openhands <openhands@all-hands.dev>
2025-03-31 07:42:44 -04:00
tofarr
1230b229b5
Replace use of requests with httpx ( #7354 )
2025-03-26 13:37:10 +00:00
Zach
0b3d15a4d7
Fix missing 'fi' statement in GAIA benchmark scripts/run_infer.sh ( #7465 )
2025-03-24 16:04:25 +00:00
Xingyao Wang
01e0e29a9f
Reduce bash SOFT timeout from 30 to 10 seconds ( #7423 )
...
Co-authored-by: openhands <openhands@all-hands.dev>
2025-03-22 22:42:24 +00:00
Engel Nyst
83458f5146
Fix style issues with pre-commit ( #7318 )
...
Co-authored-by: openhands <openhands@all-hands.dev>
2025-03-18 01:34:27 +00:00
kjain14
507afd7f06
Add TestGenEval benchmark ( #5534 )
...
Co-authored-by: Kush Dave Jain <kdjain@pit.isri.cmu.edu>
Co-authored-by: openhands <openhands@all-hands.dev>
Co-authored-by: Graham Neubig <neubig@gmail.com>
2025-03-17 20:16:45 +00:00
Xingyao Wang
9b9e728cf6
Iterative evaluation with rule-based critic ( #7293 )
2025-03-17 18:37:35 +00:00
Xingyao Wang
a4d632498c
SWE-Gym rollout stability fix & using a validated SWE-Gym set ( #7182 )
...
Co-authored-by: Robert Brennan <accounts@rbren.io>
Co-authored-by: openhands <openhands@all-hands.dev>
Co-authored-by: Engel Nyst <enyst@users.noreply.github.com>
Co-authored-by: Graham Neubig <neubig@gmail.com>
2025-03-17 21:15:01 +08:00
Engel Nyst
dd09d46ccb
Remove DelegatorAgent ( fix #7280 )
...
Co-authored-by: openhands <openhands@all-hands.dev>
2025-03-16 16:49:28 +01:00
Calvin Smith
303b7ab180
(fix): Conditional imports resolved in SWE-bench eval script while multiprocessing enabled ( #7244 )
...
Co-authored-by: Calvin Smith <calvin@all-hands.dev>
2025-03-13 13:29:11 -06:00
Elena Chistova
38e866cde4
Fix official SWE-Bench docker image prefix ( #7214 )
2025-03-12 18:23:19 +00:00
juanmichelini
b36deca265
Added link to paper in commit0 README ( #7221 )
2025-03-12 17:17:22 +00:00
Xingyao Wang
a4908f9a75
[agent] system message + SWE-Bench instruction improvements ( #7018 )
2025-03-08 00:27:02 +08:00
Nan Jiang
ec087993f1
rename commit0_bench to commit0 ( #7124 )
2025-03-06 02:55:39 +00:00
Xingyao Wang
9f720a9d69
[eval] SWE-Gym Integration ( #6651 )
...
Co-authored-by: Robert Brennan <accounts@rbren.io>
Co-authored-by: openhands <openhands@all-hands.dev>
Co-authored-by: Engel Nyst <enyst@users.noreply.github.com>
Co-authored-by: Graham Neubig <neubig@gmail.com>
2025-03-05 20:15:02 +00:00
Xingyao Wang
bbf40c6576
docs: cleanup and update SWE-Bench documentation; and remove the support of non-instance-level image ( #7118 )
...
Co-authored-by: Engel Nyst <enyst@users.noreply.github.com>
2025-03-06 03:18:40 +08:00
Xingyao Wang
4be33a079b
Update SWE-Bench README.md about RemoteRuntime ( #7108 )
2025-03-05 23:00:54 +08:00
He Du
896d7b8b96
Openhands fix issue 7091 ( #7092 )
...
Co-authored-by: 杜贺 <duhe@duhedeMacBook-Pro-2.local>
2025-03-04 18:39:28 +01:00
Rohit Malhotra
5ffb1ef704
Fix typing ( #7083 )
...
Co-authored-by: openhands <openhands@all-hands.dev>
2025-03-03 20:41:11 +00:00
Engel Nyst
395c1ea9e3
[Refactor] split runtime initialization (create, connect, init) in cli scripts ( #7036 )
2025-03-03 00:19:25 +01:00
Engel Nyst
660d1d1e64
Fix argument in swe-bench grading scripts ( #7046 )
2025-03-02 12:37:15 +08:00