Xingyao Wang
|
2ecc39ffcc
|
[eval]: disable MCP for SWE-Bench evaluation (#8574)
Co-authored-by: openhands <openhands@all-hands.dev>
Co-authored-by: Engel Nyst <enyst@users.noreply.github.com>
Co-authored-by: Engel Nyst <engel.nyst@gmail.com>
|
2025-05-19 01:32:46 +00:00 |
|
Yueqi Song
|
3ca585b79f
|
Update run_infer.py to incorporate selection of task based on repo (#8509)
|
2025-05-15 12:27:28 +08:00 |
|
omahs
|
4bb6ec2ee5
|
Fix typos (#8469)
|
2025-05-13 09:34:21 +00:00 |
|
Graham Neubig
|
f317c03b1b
|
Fix inconsistent max_iterations in SWE-bench evaluation (#8467)
Co-authored-by: openhands <openhands@all-hands.dev>
|
2025-05-13 02:07:57 +00:00 |
|
Graham Neubig
|
689d3c9046
|
Update pre-commit hook versions to most recent versions (#8343)
Co-authored-by: openhands <openhands@all-hands.dev>
|
2025-05-08 03:59:13 +00:00 |
|
Engel Nyst
|
985e20d529
|
[chore] Run full agent pre-commit (#8235)
|
2025-05-03 11:24:03 -04:00 |
|
Qi Liu
|
3d22520992
|
[Feat] add multi-swe-bench (#8174)
Co-authored-by: ByteDance User <tiger@bytedance.localdomain>
|
2025-05-01 00:23:19 +00:00 |
|
Michael Panchenko
|
14564b25d6
|
Fix linting (#7965)
|
2025-04-21 06:34:40 +08:00 |
|
Engel Nyst
|
a2c55cfdef
|
Refactor to clean up and move utility/legacy out of the agent (#7917)
|
2025-04-19 01:53:33 +08:00 |
|
Xingyao Wang
|
7c23993344
|
fix(eval): typo in SWE_Bench evaluation (#7930)
|
2025-04-19 00:31:08 +08:00 |
|
Engel Nyst
|
9b9b1291fc
|
[chore] Just linting on swe-bench files (#7918)
|
2025-04-18 22:12:01 +08:00 |
|
Niels Mündler
|
4b124d5906
|
Add inference for SWT-Bench (#7201)
Co-authored-by: Xingyao Wang <xingyao@all-hands.dev>
Co-authored-by: Graham Neubig <neubig@gmail.com>
Co-authored-by: Engel Nyst <enyst@users.noreply.github.com>
Co-authored-by: Calvin Smith <email@cjsmith.io>
|
2025-04-17 14:49:42 -06:00 |
|
juanmichelini
|
6bcebd4b9d
|
Jetbrains CI Benchmark (#7811)
Co-authored-by: Xingyao Wang <xingyao@all-hands.dev>
|
2025-04-17 15:10:20 +00:00 |
|
Engel Nyst
|
5e5bf23f9c
|
[Evaluation] Fix KeyError when the instance failed prematurely (#7864)
|
2025-04-15 15:19:31 +00:00 |
|
Engel Nyst
|
d05a6f30e1
|
[Refactor] Rename codeact_* agent options to simple name (#7853)
|
2025-04-15 00:14:13 +02:00 |
|
sp.wack
|
72b5e18898
|
fix(backend): Return 400 if trying to open a binary file (#7825)
|
2025-04-11 22:47:57 +00:00 |
|
Engel Nyst
|
bb98d94b35
|
[evaluation] fix missing metadata (#7819)
|
2025-04-11 16:58:59 +00:00 |
|
juanmichelini
|
53c0c5a07b
|
SWE-bench_verified instruction baseline improvements to 60% (#7546)
Co-authored-by: Xingyao Wang <xingyao@all-hands.dev>
|
2025-04-10 16:08:27 +00:00 |
|
Xingyao Wang
|
0087082643
|
Improve binary file handling and patch generation in SWE-bench evaluation (#7762)
Co-authored-by: openhands <openhands@all-hands.dev>
|
2025-04-08 22:57:33 +00:00 |
|
Xingyao Wang
|
ddda30d9b7
|
fix(eval): iterative evaluation improvements; SWE-Bench multimodal fixes (#7739)
Co-authored-by: Juan Michelini <juan@juan.com.uy>
Co-authored-by: Engel Nyst <enyst@users.noreply.github.com>
Co-authored-by: openhands <openhands@all-hands.dev>
|
2025-04-09 02:44:03 +08:00 |
|
Boxuan Li
|
d7c49a0656
|
[Evaluation] Fix sandbox config in TAC (#7684)
|
2025-04-03 08:19:10 +00:00 |
|
Boxuan Li
|
34bf6a6402
|
[Evaluation] Fix run_infer.py path in TAC (#7683)
|
2025-04-03 04:34:02 +00:00 |
|
Shixian Sheng
|
4fb073d1ea
|
Fixed a few hyperlinks. Translated some texts (#7652)
|
2025-04-02 22:10:19 +00:00 |
|
Xingyao Wang
|
648c8ffb21
|
(llm): Support OpenHands LM (#7598)
Co-authored-by: mamoodi <mamoodiha@gmail.com>
|
2025-03-31 17:29:31 +00:00 |
|
Xingyao Wang
|
54236f9617
|
[eval] Support SWE-Bench Multimodal (#7122)
Co-authored-by: openhands <openhands@all-hands.dev>
|
2025-03-31 07:42:44 -04:00 |
|
tofarr
|
1230b229b5
|
Replace use of requests with httpx (#7354)
|
2025-03-26 13:37:10 +00:00 |
|
Zach
|
0b3d15a4d7
|
Fix missing 'fi' statement in GAIA benchmark scripts/run_infer.sh (#7465)
|
2025-03-24 16:04:25 +00:00 |
|
Engel Nyst
|
83458f5146
|
Fix style issues with pre-commit (#7318)
Co-authored-by: openhands <openhands@all-hands.dev>
|
2025-03-18 01:34:27 +00:00 |
|
kjain14
|
507afd7f06
|
Add TestGenEval benchmark (#5534)
Co-authored-by: Kush Dave Jain <kdjain@pit.isri.cmu.edu>
Co-authored-by: openhands <openhands@all-hands.dev>
Co-authored-by: Graham Neubig <neubig@gmail.com>
|
2025-03-17 20:16:45 +00:00 |
|
Xingyao Wang
|
9b9e728cf6
|
Iterative evaluation with rule-based critic (#7293)
|
2025-03-17 18:37:35 +00:00 |
|
Xingyao Wang
|
a4d632498c
|
SWE-Gym rollout stability fix & using a validated SWE-Gym set (#7182)
Co-authored-by: Robert Brennan <accounts@rbren.io>
Co-authored-by: openhands <openhands@all-hands.dev>
Co-authored-by: Engel Nyst <enyst@users.noreply.github.com>
Co-authored-by: Graham Neubig <neubig@gmail.com>
|
2025-03-17 21:15:01 +08:00 |
|
Calvin Smith
|
303b7ab180
|
(fix): Conditional imports resolved in SWE-bench eval script while multiprocessing enabled (#7244)
Co-authored-by: Calvin Smith <calvin@all-hands.dev>
|
2025-03-13 13:29:11 -06:00 |
|
Elena Chistova
|
38e866cde4
|
Fix official SWE-Bench docker image prefix (#7214)
|
2025-03-12 18:23:19 +00:00 |
|
juanmichelini
|
b36deca265
|
Added link to paper in commit0 README (#7221)
|
2025-03-12 17:17:22 +00:00 |
|
Xingyao Wang
|
a4908f9a75
|
[agent] system message + SWE-Bench instruction improvements (#7018)
|
2025-03-08 00:27:02 +08:00 |
|
Nan Jiang
|
ec087993f1
|
rename commit0_bench to commit0 (#7124)
|
2025-03-06 02:55:39 +00:00 |
|
Xingyao Wang
|
9f720a9d69
|
[eval] SWE-Gym Integration (#6651)
Co-authored-by: Robert Brennan <accounts@rbren.io>
Co-authored-by: openhands <openhands@all-hands.dev>
Co-authored-by: Engel Nyst <enyst@users.noreply.github.com>
Co-authored-by: Graham Neubig <neubig@gmail.com>
|
2025-03-05 20:15:02 +00:00 |
|
Xingyao Wang
|
bbf40c6576
|
docs: cleanup and update SWE-Bench documentation; and remove the support of non-instance-level image (#7118)
Co-authored-by: Engel Nyst <enyst@users.noreply.github.com>
|
2025-03-06 03:18:40 +08:00 |
|
Xingyao Wang
|
4be33a079b
|
Update SWE-Bench README.md about RemoteRuntime (#7108)
|
2025-03-05 23:00:54 +08:00 |
|
He Du
|
896d7b8b96
|
Openhands fix issue 7091 (#7092)
Co-authored-by: 杜贺 <duhe@duhedeMacBook-Pro-2.local>
|
2025-03-04 18:39:28 +01:00 |
|
Rohit Malhotra
|
5ffb1ef704
|
Fix typing (#7083)
Co-authored-by: openhands <openhands@all-hands.dev>
|
2025-03-03 20:41:11 +00:00 |
|
Engel Nyst
|
395c1ea9e3
|
[Refactor] split runtime initialization (create, connect, init) in cli scripts (#7036)
|
2025-03-03 00:19:25 +01:00 |
|
Engel Nyst
|
660d1d1e64
|
Fix argument in swe-bench grading scripts (#7046)
|
2025-03-02 12:37:15 +08:00 |
|
Magic Mai
|
8a58e724c6
|
fix: Remove nested git repositories before adding files in SWE-bench (#6536)
Co-authored-by: Xingyao Wang <xingyao@all-hands.dev>
|
2025-02-28 01:19:33 +00:00 |
|
Xingyao Wang
|
33780f97d0
|
[eval] Upgrade SWE-Bench to use official image and latest harness (#6838)
Co-authored-by: Robert Brennan <accounts@rbren.io>
Co-authored-by: openhands <openhands@all-hands.dev>
Co-authored-by: Engel Nyst <enyst@users.noreply.github.com>
Co-authored-by: Graham Neubig <neubig@gmail.com>
|
2025-02-27 08:15:05 -05:00 |
|
Engel Nyst
|
4f98bce6df
|
Add selected_repo to command line (#6949)
|
2025-02-26 20:42:59 +01:00 |
|
Mateusz Kwiatkowski
|
6562297615
|
Replace shebang with /usr/bin/env bash for improved portability (#6876)
Co-authored-by: Xingyao Wang <xingyao@all-hands.dev>
|
2025-02-24 18:07:28 +00:00 |
|
Xingyao Wang
|
1a7003a705
|
Add sysbox support to remote runtime for eval; Add memory monitor, stress tests to help debug memory issue (#6684)
Co-authored-by: openhands <openhands@all-hands.dev>
Co-authored-by: Engel Nyst <enyst@users.noreply.github.com>
Co-authored-by: Graham Neubig <neubig@gmail.com>
|
2025-02-18 20:02:28 +00:00 |
|
Boxuan Li
|
4443417c75
|
A few fixes for TAC evaluation harness (#6586)
|
2025-02-14 21:01:57 -08:00 |
|
Boxuan Li
|
ef12bc5381
|
Evaluation harness: Add agent config option (#6662)
|
2025-02-13 15:05:03 -05:00 |
|