Mateusz Kwiatkowski
|
6562297615
|
Replace shebang with /usr/bin/env bash for improved portability (#6876)
Co-authored-by: Xingyao Wang <xingyao@all-hands.dev>
|
2025-02-24 18:07:28 +00:00 |
|
Xingyao Wang
|
391200510c
|
fix: revert #5506 for SWE-Bench performance regression (#6491)
Co-authored-by: Robert Brennan <accounts@rbren.io>
|
2025-01-28 22:52:57 +08:00 |
|
Xingyao Wang
|
72af7bbba2
|
feat(eval): misc SWE-Bench improvement - use different resources for different instances (#6313)
Co-authored-by: openhands <openhands@all-hands.dev>
|
2025-01-17 02:48:41 +08:00 |
|
Xingyao Wang
|
ec70af9412
|
refactor: Replace pexpect with libtmux in BashSession (#4881)
Co-authored-by: openhands <openhands@all-hands.dev>
Co-authored-by: Engel Nyst <enyst@users.noreply.github.com>
Co-authored-by: Robert Brennan <accounts@rbren.io>
|
2025-01-04 05:22:13 +08:00 |
|
Xingyao Wang
|
61ebec9ff7
|
feat(eval): better visualization for comparing two swe-bench runs (#5993)
|
2025-01-03 02:36:51 +00:00 |
|
OpenHands
|
8975fcd714
|
Fix issue #5748: Rename "Ran a Jupyter Command" to "Ran a Python Command" in UI (#5749)
Co-authored-by: Graham Neubig <neubig@gmail.com>
|
2024-12-26 23:30:19 +08:00 |
|
OpenHands
|
bfb191b5c7
|
Fix issue #5739: [Bug]: Move ./evaluation/swe_bench/scripts/cleanup_remote_runtime.sh to general eval utils (#5740)
|
2024-12-25 17:17:06 -05:00 |
|
Xingyao Wang
|
c333938384
|
feat(eval): add standard error to swebench summarize outputs (#5700)
Co-authored-by: openhands <openhands@all-hands.dev>
|
2024-12-20 08:39:43 +08:00 |
|
Xingyao Wang
|
9cdb8d06c0
|
fix(eval): Use cp -r instead of mv for SWE-Bench Initialization (#5659)
|
2024-12-17 21:21:27 +00:00 |
|
Ryan H. Tran
|
8ae2fb636e
|
Remove symlink use for swebench setup (#5549)
|
2024-12-13 22:18:14 +08:00 |
|
Engel Nyst
|
b11e905988
|
Verify costs script (#5469)
|
2024-12-10 14:20:53 +01:00 |
|
Engel Nyst
|
455e667739
|
add cost to summary (#5473)
|
2024-12-10 03:14:03 +08:00 |
|
Xingyao Wang
|
9908e1b285
|
[Evaluation]: Log openhands version in eval output folder, instead of agent version (#5394)
|
2024-12-04 03:33:43 +00:00 |
|
Xingyao Wang
|
990f277132
|
misc: Support folder-level exp analysis for SWE-Bench summarize_outputs.py; Handle CrashLoopBackoff for RemoteRuntime (#5385)
|
2024-12-03 15:37:21 +00:00 |
|
OpenHands
|
678436da30
|
Fix issue #5222: [Refactor]: Refactor the evaluation directory (#5223)
Co-authored-by: Engel Nyst <enyst@users.noreply.github.com>
|
2024-11-25 08:35:52 -05:00 |
|