Mateusz Kwiatkowski
6562297615
Replace shebang with /usr/bin/env bash for improved portability ( #6876 )
...
Co-authored-by: Xingyao Wang <xingyao@all-hands.dev>
2025-02-24 18:07:28 +00:00
Xingyao Wang
1a7003a705
Add sysbox support to remote runtime for eval; Add memory monitor, stress tests to help debug memory issue ( #6684 )
...
Co-authored-by: openhands <openhands@all-hands.dev>
Co-authored-by: Engel Nyst <enyst@users.noreply.github.com>
Co-authored-by: Graham Neubig <neubig@gmail.com>
2025-02-18 20:02:28 +00:00
Boxuan Li
4443417c75
A few fixes for TAC evaluation harness ( #6586 )
2025-02-14 21:01:57 -08:00
Boxuan Li
ef12bc5381
Evaluation harness: Add agent config option ( #6662 )
2025-02-13 15:05:03 -05:00
Graham Neubig
e930cd0aef
Better error logging in posthog ( #6346 )
...
Co-authored-by: openhands <openhands@all-hands.dev>
Co-authored-by: Ray Myers <ray.myers@gmail.com>
2025-02-06 20:16:37 +00:00
Xingyao Wang
90bbd4edbe
fix: initialize default metadata with all required fields ( #6583 )
...
Co-authored-by: openhands <openhands@all-hands.dev>
2025-02-04 02:52:11 +08:00
tofarr
bbfdc62139
Fix for issue where retries continue on a closed runtime ( #6564 )
...
Co-authored-by: Xingyao Wang <xingyao6@illinois.edu>
2025-02-03 08:44:09 -07:00
Boxuan Li
62402cd617
The-Agent-Company evaluation harness: Support splits ( #6577 )
2025-02-02 13:12:01 +08:00
Xingyao Wang
1a9971b1bf
misc: make RemoteRuntime API timeout configurable ( #6518 )
...
Co-authored-by: Robert Brennan <accounts@rbren.io>
2025-01-30 06:30:18 +08:00
Xingyao Wang
391200510c
fix: revert #5506 for SWE-Bench performance regression ( #6491 )
...
Co-authored-by: Robert Brennan <accounts@rbren.io>
2025-01-28 22:52:57 +08:00
Aditya Bharat Soni
aebb583779
Support for VisualWebArena evaluation in OpenHands ( #4773 )
...
Co-authored-by: Xingyao Wang <xingyao@all-hands.dev>
Co-authored-by: openhands <openhands@all-hands.dev>
Co-authored-by: Graham Neubig <neubig@gmail.com>
2025-01-23 20:18:30 +00:00
Engel Nyst
b9a3f1c753
Fix eval on remote runtime ( #6398 )
2025-01-21 20:49:30 +00:00
Engel Nyst
5b7fcfbe1a
Disable prompt extensions in SWE-bench ( #6391 )
2025-01-21 17:18:30 +00:00
louria
7f57dbebda
Update MiniWoB README ( #6385 )
2025-01-21 16:26:47 +01:00
Calvin Smith
a12087243a
Pydantic-based configuration and setting objects ( #6321 )
...
Co-authored-by: Calvin Smith <calvin@all-hands.dev>
Co-authored-by: Graham Neubig <neubig@gmail.com>
Co-authored-by: Engel Nyst <enyst@users.noreply.github.com>
2025-01-17 12:33:22 -07:00
Xingyao Wang
899c1f8360
fix(bash): also show timeout reminder when no_change_timeout is triggered ( #6318 )
...
Co-authored-by: Robert Brennan <accounts@rbren.io>
2025-01-18 03:31:23 +08:00
Xingyao Wang
72af7bbba2
feat(eval): misc SWE-Bench improvement - use different resources for different instances ( #6313 )
...
Co-authored-by: openhands <openhands@all-hands.dev>
2025-01-17 02:48:41 +08:00
Xingyao Wang
0c961bfd8b
refactor(prompt): move runtime/repo info to user message and disable them in eval ( #6291 )
2025-01-16 17:53:10 +00:00
Xingyao Wang
0bed17758f
fix: incorrect soft-timeout implementation & fix hard-timeout follow-up command ( #6280 )
2025-01-17 01:27:00 +08:00
Boxuan Li
92b8d55c2d
Rename trajectories_path config to save_trajectory_path ( #6216 )
...
Co-authored-by: Engel Nyst <enyst@users.noreply.github.com>
2025-01-14 04:32:45 +00:00
tofarr
23473070b9
Revert "Config objects as Pydantic BaseModels ( #6176 )" ( #6214 )
2025-01-13 07:36:25 -07:00
Calvin Smith
873dddb4e8
Config objects as Pydantic BaseModels ( #6176 )
...
Co-authored-by: Calvin Smith <calvin@all-hands.dev>
Co-authored-by: Graham Neubig <neubig@gmail.com>
2025-01-12 15:09:45 -05:00
Calvin Smith
6e4ff56934
feature: Condenser Interface and Defaults ( #5306 )
...
Co-authored-by: openhands <openhands@all-hands.dev>
Co-authored-by: Calvin Smith <calvin@all-hands.dev>
Co-authored-by: Engel Nyst <enyst@users.noreply.github.com>
2025-01-08 04:36:30 +08:00
Dmitry Kozlov
17d722f3b3
Update README.md ( #6076 )
...
Co-authored-by: Xingyao Wang <xingyao@all-hands.dev>
2025-01-06 17:31:19 +00:00
Xingyao Wang
ec70af9412
refactor: Replace pexpect with libtmux in BashSession ( #4881 )
...
Co-authored-by: openhands <openhands@all-hands.dev>
Co-authored-by: Engel Nyst <enyst@users.noreply.github.com>
Co-authored-by: Robert Brennan <accounts@rbren.io>
2025-01-04 05:22:13 +08:00
Xingyao Wang
61ebec9ff7
feat(eval): better visualization for comparing two swe-bench runs ( #5993 )
2025-01-03 02:36:51 +00:00
Xingyao Wang
9dd5463e06
Set default value of use_microagents to False to prevent breaking eval ( #5976 )
...
Co-authored-by: openhands <openhands@all-hands.dev>
2025-01-03 05:39:17 +08:00
Robert Brennan
0e4e1b3316
Factor out ActionExecutionClient ( #5796 )
2024-12-30 15:32:13 +00:00
Boxuan Li
6a4442e590
[Evaluation] Add summarise_results script for TheAgentCompany benchmark ( #5811 )
2024-12-27 20:33:41 -08:00
Boxuan Li
5ed80b5c32
[doc] Fix link in TheAgentCompany benchmark's README.md ( #5848 )
2024-12-27 22:21:02 +08:00
OpenHands
8975fcd714
Fix issue #5748 : Rename "Ran a Jupyter Command" to "Ran a Python Command" in UI ( #5749 )
...
Co-authored-by: Graham Neubig <neubig@gmail.com>
2024-12-26 23:30:19 +08:00
OpenHands
bfb191b5c7
Fix issue #5739 : [Bug]: Move ./evaluation/swe_bench/scripts/cleanup_remote_runtime.sh to general eval utils ( #5740 )
2024-12-25 17:17:06 -05:00
Boxuan Li
b1719bb3db
Add TheAgentCompany evaluation harness ( #5731 )
2024-12-22 14:12:30 -05:00
OpenHands
21948fa81b
Fix issue #5735 : [Bug]: Inconsistent command line arguments in evaluation directory ( #5736 )
2024-12-22 04:41:39 +08:00
Xingyao Wang
581d5ec7a8
feat(eval): increase resource factor for remote runtime when previous run failed due to resource ( #5709 )
2024-12-21 01:47:06 +08:00
Xingyao Wang
c333938384
feat(eval): add standard error to swebench summarize outputs ( #5700 )
...
Co-authored-by: openhands <openhands@all-hands.dev>
2024-12-20 08:39:43 +08:00
Xingyao Wang
e9cafb0372
chore: Cleanup runtime exception handling ( #5696 )
2024-12-19 17:28:29 +00:00
Xingyao Wang
9cdb8d06c0
fix(eval): Use cp -r instead of mv for SWE-Bench Initialization ( #5659 )
2024-12-17 21:21:27 +00:00
Engel Nyst
3297e4d5a8
Use litellm's modify params ( #5636 )
2024-12-17 21:32:49 +01:00
OpenHands
4998b5de32
Fix issue #5559 : The turn limit should be measured from the last user interaction ( #5560 )
...
Co-authored-by: Graham Neubig <neubig@gmail.com>
Co-authored-by: Engel Nyst <enyst@users.noreply.github.com>
2024-12-16 16:28:23 -05:00
Engel Nyst
b295f5775c
Revert "Fix issue #5609 : Use litellm's modify_params with default True" ( #5631 )
2024-12-16 20:39:57 +00:00
OpenHands
09735c7869
Fix issue #5609 : Use litellm's modify_params with default True ( #5611 )
...
Co-authored-by: Engel Nyst <enyst@users.noreply.github.com>
2024-12-16 20:18:45 +01:00
Engel Nyst
4716955960
Remove unused codeact-SWE agent ( #5600 )
...
Co-authored-by: openhands <openhands@all-hands.dev>
2024-12-14 20:49:44 +01:00
Ryan H. Tran
8ae2fb636e
Remove symlink use for swebench setup ( #5549 )
2024-12-13 22:18:14 +08:00
Engel Nyst
b11e905988
Verify costs script ( #5469 )
2024-12-10 14:20:53 +01:00
Engel Nyst
455e667739
add cost to summary ( #5473 )
2024-12-10 03:14:03 +08:00
Cheng Yang
8f47547b08
docs: fix markdown linting and broken links ( #5401 )
2024-12-05 01:28:04 +08:00
Xingyao Wang
9908e1b285
[Evaluation]: Log openhands version in eval output folder, instead of agent version ( #5394 )
2024-12-04 03:33:43 +00:00
Xingyao Wang
990f277132
misc: Support folder-level exp analysis for SWE-Bench summarize_outputs.py; Handle CrashLoopBackoff for RemoteRuntime ( #5385 )
2024-12-03 15:37:21 +00:00
Graham Neubig
12dd3352c5
Add remote runtime support to agent_bench ( #5280 )
...
Co-authored-by: openhands <openhands@all-hands.dev>
Co-authored-by: Engel Nyst <enyst@users.noreply.github.com>
2024-11-26 13:45:49 +00:00