Engel Nyst
b295f5775c
Revert "Fix issue #5609 : Use litellm's modify_params with default True" ( #5631 )
2024-12-16 20:39:57 +00:00
OpenHands
09735c7869
Fix issue #5609 : Use litellm's modify_params with default True ( #5611 )
...
Co-authored-by: Engel Nyst <enyst@users.noreply.github.com>
2024-12-16 20:18:45 +01:00
Engel Nyst
4716955960
Remove unused codeact-SWE agent ( #5600 )
...
Co-authored-by: openhands <openhands@all-hands.dev>
2024-12-14 20:49:44 +01:00
Ryan H. Tran
8ae2fb636e
Remove symlink use for swebench setup ( #5549 )
2024-12-13 22:18:14 +08:00
Engel Nyst
b11e905988
Verify costs script ( #5469 )
2024-12-10 14:20:53 +01:00
Engel Nyst
455e667739
add cost to summary ( #5473 )
2024-12-10 03:14:03 +08:00
Cheng Yang
8f47547b08
docs: fix markdown linting and broken links ( #5401 )
2024-12-05 01:28:04 +08:00
Xingyao Wang
9908e1b285
[Evaluation]: Log openhands version in eval output folder, instead of agent version ( #5394 )
2024-12-04 03:33:43 +00:00
Xingyao Wang
990f277132
misc: Support folder-level exp analysis for SWE-Bench summarize_outputs.py; Handle CrashLoopBackoff for RemoteRuntime ( #5385 )
2024-12-03 15:37:21 +00:00
Engel Nyst
ea994b6209
More integration tests info ( #5319 )
2024-11-29 16:39:03 +01:00
Cheng Yang
b808a639d9
docs: improve evaluation README with proper links and formatting ( #5221 )
2024-11-27 18:27:36 -05:00
Xingyao Wang
4d3b035e00
feat(agent): add BrowseURLAction to CodeAct (produce markdown from URL) ( #5285 )
2024-11-27 21:55:57 +00:00
OpenHands
f0ca2239f3
Fix issue #5076 : Integration test github action ( #5077 )
...
Co-authored-by: Engel Nyst <enyst@users.noreply.github.com>
2024-11-27 21:31:48 +01:00
Graham Neubig
12dd3352c5
Add remote runtime support to agent_bench ( #5280 )
...
Co-authored-by: openhands <openhands@all-hands.dev>
Co-authored-by: Engel Nyst <enyst@users.noreply.github.com>
2024-11-26 13:45:49 +00:00
OpenHands
678436da30
Fix issue #5222 : [Refactor]: Refactor the evaluation directory ( #5223 )
...
Co-authored-by: Engel Nyst <enyst@users.noreply.github.com>
2024-11-25 08:35:52 -05:00
Nan Jiang
463d4e9a46
eval: add commit0 benchmark ( #5153 )
...
Co-authored-by: Xingyao Wang <xingyao6@illinois.edu>
Co-authored-by: Xingyao Wang <xingyao@all-hands.dev>
2024-11-22 19:49:45 +00:00
Xingyao Wang
ff84a3eede
chore: remove specified sid ( #5127 )
2024-11-19 16:41:27 +00:00
Xingyao Wang
a531413d86
fix(eval): support setting hard timeout per evaluation instance ( #5110 )
2024-11-18 21:22:55 -05:00
Xingyao Wang
bdc4513937
fix(swebench): handle error in eval_infer and run_infer ( #5017 )
2024-11-15 23:04:56 +08:00
Graham Neubig
ce6f99d80e
Add GITHUB_USERNAME env var to resolver step ( #4999 )
...
Co-authored-by: openhands <openhands@all-hands.dev>
2024-11-14 18:42:59 +00:00
Ketan Ramaneti
852c90f64a
[fix eval] Fix issues with miniwob remote runtime evaluation ( #5001 )
2024-11-14 18:00:48 +00:00
Ketan Ramaneti
42b49e6c43
[fix eval] Fix issues with aider_bench remote runtime evaluation ( #5000 )
2024-11-14 17:58:45 +00:00
Xingyao Wang
07f0d1ccb3
feat(llm): convert function call request for non-funcall OSS model ( #4711 )
...
Co-authored-by: Calvin Smith <email@cjsmith.io>
2024-11-15 00:40:09 +08:00
Robert Brennan
bc3f0ac24a
fix imports ( #4974 )
2024-11-13 17:04:16 +00:00
Calvin Smith
50e7da9c3d
fix(evaluation): SWE-bench evaluation script supports multiprocessing ( #4943 )
2024-11-12 12:19:57 -07:00
Robert Brennan
17f4c6e1a9
Refactor sessions a bit, and fix issue where runtimes get killed ( #4900 )
2024-11-12 16:20:36 +00:00
Xingyao Wang
a07e8272da
fix: improve remote runtime reliability on large-scale evaluation ( #4869 )
2024-11-09 20:17:10 +00:00
Robert Brennan
be82832eb1
Use keyword matching for CodeAct microagents ( #4568 )
...
Co-authored-by: Xingyao Wang <xingyao@all-hands.dev>
2024-11-09 11:25:02 -05:00
Xingyao Wang
4ce3b9094a
Revert "(feat): Prompt engineering to remind o1 to generate a patch" ( #4846 )
2024-11-08 16:12:57 +00:00
Alejandro Cuadron Lafuente
a6810fa6ad
(feat): Prompt engineering to remind o1 to generate a patch ( #4807 )
...
Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Xingyao Wang <xingyao@all-hands.dev>
Co-authored-by: mamoodi <mamoodiha@gmail.com>
Co-authored-by: tofarr <tofarr@gmail.com>
Co-authored-by: Robert Brennan <contact@rbren.io>
2024-11-08 03:10:18 +00:00
Xingyao Wang
53390d9885
Fix issue #4583 : [Bug]: Unable to pull the full SWE-Bench test set ( #4813 )
...
Co-authored-by: openhands <openhands@all-hands.dev>
2024-11-07 22:35:20 +08:00
OpenHands
025dac5d8f
Fix issue #4776 : [Bug]: Files are not uploaded to the environment (SWE-Bench) ( #4795 )
2024-11-06 19:05:06 +00:00
Engel Nyst
eeb2342509
Refactor history/event stream ( #3808 )
2024-11-05 03:36:14 +01:00
Xingyao Wang
1d2a616be7
Fix issue #4739 : '[Bug]: The agent doesn'"'"'t know its name' ( #4740 )
...
Co-authored-by: openhands <openhands@all-hands.dev>
Co-authored-by: Graham Neubig <neubig@gmail.com>
2024-11-04 21:24:35 +00:00
Xingyao Wang
966da7b7c8
feat(agent, CodeAct 2.2): native CodeAct support for Browsing ( #4667 )
...
Co-authored-by: tofarr <tofarr@gmail.com>
2024-11-05 00:27:27 +08:00
Abhijeetsingh Meena
8857f02083
[Eval] DiscoveryBench OpenHands Integration ( #4627 )
...
Signed-off-by: Abhijeetsingh Meena <abhijeet040403@gmail.com>
Co-authored-by: Harshit Surana <surana.h@gmail.com>
2024-11-02 07:24:34 -04:00
Ziru "Ron" Chen
db4e1dbbec
[eval] Add ScienceAgentBench. ( #4645 )
...
Co-authored-by: Xingyao Wang <xingyao@all-hands.dev>
2024-11-01 02:30:55 +08:00
Xingyao Wang
9c2b48ff5d
fix(eval): SWE-Bench instance with upper-case instance id ( #4649 )
2024-10-30 21:24:18 +00:00
Xingyao Wang
6d19c93d19
[eval] add evaluation workflow ( #4489 )
...
Co-authored-by: openhands <openhands@all-hands.dev>
Co-authored-by: Engel Nyst <enyst@users.noreply.github.com>
2024-10-29 13:52:25 +00:00
Xingyao Wang
ae13171194
feat(agent): CodeAct with function calling ( #4537 )
...
Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: tobitege <10787084+tobitege@users.noreply.github.com>
Co-authored-by: Engel Nyst <enyst@users.noreply.github.com>
Co-authored-by: tofarr <tofarr@gmail.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2024-10-29 11:06:33 +08:00
Xingyao Wang
1f23dc89b6
fix(eval): add runtime.connect to all eval harness ( #4565 )
2024-10-26 00:41:30 +08:00
Xingyao Wang
7340b78962
feat(eval): rewrite log_completions to save completions to directory ( #4566 )
2024-10-25 16:36:11 +00:00
tofarr
c4f5c07be1
Refactor: shorter syntax ( #4558 )
2024-10-25 06:45:28 -06:00
Graham Neubig
ce2430180f
Update README.md to fix miniwob name ( #4534 )
2024-10-23 18:24:43 +00:00
Xingyao Wang
2d5b360505
refactor: re-organize different runtime implementations into an impl folder ( #4346 )
...
Co-authored-by: Graham Neubig <neubig@gmail.com>
2024-10-23 10:10:03 +00:00
Graham Neubig
54250e3fe2
Update evaluation README.md structure ( #4516 )
2024-10-22 14:42:22 +00:00
Xingyao Wang
da548d308c
[agent] LLM-based editing ( #3985 )
...
Co-authored-by: Tim O'Farrell <tofarr@gmail.com>
Co-authored-by: Engel Nyst <enyst@users.noreply.github.com>
Co-authored-by: Robert Brennan <accounts@rbren.io>
Co-authored-by: Graham Neubig <neubig@gmail.com>
2024-10-22 04:51:44 +08:00
Alejandro Cuadron Lafuente
a9a593bb21
[Fix] Added support to specify the platform on which the runtime image should be built. ( #4402 )
...
Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Xingyao Wang <xingyao@all-hands.dev>
Co-authored-by: mamoodi <mamoodiha@gmail.com>
Co-authored-by: tofarr <tofarr@gmail.com>
Co-authored-by: Robert Brennan <contact@rbren.io>
2024-10-20 09:19:05 +08:00
Xingyao Wang
91308ba4dc
feat: clean-up retries RemoteRuntime & add FatalErrorObservation ( #4485 )
2024-10-18 17:23:13 +00:00
Jiayi Pan
c1b323a076
Show actual dataset name in swebench log directory ( #4417 )
2024-10-17 10:32:38 +08:00