Xingyao Wang
c2f46200c0
chore(lint): Apply comprehensive linting and formatting fixes ( #10287 )
...
Co-authored-by: openhands <openhands@all-hands.dev>
2025-08-13 21:13:19 +02:00
Ibragim Badertdinov
19a6b6b618
feat(eval): Support evaluation on SWE-rebench ( #10251 )
2025-08-12 14:05:43 +00:00
juanmichelini
ea50fe4e3c
Fix: Continue evaluation when an instance fails after max retries ( #8868 )
...
Co-authored-by: openhands <openhands@all-hands.dev>
Co-authored-by: Xingyao Wang <xingyaoww@gmail.com>
Co-authored-by: Xingyao Wang <xingyao@all-hands.dev>
2025-07-16 22:42:44 +00:00
Ryan H. Tran
dfa54673d2
[OH-Versa] Add remaining browsing & GAIA eval improvement ( #9015 )
...
Co-authored-by: openhands <openhands@all-hands.dev>
Co-authored-by: Engel Nyst <enyst@users.noreply.github.com>
2025-06-25 12:36:15 +07:00
Linghao Zhang
a93b0457c6
feat(eval): Support evaluation on SWE-bench-Live ( #9137 )
2025-06-15 12:30:47 +00:00
Graham Neubig
689d3c9046
Update pre-commit hook versions to most recent versions ( #8343 )
...
Co-authored-by: openhands <openhands@all-hands.dev>
2025-05-08 03:59:13 +00:00
Rohit Malhotra
9adfcede31
(Hotfix): Track reason for Error AgentState ( #7584 )
...
Co-authored-by: openhands <openhands@all-hands.dev>
2025-03-31 21:24:42 +00:00
Xingyao Wang
01e0e29a9f
Reduce bash SOFT timeout from 30 to 10 seconds ( #7423 )
...
Co-authored-by: openhands <openhands@all-hands.dev>
2025-03-22 22:42:24 +00:00
Xingyao Wang
33780f97d0
[eval] Upgrade SWE-Bench to use official image and latest harness ( #6838 )
...
Co-authored-by: Robert Brennan <accounts@rbren.io>
Co-authored-by: openhands <openhands@all-hands.dev>
Co-authored-by: Engel Nyst <enyst@users.noreply.github.com>
Co-authored-by: Graham Neubig <neubig@gmail.com>
2025-02-27 08:15:05 -05:00
Mateusz Kwiatkowski
6562297615
Replace shebang with /usr/bin/env bash for improved portability ( #6876 )
...
Co-authored-by: Xingyao Wang <xingyao@all-hands.dev>
2025-02-24 18:07:28 +00:00
Xingyao Wang
1a7003a705
Add sysbox support to remote runtime for eval; Add memory monitor, stress tests to help debug memory issue ( #6684 )
...
Co-authored-by: openhands <openhands@all-hands.dev>
Co-authored-by: Engel Nyst <enyst@users.noreply.github.com>
Co-authored-by: Graham Neubig <neubig@gmail.com>
2025-02-18 20:02:28 +00:00
Boxuan Li
ef12bc5381
Evaluation harness: Add agent config option ( #6662 )
2025-02-13 15:05:03 -05:00
Xingyao Wang
2b04ee2e62
feat(eval): reliability improvement for SWE-Bench eval_infer ( #6347 )
2025-01-18 14:02:59 -05:00
Calvin Smith
a12087243a
Pydantic-based configuration and setting objects ( #6321 )
...
Co-authored-by: Calvin Smith <calvin@all-hands.dev>
Co-authored-by: Graham Neubig <neubig@gmail.com>
Co-authored-by: Engel Nyst <enyst@users.noreply.github.com>
2025-01-17 12:33:22 -07:00
Xingyao Wang
899c1f8360
fix(bash): also show timeout reminder when no_change_timeout is triggered ( #6318 )
...
Co-authored-by: Robert Brennan <accounts@rbren.io>
2025-01-18 03:31:23 +08:00
tofarr
23473070b9
Revert "Config objects as Pydantic BaseModels ( #6176 )" ( #6214 )
2025-01-13 07:36:25 -07:00
Calvin Smith
873dddb4e8
Config objects as Pydantic BaseModels ( #6176 )
...
Co-authored-by: Calvin Smith <calvin@all-hands.dev>
Co-authored-by: Graham Neubig <neubig@gmail.com>
2025-01-12 15:09:45 -05:00
Calvin Smith
6e4ff56934
feature: Condenser Interface and Defaults ( #5306 )
...
Co-authored-by: openhands <openhands@all-hands.dev>
Co-authored-by: Calvin Smith <calvin@all-hands.dev>
Co-authored-by: Engel Nyst <enyst@users.noreply.github.com>
2025-01-08 04:36:30 +08:00
Xingyao Wang
f14f75b064
feat: runtime improvements for rate-limit and 502/503/404 error ( #5975 )
2025-01-03 08:36:19 -07:00
OpenHands
bfb191b5c7
Fix issue #5739 : [Bug]: Move ./evaluation/swe_bench/scripts/cleanup_remote_runtime.sh to general eval utils ( #5740 )
2024-12-25 17:17:06 -05:00
Xingyao Wang
581d5ec7a8
feat(eval): increase resource factor for remote runtime when previous run failed due to resource ( #5709 )
2024-12-21 01:47:06 +08:00
Xingyao Wang
e9cafb0372
chore: Cleanup runtime exception handling ( #5696 )
2024-12-19 17:28:29 +00:00
Xingyao Wang
9908e1b285
[Evaluation]: Log openhands version in eval output folder, instead of agent version ( #5394 )
2024-12-04 03:33:43 +00:00
Xingyao Wang
a531413d86
fix(eval): support setting hard timeout per evaluation instance ( #5110 )
2024-11-18 21:22:55 -05:00
Xingyao Wang
07f0d1ccb3
feat(llm): convert function call request for non-funcall OSS model ( #4711 )
...
Co-authored-by: Calvin Smith <email@cjsmith.io>
2024-11-15 00:40:09 +08:00
Calvin Smith
50e7da9c3d
fix(evaluation): SWE-bench evaluation script supports multiprocessing ( #4943 )
2024-11-12 12:19:57 -07:00
Engel Nyst
eeb2342509
Refactor history/event stream ( #3808 )
2024-11-05 03:36:14 +01:00
Xingyao Wang
966da7b7c8
feat(agent, CodeAct 2.2): native CodeAct support for Browsing ( #4667 )
...
Co-authored-by: tofarr <tofarr@gmail.com>
2024-11-05 00:27:27 +08:00
Xingyao Wang
ae13171194
feat(agent): CodeAct with function calling ( #4537 )
...
Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: tobitege <10787084+tobitege@users.noreply.github.com>
Co-authored-by: Engel Nyst <enyst@users.noreply.github.com>
Co-authored-by: tofarr <tofarr@gmail.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2024-10-29 11:06:33 +08:00
Xingyao Wang
7340b78962
feat(eval): rewrite log_completions to save completions to directory ( #4566 )
2024-10-25 16:36:11 +00:00
mamoodi
6f2e678028
Fix eval output path in case of @ char ( #4416 )
2024-10-15 22:45:08 +00:00
Xingyao Wang
25f9413965
[Eval] Fix eval stuck when result is too large for pbar ( #4361 )
2024-10-14 22:08:34 +08:00
Engel Nyst
e6847e9e61
Move agenthub within openhands ( #4130 )
2024-10-08 00:34:18 +00:00
Xingyao Wang
9cc9b19958
eval: improve swebench infer error handling and retry ( #4205 )
2024-10-04 07:09:56 -05:00
Xingyao Wang
53a015f718
fix: make llm_completions optional to fix eval_infer.py ( #4148 )
2024-10-02 03:55:03 +08:00
tobitege
c3bbe604eb
(fix) Fix logging in shared eval file to prevent key disclosure ( #4108 )
2024-09-28 19:33:16 +00:00
Xingyao Wang
81b3cd71b3
[eval] log evaluating warnings directly to console ( #4026 )
2024-09-26 03:42:32 +08:00
Xingyao Wang
1b1d8f0b02
[eval] Use imap_unorderd for parallizing evaluation ( #4040 )
2024-09-24 20:47:27 +00:00
Xingyao Wang
a66e738957
[eval] use mp Pool instead ProcessPoolExecutor ( #4025 )
2024-09-24 23:59:06 +08:00
Xingyao Wang
714e46f29a
[eval] save eventstream & llm completions for SWE-Bench run_infer ( #3923 )
2024-09-22 04:39:13 +00:00
Xingyao Wang
5d7f2fd4ae
[eval] Allow evaluation of SWE-Bench patches on RemoteRuntime ( #3927 )
...
Co-authored-by: Boxuan Li <liboxuan@connect.hku.hk>
Co-authored-by: Graham Neubig <neubig@gmail.com>
2024-09-18 16:07:34 -04:00
Xingyao Wang
f996b31d64
[eval] Fix multi-processing bug (again^3) & allow set EXP_NAME for each run_infer ( #3907 )
...
Co-authored-by: Boxuan Li <liboxuan@connect.hku.hk>
2024-09-17 14:07:58 +00:00
Xingyao Wang
2b3925278d
[eval] refactor process instance logic into update_progress ( #3875 )
2024-09-15 18:47:15 -04:00
Engel Nyst
379f2b6f23
Fix queue length on Macs ( #3867 )
2024-09-14 01:11:29 +00:00
Xingyao Wang
3a1b8c093b
[eval] yet another eval fixes on multi-processing ( #3854 )
...
Co-authored-by: Graham Neubig <neubig@gmail.com>
2024-09-13 15:51:22 +00:00
Xingyao Wang
78c5f58adc
refactor & improve retry for the reliability of RemoteRuntime & evaluation ( #3846 )
2024-09-13 07:37:07 -04:00
tobitege
dbb671a8a5
logname fix; improve test calling instruction ( #3666 )
2024-08-30 17:15:31 +02:00
Xingyao Wang
090c911a50
(refactor) Make Runtime class synchronous ( #3661 )
...
* change runtime to be synchronous
* fix test runtime with the new interface
* fix arg
* fix eval
* fix missing config attribute
* fix plugins
* fix on_event by revert it back to async
* update upload_file endpoint
* fix argument to upload file
* remove unncessary async for eval;
fix evaluation run in parallel
* use asyncio to run controller for eval
* revert file upload
* truncate eval test result output
2024-08-30 01:37:03 +00:00
tobitege
9c39f07430
(enh) Aider-Bench: make resumable with skip_num arg ( #3626 )
...
* added optional START_ID env flag to resume from that instance id
* prepare_dataset: fix comparisons by using instance id's as int
* aider bench complete_runtime: close runtime to close container
* added matrix display of instance id for logging
* fix typo in summarize_results.py saying summarise_results
* changed start_id to skip_num to skip rows from dataset (start_id wasn't supportable)
* doc changes about huggingface spaces to temporarily point back to OD
2024-08-28 15:42:01 +00:00
Raj Maheshwari
e72dc96d13
[Fix] Stop API key from leaking in evaluation outputs. ( #3603 )
...
Co-authored-by: Xingyao Wang <xingyao6@illinois.edu>
2024-08-26 23:38:37 +02:00