Graham Neubig
089d9c1ee5
Add deprecation warning to evaluation README ( #11997 )
...
Co-authored-by: openhands <openhands@all-hands.dev>
2025-12-16 00:21:13 +08:00
Jeffrey Ma
974bcdfd0b
SWE-fficiency benchmark implementation ( #11716 )
...
Co-authored-by: Engel Nyst <enyst@users.noreply.github.com>
Co-authored-by: Xingyao Wang <xingyao@all-hands.dev>
Co-authored-by: enyst <engel.nyst@gmail.com>
2025-11-27 09:13:15 +01:00
John Eismeier
967e9e1891
Propose fix some typos and ignore emacs backup files ( #11701 )
...
Signed-off-by: John E <jeis4wpi@outlook.com>
2025-11-11 09:20:42 -05:00
Engel Nyst
14807ed273
ci: remove outdated integration runner ( #11653 )
2025-11-10 15:51:40 +01:00
Kevin Musgrave
12d6da8130
feat(evaluation): Filter task ids by difficulty for SWE Gym rollouts ( #11490 )
...
Co-authored-by: Graham Neubig <neubig@gmail.com>
Co-authored-by: openhands <openhands@all-hands.dev>
2025-10-30 02:30:19 +00:00
Zacharias Fisches
818f743dc7
Bugfix: respect config.tom system_prompt_filename when running swe-bench ( #11091 )
...
Co-authored-by: openhands <openhands@all-hands.dev>
Co-authored-by: Graham Neubig <neubig@gmail.com>
2025-10-27 21:55:05 +00:00
Robert Brennan
b5e00f577c
Replace All-Hands-AI references with OpenHands ( #11287 )
...
Co-authored-by: openhands <openhands@all-hands.dev>
Co-authored-by: Engel Nyst <engel.nyst@gmail.com>
Co-authored-by: Engel Nyst <enyst@users.noreply.github.com>
2025-10-26 01:52:45 +02:00
Tim O'Farrell
4b303ec9b4
Fixes to unblock frontend ( #11488 )
...
Co-authored-by: Ray Myers <ray.myers@gmail.com>
2025-10-23 14:43:45 -06:00
Kevin Musgrave
a237b578c0
feat(evaluation): Add multi-swe-bench dependency and fix rollout script ( #11326 )
...
Co-authored-by: Graham Neubig <neubig@gmail.com>
2025-10-16 14:35:19 +00:00
Engel Nyst
3e645f8649
fix(integration-tests): accept --eval-num-workers and --eval-note in integration test runner ( #11387 )
2025-10-16 09:50:24 -04:00
juanmichelini
471d272c7c
Mint security eval fix ( #11273 )
...
Co-authored-by: Xingyao Wang <xingyao@all-hands.dev>
2025-10-16 01:42:05 +00:00
Kevin Musgrave
19bae5ac0f
feat(evaluation): Add placeholders to swe_gpt4.j2 ( #11228 )
...
Co-authored-by: Graham Neubig <neubig@gmail.com>
2025-10-13 22:15:05 +08:00
Xinyi He
7906eab6b1
Add inference generation of SWE-Perf Benchmark ( #10246 )
...
Co-authored-by: mamoodi <mamoodiha@gmail.com>
Co-authored-by: Graham Neubig <neubig@gmail.com>
Co-authored-by: openhands <openhands@all-hands.dev>
2025-09-22 20:35:30 +00:00
juanmichelini
547e1049f1
Multi swe gym ( #10605 )
...
Co-authored-by: openhands <openhands@all-hands.dev>
2025-09-22 15:56:26 -04:00
Ryan H. Tran
df9320f8ab
Implement model routing support ( #9738 )
...
Co-authored-by: openhands <openhands@all-hands.dev>
2025-09-08 16:19:34 +07:00
Haowei Lin
bd8b1bfa25
Add a new benchmark: AlgoTune ( #10724 )
...
Co-authored-by: linhaowei <linhaowei@wizardquant.com>
Co-authored-by: Graham Neubig <neubig@gmail.com>
2025-09-04 18:08:50 +00:00
Zacharias Fisches
20e5c40969
Fix swe-bench run_infer.py config parsing from config.toml ( #10792 )
2025-09-04 20:10:08 +08:00
Xingyao Wang
b082ccc0fb
feat(llm): add support for deepseek and gpt-5-mini, util for token count ( #10626 )
...
Co-authored-by: openhands <openhands@all-hands.dev>
2025-08-27 11:03:35 +08:00
Xingyao Wang
4507a25b85
Evaluation: redirect sessions to repo-local .eval_sessions via helper; apply across entrypoints; add tests ( #10540 )
...
Co-authored-by: openhands <openhands@all-hands.dev>
2025-08-22 13:34:02 +00:00
Engel Nyst
91d3d1d20a
Fix: expose aggregated LLM metrics in State for evaluation scripts ( #10537 )
...
Co-authored-by: openhands <openhands@all-hands.dev>
2025-08-21 17:43:09 +02:00
Kevin Musgrave
74ba21bad0
feat(evaluation): Added INSTRUCTION_TEMPLATE_NAME to run_infer.py in swe_bench ( #10270 )
...
Co-authored-by: Xingyao Wang <xingyao@all-hands.dev>
Co-authored-by: mamoodi <mamoodiha@gmail.com>
2025-08-18 14:18:08 +00:00
Zhonghao Jiang
7229a16b45
feat(evaluation): Add NoCode-bench evaluation script ( #10229 )
2025-08-16 16:41:22 +00:00
Engel Nyst
f7f4fcf98f
chore(eval): remove old, unused regression test framework under evaluation/regression ( #10419 )
2025-08-16 01:08:23 +02:00
Xingyao Wang
c2f46200c0
chore(lint): Apply comprehensive linting and formatting fixes ( #10287 )
...
Co-authored-by: openhands <openhands@all-hands.dev>
2025-08-13 21:13:19 +02:00
Ibragim Badertdinov
19a6b6b618
feat(eval): Support evaluation on SWE-rebench ( #10251 )
2025-08-12 14:05:43 +00:00
Insop
1d0d88d491
Readability improvement & remove duplicated and unused prompts ( #10241 )
2025-08-12 12:42:17 +08:00
Ryan H. Tran
758e30c9a8
Remove SecretStr conversion in GAIA eval ( #10204 )
2025-08-11 21:30:18 +08:00
Xingyao Wang
04ff4a025b
feat(cli): Use CLI to launch OpenHands UI server via Docker ( #9783 )
...
Co-authored-by: openhands <openhands@all-hands.dev>
2025-08-09 02:04:07 +08:00
Xingyao Wang
c4f303a07b
chore(eval): Remove eval_infer_remote.sh script and related references ( #10157 )
...
Co-authored-by: openhands <openhands@all-hands.dev>
2025-08-07 20:46:59 +00:00
Boxuan Li
7af35ab827
Evaluation: disable browser when NOT run_with_browsing ( #9837 )
2025-07-22 01:45:52 +00:00
juanmichelini
ea50fe4e3c
Fix: Continue evaluation when an instance fails after max retries ( #8868 )
...
Co-authored-by: openhands <openhands@all-hands.dev>
Co-authored-by: Xingyao Wang <xingyaoww@gmail.com>
Co-authored-by: Xingyao Wang <xingyao@all-hands.dev>
2025-07-16 22:42:44 +00:00
Engel Nyst
fba2218760
Fix integration tests ( #9746 )
2025-07-16 22:16:40 +02:00
Boxuan Li
5c3619bc48
Add README for terminal_bench evaluation harness ( #9700 )
2025-07-15 09:48:34 -04:00
xhguo7
9388fef0ef
feat(eval): loc acc evaluation ( #8515 )
...
Co-authored-by: Xingyao Wang <xingyao@all-hands.dev>
Co-authored-by: mamoodi <mamoodiha@gmail.com>
2025-07-11 03:22:35 +08:00
Xingyao Wang
cff5697456
eval: remove gemini-specific swebench template ( #9623 )
2025-07-08 18:34:23 +00:00
Ryan H. Tran
dfa54673d2
[OH-Versa] Add remaining browsing & GAIA eval improvement ( #9015 )
...
Co-authored-by: openhands <openhands@all-hands.dev>
Co-authored-by: Engel Nyst <enyst@users.noreply.github.com>
2025-06-25 12:36:15 +07:00
Maxim Evtush
653a8a7ce2
Refactor: Improve Consistency in Function Signatures and Regex Usage in compute_ism_pm_score.py ( #9145 )
2025-06-18 04:22:16 +08:00
Ryan H. Tran
ddaa186971
[GAIA] Add prompt improvement to alleviate solution parsing issue & support Tavily search tools ( #9057 )
2025-06-17 13:16:50 +07:00
better629
432d8829dc
disable mcp in run_localize and install oh-aci[llama] for issue 9150 ( #9151 )
2025-06-16 11:03:17 +00:00
FT
e5bff91e8e
Fix Typo: Change "accurancy" to "accuracy" in Evaluation Benchmark Comments ( #9139 )
2025-06-15 12:48:26 +00:00
Linghao Zhang
a93b0457c6
feat(eval): Support evaluation on SWE-bench-Live ( #9137 )
2025-06-15 12:30:47 +00:00
kilavvy
4e99aabcb2
Minor Code Comment Corrections and Clarifications ( #9129 )
2025-06-14 18:57:14 +00:00
Graham Neubig
0c307ea12e
Lint all files in the repo ( #9131 )
...
Co-authored-by: openhands <openhands@all-hands.dev>
Co-authored-by: Engel Nyst <enyst@users.noreply.github.com>
2025-06-14 16:25:59 +00:00
ASTONE
be62ba6b35
add_versicode ( #8221 )
2025-06-14 13:17:18 +00:00
leopardracer
13c298d35f
Minor Typo Fixes in Comments and Documentation ( #9058 )
2025-06-14 12:51:38 +00:00
Engel Nyst
fd3b4ac8e6
Refactor SWE-bench instruction ( #8010 )
2025-06-13 23:27:52 +02:00
Leander Maben
d84befe28f
Adding LLM Based Editing capability ( #8677 )
...
Co-authored-by: Xingyao Wang <xingyao@all-hands.dev>
Co-authored-by: Engel Nyst <enyst@users.noreply.github.com>
Co-authored-by: Engel Nyst <engel.nyst@gmail.com>
2025-06-09 21:57:20 +08:00
Sergey
49939c1f02
Fix typo in evaluation README.md ( #8987 )
2025-06-08 14:14:07 +00:00
llamantino
880c05ed94
Fix all broken docs links across the project ( #8830 )
...
Co-authored-by: llamantino <12345678+yourusername@users.noreply.github.com>
2025-05-31 21:24:59 -04:00
Robert Brennan
205f0234e8
Rename Conversation to ServerConversation and AppConfig to OpenHandsConfig ( #8754 )
...
Co-authored-by: openhands <openhands@all-hands.dev>
2025-05-28 21:48:34 +02:00