Add README for terminal_bench evaluation harness (#9700)

2025-12-26 05:48:36 +08:00 · 2025-07-15 06:48:34 -07:00 · 2025-07-15 06:48:34 -07:00 · 5c3619bc48
commit 5c3619bc48
parent 641d0a0bcb
2 changed files with 33 additions and 1 deletions
--- a/evaluation/README.md
+++ b/evaluation/README.md
@ -101,13 +101,14 @@ The OpenHands evaluation harness supports a wide variety of benchmarks across [s
 - SWE-Bench: [`evaluation/benchmarks/swe_bench`](./benchmarks/swe_bench)
 - HumanEvalFix: [`evaluation/benchmarks/humanevalfix`](./benchmarks/humanevalfix)
 - BIRD: [`evaluation/benchmarks/bird`](./benchmarks/bird)
- BioCoder: [`evaluation/benchmarks/ml_bench`](./benchmarks/ml_bench)
+- BioCoder: [`evaluation/benchmarks/biocoder`](./benchmarks/biocoder)
 - ML-Bench: [`evaluation/benchmarks/ml_bench`](./benchmarks/ml_bench)
 - APIBench: [`evaluation/benchmarks/gorilla`](./benchmarks/gorilla/)
 - ToolQA: [`evaluation/benchmarks/toolqa`](./benchmarks/toolqa/)
 - AiderBench: [`evaluation/benchmarks/aider_bench`](./benchmarks/aider_bench/)
 - Commit0: [`evaluation/benchmarks/commit0_bench`](./benchmarks/commit0_bench/)
 - DiscoveryBench: [`evaluation/benchmarks/discoverybench`](./benchmarks/discoverybench/)
+- TerminalBench: [`evaluation/benchmarks/terminal_bench`](./benchmarks/terminal_bench)

 ### Web Browsing

--- a/evaluation/benchmarks/terminal_bench/README.md
+++ b/evaluation/benchmarks/terminal_bench/README.md
@ -0,0 +1,31 @@
+# Terminal-Bench Evaluation on OpenHands
+
+Terminal-Bench has its own evaluation harness that is very different from OpenHands'. We
+implemented [OpenHands agent](https://github.com/laude-institute/terminal-bench/tree/main/terminal_bench/agents/installed_agents/openhands) using OpenHands local runtime
+inside terminal-bench framework. Hereby we introduce how to use the terminal-bench
+harness to evaluate OpenHands.
+
+## Installation
+
+Terminal-bench ships a CLI tool to manage tasks and run evaluation.
+Please follow official [Installation Doc](https://www.tbench.ai/docs/installation). You could also clone terminal-bench [source code](https://github.com/laude-institute/terminal-bench) and use `uv run tb` CLI.
+
+## Evaluation
+
+Please see [Terminal-Bench Leaderboard](https://www.tbench.ai/leaderboard) for the latest
+instruction on benchmarking guidance. The dataset might evolve.
+
+Sample command:
+
+```bash
+export LLM_BASE_URL=<optional base url>
+export LLM_API_KEY=<llm key>
+tb run \
+    --dataset-name terminal-bench-core \
+    --dataset-version 0.1.1 \
+    --agent openhands \
+    --model <model> \
+    --cleanup
+```
+
+You could run `tb --help` or `tb run --help` to learn more about their CLI.