mirror of
https://github.com/OpenHands/OpenHands.git
synced 2025-12-26 05:48:36 +08:00
Add README for terminal_bench evaluation harness (#9700)
This commit is contained in:
parent
641d0a0bcb
commit
5c3619bc48
@ -101,13 +101,14 @@ The OpenHands evaluation harness supports a wide variety of benchmarks across [s
|
||||
- SWE-Bench: [`evaluation/benchmarks/swe_bench`](./benchmarks/swe_bench)
|
||||
- HumanEvalFix: [`evaluation/benchmarks/humanevalfix`](./benchmarks/humanevalfix)
|
||||
- BIRD: [`evaluation/benchmarks/bird`](./benchmarks/bird)
|
||||
- BioCoder: [`evaluation/benchmarks/ml_bench`](./benchmarks/ml_bench)
|
||||
- BioCoder: [`evaluation/benchmarks/biocoder`](./benchmarks/biocoder)
|
||||
- ML-Bench: [`evaluation/benchmarks/ml_bench`](./benchmarks/ml_bench)
|
||||
- APIBench: [`evaluation/benchmarks/gorilla`](./benchmarks/gorilla/)
|
||||
- ToolQA: [`evaluation/benchmarks/toolqa`](./benchmarks/toolqa/)
|
||||
- AiderBench: [`evaluation/benchmarks/aider_bench`](./benchmarks/aider_bench/)
|
||||
- Commit0: [`evaluation/benchmarks/commit0_bench`](./benchmarks/commit0_bench/)
|
||||
- DiscoveryBench: [`evaluation/benchmarks/discoverybench`](./benchmarks/discoverybench/)
|
||||
- TerminalBench: [`evaluation/benchmarks/terminal_bench`](./benchmarks/terminal_bench)
|
||||
|
||||
### Web Browsing
|
||||
|
||||
|
||||
31
evaluation/benchmarks/terminal_bench/README.md
Normal file
31
evaluation/benchmarks/terminal_bench/README.md
Normal file
@ -0,0 +1,31 @@
|
||||
# Terminal-Bench Evaluation on OpenHands
|
||||
|
||||
Terminal-Bench has its own evaluation harness that is very different from OpenHands'. We
|
||||
implemented [OpenHands agent](https://github.com/laude-institute/terminal-bench/tree/main/terminal_bench/agents/installed_agents/openhands) using OpenHands local runtime
|
||||
inside terminal-bench framework. Hereby we introduce how to use the terminal-bench
|
||||
harness to evaluate OpenHands.
|
||||
|
||||
## Installation
|
||||
|
||||
Terminal-bench ships a CLI tool to manage tasks and run evaluation.
|
||||
Please follow official [Installation Doc](https://www.tbench.ai/docs/installation). You could also clone terminal-bench [source code](https://github.com/laude-institute/terminal-bench) and use `uv run tb` CLI.
|
||||
|
||||
## Evaluation
|
||||
|
||||
Please see [Terminal-Bench Leaderboard](https://www.tbench.ai/leaderboard) for the latest
|
||||
instruction on benchmarking guidance. The dataset might evolve.
|
||||
|
||||
Sample command:
|
||||
|
||||
```bash
|
||||
export LLM_BASE_URL=<optional base url>
|
||||
export LLM_API_KEY=<llm key>
|
||||
tb run \
|
||||
--dataset-name terminal-bench-core \
|
||||
--dataset-version 0.1.1 \
|
||||
--agent openhands \
|
||||
--model <model> \
|
||||
--cleanup
|
||||
```
|
||||
|
||||
You could run `tb --help` or `tb run --help` to learn more about their CLI.
|
||||
Loading…
x
Reference in New Issue
Block a user