mirror of
https://github.com/OpenHands/OpenHands.git
synced 2025-12-26 05:48:36 +08:00
1.2 KiB
1.2 KiB
Terminal-Bench Evaluation on OpenHands
Terminal-Bench has its own evaluation harness that is very different from OpenHands'. We implemented OpenHands agent using OpenHands local runtime inside terminal-bench framework. Hereby we introduce how to use the terminal-bench harness to evaluate OpenHands.
Installation
Terminal-bench ships a CLI tool to manage tasks and run evaluation.
Please follow official Installation Doc. You could also clone terminal-bench source code and use uv run tb CLI.
Evaluation
Please see Terminal-Bench Leaderboard for the latest instruction on benchmarking guidance. The dataset might evolve.
Sample command:
export LLM_BASE_URL=<optional base url>
export LLM_API_KEY=<llm key>
tb run \
--dataset-name terminal-bench-core \
--dataset-version 0.1.1 \
--agent openhands \
--model <model> \
--cleanup
You could run tb --help or tb run --help to learn more about their CLI.