Terminal-Bench Evaluation on OpenHands

Terminal-Bench has its own evaluation harness that is very different from OpenHands'. We implemented OpenHands agent using OpenHands local runtime inside terminal-bench framework. Hereby we introduce how to use the terminal-bench harness to evaluate OpenHands.

Installation

Terminal-bench ships a CLI tool to manage tasks and run evaluation. Please follow official Installation Doc. You could also clone terminal-bench source code and use uv run tb CLI.

Evaluation

Please see Terminal-Bench Leaderboard for the latest instruction on benchmarking guidance. The dataset might evolve.

Sample command:

export LLM_BASE_URL=<optional base url>
export LLM_API_KEY=<llm key>
tb run \
    --dataset-name terminal-bench-core \
    --dataset-version 0.1.1 \
    --agent openhands \
    --model <model> \
    --cleanup

You could run tb --help or tb run --help to learn more about their CLI.