mirror of https://github.com/OpenHands/OpenHands.git synced 2025-12-26 05:48:36 +08:00

History

Evaluation: redirect sessions to repo-local .eval_sessions via helper; apply across entrypoints; add tests (#10540 )

Co-authored-by: openhands <openhands@all-hands.dev>

2025-08-22 13:34:02 +00:00

scripts

Jetbrains CI Benchmark (#7811 )

2025-04-17 15:10:20 +00:00

.gitignore

Jetbrains CI Benchmark (#7811 )

2025-04-17 15:10:20 +00:00

config_template.yaml

Jetbrains CI Benchmark (#7811 )

2025-04-17 15:10:20 +00:00

eval_infer.py

Evaluation: redirect sessions to repo-local .eval_sessions via helper; apply across entrypoints; add tests (#10540 )

2025-08-22 13:34:02 +00:00

README.MD

Jetbrains CI Benchmark (#7811 )

2025-04-17 15:10:20 +00:00

run_infer.py

Evaluation: redirect sessions to repo-local .eval_sessions via helper; apply across entrypoints; add tests (#10540 )

2025-08-22 13:34:02 +00:00

setup.py

Jetbrains CI Benchmark (#7811 )

2025-04-17 15:10:20 +00:00

README.MD

CI Builds Repair Benchmark Integration

This module integrates the CI Builds Repair benchmark developed by JetBrains-Research.

For more information, refer to the GitHub repository and the associated research paper. See notice below for details

Setup

Before running any scripts, make sure to configure the benchmark by setting up config.yaml. This benchmark pushes to JetBrains' private GitHub repository. You will to request a token_gh provided by their team, to run this benchmark.

Inference

To run inference with your model:

./evaluation/benchmarks/lca_ci_build_repair/scripts/run_infer.sh llm.yourmodel

Evaluation

To evaluate the predictions:

./evaluation/benchmarks/lca_ci_build_repair/scripts/eval_infer.sh predictions_path_containing_output

Results

The benchmark contains 68 instances, we skip instances #126 and #145, and only run 66 instances due to dockerization errors.

Due to running in live GitHub machines, the benchmark is sensitive to the date it is run. Even the golden patches in the dataset might present failures due to updates. For example, on 2025-04-09, running the benchmark against the golden patches gave 57/67 successes, with 1 job left in the waiting list.

On 2025-04-10, running the benchmark full with OH and no oracle, 37 succeeded. That is 54% of the complete set of 68 instances and 64% of the 57 that succeed with golden patches.