mirror of
https://github.com/OpenHands/OpenHands.git
synced 2026-03-22 05:37:20 +08:00
36 lines
1.6 KiB
Markdown
36 lines
1.6 KiB
Markdown
# CI Builds Repair Benchmark Integration
|
|
|
|
This module integrates the CI Builds Repair benchmark developed by [JetBrains-Research](https://github.com/JetBrains-Research/lca-baselines/tree/main/ci-builds-repair/ci-builds-repair-benchmark).
|
|
|
|
For more information, refer to the [GitHub repository](https://github.com/JetBrains-Research/lca-baselines/tree/main/ci-builds-repair/ci-builds-repair-benchmark) and the associated [research paper](https://arxiv.org/abs/2406.11612).
|
|
See notice below for details
|
|
|
|
## Setup
|
|
|
|
Before running any scripts, make sure to configure the benchmark by setting up `config.yaml`.
|
|
This benchmark pushes to JetBrains' private GitHub repository. You will to request a `token_gh` provided by their team, to run this benchmark.
|
|
|
|
## Inference
|
|
|
|
To run inference with your model:
|
|
|
|
```bash
|
|
./evaluation/benchmarks/lca_ci_build_repair/scripts/run_infer.sh llm.yourmodel
|
|
```
|
|
|
|
## Evaluation
|
|
|
|
To evaluate the predictions:
|
|
|
|
```bash
|
|
./evaluation/benchmarks/lca_ci_build_repair/scripts/eval_infer.sh predictions_path_containing_output
|
|
```
|
|
|
|
## Results
|
|
The benchmark contains 68 instances, we skip instances #126 and #145, and only run 66 instances due to dockerization errors.
|
|
|
|
Due to running in live GitHub machines, the benchmark is sensitive to the date it is run. Even the golden patches in the dataset might present failures due to updates.
|
|
For example, on 2025-04-09, running the benchmark against the golden patches gave 57/67 successes, with 1 job left in the waiting list.
|
|
|
|
On 2025-04-10, running the benchmark full with OH and no oracle, 37 succeeded. That is 54% of the complete set of 68 instances and 64% of the 57 that succeed with golden patches.
|