mirror of https://github.com/OpenHands/OpenHands.git synced 2025-12-26 05:48:36 +08:00

History

chore(lint): Apply comprehensive linting and formatting fixes (#10287 )

Co-authored-by: openhands <openhands@all-hands.dev>

2025-08-13 21:13:19 +02:00

__init__.py

feat(eval): loc acc evaluation (#8515 )

2025-07-11 03:22:35 +08:00

loc_evaluator.py

chore(lint): Apply comprehensive linting and formatting fixes (#10287 )

2025-08-13 21:13:19 +02:00

loc_utils.py

chore(lint): Apply comprehensive linting and formatting fixes (#10287 )

2025-08-13 21:13:19 +02:00

README.md

feat(eval): loc acc evaluation (#8515 )

2025-07-11 03:22:35 +08:00

README.md

Localization Evaluation for SWE-Bench

This folder implements localization evaluation at both file and function levels to complementing the assessment of agent inference on SWE-Bench.

1. Environment Setup

Python env: Install python environment
LLM config: Configure LLM config

2. Inference & Evaluation

Inference and evaluation follow the original run_infer.sh and run_eval.sh implementation
- You may refer to instructions at README.md for running inference and evaluation on SWE-Bench

3. Localization Evaluation

Localization evaluation computes two-level localization accuracy, while also considers task success as an additional metric for overall evaluation:
- File Localization Accuracy: Accuracy of correctly localizing the target file
- Function Localization Accuracy: Accuracy of correctly localizing the target function
- Resolve Rate (will be auto-skipped if missing): Success rate of whether tasks are successfully resolved
- File Localization Efficiency: Average number of iterations taken to successfully localize the target file
- Function Localization Efficiency: Average number of iterations taken to successfully localize the target file
- Task success efficiency: Average number of iterations taken to resolve the task
- Resource efficiency: the API expenditure of the agent running inference on SWE-Bench instances

Run localization evaluation

Format:
```
./evaluation/benchmarks/swe_bench/scripts/eval_localization.sh [infer-dir] [split] [dataset] [max-infer-turn] [align-with-max]
```
- infer-dir: inference directory containing inference outputs
- split: SWE-Bench dataset split to use
- dataset: SWE-Bench dataset name
- max-infer-turn: the maximum number of iterations the agent took to run inference
- align-with-max: whether to align failure indices (e.g., incorrect localization, unresolved tasks) with max_iter

Example:

# Example
./evaluation/benchmarks/swe_bench/scripts/eval_localization.sh \
    --infer-dir ./evaluation/evaluation_outputs/outputs/princeton-nlp__SWE-bench_Verified-test/CodeActAgent/gpt_4o_100_N \
    --split test \
    --dataset princeton-nlp/SWE-bench_Verified \
    --max-infer-turn 100 \
    --align-with-max true

Localization evaluation results will be automatically saved to [infer-dir]/loc_eval