mirror of
https://github.com/OpenHands/OpenHands.git
synced 2025-12-26 05:48:36 +08:00
Localization Evaluation for SWE-Bench
This folder implements localization evaluation at both file and function levels to complementing the assessment of agent inference on SWE-Bench.
1. Environment Setup
- Python env: Install python environment
- LLM config: Configure LLM config
2. Inference & Evaluation
- Inference and evaluation follow the original
run_infer.shandrun_eval.shimplementation- You may refer to instructions at README.md for running inference and evaluation on SWE-Bench
3. Localization Evaluation
-
Localization evaluation computes two-level localization accuracy, while also considers task success as an additional metric for overall evaluation:
- File Localization Accuracy: Accuracy of correctly localizing the target file
- Function Localization Accuracy: Accuracy of correctly localizing the target function
- Resolve Rate (will be auto-skipped if missing): Success rate of whether tasks are successfully resolved
- File Localization Efficiency: Average number of iterations taken to successfully localize the target file
- Function Localization Efficiency: Average number of iterations taken to successfully localize the target file
- Task success efficiency: Average number of iterations taken to resolve the task
- Resource efficiency: the API expenditure of the agent running inference on SWE-Bench instances
-
Run localization evaluation
-
Format:
./evaluation/benchmarks/swe_bench/scripts/eval_localization.sh [infer-dir] [split] [dataset] [max-infer-turn] [align-with-max]infer-dir: inference directory containing inference outputssplit: SWE-Bench dataset split to usedataset: SWE-Bench dataset namemax-infer-turn: the maximum number of iterations the agent took to run inferencealign-with-max: whether to align failure indices (e.g., incorrect localization, unresolved tasks) withmax_iter
-
Example:
# Example ./evaluation/benchmarks/swe_bench/scripts/eval_localization.sh \ --infer-dir ./evaluation/evaluation_outputs/outputs/princeton-nlp__SWE-bench_Verified-test/CodeActAgent/gpt_4o_100_N \ --split test \ --dataset princeton-nlp/SWE-bench_Verified \ --max-infer-turn 100 \ --align-with-max true
-
-
Localization evaluation results will be automatically saved to
[infer-dir]/loc_eval