Localization Evaluation for SWE-Bench

This folder implements localization evaluation at both file and function levels to complementing the assessment of agent inference on SWE-Bench.

1. Environment Setup

2. Inference & Evaluation

  • Inference and evaluation follow the original run_infer.sh and run_eval.sh implementation
    • You may refer to instructions at README.md for running inference and evaluation on SWE-Bench

3. Localization Evaluation

  • Localization evaluation computes two-level localization accuracy, while also considers task success as an additional metric for overall evaluation:

    • File Localization Accuracy: Accuracy of correctly localizing the target file
    • Function Localization Accuracy: Accuracy of correctly localizing the target function
    • Resolve Rate (will be auto-skipped if missing): Success rate of whether tasks are successfully resolved
    • File Localization Efficiency: Average number of iterations taken to successfully localize the target file
    • Function Localization Efficiency: Average number of iterations taken to successfully localize the target file
    • Task success efficiency: Average number of iterations taken to resolve the task
    • Resource efficiency: the API expenditure of the agent running inference on SWE-Bench instances
  • Run localization evaluation

    • Format:

      ./evaluation/benchmarks/swe_bench/scripts/eval_localization.sh [infer-dir] [split] [dataset] [max-infer-turn] [align-with-max]
      
      • infer-dir: inference directory containing inference outputs
      • split: SWE-Bench dataset split to use
      • dataset: SWE-Bench dataset name
      • max-infer-turn: the maximum number of iterations the agent took to run inference
      • align-with-max: whether to align failure indices (e.g., incorrect localization, unresolved tasks) with max_iter
    • Example:

      # Example
      ./evaluation/benchmarks/swe_bench/scripts/eval_localization.sh \
          --infer-dir ./evaluation/evaluation_outputs/outputs/princeton-nlp__SWE-bench_Verified-test/CodeActAgent/gpt_4o_100_N \
          --split test \
          --dataset princeton-nlp/SWE-bench_Verified \
          --max-infer-turn 100 \
          --align-with-max true
      
  • Localization evaluation results will be automatically saved to [infer-dir]/loc_eval