SWE-bench_verified instruction baseline improvements to 60% (#7546)

Co-authored-by: Xingyao Wang <xingyao@all-hands.dev>
2025-12-26 05:48:36 +08:00 · 2025-04-10 18:08:27 +02:00 · 2025-04-10 18:08:27 +02:00 · 53c0c5a07b
commit 53c0c5a07b
parent d924e7cea5
1 changed files with 42 additions and 27 deletions
--- a/evaluation/benchmarks/swe_bench/run_infer.py
+++ b/evaluation/benchmarks/swe_bench/run_infer.py
@ -84,39 +84,54 @@ I've already taken care of all changes to any of the test files described in the
 Also the development Python environment is already set up for you (i.e., all dependencies already installed), so you don't need to install other packages.
 Your task is to make the minimal changes to non-test files in the /workspace/{workspace_dir_name} directory to ensure the <issue_description> is satisfied.

-Follow these steps to resolve the issue:
+Follow these phases to resolve the issue:

-1. EXPLORATION: First, thoroughly explore the repository structure using tools like `find` and `grep`.
-   - Identify all files mentioned in the problem statement
-   - Locate where the issue occurs in the codebase
-   - Understand the surrounding context and dependencies
-   - Use `grep` to search for relevant functions, classes, or error messages
+Phase 1. READING: read the problem and reword it in clearer terms
+   1.1 If there are code or config snippets. Express in words any best practices or conventions in them.
+   1.2 Hightlight message errors, method names, variables, file names, stack traces, and technical details.
+   1.3 Explain the problem in clear terms.
+   1.4 Enumerate the steps to reproduce the problem.
+   1.5 Hightlight any best practices to take into account when testing and fixing the issue

-2. ANALYSIS: Based on your exploration, think carefully about the problem and propose 2-5 possible approaches to fix the issue.
-   - Analyze the root cause of the problem
-   - Consider trade-offs between different solutions
-   - Select the most promising approach and explain your reasoning
+Phase 2. RUNNING: install and run the tests on the repository
+   2.1 Follow the readme
+   2.2 Install the environment and anything needed
+   2.2 Iterate and figure out how to run the tests 

-3. TEST CREATION: Before implementing any fix, create a script to reproduce and verify the issue.
-   - Look at existing test files in the repository to understand the test format/structure
-   - Create a minimal reproduction script that demonstrates the issue
-   - Run your script to confirm the error exists
+Phase 3. EXPLORATION: find the files that are related to the problem and possible solutions
+   3.1 Use `grep` to search for relevant methods, classes, keywords and error messages.
+   3.2 Identify all files related to the problem statement.
+   3.3 Propose the methods and files to fix the issue and explain why.
+   3.4 From the possible file locations, select the most likely location to fix the issue.

-4. IMPLEMENTATION: Edit the source code to implement your chosen solution.
-   - Make minimal, focused changes to fix the issue
+Phase 4. TEST CREATION: before implementing any fix, create a script to reproduce and verify the issue.
+   4.1 Look at existing test files in the repository to understand the test format/structure.
+   4.2 Create a minimal reproduction script that reproduces the located issue.
+   4.3 Run the reproduction script to confirm you are reproducing the issue.
+   4.4 Adjust the reproduction script as necessary.

-5. VERIFICATION: Test your implementation thoroughly.
-   - Run your reproduction script to verify the fix works
-   - Add edge cases to your test script to ensure comprehensive coverage
-   - Run existing tests related to the modified code to ensure you haven't broken anything
+Phase 5. FIX ANALYSIS: state clearly the problem and how to fix it
+   5.1 State clearly what the problem is.
+   5.2 State clearly where the problem is located.
+   5.3 State clearly how the test reproduces the issue.
+   5.4 State clearly the best practices to take into account in the fix.
+   5.5 State clearly how to fix the problem.

-6. FINAL REVIEW: Carefully re-read the problem description and compare your changes with the base commit {instance["base_commit"]}.
-   - Ensure you've fully addressed all requirements
-   - Run any tests in the repository related to:
-     * The issue you are fixing
-     * The files you modified
-     * The functions you changed
-   - If any tests fail, revise your implementation until all tests pass
+Phase 6. FIX IMPLEMENTATION: Edit the source code to implement your chosen solution.
+   6.1 Make minimal, focused changes to fix the issue.
+
+Phase 7. VERIFICATION: Test your implementation thoroughly.
+   7.1 Run your reproduction script to verify the fix works.
+   7.2 Add edge cases to your test script to ensure comprehensive coverage.
+   7.3 Run existing tests related to the modified code to ensure you haven't broken anything.
+
+8. FINAL REVIEW: Carefully re-read the problem description and compare your changes with the base commit {instance["base_commit"]}.
+   8.1 Ensure you've fully addressed all requirements.
+   8.2 Run any tests in the repository related to:
+     8.2.1 The issue you are fixing
+     8.2.2 The files you modified
+     8.2.3 The functions you changed
+   8.3 If any tests fail, revise your implementation until all tests pass

 Be thorough in your exploration, testing, and reasoning. It's fine if your thinking process is lengthy - quality and completeness are more important than brevity.
 """