feat(SWE-Bench environment) integrate SWE-Bench sandbox (#1468)

* add draft dockerfile for build all * add rsync for build * add all-in-one docker * update prepare scripts * Update swe_env_box.py * Add swe_entry.sh (buggy now) * Parse the test command in swe_entry.sh * Update README for instance eval in sandbox * revert specialized config * replace run_as_devin as an init arg * set container & run_as_root via args * update swe entry script * update env * remove mounting * allow error after swe_entry * update swe_env_box * move file * update gitignore * get swe_env_box a working demo * support faking user response & provide sandox ahead of time; also return state for controller * tweak main to support adding controller kwargs * add module * initialize plugin for provided sandbox * add pip cache to plugin & fix jupyter kernel waiting * better print Observation output * add run infer scripts * update readme * add utility for getting diff patch * use get_diff_patch in infer * update readme * support cost tracking for codeact * add swe agent edit hack * disable color in git diff * fix git diff cmd * fix state return * support limit eval * increase t imeout and export pip cache * add eval limit config * return state when hit turn limit * save log to file; allow agent to give up * run eval with max 50 turns * add outputs to gitignore * save swe_instance & instruction * add uuid to swebench * add streamlit dep * fix save series * fix the issue where session id might be duplicated * allow setting temperature for llm (use 0 for eval) * Get report from agent running log * support evaluating task success right after inference. * remove extra log * comment out prompt for baseline * add visualizer for eval * use plaintext for instruction * reduce timeout for all; only increase timeout for init * reduce timeout for all; only increase timeout for init * ignore sid for swe env * close sandbox in each eval loop * update visualizer instruction * increase max chars * add finish action to history too * show test result in metrics * add sidebars for visualizer * also visualize swe_instance * cleanup browser when agent controller finish runinng * do not mount workspace for swe-eval to avoid accidentally overwrite files * Revert "do not mount workspace for swe-eval to avoid accidentally overwrite files" This reverts commit 8ef77390543e562e6f0a5a9992418014d8b3010c. * Revert "Revert "do not mount workspace for swe-eval to avoid accidentally overwrite files"" This reverts commit 016cfbb9f0475f32bacbad5822996b4eaff24a5e. * run jupyter command via copy to, instead of cp to mount * only print mixin output when failed * change ssh box logging * add visualizer for pass rate * add instance id to sandbox name * only remove container we created * use opendevin logger in main * support multi-processing infer * add back metadata, support keyboard interrupt * remove container with startswith * make pbar behave correctly * update instruction w/ multi-processing * show resolved rate by repo * rename tmp dir name * attempt to fix racing for copy to ssh_box * fix script * bump swe-bench-all version * fix ipython with self-contained commands * add jupyter demo to swe_env_box * make resolved count two column * increase height * do not add glob to url params * analyze obs length * print instance id prior to removal handler * add gold patch in visualizer * fix interactive git by adding a git --no-pager as alias * increase max_char to 10k to cover 98% of swe-bench obs cases * allow parsing note * prompt v2 * add iteration reminder * adjust user response * adjust order * fix return eval * fix typo * add reminder before logging * remove other resolve rate * re adjust to new folder structure * support adding eval note * fix eval note path * make sure first log of each instance is printed * add eval note * fix the display for visualizer * tweak visualizer for better git patch reading * exclude empty patch * add retry mechanism for swe_env_box start * fix ssh timeout issue * add stat field for apply test patch success * add visualization for fine-grained report * attempt to support monologue agent by constraining it to single thread * also log error msg when stopeed * save error as well * override WORKSPACE_MOUNT_PATH and WORKSPACE_BASE for monologue to work in mp * add retry mechanism for sshbox * remove retry for swe env box * try to handle loop state stopped * Add get report scripts * Add script to convert agent output to swe-bench format * Merge fine grained report for visualizer * Update eval readme * Update README.md * Add CodeAct gpt4-1106 output and eval logs on swe-bench-lite * Update the script to get model report * Update get_model_report.sh * Update get_agent_report.sh * Update report merge script * Add agent output conversion script * Update swe_lite_env_setup.sh * Add example swe-bench output files * Update eval readme * Remove redundant scripts * set iteration count down to false by default * fix: Issue where CodeAct agent was trying to log cost on local llm and throwing Undefined Model execption out of litellm (#1666) * fix: Issue where CodeAct agent was trying to log cost on local llm and throwing Undefined Model execption out of litellm * Review Feedback * Missing None Check * Review feedback and improved error handling --------- Co-authored-by: Robert Brennan <accounts@rbren.io> * fix prepare_swe_util scripts * update builder images * update setup script * remove swe-bench build workflow * update lock * remove experiments since they are moved to hf * remove visualizer (since it is moved to hf repo) * simply jupyter execution via heredoc * update ssh_box * add initial docker readme * add pkg-config as dependency * add script for swe_bench all-in-one docker * add rsync to builder * rename var * update commit * update readme * update lock * support specify timeout for long running tasks * fix path * separate building of all deps and files * support returning states at the end of controller * remove return None * support specify timeout for long running tasks * add timeout for all existing sandbox impl * fix swe_env_box for new codebase * update llm config in config.py * support pass sandbox in * remove force set * update eval script * fix issue of overriding final state * change default eval output to hf demo * change default eval output to hf demo * fix config * only close it when it is NOT external sandbox * add scripts * tweak config * only put in hostory when state has history attr * fix agent controller on the case of run out interaction budget * always assume state is always not none * remove print of final state * catch all exception when cannot compute completion cost * Update README.md * save source into json * fix path * update docker path * return the final state on close * merge AgentState with State * fix integration test * merge AgentState with State * fix integration test * add ChangeAgentStateAction to history in attempt to fix integration * add back set agent state * update tests * update tests * move scripts for setup * update script and readme for infer * do not reset logger when n processes == 1 * update eval_infer scripts and readme * simplify readme * copy over dir after eval * copy over dir after eval * directly return get state * update lock * fix output saving of infer * replace print with logger * update eval_infer script * add back the missing .close * increase timeout * copy all swe_bench_format file * attempt to fix output parsing * log git commit id as metadata * fix eval script * update lock * update unit tests * fix argparser unit test * fix lock * the deps are now lightweight enough to be incude in make build * add spaces for tests * add eval outputs to gitignore * remove git submodule * readme * tweak git email * update upload instruction * bump codeact version for eval --------- Co-authored-by: Bowen Li <libowen.ne@gmail.com> Co-authored-by: huybery <huybery@gmail.com> Co-authored-by: Bart Shappee <bshappee@gmail.com> Co-authored-by: Robert Brennan <accounts@rbren.io>
2025-12-25 21:36:52 +08:00 · 2024-05-16 00:15:55 +08:00 · 2024-05-16 00:15:55 +08:00 · 2406b901df
commit 2406b901df
parent a84d19f03c
48 changed files with 2321 additions and 708 deletions
--- a/.gitignore
+++ b/.gitignore
@ -202,6 +202,8 @@ cache

 # configuration
 config.toml
-
+evaluation/swe_bench/eval_workspace
+evaluation/outputs
+evaluation/evaluation_outputs
 test_results*
 /_test_files_tmp/
--- a/2
+++ b/2
@ -135,7 +135,7 @@ install-python-dependencies:
 		export HNSWLIB_NO_NATIVE=1; \
 		poetry run pip install chroma-hnswlib; \
 	fi
-	@poetry install --without evaluation
+	@poetry install
 	@if [ -f "/etc/manjaro-release" ]; then \
 		echo "$(BLUE)Detected Manjaro Linux. Installing Playwright dependencies...$(RESET)"; \
 		poetry run pip install playwright; \
--- a/agenthub/codeact_agent/codeact_agent.py
+++ b/agenthub/codeact_agent/codeact_agent.py
@ -276,7 +276,10 @@ class CodeActAgent(Agent):
        raise NotImplementedError('Implement this abstract method')

    def log_cost(self, response):
-        cur_cost = self.llm.completion_cost(response)
+        try:
+            cur_cost = self.llm.completion_cost(response)
+        except Exception:
+            cur_cost = 0
        self.cost_accumulator += cur_cost
        logger.info(
            'Cost: %.2f USD | Accumulated Cost: %.2f USD',
--- a/evaluation/README.md
+++ b/evaluation/README.md
@ -4,76 +4,21 @@ This folder contains code and resources to run experiments and evaluations.

 ## Logistics
 To better organize the evaluation folder, we should follow the rules below:
-  - Each subfolder contains a specific benchmark or experiment. For example, `evaluation/SWE-bench` should contain
+  - Each subfolder contains a specific benchmark or experiment. For example, `evaluation/swe_bench` should contain
 all the preprocessing/evaluation/analysis scripts.
-  - Raw data and experimental records should not be stored within this repo (e.g. Google Drive or Hugging Face Datasets).
+  - Raw data and experimental records should not be stored within this repo.
+    - For model outputs, they should be stored at [this huggingface space](https://huggingface.co/spaces/OpenDevin/evaluation) for visualization.
  - Important data files of manageable size and analysis scripts (e.g., jupyter notebooks) can be directly uploaded to this repo.

-## Roadmap
+## Supported Benchmarks

- Sanity check. Reproduce Devin's scores on SWE-bench using the released outputs to make sure that our harness pipeline works.
- Open source model support.
-  - Contributors are encouraged to submit their commits to our [forked SEW-bench repo](https://github.com/OpenDevin/SWE-bench).
-  - Ensure compatibility with OpenAI interface for inference.
-  - Serve open source models, prioritizing high concurrency and throughput.
+- SWE-Bench: [`evaluation/swe_bench`](./swe_bench)

-## SWE-bench
- notebooks
-  - `devin_eval_analysis.ipynb`: notebook analyzing devin's outputs
- scripts
-  - `prepare_devin_outputs_for_evaluation.py`: script fetching and converting [devin's output](https://github.com/CognitionAI/devin-swebench-results/tree/main) into the desired json file for evaluation.
-    - usage: `python prepare_devin_outputs_for_evaluation.py <setting>` where setting can be `passed`, `failed` or `all`
- resources
-  - Devin related SWE-bench test subsets
-    - [🤗 OpenDevin/SWE-bench-devin-passed](https://huggingface.co/datasets/OpenDevin/SWE-bench-devin-passed)
-    - [🤗 OpenDevin/SWE-bench-devin-full-filtered](https://huggingface.co/datasets/OpenDevin/SWE-bench-devin-full-filtered)
-  - Devin's outputs processed for evaluations is available on [Huggingface](https://huggingface.co/datasets/OpenDevin/Devin-SWE-bench-output)
-    - get predictions that passed the test: `wget https://huggingface.co/datasets/OpenDevin/Devin-SWE-bench-output/raw/main/devin_swe_passed.json`
-    - get all predictions `wget https://huggingface.co/datasets/OpenDevin/Devin-SWE-bench-output/raw/main/devin_swe_outputs.json`
+### Result Visualization

-See [`SWE-bench/README.md`](./SWE-bench/README.md) for more details on how to run SWE-Bench for evaluation.
+Check [this huggingface space](https://huggingface.co/spaces/OpenDevin/evaluation) for visualization of existing experimental results.

-### Results

-We have refined the original SWE-bench evaluation pipeline to enhance its efficiency and reliability. The updates are as follows:
- Reuse testbeds and Conda environments.
- Additionally try `patch` command for patch application if `git apply` command fails.
+### Upload your results

-#### Results on SWE-bench-devin-passed
-
-[🤗 OpenDevin/SWE-bench-devin-passed](https://huggingface.co/datasets/OpenDevin/SWE-bench-devin-passed)
-
-| Model/Agent            | #instances | #init | #apply | #resolve |
-|------------------------|------------|-------|--------|----------|
-| Gold                   | 79         | 79    | 79     | 79       |
-| Devin                  | 79         | 79    | 76     | 76       |
-
-#init: number of instances where testbeds have been successfully initialized.
-
-In the 3 Devin-failed instances (see below), Devin has made changes to the tests, which are incompatible with the provided test patch and causes failures during patch application. The evaluation adopted by Devin does not seem to align with the original SWE-bench evaluation.
-
-```shell
-django__django-11244
-scikit-learn__scikit-learn-10870
-sphinx-doc__sphinx-9367
-```
-
-#### Results on SWE-bench-devin-failed
-
-| Model/Agent            | #instances | #init | #apply | #resolve |
-|------------------------|------------|-------|--------|----------|
-| Gold                   | 491        | 491   | 491    | 371      |
-| Devin                  | 491        | 491   | 463    | 7        |
-
-Devin **passes** 7 instances on the `SWE-bench-devin-failed` subset. SWE-bench dataset appears to be noisy, evidenced by 120 instances where gold patches do not pass.
-
-We have filtered out the problematic 120 instances, resulting in the creation of the `SWE-bench-devin-full-filtered` subset.
-
-## Results on SWE-bench-devin-full-filtered
-
-[🤗 OpenDevin/SWE-bench-devin-full-filtered](https://huggingface.co/datasets/OpenDevin/SWE-bench-devin-full-filtered)
-
-| Model/Agent            | #instances | #init | #apply | #resolve |
-|------------------------|------------|-------|--------|----------|
-| Gold                   | 450        | 450   | 450    | 450      |
-| Devin                  | 450        | 450   | 426    | 83       |
+You can start your own fork of [our huggingface evaluation outputs](https://huggingface.co/spaces/OpenDevin/evaluation) and submit a PR of your evaluation results to our hosted huggingface repo via PR following the guide [here](https://huggingface.co/docs/hub/en/repositories-pull-requests-discussions#pull-requests-and-discussions).
--- a/evaluation/SWE-bench/README.md
+++ b/evaluation/SWE-bench/README.md
@ -1,80 +0,0 @@
-# SWE-Bench Evaluation
-
-Work in-progress.
-
-**TODOs**:
-
- [ ] Generate `predictions` files given an OpenDevin `Agent` implementation. We could borrow something from [devin's eval-harness implementation](https://github.com/CognitionAI/devin-swebench-results/tree/main/harness), for example, [how to generate `TestSpec`](https://github.com/CognitionAI/devin-swebench-results/blob/main/harness/scripts.py#L150-L160).
- [ ] Make sure the evaluation suite runs on all repos. I only tested on `matplotlib` so far, `scikit-learn` does not work for now (see [this issue](https://github.com/princeton-nlp/SWE-bench/issues/57))).
-
-
-## Run tests for a prediction file inside a docker container
-
-Currently, the docker container should be able to for running SWE-Bench. It was tested on `matplotlib`, but it requires further testing to make sure it works on other repositories. Currently, [it does not work for `scikit-learn`](https://github.com/princeton-nlp/SWE-bench/issues/57)).
-
-### Setup example data
-
-```bash
-cd evaluation/SWE-bench
-./scripts/prepare_devin_swe_bench_data.sh
-
-# Clone the repo
-# This is a fork that fixes some issues that stops matplotlib from running (see https://github.com/princeton-nlp/SWE-bench/pull/56)
-git clone https://github.com/OpenDevin/SWE-bench.git
-
-# Enter the docker container
-./scripts/run_docker_interactive.sh
-```
-
-### Run evaluation
-
-```bash
-#!/bin/bash
-rm -rf data/logs/ data/testbeds/ # (Optional) remove previous outputs
-mkdir -p data/logs
-mkdir -p data/testbeds
-
-python SWE-bench/harness/run_evaluation.py \
-    --predictions_path data/predictions/devin_swe_outputs.json \
-    --swe_bench_tasks data/processed/swe-bench-test.json \
-    --log_dir data/logs \
-    --testbed data/testbeds \
-    --skip_existing \
-    --timeout 900 \
-    --verbose
-```
-
-You will see the command line outputs similar to this (if success):
-
-```log
-swe-bench@2f3a6b9fcab2:/swe-bench$ ./harness/run_evaluation.sh
-/swe-bench/harness/run_evaluation.py:101: SyntaxWarning: assertion is always true, perhaps remove parentheses?
-  assert(temp, datasets.arrow_dataset.Dataset)
-2024-03-20 09:21:18,796 - INFO - Found 1 predictions across 1 model(s) in predictions file
-2024-03-20 09:21:18,796 - INFO - [claude-2/matplotlib__matplotlib/3.6] # of predictions to evaluate: 1 (0 already evaluated)
-2024-03-20 09:21:18,797 - INFO - [Testbed] Creating log directory /swe-bench/data/logs/claude-2
-2024-03-20 09:21:18,797 - INFO - [Testbed] Using conda path /swe-bench/data/testbeds/claude-2/matplotlib__matplotlib/3.6/tmp09wrm708
-2024-03-20 09:21:18,797 - INFO - [Testbed] Using working directory /swe-bench/data/testbeds/claude-2/matplotlib__matplotlib/3.6/tmpfy1qth23 for testbed
-2024-03-20 09:21:18,797 - INFO - [Testbed] Repo matplotlib/matplotlib: 1 versions
-2024-03-20 09:21:18,797 - INFO - [Testbed]      Version 3.6: 1 instances
-2024-03-20 09:21:18,797 - INFO - No conda path provided, creating temporary install in /swe-bench/data/testbeds/claude-2/matplotlib__matplotlib/3.6/tmp09wrm708/miniconda3...
-2024-03-20 09:21:27,482 - INFO - [Testbed] Using conda path /swe-bench/data/testbeds/claude-2/matplotlib__matplotlib/3.6/tmp09wrm708/miniconda3
-2024-03-20 09:21:27,942 - INFO - [Testbed] Setting up testbed for matplotlib__matplotlib__3.6
-2024-03-20 09:21:44,257 - INFO - [Testbed] Cloned matplotlib/matplotlib to /swe-bench/data/testbeds/claude-2/matplotlib__matplotlib/3.6/tmpfy1qth23/matplotlib__matplotlib__3.6
-2024-03-20 09:21:44,415 - INFO - [Testbed] Creating environment matplotlib__matplotlib__3.6; Command: /swe-bench/data/testbeds/claude-2/matplotlib__matplotlib/3.6/tmp09wrm708/miniconda3/bin/conda env create --file /swe-bench/data/testbeds/claude-2/matplotlib__matplotlib/3.6/tmpfy1qth23/environment.yml
-2024-03-20 09:23:39,781 - INFO - [Testbed] Installing pip packages for matplotlib__matplotlib__3.6; Command: . /swe-bench/data/testbeds/claude-2/matplotlib__matplotlib/3.6/tmp09wrm708/miniconda3/bin/activate matplotlib__matplotlib__3.6 && pip install pytest
-/swe-bench/data/testbeds/claude-2/matplotlib__matplotlib/3.6/tmpfy1qth23/matplotlib__matplotlib__3.6: 1 instances
-2024-03-20 09:23:42,309 - INFO - [matplotlib__matplotlib__3.6] [matplotlib__matplotlib-24362] Reset task environment to aca6e9d5e98811ca37c442217914b15e78127c89
-2024-03-20 09:23:42,314 - INFO - [matplotlib__matplotlib__3.6] [matplotlib__matplotlib-24362] Apply patch successful (pred_try)
-2024-03-20 09:23:42,318 - INFO - [matplotlib__matplotlib__3.6] [matplotlib__matplotlib-24362] Revert patch successful (pred_try)
-2024-03-20 09:23:42,318 - INFO - [matplotlib__matplotlib__3.6] [matplotlib__matplotlib-24362] Installing with command: . /swe-bench/data/testbeds/claude-2/matplotlib__matplotlib/3.6/tmp09wrm708/miniconda3/bin/activate matplotlib__matplotlib__3.6 && echo 'activate successful' && python -m pip install -e .
-2024-03-20 09:24:54,966 - INFO - [matplotlib__matplotlib__3.6] [matplotlib__matplotlib-24362] Installation successful
-2024-03-20 09:24:54,970 - INFO - [matplotlib__matplotlib__3.6] [matplotlib__matplotlib-24362] Apply patch successful (test)
-2024-03-20 09:24:54,974 - INFO - [matplotlib__matplotlib__3.6] [matplotlib__matplotlib-24362] Apply patch successful (pred)
-2024-03-20 09:25:04,775 - INFO - [matplotlib__matplotlib__3.6] [matplotlib__matplotlib-24362] Test script run successful
-swe-bench@2f3a6b9fcab2:/swe-bench$
-```
-
-### Interpret Results
-
-Then you may interpret the results under `data/logs`, and interpret it following [this guide](https://github.com/princeton-nlp/SWE-bench/blob/main/tutorials/evaluation.md#-metrics).
--- a/evaluation/SWE-bench/commands.sh
+++ b/evaluation/SWE-bench/commands.sh
@ -1,155 +0,0 @@
-# @yaml
-# signature: search_dir <search_term> [<dir>]
-# docstring: searches for search_term in all files in dir. If dir is not provided, searches in the current directory
-# arguments:
-#   search_term:
-#     type: string
-#     description: the term to search for
-#     required: true
-#   dir:
-#     type: string
-#     description: the directory to search in (if not provided, searches in the current directory)
-#     required: false
-search_dir() {
-    if [ $# -eq 1 ]; then
-        local search_term="$1"
-        local dir="./"
-    elif [ $# -eq 2 ]; then
-        local search_term="$1"
-        if [ -d "$2" ]; then
-            local dir="$2"
-        else
-            echo "Directory $2 not found"
-            return
-        fi
-    else
-        echo "Usage: search_dir <search_term> [<dir>]"
-        return
-    fi
-    dir=$(realpath "$dir")
-    local matches=$(find "$dir" -type f ! -path '*/.*' -exec grep -nIH "$search_term" {} + | cut -d: -f1 | sort | uniq -c)
-    # if no matches, return
-    if [ -z "$matches" ]; then
-        echo "No matches found for \"$search_term\" in $dir"
-        return
-    fi
-    # Calculate total number of matches
-    local num_matches=$(echo "$matches" | awk '{sum+=$1} END {print sum}')
-    # calculate total number of files matched
-    local num_files=$(echo "$matches" | wc -l | awk '{$1=$1; print $0}')
-    # if num_files is > 100, print an error
-    if [ $num_files -gt 100 ]; then
-        echo "More than $num_files files matched for \"$search_term\" in $dir. Please narrow your search."
-        return
-    fi
-
-    echo "Found $num_matches matches for \"$search_term\" in $dir:"
-    echo "$matches" | awk '{$2=$2; gsub(/^\.+\/+/, "./", $2); print $2 " ("$1" matches)"}'
-    echo "End of matches for \"$search_term\" in $dir"
-}
-
-# @yaml
-# signature: search_file <search_term> [<file>]
-# docstring: searches for search_term in file. If file is not provided, searches in the current open file
-# arguments:
-#   search_term:
-#     type: string
-#     description: the term to search for
-#     required: true
-#   file:
-#     type: string
-#     description: the file to search in (if not provided, searches in the current open file)
-#     required: false
-search_file() {
-    # Check if the first argument is provided
-    if [ -z "$1" ]; then
-        echo "Usage: search_file <search_term> [<file>]"
-        return
-    fi
-    # Check if the second argument is provided
-    if [ -n "$2" ]; then
-        # Check if the provided argument is a valid file
-        if [ -f "$2" ]; then
-            local file="$2"  # Set file if valid
-        else
-            echo "Usage: search_file <search_term> [<file>]"
-            echo "Error: File name $2 not found. Please provide a valid file name."
-            return  # Exit if the file is not valid
-        fi
-    else
-        # Check if a file is open
-        if [ -z "$CURRENT_FILE" ]; then
-            echo "No file open. Use the open command first."
-            return  # Exit if no file is open
-        fi
-        local file="$CURRENT_FILE"  # Set file to the current open file
-    fi
-    local search_term="$1"
-    file=$(realpath "$file")
-    # Use grep to directly get the desired formatted output
-    local matches=$(grep -nH "$search_term" "$file")
-    # Check if no matches were found
-    if [ -z "$matches" ]; then
-        echo "No matches found for \"$search_term\" in $file"
-        return
-    fi
-    # Calculate total number of matches
-    local num_matches=$(echo "$matches" | wc -l | awk '{$1=$1; print $0}')
-
-    # calculate total number of lines matched
-    local num_lines=$(echo "$matches" | cut -d: -f1 | sort | uniq | wc -l | awk '{$1=$1; print $0}')
-    # if num_lines is > 100, print an error
-    if [ $num_lines -gt 100 ]; then
-        echo "More than $num_lines lines matched for \"$search_term\" in $file. Please narrow your search."
-        return
-    fi
-
-    # Print the total number of matches and the matches themselves
-    echo "Found $num_matches matches for \"$search_term\" in $file:"
-    echo "$matches" | cut -d: -f1-2 | sort -u -t: -k2,2n | while IFS=: read -r filename line_number; do
-        echo "Line $line_number:$(sed -n "${line_number}p" "$file")"
-    done
-    echo "End of matches for \"$search_term\" in $file"
-}
-
-# @yaml
-# signature: find_file <file_name> [<dir>]
-# docstring: finds all files with the given name in dir. If dir is not provided, searches in the current directory
-# arguments:
-#   file_name:
-#     type: string
-#     description: the name of the file to search for
-#     required: true
-#   dir:
-#     type: string
-#     description: the directory to search in (if not provided, searches in the current directory)
-#     required: false
-find_file() {
-    if [ $# -eq 1 ]; then
-        local file_name="$1"
-        local dir="./"
-    elif [ $# -eq 2 ]; then
-        local file_name="$1"
-        if [ -d "$2" ]; then
-            local dir="$2"
-        else
-            echo "Directory $2 not found"
-            return
-        fi
-    else
-        echo "Usage: find_file <file_name> [<dir>]"
-        return
-    fi
-
-    dir=$(realpath "$dir")
-    local matches=$(find "$dir" -type f -name "$file_name")
-    # if no matches, return
-    if [ -z "$matches" ]; then
-        echo "No matches found for \"$file_name\" in $dir"
-        return
-    fi
-    # Calculate total number of matches
-    local num_matches=$(echo "$matches" | wc -l | awk '{$1=$1; print $0}')
-    echo "Found $num_matches matches for \"$file_name\" in $dir:"
-    echo "$matches" | awk '{print $0}'
-}
--- a/evaluation/SWE-bench/environment.yml
+++ b/evaluation/SWE-bench/environment.yml
@ -1,15 +0,0 @@
-# FROM https://github.com/princeton-nlp/SWE-bench/blob/main/environment.yml
-name: swe-bench
-dependencies:
-  - python=3.9
-  - pip
-  - pip:
-    - beautifulsoup4
-    - chardet
-    - ghapi
-    - GitPython
-    - python-dotenv
-    - requests
-    - rich
-    - transformers>=4.34.0
-  - conda-forge::gh
--- a/evaluation/SWE-bench/notebooks/devin_eval_analysis.ipynb
+++ b/evaluation/SWE-bench/notebooks/devin_eval_analysis.ipynb
--- a/evaluation/SWE-bench/scripts/download_test_data.py
+++ b/evaluation/SWE-bench/scripts/download_test_data.py
@ -1,5 +0,0 @@
-from datasets import load_dataset
-
-dataset = load_dataset('princeton-nlp/SWE-bench')
-test = dataset['test'].to_pandas()
-test.to_json('data/processed/swe-bench-test.json', orient='records')
--- a/evaluation/SWE-bench/scripts/prepare_devin_outputs_for_evaluation.py
+++ b/evaluation/SWE-bench/scripts/prepare_devin_outputs_for_evaluation.py
@ -1,81 +0,0 @@
-'''
-Script used to convert devin's output into the desired json format for evaluation on SWE-bench
-
-Usage:
-    python prepare_devin_outputs_for_evaluation.py <setting>
-    <setting> can be "passed", "failed", "all"
-
-Outputs:
-    two json files under evaluation/SWE-bench/data/
-
-'''
-
-# fetch devin's outputs into a json file for evaluation
-import json
-import os
-import sys
-
-import requests
-from tqdm import tqdm
-
-
-def get_devin_eval_output(setting):
-    repo_url = 'CognitionAI/devin-swebench-results'
-    folder_path = 'output_diffs'
-
-    base_url = 'https://api.github.com/repos/'
-    pass_api_url = f'{base_url}{repo_url}/contents/{folder_path}/pass'
-    failed_api_url = f'{base_url}{repo_url}/contents/{folder_path}/fail'
-
-    pass_files_info = []
-    failed_files_info = []
-
-    def get_files(api_url, subfolder_name, files_info):
-        response = requests.get(api_url)
-        if response.status_code == 200:
-            contents = response.json()
-            for item in tqdm(contents):
-                if item['type'] == 'file':
-                    file_url = f"https://raw.githubusercontent.com/{repo_url}/main/{folder_path}/{subfolder_name}/{item['name']}"
-                    file_content = requests.get(file_url).text
-                    instance_id = item['name'][:-9]
-                    model_name = 'Devin'  # Update with actual model name
-                    files_info.append({
-                        'instance_id': instance_id,
-                        'model_patch': file_content,
-                        'model_name_or_path': model_name,
-                        'pass_or_fail': subfolder_name
-                    })
-
-    if setting == 'passed' or setting == 'all':
-        get_files(pass_api_url, 'pass', pass_files_info)
-    if setting == 'failed' or setting == 'all':
-        get_files(failed_api_url, 'fail', failed_files_info)
-
-    script_dir = os.path.dirname(os.path.abspath(__file__))
-    output_dir = os.path.join(script_dir, '../data/devin/')
-
-    if not os.path.exists(output_dir):
-        os.makedirs(output_dir)
-
-    if setting == 'passed' or setting == 'all':
-        with open(os.path.join(output_dir, 'devin_swe_passed.json'), 'w') as pass_file:
-            json.dump(pass_files_info, pass_file, indent=4)
-
-    if setting == 'failed' or setting == 'all':
-        with open(os.path.join(output_dir, 'devin_swe_failed.json'), 'w') as fail_file:
-            json.dump(failed_files_info, fail_file, indent=4)
-
-    if setting == 'all':
-        merged_output = pass_files_info + failed_files_info
-        with open(os.path.join(output_dir, 'devin_swe_outputs.json'), 'w') as merge_file:
-            json.dump(merged_output, merge_file, indent=4)
-
-
-if __name__ == '__main__':
-    if len(sys.argv) != 2:
-        print('Usage: python script_name.py <setting>')
-        sys.exit(1)
-
-    setting = sys.argv[1]
-    get_devin_eval_output(setting)
--- a/evaluation/SWE-bench/scripts/prepare_devin_swe_bench_data.sh
+++ b/evaluation/SWE-bench/scripts/prepare_devin_swe_bench_data.sh
@ -1,10 +0,0 @@
-#!/bin/bash
-
-set -xeo pipefail
-mkdir -p data/processed
-python3 scripts/download_test_data.py
-
-# Download an example output file (FROM claude-2)
-# https://gist.github.com/sorendunn/9f1f1fade59f986b4925b6633f9ff165
-mkdir -p data/predictions
-wget https://huggingface.co/datasets/OpenDevin/Devin-SWE-bench-output/raw/main/devin_swe_outputs.json -O data/predictions/devin_swe_outputs.json
--- a/evaluation/SWE-bench/scripts/run_docker_interactive.sh
+++ b/evaluation/SWE-bench/scripts/run_docker_interactive.sh
@ -1,14 +0,0 @@
-#!/bin/bash
-
-DOCKER_IMAGE=ghcr.io/opendevin/eval-swe-bench
-WORK_DIR=`pwd`
-
-docker run \
-    -it \
-    --rm \
-    --user root \
-    --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 \
-    -v $WORK_DIR:/swe-bench \
-    -w /swe-bench \
-    $DOCKER_IMAGE \
-    /bin/bash -c "usermod -u $(id -u) swe-bench && su swe-bench"
--- a/evaluation/init.py
+++ b/evaluation/init.py
--- a/evaluation/swe_bench/BUILD_TESTBED_AND_ENV.md
+++ b/evaluation/swe_bench/BUILD_TESTBED_AND_ENV.md
@ -0,0 +1,39 @@
+# Pre-build Testbed and Env
+
+In the original SWE-Bench implementation, conda environment for evaluation is typically installed from scratch while evaluating on a paticular instance. This poses serveral challenges:
+
+- Effeciency: most time of evaluation will be wasted on downloading packages
+- Stability: setup could failed due to bad internet connectivity
+- Reliability: it is possible that an instance is considered failed not because the agent did badly, but because the environment setup failed.
+
+In OpenDevin-SWE-Bench fork, we try to pre-build the **testbed** (i.e., code of the repository we want the agent to edit) AND the **conda environment**, so that in evaluation (inference) time, we can directly leverage existing environments for effecienct evaluation.
+
+NOTE: We only support SWE-Bench lite for now. But modifying our existing scripts for full SWE-Bench should be quite straight forward.
+
+## How to pre-build your testbed
+
+### Setup Eval Workspace (Util + Data)
+
+Setup your eval workspace by:
+1. Clone OpenDevin SWE-Bench [fork](https://github.com/OpenDevin/OD-SWE-bench.git)
+2. Prepare SWE-Bench data
+
+Run the following command to do the above two steps. The results will be saved to `evaluation/SWE-bench/eval_workspace`.
+
+```bash
+./evaluation/swe_bench/scripts/setup/prepare_swe_utils.sh
+```
+
+### Pre-build Conda Env and Test Bed
+
+```bash
+./evaluation/swe_bench/scripts/setup/swe_env_setup.sh
+```
+
+### Build the pre-build conda env and testbed into ONE docker image
+
+```bash
+pushd evaluation/swe_bench
+docker build -t ghcr.io/opendevin/eval-swe-bench:full-v1.0 -f ./scripts/docker/Dockerfile.full.v1.0 .
+docker push ghcr.io/opendevin/eval-swe-bench:full-v1.0
+```
--- a/evaluation/swe_bench/EVAL_PATCH.md
+++ b/evaluation/swe_bench/EVAL_PATCH.md
@ -0,0 +1,256 @@
+# Evaluate Generated Patches
+
+## Evaluate patches generated by OpenDevin
+
+This section explains in detail how `evaluation/swe_bench/scripts/eval_infer.sh` described in [SWE-Bench README](./README.md) works.
+
+Use `scripts/setup/get_agent_report.sh` to evaluate patches generated by an OpenDevin agent. This script is available in the container at `/swe_util/get_agent_report.sh`.
+
+- `output-file` (*required*): specify the path to your patch file inside the container
+- `agent-name` (*required*): your agent name
+- `dataset` (*required*): `swe-bench-test-lite` or `swe-bench-test`
+- `num-processes`: defaults to 15.
+- `experiment-name`: set to `${parent_folder_of_output_fils}_${current_folder_of_output_file}` if not given. E.g., `xxx/CodeActAgent/gpt-4-1106-preview_maxiter_50_N_v2_cd/output.jsonl` -> `CodeActAgent_gpt-4-1106-preview_maxiter_50_N_v2_cd` as experiment name.
+- `merge_report`: if set, merges the evaluation report into the original output jsonl file and saves as a `.merged.jsonl` file.
+
+An example to run evaluation on the given example agent output (`./examples/example_agent_output.json`).
+
+```shell
+export MINICONDA3=/swe_util/miniforge3
+export OD_SWE_BENCH=/OD-SWE-bench
+export EVAL_DATA_DIR=/swe_util/eval_data
+cd /swe_util && ./get_agent_report.sh --output-file /swe_bench_output/example_agent_output.jsonl \
+--agent-name CodeActAgent \
+--dataset swe-bench-test-lite \
+--experiment-name test_experiment \
+--merge-report
+```
+
+You should get the following report:
+```shell
+- no_generation: 4
+- generated: 26
+- with_logs: 26
+- install_fail: 0
+- reset_failed: 0
+- no_apply: 0
+- applied: 24
+- test_errored: 0
+- test_timeout: 0
+- resolved: 6
+['sphinx-doc__sphinx-8721', 'sympy__sympy-14774', 'django__django-17087', 'sympy__sympy-20590', 'django__django-11583', 'sympy__sympy-21612']
+Report saved at /swe_util/eval_data/eval_logs/test_experiment/test_experiment_swe-bench-test-lite.report.json
+Agent output with report merged created at /swe_bench_output/example_agent_output.merged.jsonl
+```
+
+An additional `fine_grained_report` field will be added to each instance in the `example_agent_output.merged.jsonl`.
+
+```json
+"fine_grained_report": {
+  "gold_tests": {
+    "FAIL_TO_PASS": "[\"tests/test_ext_viewcode.py::test_viewcode_epub_default\"]",
+    "PASS_TO_PASS": "[\"tests/test_ext_viewcode.py::test_viewcode_epub_enabled\", \"tests/test_ext_viewcode.py::test_linkcode\", \"tests/test_ext_viewcode.py::test_local_source_files\"]"
+  },
+  "generated": true,
+  "with_logs": true,
+  "applied": true,
+  "test_errored": false,
+  "test_timeout": false,
+  "resolved": true,
+  "log_parse": {
+    "tests/test_ext_viewcode.py::test_viewcode_epub_default": "PASSED",
+    "tests/test_ext_viewcode.py::test_viewcode_epub_enabled": "PASSED",
+    "tests/test_ext_viewcode.py::test_linkcode": "PASSED",
+    "tests/test_ext_viewcode.py::test_local_source_files": "PASSED",
+    "tests/test_ext_viewcode.py::test_viewcode": "FAILED"
+  },
+  "eval_report": {
+    "FAIL_TO_PASS": {
+      "success": [
+        "tests/test_ext_viewcode.py::test_viewcode_epub_default"
+      ],
+      "failure": []
+    },
+    "PASS_TO_PASS": {
+      "success": [
+        "tests/test_ext_viewcode.py::test_viewcode_epub_enabled",
+        "tests/test_ext_viewcode.py::test_linkcode",
+        "tests/test_ext_viewcode.py::test_local_source_files"
+      ],
+      "failure": []
+    },
+    "FAIL_TO_FAIL": {
+      "success": [],
+      "failure": []
+    },
+    "PASS_TO_FAIL": {
+      "success": [],
+      "failure": []
+    }
+  }
+}
+```
+
+## If you already have patches not generated by OpenDevin
+
+### Prepare Output Files
+
+Ensure that model outputs are formatted correctly as below:
+```json
+[
+  {
+    "instance_id": "",
+    "model_patch": "",
+    "model_name_or_path": ""
+  },
+  ...
+]
+```
+An example can be found [here](./examples/example_model_output.json).
+
+Agent output should be adhere to the OpenDevin format. An example can be found [here](./examples/example_agent_output.json).
+
+### Set Up the Environment
+
+Before evaluating generated patches, you need to set up the Docker environment. Run the following command to instantiate the Docker container and mount the directory to your output files on the host:
+
+```shell
+docker run -it \
+-v DIR_TO_YOUR_PATCH_FILES_ON_HOST:/swe_bench_output \
+ghcr.io/opendevin/eval-swe-bench:full-v1.0 /bin/bash
+```
+
+### Evaluate Model Generated Patches
+
+Use `scripts/get_model_report.sh` to evaluate patches generated by a model. This script is located in the container at `/swe_util/get_model_report.sh`.
+
+- `output-file` (*required*): specify the path to your patch file inside the container
+- `model-name` (*required*): this must match the `model_name_or_path` in your patch file
+- `dataset` (*required*): `swe-bench-test-lite` or `swe-bench-test`
+- `num-processes`: defaults to 15.
+- `experiment-name`: set to `{model-name}__{dataset}` unless specified
+
+An example to run evaluation on the given example model output (`./examples/example_agent_output.json`).
+
+```shell
+export MINICONDA3=/swe_util/miniforge3
+export OD_SWE_BENCH=/swe_util/OD-SWE-bench
+export EVAL_DATA_DIR=/swe_util/eval_data
+cd /swe_util && ./get_model_report.sh --output-file /swe_bench_output/example_model_output.json \
+--model-name opendevin \
+--dataset swe-bench-test-lite
+```
+
+You should get the following report:
+```shell
+- no_generation: 4
+- generated: 26
+- with_logs: 26
+- install_fail: 0
+- reset_failed: 0
+- no_apply: 0
+- applied: 24
+- test_errored: 0
+- test_timeout: 0
+- resolved: 6
+['sphinx-doc__sphinx-8721', 'sympy__sympy-14774', 'django__django-17087', 'sympy__sympy-20590', 'django__django-11583', 'sympy__sympy-21612']
+Report saved at /swe_util/eval_data/eval_logs/opendevin__swe-bench-test-lite/example_model_output.report.json
+```
+Note: please ignore the `no_apply` in the report for now.
+
+The script will generate a `{experiment_name}` folder under `$EVAL_DATA_DIR/eval_logs`
+```shell
+├── $EVAL_DATA_DIR/eval_logs/$experiment_name
+│   ├── $experiment_name.json
+│   ├── $experiment_name.report.json
+│   ├── $model_name # eval log dir
+```
+
+### Evaluate Agent Generated Patches
+
+Use `scripts/setup/get_agent_report.sh` to evaluate patches generated by an agent. This script is available in the container at `/swe_util/get_agent_report.sh`.
+
+- `output-file` (*required*): specify the path to your patch file inside the container
+- `agent-name` (*required*): your agent name
+- `dataset` (*required*): `swe-bench-test-lite` or `swe-bench-test`
+- `num-processes`: defaults to 15.
+- `experiment-name`: set to `${parent_folder_of_output_fils}_${current_folder_of_output_file}` if not given. E.g., `xxx/CodeActAgent/gpt-4-1106-preview_maxiter_50_N_v2_cd/output.jsonl` -> `CodeActAgent_gpt-4-1106-preview_maxiter_50_N_v2_cd` as experiment name.
+- `merge_report`: if set, merges the evaluation report into the original output jsonl file and saves as a `.merged.jsonl` file.
+
+An example to run evaluation on the given example agent output (`./examples/example_agent_output.json`).
+
+```shell
+export MINICONDA3=/swe_util/miniforge3
+export OD_SWE_BENCH=/OD-SWE-bench
+export EVAL_DATA_DIR=/swe_util/eval_data
+cd /swe_util && ./get_agent_report.sh --output-file /swe_bench_output/example_agent_output.jsonl \
+--agent-name CodeActAgent \
+--dataset swe-bench-test-lite \
+--experiment-name test_experiment \
+--merge-report
+```
+
+You should get the following report:
+```shell
+- no_generation: 4
+- generated: 26
+- with_logs: 26
+- install_fail: 0
+- reset_failed: 0
+- no_apply: 0
+- applied: 24
+- test_errored: 0
+- test_timeout: 0
+- resolved: 6
+['sphinx-doc__sphinx-8721', 'sympy__sympy-14774', 'django__django-17087', 'sympy__sympy-20590', 'django__django-11583', 'sympy__sympy-21612']
+Report saved at /swe_util/eval_data/eval_logs/test_experiment/test_experiment_swe-bench-test-lite.report.json
+Agent output with report merged created at /swe_bench_output/example_agent_output.merged.jsonl
+```
+
+An additional `fine_grained_report` field will be added to each instance in the `example_agent_output.merged.jsonl`.
+
+```json
+"fine_grained_report": {
+  "gold_tests": {
+    "FAIL_TO_PASS": "[\"tests/test_ext_viewcode.py::test_viewcode_epub_default\"]",
+    "PASS_TO_PASS": "[\"tests/test_ext_viewcode.py::test_viewcode_epub_enabled\", \"tests/test_ext_viewcode.py::test_linkcode\", \"tests/test_ext_viewcode.py::test_local_source_files\"]"
+  },
+  "generated": true,
+  "with_logs": true,
+  "applied": true,
+  "test_errored": false,
+  "test_timeout": false,
+  "resolved": true,
+  "log_parse": {
+    "tests/test_ext_viewcode.py::test_viewcode_epub_default": "PASSED",
+    "tests/test_ext_viewcode.py::test_viewcode_epub_enabled": "PASSED",
+    "tests/test_ext_viewcode.py::test_linkcode": "PASSED",
+    "tests/test_ext_viewcode.py::test_local_source_files": "PASSED",
+    "tests/test_ext_viewcode.py::test_viewcode": "FAILED"
+  },
+  "eval_report": {
+    "FAIL_TO_PASS": {
+      "success": [
+        "tests/test_ext_viewcode.py::test_viewcode_epub_default"
+      ],
+      "failure": []
+    },
+    "PASS_TO_PASS": {
+      "success": [
+        "tests/test_ext_viewcode.py::test_viewcode_epub_enabled",
+        "tests/test_ext_viewcode.py::test_linkcode",
+        "tests/test_ext_viewcode.py::test_local_source_files"
+      ],
+      "failure": []
+    },
+    "FAIL_TO_FAIL": {
+      "success": [],
+      "failure": []
+    },
+    "PASS_TO_FAIL": {
+      "success": [],
+      "failure": []
+    }
+  }
+}
+```
--- a/evaluation/swe_bench/README.md
+++ b/evaluation/swe_bench/README.md
@ -0,0 +1,150 @@
+# SWE-Bench Evaluation with OpenDevin SWE-Bench Docker Image
+
+
+This folder contains evaluation harness we built on top of the original [SWE-Bench benchmark](https://www.swebench.com/) ([paper](https://arxiv.org/abs/2310.06770)). We create [a fork of SWE-Bench](https://github.com/OpenDevin/OD-SWE-bench.git) mostly build on top of [the original repo](https://github.com/princeton-nlp/SWE-bench) and [containerized](#opendevin-swe-bench-docker-image) it for easy evaluation.
+
+## OpenDevin SWE-Bench Docker Image
+
+In [OpenDevin-SWE-Bench fork](https://github.com/OpenDevin/OD-SWE-bench.git) (mostly from [original repo](https://github.com/princeton-nlp/SWE-bench) with some fixes), we try to pre-build the **testbed** (i.e., code of the repository we want the agent to edit) AND the **conda environment**, so that in evaluation (inference) time, we can directly leverage existing environments for effecienct evaluation.
+
+**We pack everything you need for SWE-Bench evaluation into one, gigantic, docker image.** To use it:
+
+```bash
+docker pull ghcr.io/opendevin/eval-swe-bench:full-v1.0
+```
+
+The Docker image contains several important directories:
+- `/swe_util/OD-SWE-bench`: root directory for the OD-SWE-bench repository
+- `/swe_util/eval_data`: director to eval data
+  - `/swe_util/eval_data/eval_logs/`: evaluation logs
+  - `/swe_util/eval_data/eval_temp/`: temporary folder for the evaluation process
+  - `/swe_util/eval_data/instances/`: swe-bench raw instances
+  - `/swe_util/eval_data/outputs/`: model or agent outputs
+  - `/swe_util/eval_data/testbed_logs/`: logs for testbed building
+  - `/swe_util/eval_data/testbeds/`: directory for all testbeds
+- `/swe_util/miniforge3/`: directory for miniforge3
+
+To reproduce how we pack the image, check [this doc](./BUILD_TESTBED_AND_ENV.md).
+
+NOTE: We only support SWE-Bench lite for now. But modifying our existing scripts for full SWE-Bench should be quite straight forward.
+
+## Test if your environment works
+
+```bash
+python3 evaluation/swe_bench/swe_env_box.py
+```
+
+If you get to the interactive shell successfully, it means success!
+
+## Configure your LLM
+
+Create a `config.toml` file if not exists at the root of workspace.
+
+Add the following configurations:
+
+```toml
+[core]
+max_iterations = 100
+cache_dir = "/tmp/cache"
+sandbox_container_image = "ghcr.io/opendevin/sandbox:latest"
+sandbox_type = "ssh"
+use_host_network = true
+ssh_hostname = "localhost"
+sandbox_timeout = 120
+# eval specific
+run_as_devin = false
+
+# TODO: Change these to the model you want to evaluate
+[eval_gpt4_1106_preview]
+model = "gpt-4-1106-preview"
+api_key = "XXX"
+temperature = 0.0
+
+[eval_some_openai_compatible_model]
+model = "openai/MODEL_NAME"
+base_url = "https://OPENAI_COMPATIBLE_URL/v1"
+api_key = "XXX"
+temperature = 0.0
+```
+
+## Run Inference on SWE-Bench Instances
+
+```bash
+./evaluation/swe_bench/scripts/run_infer.sh eval_gpt4_1106_preview
+```
+
+You can replace `eval_gpt4_1106_preview` with any model you setted up in `config.toml`.
+
+
+## Evaluate Generated Patches
+
+After running the inference described in the previous section, you will obtain a `output.jsonl` (by default it will save to `evaluation/evaluation_outputs`). Then you can run this one line script to evaluate generated patches, and produce a fine-grained report:
+
+If you want to evaluate existing results, you should first run this to clone existing outputs
+
+```bash
+git clone https://huggingface.co/spaces/OpenDevin/evaluation evaluation/evaluation_outputs
+```
+
+Then you can run the following:
+```bash
+# ./evaluation/swe_bench/scripts/eval_infer.sh $YOUR_OUTPUT_JSONL
+# For example:
+./evaluation/swe_bench/scripts/eval_infer.sh evaluation/evaluation_outputs/outputs/swe_bench/CodeActAgent/gpt-4-1106-preview_maxiter_50_N_v1.0/output.jsonl
+```
+
+The final results will be saved to `evaluation/evaluation_outputs/outputs/swe_bench/CodeActAgent/gpt-4-1106-preview_maxiter_50_N_v1.0/output.merged.jsonl`.
+
+It will contains an additional field `fine_grained_report` (see example below) compared to the `output.jsonl` from the previous inference stage.
+
+```json
+"fine_grained_report": {
+  "gold_tests": {
+    "FAIL_TO_PASS": "[\"tests/test_ext_viewcode.py::test_viewcode_epub_default\"]",
+    "PASS_TO_PASS": "[\"tests/test_ext_viewcode.py::test_viewcode_epub_enabled\", \"tests/test_ext_viewcode.py::test_linkcode\", \"tests/test_ext_viewcode.py::test_local_source_files\"]"
+  },
+  "generated": true,
+  "with_logs": true,
+  "applied": true,
+  "test_errored": false,
+  "test_timeout": false,
+  "resolved": true,
+  "log_parse": {
+    "tests/test_ext_viewcode.py::test_viewcode_epub_default": "PASSED",
+    "tests/test_ext_viewcode.py::test_viewcode_epub_enabled": "PASSED",
+    "tests/test_ext_viewcode.py::test_linkcode": "PASSED",
+    "tests/test_ext_viewcode.py::test_local_source_files": "PASSED",
+    "tests/test_ext_viewcode.py::test_viewcode": "FAILED"
+  },
+  "eval_report": {
+    "FAIL_TO_PASS": {
+      "success": [
+        "tests/test_ext_viewcode.py::test_viewcode_epub_default"
+      ],
+      "failure": []
+    },
+    "PASS_TO_PASS": {
+      "success": [
+        "tests/test_ext_viewcode.py::test_viewcode_epub_enabled",
+        "tests/test_ext_viewcode.py::test_linkcode",
+        "tests/test_ext_viewcode.py::test_local_source_files"
+      ],
+      "failure": []
+    },
+    "FAIL_TO_FAIL": {
+      "success": [],
+      "failure": []
+    },
+    "PASS_TO_FAIL": {
+      "success": [],
+      "failure": []
+    }
+  }
+}
+```
+
+Please refer to [EVAL_PATCH.md](./EVAL_PATCH.md) if you want to learn more about how to evaluate patches that are already generated (e.g., not by OpenDevin).
+
+## Submit your evaluation results
+
+You can start your own fork of [our huggingface evaluation outputs](https://huggingface.co/spaces/OpenDevin/evaluation) and submit a PR of your evaluation results following the guide [here](https://huggingface.co/docs/hub/en/repositories-pull-requests-discussions#pull-requests-and-discussions).
--- a/evaluation/swe_bench/init.py
+++ b/evaluation/swe_bench/init.py
--- a/evaluation/swe_bench/examples/example_agent_output.jsonl
+++ b/evaluation/swe_bench/examples/example_agent_output.jsonl
--- a/evaluation/swe_bench/examples/example_model_output.json
+++ b/evaluation/swe_bench/examples/example_model_output.json
--- a/evaluation/swe_bench/run_infer.py
+++ b/evaluation/swe_bench/run_infer.py
@ -0,0 +1,411 @@
+import asyncio
+import json
+import logging
+import multiprocessing as mp
+import os
+import pathlib
+import subprocess
+import time
+from concurrent.futures import ProcessPoolExecutor
+
+import pandas as pd
+import whatthepatch
+from datasets import load_dataset
+from tqdm import tqdm
+
+from evaluation.swe_bench.swe_env_box import SWEBenchSSHBox
+from opendevin.controller.state.state import State
+from opendevin.core.config import args, config, get_llm_config_arg
+from opendevin.core.logger import get_console_handler
+from opendevin.core.logger import opendevin_logger as logger
+from opendevin.core.main import main
+from opendevin.events.action import MessageAction
+from opendevin.events.serialization.event import event_to_dict
+
+
+def cleanup():
+    print('Cleaning up child processes...')
+    for process in mp.active_children():
+        print(f'Terminating child process: {process.name}')
+        process.terminate()
+        process.join()
+
+
+def codeact_user_response(state: State) -> str:
+    msg = (
+        'Please continue working on the task on whatever approach you think is suitable.\n'
+        'If you think you have modified the code in a way that fixes the issue, please run the following command: <execute_bash> exit </execute_bash>.\n'
+        'IMPORTANT: YOU SHOULD NEVER ASK FOR HUMAN HELP OR USE THE INTERNET TO SOLVE THIS TASK.\n'
+    )
+    if state.history:
+        user_msgs = [
+            action
+            for action, _ in state.history
+            if isinstance(action, MessageAction) and action.source == 'user'
+        ]
+        if len(user_msgs) >= 2:
+            # let the agent know that it can give up when it has tried 3 times
+            return (
+                msg
+                + 'If you want to give up, run: <execute_bash> exit </execute_bash>.\n'
+            )
+    return msg
+
+
+def monologue_user_response(state: State) -> str:
+    raise NotImplementedError('MonologueAgent should never ask for user responses.')
+
+
+AGENT_CLS_TO_FAKE_USER_RESPONSE_FN = {
+    'CodeActAgent': codeact_user_response,
+    'MonologueAgent': monologue_user_response,
+}
+
+AGENT_CLS_TO_INST_SUFFIX = {
+    'CodeActAgent': 'When you think you have fixed the issue through code changes, please run the following command: <execute_bash> exit </execute_bash>.\n'
+}
+
+
+def get_test_result(instance, sandbox, workspace_dir_name):
+    test_result = {'result': {}, 'metadata': {}}
+    try:
+        test_patch_parsed = whatthepatch.parse_patch(instance.test_patch)
+        # get a list of filepaths that are involved in the patch
+        involved_filepaths = set()
+        for patch in test_patch_parsed:
+            involved_filepaths.add(patch.header.old_path.removeprefix('a/'))
+            involved_filepaths.add(patch.header.new_path.removeprefix('b/'))
+        involved_filepaths = list(involved_filepaths)
+        test_result['metadata']['1_test_patch_parse_success'] = True
+        test_result['metadata']['1_test_involved_filepaths'] = involved_filepaths
+    except Exception as e:
+        logger.error(
+            f'Error parsing test patch for instance {instance.instance_id}: {e}'
+        )
+        test_result['metadata']['1_test_patch_parse_success'] = False
+        test_result['metadata']['1_test_patch_parse_error'] = str(e)
+        test_result['metadata']['1_test_involved_filepaths'] = None
+        involved_filepaths = []
+
+    # Try to revert the changes for involved filepaths
+    err_code, output = sandbox.execute(f'cd /workspace/{workspace_dir_name}')
+    test_result['metadata']['2_revert_test_involved_filepaths_success'] = []
+    for filepath in involved_filepaths:
+        err_code, output = sandbox.execute(
+            f'git checkout {instance["base_commit"]} -- {filepath}'
+        )
+        if err_code != 0:
+            logger.error(f'Error reverting changes for {filepath}: {output}')
+            test_result['metadata']['2_revert_test_involved_filepaths_success'].append(
+                False
+            )
+        else:
+            test_result['metadata']['2_revert_test_involved_filepaths_success'].append(
+                True
+            )
+
+    # Apply the testcase
+    err_code, output = sandbox.execute('git apply $SWE_TASK_DIR/test.patch')
+    if err_code != 0:
+        logger.error(f'Error applying test patch: {output}')
+        test_result['metadata']['3_apply_test_patch_success'] = False
+        test_result['metadata']['3_apply_test_patch_error'] = output
+    else:
+        test_result['metadata']['3_apply_test_patch_success'] = True
+
+    # Run the test command
+    err_code, output = sandbox.execute(
+        '$TEST_CMD > /workspace/$SWE_INSTANCE_ID.log 2>&1'
+    )
+    if err_code != 0:
+        logger.error(f'Error running test command: {output}')
+        test_result['metadata']['4_run_test_command_success'] = False
+        test_result['metadata']['4_run_test_command_error'] = output
+    else:
+        test_result['metadata']['4_run_test_command_success'] = True
+
+    # Get the test output
+    err_code, output = sandbox.execute('cat /workspace/$SWE_INSTANCE_ID.log')
+    if err_code != 0:
+        logger.error(f'Error getting test output: {output}')
+        test_result['metadata']['4_get_test_output_success'] = False
+        test_result['metadata']['4_get_test_output_error'] = output
+    else:
+        test_result['metadata']['4_get_test_output_success'] = True
+        test_result['test_output'] = output
+
+    # Reformat instance.json
+    # $SWE_TASK_DIR/instance.json is a dict {"XXX": "YYY"}, add a [ before and a ] after
+    err_code, output = sandbox.execute(
+        (
+            'cat $SWE_TASK_DIR/instance.json | sed "s/^{/[{/" | sed "s/}$/}]/" > /workspace/instance.json'
+        )
+    )
+    if err_code != 0:
+        logger.error(f'Error creating instance.json: {output}')
+        test_result['metadata']['5_reformat_instance_json_success'] = False
+        test_result['metadata']['5_reformat_instance_json_error'] = output
+    else:
+        test_result['metadata']['5_reformat_instance_json_success'] = True
+
+    # Get the instance report
+    err_code, output = sandbox.execute(
+        (
+            'cd /swe_util/OD-SWE-bench '
+            '&& export PYTHONPATH=$(pwd):$PYTHONPATH '
+            '&& conda run -n swe-bench-eval python swebench/metrics/get_instance_report.py --swe_bench_task /workspace/instance.json --log_path /workspace/$SWE_INSTANCE_ID.log'
+        )
+    )
+    if err_code != 0:
+        logger.error(f'Error getting instance report: {output}')
+        test_result['metadata']['6_get_instance_report_success'] = False
+        test_result['metadata']['6_get_instance_report_error'] = output
+    else:
+        test_result['metadata']['6_get_instance_report_success'] = True
+        test_result['result_raw'] = output
+
+        # try to parse output
+        for line in output.strip().split('\n'):
+            line = line.strip('-')
+            try:
+                key, value = line.split(':')
+            except ValueError:
+                # skip this line
+                print(f'Error parsing result line: {line}')
+                continue
+            value = value.strip()
+            try:
+                value = int(value)
+            except ValueError:
+                pass
+            test_result['result'][key.strip()] = value
+    return test_result
+
+
+def process_instance(
+    instance, agent_class, metadata, skip_workspace_mount, reset_logger: bool = True
+):
+    workspace_mount_path = os.path.join(config.workspace_mount_path, '_eval_workspace')
+    # create process-specific workspace dir
+    if not skip_workspace_mount:
+        workspace_mount_path = os.path.join(workspace_mount_path, str(os.getpid()))
+        pathlib.Path(workspace_mount_path).mkdir(parents=True, exist_ok=True)
+
+    if reset_logger:
+        # Set up logger
+        log_file = os.path.join(
+            eval_output_dir, 'logs', f'instance_{instance.instance_id}.log'
+        )
+        # Remove all existing handlers from logger
+        for handler in logger.handlers[:]:
+            logger.removeHandler(handler)
+        # add back the console handler to print ONE line
+        logger.addHandler(get_console_handler())
+        logger.info(
+            f'Starting evaluation for instance {instance.instance_id}.\nLOG:   tail -f {log_file}'
+        )
+        # Remove all existing handlers from logger
+        for handler in logger.handlers[:]:
+            logger.removeHandler(handler)
+        file_handler = logging.FileHandler(log_file)
+        file_handler.setFormatter(
+            logging.Formatter('%(asctime)s - %(levelname)s - %(message)s')
+        )
+        logger.addHandler(file_handler)
+
+    if not skip_workspace_mount:
+        logger.info(f'Process-specific workspace mounted at {workspace_mount_path}')
+
+    workspace_dir_name = f'{instance.repo}__{instance.version}'.replace('/', '__')
+    sandbox = SWEBenchSSHBox.get_box_for_instance(
+        instance,
+        workspace_dir_name,
+        skip_workspace_mount=skip_workspace_mount,
+        workspace_mount_path=workspace_mount_path,
+    )
+
+    # Prepare instruction
+    instruction = (
+        f'Please fix the following issue for the repository in /workspace/{workspace_dir_name}.\n'
+        'Environment has been set up for you to start working. You may assume all necessary tools are installed.\n\n'
+        '# Problem Statement\n'
+        f'{instance.problem_statement}\n\n'
+    )
+    if instance.hints_text:
+        instruction += f'# Hints\n{instance.hints_text}\n\n'
+    instruction += (
+        'IMPORTANT: You should ONLY interact with the environment provided to you AND NEVER ASK FOR HUMAN HELP.\n'
+        'You should NOT modify any existing test case files. If needed, you can add new test cases in a NEW file to reproduce the issue.\n'
+        'You SHOULD INCLUDE PROPER INDENTATION in your edit commands.\n'
+    )
+    instruction += AGENT_CLS_TO_INST_SUFFIX.get(agent_class, '')
+
+    # Run the agent
+    state: State = asyncio.run(
+        main(
+            instruction,
+            fake_user_response_fn=AGENT_CLS_TO_FAKE_USER_RESPONSE_FN.get(agent_class),
+            sandbox=sandbox,
+        )
+    )
+
+    # Get git patch
+    git_patch = sandbox.get_diff_patch()
+    logger.info(f'Got git diff for instance {instance.instance_id}')
+
+    # ======= Attempt to evaluate the agent's edits =======
+    # Attempt to analyze the test patch to get involved filepaths
+    test_result = get_test_result(instance, sandbox, workspace_dir_name)
+
+    if state is None:
+        raise ValueError('State should not be None.')
+
+    # Save the output
+    output = {
+        'instance_id': instance.instance_id,
+        'swe_instance': instance.to_dict(),
+        'instruction': instruction,
+        'git_patch': git_patch,
+        'metadata': metadata,
+        'history': [
+            (event_to_dict(action), event_to_dict(obs)) for action, obs in state.history
+        ],
+        'error': state.error if state and state.error else None,
+        'test_result': test_result,
+    }
+
+    # Close the sandbox
+    sandbox.close()
+    return output
+
+
+if __name__ == '__main__':
+    # Load the dataset
+    dataset = load_dataset('princeton-nlp/SWE-bench_Lite')
+    swe_bench_tests = dataset['test'].to_pandas()
+
+    if args.llm_config:
+        specified_llm_config = get_llm_config_arg(args.llm_config)
+        if specified_llm_config:
+            config.llm = specified_llm_config
+    logger.info(f'Config for evaluation: {config}')
+
+    # TEST METADATA
+    agent_class = args.agent_cls
+    assert (
+        agent_class in AGENT_CLS_TO_FAKE_USER_RESPONSE_FN
+    ), f'Unsupported agent class: {agent_class}'
+    model_name = config.llm.model.split('/')[-1]
+    max_iterations = args.max_iterations
+    eval_note = ''
+    if args.eval_note is not None:
+        eval_note += '_N_' + args.eval_note
+    eval_output_dir = os.path.join(
+        args.eval_output_dir,
+        'swe_bench',
+        agent_class,
+        model_name + '_maxiter_' + str(max_iterations) + eval_note,
+    )
+
+    pathlib.Path(eval_output_dir).mkdir(parents=True, exist_ok=True)
+    pathlib.Path(os.path.join(eval_output_dir, 'logs')).mkdir(
+        parents=True, exist_ok=True
+    )
+    logger.info(f'Using evaluation output directory: {eval_output_dir}')
+
+    metadata = {
+        'agent_class': agent_class,
+        'model_name': model_name,
+        'max_iterations': max_iterations,
+        'eval_output_dir': eval_output_dir,
+        'start_time': time.strftime('%Y-%m-%d %H:%M:%S'),
+        # get the commit id of current repo
+        'git_commit': subprocess.check_output(['git', 'rev-parse', 'HEAD'])
+        .decode('utf-8')
+        .strip(),
+    }
+    logger.info(f'Metadata: {metadata}')
+    with open(os.path.join(eval_output_dir, 'metadata.json'), 'w') as f:
+        json.dump(metadata, f)
+
+    # LIMIT EVALUATION
+    eval_n_limit = args.eval_n_limit
+    if eval_n_limit:
+        swe_bench_tests = swe_bench_tests.head(eval_n_limit)
+        logger.info(f'Limiting evaluation to first {eval_n_limit} instances.')
+
+    # OUTPUT FILE
+    output_file = os.path.join(eval_output_dir, 'output.jsonl')
+    logger.info(f'Writing evaluation output to {output_file}')
+    finished_instance_ids = set()
+    if os.path.exists(output_file):
+        with open(output_file, 'r') as f:
+            for line in f:
+                data = json.loads(line)
+                finished_instance_ids.add(data['instance_id'])
+        logger.warning(
+            f'Output file {output_file} already exists. Loaded {len(finished_instance_ids)} finished instances.'
+        )
+    output_fp = open(output_file, 'a')
+
+    logger.info(
+        f'Evaluation started with Agent {agent_class}, model {model_name}, max iterations {max_iterations}.'
+    )
+
+    # filter out finished instances
+    new_swe_bench_tests = []
+    for idx, instance in swe_bench_tests.iterrows():
+        if instance.instance_id in finished_instance_ids:
+            logger.info(
+                f'Skipping instance {instance.instance_id} as it is already finished.'
+            )
+            continue
+        new_swe_bench_tests.append(instance)
+
+    swe_bench_tests = pd.DataFrame(new_swe_bench_tests)
+    logger.info(
+        f'Finished instances: {len(finished_instance_ids)}, Remaining instances: {len(swe_bench_tests)}'
+    )
+
+    pbar = tqdm(total=len(swe_bench_tests))
+
+    def update_progress(future):
+        pbar.update(1)
+        output = future.result()
+        pbar.set_description(f'Instance {output["instance_id"]}')
+        pbar.set_postfix_str(f'Test Result: {output["test_result"]["result"]}')
+        logger.info(
+            f'Finished evaluation for instance {output["instance_id"]}: {output["test_result"]["result"]}'
+        )
+        output_fp.write(json.dumps(output) + '\n')
+        output_fp.flush()
+
+    num_workers = args.eval_num_workers
+    logger.info(f'Using {num_workers} workers for evaluation.')
+
+    skip_workspace_mount = agent_class == 'CodeActAgent'
+    logger.info(f'Skipping workspace mount: {skip_workspace_mount}')
+    try:
+        with ProcessPoolExecutor(num_workers) as executor:
+            futures = []
+            for row_idx, instance in swe_bench_tests.iterrows():
+                future = executor.submit(
+                    process_instance,
+                    instance,
+                    agent_class,
+                    metadata,
+                    skip_workspace_mount,
+                    reset_logger=bool(num_workers > 1),
+                )
+                future.add_done_callback(update_progress)
+                futures.append(future)
+
+            # Wait for all futures to complete
+            for future in futures:
+                future.result()
+    except KeyboardInterrupt:
+        print('KeyboardInterrupt received. Cleaning up...')
+        cleanup()
+
+    output_fp.close()
+    logger.info('Evaluation finished.')
--- a/evaluation/swe_bench/scripts/docker/Dockerfile.builder
+++ b/evaluation/swe_bench/scripts/docker/Dockerfile.builder
@ -0,0 +1,17 @@
+FROM ghcr.io/opendevin/sandbox:latest
+
+RUN apt-get update && \
+    apt-get install -y libffi-dev bash gcc git jq wget pkg-config libfreetype-dev libfreetype6 libfreetype6-dev rsync && \
+    apt-get clean && \
+    rm -rf /var/lib/apt/lists/*
+
+RUN ln -sfn /bin/bash /bin/sh
+RUN mkdir -p /opendevin/logs && chmod 777 /opendevin/logs
+
+# Setup Git
+RUN git config --global user.email "swebench@swebench.ai"
+RUN git config --global user.name "swebench"
+
+CMD ["/bin/bash"]
+# pushd evaluation/swe_bench
+# docker build -t ghcr.io/opendevin/eval-swe-bench:builder -f ./scripts/docker/Dockerfile.builder .
--- a/evaluation/swe_bench/scripts/docker/Dockerfile.builder_with_conda
+++ b/evaluation/swe_bench/scripts/docker/Dockerfile.builder_with_conda
@ -0,0 +1,19 @@
+FROM ghcr.io/opendevin/eval-swe-bench:builder
+
+# # Install Mamba/Conda
+RUN wget "https://github.com/conda-forge/miniforge/releases/latest/download/Miniforge3-$(uname)-$(uname -m).sh"
+# install to /opt/miniforge3
+RUN mkdir /swe_util
+RUN bash Miniforge3-$(uname)-$(uname -m).sh -b -p /swe_util/miniforge3
+RUN export PATH=/swe_util/miniforge3/bin:$PATH
+RUN /swe_util/miniforge3/bin/mamba init bash
+
+# Setup SWE-Bench Eval Env
+RUN /bin/bash -c "/swe_util/miniforge3/bin/mamba create -n swe-bench-eval python==3.11.5 -y"
+RUN /bin/bash -c ". /swe_util/miniforge3/etc/profile.d/conda.sh && conda activate swe-bench-eval && \
+pip install requests python-dotenv GitPython datasets pandas beautifulsoup4 ghapi"
+RUN /bin/bash -c ". /swe_util/miniforge3/etc/profile.d/conda.sh && conda config --set changeps1 False && conda config --append channels conda-forge"
+
+CMD ["/bin/bash"]
+# pushd evaluation/swe_bench
+# docker build -t ghcr.io/opendevin/eval-swe-bench:builder_with_conda -f ./scripts/docker/Dockerfile.builder_with_conda .
--- a/evaluation/swe_bench/scripts/docker/Dockerfile.full.v1.0
+++ b/evaluation/swe_bench/scripts/docker/Dockerfile.full.v1.0
@ -0,0 +1,13 @@
+FROM ghcr.io/opendevin/eval-swe-bench:full_deps
+
+# ================== COPY Smaller things ==================
+# copy everything except the folder of `eval_data` or `miniforge3`
+# typically, this should be the OD codebase
+RUN --mount=type=bind,source=./eval_workspace,target=/eval_workspace \
+    rsync -ar --progress \
+    --exclude='eval_data' \
+    --exclude='miniforge3' \
+    /eval_workspace/ /swe_util/
+
+# pushd evaluation/SWE-bench
+# docker build -t ghcr.io/opendevin/eval-swe-bench:full-v1.0 -f ./scripts/docker/Dockerfile.full.v1.0 .
--- a/evaluation/swe_bench/scripts/docker/Dockerfile.full_deps
+++ b/evaluation/swe_bench/scripts/docker/Dockerfile.full_deps
@ -0,0 +1,72 @@
+FROM ghcr.io/opendevin/eval-swe-bench:builder
+
+# This Dockefile is used to build the Docker image for the evaluation of the SWE-Bench.
+# YOU SHOULD ENSURE ./eval_workspace CONTAINS THE EVALUATION WORKSPACE (testbed, conda)
+# Check BUILD_TESTBED_AND_ENV.md for more details.
+
+RUN mkdir -p /swe_util
+
+# Use https://github.com/moby/moby/issues/15771#issuecomment-1762893340
+# to copy files from host to container with --exclude
+
+# # ================== Prepare Eval Data ==================
+# Copy everything in eval_data except the "testbeds"
+RUN --mount=type=bind,source=./eval_workspace,target=/eval_workspace \
+    rsync -ar --progress \
+    --exclude='testbeds' \
+    /eval_workspace/eval_data /swe_util/
+
+RUN --mount=type=bind,source=./eval_workspace,target=/eval_workspace \
+    rsync -ar --progress \
+    --exclude='matplotlib*' \
+    --exclude='scikit-learn*' \
+    /eval_workspace/eval_data/testbeds /swe_util/eval_data/
+
+# # copy the larger ones in separate layers
+# COPY ./eval_workspace/eval_data/testbeds/matplotlib* /swe_util/eval_data/testbeds/
+RUN --mount=type=bind,source=./eval_workspace,target=/eval_workspace \
+    rsync -ar --progress \
+    /eval_workspace/eval_data/testbeds/matplotlib* /swe_util/eval_data/testbeds/
+
+# COPY ./eval_workspace/eval_data/testbeds/scikit-learn* /swe_util/eval_data/testbeds/
+RUN --mount=type=bind,source=./eval_workspace,target=/eval_workspace \
+    rsync -ar --progress \
+    /eval_workspace/eval_data/testbeds/scikit-learn* /swe_util/eval_data/testbeds/
+
+# ================== Prepare Miniconda3 ==================
+# Copy the Miniconda3 environment
+# copy everything except the folder of `envs` & `pkgs` (two large folders)
+RUN --mount=type=bind,source=./eval_workspace,target=/eval_workspace \
+    rsync -ar --progress \
+    --exclude='envs' \
+    --exclude='pkgs' \
+    /eval_workspace/miniforge3 /swe_util/
+
+# copy pkgs in separate layers (~9.4GB)
+RUN --mount=type=bind,source=./eval_workspace,target=/eval_workspace \
+    rsync -ar --progress \
+    /eval_workspace/miniforge3/pkgs /swe_util/miniforge3/
+
+# copy envs in separate layers (except matplotlib & scikit-learn - larger ones)
+RUN --mount=type=bind,source=./eval_workspace,target=/eval_workspace \
+    rsync -ar --progress \
+    --exclude='matplotlib*' \
+    --exclude='scikit-learn*' \
+    --exclude='pydata*' \
+    /eval_workspace/miniforge3/envs /swe_util/miniforge3/
+
+# copy the larger ones in separate layers
+RUN --mount=type=bind,source=./eval_workspace,target=/eval_workspace \
+    rsync -ar --progress \
+    /eval_workspace/miniforge3/envs/matplotlib* /swe_util/miniforge3/envs/
+
+RUN --mount=type=bind,source=./eval_workspace,target=/eval_workspace \
+    rsync -ar --progress \
+    /eval_workspace/miniforge3/envs/scikit-learn* /swe_util/miniforge3/envs/
+
+RUN --mount=type=bind,source=./eval_workspace,target=/eval_workspace \
+    rsync -ar --progress \
+    /eval_workspace/miniforge3/envs/pydata* /swe_util/miniforge3/envs/
+
+# pushd evaluation/SWE-bench
+# docker build -t ghcr.io/opendevin/eval-swe-bench:full_deps -f ./scripts/docker/Dockerfile.full_deps .
--- a/evaluation/swe_bench/scripts/docker/README.md
+++ b/evaluation/swe_bench/scripts/docker/README.md
@ -0,0 +1,13 @@
+# Docker Build Guide
+
+## Builder
+
+This constructs docker container used for `evaluation/swe_bench/scripts/prepare_swe_utils.sh` that downloads the datasets.
+
+```bash
+pushd evaluation/swe_bench
+# This builds base image with basic dependencies
+docker build -t ghcr.io/opendevin/eval-swe-bench:builder -f ./scripts/docker/Dockerfile.builder .
+# This builds image with SWE-Bench conda environment pre-installed
+docker build -t ghcr.io/opendevin/eval-swe-bench:builder_with_conda -f ./scripts/docker/Dockerfile.builder_with_conda .
+```
--- a/evaluation/swe_bench/scripts/eval_infer.sh
+++ b/evaluation/swe_bench/scripts/eval_infer.sh
@ -0,0 +1,35 @@
+#!/bin/bash
+
+PROCESS_FILEPATH=$1
+if [ -z "$PROCESS_FILEPATH" ]; then
+    echo "Error: PROCESS_FILEPATH is empty. Usage: ./eval_infer.sh <output_file>"
+    exit 1
+fi
+
+if [ ! -f $PROCESS_FILEPATH ]; then
+    echo "Error: $PROCESS_FILEPATH is not a file"
+    exit 1
+fi
+
+PROCESS_FILEPATH=$(realpath $PROCESS_FILEPATH)
+FILE_DIR=$(dirname $PROCESS_FILEPATH)
+FILE_NAME=$(basename $PROCESS_FILEPATH)
+mkdir -p $FILE_DIR/eval_logs
+mkdir -p $FILE_DIR/swe_bench_format
+
+echo "Evaluating $FILE_NAME @ $FILE_DIR"
+echo "Merged output file with fine-grained report will be saved to $FILE_DIR"
+
+docker run --rm \
+    -v $FILE_DIR:/swe_bench_output \
+    -e MINICONDA3=/swe_util/miniforge3 \
+    -e OD_SWE_BENCH=/swe_util/OD-SWE-bench \
+    -e EVAL_DATA_DIR=/swe_util/eval_data \
+    -w /swe_util \
+    ghcr.io/opendevin/eval-swe-bench:full-v1.0 \
+    bash -c "./get_agent_report.sh --output-file /swe_bench_output/$FILE_NAME \
+    --agent-name CodeActAgent \
+    --dataset swe-bench-test-lite \
+    --experiment-name test_experiment \
+    --merge-report && cp -r /swe_util/eval_data/eval_logs/test_experiment/* /swe_bench_output/eval_logs \
+    && cp -r /swe_util/eval_data/outputs/* /swe_bench_output/swe_bench_format/"
--- a/evaluation/swe_bench/scripts/run_infer.sh
+++ b/evaluation/swe_bench/scripts/run_infer.sh
@ -0,0 +1,15 @@
+#!/bin/bash
+
+AGENT=CodeActAgent
+AGENT_VERSION=v1.3
+MODEL_CONFIG=$1
+
+# You should add $MODEL_CONFIG in your `config.toml`
+
+poetry run python3 evaluation/swe_bench/run_infer.py \
+  --agent-cls $AGENT \
+  --llm-config $MODEL_CONFIG \
+  --max-iterations 50 \
+  --max-chars 10000000 \
+  --eval-num-workers 8 \
+  --eval-note $AGENT_VERSION
--- a/evaluation/swe_bench/scripts/setup/_swe_env_setup.sh
+++ b/evaluation/swe_bench/scripts/setup/_swe_env_setup.sh
@ -0,0 +1,81 @@
+#!/bin/bash
+# THIS SCRIPT ONLY NEED TO BE RUN ONCE BEFORE EVALUATION
+set -e
+
+function setup_environment_and_testbed {
+    local instance_file_name=$1
+
+    # throw error if user name is not opendevin
+    if [ "$USER" != "opendevin" ]; then
+        echo "Error: This script is intended to be run by the 'opendevin' user only." >&2
+        exit 1
+    fi
+
+    # =======================================================
+    # Install & Setup Conda
+
+    # assume /swe_util/miniforge3 already exists
+    # install if swe-util does NOT have conda
+    if [ ! -d /swe_util/miniforge3 ]; then
+        pushd /swe_util
+        echo "Downloading and installing Miniforge3"
+        wget "https://github.com/conda-forge/miniforge/releases/latest/download/Miniforge3-$(uname)-$(uname -m).sh"
+        bash Miniforge3-$(uname)-$(uname -m).sh -b -p /swe_util/miniforge3
+    fi
+
+    echo 'export PATH=/swe_util/miniforge3/bin:$PATH' >> ~/.bashrc
+    eval "$(/swe_util/miniforge3/bin/conda shell.bash hook)"
+    conda init bash
+    source ~/.bashrc
+    conda config --set changeps1 False
+    conda config --append channels conda-forge
+
+    # =======================================================
+    # Install swe-bench-eval environment if it does not exist
+    ENV_EXISTS=$(conda info --envs | awk '/swe-bench-eval/ {print $1}')
+    echo "ENV_EXISTS: $ENV_EXISTS"
+    if [ -z "$ENV_EXISTS" ]; then
+        echo "Environment swe-bench-eval does not exist. Creating the environment."
+        conda create -n swe-bench-eval python==3.11.5 -y
+        conda activate swe-bench-eval
+        pip install requests python-dotenv GitPython datasets pandas beautifulsoup4 ghapi
+    fi
+    conda activate swe-bench-eval
+    echo 'swe-bench-eval environment is ready.'
+
+    # =======================================================
+    # Read the swe-bench-test-lite.json / swe-bench-test.json file and extract the required item based on instance_id
+    INSTANCE_DATA_FILE=/swe_util/eval_data/instances/$instance_file_name
+    echo "Instance data file loaded: $INSTANCE_DATA_FILE"
+
+    # =======================================================
+    # generate testbed & conda environment for ALL instances in the test file
+    echo "Generating testbed & conda environment for all instances in the test file"
+    export PYTHONPATH=/swe_util/OD-SWE-bench:$PYTHONPATH
+    python3 /swe_util/OD-SWE-bench/swebench/harness/engine_testbed.py \
+        --instances_path $INSTANCE_DATA_FILE \
+        --log_dir /swe_util/eval_data/testbed_logs \
+        --conda_path /swe_util/miniforge3 \
+        --testbed /swe_util/eval_data/testbeds \
+        --timeout 1000
+
+    # Check every log in /swe_util/eval_data/testbed_logs to see if they contains "Init Succeeded"
+    # If not, print the log file name and exit
+    for log_file in /swe_util/eval_data/testbed_logs/*; do
+        if ! grep -q "Init Succeeded" $log_file; then
+            echo "Error: $log_file does not contain 'Init Succeeded'"
+            exit 1
+        fi
+    done
+    echo "All logs contain 'Init Succeeded'. Testbed & conda environment setup is successful."
+}
+
+# check if $1 is either swe-bench-test-lite.json or swe-bench-test.json
+if [ "$1" != "swe-bench-test-lite.json" ] && [ "$1" != "swe-bench-test.json" ]; then
+    echo "Error: Invalid input file name. Please provide either swe-bench-test-lite.json or swe-bench-test.json"
+    exit 1
+fi
+
+# call the function
+echo "Calling setup_environment_and_testbed with $1"
+setup_environment_and_testbed $1
--- a/evaluation/swe_bench/scripts/setup/get_agent_report.sh
+++ b/evaluation/swe_bench/scripts/setup/get_agent_report.sh
@ -0,0 +1,86 @@
+#!/bin/bash
+
+# Initialize variables
+output_file=""
+agent_name=""
+dataset=""
+num_processes=15
+experiment_name=""
+merge_report=false
+
+# Parse command-line arguments
+while [[ "$#" -gt 0 ]]; do
+    case $1 in
+        --output-file) output_file="$2"; shift ;;
+        --agent-name) agent_name="$2"; shift ;;
+        --dataset) dataset="$2"; shift ;;
+        --num-processes) num_processes="$2"; shift ;;
+        --experiment-name) experiment_name="$2"; shift ;;
+        --merge-report) merge_report=true ;;
+        *) echo "Unknown parameter passed: $1"; exit 1 ;;
+    esac
+    shift
+done
+
+# Check if arguments are provided
+if [[ -z "$output_file" || -z "$agent_name" || -z "$dataset" ]]; then
+    echo "output-file, agent-name and dataset are required!"
+    exit 1
+fi
+echo "output file: $output_file"
+echo "agent name: $agent_name"
+echo "dataset: $dataset"
+echo "num processes: $num_processes"
+if [ ! -z "$experiment_name" ]
+then
+    echo "use provided experiment name: $experiment_name"
+else
+    current_folder=$(basename $(dirname $output_file))
+    parent_foler=$(basename $(dirname $(dirname $output_file)))
+    experiment_name="${parent_foler}_${current_folder}"
+    echo "use generated experiment name: $experiment_name"
+fi
+
+# Convert the agent output to the SWE-Bench format
+if [ -z "$EVAL_DATA_DIR" ]; then
+    echo "EVAL_DATA_DIR is not set."
+    exit 1
+fi
+target_file="${EVAL_DATA_DIR}/outputs/${experiment_name}_${dataset}.json"
+python process_output_json_file.py $output_file $agent_name $target_file
+
+# Run the evaluation script
+if [ -z "$OD_SWE_BENCH" ]; then
+    echo "OD_SWE_BENCH is not set."
+    exit 1
+fi
+if [ -z "$MINICONDA3" ]; then
+    echo "MINICONDA3 is not set."
+    exit 1
+fi
+mkdir -p $EVAL_DATA_DIR/eval_logs/$experiment_name
+export PYTHONPATH=$OD_SWE_BENCH && cd $OD_SWE_BENCH && . $MINICONDA3/etc/profile.d/conda.sh && conda activate $MINICONDA3/envs/swe-bench-eval && python swebench/harness/run_evaluation.py \
+    --swe_bench_tasks $EVAL_DATA_DIR/instances/$dataset.json \
+    --temp_dir $EVAL_DATA_DIR/eval_temp \
+    --testbed $EVAL_DATA_DIR/testbeds \
+    --conda_path $MINICONDA3 \
+    --predictions_path $target_file \
+    --log_dir $EVAL_DATA_DIR/eval_logs/$experiment_name \
+    --num_processes 15 \
+    --skip_existing \
+    --timeout 1600 \
+    --verbose
+
+# Get the report
+cp $target_file $EVAL_DATA_DIR/eval_logs/$experiment_name
+export PYTHONPATH=$OD_SWE_BENCH && cd $OD_SWE_BENCH && . $MINICONDA3/etc/profile.d/conda.sh && conda activate $MINICONDA3/envs/swe-bench-eval && python swebench/metrics/get_model_report.py \
+	--model $agent_name \
+    --swe_bench_tasks $EVAL_DATA_DIR/instances/$dataset.json \
+    --predictions_path $EVAL_DATA_DIR/eval_logs/$experiment_name/${experiment_name}_${dataset}.json \
+    --log_dir $EVAL_DATA_DIR/eval_logs/$experiment_name/$agent_name
+
+# Merge report to the agent output
+if [ "$merge_report" = true ]; then
+    cd /swe_util && python merge_fine_grained_report.py --od_output_file $output_file \
+    --fine_grained_report_file $EVAL_DATA_DIR/eval_logs/$experiment_name/${experiment_name}_${dataset}.report.json
+fi
--- a/evaluation/swe_bench/scripts/setup/get_model_report.sh
+++ b/evaluation/swe_bench/scripts/setup/get_model_report.sh
@ -0,0 +1,61 @@
+#!/bin/bash
+
+# Input arguments
+output_file=""
+model_name=""
+dataset=""
+num_processes=15
+experiment_name=""
+
+# Parse command-line arguments
+while [[ "$#" -gt 0 ]]; do
+    case $1 in
+        --output-file) output_file="$2"; shift ;;
+        --model-name) model_name="$2"; shift ;;
+        --dataset) dataset="$2"; shift ;;
+        --num-processes) num_processes="$2"; shift ;;
+        --experiment-name) experiment_name="$2"; shift ;;
+        *) echo "Unknown parameter passed: $1"; exit 1 ;;
+    esac
+    shift
+done
+
+# Check if arguments are provided
+if [[ -z "$output_file" || -z "$model_name" || -z "$dataset" ]]; then
+    echo "output-file, model-name and dataset are required!"
+    exit 1
+fi
+echo "output file: $output_file"
+echo "model name: $model_name"
+echo "dataset: $dataset"
+echo "num processes: $num_processes"
+if [ ! -z "$experiment_name" ]
+then
+    echo "use provided experiment name: $experiment_name"
+else
+    experiment_name=${model_name}__${dataset}
+    echo "use generated experiment name: $experiment_name"
+fi
+
+# Run the evaluation script
+mkdir -p $EVAL_DATA_DIR/eval_logs/$experiment_name
+export PYTHONPATH=$OD_SWE_BENCH && cd $OD_SWE_BENCH && . $MINICONDA3/etc/profile.d/conda.sh && conda activate $MINICONDA3/envs/swe-bench-eval && python swebench/harness/run_evaluation.py \
+    --swe_bench_tasks $EVAL_DATA_DIR/instances/$dataset.json \
+    --temp_dir $EVAL_DATA_DIR/eval_temp \
+    --testbed $EVAL_DATA_DIR/testbeds \
+    --conda_path $MINICONDA3 \
+    --predictions_path $output_file \
+    --log_dir $EVAL_DATA_DIR/eval_logs/$experiment_name \
+    --num_processes $num_processes \
+    --skip_existing \
+    --timeout 1600 \
+    --verbose
+
+# Get the report
+predictions_fname=$(basename $output_file)
+cp $output_file $EVAL_DATA_DIR/eval_logs/$experiment_name
+export PYTHONPATH=$OD_SWE_BENCH && cd $OD_SWE_BENCH && . $MINICONDA3/etc/profile.d/conda.sh && conda activate $MINICONDA3/envs/swe-bench-eval && python swebench/metrics/get_model_report.py \
+	--model $model_name \
+    --swe_bench_tasks $EVAL_DATA_DIR/instances/$dataset.json \
+    --predictions_path $EVAL_DATA_DIR/eval_logs/$experiment_name/$predictions_fname \
+    --log_dir $EVAL_DATA_DIR/eval_logs/$experiment_name/$model_name
--- a/evaluation/swe_bench/scripts/setup/merge_fine_grained_report.py
+++ b/evaluation/swe_bench/scripts/setup/merge_fine_grained_report.py
@ -0,0 +1,29 @@
+import argparse
+import json
+
+
+def merge_fine_grained_report(od_output_file, fine_grained_report_file):
+    merged_od_output_file = od_output_file.replace('.jsonl', '.merged.jsonl')
+    merged_report = []
+    fine_grained_report = json.load(open(fine_grained_report_file))
+    for line in open(od_output_file):
+        line = json.loads(line)
+        instance_id = line['instance_id']
+        line['fine_grained_report'] = fine_grained_report[instance_id]
+        merged_report.append(line)
+    # dump the merged report as a jsonl file
+    with open(merged_od_output_file, 'w') as f:
+        for line in merged_report:
+            f.write(json.dumps(line) + '\n')
+    print(f'Agent output with report merged created at {merged_od_output_file}')
+
+
+if __name__ == '__main__':
+    parser = argparse.ArgumentParser()
+    parser.add_argument('--od_output_file', help='Path to the OD output file')
+    parser.add_argument(
+        '--fine_grained_report_file', help='Path to the fine grained report file'
+    )
+    args = parser.parse_args()
+
+    merge_fine_grained_report(args.od_output_file, args.fine_grained_report_file)
--- a/evaluation/swe_bench/scripts/setup/prepare_swe_utils.sh
+++ b/evaluation/swe_bench/scripts/setup/prepare_swe_utils.sh
@ -0,0 +1,27 @@
+#!/bin/bash
+
+set -e
+EVAL_WORKSPACE="evaluation/swe_bench/eval_workspace"
+mkdir -p $EVAL_WORKSPACE
+
+# 1. Prepare REPO
+echo "==== Prepare SWE-bench repo ===="
+OD_SWE_BENCH_REPO_PATH="https://github.com/OpenDevin/OD-SWE-bench.git"
+OD_SWE_BENCH_REPO_BRANCH="eval"
+git clone -b $OD_SWE_BENCH_REPO_BRANCH $OD_SWE_BENCH_REPO_PATH $EVAL_WORKSPACE/OD-SWE-bench
+
+# 2. Prepare DATA
+echo "==== Prepare SWE-bench data ===="
+EVAL_IMAGE=ghcr.io/opendevin/eval-swe-bench:builder_with_conda
+EVAL_WORKSPACE=$(realpath $EVAL_WORKSPACE)
+chmod +x $EVAL_WORKSPACE/OD-SWE-bench/swebench/harness/prepare_data.sh
+if [ -d $EVAL_WORKSPACE/eval_data ]; then
+    rm -r $EVAL_WORKSPACE/eval_data
+fi
+docker run \
+    -v $EVAL_WORKSPACE:/workspace \
+    -w /workspace \
+    -u $(id -u):$(id -g) \
+    -e HF_DATASETS_CACHE="/tmp" \
+    --rm -it $EVAL_IMAGE \
+    bash -c "cd OD-SWE-bench/swebench/harness && /swe_util/miniforge3/bin/conda run -n swe-bench-eval ./prepare_data.sh && mv eval_data /workspace/"
--- a/evaluation/swe_bench/scripts/setup/process_output_json_file.py
+++ b/evaluation/swe_bench/scripts/setup/process_output_json_file.py
@ -0,0 +1,35 @@
+import json
+import sys
+
+
+def process_jsonl(input_file, model_name, output_file):
+    try:
+        with open(input_file, 'r') as infile, open(output_file, 'w') as outfile:
+            data = []
+            for line in infile:
+                if line.strip():  # Ensure the line is not empty
+                    json_obj = json.loads(line)
+                    # Create new object with required fields and new model_name
+                    new_obj = {
+                        'instance_id': json_obj['instance_id'],
+                        'model_patch': json_obj['git_patch'],
+                        'model_name_or_path': model_name,
+                    }
+                    data.append(new_obj)
+            json.dump(
+                data, outfile, indent=2
+            )  # Write the list of JSON objects to a file
+        print(f'Output JSON list created at {output_file}')
+    except Exception as e:
+        print(f'Error: {str(e)}')
+
+
+# Usage: python script.py input.jsonl model_name output.json
+if __name__ == '__main__':
+    if len(sys.argv) != 4:
+        print('Usage: python script.py <input_file> <model_name> <output_file>')
+    else:
+        input_file = sys.argv[1]
+        model_name = sys.argv[2]
+        output_file = sys.argv[3]
+        process_jsonl(input_file, model_name, output_file)
--- a/evaluation/swe_bench/scripts/setup/swe_entry.sh
+++ b/evaluation/swe_bench/scripts/setup/swe_entry.sh
@ -0,0 +1,96 @@
+#!/bin/bash
+
+set -e
+
+# assert user name is `root`
+if [ "$USER" != "root" ]; then
+    echo "Error: This script is intended to be run by the 'root' user only." >&2
+    exit 1
+fi
+
+source ~/.bashrc
+
+SWEUTIL_DIR=/swe_util
+
+# Create logs directory
+LOG_DIR=/opendevin/logs
+mkdir -p $LOG_DIR && chmod 777 $LOG_DIR
+
+# FIXME: Cannot read SWE_INSTANCE_ID from the environment variable
+# SWE_INSTANCE_ID=django__django-11099
+if [ -z "$SWE_INSTANCE_ID" ]; then
+    echo "Error: SWE_INSTANCE_ID is not set." >&2
+    exit 1
+fi
+
+# Read the swe-bench-test-lite.json file and extract the required item based on instance_id
+item=$(jq --arg INSTANCE_ID "$SWE_INSTANCE_ID" '.[] | select(.instance_id == $INSTANCE_ID)' $SWEUTIL_DIR/eval_data/instances/swe-bench-test-lite.json)
+
+if [[ -z "$item" ]]; then
+  echo "No item found for the provided instance ID."
+  exit 1
+fi
+
+CONDA_ENV_NAME=$(echo "$item" | jq -r '.repo + "__" + .version | gsub("/"; "__")')
+
+echo "CONDA_ENV_NAME: $CONDA_ENV_NAME"
+
+SWE_TASK_DIR=/opendevin/swe_tasks
+mkdir -p $SWE_TASK_DIR
+# Dump test_patch to /workspace/test.patch
+echo "$item" | jq -r '.test_patch' > $SWE_TASK_DIR/test.patch
+# Dump patch to /workspace/gold.patch
+echo "$item" | jq -r '.patch' > $SWE_TASK_DIR/gold.patch
+# Dump the item to /workspace/instance.json except for the "test_patch" and "patch" fields
+echo "$item" | jq 'del(.test_patch, .patch)' > $SWE_TASK_DIR/instance.json
+
+# Clear the workspace
+rm -rf /workspace/*
+# Copy repo to workspace
+if [ -d /workspace/$CONDA_ENV_NAME ]; then
+    rm -rf /workspace/$CONDA_ENV_NAME
+fi
+cp -r $SWEUTIL_DIR/eval_data/testbeds/$CONDA_ENV_NAME /workspace
+
+# Reset swe-bench testbed and install the repo
+. $SWEUTIL_DIR/miniforge3/etc/profile.d/conda.sh
+conda config --set changeps1 False
+conda config --append channels conda-forge
+conda activate swe-bench-eval
+
+mkdir -p $SWE_TASK_DIR/reset_testbed_temp
+mkdir -p $SWE_TASK_DIR/reset_testbed_log_dir
+SWE_BENCH_DIR=/swe_util/OD-SWE-bench
+output=$(
+    export PYTHONPATH=$SWE_BENCH_DIR && \
+    cd $SWE_BENCH_DIR && \
+    python swebench/harness/reset_swe_env.py \
+    --swe_bench_tasks $SWEUTIL_DIR/eval_data/instances/swe-bench-test.json \
+    --temp_dir $SWE_TASK_DIR/reset_testbed_temp \
+    --testbed /workspace \
+    --conda_path $SWEUTIL_DIR/miniforge3 \
+    --instance_id $SWE_INSTANCE_ID \
+    --log_dir $SWE_TASK_DIR/reset_testbed_log_dir \
+    --timeout 900 \
+    --verbose
+)
+
+REPO_PATH=$(echo "$output" | awk -F': ' '/repo_path:/ {print $2}')
+TEST_CMD=$(echo "$output" | awk -F': ' '/test_cmd:/ {print $2}')
+echo "Repo Path: $REPO_PATH"
+echo "Test Command: $TEST_CMD"
+
+echo "export SWE_BENCH_DIR=\"$SWE_BENCH_DIR\"" >> ~/.bashrc
+echo "export REPO_PATH=\"$REPO_PATH\"" >> ~/.bashrc
+echo "export TEST_CMD=\"$TEST_CMD\"" >> ~/.bashrc
+
+if [[ "$REPO_PATH" == "None" ]]; then
+    echo "Error: Failed to retrieve repository path. Tests may not have passed or output was not as expected." >&2
+    exit 1
+fi
+
+# Activate instance-specific environment
+. $SWEUTIL_DIR/miniforge3/etc/profile.d/conda.sh
+conda activate $CONDA_ENV_NAME
+
+set +e
--- a/evaluation/swe_bench/scripts/setup/swe_env_setup.sh
+++ b/evaluation/swe_bench/scripts/setup/swe_env_setup.sh
@ -0,0 +1,31 @@
+#!/bin/bash
+# THIS SCRIPT ONLY NEED TO BE RUN ONCE BEFORE EVALUATION
+
+EVAL_DOCKER_IMAGE=ghcr.io/opendevin/eval-swe-bench:builder
+EVAL_WORKSPACE="evaluation/swe_bench/eval_workspace"
+EVAL_WORKSPACE=$(realpath $EVAL_WORKSPACE)
+
+SETUP_INSTANCE_FILENAME=swe-bench-test.json # OR swe-bench-test-lite.json
+
+if [ ! -d $EVAL_WORKSPACE ]; then
+    mkdir -p $EVAL_WORKSPACE
+fi
+
+if [ -f $EVAL_WORKSPACE/swe_env_setup.sh ]; then
+    rm $EVAL_WORKSPACE/swe_env_setup.sh
+fi
+SCRIPT_DIR=evaluation/swe_bench/scripts/setup
+
+cp $SCRIPT_DIR/_swe_env_setup.sh $EVAL_WORKSPACE/swe_env_setup.sh
+cp $SCRIPT_DIR/swe_entry.sh $EVAL_WORKSPACE/swe_entry.sh
+cp $SCRIPT_DIR/get_model_report.sh $EVAL_WORKSPACE/get_model_report.sh
+cp $SCRIPT_DIR/get_agent_report.sh $EVAL_WORKSPACE/get_agent_report.sh
+cp $SCRIPT_DIR/process_output_json_file.py $EVAL_WORKSPACE/process_output_json_file.py
+cp $SCRIPT_DIR/merge_fine_grained_report.py $EVAL_WORKSPACE/merge_fine_grained_report.py
+
+docker run \
+    -v $EVAL_WORKSPACE:/swe_util \
+    -e UID=$(id -u) \
+    --rm -it $EVAL_DOCKER_IMAGE \
+    bash -c "useradd -rm -d /home/opendevin -s /bin/bash -u $(id -u) opendevin && su opendevin -c 'bash /swe_util/swe_env_setup.sh $SETUP_INSTANCE_FILENAME'"
+#
--- a/evaluation/swe_bench/swe_env_box.py
+++ b/evaluation/swe_bench/swe_env_box.py
@ -0,0 +1,204 @@
+import sys
+import uuid
+
+from opendevin.core.config import config
+from opendevin.core.logger import opendevin_logger as logger
+from opendevin.runtime.docker.ssh_box import DockerSSHBox
+from opendevin.runtime.plugins import JupyterRequirement, SWEAgentCommandsRequirement
+
+SWE_BENCH_CONTAINER_IMAGE = 'ghcr.io/opendevin/eval-swe-bench:full-v1.0'
+
+
+class SWEBenchSSHBox(DockerSSHBox):
+    def __init__(
+        self,
+        container_image: str,
+        timeout: int = 120,
+        sid: str | None = None,
+        swe_instance_id: str | None = None,
+        swe_instance: dict | None = None,
+        skip_workspace_mount: bool = True,
+    ):
+        if swe_instance_id is None:
+            raise ValueError('swe_instance_id must be provided!')
+        self.swe_instance_id = swe_instance_id
+        self.swe_instance = swe_instance
+        self.skip_workspace_mount = skip_workspace_mount
+
+        assert (
+            container_image is not None
+        ), 'container_image is required for SWEBenchSSHBox!'
+        # Need to run as root to use SWEBench container
+        sid = f'swe_bench_{swe_instance_id}' + str(uuid.uuid4())
+        super().__init__(container_image, timeout, sid)
+
+        exit_code, output = self.execute('mv ~/.bashrc ~/.bashrc.bak')
+        assert exit_code == 0, f'Failed to backup ~/.bashrc: {output}'
+
+        exit_code, output = self.execute(
+            f"echo 'export SWE_INSTANCE_ID={self.swe_instance_id}' >> ~/.bashrc && echo 'export PIP_CACHE_DIR=~/.cache/pip' >> ~/.bashrc && echo \"alias git='git --no-pager'\" >> ~/.bashrc"
+        )
+        assert exit_code == 0, f'Failed to set SWE_INSTANCE_ID in ~/.bashrc: {output}'
+
+        logger.info('Sourcing swe_entry.sh to set up environment variables')
+        # larger timeout for SWEBench init to account for long-running installations (e.g., require compilation)
+        exit_code, output = self.execute('source /swe_util/swe_entry.sh', timeout=600)
+        logger.info('exit code: %d', exit_code)
+        logger.info(output)
+        assert exit_code == 0, f'Failed to source swe_entry.sh: {output}'
+        logger.info('Sourced swe_entry.sh successfully')
+
+    @property
+    def volumes(self):
+        if self.skip_workspace_mount:
+            return {
+                k: v
+                for k, v in super().volumes.items()
+                if not v['bind'] == self.sandbox_workspace_dir
+            }
+        return super().volumes
+
+    @classmethod
+    def get_box_for_instance(
+        cls,
+        instance,
+        workspace_dir_name=None,
+        n_tries=5,
+        skip_workspace_mount: bool = True,
+        workspace_mount_path: str | None = None,
+    ) -> 'SWEBenchSSHBox':
+        if workspace_dir_name is None:
+            workspace_dir_name = f"{instance['repo']}__{instance['version']}".replace(
+                '/', '__'
+            )
+        config.workspace_base = workspace_mount_path
+        config.workspace_mount_path = workspace_mount_path
+        sandbox = cls(
+            container_image=SWE_BENCH_CONTAINER_IMAGE,
+            swe_instance_id=instance['instance_id'],
+            swe_instance=instance,
+            skip_workspace_mount=skip_workspace_mount,
+        )
+        logger.info(f"SSH box started for instance {instance['instance_id']}.")
+
+        # cd to the repo
+        exit_code, output = sandbox.execute(f'cd /workspace/{workspace_dir_name}')
+        if exit_code != 0:
+            logger.error(f'Failed to cd to the repo: {output}')
+            sys.exit(1)
+
+        # remove all future commits & remote following Devin
+        # https://www.cognition-labs.com/post/swe-bench-technical-report
+        exit_code, output = sandbox.execute('git reset --hard')
+        if exit_code != 0:
+            logger.error(f'Failed to reset the repo: {output}')
+            sys.exit(1)
+        exit_code, output = sandbox.execute(
+            'for remote_name in $(git remote); do git remote remove "${remote_name}"; done'
+        )
+        if exit_code != 0:
+            logger.error(f'Failed to remove remote: {output}')
+            sys.exit(1)
+        return sandbox
+
+    def get_diff_patch(self):
+        # add everything to the index
+        exit_code, output = self.execute('git add --all')
+        if exit_code != 0:
+            logger.error('Failed to add everything to the index')
+            return ''
+
+        # get the git diff
+        exit_code, git_patch = self.execute(
+            f'git diff --no-color --cached {self.swe_instance["base_commit"]}'
+        )
+        if exit_code != 0:
+            logger.error('Failed to get git diff')
+            return ''
+        return git_patch
+
+
+if __name__ == '__main__':
+    EXAMPLE_INSTANCE = {
+        'repo': 'django/django',
+        'instance_id': 'django__django-11099',
+        'base_commit': 'd26b2424437dabeeca94d7900b37d2df4410da0c',
+        'patch': "diff --git a/django/contrib/auth/validators.py b/django/contrib/auth/validators.py\n--- a/django/contrib/auth/validators.py\n+++ b/django/contrib/auth/validators.py\n@@ -7,7 +7,7 @@\n \n @deconstructible\n class ASCIIUsernameValidator(validators.RegexValidator):\n-    regex = r'^[\\w.@+-]+$'\n+    regex = r'^[\\w.@+-]+\\Z'\n     message = _(\n         'Enter a valid username. This value may contain only English letters, '\n         'numbers, and @/./+/-/_ characters.'\n@@ -17,7 +17,7 @@ class ASCIIUsernameValidator(validators.RegexValidator):\n \n @deconstructible\n class UnicodeUsernameValidator(validators.RegexValidator):\n-    regex = r'^[\\w.@+-]+$'\n+    regex = r'^[\\w.@+-]+\\Z'\n     message = _(\n         'Enter a valid username. This value may contain only letters, '\n         'numbers, and @/./+/-/_ characters.'\n",
+        'test_patch': "diff --git a/tests/auth_tests/test_validators.py b/tests/auth_tests/test_validators.py\n--- a/tests/auth_tests/test_validators.py\n+++ b/tests/auth_tests/test_validators.py\n@@ -237,7 +237,7 @@ def test_unicode_validator(self):\n         invalid_usernames = [\n             \"o'connell\", \"عبد ال\",\n             \"zerowidth\\u200Bspace\", \"nonbreaking\\u00A0space\",\n-            \"en\\u2013dash\",\n+            \"en\\u2013dash\", 'trailingnewline\\u000A',\n         ]\n         v = validators.UnicodeUsernameValidator()\n         for valid in valid_usernames:\n@@ -250,7 +250,7 @@ def test_unicode_validator(self):\n \n     def test_ascii_validator(self):\n         valid_usernames = ['glenn', 'GLEnN', 'jean-marc']\n-        invalid_usernames = [\"o'connell\", 'Éric', 'jean marc', \"أحمد\"]\n+        invalid_usernames = [\"o'connell\", 'Éric', 'jean marc', \"أحمد\", 'trailingnewline\\n']\n         v = validators.ASCIIUsernameValidator()\n         for valid in valid_usernames:\n             with self.subTest(valid=valid):\n",
+        'problem_statement': "UsernameValidator allows trailing newline in usernames\nDescription\n\t\nASCIIUsernameValidator and UnicodeUsernameValidator use the regex \nr'^[\\w.@+-]+$'\nThe intent is to only allow alphanumeric characters as well as ., @, +, and -. However, a little known quirk of Python regexes is that $ will also match a trailing newline. Therefore, the user name validators will accept usernames which end with a newline. You can avoid this behavior by instead using \\A and \\Z to terminate regexes. For example, the validator regex could be changed to\nr'\\A[\\w.@+-]+\\Z'\nin order to reject usernames that end with a newline.\nI am not sure how to officially post a patch, but the required change is trivial - using the regex above in the two validators in contrib.auth.validators.\n",
+        'hints_text': '',
+        'created_at': '2019-03-20T03:46:18Z',
+        'version': '3.0',
+        'FAIL_TO_PASS': '["test_ascii_validator (auth_tests.test_validators.UsernameValidatorsTests)", "test_unicode_validator (auth_tests.test_validators.UsernameValidatorsTests)", "test_help_text (auth_tests.test_validators.UserAttributeSimilarityValidatorTest)"]',
+        'PASS_TO_PASS': '["test_help_text (auth_tests.test_validators.MinimumLengthValidatorTest)", "test_validate (auth_tests.test_validators.MinimumLengthValidatorTest)", "test_help_text (auth_tests.test_validators.NumericPasswordValidatorTest)", "test_validate (auth_tests.test_validators.NumericPasswordValidatorTest)", "test_validate (auth_tests.test_validators.UserAttributeSimilarityValidatorTest)", "test_validate_property (auth_tests.test_validators.UserAttributeSimilarityValidatorTest)", "test_empty_password_validator_help_text_html (auth_tests.test_validators.PasswordValidationTest)", "test_get_default_password_validators (auth_tests.test_validators.PasswordValidationTest)", "test_get_password_validators_custom (auth_tests.test_validators.PasswordValidationTest)", "test_password_changed (auth_tests.test_validators.PasswordValidationTest)", "test_password_changed_with_custom_validator (auth_tests.test_validators.PasswordValidationTest)", "test_password_validators_help_text_html (auth_tests.test_validators.PasswordValidationTest)", "test_password_validators_help_text_html_escaping (auth_tests.test_validators.PasswordValidationTest)", "test_password_validators_help_texts (auth_tests.test_validators.PasswordValidationTest)", "test_validate_password (auth_tests.test_validators.PasswordValidationTest)", "test_help_text (auth_tests.test_validators.CommonPasswordValidatorTest)", "test_validate (auth_tests.test_validators.CommonPasswordValidatorTest)", "test_validate_custom_list (auth_tests.test_validators.CommonPasswordValidatorTest)", "test_validate_django_supplied_file (auth_tests.test_validators.CommonPasswordValidatorTest)"]',
+        'environment_setup_commit': '419a78300f7cd27611196e1e464d50fd0385ff27',
+    }
+
+    sandbox = SWEBenchSSHBox.get_box_for_instance(instance=EXAMPLE_INSTANCE)
+
+    # in actual eval, this will be initialized by the controller
+    sandbox.init_plugins([JupyterRequirement(), SWEAgentCommandsRequirement()])
+
+    # PRE TEST
+    exit_code, output = sandbox.execute('cd $REPO_PATH')
+    assert exit_code == 0, 'Failed to cd $REPO_PATH'
+    logger.info(f'cd $REPO_PATH: {output}')
+
+    # apply test patch
+    exit_code, output = sandbox.execute('git apply $SWE_TASK_DIR/test.patch')
+    assert exit_code == 0, 'Failed to apply test patch'
+    logger.info(f'git apply $SWE_TASK_DIR/test.patch: {output}')
+
+    # TEST
+    exit_code, output = sandbox.execute(
+        './tests/runtests.py --verbosity 2 auth_tests.test_validators'
+    )
+    assert exit_code == 1, 'Expected exit code 1 (since this is a FAIL_TO_PASS)'
+    logger.info(f'$TEST_CMD:\n{output}')
+
+    # apply gold patch
+    exit_code, output = sandbox.execute('git apply $SWE_TASK_DIR/gold.patch')
+    logger.info('exit code: %d', exit_code)
+    logger.info(f'git apply $SWE_TASK_DIR/gold.patch: {output}')
+
+    # TEST
+    exit_code, output = sandbox.execute(
+        './tests/runtests.py --verbosity 2 auth_tests.test_validators'
+    )
+    assert exit_code == 0, 'Expected exit code 0 (since we applied the gold patch)'
+    logger.info(f'$TEST_CMD:\n{output}')
+
+    # Reset the repo
+    exit_code, output = sandbox.execute('git reset --hard')
+    assert exit_code == 0, 'Failed to reset the repo'
+    logger.info(f'git reset --hard: {output}')
+
+    bg_cmd = sandbox.execute_in_background(
+        "while true; do echo 'dot ' && sleep 10; done"
+    )
+
+    sys.stdout.flush()
+    try:
+        while True:
+            try:
+                user_input = input('>>> ')
+            except EOFError:
+                logger.info('Exiting...')
+                break
+            if user_input.lower() == 'exit':
+                logger.info('Exiting...')
+                break
+            if user_input.lower() == 'kill':
+                sandbox.kill_background(bg_cmd.pid)
+                logger.info('Background process killed')
+                continue
+            exit_code, output = sandbox.execute(user_input)
+            logger.info('exit code: %d', exit_code)
+            logger.info(output)
+            if bg_cmd.pid in sandbox.background_commands:
+                logs = sandbox.read_logs(bg_cmd.pid)
+                logger.info('background logs: %s', logs)
+            sys.stdout.flush()
+    except KeyboardInterrupt:
+        logger.info('Exiting...')
+    sandbox.close()
--- a/opendevin/core/config.py
+++ b/opendevin/core/config.py
@ -370,6 +370,30 @@ def get_parser():
        type=int,
        help='The maximum number of characters to send to and receive from LLM per task',
    )
+    parser.add_argument(
+        '--eval-output-dir',
+        default='evaluation/evaluation_outputs/outputs',
+        type=str,
+        help='The directory to save evaluation output',
+    )
+    parser.add_argument(
+        '--eval-n-limit',
+        default=None,
+        type=int,
+        help='The number of instances to evaluate',
+    )
+    parser.add_argument(
+        '--eval-num-workers',
+        default=4,
+        type=int,
+        help='The number of workers to use for evaluation',
+    )
+    parser.add_argument(
+        '--eval-note',
+        default=None,
+        type=str,
+        help='The note to add to the evaluation directory',
+    )
    parser.add_argument(
        '-l',
        '--llm-config',
--- a/opendevin/core/logger.py
+++ b/opendevin/core/logger.py
@ -81,11 +81,10 @@ def get_console_handler():
    return console_handler


-def get_file_handler():
+def get_file_handler(log_dir=os.path.join(os.getcwd(), 'logs')):
    """
    Returns a file handler for logging.
    """
-    log_dir = os.path.join(os.getcwd(), 'logs')
    os.makedirs(log_dir, exist_ok=True)
    timestamp = datetime.now().strftime('%Y-%m-%d')
    file_name = f'opendevin_{timestamp}.log'
--- a/opendevin/core/main.py
+++ b/opendevin/core/main.py
@ -7,12 +7,14 @@ from opendevin.controller import AgentController
 from opendevin.controller.agent import Agent
 from opendevin.controller.state.state import State
 from opendevin.core.config import args, get_llm_config_arg
+from opendevin.core.logger import opendevin_logger as logger
 from opendevin.core.schema import AgentState
 from opendevin.events.action import ChangeAgentStateAction, MessageAction
 from opendevin.events.event import Event
 from opendevin.events.observation import AgentStateChangedObservation
 from opendevin.events.stream import EventSource, EventStream, EventStreamSubscriber
 from opendevin.llm.llm import LLM
+from opendevin.runtime.sandbox import Sandbox
 from opendevin.runtime.server.runtime import ServerRuntime


@ -31,6 +33,7 @@ async def main(
    task_str: str = '',
    exit_on_message: bool = False,
    fake_user_response_fn: Optional[Callable[[Optional[State]], str]] = None,
+    sandbox: Optional[Sandbox] = None,
 ) -> Optional[State]:
    """Main coroutine to run the agent controller with task input flexibility.
    It's only used when you launch opendevin backend directly via cmdline.
@ -39,6 +42,7 @@ async def main(
        task_str: The task to run.
        exit_on_message: quit if agent asks for a message from user (optional)
        fake_user_response_fn: An optional function that receives the current state (could be None) and returns a fake user response.
+        sandbox: An optional sandbox to run the agent in.
    """

    # Determine the task source
@ -62,7 +66,7 @@ async def main(
        if llm_config is None:
            raise ValueError(f'Invalid toml file, cannot read {args.llm_config}')

-        print(
+        logger.info(
            f'Running agent {args.agent_cls} (model: {llm_config.model}, llm_config: {llm_config}) with task: "{task}"'
        )

@ -70,7 +74,7 @@ async def main(
        llm = LLM(llm_config=llm_config)
    else:
        # --model-name model_name
-        print(
+        logger.info(
            f'Running agent {args.agent_cls} (model: {args.model_name}), with task: "{task}"'
        )
        llm = LLM(args.model_name)
@ -85,7 +89,7 @@ async def main(
        max_chars=args.max_chars,
        event_stream=event_stream,
    )
-    runtime = ServerRuntime(event_stream=event_stream)
+    runtime = ServerRuntime(event_stream=event_stream, sandbox=sandbox)
    runtime.init_sandbox_plugins(controller.agent.sandbox_plugins)

    await event_stream.add_event(MessageAction(content=task), EventSource.USER)
--- a/opendevin/events/observation/commands.py
+++ b/opendevin/events/observation/commands.py
@ -24,6 +24,9 @@ class CmdOutputObservation(Observation):
    def message(self) -> str:
        return f'Command `{self.command}` executed with exit code {self.exit_code}.'

+    def __str__(self) -> str:
+        return f'**CmdOutputObservation (exit code={self.exit_code})**\n{self.content}'
+

@dataclass
 class IPythonRunCellObservation(Observation):
@ -41,3 +44,6 @@ class IPythonRunCellObservation(Observation):
    @property
    def message(self) -> str:
        return 'Coded executed in IPython cell.'
+
+    def __str__(self) -> str:
+        return f'**IPythonRunCellObservation**\n{self.content}'
--- a/opendevin/llm/llm.py
+++ b/opendevin/llm/llm.py
@ -1,9 +1,8 @@
+import warnings
 from functools import partial

-import warnings
-
 with warnings.catch_warnings():
-    warnings.simplefilter("ignore")
+    warnings.simplefilter('ignore')
    import litellm
 from litellm import completion as litellm_completion
 from litellm import completion_cost as litellm_completion_cost
--- a/opendevin/runtime/plugins/jupyter/setup.sh
+++ b/opendevin/runtime/plugins/jupyter/setup.sh
@ -2,6 +2,7 @@

 set -e

+
 # ADD /opendevin/plugins to PATH to make `jupyter_cli` available
 echo 'export PATH=$PATH:/opendevin/plugins/jupyter' >> ~/.bashrc
 export PATH=/opendevin/plugins/jupyter:$PATH
@ -10,11 +11,14 @@ export PATH=/opendevin/plugins/jupyter:$PATH
 if [ "$USER" = "opendevin" ]; then
    echo 'export PATH=$PATH:/home/opendevin/.local/bin' >> ~/.bashrc
    export PATH=$PATH:/home/opendevin/.local/bin
+    export PIP_CACHE_DIR=$HOME/.cache/pip
 fi
 # if user name is `root`, add '/root/.local/bin' to PATH
 if [ "$USER" = "root" ]; then
    echo 'export PATH=$PATH:/root/.local/bin' >> ~/.bashrc
    export PATH=$PATH:/root/.local/bin
+    export PIP_CACHE_DIR=$HOME/.cache/pip
+
 fi

 # Install dependencies
@ -57,11 +61,13 @@ echo "export JUPYTER_EXEC_SERVER_PID=$JUPYTER_EXEC_SERVER_PID" >> ~/.bashrc
 echo "Execution server started with PID: $JUPYTER_EXEC_SERVER_PID"

 # Wait until /opendevin/logs/jupyter_kernel_gateway.log contains "is available"
-while ! grep -q "is available" /opendevin/logs/jupyter_kernel_gateway.log; do
+while ! grep -q "at" /opendevin/logs/jupyter_kernel_gateway.log; do
+    echo "Waiting for Jupyter kernel gateway to be available..."
    sleep 1
 done
 # Wait until /opendevin/logs/jupyter_execute_server.log contains "Jupyter kernel created for conversation"
-while ! grep -q "Jupyter kernel created for conversation" /opendevin/logs/jupyter_execute_server.log; do
+while ! grep -q "kernel created" /opendevin/logs/jupyter_execute_server.log; do
+    echo "Waiting for Jupyter kernel to be created..."
    sleep 1
 done
 echo "Jupyter kernel ready."
--- a/opendevin/runtime/plugins/mixin.py
+++ b/opendevin/runtime/plugins/mixin.py
@ -39,7 +39,7 @@ class PluginMixin:
            exit_code, output = self.execute(abs_path_to_bash_script)
            if exit_code != 0:
                raise RuntimeError(
-                    f'Failed to initialize plugin {requirement.name} with exit code {exit_code} and output {output}'
+                    f'Failed to initialize plugin {requirement.name} with exit code {exit_code} and output: {output}'
                )
            logger.info(f'Plugin {requirement.name} initialized successfully.')

@ -47,6 +47,6 @@ class PluginMixin:
            exit_code, output = self.execute('source ~/.bashrc')
            if exit_code != 0:
                raise RuntimeError(
-                    f'Failed to source ~/.bashrc with exit code {exit_code} and output {output}'
+                    f'Failed to source ~/.bashrc with exit code {exit_code} and output: {output}'
                )
            logger.info('Sourced ~/.bashrc successfully')
--- a/opendevin/runtime/plugins/swe_agent_commands/setup_cursor_mode.sh
+++ b/opendevin/runtime/plugins/swe_agent_commands/setup_cursor_mode.sh
@ -1,5 +1,6 @@
 #!/bin/bash

+export PIP_CACHE_DIR=$HOME/.cache/pip
 pip install flake8

 # Cursor Mode from SWE-Bench
--- a/opendevin/runtime/plugins/swe_agent_commands/setup_default.sh
+++ b/opendevin/runtime/plugins/swe_agent_commands/setup_default.sh
@ -1,5 +1,6 @@
 #!/bin/bash

+export PIP_CACHE_DIR=$HOME/.cache/pip
 pip install flake8

 # Default Mode from SWE-Bench
--- a/poetry.lock
+++ b/poetry.lock
@ -132,6 +132,30 @@ files = [
 [package.dependencies]
 frozenlist = ">=1.1.0"

+[[package]]
+name = "altair"
+version = "5.3.0"
+description = "Vega-Altair: A declarative statistical visualization library for Python."
+optional = false
+python-versions = ">=3.8"
+files = [
+    {file = "altair-5.3.0-py3-none-any.whl", hash = "sha256:7084a1dab4d83c5e7e5246b92dc1b4451a6c68fd057f3716ee9d315c8980e59a"},
+    {file = "altair-5.3.0.tar.gz", hash = "sha256:5a268b1a0983b23d8f9129f819f956174aa7aea2719ed55a52eba9979b9f6675"},
+]
+
+[package.dependencies]
+jinja2 = "*"
+jsonschema = ">=3.0"
+numpy = "*"
+packaging = "*"
+pandas = ">=0.25"
+toolz = "*"
+
+[package.extras]
+all = ["altair-tiles (>=0.3.0)", "anywidget (>=0.9.0)", "pyarrow (>=11)", "vega-datasets (>=0.9.0)", "vegafusion[embed] (>=1.6.6)", "vl-convert-python (>=1.3.0)"]
+dev = ["geopandas", "hatch", "ipython", "m2r", "mypy", "pandas-stubs", "pytest", "pytest-cov", "ruff (>=0.3.0)", "types-jsonschema", "types-setuptools"]
+doc = ["docutils", "jinja2", "myst-parser", "numpydoc", "pillow (>=9,<10)", "pydata-sphinx-theme (>=0.14.1)", "scipy", "sphinx", "sphinx-copybutton", "sphinx-design", "sphinxext-altair"]
+
 [[package]]
 name = "annotated-types"
 version = "0.6.0"
@ -1659,6 +1683,38 @@ monitor = ["psutil (>=5.7.0)"]
 recommended = ["cffi (>=1.12.2)", "dnspython (>=1.16.0,<2.0)", "idna", "psutil (>=5.7.0)"]
 test = ["cffi (>=1.12.2)", "coverage (>=5.0)", "dnspython (>=1.16.0,<2.0)", "idna", "objgraph", "psutil (>=5.7.0)", "requests"]

+[[package]]
+name = "gitdb"
+version = "4.0.11"
+description = "Git Object Database"
+optional = false
+python-versions = ">=3.7"
+files = [
+    {file = "gitdb-4.0.11-py3-none-any.whl", hash = "sha256:81a3407ddd2ee8df444cbacea00e2d038e40150acfa3001696fe0dcf1d3adfa4"},
+    {file = "gitdb-4.0.11.tar.gz", hash = "sha256:bf5421126136d6d0af55bc1e7c1af1c397a34f5b7bd79e776cd3e89785c2b04b"},
+]
+
+[package.dependencies]
+smmap = ">=3.0.1,<6"
+
+[[package]]
+name = "gitpython"
+version = "3.1.43"
+description = "GitPython is a Python library used to interact with Git repositories"
+optional = false
+python-versions = ">=3.7"
+files = [
+    {file = "GitPython-3.1.43-py3-none-any.whl", hash = "sha256:eec7ec56b92aad751f9912a73404bc02ba212a23adb2c7098ee668417051a1ff"},
+    {file = "GitPython-3.1.43.tar.gz", hash = "sha256:35f314a9f878467f5453cc1fee295c3e18e52f1b99f10f6cf5b1682e968a9e7c"},
+]
+
+[package.dependencies]
+gitdb = ">=4.0.1,<5"
+
+[package.extras]
+doc = ["sphinx (==4.3.2)", "sphinx-autodoc-typehints", "sphinx-rtd-theme", "sphinxcontrib-applehelp (>=1.0.2,<=1.0.4)", "sphinxcontrib-devhelp (==1.0.2)", "sphinxcontrib-htmlhelp (>=2.0.0,<=2.0.1)", "sphinxcontrib-qthelp (==1.0.3)", "sphinxcontrib-serializinghtml (==1.1.5)"]
+test = ["coverage[toml]", "ddt (>=1.1.1,!=1.4.3)", "mock", "mypy", "pre-commit", "pytest (>=7.3.1)", "pytest-cov", "pytest-instafail", "pytest-mock", "pytest-sugar", "typing-extensions"]
+
 [[package]]
 name = "google-ai-generativelanguage"
 version = "0.6.3"
@ -2298,6 +2354,41 @@ files = [
 [package.extras]
 qa = ["pytest", "pytest-cov", "tox"]

+[[package]]
+name = "jsonschema"
+version = "4.22.0"
+description = "An implementation of JSON Schema validation for Python"
+optional = false
+python-versions = ">=3.8"
+files = [
+    {file = "jsonschema-4.22.0-py3-none-any.whl", hash = "sha256:ff4cfd6b1367a40e7bc6411caec72effadd3db0bbe5017de188f2d6108335802"},
+    {file = "jsonschema-4.22.0.tar.gz", hash = "sha256:5b22d434a45935119af990552c862e5d6d564e8f6601206b305a61fdf661a2b7"},
+]
+
+[package.dependencies]
+attrs = ">=22.2.0"
+jsonschema-specifications = ">=2023.03.6"
+referencing = ">=0.28.4"
+rpds-py = ">=0.7.1"
+
+[package.extras]
+format = ["fqdn", "idna", "isoduration", "jsonpointer (>1.13)", "rfc3339-validator", "rfc3987", "uri-template", "webcolors (>=1.11)"]
+format-nongpl = ["fqdn", "idna", "isoduration", "jsonpointer (>1.13)", "rfc3339-validator", "rfc3986-validator (>0.1.0)", "uri-template", "webcolors (>=1.11)"]
+
+[[package]]
+name = "jsonschema-specifications"
+version = "2023.12.1"
+description = "The JSON Schema meta-schemas and vocabularies, exposed as a Registry"
+optional = false
+python-versions = ">=3.8"
+files = [
+    {file = "jsonschema_specifications-2023.12.1-py3-none-any.whl", hash = "sha256:87e4fdf3a94858b8a2ba2778d9ba57d8a9cafca7c7489c46ba0d30a8bc6a9c3c"},
+    {file = "jsonschema_specifications-2023.12.1.tar.gz", hash = "sha256:48a76787b3e70f5ed53f1160d2b81f586e4ca6d1548c5de7085d1682674764cc"},
+]
+
+[package.dependencies]
+referencing = ">=0.31.0"
+
 [[package]]
 name = "kiwisolver"
 version = "1.4.5"
@ -4710,6 +4801,25 @@ files = [
 [package.dependencies]
 typing-extensions = ">=4.6.0,<4.7.0 || >4.7.0"

+[[package]]
+name = "pydeck"
+version = "0.9.1"
+description = "Widget for deck.gl maps"
+optional = false
+python-versions = ">=3.8"
+files = [
+    {file = "pydeck-0.9.1-py2.py3-none-any.whl", hash = "sha256:b3f75ba0d273fc917094fa61224f3f6076ca8752b93d46faf3bcfd9f9d59b038"},
+    {file = "pydeck-0.9.1.tar.gz", hash = "sha256:f74475ae637951d63f2ee58326757f8d4f9cd9f2a457cf42950715003e2cb605"},
+]
+
+[package.dependencies]
+jinja2 = ">=2.10.1"
+numpy = ">=1.16.4"
+
+[package.extras]
+carto = ["pydeck-carto"]
+jupyter = ["ipykernel (>=5.1.2)", "ipython (>=5.8.0)", "ipywidgets (>=7,<8)", "traitlets (>=4.3.2)"]
+
 [[package]]
 name = "pyee"
 version = "11.0.1"
@ -5017,6 +5127,21 @@ files = [
    {file = "PyYAML-6.0.1.tar.gz", hash = "sha256:bfdf460b1736c775f2ba9f6a92bca30bc2095067b8a9d77876d1fad6cc3b4a43"},
 ]

+[[package]]
+name = "referencing"
+version = "0.35.1"
+description = "JSON Referencing + Python"
+optional = false
+python-versions = ">=3.8"
+files = [
+    {file = "referencing-0.35.1-py3-none-any.whl", hash = "sha256:eda6d3234d62814d1c64e305c1331c9a3a6132da475ab6382eaa997b21ee75de"},
+    {file = "referencing-0.35.1.tar.gz", hash = "sha256:25b42124a6c8b632a425174f24087783efb348a6f1e0008e63cd4466fedf703c"},
+]
+
+[package.dependencies]
+attrs = ">=22.2.0"
+rpds-py = ">=0.7.0"
+
 [[package]]
 name = "regex"
 version = "2024.5.10"
@ -5162,6 +5287,114 @@ pygments = ">=2.13.0,<3.0.0"
 [package.extras]
 jupyter = ["ipywidgets (>=7.5.1,<9)"]

+[[package]]
+name = "rpds-py"
+version = "0.18.1"
+description = "Python bindings to Rust's persistent data structures (rpds)"
+optional = false
+python-versions = ">=3.8"
+files = [
+    {file = "rpds_py-0.18.1-cp310-cp310-macosx_10_12_x86_64.whl", hash = "sha256:d31dea506d718693b6b2cffc0648a8929bdc51c70a311b2770f09611caa10d53"},
+    {file = "rpds_py-0.18.1-cp310-cp310-macosx_11_0_arm64.whl", hash = "sha256:732672fbc449bab754e0b15356c077cc31566df874964d4801ab14f71951ea80"},
+    {file = "rpds_py-0.18.1-cp310-cp310-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:4a98a1f0552b5f227a3d6422dbd61bc6f30db170939bd87ed14f3c339aa6c7c9"},
+    {file = "rpds_py-0.18.1-cp310-cp310-manylinux_2_17_armv7l.manylinux2014_armv7l.whl", hash = "sha256:7f1944ce16401aad1e3f7d312247b3d5de7981f634dc9dfe90da72b87d37887d"},
+    {file = "rpds_py-0.18.1-cp310-cp310-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl", hash = "sha256:38e14fb4e370885c4ecd734f093a2225ee52dc384b86fa55fe3f74638b2cfb09"},
+    {file = "rpds_py-0.18.1-cp310-cp310-manylinux_2_17_s390x.manylinux2014_s390x.whl", hash = "sha256:08d74b184f9ab6289b87b19fe6a6d1a97fbfea84b8a3e745e87a5de3029bf944"},
+    {file = "rpds_py-0.18.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:d70129cef4a8d979caa37e7fe957202e7eee8ea02c5e16455bc9808a59c6b2f0"},
+    {file = "rpds_py-0.18.1-cp310-cp310-manylinux_2_5_i686.manylinux1_i686.whl", hash = "sha256:ce0bb20e3a11bd04461324a6a798af34d503f8d6f1aa3d2aa8901ceaf039176d"},
+    {file = "rpds_py-0.18.1-cp310-cp310-musllinux_1_2_aarch64.whl", hash = "sha256:81c5196a790032e0fc2464c0b4ab95f8610f96f1f2fa3d4deacce6a79852da60"},
+    {file = "rpds_py-0.18.1-cp310-cp310-musllinux_1_2_i686.whl", hash = "sha256:f3027be483868c99b4985fda802a57a67fdf30c5d9a50338d9db646d590198da"},
+    {file = "rpds_py-0.18.1-cp310-cp310-musllinux_1_2_x86_64.whl", hash = "sha256:d44607f98caa2961bab4fa3c4309724b185b464cdc3ba6f3d7340bac3ec97cc1"},
+    {file = "rpds_py-0.18.1-cp310-none-win32.whl", hash = "sha256:c273e795e7a0f1fddd46e1e3cb8be15634c29ae8ff31c196debb620e1edb9333"},
+    {file = "rpds_py-0.18.1-cp310-none-win_amd64.whl", hash = "sha256:8352f48d511de5f973e4f2f9412736d7dea76c69faa6d36bcf885b50c758ab9a"},
+    {file = "rpds_py-0.18.1-cp311-cp311-macosx_10_12_x86_64.whl", hash = "sha256:6b5ff7e1d63a8281654b5e2896d7f08799378e594f09cf3674e832ecaf396ce8"},
+    {file = "rpds_py-0.18.1-cp311-cp311-macosx_11_0_arm64.whl", hash = "sha256:8927638a4d4137a289e41d0fd631551e89fa346d6dbcfc31ad627557d03ceb6d"},
+    {file = "rpds_py-0.18.1-cp311-cp311-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:154bf5c93d79558b44e5b50cc354aa0459e518e83677791e6adb0b039b7aa6a7"},
+    {file = "rpds_py-0.18.1-cp311-cp311-manylinux_2_17_armv7l.manylinux2014_armv7l.whl", hash = "sha256:07f2139741e5deb2c5154a7b9629bc5aa48c766b643c1a6750d16f865a82c5fc"},
+    {file = "rpds_py-0.18.1-cp311-cp311-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl", hash = "sha256:8c7672e9fba7425f79019db9945b16e308ed8bc89348c23d955c8c0540da0a07"},
+    {file = "rpds_py-0.18.1-cp311-cp311-manylinux_2_17_s390x.manylinux2014_s390x.whl", hash = "sha256:489bdfe1abd0406eba6b3bb4fdc87c7fa40f1031de073d0cfb744634cc8fa261"},
+    {file = "rpds_py-0.18.1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:3c20f05e8e3d4fc76875fc9cb8cf24b90a63f5a1b4c5b9273f0e8225e169b100"},
+    {file = "rpds_py-0.18.1-cp311-cp311-manylinux_2_5_i686.manylinux1_i686.whl", hash = "sha256:967342e045564cef76dfcf1edb700b1e20838d83b1aa02ab313e6a497cf923b8"},
+    {file = "rpds_py-0.18.1-cp311-cp311-musllinux_1_2_aarch64.whl", hash = "sha256:2cc7c1a47f3a63282ab0f422d90ddac4aa3034e39fc66a559ab93041e6505da7"},
+    {file = "rpds_py-0.18.1-cp311-cp311-musllinux_1_2_i686.whl", hash = "sha256:f7afbfee1157e0f9376c00bb232e80a60e59ed716e3211a80cb8506550671e6e"},
+    {file = "rpds_py-0.18.1-cp311-cp311-musllinux_1_2_x86_64.whl", hash = "sha256:9e6934d70dc50f9f8ea47081ceafdec09245fd9f6032669c3b45705dea096b88"},
+    {file = "rpds_py-0.18.1-cp311-none-win32.whl", hash = "sha256:c69882964516dc143083d3795cb508e806b09fc3800fd0d4cddc1df6c36e76bb"},
+    {file = "rpds_py-0.18.1-cp311-none-win_amd64.whl", hash = "sha256:70a838f7754483bcdc830444952fd89645569e7452e3226de4a613a4c1793fb2"},
+    {file = "rpds_py-0.18.1-cp312-cp312-macosx_10_12_x86_64.whl", hash = "sha256:3dd3cd86e1db5aadd334e011eba4e29d37a104b403e8ca24dcd6703c68ca55b3"},
+    {file = "rpds_py-0.18.1-cp312-cp312-macosx_11_0_arm64.whl", hash = "sha256:05f3d615099bd9b13ecf2fc9cf2d839ad3f20239c678f461c753e93755d629ee"},
+    {file = "rpds_py-0.18.1-cp312-cp312-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:35b2b771b13eee8729a5049c976197ff58a27a3829c018a04341bcf1ae409b2b"},
+    {file = "rpds_py-0.18.1-cp312-cp312-manylinux_2_17_armv7l.manylinux2014_armv7l.whl", hash = "sha256:ee17cd26b97d537af8f33635ef38be873073d516fd425e80559f4585a7b90c43"},
+    {file = "rpds_py-0.18.1-cp312-cp312-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl", hash = "sha256:b646bf655b135ccf4522ed43d6902af37d3f5dbcf0da66c769a2b3938b9d8184"},
+    {file = "rpds_py-0.18.1-cp312-cp312-manylinux_2_17_s390x.manylinux2014_s390x.whl", hash = "sha256:19ba472b9606c36716062c023afa2484d1e4220548751bda14f725a7de17b4f6"},
+    {file = "rpds_py-0.18.1-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:6e30ac5e329098903262dc5bdd7e2086e0256aa762cc8b744f9e7bf2a427d3f8"},
+    {file = "rpds_py-0.18.1-cp312-cp312-manylinux_2_5_i686.manylinux1_i686.whl", hash = "sha256:d58ad6317d188c43750cb76e9deacf6051d0f884d87dc6518e0280438648a9ac"},
+    {file = "rpds_py-0.18.1-cp312-cp312-musllinux_1_2_aarch64.whl", hash = "sha256:e1735502458621921cee039c47318cb90b51d532c2766593be6207eec53e5c4c"},
+    {file = "rpds_py-0.18.1-cp312-cp312-musllinux_1_2_i686.whl", hash = "sha256:f5bab211605d91db0e2995a17b5c6ee5edec1270e46223e513eaa20da20076ac"},
+    {file = "rpds_py-0.18.1-cp312-cp312-musllinux_1_2_x86_64.whl", hash = "sha256:2fc24a329a717f9e2448f8cd1f960f9dac4e45b6224d60734edeb67499bab03a"},
+    {file = "rpds_py-0.18.1-cp312-none-win32.whl", hash = "sha256:1805d5901779662d599d0e2e4159d8a82c0b05faa86ef9222bf974572286b2b6"},
+    {file = "rpds_py-0.18.1-cp312-none-win_amd64.whl", hash = "sha256:720edcb916df872d80f80a1cc5ea9058300b97721efda8651efcd938a9c70a72"},
+    {file = "rpds_py-0.18.1-cp38-cp38-macosx_10_12_x86_64.whl", hash = "sha256:c827576e2fa017a081346dce87d532a5310241648eb3700af9a571a6e9fc7e74"},
+    {file = "rpds_py-0.18.1-cp38-cp38-macosx_11_0_arm64.whl", hash = "sha256:aa3679e751408d75a0b4d8d26d6647b6d9326f5e35c00a7ccd82b78ef64f65f8"},
+    {file = "rpds_py-0.18.1-cp38-cp38-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:0abeee75434e2ee2d142d650d1e54ac1f8b01e6e6abdde8ffd6eeac6e9c38e20"},
+    {file = "rpds_py-0.18.1-cp38-cp38-manylinux_2_17_armv7l.manylinux2014_armv7l.whl", hash = "sha256:ed402d6153c5d519a0faf1bb69898e97fb31613b49da27a84a13935ea9164dfc"},
+    {file = "rpds_py-0.18.1-cp38-cp38-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl", hash = "sha256:338dee44b0cef8b70fd2ef54b4e09bb1b97fc6c3a58fea5db6cc083fd9fc2724"},
+    {file = "rpds_py-0.18.1-cp38-cp38-manylinux_2_17_s390x.manylinux2014_s390x.whl", hash = "sha256:7750569d9526199c5b97e5a9f8d96a13300950d910cf04a861d96f4273d5b104"},
+    {file = "rpds_py-0.18.1-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:607345bd5912aacc0c5a63d45a1f73fef29e697884f7e861094e443187c02be5"},
+    {file = "rpds_py-0.18.1-cp38-cp38-manylinux_2_5_i686.manylinux1_i686.whl", hash = "sha256:207c82978115baa1fd8d706d720b4a4d2b0913df1c78c85ba73fe6c5804505f0"},
+    {file = "rpds_py-0.18.1-cp38-cp38-musllinux_1_2_aarch64.whl", hash = "sha256:6d1e42d2735d437e7e80bab4d78eb2e459af48c0a46e686ea35f690b93db792d"},
+    {file = "rpds_py-0.18.1-cp38-cp38-musllinux_1_2_i686.whl", hash = "sha256:5463c47c08630007dc0fe99fb480ea4f34a89712410592380425a9b4e1611d8e"},
+    {file = "rpds_py-0.18.1-cp38-cp38-musllinux_1_2_x86_64.whl", hash = "sha256:06d218939e1bf2ca50e6b0ec700ffe755e5216a8230ab3e87c059ebb4ea06afc"},
+    {file = "rpds_py-0.18.1-cp38-none-win32.whl", hash = "sha256:312fe69b4fe1ffbe76520a7676b1e5ac06ddf7826d764cc10265c3b53f96dbe9"},
+    {file = "rpds_py-0.18.1-cp38-none-win_amd64.whl", hash = "sha256:9437ca26784120a279f3137ee080b0e717012c42921eb07861b412340f85bae2"},
+    {file = "rpds_py-0.18.1-cp39-cp39-macosx_10_12_x86_64.whl", hash = "sha256:19e515b78c3fc1039dd7da0a33c28c3154458f947f4dc198d3c72db2b6b5dc93"},
+    {file = "rpds_py-0.18.1-cp39-cp39-macosx_11_0_arm64.whl", hash = "sha256:a7b28c5b066bca9a4eb4e2f2663012debe680f097979d880657f00e1c30875a0"},
+    {file = "rpds_py-0.18.1-cp39-cp39-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:673fdbbf668dd958eff750e500495ef3f611e2ecc209464f661bc82e9838991e"},
+    {file = "rpds_py-0.18.1-cp39-cp39-manylinux_2_17_armv7l.manylinux2014_armv7l.whl", hash = "sha256:d960de62227635d2e61068f42a6cb6aae91a7fe00fca0e3aeed17667c8a34611"},
+    {file = "rpds_py-0.18.1-cp39-cp39-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl", hash = "sha256:352a88dc7892f1da66b6027af06a2e7e5d53fe05924cc2cfc56495b586a10b72"},
+    {file = "rpds_py-0.18.1-cp39-cp39-manylinux_2_17_s390x.manylinux2014_s390x.whl", hash = "sha256:4e0ee01ad8260184db21468a6e1c37afa0529acc12c3a697ee498d3c2c4dcaf3"},
+    {file = "rpds_py-0.18.1-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:e4c39ad2f512b4041343ea3c7894339e4ca7839ac38ca83d68a832fc8b3748ab"},
+    {file = "rpds_py-0.18.1-cp39-cp39-manylinux_2_5_i686.manylinux1_i686.whl", hash = "sha256:aaa71ee43a703c321906813bb252f69524f02aa05bf4eec85f0c41d5d62d0f4c"},
+    {file = "rpds_py-0.18.1-cp39-cp39-musllinux_1_2_aarch64.whl", hash = "sha256:6cd8098517c64a85e790657e7b1e509b9fe07487fd358e19431cb120f7d96338"},
+    {file = "rpds_py-0.18.1-cp39-cp39-musllinux_1_2_i686.whl", hash = "sha256:4adec039b8e2928983f885c53b7cc4cda8965b62b6596501a0308d2703f8af1b"},
+    {file = "rpds_py-0.18.1-cp39-cp39-musllinux_1_2_x86_64.whl", hash = "sha256:32b7daaa3e9389db3695964ce8e566e3413b0c43e3394c05e4b243a4cd7bef26"},
+    {file = "rpds_py-0.18.1-cp39-none-win32.whl", hash = "sha256:2625f03b105328729f9450c8badda34d5243231eef6535f80064d57035738360"},
+    {file = "rpds_py-0.18.1-cp39-none-win_amd64.whl", hash = "sha256:bf18932d0003c8c4d51a39f244231986ab23ee057d235a12b2684ea26a353590"},
+    {file = "rpds_py-0.18.1-pp310-pypy310_pp73-macosx_10_12_x86_64.whl", hash = "sha256:cbfbea39ba64f5e53ae2915de36f130588bba71245b418060ec3330ebf85678e"},
+    {file = "rpds_py-0.18.1-pp310-pypy310_pp73-macosx_11_0_arm64.whl", hash = "sha256:a3d456ff2a6a4d2adcdf3c1c960a36f4fd2fec6e3b4902a42a384d17cf4e7a65"},
+    {file = "rpds_py-0.18.1-pp310-pypy310_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:7700936ef9d006b7ef605dc53aa364da2de5a3aa65516a1f3ce73bf82ecfc7ae"},
+    {file = "rpds_py-0.18.1-pp310-pypy310_pp73-manylinux_2_17_armv7l.manylinux2014_armv7l.whl", hash = "sha256:51584acc5916212e1bf45edd17f3a6b05fe0cbb40482d25e619f824dccb679de"},
+    {file = "rpds_py-0.18.1-pp310-pypy310_pp73-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl", hash = "sha256:942695a206a58d2575033ff1e42b12b2aece98d6003c6bc739fbf33d1773b12f"},
+    {file = "rpds_py-0.18.1-pp310-pypy310_pp73-manylinux_2_17_s390x.manylinux2014_s390x.whl", hash = "sha256:b906b5f58892813e5ba5c6056d6a5ad08f358ba49f046d910ad992196ea61397"},
+    {file = "rpds_py-0.18.1-pp310-pypy310_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:f6f8e3fecca256fefc91bb6765a693d96692459d7d4c644660a9fff32e517843"},
+    {file = "rpds_py-0.18.1-pp310-pypy310_pp73-manylinux_2_5_i686.manylinux1_i686.whl", hash = "sha256:7732770412bab81c5a9f6d20aeb60ae943a9b36dcd990d876a773526468e7163"},
+    {file = "rpds_py-0.18.1-pp310-pypy310_pp73-musllinux_1_2_aarch64.whl", hash = "sha256:bd1105b50ede37461c1d51b9698c4f4be6e13e69a908ab7751e3807985fc0346"},
+    {file = "rpds_py-0.18.1-pp310-pypy310_pp73-musllinux_1_2_i686.whl", hash = "sha256:618916f5535784960f3ecf8111581f4ad31d347c3de66d02e728de460a46303c"},
+    {file = "rpds_py-0.18.1-pp310-pypy310_pp73-musllinux_1_2_x86_64.whl", hash = "sha256:17c6d2155e2423f7e79e3bb18151c686d40db42d8645e7977442170c360194d4"},
+    {file = "rpds_py-0.18.1-pp38-pypy38_pp73-macosx_10_12_x86_64.whl", hash = "sha256:6c4c4c3f878df21faf5fac86eda32671c27889e13570645a9eea0a1abdd50922"},
+    {file = "rpds_py-0.18.1-pp38-pypy38_pp73-macosx_11_0_arm64.whl", hash = "sha256:fab6ce90574645a0d6c58890e9bcaac8d94dff54fb51c69e5522a7358b80ab64"},
+    {file = "rpds_py-0.18.1-pp38-pypy38_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:531796fb842b53f2695e94dc338929e9f9dbf473b64710c28af5a160b2a8927d"},
+    {file = "rpds_py-0.18.1-pp38-pypy38_pp73-manylinux_2_17_armv7l.manylinux2014_armv7l.whl", hash = "sha256:740884bc62a5e2bbb31e584f5d23b32320fd75d79f916f15a788d527a5e83644"},
+    {file = "rpds_py-0.18.1-pp38-pypy38_pp73-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl", hash = "sha256:998125738de0158f088aef3cb264a34251908dd2e5d9966774fdab7402edfab7"},
+    {file = "rpds_py-0.18.1-pp38-pypy38_pp73-manylinux_2_17_s390x.manylinux2014_s390x.whl", hash = "sha256:e2be6e9dd4111d5b31ba3b74d17da54a8319d8168890fbaea4b9e5c3de630ae5"},
+    {file = "rpds_py-0.18.1-pp38-pypy38_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:d0cee71bc618cd93716f3c1bf56653740d2d13ddbd47673efa8bf41435a60daa"},
+    {file = "rpds_py-0.18.1-pp38-pypy38_pp73-manylinux_2_5_i686.manylinux1_i686.whl", hash = "sha256:2c3caec4ec5cd1d18e5dd6ae5194d24ed12785212a90b37f5f7f06b8bedd7139"},
+    {file = "rpds_py-0.18.1-pp38-pypy38_pp73-musllinux_1_2_aarch64.whl", hash = "sha256:27bba383e8c5231cd559affe169ca0b96ec78d39909ffd817f28b166d7ddd4d8"},
+    {file = "rpds_py-0.18.1-pp38-pypy38_pp73-musllinux_1_2_i686.whl", hash = "sha256:a888e8bdb45916234b99da2d859566f1e8a1d2275a801bb8e4a9644e3c7e7909"},
+    {file = "rpds_py-0.18.1-pp38-pypy38_pp73-musllinux_1_2_x86_64.whl", hash = "sha256:6031b25fb1b06327b43d841f33842b383beba399884f8228a6bb3df3088485ff"},
+    {file = "rpds_py-0.18.1-pp39-pypy39_pp73-macosx_10_12_x86_64.whl", hash = "sha256:48c2faaa8adfacefcbfdb5f2e2e7bdad081e5ace8d182e5f4ade971f128e6bb3"},
+    {file = "rpds_py-0.18.1-pp39-pypy39_pp73-macosx_11_0_arm64.whl", hash = "sha256:d85164315bd68c0806768dc6bb0429c6f95c354f87485ee3593c4f6b14def2bd"},
+    {file = "rpds_py-0.18.1-pp39-pypy39_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:6afd80f6c79893cfc0574956f78a0add8c76e3696f2d6a15bca2c66c415cf2d4"},
+    {file = "rpds_py-0.18.1-pp39-pypy39_pp73-manylinux_2_17_armv7l.manylinux2014_armv7l.whl", hash = "sha256:fa242ac1ff583e4ec7771141606aafc92b361cd90a05c30d93e343a0c2d82a89"},
+    {file = "rpds_py-0.18.1-pp39-pypy39_pp73-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl", hash = "sha256:d21be4770ff4e08698e1e8e0bce06edb6ea0626e7c8f560bc08222880aca6a6f"},
+    {file = "rpds_py-0.18.1-pp39-pypy39_pp73-manylinux_2_17_s390x.manylinux2014_s390x.whl", hash = "sha256:5c45a639e93a0c5d4b788b2613bd637468edd62f8f95ebc6fcc303d58ab3f0a8"},
+    {file = "rpds_py-0.18.1-pp39-pypy39_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:910e71711d1055b2768181efa0a17537b2622afeb0424116619817007f8a2b10"},
+    {file = "rpds_py-0.18.1-pp39-pypy39_pp73-manylinux_2_5_i686.manylinux1_i686.whl", hash = "sha256:b9bb1f182a97880f6078283b3505a707057c42bf55d8fca604f70dedfdc0772a"},
+    {file = "rpds_py-0.18.1-pp39-pypy39_pp73-musllinux_1_2_aarch64.whl", hash = "sha256:1d54f74f40b1f7aaa595a02ff42ef38ca654b1469bef7d52867da474243cc633"},
+    {file = "rpds_py-0.18.1-pp39-pypy39_pp73-musllinux_1_2_i686.whl", hash = "sha256:8d2e182c9ee01135e11e9676e9a62dfad791a7a467738f06726872374a83db49"},
+    {file = "rpds_py-0.18.1-pp39-pypy39_pp73-musllinux_1_2_x86_64.whl", hash = "sha256:636a15acc588f70fda1661234761f9ed9ad79ebed3f2125d44be0862708b666e"},
+    {file = "rpds_py-0.18.1.tar.gz", hash = "sha256:dc48b479d540770c811fbd1eb9ba2bb66951863e448efec2e2c102625328e92f"},
+]
+
 [[package]]
 name = "rsa"
 version = "4.9"
@ -5508,6 +5741,17 @@ files = [
    {file = "six-1.16.0.tar.gz", hash = "sha256:1e61c37477a1626458e36f7b1d82aa5c9b094fa4802892072e49de9c60c4c926"},
 ]

+[[package]]
+name = "smmap"
+version = "5.0.1"
+description = "A pure Python implementation of a sliding window memory map manager"
+optional = false
+python-versions = ">=3.7"
+files = [
+    {file = "smmap-5.0.1-py3-none-any.whl", hash = "sha256:e6d8668fa5f93e706934a62d7b4db19c8d9eb8cf2adbb75ef1b675aa332b69da"},
+    {file = "smmap-5.0.1.tar.gz", hash = "sha256:dceeb6c0028fdb6734471eb07c0cd2aae706ccaecab45965ee83f11c8d3b1f62"},
+]
+
 [[package]]
 name = "sniffio"
 version = "1.3.1"
@ -5634,6 +5878,41 @@ anyio = ">=3.4.0,<5"
 [package.extras]
 full = ["httpx (>=0.22.0)", "itsdangerous", "jinja2", "python-multipart (>=0.0.7)", "pyyaml"]

+[[package]]
+name = "streamlit"
+version = "1.34.0"
+description = "A faster way to build and share data apps"
+optional = false
+python-versions = "!=3.9.7,>=3.8"
+files = [
+    {file = "streamlit-1.34.0-py2.py3-none-any.whl", hash = "sha256:411183cf7f525e468eb256b343c914782d11cd894b5e30a4ab3bb1d54e3ae339"},
+    {file = "streamlit-1.34.0.tar.gz", hash = "sha256:135a3b79a686b3132b73f204450ad6e889de04f3349d692925e09f0e21e74b52"},
+]
+
+[package.dependencies]
+altair = ">=4.0,<6"
+blinker = ">=1.0.0,<2"
+cachetools = ">=4.0,<6"
+click = ">=7.0,<9"
+gitpython = ">=3.0.7,<3.1.19 || >3.1.19,<4"
+numpy = ">=1.19.3,<2"
+packaging = ">=16.8,<25"
+pandas = ">=1.3.0,<3"
+pillow = ">=7.1.0,<11"
+protobuf = ">=3.20,<5"
+pyarrow = ">=7.0"
+pydeck = ">=0.8.0b4,<1"
+requests = ">=2.27,<3"
+rich = ">=10.14.0,<14"
+tenacity = ">=8.1.0,<9"
+toml = ">=0.10.1,<2"
+tornado = ">=6.0.3,<7"
+typing-extensions = ">=4.3.0,<5"
+watchdog = {version = ">=2.1.5", markers = "platform_system != \"Darwin\""}
+
+[package.extras]
+snowflake = ["snowflake-connector-python (>=2.8.0)", "snowflake-snowpark-python (>=0.9.0)"]
+
 [[package]]
 name = "striprtf"
 version = "0.0.26"
@ -5895,6 +6174,17 @@ files = [
    {file = "toml-0.10.2.tar.gz", hash = "sha256:b3bda1d108d5dd99f4a20d24d9c348e91c4db7ab1b749200bded2f839ccbe68f"},
 ]

+[[package]]
+name = "toolz"
+version = "0.12.1"
+description = "List processing tools and functional utilities"
+optional = false
+python-versions = ">=3.7"
+files = [
+    {file = "toolz-0.12.1-py3-none-any.whl", hash = "sha256:d22731364c07d72eea0a0ad45bafb2c2937ab6fd38a3507bf55eae8744aa7d85"},
+    {file = "toolz-0.12.1.tar.gz", hash = "sha256:ecca342664893f177a13dac0e6b41cbd8ac25a358e5f215316d43e2100224f4d"},
+]
+
 [[package]]
 name = "torch"
 version = "2.2.2"
@ -5953,6 +6243,26 @@ typing-extensions = ">=4.8.0"
 opt-einsum = ["opt-einsum (>=3.3)"]
 optree = ["optree (>=0.9.1)"]

+[[package]]
+name = "tornado"
+version = "6.4"
+description = "Tornado is a Python web framework and asynchronous networking library, originally developed at FriendFeed."
+optional = false
+python-versions = ">= 3.8"
+files = [
+    {file = "tornado-6.4-cp38-abi3-macosx_10_9_universal2.whl", hash = "sha256:02ccefc7d8211e5a7f9e8bc3f9e5b0ad6262ba2fbb683a6443ecc804e5224ce0"},
+    {file = "tornado-6.4-cp38-abi3-macosx_10_9_x86_64.whl", hash = "sha256:27787de946a9cffd63ce5814c33f734c627a87072ec7eed71f7fc4417bb16263"},
+    {file = "tornado-6.4-cp38-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:f7894c581ecdcf91666a0912f18ce5e757213999e183ebfc2c3fdbf4d5bd764e"},
+    {file = "tornado-6.4-cp38-abi3-manylinux_2_5_i686.manylinux1_i686.manylinux_2_17_i686.manylinux2014_i686.whl", hash = "sha256:e43bc2e5370a6a8e413e1e1cd0c91bedc5bd62a74a532371042a18ef19e10579"},
+    {file = "tornado-6.4-cp38-abi3-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:f0251554cdd50b4b44362f73ad5ba7126fc5b2c2895cc62b14a1c2d7ea32f212"},
+    {file = "tornado-6.4-cp38-abi3-musllinux_1_1_aarch64.whl", hash = "sha256:fd03192e287fbd0899dd8f81c6fb9cbbc69194d2074b38f384cb6fa72b80e9c2"},
+    {file = "tornado-6.4-cp38-abi3-musllinux_1_1_i686.whl", hash = "sha256:88b84956273fbd73420e6d4b8d5ccbe913c65d31351b4c004ae362eba06e1f78"},
+    {file = "tornado-6.4-cp38-abi3-musllinux_1_1_x86_64.whl", hash = "sha256:71ddfc23a0e03ef2df1c1397d859868d158c8276a0603b96cf86892bff58149f"},
+    {file = "tornado-6.4-cp38-abi3-win32.whl", hash = "sha256:6f8a6c77900f5ae93d8b4ae1196472d0ccc2775cc1dfdc9e7727889145c45052"},
+    {file = "tornado-6.4-cp38-abi3-win_amd64.whl", hash = "sha256:10aeaa8006333433da48dec9fe417877f8bcc21f48dda8d661ae79da357b2a63"},
+    {file = "tornado-6.4.tar.gz", hash = "sha256:72291fa6e6bc84e626589f1c29d90a5a6d593ef5ae68052ee2ef000dfd273dee"},
+]
+
 [[package]]
 name = "tqdm"
 version = "4.66.4"
@ -6344,6 +6654,47 @@ platformdirs = ">=3.9.1,<5"
 docs = ["furo (>=2023.7.26)", "proselint (>=0.13)", "sphinx (>=7.1.2,!=7.3)", "sphinx-argparse (>=0.4)", "sphinxcontrib-towncrier (>=0.2.1a0)", "towncrier (>=23.6)"]
 test = ["covdefaults (>=2.3)", "coverage (>=7.2.7)", "coverage-enable-subprocess (>=1)", "flaky (>=3.7)", "packaging (>=23.1)", "pytest (>=7.4)", "pytest-env (>=0.8.2)", "pytest-freezer (>=0.4.8)", "pytest-mock (>=3.11.1)", "pytest-randomly (>=3.12)", "pytest-timeout (>=2.1)", "setuptools (>=68)", "time-machine (>=2.10)"]

+[[package]]
+name = "watchdog"
+version = "4.0.0"
+description = "Filesystem events monitoring"
+optional = false
+python-versions = ">=3.8"
+files = [
+    {file = "watchdog-4.0.0-cp310-cp310-macosx_10_9_universal2.whl", hash = "sha256:39cb34b1f1afbf23e9562501673e7146777efe95da24fab5707b88f7fb11649b"},
+    {file = "watchdog-4.0.0-cp310-cp310-macosx_10_9_x86_64.whl", hash = "sha256:c522392acc5e962bcac3b22b9592493ffd06d1fc5d755954e6be9f4990de932b"},
+    {file = "watchdog-4.0.0-cp310-cp310-macosx_11_0_arm64.whl", hash = "sha256:6c47bdd680009b11c9ac382163e05ca43baf4127954c5f6d0250e7d772d2b80c"},
+    {file = "watchdog-4.0.0-cp311-cp311-macosx_10_9_universal2.whl", hash = "sha256:8350d4055505412a426b6ad8c521bc7d367d1637a762c70fdd93a3a0d595990b"},
+    {file = "watchdog-4.0.0-cp311-cp311-macosx_10_9_x86_64.whl", hash = "sha256:c17d98799f32e3f55f181f19dd2021d762eb38fdd381b4a748b9f5a36738e935"},
+    {file = "watchdog-4.0.0-cp311-cp311-macosx_11_0_arm64.whl", hash = "sha256:4986db5e8880b0e6b7cd52ba36255d4793bf5cdc95bd6264806c233173b1ec0b"},
+    {file = "watchdog-4.0.0-cp312-cp312-macosx_10_9_universal2.whl", hash = "sha256:11e12fafb13372e18ca1bbf12d50f593e7280646687463dd47730fd4f4d5d257"},
+    {file = "watchdog-4.0.0-cp312-cp312-macosx_10_9_x86_64.whl", hash = "sha256:5369136a6474678e02426bd984466343924d1df8e2fd94a9b443cb7e3aa20d19"},
+    {file = "watchdog-4.0.0-cp312-cp312-macosx_11_0_arm64.whl", hash = "sha256:76ad8484379695f3fe46228962017a7e1337e9acadafed67eb20aabb175df98b"},
+    {file = "watchdog-4.0.0-cp38-cp38-macosx_10_9_universal2.whl", hash = "sha256:45cc09cc4c3b43fb10b59ef4d07318d9a3ecdbff03abd2e36e77b6dd9f9a5c85"},
+    {file = "watchdog-4.0.0-cp38-cp38-macosx_10_9_x86_64.whl", hash = "sha256:eed82cdf79cd7f0232e2fdc1ad05b06a5e102a43e331f7d041e5f0e0a34a51c4"},
+    {file = "watchdog-4.0.0-cp38-cp38-macosx_11_0_arm64.whl", hash = "sha256:ba30a896166f0fee83183cec913298151b73164160d965af2e93a20bbd2ab605"},
+    {file = "watchdog-4.0.0-cp39-cp39-macosx_10_9_universal2.whl", hash = "sha256:d18d7f18a47de6863cd480734613502904611730f8def45fc52a5d97503e5101"},
+    {file = "watchdog-4.0.0-cp39-cp39-macosx_10_9_x86_64.whl", hash = "sha256:2895bf0518361a9728773083908801a376743bcc37dfa252b801af8fd281b1ca"},
+    {file = "watchdog-4.0.0-cp39-cp39-macosx_11_0_arm64.whl", hash = "sha256:87e9df830022488e235dd601478c15ad73a0389628588ba0b028cb74eb72fed8"},
+    {file = "watchdog-4.0.0-pp310-pypy310_pp73-macosx_10_9_x86_64.whl", hash = "sha256:6e949a8a94186bced05b6508faa61b7adacc911115664ccb1923b9ad1f1ccf7b"},
+    {file = "watchdog-4.0.0-pp38-pypy38_pp73-macosx_10_9_x86_64.whl", hash = "sha256:6a4db54edea37d1058b08947c789a2354ee02972ed5d1e0dca9b0b820f4c7f92"},
+    {file = "watchdog-4.0.0-pp39-pypy39_pp73-macosx_10_9_x86_64.whl", hash = "sha256:d31481ccf4694a8416b681544c23bd271f5a123162ab603c7d7d2dd7dd901a07"},
+    {file = "watchdog-4.0.0-py3-none-manylinux2014_aarch64.whl", hash = "sha256:8fec441f5adcf81dd240a5fe78e3d83767999771630b5ddfc5867827a34fa3d3"},
+    {file = "watchdog-4.0.0-py3-none-manylinux2014_armv7l.whl", hash = "sha256:6a9c71a0b02985b4b0b6d14b875a6c86ddea2fdbebd0c9a720a806a8bbffc69f"},
+    {file = "watchdog-4.0.0-py3-none-manylinux2014_i686.whl", hash = "sha256:557ba04c816d23ce98a06e70af6abaa0485f6d94994ec78a42b05d1c03dcbd50"},
+    {file = "watchdog-4.0.0-py3-none-manylinux2014_ppc64.whl", hash = "sha256:d0f9bd1fd919134d459d8abf954f63886745f4660ef66480b9d753a7c9d40927"},
+    {file = "watchdog-4.0.0-py3-none-manylinux2014_ppc64le.whl", hash = "sha256:f9b2fdca47dc855516b2d66eef3c39f2672cbf7e7a42e7e67ad2cbfcd6ba107d"},
+    {file = "watchdog-4.0.0-py3-none-manylinux2014_s390x.whl", hash = "sha256:73c7a935e62033bd5e8f0da33a4dcb763da2361921a69a5a95aaf6c93aa03a87"},
+    {file = "watchdog-4.0.0-py3-none-manylinux2014_x86_64.whl", hash = "sha256:6a80d5cae8c265842c7419c560b9961561556c4361b297b4c431903f8c33b269"},
+    {file = "watchdog-4.0.0-py3-none-win32.whl", hash = "sha256:8f9a542c979df62098ae9c58b19e03ad3df1c9d8c6895d96c0d51da17b243b1c"},
+    {file = "watchdog-4.0.0-py3-none-win_amd64.whl", hash = "sha256:f970663fa4f7e80401a7b0cbeec00fa801bf0287d93d48368fc3e6fa32716245"},
+    {file = "watchdog-4.0.0-py3-none-win_ia64.whl", hash = "sha256:9a03e16e55465177d416699331b0f3564138f1807ecc5f2de9d55d8f188d08c7"},
+    {file = "watchdog-4.0.0.tar.gz", hash = "sha256:e3e7065cbdabe6183ab82199d7a4f6b3ba0a438c5a512a68559846ccb76a78ec"},
+]
+
+[package.extras]
+watchmedo = ["PyYAML (>=3.10)"]
+
 [[package]]
 name = "watchfiles"
 version = "0.21.0"
@ -6545,6 +6896,17 @@ MarkupSafe = ">=2.1.1"
 [package.extras]
 watchdog = ["watchdog (>=2.3)"]

+[[package]]
+name = "whatthepatch"
+version = "1.0.5"
+description = "A patch parsing and application library."
+optional = false
+python-versions = ">=3.7"
+files = [
+    {file = "whatthepatch-1.0.5-py3-none-any.whl", hash = "sha256:6bc41f9f48a63384be4478d8b2d5b22185aac75be853cdcb150a2dc174ede7e1"},
+    {file = "whatthepatch-1.0.5.tar.gz", hash = "sha256:7f374c172812581bc3763587525d14a143aac7fe4220bc4676ecce0d86cb8f08"},
+]
+
 [[package]]
 name = "wrapt"
 version = "1.16.0"
@ -6933,4 +7295,4 @@ testing = ["coverage (>=5.0.3)", "zope.event", "zope.testing"]
 [metadata]
 lock-version = "2.0"
 python-versions = "^3.11"
-content-hash = "b531f9f1630faffd0f97de2f24085694bc1cb6d0eb51b37e3ab78de0e858f2fe"
+content-hash = "e829de805da2becffc84fe38136b06ec5ef457b4f2f3b0ef9aef89002d2b552a"
--- a/pyproject.toml
+++ b/pyproject.toml
@ -54,8 +54,10 @@ pytest-asyncio = "*"
 [tool.coverage.run]
 concurrency = ["gevent"]

+
 [tool.poetry.group.evaluation.dependencies]
-torch = "2.2.2"
+streamlit = "*"
+whatthepatch = "*"

 [build-system]
 build-backend = "poetry.core.masonry.api"
--- a/tests/unit/test_arg_parser.py
+++ b/tests/unit/test_arg_parser.py
@ -10,8 +10,11 @@ def test_help_message(capsys):
    captured = capsys.readouterr()
    expected_help_message = """
 usage: pytest [-h] [-d DIRECTORY] [-t TASK] [-f FILE] [-c AGENT_CLS]
-                   [-m MODEL_NAME] [-i MAX_ITERATIONS] [-n MAX_CHARS]
-                   [-l LLM_CONFIG]
+              [-m MODEL_NAME] [-i MAX_ITERATIONS] [-n MAX_CHARS]
+              [--eval-output-dir EVAL_OUTPUT_DIR]
+              [--eval-n-limit EVAL_N_LIMIT]
+              [--eval-num-workers EVAL_NUM_WORKERS] [--eval-note EVAL_NOTE]
+              [-l LLM_CONFIG]

 Run an agent with a specific task

@ -31,12 +34,21 @@ options:
  -n MAX_CHARS, --max-chars MAX_CHARS
                        The maximum number of characters to send to and
                        receive from LLM per task
+  --eval-output-dir EVAL_OUTPUT_DIR
+                        The directory to save evaluation output
+  --eval-n-limit EVAL_N_LIMIT
+                        The number of instances to evaluate
+  --eval-num-workers EVAL_NUM_WORKERS
+                        The number of workers to use for evaluation
+  --eval-note EVAL_NOTE
+                        The note to add to the evaluation directory
  -l LLM_CONFIG, --llm-config LLM_CONFIG
                        The group of llm settings, e.g. a [llama3] section in
                        the toml file. Overrides model if both are provided.
 """

    actual_lines = captured.out.strip().split('\n')
+    print('\n'.join(actual_lines))
    expected_lines = expected_help_message.strip().split('\n')

    # Ensure both outputs have the same number of lines