feat(SWE-Bench environment) integrate SWE-Bench sandbox (#1468)

* add draft dockerfile for build all

* add rsync for build

* add all-in-one docker

* update prepare scripts

* Update swe_env_box.py

* Add swe_entry.sh (buggy now)

* Parse the test command in swe_entry.sh

* Update README for instance eval in sandbox

* revert specialized config

* replace run_as_devin as an init arg

* set container & run_as_root via args

* update swe entry script

* update env

* remove mounting

* allow error after swe_entry

* update swe_env_box

* move file

* update gitignore

* get swe_env_box a working demo

* support faking user response & provide sandox ahead of time;
also return state for controller

* tweak main to support adding controller kwargs

* add module

* initialize plugin for provided sandbox

* add pip cache to plugin & fix jupyter kernel waiting

* better print Observation output

* add run infer scripts

* update readme

* add utility for getting diff patch

* use get_diff_patch in infer

* update readme

* support cost tracking for codeact

* add swe agent edit hack

* disable color in git diff

* fix git diff cmd

* fix state return

* support limit eval

* increase t imeout and export pip cache

* add eval limit config

* return state when hit turn limit

* save log to file; allow agent to give up

* run eval with max 50 turns

* add outputs to gitignore

* save swe_instance & instruction

* add uuid to swebench

* add streamlit dep

* fix save series

* fix the issue where session id might be duplicated

* allow setting temperature for llm (use 0 for eval)

* Get report from agent running log

* support evaluating task success right after inference.

* remove extra log

* comment out prompt for baseline

* add visualizer for eval

* use plaintext for instruction

* reduce timeout for all; only increase timeout for init

* reduce timeout for all; only increase timeout for init

* ignore sid for swe env

* close sandbox in each eval loop

* update visualizer instruction

* increase max chars

* add finish action to history too

* show test result in metrics

* add sidebars for visualizer

* also visualize swe_instance

* cleanup browser when agent controller finish runinng

* do not mount workspace for swe-eval to avoid accidentally overwrite files

* Revert "do not mount workspace for swe-eval to avoid accidentally overwrite files"

This reverts commit 8ef77390543e562e6f0a5a9992418014d8b3010c.

* Revert "Revert "do not mount workspace for swe-eval to avoid accidentally overwrite files""

This reverts commit 016cfbb9f0475f32bacbad5822996b4eaff24a5e.

* run jupyter command via copy to, instead of cp to mount

* only print mixin output when failed

* change ssh box logging

* add visualizer for pass rate

* add instance id to sandbox name

* only remove container we created

* use opendevin logger in main

* support multi-processing infer

* add back metadata, support keyboard interrupt

* remove container with startswith

* make pbar behave correctly

* update instruction w/ multi-processing

* show resolved rate by repo

* rename tmp dir name

* attempt to fix racing for copy to ssh_box

* fix script

* bump swe-bench-all version

* fix ipython with self-contained commands

* add jupyter demo to swe_env_box

* make resolved count two column

* increase height

* do not add glob to url params

* analyze obs length

* print instance id prior to removal handler

* add gold patch in visualizer

* fix interactive git by adding a git --no-pager as alias

* increase max_char to 10k to cover 98% of swe-bench obs cases

* allow parsing note

* prompt v2

* add iteration reminder

* adjust user response

* adjust order

* fix return eval

* fix typo

* add reminder before logging

* remove other resolve rate

* re adjust to new folder structure

* support adding eval note

* fix eval note path

* make sure first log of each instance is printed

* add eval note

* fix the display for visualizer

* tweak visualizer for better git patch reading

* exclude empty patch

* add retry mechanism for swe_env_box start

* fix ssh timeout issue

* add stat field for apply test patch success

* add visualization for fine-grained report

* attempt to support monologue agent by constraining it to single thread

* also log error msg when stopeed

* save error as well

* override WORKSPACE_MOUNT_PATH and WORKSPACE_BASE for monologue to work in mp

* add retry mechanism for sshbox

* remove retry for swe env box

* try to handle loop state stopped

* Add get report scripts

* Add script to convert agent output to swe-bench format

* Merge fine grained report for visualizer

* Update eval readme

* Update README.md

* Add CodeAct gpt4-1106 output and eval logs on swe-bench-lite

* Update the script to get model report

* Update get_model_report.sh

* Update get_agent_report.sh

* Update report merge script

* Add agent output conversion script

* Update swe_lite_env_setup.sh

* Add example swe-bench output files

* Update eval readme

* Remove redundant scripts

* set iteration count down to false by default

* fix: Issue where CodeAct agent was trying to log cost on local llm and throwing Undefined Model execption out of litellm (#1666)

* fix: Issue where CodeAct agent was trying to log cost on local llm and throwing Undefined Model execption out of litellm

* Review Feedback

* Missing None Check

* Review feedback and improved error handling

---------

Co-authored-by: Robert Brennan <accounts@rbren.io>

* fix prepare_swe_util scripts

* update builder images

* update setup script

* remove swe-bench build workflow

* update lock

* remove experiments since they are moved to hf

* remove visualizer (since it is moved to hf repo)

* simply jupyter execution via heredoc

* update ssh_box

* add initial docker readme

* add pkg-config as dependency

* add script for swe_bench all-in-one docker

* add rsync to builder

* rename var

* update commit

* update readme

* update lock

* support specify timeout for long running tasks

* fix path

* separate building of all deps and files

* support returning states at the end of controller

* remove return None

* support specify timeout for long running tasks

* add timeout for all existing sandbox impl

* fix swe_env_box for new codebase

* update llm config in config.py

* support pass sandbox in

* remove force set

* update eval script

* fix issue of overriding final state

* change default eval output to hf demo

* change default eval output to hf demo

* fix config

* only close it when it is NOT external sandbox

* add scripts

* tweak config

* only put in hostory when state has history attr

* fix agent controller on the case of run out interaction budget

* always assume state is always not none

* remove print of final state

* catch all exception when cannot compute completion cost

* Update README.md

* save source into json

* fix path

* update docker path

* return the final state on close

* merge AgentState with State

* fix integration test

* merge AgentState with State

* fix integration test

* add ChangeAgentStateAction to history in attempt to fix integration

* add back set agent state

* update tests

* update tests

* move scripts for setup

* update script and readme for infer

* do not reset logger when n processes == 1

* update eval_infer scripts and readme

* simplify readme

* copy over dir after eval

* copy over dir after eval

* directly return get state

* update lock

* fix output saving of infer

* replace print with logger

* update eval_infer script

* add back the missing .close

* increase timeout

* copy all swe_bench_format file

* attempt to fix output parsing

* log git commit id as metadata

* fix eval script

* update lock

* update unit tests

* fix argparser unit test

* fix lock

* the deps are now lightweight enough to be incude in make build

* add spaces for tests

* add eval outputs to gitignore

* remove git submodule

* readme

* tweak git email

* update upload instruction

* bump codeact version for eval

---------

Co-authored-by: Bowen Li <libowen.ne@gmail.com>
Co-authored-by: huybery <huybery@gmail.com>
Co-authored-by: Bart Shappee <bshappee@gmail.com>
Co-authored-by: Robert Brennan <accounts@rbren.io>
This commit is contained in:
Xingyao Wang 2024-05-16 00:15:55 +08:00 committed by GitHub
parent a84d19f03c
commit 2406b901df
No known key found for this signature in database
GPG Key ID: B5690EEEBB952194
48 changed files with 2321 additions and 708 deletions

4
.gitignore vendored
View File

@ -202,6 +202,8 @@ cache
# configuration
config.toml
evaluation/swe_bench/eval_workspace
evaluation/outputs
evaluation/evaluation_outputs
test_results*
/_test_files_tmp/

View File

@ -135,7 +135,7 @@ install-python-dependencies:
export HNSWLIB_NO_NATIVE=1; \
poetry run pip install chroma-hnswlib; \
fi
@poetry install --without evaluation
@poetry install
@if [ -f "/etc/manjaro-release" ]; then \
echo "$(BLUE)Detected Manjaro Linux. Installing Playwright dependencies...$(RESET)"; \
poetry run pip install playwright; \

View File

@ -276,7 +276,10 @@ class CodeActAgent(Agent):
raise NotImplementedError('Implement this abstract method')
def log_cost(self, response):
try:
cur_cost = self.llm.completion_cost(response)
except Exception:
cur_cost = 0
self.cost_accumulator += cur_cost
logger.info(
'Cost: %.2f USD | Accumulated Cost: %.2f USD',

View File

@ -4,76 +4,21 @@ This folder contains code and resources to run experiments and evaluations.
## Logistics
To better organize the evaluation folder, we should follow the rules below:
- Each subfolder contains a specific benchmark or experiment. For example, `evaluation/SWE-bench` should contain
- Each subfolder contains a specific benchmark or experiment. For example, `evaluation/swe_bench` should contain
all the preprocessing/evaluation/analysis scripts.
- Raw data and experimental records should not be stored within this repo (e.g. Google Drive or Hugging Face Datasets).
- Raw data and experimental records should not be stored within this repo.
- For model outputs, they should be stored at [this huggingface space](https://huggingface.co/spaces/OpenDevin/evaluation) for visualization.
- Important data files of manageable size and analysis scripts (e.g., jupyter notebooks) can be directly uploaded to this repo.
## Roadmap
## Supported Benchmarks
- Sanity check. Reproduce Devin's scores on SWE-bench using the released outputs to make sure that our harness pipeline works.
- Open source model support.
- Contributors are encouraged to submit their commits to our [forked SEW-bench repo](https://github.com/OpenDevin/SWE-bench).
- Ensure compatibility with OpenAI interface for inference.
- Serve open source models, prioritizing high concurrency and throughput.
- SWE-Bench: [`evaluation/swe_bench`](./swe_bench)
## SWE-bench
- notebooks
- `devin_eval_analysis.ipynb`: notebook analyzing devin's outputs
- scripts
- `prepare_devin_outputs_for_evaluation.py`: script fetching and converting [devin's output](https://github.com/CognitionAI/devin-swebench-results/tree/main) into the desired json file for evaluation.
- usage: `python prepare_devin_outputs_for_evaluation.py <setting>` where setting can be `passed`, `failed` or `all`
- resources
- Devin related SWE-bench test subsets
- [🤗 OpenDevin/SWE-bench-devin-passed](https://huggingface.co/datasets/OpenDevin/SWE-bench-devin-passed)
- [🤗 OpenDevin/SWE-bench-devin-full-filtered](https://huggingface.co/datasets/OpenDevin/SWE-bench-devin-full-filtered)
- Devin's outputs processed for evaluations is available on [Huggingface](https://huggingface.co/datasets/OpenDevin/Devin-SWE-bench-output)
- get predictions that passed the test: `wget https://huggingface.co/datasets/OpenDevin/Devin-SWE-bench-output/raw/main/devin_swe_passed.json`
- get all predictions `wget https://huggingface.co/datasets/OpenDevin/Devin-SWE-bench-output/raw/main/devin_swe_outputs.json`
### Result Visualization
See [`SWE-bench/README.md`](./SWE-bench/README.md) for more details on how to run SWE-Bench for evaluation.
Check [this huggingface space](https://huggingface.co/spaces/OpenDevin/evaluation) for visualization of existing experimental results.
### Results
We have refined the original SWE-bench evaluation pipeline to enhance its efficiency and reliability. The updates are as follows:
- Reuse testbeds and Conda environments.
- Additionally try `patch` command for patch application if `git apply` command fails.
### Upload your results
#### Results on SWE-bench-devin-passed
[🤗 OpenDevin/SWE-bench-devin-passed](https://huggingface.co/datasets/OpenDevin/SWE-bench-devin-passed)
| Model/Agent | #instances | #init | #apply | #resolve |
|------------------------|------------|-------|--------|----------|
| Gold | 79 | 79 | 79 | 79 |
| Devin | 79 | 79 | 76 | 76 |
#init: number of instances where testbeds have been successfully initialized.
In the 3 Devin-failed instances (see below), Devin has made changes to the tests, which are incompatible with the provided test patch and causes failures during patch application. The evaluation adopted by Devin does not seem to align with the original SWE-bench evaluation.
```shell
django__django-11244
scikit-learn__scikit-learn-10870
sphinx-doc__sphinx-9367
```
#### Results on SWE-bench-devin-failed
| Model/Agent | #instances | #init | #apply | #resolve |
|------------------------|------------|-------|--------|----------|
| Gold | 491 | 491 | 491 | 371 |
| Devin | 491 | 491 | 463 | 7 |
Devin **passes** 7 instances on the `SWE-bench-devin-failed` subset. SWE-bench dataset appears to be noisy, evidenced by 120 instances where gold patches do not pass.
We have filtered out the problematic 120 instances, resulting in the creation of the `SWE-bench-devin-full-filtered` subset.
## Results on SWE-bench-devin-full-filtered
[🤗 OpenDevin/SWE-bench-devin-full-filtered](https://huggingface.co/datasets/OpenDevin/SWE-bench-devin-full-filtered)
| Model/Agent | #instances | #init | #apply | #resolve |
|------------------------|------------|-------|--------|----------|
| Gold | 450 | 450 | 450 | 450 |
| Devin | 450 | 450 | 426 | 83 |
You can start your own fork of [our huggingface evaluation outputs](https://huggingface.co/spaces/OpenDevin/evaluation) and submit a PR of your evaluation results to our hosted huggingface repo via PR following the guide [here](https://huggingface.co/docs/hub/en/repositories-pull-requests-discussions#pull-requests-and-discussions).

View File

@ -1,80 +0,0 @@
# SWE-Bench Evaluation
Work in-progress.
**TODOs**:
- [ ] Generate `predictions` files given an OpenDevin `Agent` implementation. We could borrow something from [devin's eval-harness implementation](https://github.com/CognitionAI/devin-swebench-results/tree/main/harness), for example, [how to generate `TestSpec`](https://github.com/CognitionAI/devin-swebench-results/blob/main/harness/scripts.py#L150-L160).
- [ ] Make sure the evaluation suite runs on all repos. I only tested on `matplotlib` so far, `scikit-learn` does not work for now (see [this issue](https://github.com/princeton-nlp/SWE-bench/issues/57))).
## Run tests for a prediction file inside a docker container
Currently, the docker container should be able to for running SWE-Bench. It was tested on `matplotlib`, but it requires further testing to make sure it works on other repositories. Currently, [it does not work for `scikit-learn`](https://github.com/princeton-nlp/SWE-bench/issues/57)).
### Setup example data
```bash
cd evaluation/SWE-bench
./scripts/prepare_devin_swe_bench_data.sh
# Clone the repo
# This is a fork that fixes some issues that stops matplotlib from running (see https://github.com/princeton-nlp/SWE-bench/pull/56)
git clone https://github.com/OpenDevin/SWE-bench.git
# Enter the docker container
./scripts/run_docker_interactive.sh
```
### Run evaluation
```bash
#!/bin/bash
rm -rf data/logs/ data/testbeds/ # (Optional) remove previous outputs
mkdir -p data/logs
mkdir -p data/testbeds
python SWE-bench/harness/run_evaluation.py \
--predictions_path data/predictions/devin_swe_outputs.json \
--swe_bench_tasks data/processed/swe-bench-test.json \
--log_dir data/logs \
--testbed data/testbeds \
--skip_existing \
--timeout 900 \
--verbose
```
You will see the command line outputs similar to this (if success):
```log
swe-bench@2f3a6b9fcab2:/swe-bench$ ./harness/run_evaluation.sh
/swe-bench/harness/run_evaluation.py:101: SyntaxWarning: assertion is always true, perhaps remove parentheses?
assert(temp, datasets.arrow_dataset.Dataset)
2024-03-20 09:21:18,796 - INFO - Found 1 predictions across 1 model(s) in predictions file
2024-03-20 09:21:18,796 - INFO - [claude-2/matplotlib__matplotlib/3.6] # of predictions to evaluate: 1 (0 already evaluated)
2024-03-20 09:21:18,797 - INFO - [Testbed] Creating log directory /swe-bench/data/logs/claude-2
2024-03-20 09:21:18,797 - INFO - [Testbed] Using conda path /swe-bench/data/testbeds/claude-2/matplotlib__matplotlib/3.6/tmp09wrm708
2024-03-20 09:21:18,797 - INFO - [Testbed] Using working directory /swe-bench/data/testbeds/claude-2/matplotlib__matplotlib/3.6/tmpfy1qth23 for testbed
2024-03-20 09:21:18,797 - INFO - [Testbed] Repo matplotlib/matplotlib: 1 versions
2024-03-20 09:21:18,797 - INFO - [Testbed] Version 3.6: 1 instances
2024-03-20 09:21:18,797 - INFO - No conda path provided, creating temporary install in /swe-bench/data/testbeds/claude-2/matplotlib__matplotlib/3.6/tmp09wrm708/miniconda3...
2024-03-20 09:21:27,482 - INFO - [Testbed] Using conda path /swe-bench/data/testbeds/claude-2/matplotlib__matplotlib/3.6/tmp09wrm708/miniconda3
2024-03-20 09:21:27,942 - INFO - [Testbed] Setting up testbed for matplotlib__matplotlib__3.6
2024-03-20 09:21:44,257 - INFO - [Testbed] Cloned matplotlib/matplotlib to /swe-bench/data/testbeds/claude-2/matplotlib__matplotlib/3.6/tmpfy1qth23/matplotlib__matplotlib__3.6
2024-03-20 09:21:44,415 - INFO - [Testbed] Creating environment matplotlib__matplotlib__3.6; Command: /swe-bench/data/testbeds/claude-2/matplotlib__matplotlib/3.6/tmp09wrm708/miniconda3/bin/conda env create --file /swe-bench/data/testbeds/claude-2/matplotlib__matplotlib/3.6/tmpfy1qth23/environment.yml
2024-03-20 09:23:39,781 - INFO - [Testbed] Installing pip packages for matplotlib__matplotlib__3.6; Command: . /swe-bench/data/testbeds/claude-2/matplotlib__matplotlib/3.6/tmp09wrm708/miniconda3/bin/activate matplotlib__matplotlib__3.6 && pip install pytest
/swe-bench/data/testbeds/claude-2/matplotlib__matplotlib/3.6/tmpfy1qth23/matplotlib__matplotlib__3.6: 1 instances
2024-03-20 09:23:42,309 - INFO - [matplotlib__matplotlib__3.6] [matplotlib__matplotlib-24362] Reset task environment to aca6e9d5e98811ca37c442217914b15e78127c89
2024-03-20 09:23:42,314 - INFO - [matplotlib__matplotlib__3.6] [matplotlib__matplotlib-24362] Apply patch successful (pred_try)
2024-03-20 09:23:42,318 - INFO - [matplotlib__matplotlib__3.6] [matplotlib__matplotlib-24362] Revert patch successful (pred_try)
2024-03-20 09:23:42,318 - INFO - [matplotlib__matplotlib__3.6] [matplotlib__matplotlib-24362] Installing with command: . /swe-bench/data/testbeds/claude-2/matplotlib__matplotlib/3.6/tmp09wrm708/miniconda3/bin/activate matplotlib__matplotlib__3.6 && echo 'activate successful' && python -m pip install -e .
2024-03-20 09:24:54,966 - INFO - [matplotlib__matplotlib__3.6] [matplotlib__matplotlib-24362] Installation successful
2024-03-20 09:24:54,970 - INFO - [matplotlib__matplotlib__3.6] [matplotlib__matplotlib-24362] Apply patch successful (test)
2024-03-20 09:24:54,974 - INFO - [matplotlib__matplotlib__3.6] [matplotlib__matplotlib-24362] Apply patch successful (pred)
2024-03-20 09:25:04,775 - INFO - [matplotlib__matplotlib__3.6] [matplotlib__matplotlib-24362] Test script run successful
swe-bench@2f3a6b9fcab2:/swe-bench$
```
### Interpret Results
Then you may interpret the results under `data/logs`, and interpret it following [this guide](https://github.com/princeton-nlp/SWE-bench/blob/main/tutorials/evaluation.md#-metrics).

View File

@ -1,155 +0,0 @@
# @yaml
# signature: search_dir <search_term> [<dir>]
# docstring: searches for search_term in all files in dir. If dir is not provided, searches in the current directory
# arguments:
# search_term:
# type: string
# description: the term to search for
# required: true
# dir:
# type: string
# description: the directory to search in (if not provided, searches in the current directory)
# required: false
search_dir() {
if [ $# -eq 1 ]; then
local search_term="$1"
local dir="./"
elif [ $# -eq 2 ]; then
local search_term="$1"
if [ -d "$2" ]; then
local dir="$2"
else
echo "Directory $2 not found"
return
fi
else
echo "Usage: search_dir <search_term> [<dir>]"
return
fi
dir=$(realpath "$dir")
local matches=$(find "$dir" -type f ! -path '*/.*' -exec grep -nIH "$search_term" {} + | cut -d: -f1 | sort | uniq -c)
# if no matches, return
if [ -z "$matches" ]; then
echo "No matches found for \"$search_term\" in $dir"
return
fi
# Calculate total number of matches
local num_matches=$(echo "$matches" | awk '{sum+=$1} END {print sum}')
# calculate total number of files matched
local num_files=$(echo "$matches" | wc -l | awk '{$1=$1; print $0}')
# if num_files is > 100, print an error
if [ $num_files -gt 100 ]; then
echo "More than $num_files files matched for \"$search_term\" in $dir. Please narrow your search."
return
fi
echo "Found $num_matches matches for \"$search_term\" in $dir:"
echo "$matches" | awk '{$2=$2; gsub(/^\.+\/+/, "./", $2); print $2 " ("$1" matches)"}'
echo "End of matches for \"$search_term\" in $dir"
}
# @yaml
# signature: search_file <search_term> [<file>]
# docstring: searches for search_term in file. If file is not provided, searches in the current open file
# arguments:
# search_term:
# type: string
# description: the term to search for
# required: true
# file:
# type: string
# description: the file to search in (if not provided, searches in the current open file)
# required: false
search_file() {
# Check if the first argument is provided
if [ -z "$1" ]; then
echo "Usage: search_file <search_term> [<file>]"
return
fi
# Check if the second argument is provided
if [ -n "$2" ]; then
# Check if the provided argument is a valid file
if [ -f "$2" ]; then
local file="$2" # Set file if valid
else
echo "Usage: search_file <search_term> [<file>]"
echo "Error: File name $2 not found. Please provide a valid file name."
return # Exit if the file is not valid
fi
else
# Check if a file is open
if [ -z "$CURRENT_FILE" ]; then
echo "No file open. Use the open command first."
return # Exit if no file is open
fi
local file="$CURRENT_FILE" # Set file to the current open file
fi
local search_term="$1"
file=$(realpath "$file")
# Use grep to directly get the desired formatted output
local matches=$(grep -nH "$search_term" "$file")
# Check if no matches were found
if [ -z "$matches" ]; then
echo "No matches found for \"$search_term\" in $file"
return
fi
# Calculate total number of matches
local num_matches=$(echo "$matches" | wc -l | awk '{$1=$1; print $0}')
# calculate total number of lines matched
local num_lines=$(echo "$matches" | cut -d: -f1 | sort | uniq | wc -l | awk '{$1=$1; print $0}')
# if num_lines is > 100, print an error
if [ $num_lines -gt 100 ]; then
echo "More than $num_lines lines matched for \"$search_term\" in $file. Please narrow your search."
return
fi
# Print the total number of matches and the matches themselves
echo "Found $num_matches matches for \"$search_term\" in $file:"
echo "$matches" | cut -d: -f1-2 | sort -u -t: -k2,2n | while IFS=: read -r filename line_number; do
echo "Line $line_number:$(sed -n "${line_number}p" "$file")"
done
echo "End of matches for \"$search_term\" in $file"
}
# @yaml
# signature: find_file <file_name> [<dir>]
# docstring: finds all files with the given name in dir. If dir is not provided, searches in the current directory
# arguments:
# file_name:
# type: string
# description: the name of the file to search for
# required: true
# dir:
# type: string
# description: the directory to search in (if not provided, searches in the current directory)
# required: false
find_file() {
if [ $# -eq 1 ]; then
local file_name="$1"
local dir="./"
elif [ $# -eq 2 ]; then
local file_name="$1"
if [ -d "$2" ]; then
local dir="$2"
else
echo "Directory $2 not found"
return
fi
else
echo "Usage: find_file <file_name> [<dir>]"
return
fi
dir=$(realpath "$dir")
local matches=$(find "$dir" -type f -name "$file_name")
# if no matches, return
if [ -z "$matches" ]; then
echo "No matches found for \"$file_name\" in $dir"
return
fi
# Calculate total number of matches
local num_matches=$(echo "$matches" | wc -l | awk '{$1=$1; print $0}')
echo "Found $num_matches matches for \"$file_name\" in $dir:"
echo "$matches" | awk '{print $0}'
}

View File

@ -1,15 +0,0 @@
# FROM https://github.com/princeton-nlp/SWE-bench/blob/main/environment.yml
name: swe-bench
dependencies:
- python=3.9
- pip
- pip:
- beautifulsoup4
- chardet
- ghapi
- GitPython
- python-dotenv
- requests
- rich
- transformers>=4.34.0
- conda-forge::gh

File diff suppressed because one or more lines are too long

View File

@ -1,5 +0,0 @@
from datasets import load_dataset
dataset = load_dataset('princeton-nlp/SWE-bench')
test = dataset['test'].to_pandas()
test.to_json('data/processed/swe-bench-test.json', orient='records')

View File

@ -1,81 +0,0 @@
'''
Script used to convert devin's output into the desired json format for evaluation on SWE-bench
Usage:
python prepare_devin_outputs_for_evaluation.py <setting>
<setting> can be "passed", "failed", "all"
Outputs:
two json files under evaluation/SWE-bench/data/
'''
# fetch devin's outputs into a json file for evaluation
import json
import os
import sys
import requests
from tqdm import tqdm
def get_devin_eval_output(setting):
repo_url = 'CognitionAI/devin-swebench-results'
folder_path = 'output_diffs'
base_url = 'https://api.github.com/repos/'
pass_api_url = f'{base_url}{repo_url}/contents/{folder_path}/pass'
failed_api_url = f'{base_url}{repo_url}/contents/{folder_path}/fail'
pass_files_info = []
failed_files_info = []
def get_files(api_url, subfolder_name, files_info):
response = requests.get(api_url)
if response.status_code == 200:
contents = response.json()
for item in tqdm(contents):
if item['type'] == 'file':
file_url = f"https://raw.githubusercontent.com/{repo_url}/main/{folder_path}/{subfolder_name}/{item['name']}"
file_content = requests.get(file_url).text
instance_id = item['name'][:-9]
model_name = 'Devin' # Update with actual model name
files_info.append({
'instance_id': instance_id,
'model_patch': file_content,
'model_name_or_path': model_name,
'pass_or_fail': subfolder_name
})
if setting == 'passed' or setting == 'all':
get_files(pass_api_url, 'pass', pass_files_info)
if setting == 'failed' or setting == 'all':
get_files(failed_api_url, 'fail', failed_files_info)
script_dir = os.path.dirname(os.path.abspath(__file__))
output_dir = os.path.join(script_dir, '../data/devin/')
if not os.path.exists(output_dir):
os.makedirs(output_dir)
if setting == 'passed' or setting == 'all':
with open(os.path.join(output_dir, 'devin_swe_passed.json'), 'w') as pass_file:
json.dump(pass_files_info, pass_file, indent=4)
if setting == 'failed' or setting == 'all':
with open(os.path.join(output_dir, 'devin_swe_failed.json'), 'w') as fail_file:
json.dump(failed_files_info, fail_file, indent=4)
if setting == 'all':
merged_output = pass_files_info + failed_files_info
with open(os.path.join(output_dir, 'devin_swe_outputs.json'), 'w') as merge_file:
json.dump(merged_output, merge_file, indent=4)
if __name__ == '__main__':
if len(sys.argv) != 2:
print('Usage: python script_name.py <setting>')
sys.exit(1)
setting = sys.argv[1]
get_devin_eval_output(setting)

View File

@ -1,10 +0,0 @@
#!/bin/bash
set -xeo pipefail
mkdir -p data/processed
python3 scripts/download_test_data.py
# Download an example output file (FROM claude-2)
# https://gist.github.com/sorendunn/9f1f1fade59f986b4925b6633f9ff165
mkdir -p data/predictions
wget https://huggingface.co/datasets/OpenDevin/Devin-SWE-bench-output/raw/main/devin_swe_outputs.json -O data/predictions/devin_swe_outputs.json

View File

@ -1,14 +0,0 @@
#!/bin/bash
DOCKER_IMAGE=ghcr.io/opendevin/eval-swe-bench
WORK_DIR=`pwd`
docker run \
-it \
--rm \
--user root \
--ipc=host --ulimit memlock=-1 --ulimit stack=67108864 \
-v $WORK_DIR:/swe-bench \
-w /swe-bench \
$DOCKER_IMAGE \
/bin/bash -c "usermod -u $(id -u) swe-bench && su swe-bench"

0
evaluation/__init__.py Normal file
View File

View File

@ -0,0 +1,39 @@
# Pre-build Testbed and Env
In the original SWE-Bench implementation, conda environment for evaluation is typically installed from scratch while evaluating on a paticular instance. This poses serveral challenges:
- Effeciency: most time of evaluation will be wasted on downloading packages
- Stability: setup could failed due to bad internet connectivity
- Reliability: it is possible that an instance is considered failed not because the agent did badly, but because the environment setup failed.
In OpenDevin-SWE-Bench fork, we try to pre-build the **testbed** (i.e., code of the repository we want the agent to edit) AND the **conda environment**, so that in evaluation (inference) time, we can directly leverage existing environments for effecienct evaluation.
NOTE: We only support SWE-Bench lite for now. But modifying our existing scripts for full SWE-Bench should be quite straight forward.
## How to pre-build your testbed
### Setup Eval Workspace (Util + Data)
Setup your eval workspace by:
1. Clone OpenDevin SWE-Bench [fork](https://github.com/OpenDevin/OD-SWE-bench.git)
2. Prepare SWE-Bench data
Run the following command to do the above two steps. The results will be saved to `evaluation/SWE-bench/eval_workspace`.
```bash
./evaluation/swe_bench/scripts/setup/prepare_swe_utils.sh
```
### Pre-build Conda Env and Test Bed
```bash
./evaluation/swe_bench/scripts/setup/swe_env_setup.sh
```
### Build the pre-build conda env and testbed into ONE docker image
```bash
pushd evaluation/swe_bench
docker build -t ghcr.io/opendevin/eval-swe-bench:full-v1.0 -f ./scripts/docker/Dockerfile.full.v1.0 .
docker push ghcr.io/opendevin/eval-swe-bench:full-v1.0
```

View File

@ -0,0 +1,256 @@
# Evaluate Generated Patches
## Evaluate patches generated by OpenDevin
This section explains in detail how `evaluation/swe_bench/scripts/eval_infer.sh` described in [SWE-Bench README](./README.md) works.
Use `scripts/setup/get_agent_report.sh` to evaluate patches generated by an OpenDevin agent. This script is available in the container at `/swe_util/get_agent_report.sh`.
- `output-file` (*required*): specify the path to your patch file inside the container
- `agent-name` (*required*): your agent name
- `dataset` (*required*): `swe-bench-test-lite` or `swe-bench-test`
- `num-processes`: defaults to 15.
- `experiment-name`: set to `${parent_folder_of_output_fils}_${current_folder_of_output_file}` if not given. E.g., `xxx/CodeActAgent/gpt-4-1106-preview_maxiter_50_N_v2_cd/output.jsonl` -> `CodeActAgent_gpt-4-1106-preview_maxiter_50_N_v2_cd` as experiment name.
- `merge_report`: if set, merges the evaluation report into the original output jsonl file and saves as a `.merged.jsonl` file.
An example to run evaluation on the given example agent output (`./examples/example_agent_output.json`).
```shell
export MINICONDA3=/swe_util/miniforge3
export OD_SWE_BENCH=/OD-SWE-bench
export EVAL_DATA_DIR=/swe_util/eval_data
cd /swe_util && ./get_agent_report.sh --output-file /swe_bench_output/example_agent_output.jsonl \
--agent-name CodeActAgent \
--dataset swe-bench-test-lite \
--experiment-name test_experiment \
--merge-report
```
You should get the following report:
```shell
- no_generation: 4
- generated: 26
- with_logs: 26
- install_fail: 0
- reset_failed: 0
- no_apply: 0
- applied: 24
- test_errored: 0
- test_timeout: 0
- resolved: 6
['sphinx-doc__sphinx-8721', 'sympy__sympy-14774', 'django__django-17087', 'sympy__sympy-20590', 'django__django-11583', 'sympy__sympy-21612']
Report saved at /swe_util/eval_data/eval_logs/test_experiment/test_experiment_swe-bench-test-lite.report.json
Agent output with report merged created at /swe_bench_output/example_agent_output.merged.jsonl
```
An additional `fine_grained_report` field will be added to each instance in the `example_agent_output.merged.jsonl`.
```json
"fine_grained_report": {
"gold_tests": {
"FAIL_TO_PASS": "[\"tests/test_ext_viewcode.py::test_viewcode_epub_default\"]",
"PASS_TO_PASS": "[\"tests/test_ext_viewcode.py::test_viewcode_epub_enabled\", \"tests/test_ext_viewcode.py::test_linkcode\", \"tests/test_ext_viewcode.py::test_local_source_files\"]"
},
"generated": true,
"with_logs": true,
"applied": true,
"test_errored": false,
"test_timeout": false,
"resolved": true,
"log_parse": {
"tests/test_ext_viewcode.py::test_viewcode_epub_default": "PASSED",
"tests/test_ext_viewcode.py::test_viewcode_epub_enabled": "PASSED",
"tests/test_ext_viewcode.py::test_linkcode": "PASSED",
"tests/test_ext_viewcode.py::test_local_source_files": "PASSED",
"tests/test_ext_viewcode.py::test_viewcode": "FAILED"
},
"eval_report": {
"FAIL_TO_PASS": {
"success": [
"tests/test_ext_viewcode.py::test_viewcode_epub_default"
],
"failure": []
},
"PASS_TO_PASS": {
"success": [
"tests/test_ext_viewcode.py::test_viewcode_epub_enabled",
"tests/test_ext_viewcode.py::test_linkcode",
"tests/test_ext_viewcode.py::test_local_source_files"
],
"failure": []
},
"FAIL_TO_FAIL": {
"success": [],
"failure": []
},
"PASS_TO_FAIL": {
"success": [],
"failure": []
}
}
}
```
## If you already have patches not generated by OpenDevin
### Prepare Output Files
Ensure that model outputs are formatted correctly as below:
```json
[
{
"instance_id": "",
"model_patch": "",
"model_name_or_path": ""
},
...
]
```
An example can be found [here](./examples/example_model_output.json).
Agent output should be adhere to the OpenDevin format. An example can be found [here](./examples/example_agent_output.json).
### Set Up the Environment
Before evaluating generated patches, you need to set up the Docker environment. Run the following command to instantiate the Docker container and mount the directory to your output files on the host:
```shell
docker run -it \
-v DIR_TO_YOUR_PATCH_FILES_ON_HOST:/swe_bench_output \
ghcr.io/opendevin/eval-swe-bench:full-v1.0 /bin/bash
```
### Evaluate Model Generated Patches
Use `scripts/get_model_report.sh` to evaluate patches generated by a model. This script is located in the container at `/swe_util/get_model_report.sh`.
- `output-file` (*required*): specify the path to your patch file inside the container
- `model-name` (*required*): this must match the `model_name_or_path` in your patch file
- `dataset` (*required*): `swe-bench-test-lite` or `swe-bench-test`
- `num-processes`: defaults to 15.
- `experiment-name`: set to `{model-name}__{dataset}` unless specified
An example to run evaluation on the given example model output (`./examples/example_agent_output.json`).
```shell
export MINICONDA3=/swe_util/miniforge3
export OD_SWE_BENCH=/swe_util/OD-SWE-bench
export EVAL_DATA_DIR=/swe_util/eval_data
cd /swe_util && ./get_model_report.sh --output-file /swe_bench_output/example_model_output.json \
--model-name opendevin \
--dataset swe-bench-test-lite
```
You should get the following report:
```shell
- no_generation: 4
- generated: 26
- with_logs: 26
- install_fail: 0
- reset_failed: 0
- no_apply: 0
- applied: 24
- test_errored: 0
- test_timeout: 0
- resolved: 6
['sphinx-doc__sphinx-8721', 'sympy__sympy-14774', 'django__django-17087', 'sympy__sympy-20590', 'django__django-11583', 'sympy__sympy-21612']
Report saved at /swe_util/eval_data/eval_logs/opendevin__swe-bench-test-lite/example_model_output.report.json
```
Note: please ignore the `no_apply` in the report for now.
The script will generate a `{experiment_name}` folder under `$EVAL_DATA_DIR/eval_logs`
```shell
├── $EVAL_DATA_DIR/eval_logs/$experiment_name
│ ├── $experiment_name.json
│ ├── $experiment_name.report.json
│ ├── $model_name # eval log dir
```
### Evaluate Agent Generated Patches
Use `scripts/setup/get_agent_report.sh` to evaluate patches generated by an agent. This script is available in the container at `/swe_util/get_agent_report.sh`.
- `output-file` (*required*): specify the path to your patch file inside the container
- `agent-name` (*required*): your agent name
- `dataset` (*required*): `swe-bench-test-lite` or `swe-bench-test`
- `num-processes`: defaults to 15.
- `experiment-name`: set to `${parent_folder_of_output_fils}_${current_folder_of_output_file}` if not given. E.g., `xxx/CodeActAgent/gpt-4-1106-preview_maxiter_50_N_v2_cd/output.jsonl` -> `CodeActAgent_gpt-4-1106-preview_maxiter_50_N_v2_cd` as experiment name.
- `merge_report`: if set, merges the evaluation report into the original output jsonl file and saves as a `.merged.jsonl` file.
An example to run evaluation on the given example agent output (`./examples/example_agent_output.json`).
```shell
export MINICONDA3=/swe_util/miniforge3
export OD_SWE_BENCH=/OD-SWE-bench
export EVAL_DATA_DIR=/swe_util/eval_data
cd /swe_util && ./get_agent_report.sh --output-file /swe_bench_output/example_agent_output.jsonl \
--agent-name CodeActAgent \
--dataset swe-bench-test-lite \
--experiment-name test_experiment \
--merge-report
```
You should get the following report:
```shell
- no_generation: 4
- generated: 26
- with_logs: 26
- install_fail: 0
- reset_failed: 0
- no_apply: 0
- applied: 24
- test_errored: 0
- test_timeout: 0
- resolved: 6
['sphinx-doc__sphinx-8721', 'sympy__sympy-14774', 'django__django-17087', 'sympy__sympy-20590', 'django__django-11583', 'sympy__sympy-21612']
Report saved at /swe_util/eval_data/eval_logs/test_experiment/test_experiment_swe-bench-test-lite.report.json
Agent output with report merged created at /swe_bench_output/example_agent_output.merged.jsonl
```
An additional `fine_grained_report` field will be added to each instance in the `example_agent_output.merged.jsonl`.
```json
"fine_grained_report": {
"gold_tests": {
"FAIL_TO_PASS": "[\"tests/test_ext_viewcode.py::test_viewcode_epub_default\"]",
"PASS_TO_PASS": "[\"tests/test_ext_viewcode.py::test_viewcode_epub_enabled\", \"tests/test_ext_viewcode.py::test_linkcode\", \"tests/test_ext_viewcode.py::test_local_source_files\"]"
},
"generated": true,
"with_logs": true,
"applied": true,
"test_errored": false,
"test_timeout": false,
"resolved": true,
"log_parse": {
"tests/test_ext_viewcode.py::test_viewcode_epub_default": "PASSED",
"tests/test_ext_viewcode.py::test_viewcode_epub_enabled": "PASSED",
"tests/test_ext_viewcode.py::test_linkcode": "PASSED",
"tests/test_ext_viewcode.py::test_local_source_files": "PASSED",
"tests/test_ext_viewcode.py::test_viewcode": "FAILED"
},
"eval_report": {
"FAIL_TO_PASS": {
"success": [
"tests/test_ext_viewcode.py::test_viewcode_epub_default"
],
"failure": []
},
"PASS_TO_PASS": {
"success": [
"tests/test_ext_viewcode.py::test_viewcode_epub_enabled",
"tests/test_ext_viewcode.py::test_linkcode",
"tests/test_ext_viewcode.py::test_local_source_files"
],
"failure": []
},
"FAIL_TO_FAIL": {
"success": [],
"failure": []
},
"PASS_TO_FAIL": {
"success": [],
"failure": []
}
}
}
```

View File

@ -0,0 +1,150 @@
# SWE-Bench Evaluation with OpenDevin SWE-Bench Docker Image
This folder contains evaluation harness we built on top of the original [SWE-Bench benchmark](https://www.swebench.com/) ([paper](https://arxiv.org/abs/2310.06770)). We create [a fork of SWE-Bench](https://github.com/OpenDevin/OD-SWE-bench.git) mostly build on top of [the original repo](https://github.com/princeton-nlp/SWE-bench) and [containerized](#opendevin-swe-bench-docker-image) it for easy evaluation.
## OpenDevin SWE-Bench Docker Image
In [OpenDevin-SWE-Bench fork](https://github.com/OpenDevin/OD-SWE-bench.git) (mostly from [original repo](https://github.com/princeton-nlp/SWE-bench) with some fixes), we try to pre-build the **testbed** (i.e., code of the repository we want the agent to edit) AND the **conda environment**, so that in evaluation (inference) time, we can directly leverage existing environments for effecienct evaluation.
**We pack everything you need for SWE-Bench evaluation into one, gigantic, docker image.** To use it:
```bash
docker pull ghcr.io/opendevin/eval-swe-bench:full-v1.0
```
The Docker image contains several important directories:
- `/swe_util/OD-SWE-bench`: root directory for the OD-SWE-bench repository
- `/swe_util/eval_data`: director to eval data
- `/swe_util/eval_data/eval_logs/`: evaluation logs
- `/swe_util/eval_data/eval_temp/`: temporary folder for the evaluation process
- `/swe_util/eval_data/instances/`: swe-bench raw instances
- `/swe_util/eval_data/outputs/`: model or agent outputs
- `/swe_util/eval_data/testbed_logs/`: logs for testbed building
- `/swe_util/eval_data/testbeds/`: directory for all testbeds
- `/swe_util/miniforge3/`: directory for miniforge3
To reproduce how we pack the image, check [this doc](./BUILD_TESTBED_AND_ENV.md).
NOTE: We only support SWE-Bench lite for now. But modifying our existing scripts for full SWE-Bench should be quite straight forward.
## Test if your environment works
```bash
python3 evaluation/swe_bench/swe_env_box.py
```
If you get to the interactive shell successfully, it means success!
## Configure your LLM
Create a `config.toml` file if not exists at the root of workspace.
Add the following configurations:
```toml
[core]
max_iterations = 100
cache_dir = "/tmp/cache"
sandbox_container_image = "ghcr.io/opendevin/sandbox:latest"
sandbox_type = "ssh"
use_host_network = true
ssh_hostname = "localhost"
sandbox_timeout = 120
# eval specific
run_as_devin = false
# TODO: Change these to the model you want to evaluate
[eval_gpt4_1106_preview]
model = "gpt-4-1106-preview"
api_key = "XXX"
temperature = 0.0
[eval_some_openai_compatible_model]
model = "openai/MODEL_NAME"
base_url = "https://OPENAI_COMPATIBLE_URL/v1"
api_key = "XXX"
temperature = 0.0
```
## Run Inference on SWE-Bench Instances
```bash
./evaluation/swe_bench/scripts/run_infer.sh eval_gpt4_1106_preview
```
You can replace `eval_gpt4_1106_preview` with any model you setted up in `config.toml`.
## Evaluate Generated Patches
After running the inference described in the previous section, you will obtain a `output.jsonl` (by default it will save to `evaluation/evaluation_outputs`). Then you can run this one line script to evaluate generated patches, and produce a fine-grained report:
If you want to evaluate existing results, you should first run this to clone existing outputs
```bash
git clone https://huggingface.co/spaces/OpenDevin/evaluation evaluation/evaluation_outputs
```
Then you can run the following:
```bash
# ./evaluation/swe_bench/scripts/eval_infer.sh $YOUR_OUTPUT_JSONL
# For example:
./evaluation/swe_bench/scripts/eval_infer.sh evaluation/evaluation_outputs/outputs/swe_bench/CodeActAgent/gpt-4-1106-preview_maxiter_50_N_v1.0/output.jsonl
```
The final results will be saved to `evaluation/evaluation_outputs/outputs/swe_bench/CodeActAgent/gpt-4-1106-preview_maxiter_50_N_v1.0/output.merged.jsonl`.
It will contains an additional field `fine_grained_report` (see example below) compared to the `output.jsonl` from the previous inference stage.
```json
"fine_grained_report": {
"gold_tests": {
"FAIL_TO_PASS": "[\"tests/test_ext_viewcode.py::test_viewcode_epub_default\"]",
"PASS_TO_PASS": "[\"tests/test_ext_viewcode.py::test_viewcode_epub_enabled\", \"tests/test_ext_viewcode.py::test_linkcode\", \"tests/test_ext_viewcode.py::test_local_source_files\"]"
},
"generated": true,
"with_logs": true,
"applied": true,
"test_errored": false,
"test_timeout": false,
"resolved": true,
"log_parse": {
"tests/test_ext_viewcode.py::test_viewcode_epub_default": "PASSED",
"tests/test_ext_viewcode.py::test_viewcode_epub_enabled": "PASSED",
"tests/test_ext_viewcode.py::test_linkcode": "PASSED",
"tests/test_ext_viewcode.py::test_local_source_files": "PASSED",
"tests/test_ext_viewcode.py::test_viewcode": "FAILED"
},
"eval_report": {
"FAIL_TO_PASS": {
"success": [
"tests/test_ext_viewcode.py::test_viewcode_epub_default"
],
"failure": []
},
"PASS_TO_PASS": {
"success": [
"tests/test_ext_viewcode.py::test_viewcode_epub_enabled",
"tests/test_ext_viewcode.py::test_linkcode",
"tests/test_ext_viewcode.py::test_local_source_files"
],
"failure": []
},
"FAIL_TO_FAIL": {
"success": [],
"failure": []
},
"PASS_TO_FAIL": {
"success": [],
"failure": []
}
}
}
```
Please refer to [EVAL_PATCH.md](./EVAL_PATCH.md) if you want to learn more about how to evaluate patches that are already generated (e.g., not by OpenDevin).
## Submit your evaluation results
You can start your own fork of [our huggingface evaluation outputs](https://huggingface.co/spaces/OpenDevin/evaluation) and submit a PR of your evaluation results following the guide [here](https://huggingface.co/docs/hub/en/repositories-pull-requests-discussions#pull-requests-and-discussions).

View File

File diff suppressed because one or more lines are too long

File diff suppressed because one or more lines are too long

View File

@ -0,0 +1,411 @@
import asyncio
import json
import logging
import multiprocessing as mp
import os
import pathlib
import subprocess
import time
from concurrent.futures import ProcessPoolExecutor
import pandas as pd
import whatthepatch
from datasets import load_dataset
from tqdm import tqdm
from evaluation.swe_bench.swe_env_box import SWEBenchSSHBox
from opendevin.controller.state.state import State
from opendevin.core.config import args, config, get_llm_config_arg
from opendevin.core.logger import get_console_handler
from opendevin.core.logger import opendevin_logger as logger
from opendevin.core.main import main
from opendevin.events.action import MessageAction
from opendevin.events.serialization.event import event_to_dict
def cleanup():
print('Cleaning up child processes...')
for process in mp.active_children():
print(f'Terminating child process: {process.name}')
process.terminate()
process.join()
def codeact_user_response(state: State) -> str:
msg = (
'Please continue working on the task on whatever approach you think is suitable.\n'
'If you think you have modified the code in a way that fixes the issue, please run the following command: <execute_bash> exit </execute_bash>.\n'
'IMPORTANT: YOU SHOULD NEVER ASK FOR HUMAN HELP OR USE THE INTERNET TO SOLVE THIS TASK.\n'
)
if state.history:
user_msgs = [
action
for action, _ in state.history
if isinstance(action, MessageAction) and action.source == 'user'
]
if len(user_msgs) >= 2:
# let the agent know that it can give up when it has tried 3 times
return (
msg
+ 'If you want to give up, run: <execute_bash> exit </execute_bash>.\n'
)
return msg
def monologue_user_response(state: State) -> str:
raise NotImplementedError('MonologueAgent should never ask for user responses.')
AGENT_CLS_TO_FAKE_USER_RESPONSE_FN = {
'CodeActAgent': codeact_user_response,
'MonologueAgent': monologue_user_response,
}
AGENT_CLS_TO_INST_SUFFIX = {
'CodeActAgent': 'When you think you have fixed the issue through code changes, please run the following command: <execute_bash> exit </execute_bash>.\n'
}
def get_test_result(instance, sandbox, workspace_dir_name):
test_result = {'result': {}, 'metadata': {}}
try:
test_patch_parsed = whatthepatch.parse_patch(instance.test_patch)
# get a list of filepaths that are involved in the patch
involved_filepaths = set()
for patch in test_patch_parsed:
involved_filepaths.add(patch.header.old_path.removeprefix('a/'))
involved_filepaths.add(patch.header.new_path.removeprefix('b/'))
involved_filepaths = list(involved_filepaths)
test_result['metadata']['1_test_patch_parse_success'] = True
test_result['metadata']['1_test_involved_filepaths'] = involved_filepaths
except Exception as e:
logger.error(
f'Error parsing test patch for instance {instance.instance_id}: {e}'
)
test_result['metadata']['1_test_patch_parse_success'] = False
test_result['metadata']['1_test_patch_parse_error'] = str(e)
test_result['metadata']['1_test_involved_filepaths'] = None
involved_filepaths = []
# Try to revert the changes for involved filepaths
err_code, output = sandbox.execute(f'cd /workspace/{workspace_dir_name}')
test_result['metadata']['2_revert_test_involved_filepaths_success'] = []
for filepath in involved_filepaths:
err_code, output = sandbox.execute(
f'git checkout {instance["base_commit"]} -- {filepath}'
)
if err_code != 0:
logger.error(f'Error reverting changes for {filepath}: {output}')
test_result['metadata']['2_revert_test_involved_filepaths_success'].append(
False
)
else:
test_result['metadata']['2_revert_test_involved_filepaths_success'].append(
True
)
# Apply the testcase
err_code, output = sandbox.execute('git apply $SWE_TASK_DIR/test.patch')
if err_code != 0:
logger.error(f'Error applying test patch: {output}')
test_result['metadata']['3_apply_test_patch_success'] = False
test_result['metadata']['3_apply_test_patch_error'] = output
else:
test_result['metadata']['3_apply_test_patch_success'] = True
# Run the test command
err_code, output = sandbox.execute(
'$TEST_CMD > /workspace/$SWE_INSTANCE_ID.log 2>&1'
)
if err_code != 0:
logger.error(f'Error running test command: {output}')
test_result['metadata']['4_run_test_command_success'] = False
test_result['metadata']['4_run_test_command_error'] = output
else:
test_result['metadata']['4_run_test_command_success'] = True
# Get the test output
err_code, output = sandbox.execute('cat /workspace/$SWE_INSTANCE_ID.log')
if err_code != 0:
logger.error(f'Error getting test output: {output}')
test_result['metadata']['4_get_test_output_success'] = False
test_result['metadata']['4_get_test_output_error'] = output
else:
test_result['metadata']['4_get_test_output_success'] = True
test_result['test_output'] = output
# Reformat instance.json
# $SWE_TASK_DIR/instance.json is a dict {"XXX": "YYY"}, add a [ before and a ] after
err_code, output = sandbox.execute(
(
'cat $SWE_TASK_DIR/instance.json | sed "s/^{/[{/" | sed "s/}$/}]/" > /workspace/instance.json'
)
)
if err_code != 0:
logger.error(f'Error creating instance.json: {output}')
test_result['metadata']['5_reformat_instance_json_success'] = False
test_result['metadata']['5_reformat_instance_json_error'] = output
else:
test_result['metadata']['5_reformat_instance_json_success'] = True
# Get the instance report
err_code, output = sandbox.execute(
(
'cd /swe_util/OD-SWE-bench '
'&& export PYTHONPATH=$(pwd):$PYTHONPATH '
'&& conda run -n swe-bench-eval python swebench/metrics/get_instance_report.py --swe_bench_task /workspace/instance.json --log_path /workspace/$SWE_INSTANCE_ID.log'
)
)
if err_code != 0:
logger.error(f'Error getting instance report: {output}')
test_result['metadata']['6_get_instance_report_success'] = False
test_result['metadata']['6_get_instance_report_error'] = output
else:
test_result['metadata']['6_get_instance_report_success'] = True
test_result['result_raw'] = output
# try to parse output
for line in output.strip().split('\n'):
line = line.strip('-')
try:
key, value = line.split(':')
except ValueError:
# skip this line
print(f'Error parsing result line: {line}')
continue
value = value.strip()
try:
value = int(value)
except ValueError:
pass
test_result['result'][key.strip()] = value
return test_result
def process_instance(
instance, agent_class, metadata, skip_workspace_mount, reset_logger: bool = True
):
workspace_mount_path = os.path.join(config.workspace_mount_path, '_eval_workspace')
# create process-specific workspace dir
if not skip_workspace_mount:
workspace_mount_path = os.path.join(workspace_mount_path, str(os.getpid()))
pathlib.Path(workspace_mount_path).mkdir(parents=True, exist_ok=True)
if reset_logger:
# Set up logger
log_file = os.path.join(
eval_output_dir, 'logs', f'instance_{instance.instance_id}.log'
)
# Remove all existing handlers from logger
for handler in logger.handlers[:]:
logger.removeHandler(handler)
# add back the console handler to print ONE line
logger.addHandler(get_console_handler())
logger.info(
f'Starting evaluation for instance {instance.instance_id}.\nLOG: tail -f {log_file}'
)
# Remove all existing handlers from logger
for handler in logger.handlers[:]:
logger.removeHandler(handler)
file_handler = logging.FileHandler(log_file)
file_handler.setFormatter(
logging.Formatter('%(asctime)s - %(levelname)s - %(message)s')
)
logger.addHandler(file_handler)
if not skip_workspace_mount:
logger.info(f'Process-specific workspace mounted at {workspace_mount_path}')
workspace_dir_name = f'{instance.repo}__{instance.version}'.replace('/', '__')
sandbox = SWEBenchSSHBox.get_box_for_instance(
instance,
workspace_dir_name,
skip_workspace_mount=skip_workspace_mount,
workspace_mount_path=workspace_mount_path,
)
# Prepare instruction
instruction = (
f'Please fix the following issue for the repository in /workspace/{workspace_dir_name}.\n'
'Environment has been set up for you to start working. You may assume all necessary tools are installed.\n\n'
'# Problem Statement\n'
f'{instance.problem_statement}\n\n'
)
if instance.hints_text:
instruction += f'# Hints\n{instance.hints_text}\n\n'
instruction += (
'IMPORTANT: You should ONLY interact with the environment provided to you AND NEVER ASK FOR HUMAN HELP.\n'
'You should NOT modify any existing test case files. If needed, you can add new test cases in a NEW file to reproduce the issue.\n'
'You SHOULD INCLUDE PROPER INDENTATION in your edit commands.\n'
)
instruction += AGENT_CLS_TO_INST_SUFFIX.get(agent_class, '')
# Run the agent
state: State = asyncio.run(
main(
instruction,
fake_user_response_fn=AGENT_CLS_TO_FAKE_USER_RESPONSE_FN.get(agent_class),
sandbox=sandbox,
)
)
# Get git patch
git_patch = sandbox.get_diff_patch()
logger.info(f'Got git diff for instance {instance.instance_id}')
# ======= Attempt to evaluate the agent's edits =======
# Attempt to analyze the test patch to get involved filepaths
test_result = get_test_result(instance, sandbox, workspace_dir_name)
if state is None:
raise ValueError('State should not be None.')
# Save the output
output = {
'instance_id': instance.instance_id,
'swe_instance': instance.to_dict(),
'instruction': instruction,
'git_patch': git_patch,
'metadata': metadata,
'history': [
(event_to_dict(action), event_to_dict(obs)) for action, obs in state.history
],
'error': state.error if state and state.error else None,
'test_result': test_result,
}
# Close the sandbox
sandbox.close()
return output
if __name__ == '__main__':
# Load the dataset
dataset = load_dataset('princeton-nlp/SWE-bench_Lite')
swe_bench_tests = dataset['test'].to_pandas()
if args.llm_config:
specified_llm_config = get_llm_config_arg(args.llm_config)
if specified_llm_config:
config.llm = specified_llm_config
logger.info(f'Config for evaluation: {config}')
# TEST METADATA
agent_class = args.agent_cls
assert (
agent_class in AGENT_CLS_TO_FAKE_USER_RESPONSE_FN
), f'Unsupported agent class: {agent_class}'
model_name = config.llm.model.split('/')[-1]
max_iterations = args.max_iterations
eval_note = ''
if args.eval_note is not None:
eval_note += '_N_' + args.eval_note
eval_output_dir = os.path.join(
args.eval_output_dir,
'swe_bench',
agent_class,
model_name + '_maxiter_' + str(max_iterations) + eval_note,
)
pathlib.Path(eval_output_dir).mkdir(parents=True, exist_ok=True)
pathlib.Path(os.path.join(eval_output_dir, 'logs')).mkdir(
parents=True, exist_ok=True
)
logger.info(f'Using evaluation output directory: {eval_output_dir}')
metadata = {
'agent_class': agent_class,
'model_name': model_name,
'max_iterations': max_iterations,
'eval_output_dir': eval_output_dir,
'start_time': time.strftime('%Y-%m-%d %H:%M:%S'),
# get the commit id of current repo
'git_commit': subprocess.check_output(['git', 'rev-parse', 'HEAD'])
.decode('utf-8')
.strip(),
}
logger.info(f'Metadata: {metadata}')
with open(os.path.join(eval_output_dir, 'metadata.json'), 'w') as f:
json.dump(metadata, f)
# LIMIT EVALUATION
eval_n_limit = args.eval_n_limit
if eval_n_limit:
swe_bench_tests = swe_bench_tests.head(eval_n_limit)
logger.info(f'Limiting evaluation to first {eval_n_limit} instances.')
# OUTPUT FILE
output_file = os.path.join(eval_output_dir, 'output.jsonl')
logger.info(f'Writing evaluation output to {output_file}')
finished_instance_ids = set()
if os.path.exists(output_file):
with open(output_file, 'r') as f:
for line in f:
data = json.loads(line)
finished_instance_ids.add(data['instance_id'])
logger.warning(
f'Output file {output_file} already exists. Loaded {len(finished_instance_ids)} finished instances.'
)
output_fp = open(output_file, 'a')
logger.info(
f'Evaluation started with Agent {agent_class}, model {model_name}, max iterations {max_iterations}.'
)
# filter out finished instances
new_swe_bench_tests = []
for idx, instance in swe_bench_tests.iterrows():
if instance.instance_id in finished_instance_ids:
logger.info(
f'Skipping instance {instance.instance_id} as it is already finished.'
)
continue
new_swe_bench_tests.append(instance)
swe_bench_tests = pd.DataFrame(new_swe_bench_tests)
logger.info(
f'Finished instances: {len(finished_instance_ids)}, Remaining instances: {len(swe_bench_tests)}'
)
pbar = tqdm(total=len(swe_bench_tests))
def update_progress(future):
pbar.update(1)
output = future.result()
pbar.set_description(f'Instance {output["instance_id"]}')
pbar.set_postfix_str(f'Test Result: {output["test_result"]["result"]}')
logger.info(
f'Finished evaluation for instance {output["instance_id"]}: {output["test_result"]["result"]}'
)
output_fp.write(json.dumps(output) + '\n')
output_fp.flush()
num_workers = args.eval_num_workers
logger.info(f'Using {num_workers} workers for evaluation.')
skip_workspace_mount = agent_class == 'CodeActAgent'
logger.info(f'Skipping workspace mount: {skip_workspace_mount}')
try:
with ProcessPoolExecutor(num_workers) as executor:
futures = []
for row_idx, instance in swe_bench_tests.iterrows():
future = executor.submit(
process_instance,
instance,
agent_class,
metadata,
skip_workspace_mount,
reset_logger=bool(num_workers > 1),
)
future.add_done_callback(update_progress)
futures.append(future)
# Wait for all futures to complete
for future in futures:
future.result()
except KeyboardInterrupt:
print('KeyboardInterrupt received. Cleaning up...')
cleanup()
output_fp.close()
logger.info('Evaluation finished.')

View File

@ -0,0 +1,17 @@
FROM ghcr.io/opendevin/sandbox:latest
RUN apt-get update && \
apt-get install -y libffi-dev bash gcc git jq wget pkg-config libfreetype-dev libfreetype6 libfreetype6-dev rsync && \
apt-get clean && \
rm -rf /var/lib/apt/lists/*
RUN ln -sfn /bin/bash /bin/sh
RUN mkdir -p /opendevin/logs && chmod 777 /opendevin/logs
# Setup Git
RUN git config --global user.email "swebench@swebench.ai"
RUN git config --global user.name "swebench"
CMD ["/bin/bash"]
# pushd evaluation/swe_bench
# docker build -t ghcr.io/opendevin/eval-swe-bench:builder -f ./scripts/docker/Dockerfile.builder .

View File

@ -0,0 +1,19 @@
FROM ghcr.io/opendevin/eval-swe-bench:builder
# # Install Mamba/Conda
RUN wget "https://github.com/conda-forge/miniforge/releases/latest/download/Miniforge3-$(uname)-$(uname -m).sh"
# install to /opt/miniforge3
RUN mkdir /swe_util
RUN bash Miniforge3-$(uname)-$(uname -m).sh -b -p /swe_util/miniforge3
RUN export PATH=/swe_util/miniforge3/bin:$PATH
RUN /swe_util/miniforge3/bin/mamba init bash
# Setup SWE-Bench Eval Env
RUN /bin/bash -c "/swe_util/miniforge3/bin/mamba create -n swe-bench-eval python==3.11.5 -y"
RUN /bin/bash -c ". /swe_util/miniforge3/etc/profile.d/conda.sh && conda activate swe-bench-eval && \
pip install requests python-dotenv GitPython datasets pandas beautifulsoup4 ghapi"
RUN /bin/bash -c ". /swe_util/miniforge3/etc/profile.d/conda.sh && conda config --set changeps1 False && conda config --append channels conda-forge"
CMD ["/bin/bash"]
# pushd evaluation/swe_bench
# docker build -t ghcr.io/opendevin/eval-swe-bench:builder_with_conda -f ./scripts/docker/Dockerfile.builder_with_conda .

View File

@ -0,0 +1,13 @@
FROM ghcr.io/opendevin/eval-swe-bench:full_deps
# ================== COPY Smaller things ==================
# copy everything except the folder of `eval_data` or `miniforge3`
# typically, this should be the OD codebase
RUN --mount=type=bind,source=./eval_workspace,target=/eval_workspace \
rsync -ar --progress \
--exclude='eval_data' \
--exclude='miniforge3' \
/eval_workspace/ /swe_util/
# pushd evaluation/SWE-bench
# docker build -t ghcr.io/opendevin/eval-swe-bench:full-v1.0 -f ./scripts/docker/Dockerfile.full.v1.0 .

View File

@ -0,0 +1,72 @@
FROM ghcr.io/opendevin/eval-swe-bench:builder
# This Dockefile is used to build the Docker image for the evaluation of the SWE-Bench.
# YOU SHOULD ENSURE ./eval_workspace CONTAINS THE EVALUATION WORKSPACE (testbed, conda)
# Check BUILD_TESTBED_AND_ENV.md for more details.
RUN mkdir -p /swe_util
# Use https://github.com/moby/moby/issues/15771#issuecomment-1762893340
# to copy files from host to container with --exclude
# # ================== Prepare Eval Data ==================
# Copy everything in eval_data except the "testbeds"
RUN --mount=type=bind,source=./eval_workspace,target=/eval_workspace \
rsync -ar --progress \
--exclude='testbeds' \
/eval_workspace/eval_data /swe_util/
RUN --mount=type=bind,source=./eval_workspace,target=/eval_workspace \
rsync -ar --progress \
--exclude='matplotlib*' \
--exclude='scikit-learn*' \
/eval_workspace/eval_data/testbeds /swe_util/eval_data/
# # copy the larger ones in separate layers
# COPY ./eval_workspace/eval_data/testbeds/matplotlib* /swe_util/eval_data/testbeds/
RUN --mount=type=bind,source=./eval_workspace,target=/eval_workspace \
rsync -ar --progress \
/eval_workspace/eval_data/testbeds/matplotlib* /swe_util/eval_data/testbeds/
# COPY ./eval_workspace/eval_data/testbeds/scikit-learn* /swe_util/eval_data/testbeds/
RUN --mount=type=bind,source=./eval_workspace,target=/eval_workspace \
rsync -ar --progress \
/eval_workspace/eval_data/testbeds/scikit-learn* /swe_util/eval_data/testbeds/
# ================== Prepare Miniconda3 ==================
# Copy the Miniconda3 environment
# copy everything except the folder of `envs` & `pkgs` (two large folders)
RUN --mount=type=bind,source=./eval_workspace,target=/eval_workspace \
rsync -ar --progress \
--exclude='envs' \
--exclude='pkgs' \
/eval_workspace/miniforge3 /swe_util/
# copy pkgs in separate layers (~9.4GB)
RUN --mount=type=bind,source=./eval_workspace,target=/eval_workspace \
rsync -ar --progress \
/eval_workspace/miniforge3/pkgs /swe_util/miniforge3/
# copy envs in separate layers (except matplotlib & scikit-learn - larger ones)
RUN --mount=type=bind,source=./eval_workspace,target=/eval_workspace \
rsync -ar --progress \
--exclude='matplotlib*' \
--exclude='scikit-learn*' \
--exclude='pydata*' \
/eval_workspace/miniforge3/envs /swe_util/miniforge3/
# copy the larger ones in separate layers
RUN --mount=type=bind,source=./eval_workspace,target=/eval_workspace \
rsync -ar --progress \
/eval_workspace/miniforge3/envs/matplotlib* /swe_util/miniforge3/envs/
RUN --mount=type=bind,source=./eval_workspace,target=/eval_workspace \
rsync -ar --progress \
/eval_workspace/miniforge3/envs/scikit-learn* /swe_util/miniforge3/envs/
RUN --mount=type=bind,source=./eval_workspace,target=/eval_workspace \
rsync -ar --progress \
/eval_workspace/miniforge3/envs/pydata* /swe_util/miniforge3/envs/
# pushd evaluation/SWE-bench
# docker build -t ghcr.io/opendevin/eval-swe-bench:full_deps -f ./scripts/docker/Dockerfile.full_deps .

View File

@ -0,0 +1,13 @@
# Docker Build Guide
## Builder
This constructs docker container used for `evaluation/swe_bench/scripts/prepare_swe_utils.sh` that downloads the datasets.
```bash
pushd evaluation/swe_bench
# This builds base image with basic dependencies
docker build -t ghcr.io/opendevin/eval-swe-bench:builder -f ./scripts/docker/Dockerfile.builder .
# This builds image with SWE-Bench conda environment pre-installed
docker build -t ghcr.io/opendevin/eval-swe-bench:builder_with_conda -f ./scripts/docker/Dockerfile.builder_with_conda .
```

View File

@ -0,0 +1,35 @@
#!/bin/bash
PROCESS_FILEPATH=$1
if [ -z "$PROCESS_FILEPATH" ]; then
echo "Error: PROCESS_FILEPATH is empty. Usage: ./eval_infer.sh <output_file>"
exit 1
fi
if [ ! -f $PROCESS_FILEPATH ]; then
echo "Error: $PROCESS_FILEPATH is not a file"
exit 1
fi
PROCESS_FILEPATH=$(realpath $PROCESS_FILEPATH)
FILE_DIR=$(dirname $PROCESS_FILEPATH)
FILE_NAME=$(basename $PROCESS_FILEPATH)
mkdir -p $FILE_DIR/eval_logs
mkdir -p $FILE_DIR/swe_bench_format
echo "Evaluating $FILE_NAME @ $FILE_DIR"
echo "Merged output file with fine-grained report will be saved to $FILE_DIR"
docker run --rm \
-v $FILE_DIR:/swe_bench_output \
-e MINICONDA3=/swe_util/miniforge3 \
-e OD_SWE_BENCH=/swe_util/OD-SWE-bench \
-e EVAL_DATA_DIR=/swe_util/eval_data \
-w /swe_util \
ghcr.io/opendevin/eval-swe-bench:full-v1.0 \
bash -c "./get_agent_report.sh --output-file /swe_bench_output/$FILE_NAME \
--agent-name CodeActAgent \
--dataset swe-bench-test-lite \
--experiment-name test_experiment \
--merge-report && cp -r /swe_util/eval_data/eval_logs/test_experiment/* /swe_bench_output/eval_logs \
&& cp -r /swe_util/eval_data/outputs/* /swe_bench_output/swe_bench_format/"

View File

@ -0,0 +1,15 @@
#!/bin/bash
AGENT=CodeActAgent
AGENT_VERSION=v1.3
MODEL_CONFIG=$1
# You should add $MODEL_CONFIG in your `config.toml`
poetry run python3 evaluation/swe_bench/run_infer.py \
--agent-cls $AGENT \
--llm-config $MODEL_CONFIG \
--max-iterations 50 \
--max-chars 10000000 \
--eval-num-workers 8 \
--eval-note $AGENT_VERSION

View File

@ -0,0 +1,81 @@
#!/bin/bash
# THIS SCRIPT ONLY NEED TO BE RUN ONCE BEFORE EVALUATION
set -e
function setup_environment_and_testbed {
local instance_file_name=$1
# throw error if user name is not opendevin
if [ "$USER" != "opendevin" ]; then
echo "Error: This script is intended to be run by the 'opendevin' user only." >&2
exit 1
fi
# =======================================================
# Install & Setup Conda
# assume /swe_util/miniforge3 already exists
# install if swe-util does NOT have conda
if [ ! -d /swe_util/miniforge3 ]; then
pushd /swe_util
echo "Downloading and installing Miniforge3"
wget "https://github.com/conda-forge/miniforge/releases/latest/download/Miniforge3-$(uname)-$(uname -m).sh"
bash Miniforge3-$(uname)-$(uname -m).sh -b -p /swe_util/miniforge3
fi
echo 'export PATH=/swe_util/miniforge3/bin:$PATH' >> ~/.bashrc
eval "$(/swe_util/miniforge3/bin/conda shell.bash hook)"
conda init bash
source ~/.bashrc
conda config --set changeps1 False
conda config --append channels conda-forge
# =======================================================
# Install swe-bench-eval environment if it does not exist
ENV_EXISTS=$(conda info --envs | awk '/swe-bench-eval/ {print $1}')
echo "ENV_EXISTS: $ENV_EXISTS"
if [ -z "$ENV_EXISTS" ]; then
echo "Environment swe-bench-eval does not exist. Creating the environment."
conda create -n swe-bench-eval python==3.11.5 -y
conda activate swe-bench-eval
pip install requests python-dotenv GitPython datasets pandas beautifulsoup4 ghapi
fi
conda activate swe-bench-eval
echo 'swe-bench-eval environment is ready.'
# =======================================================
# Read the swe-bench-test-lite.json / swe-bench-test.json file and extract the required item based on instance_id
INSTANCE_DATA_FILE=/swe_util/eval_data/instances/$instance_file_name
echo "Instance data file loaded: $INSTANCE_DATA_FILE"
# =======================================================
# generate testbed & conda environment for ALL instances in the test file
echo "Generating testbed & conda environment for all instances in the test file"
export PYTHONPATH=/swe_util/OD-SWE-bench:$PYTHONPATH
python3 /swe_util/OD-SWE-bench/swebench/harness/engine_testbed.py \
--instances_path $INSTANCE_DATA_FILE \
--log_dir /swe_util/eval_data/testbed_logs \
--conda_path /swe_util/miniforge3 \
--testbed /swe_util/eval_data/testbeds \
--timeout 1000
# Check every log in /swe_util/eval_data/testbed_logs to see if they contains "Init Succeeded"
# If not, print the log file name and exit
for log_file in /swe_util/eval_data/testbed_logs/*; do
if ! grep -q "Init Succeeded" $log_file; then
echo "Error: $log_file does not contain 'Init Succeeded'"
exit 1
fi
done
echo "All logs contain 'Init Succeeded'. Testbed & conda environment setup is successful."
}
# check if $1 is either swe-bench-test-lite.json or swe-bench-test.json
if [ "$1" != "swe-bench-test-lite.json" ] && [ "$1" != "swe-bench-test.json" ]; then
echo "Error: Invalid input file name. Please provide either swe-bench-test-lite.json or swe-bench-test.json"
exit 1
fi
# call the function
echo "Calling setup_environment_and_testbed with $1"
setup_environment_and_testbed $1

View File

@ -0,0 +1,86 @@
#!/bin/bash
# Initialize variables
output_file=""
agent_name=""
dataset=""
num_processes=15
experiment_name=""
merge_report=false
# Parse command-line arguments
while [[ "$#" -gt 0 ]]; do
case $1 in
--output-file) output_file="$2"; shift ;;
--agent-name) agent_name="$2"; shift ;;
--dataset) dataset="$2"; shift ;;
--num-processes) num_processes="$2"; shift ;;
--experiment-name) experiment_name="$2"; shift ;;
--merge-report) merge_report=true ;;
*) echo "Unknown parameter passed: $1"; exit 1 ;;
esac
shift
done
# Check if arguments are provided
if [[ -z "$output_file" || -z "$agent_name" || -z "$dataset" ]]; then
echo "output-file, agent-name and dataset are required!"
exit 1
fi
echo "output file: $output_file"
echo "agent name: $agent_name"
echo "dataset: $dataset"
echo "num processes: $num_processes"
if [ ! -z "$experiment_name" ]
then
echo "use provided experiment name: $experiment_name"
else
current_folder=$(basename $(dirname $output_file))
parent_foler=$(basename $(dirname $(dirname $output_file)))
experiment_name="${parent_foler}_${current_folder}"
echo "use generated experiment name: $experiment_name"
fi
# Convert the agent output to the SWE-Bench format
if [ -z "$EVAL_DATA_DIR" ]; then
echo "EVAL_DATA_DIR is not set."
exit 1
fi
target_file="${EVAL_DATA_DIR}/outputs/${experiment_name}_${dataset}.json"
python process_output_json_file.py $output_file $agent_name $target_file
# Run the evaluation script
if [ -z "$OD_SWE_BENCH" ]; then
echo "OD_SWE_BENCH is not set."
exit 1
fi
if [ -z "$MINICONDA3" ]; then
echo "MINICONDA3 is not set."
exit 1
fi
mkdir -p $EVAL_DATA_DIR/eval_logs/$experiment_name
export PYTHONPATH=$OD_SWE_BENCH && cd $OD_SWE_BENCH && . $MINICONDA3/etc/profile.d/conda.sh && conda activate $MINICONDA3/envs/swe-bench-eval && python swebench/harness/run_evaluation.py \
--swe_bench_tasks $EVAL_DATA_DIR/instances/$dataset.json \
--temp_dir $EVAL_DATA_DIR/eval_temp \
--testbed $EVAL_DATA_DIR/testbeds \
--conda_path $MINICONDA3 \
--predictions_path $target_file \
--log_dir $EVAL_DATA_DIR/eval_logs/$experiment_name \
--num_processes 15 \
--skip_existing \
--timeout 1600 \
--verbose
# Get the report
cp $target_file $EVAL_DATA_DIR/eval_logs/$experiment_name
export PYTHONPATH=$OD_SWE_BENCH && cd $OD_SWE_BENCH && . $MINICONDA3/etc/profile.d/conda.sh && conda activate $MINICONDA3/envs/swe-bench-eval && python swebench/metrics/get_model_report.py \
--model $agent_name \
--swe_bench_tasks $EVAL_DATA_DIR/instances/$dataset.json \
--predictions_path $EVAL_DATA_DIR/eval_logs/$experiment_name/${experiment_name}_${dataset}.json \
--log_dir $EVAL_DATA_DIR/eval_logs/$experiment_name/$agent_name
# Merge report to the agent output
if [ "$merge_report" = true ]; then
cd /swe_util && python merge_fine_grained_report.py --od_output_file $output_file \
--fine_grained_report_file $EVAL_DATA_DIR/eval_logs/$experiment_name/${experiment_name}_${dataset}.report.json
fi

View File

@ -0,0 +1,61 @@
#!/bin/bash
# Input arguments
output_file=""
model_name=""
dataset=""
num_processes=15
experiment_name=""
# Parse command-line arguments
while [[ "$#" -gt 0 ]]; do
case $1 in
--output-file) output_file="$2"; shift ;;
--model-name) model_name="$2"; shift ;;
--dataset) dataset="$2"; shift ;;
--num-processes) num_processes="$2"; shift ;;
--experiment-name) experiment_name="$2"; shift ;;
*) echo "Unknown parameter passed: $1"; exit 1 ;;
esac
shift
done
# Check if arguments are provided
if [[ -z "$output_file" || -z "$model_name" || -z "$dataset" ]]; then
echo "output-file, model-name and dataset are required!"
exit 1
fi
echo "output file: $output_file"
echo "model name: $model_name"
echo "dataset: $dataset"
echo "num processes: $num_processes"
if [ ! -z "$experiment_name" ]
then
echo "use provided experiment name: $experiment_name"
else
experiment_name=${model_name}__${dataset}
echo "use generated experiment name: $experiment_name"
fi
# Run the evaluation script
mkdir -p $EVAL_DATA_DIR/eval_logs/$experiment_name
export PYTHONPATH=$OD_SWE_BENCH && cd $OD_SWE_BENCH && . $MINICONDA3/etc/profile.d/conda.sh && conda activate $MINICONDA3/envs/swe-bench-eval && python swebench/harness/run_evaluation.py \
--swe_bench_tasks $EVAL_DATA_DIR/instances/$dataset.json \
--temp_dir $EVAL_DATA_DIR/eval_temp \
--testbed $EVAL_DATA_DIR/testbeds \
--conda_path $MINICONDA3 \
--predictions_path $output_file \
--log_dir $EVAL_DATA_DIR/eval_logs/$experiment_name \
--num_processes $num_processes \
--skip_existing \
--timeout 1600 \
--verbose
# Get the report
predictions_fname=$(basename $output_file)
cp $output_file $EVAL_DATA_DIR/eval_logs/$experiment_name
export PYTHONPATH=$OD_SWE_BENCH && cd $OD_SWE_BENCH && . $MINICONDA3/etc/profile.d/conda.sh && conda activate $MINICONDA3/envs/swe-bench-eval && python swebench/metrics/get_model_report.py \
--model $model_name \
--swe_bench_tasks $EVAL_DATA_DIR/instances/$dataset.json \
--predictions_path $EVAL_DATA_DIR/eval_logs/$experiment_name/$predictions_fname \
--log_dir $EVAL_DATA_DIR/eval_logs/$experiment_name/$model_name

View File

@ -0,0 +1,29 @@
import argparse
import json
def merge_fine_grained_report(od_output_file, fine_grained_report_file):
merged_od_output_file = od_output_file.replace('.jsonl', '.merged.jsonl')
merged_report = []
fine_grained_report = json.load(open(fine_grained_report_file))
for line in open(od_output_file):
line = json.loads(line)
instance_id = line['instance_id']
line['fine_grained_report'] = fine_grained_report[instance_id]
merged_report.append(line)
# dump the merged report as a jsonl file
with open(merged_od_output_file, 'w') as f:
for line in merged_report:
f.write(json.dumps(line) + '\n')
print(f'Agent output with report merged created at {merged_od_output_file}')
if __name__ == '__main__':
parser = argparse.ArgumentParser()
parser.add_argument('--od_output_file', help='Path to the OD output file')
parser.add_argument(
'--fine_grained_report_file', help='Path to the fine grained report file'
)
args = parser.parse_args()
merge_fine_grained_report(args.od_output_file, args.fine_grained_report_file)

View File

@ -0,0 +1,27 @@
#!/bin/bash
set -e
EVAL_WORKSPACE="evaluation/swe_bench/eval_workspace"
mkdir -p $EVAL_WORKSPACE
# 1. Prepare REPO
echo "==== Prepare SWE-bench repo ===="
OD_SWE_BENCH_REPO_PATH="https://github.com/OpenDevin/OD-SWE-bench.git"
OD_SWE_BENCH_REPO_BRANCH="eval"
git clone -b $OD_SWE_BENCH_REPO_BRANCH $OD_SWE_BENCH_REPO_PATH $EVAL_WORKSPACE/OD-SWE-bench
# 2. Prepare DATA
echo "==== Prepare SWE-bench data ===="
EVAL_IMAGE=ghcr.io/opendevin/eval-swe-bench:builder_with_conda
EVAL_WORKSPACE=$(realpath $EVAL_WORKSPACE)
chmod +x $EVAL_WORKSPACE/OD-SWE-bench/swebench/harness/prepare_data.sh
if [ -d $EVAL_WORKSPACE/eval_data ]; then
rm -r $EVAL_WORKSPACE/eval_data
fi
docker run \
-v $EVAL_WORKSPACE:/workspace \
-w /workspace \
-u $(id -u):$(id -g) \
-e HF_DATASETS_CACHE="/tmp" \
--rm -it $EVAL_IMAGE \
bash -c "cd OD-SWE-bench/swebench/harness && /swe_util/miniforge3/bin/conda run -n swe-bench-eval ./prepare_data.sh && mv eval_data /workspace/"

View File

@ -0,0 +1,35 @@
import json
import sys
def process_jsonl(input_file, model_name, output_file):
try:
with open(input_file, 'r') as infile, open(output_file, 'w') as outfile:
data = []
for line in infile:
if line.strip(): # Ensure the line is not empty
json_obj = json.loads(line)
# Create new object with required fields and new model_name
new_obj = {
'instance_id': json_obj['instance_id'],
'model_patch': json_obj['git_patch'],
'model_name_or_path': model_name,
}
data.append(new_obj)
json.dump(
data, outfile, indent=2
) # Write the list of JSON objects to a file
print(f'Output JSON list created at {output_file}')
except Exception as e:
print(f'Error: {str(e)}')
# Usage: python script.py input.jsonl model_name output.json
if __name__ == '__main__':
if len(sys.argv) != 4:
print('Usage: python script.py <input_file> <model_name> <output_file>')
else:
input_file = sys.argv[1]
model_name = sys.argv[2]
output_file = sys.argv[3]
process_jsonl(input_file, model_name, output_file)

View File

@ -0,0 +1,96 @@
#!/bin/bash
set -e
# assert user name is `root`
if [ "$USER" != "root" ]; then
echo "Error: This script is intended to be run by the 'root' user only." >&2
exit 1
fi
source ~/.bashrc
SWEUTIL_DIR=/swe_util
# Create logs directory
LOG_DIR=/opendevin/logs
mkdir -p $LOG_DIR && chmod 777 $LOG_DIR
# FIXME: Cannot read SWE_INSTANCE_ID from the environment variable
# SWE_INSTANCE_ID=django__django-11099
if [ -z "$SWE_INSTANCE_ID" ]; then
echo "Error: SWE_INSTANCE_ID is not set." >&2
exit 1
fi
# Read the swe-bench-test-lite.json file and extract the required item based on instance_id
item=$(jq --arg INSTANCE_ID "$SWE_INSTANCE_ID" '.[] | select(.instance_id == $INSTANCE_ID)' $SWEUTIL_DIR/eval_data/instances/swe-bench-test-lite.json)
if [[ -z "$item" ]]; then
echo "No item found for the provided instance ID."
exit 1
fi
CONDA_ENV_NAME=$(echo "$item" | jq -r '.repo + "__" + .version | gsub("/"; "__")')
echo "CONDA_ENV_NAME: $CONDA_ENV_NAME"
SWE_TASK_DIR=/opendevin/swe_tasks
mkdir -p $SWE_TASK_DIR
# Dump test_patch to /workspace/test.patch
echo "$item" | jq -r '.test_patch' > $SWE_TASK_DIR/test.patch
# Dump patch to /workspace/gold.patch
echo "$item" | jq -r '.patch' > $SWE_TASK_DIR/gold.patch
# Dump the item to /workspace/instance.json except for the "test_patch" and "patch" fields
echo "$item" | jq 'del(.test_patch, .patch)' > $SWE_TASK_DIR/instance.json
# Clear the workspace
rm -rf /workspace/*
# Copy repo to workspace
if [ -d /workspace/$CONDA_ENV_NAME ]; then
rm -rf /workspace/$CONDA_ENV_NAME
fi
cp -r $SWEUTIL_DIR/eval_data/testbeds/$CONDA_ENV_NAME /workspace
# Reset swe-bench testbed and install the repo
. $SWEUTIL_DIR/miniforge3/etc/profile.d/conda.sh
conda config --set changeps1 False
conda config --append channels conda-forge
conda activate swe-bench-eval
mkdir -p $SWE_TASK_DIR/reset_testbed_temp
mkdir -p $SWE_TASK_DIR/reset_testbed_log_dir
SWE_BENCH_DIR=/swe_util/OD-SWE-bench
output=$(
export PYTHONPATH=$SWE_BENCH_DIR && \
cd $SWE_BENCH_DIR && \
python swebench/harness/reset_swe_env.py \
--swe_bench_tasks $SWEUTIL_DIR/eval_data/instances/swe-bench-test.json \
--temp_dir $SWE_TASK_DIR/reset_testbed_temp \
--testbed /workspace \
--conda_path $SWEUTIL_DIR/miniforge3 \
--instance_id $SWE_INSTANCE_ID \
--log_dir $SWE_TASK_DIR/reset_testbed_log_dir \
--timeout 900 \
--verbose
)
REPO_PATH=$(echo "$output" | awk -F': ' '/repo_path:/ {print $2}')
TEST_CMD=$(echo "$output" | awk -F': ' '/test_cmd:/ {print $2}')
echo "Repo Path: $REPO_PATH"
echo "Test Command: $TEST_CMD"
echo "export SWE_BENCH_DIR=\"$SWE_BENCH_DIR\"" >> ~/.bashrc
echo "export REPO_PATH=\"$REPO_PATH\"" >> ~/.bashrc
echo "export TEST_CMD=\"$TEST_CMD\"" >> ~/.bashrc
if [[ "$REPO_PATH" == "None" ]]; then
echo "Error: Failed to retrieve repository path. Tests may not have passed or output was not as expected." >&2
exit 1
fi
# Activate instance-specific environment
. $SWEUTIL_DIR/miniforge3/etc/profile.d/conda.sh
conda activate $CONDA_ENV_NAME
set +e

View File

@ -0,0 +1,31 @@
#!/bin/bash
# THIS SCRIPT ONLY NEED TO BE RUN ONCE BEFORE EVALUATION
EVAL_DOCKER_IMAGE=ghcr.io/opendevin/eval-swe-bench:builder
EVAL_WORKSPACE="evaluation/swe_bench/eval_workspace"
EVAL_WORKSPACE=$(realpath $EVAL_WORKSPACE)
SETUP_INSTANCE_FILENAME=swe-bench-test.json # OR swe-bench-test-lite.json
if [ ! -d $EVAL_WORKSPACE ]; then
mkdir -p $EVAL_WORKSPACE
fi
if [ -f $EVAL_WORKSPACE/swe_env_setup.sh ]; then
rm $EVAL_WORKSPACE/swe_env_setup.sh
fi
SCRIPT_DIR=evaluation/swe_bench/scripts/setup
cp $SCRIPT_DIR/_swe_env_setup.sh $EVAL_WORKSPACE/swe_env_setup.sh
cp $SCRIPT_DIR/swe_entry.sh $EVAL_WORKSPACE/swe_entry.sh
cp $SCRIPT_DIR/get_model_report.sh $EVAL_WORKSPACE/get_model_report.sh
cp $SCRIPT_DIR/get_agent_report.sh $EVAL_WORKSPACE/get_agent_report.sh
cp $SCRIPT_DIR/process_output_json_file.py $EVAL_WORKSPACE/process_output_json_file.py
cp $SCRIPT_DIR/merge_fine_grained_report.py $EVAL_WORKSPACE/merge_fine_grained_report.py
docker run \
-v $EVAL_WORKSPACE:/swe_util \
-e UID=$(id -u) \
--rm -it $EVAL_DOCKER_IMAGE \
bash -c "useradd -rm -d /home/opendevin -s /bin/bash -u $(id -u) opendevin && su opendevin -c 'bash /swe_util/swe_env_setup.sh $SETUP_INSTANCE_FILENAME'"
#

View File

@ -0,0 +1,204 @@
import sys
import uuid
from opendevin.core.config import config
from opendevin.core.logger import opendevin_logger as logger
from opendevin.runtime.docker.ssh_box import DockerSSHBox
from opendevin.runtime.plugins import JupyterRequirement, SWEAgentCommandsRequirement
SWE_BENCH_CONTAINER_IMAGE = 'ghcr.io/opendevin/eval-swe-bench:full-v1.0'
class SWEBenchSSHBox(DockerSSHBox):
def __init__(
self,
container_image: str,
timeout: int = 120,
sid: str | None = None,
swe_instance_id: str | None = None,
swe_instance: dict | None = None,
skip_workspace_mount: bool = True,
):
if swe_instance_id is None:
raise ValueError('swe_instance_id must be provided!')
self.swe_instance_id = swe_instance_id
self.swe_instance = swe_instance
self.skip_workspace_mount = skip_workspace_mount
assert (
container_image is not None
), 'container_image is required for SWEBenchSSHBox!'
# Need to run as root to use SWEBench container
sid = f'swe_bench_{swe_instance_id}' + str(uuid.uuid4())
super().__init__(container_image, timeout, sid)
exit_code, output = self.execute('mv ~/.bashrc ~/.bashrc.bak')
assert exit_code == 0, f'Failed to backup ~/.bashrc: {output}'
exit_code, output = self.execute(
f"echo 'export SWE_INSTANCE_ID={self.swe_instance_id}' >> ~/.bashrc && echo 'export PIP_CACHE_DIR=~/.cache/pip' >> ~/.bashrc && echo \"alias git='git --no-pager'\" >> ~/.bashrc"
)
assert exit_code == 0, f'Failed to set SWE_INSTANCE_ID in ~/.bashrc: {output}'
logger.info('Sourcing swe_entry.sh to set up environment variables')
# larger timeout for SWEBench init to account for long-running installations (e.g., require compilation)
exit_code, output = self.execute('source /swe_util/swe_entry.sh', timeout=600)
logger.info('exit code: %d', exit_code)
logger.info(output)
assert exit_code == 0, f'Failed to source swe_entry.sh: {output}'
logger.info('Sourced swe_entry.sh successfully')
@property
def volumes(self):
if self.skip_workspace_mount:
return {
k: v
for k, v in super().volumes.items()
if not v['bind'] == self.sandbox_workspace_dir
}
return super().volumes
@classmethod
def get_box_for_instance(
cls,
instance,
workspace_dir_name=None,
n_tries=5,
skip_workspace_mount: bool = True,
workspace_mount_path: str | None = None,
) -> 'SWEBenchSSHBox':
if workspace_dir_name is None:
workspace_dir_name = f"{instance['repo']}__{instance['version']}".replace(
'/', '__'
)
config.workspace_base = workspace_mount_path
config.workspace_mount_path = workspace_mount_path
sandbox = cls(
container_image=SWE_BENCH_CONTAINER_IMAGE,
swe_instance_id=instance['instance_id'],
swe_instance=instance,
skip_workspace_mount=skip_workspace_mount,
)
logger.info(f"SSH box started for instance {instance['instance_id']}.")
# cd to the repo
exit_code, output = sandbox.execute(f'cd /workspace/{workspace_dir_name}')
if exit_code != 0:
logger.error(f'Failed to cd to the repo: {output}')
sys.exit(1)
# remove all future commits & remote following Devin
# https://www.cognition-labs.com/post/swe-bench-technical-report
exit_code, output = sandbox.execute('git reset --hard')
if exit_code != 0:
logger.error(f'Failed to reset the repo: {output}')
sys.exit(1)
exit_code, output = sandbox.execute(
'for remote_name in $(git remote); do git remote remove "${remote_name}"; done'
)
if exit_code != 0:
logger.error(f'Failed to remove remote: {output}')
sys.exit(1)
return sandbox
def get_diff_patch(self):
# add everything to the index
exit_code, output = self.execute('git add --all')
if exit_code != 0:
logger.error('Failed to add everything to the index')
return ''
# get the git diff
exit_code, git_patch = self.execute(
f'git diff --no-color --cached {self.swe_instance["base_commit"]}'
)
if exit_code != 0:
logger.error('Failed to get git diff')
return ''
return git_patch
if __name__ == '__main__':
EXAMPLE_INSTANCE = {
'repo': 'django/django',
'instance_id': 'django__django-11099',
'base_commit': 'd26b2424437dabeeca94d7900b37d2df4410da0c',
'patch': "diff --git a/django/contrib/auth/validators.py b/django/contrib/auth/validators.py\n--- a/django/contrib/auth/validators.py\n+++ b/django/contrib/auth/validators.py\n@@ -7,7 +7,7 @@\n \n @deconstructible\n class ASCIIUsernameValidator(validators.RegexValidator):\n- regex = r'^[\\w.@+-]+$'\n+ regex = r'^[\\w.@+-]+\\Z'\n message = _(\n 'Enter a valid username. This value may contain only English letters, '\n 'numbers, and @/./+/-/_ characters.'\n@@ -17,7 +17,7 @@ class ASCIIUsernameValidator(validators.RegexValidator):\n \n @deconstructible\n class UnicodeUsernameValidator(validators.RegexValidator):\n- regex = r'^[\\w.@+-]+$'\n+ regex = r'^[\\w.@+-]+\\Z'\n message = _(\n 'Enter a valid username. This value may contain only letters, '\n 'numbers, and @/./+/-/_ characters.'\n",
'test_patch': "diff --git a/tests/auth_tests/test_validators.py b/tests/auth_tests/test_validators.py\n--- a/tests/auth_tests/test_validators.py\n+++ b/tests/auth_tests/test_validators.py\n@@ -237,7 +237,7 @@ def test_unicode_validator(self):\n invalid_usernames = [\n \"o'connell\", \"عبد ال\",\n \"zerowidth\\u200Bspace\", \"nonbreaking\\u00A0space\",\n- \"en\\u2013dash\",\n+ \"en\\u2013dash\", 'trailingnewline\\u000A',\n ]\n v = validators.UnicodeUsernameValidator()\n for valid in valid_usernames:\n@@ -250,7 +250,7 @@ def test_unicode_validator(self):\n \n def test_ascii_validator(self):\n valid_usernames = ['glenn', 'GLEnN', 'jean-marc']\n- invalid_usernames = [\"o'connell\", 'Éric', 'jean marc', \"أحمد\"]\n+ invalid_usernames = [\"o'connell\", 'Éric', 'jean marc', \"أحمد\", 'trailingnewline\\n']\n v = validators.ASCIIUsernameValidator()\n for valid in valid_usernames:\n with self.subTest(valid=valid):\n",
'problem_statement': "UsernameValidator allows trailing newline in usernames\nDescription\n\t\nASCIIUsernameValidator and UnicodeUsernameValidator use the regex \nr'^[\\w.@+-]+$'\nThe intent is to only allow alphanumeric characters as well as ., @, +, and -. However, a little known quirk of Python regexes is that $ will also match a trailing newline. Therefore, the user name validators will accept usernames which end with a newline. You can avoid this behavior by instead using \\A and \\Z to terminate regexes. For example, the validator regex could be changed to\nr'\\A[\\w.@+-]+\\Z'\nin order to reject usernames that end with a newline.\nI am not sure how to officially post a patch, but the required change is trivial - using the regex above in the two validators in contrib.auth.validators.\n",
'hints_text': '',
'created_at': '2019-03-20T03:46:18Z',
'version': '3.0',
'FAIL_TO_PASS': '["test_ascii_validator (auth_tests.test_validators.UsernameValidatorsTests)", "test_unicode_validator (auth_tests.test_validators.UsernameValidatorsTests)", "test_help_text (auth_tests.test_validators.UserAttributeSimilarityValidatorTest)"]',
'PASS_TO_PASS': '["test_help_text (auth_tests.test_validators.MinimumLengthValidatorTest)", "test_validate (auth_tests.test_validators.MinimumLengthValidatorTest)", "test_help_text (auth_tests.test_validators.NumericPasswordValidatorTest)", "test_validate (auth_tests.test_validators.NumericPasswordValidatorTest)", "test_validate (auth_tests.test_validators.UserAttributeSimilarityValidatorTest)", "test_validate_property (auth_tests.test_validators.UserAttributeSimilarityValidatorTest)", "test_empty_password_validator_help_text_html (auth_tests.test_validators.PasswordValidationTest)", "test_get_default_password_validators (auth_tests.test_validators.PasswordValidationTest)", "test_get_password_validators_custom (auth_tests.test_validators.PasswordValidationTest)", "test_password_changed (auth_tests.test_validators.PasswordValidationTest)", "test_password_changed_with_custom_validator (auth_tests.test_validators.PasswordValidationTest)", "test_password_validators_help_text_html (auth_tests.test_validators.PasswordValidationTest)", "test_password_validators_help_text_html_escaping (auth_tests.test_validators.PasswordValidationTest)", "test_password_validators_help_texts (auth_tests.test_validators.PasswordValidationTest)", "test_validate_password (auth_tests.test_validators.PasswordValidationTest)", "test_help_text (auth_tests.test_validators.CommonPasswordValidatorTest)", "test_validate (auth_tests.test_validators.CommonPasswordValidatorTest)", "test_validate_custom_list (auth_tests.test_validators.CommonPasswordValidatorTest)", "test_validate_django_supplied_file (auth_tests.test_validators.CommonPasswordValidatorTest)"]',
'environment_setup_commit': '419a78300f7cd27611196e1e464d50fd0385ff27',
}
sandbox = SWEBenchSSHBox.get_box_for_instance(instance=EXAMPLE_INSTANCE)
# in actual eval, this will be initialized by the controller
sandbox.init_plugins([JupyterRequirement(), SWEAgentCommandsRequirement()])
# PRE TEST
exit_code, output = sandbox.execute('cd $REPO_PATH')
assert exit_code == 0, 'Failed to cd $REPO_PATH'
logger.info(f'cd $REPO_PATH: {output}')
# apply test patch
exit_code, output = sandbox.execute('git apply $SWE_TASK_DIR/test.patch')
assert exit_code == 0, 'Failed to apply test patch'
logger.info(f'git apply $SWE_TASK_DIR/test.patch: {output}')
# TEST
exit_code, output = sandbox.execute(
'./tests/runtests.py --verbosity 2 auth_tests.test_validators'
)
assert exit_code == 1, 'Expected exit code 1 (since this is a FAIL_TO_PASS)'
logger.info(f'$TEST_CMD:\n{output}')
# apply gold patch
exit_code, output = sandbox.execute('git apply $SWE_TASK_DIR/gold.patch')
logger.info('exit code: %d', exit_code)
logger.info(f'git apply $SWE_TASK_DIR/gold.patch: {output}')
# TEST
exit_code, output = sandbox.execute(
'./tests/runtests.py --verbosity 2 auth_tests.test_validators'
)
assert exit_code == 0, 'Expected exit code 0 (since we applied the gold patch)'
logger.info(f'$TEST_CMD:\n{output}')
# Reset the repo
exit_code, output = sandbox.execute('git reset --hard')
assert exit_code == 0, 'Failed to reset the repo'
logger.info(f'git reset --hard: {output}')
bg_cmd = sandbox.execute_in_background(
"while true; do echo 'dot ' && sleep 10; done"
)
sys.stdout.flush()
try:
while True:
try:
user_input = input('>>> ')
except EOFError:
logger.info('Exiting...')
break
if user_input.lower() == 'exit':
logger.info('Exiting...')
break
if user_input.lower() == 'kill':
sandbox.kill_background(bg_cmd.pid)
logger.info('Background process killed')
continue
exit_code, output = sandbox.execute(user_input)
logger.info('exit code: %d', exit_code)
logger.info(output)
if bg_cmd.pid in sandbox.background_commands:
logs = sandbox.read_logs(bg_cmd.pid)
logger.info('background logs: %s', logs)
sys.stdout.flush()
except KeyboardInterrupt:
logger.info('Exiting...')
sandbox.close()

View File

@ -370,6 +370,30 @@ def get_parser():
type=int,
help='The maximum number of characters to send to and receive from LLM per task',
)
parser.add_argument(
'--eval-output-dir',
default='evaluation/evaluation_outputs/outputs',
type=str,
help='The directory to save evaluation output',
)
parser.add_argument(
'--eval-n-limit',
default=None,
type=int,
help='The number of instances to evaluate',
)
parser.add_argument(
'--eval-num-workers',
default=4,
type=int,
help='The number of workers to use for evaluation',
)
parser.add_argument(
'--eval-note',
default=None,
type=str,
help='The note to add to the evaluation directory',
)
parser.add_argument(
'-l',
'--llm-config',

View File

@ -81,11 +81,10 @@ def get_console_handler():
return console_handler
def get_file_handler():
def get_file_handler(log_dir=os.path.join(os.getcwd(), 'logs')):
"""
Returns a file handler for logging.
"""
log_dir = os.path.join(os.getcwd(), 'logs')
os.makedirs(log_dir, exist_ok=True)
timestamp = datetime.now().strftime('%Y-%m-%d')
file_name = f'opendevin_{timestamp}.log'

View File

@ -7,12 +7,14 @@ from opendevin.controller import AgentController
from opendevin.controller.agent import Agent
from opendevin.controller.state.state import State
from opendevin.core.config import args, get_llm_config_arg
from opendevin.core.logger import opendevin_logger as logger
from opendevin.core.schema import AgentState
from opendevin.events.action import ChangeAgentStateAction, MessageAction
from opendevin.events.event import Event
from opendevin.events.observation import AgentStateChangedObservation
from opendevin.events.stream import EventSource, EventStream, EventStreamSubscriber
from opendevin.llm.llm import LLM
from opendevin.runtime.sandbox import Sandbox
from opendevin.runtime.server.runtime import ServerRuntime
@ -31,6 +33,7 @@ async def main(
task_str: str = '',
exit_on_message: bool = False,
fake_user_response_fn: Optional[Callable[[Optional[State]], str]] = None,
sandbox: Optional[Sandbox] = None,
) -> Optional[State]:
"""Main coroutine to run the agent controller with task input flexibility.
It's only used when you launch opendevin backend directly via cmdline.
@ -39,6 +42,7 @@ async def main(
task_str: The task to run.
exit_on_message: quit if agent asks for a message from user (optional)
fake_user_response_fn: An optional function that receives the current state (could be None) and returns a fake user response.
sandbox: An optional sandbox to run the agent in.
"""
# Determine the task source
@ -62,7 +66,7 @@ async def main(
if llm_config is None:
raise ValueError(f'Invalid toml file, cannot read {args.llm_config}')
print(
logger.info(
f'Running agent {args.agent_cls} (model: {llm_config.model}, llm_config: {llm_config}) with task: "{task}"'
)
@ -70,7 +74,7 @@ async def main(
llm = LLM(llm_config=llm_config)
else:
# --model-name model_name
print(
logger.info(
f'Running agent {args.agent_cls} (model: {args.model_name}), with task: "{task}"'
)
llm = LLM(args.model_name)
@ -85,7 +89,7 @@ async def main(
max_chars=args.max_chars,
event_stream=event_stream,
)
runtime = ServerRuntime(event_stream=event_stream)
runtime = ServerRuntime(event_stream=event_stream, sandbox=sandbox)
runtime.init_sandbox_plugins(controller.agent.sandbox_plugins)
await event_stream.add_event(MessageAction(content=task), EventSource.USER)

View File

@ -24,6 +24,9 @@ class CmdOutputObservation(Observation):
def message(self) -> str:
return f'Command `{self.command}` executed with exit code {self.exit_code}.'
def __str__(self) -> str:
return f'**CmdOutputObservation (exit code={self.exit_code})**\n{self.content}'
@dataclass
class IPythonRunCellObservation(Observation):
@ -41,3 +44,6 @@ class IPythonRunCellObservation(Observation):
@property
def message(self) -> str:
return 'Coded executed in IPython cell.'
def __str__(self) -> str:
return f'**IPythonRunCellObservation**\n{self.content}'

View File

@ -1,9 +1,8 @@
import warnings
from functools import partial
import warnings
with warnings.catch_warnings():
warnings.simplefilter("ignore")
warnings.simplefilter('ignore')
import litellm
from litellm import completion as litellm_completion
from litellm import completion_cost as litellm_completion_cost

View File

@ -2,6 +2,7 @@
set -e
# ADD /opendevin/plugins to PATH to make `jupyter_cli` available
echo 'export PATH=$PATH:/opendevin/plugins/jupyter' >> ~/.bashrc
export PATH=/opendevin/plugins/jupyter:$PATH
@ -10,11 +11,14 @@ export PATH=/opendevin/plugins/jupyter:$PATH
if [ "$USER" = "opendevin" ]; then
echo 'export PATH=$PATH:/home/opendevin/.local/bin' >> ~/.bashrc
export PATH=$PATH:/home/opendevin/.local/bin
export PIP_CACHE_DIR=$HOME/.cache/pip
fi
# if user name is `root`, add '/root/.local/bin' to PATH
if [ "$USER" = "root" ]; then
echo 'export PATH=$PATH:/root/.local/bin' >> ~/.bashrc
export PATH=$PATH:/root/.local/bin
export PIP_CACHE_DIR=$HOME/.cache/pip
fi
# Install dependencies
@ -57,11 +61,13 @@ echo "export JUPYTER_EXEC_SERVER_PID=$JUPYTER_EXEC_SERVER_PID" >> ~/.bashrc
echo "Execution server started with PID: $JUPYTER_EXEC_SERVER_PID"
# Wait until /opendevin/logs/jupyter_kernel_gateway.log contains "is available"
while ! grep -q "is available" /opendevin/logs/jupyter_kernel_gateway.log; do
while ! grep -q "at" /opendevin/logs/jupyter_kernel_gateway.log; do
echo "Waiting for Jupyter kernel gateway to be available..."
sleep 1
done
# Wait until /opendevin/logs/jupyter_execute_server.log contains "Jupyter kernel created for conversation"
while ! grep -q "Jupyter kernel created for conversation" /opendevin/logs/jupyter_execute_server.log; do
while ! grep -q "kernel created" /opendevin/logs/jupyter_execute_server.log; do
echo "Waiting for Jupyter kernel to be created..."
sleep 1
done
echo "Jupyter kernel ready."

View File

@ -39,7 +39,7 @@ class PluginMixin:
exit_code, output = self.execute(abs_path_to_bash_script)
if exit_code != 0:
raise RuntimeError(
f'Failed to initialize plugin {requirement.name} with exit code {exit_code} and output {output}'
f'Failed to initialize plugin {requirement.name} with exit code {exit_code} and output: {output}'
)
logger.info(f'Plugin {requirement.name} initialized successfully.')
@ -47,6 +47,6 @@ class PluginMixin:
exit_code, output = self.execute('source ~/.bashrc')
if exit_code != 0:
raise RuntimeError(
f'Failed to source ~/.bashrc with exit code {exit_code} and output {output}'
f'Failed to source ~/.bashrc with exit code {exit_code} and output: {output}'
)
logger.info('Sourced ~/.bashrc successfully')

View File

@ -1,5 +1,6 @@
#!/bin/bash
export PIP_CACHE_DIR=$HOME/.cache/pip
pip install flake8
# Cursor Mode from SWE-Bench

View File

@ -1,5 +1,6 @@
#!/bin/bash
export PIP_CACHE_DIR=$HOME/.cache/pip
pip install flake8
# Default Mode from SWE-Bench

364
poetry.lock generated
View File

@ -132,6 +132,30 @@ files = [
[package.dependencies]
frozenlist = ">=1.1.0"
[[package]]
name = "altair"
version = "5.3.0"
description = "Vega-Altair: A declarative statistical visualization library for Python."
optional = false
python-versions = ">=3.8"
files = [
{file = "altair-5.3.0-py3-none-any.whl", hash = "sha256:7084a1dab4d83c5e7e5246b92dc1b4451a6c68fd057f3716ee9d315c8980e59a"},
{file = "altair-5.3.0.tar.gz", hash = "sha256:5a268b1a0983b23d8f9129f819f956174aa7aea2719ed55a52eba9979b9f6675"},
]
[package.dependencies]
jinja2 = "*"
jsonschema = ">=3.0"
numpy = "*"
packaging = "*"
pandas = ">=0.25"
toolz = "*"
[package.extras]
all = ["altair-tiles (>=0.3.0)", "anywidget (>=0.9.0)", "pyarrow (>=11)", "vega-datasets (>=0.9.0)", "vegafusion[embed] (>=1.6.6)", "vl-convert-python (>=1.3.0)"]
dev = ["geopandas", "hatch", "ipython", "m2r", "mypy", "pandas-stubs", "pytest", "pytest-cov", "ruff (>=0.3.0)", "types-jsonschema", "types-setuptools"]
doc = ["docutils", "jinja2", "myst-parser", "numpydoc", "pillow (>=9,<10)", "pydata-sphinx-theme (>=0.14.1)", "scipy", "sphinx", "sphinx-copybutton", "sphinx-design", "sphinxext-altair"]
[[package]]
name = "annotated-types"
version = "0.6.0"
@ -1659,6 +1683,38 @@ monitor = ["psutil (>=5.7.0)"]
recommended = ["cffi (>=1.12.2)", "dnspython (>=1.16.0,<2.0)", "idna", "psutil (>=5.7.0)"]
test = ["cffi (>=1.12.2)", "coverage (>=5.0)", "dnspython (>=1.16.0,<2.0)", "idna", "objgraph", "psutil (>=5.7.0)", "requests"]
[[package]]
name = "gitdb"
version = "4.0.11"
description = "Git Object Database"
optional = false
python-versions = ">=3.7"
files = [
{file = "gitdb-4.0.11-py3-none-any.whl", hash = "sha256:81a3407ddd2ee8df444cbacea00e2d038e40150acfa3001696fe0dcf1d3adfa4"},
{file = "gitdb-4.0.11.tar.gz", hash = "sha256:bf5421126136d6d0af55bc1e7c1af1c397a34f5b7bd79e776cd3e89785c2b04b"},
]
[package.dependencies]
smmap = ">=3.0.1,<6"
[[package]]
name = "gitpython"
version = "3.1.43"
description = "GitPython is a Python library used to interact with Git repositories"
optional = false
python-versions = ">=3.7"
files = [
{file = "GitPython-3.1.43-py3-none-any.whl", hash = "sha256:eec7ec56b92aad751f9912a73404bc02ba212a23adb2c7098ee668417051a1ff"},
{file = "GitPython-3.1.43.tar.gz", hash = "sha256:35f314a9f878467f5453cc1fee295c3e18e52f1b99f10f6cf5b1682e968a9e7c"},
]
[package.dependencies]
gitdb = ">=4.0.1,<5"
[package.extras]
doc = ["sphinx (==4.3.2)", "sphinx-autodoc-typehints", "sphinx-rtd-theme", "sphinxcontrib-applehelp (>=1.0.2,<=1.0.4)", "sphinxcontrib-devhelp (==1.0.2)", "sphinxcontrib-htmlhelp (>=2.0.0,<=2.0.1)", "sphinxcontrib-qthelp (==1.0.3)", "sphinxcontrib-serializinghtml (==1.1.5)"]
test = ["coverage[toml]", "ddt (>=1.1.1,!=1.4.3)", "mock", "mypy", "pre-commit", "pytest (>=7.3.1)", "pytest-cov", "pytest-instafail", "pytest-mock", "pytest-sugar", "typing-extensions"]
[[package]]
name = "google-ai-generativelanguage"
version = "0.6.3"
@ -2298,6 +2354,41 @@ files = [
[package.extras]
qa = ["pytest", "pytest-cov", "tox"]
[[package]]
name = "jsonschema"
version = "4.22.0"
description = "An implementation of JSON Schema validation for Python"
optional = false
python-versions = ">=3.8"
files = [
{file = "jsonschema-4.22.0-py3-none-any.whl", hash = "sha256:ff4cfd6b1367a40e7bc6411caec72effadd3db0bbe5017de188f2d6108335802"},
{file = "jsonschema-4.22.0.tar.gz", hash = "sha256:5b22d434a45935119af990552c862e5d6d564e8f6601206b305a61fdf661a2b7"},
]
[package.dependencies]
attrs = ">=22.2.0"
jsonschema-specifications = ">=2023.03.6"
referencing = ">=0.28.4"
rpds-py = ">=0.7.1"
[package.extras]
format = ["fqdn", "idna", "isoduration", "jsonpointer (>1.13)", "rfc3339-validator", "rfc3987", "uri-template", "webcolors (>=1.11)"]
format-nongpl = ["fqdn", "idna", "isoduration", "jsonpointer (>1.13)", "rfc3339-validator", "rfc3986-validator (>0.1.0)", "uri-template", "webcolors (>=1.11)"]
[[package]]
name = "jsonschema-specifications"
version = "2023.12.1"
description = "The JSON Schema meta-schemas and vocabularies, exposed as a Registry"
optional = false
python-versions = ">=3.8"
files = [
{file = "jsonschema_specifications-2023.12.1-py3-none-any.whl", hash = "sha256:87e4fdf3a94858b8a2ba2778d9ba57d8a9cafca7c7489c46ba0d30a8bc6a9c3c"},
{file = "jsonschema_specifications-2023.12.1.tar.gz", hash = "sha256:48a76787b3e70f5ed53f1160d2b81f586e4ca6d1548c5de7085d1682674764cc"},
]
[package.dependencies]
referencing = ">=0.31.0"
[[package]]
name = "kiwisolver"
version = "1.4.5"
@ -4710,6 +4801,25 @@ files = [
[package.dependencies]
typing-extensions = ">=4.6.0,<4.7.0 || >4.7.0"
[[package]]
name = "pydeck"
version = "0.9.1"
description = "Widget for deck.gl maps"
optional = false
python-versions = ">=3.8"
files = [
{file = "pydeck-0.9.1-py2.py3-none-any.whl", hash = "sha256:b3f75ba0d273fc917094fa61224f3f6076ca8752b93d46faf3bcfd9f9d59b038"},
{file = "pydeck-0.9.1.tar.gz", hash = "sha256:f74475ae637951d63f2ee58326757f8d4f9cd9f2a457cf42950715003e2cb605"},
]
[package.dependencies]
jinja2 = ">=2.10.1"
numpy = ">=1.16.4"
[package.extras]
carto = ["pydeck-carto"]
jupyter = ["ipykernel (>=5.1.2)", "ipython (>=5.8.0)", "ipywidgets (>=7,<8)", "traitlets (>=4.3.2)"]
[[package]]
name = "pyee"
version = "11.0.1"
@ -5017,6 +5127,21 @@ files = [
{file = "PyYAML-6.0.1.tar.gz", hash = "sha256:bfdf460b1736c775f2ba9f6a92bca30bc2095067b8a9d77876d1fad6cc3b4a43"},
]
[[package]]
name = "referencing"
version = "0.35.1"
description = "JSON Referencing + Python"
optional = false
python-versions = ">=3.8"
files = [
{file = "referencing-0.35.1-py3-none-any.whl", hash = "sha256:eda6d3234d62814d1c64e305c1331c9a3a6132da475ab6382eaa997b21ee75de"},
{file = "referencing-0.35.1.tar.gz", hash = "sha256:25b42124a6c8b632a425174f24087783efb348a6f1e0008e63cd4466fedf703c"},
]
[package.dependencies]
attrs = ">=22.2.0"
rpds-py = ">=0.7.0"
[[package]]
name = "regex"
version = "2024.5.10"
@ -5162,6 +5287,114 @@ pygments = ">=2.13.0,<3.0.0"
[package.extras]
jupyter = ["ipywidgets (>=7.5.1,<9)"]
[[package]]
name = "rpds-py"
version = "0.18.1"
description = "Python bindings to Rust's persistent data structures (rpds)"
optional = false
python-versions = ">=3.8"
files = [
{file = "rpds_py-0.18.1-cp310-cp310-macosx_10_12_x86_64.whl", hash = "sha256:d31dea506d718693b6b2cffc0648a8929bdc51c70a311b2770f09611caa10d53"},
{file = "rpds_py-0.18.1-cp310-cp310-macosx_11_0_arm64.whl", hash = "sha256:732672fbc449bab754e0b15356c077cc31566df874964d4801ab14f71951ea80"},
{file = "rpds_py-0.18.1-cp310-cp310-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:4a98a1f0552b5f227a3d6422dbd61bc6f30db170939bd87ed14f3c339aa6c7c9"},
{file = "rpds_py-0.18.1-cp310-cp310-manylinux_2_17_armv7l.manylinux2014_armv7l.whl", hash = "sha256:7f1944ce16401aad1e3f7d312247b3d5de7981f634dc9dfe90da72b87d37887d"},
{file = "rpds_py-0.18.1-cp310-cp310-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl", hash = "sha256:38e14fb4e370885c4ecd734f093a2225ee52dc384b86fa55fe3f74638b2cfb09"},
{file = "rpds_py-0.18.1-cp310-cp310-manylinux_2_17_s390x.manylinux2014_s390x.whl", hash = "sha256:08d74b184f9ab6289b87b19fe6a6d1a97fbfea84b8a3e745e87a5de3029bf944"},
{file = "rpds_py-0.18.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:d70129cef4a8d979caa37e7fe957202e7eee8ea02c5e16455bc9808a59c6b2f0"},
{file = "rpds_py-0.18.1-cp310-cp310-manylinux_2_5_i686.manylinux1_i686.whl", hash = "sha256:ce0bb20e3a11bd04461324a6a798af34d503f8d6f1aa3d2aa8901ceaf039176d"},
{file = "rpds_py-0.18.1-cp310-cp310-musllinux_1_2_aarch64.whl", hash = "sha256:81c5196a790032e0fc2464c0b4ab95f8610f96f1f2fa3d4deacce6a79852da60"},
{file = "rpds_py-0.18.1-cp310-cp310-musllinux_1_2_i686.whl", hash = "sha256:f3027be483868c99b4985fda802a57a67fdf30c5d9a50338d9db646d590198da"},
{file = "rpds_py-0.18.1-cp310-cp310-musllinux_1_2_x86_64.whl", hash = "sha256:d44607f98caa2961bab4fa3c4309724b185b464cdc3ba6f3d7340bac3ec97cc1"},
{file = "rpds_py-0.18.1-cp310-none-win32.whl", hash = "sha256:c273e795e7a0f1fddd46e1e3cb8be15634c29ae8ff31c196debb620e1edb9333"},
{file = "rpds_py-0.18.1-cp310-none-win_amd64.whl", hash = "sha256:8352f48d511de5f973e4f2f9412736d7dea76c69faa6d36bcf885b50c758ab9a"},
{file = "rpds_py-0.18.1-cp311-cp311-macosx_10_12_x86_64.whl", hash = "sha256:6b5ff7e1d63a8281654b5e2896d7f08799378e594f09cf3674e832ecaf396ce8"},
{file = "rpds_py-0.18.1-cp311-cp311-macosx_11_0_arm64.whl", hash = "sha256:8927638a4d4137a289e41d0fd631551e89fa346d6dbcfc31ad627557d03ceb6d"},
{file = "rpds_py-0.18.1-cp311-cp311-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:154bf5c93d79558b44e5b50cc354aa0459e518e83677791e6adb0b039b7aa6a7"},
{file = "rpds_py-0.18.1-cp311-cp311-manylinux_2_17_armv7l.manylinux2014_armv7l.whl", hash = "sha256:07f2139741e5deb2c5154a7b9629bc5aa48c766b643c1a6750d16f865a82c5fc"},
{file = "rpds_py-0.18.1-cp311-cp311-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl", hash = "sha256:8c7672e9fba7425f79019db9945b16e308ed8bc89348c23d955c8c0540da0a07"},
{file = "rpds_py-0.18.1-cp311-cp311-manylinux_2_17_s390x.manylinux2014_s390x.whl", hash = "sha256:489bdfe1abd0406eba6b3bb4fdc87c7fa40f1031de073d0cfb744634cc8fa261"},
{file = "rpds_py-0.18.1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:3c20f05e8e3d4fc76875fc9cb8cf24b90a63f5a1b4c5b9273f0e8225e169b100"},
{file = "rpds_py-0.18.1-cp311-cp311-manylinux_2_5_i686.manylinux1_i686.whl", hash = "sha256:967342e045564cef76dfcf1edb700b1e20838d83b1aa02ab313e6a497cf923b8"},
{file = "rpds_py-0.18.1-cp311-cp311-musllinux_1_2_aarch64.whl", hash = "sha256:2cc7c1a47f3a63282ab0f422d90ddac4aa3034e39fc66a559ab93041e6505da7"},
{file = "rpds_py-0.18.1-cp311-cp311-musllinux_1_2_i686.whl", hash = "sha256:f7afbfee1157e0f9376c00bb232e80a60e59ed716e3211a80cb8506550671e6e"},
{file = "rpds_py-0.18.1-cp311-cp311-musllinux_1_2_x86_64.whl", hash = "sha256:9e6934d70dc50f9f8ea47081ceafdec09245fd9f6032669c3b45705dea096b88"},
{file = "rpds_py-0.18.1-cp311-none-win32.whl", hash = "sha256:c69882964516dc143083d3795cb508e806b09fc3800fd0d4cddc1df6c36e76bb"},
{file = "rpds_py-0.18.1-cp311-none-win_amd64.whl", hash = "sha256:70a838f7754483bcdc830444952fd89645569e7452e3226de4a613a4c1793fb2"},
{file = "rpds_py-0.18.1-cp312-cp312-macosx_10_12_x86_64.whl", hash = "sha256:3dd3cd86e1db5aadd334e011eba4e29d37a104b403e8ca24dcd6703c68ca55b3"},
{file = "rpds_py-0.18.1-cp312-cp312-macosx_11_0_arm64.whl", hash = "sha256:05f3d615099bd9b13ecf2fc9cf2d839ad3f20239c678f461c753e93755d629ee"},
{file = "rpds_py-0.18.1-cp312-cp312-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:35b2b771b13eee8729a5049c976197ff58a27a3829c018a04341bcf1ae409b2b"},
{file = "rpds_py-0.18.1-cp312-cp312-manylinux_2_17_armv7l.manylinux2014_armv7l.whl", hash = "sha256:ee17cd26b97d537af8f33635ef38be873073d516fd425e80559f4585a7b90c43"},
{file = "rpds_py-0.18.1-cp312-cp312-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl", hash = "sha256:b646bf655b135ccf4522ed43d6902af37d3f5dbcf0da66c769a2b3938b9d8184"},
{file = "rpds_py-0.18.1-cp312-cp312-manylinux_2_17_s390x.manylinux2014_s390x.whl", hash = "sha256:19ba472b9606c36716062c023afa2484d1e4220548751bda14f725a7de17b4f6"},
{file = "rpds_py-0.18.1-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:6e30ac5e329098903262dc5bdd7e2086e0256aa762cc8b744f9e7bf2a427d3f8"},
{file = "rpds_py-0.18.1-cp312-cp312-manylinux_2_5_i686.manylinux1_i686.whl", hash = "sha256:d58ad6317d188c43750cb76e9deacf6051d0f884d87dc6518e0280438648a9ac"},
{file = "rpds_py-0.18.1-cp312-cp312-musllinux_1_2_aarch64.whl", hash = "sha256:e1735502458621921cee039c47318cb90b51d532c2766593be6207eec53e5c4c"},
{file = "rpds_py-0.18.1-cp312-cp312-musllinux_1_2_i686.whl", hash = "sha256:f5bab211605d91db0e2995a17b5c6ee5edec1270e46223e513eaa20da20076ac"},
{file = "rpds_py-0.18.1-cp312-cp312-musllinux_1_2_x86_64.whl", hash = "sha256:2fc24a329a717f9e2448f8cd1f960f9dac4e45b6224d60734edeb67499bab03a"},
{file = "rpds_py-0.18.1-cp312-none-win32.whl", hash = "sha256:1805d5901779662d599d0e2e4159d8a82c0b05faa86ef9222bf974572286b2b6"},
{file = "rpds_py-0.18.1-cp312-none-win_amd64.whl", hash = "sha256:720edcb916df872d80f80a1cc5ea9058300b97721efda8651efcd938a9c70a72"},
{file = "rpds_py-0.18.1-cp38-cp38-macosx_10_12_x86_64.whl", hash = "sha256:c827576e2fa017a081346dce87d532a5310241648eb3700af9a571a6e9fc7e74"},
{file = "rpds_py-0.18.1-cp38-cp38-macosx_11_0_arm64.whl", hash = "sha256:aa3679e751408d75a0b4d8d26d6647b6d9326f5e35c00a7ccd82b78ef64f65f8"},
{file = "rpds_py-0.18.1-cp38-cp38-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:0abeee75434e2ee2d142d650d1e54ac1f8b01e6e6abdde8ffd6eeac6e9c38e20"},
{file = "rpds_py-0.18.1-cp38-cp38-manylinux_2_17_armv7l.manylinux2014_armv7l.whl", hash = "sha256:ed402d6153c5d519a0faf1bb69898e97fb31613b49da27a84a13935ea9164dfc"},
{file = "rpds_py-0.18.1-cp38-cp38-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl", hash = "sha256:338dee44b0cef8b70fd2ef54b4e09bb1b97fc6c3a58fea5db6cc083fd9fc2724"},
{file = "rpds_py-0.18.1-cp38-cp38-manylinux_2_17_s390x.manylinux2014_s390x.whl", hash = "sha256:7750569d9526199c5b97e5a9f8d96a13300950d910cf04a861d96f4273d5b104"},
{file = "rpds_py-0.18.1-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:607345bd5912aacc0c5a63d45a1f73fef29e697884f7e861094e443187c02be5"},
{file = "rpds_py-0.18.1-cp38-cp38-manylinux_2_5_i686.manylinux1_i686.whl", hash = "sha256:207c82978115baa1fd8d706d720b4a4d2b0913df1c78c85ba73fe6c5804505f0"},
{file = "rpds_py-0.18.1-cp38-cp38-musllinux_1_2_aarch64.whl", hash = "sha256:6d1e42d2735d437e7e80bab4d78eb2e459af48c0a46e686ea35f690b93db792d"},
{file = "rpds_py-0.18.1-cp38-cp38-musllinux_1_2_i686.whl", hash = "sha256:5463c47c08630007dc0fe99fb480ea4f34a89712410592380425a9b4e1611d8e"},
{file = "rpds_py-0.18.1-cp38-cp38-musllinux_1_2_x86_64.whl", hash = "sha256:06d218939e1bf2ca50e6b0ec700ffe755e5216a8230ab3e87c059ebb4ea06afc"},
{file = "rpds_py-0.18.1-cp38-none-win32.whl", hash = "sha256:312fe69b4fe1ffbe76520a7676b1e5ac06ddf7826d764cc10265c3b53f96dbe9"},
{file = "rpds_py-0.18.1-cp38-none-win_amd64.whl", hash = "sha256:9437ca26784120a279f3137ee080b0e717012c42921eb07861b412340f85bae2"},
{file = "rpds_py-0.18.1-cp39-cp39-macosx_10_12_x86_64.whl", hash = "sha256:19e515b78c3fc1039dd7da0a33c28c3154458f947f4dc198d3c72db2b6b5dc93"},
{file = "rpds_py-0.18.1-cp39-cp39-macosx_11_0_arm64.whl", hash = "sha256:a7b28c5b066bca9a4eb4e2f2663012debe680f097979d880657f00e1c30875a0"},
{file = "rpds_py-0.18.1-cp39-cp39-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:673fdbbf668dd958eff750e500495ef3f611e2ecc209464f661bc82e9838991e"},
{file = "rpds_py-0.18.1-cp39-cp39-manylinux_2_17_armv7l.manylinux2014_armv7l.whl", hash = "sha256:d960de62227635d2e61068f42a6cb6aae91a7fe00fca0e3aeed17667c8a34611"},
{file = "rpds_py-0.18.1-cp39-cp39-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl", hash = "sha256:352a88dc7892f1da66b6027af06a2e7e5d53fe05924cc2cfc56495b586a10b72"},
{file = "rpds_py-0.18.1-cp39-cp39-manylinux_2_17_s390x.manylinux2014_s390x.whl", hash = "sha256:4e0ee01ad8260184db21468a6e1c37afa0529acc12c3a697ee498d3c2c4dcaf3"},
{file = "rpds_py-0.18.1-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:e4c39ad2f512b4041343ea3c7894339e4ca7839ac38ca83d68a832fc8b3748ab"},
{file = "rpds_py-0.18.1-cp39-cp39-manylinux_2_5_i686.manylinux1_i686.whl", hash = "sha256:aaa71ee43a703c321906813bb252f69524f02aa05bf4eec85f0c41d5d62d0f4c"},
{file = "rpds_py-0.18.1-cp39-cp39-musllinux_1_2_aarch64.whl", hash = "sha256:6cd8098517c64a85e790657e7b1e509b9fe07487fd358e19431cb120f7d96338"},
{file = "rpds_py-0.18.1-cp39-cp39-musllinux_1_2_i686.whl", hash = "sha256:4adec039b8e2928983f885c53b7cc4cda8965b62b6596501a0308d2703f8af1b"},
{file = "rpds_py-0.18.1-cp39-cp39-musllinux_1_2_x86_64.whl", hash = "sha256:32b7daaa3e9389db3695964ce8e566e3413b0c43e3394c05e4b243a4cd7bef26"},
{file = "rpds_py-0.18.1-cp39-none-win32.whl", hash = "sha256:2625f03b105328729f9450c8badda34d5243231eef6535f80064d57035738360"},
{file = "rpds_py-0.18.1-cp39-none-win_amd64.whl", hash = "sha256:bf18932d0003c8c4d51a39f244231986ab23ee057d235a12b2684ea26a353590"},
{file = "rpds_py-0.18.1-pp310-pypy310_pp73-macosx_10_12_x86_64.whl", hash = "sha256:cbfbea39ba64f5e53ae2915de36f130588bba71245b418060ec3330ebf85678e"},
{file = "rpds_py-0.18.1-pp310-pypy310_pp73-macosx_11_0_arm64.whl", hash = "sha256:a3d456ff2a6a4d2adcdf3c1c960a36f4fd2fec6e3b4902a42a384d17cf4e7a65"},
{file = "rpds_py-0.18.1-pp310-pypy310_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:7700936ef9d006b7ef605dc53aa364da2de5a3aa65516a1f3ce73bf82ecfc7ae"},
{file = "rpds_py-0.18.1-pp310-pypy310_pp73-manylinux_2_17_armv7l.manylinux2014_armv7l.whl", hash = "sha256:51584acc5916212e1bf45edd17f3a6b05fe0cbb40482d25e619f824dccb679de"},
{file = "rpds_py-0.18.1-pp310-pypy310_pp73-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl", hash = "sha256:942695a206a58d2575033ff1e42b12b2aece98d6003c6bc739fbf33d1773b12f"},
{file = "rpds_py-0.18.1-pp310-pypy310_pp73-manylinux_2_17_s390x.manylinux2014_s390x.whl", hash = "sha256:b906b5f58892813e5ba5c6056d6a5ad08f358ba49f046d910ad992196ea61397"},
{file = "rpds_py-0.18.1-pp310-pypy310_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:f6f8e3fecca256fefc91bb6765a693d96692459d7d4c644660a9fff32e517843"},
{file = "rpds_py-0.18.1-pp310-pypy310_pp73-manylinux_2_5_i686.manylinux1_i686.whl", hash = "sha256:7732770412bab81c5a9f6d20aeb60ae943a9b36dcd990d876a773526468e7163"},
{file = "rpds_py-0.18.1-pp310-pypy310_pp73-musllinux_1_2_aarch64.whl", hash = "sha256:bd1105b50ede37461c1d51b9698c4f4be6e13e69a908ab7751e3807985fc0346"},
{file = "rpds_py-0.18.1-pp310-pypy310_pp73-musllinux_1_2_i686.whl", hash = "sha256:618916f5535784960f3ecf8111581f4ad31d347c3de66d02e728de460a46303c"},
{file = "rpds_py-0.18.1-pp310-pypy310_pp73-musllinux_1_2_x86_64.whl", hash = "sha256:17c6d2155e2423f7e79e3bb18151c686d40db42d8645e7977442170c360194d4"},
{file = "rpds_py-0.18.1-pp38-pypy38_pp73-macosx_10_12_x86_64.whl", hash = "sha256:6c4c4c3f878df21faf5fac86eda32671c27889e13570645a9eea0a1abdd50922"},
{file = "rpds_py-0.18.1-pp38-pypy38_pp73-macosx_11_0_arm64.whl", hash = "sha256:fab6ce90574645a0d6c58890e9bcaac8d94dff54fb51c69e5522a7358b80ab64"},
{file = "rpds_py-0.18.1-pp38-pypy38_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:531796fb842b53f2695e94dc338929e9f9dbf473b64710c28af5a160b2a8927d"},
{file = "rpds_py-0.18.1-pp38-pypy38_pp73-manylinux_2_17_armv7l.manylinux2014_armv7l.whl", hash = "sha256:740884bc62a5e2bbb31e584f5d23b32320fd75d79f916f15a788d527a5e83644"},
{file = "rpds_py-0.18.1-pp38-pypy38_pp73-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl", hash = "sha256:998125738de0158f088aef3cb264a34251908dd2e5d9966774fdab7402edfab7"},
{file = "rpds_py-0.18.1-pp38-pypy38_pp73-manylinux_2_17_s390x.manylinux2014_s390x.whl", hash = "sha256:e2be6e9dd4111d5b31ba3b74d17da54a8319d8168890fbaea4b9e5c3de630ae5"},
{file = "rpds_py-0.18.1-pp38-pypy38_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:d0cee71bc618cd93716f3c1bf56653740d2d13ddbd47673efa8bf41435a60daa"},
{file = "rpds_py-0.18.1-pp38-pypy38_pp73-manylinux_2_5_i686.manylinux1_i686.whl", hash = "sha256:2c3caec4ec5cd1d18e5dd6ae5194d24ed12785212a90b37f5f7f06b8bedd7139"},
{file = "rpds_py-0.18.1-pp38-pypy38_pp73-musllinux_1_2_aarch64.whl", hash = "sha256:27bba383e8c5231cd559affe169ca0b96ec78d39909ffd817f28b166d7ddd4d8"},
{file = "rpds_py-0.18.1-pp38-pypy38_pp73-musllinux_1_2_i686.whl", hash = "sha256:a888e8bdb45916234b99da2d859566f1e8a1d2275a801bb8e4a9644e3c7e7909"},
{file = "rpds_py-0.18.1-pp38-pypy38_pp73-musllinux_1_2_x86_64.whl", hash = "sha256:6031b25fb1b06327b43d841f33842b383beba399884f8228a6bb3df3088485ff"},
{file = "rpds_py-0.18.1-pp39-pypy39_pp73-macosx_10_12_x86_64.whl", hash = "sha256:48c2faaa8adfacefcbfdb5f2e2e7bdad081e5ace8d182e5f4ade971f128e6bb3"},
{file = "rpds_py-0.18.1-pp39-pypy39_pp73-macosx_11_0_arm64.whl", hash = "sha256:d85164315bd68c0806768dc6bb0429c6f95c354f87485ee3593c4f6b14def2bd"},
{file = "rpds_py-0.18.1-pp39-pypy39_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:6afd80f6c79893cfc0574956f78a0add8c76e3696f2d6a15bca2c66c415cf2d4"},
{file = "rpds_py-0.18.1-pp39-pypy39_pp73-manylinux_2_17_armv7l.manylinux2014_armv7l.whl", hash = "sha256:fa242ac1ff583e4ec7771141606aafc92b361cd90a05c30d93e343a0c2d82a89"},
{file = "rpds_py-0.18.1-pp39-pypy39_pp73-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl", hash = "sha256:d21be4770ff4e08698e1e8e0bce06edb6ea0626e7c8f560bc08222880aca6a6f"},
{file = "rpds_py-0.18.1-pp39-pypy39_pp73-manylinux_2_17_s390x.manylinux2014_s390x.whl", hash = "sha256:5c45a639e93a0c5d4b788b2613bd637468edd62f8f95ebc6fcc303d58ab3f0a8"},
{file = "rpds_py-0.18.1-pp39-pypy39_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:910e71711d1055b2768181efa0a17537b2622afeb0424116619817007f8a2b10"},
{file = "rpds_py-0.18.1-pp39-pypy39_pp73-manylinux_2_5_i686.manylinux1_i686.whl", hash = "sha256:b9bb1f182a97880f6078283b3505a707057c42bf55d8fca604f70dedfdc0772a"},
{file = "rpds_py-0.18.1-pp39-pypy39_pp73-musllinux_1_2_aarch64.whl", hash = "sha256:1d54f74f40b1f7aaa595a02ff42ef38ca654b1469bef7d52867da474243cc633"},
{file = "rpds_py-0.18.1-pp39-pypy39_pp73-musllinux_1_2_i686.whl", hash = "sha256:8d2e182c9ee01135e11e9676e9a62dfad791a7a467738f06726872374a83db49"},
{file = "rpds_py-0.18.1-pp39-pypy39_pp73-musllinux_1_2_x86_64.whl", hash = "sha256:636a15acc588f70fda1661234761f9ed9ad79ebed3f2125d44be0862708b666e"},
{file = "rpds_py-0.18.1.tar.gz", hash = "sha256:dc48b479d540770c811fbd1eb9ba2bb66951863e448efec2e2c102625328e92f"},
]
[[package]]
name = "rsa"
version = "4.9"
@ -5508,6 +5741,17 @@ files = [
{file = "six-1.16.0.tar.gz", hash = "sha256:1e61c37477a1626458e36f7b1d82aa5c9b094fa4802892072e49de9c60c4c926"},
]
[[package]]
name = "smmap"
version = "5.0.1"
description = "A pure Python implementation of a sliding window memory map manager"
optional = false
python-versions = ">=3.7"
files = [
{file = "smmap-5.0.1-py3-none-any.whl", hash = "sha256:e6d8668fa5f93e706934a62d7b4db19c8d9eb8cf2adbb75ef1b675aa332b69da"},
{file = "smmap-5.0.1.tar.gz", hash = "sha256:dceeb6c0028fdb6734471eb07c0cd2aae706ccaecab45965ee83f11c8d3b1f62"},
]
[[package]]
name = "sniffio"
version = "1.3.1"
@ -5634,6 +5878,41 @@ anyio = ">=3.4.0,<5"
[package.extras]
full = ["httpx (>=0.22.0)", "itsdangerous", "jinja2", "python-multipart (>=0.0.7)", "pyyaml"]
[[package]]
name = "streamlit"
version = "1.34.0"
description = "A faster way to build and share data apps"
optional = false
python-versions = "!=3.9.7,>=3.8"
files = [
{file = "streamlit-1.34.0-py2.py3-none-any.whl", hash = "sha256:411183cf7f525e468eb256b343c914782d11cd894b5e30a4ab3bb1d54e3ae339"},
{file = "streamlit-1.34.0.tar.gz", hash = "sha256:135a3b79a686b3132b73f204450ad6e889de04f3349d692925e09f0e21e74b52"},
]
[package.dependencies]
altair = ">=4.0,<6"
blinker = ">=1.0.0,<2"
cachetools = ">=4.0,<6"
click = ">=7.0,<9"
gitpython = ">=3.0.7,<3.1.19 || >3.1.19,<4"
numpy = ">=1.19.3,<2"
packaging = ">=16.8,<25"
pandas = ">=1.3.0,<3"
pillow = ">=7.1.0,<11"
protobuf = ">=3.20,<5"
pyarrow = ">=7.0"
pydeck = ">=0.8.0b4,<1"
requests = ">=2.27,<3"
rich = ">=10.14.0,<14"
tenacity = ">=8.1.0,<9"
toml = ">=0.10.1,<2"
tornado = ">=6.0.3,<7"
typing-extensions = ">=4.3.0,<5"
watchdog = {version = ">=2.1.5", markers = "platform_system != \"Darwin\""}
[package.extras]
snowflake = ["snowflake-connector-python (>=2.8.0)", "snowflake-snowpark-python (>=0.9.0)"]
[[package]]
name = "striprtf"
version = "0.0.26"
@ -5895,6 +6174,17 @@ files = [
{file = "toml-0.10.2.tar.gz", hash = "sha256:b3bda1d108d5dd99f4a20d24d9c348e91c4db7ab1b749200bded2f839ccbe68f"},
]
[[package]]
name = "toolz"
version = "0.12.1"
description = "List processing tools and functional utilities"
optional = false
python-versions = ">=3.7"
files = [
{file = "toolz-0.12.1-py3-none-any.whl", hash = "sha256:d22731364c07d72eea0a0ad45bafb2c2937ab6fd38a3507bf55eae8744aa7d85"},
{file = "toolz-0.12.1.tar.gz", hash = "sha256:ecca342664893f177a13dac0e6b41cbd8ac25a358e5f215316d43e2100224f4d"},
]
[[package]]
name = "torch"
version = "2.2.2"
@ -5953,6 +6243,26 @@ typing-extensions = ">=4.8.0"
opt-einsum = ["opt-einsum (>=3.3)"]
optree = ["optree (>=0.9.1)"]
[[package]]
name = "tornado"
version = "6.4"
description = "Tornado is a Python web framework and asynchronous networking library, originally developed at FriendFeed."
optional = false
python-versions = ">= 3.8"
files = [
{file = "tornado-6.4-cp38-abi3-macosx_10_9_universal2.whl", hash = "sha256:02ccefc7d8211e5a7f9e8bc3f9e5b0ad6262ba2fbb683a6443ecc804e5224ce0"},
{file = "tornado-6.4-cp38-abi3-macosx_10_9_x86_64.whl", hash = "sha256:27787de946a9cffd63ce5814c33f734c627a87072ec7eed71f7fc4417bb16263"},
{file = "tornado-6.4-cp38-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:f7894c581ecdcf91666a0912f18ce5e757213999e183ebfc2c3fdbf4d5bd764e"},
{file = "tornado-6.4-cp38-abi3-manylinux_2_5_i686.manylinux1_i686.manylinux_2_17_i686.manylinux2014_i686.whl", hash = "sha256:e43bc2e5370a6a8e413e1e1cd0c91bedc5bd62a74a532371042a18ef19e10579"},
{file = "tornado-6.4-cp38-abi3-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:f0251554cdd50b4b44362f73ad5ba7126fc5b2c2895cc62b14a1c2d7ea32f212"},
{file = "tornado-6.4-cp38-abi3-musllinux_1_1_aarch64.whl", hash = "sha256:fd03192e287fbd0899dd8f81c6fb9cbbc69194d2074b38f384cb6fa72b80e9c2"},
{file = "tornado-6.4-cp38-abi3-musllinux_1_1_i686.whl", hash = "sha256:88b84956273fbd73420e6d4b8d5ccbe913c65d31351b4c004ae362eba06e1f78"},
{file = "tornado-6.4-cp38-abi3-musllinux_1_1_x86_64.whl", hash = "sha256:71ddfc23a0e03ef2df1c1397d859868d158c8276a0603b96cf86892bff58149f"},
{file = "tornado-6.4-cp38-abi3-win32.whl", hash = "sha256:6f8a6c77900f5ae93d8b4ae1196472d0ccc2775cc1dfdc9e7727889145c45052"},
{file = "tornado-6.4-cp38-abi3-win_amd64.whl", hash = "sha256:10aeaa8006333433da48dec9fe417877f8bcc21f48dda8d661ae79da357b2a63"},
{file = "tornado-6.4.tar.gz", hash = "sha256:72291fa6e6bc84e626589f1c29d90a5a6d593ef5ae68052ee2ef000dfd273dee"},
]
[[package]]
name = "tqdm"
version = "4.66.4"
@ -6344,6 +6654,47 @@ platformdirs = ">=3.9.1,<5"
docs = ["furo (>=2023.7.26)", "proselint (>=0.13)", "sphinx (>=7.1.2,!=7.3)", "sphinx-argparse (>=0.4)", "sphinxcontrib-towncrier (>=0.2.1a0)", "towncrier (>=23.6)"]
test = ["covdefaults (>=2.3)", "coverage (>=7.2.7)", "coverage-enable-subprocess (>=1)", "flaky (>=3.7)", "packaging (>=23.1)", "pytest (>=7.4)", "pytest-env (>=0.8.2)", "pytest-freezer (>=0.4.8)", "pytest-mock (>=3.11.1)", "pytest-randomly (>=3.12)", "pytest-timeout (>=2.1)", "setuptools (>=68)", "time-machine (>=2.10)"]
[[package]]
name = "watchdog"
version = "4.0.0"
description = "Filesystem events monitoring"
optional = false
python-versions = ">=3.8"
files = [
{file = "watchdog-4.0.0-cp310-cp310-macosx_10_9_universal2.whl", hash = "sha256:39cb34b1f1afbf23e9562501673e7146777efe95da24fab5707b88f7fb11649b"},
{file = "watchdog-4.0.0-cp310-cp310-macosx_10_9_x86_64.whl", hash = "sha256:c522392acc5e962bcac3b22b9592493ffd06d1fc5d755954e6be9f4990de932b"},
{file = "watchdog-4.0.0-cp310-cp310-macosx_11_0_arm64.whl", hash = "sha256:6c47bdd680009b11c9ac382163e05ca43baf4127954c5f6d0250e7d772d2b80c"},
{file = "watchdog-4.0.0-cp311-cp311-macosx_10_9_universal2.whl", hash = "sha256:8350d4055505412a426b6ad8c521bc7d367d1637a762c70fdd93a3a0d595990b"},
{file = "watchdog-4.0.0-cp311-cp311-macosx_10_9_x86_64.whl", hash = "sha256:c17d98799f32e3f55f181f19dd2021d762eb38fdd381b4a748b9f5a36738e935"},
{file = "watchdog-4.0.0-cp311-cp311-macosx_11_0_arm64.whl", hash = "sha256:4986db5e8880b0e6b7cd52ba36255d4793bf5cdc95bd6264806c233173b1ec0b"},
{file = "watchdog-4.0.0-cp312-cp312-macosx_10_9_universal2.whl", hash = "sha256:11e12fafb13372e18ca1bbf12d50f593e7280646687463dd47730fd4f4d5d257"},
{file = "watchdog-4.0.0-cp312-cp312-macosx_10_9_x86_64.whl", hash = "sha256:5369136a6474678e02426bd984466343924d1df8e2fd94a9b443cb7e3aa20d19"},
{file = "watchdog-4.0.0-cp312-cp312-macosx_11_0_arm64.whl", hash = "sha256:76ad8484379695f3fe46228962017a7e1337e9acadafed67eb20aabb175df98b"},
{file = "watchdog-4.0.0-cp38-cp38-macosx_10_9_universal2.whl", hash = "sha256:45cc09cc4c3b43fb10b59ef4d07318d9a3ecdbff03abd2e36e77b6dd9f9a5c85"},
{file = "watchdog-4.0.0-cp38-cp38-macosx_10_9_x86_64.whl", hash = "sha256:eed82cdf79cd7f0232e2fdc1ad05b06a5e102a43e331f7d041e5f0e0a34a51c4"},
{file = "watchdog-4.0.0-cp38-cp38-macosx_11_0_arm64.whl", hash = "sha256:ba30a896166f0fee83183cec913298151b73164160d965af2e93a20bbd2ab605"},
{file = "watchdog-4.0.0-cp39-cp39-macosx_10_9_universal2.whl", hash = "sha256:d18d7f18a47de6863cd480734613502904611730f8def45fc52a5d97503e5101"},
{file = "watchdog-4.0.0-cp39-cp39-macosx_10_9_x86_64.whl", hash = "sha256:2895bf0518361a9728773083908801a376743bcc37dfa252b801af8fd281b1ca"},
{file = "watchdog-4.0.0-cp39-cp39-macosx_11_0_arm64.whl", hash = "sha256:87e9df830022488e235dd601478c15ad73a0389628588ba0b028cb74eb72fed8"},
{file = "watchdog-4.0.0-pp310-pypy310_pp73-macosx_10_9_x86_64.whl", hash = "sha256:6e949a8a94186bced05b6508faa61b7adacc911115664ccb1923b9ad1f1ccf7b"},
{file = "watchdog-4.0.0-pp38-pypy38_pp73-macosx_10_9_x86_64.whl", hash = "sha256:6a4db54edea37d1058b08947c789a2354ee02972ed5d1e0dca9b0b820f4c7f92"},
{file = "watchdog-4.0.0-pp39-pypy39_pp73-macosx_10_9_x86_64.whl", hash = "sha256:d31481ccf4694a8416b681544c23bd271f5a123162ab603c7d7d2dd7dd901a07"},
{file = "watchdog-4.0.0-py3-none-manylinux2014_aarch64.whl", hash = "sha256:8fec441f5adcf81dd240a5fe78e3d83767999771630b5ddfc5867827a34fa3d3"},
{file = "watchdog-4.0.0-py3-none-manylinux2014_armv7l.whl", hash = "sha256:6a9c71a0b02985b4b0b6d14b875a6c86ddea2fdbebd0c9a720a806a8bbffc69f"},
{file = "watchdog-4.0.0-py3-none-manylinux2014_i686.whl", hash = "sha256:557ba04c816d23ce98a06e70af6abaa0485f6d94994ec78a42b05d1c03dcbd50"},
{file = "watchdog-4.0.0-py3-none-manylinux2014_ppc64.whl", hash = "sha256:d0f9bd1fd919134d459d8abf954f63886745f4660ef66480b9d753a7c9d40927"},
{file = "watchdog-4.0.0-py3-none-manylinux2014_ppc64le.whl", hash = "sha256:f9b2fdca47dc855516b2d66eef3c39f2672cbf7e7a42e7e67ad2cbfcd6ba107d"},
{file = "watchdog-4.0.0-py3-none-manylinux2014_s390x.whl", hash = "sha256:73c7a935e62033bd5e8f0da33a4dcb763da2361921a69a5a95aaf6c93aa03a87"},
{file = "watchdog-4.0.0-py3-none-manylinux2014_x86_64.whl", hash = "sha256:6a80d5cae8c265842c7419c560b9961561556c4361b297b4c431903f8c33b269"},
{file = "watchdog-4.0.0-py3-none-win32.whl", hash = "sha256:8f9a542c979df62098ae9c58b19e03ad3df1c9d8c6895d96c0d51da17b243b1c"},
{file = "watchdog-4.0.0-py3-none-win_amd64.whl", hash = "sha256:f970663fa4f7e80401a7b0cbeec00fa801bf0287d93d48368fc3e6fa32716245"},
{file = "watchdog-4.0.0-py3-none-win_ia64.whl", hash = "sha256:9a03e16e55465177d416699331b0f3564138f1807ecc5f2de9d55d8f188d08c7"},
{file = "watchdog-4.0.0.tar.gz", hash = "sha256:e3e7065cbdabe6183ab82199d7a4f6b3ba0a438c5a512a68559846ccb76a78ec"},
]
[package.extras]
watchmedo = ["PyYAML (>=3.10)"]
[[package]]
name = "watchfiles"
version = "0.21.0"
@ -6545,6 +6896,17 @@ MarkupSafe = ">=2.1.1"
[package.extras]
watchdog = ["watchdog (>=2.3)"]
[[package]]
name = "whatthepatch"
version = "1.0.5"
description = "A patch parsing and application library."
optional = false
python-versions = ">=3.7"
files = [
{file = "whatthepatch-1.0.5-py3-none-any.whl", hash = "sha256:6bc41f9f48a63384be4478d8b2d5b22185aac75be853cdcb150a2dc174ede7e1"},
{file = "whatthepatch-1.0.5.tar.gz", hash = "sha256:7f374c172812581bc3763587525d14a143aac7fe4220bc4676ecce0d86cb8f08"},
]
[[package]]
name = "wrapt"
version = "1.16.0"
@ -6933,4 +7295,4 @@ testing = ["coverage (>=5.0.3)", "zope.event", "zope.testing"]
[metadata]
lock-version = "2.0"
python-versions = "^3.11"
content-hash = "b531f9f1630faffd0f97de2f24085694bc1cb6d0eb51b37e3ab78de0e858f2fe"
content-hash = "e829de805da2becffc84fe38136b06ec5ef457b4f2f3b0ef9aef89002d2b552a"

View File

@ -54,8 +54,10 @@ pytest-asyncio = "*"
[tool.coverage.run]
concurrency = ["gevent"]
[tool.poetry.group.evaluation.dependencies]
torch = "2.2.2"
streamlit = "*"
whatthepatch = "*"
[build-system]
build-backend = "poetry.core.masonry.api"

View File

@ -11,6 +11,9 @@ def test_help_message(capsys):
expected_help_message = """
usage: pytest [-h] [-d DIRECTORY] [-t TASK] [-f FILE] [-c AGENT_CLS]
[-m MODEL_NAME] [-i MAX_ITERATIONS] [-n MAX_CHARS]
[--eval-output-dir EVAL_OUTPUT_DIR]
[--eval-n-limit EVAL_N_LIMIT]
[--eval-num-workers EVAL_NUM_WORKERS] [--eval-note EVAL_NOTE]
[-l LLM_CONFIG]
Run an agent with a specific task
@ -31,12 +34,21 @@ options:
-n MAX_CHARS, --max-chars MAX_CHARS
The maximum number of characters to send to and
receive from LLM per task
--eval-output-dir EVAL_OUTPUT_DIR
The directory to save evaluation output
--eval-n-limit EVAL_N_LIMIT
The number of instances to evaluate
--eval-num-workers EVAL_NUM_WORKERS
The number of workers to use for evaluation
--eval-note EVAL_NOTE
The note to add to the evaluation directory
-l LLM_CONFIG, --llm-config LLM_CONFIG
The group of llm settings, e.g. a [llama3] section in
the toml file. Overrides model if both are provided.
"""
actual_lines = captured.out.strip().split('\n')
print('\n'.join(actual_lines))
expected_lines = expected_help_message.strip().split('\n')
# Ensure both outputs have the same number of lines