Xingyao Wang 2406b901df
feat(SWE-Bench environment) integrate SWE-Bench sandbox (#1468)
* add draft dockerfile for build all

* add rsync for build

* add all-in-one docker

* update prepare scripts

* Update swe_env_box.py

* Add swe_entry.sh (buggy now)

* Parse the test command in swe_entry.sh

* Update README for instance eval in sandbox

* revert specialized config

* replace run_as_devin as an init arg

* set container & run_as_root via args

* update swe entry script

* update env

* remove mounting

* allow error after swe_entry

* update swe_env_box

* move file

* update gitignore

* get swe_env_box a working demo

* support faking user response & provide sandox ahead of time;
also return state for controller

* tweak main to support adding controller kwargs

* add module

* initialize plugin for provided sandbox

* add pip cache to plugin & fix jupyter kernel waiting

* better print Observation output

* add run infer scripts

* update readme

* add utility for getting diff patch

* use get_diff_patch in infer

* update readme

* support cost tracking for codeact

* add swe agent edit hack

* disable color in git diff

* fix git diff cmd

* fix state return

* support limit eval

* increase t imeout and export pip cache

* add eval limit config

* return state when hit turn limit

* save log to file; allow agent to give up

* run eval with max 50 turns

* add outputs to gitignore

* save swe_instance & instruction

* add uuid to swebench

* add streamlit dep

* fix save series

* fix the issue where session id might be duplicated

* allow setting temperature for llm (use 0 for eval)

* Get report from agent running log

* support evaluating task success right after inference.

* remove extra log

* comment out prompt for baseline

* add visualizer for eval

* use plaintext for instruction

* reduce timeout for all; only increase timeout for init

* reduce timeout for all; only increase timeout for init

* ignore sid for swe env

* close sandbox in each eval loop

* update visualizer instruction

* increase max chars

* add finish action to history too

* show test result in metrics

* add sidebars for visualizer

* also visualize swe_instance

* cleanup browser when agent controller finish runinng

* do not mount workspace for swe-eval to avoid accidentally overwrite files

* Revert "do not mount workspace for swe-eval to avoid accidentally overwrite files"

This reverts commit 8ef77390543e562e6f0a5a9992418014d8b3010c.

* Revert "Revert "do not mount workspace for swe-eval to avoid accidentally overwrite files""

This reverts commit 016cfbb9f0475f32bacbad5822996b4eaff24a5e.

* run jupyter command via copy to, instead of cp to mount

* only print mixin output when failed

* change ssh box logging

* add visualizer for pass rate

* add instance id to sandbox name

* only remove container we created

* use opendevin logger in main

* support multi-processing infer

* add back metadata, support keyboard interrupt

* remove container with startswith

* make pbar behave correctly

* update instruction w/ multi-processing

* show resolved rate by repo

* rename tmp dir name

* attempt to fix racing for copy to ssh_box

* fix script

* bump swe-bench-all version

* fix ipython with self-contained commands

* add jupyter demo to swe_env_box

* make resolved count two column

* increase height

* do not add glob to url params

* analyze obs length

* print instance id prior to removal handler

* add gold patch in visualizer

* fix interactive git by adding a git --no-pager as alias

* increase max_char to 10k to cover 98% of swe-bench obs cases

* allow parsing note

* prompt v2

* add iteration reminder

* adjust user response

* adjust order

* fix return eval

* fix typo

* add reminder before logging

* remove other resolve rate

* re adjust to new folder structure

* support adding eval note

* fix eval note path

* make sure first log of each instance is printed

* add eval note

* fix the display for visualizer

* tweak visualizer for better git patch reading

* exclude empty patch

* add retry mechanism for swe_env_box start

* fix ssh timeout issue

* add stat field for apply test patch success

* add visualization for fine-grained report

* attempt to support monologue agent by constraining it to single thread

* also log error msg when stopeed

* save error as well

* override WORKSPACE_MOUNT_PATH and WORKSPACE_BASE for monologue to work in mp

* add retry mechanism for sshbox

* remove retry for swe env box

* try to handle loop state stopped

* Add get report scripts

* Add script to convert agent output to swe-bench format

* Merge fine grained report for visualizer

* Update eval readme

* Update README.md

* Add CodeAct gpt4-1106 output and eval logs on swe-bench-lite

* Update the script to get model report

* Update get_model_report.sh

* Update get_agent_report.sh

* Update report merge script

* Add agent output conversion script

* Update swe_lite_env_setup.sh

* Add example swe-bench output files

* Update eval readme

* Remove redundant scripts

* set iteration count down to false by default

* fix: Issue where CodeAct agent was trying to log cost on local llm and throwing Undefined Model execption out of litellm (#1666)

* fix: Issue where CodeAct agent was trying to log cost on local llm and throwing Undefined Model execption out of litellm

* Review Feedback

* Missing None Check

* Review feedback and improved error handling

---------

Co-authored-by: Robert Brennan <accounts@rbren.io>

* fix prepare_swe_util scripts

* update builder images

* update setup script

* remove swe-bench build workflow

* update lock

* remove experiments since they are moved to hf

* remove visualizer (since it is moved to hf repo)

* simply jupyter execution via heredoc

* update ssh_box

* add initial docker readme

* add pkg-config as dependency

* add script for swe_bench all-in-one docker

* add rsync to builder

* rename var

* update commit

* update readme

* update lock

* support specify timeout for long running tasks

* fix path

* separate building of all deps and files

* support returning states at the end of controller

* remove return None

* support specify timeout for long running tasks

* add timeout for all existing sandbox impl

* fix swe_env_box for new codebase

* update llm config in config.py

* support pass sandbox in

* remove force set

* update eval script

* fix issue of overriding final state

* change default eval output to hf demo

* change default eval output to hf demo

* fix config

* only close it when it is NOT external sandbox

* add scripts

* tweak config

* only put in hostory when state has history attr

* fix agent controller on the case of run out interaction budget

* always assume state is always not none

* remove print of final state

* catch all exception when cannot compute completion cost

* Update README.md

* save source into json

* fix path

* update docker path

* return the final state on close

* merge AgentState with State

* fix integration test

* merge AgentState with State

* fix integration test

* add ChangeAgentStateAction to history in attempt to fix integration

* add back set agent state

* update tests

* update tests

* move scripts for setup

* update script and readme for infer

* do not reset logger when n processes == 1

* update eval_infer scripts and readme

* simplify readme

* copy over dir after eval

* copy over dir after eval

* directly return get state

* update lock

* fix output saving of infer

* replace print with logger

* update eval_infer script

* add back the missing .close

* increase timeout

* copy all swe_bench_format file

* attempt to fix output parsing

* log git commit id as metadata

* fix eval script

* update lock

* update unit tests

* fix argparser unit test

* fix lock

* the deps are now lightweight enough to be incude in make build

* add spaces for tests

* add eval outputs to gitignore

* remove git submodule

* readme

* tweak git email

* update upload instruction

* bump codeact version for eval

---------

Co-authored-by: Bowen Li <libowen.ne@gmail.com>
Co-authored-by: huybery <huybery@gmail.com>
Co-authored-by: Bart Shappee <bshappee@gmail.com>
Co-authored-by: Robert Brennan <accounts@rbren.io>
2024-05-15 16:15:55 +00:00

420 lines
14 KiB
Python

import argparse
import logging
import os
import pathlib
import platform
from dataclasses import dataclass, field, fields, is_dataclass
from types import UnionType
from typing import Any, ClassVar, get_args, get_origin
import toml
from dotenv import load_dotenv
from opendevin.core.utils import Singleton
logger = logging.getLogger(__name__)
load_dotenv()
@dataclass
class LLMConfig(metaclass=Singleton):
model: str = 'gpt-3.5-turbo'
api_key: str | None = None
base_url: str | None = None
api_version: str | None = None
embedding_model: str = 'local'
embedding_base_url: str | None = None
embedding_deployment_name: str | None = None
aws_access_key_id: str | None = None
aws_secret_access_key: str | None = None
aws_region_name: str | None = None
num_retries: int = 5
retry_min_wait: int = 3
retry_max_wait: int = 60
timeout: int | None = None
max_chars: int = 5_000_000 # fallback for token counting
temperature: float = 0
top_p: float = 0.5
custom_llm_provider: str | None = None
max_input_tokens: int | None = None
max_output_tokens: int | None = None
def defaults_to_dict(self) -> dict:
"""
Serialize fields to a dict for the frontend, including type hints, defaults, and whether it's optional.
"""
dict = {}
for f in fields(self):
dict[f.name] = get_field_info(f)
return dict
@dataclass
class AgentConfig(metaclass=Singleton):
name: str = 'CodeActAgent'
memory_enabled: bool = False
memory_max_threads: int = 2
def defaults_to_dict(self) -> dict:
"""
Serialize fields to a dict for the frontend, including type hints, defaults, and whether it's optional.
"""
dict = {}
for f in fields(self):
dict[f.name] = get_field_info(f)
return dict
@dataclass
class AppConfig(metaclass=Singleton):
llm: LLMConfig = field(default_factory=LLMConfig)
agent: AgentConfig = field(default_factory=AgentConfig)
runtime: str = 'server'
file_store: str = 'memory'
file_store_path: str = '/tmp/file_store'
workspace_base: str = os.getcwd()
workspace_mount_path: str = os.getcwd()
workspace_mount_path_in_sandbox: str = '/workspace'
workspace_mount_rewrite: str | None = None
cache_dir: str = '/tmp/cache'
sandbox_container_image: str = 'ghcr.io/opendevin/sandbox' + (
f':{os.getenv("OPEN_DEVIN_BUILD_VERSION")}'
if os.getenv('OPEN_DEVIN_BUILD_VERSION')
else ':main'
)
run_as_devin: bool = True
max_iterations: int = 100
e2b_api_key: str = ''
sandbox_type: str = 'ssh' # Can be 'ssh', 'exec', or 'e2b'
use_host_network: bool = False
ssh_hostname: str = 'localhost'
disable_color: bool = False
sandbox_user_id: int = os.getuid() if hasattr(os, 'getuid') else 1000
sandbox_timeout: int = 120
github_token: str | None = None
debug: bool = False
defaults_dict: ClassVar[dict] = {}
def __post_init__(self):
"""
Post-initialization hook, called when the instance is created with only default values.
"""
AppConfig.defaults_dict = self.defaults_to_dict()
def defaults_to_dict(self) -> dict:
"""
Serialize fields to a dict for the frontend, including type hints, defaults, and whether it's optional.
"""
dict = {}
for f in fields(self):
field_value = getattr(self, f.name)
# dataclasses compute their defaults themselves
if is_dataclass(type(field_value)):
dict[f.name] = field_value.defaults_to_dict()
else:
dict[f.name] = get_field_info(f)
return dict
def get_field_info(field):
"""
Extract information about a dataclass field: type, optional, and default.
Args:
field: The field to extract information from.
Returns: A dict with the field's type, whether it's optional, and its default value.
"""
field_type = field.type
optional = False
# for types like str | None, find the non-None type and set optional to True
# this is useful for the frontend to know if a field is optional
# and to show the correct type in the UI
# Note: this only works for UnionTypes with None as one of the types
if get_origin(field_type) is UnionType:
types = get_args(field_type)
non_none_arg = next((t for t in types if t is not type(None)), None)
if non_none_arg is not None:
field_type = non_none_arg
optional = True
# type name in a pretty format
type_name = (
field_type.__name__ if hasattr(field_type, '__name__') else str(field_type)
)
# default is always present
default = field.default
# return a schema with the useful info for frontend
return {'type': type_name.lower(), 'optional': optional, 'default': default}
def load_from_env(config: AppConfig, env_or_toml_dict: dict | os._Environ):
"""Reads the env-style vars and sets config attributes based on env vars or a config.toml dict.
Compatibility with vars like LLM_BASE_URL, AGENT_MEMORY_ENABLED and others.
Args:
config: The AppConfig object to set attributes on.
env_or_toml_dict: The environment variables or a config.toml dict.
"""
def get_optional_type(union_type: UnionType) -> Any:
"""Returns the non-None type from an Union."""
types = get_args(union_type)
return next((t for t in types if t is not type(None)), None)
# helper function to set attributes based on env vars
def set_attr_from_env(sub_config: Any, prefix=''):
"""Set attributes of a config dataclass based on environment variables."""
for field_name, field_type in sub_config.__annotations__.items():
# compute the expected env var name from the prefix and field name
# e.g. LLM_BASE_URL
env_var_name = (prefix + field_name).upper()
if is_dataclass(field_type):
# nested dataclass
nested_sub_config = getattr(sub_config, field_name)
# the agent field: the env var for agent.name is just 'AGENT'
if field_name == 'agent' and 'AGENT' in env_or_toml_dict:
setattr(nested_sub_config, 'name', env_or_toml_dict[env_var_name])
set_attr_from_env(nested_sub_config, prefix=field_name + '_')
elif env_var_name in env_or_toml_dict:
# convert the env var to the correct type and set it
value = env_or_toml_dict[env_var_name]
try:
# if it's an optional type, get the non-None type
if get_origin(field_type) is UnionType:
field_type = get_optional_type(field_type)
# Attempt to cast the env var to type hinted in the dataclass
if field_type is bool:
cast_value = str(value).lower() in ['true', '1']
else:
cast_value = field_type(value)
setattr(sub_config, field_name, cast_value)
except (ValueError, TypeError):
logger.error(
f'Error setting env var {env_var_name}={value}: check that the value is of the right type'
)
# Start processing from the root of the config object
set_attr_from_env(config)
def load_from_toml(config: AppConfig, toml_file: str = 'config.toml'):
"""Load the config from the toml file. Supports both styles of config vars.
Args:
config: The AppConfig object to update attributes of.
"""
# try to read the config.toml file into the config object
toml_config = {}
try:
with open(toml_file, 'r', encoding='utf-8') as toml_contents:
toml_config = toml.load(toml_contents)
except FileNotFoundError:
# the file is optional, we don't need to do anything
return
except toml.TomlDecodeError:
logger.warning(
'Cannot parse config from toml, toml values have not been applied.',
exc_info=False,
)
return
# if there was an exception or core is not in the toml, try to use the old-style toml
if 'core' not in toml_config:
# re-use the env loader to set the config from env-style vars
load_from_env(config, toml_config)
return
core_config = toml_config['core']
try:
# set llm config from the toml file
llm_config = config.llm
if 'llm' in toml_config:
llm_config = LLMConfig(**toml_config['llm'])
# set agent config from the toml file
agent_config = config.agent
if 'agent' in toml_config:
agent_config = AgentConfig(**toml_config['agent'])
# update the config object with the new values
config = AppConfig(llm=llm_config, agent=agent_config, **core_config)
except (TypeError, KeyError):
logger.warning(
'Cannot parse config from toml, toml values have not been applied.',
exc_info=False,
)
def finalize_config(config: AppConfig):
"""
More tweaks to the config after it's been loaded.
"""
# In local there is no sandbox, the workspace will have the same pwd as the host
if config.sandbox_type == 'local':
config.workspace_mount_path_in_sandbox = config.workspace_mount_path
if config.workspace_mount_rewrite: # and not config.workspace_mount_path:
# TODO why do we need to check if workspace_mount_path is None?
base = config.workspace_base or os.getcwd()
parts = config.workspace_mount_rewrite.split(':')
config.workspace_mount_path = base.replace(parts[0], parts[1])
if config.llm.embedding_base_url is None:
config.llm.embedding_base_url = config.llm.base_url
if config.use_host_network and platform.system() == 'Darwin':
logger.warning(
'Please upgrade to Docker Desktop 4.29.0 or later to use host network mode on macOS. '
'See https://github.com/docker/roadmap/issues/238#issuecomment-2044688144 for more information.'
)
# make sure cache dir exists
if config.cache_dir:
pathlib.Path(config.cache_dir).mkdir(parents=True, exist_ok=True)
config = AppConfig()
load_from_toml(config)
load_from_env(config, os.environ)
finalize_config(config)
# Utility function for command line --group argument
def get_llm_config_arg(llm_config_arg: str):
"""
Get a group of llm settings from the config file.
"""
# keep only the name, just in case
llm_config_arg = llm_config_arg.strip('[]')
logger.info(f'Loading llm config from {llm_config_arg}')
# load the toml file
try:
with open('config.toml', 'r', encoding='utf-8') as toml_file:
toml_config = toml.load(toml_file)
except FileNotFoundError:
return None
except toml.TomlDecodeError as e:
logger.error(f'Cannot parse llm group from {llm_config_arg}. Exception: {e}')
return None
# update the llm config with the specified section
if llm_config_arg in toml_config:
return LLMConfig(**toml_config[llm_config_arg])
logger.debug(f'Loading from toml failed for {llm_config_arg}')
return None
# Command line arguments
def get_parser():
"""
Get the parser for the command line arguments.
"""
parser = argparse.ArgumentParser(description='Run an agent with a specific task')
parser.add_argument(
'-d',
'--directory',
type=str,
help='The working directory for the agent',
)
parser.add_argument(
'-t', '--task', type=str, default='', help='The task for the agent to perform'
)
parser.add_argument(
'-f',
'--file',
type=str,
help='Path to a file containing the task. Overrides -t if both are provided.',
)
parser.add_argument(
'-c',
'--agent-cls',
default=config.agent.name,
type=str,
help='The agent class to use',
)
parser.add_argument(
'-m',
'--model-name',
default=config.llm.model,
type=str,
help='The (litellm) model name to use',
)
parser.add_argument(
'-i',
'--max-iterations',
default=config.max_iterations,
type=int,
help='The maximum number of iterations to run the agent',
)
parser.add_argument(
'-n',
'--max-chars',
default=config.llm.max_chars,
type=int,
help='The maximum number of characters to send to and receive from LLM per task',
)
parser.add_argument(
'--eval-output-dir',
default='evaluation/evaluation_outputs/outputs',
type=str,
help='The directory to save evaluation output',
)
parser.add_argument(
'--eval-n-limit',
default=None,
type=int,
help='The number of instances to evaluate',
)
parser.add_argument(
'--eval-num-workers',
default=4,
type=int,
help='The number of workers to use for evaluation',
)
parser.add_argument(
'--eval-note',
default=None,
type=str,
help='The note to add to the evaluation directory',
)
parser.add_argument(
'-l',
'--llm-config',
default=None,
type=str,
help='The group of llm settings, e.g. a [llama3] section in the toml file. Overrides model if both are provided.',
)
return parser
def parse_arguments():
"""
Parse the command line arguments.
"""
parser = get_parser()
args, _ = parser.parse_known_args()
if args.directory:
config.workspace_base = os.path.abspath(args.directory)
print(f'Setting workspace base to {config.workspace_base}')
return args
args = parse_arguments()