[Refactor, Evaluation] Refactor and clean up evaluation harness to remove global config and use EventStreamRuntime (#3230)

* move multi-line bash tests to test_runtime;
support multi-line bash for esruntime;

* add testcase to handle PS2 prompt

* use bashlex for bash parsing to handle multi-line commands;
add testcases for multi-line commands

* revert ghcr runtime change

* Apply stash

* fix run as other user;
make test async;

* fix test runtime for run as od

* add run-as-devin to all the runtime tests

* handle the case when username is root

* move all run-as-devin tests from sandbox;
only tests a few cases on different user to save time;

* move over multi-line echo related tests to test_runtime

* fix user-specific jupyter by fixing the pypoetry virtualenv folder

* make plugin's init async;
chdir at initialization of jupyter plugin;
move ipy simple testcase to test runtime;

* support agentskills import in
move tests for jupyter pwd tests;
overload `add_env_vars` for EventStreamRuntime to update env var also in Jupyter;
make agentskills read env var lazily, in case env var is updated;

* fix ServerRuntime agentskills issue

* move agnostic image test to test_runtime

* merge runtime tests in CI

* fix enable auto lint as env var

* update warning message

* update warning message

* test for different container images

* change parsing output as debug

* add exception handling for update_pwd_decorator

* fix unit test indentation

* add plugins as default input to Runtime class;
remove init_sandbox_plugins;
implement add_env_var (include jupyter) in the base class;

* fix server runtime auto lint

* Revert "add exception handling for update_pwd_decorator"

This reverts commit 2b668b1506e02145cb8f87e321aad62febca3d50.

* tries to print debugging info for agentskills

* explictly setting uid (try fix permission issue)

* Revert "tries to print debugging info for agentskills"

This reverts commit 8be4c86756f0e3fc62957b327ba2ac4999c419de.

* set sandbox user id during testing to hopefully fix the permission issue

* add browser tools for server runtime

* try to debug for old pwd

* update debug cmd

* only test agnostic runtime when TEST_RUNTIME is Server

* fix temp dir mkdir

* load TEST_RUNTIME at the beginning

* remove ipython tests

* only log to file when DEBUG

* default logging to project root

* temporarily remove log to file

* fix LLM logger dir

* fix logger

* make set pwd an optional aux action

* fix prev pwd

* fix infinity recursion

* simplify

* do not import the whole od library to avoid logger folder by jupyter

* fix browsing

* increase timeout

* attempt to fix agentskills yet again

* clean up in testcases, since CI maybe run as non-root

* add _cause attribute for event.id

* remove parent

* add a bunch of debugging statement again for CI :(

* fix temp_dir fixture

* change all temp dir to follow pytest's tmp_path_factory

* remove extra bracket

* clean up error printing a bit

* jupyter chdir to self.config.workspace_mount_path_in_sandbox on initialization

* jupyter chdir to self.config.workspace_mount_path_in_sandbox on initialization

* add typing for tmp dir fixture

* clear the directory before running the test to avoid weird CI temp dir

* remove agnostic test case for server runtime

* Revert "remove agnostic test case for server runtime"

This reverts commit 30e2181c3fc1410e69596c2dcd06be01f1d016b3.

* disable agnostic tests in CI

* fix test

* make sure plugin arg is not passed when no plugin is specified;
remove redundant on_event function;

* move mock prompt

* rename runtime

* remove extra logging

* refactor run_controller's interface;
support multiple runtime for integration test;
filter out hostname for prompt

* uncomment other tests

* pass the right runtime to controller

* log runtime when start

* uncomment tests

* improve symbol filters

* add intergration test prompts that seemd ok

* add integration test workflow

* add python3 to default ubuntu image

* symlink python and fix permission to jupyter pip

* add retry for jupyter execute server

* fix jupyter pip install;
add post-process for jupyter pip install;
simplify init by add agent_skills path to PYTHONPATH;
add testcase to tests jupyter pip install;

* fix bug

* use ubuntu:22.04 for eventstream integration tests

* add todo

* update testcase

* remove redundant code

* fix unit test

* reduce dependency for runtime

* try making llama-index an optional dependency that's not installed by default

* remove pip install since it seemd not needed

* log ipython execution;
await write message since it returns a future

* update ipy testcase

* do not install llama-index in CI

* do not install llama-index in the app docker as well

* set sandbox container image in the integration test script

* log plugins & env var for runtime

* update conftest for sha256

* add git

* remove all non-alphanumeric chalracters

* add working ipy module tests!

* default to use host network

* remove is_async from browser to make thing a little more reliable;
retry loading browser when error;

* add sleep to wait a bit for http server

* kill http server before regenerate browsing tests

* fix browsing

* only set sandbox container image if undefined

* skip empty config value

* update evaluation to use the latest run_controller

* revert logger in execute_server to be compatible with server runtime

* revert logging level to fix jupyter

* set logger level

* revert the logging

* chmod for workspace to fix permission

* support getting timeout from action

* update test for server runtime

* try to fix file permission

* fix test_cmd_run_action_serialization_deserialization test (added timeout)

* poetry: pip 24.2, torch 2.2.2

* revert adding pip to pyproject.toml

* add build to dependencies in pyproject.toml

* forgot poetry lock --no-update

* fix a DelegatorAgent prompt_002.log (timeout)

* fix a DelegatorAgent prompt_003.log (timeout)

* couple more timeout attribs in prompt files

* some more prompt files

* prompts galore

* add clarification comment for timeout

* default timeout to config

* add assert

* update integraton tests for eventstream

* update integration tests

* fix timeout for action<->dict

* remove redundant on_event

* default to use instance image

* update run_controller interface

* add logging for copy

* refactor swe_bench for the new design

* fix action execution timeout

* updatelock

* remove build sandbox locally

* fix runtime

* use plain for-loop for single process

* remove extra print

* get swebench inference working

* print whole `test_result` dict

* got swebench patch post-process working

* update swe-bench evaluation readme

* refactor using shared reset_logger function

* move messy swebench prompt to a different file

* support the ability to specify whether to keep prompt

* support the ability to specify whether to keep prompt

* fix dockerfile

* fix import and remove unnecessary strip logic

* fix action serialization

* get agentbench running

* remove extra ls for agent bench

* fix agentbench metric

* factor out common documentation for eval

* update biocoder doc

* remove swe_env_box since it is no longer needed

* get biocoder working

* add func timeout for bird

* fix jupyter pwd with ~ as user name

* fix jupyter pwd with ~ as user name

* get bird working

* get browsing evaluation working

* make eda runnable

* fix id column

* fix eda run_infer

* unify eval output using a structured format;
make swebench coompatible with that format;
update client source code for every swebench run;
do not inject testcmd for swebench

* standardize existing benchs for the new eval output

* set update source code = true

* get gaia standardized

* fix gaia

* gorilla refactored but stuck at language.so to test

* refactor and make gpqa work

* refactor humanevalfix and get it working

* refactor logic reasoning and get it working

* refactor browser env so it works with eventstream runtime for eval

* add initial version of miniwob refactor

* fix browsergym environment

* get miniwob working!!

* allowing injecting additional dependency to OD runtime docker image

* allowing injecting additional dependency to OD runtime docker image

* support logic reasoning with pre-injected dependency

* get mint working

* update runtime build

* fix mint docker

* add test for keep_prompt;
add missing await close for some tests

* update integration tests for eventstream runtime

* fix integration tests for server runtime

* refactor ml bench and toolqa

* refactor webarena

* fix default factory

* Update run_infer.py

* add APIError to retry

* increase timeout for swebench

* make sure to hide api key when dump eval output

* update the behavior of put source code to put files instead of tarball

* add dishash to dependency

* sendintr when timeout

* fix dockerfile copy

* reduce timeout

* use dirhash to avoid repeat building for update source

* fix runtime_build testcase

* add dir_hash to docker build pipeline

* revert api error

* update poetry lock

* add retries for swebench run infer

* fix git patch

* update poetry lock

* adjust config order

* fix mount volumns

* enforce all eval to use "instance_id"

* remove file store from runtime

* make file_store public inside eventstream

* move the runtime logic inside `main` out

* support using async function for process_instance_fn

* refactor run_infer with the create_time

* fix file store

* Update evaluation/toolqa/utils.py

Co-authored-by: Graham Neubig <neubig@gmail.com>

* fix typo

---------

Co-authored-by: tobitege <tobitege@gmx.de>
Co-authored-by: super-dainiu <78588128+super-dainiu@users.noreply.github.com>
Co-authored-by: Graham Neubig <neubig@gmail.com>
This commit is contained in:
Xingyao Wang 2024-08-07 01:21:45 +08:00 committed by GitHub
parent 9029cd77d3
commit 31b244f95e
No known key found for this signature in database
GPG Key ID: B5690EEEBB952194
78 changed files with 3565 additions and 3784 deletions

4
.gitignore vendored
View File

@ -169,6 +169,10 @@ evaluation/outputs
evaluation/swe_bench/eval_workspace*
evaluation/SWE-bench/data
evaluation/webarena/scripts/webarena_env.sh
evaluation/bird/data
evaluation/gaia/data
evaluation/gorilla/data
evaluation/toolqa/data
# frontend

View File

@ -2,9 +2,10 @@
This folder contains evaluation harness for evaluating agents on the Entity-deduction-Arena Benchmark, from the paper [Probing the Multi-turn Planning Capabilities of LLMs via 20 Question Games](https://arxiv.org/abs/2310.01468), presented in ACL 2024 main conference.
## Configure OpenDevin and your LLM
## Setup Environment and LLM Configuration
Please follow instruction [here](../README.md#setup) to setup your local development environment and LLM.
Create a `config.toml` file if it does not exist at the root of the workspace. Please check [README.md](../../README.md) for how to set this up.
## Start the evaluation

View File

@ -1,30 +1,27 @@
import asyncio
import logging
import os
import pandas as pd
# import huggingface_hub
from datasets import load_dataset
from evaluation.EDA.game import Q20Game, Q20GameCelebrity
from evaluation.utils.shared import (
EvalMetadata,
EvalOutput,
make_metadata,
prepare_dataset,
reset_logger_for_multiprocessing,
run_evaluation,
)
from opendevin.controller.agent import Agent
# from evaluation.EDA.scorer import question_scorer
from opendevin.controller.state.state import State
from opendevin.core.config import get_llm_config_arg, get_parser, load_app_config
from opendevin.core.logger import get_console_handler
from opendevin.core.config import (
AppConfig,
SandboxConfig,
get_llm_config_arg,
get_parser,
)
from opendevin.core.logger import opendevin_logger as logger
from opendevin.core.main import run_controller
from opendevin.llm.llm import LLM
config = load_app_config()
from opendevin.core.main import create_runtime, run_controller
game = None
@ -56,39 +53,45 @@ AGENT_CLS_TO_INST_SUFFIX = {
}
def process_instance(
def get_config(
metadata: EvalMetadata,
) -> AppConfig:
config = AppConfig(
default_agent=metadata.agent_class,
run_as_devin=False,
runtime='eventstream',
max_iterations=metadata.max_iterations,
sandbox=SandboxConfig(
container_image='ubuntu:22.04',
enable_auto_lint=False,
use_host_network=False,
update_source_code=True,
),
# do not mount workspace
workspace_base=None,
workspace_mount_path=None,
)
config.set_llm_config(metadata.llm_config)
return config
async def process_instance(
instance: pd.Series,
metadata: EvalMetadata,
reset_logger: bool = True,
):
# Create the agent
agent = Agent.get_cls(metadata.agent_class)(llm=LLM(config=metadata.llm_config))
) -> EvalOutput:
config = get_config(metadata)
instance_id = instance['text'].strip()
# Setup the logger properly, so you can run multi-processing to parallelize the evaluation
eval_output_dir = metadata.eval_output_dir
if reset_logger:
# Set up logger
log_file = os.path.join(
eval_output_dir, 'logs', f'instance_{instance["text"].strip()}.log'
)
# Remove all existing handlers from logger
for handler in logger.handlers[:]:
logger.removeHandler(handler)
# add back the console handler to print ONE line
logger.addHandler(get_console_handler())
logger.info(
f'Starting evaluation for instance {instance["text"].strip()}.\nLOG: tail -f {log_file}'
)
# Remove all existing handlers from logger
for handler in logger.handlers[:]:
logger.removeHandler(handler)
file_handler = logging.FileHandler(log_file)
file_handler.setFormatter(
logging.Formatter('%(asctime)s - %(levelname)s - %(message)s')
)
logger.addHandler(file_handler)
log_dir = os.path.join(metadata.eval_output_dir, 'infer_logs')
reset_logger_for_multiprocessing(logger, instance_id, log_dir)
else:
logger.info(f'Starting evaluation for instance {instance_id}.')
# Prepare instruction
_game_class = {'things': Q20Game, 'celebs': Q20GameCelebrity}
_game_class = {'eda-things': Q20Game, 'eda-celebs': Q20GameCelebrity}
guesser_kargs = {
'max_new_tokens': 64,
@ -112,24 +115,16 @@ def process_instance(
instruction = f'{game.first_user_utterance}'
logger.info(f'Instruction: {instruction}')
# instruction += 'IMPORTANT: You should ONLY interact with the environment provided to you AND NEVER ASK FOR HUMAN HELP.\n'
# NOTE: You can actually set slightly different instruction for different agents
instruction += AGENT_CLS_TO_INST_SUFFIX[agent.__class__.__name__]
instruction += AGENT_CLS_TO_INST_SUFFIX[metadata.agent_class]
# Here's how you can run the agent (similar to the `main` function) and get the final task state
config.max_iterations = metadata.max_iterations
runtime = await create_runtime(config, sid=instance['text'].strip())
state: State | None = asyncio.run(
run_controller(
config=config,
task_str=instruction,
fake_user_response_fn=AGENT_CLS_TO_FAKE_USER_RESPONSE_FN[
agent.__class__.__name__
],
agent=agent,
sid=instance['text'].strip(),
)
state: State | None = await run_controller(
config=config,
task_str=instruction,
runtime=runtime,
fake_user_response_fn=AGENT_CLS_TO_FAKE_USER_RESPONSE_FN[metadata.agent_class],
)
# ======= Attempt to evaluate the agent's edits =======
# If you are working on simpler benchmark that only evaluates the final model output (e.g., in a MessageAction)
@ -150,21 +145,20 @@ def process_instance(
histories = state.history.compatibility_for_eval_history_pairs()
# Save the output
output = {
'instance_id': instance['text'].strip(),
'instance': instance,
'instruction': instruction,
'metadata': metadata.model_dump(),
'history': histories,
'metrics': metrics,
'error': state.last_error if state and state.last_error else None,
'test_result': {
output = EvalOutput(
instance_id=instance_id,
instance=instance.to_dict(),
instruction=instruction,
metadata=metadata,
history=histories,
metrics=metrics,
error=state.last_error if state and state.last_error else None,
test_result={
'success': test_result,
'final_message': final_message,
'ground_truth': instance['text'],
},
}
)
return output
@ -191,12 +185,16 @@ if __name__ == '__main__':
)
args, _ = parser.parse_known_args()
llm_config = get_llm_config_arg(args.llm_config) if args.llm_config else config.llm
logger.info(f'Config for evaluation: {config}')
eda_dataset = load_dataset(
'yizheapple/entity-deduction-arena', name=args.dataset, split=args.data_split
)
eda_dataset.rename(columns={'text': 'instance_id'}, inplace=True)
llm_config = None
if args.llm_config:
llm_config = get_llm_config_arg(args.llm_config)
if llm_config is None:
raise ValueError(f'Could not find LLM config: --llm_config {args.llm_config}')
metadata = make_metadata(
llm_config,
@ -214,16 +212,15 @@ if __name__ == '__main__':
output_file = os.path.join(metadata.eval_output_dir, 'output.jsonl')
prepared_dataset = prepare_dataset(
eda_dataset.to_pandas(), output_file, args.eval_n_limit, 'text'
eda_dataset.to_pandas(), output_file, args.eval_n_limit
)
agent = Agent.get_cls(args.agent_cls)(llm=LLM(config.llm))
run_evaluation(
prepared_dataset,
metadata,
output_file,
args.eval_num_workers,
process_instance,
'text',
asyncio.run(
run_evaluation(
prepared_dataset,
metadata,
output_file,
args.eval_num_workers,
process_instance,
)
)

0
evaluation/EDA/scripts/run_infer.sh Normal file → Executable file
View File

View File

@ -22,6 +22,32 @@ all the preprocessing/evaluation/analysis scripts.
- BIRD: [`evaluation/bird`](./bird)
- LogicReasoning: [`evaluation/logic_reasoning`](./logic_reasoning)
## Setup
### Development environment
Please follow [this document](https://github.com/OpenDevin/OpenDevin/blob/main/Development.md) to set up a local development environment for OpenDevin.
### Configure OpenDevin and your LLM
Create a `config.toml` file if it does not exist at the root of the workspace. You can copy from `config.template.toml` if it is easier for you.
Add the configuration for your LLM:
```toml
# TODO: Change these to the model you want to evaluate
[llm.eval_gpt4_1106_preview_llm]
model = "gpt-4-1106-preview"
api_key = "XXX"
temperature = 0.0
[llm.eval_some_openai_compatible_model_llm]
model = "openai/MODEL_NAME"
base_url = "https://OPENAI_COMPATIBLE_URL/v1"
api_key = "XXX"
temperature = 0.0
```
### Result Visualization
Check [this huggingface space](https://huggingface.co/spaces/OpenDevin/evaluation) for visualization of existing experimental results.

View File

@ -1,44 +1,10 @@
# AgentBench Evaluation
This folder contains evaluation harness for evaluating agents on
the [AgentBench: Evaluating LLMs as Agents](https://arxiv.org/abs/2308.03688).
This folder contains evaluation harness for evaluating agents on the [AgentBench: Evaluating LLMs as Agents](https://arxiv.org/abs/2308.03688). We currently only support running on the `osbench` subset.
## Configure OpenDevin and your LLM
## Setup Environment and LLM Configuration
Create a `config.toml` file if it does not exist at the root of the workspace. Please check [README.md](../../README.md)
for how to set this up.
Here is an example `config.toml` file:
```toml
[core]
max_iterations = 100
cache_dir = "/path/to/cache"
workspace_base = "/path/to/workspace"
workspace_mount_path = "/path/to/workspace"
ssh_hostname = "localhost"
# AgentBench specific
run_as_devin = true
[sandbox]
use_host_network = false
enable_auto_lint = true
box_type = "ssh"
timeout = 120
[llm.eval_gpt35_turbo]
model = "gpt-3.5-turbo"
api_key = "sk-123"
temperature = 0.0
[llm.eval_gpt4o]
model = "gpt-4o"
api_key = "sk-123"
temperature = 0.0
```
Please follow instruction [here](../README.md#setup) to setup your local development environment and LLM.
## Start the evaluation
@ -46,7 +12,18 @@ temperature = 0.0
./evaluation/agent_bench/scripts/run_infer.sh [model_config] [git-version] [agent] [eval_limit]
```
Following is the basic command to start the evaluation. Here we are only evaluating the `osbench` for now.
- `model_config`, e.g. `eval_gpt4_1106_preview`, is the config group name for your
LLM settings, as defined in your `config.toml`.
- `git-version`, e.g. `HEAD`, is the git commit hash of the OpenDevin version you would
like to evaluate. It could also be a release tag like `0.6.2`.
- `agent`, e.g. `CodeActAgent`, is the name of the agent for benchmarks, defaulting
to `CodeActAgent`.
- `eval_limit`, e.g. `10`, limits the evaluation to the first `eval_limit` instances. By
default, the script evaluates the entire SWE-bench_Lite test set (300 issues). Note:
in order to use `eval_limit`, you must also set `agent`.
Following is the basic command to start the evaluation.
You can update the arguments in the script `evaluation/agent_bench/scripts/run_infer.sh`, such as `--max-iterations`, `--eval-num-workers` and so on.
@ -57,5 +34,5 @@ You can update the arguments in the script `evaluation/agent_bench/scripts/run_i
- `--eval-n-limit`: the number of examples to evaluate. For example, `100`.
```bash
./evaluation/agent_bench/scripts/run_infer.sh eval_gpt35_turbo 0.6.2 CodeActAgent 1
./evaluation/agent_bench/scripts/run_infer.sh eval_gpt35_turbo HEAD CodeActAgent 1
```

View File

@ -14,7 +14,7 @@ def try_parse_answer(act) -> str | None:
raw_ans = act.thought
else:
return None
agent_answer = re.findall(r'<solution>(.*?)</solution>', raw_ans)
agent_answer = re.findall(r'<solution>(.*?)</solution>', raw_ans, re.DOTALL)
if not agent_answer:
return None
return agent_answer[0].strip()

View File

@ -1,10 +1,9 @@
import asyncio
import logging
import os
import re
import shutil
import tempfile
from typing import Any
import docker
import pandas as pd
from datasets import load_dataset
@ -16,64 +15,176 @@ from evaluation.agent_bench.helper import (
)
from evaluation.utils.shared import (
EvalMetadata,
EvalOutput,
make_metadata,
prepare_dataset,
reset_logger_for_multiprocessing,
run_evaluation,
)
from opendevin.controller.agent import Agent
from opendevin.controller.state.state import State
from opendevin.core.config import get_llm_config_arg, load_app_config, parse_arguments
from opendevin.core.logger import get_console_handler
from opendevin.core.config import (
AppConfig,
SandboxConfig,
get_llm_config_arg,
parse_arguments,
)
from opendevin.core.logger import opendevin_logger as logger
from opendevin.core.main import run_controller
from opendevin.events.action import CmdRunAction, MessageAction
from opendevin.llm.llm import LLM
from opendevin.runtime.docker.ssh_box import DockerSSHBox
config = load_app_config()
from opendevin.core.main import create_runtime, run_controller
from opendevin.events.action import AgentFinishAction, CmdRunAction, MessageAction
from opendevin.events.observation import CmdOutputObservation
from opendevin.runtime.runtime import Runtime
def process_instance(
def get_config(
metadata: EvalMetadata,
) -> AppConfig:
config = AppConfig(
default_agent=metadata.agent_class,
run_as_devin=False,
runtime='eventstream',
max_iterations=metadata.max_iterations,
sandbox=SandboxConfig(
container_image='ubuntu:22.04',
enable_auto_lint=True,
use_host_network=False,
update_source_code=True,
),
# do not mount workspace
workspace_base=None,
workspace_mount_path=None,
)
config.set_llm_config(metadata.llm_config)
return config
async def initialize_runtime(
runtime: Runtime,
instance: pd.Series, # this argument is not required
):
"""Initialize the runtime for the agent.
This function is called before the runtime is used to run the agent.
"""
logger.info(f"{'-' * 50} BEGIN Runtime Initialization Fn {'-' * 50}")
obs: CmdOutputObservation
# Set instance id
action = CmdRunAction(command='mkdir -p /workspace')
logger.info(action, extra={'msg_type': 'ACTION'})
obs = await runtime.run_action(action)
assert obs.exit_code == 0
action = CmdRunAction(command='cd /workspace')
logger.info(action, extra={'msg_type': 'ACTION'})
obs = await runtime.run_action(action)
assert obs.exit_code == 0
init_cmd = instance.init
if init_cmd is not None:
script_name = f'{instance.instance_id}_init.sh'
with tempfile.TemporaryDirectory() as tmpdir:
host_script_path = os.path.join(tmpdir, script_name)
create_sh_file(host_script_path, init_cmd)
await runtime.copy_to(
host_script_path,
'/workspace',
)
logger.info(f'Running init script: {script_name}')
action = CmdRunAction(command=f'chmod +x ./{script_name} && ./{script_name}')
logger.info(action, extra={'msg_type': 'ACTION'})
obs = await runtime.run_action(action)
logger.info(obs, extra={'msg_type': 'OBSERVATION'})
assert obs.exit_code == 0
logger.info(f"{'-' * 50} END Runtime Initialization Fn {'-' * 50}")
async def complete_runtime(
runtime: Runtime,
instance: pd.Series, # this argument is not required, but it is used to get the workspace_dir_name
) -> dict[str, Any]:
"""Complete the runtime for the agent.
This function is called before the runtime is used to run the agent.
If you need to do something in the sandbox to get the correctness metric after
the agent has run, modify this function.
"""
logger.info(f"{'-' * 50} BEGIN Runtime Completion Fn {'-' * 50}")
obs: CmdOutputObservation
agent_answer = None
get_agent_result_cmd = instance.get_agent_result
if get_agent_result_cmd is not None:
script_name = 'get_agent_result.sh'
with tempfile.TemporaryDirectory() as tmpdir:
host_script_path = os.path.join(tmpdir, script_name)
create_sh_file(host_script_path, get_agent_result_cmd)
await runtime.copy_to(
host_script_path,
'/workspace',
)
logger.info(f'Running get agent result cmd: {script_name}')
action = CmdRunAction(
command=f'chmod +x ./{script_name} && ./{script_name}',
keep_prompt=False,
)
logger.info(action, extra={'msg_type': 'ACTION'})
obs = await runtime.run_action(action)
logger.info(obs, extra={'msg_type': 'OBSERVATION'})
assert obs.exit_code == 0
agent_answer = obs.content
# IF the agent answer is not found, retrieve it from the history
# We wait until the controller finishes
final_ans = None
if instance.ground_truth is not None:
final_ans = instance.ground_truth
else:
get_ground_truth_cmd = instance.get_ground_truth
if get_ground_truth_cmd is not None:
script_name = 'get_ground_truth.sh'
with tempfile.TemporaryDirectory() as tmpdir:
host_script_path = os.path.join(tmpdir, script_name)
create_sh_file(host_script_path, get_ground_truth_cmd)
await runtime.copy_to(
host_script_path,
'/workspace',
)
logger.info(f'Running get ground truth cmd: {script_name}')
action = CmdRunAction(
command=f'chmod +x ./{script_name} && ./{script_name}',
keep_prompt=False,
)
logger.info(action, extra={'msg_type': 'ACTION'})
obs = await runtime.run_action(action)
logger.info(obs, extra={'msg_type': 'OBSERVATION'})
final_ans = obs.content
logger.info(f"{'-' * 50} END Runtime Completion Fn {'-' * 50}")
return {
'final_ans': final_ans,
'agent_answer': agent_answer,
}
async def process_instance(
instance: pd.Series,
metadata: EvalMetadata,
reset_logger: bool = True,
):
# Create the agent
agent = Agent.get_cls(metadata.agent_class)(llm=LLM(config=metadata.llm_config))
) -> EvalOutput:
config = get_config(metadata)
inst_id = instance.instance_id
question = instance.description
# create a directory for the instance's workspace
instance_workspace = str(os.path.join(config.workspace_base, inst_id))
container_inst_workspace = str(
os.path.join(config.workspace_mount_path_in_sandbox, inst_id)
)
if os.path.exists(instance_workspace):
shutil.rmtree(instance_workspace)
os.makedirs(instance_workspace, exist_ok=True)
# Set up the logger properly, so you can run multiprocessing to parallel the evaluation
# Setup the logger properly, so you can run multi-processing to parallelize the evaluation
if reset_logger:
# Set up logger
log_file = os.path.join(
metadata.eval_output_dir, 'logs', f'instance_{inst_id}.log'
)
# Remove all existing handlers from logger
for handler in logger.handlers[:]:
logger.removeHandler(handler)
# add back the console handler to print ONE line
logger.addHandler(get_console_handler())
logger.info(
f'Starting evaluation for instance {inst_id}.\nHint: run "tail -f {log_file}" to see live logs in a separate shell'
)
# Remove all existing handlers from logger
for handler in logger.handlers[:]:
logger.removeHandler(handler)
file_handler = logging.FileHandler(log_file)
file_handler.setFormatter(
logging.Formatter('%(asctime)s - %(levelname)s - %(message)s')
)
logger.addHandler(file_handler)
log_dir = os.path.join(metadata.eval_output_dir, 'infer_logs')
reset_logger_for_multiprocessing(logger, instance.instance_id, log_dir)
else:
logger.info(f'Starting evaluation for instance {instance.instance_id}.')
# =============================================
# build instruction
@ -86,104 +197,68 @@ def process_instance(
'Please encapsulate your final answer (answer ONLY) within <solution> and </solution>.\n'
'For example: The answer to the question is <solution> 42 </solution>.\n'
'# Problem \n'
f'{question}\n\n'
f'{instance.description}\n\n'
)
instruction += (
'IMPORTANT: You should ONLY interact with the environment provided '
'to you AND NEVER ASK FOR HUMAN HELP.\n'
)
# NOTE: You can actually set slightly different instruction for different agents
instruction += INST_SUFFIXES[agent.__class__.__name__]
instruction += INST_SUFFIXES[metadata.agent_class]
# =============================================
# create sandbox and run the agent
# =============================================
sandbox = DockerSSHBox(
config=config.sandbox,
persist_sandbox=False,
workspace_mount_path=config.workspace_mount_path,
sandbox_workspace_dir=config.workspace_mount_path_in_sandbox,
cache_dir=config.cache_dir,
run_as_devin=config.run_as_devin,
)
sandbox.execute(f'cd {inst_id}')
runtime: Runtime = await create_runtime(config, sid=instance.instance_id)
init_cmd = instance.init
if init_cmd is not None:
scpt_name = f'{instance.instance_id}_init.sh'
scpt_path = os.path.join(container_inst_workspace, scpt_name)
host_scpt_path = os.path.join(instance_workspace, scpt_name)
create_sh_file(host_scpt_path, init_cmd)
logger.info(f'Running init script: {scpt_path}')
_, init_res = sandbox.execute(scpt_path)
logger.info(f'Init script result: {init_res}')
await initialize_runtime(runtime, instance=instance)
# Here's how you can run the agent (similar to the `main` function) and get the final task state
config.max_iterations = metadata.max_iterations
state: State | None = asyncio.run(
run_controller(
config=config,
task_str=instruction,
fake_user_response_fn=FAKE_RESPONSES[agent.__class__.__name__],
agent=agent,
sandbox=sandbox,
sid=inst_id,
)
state: State | None = await run_controller(
config=config,
task_str=instruction,
runtime=runtime,
fake_user_response_fn=FAKE_RESPONSES[metadata.agent_class],
)
if state is None:
raise ValueError('State should not be None.')
# get the ground truth
# OSBenchSSHBox.get_ground_truth(instance, state)
# =============================================
# result evaluation
# =============================================
agent_answer = ''
get_agent_result_cmd = instance.get_agent_result
if get_agent_result_cmd is not None:
scpt_name = f'{instance.instance_id}_get_agent_result.sh'
scpt_path = os.path.join(container_inst_workspace, scpt_name)
host_scpt_path = os.path.join(instance_workspace, scpt_name)
create_sh_file(host_scpt_path, get_agent_result_cmd)
logger.info(f'Running get agent result cmd: {scpt_path}')
_, agent_answer = sandbox.execute(scpt_path)
else:
return_val = await complete_runtime(runtime, instance)
agent_answer = return_val['agent_answer']
final_ans = return_val['final_ans']
# If the agent answer is not found, retrieve it from the history
if agent_answer is None:
agent_answer = ''
logger.info('Retrieving agent answer from history.')
raw_ans = ''
# retrieve the last agent message or thought
for event in state.history.get_events(reverse=True):
if isinstance(event, MessageAction) and event.source == 'agent':
raw_ans = event.content
elif isinstance(event, CmdRunAction) and event.source == 'agent':
raw_ans = event.thought
if event.source == 'agent':
if isinstance(event, AgentFinishAction):
raw_ans = event.thought
break
elif isinstance(event, MessageAction):
raw_ans = event.content
break
elif isinstance(event, CmdRunAction):
raw_ans = event.thought
break
# parse the answer for a solution tag
agent_answer = re.findall(r'<solution>(.*?)</solution>', raw_ans)
agent_answer = re.findall(r'<solution>(.*?)</solution>', raw_ans, re.DOTALL)
if len(agent_answer) == 0:
logger.warning(f'Failed to parse model answer: {raw_ans}')
agent_answer = raw_ans
else:
agent_answer = agent_answer[0]
final_ans = ''
if instance.ground_truth is not None:
final_ans = instance.ground_truth
else:
get_ground_truth_cmd = instance.get_ground_truth
if get_ground_truth_cmd is not None:
scpt_name = f'{instance.instance_id}_get_ground_truth.sh'
scpt_path = os.path.join(container_inst_workspace, scpt_name)
host_scpt_path = os.path.join(instance_workspace, scpt_name)
create_sh_file(host_scpt_path, get_ground_truth_cmd)
logger.info(f'Running get ground truth cmd: {scpt_path}')
sandbox.execute(f'cd {container_inst_workspace}')
_, final_ans = sandbox.execute(scpt_path)
comparison_method = instance.comparison_method
logger.info(
f'Final message: {agent_answer} | Ground truth: {final_ans} | Comparison method: {comparison_method}'
@ -198,58 +273,49 @@ def process_instance(
metrics = state.metrics.get() if state.metrics else None
# Save the output
output = {
'instance_id': inst_id,
'instance': instance.to_dict(),
'instruction': instruction,
'metadata': metadata.model_dump(),
'history': histories,
'metrics': metrics,
'error': state.last_error if state and state.last_error else None,
'test_result': {
output = EvalOutput(
instance_id=instance.instance_id,
instance=instance.to_dict(),
instruction=instruction,
metadata=metadata,
history=histories,
metrics=metrics,
error=state.last_error if state and state.last_error else None,
test_result={
'agent_answer': agent_answer,
'final_answer': final_ans,
'check_method': comparison_method,
'result': test_result,
},
}
# clean up
if os.path.exists(instance_workspace):
shutil.rmtree(instance_workspace)
# Close the sandbox
try:
sandbox.close()
except docker.errors.NotFound as e:
logger.error(f'Failed to close sandbox: {e}')
)
return output
if __name__ == '__main__':
id_column = 'instance_id'
args = parse_arguments()
dataset = load_dataset('iFurySt/AgentBench')
agent_bench_tests = dataset['osbench'].to_pandas()
llm_config = get_llm_config_arg(args.llm_config) if args.llm_config else config.llm
logger.info(f'Config for evaluation: {config}')
llm_config = None
if args.llm_config:
llm_config = get_llm_config_arg(args.llm_config)
if llm_config is None:
raise ValueError(f'Could not find LLM config: --llm_config {args.llm_config}')
metadata = make_metadata(
llm_config,
args.dataset_name,
'AgentBench-OS',
args.agent_cls,
args.max_iterations,
args.eval_note,
args.eval_output_dir,
)
output_file = os.path.join(metadata.eval_output_dir, 'output.jsonl')
instances = prepare_dataset(dataset, output_file, args.eval_n_limit, id_column)
instances = prepare_dataset(agent_bench_tests, output_file, args.eval_n_limit)
run_evaluation(
instances,
metadata,
output_file,
args.eval_num_workers,
process_instance,
id_column,
asyncio.run(
run_evaluation(
instances, metadata, output_file, args.eval_num_workers, process_instance
)
)

0
evaluation/agent_bench/scripts/run_infer.sh Normal file → Executable file
View File

View File

@ -2,15 +2,12 @@
Implements evaluation of agents on BioCoder from the BioCoder benchmark introduced in [BioCoder: A Benchmark for Bioinformatics Code Generation with Large Language Models](https://arxiv.org/abs/2308.16458). Please see [here](https://github.com/bigcode-project/bigcode-evaluation-harness/blob/main/bigcode_eval/tasks/humanevalpack.py) for the reference implementation used in the paper.
## Setup Environment
## Setup Environment and LLM Configuration
Please follow [this document](https://github.com/OpenDevin/OpenDevin/blob/main/Development.md) to setup local develop environment for OpenDevin.
## Configure OpenDevin and your LLM
Create a `config.toml` file if it does not exist at the root of the workspace. Please check [README.md](../../README.md) for how to set this up.
Please follow instruction [here](../README.md#setup) to setup your local development environment and LLM.
## BioCoder Docker Image
In the opendevin branch of the Biocoder repository, we have slightly modified our original Docker image to work with the OpenDevin environment. In the Docker image are testing scripts (`/testing/start_test_opendevin.py` and aux files in `/testing_files/`) to assist with evaluation. Additionally, we have installed all dependencies, including OpenJDK, mamba (with Python 3.6), and many system libraries. Notably, we have **not** packaged all repositories into the image, so they are downloaded at runtime.
**Before first execution, pull our Docker image with the following command**
@ -41,12 +38,12 @@ to `CodeActAgent`.
- `eval_limit`, e.g. `10`, limits the evaluation to the first `eval_limit` instances. By default it infers all instances.
Let's say you'd like to run 1 instance using `eval_gpt4_1106_eval_gpt4o_2024_05_13preview` and CodeActAgent
with OpenDevin version 0.6.2, then your command would be:
with current OpenDevin version, then your command would be:
## Examples
```bash
./evaluation/biocoder/scripts/run_infer.sh eval_gpt4o_2024_05_13 0.6.2 CodeActAgent 1
./evaluation/biocoder/scripts/run_infer.sh eval_gpt4o_2024_05_13 HEAD CodeActAgent 1
```
## Reference

View File

@ -1,387 +0,0 @@
import json
import os
import re
import sys
from collections import defaultdict
from dataclasses import dataclass
from datasets import load_dataset
from opendevin.core.config import load_app_config
from opendevin.core.logger import opendevin_logger as logger
from opendevin.runtime.docker.ssh_box import DockerSSHBox
from opendevin.runtime.plugins import (
JupyterRequirement,
PluginRequirement,
SWEAgentCommandsRequirement,
)
config = load_app_config()
BIOCODER_BENCH_CONTAINER_IMAGE = 'public.ecr.aws/i5g0m1f6/eval_biocoder:v1.0'
@dataclass
class BiocoderData:
filePath: str
numLines: int
lineStart: int
lineEnd: int
signature: str
comment: str
content: str
repository: str
promptSummaryOnly: str
contextCode: str
goldenCode: str
test_case_id: str
language: str
def to_dict(self):
return {
'filePath': self.filePath,
'numLines': self.numLines,
'lineStart': self.lineStart,
'lineEnd': self.lineEnd,
'signature': self.signature,
'comment': self.comment,
'content': self.content,
'repository': self.repository,
'promptSummaryOnly': self.promptSummaryOnly,
'contextCode': self.contextCode,
'goldenCode': self.goldenCode,
'test_case_id': self.test_case_id,
'language': self.language,
}
def get_likely_indent_size(array_of_tabs) -> int:
sizes = defaultdict(int)
for i in range(len(array_of_tabs) - 1):
diff = array_of_tabs[i + 1] - array_of_tabs[i]
if diff > 0:
sizes[diff] += 1
if len(sizes) == 0:
return 4
return int(max(sizes, key=sizes.get))
class BiocoderSSHBox(DockerSSHBox):
def __init__(
self,
container_image: str,
timeout: int = 120,
sid: str | None = None,
biocoder_instance_id: str | None = None,
biocoder_instance: BiocoderData | None = None,
skip_workspace_mount: bool = True,
sandbox_plugins: list[PluginRequirement] = [], # noqa: B006
biocoder_cache_folder: str = 'biocoder_cache',
workspace_dir_name: str | None = None,
):
if biocoder_instance_id is None:
raise ValueError('biocoder_instance_id must be provided')
self.biocoder_instance_id = biocoder_instance_id
self.biocoder_instance = biocoder_instance
self.skip_workspace_mount = skip_workspace_mount
self.biocoder_cache_folder = biocoder_cache_folder
self.first_line_after_removed = None
self.workspace_dir_name = workspace_dir_name
self.workspace_base = config.workspace_base
self.workspace_mount_path = config.workspace_mount_path
# self.workspace_dir_name_host = os.path.join(config.workspace_base, workspace_dir_name)
self.context_path = None
self.generated_path = None
self.golden_path = None
assert (
container_image is not None
), 'container_image is required for BiocoderBenchSSHBox!'
super().__init__(container_image, timeout, sid)
self.init_plugins(sandbox_plugins)
@property
def volumes(self):
if self.skip_workspace_mount:
return {
k: v
for k, v in super().volumes.items()
if not v['bind'] == self.sandbox_workspace_dir
}
return super().volumes
def get_target_filepath(self):
target_filepath = os.path.join(
self.workspace_mount_path,
self.biocoder_instance.repository.split('/')[1],
self.biocoder_instance.filePath,
)
return target_filepath
def get_changed_code(self, include_signature=False):
# copies changed code into /testing_files/
# Note that this does NOT copy the function signature
target_filepath = self.get_target_filepath()
selected_lines = []
offset = 1 if include_signature else 0
if self.first_line_after_removed is None:
logger.warning('First line after removed is None')
with open(target_filepath, 'r') as f:
lines = f.read().split('\n')
for i in range(self.biocoder_instance.lineStart - offset, len(lines)):
if lines[i].strip() == self.first_line_after_removed.strip():
break
selected_lines.append(lines[i])
text = '\n'.join(selected_lines)
return text
def copy_changed_code(self):
changed_code = self.get_changed_code(include_signature=True)
with open(self.generated_path, 'w') as f:
f.write(changed_code)
exit_code, output = self.execute_and_check(
f'cp -r /workspace/{self.biocoder_cache_folder}/* /testing_files',
'Failed to copy the files',
)
def remove_code(self):
comment_prefix = {'python': '#', 'java': '//'}
target_filepath = self.get_target_filepath()
line_start = self.biocoder_instance.lineStart
line_end = self.biocoder_instance.lineEnd
with open(target_filepath, 'r') as f:
lines = f.read().split('\n')
# print("="*10+"ORIGINAL"+"="*10)
# print("\n".join(lines))
signature_line = lines[line_start - 1]
# get the number of tabs
def get_indent_size(s: str):
return len(re.match(r'\s*', s).group())
indent_sizes = list(map(get_indent_size, lines))
indent_size = get_likely_indent_size(indent_sizes)
comment_indent_size = get_indent_size(signature_line) + indent_size
lines = (
lines[:line_start]
+ [
f"{' '*comment_indent_size+comment_prefix[self.biocoder_instance.language.lower()]}TODO: replace with your code here"
]
+ ([''] * 2)
+ lines[line_end:]
)
first_line_after_removed_index = line_start
while len(
lines[first_line_after_removed_index].strip()
) == 0 and first_line_after_removed_index < len(lines):
first_line_after_removed_index += 1
self.first_line_after_removed = lines[first_line_after_removed_index]
# print("FIRST LINE AFTER REMOVED: ", self.first_line_after_removed)
with open(target_filepath, 'w') as f:
f.write('\n'.join(lines))
# with open(target_filepath, 'r') as f:
# print("="*10+"MODIFIED"+"="*10)
# print(f.read())
def execute_and_check(self, cmd: str, error_msg: str) -> tuple[int, str]:
exit_code, output = self.execute(cmd)
if exit_code != 0:
logger.error(error_msg)
sys.exit(1)
return exit_code, output
@classmethod
def get_box_for_instance(
cls,
instance,
workspace_dir_name=None,
skip_workspace_mount: bool = False,
workspace_mount_path: str | None = None,
sandbox_plugins: list[PluginRequirement] = [], # noqa: B006
) -> 'BiocoderSSHBox':
"""This method initializes a container image, then runs some initialization commands"""
if workspace_dir_name is None:
workspace_dir_name = f'{instance.repository}__{instance.test_case_id[:10]}__{os.getpid()}'.replace(
'/', '__'
)
workspace_base = str(os.path.join(config.workspace_base, workspace_dir_name))
old_workspace_base = config.workspace_base
old_workspace_mount_path = config.workspace_mount_path
try:
config.workspace_base = workspace_base
config.workspace_mount_path = workspace_base
# linting python after editing helps LLM fix indentations
config.sandbox.enable_auto_lint = True
# create folder for transferring files back/forth
biocoder_cache_folder = 'biocoder_cache'
if not os.path.exists(os.path.join(workspace_base, biocoder_cache_folder)):
os.makedirs(
os.path.join(workspace_base, biocoder_cache_folder), exist_ok=True
)
file_ext = {
'python': 'py',
'java': 'java',
'c': 'c',
'cpp': 'cpp',
'javascript': 'js',
'typescript': 'ts',
}[instance.language.lower()]
context_path = os.path.join(
workspace_base, biocoder_cache_folder, 'context.' + file_ext
)
generated_path = os.path.join(
workspace_base, biocoder_cache_folder, 'generated.' + file_ext
)
golden_path = os.path.join(
workspace_base, biocoder_cache_folder, 'golden.' + file_ext
)
# print(instance.contextCode)
with open(context_path, 'w') as f:
f.write(instance.contextCode)
with open(generated_path, 'w') as f:
f.write(instance.goldenCode)
with open(golden_path, 'w') as f:
f.write(instance.goldenCode)
testcase_json = {
'test_case_id': instance.test_case_id,
'num_cases': 1000,
'language': instance.language.lower(),
}
with open(
os.path.join(
workspace_base, biocoder_cache_folder, 'testcase_biocoder.json'
),
'w',
) as f:
f.write(json.dumps(testcase_json, indent=4))
# linting python after editing helps LLM fix indentations
config.sandbox.enable_auto_lint = True
sandbox = cls(
container_image=BIOCODER_BENCH_CONTAINER_IMAGE,
biocoder_instance_id=instance.test_case_id,
biocoder_instance=instance,
skip_workspace_mount=skip_workspace_mount,
sandbox_plugins=sandbox_plugins,
biocoder_cache_folder=biocoder_cache_folder,
workspace_dir_name=workspace_dir_name,
)
except Exception:
raise
finally:
config.workspace_base = old_workspace_base
config.workspace_mount_path = old_workspace_mount_path
sandbox.context_path = context_path
sandbox.generated_path = generated_path
sandbox.golden_path = golden_path
logger.info(f'SSH box started for instance {instance.test_case_id}.')
# cd to the workspace
exit_code, output = sandbox.execute_and_check(
'cd /workspace', 'Failed to cd to workspace'
)
logger.info(f'cd to workspace: {output}')
# download repository archive
repository_url = f"https://biocoder.lilbillbiscuit.com/repos/{instance.repository.split('/')[1]}.zip"
exit_code, output = sandbox.execute_and_check(
'wget -O repo.zip ' + repository_url, 'Failed to download the repository'
)
logger.info(f'Downloaded the repository: {output}')
exit_code, output = sandbox.execute_and_check(
'unzip -o -q repo.zip', 'Failed to unzip the repository'
)
logger.info(f'Unzipped the repository: {output}')
# copy the context, generated and golden files to the /testing_files folder
exit_code, output = sandbox.execute_and_check(
f'cp -r /workspace/{biocoder_cache_folder}/* /testing_files',
'Failed to copy the files',
)
# chmod 777
exit_code, output = sandbox.execute_and_check(
'chmod -R 777 /workspace',
'Failed to chmod the files',
)
return sandbox
if __name__ == '__main__':
biocoder_dataset = load_dataset('Lilbillbiscuit/biocoder_public')
EXAMPLE_INSTANCE = biocoder_dataset['test'][0]
EXAMPLE_INSTANCE = BiocoderData(**EXAMPLE_INSTANCE)
sandbox = BiocoderSSHBox.get_box_for_instance(
instance=EXAMPLE_INSTANCE,
workspace_mount_path='/home/ubuntu/OpenDevinBioCoder/workspace',
skip_workspace_mount=False,
sandbox_plugins=[JupyterRequirement(), SWEAgentCommandsRequirement()],
)
# PRE TEST
exit_code, output = sandbox.execute_and_check(
'cd /testing',
'Failed to cd /testing',
)
logger.info(f'cd $REPO_PATH: {output}')
exit_code, output = sandbox.execute_and_check(
'whoami',
'Failed to run whoami',
)
logger.info(f'whoami: {output}')
# TEST
exit_code, output = sandbox.execute(
'/home/devin/mambaforge/bin/mamba run -n test python3 /testing/start_test_opendevin.py'
)
assert exit_code == 0, 'Expected exit code 0 (this should have passed)'
logger.info(f'$TEST_CMD:\n{output}')
exit_code, output = sandbox.execute_and_check(
'cat /testing_files/results_biocoder.json', 'Failed to read the result file'
)
print(output)
json_obj = json.loads(output)
if json_obj['result'] == 'pass':
print('PASS')
else:
print('FAIL')
sys.stdout.flush()
try:
while True:
try:
user_input = input('>>> ')
except EOFError:
logger.info('Exiting...')
break
if user_input.lower() == 'exit':
logger.info('Exiting...')
break
exit_code, output = sandbox.execute(user_input)
logger.info('exit code: %d', exit_code)
logger.info(output)
sys.stdout.flush()
except KeyboardInterrupt:
logger.info('Exiting...')
sandbox.close()

View File

@ -1,33 +1,38 @@
import asyncio
import functools
import json
import logging
import os
import pathlib
from functools import partial
import tempfile
from typing import Any
import pandas as pd
from datasets import load_dataset
from evaluation.biocoder.biocoder_env_box import BiocoderData, BiocoderSSHBox
from evaluation.biocoder.utils import BiocoderData
from evaluation.utils.shared import (
EvalMetadata,
EvalOutput,
codeact_user_response,
make_metadata,
prepare_dataset,
reset_logger_for_multiprocessing,
run_evaluation,
)
from opendevin.controller.agent import Agent
from opendevin.controller.state.state import State
from opendevin.core.config import get_llm_config_arg, load_app_config, parse_arguments
from opendevin.core.logger import get_console_handler
from opendevin.core.config import (
AppConfig,
SandboxConfig,
get_llm_config_arg,
parse_arguments,
)
from opendevin.core.logger import opendevin_logger as logger
from opendevin.core.main import run_controller
from opendevin.llm.llm import LLM
config = load_app_config()
from opendevin.core.main import create_runtime, run_controller
from opendevin.events.action import CmdRunAction
from opendevin.events.observation import CmdOutputObservation
from opendevin.runtime.runtime import Runtime
AGENT_CLS_TO_FAKE_USER_RESPONSE_FN = {
'CodeActAgent': partial(
'CodeActAgent': functools.partial(
codeact_user_response, encapsulate_solution=True, try_parse=None
),
}
@ -36,111 +41,219 @@ AGENT_CLS_TO_INST_SUFFIX = {
'CodeActAgent': 'When you think you have fixed the issue through code changes, please run the following command: <execute_bash> exit </execute_bash>.\n'
}
FILE_EXT_MAP = {
'python': 'py',
'java': 'java',
'c': 'c',
'cpp': 'cpp',
'javascript': 'js',
'typescript': 'ts',
}
def get_config(
metadata: EvalMetadata,
) -> AppConfig:
BIOCODER_BENCH_CONTAINER_IMAGE = 'public.ecr.aws/i5g0m1f6/eval_biocoder:v1.0'
config = AppConfig(
default_agent=metadata.agent_class,
run_as_devin=False,
runtime='eventstream',
max_iterations=metadata.max_iterations,
sandbox=SandboxConfig(
container_image=BIOCODER_BENCH_CONTAINER_IMAGE,
enable_auto_lint=True,
use_host_network=False,
update_source_code=True,
),
# do not mount workspace
workspace_base=None,
workspace_mount_path=None,
)
config.set_llm_config(metadata.llm_config)
return config
async def initialize_runtime(
runtime: Runtime,
instance: BiocoderData, # this argument is not required
):
"""Initialize the runtime for the agent.
This function is called before the runtime is used to run the agent.
"""
logger.info(f"{'-' * 50} BEGIN Runtime Initialization Fn {'-' * 50}")
obs: CmdOutputObservation
file_ext = FILE_EXT_MAP[instance.language.lower()]
action = CmdRunAction(command='mkdir -p /workspace && mkdir -p /testing_files')
logger.info(action, extra={'msg_type': 'ACTION'})
obs = await runtime.run_action(action)
assert obs.exit_code == 0
with tempfile.TemporaryDirectory() as tmpdir:
context_path = os.path.join(tmpdir, 'context.' + file_ext)
with open(context_path, 'w') as f:
f.write(instance.contextCode)
await runtime.copy_to(context_path, '/testing_files')
golden_path = os.path.join(tmpdir, 'golden.' + file_ext)
with open(golden_path, 'w') as f:
f.write(instance.goldenCode)
await runtime.copy_to(golden_path, '/testing_files')
testcase_json = {
'test_case_id': instance.test_case_id,
'num_cases': 1000,
'language': instance.language.lower(),
}
testcase_path = os.path.join(tmpdir, 'testcase_biocoder.json')
with open(testcase_path, 'w') as f:
f.write(json.dumps(testcase_json, indent=4))
await runtime.copy_to(testcase_path, '/testing_files')
# setup paths
remove_code_script = os.path.join(
os.path.dirname(__file__), 'scripts', 'setup', 'remove_code.py'
)
await runtime.copy_to(remove_code_script, '/testing_files')
action = CmdRunAction(command='cd /workspace')
logger.info(action, extra={'msg_type': 'ACTION'})
obs = await runtime.run_action(action)
assert obs.exit_code == 0
# download repository archive
repository_url = f"https://biocoder.lilbillbiscuit.com/repos/{instance.repository.split('/')[1]}.zip"
action = CmdRunAction(command='wget -O repo.zip ' + repository_url)
logger.info(action, extra={'msg_type': 'ACTION'})
obs = await runtime.run_action(action)
assert obs.exit_code == 0, f'Failed to download the repository: {obs.content}'
# unzip the repository
action = CmdRunAction(command='unzip -o -q repo.zip && rm repo.zip')
logger.info(action, extra={'msg_type': 'ACTION'})
obs = await runtime.run_action(action)
assert obs.exit_code == 0, f'Failed to unzip the repository: {obs.content}'
# chmod 777
action = CmdRunAction(command='chmod -R 777 /workspace')
logger.info(action, extra={'msg_type': 'ACTION'})
obs = await runtime.run_action(action)
assert obs.exit_code == 0, f'Failed to chmod the files: {obs.content}'
# remove code for evaluation instance
target_filepath = os.path.join(
'/workspace', instance.repository.split('/')[1], instance.filePath
)
line_start = instance.lineStart
line_end = instance.lineEnd
language = instance.language.lower()
action = CmdRunAction(
command=f'python3 /testing_files/remove_code.py --target_filepath {target_filepath} --line_start {line_start} --line_end {line_end} --language {language}'
)
logger.info(action, extra={'msg_type': 'ACTION'})
obs = await runtime.run_action(action)
assert obs.exit_code == 0, f'Failed to remove the code: {obs.content}'
logger.info(f"{'-' * 50} END Runtime Initialization Fn {'-' * 50}")
async def complete_runtime(
runtime: Runtime,
instance: pd.Series, # this argument is not required, but it is used to get the workspace_dir_name
) -> dict[str, Any]:
"""Complete the runtime for the agent.
This function is called before the runtime is used to run the agent.
If you need to do something in the sandbox to get the correctness metric after
the agent has run, modify this function.
"""
logger.info(f"{'-' * 50} BEGIN Runtime Completion Fn {'-' * 50}")
obs: CmdOutputObservation
def get_test_result(instance, sandbox, workspace_dir_name):
test_result = {'result': {}, 'metadata': {}}
try:
code = sandbox.get_changed_code(include_signature=True)
sandbox.copy_changed_code()
copy_changed_code_script = os.path.join(
os.path.dirname(__file__), 'scripts', 'setup', 'copy_changed_code.py'
)
await runtime.copy_to(copy_changed_code_script, '/testing_files')
file_ext = FILE_EXT_MAP[instance.language.lower()]
target_filepath = os.path.join(
'/workspace', instance.repository.split('/')[1], instance.filePath
)
generated_path = os.path.join('/testing_files', 'generated.' + file_ext)
action = CmdRunAction(
command=f'python3 /testing_files/copy_changed_code.py --target_filepath {target_filepath} --generated_code_filepath {generated_path} --line_start {instance.lineStart} --include_signature'
)
logger.info(action, extra={'msg_type': 'ACTION'})
obs = await runtime.run_action(action)
if obs.exit_code == 0:
test_result['metadata']['1_copy_change_success'] = True
action = CmdRunAction(command=f'cat {generated_path}', keep_prompt=False)
logger.info(action, extra={'msg_type': 'ACTION'})
obs = await runtime.run_action(action)
assert obs.exit_code == 0
code = obs.content
test_result['metadata']['1_copy_change_code'] = code
except Exception:
logger.error('Error fetching changed code for this instance')
else:
test_result['metadata']['1_copy_change_success'] = False
test_result['metadata']['1_copy_change_code'] = None
exit_code, output = sandbox.execute_and_check(
'cd /testing',
'Failed to cd /testing',
)
logger.info(f'cd $REPO_PATH: {output}')
action = CmdRunAction(command='cd /testing_files')
logger.info(action, extra={'msg_type': 'ACTION'})
obs = await runtime.run_action(action)
assert obs.exit_code == 0
exit_code, output = sandbox.execute_and_check(
'whoami',
'Failed to run whoami',
action = CmdRunAction(
command='/home/devin/mambaforge/bin/mamba run -n test python3 /testing/start_test_opendevin.py'
)
logger.info(f'whoami: {output}')
logger.info(action, extra={'msg_type': 'ACTION'})
obs = await runtime.run_action(action)
logger.info(obs, extra={'msg_type': 'OBSERVATION'})
assert obs.exit_code == 0
exit_code, output = sandbox.execute(
'/home/devin/mambaforge/bin/mamba run -n test python3 /testing/start_test_opendevin.py'
action = CmdRunAction(
command='cat /testing_files/results_biocoder.json', keep_prompt=False
)
logger.info(f'$TEST_CMD:\n{output}')
exit_code, output = sandbox.execute_and_check(
'cat /testing_files/results_biocoder.json', 'Failed to read the result file'
)
if exit_code == 0:
logger.info(action, extra={'msg_type': 'ACTION'})
obs = await runtime.run_action(action)
if obs.exit_code == 0:
test_result['metadata']['2_run_test_success'] = True
test_result['metadata']['2_run_test_result'] = str(output)
test_result['metadata']['2_run_test_result'] = str(obs.content)
json_obj = json.loads(obs.content)
test_result['result'] = json_obj['result']
else:
test_result['metadata']['2_run_test_success'] = False
test_result['metadata']['2_run_test_result'] = str(output)
json_obj = json.loads(output)
test_result['result'] = json_obj['result']
test_result['metadata']['2_run_test_result'] = str(obs.content)
logger.info(f"{'-' * 50} END Runtime Completion Fn {'-' * 50}")
return test_result
def process_instance(
async def process_instance(
instance: pd.Series,
metadata: EvalMetadata,
reset_logger: bool = True,
):
# Create the agent
agent = Agent.get_cls(metadata.agent_class)(llm=LLM(config=metadata.llm_config))
) -> EvalOutput:
config = get_config(metadata)
instance = BiocoderData(**instance)
print(instance)
workspace_dir_name = (
f'{instance.repository}__{instance.test_case_id[:10]}__{os.getpid()}'.replace(
'/', '__'
)
)
workspace_mount_path = os.path.join(config.workspace_base, workspace_dir_name)
# create process-specific workspace dir
# if `not skip_workspace_mount` - we will create a workspace directory for EACH process
# so that different agent don't interfere with each other.
workspace_mount_path = os.path.join(workspace_mount_path, str(os.getpid()))
pathlib.Path(workspace_mount_path).mkdir(parents=True, exist_ok=True)
instance_id = f'{instance.repository}__{instance.instance_id[:10]}'
# Setup the logger properly, so you can run multi-processing to parallize the evaluation
# Setup the logger properly, so you can run multi-processing to parallelize the evaluation
if reset_logger:
# Set up logger
log_file = os.path.join(
metadata.eval_output_dir, 'logs', f'instance_{instance.test_case_id}.log'
)
# Remove all existing handlers from logger
for handler in logger.handlers[:]:
logger.removeHandler(handler)
# add back the console handler to print ONE line
logger.addHandler(get_console_handler())
logger.info(
f'Starting evaluation for instance {instance.test_case_id}.\nHint: run "tail -f {log_file}" to see live logs in a seperate shell'
)
# Remove all existing handlers from logger
for handler in logger.handlers[:]:
logger.removeHandler(handler)
file_handler = logging.FileHandler(log_file)
file_handler.setFormatter(
logging.Formatter('%(asctime)s - %(levelname)s - %(message)s')
)
logger.addHandler(file_handler)
logger.info(f'Process-specific workspace mounted at {workspace_mount_path}')
# NOTE: this is something special we do for SWE-Bench due to the reason described in the previous section
# You can omit this if you don't need to setup specialized sandbox
workspace_dir_name = f'{instance.repository}__{instance.test_case_id[:10]}'.replace(
'/', '__'
)
sandbox = BiocoderSSHBox.get_box_for_instance(
instance,
workspace_dir_name,
skip_workspace_mount=False,
workspace_mount_path=workspace_mount_path,
sandbox_plugins=agent.sandbox_plugins,
)
sandbox.remove_code()
log_dir = os.path.join(metadata.eval_output_dir, 'infer_logs')
reset_logger_for_multiprocessing(logger, instance_id, log_dir)
else:
logger.info(f'Starting evaluation for instance {instance_id}.')
# Prepare instruction
instruction = (
@ -160,80 +273,76 @@ def process_instance(
'Make sure to include proper formatting in Java and Python, including correct braces and/or indentation.\n'
)
# NOTE: You can actually set slightly different instruction for different agents
instruction += AGENT_CLS_TO_INST_SUFFIX[agent.__class__.__name__]
instruction += AGENT_CLS_TO_INST_SUFFIX[metadata.agent_class]
# use a session id for concurrent evaluation
sid = instance.test_case_id.replace('/', '__')
sid = instance.instance_id.replace('/', '__')
runtime = await create_runtime(config, sid=sid)
await initialize_runtime(runtime, instance)
# Here's how you can run the agent (similar to the `main` function) and get the final task state
config.max_iterations = metadata.max_iterations
state: State | None = asyncio.run(
run_controller(
config=config,
task_str=instruction,
fake_user_response_fn=AGENT_CLS_TO_FAKE_USER_RESPONSE_FN[
agent.__class__.__name__
],
agent=agent,
sandbox=sandbox,
sid=sid,
)
state: State | None = await run_controller(
config=config,
task_str=instruction,
runtime=runtime,
fake_user_response_fn=AGENT_CLS_TO_FAKE_USER_RESPONSE_FN[metadata.agent_class],
)
test_result = get_test_result(instance, sandbox, workspace_dir_name)
if state is None:
raise ValueError('State should not be None.')
metrics = state.metrics.get() if state.metrics else None
test_result = await complete_runtime(runtime, instance)
metrics = state.metrics.get() if state.metrics else None
# history is now available as a stream of events, rather than list of pairs of (Action, Observation)
# for compatibility with the existing output format, we can remake the pairs here
# remove when it becomes unnecessary
histories = state.history.compatibility_for_eval_history_pairs()
# Save the output
output = {
'test_case_id': instance.test_case_id,
'biocoder_instance': instance.to_dict(),
'instruction': instruction,
'generated': test_result['metadata']['1_copy_change_code'],
'metadata': metadata.model_dump(),
'history': histories,
'metrics': metrics,
'error': state.last_error if state and state.last_error else None,
'test_result': test_result,
}
test_result['generated'] = test_result['metadata']['1_copy_change_code']
# Close the sandbox
sandbox.close()
# Save the output
output = EvalOutput(
instance_id=instance.instance_id,
instance=instance.to_dict(),
instruction=instruction,
metadata=metadata,
history=histories,
metrics=metrics,
error=state.last_error if state and state.last_error else None,
test_result=test_result,
)
return output
if __name__ == '__main__':
id_column = 'test_case_id'
args = parse_arguments()
dataset = load_dataset('lilbillbiscuit/biocoder_public')
biocoder_tests = dataset['test'].to_pandas()
llm_config = get_llm_config_arg(args.llm_config) if args.llm_config else config.llm
logger.info(f'Config for evaluation: {config}')
dataset = load_dataset('lilbillbiscuit/biocoder_public')
biocoder_tests = dataset['train'].to_pandas()
biocoder_tests['instance_id'] = biocoder_tests['test_case_id']
llm_config = None
if args.llm_config:
llm_config = get_llm_config_arg(args.llm_config)
if llm_config is None:
raise ValueError(f'Could not find LLM config: --llm_config {args.llm_config}')
metadata = make_metadata(
llm_config,
args.dataset_name,
'biocoder',
args.agent_cls,
args.max_iterations,
args.eval_note,
args.eval_output_dir,
)
output_file = os.path.join(metadata.eval_output_dir, 'output.jsonl')
instances = prepare_dataset(dataset, output_file, args.eval_n_limit, id_column)
instances = prepare_dataset(biocoder_tests, output_file, args.eval_n_limit)
run_evaluation(
instances,
metadata,
output_file,
args.eval_num_workers,
process_instance,
id_column,
asyncio.run(
run_evaluation(
instances, metadata, output_file, args.eval_num_workers, process_instance
)
)

0
evaluation/biocoder/scripts/run_infer.sh Normal file → Executable file
View File

View File

@ -0,0 +1,45 @@
import argparse
def get_changed_code(target_filepath, line_start, include_signature=False):
# copies changed code into /testing_files/
# Note that this does NOT copy the function signature
selected_lines = []
offset = 1 if include_signature else 0
with open('/testing_files/first_line_after_removed.txt', 'r') as f:
first_line_after_removed = f.read()
if first_line_after_removed is None:
print('First line after removed is None')
with open(target_filepath, 'r') as f:
lines = f.read().split('\n')
for i in range(line_start - offset, len(lines)):
if lines[i].strip() == first_line_after_removed.strip():
break
selected_lines.append(lines[i])
text = '\n'.join(selected_lines)
return text
def copy_changed_code(
target_filepath, generated_code_filepath, line_start, include_signature=False
):
changed_code = get_changed_code(target_filepath, line_start, include_signature)
with open(generated_code_filepath, 'w') as f:
f.write(changed_code)
if __name__ == '__main__':
parser = argparse.ArgumentParser()
parser.add_argument('--target_filepath', type=str, required=True)
parser.add_argument('--generated_code_filepath', type=str, required=True)
parser.add_argument('--line_start', type=int, required=True)
parser.add_argument('--include_signature', action='store_true')
args = parser.parse_args()
copy_changed_code(
args.target_filepath,
args.generated_code_filepath,
args.line_start,
args.include_signature,
)

View File

@ -0,0 +1,74 @@
import argparse
import os
import re
from collections import defaultdict
def get_likely_indent_size(array_of_tabs) -> int:
sizes = defaultdict(int)
for i in range(len(array_of_tabs) - 1):
diff = array_of_tabs[i + 1] - array_of_tabs[i]
if diff > 0:
sizes[diff] += 1
if len(sizes) == 0:
return 4
return int(max(sizes, key=sizes.get))
def get_target_filepath(self):
target_filepath = os.path.join(
self.workspace_mount_path,
self.biocoder_instance.repository.split('/')[1],
self.biocoder_instance.filePath,
)
return target_filepath
def remove_code(target_filepath: str, line_start: int, line_end: int, language: str):
comment_prefix = {'python': '#', 'java': '//'}
with open(target_filepath, 'r') as f:
lines = f.read().split('\n')
# print("="*10+"ORIGINAL"+"="*10)
# print("\n".join(lines))
signature_line = lines[line_start - 1]
# get the number of tabs
def get_indent_size(s: str):
return len(re.match(r'\s*', s).group())
indent_sizes = list(map(get_indent_size, lines))
indent_size = get_likely_indent_size(indent_sizes)
comment_indent_size = get_indent_size(signature_line) + indent_size
lines = (
lines[:line_start]
+ [
f"{' '*comment_indent_size+comment_prefix[language.lower()]}TODO: replace with your code here"
]
+ ([''] * 2)
+ lines[line_end:]
)
first_line_after_removed_index = line_start
while len(
lines[first_line_after_removed_index].strip()
) == 0 and first_line_after_removed_index < len(lines):
first_line_after_removed_index += 1
first_line_after_removed = lines[first_line_after_removed_index]
print('FIRST LINE AFTER REMOVED: ', first_line_after_removed)
with open('/testing_files/first_line_after_removed.txt', 'w') as f:
f.write(first_line_after_removed)
with open(target_filepath, 'w') as f:
f.write('\n'.join(lines))
if __name__ == '__main__':
parser = argparse.ArgumentParser()
parser.add_argument('--target_filepath', type=str, required=True)
parser.add_argument('--line_start', type=int, required=True)
parser.add_argument('--line_end', type=int, required=True)
parser.add_argument('--language', type=str, required=True)
args = parser.parse_args()
remove_code(args.target_filepath, args.line_start, args.line_end, args.language)

View File

@ -0,0 +1,36 @@
from dataclasses import dataclass
@dataclass
class BiocoderData:
instance_id: str
filePath: str
numLines: int
lineStart: int
lineEnd: int
signature: str
comment: str
content: str
repository: str
promptSummaryOnly: str
contextCode: str
goldenCode: str
test_case_id: str
language: str
def to_dict(self):
return {
'filePath': self.filePath,
'numLines': self.numLines,
'lineStart': self.lineStart,
'lineEnd': self.lineEnd,
'signature': self.signature,
'comment': self.comment,
'content': self.content,
'repository': self.repository,
'promptSummaryOnly': self.promptSummaryOnly,
'contextCode': self.contextCode,
'goldenCode': self.goldenCode,
'test_case_id': self.test_case_id,
'language': self.language,
}

View File

@ -2,43 +2,14 @@
Implements evaluation of agents on BIRD introduced in [Can LLM Already Serve as A Database Interface? A BIg Bench for Large-Scale Database Grounded Text-to-SQLs](https://arxiv.org/abs/2305.03111). Please see [here](https://bird-bench.github.io/) for the reference implementation used in the paper.
## Setup Environment
## Setup Environment and LLM Configuration
Please follow [this document](https://github.com/OpenDevin/OpenDevin/blob/main/Development.md) to setup local develop environment for OpenDevin.
## Configure OpenDevin and your LLM
Create a `config.toml` file if it does not exist at the root of the workspace.
Add the following configurations:
```toml
[core]
max_iterations = 100
cache_dir = "/tmp/cache"
ssh_hostname = "localhost"
[sandbox]
enable_auto_lint = true
# TODO: Change these to the model you want to evaluate
[llm.eval_gpt4_1106_preview]
model = "gpt-4-1106-preview"
api_key = "XXX"
temperature = 0.0
[llm.eval_some_openai_compatible_model]
model = "openai/MODEL_NAME"
base_url = "https://OPENAI_COMPATIBLE_URL/v1"
api_key = "XXX"
temperature = 0.0
```
Please follow instruction [here](../README.md#setup) to setup your local development environment and LLM.
## Run Inference on Bird
```bash
./evaluation/bird/scripts/run_infer.sh eval_gpt4_1106_preview [model_config] [git-version]
./evaluation/bird/scripts/run_infer.sh [model_config] [git-version]
```
- `model_config`, e.g. `eval_gpt4_1106_preview`, is the config group name for your

View File

@ -1,12 +1,12 @@
import asyncio
import json
import logging
import os
import pathlib
import re
import shutil
import sqlite3
import subprocess
import zipfile
from typing import Any
import pandas as pd
from datasets import load_dataset
@ -15,20 +15,24 @@ from tqdm import tqdm
from evaluation.utils.shared import (
EvalMetadata,
EvalOutput,
make_metadata,
prepare_dataset,
reset_logger_for_multiprocessing,
run_evaluation,
)
from opendevin.controller.agent import Agent
from opendevin.controller.state.state import State
from opendevin.core.config import get_llm_config_arg, load_app_config, parse_arguments
from opendevin.core.logger import get_console_handler
from opendevin.core.config import (
AppConfig,
SandboxConfig,
get_llm_config_arg,
parse_arguments,
)
from opendevin.core.logger import opendevin_logger as logger
from opendevin.core.main import run_controller
from opendevin.events.action import MessageAction
from opendevin.llm.llm import LLM
config = load_app_config()
from opendevin.core.main import create_runtime, run_controller
from opendevin.events.action import CmdRunAction, MessageAction
from opendevin.events.observation import CmdOutputObservation
from opendevin.runtime.runtime import Runtime
def codeact_user_response(state: State) -> str:
@ -62,6 +66,28 @@ AGENT_CLS_TO_INST_SUFFIX = {
}
def get_config(
metadata: EvalMetadata,
) -> AppConfig:
config = AppConfig(
default_agent=metadata.agent_class,
run_as_devin=False,
runtime='eventstream',
max_iterations=metadata.max_iterations,
sandbox=SandboxConfig(
container_image='ubuntu:22.04',
enable_auto_lint=True,
use_host_network=False,
update_source_code=True,
),
# do not mount workspace
workspace_base=None,
workspace_mount_path=None,
)
config.set_llm_config(metadata.llm_config)
return config
def execute_sql(db_path, gen_sql, gold_sql):
"""Execute the generated SQL and the ground truth SQL and compare the results."""
with sqlite3.connect(db_path) as conn:
@ -76,12 +102,213 @@ def execute_sql(db_path, gen_sql, gold_sql):
return res
def get_test_result(instance, path, timeout=30):
LOCAL_DATASET_PATH = os.path.join(os.path.dirname(__file__), 'data')
def load_bird():
"""Main function to handle the flow of downloading, processing, and loading the bird dataset."""
def _download_bird():
"""Downloads and extracts the bird dataset from a specified URL into a local directory."""
devset_path = os.path.join(LOCAL_DATASET_PATH, 'dev')
if not os.path.exists(devset_path):
logger.info(
f'{LOCAL_DATASET_PATH} folder does not exist, starting download and extraction...'
)
os.makedirs(LOCAL_DATASET_PATH, exist_ok=True)
download_url = 'https://bird-bench.oss-cn-beijing.aliyuncs.com/dev.zip'
download_path = os.path.join(LOCAL_DATASET_PATH, 'dev.zip')
if not os.path.exists(download_path):
logger.info('Start Downloading...')
subprocess.run(['wget', download_url, '-O', download_path])
logger.info('Download completed.')
devset_path = os.path.join(LOCAL_DATASET_PATH, 'dev')
if not os.path.exists(devset_path):
logger.info('Start Extracting...')
os.makedirs(devset_path, exist_ok=True)
with zipfile.ZipFile(download_path, 'r') as zip_ref:
zip_ref.extractall(devset_path)
# move everything in 'dev_20240627' to the root folder
for file in os.listdir(os.path.join(devset_path, 'dev_20240627')):
os.rename(
os.path.join(devset_path, 'dev_20240627', file),
os.path.join(devset_path, file),
)
os.rmdir(os.path.join(devset_path, 'dev_20240627'))
logger.info('Extraction completed.')
# extract databases
database_path = os.path.join(devset_path, 'dev_databases.zip')
assert os.path.exists(database_path)
logger.info('Start Extracting...')
with zipfile.ZipFile(database_path, 'r') as zip_ref:
zip_ref.extractall(devset_path)
logger.info('Extraction completed.')
else:
logger.info(f'{LOCAL_DATASET_PATH} folder already exists.')
return devset_path
def _extract_create_table_prompt(db_path, limit_value=0):
"""Generates a SQL prompt with CREATE TABLE statements and sample data from the database."""
table_query = "SELECT * FROM sqlite_master WHERE type='table';"
tables = sqlite3.connect(db_path).cursor().execute(table_query).fetchall()
prompt = ''
for table in tables:
table_name = table[1]
create_table_statement = table[-1]
table_info_query = f'PRAGMA table_info(`{table_name}`);'
top_k_row_query = f'SELECT * FROM {table_name} LIMIT {limit_value};'
try:
headers = [
x[1]
for x in sqlite3.connect(db_path)
.cursor()
.execute(table_info_query)
.fetchall()
]
except Exception:
logger.error(f'Error Connection: {table_info_query}, {top_k_row_query}')
exit(0)
prompt += create_table_statement + ';\n'
if limit_value > 0:
top_k_rows = (
sqlite3.connect(db_path)
.cursor()
.execute(top_k_row_query)
.fetchall()
)
prompt += (
f"/*\n3 example rows:\n{top_k_row_query}\n{' '.join(headers)}\n"
)
for row in top_k_rows:
row = [str(x) for x in row]
row = [x if x is not None else '' for x in row]
prompt += ' '.join(row) + '\n'
prompt += '*/\n'
prompt += '\n'
return prompt
def _create_prompt(e, database_path):
"""Create a prompt for the given example"""
db_id = e['db_id']
db_path = pathlib.Path(database_path) / db_id / f'{db_id}.sqlite'
# Extract the CREATE TABLE statements and sample data from the database
prompt = _extract_create_table_prompt(db_path)
prompt += f"-- External Knowledge: {e['evidence']}\n\n"
prompt += '-- Using valid SQLite and understanding External Knowledge, answer the following questions for the tables provided above.\n\n'
prompt += '-- Using valid SQLite, answer the following questions for the tables provided above.\n'
prompt += f"Question: {e['question']}\n"
return prompt
def _process_bird(dataset_path):
"""Processes the raw bird dataset into a structured format and saves it as JSON."""
processed_path = os.path.join(LOCAL_DATASET_PATH, 'dev', 'processed_dev.json')
if not os.path.exists(processed_path):
logger.info(
f'{processed_path} folder does not exist, starting processing...'
)
raw_data_path = os.path.join(LOCAL_DATASET_PATH, 'dev', 'dev.json')
database_path = os.path.join(LOCAL_DATASET_PATH, 'dev', 'dev_databases')
processed_data = []
with pathlib.Path(raw_data_path).open('r') as f:
data = json.load(f)
for e in tqdm(data):
item = {
'instance_id': f'{len(processed_data)}',
'db_path': os.path.join(
database_path, e['db_id'], f"{e['db_id']}.sqlite"
),
'db_id': e['db_id'],
'instruction': _create_prompt(e, database_path),
'SQL': e['SQL'],
}
processed_data.append(item)
with pathlib.Path(processed_path).open('w') as f:
json.dump(processed_data, f, indent=2)
logger.info(f'Processed data saved to {processed_path}')
else:
logger.info(f'{processed_path} folder already exists.')
bird_dataset = load_dataset('json', data_files={'test': processed_path})
return bird_dataset
raw_dataset_path = _download_bird()
bird_dataset = _process_bird(raw_dataset_path)
return bird_dataset
async def initialize_runtime(
runtime: Runtime,
instance: pd.Series, # this argument is not required
):
"""Initialize the runtime for the agent.
This function is called before the runtime is used to run the agent.
"""
logger.info(f"{'-' * 50} BEGIN Runtime Initialization Fn {'-' * 50}")
obs: CmdOutputObservation
# Copy the database to the workspace
db_file = os.path.join(
LOCAL_DATASET_PATH,
'dev',
'dev_databases',
instance.db_id,
f'{instance.db_id}.sqlite',
)
await runtime.copy_to(db_file, '/workspace')
# Check the database is copied
action = CmdRunAction(
command='cd /workspace && ls -l',
keep_prompt=False,
)
obs = await runtime.run_action(action)
logger.info(obs, extra={'msg_type': 'OBSERVATION'})
assert obs.exit_code == 0
assert f'{instance.db_id}.sqlite' in obs.content
logger.info(f"{'-' * 50} END Runtime Initialization Fn {'-' * 50}")
async def complete_runtime(
runtime: Runtime,
instance: pd.Series, # this argument is not required, but it is used to get the workspace_dir_name
) -> dict[str, Any]:
"""Complete the runtime for the agent.
This function is called before the runtime is used to run the agent.
If you need to do something in the sandbox to get the correctness metric after
the agent has run, modify this function.
"""
logger.info(f"{'-' * 50} BEGIN Runtime Completion Fn {'-' * 50}")
obs: CmdOutputObservation
timeout = 30
test_result = {'result': {}, 'metadata': {}}
# Read the generated python file
with open(path, 'r') as f:
gen_file = f.read()
instance_id = instance.instance_id.replace('/', '__')
path = os.path.join('/workspace', f'{instance_id}.py')
action = CmdRunAction(
command=f'cat {path}',
keep_prompt=False,
)
obs = await runtime.run_action(action)
logger.info(obs, extra={'msg_type': 'OBSERVATION'})
if obs.exit_code != 0:
test_result['result'] = {'passed': 0, 'status': 'error'}
return test_result
gen_file = obs.content.strip().replace('\r\n', '\n')
# Extract the SQL from the python file
gen_sql = ''
@ -96,7 +323,13 @@ def get_test_result(instance, path, timeout=30):
# Execute the SQL
try:
res = func_timeout(
timeout, execute_sql, args=(instance.db_path, gen_sql, gold_sql)
timeout,
execute_sql,
args=(
instance.db_path,
gen_sql,
gold_sql,
),
)
status = 'success'
except FunctionTimedOut:
@ -114,68 +347,28 @@ def get_test_result(instance, path, timeout=30):
'gen_sql': gen_sql,
'gold_sql': gold_sql,
}
logger.info(f"{'-' * 50} END Runtime Completion Fn {'-' * 50}")
return test_result
def process_instance(
async def process_instance(
instance: pd.Series,
metadata: EvalMetadata,
reset_logger: bool = True,
):
# Create the agent
agent = Agent.get_cls(metadata.agent_class)(llm=LLM(config=metadata.llm_config))
workspace_mount_path = os.path.join(
config.workspace_mount_path, 'bird_eval_workspace'
)
workspace_mount_path = os.path.join(workspace_mount_path, str(os.getpid()))
pathlib.Path(workspace_mount_path).mkdir(parents=True, exist_ok=True)
# reset workspace to config
config.workspace_mount_path = workspace_mount_path
# Copy the database to the workspace
db_root = os.path.join(
config.workspace_base, 'evaluation_bird/dev/dev_databases', instance.db_id
)
target_path = os.path.join(workspace_mount_path, f'{instance.db_id}')
if not os.path.exists(target_path):
logger.info(f'Copying database from {db_root} to {target_path}...')
shutil.copytree(db_root, target_path)
# Set up the database path
database_path = os.path.join(instance.db_id, f'{instance.db_id}.sqlite')
) -> EvalOutput:
config = get_config(metadata)
# use session id for concurrent evaluation
sid = instance.task_id.replace('/', '__')
instance_id = instance.instance_id.replace('/', '__')
# Set up the logger properly, so you can run multi-processing to parallelize the evaluation
if reset_logger:
# Set up logger
log_file = os.path.join(
metadata.eval_output_dir,
'logs',
f'instance_{sid}.log',
)
# Remove all existing handlers from logger
for handler in logger.handlers[:]:
logger.removeHandler(handler)
# add back the console handler to print ONE line
logger.addHandler(get_console_handler())
logger.info(
f'Starting evaluation for instance {instance.task_id}.\nLOG: tail -f {log_file}'
)
# Remove all existing handlers from logger
for handler in logger.handlers[:]:
logger.removeHandler(handler)
file_handler = logging.FileHandler(log_file)
file_handler.setFormatter(
logging.Formatter('%(asctime)s - %(levelname)s - %(message)s')
)
logger.addHandler(file_handler)
logger.info(f'Process-specific workspace mounted at {workspace_mount_path}')
log_dir = os.path.join(metadata.eval_output_dir, 'infer_logs')
reset_logger_for_multiprocessing(logger, instance_id, log_dir)
else:
logger.info(f'Starting evaluation for instance {instance_id}.')
# Create file with BIRD instance
database_path = os.path.join('/workspace', f'{instance.db_id}.sqlite')
statements = f"""
import sqlite3
def execute_sql(db_path, sql):
@ -192,12 +385,12 @@ def process_instance(
result = execute_sql(db_path, sql)
print(result)
"""
path = os.path.join(config.workspace_mount_path, f'{sid}.py')
instruction = (
f'You are a SQL expert and need to complete the following text-to-SQL tasks.'
f'\n\n{instance.instruction}\n\n'
'Please write the SQL in one line without line breaks.'
f'And write a new python file named {sid}.py to call the SQL you wrote.'
f'And write a new python file named {instance_id}.py to call the SQL you wrote.'
'You need to follow the code template below:'
f'\n\n{statements}\n\n'
'Environment has been set up for you to start working.'
@ -208,24 +401,21 @@ def process_instance(
'You SHOULD INCLUDE PROPER INDENTATION in your edit commands.\n'
)
# NOTE: You can actually set slightly different instruction for different agents
instruction += AGENT_CLS_TO_INST_SUFFIX[agent.__class__.__name__]
instruction += AGENT_CLS_TO_INST_SUFFIX[metadata.agent_class]
runtime = await create_runtime(config, sid=instance_id)
await initialize_runtime(runtime, instance)
# Here's how you can run the agent (similar to the `main` function) and get the final task state
config.max_iterations = metadata.max_iterations
state: State | None = asyncio.run(
run_controller(
config=config,
task_str=instruction,
fake_user_response_fn=AGENT_CLS_TO_FAKE_USER_RESPONSE_FN[
agent.__class__.__name__
],
agent=agent,
sid=sid,
)
state: State | None = await run_controller(
config=config,
task_str=instruction,
fake_user_response_fn=AGENT_CLS_TO_FAKE_USER_RESPONSE_FN[metadata.agent_class],
runtime=runtime,
)
# ======= Attempt to evaluate the agent's edits =======
test_result = get_test_result(instance, path)
test_result = await complete_runtime(runtime, instance)
# If you are working on some simpler benchmark that only evaluates the final model output (e.g., in a MessageAction)
# You can simply get the LAST `MessageAction` from the returned `state.history` and parse it for evaluation.
@ -239,162 +429,43 @@ def process_instance(
histories = state.history.compatibility_for_eval_history_pairs()
# Save the output
output = {
'task_id': instance.task_id,
'instruction': instruction,
'metadata': metadata.model_dump(),
'history': histories,
'metrics': metrics,
'error': state.last_error if state and state.last_error else None,
'test_result': test_result,
}
output = EvalOutput(
instance_id=instance.instance_id,
instruction=instruction,
metadata=metadata,
history=histories,
metrics=metrics,
error=state.last_error if state and state.last_error else None,
test_result=test_result,
)
return output
def load_bird():
"""Main function to handle the flow of downloading, processing, and loading the bird dataset."""
raw_dataset_path = download_bird()
bird_dataset = process_bird(raw_dataset_path)
return bird_dataset
def download_bird():
"""Downloads and extracts the bird dataset from a specified URL into a local directory."""
dataset_path = os.path.join(config.workspace_base, 'evaluation_bird')
devset_path = os.path.join(dataset_path, 'dev')
if not os.path.exists(dataset_path):
logger.info(
f'{dataset_path} folder does not exist, starting download and extraction...'
)
os.makedirs(dataset_path, exist_ok=True)
download_url = 'https://bird-bench.oss-cn-beijing.aliyuncs.com/dev.zip'
download_path = os.path.join(dataset_path, 'dev.zip')
logger.info('Start Downloading...')
subprocess.run(['wget', download_url, '-O', download_path])
logger.info('Download completed.')
logger.info('Start Extracting...')
subprocess.run(['unzip', download_path, '-d', dataset_path])
# extract databases
devset_path = os.path.join(dataset_path, 'dev')
database_path = os.path.join(devset_path, 'dev_databases.zip')
subprocess.run(['unzip', database_path, '-d', devset_path])
logger.info('Extraction completed.')
else:
logger.info(f'{dataset_path} folder already exists.')
return devset_path
def process_bird(dataset_path):
"""Processes the raw bird dataset into a structured format and saves it as JSON."""
processed_path = os.path.join(dataset_path, 'processed_dev.json')
if not os.path.exists(processed_path):
logger.info(f'{processed_path} folder does not exist, starting processing...')
raw_data_path = os.path.join(dataset_path, 'dev.json')
database_path = os.path.join(dataset_path, 'dev_databases')
processed_data = []
with pathlib.Path(raw_data_path).open('r') as f:
data = json.load(f)
for e in tqdm(data):
item = {
'task_id': f'{len(processed_data)}',
'db_path': os.path.join(
database_path, e['db_id'], f"{e['db_id']}.sqlite"
),
'db_id': e['db_id'],
'instruction': create_prompt(e, database_path),
'SQL': e['SQL'],
}
processed_data.append(item)
with pathlib.Path(processed_path).open('w') as f:
json.dump(processed_data, f, indent=2)
logger.info(f'Processed data saved to {processed_path}')
else:
logger.info(f'{processed_path} folder already exists.')
bird_dataset = load_dataset('json', data_files={'test': processed_path})
return bird_dataset
def extract_create_table_prompt(db_path, limit_value=0):
"""Generates a SQL prompt with CREATE TABLE statements and sample data from the database."""
table_query = "SELECT * FROM sqlite_master WHERE type='table';"
tables = sqlite3.connect(db_path).cursor().execute(table_query).fetchall()
prompt = ''
for table in tables:
table_name = table[1]
create_table_statement = table[-1]
table_info_query = f'PRAGMA table_info(`{table_name}`);'
top_k_row_query = f'SELECT * FROM {table_name} LIMIT {limit_value};'
try:
headers = [
x[1]
for x in sqlite3.connect(db_path)
.cursor()
.execute(table_info_query)
.fetchall()
]
except Exception:
logger.error(f'Error Connection: {table_info_query}, {top_k_row_query}')
exit(0)
prompt += create_table_statement + ';\n'
if limit_value > 0:
top_k_rows = (
sqlite3.connect(db_path).cursor().execute(top_k_row_query).fetchall()
)
prompt += (
f"/*\n3 example rows:\n{top_k_row_query}\n{' '.join(headers)}\n"
)
for row in top_k_rows:
row = [str(x) for x in row]
row = [x if x is not None else '' for x in row]
prompt += ' '.join(row) + '\n'
prompt += '*/\n'
prompt += '\n'
return prompt
def create_prompt(e, database_path):
"""Create a prompt for the given example"""
db_id = e['db_id']
db_path = pathlib.Path(database_path) / db_id / f'{db_id}.sqlite'
# Extract the CREATE TABLE statements and sample data from the database
prompt = extract_create_table_prompt(db_path)
prompt += f"-- External Knowledge: {e['evidence']}\n\n"
prompt += '-- Using valid SQLite and understanding External Knowledge, answer the following questions for the tables provided above.\n\n'
prompt += '-- Using valid SQLite, answer the following questions for the tables provided above.\n'
prompt += f"Question: {e['question']}\n"
return prompt
if __name__ == '__main__':
id_column = 'task_id'
args = parse_arguments()
bird_dataset = load_bird()
dataset = bird_dataset['test'].to_pandas()
dataset.rename(columns={'task_id': 'instance_id'}, inplace=True)
llm_config = get_llm_config_arg(args.llm_config) if args.llm_config else config.llm
logger.info(f'Config for evaluation: {config}')
llm_config = None
if args.llm_config:
llm_config = get_llm_config_arg(args.llm_config)
if llm_config is None:
raise ValueError(f'Could not find LLM config: --llm_config {args.llm_config}')
metadata = make_metadata(
llm_config,
args.dataset_name,
'BIRD',
args.agent_cls,
args.max_iterations,
args.eval_note,
args.eval_output_dir,
)
output_file = os.path.join(metadata.eval_output_dir, 'output.jsonl')
instances = prepare_dataset(dataset, output_file, args.eval_n_limit, id_column)
instances = prepare_dataset(dataset, output_file, args.eval_n_limit)
run_evaluation(
instances,
metadata,
output_file,
args.eval_num_workers,
process_instance,
id_column,
asyncio.run(
run_evaluation(
instances, metadata, output_file, args.eval_num_workers, process_instance
)
)

0
evaluation/bird/scripts/run_infer.sh Normal file → Executable file
View File

View File

@ -5,30 +5,9 @@ Some of OpenDevin's agent supports agent delegation action, for example, CodeAct
This evaluation tests whether CodeActAgent can correctly delegate the instruction from WebArena and MiniWob benchmark to the BrowsingAgent.
If so, the browsing performance upper-bound of CodeActAgent will be the performance of BrowsingAgent.
## Setup Environment and LLM Configuration
## Setup Environment
Please follow [this document](https://github.com/OpenDevin/OpenDevin/blob/main/Development.md) to set up a local development environment for OpenDevin.
## Configure OpenDevin and your LLM
Create a `config.toml` file if it does not exist at the root of the workspace.
Add the following configurations:
```toml
# TODO: Change these to the model you want to evaluate
[llm.eval_gpt4_1106_preview_llm]
model = "gpt-4-1106-preview"
api_key = "XXX"
temperature = 0.0
[llm.eval_some_openai_compatible_model_llm]
model = "openai/MODEL_NAME"
base_url = "https://OPENAI_COMPATIBLE_URL/v1"
api_key = "XXX"
temperature = 0.0
```
Please follow instruction [here](../README.md#setup) to setup your local development environment and LLM.
## Run Inference

View File

@ -1,5 +1,4 @@
import asyncio
import logging
import os
import re
@ -9,56 +8,62 @@ from datasets import load_dataset
from evaluation.utils.shared import (
EvalMetadata,
EvalOutput,
make_metadata,
prepare_dataset,
reset_logger_for_multiprocessing,
run_evaluation,
)
from opendevin.controller.agent import Agent
from opendevin.controller.state.state import State
from opendevin.core.config import get_llm_config_arg, load_app_config, parse_arguments
from opendevin.core.logger import get_console_handler
from opendevin.core.config import (
AppConfig,
SandboxConfig,
get_llm_config_arg,
parse_arguments,
)
from opendevin.core.logger import opendevin_logger as logger
from opendevin.core.main import run_controller
from opendevin.llm.llm import LLM
config = load_app_config()
from opendevin.core.main import create_runtime, run_controller
# Only CodeActAgent can delegate to BrowsingAgent
SUPPORTED_AGENT_CLS = {'CodeActAgent'}
def process_instance(
def get_config(
metadata: EvalMetadata,
) -> AppConfig:
assert (
metadata.max_iterations == 1
), 'max_iterations must be 1 for browsing delegation evaluation.'
config = AppConfig(
default_agent=metadata.agent_class,
run_as_devin=False,
runtime='eventstream',
max_iterations=metadata.max_iterations,
sandbox=SandboxConfig(
container_image='ubuntu:22.04',
enable_auto_lint=False,
use_host_network=False,
update_source_code=True,
),
workspace_base=None,
workspace_mount_path=None,
)
config.set_llm_config(metadata.llm_config)
return config
async def process_instance(
instance: pd.Series,
metadata: EvalMetadata,
reset_logger: bool = True,
):
# Create the agent
agent = Agent.get_cls(metadata.agent_class)(llm=LLM(config=metadata.llm_config))
env_id = instance.instance_id
) -> EvalOutput:
config = get_config(metadata)
# Setup the logger properly, so you can run multi-processing to parallelize the evaluation
if reset_logger:
# Set up logger
log_file = os.path.join(
metadata.eval_output_dir, 'logs', f'instance_{env_id}.log'
)
# Remove all existing handlers from logger
for handler in logger.handlers[:]:
logger.removeHandler(handler)
# add back the console handler to print ONE line
logger.addHandler(get_console_handler())
logger.info(
f'Starting evaluation for instance {env_id}.\nHint: run "tail -f {log_file}" to see live logs in a separate shell'
)
# Remove all existing handlers from logger
for handler in logger.handlers[:]:
logger.removeHandler(handler)
file_handler = logging.FileHandler(log_file)
file_handler.setFormatter(
logging.Formatter('%(asctime)s - %(levelname)s - %(message)s')
)
logger.addHandler(file_handler)
log_dir = os.path.join(metadata.eval_output_dir, 'infer_logs')
reset_logger_for_multiprocessing(logger, instance.instance_id, log_dir)
else:
logger.info(f'Starting evaluation for instance {env_id}.')
logger.info(f'Starting evaluation for instance {instance.instance_id}.')
instruction = (
f'You can delegate browsing tasks to a browser agent. '
@ -67,21 +72,14 @@ def process_instance(
f'NOTE: You should copy the "query" as is into the <execute_browse> tag. DO NOT change ANYTHING in the query.'
)
config.max_iterations = metadata.max_iterations
state: State | None = asyncio.run(
run_controller(
config=config,
task_str=instruction,
agent=agent,
sid=env_id,
)
runtime = await create_runtime(config, sid=instance.instance_id)
state: State | None = await run_controller(
config=config,
task_str=instruction,
runtime=runtime,
)
# ======= Attempt to evaluate the agent's environment impact =======
# If you are working on some simpler benchmark that only evaluates the final model output (e.g., in a MessageAction)
# You can simply get the LAST `MessageAction` from the returned `state.history` and parse it for evaluation.
if state is None:
raise ValueError('State should not be None.')
@ -115,20 +113,19 @@ def process_instance(
result['is_exact_match'] = is_exact_match
# Save the output
output = {
'instance_id': env_id,
'instruction': instruction,
'metadata': metadata.model_dump(),
'history': histories,
'metrics': metrics,
'error': state.last_error if state and state.last_error else None,
'test_result': {
output = EvalOutput(
instance_id=instance.instance_id,
instruction=instruction,
metadata=metadata,
history=histories,
metrics=metrics,
error=state.last_error if state and state.last_error else None,
test_result={
'query': instance.instruction,
'action': last_delegate_action,
'result': result,
},
}
)
return output
@ -138,9 +135,13 @@ if __name__ == '__main__':
dataset = load_dataset('OpenDevin/eval-browsing-instructions')
dataset = dataset['train'].to_pandas()
assert dataset.columns.tolist() == ['instance_id', 'instruction']
id_column = 'instance_id'
llm_config = get_llm_config_arg(args.llm_config) if args.llm_config else config.llm
logger.info(f'Config for evaluation: {config}')
llm_config = None
if args.llm_config:
llm_config = get_llm_config_arg(args.llm_config)
if llm_config is None:
raise ValueError(f'Could not find LLM config: --llm_config {args.llm_config}')
metadata = make_metadata(
llm_config,
@ -150,18 +151,20 @@ if __name__ == '__main__':
args.eval_note,
args.eval_output_dir,
)
if metadata.agent_class not in SUPPORTED_AGENT_CLS:
raise ValueError(
f'Agent class {metadata.agent_class} not supported with AgentDelegation.'
)
output_file = os.path.join(metadata.eval_output_dir, 'output.jsonl')
instances = prepare_dataset(dataset, output_file, args.eval_n_limit, id_column)
run_evaluation(
instances,
metadata,
output_file,
args.eval_num_workers,
process_instance,
id_column,
instances = prepare_dataset(dataset, output_file, args.eval_n_limit)
asyncio.run(
run_evaluation(
instances,
metadata,
output_file,
args.eval_num_workers,
process_instance,
)
)

View File

@ -2,9 +2,9 @@
This folder contains evaluation harness for evaluating agents on the [GAIA benchmark](https://arxiv.org/abs/2311.12983).
## Configure OpenDevin and your LLM
## Setup Environment and LLM Configuration
Create a `config.toml` file if it does not exist at the root of the workspace. Please check [README.md](../../README.md) for how to set this up.
Please follow instruction [here](../README.md#setup) to setup your local development environment and LLM.
## Run the evaluation
We are using the GAIA dataset hosted on [Hugging Face](https://huggingface.co/datasets/gaia-benchmark/GAIA).

View File

@ -1,10 +1,7 @@
import asyncio
import logging
import functools
import os
import pathlib
import re
import shutil
from functools import partial
import huggingface_hub
import pandas as pd
@ -13,28 +10,31 @@ from datasets import load_dataset
from evaluation.gaia.scorer import question_scorer
from evaluation.utils.shared import (
EvalMetadata,
EvalOutput,
codeact_user_response,
make_metadata,
prepare_dataset,
reset_logger_for_multiprocessing,
run_evaluation,
)
from opendevin.controller.agent import Agent
from opendevin.controller.state.state import State
from opendevin.core.config import get_llm_config_arg, get_parser, load_app_config
from opendevin.core.logger import get_console_handler
from opendevin.core.config import (
AppConfig,
SandboxConfig,
get_llm_config_arg,
get_parser,
)
from opendevin.core.logger import opendevin_logger as logger
from opendevin.core.main import run_controller
from opendevin.events.action import CmdRunAction, MessageAction
from opendevin.llm.llm import LLM
from opendevin.core.main import create_runtime, run_controller
from opendevin.events.action import AgentFinishAction, CmdRunAction, MessageAction
from opendevin.events.observation import CmdOutputObservation
from opendevin.runtime.runtime import Runtime
config = load_app_config()
DATASET_CACHE_DIR = '~/.cache/open-devin/evals/gaia'
DATASET_CACHE_DIR = os.path.expanduser(DATASET_CACHE_DIR)
DATASET_CACHE_DIR = os.path.join(os.path.dirname(__file__), 'data')
AGENT_CLS_TO_FAKE_USER_RESPONSE_FN = {
'CodeActAgent': partial(codeact_user_response, encapsulate_solution=True),
'CodeActAgent': functools.partial(codeact_user_response, encapsulate_solution=True),
}
AGENT_CLS_TO_INST_SUFFIX = {
@ -42,151 +42,175 @@ AGENT_CLS_TO_INST_SUFFIX = {
}
def process_instance(
def get_config(
metadata: EvalMetadata,
) -> AppConfig:
config = AppConfig(
default_agent=metadata.agent_class,
run_as_devin=False,
runtime='eventstream',
max_iterations=metadata.max_iterations,
sandbox=SandboxConfig(
container_image='ubuntu:22.04',
enable_auto_lint=True,
use_host_network=False,
update_source_code=True,
),
# do not mount workspace
workspace_base=None,
workspace_mount_path=None,
)
config.set_llm_config(metadata.llm_config)
return config
async def initialize_runtime(
runtime: Runtime,
instance: pd.Series, # this argument is not required
):
"""Initialize the runtime for the agent.
This function is called before the runtime is used to run the agent.
"""
logger.info(f"{'-' * 50} BEGIN Runtime Initialization Fn {'-' * 50}")
obs: CmdOutputObservation
action = CmdRunAction(command='mkdir -p /workspace')
logger.info(action, extra={'msg_type': 'ACTION'})
obs = await runtime.run_action(action)
assert obs.exit_code == 0
if instance['file_name'] != '':
# if this question comes with a file, we need to save it to the workspace
assert metadata.data_split is not None
src_file = os.path.join(
DATASET_CACHE_DIR, '2023', metadata.data_split, instance['file_name']
)
assert os.path.exists(src_file)
dest_file = os.path.join('/workspace', instance['file_name'])
await runtime.copy_to(src_file, dest_file)
# rename to file.extension_name
extension_name = instance['file_name'].split('.')[-1]
action = CmdRunAction(
command=f'mv /workspace/{instance["file_name"]} /workspace/file.{extension_name}'
)
logger.info(action, extra={'msg_type': 'ACTION'})
obs = await runtime.run_action(action)
assert obs.exit_code == 0
action = CmdRunAction(command='cd /workspace')
logger.info(action, extra={'msg_type': 'ACTION'})
obs = await runtime.run_action(action)
assert obs.exit_code == 0
logger.info(f"{'-' * 50} END Runtime Initialization Fn {'-' * 50}")
async def process_instance(
instance: pd.Series,
metadata: EvalMetadata,
reset_logger: bool = True,
):
# Create the agent
agent = Agent.get_cls(metadata.agent_class)(llm=LLM(config=metadata.llm_config))
# create process-specific workspace dir
# we will create a workspace directory for EACH process
# so that different agent don't interfere with each other.
old_workspace_mount_path = config.workspace_mount_path
) -> EvalOutput:
config = get_config(metadata)
try:
workspace_mount_path = os.path.join(
config.workspace_mount_path, '_eval_workspace'
)
workspace_mount_path = os.path.join(workspace_mount_path, str(os.getpid()))
pathlib.Path(workspace_mount_path).mkdir(parents=True, exist_ok=True)
config.workspace_mount_path = workspace_mount_path
# Setup the logger properly, so you can run multi-processing to parallelize the evaluation
if reset_logger:
log_dir = os.path.join(metadata.eval_output_dir, 'infer_logs')
reset_logger_for_multiprocessing(logger, instance['instance_id'], log_dir)
else:
logger.info(f'Starting evaluation for instance {instance["instance_id"]}.')
# Setup the logger properly, so you can run multi-processing to parallelize the evaluation
eval_output_dir = metadata.eval_output_dir
if reset_logger:
# Set up logger
log_file = os.path.join(
eval_output_dir, 'logs', f'instance_{instance["task_id"]}.log'
)
# Remove all existing handlers from logger
for handler in logger.handlers[:]:
logger.removeHandler(handler)
# add back the console handler to print ONE line
logger.addHandler(get_console_handler())
logger.info(
f'Starting evaluation for instance {instance["task_id"]}.\nLOG: tail -f {log_file}'
)
# Remove all existing handlers from logger
for handler in logger.handlers[:]:
logger.removeHandler(handler)
file_handler = logging.FileHandler(log_file)
file_handler.setFormatter(
logging.Formatter('%(asctime)s - %(levelname)s - %(message)s')
)
logger.addHandler(file_handler)
if instance['file_name'] != '':
extension_name = instance['file_name'].split('.')[-1]
dest_file = os.path.join('/workspace', f'file.{extension_name}')
else:
dest_file = None
logger.info(f'Process-specific workspace mounted at {workspace_mount_path}')
if instance['file_name'] != '':
# if this question comes with a file, we need to save it to the workspace
assert metadata.data_split is not None
src_file = os.path.join(
DATASET_CACHE_DIR, '2023', metadata.data_split, instance['file_name']
)
extension_name = instance['file_name'].split('.')[-1]
dest_file = os.path.join(workspace_mount_path, f'file.{extension_name}')
shutil.copyfile(src_file, dest_file)
logger.info(f'File copied to {dest_file}')
else:
dest_file = None
# Prepare instruction
instruction = f"{instance['Question']}\n"
logger.info(f'Instruction: {instruction}')
if dest_file:
instruction += f"\n\nThe mentioned file is provided in the workspace at: {dest_file.split('/')[-1]}"
# Prepare instruction
instruction = f"{instance['Question']}\n"
logger.info(f'Instruction: {instruction}')
if dest_file:
instruction += f"\n\nThe mentioned file is provided in the workspace at: {dest_file.split('/')[-1]}"
instruction += 'IMPORTANT: You should ONLY interact with the environment provided to you AND NEVER ASK FOR HUMAN HELP.\n'
instruction += 'Please encapsulate your final answer (answer ONLY) within <solution> and </solution>.\n'
instruction += (
'For example: The answer to the question is <solution> 42 </solution>.\n'
)
# NOTE: You can actually set slightly different instruction for different agents
instruction += AGENT_CLS_TO_INST_SUFFIX.get(metadata.agent_class, '')
logger.info(f'Instruction:\n{instruction}', extra={'msg_type': 'OBSERVATION'})
instruction += 'IMPORTANT: You should ONLY interact with the environment provided to you AND NEVER ASK FOR HUMAN HELP.\n'
instruction += 'Please encapsulate your final answer (answer ONLY) within <solution> and </solution>.\n'
instruction += (
'For example: The answer to the question is <solution> 42 </solution>.\n'
)
# NOTE: You can actually set slightly different instruction for different agents
instruction += AGENT_CLS_TO_INST_SUFFIX.get(agent.__class__.__name__, '')
logger.info(f'Instruction:\n{instruction}', extra={'msg_type': 'OBSERVATION'})
runtime = await create_runtime(config, sid=instance['instance_id'])
await initialize_runtime(runtime, instance)
# Here's how you can run the agent (similar to the `main` function) and get the final task state
config.max_iterations = metadata.max_iterations
state: State | None = asyncio.run(
run_controller(
config=config,
task_str=instruction,
fake_user_response_fn=AGENT_CLS_TO_FAKE_USER_RESPONSE_FN[
agent.__class__.__name__
],
agent=agent,
sid=instance['task_id'],
)
)
# ======= Attempt to evaluate the agent's edits =======
# If you are working on simpler benchmark that only evaluates the final model output (e.g., in a MessageAction)
# You can simply get the LAST `MessageAction` from the returned `state.history` and parse it for evaluation.
# Here's how you can run the agent (similar to the `main` function) and get the final task state
state: State | None = await run_controller(
config=config,
task_str=instruction,
runtime=runtime,
fake_user_response_fn=AGENT_CLS_TO_FAKE_USER_RESPONSE_FN[metadata.agent_class],
)
# ======= Attempt to evaluate the agent's edits =======
# If you are working on simpler benchmark that only evaluates the final model output (e.g., in a MessageAction)
# You can simply get the LAST `MessageAction` from the returned `state.history` and parse it for evaluation.
if state is None:
raise ValueError('State should not be None.')
if state is None:
raise ValueError('State should not be None.')
model_answer_raw = ''
# get the last message or thought from the agent
for event in state.history.get_events(reverse=True):
if isinstance(event, CmdRunAction) and event.source == 'agent':
model_answer_raw = ''
# get the last message or thought from the agent
for event in state.history.get_events(reverse=True):
if event.source == 'agent':
if isinstance(event, AgentFinishAction):
model_answer_raw = event.thought
elif isinstance(event, MessageAction) and event.source == 'agent':
break
elif isinstance(event, CmdRunAction):
model_answer_raw = event.thought
break
elif isinstance(event, MessageAction):
model_answer_raw = event.content
break
# attempt to parse model_answer
model_answer = re.findall(r'<solution>(.*?)</solution>', model_answer_raw)
if len(model_answer) == 0:
logger.warning(f'Failed to parse model answer: {model_answer_raw}')
model_answer = model_answer_raw
else:
model_answer = model_answer[0]
# attempt to parse model_answer
model_answer = re.findall(r'<solution>(.*?)</solution>', model_answer_raw)
if len(model_answer) == 0:
logger.warning(f'Failed to parse model answer: {model_answer_raw}')
model_answer = model_answer_raw
else:
model_answer = model_answer[0]
logger.info(
f'Final message: {model_answer} | Ground truth: {instance["Final answer"]}'
)
score = question_scorer(
model_answer=model_answer, ground_truth=instance['Final answer']
)
test_result = {
'score': score,
'model_answer_raw': model_answer_raw,
'model_answer': model_answer,
'ground_truth': instance['Final answer'],
}
metrics = state.metrics.get() if state.metrics else None
logger.info(
f'Final message: {model_answer} | Ground truth: {instance["Final answer"]}'
)
score = question_scorer(
model_answer=model_answer, ground_truth=instance['Final answer']
)
test_result = {
'score': score,
'model_answer_raw': model_answer_raw,
'model_answer': model_answer,
'ground_truth': instance['Final answer'],
}
metrics = state.metrics.get() if state.metrics else None
# history is now available as a stream of events, rather than list of pairs of (Action, Observation)
# for compatibility with the existing output format, we can remake the pairs here
# remove when it becomes unnecessary
histories = state.history.compatibility_for_eval_history_pairs()
# history is now available as a stream of events, rather than list of pairs of (Action, Observation)
# for compatibility with the existing output format, we can remake the pairs here
# remove when it becomes unnecessary
histories = state.history.compatibility_for_eval_history_pairs()
# Save the output
output = {
'instance_id': instance['task_id'],
'instance': instance,
'instruction': instance['Question'],
'metadata': metadata.model_dump(),
'history': histories,
'metrics': metrics,
'error': state.last_error if state and state.last_error else None,
'test_result': test_result,
}
except Exception:
logger.error('Process instance failed')
raise
finally:
config.workspace_mount_path = old_workspace_mount_path
# Save the output
output = EvalOutput(
instance_id=instance['instance_id'],
instance=instance.to_dict(),
instruction=instance['Question'],
metadata=metadata,
history=histories,
metrics=metrics,
error=state.last_error if state and state.last_error else None,
test_result=test_result,
)
return output
@ -197,13 +221,19 @@ if __name__ == '__main__':
type=str,
help='gaia level to evaluate, eg. 2023_level1',
)
parser.add_argument(
'--data-split',
type=str,
help='data split to evaluate, eg. test',
default='validation',
)
args, _ = parser.parse_known_args()
if args.directory:
config.workspace_base = os.path.abspath(args.directory)
logger.info(f'Setting workspace base to {config.workspace_base}')
llm_config = get_llm_config_arg(args.llm_config) if args.llm_config else config.llm
logger.info(f'Config for evaluation: {config}')
llm_config = None
if args.llm_config:
llm_config = get_llm_config_arg(args.llm_config)
if llm_config is None:
raise ValueError(f'Could not find LLM config: --llm_config {args.llm_config}')
metadata = make_metadata(
llm_config=llm_config,
@ -222,20 +252,18 @@ if __name__ == '__main__':
repo_type='dataset',
local_dir=DATASET_CACHE_DIR,
)
gaia_tests = dataset[metadata.data_split]
gaia_tests = dataset[metadata.data_split].to_pandas()
gaia_tests.rename(columns={'task_id': 'instance_id'}, inplace=True)
output_file = os.path.join(metadata.eval_output_dir, 'output.jsonl')
prepared_dataset = prepare_dataset(
gaia_tests.to_pandas(), output_file, args.eval_n_limit, 'task_id'
)
prepared_dataset = prepare_dataset(gaia_tests, output_file, args.eval_n_limit)
agent = Agent.get_cls(args.agent_cls)(llm=LLM(config.llm))
run_evaluation(
dataset=prepared_dataset,
metadata=metadata,
output_file=output_file,
num_workers=args.eval_num_workers,
process_instance_func=process_instance,
id_column='task_id',
asyncio.run(
run_evaluation(
dataset=prepared_dataset,
metadata=metadata,
output_file=output_file,
num_workers=args.eval_num_workers,
process_instance_func=process_instance,
)
)

0
evaluation/gaia/scripts/run_infer.sh Normal file → Executable file
View File

View File

@ -2,20 +2,16 @@
This folder contains evaluation harness we built on top of the original [Gorilla APIBench](https://github.com/ShishirPatil/gorilla) ([paper](https://arxiv.org/pdf/2305.15334)).
## Setup Environment
## Setup Environment and LLM Configuration
Please follow [this document](https://github.com/OpenDevin/OpenDevin/blob/main/Development.md) to setup local development environment for OpenDevin.
## Configure OpenDevin and your LLM
Run `make setup-config` to set up the `config.toml` file if it does not exist at the root of the workspace.
Please follow instruction [here](../README.md#setup) to setup your local development environment and LLM.
## Run Inference on APIBench Instances
Make sure your Docker daemon is running, then run this bash script:
```bash
bash evaluation/gorilla/scripts/run_infer.sh [model_config] [git-version] [agent] [eval_limit] [hubs]
./evaluation/gorilla/scripts/run_infer.sh [model_config] [git-version] [agent] [eval_limit] [hubs]
```
where `model_config` is mandatory, while all other arguments are optional.
@ -39,5 +35,5 @@ Note: in order to use `eval_limit`, you must also set `agent`; in order to use `
For example,
```bash
bash evaluation/gorilla/scripts/run_infer.sh llm 0.6.2 CodeActAgent 10 th
./evaluation/gorilla/scripts/run_infer.sh llm 0.6.2 CodeActAgent 10 th
```

View File

@ -1,59 +1,28 @@
import asyncio
import json
import logging
import multiprocessing as mp
import os
import pathlib
import subprocess
import time
from concurrent.futures import ProcessPoolExecutor
from tqdm import tqdm
import pandas as pd
from opendevin.controller.agent import Agent
from evaluation.gorilla.utils import encode_question, get_data_for_hub
from evaluation.utils.shared import (
EvalMetadata,
EvalOutput,
codeact_user_response,
make_metadata,
prepare_dataset,
reset_logger_for_multiprocessing,
run_evaluation,
)
from opendevin.controller.state.state import State
from opendevin.core.config import get_llm_config_arg, get_parser, load_app_config
from opendevin.core.logger import get_console_handler
from opendevin.core.config import (
AppConfig,
SandboxConfig,
get_llm_config_arg,
get_parser,
)
from opendevin.core.logger import opendevin_logger as logger
from opendevin.core.main import run_controller
from opendevin.events.action import MessageAction
from opendevin.llm.llm import LLM
from .utils import encode_question, get_data
config = load_app_config()
def cleanup():
print('Cleaning up child processes...')
for process in mp.active_children():
print(f'Terminating child process: {process.name}')
process.terminate()
process.join()
def codeact_user_response(state: State) -> str:
msg = (
#'Please continue working on the task on whatever approach you think is suitable.\n'
'Please run the following command: <execute_bash> exit </execute_bash>.\n'
#'IMPORTANT: YOU SHOULD NEVER ASK FOR HUMAN HELP OR USE THE INTERNET TO SOLVE THIS TASK.\n'
)
# check if the agent has tried to talk to the user 3 times, if so, let the agent know it can give up
if state.history:
user_msgs = [
event
for event in state.history.get_events()
if isinstance(event, MessageAction) and event.source == 'user'
]
if len(user_msgs) > 2:
# let the agent know that it can give up when it has tried 3 times
return (
msg
+ 'If you want to give up, run: <execute_bash> exit </execute_bash>.\n'
)
return msg
from opendevin.core.main import create_runtime, run_controller
AGENT_CLS_TO_FAKE_USER_RESPONSE_FN = {
'CodeActAgent': codeact_user_response,
@ -64,105 +33,96 @@ AGENT_CLS_TO_INST_SUFFIX = {
}
def process_instance(agent, question_id, question, metadata, reset_logger: bool = True):
# create process-specific workspace dir
# we will create a workspace directory for EACH process
# so that different agent don't interfere with each other.
old_workspace_mount_path = config.workspace_mount_path
try:
workspace_mount_path = os.path.join(
config.workspace_mount_path, '_eval_workspace'
)
workspace_mount_path = os.path.join(workspace_mount_path, str(os.getpid()))
pathlib.Path(workspace_mount_path).mkdir(parents=True, exist_ok=True)
config.workspace_mount_path = workspace_mount_path
def get_config(
metadata: EvalMetadata,
) -> AppConfig:
config = AppConfig(
default_agent=metadata.agent_class,
run_as_devin=False,
runtime='eventstream',
max_iterations=metadata.max_iterations,
sandbox=SandboxConfig(
container_image='ubuntu:22.04',
enable_auto_lint=True,
use_host_network=False,
update_source_code=True,
),
# do not mount workspace
workspace_base=None,
workspace_mount_path=None,
)
config.set_llm_config(metadata.llm_config)
return config
# Setup the logger properly, so you can run multi-processing to parallize the evaluation
eval_output_dir = metadata['eval_output_dir']
if reset_logger:
# Set up logger
log_file = os.path.join(
eval_output_dir, 'logs', f'instance_{question_id}.log'
)
# Remove all existing handlers from logger
for handler in logger.handlers[:]:
logger.removeHandler(handler)
# add back the console handler to print ONE line
logger.addHandler(get_console_handler())
logger.info(
f'Starting evaluation for instance {question_id}.\nLOG: tail -f {log_file}'
)
# Remove all existing handlers from logger
for handler in logger.handlers[:]:
logger.removeHandler(handler)
file_handler = logging.FileHandler(log_file)
file_handler.setFormatter(
logging.Formatter('%(asctime)s - %(levelname)s - %(message)s')
)
logger.addHandler(file_handler)
logger.info(f'Process-specific workspace mounted at {workspace_mount_path}')
# Prepare instruction
instruction = encode_question(question, metadata['hub'])
instruction += 'IMPORTANT: You should ONLY interact with the environment provided to you AND NEVER ASK FOR HUMAN HELP.\n'
# NOTE: You can actually set slightly different instruction for different agents
instruction += AGENT_CLS_TO_INST_SUFFIX[agent.__class__.__name__]
# logger.info(f'Instruction:\n{instruction}', extra={'msg_type': 'OBSERVATION'})
async def process_instance(
instance: pd.Series,
metadata: EvalMetadata,
reset_logger: bool = True,
) -> EvalOutput:
config = get_config(metadata)
instance_id = instance['question_id']
question = instance['question']
# Here's how you can run the agent (similar to the `main` function) and get the final task state
config.max_iterations = metadata.max_iterations
state: State | None = asyncio.run(
run_controller(
config=config,
task_str=instruction,
fake_user_response_fn=AGENT_CLS_TO_FAKE_USER_RESPONSE_FN.get(
agent.__class__.__name__
),
agent=agent,
sid=question_id,
)
)
# ======= Attempt to evaluate the agent's edits =======
# If you are working on simpler benchmark that only evaluates the final model output (e.g., in a MessageAction)
# You can simply get the LAST `MessageAction` from the returned `state.history` and parse it for evaluation.
# Setup the logger properly, so you can run multi-processing to parallelize the evaluation
if reset_logger:
log_dir = os.path.join(metadata.eval_output_dir, 'infer_logs')
reset_logger_for_multiprocessing(logger, instance_id, log_dir)
else:
logger.info(f'Starting evaluation for instance {instance_id}.')
if state is None:
raise ValueError('State should not be None.')
# Prepare instruction
instruction = encode_question(question, instance['hub'])
instruction += 'IMPORTANT: You should ONLY interact with the environment provided to you AND NEVER ASK FOR HUMAN HELP.\n'
# NOTE: You can actually set slightly different instruction for different agents
instruction += AGENT_CLS_TO_INST_SUFFIX[metadata.agent_class]
# logger.info(f'Instruction:\n{instruction}', extra={'msg_type': 'OBSERVATION'})
# retrieve the last message from the agent
model_answer_raw = state.history.get_last_agent_message()
# Here's how you can run the agent (similar to the `main` function) and get the final task state
runtime = await create_runtime(config, sid=instance_id)
state: State | None = await run_controller(
config=config,
task_str=instruction,
runtime=runtime,
fake_user_response_fn=AGENT_CLS_TO_FAKE_USER_RESPONSE_FN.get(
metadata.agent_class
),
)
# ======= Attempt to evaluate the agent's edits =======
# If you are working on simpler benchmark that only evaluates the final model output (e.g., in a MessageAction)
# You can simply get the LAST `MessageAction` from the returned `state.history` and parse it for evaluation.
# attempt to parse model_answer
_, _, ast_eval = get_data(metadata['hub'])
correct, hallucination = ast_eval(question_id, model_answer_raw)
metrics = state.metrics.get() if state.metrics else None
logger.info(
f'Final message: {model_answer_raw} | Correctness: {correct} | Hallucination: {hallucination}'
)
if state is None:
raise ValueError('State should not be None.')
# history is now available as a stream of events, rather than list of pairs of (Action, Observation)
# for compatibility with the existing output format, we can remake the pairs here
# remove when it becomes unnecessary
histories = state.history.compatibility_for_eval_history_pairs()
# retrieve the last message from the agent
model_answer_raw = state.history.get_last_agent_message()
# Save the output
output = {
'question_id': question_id,
# attempt to parse model_answer
ast_eval_fn = instance['ast_eval']
correct, hallucination = ast_eval_fn(instance_id, model_answer_raw)
metrics = state.metrics.get() if state.metrics else None
logger.info(
f'Final message: {model_answer_raw} | Correctness: {correct} | Hallucination: {hallucination}'
)
# history is now available as a stream of events, rather than list of pairs of (Action, Observation)
# for compatibility with the existing output format, we can remake the pairs here
# remove when it becomes unnecessary
histories = state.history.compatibility_for_eval_history_pairs()
output = EvalOutput(
instance_id=instance_id,
metadata=metadata,
history=histories,
metrics=metrics,
error=state.last_error if state and state.last_error else None,
test_result={
'text': model_answer_raw,
'correct': correct,
'hallucination': hallucination,
'answer_id': 'None',
'model_id': metadata['model_name'],
'metadata': metadata.model_dump(),
'history': histories,
'metrics': metrics,
'error': state.last_error if state and state.last_error else None,
}
except Exception:
logger.error('Process instance failed')
raise
finally:
config.workspace_mount_path = old_workspace_mount_path
},
)
return output
@ -175,188 +135,62 @@ if __name__ == '__main__':
default='hf,torch,tf',
)
args, _ = parser.parse_known_args()
if args.directory:
config.workspace_base = os.path.abspath(args.directory)
print(f'Setting workspace base to {config.workspace_base}')
# Check https://github.com/OpenDevin/OpenDevin/blob/main/evaluation/swe_bench/README.md#configure-opendevin-and-your-llm
# for details of how to set `llm_config`
llm_config = None
if args.llm_config:
specified_llm_config = get_llm_config_arg(args.llm_config)
if specified_llm_config:
config.llm = specified_llm_config
logger.info(f'Config for evaluation: {config}')
agent_class = args.agent_cls
assert (
agent_class in AGENT_CLS_TO_FAKE_USER_RESPONSE_FN
), f'Unsupported agent class: {agent_class}'
model_name = config.llm.model.split('/')[-1]
max_iterations = args.max_iterations
eval_note = ''
if args.eval_note is not None:
eval_note += '_N_' + args.eval_note
eval_output_dir = os.path.join(
args.eval_output_dir,
'gorilla',
agent_class,
model_name + '_maxiter_' + str(max_iterations) + eval_note,
)
pathlib.Path(eval_output_dir).mkdir(parents=True, exist_ok=True)
pathlib.Path(os.path.join(eval_output_dir, 'logs')).mkdir(
parents=True, exist_ok=True
)
logger.info(f'Using evaluation output directory: {eval_output_dir}')
llm_config = get_llm_config_arg(args.llm_config)
if llm_config is None:
raise ValueError(f'Could not find LLM config: --llm_config {args.llm_config}')
hubs = []
if 'hf' in args.hubs:
hubs.append('hf')
if 'torch' in args.hubs or 'th' in args.hubs:
hubs.append('torch')
if 'tf' in args.hubs:
hubs.append('tf')
if hubs == []:
hubs = args.hubs.split(',')
if len(hubs) == 0:
raise ValueError('Please choose at least one from hf, torch, and tf for hubs.')
dfs = []
for hub in hubs:
logger.info(f'Evaluating APIBench {hub} test')
questions, question_ids, ast_eval = get_data(hub)
df = get_data_for_hub(hub)
dfs.append(df)
dataset_df = pd.concat(dfs)
dataset_df.rename(columns={'question_id': 'instance_id'}, inplace=True)
# TEST METADATA
metadata = {
'hub': hub,
'agent_class': agent_class,
'model_name': model_name,
'max_iterations': max_iterations,
'eval_output_dir': eval_output_dir,
'start_time': time.strftime('%Y-%m-%d %H:%M:%S'),
# get the commit id of current repo for reproduciblity
'git_commit': subprocess.check_output(['git', 'rev-parse', 'HEAD'])
.decode('utf-8')
.strip(),
}
logger.info(f'Metadata: {metadata}')
with open(os.path.join(eval_output_dir, f'metadata_{hub}.json'), 'w') as f:
json.dump(metadata, f)
metadata = make_metadata(
llm_config=llm_config,
dataset_name=f'gorilla-{hub}',
agent_class=args.agent_cls,
max_iterations=args.max_iterations,
eval_note=args.eval_note,
eval_output_dir=args.eval_output_dir,
data_split=args.data_split,
)
output_file = os.path.join(metadata.eval_output_dir, 'output.jsonl')
# LIMIT EVALUATION
eval_n_limit = args.eval_n_limit
if eval_n_limit:
questions = questions[: (eval_n_limit // len(hubs))]
question_ids = question_ids[: (eval_n_limit // len(hubs))]
logger.info(
f'Limiting evaluation to a total of first {eval_n_limit} instances -> first {eval_n_limit//len(hubs)} instances per hub.'
)
output_file = os.path.join(eval_output_dir, f'output_{model_name}_{hub}.jsonl')
logger.info(f'Writing evaluation output to {output_file}')
finished_task_ids = set()
if os.path.exists(output_file):
with open(output_file, 'r') as f:
for line in f:
data = json.loads(line)
for i in range(len(question_ids)):
if question_ids[i] == int(data['question_id']):
finished_task_ids.add(data['question_id'])
logger.warning(
f'Output file {output_file} already exists. Loaded {len(finished_task_ids)} finished instances.'
)
output_fp = open(output_file, 'a')
logger.info(
f'Evaluation started with Agent {agent_class}, model {model_name}, max iterations {max_iterations}.'
dataset = prepare_dataset(
dataset_df, output_file=output_file, eval_n_limit=args.eval_n_limit
)
asyncio.run(
run_evaluation(
dataset=dataset,
metadata=metadata,
output_file=output_file,
num_workers=args.eval_num_workers,
process_instance_func=process_instance,
)
# =============================================
# filter out finished instances
new_questions = []
new_question_ids = []
for i in range(len(question_ids)):
if question_ids[i] in finished_task_ids:
logger.info(
f'Skipping instance {question_ids[i]} as it is already finished.'
)
continue
new_questions.append(questions[i])
new_question_ids.append(question_ids[i])
)
finished_task_number = len(finished_task_ids)
questions = new_questions
question_ids = new_question_ids
logger.info(
f'Finished instances: {finished_task_number}, Remaining instances: {len(question_ids)}'
)
# =============================================
pbar = tqdm(total=len(question_ids))
# This function tracks the progress AND write the output to a JSONL file
def update_progress(future, pbar, output_fp, finished_task_ids):
pbar.update(1)
output = future.result()
pbar.set_description(f'Instance {output["question_id"]}')
pbar.set_postfix_str(f'Test Result: {output["correct"]}')
logger.info(
f'Finished evaluation for instance {output["question_id"]}: {output["correct"]}'
)
output_fp.write(json.dumps(output) + '\n')
output_fp.flush()
finished_task_ids.add(output['question_id'])
# Create the agent
agent = Agent.get_cls(agent_class)(llm=LLM(config.llm))
# This sets the multi-processing
num_workers = args.eval_num_workers
logger.info(f'Using {num_workers} workers for evaluation.')
try:
with ProcessPoolExecutor(num_workers) as executor:
futures = []
# This is how we perform multi-processing
for i in range(len(question_ids)):
try:
question_id = question_ids[i]
question = questions[i]
future = executor.submit(
process_instance,
agent,
question_id,
question,
metadata,
reset_logger=bool(num_workers > 1),
)
future.add_done_callback(
update_progress, pbar, output_fp, finished_task_ids
)
futures.append(future)
except Exception:
continue
# Wait for all futures to complete
for future in futures:
try:
future.result()
except Exception:
continue
except KeyboardInterrupt:
logger.info('KeyboardInterrupt received. Cleaning up...')
cleanup()
output_fp.close()
total_correct = 0
total_hallucination = 0
output = []
with open(output_file, 'r') as f:
for line in f:
data = json.loads(line)
output.append(data)
if int(data['question_id']) in finished_task_ids:
if str(data['correct']).lower() == 'true':
total_correct += 1
if str(data['hallucination']).lower() == 'true':
total_hallucination += 1
# sort all output by question_id
output = sorted(output, key=lambda x: x['question_id'])
with open(output_file, 'w') as f:
for dat in output:
f.write(json.dumps(dat) + '\n')
f.flush()
logger.info(
f'Evaluation finished for {hub}. Total: {len(question_ids)+finished_task_number}; Correct: {total_correct}; Hallucination: {total_hallucination}. Accuracy: {total_correct / (len(question_ids)+finished_task_number)}'
)
# Read the output file and calculate the accuracy
total_correct = 0
total_hallucination = 0
output = []
with open(output_file, 'r') as f:
for line in f:
data = json.loads(line)
if data['test_result']['correct']:
total_correct += 1
if data['test_result']['hallucination']:
total_hallucination += 1
output.append(data)
logger.info(
f'Evaluation finished for {hub}. Total: {len(output)}; Correct: {total_correct}; Hallucination: {total_hallucination}. Accuracy: {total_correct / len(output)}'
)

0
evaluation/gorilla/scripts/run_infer.sh Normal file → Executable file
View File

View File

@ -1,6 +1,8 @@
import json
import os
from functools import partial
import pandas as pd
import requests
from ast_eval_hf import ast_eval_hf, ast_parse
from ast_eval_tf import ast_eval_tf
@ -48,48 +50,59 @@ def encode_question(question, api_name):
return prompts
def get_data(hub):
DATA_DIR = os.path.join(os.path.dirname(__file__), 'data')
os.makedirs(DATA_DIR, exist_ok=True)
def fetch_data(url, filename):
cache_path = os.path.join(DATA_DIR, filename)
if os.path.exists(cache_path):
with open(cache_path, 'r') as f:
return f.read()
else:
response = requests.get(url)
if response.status_code == 200:
with open(cache_path, 'w') as f:
f.write(response.text)
return response.text
else:
raise Exception(f'Failed to fetch data from {url}')
def get_data_for_hub(hub: str):
if hub == 'hf':
question_data = 'https://raw.githubusercontent.com/ShishirPatil/gorilla/main/eval/eval-data/questions/huggingface/questions_huggingface_0_shot.jsonl'
api_dataset = 'https://raw.githubusercontent.com/ShishirPatil/gorilla/main/data/api/huggingface_api.jsonl'
apibench = 'https://raw.githubusercontent.com/ShishirPatil/gorilla/main/data/apibench/huggingface_eval.json'
ast_eval = ast_eval_hf
if hub == 'torch':
elif hub == 'torch':
question_data = 'https://raw.githubusercontent.com/ShishirPatil/gorilla/main/eval/eval-data/questions/torchhub/questions_torchhub_0_shot.jsonl'
api_dataset = 'https://raw.githubusercontent.com/ShishirPatil/gorilla/main/data/api/torchhub_api.jsonl'
apibench = 'https://raw.githubusercontent.com/ShishirPatil/gorilla/main/data/apibench/torchhub_eval.json'
ast_eval = ast_eval_th
if hub == 'tf':
elif hub == 'tf':
question_data = 'https://raw.githubusercontent.com/ShishirPatil/gorilla/main/eval/eval-data/questions/tensorflowhub/questions_tensorflowhub_0_shot.jsonl'
api_dataset = 'https://raw.githubusercontent.com/ShishirPatil/gorilla/main/data/api/tensorflowhub_api.jsonl'
apibench = 'https://raw.githubusercontent.com/ShishirPatil/gorilla/main/data/apibench/tensorflow_eval.json'
ast_eval = ast_eval_tf
# get questions and question_ids
question_data = fetch_data(question_data, 'question_data.jsonl')
api_dataset = fetch_data(api_dataset, 'api_dataset.jsonl')
apibench = fetch_data(apibench, 'apibench.json')
# Parse question data
questions = []
question_ids = []
question_data = requests.get(question_data)
if question_data.status_code == 200:
lines = question_data.text.splitlines()
for line in lines:
questions.append(json.loads(line)['text'])
question_ids.append(json.loads(line)['question_id'])
for line in question_data.splitlines():
data = json.loads(line)
questions.append(data['text'])
question_ids.append(data['question_id'])
# get the api datasest
api_database = []
api_dataset = requests.get(api_dataset)
if api_dataset.status_code == 200:
lines = api_dataset.text.splitlines()
for line in lines:
api_database.append(json.loads(line))
# Parse API dataset
api_database = [json.loads(line) for line in api_dataset.splitlines()]
# get the question answer pair datasest
qa_pairs = []
apibench = requests.get(apibench)
if apibench.status_code == 200:
lines = apibench.text.splitlines()
for line in lines:
qa_pairs.append(json.loads(line)['api_data'])
# Parse question-answer pairs
qa_pairs = [json.loads(line)['api_data'] for line in apibench.splitlines()]
# Parse all apis to ast trees
ast_database = []
@ -97,4 +110,15 @@ def get_data(hub):
ast_tree = ast_parse(data['api_call'])
ast_database.append(ast_tree)
ast_eval = partial(ast_eval, api_database, qa_pairs, ast_database)
return questions, question_ids, ast_eval
return pd.DataFrame(
{
'question_id': question_ids,
'question': questions,
'api_database': [api_database] * len(questions),
'qa_pairs': [qa_pairs] * len(questions),
'ast_database': [ast_database] * len(questions),
'ast_eval': [ast_eval] * len(questions),
'hub': [hub] * len(questions),
}
)

View File

@ -15,31 +15,9 @@ Further references:
- https://paperswithcode.com/dataset/gpqa
- https://github.com/idavidrein/gpqa
## Setup Environment and LLM Configuration
## Setup Environment
Please follow [this document](https://github.com/OpenDevin/OpenDevin/blob/main/Development.md) to setup local develop environment for OpenDevin.
## Configure OpenDevin and your LLM
Create a `config.toml` file (you can copy from `config.template.toml`) if it does not exist at the root of the workspace.
Add the following configurations:
```toml
# TODO: Change these to the model you want to evaluate
[llm.eval_gpt4_1106_preview]
model = "gpt-4-1106-preview"
api_key = "XXX"
temperature = 0.0
[llm.eval_azure_openai_compatible_model]
model = "AZURE_OPENAI_EXACT_DEPLOYMENT_MODEL_NAME"
base_url = "AZURE_OPENAI_ENDPOINT"
api_key = "AZURE_ENDPOINT_API_KEY"
temperature = 0.0
```
Please follow instruction [here](../README.md#setup) to setup your local development environment and LLM.
## Run Inference on GPQA Benchmark
'gpqa_main', 'gqpa_diamond', 'gpqa_experts', 'gpqa_extended' -- data split options
@ -55,8 +33,3 @@ like to evaluate. It could also be a release tag like `0.6.2`.
- `num_samples_eval`: Number of samples to evaluate (useful for testing and debugging).
- `data_split`: The data split to evaluate on. Must be one of `gpqa_main`, `gqpa_diamond`, `gpqa_experts`, `gpqa_extended`. Defaults to `gpqa_diamond` as done in the paper.
- `AgentClass`: The agent class to use for evaluation. Currently only supports `CodeActAgent` for CodeActAgent.
## Benchmark Evaluation Results
- [] TODO: Finish the evaluation run across the entire benchmark and compile results

View File

@ -17,9 +17,7 @@ TODOs:
"""
import asyncio
import logging
import os
import pathlib
import random
import re
from typing import Callable
@ -29,22 +27,27 @@ from datasets import load_dataset
from evaluation.utils.shared import (
EvalMetadata,
codeact_user_response,
EvalOutput,
make_metadata,
prepare_dataset,
reset_logger_for_multiprocessing,
run_evaluation,
)
from opendevin.controller.agent import Agent
from opendevin.controller.state.state import State
from opendevin.core.config import get_llm_config_arg, get_parser, load_app_config
from opendevin.core.logger import get_console_handler
from opendevin.core.config import (
AppConfig,
SandboxConfig,
get_llm_config_arg,
get_parser,
)
from opendevin.core.logger import opendevin_logger as logger
from opendevin.core.main import run_controller
from opendevin.events.action import Action, AgentFinishAction, MessageAction
from opendevin.core.main import create_runtime, run_controller
from opendevin.events.action import (
Action,
AgentFinishAction,
MessageAction,
)
from opendevin.events.observation import Observation
from opendevin.llm.llm import LLM
config = load_app_config()
ACTION_FORMAT = """
<<FINAL_ANSWER||
@ -53,6 +56,28 @@ ACTION_FORMAT = """
""".strip()
def get_config(
metadata: EvalMetadata,
) -> AppConfig:
config = AppConfig(
default_agent=metadata.agent_class,
run_as_devin=False,
runtime='eventstream',
max_iterations=metadata.max_iterations,
sandbox=SandboxConfig(
container_image='ubuntu:22.04',
enable_auto_lint=True,
use_host_network=False,
update_source_code=True,
),
# do not mount workspace
workspace_base=None,
workspace_mount_path=None,
)
config.set_llm_config(metadata.llm_config)
return config
def gpqa_codeact_user_response(
state: State,
encapsulate_solution: bool = False,
@ -68,11 +93,10 @@ def gpqa_codeact_user_response(
'<execute_bash> exit </execute_bash>\n'
'IMPORTANT: YOU SHOULD NEVER ASK FOR HUMAN HELP TO SOLVE THIS TASK.\n'
)
return msg
AGENT_CLS_TO_FAKE_USER_RESPONSE_FN = {'CodeActAgent': codeact_user_response}
AGENT_CLS_TO_FAKE_USER_RESPONSE_FN = {'CodeActAgent': gpqa_codeact_user_response}
AGENT_CLS_TO_INST_SUFFIX = {
'CodeActAgent': '\n\n SUPER IMPORTANT: When you think you have solved the question, first report it back to the user in the requested format. Only once that is done, in the next turn, please run the following command: <execute_bash> exit </execute_bash>.\n'
@ -146,57 +170,23 @@ def convert_instance_dict(instance):
return out_instance_dict
def process_instance(
async def process_instance(
instance: pd.Series,
metadata: EvalMetadata,
reset_logger: bool = True,
):
# Create the agent
agent = Agent.get_cls(metadata.agent_class)(llm=LLM(config=metadata.llm_config))
old_workspace_mount_path = config.workspace_mount_path
old_workspace_base = config.workspace_base
try:
workspace_mount_path = os.path.join(
config.workspace_mount_path, '_eval_workspace'
)
# create process-specific workspace dir
workspace_mount_path = os.path.join(workspace_mount_path, str(os.getpid()))
pathlib.Path(workspace_mount_path).mkdir(parents=True, exist_ok=True)
config = get_config(metadata)
# reset workspace to config
config.workspace_base = workspace_mount_path
config.workspace_mount_path = workspace_mount_path
# Setup the logger properly, so you can run multi-processing to parallelize the evaluation
if reset_logger:
log_dir = os.path.join(metadata.eval_output_dir, 'infer_logs')
reset_logger_for_multiprocessing(logger, instance['instance_id'], log_dir)
else:
logger.info(f'Starting evaluation for instance {instance["instance_id"]}.')
# Setup the logger properly, so you can run multi-processing to parallelize the evaluation
if reset_logger:
# Set up logger
log_file = os.path.join(
metadata.eval_output_dir, 'logs', f'instance_{instance.instance_id}.log'
)
# Remove all existing handlers from logger
for handler in logger.handlers[:]:
logger.removeHandler(handler)
# add back the console handler to print ONE line
logger.addHandler(get_console_handler())
logger.info(
f'Starting evaluation for instance {instance.instance_id}.\nHint: run "tail -f {log_file}" to see live logs in a separate shell'
)
# Remove all existing handlers from logger
for handler in logger.handlers[:]:
logger.removeHandler(handler)
file_handler = logging.FileHandler(log_file)
file_handler.setFormatter(
logging.Formatter('%(asctime)s - %(levelname)s - %(message)s')
)
logger.addHandler(file_handler)
else:
logger.info(f'Starting evaluation for instance {instance.instance_id}.')
logger.info(f'Process-specific workspace mounted at {workspace_mount_path}')
# ======= Run the agent on the instance =======
# Prepare instruction for the agent using suggested format in gpqa codebase
instruction = f"""
# ======= Run the agent on the instance =======
# Prepare instruction for the agent using suggested format in gpqa codebase
instruction = f"""
What is the correct answer to this question:\n
{instance['question']}\n
@ -225,109 +215,98 @@ Again do not quit without reporting the answer first.
Ok now its time to start solving the question. Good luck!
"""
# Here's how you can run the agent (similar to the `main` function) and get the final task state
config.max_iterations = metadata.max_iterations
state: State | None = asyncio.run(
run_controller(
config=config,
task_str=instruction,
fake_user_response_fn=AGENT_CLS_TO_FAKE_USER_RESPONSE_FN.get(
agent.__class__.__name__
),
agent=agent,
sid=f'gptq_{str(instance.instance_id)}',
)
)
assert state is not None, 'State should not be None.'
runtime = await create_runtime(config, sid=f'gptq_{str(instance.instance_id)}')
# ======= Attempt to evaluate the agent's edits =======
state: State | None = await run_controller(
config=config,
task_str=instruction,
runtime=runtime,
fake_user_response_fn=AGENT_CLS_TO_FAKE_USER_RESPONSE_FN.get(
metadata.agent_class
),
)
assert state is not None, 'State should not be None.'
question_choices = {
'A': instance['choices'][0],
'B': instance['choices'][1],
'C': instance['choices'][2],
'D': instance['choices'][3],
}
# get the final message from the state history (default to empty if not found)
found_answers = {
'A': False,
'B': False,
'C': False,
'D': False,
}
for event in state.history.get_events(reverse=True):
if (
isinstance(event, AgentFinishAction)
and event.source != 'user'
and '<<FINAL_ANSWER||' in event.thought
):
final_message = event.thought
break
elif (
isinstance(event, MessageAction)
and event.source != 'user'
and '<<FINAL_ANSWER||' in event.content
):
final_message = event.content
break
elif isinstance(event, Observation):
for option, option_text in question_choices.items():
if option_text in event.content:
found_answers[option] = True
else:
final_message = None
# ======= Attempt to evaluate the agent's edits =======
found_options = [option for option, found in found_answers.items() if found]
question_choices = {
'A': instance['choices'][0],
'B': instance['choices'][1],
'C': instance['choices'][2],
'D': instance['choices'][3],
}
# get the final message from the state history (default to empty if not found)
found_answers = {
'A': False,
'B': False,
'C': False,
'D': False,
}
for event in state.history.get_events(reverse=True):
if (
isinstance(event, AgentFinishAction)
and event.source != 'user'
and '<<FINAL_ANSWER||' in event.thought
):
final_message = event.thought
break
elif (
isinstance(event, MessageAction)
and event.source != 'user'
and '<<FINAL_ANSWER||' in event.content
):
final_message = event.content
break
elif isinstance(event, Observation):
for option, option_text in question_choices.items():
if option_text in event.content:
found_answers[option] = True
else:
final_message = None
found_options = [option for option, found in found_answers.items() if found]
logger.info('#############################################')
logger.info(f'Final message generated by the agent: {final_message}')
logger.info('#############################################')
# check if the model output matches the ground truth
test_result = compare_answers(final_message, instance.correct_solution)
if final_message is None and len(found_options) > 0:
_selected = random.choice(found_options)
# if the final message is None, then the agent did not report the answer in the correct format
# so we randomly select one of the found options and compare it with the correct solution
test_result = _selected == instance.correct_solution
logger.info('#############################################')
logger.info(f'Final message generated by the agent: {final_message}')
logger.info('Agent did not report the answer in the correct format.')
logger.info(f'Found options: {found_options}')
logger.info(f'Selected option: {_selected}')
logger.info('#############################################')
# check if the model output matches the ground truth
test_result = compare_answers(final_message, instance.correct_solution)
if final_message is None and len(found_options) > 0:
_selected = random.choice(found_options)
# if the final message is None, then the agent did not report the answer in the correct format
# so we randomly select one of the found options and compare it with the correct solution
test_result = _selected == instance.correct_solution
logger.info('#############################################')
logger.info('Agent did not report the answer in the correct format.')
logger.info(f'Found options: {found_options}')
logger.info(f'Selected option: {_selected}')
logger.info('#############################################')
logger.info('#############################################')
logger.info(f'Test result: {test_result}')
logger.info('#############################################')
logger.info('#############################################')
logger.info(f'Test result: {test_result}')
logger.info('#############################################')
# If you are working on some simpler benchmark that only evaluates the final model output (e.g., in a MessageAction)
# You can simply get the LAST `MessageAction` from the returned `state.history` and parse it for evaluation.
if state is None:
raise ValueError('State should not be None.')
# If you are working on some simpler benchmark that only evaluates the final model output (e.g., in a MessageAction)
# You can simply get the LAST `MessageAction` from the returned `state.history` and parse it for evaluation.
if state is None:
raise ValueError('State should not be None.')
metrics = state.metrics.get() if state.metrics else None
metrics = state.metrics.get() if state.metrics else None
# Save the output
output = {
'task_id': instance.task_id,
'instance_id': instance.instance_id,
'instruction': instruction,
'metadata': metadata.model_dump(),
'history': state.history.compatibility_for_eval_history_pairs(),
'metrics': metrics,
'error': state.last_error if state and state.last_error else None,
'test_result': {
'result': test_result,
'found_answers': found_answers,
'last_message': final_message,
},
}
except Exception:
logger.error('Process instance failed')
raise
finally:
config.workspace_mount_path = old_workspace_mount_path
config.workspace_base = old_workspace_base
# Save the output
output = EvalOutput(
instance_id=str(instance.instance_id),
instruction=instruction,
metadata=metadata,
history=state.history.compatibility_for_eval_history_pairs(),
metrics=metrics,
error=state.last_error if state and state.last_error else None,
test_result={
'result': test_result,
'found_answers': found_answers,
'last_message': final_message,
},
)
return output
@ -343,8 +322,11 @@ if __name__ == '__main__':
)
args, _ = parser.parse_known_args()
llm_config = get_llm_config_arg(args.llm_config) if args.llm_config else config.llm
logger.info(f'Config for evaluation: {config}')
llm_config = None
if args.llm_config:
llm_config = get_llm_config_arg(args.llm_config)
if llm_config is None:
raise ValueError(f'Could not find LLM config: --llm_config {args.llm_config}')
# NOTE: It is preferable to load datasets from huggingface datasets and perform post-processing
# so we don't need to manage file uploading to OpenDevin's repo
@ -355,8 +337,6 @@ if __name__ == '__main__':
gpqa_dataset = gpqa_dataset.to_pandas()
# Add a new column 'instance_id' with the index
gpqa_dataset['instance_id'] = gpqa_dataset.index
gpqa_dataset['task_id'] = gpqa_dataset.index
# gpqa_dataset = dataset['train'].to_pandas().sort_values(by='id').reset_index(drop=True)
if args.agent_cls != 'CodeActAgent':
raise ValueError(
@ -374,15 +354,14 @@ if __name__ == '__main__':
)
output_file = os.path.join(metadata.eval_output_dir, 'output.jsonl')
prepared_dataset = prepare_dataset(
gpqa_dataset, output_file, args.eval_n_limit, 'task_id'
)
prepared_dataset = prepare_dataset(gpqa_dataset, output_file, args.eval_n_limit)
run_evaluation(
dataset=prepared_dataset,
metadata=metadata,
output_file=output_file,
num_workers=args.eval_num_workers,
process_instance_func=process_instance,
id_column='task_id',
asyncio.run(
run_evaluation(
dataset=prepared_dataset,
metadata=metadata,
output_file=output_file,
num_workers=args.eval_num_workers,
process_instance_func=process_instance,
)
)

View File

@ -1,39 +1,10 @@
# HumanEvalFix Evaluation with OpenDevin
Implements evaluation of agents on HumanEvalFix from the HumanEvalPack benchmark introduced in [OctoPack: Instruction Tuning Code Large Language Models](https://arxiv.org/abs/2308.07124). Please see [here](https://github.com/bigcode-project/bigcode-evaluation-harness/blob/main/bigcode_eval/tasks/humanevalpack.py) for the reference implementation used in the paper.
Implements evaluation of agents on HumanEvalFix from the HumanEvalPack benchmark introduced in [OctoPack: Instruction Tuning Code Large Language Models](https://arxiv.org/abs/2308.07124). Please see [here](https://github.com/bigcode-project/bigcode-evaluation-harness/blob/main/bigcode_eval/tasks/humanevalpack.py) for the reference implementation used in the paper. Currently only `python` evaluation is supported.
## Setup Environment
## Setup Environment and LLM Configuration
Please follow [this document](https://github.com/OpenDevin/OpenDevin/blob/main/Development.md) to setup local develop environment for OpenDevin.
## Configure OpenDevin and your LLM
Create a `config.toml` file if it does not exist at the root of the workspace.
Add the following configurations:
```toml
[core]
max_iterations = 100
cache_dir = "/tmp/cache"
ssh_hostname = "localhost"
[sandbox]
enable_auto_lint = true
# TODO: Change these to the model you want to evaluate
[llm.eval_gpt4_1106_preview]
model = "gpt-4-1106-preview"
api_key = "XXX"
temperature = 0.0
[llm.eval_some_openai_compatible_model]
model = "openai/MODEL_NAME"
base_url = "https://OPENAI_COMPATIBLE_URL/v1"
api_key = "XXX"
temperature = 0.0
```
Please follow instruction [here](../README.md#setup) to setup your local development environment and LLM.
## Run Inference on HumanEvalFix

View File

@ -9,9 +9,9 @@ TODOs:
"""
import asyncio
import logging
import os
import pathlib
import tempfile
from typing import Any
import pandas as pd
from datasets import load_dataset
@ -19,20 +19,25 @@ from evaluate import load
from evaluation.utils.shared import (
EvalMetadata,
EvalOutput,
codeact_user_response,
make_metadata,
prepare_dataset,
reset_logger_for_multiprocessing,
run_evaluation,
)
from opendevin.controller.agent import Agent
from opendevin.controller.state.state import State
from opendevin.core.config import get_llm_config_arg, load_app_config, parse_arguments
from opendevin.core.logger import get_console_handler
from opendevin.core.config import (
AppConfig,
SandboxConfig,
get_llm_config_arg,
parse_arguments,
)
from opendevin.core.logger import opendevin_logger as logger
from opendevin.core.main import run_controller
from opendevin.llm.llm import LLM
config = load_app_config()
from opendevin.core.main import create_runtime, run_controller
from opendevin.events.action import CmdRunAction
from opendevin.events.observation import CmdOutputObservation
from opendevin.runtime.runtime import Runtime
IMPORT_HELPER = {
'python': [
@ -72,19 +77,106 @@ AGENT_CLS_TO_INST_SUFFIX = {
}
def get_test_result(instance, path, language='python', timeout=10):
# Evaluation reference: https://github.com/bigcode-project/bigcode-evaluation-harness/blob/84b96da31b7f840b55c5733325346176140cdb6b/bigcode_eval/tasks/humanevalpack.py#L347
def get_config(
metadata: EvalMetadata,
) -> AppConfig:
config = AppConfig(
default_agent=metadata.agent_class,
run_as_devin=False,
runtime='eventstream',
max_iterations=metadata.max_iterations,
sandbox=SandboxConfig(
container_image='ubuntu:22.04',
enable_auto_lint=True,
use_host_network=False,
update_source_code=True,
),
# do not mount workspace
workspace_base=None,
workspace_mount_path=None,
)
config.set_llm_config(metadata.llm_config)
return config
def _get_instance_id(instance: pd.Series) -> str:
return instance.task_id.replace('/', '__')
async def initialize_runtime(
runtime: Runtime,
instance: pd.Series, # this argument is not required
):
"""Initialize the runtime for the agent.
This function is called before the runtime is used to run the agent.
"""
logger.info(f"{'-' * 50} BEGIN Runtime Initialization Fn {'-' * 50}")
obs: CmdOutputObservation
action = CmdRunAction(command='mkdir -p /workspace')
logger.info(action, extra={'msg_type': 'ACTION'})
obs = await runtime.run_action(action)
assert obs.exit_code == 0
action = CmdRunAction(command='cd /workspace')
logger.info(action, extra={'msg_type': 'ACTION'})
obs = await runtime.run_action(action)
assert obs.exit_code == 0
problem_statement = (
instance.declaration + instance.buggy_solution + '\n' + instance.test
)
filename = f'{_get_instance_id(instance)}.py'
with tempfile.TemporaryDirectory() as tmpdir:
host_script_path = os.path.join(tmpdir, filename)
with open(host_script_path, 'w') as f:
f.write(problem_statement)
await runtime.copy_to(
host_script_path,
'/workspace',
)
# check file exists
action = CmdRunAction(command=f'ls /workspace/{_get_instance_id(instance)}.py')
obs = await runtime.run_action(action)
assert obs.exit_code == 0
logger.info(f"{'-' * 50} END Runtime Initialization Fn {'-' * 50}")
async def complete_runtime(
runtime: Runtime,
instance: pd.Series, # this argument is not required, but it is used to get the workspace_dir_name
) -> dict[str, Any]:
"""Complete the runtime for the agent.
This function is called before the runtime is used to run the agent.
If you need to do something in the sandbox to get the correctness metric after
the agent has run, modify this function.
"""
logger.info(f"{'-' * 50} BEGIN Runtime Completion Fn {'-' * 50}")
obs: CmdOutputObservation
# default value
language = 'python'
timeout = 10
test_result = {'result': {}, 'metadata': {}}
code_metric = load('Muennighoff/code_eval_octopack')
timeout = LANGUAGE_TO_TIMEOUT[language]
num_workers = LANGUAGE_TO_NUM_WORKERS[language]
python_imports = '\n'.join(IMPORT_HELPER[language])
# Load function from path
with open(path, 'r') as f:
function = f.read()
action = CmdRunAction(
command=f'cat /workspace/{_get_instance_id(instance)}.py', keep_prompt=False
)
obs = await runtime.run_action(action)
assert obs.exit_code == 0
function = [[python_imports + '\n' + function.strip()]]
function = obs.content.replace('\r\n', '\n')
logger.info(f'Function: {function}')
function = [[python_imports + '\n' + function]]
results, logs = code_metric.compute(
references=[instance.test],
@ -99,129 +191,79 @@ def get_test_result(instance, path, language='python', timeout=10):
'timeout': timeout,
'num_workers': num_workers,
}
logger.info(f"{'-' * 50} END Runtime Completion Fn {'-' * 50}")
return test_result
def process_instance(
async def process_instance(
instance: pd.Series,
metadata: EvalMetadata,
reset_logger: bool = True,
):
# Create the agent
agent = Agent.get_cls(metadata.agent_class)(llm=LLM(config=metadata.llm_config))
old_workspace_mount_path = config.workspace_mount_path
old_workspace_base = config.workspace_base
) -> EvalOutput:
config = get_config(metadata)
# use a session id for concurrent evaluation
sid = _get_instance_id(instance)
try:
workspace_mount_path = os.path.join(
config.workspace_mount_path, '_eval_workspace'
)
# create process-specific workspace dir
workspace_mount_path = os.path.join(workspace_mount_path, str(os.getpid()))
pathlib.Path(workspace_mount_path).mkdir(parents=True, exist_ok=True)
# Setup the logger properly, so you can run multi-processing to parallelize the evaluation
if reset_logger:
log_dir = os.path.join(metadata.eval_output_dir, 'infer_logs')
reset_logger_for_multiprocessing(logger, instance.task_id, log_dir)
else:
logger.info(f'Starting evaluation for instance {instance.task_id}.')
# reset workspace to config
config.workspace_base = workspace_mount_path
config.workspace_mount_path = workspace_mount_path
# Create file with HumanEvalFix problem
# Prompt reference: https://github.com/bigcode-project/bigcode-evaluation-harness/blob/84b96da31b7f840b55c5733325346176140cdb6b/bigcode_eval/tasks/humanevalpack.py#L509
problem_statement = (
instance.declaration + instance.buggy_solution + '\n' + instance.test
)
# use a session id for concurrent evaluation
sid = instance.task_id.replace('/', '__')
# Prepare instruction
instruction = (
f'Please fix the function in {sid}.py such that all test cases pass.\n'
'Environment has been set up for you to start working. You may assume all necessary tools are installed.\n\n'
'# Problem Statement\n'
f'{problem_statement}\n\n'
)
instruction += (
'IMPORTANT: You should ONLY interact with the environment provided to you AND NEVER ASK FOR HUMAN HELP.\n'
'You should NOT modify any existing test case files. If needed, you can add new test cases in a NEW file to reproduce the issue.\n'
'You SHOULD INCLUDE PROPER INDENTATION in your edit commands.\n'
)
# NOTE: You can actually set slightly different instruction for different agents
instruction += AGENT_CLS_TO_INST_SUFFIX[metadata.agent_class]
# Setup the logger properly, so you can run multi-processing to parallelize the evaluation
if reset_logger:
# Set up logger
log_file = os.path.join(
metadata.eval_output_dir,
'logs',
f'instance_{sid}.log',
)
# Remove all existing handlers from logger
for handler in logger.handlers[:]:
logger.removeHandler(handler)
# add back the console handler to print ONE line
logger.addHandler(get_console_handler())
logger.info(
f'Starting evaluation for instance {instance.task_id}.\nLOG: tail -f {log_file}'
)
# Remove all existing handlers from logger
for handler in logger.handlers[:]:
logger.removeHandler(handler)
file_handler = logging.FileHandler(log_file)
file_handler.setFormatter(
logging.Formatter('%(asctime)s - %(levelname)s - %(message)s')
)
logger.addHandler(file_handler)
# Here's how you can run the agent (similar to the `main` function) and get the final task state
runtime = await create_runtime(config, sid=sid)
await initialize_runtime(runtime, instance)
state: State | None = await run_controller(
config=config,
task_str=instruction,
runtime=runtime,
fake_user_response_fn=AGENT_CLS_TO_FAKE_USER_RESPONSE_FN.get(
metadata.agent_class
),
)
logger.info(f'Process-specific workspace mounted at {workspace_mount_path}')
if state is None:
raise ValueError('State should not be None.')
metrics = state.metrics.get() if state.metrics else None
test_result = await complete_runtime(runtime, instance)
# Create file with HumanEvalFix problem
# Prompt reference: https://github.com/bigcode-project/bigcode-evaluation-harness/blob/84b96da31b7f840b55c5733325346176140cdb6b/bigcode_eval/tasks/humanevalpack.py#L509
problem_statement = (
instance.declaration + instance.buggy_solution + '\n' + instance.test
)
path = os.path.join(workspace_mount_path, f'{sid}.py')
with open(path, 'w') as f:
f.write(problem_statement)
# history is now available as a stream of events, rather than list of pairs of (Action, Observation)
# for compatibility with the existing output format, we can remake the pairs here
# remove when it becomes unnecessary
histories = state.history.compatibility_for_eval_history_pairs()
# Prepare instruction
instruction = (
f'Please fix the function in {instance.task_id.replace("/", "__")}.py such that all test cases pass.\n'
'Environment has been set up for you to start working. You may assume all necessary tools are installed.\n\n'
'# Problem Statement\n'
f'{problem_statement}\n\n'
)
instruction += (
'IMPORTANT: You should ONLY interact with the environment provided to you AND NEVER ASK FOR HUMAN HELP.\n'
'You should NOT modify any existing test case files. If needed, you can add new test cases in a NEW file to reproduce the issue.\n'
'You SHOULD INCLUDE PROPER INDENTATION in your edit commands.\n'
)
# NOTE: You can actually set slightly different instruction for different agents
instruction += AGENT_CLS_TO_INST_SUFFIX[agent.__class__.__name__]
# Here's how you can run the agent (similar to the `main` function) and get the final task state
config.max_iterations = metadata.max_iterations
state: State | None = asyncio.run(
run_controller(
config=config,
task_str=instruction,
fake_user_response_fn=AGENT_CLS_TO_FAKE_USER_RESPONSE_FN.get(
agent.__class__.__name__
),
agent=agent,
sid=sid,
)
)
# ======= Attempt to evaluate the agent's edits =======
test_result = get_test_result(instance, path)
# If you are working on some simpler benchmark that only evaluates the final model output (e.g., in a MessageAction)
# You can simply get the LAST `MessageAction` from the returned `state.history` and parse it for evaluation.
if state is None:
raise ValueError('State should not be None.')
metrics = state.metrics.get() if state.metrics else None
# history is now available as a stream of events, rather than list of pairs of (Action, Observation)
# for compatibility with the existing output format, we can remake the pairs here
# remove when it becomes unnecessary
histories = state.history.compatibility_for_eval_history_pairs()
# Save the output
output = {
'task_id': instance.task_id,
'instruction': instruction,
'metadata': metadata.model_dump(),
'history': histories,
'metrics': metrics,
'error': state.last_error if state and state.last_error else None,
'test_result': test_result,
}
except Exception:
logger.error('Process instance failed')
raise
finally:
config.workspace_mount_path = old_workspace_mount_path
config.workspace_base = old_workspace_base
# Save the output
output = EvalOutput(
instance_id=instance.task_id,
instruction=instruction,
metadata=metadata,
history=histories,
metrics=metrics,
error=state.last_error if state and state.last_error else None,
test_result=test_result,
)
return output
@ -234,28 +276,31 @@ if __name__ == '__main__':
'bigcode/humanevalpack', 'python'
) # TODO: Support other languages
hefix_tests = dataset['test'].to_pandas()
hefix_tests.rename(columns={'task_id': 'instance_id'}, inplace=True)
id_column = 'task_id'
llm_config = get_llm_config_arg(args.llm_config) if args.llm_config else config.llm
logger.info(f'Config for evaluation: {config}')
llm_config = None
if args.llm_config:
llm_config = get_llm_config_arg(args.llm_config)
if llm_config is None:
raise ValueError(f'Could not find LLM config: --llm_config {args.llm_config}')
metadata = make_metadata(
llm_config,
args.dataset_name,
'humanevalfix-python',
args.agent_cls,
args.max_iterations,
args.eval_note,
args.eval_output_dir,
)
output_file = os.path.join(metadata.eval_output_dir, 'output.jsonl')
instances = prepare_dataset(dataset, output_file, args.eval_n_limit, id_column)
instances = prepare_dataset(hefix_tests, output_file, args.eval_n_limit)
run_evaluation(
instances,
metadata,
output_file,
args.eval_num_workers,
process_instance,
id_column,
asyncio.run(
run_evaluation(
instances,
metadata,
output_file,
args.eval_num_workers,
process_instance,
)
)

0
evaluation/humanevalfix/scripts/run_infer.sh Normal file → Executable file
View File

View File

@ -0,0 +1,7 @@
FROM ubuntu:22.04
RUN apt-get update && apt-get install -y python3 python3-pip
RUN pip install scitools-pyke
# docker build -t xingyaoww/od_logic_reasoning .

View File

@ -2,38 +2,13 @@
This folder contains evaluation harness for evaluating agents on the logic reasoning benchmark [ProntoQA](https://github.com/asaparov/prontoqa) and [ProofWriter](https://allenai.org/data/proofwriter).
## Configure OpenDevin and your LLM
## Setup Environment and LLM Configuration
Create a `config.toml` file if it does not exist at the root of the workspace.
Add the following configurations:
```toml
[core]
max_iterations = 100
cache_dir = "/tmp/cache"
ssh_hostname = "localhost"
[sandbox]
enable_auto_lint = true
# TODO: Change these to the model you want to evaluate
[llm.eval_gpt4_1106_preview_llm]
model = "gpt-4-1106-preview"
api_key = "XXX"
temperature = 0.0
[llm.eval_some_openai_compatible_model_llm]
model = "openai/MODEL_NAME"
base_url = "https://OPENAI_COMPATIBLE_URL/v1"
api_key = "XXX"
temperature = 0.0
```
Please follow instruction [here](../README.md#setup) to setup your local development environment and LLM.
## Run Inference on logic_reasoning
The following code will run inference on the first example of the ProntoQA dataset,
using OpenDevin 0.6.2 version.
The following code will run inference on the first example of the ProofWriter dataset,
```bash
./evaluation/logic_reasoning/scripts/run_infer.sh ProntoQA eval_gpt4_1106_preview_llm 0.6.2 1
./evaluation/logic_reasoning/scripts/run_infer.sh eval_gpt4_1106_preview_llm ProofWriter
```

View File

@ -3,12 +3,12 @@ you can interact with an interactive Python (Jupyter Notebook) environment and r
In this task, you need to use the code in [[logic_inference_path.py]] to help you. Specifically, you first need to instantiate a **LogicInferenceEngine** class and use the **safe_execute_program** method to prove the **logic programs**. You should receive *answer*, *flag*, *error_message* from the output.
An example would be look like this:
<execute_ipython>
import sys
sys.path.append(workspace_mount_path)
engine = LogicInferenceEngine(dataset_name, workspace_mount_path)
answer, flag, error_message = engine.safe_execute_program(logic_programs)
</execute_ipython>
<execute_ipython>
import sys
sys.path.append('/workspace')
engine = LogicInferenceEngine()
answer, flag, error_message = engine.safe_execute_program(logic_programs)
</execute_ipython>
Please send the *answer* variable through message.

View File

@ -191,9 +191,9 @@ class PykeProgram:
class LogicInferenceEngine:
def __init__(self, dataset_name, workspace_mount_path):
self.dataset_name = dataset_name
self.workspace_mount_path = workspace_mount_path
def __init__(self):
self.dataset_name = os.environ.get('DATASET_NAME', 'ProofWriter')
self.workspace_mount_path = '/workspace'
def random_backup(self):
if self.dataset_name == 'ProntoQA':

View File

@ -1,29 +1,35 @@
import asyncio
import logging
import os
import pathlib
import shutil
import pandas as pd
from datasets import load_dataset
from evaluation.swe_bench.swe_env_box import DockerSSHBox
from evaluation.utils.shared import (
EvalMetadata,
EvalOutput,
codeact_user_response,
make_metadata,
prepare_dataset,
reset_logger_for_multiprocessing,
run_evaluation,
)
from opendevin.controller.agent import Agent
from opendevin.controller.state.state import State
from opendevin.core.config import get_llm_config_arg, get_parser, load_app_config
from opendevin.core.logger import get_console_handler
from opendevin.core.config import (
AppConfig,
SandboxConfig,
get_llm_config_arg,
get_parser,
)
from opendevin.core.logger import opendevin_logger as logger
from opendevin.core.main import run_controller
from opendevin.llm.llm import LLM
config = load_app_config()
from opendevin.core.main import create_runtime, run_controller
from opendevin.events.action import (
AgentFinishAction,
CmdRunAction,
IPythonRunCellAction,
MessageAction,
)
from opendevin.events.observation import CmdOutputObservation
from opendevin.runtime.runtime import Runtime
AGENT_CLS_TO_FAKE_USER_RESPONSE_FN = {
'CodeActAgent': codeact_user_response,
@ -34,6 +40,29 @@ AGENT_CLS_TO_INST_SUFFIX = {
}
def get_config(
metadata: EvalMetadata,
) -> AppConfig:
config = AppConfig(
default_agent=metadata.agent_class,
run_as_devin=False,
runtime='eventstream',
max_iterations=metadata.max_iterations,
sandbox=SandboxConfig(
container_image='xingyaoww/od-eval-logic-reasoning:v1.0',
enable_auto_lint=True,
use_host_network=False,
update_source_code=True,
od_runtime_extra_deps='$OD_INTERPRETER_PATH -m pip install scitools-pyke',
),
# do not mount workspace
workspace_base=None,
workspace_mount_path=None,
)
config.set_llm_config(metadata.llm_config)
return config
def get_choice(answer_str):
choices = [
'A',
@ -83,7 +112,7 @@ def get_test_result(
'the correct answer is',
'The correct answer is',
'The correct option is',
'Thus, the answer is',
'the answer is',
]
if prediction is None:
for indicator in indicators:
@ -97,162 +126,143 @@ def get_test_result(
return test_result
def process_instance(
CUR_EVAL_DIR = os.path.dirname(__file__)
async def initialize_runtime(
runtime: Runtime,
instance: pd.Series, # this argument is not required
):
"""Initialize the runtime for the agent.
This function is called before the runtime is used to run the agent.
"""
logger.info(f"{'-' * 50} BEGIN Runtime Initialization Fn {'-' * 50}")
obs: CmdOutputObservation
# Set instance id
action = CmdRunAction(command='mkdir -p /workspace')
logger.info(action, extra={'msg_type': 'ACTION'})
obs = await runtime.run_action(action)
assert obs.exit_code == 0
action = CmdRunAction(command='cd /workspace')
logger.info(action, extra={'msg_type': 'ACTION'})
obs = await runtime.run_action(action)
assert obs.exit_code == 0
# copy logic_inference.py to /workspace
await runtime.copy_to(
os.path.join(CUR_EVAL_DIR, 'logic_inference.py'), '/workspace'
)
# check if the file exists
obs = await runtime.run_action(CmdRunAction(command='ls /workspace'))
assert obs.exit_code == 0
assert 'logic_inference.py' in obs.content
await runtime.add_env_vars({'DATASET_NAME': metadata.dataset})
action = CmdRunAction(command='mkdir -p /workspace/.cache_program')
logger.info(action, extra={'msg_type': 'ACTION'})
obs = await runtime.run_action(action)
assert obs.exit_code == 0
action = IPythonRunCellAction(code='%pip install scitools-pyke')
logger.info(action, extra={'msg_type': 'ACTION'})
ipynb_obs = await runtime.run_action(action)
logger.info(ipynb_obs, extra={'msg_type': 'OBSERVATION'})
logger.info(f"{'-' * 50} END Runtime Initialization Fn {'-' * 50}")
# Prepare instruction
with open(os.path.join(CUR_EVAL_DIR, 'instruction.txt'), 'r') as f:
INSTRUCTION_TEMPLATE = f.read()
async def process_instance(
instance: pd.Series,
metadata: EvalMetadata,
reset_logger: bool = True,
):
# Create the agent
agent = Agent.get_cls(metadata.agent_class)(llm=LLM(config=metadata.llm_config))
old_workspace_mount_path = config.workspace_mount_path
old_workspace_base = config.workspace_base
config = get_config(metadata)
try:
workspace_mount_path = os.path.join(
config.workspace_mount_path, '_eval_workspace'
# Setup the logger properly, so you can run multi-processing to parallelize the evaluation
if reset_logger:
log_dir = os.path.join(metadata.eval_output_dir, 'infer_logs')
reset_logger_for_multiprocessing(logger, instance['instance_id'], log_dir)
else:
logger.info(f'Starting evaluation for instance {instance["instance_id"]}.')
instance_logic_programs = instance['raw_logic_programs'][0].strip()
instruction = (
INSTRUCTION_TEMPLATE.replace('[[dataset_name]]', dataset_name)
.replace('[[logic_programs]]', instance_logic_programs)
.replace('[[logic_inference_path.py]]', '/workspace/logic_inference.py')
)
# NOTE: You can actually set slightly different instruction for different agents
instruction += AGENT_CLS_TO_INST_SUFFIX[metadata.agent_class]
# use a session id for concurrent evaluation
sid = instance['instance_id']
runtime = await create_runtime(config, sid=sid)
await initialize_runtime(runtime, instance)
# Here's how you can run the agent (similar to the `main` function) and get the final task state
state: State | None = asyncio.run(
run_controller(
config=config,
task_str=instruction,
runtime=runtime,
fake_user_response_fn=AGENT_CLS_TO_FAKE_USER_RESPONSE_FN.get(
metadata.agent_class
),
)
# create process-specific workspace dir
workspace_mount_path = os.path.join(workspace_mount_path, str(os.getpid()))
pathlib.Path(workspace_mount_path).mkdir(parents=True, exist_ok=True)
)
# ======= Attempt to evaluate the agent's edits =======
# If you are working on simpler benchmark that only evaluates the final model output (e.g., in a MessageAction)
# You can simply get the LAST `MessageAction` from the returned `state.history` and parse it for evaluation.
# reset workspace to config
config.workspace_base = workspace_mount_path
config.workspace_mount_path = workspace_mount_path
if state is None:
raise ValueError('State should not be None.')
# Setup the logger properly, so you can run multi-processing to parallelize the evaluation
if reset_logger:
# Set up logger
log_file = os.path.join(
metadata.eval_output_dir, 'logs', f'instance_{instance["id"]}.log'
)
# Remove all existing handlers from logger
for handler in logger.handlers[:]:
logger.removeHandler(handler)
# add back the console handler to print ONE line
logger.addHandler(get_console_handler())
logger.info(
f'Starting evaluation for instance {instance["id"]}.\nLOG: tail -f {log_file}'
)
# Remove all existing handlers from logger
for handler in logger.handlers[:]:
logger.removeHandler(handler)
file_handler = logging.FileHandler(log_file)
file_handler.setFormatter(
logging.Formatter('%(asctime)s - %(levelname)s - %(message)s')
)
logger.addHandler(file_handler)
final_message = ''
for event in state.history.get_events(reverse=True):
if isinstance(event, AgentFinishAction):
final_message = event.thought
break
elif isinstance(event, MessageAction):
final_message = event.content
break
logger.info(f'Process-specific workspace mounted at {workspace_mount_path}')
final_message = final_message.strip("'")
logger.info(
f'Predicted answer: {final_message}, Ground truth: {instance["answer"]}'
)
# sandbox = DockerSSHBox()
logic_inference_path = os.path.join(workspace_mount_path, 'logic_inference.py')
if not os.path.exists(logic_inference_path):
shutil.copyfile(
'./evaluation/logic_reasoning/logic_inference.py', logic_inference_path
)
logger.info(f'logic_inference.py copied to {workspace_mount_path}')
test_result = get_test_result(
model_answer=final_message, ground_truth=instance['answer']
)
test_result['final_message'] = final_message
cache_dir = os.path.join(workspace_mount_path, '.cache_program')
if not os.path.exists(cache_dir):
os.makedirs(cache_dir)
# Prepare instruction
with open('./evaluation/logic_reasoning/instruction.txt', 'r') as f:
instruction = f.read()
instance_logic_programs = instance['raw_logic_programs'][0].strip()
instruction = instruction.replace('[[dataset_name]]', dataset_name)
instruction = instruction.replace('[[logic_programs]]', instance_logic_programs)
instruction = instruction.replace(
'[[logic_inference_path.py]]', logic_inference_path
)
# NOTE: You can actually set slightly different instruction for different agents
instruction += AGENT_CLS_TO_INST_SUFFIX[agent.__class__.__name__]
# use a session id for concurrent evaluation
sid = instance['id'] + '_' + str(os.getpid())
sandbox = DockerSSHBox(
config=config.sandbox,
persist_sandbox=False,
workspace_mount_path=config.workspace_mount_path,
sandbox_workspace_dir=config.workspace_mount_path_in_sandbox,
cache_dir=config.cache_dir,
run_as_devin=config.run_as_devin,
sid=sid,
)
exit_code, command_output = sandbox.execute('pip install scitools-pyke')
# Here's how you can run the agent (similar to the `main` function) and get the final task state
config.max_iterations = metadata.max_iterations
state: State | None = asyncio.run(
run_controller(
config=config,
task_str=instruction,
fake_user_response_fn=AGENT_CLS_TO_FAKE_USER_RESPONSE_FN.get(
agent.__class__.__name__
),
agent=agent,
sandbox=sandbox,
sid=sid,
)
)
# ======= Attempt to evaluate the agent's edits =======
# If you are working on simpler benchmark that only evaluates the final model output (e.g., in a MessageAction)
# You can simply get the LAST `MessageAction` from the returned `state.history` and parse it for evaluation.
if state is None:
raise ValueError('State should not be None.')
final_message = ''
messages = []
for event in state.history.get_events(reverse=True):
# will this be a MessageAction?
# TODO we can filter for types of events if we know what to expect
messages.append(event.content)
if str(event.content) in ["'A'", "'B'", "'C'"]:
final_message = event.content
break
final_message = final_message.strip("'")
logger.info(
f'Predicted answer: {final_message}, Ground truth: {instance["answer"]}'
)
test_result = get_test_result(
model_answer=final_message, ground_truth=instance['answer']
)
metrics = state.metrics.get() if state.metrics else None
# history is now available as a stream of events, rather than list of pairs of (Action, Observation)
# for compatibility with the existing output format, we can remake the pairs here
# remove when it becomes unnecessary
histories = state.history.compatibility_for_eval_history_pairs()
# Save the output
output = {
'id': instance['id'],
'instance': instance,
'instruction': instruction,
# 'metadata': metadata.model_dump(),
'history': histories,
'metrics': metrics,
'final_message': final_message,
'messages': messages,
'error': state.last_error if state and state.last_error else None,
'test_result': test_result,
}
except Exception:
logger.error('Process instance failed')
raise
finally:
config.workspace_mount_path = old_workspace_mount_path
config.workspace_base = old_workspace_base
# Close the sandbox
sandbox.close()
metrics = state.metrics.get() if state.metrics else None
# history is now available as a stream of events, rather than list of pairs of (Action, Observation)
# for compatibility with the existing output format, we can remake the pairs here
# remove when it becomes unnecessary
histories = state.history.compatibility_for_eval_history_pairs()
# Save the output
output = EvalOutput(
instance_id=instance['instance_id'],
instruction=instruction,
metadata=metadata,
history=histories,
metrics=metrics,
error=state.last_error if state and state.last_error else None,
test_result=test_result,
)
return output
@ -262,7 +272,7 @@ if __name__ == '__main__':
'--dataset',
type=str,
help='the logic reasoning dataset to evaluate on {ProntoQA, ProofWriter}',
default='ProntoQA',
default='ProofWriter',
)
parser.add_argument(
'--data_split',
@ -270,36 +280,32 @@ if __name__ == '__main__':
help='data split to evaluate on {validation}', # right now we only support validation split
default='validation',
)
args, _ = parser.parse_known_args()
if args.directory:
config.workspace_base = os.path.abspath(args.directory)
print(f'Setting workspace base to {config.workspace_base}')
dataset_name = args.dataset
data_split = args.data_split
dataset = load_dataset(f'renma/{dataset_name}')
logic_reasoning_tests = dataset[data_split]
dataset_df = dataset[data_split].to_pandas()
dataset_df.rename(columns={'id': 'instance_id'}, inplace=True)
id_column = 'id'
llm_config = get_llm_config_arg(args.llm_config) if args.llm_config else config.llm
logger.info(f'Config for evaluation: {config}')
llm_config = None
if args.llm_config:
llm_config = get_llm_config_arg(args.llm_config)
if llm_config is None:
raise ValueError(f'Could not find LLM config: --llm_config {args.llm_config}')
metadata = make_metadata(
llm_config,
args.dataset_name,
dataset_name,
args.agent_cls,
args.max_iterations,
args.eval_note,
args.eval_output_dir,
)
output_file = os.path.join(metadata.eval_output_dir, 'output.jsonl')
instances = prepare_dataset(dataset, output_file, args.eval_n_limit, id_column)
run_evaluation(
instances,
metadata,
output_file,
args.eval_num_workers,
process_instance,
id_column,
instances = prepare_dataset(dataset_df, output_file, args.eval_n_limit)
asyncio.run(
run_evaluation(
instances, metadata, output_file, args.eval_num_workers, process_instance
)
)

9
evaluation/logic_reasoning/scripts/run_infer.sh Normal file → Executable file
View File

@ -3,8 +3,8 @@ set -eo pipefail
source "evaluation/utils/version_control.sh"
DATASET=$1
MODEL_CONFIG=$2
MODEL_CONFIG=$1
DATASET=$2
COMMIT_HASH=$3
EVAL_LIMIT=$4
AGENT=$5
@ -23,6 +23,11 @@ if [ -z "$AGENT" ]; then
AGENT="CodeActAgent"
fi
if [ -z "$DATASET" ]; then
echo "Dataset not specified, use default ProofWriter"
DATASET="ProofWriter"
fi
get_agent_version
echo "AGENT: $AGENT"

View File

@ -0,0 +1,10 @@
FROM ubuntu:22.04
RUN apt-get update && apt-get install -y python3 python3-pip git
RUN git clone https://github.com/Farama-Foundation/miniwob-plusplus.git /miniwob-plusplus && \
git -C "/miniwob-plusplus" reset --hard 7fd85d71a4b60325c6585396ec4f48377d049838
ENV MINIWOB_URL="file:///miniwob-plusplus/miniwob/html/miniwob/"
# docker build -t xingyaoww/od-eval-miniwob .

View File

@ -2,52 +2,9 @@
This folder contains evaluation for [MiniWoB++](https://miniwob.farama.org/) benchmark, powered by [BrowserGym](https://github.com/ServiceNow/BrowserGym) for easy evaluation of how well an agent capable of browsing can perform on synthetic web browsing tasks.
## Setup OpenDevin Environment
## Setup Environment and LLM Configuration
Please follow [this document](https://github.com/OpenDevin/OpenDevin/blob/main/Development.md) to setup local develop environment for OpenDevin.
## Configure OpenDevin and your LLM
Create a `config.toml` file if it does not exist at the root of the workspace.
Add the following configurations:
```toml
[core]
max_iterations = 100
cache_dir = "/tmp/cache"
ssh_hostname = "localhost"
[sandbox]
box_type = "ssh"
timeout = 120
# TODO: Change these to the model you want to evaluate
[llm.eval_gpt4_1106_preview]
model = "gpt-4-1106-preview"
api_key = "XXX"
temperature = 0.0
[llm.eval_some_openai_compatible_model]
model = "openai/MODEL_NAME"
base_url = "https://OPENAI_COMPATIBLE_URL/v1"
api_key = "XXX"
temperature = 0.0
```
## Setup MiniWoB++ Environment and Environment Variables of MiniWoB++
MiniWoB++ requires you to set up websites containing a static website that is accessible via URL to the machine running the OpenDevin agents.
- Clone miniwob (use a specific frozen commit for reproducibility)
```sh
git clone git@github.com:Farama-Foundation/miniwob-plusplus.git
git -C "./miniwob-plusplus" reset --hard 7fd85d71a4b60325c6585396ec4f48377d049838
```
- Setup Miniwob URL (change `PATH_TO_MINIWOB_CLONED_REPO` here to the absolute path to your `miniwob-plusplus` folder) in `evaluation/miniwob/scripts/run_infer.sh`
```sh
export MINIWOB_URL="file://<PATH_TO_MINIWOB_CLONED_REPO>/miniwob/html/miniwob/"
```
Please follow instruction [here](../README.md#setup) to setup your local development environment and LLM.
## Test if your environment works
@ -56,7 +13,7 @@ Access with browser the above MiniWoB URLs and see if they load correctly.
## Run Evaluation
```sh
bash evaluation/miniwob/scripts/run_infer.sh
./evaluation/miniwob/scripts/run_infer.sh llm.claude-35-sonnet-eval
```
Results will be in `evaluation/evaluation_outputs/outputs/miniwob/`

View File

@ -1,7 +1,7 @@
import asyncio
import json
import logging
import os
from typing import Any
import browsergym.miniwob # noqa F401 register miniwob tasks as gym environments
import gymnasium as gym
@ -9,91 +9,132 @@ import pandas as pd
from evaluation.utils.shared import (
EvalMetadata,
EvalOutput,
make_metadata,
prepare_dataset,
reset_logger_for_multiprocessing,
run_evaluation,
)
from opendevin.controller.agent import Agent
from opendevin.controller.state.state import State
from opendevin.core.config import get_llm_config_arg, load_app_config, parse_arguments
from opendevin.core.logger import get_console_handler
from opendevin.core.config import (
AppConfig,
SandboxConfig,
get_llm_config_arg,
parse_arguments,
)
from opendevin.core.logger import opendevin_logger as logger
from opendevin.core.main import run_controller
from opendevin.llm.llm import LLM
from opendevin.runtime.docker.ssh_box import DockerSSHBox
from opendevin.runtime.tools import RuntimeTool
config = load_app_config()
from opendevin.core.main import create_runtime, run_controller
from opendevin.events.action import (
BrowseInteractiveAction,
CmdRunAction,
MessageAction,
)
from opendevin.events.observation import CmdOutputObservation
from opendevin.runtime.browser.browser_env import (
BROWSER_EVAL_GET_GOAL_ACTION,
BROWSER_EVAL_GET_REWARDS_ACTION,
)
from opendevin.runtime.runtime import Runtime
SUPPORTED_AGENT_CLS = {'BrowsingAgent'}
docker_ssh_box: DockerSSHBox | None = None
def get_config(
metadata: EvalMetadata,
env_id: str,
) -> AppConfig:
config = AppConfig(
default_agent=metadata.agent_class,
run_as_devin=False,
runtime='eventstream',
max_iterations=metadata.max_iterations,
sandbox=SandboxConfig(
container_image='xingyaoww/od-eval-miniwob:v1.0',
enable_auto_lint=True,
use_host_network=False,
update_source_code=True,
browsergym_eval_env=env_id,
),
# do not mount workspace
workspace_base=None,
workspace_mount_path=None,
)
config.set_llm_config(metadata.llm_config)
return config
def get_sandbox():
global docker_ssh_box
if docker_ssh_box is None:
docker_ssh_box = DockerSSHBox(
config=config.sandbox,
persist_sandbox=False,
workspace_mount_path=config.workspace_mount_path,
sandbox_workspace_dir=config.workspace_mount_path_in_sandbox,
cache_dir=config.cache_dir,
run_as_devin=config.run_as_devin,
)
return docker_ssh_box
async def initialize_runtime(
runtime: Runtime,
) -> str:
"""Initialize the runtime for the agent.
This function is called before the runtime is used to run the agent.
"""
logger.info(f"{'-' * 50} BEGIN Runtime Initialization Fn {'-' * 50}")
obs: CmdOutputObservation
# Set instance id
action = CmdRunAction(command='mkdir -p /workspace')
logger.info(action, extra={'msg_type': 'ACTION'})
obs = await runtime.run_action(action)
assert obs.exit_code == 0
action = BrowseInteractiveAction(browser_actions=BROWSER_EVAL_GET_GOAL_ACTION)
logger.info(action, extra={'msg_type': 'ACTION'})
obs = await runtime.run_action(action)
logger.info(obs, extra={'msg_type': 'OBSERVATION'})
goal = obs.content
logger.info(f"{'-' * 50} END Runtime Initialization Fn {'-' * 50}")
return goal
def process_instance(
async def complete_runtime(
runtime: Runtime,
) -> dict[str, Any]:
"""Complete the runtime for the agent.
This function is called before the runtime is used to run the agent.
If you need to do something in the sandbox to get the correctness metric after
the agent has run, modify this function.
"""
logger.info(f"{'-' * 50} BEGIN Runtime Completion Fn {'-' * 50}")
obs: CmdOutputObservation
action = BrowseInteractiveAction(browser_actions=BROWSER_EVAL_GET_REWARDS_ACTION)
logger.info(action, extra={'msg_type': 'ACTION'})
obs = await runtime.run_action(action)
logger.info(obs, extra={'msg_type': 'OBSERVATION'})
logger.info(f"{'-' * 50} END Runtime Completion Fn {'-' * 50}")
return {
'rewards': json.loads(obs.content),
}
async def process_instance(
instance: pd.Series,
metadata: EvalMetadata,
reset_logger: bool = True,
):
# Create the agent
agent = Agent.get_cls(metadata.agent_class)(llm=LLM(config=metadata.llm_config))
) -> EvalOutput:
env_id = instance.id
config = get_config(metadata, env_id)
# Setup the logger properly, so you can run multi-processing to parallelize the evaluation
if reset_logger:
# Set up logger
log_file = os.path.join(
metadata.eval_output_dir, 'logs', f'instance_{env_id}.log'
)
# Remove all existing handlers from logger
for handler in logger.handlers[:]:
logger.removeHandler(handler)
# add back the console handler to print ONE line
logger.addHandler(get_console_handler())
logger.info(
f'Starting evaluation for instance {env_id}.\nHint: run "tail -f {log_file}" to see live logs in a separate shell'
)
# Remove all existing handlers from logger
for handler in logger.handlers[:]:
logger.removeHandler(handler)
file_handler = logging.FileHandler(log_file)
file_handler.setFormatter(
logging.Formatter('%(asctime)s - %(levelname)s - %(message)s')
)
logger.addHandler(file_handler)
log_dir = os.path.join(metadata.eval_output_dir, 'infer_logs')
reset_logger_for_multiprocessing(logger, env_id, log_dir)
else:
logger.info(f'Starting evaluation for instance {env_id}.')
# Here's how you can run the agent (similar to the `main` function) and get the final task state
runtime_tools_config = {
RuntimeTool.BROWSER: {
'browsergym_eval': env_id,
'browsergym_eval_save_dir': metadata.eval_output_dir,
}
}
runtime = await create_runtime(config, sid=env_id)
task_str = await initialize_runtime(runtime)
config.max_iterations = metadata.max_iterations
state: State | None = asyncio.run(
run_controller(
config=config,
task_str='PLACEHOLDER_GOAL',
runtime_tools_config=runtime_tools_config,
agent=agent,
sandbox=get_sandbox(),
sid=env_id,
task_str=task_str, # take output from initialize_runtime
runtime=runtime,
)
)
@ -106,18 +147,17 @@ def process_instance(
raise ValueError('State should not be None.')
metrics = state.metrics.get() if state.metrics else None
browsergym_eval_dir = os.path.join(metadata.eval_output_dir, env_id.split('/')[1])
# read goal
with open(
os.path.join(browsergym_eval_dir, 'goal.txt'), 'r', encoding='utf-8'
) as f:
instruction = f.read()
# read reward
with open(
os.path.join(browsergym_eval_dir, 'rewards.json'), 'r', encoding='utf-8'
) as f:
rewards = json.load(f)
reward = max(rewards)
# Instruction is the first message from the USER
instruction = ''
for event in state.history.get_events():
if isinstance(event, MessageAction):
instruction = event.content
break
return_val = await complete_runtime(runtime)
logger.info(f'Return value from complete_runtime: {return_val}')
reward = max(return_val['rewards'])
# history is now available as a stream of events, rather than list of pairs of (Action, Observation)
# for compatibility with the existing output format, we can remake the pairs here
@ -125,16 +165,17 @@ def process_instance(
histories = state.history.compatibility_for_eval_history_pairs()
# Save the output
output = {
'instance_id': env_id,
'instruction': instruction,
'metadata': metadata.model_dump(),
'history': histories,
'metrics': metrics,
'error': state.last_error if state and state.last_error else None,
'test_result': reward,
}
output = EvalOutput(
instance_id=env_id,
instruction=instruction,
metadata=metadata,
history=histories,
metrics=metrics,
error=state.last_error if state and state.last_error else None,
test_result={
'reward': reward,
},
)
return output
@ -143,7 +184,7 @@ if __name__ == '__main__':
dataset = pd.DataFrame(
{
'id': [
'instance_id': [
id
for id in gym.envs.registry.keys()
if id.startswith('browsergym/miniwob')
@ -151,26 +192,25 @@ if __name__ == '__main__':
}
)
id_column = 'id'
llm_config = get_llm_config_arg(args.llm_config) if args.llm_config else config.llm
logger.info(f'Config for evaluation: {config}')
llm_config = None
if args.llm_config:
llm_config = get_llm_config_arg(args.llm_config)
if llm_config is None:
raise ValueError(f'Could not find LLM config: --llm_config {args.llm_config}')
metadata = make_metadata(
llm_config,
args.dataset_name,
'miniwob',
args.agent_cls,
args.max_iterations,
args.eval_note,
args.eval_output_dir,
)
output_file = os.path.join(metadata.eval_output_dir, 'output.jsonl')
instances = prepare_dataset(dataset, output_file, args.eval_n_limit, id_column)
_ = get_sandbox() # Initialize the sandbox
run_evaluation(
instances,
metadata,
output_file,
args.eval_num_workers,
process_instance,
id_column,
instances = prepare_dataset(dataset, output_file, args.eval_n_limit)
asyncio.run(
run_evaluation(
instances, metadata, output_file, args.eval_num_workers, process_instance
)
)

6
evaluation/miniwob/scripts/run_infer.sh Normal file → Executable file
View File

@ -3,14 +3,10 @@ set -eo pipefail
source "evaluation/utils/version_control.sh"
# configure miniwob website, change URL to yours
export MINIWOB_URL="file:///home/fangzhex/miniwob-plusplus/miniwob/html/miniwob/"
# configure browsing agent
export USE_NAV="false"
export USE_CONCISE_ANSWER="true"
MODEL_CONFIG=$1
COMMIT_HASH=$2
AGENT=$3
@ -42,7 +38,7 @@ COMMAND="poetry run python evaluation/miniwob/run_infer.py \
--llm-config $MODEL_CONFIG \
--max-iterations 10 \
--max-chars 10000000 \
--eval-num-workers $NUM_WORKERS \
--eval-num-workers $NUM_WORKERS"
if [ -n "$EVAL_LIMIT" ]; then
echo "EVAL_LIMIT: $EVAL_LIMIT"

View File

@ -0,0 +1,10 @@
FROM ubuntu:22.04
RUN apt-get update && apt-get install -y python3 python3-pip git gcc
WORKDIR /root
COPY requirements.txt .
RUN pip install -r requirements.txt
# docker build -t xingyaoww/od-eval-mint:v1.0 .

View File

@ -2,9 +2,11 @@
This folder contains the evaluation harness for the [MINT benchmark](https://arxiv.org/abs/2309.10691) on LLMs' ability to solve tasks with multi-turn interactions.
## Configure OpenDevin and LM
We support evaluation of the [Eurus subset focus on math and code reasoning](https://arxiv.org/abs/2404.02078), including MATH, MMLU, TheoremQA, HumanEval, MBPP.
Create a `config.toml` file if it does not exist at the root of the workspace. Please check [README.md](../../README.md) for how to set this up.
## Setup Environment and LLM Configuration
Please follow instruction [here](../README.md#setup) to setup your local development environment and LLM.
## Start the evaluation

View File

@ -1,33 +1,36 @@
import asyncio
import functools
import logging
import os
import pathlib
from typing import Any, Dict
import pandas as pd
from datasets import load_dataset
from evaluation.swe_bench.swe_env_box import DockerSSHBox
from evaluation.mint.datatypes import TaskState
from evaluation.mint.env import SimplifiedEnv
from evaluation.mint.prompts import ToolPromptTemplate
from evaluation.mint.tasks import Task
from evaluation.utils.shared import (
EvalMetadata,
EvalOutput,
make_metadata,
prepare_dataset,
reset_logger_for_multiprocessing,
run_evaluation,
)
from opendevin.controller.agent import Agent
from opendevin.controller.state.state import State
from opendevin.core.config import get_llm_config_arg, get_parser, load_app_config
from opendevin.core.logger import get_console_handler
from opendevin.core.config import (
AppConfig,
SandboxConfig,
get_llm_config_arg,
get_parser,
)
from opendevin.core.logger import opendevin_logger as logger
from opendevin.core.main import run_controller
from opendevin.llm.llm import LLM
from .datatypes import TaskState
from .env import SimplifiedEnv
from .prompts import ToolPromptTemplate
from .tasks import Task
config = load_app_config()
from opendevin.core.main import create_runtime, run_controller
from opendevin.events.action import (
CmdRunAction,
)
from opendevin.events.observation import CmdOutputObservation
from opendevin.runtime.runtime import Runtime
def codeact_user_response_mint(state: State, task: Task, task_config: Dict[str, int]):
@ -42,7 +45,7 @@ def codeact_user_response_mint(state: State, task: Task, task_config: Dict[str,
last_action = state.history.get_last_action()
result_state: TaskState = env.step(last_action.message or '')
state.task_state = result_state
state.extra_data['task_state'] = result_state
if not result_state.latest_output:
# Task is finished
@ -62,85 +65,108 @@ AGENT_CLS_TO_INST_SUFFIX = {
'CodeActAgent': '\nIMPORTANT: When your answer is confirmed by the user to be correct, you can exit using the following command: <execute_bash> exit </execute_bash>.\n'
}
with open(os.path.join(os.path.dirname(__file__), 'requirements.txt'), 'r') as f:
MINT_DEPENDENCIES = f.read().splitlines()
def process_instance(
def load_incontext_example(task_name: str, with_tool: bool = True):
assert with_tool, 'NOT with_tool is not supported yet'
subset = {
'gsm8k': 'reasoning',
'math': 'reasoning',
'mmlu': 'reasoning',
'theoremqa': 'reasoning',
'mbpp': 'mbpp',
'humaneval': 'humaneval',
}[task_name]
with open(
os.path.join(
os.path.dirname(__file__),
'tasks',
'in_context_examples',
subset,
'with_tool.txt',
),
'r',
) as f:
return f.read()
def get_config(
metadata: EvalMetadata,
) -> AppConfig:
config = AppConfig(
default_agent=metadata.agent_class,
run_as_devin=False,
runtime='eventstream',
max_iterations=metadata.max_iterations,
sandbox=SandboxConfig(
container_image='xingyaoww/od-eval-mint:v1.0',
enable_auto_lint=True,
use_host_network=False,
update_source_code=True,
od_runtime_extra_deps=f'$OD_INTERPRETER_PATH -m pip install {" ".join(MINT_DEPENDENCIES)}',
),
# do not mount workspace
workspace_base=None,
workspace_mount_path=None,
)
config.set_llm_config(metadata.llm_config)
return config
async def initialize_runtime(runtime: Runtime):
"""Initialize the runtime for the agent.
This function is called before the runtime is used to run the agent.
"""
logger.info(f"{'-' * 50} BEGIN Runtime Initialization Fn {'-' * 50}")
obs: CmdOutputObservation
# Set instance id
action = CmdRunAction(command='mkdir -p /workspace')
logger.info(action, extra={'msg_type': 'ACTION'})
obs = await runtime.run_action(action)
assert obs.exit_code == 0
action = CmdRunAction(command='cd /workspace')
logger.info(action, extra={'msg_type': 'ACTION'})
obs = await runtime.run_action(action)
assert obs.exit_code == 0
logger.info(f"{'-' * 50} END Runtime Initialization Fn {'-' * 50}")
async def process_instance(
instance: Any,
metadata: EvalMetadata,
reset_logger: bool = True,
):
agent = Agent.get_cls(metadata.agent_class)(llm=LLM(metadata.llm_config))
workspace_mount_path = os.path.join(config.workspace_mount_path, '_eval_workspace')
# create process-specific workspace dir
workspace_mount_path = os.path.join(workspace_mount_path, str(os.getpid()))
pathlib.Path(workspace_mount_path).mkdir(parents=True, exist_ok=True)
config = get_config(metadata)
# Setup the logger properly, so you can run multi-processing to parallelize the evaluation
if reset_logger:
# Set up logger
log_file = os.path.join(
metadata.eval_output_dir, 'logs', f'instance_{instance.task_id}.log'
)
# Remove all existing handlers from logger
for handler in logger.handlers[:]:
logger.removeHandler(handler)
# add back the console handler to print ONE line
logger.addHandler(get_console_handler())
logger.info(
f'Starting evaluation for instance {instance.task_id}.\nHint: run "tail -f {log_file}" to see live logs in a separate shell'
)
# Remove all existing handlers from logger
for handler in logger.handlers[:]:
logger.removeHandler(handler)
file_handler = logging.FileHandler(log_file)
file_handler.setFormatter(
logging.Formatter('%(asctime)s - %(levelname)s - %(message)s')
)
logger.addHandler(file_handler)
logger.info(f'Process-specific workspace mounted at {workspace_mount_path}')
# use a session id for concurrent processing
sid = instance.task_id + '_' + str(os.getpid())
sandbox = DockerSSHBox(
config=config.sandbox,
persist_sandbox=False,
workspace_mount_path=config.workspace_mount_path,
sandbox_workspace_dir=config.workspace_mount_path_in_sandbox,
cache_dir=config.cache_dir,
run_as_devin=config.run_as_devin,
sid=sid,
)
requirements_host_src = 'evaluation/mint/requirements.txt'
requirements_sandbox_dest = '/opendevin/plugins/mint/'
sandbox.copy_to(
host_src=requirements_host_src,
sandbox_dest=requirements_sandbox_dest,
recursive=False,
)
logger.info(
f'Copied files from [{requirements_host_src}] to [{requirements_sandbox_dest}] inside sandbox.'
)
exit_code, output = sandbox.execute(f'pip install -r {requirements_sandbox_dest}')
log_dir = os.path.join(metadata.eval_output_dir, 'infer_logs')
reset_logger_for_multiprocessing(logger, instance.instance_id, log_dir)
else:
logger.info(f'Starting evaluation for instance {instance.instance_id}.')
# Prepare instruction
assert metadata.details is not None
instruction = ToolPromptTemplate(use_tool=True)(
max_total_steps=metadata.max_iterations,
max_propose_solution=metadata.details['max_propose_solution'],
in_context_example=instance.in_context_example(
use_tool=True, with_feedback=False
),
in_context_example=instance.in_context_example,
task_prompt='Task:\n' + instance.prompt,
)
instruction += 'IMPORTANT: You should ONLY interact with the environment provided to you or provide the concise RESULT inside <solution> tag AND NEVER ASK FOR HUMAN HELP.\n'
# NOTE: You can actually set slightly different instruction for different agents
instruction += AGENT_CLS_TO_INST_SUFFIX[agent.__class__.__name__]
instruction += AGENT_CLS_TO_INST_SUFFIX[metadata.agent_class]
# Here's how you can run the agent (similar to the `main` function) and get the final task state
fake_user_response_fn = functools.partial(
AGENT_CLS_TO_FAKE_USER_RESPONSE_FN[agent.__class__.__name__],
AGENT_CLS_TO_FAKE_USER_RESPONSE_FN[metadata.agent_class],
task=instance,
task_config={
'max_iterations': metadata.max_iterations,
@ -148,24 +174,22 @@ def process_instance(
},
)
config.max_iterations = metadata.max_iterations
state: State | None = asyncio.run(
run_controller(
config=config,
task_str=instruction,
fake_user_response_fn=fake_user_response_fn,
agent=agent,
sandbox=sandbox,
sid=sid,
)
runtime = await create_runtime(config, sid=instance.instance_id)
await initialize_runtime(runtime)
state: State | None = await run_controller(
config=config,
task_str=instruction,
runtime=runtime,
fake_user_response_fn=fake_user_response_fn,
)
if state is None:
raise ValueError('State should not be None.')
task_state = None
if hasattr(state, 'task_state'):
task_state = state.task_state
if 'task_state' in state.extra_data:
task_state = state.extra_data['task_state']
logger.info('Task state: ' + str(task_state.to_dict()))
metrics = state.metrics.get() if state.metrics else None
@ -176,30 +200,37 @@ def process_instance(
histories = state.history.compatibility_for_eval_history_pairs()
# Save the output
output = {
'id': instance.task_id,
'instance': instance.to_dict(),
'instruction': instruction,
'metadata': metadata.model_dump(),
'history': histories,
'metrics': metrics,
'error': state.last_error if state and state.last_error else None,
'test_result': task_state.success if task_state else False,
}
# Close the sandbox
sandbox.close()
output = EvalOutput(
instance_id=instance.instance_id,
instance=instance.to_dict(),
instruction=instruction,
metadata=metadata,
history=histories,
metrics=metrics,
error=state.last_error if state and state.last_error else None,
test_result={
'success': task_state.success if task_state else False,
},
)
return output
if __name__ == '__main__':
parser = get_parser()
SUBSETS = [
# Eurus subset: https://arxiv.org/abs/2404.02078
'math',
# 'gsm8k',
'mmlu',
'theoremqa',
'mbpp',
'humaneval',
]
parser.add_argument(
'--subset',
default='math',
choices=['math', 'gsm8k', 'mmlu', 'theoremqa', 'mbpp', 'humaneval'],
default='all',
choices=SUBSETS + ['all'],
type=str,
help='subset of the dataset to be used',
)
@ -214,19 +245,36 @@ if __name__ == '__main__':
# NOTE: It is preferable to load datasets from huggingface datasets and perform post-processing
# so we don't need to manage file uploading to OpenDevin's repo
mint_dataset = load_dataset(
'ryanhoangt/xingyaoww-mint-bench', name=args.subset, split='test'
)
logger.info(f'Evaluating MINT - {args.subset} subset')
mint_tests = mint_dataset.to_pandas()
if args.subset == 'all':
subsets = SUBSETS
else:
subsets = [args.subset]
id_column = 'id'
llm_config = get_llm_config_arg(args.llm_config) if args.llm_config else config.llm
logger.info(f'Config for evaluation: {config}')
dataset_dfs = []
for subset in subsets:
in_context_example = load_incontext_example(subset)
_cur_dataset = load_dataset(
'ryanhoangt/xingyaoww-mint-bench', name=subset, split='test'
)
logger.info(f'Loaded MINT - {subset} subset')
_df = _cur_dataset.to_pandas().rename(columns={'id': 'instance_id'})
_df['instance_id'] = _df['instance_id'].apply(lambda x: f'{subset}/{x}') # noqa
_df['in_context_example'] = in_context_example
dataset_dfs.append(_df)
logger.info(f'Loaded {len(_df)} instances for subset: {subset}')
dataset_df = pd.concat(dataset_dfs)
logger.info(f'Loaded {len(dataset_df)} instances for subset: {subsets}')
llm_config = None
if args.llm_config:
llm_config = get_llm_config_arg(args.llm_config)
if llm_config is None:
raise ValueError(f'Could not find LLM config: --llm_config {args.llm_config}')
metadata = make_metadata(
llm_config,
args.dataset_name,
f'MINT-{args.subset}',
args.agent_cls,
args.max_iterations,
args.eval_note,
@ -234,12 +282,7 @@ if __name__ == '__main__':
details={'max_propose_solution': args.max_propose_solution},
)
output_file = os.path.join(metadata.eval_output_dir, 'output.jsonl')
instances = prepare_dataset(mint_dataset, output_file, args.eval_n_limit, id_column)
instances = prepare_dataset(dataset_df, output_file, args.eval_n_limit)
run_evaluation(
instances,
metadata,
output_file,
args.eval_num_workers,
process_instance,
id_column,
instances, metadata, output_file, args.eval_num_workers, process_instance
)

7
evaluation/mint/scripts/run_infer.sh Normal file → Executable file
View File

@ -29,15 +29,16 @@ COMMAND="poetry run python ./evaluation/mint/run_infer.py \
--llm-config $MODEL_CONFIG \
--max-iterations 5 \
--max-propose-solution 2 \
--eval-num-workers $NUM_WORKERS \
--eval-num-workers $NUM_WORKERS
"
if [ -n "$SUBSET" ]; then
echo "SUBSET: $SUBSET"
COMMAND="$COMMAND --subset $SUBSET"
# otherwise default to use the math subset
else
echo "SUBSET: math"
COMMAND="$COMMAND --subset math"
echo "SUBSET: all"
COMMAND="$COMMAND --subset all"
fi
if [ -n "$EVAL_LIMIT" ]; then

View File

@ -10,40 +10,9 @@ The task introduces new challenges for LLMs, such as comprehending long and lang
For more details on the ML-Bench task and dataset, please refer to the paper: [ML-Bench: Evaluating Large Language Models for Code Generation in Repository-Level Machine Learning Tasks](https://arxiv.org/abs/2311.09835).
## Setup Environment
## Setup Environment and LLM Configuration
Please follow the [OpenDevin setup guide](https://github.com/OpenDevin/OpenDevin/blob/main/docs/setup.md) to set up the local development environment for OpenDevin.
## Configure OpenDevin and your LLM
Create a `config.toml` file if it does not exist at the root of the workspace.
Add the following configurations:
```toml
[core]
max_iterations = 100
cache_dir = "/tmp/cache"
ssh_hostname = "localhost"
run_as_devin = false
sandbox_container_image = "public.ecr.aws/i5g0m1f6/ml-bench" # Use the latest image from the ML-Bench repository
[sandbox]
enable_auto_lint = true
# TODO: Change these to the model you want to evaluate
[llm.eval_gpt4_1106_preview]
model = "gpt-4-1106-preview"
api_key = "XXX"
temperature = 0.0
[llm.eval_some_openai_compatible_model]
model = "openai/MODEL_NAME"
base_url = "https://OPENAI_COMPATIBLE_URL/v1"
api_key = "XXX"
temperature = 0.0
```
Please follow instruction [here](../README.md#setup) to setup your local development environment and LLM.
## Run Inference on ML-Bench

View File

@ -13,29 +13,34 @@ TODOs:
- Clean up the code and docker image used for evaluation.
"""
import asyncio
import logging
import os
import pathlib
from typing import Any
import pandas as pd
from datasets import load_dataset
from evaluation.utils.shared import (
EvalMetadata,
EvalOutput,
codeact_user_response,
make_metadata,
prepare_dataset,
reset_logger_for_multiprocessing,
run_evaluation,
)
from opendevin.controller.agent import Agent
from opendevin.controller.state.state import State
from opendevin.core.config import get_llm_config_arg, get_parser, load_app_config
from opendevin.core.logger import get_console_handler
from opendevin.core.config import (
AppConfig,
SandboxConfig,
get_llm_config_arg,
get_parser,
load_app_config,
)
from opendevin.core.logger import opendevin_logger as logger
from opendevin.core.main import run_controller
from opendevin.llm.llm import LLM
from opendevin.runtime.docker.ssh_box import DockerSSHBox
from opendevin.core.main import create_runtime, run_controller
from opendevin.events.action import CmdRunAction
from opendevin.events.observation import CmdOutputObservation
from opendevin.runtime.runtime import Runtime
config = load_app_config()
@ -66,169 +71,204 @@ ID2CONDA = {
}
def process_instance(instance: Any, metadata: EvalMetadata, reset_logger: bool = True):
agent = Agent.get_cls(metadata.agent_class)(llm=LLM(config=metadata.llm_config))
old_workspace_mount_path = config.workspace_mount_path
old_workspace_base = config.workspace_base
try:
workspace_mount_path = os.path.join(
config.workspace_mount_path, '_eval_workspace'
def get_config(
metadata: EvalMetadata,
) -> AppConfig:
config = AppConfig(
default_agent=metadata.agent_class,
run_as_devin=False,
runtime='eventstream',
max_iterations=metadata.max_iterations,
sandbox=SandboxConfig(
container_image='public.ecr.aws/i5g0m1f6/ml-bench',
enable_auto_lint=True,
use_host_network=False,
update_source_code=True,
),
# do not mount workspace
workspace_base=None,
workspace_mount_path=None,
)
config.set_llm_config(metadata.llm_config)
return config
async def initialize_runtime(
runtime: Runtime,
instance: pd.Series, # this argument is not required
):
"""Initialize the runtime for the agent.
This function is called before the runtime is used to run the agent.
"""
logger.info(f"{'-' * 50} BEGIN Runtime Initialization Fn {'-' * 50}")
obs: CmdOutputObservation
# Set instance id
action = CmdRunAction(command='mkdir -p /workspace')
logger.info(action, extra={'msg_type': 'ACTION'})
obs = await runtime.run_action(action)
assert obs.exit_code == 0
# Set up the task environment
action = CmdRunAction(command=f'conda activate {ID2CONDA[instance["github_id"]]}')
logger.info(action, extra={'msg_type': 'ACTION'})
obs = await runtime.run_action(action)
assert obs.exit_code == 0
repo_url = instance['github']
repo_name = repo_url.split('/')[-1]
action = CmdRunAction(command=f'git clone {repo_url} /workspace/{repo_name}')
logger.info(action, extra={'msg_type': 'ACTION'})
obs = await runtime.run_action(action)
assert obs.exit_code == 0
action = CmdRunAction(command=f'chmod -R 777 /workspace/{repo_name}')
logger.info(action, extra={'msg_type': 'ACTION'})
obs = await runtime.run_action(action)
assert obs.exit_code == 0
# Navigate to the task's code path
task_path = os.path.join('/workspace', repo_name, instance['path'][2:])
action = CmdRunAction(command=f'cd {task_path}')
logger.info(action, extra={'msg_type': 'ACTION'})
obs = await runtime.run_action(action)
assert obs.exit_code == 0
logger.info(f"{'-' * 50} END Runtime Initialization Fn {'-' * 50}")
async def complete_runtime(
runtime: Runtime,
instance: pd.Series, # this argument is not required, but it is used to get the workspace_dir_name
) -> dict[str, Any]:
"""Complete the runtime for the agent.
This function is called before the runtime is used to run the agent.
If you need to do something in the sandbox to get the correctness metric after
the agent has run, modify this function.
"""
logger.info(f"{'-' * 50} BEGIN Runtime Completion Fn {'-' * 50}")
obs: CmdOutputObservation
repo_url = instance['github']
repo_name = repo_url.split('/')[-1]
task_path = os.path.join('/workspace', repo_name, instance['path'][2:])
# Evaluate the agent's script
eval_script = os.path.join(task_path, 'run.sh')
logger.info(f'Running evaluation script: {eval_script}')
action = CmdRunAction(command=f'cat {eval_script}', keep_prompt=False)
logger.info(action, extra={'msg_type': 'ACTION'})
obs = await runtime.run_action(action)
if obs.exit_code == 0:
eval_script_content = obs.content
else:
logger.error(f'Error reading evaluation script: {obs.content}')
eval_script_content = ''
action = CmdRunAction(
command=f'timeout 120s conda run -n {ID2CONDA[instance["github_id"]]} bash {eval_script}',
timeout=600,
)
logger.info(action, extra={'msg_type': 'ACTION'})
obs = await runtime.run_action(action)
if obs.exit_code == 0:
eval_output = obs.content
else:
logger.error(f'Error running evaluation script: {obs.content}')
eval_output = ''
outputs = {
'eval_script_content': eval_script_content,
'eval_output': eval_output,
}
if obs.exit_code != 0 and obs.exit_code != 124:
logger.warning(f'Evaluation script failed with exit code {obs.exit_code}')
logger.warning(f'Output: {eval_output}')
outputs['success'] = int(
'KeyboardInterrupt' in eval_output
) # super-dainiu: assume ``KeyboardInterrupt`` is a success as is done in ML-Bench
else:
logger.info(f'Evaluation script succeeded with exit code {obs.exit_code}')
logger.info(f'Output: {eval_output}')
outputs['success'] = 1
outputs['eval_exit_code'] = obs.exit_code
logger.info(f"{'-' * 50} END Runtime Completion Fn {'-' * 50}")
return outputs
async def process_instance(
instance: Any, metadata: EvalMetadata, reset_logger: bool = True
):
config = get_config(metadata)
# Setup the logger properly, so you can run multi-processing to parallelize the evaluation
if reset_logger:
log_dir = os.path.join(metadata.eval_output_dir, 'infer_logs')
reset_logger_for_multiprocessing(logger, instance['instance_id'], log_dir)
else:
logger.info(f'Starting evaluation for instance {instance["instance_id"]}.')
# Create a sandbox, using the instance ID and PID as the session ID to avoid conflicts
sid = str(instance['instance_id'])
repo_url = instance['github']
repo_name = repo_url.split('/')[-1]
task_path = os.path.join('/workspace', repo_name, instance['path'][2:])
# Prepare the task instruction
instruction = (
f'Please complete the Machine Learning task in the following repository: {repo_name}\n\n'
f'{instance["instruction"]}\n\n'
'You should create a script named `run.sh` under the specified path in the repo to run the task.\n\n'
f'You can find the task repo at: {task_path}\n\n'
+ (
'Here is the prefix code for the task:\n'
'```bash\n'
f'{instance["prefix_code"]}\n'
'```\n\n'
if instance['prefix_code']
else ''
)
# create process-specific workspace dir
# so that different agent don't interfere with each other.
workspace_mount_path = os.path.join(workspace_mount_path, str(os.getpid()))
pathlib.Path(workspace_mount_path).mkdir(parents=True, exist_ok=True)
+ 'You should terminate the subprocess after running the task (e.g., call subprocess.Popen(args).wait()).'
)
instruction += AGENT_CLS_TO_INST_SUFFIX[metadata.agent_class]
# reset workspace to config
config.workspace_base = workspace_mount_path
config.workspace_mount_path = workspace_mount_path
runtime = await create_runtime(config, sid=sid)
await initialize_runtime(runtime, instance)
# Setup the logger properly, so you can run multi-processing to parallelize the evaluation
if reset_logger:
# Set up logger
log_file = os.path.join(
metadata.eval_output_dir,
'logs',
f"instance_{instance['id']}_pid_{os.getpid()}.log",
)
# Remove all existing handlers from logger
for handler in logger.handlers[:]:
logger.removeHandler(handler)
# add back the console handler to print ONE line
logger.addHandler(get_console_handler())
logger.info(
f"Starting evaluation for instance {instance['id']}.\nLOG: tail -f {log_file}"
)
# Remove all existing handlers from logger
for handler in logger.handlers[:]:
logger.removeHandler(handler)
file_handler = logging.FileHandler(log_file)
file_handler.setFormatter(
logging.Formatter('%(asctime)s - %(levelname)s - %(message)s')
)
logger.addHandler(file_handler)
# Run the agent
state: State | None = await run_controller(
config=config,
task_str=instruction,
runtime=runtime,
fake_user_response_fn=AGENT_CLS_TO_FAKE_USER_RESPONSE_FN.get(
metadata.agent_class
),
)
assert state is not None
metrics = state.metrics.get() if state.metrics else {}
logger.info(f'Process-specific workspace mounted at {workspace_mount_path}')
test_result = await complete_runtime(runtime)
# Create a sandbox, using the instance ID and PID as the session ID to avoid conflicts
sid = str(instance['id']) + '_' + str(os.getpid())
sandbox = DockerSSHBox(
config=config.sandbox,
persist_sandbox=False,
workspace_mount_path=config.workspace_mount_path,
sandbox_workspace_dir=config.workspace_mount_path_in_sandbox,
cache_dir=config.cache_dir,
run_as_devin=config.run_as_devin,
sid=sid,
)
# history is now available as a stream of events, rather than list of pairs of (Action, Observation)
# for compatibility with the existing output format, we can remake the pairs here
# remove when it becomes unnecessary
histories = state.history.compatibility_for_eval_history_pairs()
# Set up the task environment
sandbox.execute(f'conda activate {ID2CONDA[instance["github_id"]]}')
# Clone the task repo into the sandbox
repo_url = instance['github']
repo_name = repo_url.split('/')[-1]
sandbox.execute(f'git clone {repo_url} /workspace/{repo_name}')
sandbox.execute(f'chmod -R 777 /workspace/{repo_name}')
# Navigate to the task's code path
task_path = os.path.join('/workspace', repo_name, instance['path'][2:])
sandbox.execute(f'cd {task_path}')
# Prepare the task instruction
instruction = (
f'Please complete the Machine Learning task in the following repository: {repo_name}\n\n'
f'The task is: {instance["task"]}\n\n'
f'{instance["instruction"]}\n\n'
'You should create a script named `run.sh` under the specified path in the repo to run the task.\n\n'
f'You can find the task repo at: {task_path}\n\n'
+ (
'Here is the prefix code for the task:\n'
'```bash\n'
f'{instance["prefix_code"]}\n'
'```\n\n'
if instance['prefix_code']
else ''
)
+ 'You should terminate the subprocess after running the task (e.g., call subprocess.Popen(args).wait()).'
)
instruction += AGENT_CLS_TO_INST_SUFFIX[agent.__class__.__name__]
# Run the agent
config.max_iterations = metadata.max_iterations
state: State | None = asyncio.run(
run_controller(
config=config,
task_str=instruction,
fake_user_response_fn=AGENT_CLS_TO_FAKE_USER_RESPONSE_FN.get(
agent.__class__.__name__
),
agent=agent,
sandbox=sandbox,
sid=sid,
)
)
assert state is not None
metrics = state.metrics.get() if state.metrics else {}
# Evaluate the agent's script
eval_script = os.path.join(task_path, 'run.sh')
logger.info(f'Running evaluation script: {eval_script}')
try:
_, eval_script_content = sandbox.execute(f'cat {eval_script}')
except Exception as e:
logger.error(f'Error reading evaluation script: {e}')
eval_script_content = ''
try:
exit_code, eval_output = sandbox.execute(
f'timeout 120s conda run -n {ID2CONDA[instance["github_id"]]} bash {eval_script}',
timeout=600,
)
except Exception as e:
logger.error(f'Error running evaluation script: {e}')
exit_code = -1
eval_output = ''
if exit_code != 0 and exit_code != 124:
logger.warning(f'Evaluation script failed with exit code {exit_code}')
logger.warning(f'Output: {eval_output}')
metrics['success'] = int(
'KeyboardInterrupt' in eval_output
) # super-dainiu: assume ``KeyboardInterrupt`` is a success as is done in ML-Bench
else:
logger.info(f'Evaluation script succeeded with exit code {exit_code}')
logger.info(f'Output: {eval_output}')
metrics['success'] = 1
# history is now available as a stream of events, rather than list of pairs of (Action, Observation)
# for compatibility with the existing output format, we can remake the pairs here
# remove when it becomes unnecessary
histories = state.history.compatibility_for_eval_history_pairs()
# Save the output
output = {
'instance_id': instance['id'],
'repo': repo_url,
'instruction': instruction,
'metadata': metadata.model_dump(),
'history': histories,
'eval_script': eval_script_content,
'eval_exit_code': exit_code,
'eval_output': eval_output,
'metrics': metrics,
}
except Exception as e:
logger.error(f'Error processing instance {instance["id"]}: {e}')
raise
finally:
config.workspace_mount_path = old_workspace_mount_path
config.workspace_base = old_workspace_base
# Shutdown the sandbox
sandbox.close()
# Save the output
output = EvalOutput(
instance_id=instance['instance_id'],
instance=instance.to_dict(),
instruction=instruction,
metadata=metadata,
history=histories,
test_result=test_result,
metrics=metrics,
)
return output
@ -246,30 +286,26 @@ if __name__ == '__main__':
data_split = args.eval_split
# NOTE: It is preferable to load datasets from huggingface datasets and perform post-processing
# so we don't need to manage file uploading to OpenDevin's repo
ml_bench = load_dataset('super-dainiu/ml-bench', split=data_split).to_pandas()
ml_bench.rename(columns={'id': 'instance_id'}, inplace=True)
id_column = 'instance_id'
llm_config = get_llm_config_arg(args.llm_config) if args.llm_config else config.llm
logger.info(f'Config for evaluation: {config}')
llm_config = None
if args.llm_config:
llm_config = get_llm_config_arg(args.llm_config)
if llm_config is None:
raise ValueError(f'Could not find LLM config: --llm_config {args.llm_config}')
metadata = make_metadata(
llm_config,
args.dataset_name,
f'ml-bench-{data_split}',
args.agent_cls,
args.max_iterations,
args.eval_note,
args.eval_output_dir,
)
output_file = os.path.join(metadata.eval_output_dir, 'output.jsonl')
instances = prepare_dataset(ml_bench, output_file, args.eval_n_limit, id_column)
instances = prepare_dataset(ml_bench, output_file, args.eval_n_limit)
run_evaluation(
instances,
metadata,
output_file,
args.eval_num_workers,
process_instance,
id_column,
instances, metadata, output_file, args.eval_num_workers, process_instance
)

0
evaluation/ml_bench/scripts/run_infer.sh Normal file → Executable file
View File

View File

@ -1,132 +1,79 @@
# SWE-Bench Evaluation with OpenDevin SWE-Bench Docker Image
This folder contains the evaluation harness that we built on top of the original [SWE-Bench benchmark](https://www.swebench.com/) ([paper](https://arxiv.org/abs/2310.06770)). We created [a fork of SWE-Bench](https://github.com/OpenDevin/OD-SWE-bench.git) mostly built on top of [the original repo](https://github.com/princeton-nlp/SWE-bench) and [containerized](#opendevin-swe-bench-docker-image) it for easy evaluation.
This folder contains the evaluation harness that we built on top of the original [SWE-Bench benchmark](https://www.swebench.com/) ([paper](https://arxiv.org/abs/2310.06770)).
**UPDATE (7/1/2024): We now support the official SWE-Bench dockerized evaluation as announced [here](https://github.com/princeton-nlp/SWE-bench/blob/main/docs/20240627_docker/README.md).**
## Setup Environment
The evaluation consists of three steps:
Please follow [this document](https://github.com/OpenDevin/OpenDevin/blob/main/Development.md) to set up a local development environment for OpenDevin.
1. Environment setup: [install python environment](../README.md#development-environment), [configure LLM config](../README.md#configure-opendevin-and-your-llm), and [pull docker](#opendevin-swe-bench-instance-level-docker-support).
2. [Run inference](#run-inference-on-swe-bench-instances): Generate a edit patch for each Github issue
3. [Evaluate patches using SWE-Bench docker](#evaluate-generated-patches)
## OpenDevin SWE-Bench Docker Image
## Setup Environment and LLM Configuration
In [OpenDevin-SWE-Bench fork](https://github.com/OpenDevin/OD-SWE-bench.git) (mostly from [original repo](https://github.com/princeton-nlp/SWE-bench) with some fixes), we try to pre-build the **testbed** (i.e., code of the repository we want the agent to edit) AND the **conda environment**, so that in evaluation (inference) time, we can directly leverage existing environments for efficient evaluation.
Please follow instruction [here](../README.md#setup) to setup your local development environment and LLM.
**We pack everything you need for SWE-Bench inference into one, gigantic, docker image.** To use it:
## OpenDevin SWE-Bench Instance-level Docker Support
OpenDevin now support using the [official evaluation docker](https://github.com/princeton-nlp/SWE-bench/blob/main/docs/20240627_docker/README.md) for both **[inference](#run-inference-on-swe-bench-instances) and [evaluation](#evaluate-generated-patches)**.
This is now the default behavior.
### Download Docker Images
**(Recommended for reproducibility)** If you have extra local space (e.g., 100GB), you can try pull the [instance-level docker images](https://github.com/princeton-nlp/SWE-bench/blob/main/docs/20240627_docker/README.md#choosing-the-right-cache_level) we've prepared by running:
```bash
docker pull ghcr.io/opendevin/eval-swe-bench:full-v1.2.1
evaluation/swe_bench/scripts/docker/pull_all_eval_docker.sh instance
```
The Docker image contains several important directories:
- `/swe_util/OD-SWE-bench`: root directory for the OD-SWE-bench repository
- `/swe_util/eval_data`: directory to eval data
- `/swe_util/eval_data/eval_logs/`: evaluation logs
- `/swe_util/eval_data/eval_temp/`: temporary folder for the evaluation process
- `/swe_util/eval_data/instances/`: swe-bench raw instances
- `/swe_util/eval_data/outputs/`: model or agent outputs
- `/swe_util/eval_data/testbed_logs/`: logs for testbed building
- `/swe_util/eval_data/testbeds/`: directory for all testbeds
- `/swe_util/miniforge3/`: directory for miniforge3
To reproduce how we pack the image, check [this doc](./BUILD_TESTBED_AND_ENV.md).
NOTE: We only support SWE-Bench lite for now. But modifying our existing scripts for full SWE-Bench should be quite straightforward.
## Configure OpenDevin and your LLM
Create a `config.toml` file if it does not exist at the root of the workspace.
Add the following configurations:
```toml
[core]
max_iterations = 100
cache_dir = "/tmp/cache"
ssh_hostname = "localhost"
[sandbox]
box_type = "ssh"
timeout = 120
run_as_devin = false
max_budget_per_task = 4 # 4 USD
[sandbox]
# SWEBench eval specific
use_host_network = false
enable_auto_lint = true
# TODO: Change these to the model you want to evaluate
[llm.eval_gpt4_1106_preview_llm]
model = "gpt-4-1106-preview"
api_key = "XXX"
temperature = 0.0
[llm.eval_some_openai_compatible_model_llm]
model = "openai/MODEL_NAME"
base_url = "https://OPENAI_COMPATIBLE_URL/v1"
api_key = "XXX"
temperature = 0.0
```
## Test if your environment works
Make sure your Docker daemon is running, and you have pulled the `eval-swe-bench:full-v1.2`
docker image. Then run this python script:
If you want to save disk space a bit (e.g., with ~50GB free disk space), while speeding up the image pre-build process, you can pull the environment-level docker images:
```bash
# export USE_INSTANCE_IMAGE=true # if you want to test support for instance-level docker images
poetry run python evaluation/swe_bench/swe_env_box.py
evaluation/swe_bench/scripts/docker/pull_all_eval_docker.sh env
```
If you get to the interactive shell successfully, it means your environment works!
If you see an error, please make sure your `config.toml` contains all
`SWEBench eval specific` settings as shown in the previous section.
## Run Inference on SWE-Bench Instances
Make sure your Docker daemon is running, and you have pulled the [instance-level docker image](#opendevin-swe-bench-instance-level-docker-support).
```bash
./evaluation/swe_bench/scripts/run_infer.sh [model_config] [git-version] [agent] [eval_limit] [max_iter] [num_workers]
# e.g., ./evaluation/swe_bench/scripts/run_infer.sh eval_gpt4_1106_preview_llm HEAD CodeActAgent 300
# e.g., ./evaluation/swe_bench/scripts/run_infer.sh llm.eval_gpt4_1106_preview HEAD CodeActAgent 300
```
where `model_config` is mandatory, while `agent` and `eval_limit` are optional.
where `model_config` is mandatory, and the rest are optional.
`model_config`, e.g. `eval_gpt4_1106_preview`, is the config group name for your
- `model_config`, e.g. `eval_gpt4_1106_preview`, is the config group name for your
LLM settings, as defined in your `config.toml`.
`git-version`, e.g. `HEAD`, is the git commit hash of the OpenDevin version you would
- `git-version`, e.g. `HEAD`, is the git commit hash of the OpenDevin version you would
like to evaluate. It could also be a release tag like `0.6.2`.
`agent`, e.g. `CodeActAgent`, is the name of the agent for benchmarks, defaulting
- `agent`, e.g. `CodeActAgent`, is the name of the agent for benchmarks, defaulting
to `CodeActAgent`.
`eval_limit`, e.g. `10`, limits the evaluation to the first `eval_limit` instances. By
- `eval_limit`, e.g. `10`, limits the evaluation to the first `eval_limit` instances. By
default, the script evaluates the entire SWE-bench_Lite test set (300 issues). Note:
in order to use `eval_limit`, you must also set `agent`.
`max_iter`, e.g. `20`, is the maximum number of iterations for the agent to run. By
- `max_iter`, e.g. `20`, is the maximum number of iterations for the agent to run. By
default, it is set to 30.
`num_workers`, e.g. `3`, is the number of parallel workers to run the evaluation. By
- `num_workers`, e.g. `3`, is the number of parallel workers to run the evaluation. By
default, it is set to 1.
There are also two optional environment variables you can set.
```
export USE_HINT_TEXT=true # if you want to use hint text in the evaluation. Ignore this if you are not sure.
export USE_INSTANCE_IMAGE=true # if you want to use instance-level docker images
export USE_HINT_TEXT=true # if you want to use hint text in the evaluation. Default to false. Ignore this if you are not sure.
export USE_INSTANCE_IMAGE=true # if you want to use instance-level docker images. Default to true
```
Let's say you'd like to run 10 instances using `eval_gpt4_1106_preview_llm` and CodeActAgent,
Let's say you'd like to run 10 instances using `llm.eval_gpt4_1106_preview` and CodeActAgent,
then your command would be:
```bash
./evaluation/swe_bench/scripts/run_infer.sh eval_gpt4_1106_preview_llm HEAD CodeActAgent 10
./evaluation/swe_bench/scripts/run_infer.sh llm.eval_gpt4_1106_preview HEAD CodeActAgent 10
```
### Specify a subset of tasks to run infer
If you would like to specify a list of tasks you'd like to benchmark on, you could
create a `config.toml` under `./evaluation/swe_bench/` folder, and put a list
attribute named `selected_ids`, e.g.
@ -146,22 +93,12 @@ With `output.jsonl` file, you can run `eval_infer.sh` to evaluate generated patc
**This evaluation is performed using the official dockerized evaluation announced [here](https://github.com/princeton-nlp/SWE-bench/blob/main/docs/20240627_docker/README.md).**
If you want to evaluate existing results, you should first run this to clone existing outputs
> If you want to evaluate existing results, you should first run this to clone existing outputs
>```bash
>git clone https://huggingface.co/spaces/OpenDevin/evaluation evaluation/evaluation_outputs
>```
```bash
git clone https://huggingface.co/spaces/OpenDevin/evaluation evaluation/evaluation_outputs
```
If you have extra local space (e.g., 500GB), you can try pull the [instance-level docker images](https://github.com/princeton-nlp/SWE-bench/blob/main/docs/20240627_docker/README.md#choosing-the-right-cache_level) we've prepared to speed up the evaluation by running:
```bash
evaluation/swe_bench/scripts/docker/pull_all_eval_docker.sh instance
```
If you want to save disk space a bit (e.g., with ~50GB free disk space), while speeding up the image pre-build process, you can pull the environment-level docker images:
```bash
evaluation/swe_bench/scripts/docker/pull_all_eval_docker.sh env
```
NOTE, you should have already pulled the instance-level OR env-level docker images following [this section](#opendevin-swe-bench-instance-level-docker-support).
Then you can run the following:
@ -171,13 +108,13 @@ Then you can run the following:
./evaluation/swe_bench/scripts/eval_infer.sh evaluation/evaluation_outputs/outputs/swe_bench/CodeActAgent/gpt-4-1106-preview_maxiter_50_N_v1.0/output.jsonl
```
PS: You can also pass in a JSONL with [SWE-Bench format](https://github.com/princeton-nlp/SWE-bench/blob/main/tutorials/evaluation.md#-creating-predictions) to `./evaluation/swe_bench/scripts/eval_infer.sh`, where each line is a JSON of `{"model_patch": "XXX", "model_name_or_path": "YYY", "instance_id": "ZZZ"}`.
> You can also pass in a JSONL with [SWE-Bench format](https://github.com/princeton-nlp/SWE-bench/blob/main/tutorials/evaluation.md#-creating-predictions) to `./evaluation/swe_bench/scripts/eval_infer.sh`, where each line is a JSON of `{"model_patch": "XXX", "model_name_or_path": "YYY", "instance_id": "ZZZ"}`.
The final results will be saved to `evaluation/evaluation_outputs/outputs/swe_bench/CodeActAgent/gpt-4-1106-preview_maxiter_50_N_v1.0/` with the following files/directory:
- `README.md`: a report showing what are the instances that passed, failed, etc.
- `report.json`: a JSON file that contains keys like `"resolved_ids"` pointing to instance IDs that are resolved by the agent.
- `eval_outputs/`: a directory of test logs
- `logs/`: a directory of test logs
## Visualize Results
@ -189,9 +126,10 @@ git clone https://huggingface.co/spaces/OpenDevin/evaluation
**(optional) setup streamlit environment with conda**:
```bash
cd evaluation
conda create -n streamlit python=3.10
conda activate streamlit
pip install streamlit altair st_pages
pip install -r requirements.txt
```
**run the visualizer**:

View File

@ -0,0 +1,28 @@
CODEACT_SWE_PROMPT = """Now, you're going to solve this issue on your own. Your terminal session has started and you're in the repository's root directory. You can use any bash commands or the special interface to help you. Edit all the files you need to and run any checks or tests that you want.
Remember, YOU CAN ONLY ENTER ONE COMMAND AT A TIME. You should always wait for feedback after every command.
When you're satisfied with all of the changes you've made, you can run the following command: <execute_bash> exit </execute_bash>.
Note however that you cannot use any interactive session commands (e.g. vim) in this environment, but you can write scripts and run them. E.g. you can write a python script and then run it with `python <script_name>.py`.
NOTE ABOUT THE EDIT COMMAND: Indentation really matters! When editing a file, make sure to insert appropriate indentation before each line!
IMPORTANT TIPS:
1. Always start by trying to replicate the bug that the issues discusses.
If the issue includes code for reproducing the bug, we recommend that you re-implement that in your environment, and run it to make sure you can reproduce the bug.
Then start trying to fix it.
When you think you've fixed the bug, re-run the bug reproduction script to make sure that the bug has indeed been fixed.
If the bug reproduction script does not print anything when it successfully runs, we recommend adding a print("Script completed successfully, no errors.") command at the end of the file,
so that you can be sure that the script indeed ran fine all the way through.
2. If you run a command and it doesn't work, try running a different command. A command that did not work once will not work the second time unless you modify it!
3. If you open a file and need to get to an area around a specific line that is not in the first 100 lines, say line 583, don't just use the scroll_down command multiple times. Instead, use the goto 583 command. It's much quicker.
4. If the bug reproduction script requires inputting/reading a specific file, such as buggy-input.png, and you'd like to understand how to input that file, conduct a search in the existing repo code, to see whether someone else has already done that. Do this by running the command: find_file("buggy-input.png") If that doesn't work, use the linux 'find' command.
5. Always make sure to look at the currently open file and the current working directory (which appears right after the currently open file). The currently open file might be in a different directory than the working directory! Note that some commands, such as 'create', open files, so they might change the current open file.
6. When editing files, it is easy to accidentally specify a wrong line number or to write code with incorrect indentation. Always check the code after you issue an edit to make sure that it reflects what you wanted to accomplish. If it didn't, issue another command to fix it.
[Current directory: /workspace/{workspace_dir_name}]
"""

View File

@ -1,34 +1,39 @@
import asyncio
import logging
import json
import os
import pathlib
import tempfile
from typing import Any
import pandas as pd
import toml
import whatthepatch
from datasets import load_dataset
import agenthub
from evaluation.swe_bench.swe_env_box import SWEBenchSSHBox
from evaluation.swe_bench.prompt import CODEACT_SWE_PROMPT
from evaluation.utils.shared import (
EvalMetadata,
EvalOutput,
codeact_user_response,
make_metadata,
prepare_dataset,
reset_logger_for_multiprocessing,
run_evaluation,
)
from opendevin.controller.agent import Agent
from opendevin.controller.state.state import State
from opendevin.core.config import get_llm_config_arg, load_app_config, parse_arguments
from opendevin.core.logger import get_console_handler
from opendevin.core.config import (
AppConfig,
SandboxConfig,
get_llm_config_arg,
parse_arguments,
)
from opendevin.core.logger import opendevin_logger as logger
from opendevin.core.main import run_controller
from opendevin.llm.llm import LLM
from opendevin.core.main import create_runtime, run_controller
from opendevin.events.action import CmdRunAction
from opendevin.events.observation import CmdOutputObservation, ErrorObservation
from opendevin.runtime.runtime import Runtime
config = load_app_config()
USE_HINT_TEXT = os.environ.get('USE_HINT_TEXT', 'false') == 'true'
USE_INSTANCE_IMAGE = os.environ.get('USE_INSTANCE_IMAGE', 'false') == 'true'
USE_HINT_TEXT = os.environ.get('USE_HINT_TEXT', 'false').lower() == 'true'
USE_INSTANCE_IMAGE = os.environ.get('USE_INSTANCE_IMAGE', 'false').lower() == 'true'
AGENT_CLS_TO_FAKE_USER_RESPONSE_FN = {
'CodeActAgent': codeact_user_response,
@ -41,184 +46,12 @@ AGENT_CLS_TO_INST_SUFFIX = {
}
def get_test_result(instance, sandbox, workspace_dir_name):
test_result = {'result': {}, 'metadata': {}}
# NOTE: if you need to do something in the sandbox to get the correctness metric, modify this function
try:
test_patch_parsed = whatthepatch.parse_patch(instance.test_patch)
# get a list of filepaths that are involved in the patch
involved_filepaths = set()
for patch in test_patch_parsed:
involved_filepaths.add(patch.header.old_path.removeprefix('a/'))
involved_filepaths.add(patch.header.new_path.removeprefix('b/'))
involved_filepaths = list(involved_filepaths)
test_result['metadata']['1_test_patch_parse_success'] = True
test_result['metadata']['1_test_involved_filepaths'] = involved_filepaths
except Exception as e:
logger.error(
f'Error parsing test patch for instance {instance.instance_id}: {e}'
)
test_result['metadata']['1_test_patch_parse_success'] = False
test_result['metadata']['1_test_patch_parse_error'] = str(e)
test_result['metadata']['1_test_involved_filepaths'] = None
involved_filepaths = []
# Try to revert the changes for involved filepaths
err_code, output = sandbox.execute(f'cd /workspace/{workspace_dir_name}')
test_result['metadata']['2_revert_test_involved_filepaths_success'] = []
for filepath in involved_filepaths:
err_code, output = sandbox.execute(
f'git checkout {instance["base_commit"]} -- {filepath}'
)
if err_code != 0:
logger.error(f'Error reverting changes for {filepath}: {output}')
test_result['metadata']['2_revert_test_involved_filepaths_success'].append(
False
)
else:
test_result['metadata']['2_revert_test_involved_filepaths_success'].append(
True
)
# Apply the testcase
err_code, output = sandbox.execute('git apply $SWE_TASK_DIR/test.patch')
if err_code != 0:
logger.error(f'Error applying test patch: {output}')
test_result['metadata']['3_apply_test_patch_success'] = False
test_result['metadata']['3_apply_test_patch_error'] = output
else:
test_result['metadata']['3_apply_test_patch_success'] = True
# Run the test command
err_code, output = sandbox.execute(
'$TEST_CMD > /workspace/$SWE_INSTANCE_ID.log 2>&1'
)
if err_code != 0:
logger.error(f'Error running test command: {output}')
test_result['metadata']['4_run_test_command_success'] = False
test_result['metadata']['4_run_test_command_error'] = output
else:
test_result['metadata']['4_run_test_command_success'] = True
# Get the test output
err_code, output = sandbox.execute('cat /workspace/$SWE_INSTANCE_ID.log')
if err_code != 0:
logger.error(f'Error getting test output: {output}')
test_result['metadata']['4_get_test_output_success'] = False
test_result['metadata']['4_get_test_output_error'] = output
else:
test_result['metadata']['4_get_test_output_success'] = True
test_result['test_output'] = output
# Reformat instance.json
# $SWE_TASK_DIR/instance.json is a dict {"XXX": "YYY"}, add a [ before and a ] after
err_code, output = sandbox.execute(
(
'cat $SWE_TASK_DIR/instance.json | sed "s/^{/[{/" | sed "s/}$/}]/" > /workspace/instance.json'
)
)
if err_code != 0:
logger.error(f'Error creating instance.json: {output}')
test_result['metadata']['5_reformat_instance_json_success'] = False
test_result['metadata']['5_reformat_instance_json_error'] = output
else:
test_result['metadata']['5_reformat_instance_json_success'] = True
if USE_INSTANCE_IMAGE:
# instance report is not supported in instance image mode
test_result['metadata']['6_get_instance_report_success'] = False
test_result['metadata']['6_get_instance_report_error'] = (
'Instance report is not supported in instance image mode.'
)
else:
# Get the instance report
err_code, output = sandbox.execute(
(
'cd /swe_util/OD-SWE-bench '
'&& export PYTHONPATH=$(pwd):$PYTHONPATH '
'&& conda run -n swe-bench-eval python swebench/metrics/get_instance_report.py --swe_bench_task /workspace/instance.json --log_path /workspace/$SWE_INSTANCE_ID.log'
)
)
if err_code != 0:
logger.error(f'Error getting instance report: {output}')
test_result['metadata']['6_get_instance_report_success'] = False
test_result['metadata']['6_get_instance_report_error'] = output
else:
test_result['metadata']['6_get_instance_report_success'] = True
test_result['result_raw'] = output
# try to parse output
for line in output.strip().split('\n'):
line = line.strip('-')
try:
key, value = line.split(':')
except ValueError:
# skip this line
print(f'Error parsing result line: {line}')
continue
value = value.strip()
try:
value = int(value)
except ValueError:
pass
test_result['result'][key.strip()] = value
return test_result
def _get_swebench_workspace_dir_name(instance: pd.Series) -> str:
return f'{instance.repo}__{instance.version}'.replace('/', '__')
def process_instance(
instance: pd.Series,
metadata: EvalMetadata,
reset_logger: bool = True,
):
# Create the agent
agent = Agent.get_cls(metadata.agent_class)(llm=LLM(config=metadata.llm_config))
workspace_mount_path = os.path.join(config.workspace_mount_path, '_eval_workspace')
# create process-specific workspace dir
workspace_mount_path = os.path.join(workspace_mount_path, str(os.getpid()))
pathlib.Path(workspace_mount_path).mkdir(parents=True, exist_ok=True)
# Setup the logger properly, so you can run multi-processing to parallelize the evaluation
if reset_logger:
# Set up logger
log_file = os.path.join(
metadata.eval_output_dir,
'infer_logs',
f'instance_{instance.instance_id}.log',
)
# Remove all existing handlers from logger
for handler in logger.handlers[:]:
logger.removeHandler(handler)
# add back the console handler to print ONE line
logger.addHandler(get_console_handler())
logger.info(
f'Starting evaluation for instance {instance.instance_id}.\nHint: run "tail -f {log_file}" to see live logs in a separate shell'
)
# Remove all existing handlers from logger
for handler in logger.handlers[:]:
logger.removeHandler(handler)
os.makedirs(os.path.dirname(log_file), exist_ok=True)
file_handler = logging.FileHandler(log_file)
file_handler.setFormatter(
logging.Formatter('%(asctime)s - %(levelname)s - %(message)s')
)
logger.addHandler(file_handler)
else:
logger.info(f'Starting evaluation for instance {instance.instance_id}.')
# NOTE: this is something special we do for SWE-Bench due to the reason described in the previous section
# You can omit this if you don't need to setup specialized sandbox
workspace_dir_name = f'{instance.repo}__{instance.version}'.replace('/', '__')
sandbox = SWEBenchSSHBox.get_box_for_instance(
config,
instance,
workspace_dir_name,
workspace_mount_path=workspace_mount_path,
sandbox_plugins=agenthub.Agent.get_cls(metadata.agent_class).sandbox_plugins,
use_instance_image=USE_INSTANCE_IMAGE,
)
def get_instruction(instance: pd.Series, metadata: EvalMetadata):
workspace_dir_name = _get_swebench_workspace_dir_name(instance)
# Prepare instruction
if metadata.agent_class == 'CodeActSWEAgent':
instruction = (
@ -227,39 +60,11 @@ def process_instance(
f'{instance.problem_statement}\n'
'--- END ISSUE ---\n\n'
)
if USE_HINT_TEXT and instance.hints_text:
instruction += (
f'--- BEGIN HINTS ---\n{instance.hints_text}\n--- END HINTS ---\n'
)
instruction += f"""Now, you're going to solve this issue on your own. Your terminal session has started and you're in the repository's root directory. You can use any bash commands or the special interface to help you. Edit all the files you need to and run any checks or tests that you want.
Remember, YOU CAN ONLY ENTER ONE COMMAND AT A TIME. You should always wait for feedback after every command.
When you're satisfied with all of the changes you've made, you can run the following command: <execute_bash> exit </execute_bash>.
Note however that you cannot use any interactive session commands (e.g. vim) in this environment, but you can write scripts and run them. E.g. you can write a python script and then run it with `python <script_name>.py`.
NOTE ABOUT THE EDIT COMMAND: Indentation really matters! When editing a file, make sure to insert appropriate indentation before each line!
IMPORTANT TIPS:
1. Always start by trying to replicate the bug that the issues discusses.
If the issue includes code for reproducing the bug, we recommend that you re-implement that in your environment, and run it to make sure you can reproduce the bug.
Then start trying to fix it.
When you think you've fixed the bug, re-run the bug reproduction script to make sure that the bug has indeed been fixed.
If the bug reproduction script does not print anything when it successfully runs, we recommend adding a print("Script completed successfully, no errors.") command at the end of the file,
so that you can be sure that the script indeed ran fine all the way through.
2. If you run a command and it doesn't work, try running a different command. A command that did not work once will not work the second time unless you modify it!
3. If you open a file and need to get to an area around a specific line that is not in the first 100 lines, say line 583, don't just use the scroll_down command multiple times. Instead, use the goto 583 command. It's much quicker.
4. If the bug reproduction script requires inputting/reading a specific file, such as buggy-input.png, and you'd like to understand how to input that file, conduct a search in the existing repo code, to see whether someone else has already done that. Do this by running the command: find_file("buggy-input.png") If that doesn't work, use the linux 'find' command.
5. Always make sure to look at the currently open file and the current working directory (which appears right after the currently open file). The currently open file might be in a different directory than the working directory! Note that some commands, such as 'create', open files, so they might change the current open file.
6. When editing files, it is easy to accidentally specify a wrong line number or to write code with incorrect indentation. Always check the code after you issue an edit to make sure that it reflects what you wanted to accomplish. If it didn't, issue another command to fix it.
[Current directory: /workspace/{workspace_dir_name}]
"""
instruction += CODEACT_SWE_PROMPT.format(workspace_dir_name=workspace_dir_name)
else:
# Testing general agents
instruction = (
@ -277,61 +82,280 @@ IMPORTANT TIPS:
)
# NOTE: You can actually set slightly different instruction for different agents
instruction += AGENT_CLS_TO_INST_SUFFIX[agent.__class__.__name__]
instruction += AGENT_CLS_TO_INST_SUFFIX[metadata.agent_class]
return instruction
def get_config(
instance: pd.Series,
metadata: EvalMetadata,
) -> AppConfig:
SWE_BENCH_CONTAINER_IMAGE = 'ghcr.io/opendevin/eval-swe-bench:full-v1.2.1'
if USE_INSTANCE_IMAGE:
# We use a different instance image for the each instance of swe-bench eval
container_image = 'sweb.eval.x86_64.' + instance['instance_id']
else:
container_image = SWE_BENCH_CONTAINER_IMAGE
config = AppConfig(
default_agent=metadata.agent_class,
run_as_devin=False,
runtime='eventstream',
max_budget_per_task=4,
max_iterations=metadata.max_iterations,
sandbox=SandboxConfig(
container_image=container_image,
enable_auto_lint=True,
use_host_network=False,
# always make sure we are using the latest source code
update_source_code=True,
# large enough timeout, since some testcases take very long to run
timeout=300,
),
# do not mount workspace
workspace_base=None,
workspace_mount_path=None,
)
config.set_llm_config(metadata.llm_config)
return config
async def initialize_runtime(
runtime: Runtime,
instance: pd.Series, # this argument is not required
):
"""Initialize the runtime for the agent.
This function is called before the runtime is used to run the agent.
"""
logger.info('-' * 30)
logger.info('BEGIN Runtime Initialization Fn')
logger.info('-' * 30)
workspace_dir_name = _get_swebench_workspace_dir_name(instance)
obs: CmdOutputObservation
# Set instance id
action = CmdRunAction(
command=f"""echo 'export SWE_INSTANCE_ID={instance['instance_id']}' >> ~/.bashrc && echo 'export PIP_CACHE_DIR=~/.cache/pip' >> ~/.bashrc && echo "alias git='git --no-pager'" >> ~/.bashrc"""
)
logger.info(action, extra={'msg_type': 'ACTION'})
obs = await runtime.run_action(action)
logger.info(obs, extra={'msg_type': 'OBSERVATION'})
assert obs.exit_code == 0
if USE_INSTANCE_IMAGE:
# inject the init script
script_dir = os.path.dirname(__file__)
# inject the instance info
action = CmdRunAction(command='mkdir -p /swe_util/eval_data/instances')
logger.info(action, extra={'msg_type': 'ACTION'})
obs = await runtime.run_action(action)
logger.info(obs, extra={'msg_type': 'OBSERVATION'})
assert (
obs.exit_code == 0
), f'Failed to create /swe_util/eval_data/instances: {obs.content}'
swe_instance_json_name = 'swe-bench-instance.json'
with tempfile.TemporaryDirectory() as temp_dir:
# Construct the full path for the desired file name within the temporary directory
temp_file_path = os.path.join(temp_dir, swe_instance_json_name)
# Write to the file with the desired name within the temporary directory
with open(temp_file_path, 'w') as f:
if not isinstance(instance, dict):
json.dump([instance.to_dict()], f)
else:
json.dump([instance], f)
# Copy the file to the desired location
await runtime.copy_to(temp_file_path, '/swe_util/eval_data/instances/')
# inject the instance swe entry
await runtime.copy_to(
str(os.path.join(script_dir, 'scripts/setup/instance_swe_entry.sh')),
'/swe_util/',
)
action = CmdRunAction(command='cat ~/.bashrc')
logger.info(action, extra={'msg_type': 'ACTION'})
obs = await runtime.run_action(action)
logger.info(obs, extra={'msg_type': 'OBSERVATION'})
assert obs.exit_code == 0
action = CmdRunAction(command='source ~/.bashrc')
logger.info(action, extra={'msg_type': 'ACTION'})
obs = await runtime.run_action(action)
logger.info(obs, extra={'msg_type': 'OBSERVATION'})
assert obs.exit_code == 0
action = CmdRunAction(command='source /swe_util/instance_swe_entry.sh')
logger.info(action, extra={'msg_type': 'ACTION'})
obs = await runtime.run_action(action)
logger.info(obs, extra={'msg_type': 'OBSERVATION'})
assert obs.exit_code == 0
else:
action = CmdRunAction(command='source /swe_util/swe_entry.sh')
logger.info(action, extra={'msg_type': 'ACTION'})
obs = await runtime.run_action(action)
logger.info(obs, extra={'msg_type': 'OBSERVATION'})
assert (
obs.exit_code == 0
), f'Failed to source /swe_util/swe_entry.sh: {obs.content}'
action = CmdRunAction(command=f'cd /workspace/{workspace_dir_name}')
logger.info(action, extra={'msg_type': 'ACTION'})
obs = await runtime.run_action(action)
logger.info(obs, extra={'msg_type': 'OBSERVATION'})
assert obs.exit_code == 0
action = CmdRunAction(command='git reset --hard')
logger.info(action, extra={'msg_type': 'ACTION'})
obs = await runtime.run_action(action)
logger.info(obs, extra={'msg_type': 'OBSERVATION'})
assert obs.exit_code == 0
action = CmdRunAction(
command='for remote_name in $(git remote); do git remote remove "${remote_name}"; done'
)
logger.info(action, extra={'msg_type': 'ACTION'})
obs = await runtime.run_action(action)
logger.info(obs, extra={'msg_type': 'OBSERVATION'})
assert obs.exit_code == 0
logger.info('-' * 30)
logger.info('END Runtime Initialization Fn')
logger.info('-' * 30)
async def complete_runtime(
runtime: Runtime,
instance: pd.Series, # this argument is not required, but it is used to get the workspace_dir_name
) -> dict[str, Any]:
"""Complete the runtime for the agent.
This function is called before the runtime is used to run the agent.
If you need to do something in the sandbox to get the correctness metric after
the agent has run, modify this function.
"""
logger.info('-' * 30)
logger.info('BEGIN Runtime Completion Fn')
logger.info('-' * 30)
obs: CmdOutputObservation
workspace_dir_name = _get_swebench_workspace_dir_name(instance)
action = CmdRunAction(command=f'cd /workspace/{workspace_dir_name}')
logger.info(action, extra={'msg_type': 'ACTION'})
obs = await runtime.run_action(action)
logger.info(obs, extra={'msg_type': 'OBSERVATION'})
assert obs.exit_code == 0
action = CmdRunAction(command='git config --global core.pager ""')
logger.info(action, extra={'msg_type': 'ACTION'})
obs = await runtime.run_action(action)
logger.info(obs, extra={'msg_type': 'OBSERVATION'})
assert obs.exit_code == 0
action = CmdRunAction(command='git add -A')
logger.info(action, extra={'msg_type': 'ACTION'})
obs = await runtime.run_action(action)
logger.info(obs, extra={'msg_type': 'OBSERVATION'})
assert obs.exit_code == 0
n_retries = 0
git_patch = None
while n_retries < 5:
action = CmdRunAction(
command=f'git diff --no-color --cached {instance["base_commit"]}',
keep_prompt=False,
)
action.timeout = 600 + 100 * n_retries
logger.info(action, extra={'msg_type': 'ACTION'})
obs = await runtime.run_action(action)
logger.info(obs, extra={'msg_type': 'OBSERVATION'})
n_retries += 1
if isinstance(obs, CmdOutputObservation):
if obs.exit_code == 0:
git_patch = obs.content.strip()
break
else:
logger.info('Failed to get git diff, retrying...')
await asyncio.sleep(10)
elif isinstance(obs, ErrorObservation):
logger.error(f'Error occurred: {obs.content}. Retrying...')
await asyncio.sleep(10)
else:
raise ValueError(f'Unexpected observation type: {type(obs)}')
logger.info('-' * 30)
logger.info('END Runtime Completion Fn')
logger.info('-' * 30)
return {'git_patch': git_patch}
async def process_instance(
instance: pd.Series,
metadata: EvalMetadata,
reset_logger: bool = True,
) -> EvalOutput:
config = get_config(instance, metadata)
# Setup the logger properly, so you can run multi-processing to parallelize the evaluation
if reset_logger:
log_dir = os.path.join(metadata.eval_output_dir, 'infer_logs')
reset_logger_for_multiprocessing(logger, instance.instance_id, log_dir)
else:
logger.info(f'Starting evaluation for instance {instance.instance_id}.')
runtime = await create_runtime(config, sid=instance.instance_id)
await initialize_runtime(runtime, instance)
instruction = get_instruction(instance, metadata)
# Here's how you can run the agent (similar to the `main` function) and get the final task state
config.max_iterations = metadata.max_iterations
state: State | None = asyncio.run(
run_controller(
config=config,
task_str=instruction,
fake_user_response_fn=AGENT_CLS_TO_FAKE_USER_RESPONSE_FN[
agent.__class__.__name__
],
agent=agent,
sandbox=sandbox,
sid=instance.instance_id,
)
state: State | None = await run_controller(
config=config,
task_str=instruction,
runtime=runtime,
fake_user_response_fn=AGENT_CLS_TO_FAKE_USER_RESPONSE_FN[metadata.agent_class],
)
# ======= THIS IS SWE-Bench specific =======
# Get git patch
git_patch = sandbox.get_diff_patch()
logger.info(f'Got git diff for instance {instance.instance_id}')
return_val = await complete_runtime(runtime, instance)
git_patch = return_val['git_patch']
logger.info(
f'Got git diff for instance {instance.instance_id}:\n--------\n{git_patch}\n--------'
)
# ==========================================
# ======= Attempt to evaluate the agent's edits =======
# TODO: if you need to do something in the sandbox to get the correctness metric, modify this function
test_result = get_test_result(instance, sandbox, workspace_dir_name)
# we use eval_infer.sh to evaluate the agent's edits, not here
# because the agent may alter the environment / testcases
test_result = {
'git_patch': git_patch,
}
# If you are working on some simpler benchmark that only evaluates the final model output (e.g., in a MessageAction)
# You can simply get the LAST `MessageAction` from the returned `state.history` and parse it for evaluation.
if state is None:
raise ValueError('State should not be None.')
metrics = state.metrics.get() if state.metrics else None
# history is now available as a stream of events, rather than list of pairs of (Action, Observation)
# for compatibility with the existing output format, we can remake the pairs here
# remove when it becomes unnecessary
histories = state.history.compatibility_for_eval_history_pairs()
metrics = state.metrics.get() if state.metrics else None
# Save the output
output = {
'instance_id': instance.instance_id,
'swe_instance': instance.to_dict(), # SWE Bench specific
'instruction': instruction,
'git_patch': git_patch, # SWE Bench specific
'metadata': metadata.model_dump(),
'history': histories,
'metrics': metrics,
'error': state.last_error if state and state.last_error else None,
'test_result': test_result,
}
# Close the sandbox
sandbox.close()
output = EvalOutput(
instance_id=instance.instance_id,
instruction=instruction,
instance=instance.to_dict(), # SWE Bench specific
test_result=test_result,
metadata=metadata,
history=histories,
metrics=metrics,
error=state.last_error if state and state.last_error else None,
)
return output
@ -359,11 +383,12 @@ if __name__ == '__main__':
dataset = load_dataset('princeton-nlp/SWE-bench_Lite')
swe_bench_tests = filter_dataset(dataset['test'].to_pandas(), 'instance_id')
id_column = 'instance_id'
llm_config = get_llm_config_arg(args.llm_config) if args.llm_config else config.llm
if args.llm_config and llm_config is None:
raise ValueError(f'Could not find LLM config {args.llm_config}')
logger.info(f'Config for evaluation: {config}')
llm_config = None
if args.llm_config:
llm_config = get_llm_config_arg(args.llm_config)
if llm_config is None:
raise ValueError(f'Could not find LLM config: --llm_config {args.llm_config}')
details = {}
_agent_cls = agenthub.Agent.get_cls(args.agent_cls)
@ -383,14 +408,10 @@ if __name__ == '__main__':
)
output_file = os.path.join(metadata.eval_output_dir, 'output.jsonl')
instances = prepare_dataset(
swe_bench_tests, output_file, args.eval_n_limit, id_column
)
run_evaluation(
instances,
metadata,
output_file,
args.eval_num_workers,
process_instance,
id_column,
instances = prepare_dataset(swe_bench_tests, output_file, args.eval_n_limit)
asyncio.run(
run_evaluation(
instances, metadata, output_file, args.eval_num_workers, process_instance
)
)

View File

@ -45,9 +45,16 @@ def process_git_patch(patch):
def convert_row_to_swebench_format(row):
if 'git_patch' in row:
model_patch = row['git_patch']
elif 'test_result' in row and 'git_patch' in row['test_result']:
model_patch = row['test_result']['git_patch']
else:
raise ValueError(f'Row {row} does not have a git_patch')
return {
'instance_id': row['instance_id'],
'model_patch': process_git_patch(row['git_patch']),
'model_patch': process_git_patch(model_patch),
'model_name_or_path': model_name,
}

View File

@ -27,8 +27,8 @@ if [ -z "$MAX_ITER" ]; then
fi
if [ -z "$USE_INSTANCE_IMAGE" ]; then
echo "USE_INSTANCE_IMAGE not specified, use default false"
USE_INSTANCE_IMAGE=false
echo "USE_INSTANCE_IMAGE not specified, use default true"
USE_INSTANCE_IMAGE=true
fi
export USE_INSTANCE_IMAGE=$USE_INSTANCE_IMAGE

View File

@ -45,7 +45,11 @@ echo "$item" | jq -r '.patch' > $SWE_TASK_DIR/gold.patch
echo "$item" | jq 'del(.test_patch, .patch)' > $SWE_TASK_DIR/instance.json
# Clear the workspace
rm -rf /workspace/*
if [ -d /workspace ]; then
rm -rf /workspace/*
else
mkdir /workspace
fi
# Copy repo to workspace
if [ -d /workspace/$WORKSPACE_NAME ]; then
rm -rf /workspace/$WORKSPACE_NAME
@ -61,7 +65,7 @@ mkdir -p $SWE_TASK_DIR/reset_testbed_log_dir
REPO_PATH=/workspace/$WORKSPACE_NAME
echo "Repo Path: $REPO_PATH"
echo "Test Command: $TEST_CMD"
# echo "Test Command: $TEST_CMD"
echo "export REPO_PATH=\"$REPO_PATH\"" >> ~/.bashrc
# echo "export TEST_CMD=\"$TEST_CMD\"" >> ~/.bashrc

View File

@ -1,313 +0,0 @@
import json
import os
import sys
import tempfile
import uuid
from datasets import load_dataset
from swebench.harness.constants import MAP_REPO_TO_TEST_FRAMEWORK
from swebench.harness.utils import get_test_directives
from opendevin.core.config import AppConfig, SandboxConfig, load_app_config
from opendevin.core.logger import opendevin_logger as logger
from opendevin.runtime.docker.ssh_box import DockerSSHBox
from opendevin.runtime.plugins import (
AgentSkillsRequirement,
JupyterRequirement,
PluginRequirement,
)
SWE_BENCH_CONTAINER_IMAGE = 'ghcr.io/opendevin/eval-swe-bench:full-v1.2.1'
def get_image_name_from_instance_id(instance_id: str) -> str:
return 'sweb.eval.x86_64.' + instance_id
class SWEBenchSSHBox(DockerSSHBox):
def __init__(
self,
config: AppConfig,
container_image: str,
timeout: int = 120,
sid: str | None = None,
swe_instance_id: str | None = None,
swe_instance: dict | None = None,
skip_workspace_mount: bool = True,
sandbox_plugins: list[PluginRequirement] = [], # noqa: B006
workspace_dir_name: str | None = None,
use_instance_image: bool = False,
):
if swe_instance_id is None:
raise ValueError('swe_instance_id must be provided!')
self.swe_instance_id = swe_instance_id
self.swe_instance = swe_instance
self.skip_workspace_mount = skip_workspace_mount
self.workspace_dir_name = workspace_dir_name
assert (
container_image is not None
), 'container_image is required for SWEBenchSSHBox!'
# Need to run as root to use SWEBench container
sid = f'swe_bench_{swe_instance_id}_' + str(uuid.uuid4())
logger.info(f'===Using container image: {container_image}')
super().__init__(
config=SandboxConfig(container_image=container_image, timeout=timeout),
persist_sandbox=config.persist_sandbox,
workspace_mount_path=config.workspace_mount_path,
sandbox_workspace_dir=config.workspace_mount_path_in_sandbox,
cache_dir=config.cache_dir,
run_as_devin=config.run_as_devin,
ssh_hostname=config.ssh_hostname,
ssh_password=config.ssh_password,
ssh_port=config.ssh_port,
sid=sid,
)
self.init_plugins(sandbox_plugins)
exit_code, output = self.execute('mv ~/.bashrc ~/.bashrc.bak')
assert exit_code == 0, f'Failed to backup ~/.bashrc: {output}'
exit_code, output = self.execute(
f"echo 'export SWE_INSTANCE_ID={self.swe_instance_id}' >> ~/.bashrc && echo 'export PIP_CACHE_DIR=~/.cache/pip' >> ~/.bashrc && echo \"alias git='git --no-pager'\" >> ~/.bashrc"
)
assert exit_code == 0, f'Failed to set SWE_INSTANCE_ID in ~/.bashrc: {output}'
logger.info('Sourcing swe_entry.sh to set up environment variables')
logger.info(
'Initialization of SWEBench may take approximately 10 minutes due to long-running installations, such as those requiring compilation.'
)
logger.info(f'Use instance image: {use_instance_image}')
if use_instance_image:
# we directly inject the instance info into the container and the init script
script_dir = os.path.dirname(__file__)
# inject test command
test_type = MAP_REPO_TO_TEST_FRAMEWORK[swe_instance['repo']][
swe_instance['version']
]
swe_instance['test_directives'] = get_test_directives(swe_instance)
swe_instance['test_cmd'] = (
f"{test_type} {' '.join(swe_instance['test_directives'])}"
)
exit_code, output = self.execute(
f"""echo "export TEST_CMD='{swe_instance["test_cmd"]}'" >> ~/.bashrc"""
)
# assert exit_code == 0, f'Failed to set TEST_CMD in ~/.bashrc: {output}'
# inject the instance info
self.execute('mkdir -p /swe_util/eval_data/instances')
swe_instance_json_name = 'swe-bench-instance.json'
with tempfile.TemporaryDirectory() as temp_dir:
# Construct the full path for the desired file name within the temporary directory
temp_file_path = os.path.join(temp_dir, swe_instance_json_name)
# Write to the file with the desired name within the temporary directory
with open(temp_file_path, 'w') as f:
if not isinstance(swe_instance, dict):
json.dump([swe_instance.to_dict()], f)
else:
json.dump([swe_instance], f)
# Copy the file to the desired location
self.copy_to(temp_file_path, '/swe_util/eval_data/instances/')
# inject the init script
self.copy_to(
str(os.path.join(script_dir, 'scripts/setup/instance_swe_entry.sh')),
'/swe_util/',
)
self.execute('cat ~/.bashrc')
self.execute('source ~/.bashrc')
self.execute('source /swe_util/instance_swe_entry.sh', timeout=600)
logger.info('exit code: %d', exit_code)
logger.info(output)
assert exit_code == 0, f'Failed to source swe_entry.sh: {output}'
logger.info('Sourced swe_entry.sh successfully')
else:
exit_code, output = self.execute(
'source /swe_util/swe_entry.sh', timeout=600
)
logger.info('exit code: %d', exit_code)
logger.info(output)
assert exit_code == 0, f'Failed to source swe_entry.sh: {output}'
logger.info('Sourced swe_entry.sh successfully')
@property
def volumes(self):
if self.skip_workspace_mount:
return {
k: v
for k, v in super().volumes.items()
if not v['bind'] == self.sandbox_workspace_dir
}
return super().volumes
@classmethod
def get_box_for_instance(
cls,
config: AppConfig,
instance,
workspace_dir_name=None,
skip_workspace_mount: bool = True,
workspace_mount_path: str | None = None,
sandbox_plugins: list[PluginRequirement] = [], # noqa: B006
use_instance_image: bool = False,
) -> 'SWEBenchSSHBox':
if workspace_dir_name is None:
workspace_dir_name = f"{instance['repo']}__{instance['version']}".replace(
'/', '__'
)
old_workspace_base = config.workspace_base
old_workspace_mount_path = config.workspace_mount_path
try:
config.workspace_base = workspace_mount_path
config.workspace_mount_path = workspace_mount_path
# linting python after editing helps LLM fix indentations
config.sandbox.enable_auto_lint = True
# Need to run as root to use SWEBench container
config.run_as_devin = False
if use_instance_image:
container_image = get_image_name_from_instance_id(
instance['instance_id']
)
else:
container_image = SWE_BENCH_CONTAINER_IMAGE
sandbox = cls(
container_image=container_image,
config=config,
swe_instance_id=instance['instance_id'],
swe_instance=instance,
skip_workspace_mount=skip_workspace_mount,
sandbox_plugins=sandbox_plugins,
workspace_dir_name=workspace_dir_name,
use_instance_image=use_instance_image,
)
logger.info(f"SSH box started for instance {instance['instance_id']}.")
# cd to the repo
exit_code, output = sandbox.execute(f'cd /workspace/{workspace_dir_name}')
if exit_code != 0:
logger.error(f'Failed to cd to the repo: {output}')
sys.exit(1)
# remove all future commits & remote following Devin
# https://www.cognition-labs.com/post/swe-bench-technical-report
exit_code, output = sandbox.execute('git reset --hard')
if exit_code != 0:
logger.error(f'Failed to reset the repo: {output}')
sys.exit(1)
exit_code, output = sandbox.execute(
'for remote_name in $(git remote); do git remote remove "${remote_name}"; done'
)
if exit_code != 0:
logger.error(f'Failed to remove remote: {output}')
sys.exit(1)
except Exception:
raise
finally:
# restore workspace_base and workspace_mount_path
config.workspace_base = old_workspace_base
config.workspace_mount_path = old_workspace_mount_path
return sandbox
def get_diff_patch(self):
# add everything to the index
exit_code, output = self.execute(f'cd /workspace/{self.workspace_dir_name}')
if exit_code != 0:
logger.error('Failed to cd to the repo')
return ''
exit_code, _output = self.execute('git config --global core.pager ""')
if exit_code != 0:
logger.error('Failed to change git config')
return ''
# add everything to the index
exit_code, output = self.execute('git add -A')
if exit_code != 0:
logger.error('Failed to add everything to the index')
return ''
# get the git diff
exit_code, git_patch = self.execute(
f'git diff --no-color --cached {self.swe_instance["base_commit"]}'
)
if exit_code != 0:
logger.error('Failed to get git diff')
return ''
return git_patch
if __name__ == '__main__':
config = load_app_config()
# NOTE: It is preferable to load datasets from huggingface datasets and perform post-processing
# so we don't need to manage file uploading to OpenDevin's repo
dataset = load_dataset('princeton-nlp/SWE-bench_Lite')
swe_bench_tests = dataset['test'].to_pandas()
USE_INSTANCE_IMAGE = os.environ.get('USE_INSTANCE_IMAGE', 'false') == 'true'
logger.info(f'USE_INSTANCE_IMAGE: {USE_INSTANCE_IMAGE}')
# INSTANCE_ID = 'django__django-11099'
INSTANCE_ID = 'astropy__astropy-12907'
swe_bench_tests = swe_bench_tests[swe_bench_tests['instance_id'] == INSTANCE_ID]
EXAMPLE_INSTANCE = swe_bench_tests.iloc[0].to_dict()
sandbox = SWEBenchSSHBox.get_box_for_instance(
config=config,
instance=EXAMPLE_INSTANCE,
sandbox_plugins=[AgentSkillsRequirement(), JupyterRequirement()],
use_instance_image=USE_INSTANCE_IMAGE,
)
# PRE TEST
exit_code, output = sandbox.execute('cd $REPO_PATH')
assert exit_code == 0, 'Failed to cd $REPO_PATH'
logger.info(f'cd $REPO_PATH: {output}')
# apply test patch
exit_code, output = sandbox.execute('git apply $SWE_TASK_DIR/test.patch')
assert exit_code == 0, 'Failed to apply test patch'
logger.info(f'git apply $SWE_TASK_DIR/test.patch: {output}')
# TEST
exit_code, output = sandbox.execute('$TEST_CMD')
assert exit_code == 1, 'Expected exit code 1 (since this is a FAIL_TO_PASS)'
logger.info(f'$TEST_CMD:\n{output}')
# apply gold patch
exit_code, output = sandbox.execute('git apply $SWE_TASK_DIR/gold.patch')
logger.info('exit code: %d', exit_code)
logger.info(f'git apply $SWE_TASK_DIR/gold.patch: {output}')
# TEST
exit_code, output = sandbox.execute('$TEST_CMD')
assert exit_code == 0, 'Expected exit code 0 (since we applied the gold patch)'
logger.info(f'$TEST_CMD:\n{output}')
# Reset the repo
exit_code, output = sandbox.execute('git reset --hard')
assert exit_code == 0, 'Failed to reset the repo'
logger.info(f'git reset --hard: {output}')
sys.stdout.flush()
try:
while True:
try:
user_input = input('>>> ')
except EOFError:
logger.info('Exiting...')
break
if user_input.lower() == 'exit':
logger.info('Exiting...')
break
exit_code, output = sandbox.execute(user_input)
logger.info('exit code: %d', exit_code)
logger.info(output)
sys.stdout.flush()
except KeyboardInterrupt:
logger.info('Exiting...')
sandbox.close()

View File

@ -0,0 +1,17 @@
FROM ubuntu:22.04
RUN apt-get update && apt-get install -y python3 python3-pip
RUN mkdir /workspace
WORKDIR /workspace
COPY data/ /workspace/data/
COPY tools/ /workspace/tools/
# TODO: NEED TO FIGURE DEPENDECIES FOR THESE TOOLS
# pushd evaluation/toolqa
# mkdir data
# python3 -c "from utils import download_data, download_tools; download_data('/workspace'); download_tools('/workspace')"
# docker build --network host -t xingyaoww/od-eval-toolqa .

View File

@ -2,13 +2,9 @@
This folder contains an evaluation harness we built on top of the original [ToolQA](https://github.com/night-chen/ToolQA) ([paper](https://arxiv.org/pdf/2306.13304)).
## Setup Environment
## Setup Environment and LLM Configuration
Please follow [this document](https://github.com/OpenDevin/OpenDevin/blob/main/Development.md) to setup local development environment for OpenDevin.
## Configure OpenDevin and your LLM
Run `make setup-config` to set up the `config.toml` file if it does not exist at the root of the workspace.
Please follow instruction [here](../README.md#setup) to setup your local development environment and LLM.
## Run Inference on ToolQA Instances

View File

@ -1,29 +1,31 @@
import asyncio
import logging
import os
import pathlib
from typing import Any
import pandas as pd
from evaluation.toolqa.utils import encode_question, eval_answer, get_data
from evaluation.utils.shared import (
EvalMetadata,
EvalOutput,
codeact_user_response,
make_metadata,
prepare_dataset,
reset_logger_for_multiprocessing,
run_evaluation,
)
from opendevin.controller.agent import Agent
from opendevin.controller.state.state import State
from opendevin.core.config import get_llm_config_arg, get_parser, load_app_config
from opendevin.core.logger import get_console_handler
from opendevin.core.config import (
AppConfig,
SandboxConfig,
get_llm_config_arg,
get_parser,
)
from opendevin.core.logger import opendevin_logger as logger
from opendevin.core.main import run_controller
from opendevin.llm.llm import LLM
from .utils import download_data, download_tools, encode_question, eval_answer, get_data
config = load_app_config()
from opendevin.core.main import create_runtime, run_controller
from opendevin.events.action import CmdRunAction
from opendevin.events.observation import CmdOutputObservation
from opendevin.runtime.runtime import Runtime
AGENT_CLS_TO_FAKE_USER_RESPONSE_FN = {
'CodeActAgent': codeact_user_response,
@ -34,59 +36,84 @@ AGENT_CLS_TO_INST_SUFFIX = {
}
def process_instance(instance: Any, metadata: EvalMetadata, reset_logger: bool = True):
agent = Agent.get_cls(metadata.agent_class)(llm=LLM(config=metadata.llm_config))
# create process-specific workspace dir
# we will create a workspace directory for EACH process
# so that different agent don't interfere with each other.
workspace_mount_path = config.workspace_mount_path
pathlib.Path(workspace_mount_path).mkdir(parents=True, exist_ok=True)
def get_config(
metadata: EvalMetadata,
) -> AppConfig:
config = AppConfig(
default_agent=metadata.agent_class,
run_as_devin=False,
runtime='eventstream',
max_iterations=metadata.max_iterations,
sandbox=SandboxConfig(
container_image='ubuntu:22.04',
enable_auto_lint=True,
use_host_network=False,
update_source_code=True,
),
# do not mount workspace
workspace_base=None,
workspace_mount_path=None,
)
config.set_llm_config(metadata.llm_config)
return config
async def initialize_runtime(runtime: Runtime):
"""Initialize the runtime for the agent.
This function is called before the runtime is used to run the agent.
"""
logger.info(f"{'-' * 50} BEGIN Runtime Initialization Fn {'-' * 50}")
obs: CmdOutputObservation
# Set instance id
action = CmdRunAction(command='mkdir -p /workspace')
logger.info(action, extra={'msg_type': 'ACTION'})
obs = await runtime.run_action(action)
assert obs.exit_code == 0
action = CmdRunAction(command='cd /workspace')
logger.info(action, extra={'msg_type': 'ACTION'})
obs = await runtime.run_action(action)
assert obs.exit_code == 0
await runtime.add_env_vars({'WOLFRAM_ALPHA_APPID': args.wolfram_alpha_appid})
logger.info(f"{'-' * 50} END Runtime Initialization Fn {'-' * 50}")
async def process_instance(
instance: Any, metadata: EvalMetadata, reset_logger: bool = True
):
config = get_config(metadata)
# Setup the logger properly, so you can run multi-processing to parallelize the evaluation
eval_output_dir = metadata.eval_output_dir
qid = instance.qid
question = instance.question
answer = instance.answer
# Setup the logger properly, so you can run multi-processing to parallelize the evaluation
if reset_logger:
# Set up logger
log_file = os.path.join(eval_output_dir, 'logs', f'instance_{qid}.log')
# Remove all existing handlers from logger
for handler in logger.handlers[:]:
logger.removeHandler(handler)
# add back the console handler to print ONE line
logger.addHandler(get_console_handler())
logger.info(
f'Starting evaluation for instance {qid}.\nHint: run "tail -f {log_file}" to see live logs in a separate shell'
)
# Remove all existing handlers from logger
for handler in logger.handlers[:]:
logger.removeHandler(handler)
file_handler = logging.FileHandler(log_file)
file_handler.setFormatter(
logging.Formatter('%(asctime)s - %(levelname)s - %(message)s')
)
logger.addHandler(file_handler)
logger.info(f'Process-specific workspace mounted at {workspace_mount_path}')
log_dir = os.path.join(metadata.eval_output_dir, 'infer_logs')
reset_logger_for_multiprocessing(logger, qid, log_dir)
else:
logger.info(f'Starting evaluation for instance {qid}.')
# Prepare instruction
instruction = encode_question(question)
instruction += 'IMPORTANT: You should ONLY interact with the environment provided to you AND NEVER ASK FOR HUMAN HELP.\n'
# NOTE: You can actually set slightly different instruction for different agents
instruction += AGENT_CLS_TO_INST_SUFFIX[agent.__class__.__name__]
# logger.info(f'Instruction:\n{instruction}', extra={'msg_type': 'OBSERVATION'})
instruction += AGENT_CLS_TO_INST_SUFFIX[metadata.agent_class]
logger.info(f'Instruction:\n{instruction}', extra={'msg_type': 'OBSERVATION'})
runtime = await create_runtime(config, sid=qid)
await initialize_runtime(runtime)
# Here's how you can run the agent (similar to the `main` function) and get the final task state
config.max_iterations = metadata.max_iterations
state: State | None = asyncio.run(
run_controller(
config=config,
task_str=instruction,
fake_user_response_fn=AGENT_CLS_TO_FAKE_USER_RESPONSE_FN[
agent.__class__.__name__
],
agent=agent,
sid=qid,
)
state: State | None = await run_controller(
config=config,
task_str=instruction,
runtime=runtime,
fake_user_response_fn=AGENT_CLS_TO_FAKE_USER_RESPONSE_FN[metadata.agent_class],
)
# ======= Attempt to evaluate the agent's edits =======
# If you are working on simpler benchmark that only evaluates the final model output (e.g., in a MessageAction)
@ -110,17 +137,17 @@ def process_instance(instance: Any, metadata: EvalMetadata, reset_logger: bool =
histories = state.history.compatibility_for_eval_history_pairs()
# Save the output
output = {
'qid': qid,
'text': model_answer_raw,
'correct': correct,
'answer_id': 'None',
'model_id': metadata.model_name,
'metadata': metadata,
'history': histories,
'metrics': metrics,
'error': state.last_error if state and state.last_error else None,
}
output = EvalOutput(
instance_id=qid,
test_result={
'model_answer_raw': model_answer_raw,
'correct': correct,
},
metadata=metadata,
history=histories,
metrics=metrics,
error=state.last_error if state and state.last_error else None,
)
return output
@ -145,8 +172,12 @@ if __name__ == '__main__':
default='YOUR_WOLFRAMALPHA_APPID',
)
args, _ = parser.parse_known_args()
llm_config = get_llm_config_arg(args.llm_config) if args.llm_config else config.llm
logger.info(f'Config for evaluation: {config}')
llm_config = None
if args.llm_config:
llm_config = get_llm_config_arg(args.llm_config)
if llm_config is None:
raise ValueError(f'Could not find LLM config: --llm_config {args.llm_config}')
dataset = ''
hardness = ''
@ -168,14 +199,9 @@ if __name__ == '__main__':
if args.hardness not in ['easy', 'hard']:
raise ValueError('Please choose from easy and hard for hardness.')
# workspace_mount_path = os.path.join(config.workspace_mount_path, '_eval_workspace')
workspace_mount_path = config.workspace_mount_path
pathlib.Path(workspace_mount_path).mkdir(parents=True, exist_ok=True)
toolqa_test = pd.DataFrame(get_data(dataset, hardness))
toolqa_data_path = download_data(workspace_mount_path)
toolqa_tool_path = download_tools(workspace_mount_path, args.wolfram_alpha_appid)
toolqa_test.rename(columns={'qid': 'instance_id'}, inplace=True)
id_column = 'qid'
metadata = make_metadata(
llm_config,
f'toolqa-{args.dataset}-{args.hardness}',
@ -184,12 +210,9 @@ if __name__ == '__main__':
args.eval_output_dir,
)
output_file = os.path.join(metadata.eval_output_dir, 'output.jsonl')
instances = prepare_dataset(toolqa_test, output_file, args.eval_n_limit, id_column)
run_evaluation(
instances,
metadata,
output_file,
args.eval_num_workers,
process_instance,
id_column,
instances = prepare_dataset(toolqa_test, output_file, args.eval_n_limit)
asyncio.run(
run_evaluation(
instances, metadata, output_file, args.eval_num_workers, process_instance
)
)

View File

@ -4,11 +4,12 @@ import re
import string
import zipfile
import gdown
import requests
def download_data(dir):
import gdown
data_path = os.path.join(dir, 'data/external_corpus')
if os.path.exists(data_path):
return data_path
@ -19,6 +20,7 @@ def download_data(dir):
zip_ref.extractall(os.path.join(dir, 'data'))
if os.path.exists(zip_path):
os.remove(zip_path)
print(f'Data saved to {data_path}')
return data_path
@ -42,6 +44,7 @@ def download_tools(dir, wolfram_alpha_appid='YOUR_WOLFRAMALPHA_APPID'):
output_file = os.path.join(tool_path, tool.split('/')[1])
with open(output_file, 'wb') as f:
f.write(response.content)
print(f'Tool saved to {output_file}')
with open(os.path.join(tool_path, 'calculator.py'), 'r') as f:
content = f.read()
new_content = content.replace('YOUR_WOLFRAMALPHA_APPID', wolfram_alpha_appid)
@ -64,14 +67,29 @@ def download_tools(dir, wolfram_alpha_appid='YOUR_WOLFRAMALPHA_APPID'):
f.write(new_content)
LOCAL_DATA_DIR = os.path.join(os.path.dirname(__file__), 'data')
def get_data(dataset, hardness):
data = []
url = f'https://raw.githubusercontent.com/night-chen/ToolQA/main/data/questions/{hardness}/{dataset}-{hardness}.jsonl'
url = requests.get(url)
if url.status_code == 200:
lines = url.text.splitlines()
for line in lines:
data.append(json.loads(line))
data_path = os.path.join(LOCAL_DATA_DIR, f'{dataset}-{hardness}.jsonl')
if os.path.exists(data_path):
print(f'Loading data from {data_path}')
with open(data_path, 'r') as f:
return json.load(f)
else:
print(
f'Downloading data from https://raw.githubusercontent.com/night-chen/ToolQA/main/data/questions/{hardness}/{dataset}-{hardness}.jsonl'
)
data = []
url = f'https://raw.githubusercontent.com/night-chen/ToolQA/main/data/questions/{hardness}/{dataset}-{hardness}.jsonl'
url = requests.get(url)
if url.status_code == 200:
lines = url.text.splitlines()
for line in lines:
data.append(json.loads(line))
with open(data_path, 'w') as f:
json.dump(data, f)
print(f'Data saved to {data_path}')
return data

View File

@ -1,12 +1,13 @@
import asyncio
import json
import logging
import multiprocessing as mp
import os
import pathlib
import subprocess
import time
from asyncio.log import logger
from concurrent.futures import ProcessPoolExecutor
from typing import Any, Callable
from typing import Any, Awaitable, Callable
import pandas as pd
from pydantic import BaseModel
@ -14,6 +15,8 @@ from tqdm import tqdm
from opendevin.controller.state.state import State
from opendevin.core.config import LLMConfig
from opendevin.core.logger import get_console_handler
from opendevin.core.logger import opendevin_logger as logger
from opendevin.events.action import Action
from opendevin.events.action.message import MessageAction
@ -38,6 +41,31 @@ class EvalMetadata(BaseModel):
return json.dumps(dumped_dict)
class EvalOutput(BaseModel):
# NOTE: User-specified
instance_id: str
instruction: str
# output of the evaluation
# store anything that is needed for the score calculation
test_result: dict[str, Any]
# Interaction info
metadata: EvalMetadata
history: list[tuple[dict[str, Any], dict[str, Any]]]
metrics: dict[str, Any]
error: str | None = None
# Optionally save the input test instance
instance: dict[str, Any] | None = None
def model_dump_json(self, *args, **kwargs):
dumped = super().model_dump_json(*args, **kwargs)
dumped_dict = json.loads(dumped)
# Apply custom serialization for metadata (to avoid leaking sensitive information)
dumped_dict['metadata'] = json.loads(self.metadata.model_dump_json())
return json.dumps(dumped_dict)
def codeact_user_response(
state: State,
encapsulate_solution: bool = False,
@ -136,7 +164,11 @@ def make_metadata(
return metadata
def prepare_dataset(dataset: pd.DataFrame, output_file, eval_n_limit, id_column):
def prepare_dataset(dataset: pd.DataFrame, output_file: str, eval_n_limit: int):
assert (
'instance_id' in dataset.columns
), "Expected 'instance_id' column in the dataset. You should define your own unique identifier for each instance and use it as the 'instance_id' column."
id_column = 'instance_id'
logger.info(f'Writing evaluation output to {output_file}')
finished_ids = set()
if os.path.exists(output_file):
@ -164,14 +196,16 @@ def prepare_dataset(dataset: pd.DataFrame, output_file, eval_n_limit, id_column)
return pd.DataFrame(new_dataset)
def run_evaluation(
async def run_evaluation(
dataset: pd.DataFrame,
metadata: EvalMetadata,
output_file: str,
num_workers: int,
process_instance_func: Callable[[pd.Series, EvalMetadata, bool], Any],
id_column: str,
process_instance_func: Callable[
[pd.Series, EvalMetadata, bool], Awaitable[EvalOutput]
],
):
use_multiprocessing = num_workers > 1
logger.info(
f'Evaluation started with Agent {metadata.agent_class}, '
f'model {metadata.llm_config.model}, max iterations {metadata.max_iterations}.'
@ -179,35 +213,77 @@ def run_evaluation(
pbar = tqdm(total=len(dataset))
output_fp = open(output_file, 'a')
def update_progress(future):
async def update_progress(future):
pbar.update(1)
output = future.result()
pbar.set_description(f'Instance {output[id_column]}')
pbar.set_postfix_str(f'Test Result: {output["test_result"]["result"]}')
output: EvalOutput = await future if use_multiprocessing else future
pbar.set_description(f'Instance {output.instance_id}')
pbar.set_postfix_str(f'Test Result: {output.test_result}')
logger.info(
f'Finished evaluation for instance {output[id_column]}: {output["test_result"]["result"]}'
f'Finished evaluation for instance {output.instance_id}: {output.test_result}'
)
output_fp.write(json.dumps(output) + '\n')
output_fp.write(json.dumps(output.model_dump()) + '\n')
output_fp.flush()
try:
with ProcessPoolExecutor(num_workers) as executor:
futures = []
for _, instance in dataset.iterrows():
future = executor.submit(
process_instance_func,
instance,
metadata,
bool(num_workers > 1),
)
future.add_done_callback(update_progress)
futures.append(future)
if use_multiprocessing:
with ProcessPoolExecutor(num_workers) as executor:
loop = asyncio.get_event_loop()
futures = []
for _, instance in dataset.iterrows():
future = loop.run_in_executor(
executor,
process_instance_func,
instance,
metadata,
bool(num_workers > 1),
)
futures.append(update_progress(future))
await asyncio.gather(*futures)
# Use plain for loop for single process for easier debugging
else:
assert num_workers == 1
for _, instance in dataset.iterrows():
output = await process_instance_func(instance, metadata, False)
await update_progress(output)
for future in futures:
future.result()
except KeyboardInterrupt:
print('KeyboardInterrupt received. Cleaning up...')
cleanup()
output_fp.close()
logger.info('Evaluation finished.')
def reset_logger_for_multiprocessing(
logger: logging.Logger, instance_id: str, log_dir: str
):
"""Reset the logger for multiprocessing.
Save logs to a separate file for each process, instead of trying to write to the
same file/console from multiple processes.
"""
# Set up logger
log_file = os.path.join(
log_dir,
f'instance_{instance_id}.log',
)
# Remove all existing handlers from logger
for handler in logger.handlers[:]:
logger.removeHandler(handler)
# add back the console handler to print ONE line
logger.addHandler(get_console_handler())
logger.info(
f'Starting evaluation for instance {instance_id}.\n'
f'Hint: run "tail -f {log_file}" to see live logs in a separate shell'
)
# Remove all existing handlers from logger
for handler in logger.handlers[:]:
logger.removeHandler(handler)
os.makedirs(os.path.dirname(log_file), exist_ok=True)
file_handler = logging.FileHandler(log_file)
file_handler.setFormatter(
logging.Formatter('%(asctime)s - %(levelname)s - %(message)s')
)
logger.addHandler(file_handler)

View File

@ -1,13 +1,11 @@
checkout_eval_branch() {
if [ -z "$COMMIT_HASH" ]; then
echo "Commit hash not specified, use current git commit"
build_sandbox
return 0
fi
if git diff --quiet $COMMIT_HASH HEAD; then
echo "The given hash is equivalent to the current HEAD"
build_sandbox
return 0
fi
@ -30,14 +28,8 @@ checkout_eval_branch() {
# Trap the EXIT signal to checkout original branch
trap checkout_original_branch EXIT
build_sandbox
}
build_sandbox() {
echo "Build sandbox locally"
docker build -t eval-sandbox -f containers/sandbox/Dockerfile /tmp
export SANDBOX_CONTAINER_IMAGE="eval-sandbox"
}
checkout_original_branch() {
if [ -z "$current_branch" ]; then

View File

@ -2,59 +2,14 @@
This folder contains evaluation for [WebArena](https://github.com/web-arena-x/webarena) benchmark, powered by [BrowserGym](https://github.com/ServiceNow/BrowserGym) for easy evaluation of how well an agent capable of browsing can perform on realistic web browsing tasks.
## Setup OpenDevin Environment
## Setup Environment and LLM Configuration
Please follow [this document](https://github.com/OpenDevin/OpenDevin/blob/main/Development.md) to setup local develop environment for OpenDevin.
## Configure OpenDevin and your LLM
Create a `config.toml` file if it does not exist at the root of the workspace.
Add the following configurations:
```toml
[core]
max_iterations = 100
cache_dir = "/tmp/cache"
ssh_hostname = "localhost"
[sandbox]
box_type = "ssh"
timeout = 120
# TODO: Change these to the model you want to evaluate
[eval_gpt4_1106_preview]
model = "gpt-4-1106-preview"
api_key = "XXX"
temperature = 0.0
[eval_some_openai_compatible_model]
model = "openai/MODEL_NAME"
base_url = "https://OPENAI_COMPATIBLE_URL/v1"
api_key = "XXX"
temperature = 0.0
```
Please follow instruction [here](../README.md#setup) to setup your local development environment and LLM.
## Setup WebArena Environment
WebArena requires you to set up websites containing pre-populated content that is accessible via URL to the machine running the OpenDevin agents.
Follow [this document](https://github.com/web-arena-x/webarena/blob/main/environment_docker/README.md) to set up your own WebArena environment through local servers or AWS EC2 instances.
Take note of the base URL of the machine where the environment is installed.
## Setup Environment Variables of WebArena Websites
Create a script `webarena_env.sh` under `evaluation/webarena/scripts` with the following:
```bash
export BASE_URL=<YOUR_SERVER_URL_HERE>
export SHOPPING="$BASE_URL:7770/"
export SHOPPING_ADMIN="$BASE_URL:7780/admin"
export REDDIT="$BASE_URL:9999"
export GITLAB="$BASE_URL:8023"
export WIKIPEDIA="$BASE_URL:8888/wikipedia_en_all_maxi_2022-05/A/User:The_other_Kiwix_guy/Landing"
export MAP="$BASE_URL:3000"
export HOMEPAGE="$BASE_URL:4399"
export OPENAI_API_KEY="yourkey" # this key is required for some WebArena validators that utilize LLMs
```
Take note of the base URL (`$WEBARENA_BASE_URL`) of the machine where the environment is installed.
## Test if your environment works
@ -65,7 +20,9 @@ Follow the WebArena environment setup guide carefully, and make sure the URL fie
## Run Evaluation
```sh
```bash
export WEBARENA_BASE_URL=<YOUR_SERVER_URL_HERE>
export OPENAI_API_KEY="yourkey" # this key is required for some WebArena validators that utilize LLMs
bash evaluation/webarena/scripts/run_infer.sh
```

View File

@ -1,7 +1,7 @@
import asyncio
import json
import logging
import os
from typing import Any
import browsergym.webarena # noqa F401 register webarena tasks as gym environments
import gymnasium as gym
@ -9,93 +9,147 @@ import pandas as pd
from evaluation.utils.shared import (
EvalMetadata,
EvalOutput,
make_metadata,
prepare_dataset,
reset_logger_for_multiprocessing,
run_evaluation,
)
from opendevin.controller.agent import Agent
from opendevin.controller.state.state import State
from opendevin.core.config import get_llm_config_arg, load_app_config, parse_arguments
from opendevin.core.logger import get_console_handler
from opendevin.core.config import (
AppConfig,
SandboxConfig,
get_llm_config_arg,
parse_arguments,
)
from opendevin.core.logger import opendevin_logger as logger
from opendevin.core.main import run_controller
from opendevin.llm.llm import LLM
from opendevin.runtime.docker.ssh_box import DockerSSHBox
from opendevin.runtime.tools import RuntimeTool
config = load_app_config()
from opendevin.core.main import create_runtime, run_controller
from opendevin.events.action import (
BrowseInteractiveAction,
CmdRunAction,
MessageAction,
)
from opendevin.events.observation import CmdOutputObservation
from opendevin.runtime.browser.browser_env import (
BROWSER_EVAL_GET_GOAL_ACTION,
BROWSER_EVAL_GET_REWARDS_ACTION,
)
from opendevin.runtime.runtime import Runtime
SUPPORTED_AGENT_CLS = {'BrowsingAgent'}
docker_ssh_box: DockerSSHBox | None = None
def get_config(
metadata: EvalMetadata,
env_id: str,
) -> AppConfig:
base_url = os.environ.get('WEBARENA_BASE_URL', None)
openai_api_key = os.environ.get('OPENAI_API_KEY', None)
assert base_url is not None, 'WEBARENA_BASE_URL must be set'
assert openai_api_key is not None, 'OPENAI_API_KEY must be set'
config = AppConfig(
default_agent=metadata.agent_class,
run_as_devin=False,
runtime='eventstream',
max_iterations=metadata.max_iterations,
sandbox=SandboxConfig(
container_image='ubuntu:22.04',
enable_auto_lint=True,
use_host_network=False,
update_source_code=True,
browsergym_eval_env=env_id,
od_runtime_startup_env_vars={
'BASE_URL': base_url,
'OPENAI_API_KEY': openai_api_key,
'SHOPPING': f'{base_url}:7770/',
'SHOPPING_ADMIN': f'{base_url}:7780/admin',
'REDDIT': f'{base_url}:9999',
'GITLAB': f'{base_url}:8023',
'WIKIPEDIA': f'{base_url}:8888/wikipedia_en_all_maxi_2022-05/A/User:The_other_Kiwix_guy/Landing',
'MAP': f'{base_url}:3000',
'HOMEPAGE': f'{base_url}:4399',
},
),
# do not mount workspace
workspace_base=None,
workspace_mount_path=None,
)
config.set_llm_config(metadata.llm_config)
return config
def get_sandbox():
global docker_ssh_box
if docker_ssh_box is None:
docker_ssh_box = DockerSSHBox(
config=config.sandbox,
persist_sandbox=False,
workspace_mount_path=config.workspace_mount_path,
sandbox_workspace_dir=config.workspace_mount_path_in_sandbox,
cache_dir=config.cache_dir,
run_as_devin=config.run_as_devin,
)
return docker_ssh_box
async def initialize_runtime(
runtime: Runtime,
) -> dict:
"""Initialize the runtime for the agent.
This function is called before the runtime is used to run the agent.
"""
logger.info(f"{'-' * 50} BEGIN Runtime Initialization Fn {'-' * 50}")
obs: CmdOutputObservation
# Set instance id
action = CmdRunAction(command='mkdir -p /workspace')
logger.info(action, extra={'msg_type': 'ACTION'})
obs = await runtime.run_action(action)
assert obs.exit_code == 0
action = BrowseInteractiveAction(browser_actions=BROWSER_EVAL_GET_GOAL_ACTION)
logger.info(action, extra={'msg_type': 'ACTION'})
obs = await runtime.run_action(action)
logger.info(obs, extra={'msg_type': 'OBSERVATION'})
goal = obs.content
logger.info(f"{'-' * 50} END Runtime Initialization Fn {'-' * 50}")
return goal
def process_instance(
async def complete_runtime(
runtime: Runtime,
) -> dict[str, Any]:
"""Complete the runtime for the agent.
This function is called before the runtime is used to run the agent.
If you need to do something in the sandbox to get the correctness metric after
the agent has run, modify this function.
"""
logger.info(f"{'-' * 50} BEGIN Runtime Completion Fn {'-' * 50}")
obs: CmdOutputObservation
action = BrowseInteractiveAction(browser_actions=BROWSER_EVAL_GET_REWARDS_ACTION)
logger.info(action, extra={'msg_type': 'ACTION'})
obs = await runtime.run_action(action)
logger.info(obs, extra={'msg_type': 'OBSERVATION'})
logger.info(f"{'-' * 50} END Runtime Completion Fn {'-' * 50}")
return {
'rewards': json.loads(obs.content),
}
async def process_instance(
instance: pd.Series,
metadata: EvalMetadata,
reset_logger: bool = True,
):
# Create the agent
agent = Agent.get_cls(metadata.agent_class)(llm=LLM(config=metadata.llm_config))
env_id = instance.id
env_id = instance.instance_id
config = get_config(metadata, env_id)
# Setup the logger properly, so you can run multi-processing to parallelize the evaluation
if reset_logger:
# Set up logger
log_file = os.path.join(
metadata.eval_output_dir, 'logs', f'instance_{env_id}.log'
)
# Remove all existing handlers from logger
for handler in logger.handlers[:]:
logger.removeHandler(handler)
# add back the console handler to print ONE line
logger.addHandler(get_console_handler())
logger.info(
f'Starting evaluation for instance {env_id}.\nHint: run "tail -f {log_file}" to see live logs in a separate shell'
)
# Remove all existing handlers from logger
for handler in logger.handlers[:]:
logger.removeHandler(handler)
file_handler = logging.FileHandler(log_file)
file_handler.setFormatter(
logging.Formatter('%(asctime)s - %(levelname)s - %(message)s')
)
logger.addHandler(file_handler)
log_dir = os.path.join(metadata.eval_output_dir, 'infer_logs')
reset_logger_for_multiprocessing(logger, env_id, log_dir)
else:
logger.info(f'Starting evaluation for instance {env_id}.')
# Here's how you can run the agent (similar to the `main` function) and get the final task state
runtime_tools_config = {
RuntimeTool.BROWSER: {
'browsergym_eval': env_id,
'browsergym_eval_save_dir': metadata.eval_output_dir,
}
}
runtime = await create_runtime(config, sid=env_id)
task_str = await initialize_runtime(runtime)
config.max_iterations = metadata.max_iterations
state: State | None = asyncio.run(
run_controller(
config=config,
task_str='PLACEHOLDER_GOAL',
runtime_tools_config=runtime_tools_config,
agent=agent,
sandbox=get_sandbox(),
sid=env_id,
)
state: State | None = await run_controller(
config=config,
task_str=task_str,
runtime=runtime,
)
# ======= Attempt to evaluate the agent's environment impact =======
@ -107,18 +161,17 @@ def process_instance(
raise ValueError('State should not be None.')
metrics = state.metrics.get() if state.metrics else None
browsergym_eval_dir = os.path.join(metadata.eval_output_dir, env_id.split('/')[1])
# read goal
with open(
os.path.join(browsergym_eval_dir, 'goal.txt'), 'r', encoding='utf-8'
) as f:
instruction = f.read()
# read reward
with open(
os.path.join(browsergym_eval_dir, 'rewards.json'), 'r', encoding='utf-8'
) as f:
rewards = json.load(f)
reward = max(rewards)
# Instruction is the first message from the USER
instruction = ''
for event in state.history.get_events():
if isinstance(event, MessageAction):
instruction = event.content
break
return_val = await complete_runtime(runtime)
logger.info(f'Return value from complete_runtime: {return_val}')
reward = max(return_val['rewards'])
# history is now available as a stream of events, rather than list of pairs of (Action, Observation)
# for compatibility with the existing output format, we can remake the pairs here
@ -126,39 +179,38 @@ def process_instance(
histories = state.history.compatibility_for_eval_history_pairs()
# Save the output
output = {
'instance_id': env_id,
'instruction': instruction,
'metadata': metadata.model_dump(),
'history': histories,
'metrics': metrics,
'error': state.last_error if state and state.last_error else None,
'test_result': reward,
}
output = EvalOutput(
instance_id=env_id,
instruction=instruction,
metadata=metadata,
history=histories,
metrics=metrics,
error=state.last_error if state and state.last_error else None,
test_result={
'reward': reward,
},
)
return output
if __name__ == '__main__':
args = parse_arguments()
env_ids = [
id for id in gym.envs.registry.keys() if id.startswith('browsergym/webarena')
]
dataset = pd.DataFrame(
{
'id': [
'instance_id': [
id
for id in gym.envs.registry.keys()
if id.startswith('browsergym/miniwob')
if id.startswith('browsergym/webarena')
]
}
)
id_column = 'id'
llm_config = get_llm_config_arg(args.llm_config) if args.llm_config else config.llm
logger.info(f'Config for evaluation: {config}')
llm_config = None
if args.llm_config:
llm_config = get_llm_config_arg(args.llm_config)
if llm_config is None:
raise ValueError(f'Could not find LLM config: --llm_config {args.llm_config}')
metadata = make_metadata(
llm_config,
@ -169,13 +221,14 @@ if __name__ == '__main__':
args.eval_output_dir,
)
output_file = os.path.join(metadata.eval_output_dir, 'output.jsonl')
instances = prepare_dataset(dataset, output_file, args.eval_n_limit, id_column)
_ = get_sandbox() # Initialize the sandbox
run_evaluation(
instances,
metadata,
output_file,
args.eval_num_workers,
process_instance,
id_column,
instances = prepare_dataset(dataset, output_file, args.eval_n_limit)
asyncio.run(
run_evaluation(
instances,
metadata,
output_file,
args.eval_num_workers,
process_instance,
)
)

View File

@ -2,6 +2,7 @@ import base64
import pickle
from dataclasses import dataclass, field
from enum import Enum
from typing import Any
from opendevin.controller.state.task import RootTask
from opendevin.core.logger import opendevin_logger as logger
@ -106,6 +107,9 @@ class State:
start_id: int = -1
end_id: int = -1
almost_stuck: int = 0
# NOTE: This will never be used by the controller, but it can be used by different
# evaluation tasks to store extra data needed to track the progress/state of the task.
extra_data: dict[str, Any] = field(default_factory=dict)
def save_to_session(self, sid: str, file_store: FileStore):
pickled = pickle.dumps(self)

View File

@ -159,9 +159,12 @@ class SandboxConfig(metaclass=Singleton):
It can contain any valid shell commands (e.g., pip install numpy).
The path to the interpreter is available as $OD_INTERPRETER_PATH,
which can be used to install dependencies for the OD-specific Python interpreter.
od_runtime_startup_env_vars: The environment variables to set at the launch of the runtime.
This is a dictionary of key-value pairs.
This is useful for setting environment variables that are needed by the runtime.
For example, for specifying the base url of website for browsergym evaluation.
browsergym_eval_env: The BrowserGym environment to use for evaluation.
Default is None for general purpose browsing. Check evaluation/miniwob and evaluation/webarena for examples.
"""
box_type: str = 'ssh'
@ -179,6 +182,7 @@ class SandboxConfig(metaclass=Singleton):
initialize_plugins: bool = True
update_source_code: bool = False
od_runtime_extra_deps: str | None = None
od_runtime_startup_env_vars: dict[str, str] = field(default_factory=dict)
browsergym_eval_env: str | None = None
def defaults_to_dict(self) -> dict:
@ -243,10 +247,11 @@ class AppConfig(metaclass=Singleton):
runtime: str = 'server'
file_store: str = 'memory'
file_store_path: str = '/tmp/file_store'
# TODO: clean up workspace path after the removal of ServerRuntime
workspace_base: str = os.path.join(os.getcwd(), 'workspace')
workspace_mount_path: str = (
workspace_mount_path: str | None = (
UndefinedString.UNDEFINED # this path should always be set when config is fully loaded
)
) # when set to None, do not mount the workspace
workspace_mount_path_in_sandbox: str = '/workspace'
workspace_mount_rewrite: str | None = None
cache_dir: str = '/tmp/cache'
@ -550,7 +555,7 @@ def finalize_config(cfg: AppConfig):
cfg.workspace_base = os.path.abspath(cfg.workspace_base)
# In local there is no sandbox, the workspace will have the same pwd as the host
if cfg.sandbox.box_type == 'local':
if cfg.sandbox.box_type == 'local' and cfg.workspace_mount_path is not None:
cfg.workspace_mount_path_in_sandbox = cfg.workspace_mount_path
if cfg.workspace_mount_rewrite: # and not config.workspace_mount_path:

View File

@ -1,6 +1,7 @@
import asyncio
import os
import sys
import uuid
from typing import Callable, Type
import agenthub # noqa F401 (we import this to get the agents registered)
@ -21,7 +22,7 @@ from opendevin.events.event import Event
from opendevin.events.observation import AgentStateChangedObservation
from opendevin.llm.llm import LLM
from opendevin.runtime import get_runtime_cls
from opendevin.runtime.sandbox import Sandbox
from opendevin.runtime.runtime import Runtime
from opendevin.runtime.server.runtime import ServerRuntime
from opendevin.storage import get_file_store
@ -37,86 +38,39 @@ def read_task_from_stdin() -> str:
return sys.stdin.read()
async def run_controller(
async def create_runtime(
config: AppConfig,
task_str: str,
exit_on_message: bool = False,
fake_user_response_fn: Callable[[State | None], str] | None = None,
sandbox: Sandbox | None = None,
agent: Agent | None = None,
runtime_tools_config: dict | None = None,
sid: str | None = None,
headless_mode: bool = True,
) -> State | None:
"""Main coroutine to run the agent controller with task input flexibility.
It's only used when you launch opendevin backend directly via cmdline.
runtime_tools_config: dict | None = None,
) -> Runtime:
"""Create a runtime for the agent to run on.
Args:
config: The app config.
task_str: The task to run.
exit_on_message: quit if agent asks for a message from user (optional)
fake_user_response_fn: An optional function that receives the current state (could be None) and returns a fake user response.
sandbox: (will be deprecated) An optional sandbox to run the agent in.
agent: An optional agent to run.
runtime_tools_config: (will be deprecated) The runtime tools config.
sid: The session id.
headless_mode: Whether the agent is run in headless mode.
config: The app config.
sid: The session id.
runtime_tools_config: (will be deprecated) The runtime tools config.
"""
# Create the agent
if agent is None:
agent_cls: Type[Agent] = Agent.get_cls(config.default_agent)
agent = agent_cls(
llm=LLM(config=config.get_llm_config_from_agent(config.default_agent))
)
# Logging
logger.info(
f'Running agent {agent.name}, model {agent.llm.config.model}, with task: "{task_str}"'
)
# set up the event stream
file_store = get_file_store(config.file_store, config.file_store_path)
cli_session = 'main' + ('_' + sid if sid else '')
event_stream = EventStream(cli_session, file_store)
session_id = 'main' + ('_' + sid if sid else str(uuid.uuid4()))
event_stream = EventStream(session_id, file_store)
# restore cli session if enabled
initial_state = None
if config.enable_cli_session:
try:
logger.info('Restoring agent state from cli session')
initial_state = State.restore_from_session(cli_session, file_store)
except Exception as e:
print('Error restoring state', e)
# init controller with this initial state
controller = AgentController(
agent=agent,
max_iterations=config.max_iterations,
max_budget_per_task=config.max_budget_per_task,
agent_to_llm_config=config.get_agent_to_llm_config_map(),
event_stream=event_stream,
initial_state=initial_state,
headless_mode=headless_mode,
)
# agent class
agent_cls = agenthub.Agent.get_cls(config.default_agent)
# runtime and tools
runtime_cls = get_runtime_cls(config.runtime)
extra_kwargs = {}
if isinstance(runtime_cls, ServerRuntime):
extra_kwargs['sandbox'] = sandbox
# TODO: deprecate this and accept runtime as a parameter instead
logger.info(f'Initializing runtime: {runtime_cls}')
runtime = runtime_cls(
runtime: Runtime = runtime_cls(
config=config,
event_stream=event_stream,
plugins=controller.agent.sandbox_plugins,
**extra_kwargs,
sid=session_id,
plugins=agent_cls.sandbox_plugins,
)
await runtime.ainit()
if isinstance(runtime, ServerRuntime):
runtime.init_runtime_tools(
controller.agent.runtime_tools,
agent_cls.runtime_tools,
runtime_tools_config=runtime_tools_config,
)
# browser eval specific
@ -130,7 +84,68 @@ async def run_controller(
) as f:
task_str = f.read()
logger.info(f'Dynamic Eval task: {task_str}')
# TODO: Implement this for EventStream Runtime
return runtime
async def run_controller(
config: AppConfig,
task_str: str,
runtime: Runtime | None = None,
agent: Agent | None = None,
exit_on_message: bool = False,
fake_user_response_fn: Callable[[State | None], str] | None = None,
headless_mode: bool = True,
) -> State | None:
"""Main coroutine to run the agent controller with task input flexibility.
It's only used when you launch opendevin backend directly via cmdline.
Args:
config: The app config.
task_str: The task to run. It can be a string.
runtime: (optional) A runtime for the agent to run on.
agent: (optional) A agent to run.
exit_on_message: quit if agent asks for a message from user (optional)
fake_user_response_fn: An optional function that receives the current state (could be None) and returns a fake user response.
headless_mode: Whether the agent is run in headless mode.
"""
# Create the agent
if agent is None:
agent_cls: Type[Agent] = Agent.get_cls(config.default_agent)
agent = agent_cls(
llm=LLM(config=config.get_llm_config_from_agent(config.default_agent))
)
if runtime is None:
runtime = await create_runtime(config)
event_stream = runtime.event_stream
# restore cli session if enabled
initial_state = None
if config.enable_cli_session:
try:
logger.info('Restoring agent state from cli session')
initial_state = State.restore_from_session(
event_stream.sid, event_stream.file_store
)
except Exception as e:
logger.info('Error restoring state', e)
# init controller with this initial state
controller = AgentController(
agent=agent,
max_iterations=config.max_iterations,
max_budget_per_task=config.max_budget_per_task,
agent_to_llm_config=config.get_agent_to_llm_config_map(),
event_stream=event_stream,
initial_state=initial_state,
headless_mode=headless_mode,
)
assert isinstance(task_str, str), f'task_str must be a string, got {type(task_str)}'
# Logging
logger.info(
f'Agent Controller Initialized: Running agent {agent.name}, model {agent.llm.config.model}, with task: "{task_str}"'
)
# start event is a MessageAction with the task, either resumed or new
if config.enable_cli_session and initial_state is not None:
@ -170,12 +185,13 @@ async def run_controller(
# save session when we're about to close
if config.enable_cli_session:
end_state = controller.get_state()
end_state.save_to_session(cli_session, file_store)
end_state.save_to_session(event_stream.sid, event_stream.file_store)
# close when done
await controller.close()
await runtime.close()
return controller.get_state()
state = controller.get_state()
return state
if __name__ == '__main__':

View File

@ -22,16 +22,16 @@ class EventStreamSubscriber(str, Enum):
class EventStream:
sid: str
file_store: FileStore
# For each subscriber ID, there is a stack of callback functions - useful
# when there are agent delegates
_subscribers: dict[str, list[Callable]]
_cur_id: int
_lock: threading.Lock
_file_store: FileStore
def __init__(self, sid: str, file_store: FileStore):
self.sid = sid
self._file_store = file_store
self.file_store = file_store
self._subscribers = {}
self._cur_id = 0
self._lock = threading.Lock()
@ -39,7 +39,7 @@ class EventStream:
def _reinitialize_from_file_store(self) -> None:
try:
events = self._file_store.list(f'sessions/{self.sid}/events')
events = self.file_store.list(f'sessions/{self.sid}/events')
except FileNotFoundError:
logger.debug(f'No events found for session {self.sid}')
self._cur_id = 0
@ -100,7 +100,7 @@ class EventStream:
def get_event(self, id: int) -> Event:
filename = self._get_filename_for_id(id)
content = self._file_store.read(filename)
content = self.file_store.read(filename)
data = json.loads(content)
return event_from_dict(data)
@ -136,9 +136,7 @@ class EventStream:
event._source = source # type: ignore [attr-defined]
data = event_to_dict(event)
if event.id is not None:
self._file_store.write(
self._get_filename_for_id(event.id), json.dumps(data)
)
self.file_store.write(self._get_filename_for_id(event.id), json.dumps(data))
for stack in self._subscribers.values():
callback = stack[-1]
asyncio.create_task(callback(event))
@ -149,7 +147,7 @@ class EventStream:
yield event
def clear(self):
self._file_store.delete(f'sessions/{self.sid}')
self.file_store.delete(f'sessions/{self.sid}')
self._cur_id = 0
# self._subscribers = {}
self._reinitialize_from_file_store()

View File

@ -187,23 +187,37 @@ class RuntimeClient:
keep_prompt: bool = True,
) -> tuple[str, int]:
logger.debug(f'Executing command: {command}')
self.shell.sendline(command)
self.shell.expect(self.__bash_expect_regex, timeout=timeout)
try:
self.shell.sendline(command)
self.shell.expect(self.__bash_expect_regex, timeout=timeout)
output = self.shell.before
output = self.shell.before
bash_prompt = self._get_bash_prompt_and_update_pwd()
if keep_prompt:
output += '\r\n' + bash_prompt
logger.debug(f'Command output: {output}')
# Get exit code
self.shell.sendline('echo $?')
logger.debug(f'Executing command for exit code: {command}')
self.shell.expect(self.__bash_expect_regex, timeout=timeout)
_exit_code_output = self.shell.before
logger.debug(f'Exit code Output: {_exit_code_output}')
exit_code = int(_exit_code_output.strip().split()[0])
except pexpect.TIMEOUT as e:
self.shell.sendintr() # send SIGINT to the shell
self.shell.expect(self.__bash_expect_regex, timeout=timeout)
output = self.shell.before
output += (
'\r\n\r\n'
+ f'[Command timed out after {timeout} seconds. SIGINT was sent to interrupt it.]'
)
exit_code = 130 # SIGINT
logger.error(f'Failed to execute command: {command}. Error: {e}')
finally:
bash_prompt = self._get_bash_prompt_and_update_pwd()
if keep_prompt:
output += '\r\n' + bash_prompt
logger.debug(f'Command output: {output}')
# Get exit code
self.shell.sendline('echo $?')
logger.debug(f'Executing command for exit code: {command}')
self.shell.expect(self.__bash_expect_regex, timeout=timeout)
_exit_code_output = self.shell.before
logger.debug(f'Exit code Output: {_exit_code_output}')
exit_code = int(_exit_code_output.strip().split()[0])
return output, exit_code
async def run_action(self, action) -> Observation:

View File

@ -230,7 +230,7 @@ class EventStreamRuntime(Runtime):
async def copy_to(
self, host_src: str, sandbox_dest: str, recursive: bool = False
) -> dict[str, Any]:
) -> None:
if not os.path.exists(host_src):
raise FileNotFoundError(f'Source file {host_src} does not exist')
@ -264,7 +264,7 @@ class EventStreamRuntime(Runtime):
f'{self.api_url}/upload_file', data=upload_data, params=params
) as response:
if response.status == 200:
return await response.json()
return
else:
error_message = await response.text()
raise Exception(f'Copy operation failed: {error_message}')
@ -276,6 +276,7 @@ class EventStreamRuntime(Runtime):
finally:
if recursive:
os.unlink(temp_zip_path)
logger.info(f'Copy completed: host:{host_src} -> runtime:{sandbox_dest}')
async def run_action(self, action: Action) -> Observation:
# set timeout to default if not set

View File

@ -117,7 +117,7 @@ class DockerSSHBox(Sandbox):
self,
config: SandboxConfig,
persist_sandbox: bool,
workspace_mount_path: str,
workspace_mount_path: str | None,
sandbox_workspace_dir: str,
cache_dir: str,
run_as_devin: bool,
@ -554,9 +554,7 @@ class DockerSSHBox(Sandbox):
@property
def volumes(self):
mount_dir = self.workspace_mount_path
return {
mount_dir: {'bind': self.sandbox_workspace_dir, 'mode': 'rw'},
mount_volumes = {
# mount cache directory to /home/opendevin/.cache for pip cache reuse
self.cache_dir: {
'bind': (
@ -565,6 +563,12 @@ class DockerSSHBox(Sandbox):
'mode': 'rw',
},
}
if self.workspace_mount_path is not None:
mount_volumes[self.workspace_mount_path] = {
'bind': self.sandbox_workspace_dir,
'mode': 'rw',
}
return mount_volumes
def restart_docker_container(self):
try:

View File

@ -30,7 +30,6 @@ from opendevin.events.observation import (
from opendevin.events.serialization.action import ACTION_TYPE_TO_CLASS
from opendevin.runtime.plugins import JupyterRequirement, PluginRequirement
from opendevin.runtime.tools import RuntimeTool
from opendevin.storage import FileStore
def _default_env_vars(sandbox_config: SandboxConfig) -> dict[str, str]:
@ -52,7 +51,6 @@ class Runtime:
"""
sid: str
file_store: FileStore
DEFAULT_ENV_VARS: dict[str, str]
def __init__(

View File

@ -28,7 +28,6 @@ from opendevin.runtime.browser.browser_env import BrowserEnv
from opendevin.runtime.plugins import JupyterRequirement, PluginRequirement
from opendevin.runtime.runtime import Runtime
from opendevin.runtime.tools import RuntimeTool
from opendevin.storage.local import LocalFileStore
from ..browser import browse
from .files import read_file, write_file
@ -44,7 +43,6 @@ class ServerRuntime(Runtime):
sandbox: Sandbox | None = None,
):
super().__init__(config, event_stream, sid, plugins)
self.file_store = LocalFileStore(config.workspace_base)
if sandbox is None:
self.sandbox = self.create_sandbox(sid, config.sandbox.box_type)
self._is_external_sandbox = False
@ -187,7 +185,6 @@ class ServerRuntime(Runtime):
return IPythonRunCellObservation(content=output, code=action.code)
async def read(self, action: FileReadAction) -> Observation:
# TODO: use self.file_store
working_dir = self.sandbox.get_working_directory()
return await read_file(
action.path,
@ -199,7 +196,6 @@ class ServerRuntime(Runtime):
)
async def write(self, action: FileWriteAction) -> Observation:
# TODO: use self.file_store
working_dir = self.sandbox.get_working_directory()
return await write_file(
action.path,

View File

@ -14,10 +14,16 @@ FROM {{ base_image }}
# Install necessary packages and clean up in one layer
RUN apt-get update && \
apt-get install -y wget sudo apt-utils {{ LIBGL_MESA }} libasound2-plugins python3 git && \
apt-get clean \
&& ln -s /usr/bin/python3 /usr/bin/python \
&& rm -rf /var/lib/apt/lists/*
apt-get install -y wget sudo apt-utils {{ LIBGL_MESA }} libasound2-plugins git && \
apt-get clean && \
rm -rf /var/lib/apt/lists/*
# Install Python if not already installed
RUN if [ ! -e /usr/bin/python ]; then \
apt-get update && \
apt-get install -y python3 && \
ln -s /usr/bin/python3 /usr/bin/python; \
fi
# Create necessary directories
RUN mkdir -p /opendevin && \

45
poetry.lock generated
View File

@ -1883,6 +1883,36 @@ test-downstream = ["aiobotocore (>=2.5.4,<3.0.0)", "dask-expr", "dask[dataframe,
test-full = ["adlfs", "aiohttp (!=4.0.0a0,!=4.0.0a1)", "cloudpickle", "dask", "distributed", "dropbox", "dropboxdrivefs", "fastparquet", "fusepy", "gcsfs", "jinja2", "kerchunk", "libarchive-c", "lz4", "notebook", "numpy", "ocifs", "pandas", "panel", "paramiko", "pyarrow", "pyarrow (>=1)", "pyftpdlib", "pygit2", "pytest", "pytest-asyncio (!=0.22.0)", "pytest-benchmark", "pytest-cov", "pytest-mock", "pytest-recording", "pytest-rerunfailures", "python-snappy", "requests", "smbprotocol", "tqdm", "urllib3", "zarr", "zstandard"]
tqdm = ["tqdm"]
[[package]]
name = "func-timeout"
version = "4.3.5"
description = "Python module which allows you to specify timeouts when calling any existing function. Also provides support for stoppable-threads"
optional = false
python-versions = "*"
files = [
{file = "func_timeout-4.3.5.tar.gz", hash = "sha256:74cd3c428ec94f4edfba81f9b2f14904846d5ffccc27c92433b8b5939b5575dd"},
]
[[package]]
name = "gdown"
version = "5.2.0"
description = "Google Drive Public File/Folder Downloader"
optional = false
python-versions = ">=3.8"
files = [
{file = "gdown-5.2.0-py3-none-any.whl", hash = "sha256:33083832d82b1101bdd0e9df3edd0fbc0e1c5f14c9d8c38d2a35bf1683b526d6"},
{file = "gdown-5.2.0.tar.gz", hash = "sha256:2145165062d85520a3cd98b356c9ed522c5e7984d408535409fd46f94defc787"},
]
[package.dependencies]
beautifulsoup4 = "*"
filelock = "*"
requests = {version = "*", extras = ["socks"]}
tqdm = "*"
[package.extras]
test = ["build", "mypy", "pytest", "pytest-xdist", "ruff", "twine", "types-requests", "types-setuptools"]
[[package]]
name = "gevent"
version = "24.2.1"
@ -6202,6 +6232,18 @@ files = [
{file = "pyreadline3-3.4.1.tar.gz", hash = "sha256:6f3d1f7b8a31ba32b73917cefc1f28cc660562f39aea8646d30bd6eff21f7bae"},
]
[[package]]
name = "pysocks"
version = "1.7.1"
description = "A Python SOCKS client module. See https://github.com/Anorov/PySocks for more information."
optional = false
python-versions = ">=2.7, !=3.0.*, !=3.1.*, !=3.2.*, !=3.3.*"
files = [
{file = "PySocks-1.7.1-py27-none-any.whl", hash = "sha256:08e69f092cc6dbe92a0fdd16eeb9b9ffbc13cadfe5ca4c7bd92ffb078b293299"},
{file = "PySocks-1.7.1-py3-none-any.whl", hash = "sha256:2725bd0a9925919b9b51739eea5f9e2bae91e83288108a9ad338b2e3a4435ee5"},
{file = "PySocks-1.7.1.tar.gz", hash = "sha256:3f8804571ebe159c380ac6de37643bb4685970655d3bba243530d6558b799aa0"},
]
[[package]]
name = "pytest"
version = "8.3.2"
@ -6705,6 +6747,7 @@ files = [
certifi = ">=2017.4.17"
charset-normalizer = ">=2,<4"
idna = ">=2.5,<4"
PySocks = {version = ">=1.5.6,<1.5.7 || >1.5.7", optional = true, markers = "extra == \"socks\""}
urllib3 = ">=1.21.1,<3"
[package.extras]
@ -9106,4 +9149,4 @@ testing = ["coverage (>=5.0.3)", "zope.event", "zope.testing"]
[metadata]
lock-version = "2.0"
python-versions = "^3.11"
content-hash = "fa6657610c1f5cf14107f7c615e3e6c9d14367defbcb7582b2e72fe77bc939c9"
content-hash = "22b31a3f8d5b241552a798639668e43aabafff7f9611888f1d4239ea07d71a75"

View File

@ -118,3 +118,6 @@ whatthepatch = "*"
retry = "*"
evaluate = "*"
swebench = { git = "https://github.com/OpenDevin/SWE-bench.git" }
func_timeout = "*"
sympy = "*"
gdown = "*"

View File

@ -32,7 +32,7 @@ def test_stream_storage(temp_dir: str):
event_stream = EventStream('abc', file_store)
event_stream.add_event(NullObservation(''), EventSource.AGENT)
assert len(collect_events(event_stream)) == 1
content = event_stream._file_store.read('sessions/abc/events/0.json')
content = event_stream.file_store.read('sessions/abc/events/0.json')
assert content is not None
data = json.loads(content)
assert 'timestamp' in data