Xingyao Wang 31b244f95e
[Refactor, Evaluation] Refactor and clean up evaluation harness to remove global config and use EventStreamRuntime (#3230)
* move multi-line bash tests to test_runtime;
support multi-line bash for esruntime;

* add testcase to handle PS2 prompt

* use bashlex for bash parsing to handle multi-line commands;
add testcases for multi-line commands

* revert ghcr runtime change

* Apply stash

* fix run as other user;
make test async;

* fix test runtime for run as od

* add run-as-devin to all the runtime tests

* handle the case when username is root

* move all run-as-devin tests from sandbox;
only tests a few cases on different user to save time;

* move over multi-line echo related tests to test_runtime

* fix user-specific jupyter by fixing the pypoetry virtualenv folder

* make plugin's init async;
chdir at initialization of jupyter plugin;
move ipy simple testcase to test runtime;

* support agentskills import in
move tests for jupyter pwd tests;
overload `add_env_vars` for EventStreamRuntime to update env var also in Jupyter;
make agentskills read env var lazily, in case env var is updated;

* fix ServerRuntime agentskills issue

* move agnostic image test to test_runtime

* merge runtime tests in CI

* fix enable auto lint as env var

* update warning message

* update warning message

* test for different container images

* change parsing output as debug

* add exception handling for update_pwd_decorator

* fix unit test indentation

* add plugins as default input to Runtime class;
remove init_sandbox_plugins;
implement add_env_var (include jupyter) in the base class;

* fix server runtime auto lint

* Revert "add exception handling for update_pwd_decorator"

This reverts commit 2b668b1506e02145cb8f87e321aad62febca3d50.

* tries to print debugging info for agentskills

* explictly setting uid (try fix permission issue)

* Revert "tries to print debugging info for agentskills"

This reverts commit 8be4c86756f0e3fc62957b327ba2ac4999c419de.

* set sandbox user id during testing to hopefully fix the permission issue

* add browser tools for server runtime

* try to debug for old pwd

* update debug cmd

* only test agnostic runtime when TEST_RUNTIME is Server

* fix temp dir mkdir

* load TEST_RUNTIME at the beginning

* remove ipython tests

* only log to file when DEBUG

* default logging to project root

* temporarily remove log to file

* fix LLM logger dir

* fix logger

* make set pwd an optional aux action

* fix prev pwd

* fix infinity recursion

* simplify

* do not import the whole od library to avoid logger folder by jupyter

* fix browsing

* increase timeout

* attempt to fix agentskills yet again

* clean up in testcases, since CI maybe run as non-root

* add _cause attribute for event.id

* remove parent

* add a bunch of debugging statement again for CI :(

* fix temp_dir fixture

* change all temp dir to follow pytest's tmp_path_factory

* remove extra bracket

* clean up error printing a bit

* jupyter chdir to self.config.workspace_mount_path_in_sandbox on initialization

* jupyter chdir to self.config.workspace_mount_path_in_sandbox on initialization

* add typing for tmp dir fixture

* clear the directory before running the test to avoid weird CI temp dir

* remove agnostic test case for server runtime

* Revert "remove agnostic test case for server runtime"

This reverts commit 30e2181c3fc1410e69596c2dcd06be01f1d016b3.

* disable agnostic tests in CI

* fix test

* make sure plugin arg is not passed when no plugin is specified;
remove redundant on_event function;

* move mock prompt

* rename runtime

* remove extra logging

* refactor run_controller's interface;
support multiple runtime for integration test;
filter out hostname for prompt

* uncomment other tests

* pass the right runtime to controller

* log runtime when start

* uncomment tests

* improve symbol filters

* add intergration test prompts that seemd ok

* add integration test workflow

* add python3 to default ubuntu image

* symlink python and fix permission to jupyter pip

* add retry for jupyter execute server

* fix jupyter pip install;
add post-process for jupyter pip install;
simplify init by add agent_skills path to PYTHONPATH;
add testcase to tests jupyter pip install;

* fix bug

* use ubuntu:22.04 for eventstream integration tests

* add todo

* update testcase

* remove redundant code

* fix unit test

* reduce dependency for runtime

* try making llama-index an optional dependency that's not installed by default

* remove pip install since it seemd not needed

* log ipython execution;
await write message since it returns a future

* update ipy testcase

* do not install llama-index in CI

* do not install llama-index in the app docker as well

* set sandbox container image in the integration test script

* log plugins & env var for runtime

* update conftest for sha256

* add git

* remove all non-alphanumeric chalracters

* add working ipy module tests!

* default to use host network

* remove is_async from browser to make thing a little more reliable;
retry loading browser when error;

* add sleep to wait a bit for http server

* kill http server before regenerate browsing tests

* fix browsing

* only set sandbox container image if undefined

* skip empty config value

* update evaluation to use the latest run_controller

* revert logger in execute_server to be compatible with server runtime

* revert logging level to fix jupyter

* set logger level

* revert the logging

* chmod for workspace to fix permission

* support getting timeout from action

* update test for server runtime

* try to fix file permission

* fix test_cmd_run_action_serialization_deserialization test (added timeout)

* poetry: pip 24.2, torch 2.2.2

* revert adding pip to pyproject.toml

* add build to dependencies in pyproject.toml

* forgot poetry lock --no-update

* fix a DelegatorAgent prompt_002.log (timeout)

* fix a DelegatorAgent prompt_003.log (timeout)

* couple more timeout attribs in prompt files

* some more prompt files

* prompts galore

* add clarification comment for timeout

* default timeout to config

* add assert

* update integraton tests for eventstream

* update integration tests

* fix timeout for action<->dict

* remove redundant on_event

* default to use instance image

* update run_controller interface

* add logging for copy

* refactor swe_bench for the new design

* fix action execution timeout

* updatelock

* remove build sandbox locally

* fix runtime

* use plain for-loop for single process

* remove extra print

* get swebench inference working

* print whole `test_result` dict

* got swebench patch post-process working

* update swe-bench evaluation readme

* refactor using shared reset_logger function

* move messy swebench prompt to a different file

* support the ability to specify whether to keep prompt

* support the ability to specify whether to keep prompt

* fix dockerfile

* fix import and remove unnecessary strip logic

* fix action serialization

* get agentbench running

* remove extra ls for agent bench

* fix agentbench metric

* factor out common documentation for eval

* update biocoder doc

* remove swe_env_box since it is no longer needed

* get biocoder working

* add func timeout for bird

* fix jupyter pwd with ~ as user name

* fix jupyter pwd with ~ as user name

* get bird working

* get browsing evaluation working

* make eda runnable

* fix id column

* fix eda run_infer

* unify eval output using a structured format;
make swebench coompatible with that format;
update client source code for every swebench run;
do not inject testcmd for swebench

* standardize existing benchs for the new eval output

* set update source code = true

* get gaia standardized

* fix gaia

* gorilla refactored but stuck at language.so to test

* refactor and make gpqa work

* refactor humanevalfix and get it working

* refactor logic reasoning and get it working

* refactor browser env so it works with eventstream runtime for eval

* add initial version of miniwob refactor

* fix browsergym environment

* get miniwob working!!

* allowing injecting additional dependency to OD runtime docker image

* allowing injecting additional dependency to OD runtime docker image

* support logic reasoning with pre-injected dependency

* get mint working

* update runtime build

* fix mint docker

* add test for keep_prompt;
add missing await close for some tests

* update integration tests for eventstream runtime

* fix integration tests for server runtime

* refactor ml bench and toolqa

* refactor webarena

* fix default factory

* Update run_infer.py

* add APIError to retry

* increase timeout for swebench

* make sure to hide api key when dump eval output

* update the behavior of put source code to put files instead of tarball

* add dishash to dependency

* sendintr when timeout

* fix dockerfile copy

* reduce timeout

* use dirhash to avoid repeat building for update source

* fix runtime_build testcase

* add dir_hash to docker build pipeline

* revert api error

* update poetry lock

* add retries for swebench run infer

* fix git patch

* update poetry lock

* adjust config order

* fix mount volumns

* enforce all eval to use "instance_id"

* remove file store from runtime

* make file_store public inside eventstream

* move the runtime logic inside `main` out

* support using async function for process_instance_fn

* refactor run_infer with the create_time

* fix file store

* Update evaluation/toolqa/utils.py

Co-authored-by: Graham Neubig <neubig@gmail.com>

* fix typo

---------

Co-authored-by: tobitege <tobitege@gmx.de>
Co-authored-by: super-dainiu <78588128+super-dainiu@users.noreply.github.com>
Co-authored-by: Graham Neubig <neubig@gmail.com>
2024-08-06 17:21:45 +00:00

154 lines
5.2 KiB
Python

import asyncio
import threading
from datetime import datetime
from enum import Enum
from typing import Callable, Iterable
from opendevin.core.logger import opendevin_logger as logger
from opendevin.core.utils import json
from opendevin.events.serialization.event import event_from_dict, event_to_dict
from opendevin.storage import FileStore
from .event import Event, EventSource
class EventStreamSubscriber(str, Enum):
AGENT_CONTROLLER = 'agent_controller'
SERVER = 'server'
RUNTIME = 'runtime'
MAIN = 'main'
TEST = 'test'
class EventStream:
sid: str
file_store: FileStore
# For each subscriber ID, there is a stack of callback functions - useful
# when there are agent delegates
_subscribers: dict[str, list[Callable]]
_cur_id: int
_lock: threading.Lock
def __init__(self, sid: str, file_store: FileStore):
self.sid = sid
self.file_store = file_store
self._subscribers = {}
self._cur_id = 0
self._lock = threading.Lock()
self._reinitialize_from_file_store()
def _reinitialize_from_file_store(self) -> None:
try:
events = self.file_store.list(f'sessions/{self.sid}/events')
except FileNotFoundError:
logger.debug(f'No events found for session {self.sid}')
self._cur_id = 0
return
# if we have events, we need to find the highest id to prepare for new events
for event_str in events:
id = self._get_id_from_filename(event_str)
if id >= self._cur_id:
self._cur_id = id + 1
def _get_filename_for_id(self, id: int) -> str:
return f'sessions/{self.sid}/events/{id}.json'
@staticmethod
def _get_id_from_filename(filename: str) -> int:
try:
return int(filename.split('/')[-1].split('.')[0])
except ValueError:
logger.warning(f'get id from filename ({filename}) failed.')
return -1
def get_events(
self,
start_id=0,
end_id=None,
reverse=False,
filter_out_type: tuple[type[Event], ...] | None = None,
) -> Iterable[Event]:
if reverse:
if end_id is None:
end_id = self._cur_id - 1
event_id = end_id
while event_id >= start_id:
try:
event = self.get_event(event_id)
if filter_out_type is None or not isinstance(
event, filter_out_type
):
yield event
except FileNotFoundError:
logger.debug(f'No event found for ID {event_id}')
event_id -= 1
else:
event_id = start_id
while True:
if end_id is not None and event_id > end_id:
break
try:
event = self.get_event(event_id)
if filter_out_type is None or not isinstance(
event, filter_out_type
):
yield event
except FileNotFoundError:
break
event_id += 1
def get_event(self, id: int) -> Event:
filename = self._get_filename_for_id(id)
content = self.file_store.read(filename)
data = json.loads(content)
return event_from_dict(data)
def get_latest_event(self) -> Event:
return self.get_event(self._cur_id - 1)
def get_latest_event_id(self) -> int:
return self._cur_id - 1
def subscribe(self, id: EventStreamSubscriber, callback: Callable, append=False):
if id in self._subscribers:
if append:
self._subscribers[id].append(callback)
else:
raise ValueError('Subscriber already exists: ' + id)
else:
self._subscribers[id] = [callback]
def unsubscribe(self, id: EventStreamSubscriber):
if id not in self._subscribers:
logger.warning('Subscriber not found during unsubscribe: ' + id)
else:
self._subscribers[id].pop()
if len(self._subscribers[id]) == 0:
del self._subscribers[id]
def add_event(self, event: Event, source: EventSource):
with self._lock:
event._id = self._cur_id # type: ignore [attr-defined]
self._cur_id += 1
logger.debug(f'Adding {type(event).__name__} id={event.id} from {source.name}')
event._timestamp = datetime.now() # type: ignore [attr-defined]
event._source = source # type: ignore [attr-defined]
data = event_to_dict(event)
if event.id is not None:
self.file_store.write(self._get_filename_for_id(event.id), json.dumps(data))
for stack in self._subscribers.values():
callback = stack[-1]
asyncio.create_task(callback(event))
def filtered_events_by_source(self, source: EventSource):
for event in self.get_events():
if event.source == source:
yield event
def clear(self):
self.file_store.delete(f'sessions/{self.sid}')
self._cur_id = 0
# self._subscribers = {}
self._reinitialize_from_file_store()