[Refactor, Evaluation] Refactor and clean up evaluation harness to remove global config and use EventStreamRuntime (#3230)

* move multi-line bash tests to test_runtime; support multi-line bash for esruntime; * add testcase to handle PS2 prompt * use bashlex for bash parsing to handle multi-line commands; add testcases for multi-line commands * revert ghcr runtime change * Apply stash * fix run as other user; make test async; * fix test runtime for run as od * add run-as-devin to all the runtime tests * handle the case when username is root * move all run-as-devin tests from sandbox; only tests a few cases on different user to save time; * move over multi-line echo related tests to test_runtime * fix user-specific jupyter by fixing the pypoetry virtualenv folder * make plugin's init async; chdir at initialization of jupyter plugin; move ipy simple testcase to test runtime; * support agentskills import in move tests for jupyter pwd tests; overload `add_env_vars` for EventStreamRuntime to update env var also in Jupyter; make agentskills read env var lazily, in case env var is updated; * fix ServerRuntime agentskills issue * move agnostic image test to test_runtime * merge runtime tests in CI * fix enable auto lint as env var * update warning message * update warning message * test for different container images * change parsing output as debug * add exception handling for update_pwd_decorator * fix unit test indentation * add plugins as default input to Runtime class; remove init_sandbox_plugins; implement add_env_var (include jupyter) in the base class; * fix server runtime auto lint * Revert "add exception handling for update_pwd_decorator" This reverts commit 2b668b1506e02145cb8f87e321aad62febca3d50. * tries to print debugging info for agentskills * explictly setting uid (try fix permission issue) * Revert "tries to print debugging info for agentskills" This reverts commit 8be4c86756f0e3fc62957b327ba2ac4999c419de. * set sandbox user id during testing to hopefully fix the permission issue * add browser tools for server runtime * try to debug for old pwd * update debug cmd * only test agnostic runtime when TEST_RUNTIME is Server * fix temp dir mkdir * load TEST_RUNTIME at the beginning * remove ipython tests * only log to file when DEBUG * default logging to project root * temporarily remove log to file * fix LLM logger dir * fix logger * make set pwd an optional aux action * fix prev pwd * fix infinity recursion * simplify * do not import the whole od library to avoid logger folder by jupyter * fix browsing * increase timeout * attempt to fix agentskills yet again * clean up in testcases, since CI maybe run as non-root * add _cause attribute for event.id * remove parent * add a bunch of debugging statement again for CI :( * fix temp_dir fixture * change all temp dir to follow pytest's tmp_path_factory * remove extra bracket * clean up error printing a bit * jupyter chdir to self.config.workspace_mount_path_in_sandbox on initialization * jupyter chdir to self.config.workspace_mount_path_in_sandbox on initialization * add typing for tmp dir fixture * clear the directory before running the test to avoid weird CI temp dir * remove agnostic test case for server runtime * Revert "remove agnostic test case for server runtime" This reverts commit 30e2181c3fc1410e69596c2dcd06be01f1d016b3. * disable agnostic tests in CI * fix test * make sure plugin arg is not passed when no plugin is specified; remove redundant on_event function; * move mock prompt * rename runtime * remove extra logging * refactor run_controller's interface; support multiple runtime for integration test; filter out hostname for prompt * uncomment other tests * pass the right runtime to controller * log runtime when start * uncomment tests * improve symbol filters * add intergration test prompts that seemd ok * add integration test workflow * add python3 to default ubuntu image * symlink python and fix permission to jupyter pip * add retry for jupyter execute server * fix jupyter pip install; add post-process for jupyter pip install; simplify init by add agent_skills path to PYTHONPATH; add testcase to tests jupyter pip install; * fix bug * use ubuntu:22.04 for eventstream integration tests * add todo * update testcase * remove redundant code * fix unit test * reduce dependency for runtime * try making llama-index an optional dependency that's not installed by default * remove pip install since it seemd not needed * log ipython execution; await write message since it returns a future * update ipy testcase * do not install llama-index in CI * do not install llama-index in the app docker as well * set sandbox container image in the integration test script * log plugins & env var for runtime * update conftest for sha256 * add git * remove all non-alphanumeric chalracters * add working ipy module tests! * default to use host network * remove is_async from browser to make thing a little more reliable; retry loading browser when error; * add sleep to wait a bit for http server * kill http server before regenerate browsing tests * fix browsing * only set sandbox container image if undefined * skip empty config value * update evaluation to use the latest run_controller * revert logger in execute_server to be compatible with server runtime * revert logging level to fix jupyter * set logger level * revert the logging * chmod for workspace to fix permission * support getting timeout from action * update test for server runtime * try to fix file permission * fix test_cmd_run_action_serialization_deserialization test (added timeout) * poetry: pip 24.2, torch 2.2.2 * revert adding pip to pyproject.toml * add build to dependencies in pyproject.toml * forgot poetry lock --no-update * fix a DelegatorAgent prompt_002.log (timeout) * fix a DelegatorAgent prompt_003.log (timeout) * couple more timeout attribs in prompt files * some more prompt files * prompts galore * add clarification comment for timeout * default timeout to config * add assert * update integraton tests for eventstream * update integration tests * fix timeout for action<->dict * remove redundant on_event * default to use instance image * update run_controller interface * add logging for copy * refactor swe_bench for the new design * fix action execution timeout * updatelock * remove build sandbox locally * fix runtime * use plain for-loop for single process * remove extra print * get swebench inference working * print whole `test_result` dict * got swebench patch post-process working * update swe-bench evaluation readme * refactor using shared reset_logger function * move messy swebench prompt to a different file * support the ability to specify whether to keep prompt * support the ability to specify whether to keep prompt * fix dockerfile * fix import and remove unnecessary strip logic * fix action serialization * get agentbench running * remove extra ls for agent bench * fix agentbench metric * factor out common documentation for eval * update biocoder doc * remove swe_env_box since it is no longer needed * get biocoder working * add func timeout for bird * fix jupyter pwd with ~ as user name * fix jupyter pwd with ~ as user name * get bird working * get browsing evaluation working * make eda runnable * fix id column * fix eda run_infer * unify eval output using a structured format; make swebench coompatible with that format; update client source code for every swebench run; do not inject testcmd for swebench * standardize existing benchs for the new eval output * set update source code = true * get gaia standardized * fix gaia * gorilla refactored but stuck at language.so to test * refactor and make gpqa work * refactor humanevalfix and get it working * refactor logic reasoning and get it working * refactor browser env so it works with eventstream runtime for eval * add initial version of miniwob refactor * fix browsergym environment * get miniwob working!! * allowing injecting additional dependency to OD runtime docker image * allowing injecting additional dependency to OD runtime docker image * support logic reasoning with pre-injected dependency * get mint working * update runtime build * fix mint docker * add test for keep_prompt; add missing await close for some tests * update integration tests for eventstream runtime * fix integration tests for server runtime * refactor ml bench and toolqa * refactor webarena * fix default factory * Update run_infer.py * add APIError to retry * increase timeout for swebench * make sure to hide api key when dump eval output * update the behavior of put source code to put files instead of tarball * add dishash to dependency * sendintr when timeout * fix dockerfile copy * reduce timeout * use dirhash to avoid repeat building for update source * fix runtime_build testcase * add dir_hash to docker build pipeline * revert api error * update poetry lock * add retries for swebench run infer * fix git patch * update poetry lock * adjust config order * fix mount volumns * enforce all eval to use "instance_id" * remove file store from runtime * make file_store public inside eventstream * move the runtime logic inside `main` out * support using async function for process_instance_fn * refactor run_infer with the create_time * fix file store * Update evaluation/toolqa/utils.py Co-authored-by: Graham Neubig <neubig@gmail.com> * fix typo --------- Co-authored-by: tobitege <tobitege@gmx.de> Co-authored-by: super-dainiu <78588128+super-dainiu@users.noreply.github.com> Co-authored-by: Graham Neubig <neubig@gmail.com>
2025-12-26 05:48:36 +08:00 · 2024-08-07 01:21:45 +08:00 · 2024-08-07 01:21:45 +08:00 · 31b244f95e
commit 31b244f95e
parent 9029cd77d3
78 changed files with 3565 additions and 3784 deletions
--- a/.gitignore
+++ b/.gitignore
@ -169,6 +169,10 @@ evaluation/outputs
 evaluation/swe_bench/eval_workspace*
 evaluation/SWE-bench/data
 evaluation/webarena/scripts/webarena_env.sh
+evaluation/bird/data
+evaluation/gaia/data
+evaluation/gorilla/data
+evaluation/toolqa/data

 # frontend

--- a/evaluation/EDA/README.md
+++ b/evaluation/EDA/README.md
@ -2,9 +2,10 @@

 This folder contains evaluation harness for evaluating agents on the Entity-deduction-Arena Benchmark, from the paper [Probing the Multi-turn Planning Capabilities of LLMs via 20 Question Games](https://arxiv.org/abs/2310.01468), presented in ACL 2024 main conference.

-## Configure OpenDevin and your LLM
+## Setup Environment and LLM Configuration
+
+Please follow instruction [here](../README.md#setup) to setup your local development environment and LLM.

-Create a `config.toml` file if it does not exist at the root of the workspace. Please check [README.md](../../README.md) for how to set this up.

 ## Start the evaluation

--- a/evaluation/EDA/run_infer.py
+++ b/evaluation/EDA/run_infer.py
@ -1,30 +1,27 @@
 import asyncio
-import logging
 import os

 import pandas as pd
-
-# import huggingface_hub
 from datasets import load_dataset

 from evaluation.EDA.game import Q20Game, Q20GameCelebrity
 from evaluation.utils.shared import (
    EvalMetadata,
+    EvalOutput,
    make_metadata,
    prepare_dataset,
+    reset_logger_for_multiprocessing,
    run_evaluation,
 )
-from opendevin.controller.agent import Agent
-
-# from evaluation.EDA.scorer import question_scorer
 from opendevin.controller.state.state import State
-from opendevin.core.config import get_llm_config_arg, get_parser, load_app_config
-from opendevin.core.logger import get_console_handler
+from opendevin.core.config import (
+    AppConfig,
+    SandboxConfig,
+    get_llm_config_arg,
+    get_parser,
+)
 from opendevin.core.logger import opendevin_logger as logger
-from opendevin.core.main import run_controller
-from opendevin.llm.llm import LLM
-
-config = load_app_config()
+from opendevin.core.main import create_runtime, run_controller

 game = None

@ -56,39 +53,45 @@ AGENT_CLS_TO_INST_SUFFIX = {
 }


-def process_instance(
+def get_config(
+    metadata: EvalMetadata,
+) -> AppConfig:
+    config = AppConfig(
+        default_agent=metadata.agent_class,
+        run_as_devin=False,
+        runtime='eventstream',
+        max_iterations=metadata.max_iterations,
+        sandbox=SandboxConfig(
+            container_image='ubuntu:22.04',
+            enable_auto_lint=False,
+            use_host_network=False,
+            update_source_code=True,
+        ),
+        # do not mount workspace
+        workspace_base=None,
+        workspace_mount_path=None,
+    )
+    config.set_llm_config(metadata.llm_config)
+    return config
+
+
+async def process_instance(
    instance: pd.Series,
    metadata: EvalMetadata,
    reset_logger: bool = True,
-):
-    # Create the agent
-    agent = Agent.get_cls(metadata.agent_class)(llm=LLM(config=metadata.llm_config))
+) -> EvalOutput:
+    config = get_config(metadata)
+    instance_id = instance['text'].strip()
+
    # Setup the logger properly, so you can run multi-processing to parallelize the evaluation
-    eval_output_dir = metadata.eval_output_dir
    if reset_logger:
-        # Set up logger
-        log_file = os.path.join(
-            eval_output_dir, 'logs', f'instance_{instance["text"].strip()}.log'
-        )
-        # Remove all existing handlers from logger
-        for handler in logger.handlers[:]:
-            logger.removeHandler(handler)
-        # add back the console handler to print ONE line
-        logger.addHandler(get_console_handler())
-        logger.info(
-            f'Starting evaluation for instance {instance["text"].strip()}.\nLOG:   tail -f {log_file}'
-        )
-        # Remove all existing handlers from logger
-        for handler in logger.handlers[:]:
-            logger.removeHandler(handler)
-        file_handler = logging.FileHandler(log_file)
-        file_handler.setFormatter(
-            logging.Formatter('%(asctime)s - %(levelname)s - %(message)s')
-        )
-        logger.addHandler(file_handler)
+        log_dir = os.path.join(metadata.eval_output_dir, 'infer_logs')
+        reset_logger_for_multiprocessing(logger, instance_id, log_dir)
+    else:
+        logger.info(f'Starting evaluation for instance {instance_id}.')

    # Prepare instruction
-    _game_class = {'things': Q20Game, 'celebs': Q20GameCelebrity}
+    _game_class = {'eda-things': Q20Game, 'eda-celebs': Q20GameCelebrity}

    guesser_kargs = {
        'max_new_tokens': 64,
@ -112,24 +115,16 @@ def process_instance(

    instruction = f'{game.first_user_utterance}'
    logger.info(f'Instruction: {instruction}')
-
-    # instruction += 'IMPORTANT: You should ONLY interact with the environment provided to you AND NEVER ASK FOR HUMAN HELP.\n'
-    # NOTE: You can actually set slightly different instruction for different agents
-    instruction += AGENT_CLS_TO_INST_SUFFIX[agent.__class__.__name__]
+    instruction += AGENT_CLS_TO_INST_SUFFIX[metadata.agent_class]

    # Here's how you can run the agent (similar to the `main` function) and get the final task state
-    config.max_iterations = metadata.max_iterations
+    runtime = await create_runtime(config, sid=instance['text'].strip())

-    state: State | None = asyncio.run(
-        run_controller(
-            config=config,
-            task_str=instruction,
-            fake_user_response_fn=AGENT_CLS_TO_FAKE_USER_RESPONSE_FN[
-                agent.__class__.__name__
-            ],
-            agent=agent,
-            sid=instance['text'].strip(),
-        )
+    state: State | None = await run_controller(
+        config=config,
+        task_str=instruction,
+        runtime=runtime,
+        fake_user_response_fn=AGENT_CLS_TO_FAKE_USER_RESPONSE_FN[metadata.agent_class],
    )
    # ======= Attempt to evaluate the agent's edits =======
    # If you are working on simpler benchmark that only evaluates the final model output (e.g., in a MessageAction)
@ -150,21 +145,20 @@ def process_instance(
    histories = state.history.compatibility_for_eval_history_pairs()

    # Save the output
-    output = {
-        'instance_id': instance['text'].strip(),
-        'instance': instance,
-        'instruction': instruction,
-        'metadata': metadata.model_dump(),
-        'history': histories,
-        'metrics': metrics,
-        'error': state.last_error if state and state.last_error else None,
-        'test_result': {
+    output = EvalOutput(
+        instance_id=instance_id,
+        instance=instance.to_dict(),
+        instruction=instruction,
+        metadata=metadata,
+        history=histories,
+        metrics=metrics,
+        error=state.last_error if state and state.last_error else None,
+        test_result={
            'success': test_result,
            'final_message': final_message,
            'ground_truth': instance['text'],
        },
-    }
-
+    )
    return output


@ -191,12 +185,16 @@ if __name__ == '__main__':
    )
    args, _ = parser.parse_known_args()

-    llm_config = get_llm_config_arg(args.llm_config) if args.llm_config else config.llm
-    logger.info(f'Config for evaluation: {config}')
-
    eda_dataset = load_dataset(
        'yizheapple/entity-deduction-arena', name=args.dataset, split=args.data_split
    )
+    eda_dataset.rename(columns={'text': 'instance_id'}, inplace=True)
+
+    llm_config = None
+    if args.llm_config:
+        llm_config = get_llm_config_arg(args.llm_config)
+    if llm_config is None:
+        raise ValueError(f'Could not find LLM config: --llm_config {args.llm_config}')

    metadata = make_metadata(
        llm_config,
@ -214,16 +212,15 @@ if __name__ == '__main__':

    output_file = os.path.join(metadata.eval_output_dir, 'output.jsonl')
    prepared_dataset = prepare_dataset(
-        eda_dataset.to_pandas(), output_file, args.eval_n_limit, 'text'
+        eda_dataset.to_pandas(), output_file, args.eval_n_limit
    )

-    agent = Agent.get_cls(args.agent_cls)(llm=LLM(config.llm))
-
-    run_evaluation(
-        prepared_dataset,
-        metadata,
-        output_file,
-        args.eval_num_workers,
-        process_instance,
-        'text',
+    asyncio.run(
+        run_evaluation(
+            prepared_dataset,
+            metadata,
+            output_file,
+            args.eval_num_workers,
+            process_instance,
+        )
    )
--- a/evaluation/EDA/scripts/run_infer.sh
+++ b/evaluation/EDA/scripts/run_infer.sh
--- a/evaluation/README.md
+++ b/evaluation/README.md
@ -22,6 +22,32 @@ all the preprocessing/evaluation/analysis scripts.
 - BIRD: [`evaluation/bird`](./bird)
 - LogicReasoning: [`evaluation/logic_reasoning`](./logic_reasoning)

+## Setup
+
+### Development environment
+Please follow [this document](https://github.com/OpenDevin/OpenDevin/blob/main/Development.md) to set up a local development environment for OpenDevin.
+
+### Configure OpenDevin and your LLM
+
+Create a `config.toml` file if it does not exist at the root of the workspace. You can copy from `config.template.toml` if it is easier for you.
+
+Add the configuration for your LLM:
+
+```toml
+# TODO: Change these to the model you want to evaluate
+[llm.eval_gpt4_1106_preview_llm]
+model = "gpt-4-1106-preview"
+api_key = "XXX"
+temperature = 0.0
+
+[llm.eval_some_openai_compatible_model_llm]
+model = "openai/MODEL_NAME"
+base_url = "https://OPENAI_COMPATIBLE_URL/v1"
+api_key = "XXX"
+temperature = 0.0
+```
+
+
 ### Result Visualization

 Check [this huggingface space](https://huggingface.co/spaces/OpenDevin/evaluation) for visualization of existing experimental results.
--- a/evaluation/agent_bench/README.md
+++ b/evaluation/agent_bench/README.md
@ -1,44 +1,10 @@
 # AgentBench Evaluation

-This folder contains evaluation harness for evaluating agents on
-the [AgentBench: Evaluating LLMs as Agents](https://arxiv.org/abs/2308.03688).
+This folder contains evaluation harness for evaluating agents on the [AgentBench: Evaluating LLMs as Agents](https://arxiv.org/abs/2308.03688). We currently only support running on the `osbench` subset.

-## Configure OpenDevin and your LLM
+## Setup Environment and LLM Configuration

-Create a `config.toml` file if it does not exist at the root of the workspace. Please check [README.md](../../README.md)
-for how to set this up.
-
-Here is an example `config.toml` file:
-
-```toml
-[core]
-max_iterations = 100
-cache_dir = "/path/to/cache"
-
-workspace_base = "/path/to/workspace"
-workspace_mount_path = "/path/to/workspace"
-
-ssh_hostname = "localhost"
-
-# AgentBench specific
-run_as_devin = true
-
-[sandbox]
-use_host_network = false
-enable_auto_lint = true
-box_type = "ssh"
-timeout = 120
-
-[llm.eval_gpt35_turbo]
-model = "gpt-3.5-turbo"
-api_key = "sk-123"
-temperature = 0.0
-
-[llm.eval_gpt4o]
-model = "gpt-4o"
-api_key = "sk-123"
-temperature = 0.0
-```
+Please follow instruction [here](../README.md#setup) to setup your local development environment and LLM.

 ## Start the evaluation

@ -46,7 +12,18 @@ temperature = 0.0
 ./evaluation/agent_bench/scripts/run_infer.sh [model_config] [git-version] [agent] [eval_limit]
 ```

-Following is the basic command to start the evaluation. Here we are only evaluating the `osbench` for now.
+- `model_config`, e.g. `eval_gpt4_1106_preview`, is the config group name for your
+LLM settings, as defined in your `config.toml`.
+- `git-version`, e.g. `HEAD`, is the git commit hash of the OpenDevin version you would
+like to evaluate. It could also be a release tag like `0.6.2`.
+- `agent`, e.g. `CodeActAgent`, is the name of the agent for benchmarks, defaulting
+to `CodeActAgent`.
+- `eval_limit`, e.g. `10`, limits the evaluation to the first `eval_limit` instances. By
+default, the script evaluates the entire SWE-bench_Lite test set (300 issues). Note:
+in order to use `eval_limit`, you must also set `agent`.
+
+
+Following is the basic command to start the evaluation.

 You can update the arguments in the script `evaluation/agent_bench/scripts/run_infer.sh`, such as `--max-iterations`, `--eval-num-workers` and so on.

@ -57,5 +34,5 @@ You can update the arguments in the script `evaluation/agent_bench/scripts/run_i
 - `--eval-n-limit`: the number of examples to evaluate. For example, `100`.

 ```bash
-./evaluation/agent_bench/scripts/run_infer.sh eval_gpt35_turbo 0.6.2 CodeActAgent 1
+./evaluation/agent_bench/scripts/run_infer.sh eval_gpt35_turbo HEAD CodeActAgent 1
 ```
--- a/evaluation/agent_bench/helper.py
+++ b/evaluation/agent_bench/helper.py
@ -14,7 +14,7 @@ def try_parse_answer(act) -> str | None:
        raw_ans = act.thought
    else:
        return None
-    agent_answer = re.findall(r'<solution>(.*?)</solution>', raw_ans)
+    agent_answer = re.findall(r'<solution>(.*?)</solution>', raw_ans, re.DOTALL)
    if not agent_answer:
        return None
    return agent_answer[0].strip()
--- a/evaluation/agent_bench/run_infer.py
+++ b/evaluation/agent_bench/run_infer.py
@ -1,10 +1,9 @@
 import asyncio
-import logging
 import os
 import re
-import shutil
+import tempfile
+from typing import Any

-import docker
 import pandas as pd
 from datasets import load_dataset

@ -16,64 +15,176 @@ from evaluation.agent_bench.helper import (
 )
 from evaluation.utils.shared import (
    EvalMetadata,
+    EvalOutput,
    make_metadata,
    prepare_dataset,
+    reset_logger_for_multiprocessing,
    run_evaluation,
 )
-from opendevin.controller.agent import Agent
 from opendevin.controller.state.state import State
-from opendevin.core.config import get_llm_config_arg, load_app_config, parse_arguments
-from opendevin.core.logger import get_console_handler
+from opendevin.core.config import (
+    AppConfig,
+    SandboxConfig,
+    get_llm_config_arg,
+    parse_arguments,
+)
 from opendevin.core.logger import opendevin_logger as logger
-from opendevin.core.main import run_controller
-from opendevin.events.action import CmdRunAction, MessageAction
-from opendevin.llm.llm import LLM
-from opendevin.runtime.docker.ssh_box import DockerSSHBox
-
-config = load_app_config()
+from opendevin.core.main import create_runtime, run_controller
+from opendevin.events.action import AgentFinishAction, CmdRunAction, MessageAction
+from opendevin.events.observation import CmdOutputObservation
+from opendevin.runtime.runtime import Runtime


-def process_instance(
+def get_config(
+    metadata: EvalMetadata,
+) -> AppConfig:
+    config = AppConfig(
+        default_agent=metadata.agent_class,
+        run_as_devin=False,
+        runtime='eventstream',
+        max_iterations=metadata.max_iterations,
+        sandbox=SandboxConfig(
+            container_image='ubuntu:22.04',
+            enable_auto_lint=True,
+            use_host_network=False,
+            update_source_code=True,
+        ),
+        # do not mount workspace
+        workspace_base=None,
+        workspace_mount_path=None,
+    )
+    config.set_llm_config(metadata.llm_config)
+    return config
+
+
+async def initialize_runtime(
+    runtime: Runtime,
+    instance: pd.Series,  # this argument is not required
+):
+    """Initialize the runtime for the agent.
+
+    This function is called before the runtime is used to run the agent.
+    """
+    logger.info(f"{'-' * 50} BEGIN Runtime Initialization Fn {'-' * 50}")
+    obs: CmdOutputObservation
+
+    # Set instance id
+    action = CmdRunAction(command='mkdir -p /workspace')
+    logger.info(action, extra={'msg_type': 'ACTION'})
+    obs = await runtime.run_action(action)
+    assert obs.exit_code == 0
+
+    action = CmdRunAction(command='cd /workspace')
+    logger.info(action, extra={'msg_type': 'ACTION'})
+    obs = await runtime.run_action(action)
+    assert obs.exit_code == 0
+
+    init_cmd = instance.init
+    if init_cmd is not None:
+        script_name = f'{instance.instance_id}_init.sh'
+
+        with tempfile.TemporaryDirectory() as tmpdir:
+            host_script_path = os.path.join(tmpdir, script_name)
+            create_sh_file(host_script_path, init_cmd)
+            await runtime.copy_to(
+                host_script_path,
+                '/workspace',
+            )
+
+        logger.info(f'Running init script: {script_name}')
+        action = CmdRunAction(command=f'chmod +x ./{script_name} && ./{script_name}')
+        logger.info(action, extra={'msg_type': 'ACTION'})
+        obs = await runtime.run_action(action)
+        logger.info(obs, extra={'msg_type': 'OBSERVATION'})
+        assert obs.exit_code == 0
+
+    logger.info(f"{'-' * 50} END Runtime Initialization Fn {'-' * 50}")
+
+
+async def complete_runtime(
+    runtime: Runtime,
+    instance: pd.Series,  # this argument is not required, but it is used to get the workspace_dir_name
+) -> dict[str, Any]:
+    """Complete the runtime for the agent.
+
+    This function is called before the runtime is used to run the agent.
+    If you need to do something in the sandbox to get the correctness metric after
+    the agent has run, modify this function.
+    """
+    logger.info(f"{'-' * 50} BEGIN Runtime Completion Fn {'-' * 50}")
+    obs: CmdOutputObservation
+
+    agent_answer = None
+    get_agent_result_cmd = instance.get_agent_result
+    if get_agent_result_cmd is not None:
+        script_name = 'get_agent_result.sh'
+
+        with tempfile.TemporaryDirectory() as tmpdir:
+            host_script_path = os.path.join(tmpdir, script_name)
+            create_sh_file(host_script_path, get_agent_result_cmd)
+            await runtime.copy_to(
+                host_script_path,
+                '/workspace',
+            )
+            logger.info(f'Running get agent result cmd: {script_name}')
+
+        action = CmdRunAction(
+            command=f'chmod +x ./{script_name} && ./{script_name}',
+            keep_prompt=False,
+        )
+        logger.info(action, extra={'msg_type': 'ACTION'})
+        obs = await runtime.run_action(action)
+        logger.info(obs, extra={'msg_type': 'OBSERVATION'})
+        assert obs.exit_code == 0
+        agent_answer = obs.content
+    # IF the agent answer is not found, retrieve it from the history
+    # We wait until the controller finishes
+
+    final_ans = None
+    if instance.ground_truth is not None:
+        final_ans = instance.ground_truth
+    else:
+        get_ground_truth_cmd = instance.get_ground_truth
+        if get_ground_truth_cmd is not None:
+            script_name = 'get_ground_truth.sh'
+            with tempfile.TemporaryDirectory() as tmpdir:
+                host_script_path = os.path.join(tmpdir, script_name)
+                create_sh_file(host_script_path, get_ground_truth_cmd)
+                await runtime.copy_to(
+                    host_script_path,
+                    '/workspace',
+                )
+            logger.info(f'Running get ground truth cmd: {script_name}')
+
+            action = CmdRunAction(
+                command=f'chmod +x ./{script_name} && ./{script_name}',
+                keep_prompt=False,
+            )
+            logger.info(action, extra={'msg_type': 'ACTION'})
+            obs = await runtime.run_action(action)
+            logger.info(obs, extra={'msg_type': 'OBSERVATION'})
+            final_ans = obs.content
+
+    logger.info(f"{'-' * 50} END Runtime Completion Fn {'-' * 50}")
+    return {
+        'final_ans': final_ans,
+        'agent_answer': agent_answer,
+    }
+
+
+async def process_instance(
    instance: pd.Series,
    metadata: EvalMetadata,
    reset_logger: bool = True,
-):
-    # Create the agent
-    agent = Agent.get_cls(metadata.agent_class)(llm=LLM(config=metadata.llm_config))
+) -> EvalOutput:
+    config = get_config(metadata)

-    inst_id = instance.instance_id
-    question = instance.description
-    # create a directory for the instance's workspace
-    instance_workspace = str(os.path.join(config.workspace_base, inst_id))
-    container_inst_workspace = str(
-        os.path.join(config.workspace_mount_path_in_sandbox, inst_id)
-    )
-    if os.path.exists(instance_workspace):
-        shutil.rmtree(instance_workspace)
-    os.makedirs(instance_workspace, exist_ok=True)
-
-    # Set up the logger properly, so you can run multiprocessing to parallel the evaluation
+    # Setup the logger properly, so you can run multi-processing to parallelize the evaluation
    if reset_logger:
-        # Set up logger
-        log_file = os.path.join(
-            metadata.eval_output_dir, 'logs', f'instance_{inst_id}.log'
-        )
-        # Remove all existing handlers from logger
-        for handler in logger.handlers[:]:
-            logger.removeHandler(handler)
-        # add back the console handler to print ONE line
-        logger.addHandler(get_console_handler())
-        logger.info(
-            f'Starting evaluation for instance {inst_id}.\nHint: run "tail -f {log_file}" to see live logs in a separate shell'
-        )
-        # Remove all existing handlers from logger
-        for handler in logger.handlers[:]:
-            logger.removeHandler(handler)
-        file_handler = logging.FileHandler(log_file)
-        file_handler.setFormatter(
-            logging.Formatter('%(asctime)s - %(levelname)s - %(message)s')
-        )
-        logger.addHandler(file_handler)
+        log_dir = os.path.join(metadata.eval_output_dir, 'infer_logs')
+        reset_logger_for_multiprocessing(logger, instance.instance_id, log_dir)
+    else:
+        logger.info(f'Starting evaluation for instance {instance.instance_id}.')

    # =============================================
    # build instruction
@ -86,104 +197,68 @@ def process_instance(
        'Please encapsulate your final answer (answer ONLY) within <solution> and </solution>.\n'
        'For example: The answer to the question is <solution> 42 </solution>.\n'
        '# Problem \n'
-        f'{question}\n\n'
+        f'{instance.description}\n\n'
    )
    instruction += (
        'IMPORTANT: You should ONLY interact with the environment provided '
        'to you AND NEVER ASK FOR HUMAN HELP.\n'
    )
    # NOTE: You can actually set slightly different instruction for different agents
-    instruction += INST_SUFFIXES[agent.__class__.__name__]
+    instruction += INST_SUFFIXES[metadata.agent_class]

    # =============================================
    # create sandbox and run the agent
    # =============================================

-    sandbox = DockerSSHBox(
-        config=config.sandbox,
-        persist_sandbox=False,
-        workspace_mount_path=config.workspace_mount_path,
-        sandbox_workspace_dir=config.workspace_mount_path_in_sandbox,
-        cache_dir=config.cache_dir,
-        run_as_devin=config.run_as_devin,
-    )
-    sandbox.execute(f'cd {inst_id}')
+    runtime: Runtime = await create_runtime(config, sid=instance.instance_id)

-    init_cmd = instance.init
-    if init_cmd is not None:
-        scpt_name = f'{instance.instance_id}_init.sh'
-        scpt_path = os.path.join(container_inst_workspace, scpt_name)
-        host_scpt_path = os.path.join(instance_workspace, scpt_name)
-        create_sh_file(host_scpt_path, init_cmd)
-        logger.info(f'Running init script: {scpt_path}')
-        _, init_res = sandbox.execute(scpt_path)
-        logger.info(f'Init script result: {init_res}')
+    await initialize_runtime(runtime, instance=instance)

    # Here's how you can run the agent (similar to the `main` function) and get the final task state
-    config.max_iterations = metadata.max_iterations
-    state: State | None = asyncio.run(
-        run_controller(
-            config=config,
-            task_str=instruction,
-            fake_user_response_fn=FAKE_RESPONSES[agent.__class__.__name__],
-            agent=agent,
-            sandbox=sandbox,
-            sid=inst_id,
-        )
+    state: State | None = await run_controller(
+        config=config,
+        task_str=instruction,
+        runtime=runtime,
+        fake_user_response_fn=FAKE_RESPONSES[metadata.agent_class],
    )
-
    if state is None:
        raise ValueError('State should not be None.')

-    # get the ground truth
-    # OSBenchSSHBox.get_ground_truth(instance, state)
-
    # =============================================
    # result evaluation
    # =============================================

-    agent_answer = ''
-    get_agent_result_cmd = instance.get_agent_result
-    if get_agent_result_cmd is not None:
-        scpt_name = f'{instance.instance_id}_get_agent_result.sh'
-        scpt_path = os.path.join(container_inst_workspace, scpt_name)
-        host_scpt_path = os.path.join(instance_workspace, scpt_name)
-        create_sh_file(host_scpt_path, get_agent_result_cmd)
-        logger.info(f'Running get agent result cmd: {scpt_path}')
-        _, agent_answer = sandbox.execute(scpt_path)
-    else:
+    return_val = await complete_runtime(runtime, instance)
+    agent_answer = return_val['agent_answer']
+    final_ans = return_val['final_ans']
+
+    # If the agent answer is not found, retrieve it from the history
+    if agent_answer is None:
+        agent_answer = ''
        logger.info('Retrieving agent answer from history.')
        raw_ans = ''

        # retrieve the last agent message or thought
        for event in state.history.get_events(reverse=True):
-            if isinstance(event, MessageAction) and event.source == 'agent':
-                raw_ans = event.content
-            elif isinstance(event, CmdRunAction) and event.source == 'agent':
-                raw_ans = event.thought
+            if event.source == 'agent':
+                if isinstance(event, AgentFinishAction):
+                    raw_ans = event.thought
+                    break
+                elif isinstance(event, MessageAction):
+                    raw_ans = event.content
+                    break
+                elif isinstance(event, CmdRunAction):
+                    raw_ans = event.thought
+                    break

        # parse the answer for a solution tag
-        agent_answer = re.findall(r'<solution>(.*?)</solution>', raw_ans)
+        agent_answer = re.findall(r'<solution>(.*?)</solution>', raw_ans, re.DOTALL)
        if len(agent_answer) == 0:
            logger.warning(f'Failed to parse model answer: {raw_ans}')
            agent_answer = raw_ans
        else:
            agent_answer = agent_answer[0]

-    final_ans = ''
-    if instance.ground_truth is not None:
-        final_ans = instance.ground_truth
-    else:
-        get_ground_truth_cmd = instance.get_ground_truth
-        if get_ground_truth_cmd is not None:
-            scpt_name = f'{instance.instance_id}_get_ground_truth.sh'
-            scpt_path = os.path.join(container_inst_workspace, scpt_name)
-            host_scpt_path = os.path.join(instance_workspace, scpt_name)
-            create_sh_file(host_scpt_path, get_ground_truth_cmd)
-            logger.info(f'Running get ground truth cmd: {scpt_path}')
-            sandbox.execute(f'cd {container_inst_workspace}')
-            _, final_ans = sandbox.execute(scpt_path)
-
    comparison_method = instance.comparison_method
    logger.info(
        f'Final message: {agent_answer} | Ground truth: {final_ans} | Comparison method: {comparison_method}'
@ -198,58 +273,49 @@ def process_instance(
    metrics = state.metrics.get() if state.metrics else None

    # Save the output
-    output = {
-        'instance_id': inst_id,
-        'instance': instance.to_dict(),
-        'instruction': instruction,
-        'metadata': metadata.model_dump(),
-        'history': histories,
-        'metrics': metrics,
-        'error': state.last_error if state and state.last_error else None,
-        'test_result': {
+    output = EvalOutput(
+        instance_id=instance.instance_id,
+        instance=instance.to_dict(),
+        instruction=instruction,
+        metadata=metadata,
+        history=histories,
+        metrics=metrics,
+        error=state.last_error if state and state.last_error else None,
+        test_result={
            'agent_answer': agent_answer,
            'final_answer': final_ans,
            'check_method': comparison_method,
            'result': test_result,
        },
-    }
-
-    # clean up
-    if os.path.exists(instance_workspace):
-        shutil.rmtree(instance_workspace)
-    # Close the sandbox
-    try:
-        sandbox.close()
-    except docker.errors.NotFound as e:
-        logger.error(f'Failed to close sandbox: {e}')
+    )
    return output


 if __name__ == '__main__':
-    id_column = 'instance_id'
    args = parse_arguments()
    dataset = load_dataset('iFurySt/AgentBench')
    agent_bench_tests = dataset['osbench'].to_pandas()

-    llm_config = get_llm_config_arg(args.llm_config) if args.llm_config else config.llm
-    logger.info(f'Config for evaluation: {config}')
+    llm_config = None
+    if args.llm_config:
+        llm_config = get_llm_config_arg(args.llm_config)
+
+    if llm_config is None:
+        raise ValueError(f'Could not find LLM config: --llm_config {args.llm_config}')

    metadata = make_metadata(
        llm_config,
-        args.dataset_name,
+        'AgentBench-OS',
        args.agent_cls,
        args.max_iterations,
        args.eval_note,
        args.eval_output_dir,
    )
    output_file = os.path.join(metadata.eval_output_dir, 'output.jsonl')
-    instances = prepare_dataset(dataset, output_file, args.eval_n_limit, id_column)
+    instances = prepare_dataset(agent_bench_tests, output_file, args.eval_n_limit)

-    run_evaluation(
-        instances,
-        metadata,
-        output_file,
-        args.eval_num_workers,
-        process_instance,
-        id_column,
+    asyncio.run(
+        run_evaluation(
+            instances, metadata, output_file, args.eval_num_workers, process_instance
+        )
    )
--- a/evaluation/agent_bench/scripts/run_infer.sh
+++ b/evaluation/agent_bench/scripts/run_infer.sh
--- a/evaluation/biocoder/README.md
+++ b/evaluation/biocoder/README.md
@ -2,15 +2,12 @@

 Implements evaluation of agents on BioCoder from the BioCoder benchmark introduced in [BioCoder: A Benchmark for Bioinformatics Code Generation with Large Language Models](https://arxiv.org/abs/2308.16458). Please see [here](https://github.com/bigcode-project/bigcode-evaluation-harness/blob/main/bigcode_eval/tasks/humanevalpack.py) for the reference implementation used in the paper.

-## Setup Environment
+## Setup Environment and LLM Configuration

-Please follow [this document](https://github.com/OpenDevin/OpenDevin/blob/main/Development.md) to setup local develop environment for OpenDevin.
-
-
-## Configure OpenDevin and your LLM
-Create a `config.toml` file if it does not exist at the root of the workspace. Please check [README.md](../../README.md) for how to set this up.
+Please follow instruction [here](../README.md#setup) to setup your local development environment and LLM.

 ## BioCoder Docker Image
+
 In the opendevin branch of the Biocoder repository, we have slightly modified our original Docker image to work with the OpenDevin environment. In the Docker image are testing scripts (`/testing/start_test_opendevin.py` and aux files in `/testing_files/`) to assist with evaluation. Additionally, we have installed all dependencies, including OpenJDK, mamba (with Python 3.6), and many system libraries. Notably, we have **not** packaged all repositories into the image, so they are downloaded at runtime.

 **Before first execution, pull our Docker image with the following command**
@ -41,12 +38,12 @@ to `CodeActAgent`.
 - `eval_limit`, e.g. `10`, limits the evaluation to the first `eval_limit` instances. By default it infers all instances.

 Let's say you'd like to run 1 instance using `eval_gpt4_1106_eval_gpt4o_2024_05_13preview` and CodeActAgent
-with OpenDevin version 0.6.2, then your command would be:
+with current OpenDevin version, then your command would be:

 ## Examples

 ```bash
-./evaluation/biocoder/scripts/run_infer.sh eval_gpt4o_2024_05_13 0.6.2 CodeActAgent 1
+./evaluation/biocoder/scripts/run_infer.sh eval_gpt4o_2024_05_13 HEAD CodeActAgent 1
 ```

 ## Reference
--- a/evaluation/biocoder/biocoder_env_box.py
+++ b/evaluation/biocoder/biocoder_env_box.py
@ -1,387 +0,0 @@
-import json
-import os
-import re
-import sys
-from collections import defaultdict
-from dataclasses import dataclass
-
-from datasets import load_dataset
-
-from opendevin.core.config import load_app_config
-from opendevin.core.logger import opendevin_logger as logger
-from opendevin.runtime.docker.ssh_box import DockerSSHBox
-from opendevin.runtime.plugins import (
-    JupyterRequirement,
-    PluginRequirement,
-    SWEAgentCommandsRequirement,
-)
-
-config = load_app_config()
-
-BIOCODER_BENCH_CONTAINER_IMAGE = 'public.ecr.aws/i5g0m1f6/eval_biocoder:v1.0'
-
-
-@dataclass
-class BiocoderData:
-    filePath: str
-    numLines: int
-    lineStart: int
-    lineEnd: int
-    signature: str
-    comment: str
-    content: str
-    repository: str
-    promptSummaryOnly: str
-    contextCode: str
-    goldenCode: str
-    test_case_id: str
-    language: str
-
-    def to_dict(self):
-        return {
-            'filePath': self.filePath,
-            'numLines': self.numLines,
-            'lineStart': self.lineStart,
-            'lineEnd': self.lineEnd,
-            'signature': self.signature,
-            'comment': self.comment,
-            'content': self.content,
-            'repository': self.repository,
-            'promptSummaryOnly': self.promptSummaryOnly,
-            'contextCode': self.contextCode,
-            'goldenCode': self.goldenCode,
-            'test_case_id': self.test_case_id,
-            'language': self.language,
-        }
-
-
-def get_likely_indent_size(array_of_tabs) -> int:
-    sizes = defaultdict(int)
-
-    for i in range(len(array_of_tabs) - 1):
-        diff = array_of_tabs[i + 1] - array_of_tabs[i]
-        if diff > 0:
-            sizes[diff] += 1
-    if len(sizes) == 0:
-        return 4
-    return int(max(sizes, key=sizes.get))
-
-
-class BiocoderSSHBox(DockerSSHBox):
-    def __init__(
-        self,
-        container_image: str,
-        timeout: int = 120,
-        sid: str | None = None,
-        biocoder_instance_id: str | None = None,
-        biocoder_instance: BiocoderData | None = None,
-        skip_workspace_mount: bool = True,
-        sandbox_plugins: list[PluginRequirement] = [],  # noqa: B006
-        biocoder_cache_folder: str = 'biocoder_cache',
-        workspace_dir_name: str | None = None,
-    ):
-        if biocoder_instance_id is None:
-            raise ValueError('biocoder_instance_id must be provided')
-        self.biocoder_instance_id = biocoder_instance_id
-        self.biocoder_instance = biocoder_instance
-        self.skip_workspace_mount = skip_workspace_mount
-        self.biocoder_cache_folder = biocoder_cache_folder
-        self.first_line_after_removed = None
-        self.workspace_dir_name = workspace_dir_name
-        self.workspace_base = config.workspace_base
-        self.workspace_mount_path = config.workspace_mount_path
-        # self.workspace_dir_name_host = os.path.join(config.workspace_base, workspace_dir_name)
-
-        self.context_path = None
-        self.generated_path = None
-        self.golden_path = None
-
-        assert (
-            container_image is not None
-        ), 'container_image is required for BiocoderBenchSSHBox!'
-        super().__init__(container_image, timeout, sid)
-        self.init_plugins(sandbox_plugins)
-
-    @property
-    def volumes(self):
-        if self.skip_workspace_mount:
-            return {
-                k: v
-                for k, v in super().volumes.items()
-                if not v['bind'] == self.sandbox_workspace_dir
-            }
-        return super().volumes
-
-    def get_target_filepath(self):
-        target_filepath = os.path.join(
-            self.workspace_mount_path,
-            self.biocoder_instance.repository.split('/')[1],
-            self.biocoder_instance.filePath,
-        )
-        return target_filepath
-
-    def get_changed_code(self, include_signature=False):
-        # copies changed code into /testing_files/
-        # Note that this does NOT copy the function signature
-        target_filepath = self.get_target_filepath()
-        selected_lines = []
-        offset = 1 if include_signature else 0
-        if self.first_line_after_removed is None:
-            logger.warning('First line after removed is None')
-        with open(target_filepath, 'r') as f:
-            lines = f.read().split('\n')
-            for i in range(self.biocoder_instance.lineStart - offset, len(lines)):
-                if lines[i].strip() == self.first_line_after_removed.strip():
-                    break
-                selected_lines.append(lines[i])
-        text = '\n'.join(selected_lines)
-        return text
-
-    def copy_changed_code(self):
-        changed_code = self.get_changed_code(include_signature=True)
-        with open(self.generated_path, 'w') as f:
-            f.write(changed_code)
-        exit_code, output = self.execute_and_check(
-            f'cp -r /workspace/{self.biocoder_cache_folder}/* /testing_files',
-            'Failed to copy the files',
-        )
-
-    def remove_code(self):
-        comment_prefix = {'python': '#', 'java': '//'}
-
-        target_filepath = self.get_target_filepath()
-        line_start = self.biocoder_instance.lineStart
-        line_end = self.biocoder_instance.lineEnd
-        with open(target_filepath, 'r') as f:
-            lines = f.read().split('\n')
-            # print("="*10+"ORIGINAL"+"="*10)
-            # print("\n".join(lines))
-            signature_line = lines[line_start - 1]
-
-            # get the number of tabs
-            def get_indent_size(s: str):
-                return len(re.match(r'\s*', s).group())
-
-            indent_sizes = list(map(get_indent_size, lines))
-            indent_size = get_likely_indent_size(indent_sizes)
-            comment_indent_size = get_indent_size(signature_line) + indent_size
-            lines = (
-                lines[:line_start]
-                + [
-                    f"{' '*comment_indent_size+comment_prefix[self.biocoder_instance.language.lower()]}TODO: replace with your code here"
-                ]
-                + ([''] * 2)
-                + lines[line_end:]
-            )
-        first_line_after_removed_index = line_start
-        while len(
-            lines[first_line_after_removed_index].strip()
-        ) == 0 and first_line_after_removed_index < len(lines):
-            first_line_after_removed_index += 1
-        self.first_line_after_removed = lines[first_line_after_removed_index]
-        # print("FIRST LINE AFTER REMOVED: ", self.first_line_after_removed)
-
-        with open(target_filepath, 'w') as f:
-            f.write('\n'.join(lines))
-
-        # with open(target_filepath, 'r') as f:
-        #     print("="*10+"MODIFIED"+"="*10)
-        #     print(f.read())
-
-    def execute_and_check(self, cmd: str, error_msg: str) -> tuple[int, str]:
-        exit_code, output = self.execute(cmd)
-        if exit_code != 0:
-            logger.error(error_msg)
-            sys.exit(1)
-        return exit_code, output
-
-    @classmethod
-    def get_box_for_instance(
-        cls,
-        instance,
-        workspace_dir_name=None,
-        skip_workspace_mount: bool = False,
-        workspace_mount_path: str | None = None,
-        sandbox_plugins: list[PluginRequirement] = [],  # noqa: B006
-    ) -> 'BiocoderSSHBox':
-        """This method initializes a container image, then runs some initialization commands"""
-        if workspace_dir_name is None:
-            workspace_dir_name = f'{instance.repository}__{instance.test_case_id[:10]}__{os.getpid()}'.replace(
-                '/', '__'
-            )
-
-        workspace_base = str(os.path.join(config.workspace_base, workspace_dir_name))
-        old_workspace_base = config.workspace_base
-        old_workspace_mount_path = config.workspace_mount_path
-
-        try:
-            config.workspace_base = workspace_base
-            config.workspace_mount_path = workspace_base
-
-            # linting python after editing helps LLM fix indentations
-            config.sandbox.enable_auto_lint = True
-
-            # create folder for transferring files back/forth
-            biocoder_cache_folder = 'biocoder_cache'
-            if not os.path.exists(os.path.join(workspace_base, biocoder_cache_folder)):
-                os.makedirs(
-                    os.path.join(workspace_base, biocoder_cache_folder), exist_ok=True
-                )
-
-            file_ext = {
-                'python': 'py',
-                'java': 'java',
-                'c': 'c',
-                'cpp': 'cpp',
-                'javascript': 'js',
-                'typescript': 'ts',
-            }[instance.language.lower()]
-
-            context_path = os.path.join(
-                workspace_base, biocoder_cache_folder, 'context.' + file_ext
-            )
-            generated_path = os.path.join(
-                workspace_base, biocoder_cache_folder, 'generated.' + file_ext
-            )
-            golden_path = os.path.join(
-                workspace_base, biocoder_cache_folder, 'golden.' + file_ext
-            )
-
-            # print(instance.contextCode)
-            with open(context_path, 'w') as f:
-                f.write(instance.contextCode)
-            with open(generated_path, 'w') as f:
-                f.write(instance.goldenCode)
-            with open(golden_path, 'w') as f:
-                f.write(instance.goldenCode)
-
-            testcase_json = {
-                'test_case_id': instance.test_case_id,
-                'num_cases': 1000,
-                'language': instance.language.lower(),
-            }
-
-            with open(
-                os.path.join(
-                    workspace_base, biocoder_cache_folder, 'testcase_biocoder.json'
-                ),
-                'w',
-            ) as f:
-                f.write(json.dumps(testcase_json, indent=4))
-
-            # linting python after editing helps LLM fix indentations
-            config.sandbox.enable_auto_lint = True
-
-            sandbox = cls(
-                container_image=BIOCODER_BENCH_CONTAINER_IMAGE,
-                biocoder_instance_id=instance.test_case_id,
-                biocoder_instance=instance,
-                skip_workspace_mount=skip_workspace_mount,
-                sandbox_plugins=sandbox_plugins,
-                biocoder_cache_folder=biocoder_cache_folder,
-                workspace_dir_name=workspace_dir_name,
-            )
-        except Exception:
-            raise
-        finally:
-            config.workspace_base = old_workspace_base
-            config.workspace_mount_path = old_workspace_mount_path
-
-        sandbox.context_path = context_path
-        sandbox.generated_path = generated_path
-        sandbox.golden_path = golden_path
-
-        logger.info(f'SSH box started for instance {instance.test_case_id}.')
-        # cd to the workspace
-        exit_code, output = sandbox.execute_and_check(
-            'cd /workspace', 'Failed to cd to workspace'
-        )
-        logger.info(f'cd to workspace: {output}')
-
-        # download repository archive
-        repository_url = f"https://biocoder.lilbillbiscuit.com/repos/{instance.repository.split('/')[1]}.zip"
-        exit_code, output = sandbox.execute_and_check(
-            'wget -O repo.zip ' + repository_url, 'Failed to download the repository'
-        )
-        logger.info(f'Downloaded the repository: {output}')
-        exit_code, output = sandbox.execute_and_check(
-            'unzip -o -q repo.zip', 'Failed to unzip the repository'
-        )
-        logger.info(f'Unzipped the repository: {output}')
-
-        # copy the context, generated and golden files to the /testing_files folder
-        exit_code, output = sandbox.execute_and_check(
-            f'cp -r /workspace/{biocoder_cache_folder}/* /testing_files',
-            'Failed to copy the files',
-        )
-
-        # chmod 777
-        exit_code, output = sandbox.execute_and_check(
-            'chmod -R 777 /workspace',
-            'Failed to chmod the files',
-        )
-
-        return sandbox
-
-
-if __name__ == '__main__':
-    biocoder_dataset = load_dataset('Lilbillbiscuit/biocoder_public')
-    EXAMPLE_INSTANCE = biocoder_dataset['test'][0]
-    EXAMPLE_INSTANCE = BiocoderData(**EXAMPLE_INSTANCE)
-
-    sandbox = BiocoderSSHBox.get_box_for_instance(
-        instance=EXAMPLE_INSTANCE,
-        workspace_mount_path='/home/ubuntu/OpenDevinBioCoder/workspace',
-        skip_workspace_mount=False,
-        sandbox_plugins=[JupyterRequirement(), SWEAgentCommandsRequirement()],
-    )
-
-    # PRE TEST
-    exit_code, output = sandbox.execute_and_check(
-        'cd /testing',
-        'Failed to cd /testing',
-    )
-    logger.info(f'cd $REPO_PATH: {output}')
-
-    exit_code, output = sandbox.execute_and_check(
-        'whoami',
-        'Failed to run whoami',
-    )
-    logger.info(f'whoami: {output}')
-
-    # TEST
-    exit_code, output = sandbox.execute(
-        '/home/devin/mambaforge/bin/mamba run -n test python3 /testing/start_test_opendevin.py'
-    )
-    assert exit_code == 0, 'Expected exit code 0 (this should have passed)'
-    logger.info(f'$TEST_CMD:\n{output}')
-
-    exit_code, output = sandbox.execute_and_check(
-        'cat /testing_files/results_biocoder.json', 'Failed to read the result file'
-    )
-
-    print(output)
-    json_obj = json.loads(output)
-    if json_obj['result'] == 'pass':
-        print('PASS')
-    else:
-        print('FAIL')
-
-    sys.stdout.flush()
-    try:
-        while True:
-            try:
-                user_input = input('>>> ')
-            except EOFError:
-                logger.info('Exiting...')
-                break
-            if user_input.lower() == 'exit':
-                logger.info('Exiting...')
-                break
-            exit_code, output = sandbox.execute(user_input)
-            logger.info('exit code: %d', exit_code)
-            logger.info(output)
-            sys.stdout.flush()
-    except KeyboardInterrupt:
-        logger.info('Exiting...')
-    sandbox.close()
--- a/evaluation/biocoder/run_infer.py
+++ b/evaluation/biocoder/run_infer.py
@ -1,33 +1,38 @@
 import asyncio
+import functools
 import json
-import logging
 import os
-import pathlib
-from functools import partial
+import tempfile
+from typing import Any

 import pandas as pd
 from datasets import load_dataset

-from evaluation.biocoder.biocoder_env_box import BiocoderData, BiocoderSSHBox
+from evaluation.biocoder.utils import BiocoderData
 from evaluation.utils.shared import (
    EvalMetadata,
+    EvalOutput,
    codeact_user_response,
    make_metadata,
    prepare_dataset,
+    reset_logger_for_multiprocessing,
    run_evaluation,
 )
-from opendevin.controller.agent import Agent
 from opendevin.controller.state.state import State
-from opendevin.core.config import get_llm_config_arg, load_app_config, parse_arguments
-from opendevin.core.logger import get_console_handler
+from opendevin.core.config import (
+    AppConfig,
+    SandboxConfig,
+    get_llm_config_arg,
+    parse_arguments,
+)
 from opendevin.core.logger import opendevin_logger as logger
-from opendevin.core.main import run_controller
-from opendevin.llm.llm import LLM
-
-config = load_app_config()
+from opendevin.core.main import create_runtime, run_controller
+from opendevin.events.action import CmdRunAction
+from opendevin.events.observation import CmdOutputObservation
+from opendevin.runtime.runtime import Runtime

 AGENT_CLS_TO_FAKE_USER_RESPONSE_FN = {
-    'CodeActAgent': partial(
+    'CodeActAgent': functools.partial(
        codeact_user_response, encapsulate_solution=True, try_parse=None
    ),
 }
@ -36,111 +41,219 @@ AGENT_CLS_TO_INST_SUFFIX = {
    'CodeActAgent': 'When you think you have fixed the issue through code changes, please run the following command: <execute_bash> exit </execute_bash>.\n'
 }

+FILE_EXT_MAP = {
+    'python': 'py',
+    'java': 'java',
+    'c': 'c',
+    'cpp': 'cpp',
+    'javascript': 'js',
+    'typescript': 'ts',
+}
+
+
+def get_config(
+    metadata: EvalMetadata,
+) -> AppConfig:
+    BIOCODER_BENCH_CONTAINER_IMAGE = 'public.ecr.aws/i5g0m1f6/eval_biocoder:v1.0'
+
+    config = AppConfig(
+        default_agent=metadata.agent_class,
+        run_as_devin=False,
+        runtime='eventstream',
+        max_iterations=metadata.max_iterations,
+        sandbox=SandboxConfig(
+            container_image=BIOCODER_BENCH_CONTAINER_IMAGE,
+            enable_auto_lint=True,
+            use_host_network=False,
+            update_source_code=True,
+        ),
+        # do not mount workspace
+        workspace_base=None,
+        workspace_mount_path=None,
+    )
+    config.set_llm_config(metadata.llm_config)
+    return config
+
+
+async def initialize_runtime(
+    runtime: Runtime,
+    instance: BiocoderData,  # this argument is not required
+):
+    """Initialize the runtime for the agent.
+
+    This function is called before the runtime is used to run the agent.
+    """
+    logger.info(f"{'-' * 50} BEGIN Runtime Initialization Fn {'-' * 50}")
+    obs: CmdOutputObservation
+
+    file_ext = FILE_EXT_MAP[instance.language.lower()]
+
+    action = CmdRunAction(command='mkdir -p /workspace && mkdir -p /testing_files')
+    logger.info(action, extra={'msg_type': 'ACTION'})
+    obs = await runtime.run_action(action)
+    assert obs.exit_code == 0
+
+    with tempfile.TemporaryDirectory() as tmpdir:
+        context_path = os.path.join(tmpdir, 'context.' + file_ext)
+        with open(context_path, 'w') as f:
+            f.write(instance.contextCode)
+        await runtime.copy_to(context_path, '/testing_files')
+
+        golden_path = os.path.join(tmpdir, 'golden.' + file_ext)
+        with open(golden_path, 'w') as f:
+            f.write(instance.goldenCode)
+        await runtime.copy_to(golden_path, '/testing_files')
+
+        testcase_json = {
+            'test_case_id': instance.test_case_id,
+            'num_cases': 1000,
+            'language': instance.language.lower(),
+        }
+        testcase_path = os.path.join(tmpdir, 'testcase_biocoder.json')
+        with open(testcase_path, 'w') as f:
+            f.write(json.dumps(testcase_json, indent=4))
+
+        await runtime.copy_to(testcase_path, '/testing_files')
+
+    # setup paths
+    remove_code_script = os.path.join(
+        os.path.dirname(__file__), 'scripts', 'setup', 'remove_code.py'
+    )
+    await runtime.copy_to(remove_code_script, '/testing_files')
+
+    action = CmdRunAction(command='cd /workspace')
+    logger.info(action, extra={'msg_type': 'ACTION'})
+    obs = await runtime.run_action(action)
+    assert obs.exit_code == 0
+
+    # download repository archive
+    repository_url = f"https://biocoder.lilbillbiscuit.com/repos/{instance.repository.split('/')[1]}.zip"
+    action = CmdRunAction(command='wget -O repo.zip ' + repository_url)
+    logger.info(action, extra={'msg_type': 'ACTION'})
+    obs = await runtime.run_action(action)
+    assert obs.exit_code == 0, f'Failed to download the repository: {obs.content}'
+
+    # unzip the repository
+    action = CmdRunAction(command='unzip -o -q repo.zip && rm repo.zip')
+    logger.info(action, extra={'msg_type': 'ACTION'})
+    obs = await runtime.run_action(action)
+    assert obs.exit_code == 0, f'Failed to unzip the repository: {obs.content}'
+
+    # chmod 777
+    action = CmdRunAction(command='chmod -R 777 /workspace')
+    logger.info(action, extra={'msg_type': 'ACTION'})
+    obs = await runtime.run_action(action)
+    assert obs.exit_code == 0, f'Failed to chmod the files: {obs.content}'
+
+    # remove code for evaluation instance
+    target_filepath = os.path.join(
+        '/workspace', instance.repository.split('/')[1], instance.filePath
+    )
+    line_start = instance.lineStart
+    line_end = instance.lineEnd
+    language = instance.language.lower()
+    action = CmdRunAction(
+        command=f'python3 /testing_files/remove_code.py --target_filepath {target_filepath} --line_start {line_start} --line_end {line_end} --language {language}'
+    )
+    logger.info(action, extra={'msg_type': 'ACTION'})
+    obs = await runtime.run_action(action)
+    assert obs.exit_code == 0, f'Failed to remove the code: {obs.content}'
+
+    logger.info(f"{'-' * 50} END Runtime Initialization Fn {'-' * 50}")
+
+
+async def complete_runtime(
+    runtime: Runtime,
+    instance: pd.Series,  # this argument is not required, but it is used to get the workspace_dir_name
+) -> dict[str, Any]:
+    """Complete the runtime for the agent.
+
+    This function is called before the runtime is used to run the agent.
+    If you need to do something in the sandbox to get the correctness metric after
+    the agent has run, modify this function.
+    """
+    logger.info(f"{'-' * 50} BEGIN Runtime Completion Fn {'-' * 50}")
+    obs: CmdOutputObservation

-def get_test_result(instance, sandbox, workspace_dir_name):
    test_result = {'result': {}, 'metadata': {}}
-    try:
-        code = sandbox.get_changed_code(include_signature=True)
-        sandbox.copy_changed_code()
+
+    copy_changed_code_script = os.path.join(
+        os.path.dirname(__file__), 'scripts', 'setup', 'copy_changed_code.py'
+    )
+    await runtime.copy_to(copy_changed_code_script, '/testing_files')
+
+    file_ext = FILE_EXT_MAP[instance.language.lower()]
+    target_filepath = os.path.join(
+        '/workspace', instance.repository.split('/')[1], instance.filePath
+    )
+    generated_path = os.path.join('/testing_files', 'generated.' + file_ext)
+
+    action = CmdRunAction(
+        command=f'python3 /testing_files/copy_changed_code.py --target_filepath {target_filepath} --generated_code_filepath {generated_path} --line_start {instance.lineStart} --include_signature'
+    )
+    logger.info(action, extra={'msg_type': 'ACTION'})
+    obs = await runtime.run_action(action)
+    if obs.exit_code == 0:
        test_result['metadata']['1_copy_change_success'] = True
+
+        action = CmdRunAction(command=f'cat {generated_path}', keep_prompt=False)
+        logger.info(action, extra={'msg_type': 'ACTION'})
+        obs = await runtime.run_action(action)
+        assert obs.exit_code == 0
+
+        code = obs.content
        test_result['metadata']['1_copy_change_code'] = code
-    except Exception:
-        logger.error('Error fetching changed code for this instance')
+    else:
        test_result['metadata']['1_copy_change_success'] = False
        test_result['metadata']['1_copy_change_code'] = None

-    exit_code, output = sandbox.execute_and_check(
-        'cd /testing',
-        'Failed to cd /testing',
-    )
-    logger.info(f'cd $REPO_PATH: {output}')
+    action = CmdRunAction(command='cd /testing_files')
+    logger.info(action, extra={'msg_type': 'ACTION'})
+    obs = await runtime.run_action(action)
+    assert obs.exit_code == 0

-    exit_code, output = sandbox.execute_and_check(
-        'whoami',
-        'Failed to run whoami',
+    action = CmdRunAction(
+        command='/home/devin/mambaforge/bin/mamba run -n test python3 /testing/start_test_opendevin.py'
    )
-    logger.info(f'whoami: {output}')
+    logger.info(action, extra={'msg_type': 'ACTION'})
+    obs = await runtime.run_action(action)
+    logger.info(obs, extra={'msg_type': 'OBSERVATION'})
+    assert obs.exit_code == 0

-    exit_code, output = sandbox.execute(
-        '/home/devin/mambaforge/bin/mamba run -n test python3 /testing/start_test_opendevin.py'
+    action = CmdRunAction(
+        command='cat /testing_files/results_biocoder.json', keep_prompt=False
    )
-    logger.info(f'$TEST_CMD:\n{output}')
-
-    exit_code, output = sandbox.execute_and_check(
-        'cat /testing_files/results_biocoder.json', 'Failed to read the result file'
-    )
-    if exit_code == 0:
+    logger.info(action, extra={'msg_type': 'ACTION'})
+    obs = await runtime.run_action(action)
+    if obs.exit_code == 0:
        test_result['metadata']['2_run_test_success'] = True
-        test_result['metadata']['2_run_test_result'] = str(output)
+        test_result['metadata']['2_run_test_result'] = str(obs.content)
+        json_obj = json.loads(obs.content)
+        test_result['result'] = json_obj['result']
    else:
        test_result['metadata']['2_run_test_success'] = False
-        test_result['metadata']['2_run_test_result'] = str(output)
-    json_obj = json.loads(output)
-    test_result['result'] = json_obj['result']
+        test_result['metadata']['2_run_test_result'] = str(obs.content)

+    logger.info(f"{'-' * 50} END Runtime Completion Fn {'-' * 50}")
    return test_result


-def process_instance(
+async def process_instance(
    instance: pd.Series,
    metadata: EvalMetadata,
    reset_logger: bool = True,
-):
-    # Create the agent
-    agent = Agent.get_cls(metadata.agent_class)(llm=LLM(config=metadata.llm_config))
+) -> EvalOutput:
+    config = get_config(metadata)
    instance = BiocoderData(**instance)
    print(instance)
-    workspace_dir_name = (
-        f'{instance.repository}__{instance.test_case_id[:10]}__{os.getpid()}'.replace(
-            '/', '__'
-        )
-    )
-    workspace_mount_path = os.path.join(config.workspace_base, workspace_dir_name)
-    # create process-specific workspace dir
-    # if `not skip_workspace_mount` - we will create a workspace directory for EACH process
-    # so that different agent don't interfere with each other.
-    workspace_mount_path = os.path.join(workspace_mount_path, str(os.getpid()))
-    pathlib.Path(workspace_mount_path).mkdir(parents=True, exist_ok=True)
+    instance_id = f'{instance.repository}__{instance.instance_id[:10]}'

-    # Setup the logger properly, so you can run multi-processing to parallize the evaluation
+    # Setup the logger properly, so you can run multi-processing to parallelize the evaluation
    if reset_logger:
-        # Set up logger
-        log_file = os.path.join(
-            metadata.eval_output_dir, 'logs', f'instance_{instance.test_case_id}.log'
-        )
-        # Remove all existing handlers from logger
-        for handler in logger.handlers[:]:
-            logger.removeHandler(handler)
-        # add back the console handler to print ONE line
-        logger.addHandler(get_console_handler())
-        logger.info(
-            f'Starting evaluation for instance {instance.test_case_id}.\nHint: run "tail -f {log_file}" to see live logs in a seperate shell'
-        )
-        # Remove all existing handlers from logger
-        for handler in logger.handlers[:]:
-            logger.removeHandler(handler)
-        file_handler = logging.FileHandler(log_file)
-        file_handler.setFormatter(
-            logging.Formatter('%(asctime)s - %(levelname)s - %(message)s')
-        )
-        logger.addHandler(file_handler)
-
-    logger.info(f'Process-specific workspace mounted at {workspace_mount_path}')
-
-    # NOTE: this is something special we do for SWE-Bench due to the reason described in the previous section
-    # You can omit this if you don't need to setup specialized sandbox
-    workspace_dir_name = f'{instance.repository}__{instance.test_case_id[:10]}'.replace(
-        '/', '__'
-    )
-    sandbox = BiocoderSSHBox.get_box_for_instance(
-        instance,
-        workspace_dir_name,
-        skip_workspace_mount=False,
-        workspace_mount_path=workspace_mount_path,
-        sandbox_plugins=agent.sandbox_plugins,
-    )
-
-    sandbox.remove_code()
+        log_dir = os.path.join(metadata.eval_output_dir, 'infer_logs')
+        reset_logger_for_multiprocessing(logger, instance_id, log_dir)
+    else:
+        logger.info(f'Starting evaluation for instance {instance_id}.')

    # Prepare instruction
    instruction = (
@ -160,80 +273,76 @@ def process_instance(
        'Make sure to include proper formatting in Java and Python, including correct braces and/or indentation.\n'
    )
    # NOTE: You can actually set slightly different instruction for different agents
-    instruction += AGENT_CLS_TO_INST_SUFFIX[agent.__class__.__name__]
+    instruction += AGENT_CLS_TO_INST_SUFFIX[metadata.agent_class]

    # use a session id for concurrent evaluation
-    sid = instance.test_case_id.replace('/', '__')
+    sid = instance.instance_id.replace('/', '__')
+
+    runtime = await create_runtime(config, sid=sid)
+
+    await initialize_runtime(runtime, instance)

    # Here's how you can run the agent (similar to the `main` function) and get the final task state
-    config.max_iterations = metadata.max_iterations
-    state: State | None = asyncio.run(
-        run_controller(
-            config=config,
-            task_str=instruction,
-            fake_user_response_fn=AGENT_CLS_TO_FAKE_USER_RESPONSE_FN[
-                agent.__class__.__name__
-            ],
-            agent=agent,
-            sandbox=sandbox,
-            sid=sid,
-        )
+    state: State | None = await run_controller(
+        config=config,
+        task_str=instruction,
+        runtime=runtime,
+        fake_user_response_fn=AGENT_CLS_TO_FAKE_USER_RESPONSE_FN[metadata.agent_class],
    )

-    test_result = get_test_result(instance, sandbox, workspace_dir_name)
-
    if state is None:
        raise ValueError('State should not be None.')
-    metrics = state.metrics.get() if state.metrics else None

+    test_result = await complete_runtime(runtime, instance)
+    metrics = state.metrics.get() if state.metrics else None
    # history is now available as a stream of events, rather than list of pairs of (Action, Observation)
    # for compatibility with the existing output format, we can remake the pairs here
    # remove when it becomes unnecessary
    histories = state.history.compatibility_for_eval_history_pairs()

-    # Save the output
-    output = {
-        'test_case_id': instance.test_case_id,
-        'biocoder_instance': instance.to_dict(),
-        'instruction': instruction,
-        'generated': test_result['metadata']['1_copy_change_code'],
-        'metadata': metadata.model_dump(),
-        'history': histories,
-        'metrics': metrics,
-        'error': state.last_error if state and state.last_error else None,
-        'test_result': test_result,
-    }
+    test_result['generated'] = test_result['metadata']['1_copy_change_code']

-    # Close the sandbox
-    sandbox.close()
+    # Save the output
+    output = EvalOutput(
+        instance_id=instance.instance_id,
+        instance=instance.to_dict(),
+        instruction=instruction,
+        metadata=metadata,
+        history=histories,
+        metrics=metrics,
+        error=state.last_error if state and state.last_error else None,
+        test_result=test_result,
+    )
    return output


 if __name__ == '__main__':
-    id_column = 'test_case_id'
    args = parse_arguments()
-    dataset = load_dataset('lilbillbiscuit/biocoder_public')
-    biocoder_tests = dataset['test'].to_pandas()

-    llm_config = get_llm_config_arg(args.llm_config) if args.llm_config else config.llm
-    logger.info(f'Config for evaluation: {config}')
+    dataset = load_dataset('lilbillbiscuit/biocoder_public')
+    biocoder_tests = dataset['train'].to_pandas()
+    biocoder_tests['instance_id'] = biocoder_tests['test_case_id']
+
+    llm_config = None
+    if args.llm_config:
+        llm_config = get_llm_config_arg(args.llm_config)
+
+    if llm_config is None:
+        raise ValueError(f'Could not find LLM config: --llm_config {args.llm_config}')

    metadata = make_metadata(
        llm_config,
-        args.dataset_name,
+        'biocoder',
        args.agent_cls,
        args.max_iterations,
        args.eval_note,
        args.eval_output_dir,
    )
    output_file = os.path.join(metadata.eval_output_dir, 'output.jsonl')
-    instances = prepare_dataset(dataset, output_file, args.eval_n_limit, id_column)
+    instances = prepare_dataset(biocoder_tests, output_file, args.eval_n_limit)

-    run_evaluation(
-        instances,
-        metadata,
-        output_file,
-        args.eval_num_workers,
-        process_instance,
-        id_column,
+    asyncio.run(
+        run_evaluation(
+            instances, metadata, output_file, args.eval_num_workers, process_instance
+        )
    )
--- a/evaluation/biocoder/scripts/run_infer.sh
+++ b/evaluation/biocoder/scripts/run_infer.sh
--- a/evaluation/biocoder/scripts/setup/copy_changed_code.py
+++ b/evaluation/biocoder/scripts/setup/copy_changed_code.py
@ -0,0 +1,45 @@
+import argparse
+
+
+def get_changed_code(target_filepath, line_start, include_signature=False):
+    # copies changed code into /testing_files/
+    # Note that this does NOT copy the function signature
+    selected_lines = []
+    offset = 1 if include_signature else 0
+
+    with open('/testing_files/first_line_after_removed.txt', 'r') as f:
+        first_line_after_removed = f.read()
+    if first_line_after_removed is None:
+        print('First line after removed is None')
+
+    with open(target_filepath, 'r') as f:
+        lines = f.read().split('\n')
+        for i in range(line_start - offset, len(lines)):
+            if lines[i].strip() == first_line_after_removed.strip():
+                break
+            selected_lines.append(lines[i])
+    text = '\n'.join(selected_lines)
+    return text
+
+
+def copy_changed_code(
+    target_filepath, generated_code_filepath, line_start, include_signature=False
+):
+    changed_code = get_changed_code(target_filepath, line_start, include_signature)
+    with open(generated_code_filepath, 'w') as f:
+        f.write(changed_code)
+
+
+if __name__ == '__main__':
+    parser = argparse.ArgumentParser()
+    parser.add_argument('--target_filepath', type=str, required=True)
+    parser.add_argument('--generated_code_filepath', type=str, required=True)
+    parser.add_argument('--line_start', type=int, required=True)
+    parser.add_argument('--include_signature', action='store_true')
+    args = parser.parse_args()
+    copy_changed_code(
+        args.target_filepath,
+        args.generated_code_filepath,
+        args.line_start,
+        args.include_signature,
+    )
--- a/evaluation/biocoder/scripts/setup/remove_code.py
+++ b/evaluation/biocoder/scripts/setup/remove_code.py
@ -0,0 +1,74 @@
+import argparse
+import os
+import re
+from collections import defaultdict
+
+
+def get_likely_indent_size(array_of_tabs) -> int:
+    sizes = defaultdict(int)
+
+    for i in range(len(array_of_tabs) - 1):
+        diff = array_of_tabs[i + 1] - array_of_tabs[i]
+        if diff > 0:
+            sizes[diff] += 1
+    if len(sizes) == 0:
+        return 4
+    return int(max(sizes, key=sizes.get))
+
+
+def get_target_filepath(self):
+    target_filepath = os.path.join(
+        self.workspace_mount_path,
+        self.biocoder_instance.repository.split('/')[1],
+        self.biocoder_instance.filePath,
+    )
+    return target_filepath
+
+
+def remove_code(target_filepath: str, line_start: int, line_end: int, language: str):
+    comment_prefix = {'python': '#', 'java': '//'}
+
+    with open(target_filepath, 'r') as f:
+        lines = f.read().split('\n')
+        # print("="*10+"ORIGINAL"+"="*10)
+        # print("\n".join(lines))
+        signature_line = lines[line_start - 1]
+
+        # get the number of tabs
+        def get_indent_size(s: str):
+            return len(re.match(r'\s*', s).group())
+
+        indent_sizes = list(map(get_indent_size, lines))
+        indent_size = get_likely_indent_size(indent_sizes)
+        comment_indent_size = get_indent_size(signature_line) + indent_size
+        lines = (
+            lines[:line_start]
+            + [
+                f"{' '*comment_indent_size+comment_prefix[language.lower()]}TODO: replace with your code here"
+            ]
+            + ([''] * 2)
+            + lines[line_end:]
+        )
+    first_line_after_removed_index = line_start
+    while len(
+        lines[first_line_after_removed_index].strip()
+    ) == 0 and first_line_after_removed_index < len(lines):
+        first_line_after_removed_index += 1
+
+    first_line_after_removed = lines[first_line_after_removed_index]
+    print('FIRST LINE AFTER REMOVED: ', first_line_after_removed)
+    with open('/testing_files/first_line_after_removed.txt', 'w') as f:
+        f.write(first_line_after_removed)
+
+    with open(target_filepath, 'w') as f:
+        f.write('\n'.join(lines))
+
+
+if __name__ == '__main__':
+    parser = argparse.ArgumentParser()
+    parser.add_argument('--target_filepath', type=str, required=True)
+    parser.add_argument('--line_start', type=int, required=True)
+    parser.add_argument('--line_end', type=int, required=True)
+    parser.add_argument('--language', type=str, required=True)
+    args = parser.parse_args()
+    remove_code(args.target_filepath, args.line_start, args.line_end, args.language)
--- a/evaluation/biocoder/utils.py
+++ b/evaluation/biocoder/utils.py
@ -0,0 +1,36 @@
+from dataclasses import dataclass
+
+
+@dataclass
+class BiocoderData:
+    instance_id: str
+    filePath: str
+    numLines: int
+    lineStart: int
+    lineEnd: int
+    signature: str
+    comment: str
+    content: str
+    repository: str
+    promptSummaryOnly: str
+    contextCode: str
+    goldenCode: str
+    test_case_id: str
+    language: str
+
+    def to_dict(self):
+        return {
+            'filePath': self.filePath,
+            'numLines': self.numLines,
+            'lineStart': self.lineStart,
+            'lineEnd': self.lineEnd,
+            'signature': self.signature,
+            'comment': self.comment,
+            'content': self.content,
+            'repository': self.repository,
+            'promptSummaryOnly': self.promptSummaryOnly,
+            'contextCode': self.contextCode,
+            'goldenCode': self.goldenCode,
+            'test_case_id': self.test_case_id,
+            'language': self.language,
+        }
--- a/evaluation/bird/README.md
+++ b/evaluation/bird/README.md
@ -2,43 +2,14 @@

 Implements evaluation of agents on BIRD introduced in [Can LLM Already Serve as A Database Interface? A BIg Bench for Large-Scale Database Grounded Text-to-SQLs](https://arxiv.org/abs/2305.03111). Please see [here](https://bird-bench.github.io/) for the reference implementation used in the paper.

-## Setup Environment
+## Setup Environment and LLM Configuration

-Please follow [this document](https://github.com/OpenDevin/OpenDevin/blob/main/Development.md) to setup local develop environment for OpenDevin.
-
-
-## Configure OpenDevin and your LLM
-
-Create a `config.toml` file if it does not exist at the root of the workspace.
-
-Add the following configurations:
-
-```toml
-[core]
-max_iterations = 100
-cache_dir = "/tmp/cache"
-ssh_hostname = "localhost"
-
-[sandbox]
-enable_auto_lint = true
-
-# TODO: Change these to the model you want to evaluate
-[llm.eval_gpt4_1106_preview]
-model = "gpt-4-1106-preview"
-api_key = "XXX"
-temperature = 0.0
-
-[llm.eval_some_openai_compatible_model]
-model = "openai/MODEL_NAME"
-base_url = "https://OPENAI_COMPATIBLE_URL/v1"
-api_key = "XXX"
-temperature = 0.0
-```
+Please follow instruction [here](../README.md#setup) to setup your local development environment and LLM.

 ## Run Inference on Bird

 ```bash
-./evaluation/bird/scripts/run_infer.sh eval_gpt4_1106_preview [model_config] [git-version]
+./evaluation/bird/scripts/run_infer.sh [model_config] [git-version]
 ```

 - `model_config`, e.g. `eval_gpt4_1106_preview`, is the config group name for your
--- a/evaluation/bird/run_infer.py
+++ b/evaluation/bird/run_infer.py
@ -1,12 +1,12 @@
 import asyncio
 import json
-import logging
 import os
 import pathlib
 import re
-import shutil
 import sqlite3
 import subprocess
+import zipfile
+from typing import Any

 import pandas as pd
 from datasets import load_dataset
@ -15,20 +15,24 @@ from tqdm import tqdm

 from evaluation.utils.shared import (
    EvalMetadata,
+    EvalOutput,
    make_metadata,
    prepare_dataset,
+    reset_logger_for_multiprocessing,
    run_evaluation,
 )
-from opendevin.controller.agent import Agent
 from opendevin.controller.state.state import State
-from opendevin.core.config import get_llm_config_arg, load_app_config, parse_arguments
-from opendevin.core.logger import get_console_handler
+from opendevin.core.config import (
+    AppConfig,
+    SandboxConfig,
+    get_llm_config_arg,
+    parse_arguments,
+)
 from opendevin.core.logger import opendevin_logger as logger
-from opendevin.core.main import run_controller
-from opendevin.events.action import MessageAction
-from opendevin.llm.llm import LLM
-
-config = load_app_config()
+from opendevin.core.main import create_runtime, run_controller
+from opendevin.events.action import CmdRunAction, MessageAction
+from opendevin.events.observation import CmdOutputObservation
+from opendevin.runtime.runtime import Runtime


 def codeact_user_response(state: State) -> str:
@ -62,6 +66,28 @@ AGENT_CLS_TO_INST_SUFFIX = {
 }


+def get_config(
+    metadata: EvalMetadata,
+) -> AppConfig:
+    config = AppConfig(
+        default_agent=metadata.agent_class,
+        run_as_devin=False,
+        runtime='eventstream',
+        max_iterations=metadata.max_iterations,
+        sandbox=SandboxConfig(
+            container_image='ubuntu:22.04',
+            enable_auto_lint=True,
+            use_host_network=False,
+            update_source_code=True,
+        ),
+        # do not mount workspace
+        workspace_base=None,
+        workspace_mount_path=None,
+    )
+    config.set_llm_config(metadata.llm_config)
+    return config
+
+
 def execute_sql(db_path, gen_sql, gold_sql):
    """Execute the generated SQL and the ground truth SQL and compare the results."""
    with sqlite3.connect(db_path) as conn:
@ -76,12 +102,213 @@ def execute_sql(db_path, gen_sql, gold_sql):
    return res


-def get_test_result(instance, path, timeout=30):
+LOCAL_DATASET_PATH = os.path.join(os.path.dirname(__file__), 'data')
+
+
+def load_bird():
+    """Main function to handle the flow of downloading, processing, and loading the bird dataset."""
+
+    def _download_bird():
+        """Downloads and extracts the bird dataset from a specified URL into a local directory."""
+        devset_path = os.path.join(LOCAL_DATASET_PATH, 'dev')
+        if not os.path.exists(devset_path):
+            logger.info(
+                f'{LOCAL_DATASET_PATH} folder does not exist, starting download and extraction...'
+            )
+            os.makedirs(LOCAL_DATASET_PATH, exist_ok=True)
+
+            download_url = 'https://bird-bench.oss-cn-beijing.aliyuncs.com/dev.zip'
+            download_path = os.path.join(LOCAL_DATASET_PATH, 'dev.zip')
+            if not os.path.exists(download_path):
+                logger.info('Start Downloading...')
+                subprocess.run(['wget', download_url, '-O', download_path])
+                logger.info('Download completed.')
+
+            devset_path = os.path.join(LOCAL_DATASET_PATH, 'dev')
+            if not os.path.exists(devset_path):
+                logger.info('Start Extracting...')
+                os.makedirs(devset_path, exist_ok=True)
+                with zipfile.ZipFile(download_path, 'r') as zip_ref:
+                    zip_ref.extractall(devset_path)
+                # move everything in 'dev_20240627' to the root folder
+                for file in os.listdir(os.path.join(devset_path, 'dev_20240627')):
+                    os.rename(
+                        os.path.join(devset_path, 'dev_20240627', file),
+                        os.path.join(devset_path, file),
+                    )
+                os.rmdir(os.path.join(devset_path, 'dev_20240627'))
+                logger.info('Extraction completed.')
+
+            # extract databases
+            database_path = os.path.join(devset_path, 'dev_databases.zip')
+            assert os.path.exists(database_path)
+            logger.info('Start Extracting...')
+            with zipfile.ZipFile(database_path, 'r') as zip_ref:
+                zip_ref.extractall(devset_path)
+            logger.info('Extraction completed.')
+        else:
+            logger.info(f'{LOCAL_DATASET_PATH} folder already exists.')
+        return devset_path
+
+    def _extract_create_table_prompt(db_path, limit_value=0):
+        """Generates a SQL prompt with CREATE TABLE statements and sample data from the database."""
+        table_query = "SELECT * FROM sqlite_master WHERE type='table';"
+        tables = sqlite3.connect(db_path).cursor().execute(table_query).fetchall()
+        prompt = ''
+        for table in tables:
+            table_name = table[1]
+            create_table_statement = table[-1]
+
+            table_info_query = f'PRAGMA table_info(`{table_name}`);'
+            top_k_row_query = f'SELECT * FROM {table_name} LIMIT {limit_value};'
+            try:
+                headers = [
+                    x[1]
+                    for x in sqlite3.connect(db_path)
+                    .cursor()
+                    .execute(table_info_query)
+                    .fetchall()
+                ]
+            except Exception:
+                logger.error(f'Error Connection: {table_info_query}, {top_k_row_query}')
+                exit(0)
+
+            prompt += create_table_statement + ';\n'
+            if limit_value > 0:
+                top_k_rows = (
+                    sqlite3.connect(db_path)
+                    .cursor()
+                    .execute(top_k_row_query)
+                    .fetchall()
+                )
+                prompt += (
+                    f"/*\n3 example rows:\n{top_k_row_query}\n{'    '.join(headers)}\n"
+                )
+                for row in top_k_rows:
+                    row = [str(x) for x in row]
+                    row = [x if x is not None else '' for x in row]
+                    prompt += '    '.join(row) + '\n'
+                prompt += '*/\n'
+            prompt += '\n'
+        return prompt
+
+    def _create_prompt(e, database_path):
+        """Create a prompt for the given example"""
+        db_id = e['db_id']
+        db_path = pathlib.Path(database_path) / db_id / f'{db_id}.sqlite'
+
+        # Extract the CREATE TABLE statements and sample data from the database
+        prompt = _extract_create_table_prompt(db_path)
+        prompt += f"-- External Knowledge: {e['evidence']}\n\n"
+        prompt += '-- Using valid SQLite and understanding External Knowledge, answer the following questions for the tables provided above.\n\n'
+        prompt += '-- Using valid SQLite, answer the following questions for the tables provided above.\n'
+        prompt += f"Question: {e['question']}\n"
+
+        return prompt
+
+    def _process_bird(dataset_path):
+        """Processes the raw bird dataset into a structured format and saves it as JSON."""
+        processed_path = os.path.join(LOCAL_DATASET_PATH, 'dev', 'processed_dev.json')
+        if not os.path.exists(processed_path):
+            logger.info(
+                f'{processed_path} folder does not exist, starting processing...'
+            )
+            raw_data_path = os.path.join(LOCAL_DATASET_PATH, 'dev', 'dev.json')
+            database_path = os.path.join(LOCAL_DATASET_PATH, 'dev', 'dev_databases')
+            processed_data = []
+            with pathlib.Path(raw_data_path).open('r') as f:
+                data = json.load(f)
+                for e in tqdm(data):
+                    item = {
+                        'instance_id': f'{len(processed_data)}',
+                        'db_path': os.path.join(
+                            database_path, e['db_id'], f"{e['db_id']}.sqlite"
+                        ),
+                        'db_id': e['db_id'],
+                        'instruction': _create_prompt(e, database_path),
+                        'SQL': e['SQL'],
+                    }
+                    processed_data.append(item)
+
+            with pathlib.Path(processed_path).open('w') as f:
+                json.dump(processed_data, f, indent=2)
+                logger.info(f'Processed data saved to {processed_path}')
+        else:
+            logger.info(f'{processed_path} folder already exists.')
+        bird_dataset = load_dataset('json', data_files={'test': processed_path})
+        return bird_dataset
+
+    raw_dataset_path = _download_bird()
+    bird_dataset = _process_bird(raw_dataset_path)
+    return bird_dataset
+
+
+async def initialize_runtime(
+    runtime: Runtime,
+    instance: pd.Series,  # this argument is not required
+):
+    """Initialize the runtime for the agent.
+
+    This function is called before the runtime is used to run the agent.
+    """
+    logger.info(f"{'-' * 50} BEGIN Runtime Initialization Fn {'-' * 50}")
+    obs: CmdOutputObservation
+
+    # Copy the database to the workspace
+    db_file = os.path.join(
+        LOCAL_DATASET_PATH,
+        'dev',
+        'dev_databases',
+        instance.db_id,
+        f'{instance.db_id}.sqlite',
+    )
+    await runtime.copy_to(db_file, '/workspace')
+
+    # Check the database is copied
+    action = CmdRunAction(
+        command='cd /workspace && ls -l',
+        keep_prompt=False,
+    )
+    obs = await runtime.run_action(action)
+    logger.info(obs, extra={'msg_type': 'OBSERVATION'})
+    assert obs.exit_code == 0
+    assert f'{instance.db_id}.sqlite' in obs.content
+
+    logger.info(f"{'-' * 50} END Runtime Initialization Fn {'-' * 50}")
+
+
+async def complete_runtime(
+    runtime: Runtime,
+    instance: pd.Series,  # this argument is not required, but it is used to get the workspace_dir_name
+) -> dict[str, Any]:
+    """Complete the runtime for the agent.
+
+    This function is called before the runtime is used to run the agent.
+    If you need to do something in the sandbox to get the correctness metric after
+    the agent has run, modify this function.
+    """
+    logger.info(f"{'-' * 50} BEGIN Runtime Completion Fn {'-' * 50}")
+    obs: CmdOutputObservation
+    timeout = 30
+
    test_result = {'result': {}, 'metadata': {}}

    # Read the generated python file
-    with open(path, 'r') as f:
-        gen_file = f.read()
+    instance_id = instance.instance_id.replace('/', '__')
+    path = os.path.join('/workspace', f'{instance_id}.py')
+
+    action = CmdRunAction(
+        command=f'cat {path}',
+        keep_prompt=False,
+    )
+    obs = await runtime.run_action(action)
+    logger.info(obs, extra={'msg_type': 'OBSERVATION'})
+
+    if obs.exit_code != 0:
+        test_result['result'] = {'passed': 0, 'status': 'error'}
+        return test_result
+
+    gen_file = obs.content.strip().replace('\r\n', '\n')

    # Extract the SQL from the python file
    gen_sql = ''
@ -96,7 +323,13 @@ def get_test_result(instance, path, timeout=30):
    # Execute the SQL
    try:
        res = func_timeout(
-            timeout, execute_sql, args=(instance.db_path, gen_sql, gold_sql)
+            timeout,
+            execute_sql,
+            args=(
+                instance.db_path,
+                gen_sql,
+                gold_sql,
+            ),
        )
        status = 'success'
    except FunctionTimedOut:
@ -114,68 +347,28 @@ def get_test_result(instance, path, timeout=30):
        'gen_sql': gen_sql,
        'gold_sql': gold_sql,
    }
+    logger.info(f"{'-' * 50} END Runtime Completion Fn {'-' * 50}")
    return test_result


-def process_instance(
+async def process_instance(
    instance: pd.Series,
    metadata: EvalMetadata,
    reset_logger: bool = True,
-):
-    # Create the agent
-    agent = Agent.get_cls(metadata.agent_class)(llm=LLM(config=metadata.llm_config))
-    workspace_mount_path = os.path.join(
-        config.workspace_mount_path, 'bird_eval_workspace'
-    )
-    workspace_mount_path = os.path.join(workspace_mount_path, str(os.getpid()))
-    pathlib.Path(workspace_mount_path).mkdir(parents=True, exist_ok=True)
-
-    # reset workspace to config
-    config.workspace_mount_path = workspace_mount_path
-
-    # Copy the database to the workspace
-    db_root = os.path.join(
-        config.workspace_base, 'evaluation_bird/dev/dev_databases', instance.db_id
-    )
-    target_path = os.path.join(workspace_mount_path, f'{instance.db_id}')
-    if not os.path.exists(target_path):
-        logger.info(f'Copying database from {db_root} to {target_path}...')
-        shutil.copytree(db_root, target_path)
-
-    # Set up the database path
-    database_path = os.path.join(instance.db_id, f'{instance.db_id}.sqlite')
-
+) -> EvalOutput:
+    config = get_config(metadata)
    # use session id for concurrent evaluation
-    sid = instance.task_id.replace('/', '__')
+    instance_id = instance.instance_id.replace('/', '__')

    # Set up the logger properly, so you can run multi-processing to parallelize the evaluation
    if reset_logger:
-        # Set up logger
-        log_file = os.path.join(
-            metadata.eval_output_dir,
-            'logs',
-            f'instance_{sid}.log',
-        )
-        # Remove all existing handlers from logger
-        for handler in logger.handlers[:]:
-            logger.removeHandler(handler)
-        # add back the console handler to print ONE line
-        logger.addHandler(get_console_handler())
-        logger.info(
-            f'Starting evaluation for instance {instance.task_id}.\nLOG:   tail -f {log_file}'
-        )
-        # Remove all existing handlers from logger
-        for handler in logger.handlers[:]:
-            logger.removeHandler(handler)
-        file_handler = logging.FileHandler(log_file)
-        file_handler.setFormatter(
-            logging.Formatter('%(asctime)s - %(levelname)s - %(message)s')
-        )
-        logger.addHandler(file_handler)
-
-    logger.info(f'Process-specific workspace mounted at {workspace_mount_path}')
+        log_dir = os.path.join(metadata.eval_output_dir, 'infer_logs')
+        reset_logger_for_multiprocessing(logger, instance_id, log_dir)
+    else:
+        logger.info(f'Starting evaluation for instance {instance_id}.')

    # Create file with BIRD instance
+    database_path = os.path.join('/workspace', f'{instance.db_id}.sqlite')
    statements = f"""
    import sqlite3
    def execute_sql(db_path, sql):
@ -192,12 +385,12 @@ def process_instance(
        result = execute_sql(db_path, sql)
        print(result)
    """
-    path = os.path.join(config.workspace_mount_path, f'{sid}.py')
+
    instruction = (
        f'You are a SQL expert and need to complete the following text-to-SQL tasks.'
        f'\n\n{instance.instruction}\n\n'
        'Please write the SQL in one line without line breaks.'
-        f'And write a new python file named {sid}.py to call the SQL you wrote.'
+        f'And write a new python file named {instance_id}.py to call the SQL you wrote.'
        'You need to follow the code template below:'
        f'\n\n{statements}\n\n'
        'Environment has been set up for you to start working.'
@ -208,24 +401,21 @@ def process_instance(
        'You SHOULD INCLUDE PROPER INDENTATION in your edit commands.\n'
    )
    # NOTE: You can actually set slightly different instruction for different agents
-    instruction += AGENT_CLS_TO_INST_SUFFIX[agent.__class__.__name__]
+    instruction += AGENT_CLS_TO_INST_SUFFIX[metadata.agent_class]
+
+    runtime = await create_runtime(config, sid=instance_id)
+    await initialize_runtime(runtime, instance)

    # Here's how you can run the agent (similar to the `main` function) and get the final task state
-    config.max_iterations = metadata.max_iterations
-    state: State | None = asyncio.run(
-        run_controller(
-            config=config,
-            task_str=instruction,
-            fake_user_response_fn=AGENT_CLS_TO_FAKE_USER_RESPONSE_FN[
-                agent.__class__.__name__
-            ],
-            agent=agent,
-            sid=sid,
-        )
+    state: State | None = await run_controller(
+        config=config,
+        task_str=instruction,
+        fake_user_response_fn=AGENT_CLS_TO_FAKE_USER_RESPONSE_FN[metadata.agent_class],
+        runtime=runtime,
    )

    # ======= Attempt to evaluate the agent's edits =======
-    test_result = get_test_result(instance, path)
+    test_result = await complete_runtime(runtime, instance)

    # If you are working on some simpler benchmark that only evaluates the final model output (e.g., in a MessageAction)
    # You can simply get the LAST `MessageAction` from the returned `state.history` and parse it for evaluation.
@ -239,162 +429,43 @@ def process_instance(
    histories = state.history.compatibility_for_eval_history_pairs()

    # Save the output
-    output = {
-        'task_id': instance.task_id,
-        'instruction': instruction,
-        'metadata': metadata.model_dump(),
-        'history': histories,
-        'metrics': metrics,
-        'error': state.last_error if state and state.last_error else None,
-        'test_result': test_result,
-    }
+    output = EvalOutput(
+        instance_id=instance.instance_id,
+        instruction=instruction,
+        metadata=metadata,
+        history=histories,
+        metrics=metrics,
+        error=state.last_error if state and state.last_error else None,
+        test_result=test_result,
+    )
    return output


-def load_bird():
-    """Main function to handle the flow of downloading, processing, and loading the bird dataset."""
-    raw_dataset_path = download_bird()
-    bird_dataset = process_bird(raw_dataset_path)
-    return bird_dataset
-
-
-def download_bird():
-    """Downloads and extracts the bird dataset from a specified URL into a local directory."""
-    dataset_path = os.path.join(config.workspace_base, 'evaluation_bird')
-    devset_path = os.path.join(dataset_path, 'dev')
-    if not os.path.exists(dataset_path):
-        logger.info(
-            f'{dataset_path} folder does not exist, starting download and extraction...'
-        )
-        os.makedirs(dataset_path, exist_ok=True)
-        download_url = 'https://bird-bench.oss-cn-beijing.aliyuncs.com/dev.zip'
-        download_path = os.path.join(dataset_path, 'dev.zip')
-        logger.info('Start Downloading...')
-        subprocess.run(['wget', download_url, '-O', download_path])
-        logger.info('Download completed.')
-        logger.info('Start Extracting...')
-        subprocess.run(['unzip', download_path, '-d', dataset_path])
-        # extract databases
-        devset_path = os.path.join(dataset_path, 'dev')
-        database_path = os.path.join(devset_path, 'dev_databases.zip')
-        subprocess.run(['unzip', database_path, '-d', devset_path])
-        logger.info('Extraction completed.')
-    else:
-        logger.info(f'{dataset_path} folder already exists.')
-    return devset_path
-
-
-def process_bird(dataset_path):
-    """Processes the raw bird dataset into a structured format and saves it as JSON."""
-    processed_path = os.path.join(dataset_path, 'processed_dev.json')
-    if not os.path.exists(processed_path):
-        logger.info(f'{processed_path} folder does not exist, starting processing...')
-        raw_data_path = os.path.join(dataset_path, 'dev.json')
-        database_path = os.path.join(dataset_path, 'dev_databases')
-        processed_data = []
-        with pathlib.Path(raw_data_path).open('r') as f:
-            data = json.load(f)
-            for e in tqdm(data):
-                item = {
-                    'task_id': f'{len(processed_data)}',
-                    'db_path': os.path.join(
-                        database_path, e['db_id'], f"{e['db_id']}.sqlite"
-                    ),
-                    'db_id': e['db_id'],
-                    'instruction': create_prompt(e, database_path),
-                    'SQL': e['SQL'],
-                }
-                processed_data.append(item)
-
-        with pathlib.Path(processed_path).open('w') as f:
-            json.dump(processed_data, f, indent=2)
-            logger.info(f'Processed data saved to {processed_path}')
-    else:
-        logger.info(f'{processed_path} folder already exists.')
-    bird_dataset = load_dataset('json', data_files={'test': processed_path})
-    return bird_dataset
-
-
-def extract_create_table_prompt(db_path, limit_value=0):
-    """Generates a SQL prompt with CREATE TABLE statements and sample data from the database."""
-    table_query = "SELECT * FROM sqlite_master WHERE type='table';"
-    tables = sqlite3.connect(db_path).cursor().execute(table_query).fetchall()
-    prompt = ''
-    for table in tables:
-        table_name = table[1]
-        create_table_statement = table[-1]
-
-        table_info_query = f'PRAGMA table_info(`{table_name}`);'
-        top_k_row_query = f'SELECT * FROM {table_name} LIMIT {limit_value};'
-        try:
-            headers = [
-                x[1]
-                for x in sqlite3.connect(db_path)
-                .cursor()
-                .execute(table_info_query)
-                .fetchall()
-            ]
-        except Exception:
-            logger.error(f'Error Connection: {table_info_query}, {top_k_row_query}')
-            exit(0)
-
-        prompt += create_table_statement + ';\n'
-        if limit_value > 0:
-            top_k_rows = (
-                sqlite3.connect(db_path).cursor().execute(top_k_row_query).fetchall()
-            )
-            prompt += (
-                f"/*\n3 example rows:\n{top_k_row_query}\n{'    '.join(headers)}\n"
-            )
-            for row in top_k_rows:
-                row = [str(x) for x in row]
-                row = [x if x is not None else '' for x in row]
-                prompt += '    '.join(row) + '\n'
-            prompt += '*/\n'
-        prompt += '\n'
-    return prompt
-
-
-def create_prompt(e, database_path):
-    """Create a prompt for the given example"""
-    db_id = e['db_id']
-    db_path = pathlib.Path(database_path) / db_id / f'{db_id}.sqlite'
-
-    # Extract the CREATE TABLE statements and sample data from the database
-    prompt = extract_create_table_prompt(db_path)
-    prompt += f"-- External Knowledge: {e['evidence']}\n\n"
-    prompt += '-- Using valid SQLite and understanding External Knowledge, answer the following questions for the tables provided above.\n\n'
-    prompt += '-- Using valid SQLite, answer the following questions for the tables provided above.\n'
-    prompt += f"Question: {e['question']}\n"
-
-    return prompt
-
-
 if __name__ == '__main__':
-    id_column = 'task_id'
    args = parse_arguments()
    bird_dataset = load_bird()
    dataset = bird_dataset['test'].to_pandas()
+    dataset.rename(columns={'task_id': 'instance_id'}, inplace=True)

-    llm_config = get_llm_config_arg(args.llm_config) if args.llm_config else config.llm
-    logger.info(f'Config for evaluation: {config}')
+    llm_config = None
+    if args.llm_config:
+        llm_config = get_llm_config_arg(args.llm_config)
+    if llm_config is None:
+        raise ValueError(f'Could not find LLM config: --llm_config {args.llm_config}')

    metadata = make_metadata(
        llm_config,
-        args.dataset_name,
+        'BIRD',
        args.agent_cls,
        args.max_iterations,
        args.eval_note,
        args.eval_output_dir,
    )
    output_file = os.path.join(metadata.eval_output_dir, 'output.jsonl')
-    instances = prepare_dataset(dataset, output_file, args.eval_n_limit, id_column)
+    instances = prepare_dataset(dataset, output_file, args.eval_n_limit)

-    run_evaluation(
-        instances,
-        metadata,
-        output_file,
-        args.eval_num_workers,
-        process_instance,
-        id_column,
+    asyncio.run(
+        run_evaluation(
+            instances, metadata, output_file, args.eval_num_workers, process_instance
+        )
    )
--- a/evaluation/bird/scripts/run_infer.sh
+++ b/evaluation/bird/scripts/run_infer.sh
--- a/evaluation/browsing_delegation/README.md
+++ b/evaluation/browsing_delegation/README.md
@ -5,30 +5,9 @@ Some of OpenDevin's agent supports agent delegation action, for example, CodeAct
 This evaluation tests whether CodeActAgent can correctly delegate the instruction from WebArena and MiniWob benchmark to the BrowsingAgent.
 If so, the browsing performance upper-bound of CodeActAgent will be the performance of BrowsingAgent.

+## Setup Environment and LLM Configuration

-## Setup Environment
-
-Please follow [this document](https://github.com/OpenDevin/OpenDevin/blob/main/Development.md) to set up a local development environment for OpenDevin.
-
-## Configure OpenDevin and your LLM
-
-Create a `config.toml` file if it does not exist at the root of the workspace.
-
-Add the following configurations:
-
-```toml
-# TODO: Change these to the model you want to evaluate
-[llm.eval_gpt4_1106_preview_llm]
-model = "gpt-4-1106-preview"
-api_key = "XXX"
-temperature = 0.0
-
-[llm.eval_some_openai_compatible_model_llm]
-model = "openai/MODEL_NAME"
-base_url = "https://OPENAI_COMPATIBLE_URL/v1"
-api_key = "XXX"
-temperature = 0.0
-```
+Please follow instruction [here](../README.md#setup) to setup your local development environment and LLM.

 ## Run Inference

--- a/evaluation/browsing_delegation/run_infer.py
+++ b/evaluation/browsing_delegation/run_infer.py
@ -1,5 +1,4 @@
 import asyncio
-import logging
 import os
 import re

@ -9,56 +8,62 @@ from datasets import load_dataset

 from evaluation.utils.shared import (
    EvalMetadata,
+    EvalOutput,
    make_metadata,
    prepare_dataset,
+    reset_logger_for_multiprocessing,
    run_evaluation,
 )
-from opendevin.controller.agent import Agent
 from opendevin.controller.state.state import State
-from opendevin.core.config import get_llm_config_arg, load_app_config, parse_arguments
-from opendevin.core.logger import get_console_handler
+from opendevin.core.config import (
+    AppConfig,
+    SandboxConfig,
+    get_llm_config_arg,
+    parse_arguments,
+)
 from opendevin.core.logger import opendevin_logger as logger
-from opendevin.core.main import run_controller
-from opendevin.llm.llm import LLM
-
-config = load_app_config()
+from opendevin.core.main import create_runtime, run_controller

 # Only CodeActAgent can delegate to BrowsingAgent
 SUPPORTED_AGENT_CLS = {'CodeActAgent'}


-def process_instance(
+def get_config(
+    metadata: EvalMetadata,
+) -> AppConfig:
+    assert (
+        metadata.max_iterations == 1
+    ), 'max_iterations must be 1 for browsing delegation evaluation.'
+    config = AppConfig(
+        default_agent=metadata.agent_class,
+        run_as_devin=False,
+        runtime='eventstream',
+        max_iterations=metadata.max_iterations,
+        sandbox=SandboxConfig(
+            container_image='ubuntu:22.04',
+            enable_auto_lint=False,
+            use_host_network=False,
+            update_source_code=True,
+        ),
+        workspace_base=None,
+        workspace_mount_path=None,
+    )
+    config.set_llm_config(metadata.llm_config)
+    return config
+
+
+async def process_instance(
    instance: pd.Series,
    metadata: EvalMetadata,
    reset_logger: bool = True,
-):
-    # Create the agent
-    agent = Agent.get_cls(metadata.agent_class)(llm=LLM(config=metadata.llm_config))
-    env_id = instance.instance_id
+) -> EvalOutput:
+    config = get_config(metadata)
    # Setup the logger properly, so you can run multi-processing to parallelize the evaluation
    if reset_logger:
-        # Set up logger
-        log_file = os.path.join(
-            metadata.eval_output_dir, 'logs', f'instance_{env_id}.log'
-        )
-        # Remove all existing handlers from logger
-        for handler in logger.handlers[:]:
-            logger.removeHandler(handler)
-        # add back the console handler to print ONE line
-        logger.addHandler(get_console_handler())
-        logger.info(
-            f'Starting evaluation for instance {env_id}.\nHint: run "tail -f {log_file}" to see live logs in a separate shell'
-        )
-        # Remove all existing handlers from logger
-        for handler in logger.handlers[:]:
-            logger.removeHandler(handler)
-        file_handler = logging.FileHandler(log_file)
-        file_handler.setFormatter(
-            logging.Formatter('%(asctime)s - %(levelname)s - %(message)s')
-        )
-        logger.addHandler(file_handler)
+        log_dir = os.path.join(metadata.eval_output_dir, 'infer_logs')
+        reset_logger_for_multiprocessing(logger, instance.instance_id, log_dir)
    else:
-        logger.info(f'Starting evaluation for instance {env_id}.')
+        logger.info(f'Starting evaluation for instance {instance.instance_id}.')

    instruction = (
        f'You can delegate browsing tasks to a browser agent. '
@ -67,21 +72,14 @@ def process_instance(
        f'NOTE: You should copy the "query" as is into the <execute_browse> tag. DO NOT change ANYTHING in the query.'
    )

-    config.max_iterations = metadata.max_iterations
-    state: State | None = asyncio.run(
-        run_controller(
-            config=config,
-            task_str=instruction,
-            agent=agent,
-            sid=env_id,
-        )
+    runtime = await create_runtime(config, sid=instance.instance_id)
+
+    state: State | None = await run_controller(
+        config=config,
+        task_str=instruction,
+        runtime=runtime,
    )

-    # ======= Attempt to evaluate the agent's environment impact =======
-
-    # If you are working on some simpler benchmark that only evaluates the final model output (e.g., in a MessageAction)
-    # You can simply get the LAST `MessageAction` from the returned `state.history` and parse it for evaluation.
-
    if state is None:
        raise ValueError('State should not be None.')

@ -115,20 +113,19 @@ def process_instance(
            result['is_exact_match'] = is_exact_match

    # Save the output
-    output = {
-        'instance_id': env_id,
-        'instruction': instruction,
-        'metadata': metadata.model_dump(),
-        'history': histories,
-        'metrics': metrics,
-        'error': state.last_error if state and state.last_error else None,
-        'test_result': {
+    output = EvalOutput(
+        instance_id=instance.instance_id,
+        instruction=instruction,
+        metadata=metadata,
+        history=histories,
+        metrics=metrics,
+        error=state.last_error if state and state.last_error else None,
+        test_result={
            'query': instance.instruction,
            'action': last_delegate_action,
            'result': result,
        },
-    }
-
+    )
    return output


@ -138,9 +135,13 @@ if __name__ == '__main__':
    dataset = load_dataset('OpenDevin/eval-browsing-instructions')
    dataset = dataset['train'].to_pandas()
    assert dataset.columns.tolist() == ['instance_id', 'instruction']
-    id_column = 'instance_id'
-    llm_config = get_llm_config_arg(args.llm_config) if args.llm_config else config.llm
-    logger.info(f'Config for evaluation: {config}')
+
+    llm_config = None
+    if args.llm_config:
+        llm_config = get_llm_config_arg(args.llm_config)
+
+    if llm_config is None:
+        raise ValueError(f'Could not find LLM config: --llm_config {args.llm_config}')

    metadata = make_metadata(
        llm_config,
@ -150,18 +151,20 @@ if __name__ == '__main__':
        args.eval_note,
        args.eval_output_dir,
    )
+
    if metadata.agent_class not in SUPPORTED_AGENT_CLS:
        raise ValueError(
            f'Agent class {metadata.agent_class} not supported with AgentDelegation.'
        )

    output_file = os.path.join(metadata.eval_output_dir, 'output.jsonl')
-    instances = prepare_dataset(dataset, output_file, args.eval_n_limit, id_column)
-    run_evaluation(
-        instances,
-        metadata,
-        output_file,
-        args.eval_num_workers,
-        process_instance,
-        id_column,
+    instances = prepare_dataset(dataset, output_file, args.eval_n_limit)
+    asyncio.run(
+        run_evaluation(
+            instances,
+            metadata,
+            output_file,
+            args.eval_num_workers,
+            process_instance,
+        )
    )
--- a/evaluation/gaia/README.md
+++ b/evaluation/gaia/README.md
@ -2,9 +2,9 @@

 This folder contains evaluation harness for evaluating agents on the [GAIA benchmark](https://arxiv.org/abs/2311.12983).

-## Configure OpenDevin and your LLM
+## Setup Environment and LLM Configuration

-Create a `config.toml` file if it does not exist at the root of the workspace. Please check [README.md](../../README.md) for how to set this up.
+Please follow instruction [here](../README.md#setup) to setup your local development environment and LLM.

 ## Run the evaluation
 We are using the GAIA dataset hosted on [Hugging Face](https://huggingface.co/datasets/gaia-benchmark/GAIA).
--- a/evaluation/gaia/run_infer.py
+++ b/evaluation/gaia/run_infer.py
@ -1,10 +1,7 @@
 import asyncio
-import logging
+import functools
 import os
-import pathlib
 import re
-import shutil
-from functools import partial

 import huggingface_hub
 import pandas as pd
@ -13,28 +10,31 @@ from datasets import load_dataset
 from evaluation.gaia.scorer import question_scorer
 from evaluation.utils.shared import (
    EvalMetadata,
+    EvalOutput,
    codeact_user_response,
    make_metadata,
    prepare_dataset,
+    reset_logger_for_multiprocessing,
    run_evaluation,
 )
-from opendevin.controller.agent import Agent
 from opendevin.controller.state.state import State
-from opendevin.core.config import get_llm_config_arg, get_parser, load_app_config
-from opendevin.core.logger import get_console_handler
+from opendevin.core.config import (
+    AppConfig,
+    SandboxConfig,
+    get_llm_config_arg,
+    get_parser,
+)
 from opendevin.core.logger import opendevin_logger as logger
-from opendevin.core.main import run_controller
-from opendevin.events.action import CmdRunAction, MessageAction
-from opendevin.llm.llm import LLM
+from opendevin.core.main import create_runtime, run_controller
+from opendevin.events.action import AgentFinishAction, CmdRunAction, MessageAction
+from opendevin.events.observation import CmdOutputObservation
+from opendevin.runtime.runtime import Runtime

-config = load_app_config()
-
-DATASET_CACHE_DIR = '~/.cache/open-devin/evals/gaia'
-DATASET_CACHE_DIR = os.path.expanduser(DATASET_CACHE_DIR)
+DATASET_CACHE_DIR = os.path.join(os.path.dirname(__file__), 'data')


 AGENT_CLS_TO_FAKE_USER_RESPONSE_FN = {
-    'CodeActAgent': partial(codeact_user_response, encapsulate_solution=True),
+    'CodeActAgent': functools.partial(codeact_user_response, encapsulate_solution=True),
 }

 AGENT_CLS_TO_INST_SUFFIX = {
@ -42,151 +42,175 @@ AGENT_CLS_TO_INST_SUFFIX = {
 }


-def process_instance(
+def get_config(
+    metadata: EvalMetadata,
+) -> AppConfig:
+    config = AppConfig(
+        default_agent=metadata.agent_class,
+        run_as_devin=False,
+        runtime='eventstream',
+        max_iterations=metadata.max_iterations,
+        sandbox=SandboxConfig(
+            container_image='ubuntu:22.04',
+            enable_auto_lint=True,
+            use_host_network=False,
+            update_source_code=True,
+        ),
+        # do not mount workspace
+        workspace_base=None,
+        workspace_mount_path=None,
+    )
+    config.set_llm_config(metadata.llm_config)
+    return config
+
+
+async def initialize_runtime(
+    runtime: Runtime,
+    instance: pd.Series,  # this argument is not required
+):
+    """Initialize the runtime for the agent.
+
+    This function is called before the runtime is used to run the agent.
+    """
+    logger.info(f"{'-' * 50} BEGIN Runtime Initialization Fn {'-' * 50}")
+    obs: CmdOutputObservation
+
+    action = CmdRunAction(command='mkdir -p /workspace')
+    logger.info(action, extra={'msg_type': 'ACTION'})
+    obs = await runtime.run_action(action)
+    assert obs.exit_code == 0
+
+    if instance['file_name'] != '':
+        # if this question comes with a file, we need to save it to the workspace
+        assert metadata.data_split is not None
+        src_file = os.path.join(
+            DATASET_CACHE_DIR, '2023', metadata.data_split, instance['file_name']
+        )
+        assert os.path.exists(src_file)
+        dest_file = os.path.join('/workspace', instance['file_name'])
+        await runtime.copy_to(src_file, dest_file)
+
+        # rename to file.extension_name
+        extension_name = instance['file_name'].split('.')[-1]
+        action = CmdRunAction(
+            command=f'mv /workspace/{instance["file_name"]} /workspace/file.{extension_name}'
+        )
+        logger.info(action, extra={'msg_type': 'ACTION'})
+        obs = await runtime.run_action(action)
+        assert obs.exit_code == 0
+
+    action = CmdRunAction(command='cd /workspace')
+    logger.info(action, extra={'msg_type': 'ACTION'})
+    obs = await runtime.run_action(action)
+    assert obs.exit_code == 0
+
+    logger.info(f"{'-' * 50} END Runtime Initialization Fn {'-' * 50}")
+
+
+async def process_instance(
    instance: pd.Series,
    metadata: EvalMetadata,
    reset_logger: bool = True,
-):
-    # Create the agent
-    agent = Agent.get_cls(metadata.agent_class)(llm=LLM(config=metadata.llm_config))
-    # create process-specific workspace dir
-    # we will create a workspace directory for EACH process
-    # so that different agent don't interfere with each other.
-    old_workspace_mount_path = config.workspace_mount_path
+) -> EvalOutput:
+    config = get_config(metadata)

-    try:
-        workspace_mount_path = os.path.join(
-            config.workspace_mount_path, '_eval_workspace'
-        )
-        workspace_mount_path = os.path.join(workspace_mount_path, str(os.getpid()))
-        pathlib.Path(workspace_mount_path).mkdir(parents=True, exist_ok=True)
-        config.workspace_mount_path = workspace_mount_path
+    # Setup the logger properly, so you can run multi-processing to parallelize the evaluation
+    if reset_logger:
+        log_dir = os.path.join(metadata.eval_output_dir, 'infer_logs')
+        reset_logger_for_multiprocessing(logger, instance['instance_id'], log_dir)
+    else:
+        logger.info(f'Starting evaluation for instance {instance["instance_id"]}.')

-        # Setup the logger properly, so you can run multi-processing to parallelize the evaluation
-        eval_output_dir = metadata.eval_output_dir
-        if reset_logger:
-            # Set up logger
-            log_file = os.path.join(
-                eval_output_dir, 'logs', f'instance_{instance["task_id"]}.log'
-            )
-            # Remove all existing handlers from logger
-            for handler in logger.handlers[:]:
-                logger.removeHandler(handler)
-            # add back the console handler to print ONE line
-            logger.addHandler(get_console_handler())
-            logger.info(
-                f'Starting evaluation for instance {instance["task_id"]}.\nLOG:   tail -f {log_file}'
-            )
-            # Remove all existing handlers from logger
-            for handler in logger.handlers[:]:
-                logger.removeHandler(handler)
-            file_handler = logging.FileHandler(log_file)
-            file_handler.setFormatter(
-                logging.Formatter('%(asctime)s - %(levelname)s - %(message)s')
-            )
-            logger.addHandler(file_handler)
+    if instance['file_name'] != '':
+        extension_name = instance['file_name'].split('.')[-1]
+        dest_file = os.path.join('/workspace', f'file.{extension_name}')
+    else:
+        dest_file = None

-        logger.info(f'Process-specific workspace mounted at {workspace_mount_path}')
-        if instance['file_name'] != '':
-            # if this question comes with a file, we need to save it to the workspace
-            assert metadata.data_split is not None
-            src_file = os.path.join(
-                DATASET_CACHE_DIR, '2023', metadata.data_split, instance['file_name']
-            )
-            extension_name = instance['file_name'].split('.')[-1]
-            dest_file = os.path.join(workspace_mount_path, f'file.{extension_name}')
-            shutil.copyfile(src_file, dest_file)
-            logger.info(f'File copied to {dest_file}')
-        else:
-            dest_file = None
+    # Prepare instruction
+    instruction = f"{instance['Question']}\n"
+    logger.info(f'Instruction: {instruction}')
+    if dest_file:
+        instruction += f"\n\nThe mentioned file is provided in the workspace at: {dest_file.split('/')[-1]}"

-        # Prepare instruction
-        instruction = f"{instance['Question']}\n"
-        logger.info(f'Instruction: {instruction}')
-        if dest_file:
-            instruction += f"\n\nThe mentioned file is provided in the workspace at: {dest_file.split('/')[-1]}"
+    instruction += 'IMPORTANT: You should ONLY interact with the environment provided to you AND NEVER ASK FOR HUMAN HELP.\n'
+    instruction += 'Please encapsulate your final answer (answer ONLY) within <solution> and </solution>.\n'
+    instruction += (
+        'For example: The answer to the question is <solution> 42 </solution>.\n'
+    )
+    # NOTE: You can actually set slightly different instruction for different agents
+    instruction += AGENT_CLS_TO_INST_SUFFIX.get(metadata.agent_class, '')
+    logger.info(f'Instruction:\n{instruction}', extra={'msg_type': 'OBSERVATION'})

-        instruction += 'IMPORTANT: You should ONLY interact with the environment provided to you AND NEVER ASK FOR HUMAN HELP.\n'
-        instruction += 'Please encapsulate your final answer (answer ONLY) within <solution> and </solution>.\n'
-        instruction += (
-            'For example: The answer to the question is <solution> 42 </solution>.\n'
-        )
-        # NOTE: You can actually set slightly different instruction for different agents
-        instruction += AGENT_CLS_TO_INST_SUFFIX.get(agent.__class__.__name__, '')
-        logger.info(f'Instruction:\n{instruction}', extra={'msg_type': 'OBSERVATION'})
+    runtime = await create_runtime(config, sid=instance['instance_id'])
+    await initialize_runtime(runtime, instance)

-        # Here's how you can run the agent (similar to the `main` function) and get the final task state
-        config.max_iterations = metadata.max_iterations
-        state: State | None = asyncio.run(
-            run_controller(
-                config=config,
-                task_str=instruction,
-                fake_user_response_fn=AGENT_CLS_TO_FAKE_USER_RESPONSE_FN[
-                    agent.__class__.__name__
-                ],
-                agent=agent,
-                sid=instance['task_id'],
-            )
-        )
-        # ======= Attempt to evaluate the agent's edits =======
-        # If you are working on simpler benchmark that only evaluates the final model output (e.g., in a MessageAction)
-        # You can simply get the LAST `MessageAction` from the returned `state.history` and parse it for evaluation.
+    # Here's how you can run the agent (similar to the `main` function) and get the final task state
+    state: State | None = await run_controller(
+        config=config,
+        task_str=instruction,
+        runtime=runtime,
+        fake_user_response_fn=AGENT_CLS_TO_FAKE_USER_RESPONSE_FN[metadata.agent_class],
+    )
+    # ======= Attempt to evaluate the agent's edits =======
+    # If you are working on simpler benchmark that only evaluates the final model output (e.g., in a MessageAction)
+    # You can simply get the LAST `MessageAction` from the returned `state.history` and parse it for evaluation.

-        if state is None:
-            raise ValueError('State should not be None.')
+    if state is None:
+        raise ValueError('State should not be None.')

-        model_answer_raw = ''
-
-        # get the last message or thought from the agent
-        for event in state.history.get_events(reverse=True):
-            if isinstance(event, CmdRunAction) and event.source == 'agent':
+    model_answer_raw = ''
+    # get the last message or thought from the agent
+    for event in state.history.get_events(reverse=True):
+        if event.source == 'agent':
+            if isinstance(event, AgentFinishAction):
                model_answer_raw = event.thought
-            elif isinstance(event, MessageAction) and event.source == 'agent':
+                break
+            elif isinstance(event, CmdRunAction):
+                model_answer_raw = event.thought
+                break
+            elif isinstance(event, MessageAction):
                model_answer_raw = event.content
+                break

-        # attempt to parse model_answer
-        model_answer = re.findall(r'<solution>(.*?)</solution>', model_answer_raw)
-        if len(model_answer) == 0:
-            logger.warning(f'Failed to parse model answer: {model_answer_raw}')
-            model_answer = model_answer_raw
-        else:
-            model_answer = model_answer[0]
+    # attempt to parse model_answer
+    model_answer = re.findall(r'<solution>(.*?)</solution>', model_answer_raw)
+    if len(model_answer) == 0:
+        logger.warning(f'Failed to parse model answer: {model_answer_raw}')
+        model_answer = model_answer_raw
+    else:
+        model_answer = model_answer[0]

-        logger.info(
-            f'Final message: {model_answer} | Ground truth: {instance["Final answer"]}'
-        )
-        score = question_scorer(
-            model_answer=model_answer, ground_truth=instance['Final answer']
-        )
-        test_result = {
-            'score': score,
-            'model_answer_raw': model_answer_raw,
-            'model_answer': model_answer,
-            'ground_truth': instance['Final answer'],
-        }
-        metrics = state.metrics.get() if state.metrics else None
+    logger.info(
+        f'Final message: {model_answer} | Ground truth: {instance["Final answer"]}'
+    )
+    score = question_scorer(
+        model_answer=model_answer, ground_truth=instance['Final answer']
+    )
+    test_result = {
+        'score': score,
+        'model_answer_raw': model_answer_raw,
+        'model_answer': model_answer,
+        'ground_truth': instance['Final answer'],
+    }
+    metrics = state.metrics.get() if state.metrics else None

-        # history is now available as a stream of events, rather than list of pairs of (Action, Observation)
-        # for compatibility with the existing output format, we can remake the pairs here
-        # remove when it becomes unnecessary
-        histories = state.history.compatibility_for_eval_history_pairs()
+    # history is now available as a stream of events, rather than list of pairs of (Action, Observation)
+    # for compatibility with the existing output format, we can remake the pairs here
+    # remove when it becomes unnecessary
+    histories = state.history.compatibility_for_eval_history_pairs()

-        # Save the output
-        output = {
-            'instance_id': instance['task_id'],
-            'instance': instance,
-            'instruction': instance['Question'],
-            'metadata': metadata.model_dump(),
-            'history': histories,
-            'metrics': metrics,
-            'error': state.last_error if state and state.last_error else None,
-            'test_result': test_result,
-        }
-    except Exception:
-        logger.error('Process instance failed')
-        raise
-    finally:
-        config.workspace_mount_path = old_workspace_mount_path
+    # Save the output
+    output = EvalOutput(
+        instance_id=instance['instance_id'],
+        instance=instance.to_dict(),
+        instruction=instance['Question'],
+        metadata=metadata,
+        history=histories,
+        metrics=metrics,
+        error=state.last_error if state and state.last_error else None,
+        test_result=test_result,
+    )
    return output


@ -197,13 +221,19 @@ if __name__ == '__main__':
        type=str,
        help='gaia level to evaluate, eg. 2023_level1',
    )
+    parser.add_argument(
+        '--data-split',
+        type=str,
+        help='data split to evaluate, eg. test',
+        default='validation',
+    )
    args, _ = parser.parse_known_args()
-    if args.directory:
-        config.workspace_base = os.path.abspath(args.directory)
-        logger.info(f'Setting workspace base to {config.workspace_base}')

-    llm_config = get_llm_config_arg(args.llm_config) if args.llm_config else config.llm
-    logger.info(f'Config for evaluation: {config}')
+    llm_config = None
+    if args.llm_config:
+        llm_config = get_llm_config_arg(args.llm_config)
+    if llm_config is None:
+        raise ValueError(f'Could not find LLM config: --llm_config {args.llm_config}')

    metadata = make_metadata(
        llm_config=llm_config,
@ -222,20 +252,18 @@ if __name__ == '__main__':
        repo_type='dataset',
        local_dir=DATASET_CACHE_DIR,
    )
-    gaia_tests = dataset[metadata.data_split]
+    gaia_tests = dataset[metadata.data_split].to_pandas()
+    gaia_tests.rename(columns={'task_id': 'instance_id'}, inplace=True)

    output_file = os.path.join(metadata.eval_output_dir, 'output.jsonl')
-    prepared_dataset = prepare_dataset(
-        gaia_tests.to_pandas(), output_file, args.eval_n_limit, 'task_id'
-    )
+    prepared_dataset = prepare_dataset(gaia_tests, output_file, args.eval_n_limit)

-    agent = Agent.get_cls(args.agent_cls)(llm=LLM(config.llm))
-
-    run_evaluation(
-        dataset=prepared_dataset,
-        metadata=metadata,
-        output_file=output_file,
-        num_workers=args.eval_num_workers,
-        process_instance_func=process_instance,
-        id_column='task_id',
+    asyncio.run(
+        run_evaluation(
+            dataset=prepared_dataset,
+            metadata=metadata,
+            output_file=output_file,
+            num_workers=args.eval_num_workers,
+            process_instance_func=process_instance,
+        )
    )
--- a/evaluation/gaia/scripts/run_infer.sh
+++ b/evaluation/gaia/scripts/run_infer.sh
--- a/evaluation/gorilla/README.md
+++ b/evaluation/gorilla/README.md
@ -2,20 +2,16 @@

 This folder contains evaluation harness we built on top of the original [Gorilla APIBench](https://github.com/ShishirPatil/gorilla) ([paper](https://arxiv.org/pdf/2305.15334)).

-## Setup Environment
+## Setup Environment and LLM Configuration

-Please follow [this document](https://github.com/OpenDevin/OpenDevin/blob/main/Development.md) to setup local development environment for OpenDevin.
-
-## Configure OpenDevin and your LLM
-
-Run `make setup-config` to set up the `config.toml` file if it does not exist at the root of the workspace.
+Please follow instruction [here](../README.md#setup) to setup your local development environment and LLM.

 ## Run Inference on APIBench Instances

 Make sure your Docker daemon is running, then run this bash script:

 ```bash
-bash evaluation/gorilla/scripts/run_infer.sh [model_config] [git-version] [agent] [eval_limit] [hubs]
+./evaluation/gorilla/scripts/run_infer.sh [model_config] [git-version] [agent] [eval_limit] [hubs]
 ```

 where `model_config` is mandatory, while all other arguments are optional.
@ -39,5 +35,5 @@ Note: in order to use `eval_limit`, you must also set `agent`; in order to use `
 For example,

 ```bash
-bash evaluation/gorilla/scripts/run_infer.sh llm 0.6.2 CodeActAgent 10 th
+./evaluation/gorilla/scripts/run_infer.sh llm 0.6.2 CodeActAgent 10 th
 ```
--- a/evaluation/gorilla/run_infer.py
+++ b/evaluation/gorilla/run_infer.py
@ -1,59 +1,28 @@
 import asyncio
 import json
-import logging
-import multiprocessing as mp
 import os
-import pathlib
-import subprocess
-import time
-from concurrent.futures import ProcessPoolExecutor

-from tqdm import tqdm
+import pandas as pd

-from opendevin.controller.agent import Agent
+from evaluation.gorilla.utils import encode_question, get_data_for_hub
+from evaluation.utils.shared import (
+    EvalMetadata,
+    EvalOutput,
+    codeact_user_response,
+    make_metadata,
+    prepare_dataset,
+    reset_logger_for_multiprocessing,
+    run_evaluation,
+)
 from opendevin.controller.state.state import State
-from opendevin.core.config import get_llm_config_arg, get_parser, load_app_config
-from opendevin.core.logger import get_console_handler
+from opendevin.core.config import (
+    AppConfig,
+    SandboxConfig,
+    get_llm_config_arg,
+    get_parser,
+)
 from opendevin.core.logger import opendevin_logger as logger
-from opendevin.core.main import run_controller
-from opendevin.events.action import MessageAction
-from opendevin.llm.llm import LLM
-
-from .utils import encode_question, get_data
-
-config = load_app_config()
-
-
-def cleanup():
-    print('Cleaning up child processes...')
-    for process in mp.active_children():
-        print(f'Terminating child process: {process.name}')
-        process.terminate()
-        process.join()
-
-
-def codeact_user_response(state: State) -> str:
-    msg = (
-        #'Please continue working on the task on whatever approach you think is suitable.\n'
-        'Please run the following command: <execute_bash> exit </execute_bash>.\n'
-        #'IMPORTANT: YOU SHOULD NEVER ASK FOR HUMAN HELP OR USE THE INTERNET TO SOLVE THIS TASK.\n'
-    )
-
-    # check if the agent has tried to talk to the user 3 times, if so, let the agent know it can give up
-    if state.history:
-        user_msgs = [
-            event
-            for event in state.history.get_events()
-            if isinstance(event, MessageAction) and event.source == 'user'
-        ]
-        if len(user_msgs) > 2:
-            # let the agent know that it can give up when it has tried 3 times
-            return (
-                msg
-                + 'If you want to give up, run: <execute_bash> exit </execute_bash>.\n'
-            )
-    return msg
-
+from opendevin.core.main import create_runtime, run_controller

 AGENT_CLS_TO_FAKE_USER_RESPONSE_FN = {
    'CodeActAgent': codeact_user_response,
@ -64,105 +33,96 @@ AGENT_CLS_TO_INST_SUFFIX = {
 }


-def process_instance(agent, question_id, question, metadata, reset_logger: bool = True):
-    # create process-specific workspace dir
-    # we will create a workspace directory for EACH process
-    # so that different agent don't interfere with each other.
-    old_workspace_mount_path = config.workspace_mount_path
-    try:
-        workspace_mount_path = os.path.join(
-            config.workspace_mount_path, '_eval_workspace'
-        )
-        workspace_mount_path = os.path.join(workspace_mount_path, str(os.getpid()))
-        pathlib.Path(workspace_mount_path).mkdir(parents=True, exist_ok=True)
-        config.workspace_mount_path = workspace_mount_path
+def get_config(
+    metadata: EvalMetadata,
+) -> AppConfig:
+    config = AppConfig(
+        default_agent=metadata.agent_class,
+        run_as_devin=False,
+        runtime='eventstream',
+        max_iterations=metadata.max_iterations,
+        sandbox=SandboxConfig(
+            container_image='ubuntu:22.04',
+            enable_auto_lint=True,
+            use_host_network=False,
+            update_source_code=True,
+        ),
+        # do not mount workspace
+        workspace_base=None,
+        workspace_mount_path=None,
+    )
+    config.set_llm_config(metadata.llm_config)
+    return config

-        # Setup the logger properly, so you can run multi-processing to parallize the evaluation
-        eval_output_dir = metadata['eval_output_dir']
-        if reset_logger:
-            # Set up logger
-            log_file = os.path.join(
-                eval_output_dir, 'logs', f'instance_{question_id}.log'
-            )
-            # Remove all existing handlers from logger
-            for handler in logger.handlers[:]:
-                logger.removeHandler(handler)
-            # add back the console handler to print ONE line
-            logger.addHandler(get_console_handler())
-            logger.info(
-                f'Starting evaluation for instance {question_id}.\nLOG:   tail -f {log_file}'
-            )
-            # Remove all existing handlers from logger
-            for handler in logger.handlers[:]:
-                logger.removeHandler(handler)
-            file_handler = logging.FileHandler(log_file)
-            file_handler.setFormatter(
-                logging.Formatter('%(asctime)s - %(levelname)s - %(message)s')
-            )
-            logger.addHandler(file_handler)
-        logger.info(f'Process-specific workspace mounted at {workspace_mount_path}')

-        # Prepare instruction
-        instruction = encode_question(question, metadata['hub'])
-        instruction += 'IMPORTANT: You should ONLY interact with the environment provided to you AND NEVER ASK FOR HUMAN HELP.\n'
-        # NOTE: You can actually set slightly different instruction for different agents
-        instruction += AGENT_CLS_TO_INST_SUFFIX[agent.__class__.__name__]
-        # logger.info(f'Instruction:\n{instruction}', extra={'msg_type': 'OBSERVATION'})
+async def process_instance(
+    instance: pd.Series,
+    metadata: EvalMetadata,
+    reset_logger: bool = True,
+) -> EvalOutput:
+    config = get_config(metadata)
+    instance_id = instance['question_id']
+    question = instance['question']

-        # Here's how you can run the agent (similar to the `main` function) and get the final task state
-        config.max_iterations = metadata.max_iterations
-        state: State | None = asyncio.run(
-            run_controller(
-                config=config,
-                task_str=instruction,
-                fake_user_response_fn=AGENT_CLS_TO_FAKE_USER_RESPONSE_FN.get(
-                    agent.__class__.__name__
-                ),
-                agent=agent,
-                sid=question_id,
-            )
-        )
-        # ======= Attempt to evaluate the agent's edits =======
-        # If you are working on simpler benchmark that only evaluates the final model output (e.g., in a MessageAction)
-        # You can simply get the LAST `MessageAction` from the returned `state.history` and parse it for evaluation.
+    # Setup the logger properly, so you can run multi-processing to parallelize the evaluation
+    if reset_logger:
+        log_dir = os.path.join(metadata.eval_output_dir, 'infer_logs')
+        reset_logger_for_multiprocessing(logger, instance_id, log_dir)
+    else:
+        logger.info(f'Starting evaluation for instance {instance_id}.')

-        if state is None:
-            raise ValueError('State should not be None.')
+    # Prepare instruction
+    instruction = encode_question(question, instance['hub'])
+    instruction += 'IMPORTANT: You should ONLY interact with the environment provided to you AND NEVER ASK FOR HUMAN HELP.\n'
+    # NOTE: You can actually set slightly different instruction for different agents
+    instruction += AGENT_CLS_TO_INST_SUFFIX[metadata.agent_class]
+    # logger.info(f'Instruction:\n{instruction}', extra={'msg_type': 'OBSERVATION'})

-        # retrieve the last message from the agent
-        model_answer_raw = state.history.get_last_agent_message()
+    # Here's how you can run the agent (similar to the `main` function) and get the final task state
+    runtime = await create_runtime(config, sid=instance_id)
+    state: State | None = await run_controller(
+        config=config,
+        task_str=instruction,
+        runtime=runtime,
+        fake_user_response_fn=AGENT_CLS_TO_FAKE_USER_RESPONSE_FN.get(
+            metadata.agent_class
+        ),
+    )
+    # ======= Attempt to evaluate the agent's edits =======
+    # If you are working on simpler benchmark that only evaluates the final model output (e.g., in a MessageAction)
+    # You can simply get the LAST `MessageAction` from the returned `state.history` and parse it for evaluation.

-        # attempt to parse model_answer
-        _, _, ast_eval = get_data(metadata['hub'])
-        correct, hallucination = ast_eval(question_id, model_answer_raw)
-        metrics = state.metrics.get() if state.metrics else None
-        logger.info(
-            f'Final message: {model_answer_raw} | Correctness: {correct} | Hallucination: {hallucination}'
-        )
+    if state is None:
+        raise ValueError('State should not be None.')

-        # history is now available as a stream of events, rather than list of pairs of (Action, Observation)
-        # for compatibility with the existing output format, we can remake the pairs here
-        # remove when it becomes unnecessary
-        histories = state.history.compatibility_for_eval_history_pairs()
+    # retrieve the last message from the agent
+    model_answer_raw = state.history.get_last_agent_message()

-        # Save the output
-        output = {
-            'question_id': question_id,
+    # attempt to parse model_answer
+    ast_eval_fn = instance['ast_eval']
+    correct, hallucination = ast_eval_fn(instance_id, model_answer_raw)
+    metrics = state.metrics.get() if state.metrics else None
+    logger.info(
+        f'Final message: {model_answer_raw} | Correctness: {correct} | Hallucination: {hallucination}'
+    )
+
+    # history is now available as a stream of events, rather than list of pairs of (Action, Observation)
+    # for compatibility with the existing output format, we can remake the pairs here
+    # remove when it becomes unnecessary
+    histories = state.history.compatibility_for_eval_history_pairs()
+
+    output = EvalOutput(
+        instance_id=instance_id,
+        metadata=metadata,
+        history=histories,
+        metrics=metrics,
+        error=state.last_error if state and state.last_error else None,
+        test_result={
            'text': model_answer_raw,
            'correct': correct,
            'hallucination': hallucination,
-            'answer_id': 'None',
-            'model_id': metadata['model_name'],
-            'metadata': metadata.model_dump(),
-            'history': histories,
-            'metrics': metrics,
-            'error': state.last_error if state and state.last_error else None,
-        }
-    except Exception:
-        logger.error('Process instance failed')
-        raise
-    finally:
-        config.workspace_mount_path = old_workspace_mount_path
+        },
+    )
    return output


@ -175,188 +135,62 @@ if __name__ == '__main__':
        default='hf,torch,tf',
    )
    args, _ = parser.parse_known_args()
-    if args.directory:
-        config.workspace_base = os.path.abspath(args.directory)
-        print(f'Setting workspace base to {config.workspace_base}')

-    # Check https://github.com/OpenDevin/OpenDevin/blob/main/evaluation/swe_bench/README.md#configure-opendevin-and-your-llm
-    # for details of how to set `llm_config`
+    llm_config = None
    if args.llm_config:
-        specified_llm_config = get_llm_config_arg(args.llm_config)
-        if specified_llm_config:
-            config.llm = specified_llm_config
-    logger.info(f'Config for evaluation: {config}')
-    agent_class = args.agent_cls
-    assert (
-        agent_class in AGENT_CLS_TO_FAKE_USER_RESPONSE_FN
-    ), f'Unsupported agent class: {agent_class}'
-    model_name = config.llm.model.split('/')[-1]
-    max_iterations = args.max_iterations
-    eval_note = ''
-    if args.eval_note is not None:
-        eval_note += '_N_' + args.eval_note
-    eval_output_dir = os.path.join(
-        args.eval_output_dir,
-        'gorilla',
-        agent_class,
-        model_name + '_maxiter_' + str(max_iterations) + eval_note,
-    )
-    pathlib.Path(eval_output_dir).mkdir(parents=True, exist_ok=True)
-    pathlib.Path(os.path.join(eval_output_dir, 'logs')).mkdir(
-        parents=True, exist_ok=True
-    )
-    logger.info(f'Using evaluation output directory: {eval_output_dir}')
+        llm_config = get_llm_config_arg(args.llm_config)
+    if llm_config is None:
+        raise ValueError(f'Could not find LLM config: --llm_config {args.llm_config}')

-    hubs = []
-    if 'hf' in args.hubs:
-        hubs.append('hf')
-    if 'torch' in args.hubs or 'th' in args.hubs:
-        hubs.append('torch')
-    if 'tf' in args.hubs:
-        hubs.append('tf')
-    if hubs == []:
+    hubs = args.hubs.split(',')
+    if len(hubs) == 0:
        raise ValueError('Please choose at least one from hf, torch, and tf for hubs.')

+    dfs = []
    for hub in hubs:
        logger.info(f'Evaluating APIBench {hub} test')
-        questions, question_ids, ast_eval = get_data(hub)
+        df = get_data_for_hub(hub)
+        dfs.append(df)
+    dataset_df = pd.concat(dfs)
+    dataset_df.rename(columns={'question_id': 'instance_id'}, inplace=True)

-        # TEST METADATA
-        metadata = {
-            'hub': hub,
-            'agent_class': agent_class,
-            'model_name': model_name,
-            'max_iterations': max_iterations,
-            'eval_output_dir': eval_output_dir,
-            'start_time': time.strftime('%Y-%m-%d %H:%M:%S'),
-            # get the commit id of current repo for reproduciblity
-            'git_commit': subprocess.check_output(['git', 'rev-parse', 'HEAD'])
-            .decode('utf-8')
-            .strip(),
-        }
-        logger.info(f'Metadata: {metadata}')
-        with open(os.path.join(eval_output_dir, f'metadata_{hub}.json'), 'w') as f:
-            json.dump(metadata, f)
+    metadata = make_metadata(
+        llm_config=llm_config,
+        dataset_name=f'gorilla-{hub}',
+        agent_class=args.agent_cls,
+        max_iterations=args.max_iterations,
+        eval_note=args.eval_note,
+        eval_output_dir=args.eval_output_dir,
+        data_split=args.data_split,
+    )
+    output_file = os.path.join(metadata.eval_output_dir, 'output.jsonl')

-        # LIMIT EVALUATION
-        eval_n_limit = args.eval_n_limit
-        if eval_n_limit:
-            questions = questions[: (eval_n_limit // len(hubs))]
-            question_ids = question_ids[: (eval_n_limit // len(hubs))]
-            logger.info(
-                f'Limiting evaluation to a total of first {eval_n_limit} instances -> first {eval_n_limit//len(hubs)} instances per hub.'
-            )
-        output_file = os.path.join(eval_output_dir, f'output_{model_name}_{hub}.jsonl')
-        logger.info(f'Writing evaluation output to {output_file}')
-        finished_task_ids = set()
-        if os.path.exists(output_file):
-            with open(output_file, 'r') as f:
-                for line in f:
-                    data = json.loads(line)
-                    for i in range(len(question_ids)):
-                        if question_ids[i] == int(data['question_id']):
-                            finished_task_ids.add(data['question_id'])
-            logger.warning(
-                f'Output file {output_file} already exists. Loaded {len(finished_task_ids)} finished instances.'
-            )
-        output_fp = open(output_file, 'a')
-        logger.info(
-            f'Evaluation started with Agent {agent_class}, model {model_name}, max iterations {max_iterations}.'
+    dataset = prepare_dataset(
+        dataset_df, output_file=output_file, eval_n_limit=args.eval_n_limit
+    )
+
+    asyncio.run(
+        run_evaluation(
+            dataset=dataset,
+            metadata=metadata,
+            output_file=output_file,
+            num_workers=args.eval_num_workers,
+            process_instance_func=process_instance,
        )
-        # =============================================
-        # filter out finished instances
-        new_questions = []
-        new_question_ids = []
-        for i in range(len(question_ids)):
-            if question_ids[i] in finished_task_ids:
-                logger.info(
-                    f'Skipping instance {question_ids[i]} as it is already finished.'
-                )
-                continue
-            new_questions.append(questions[i])
-            new_question_ids.append(question_ids[i])
+    )

-        finished_task_number = len(finished_task_ids)
-        questions = new_questions
-        question_ids = new_question_ids
-        logger.info(
-            f'Finished instances: {finished_task_number}, Remaining instances: {len(question_ids)}'
-        )
-        # =============================================
-        pbar = tqdm(total=len(question_ids))
-
-        # This function tracks the progress AND write the output to a JSONL file
-        def update_progress(future, pbar, output_fp, finished_task_ids):
-            pbar.update(1)
-            output = future.result()
-            pbar.set_description(f'Instance {output["question_id"]}')
-            pbar.set_postfix_str(f'Test Result: {output["correct"]}')
-            logger.info(
-                f'Finished evaluation for instance {output["question_id"]}: {output["correct"]}'
-            )
-            output_fp.write(json.dumps(output) + '\n')
-            output_fp.flush()
-            finished_task_ids.add(output['question_id'])
-
-        # Create the agent
-        agent = Agent.get_cls(agent_class)(llm=LLM(config.llm))
-
-        # This sets the multi-processing
-        num_workers = args.eval_num_workers
-        logger.info(f'Using {num_workers} workers for evaluation.')
-        try:
-            with ProcessPoolExecutor(num_workers) as executor:
-                futures = []
-                # This is how we perform multi-processing
-                for i in range(len(question_ids)):
-                    try:
-                        question_id = question_ids[i]
-                        question = questions[i]
-                        future = executor.submit(
-                            process_instance,
-                            agent,
-                            question_id,
-                            question,
-                            metadata,
-                            reset_logger=bool(num_workers > 1),
-                        )
-                        future.add_done_callback(
-                            update_progress, pbar, output_fp, finished_task_ids
-                        )
-                        futures.append(future)
-                    except Exception:
-                        continue
-
-                # Wait for all futures to complete
-                for future in futures:
-                    try:
-                        future.result()
-                    except Exception:
-                        continue
-        except KeyboardInterrupt:
-            logger.info('KeyboardInterrupt received. Cleaning up...')
-            cleanup()
-
-        output_fp.close()
-        total_correct = 0
-        total_hallucination = 0
-        output = []
-        with open(output_file, 'r') as f:
-            for line in f:
-                data = json.loads(line)
-                output.append(data)
-                if int(data['question_id']) in finished_task_ids:
-                    if str(data['correct']).lower() == 'true':
-                        total_correct += 1
-                    if str(data['hallucination']).lower() == 'true':
-                        total_hallucination += 1
-        # sort all output by question_id
-        output = sorted(output, key=lambda x: x['question_id'])
-        with open(output_file, 'w') as f:
-            for dat in output:
-                f.write(json.dumps(dat) + '\n')
-                f.flush()
-
-        logger.info(
-            f'Evaluation finished for {hub}. Total: {len(question_ids)+finished_task_number}; Correct: {total_correct}; Hallucination: {total_hallucination}. Accuracy: {total_correct / (len(question_ids)+finished_task_number)}'
-        )
+    # Read the output file and calculate the accuracy
+    total_correct = 0
+    total_hallucination = 0
+    output = []
+    with open(output_file, 'r') as f:
+        for line in f:
+            data = json.loads(line)
+            if data['test_result']['correct']:
+                total_correct += 1
+            if data['test_result']['hallucination']:
+                total_hallucination += 1
+            output.append(data)
+    logger.info(
+        f'Evaluation finished for {hub}. Total: {len(output)}; Correct: {total_correct}; Hallucination: {total_hallucination}. Accuracy: {total_correct / len(output)}'
+    )
--- a/evaluation/gorilla/scripts/run_infer.sh
+++ b/evaluation/gorilla/scripts/run_infer.sh
--- a/evaluation/gorilla/utils.py
+++ b/evaluation/gorilla/utils.py
@ -1,6 +1,8 @@
 import json
+import os
 from functools import partial

+import pandas as pd
 import requests
 from ast_eval_hf import ast_eval_hf, ast_parse
 from ast_eval_tf import ast_eval_tf
@ -48,48 +50,59 @@ def encode_question(question, api_name):
    return prompts


-def get_data(hub):
+DATA_DIR = os.path.join(os.path.dirname(__file__), 'data')
+os.makedirs(DATA_DIR, exist_ok=True)
+
+
+def fetch_data(url, filename):
+    cache_path = os.path.join(DATA_DIR, filename)
+    if os.path.exists(cache_path):
+        with open(cache_path, 'r') as f:
+            return f.read()
+    else:
+        response = requests.get(url)
+        if response.status_code == 200:
+            with open(cache_path, 'w') as f:
+                f.write(response.text)
+            return response.text
+        else:
+            raise Exception(f'Failed to fetch data from {url}')
+
+
+def get_data_for_hub(hub: str):
    if hub == 'hf':
        question_data = 'https://raw.githubusercontent.com/ShishirPatil/gorilla/main/eval/eval-data/questions/huggingface/questions_huggingface_0_shot.jsonl'
        api_dataset = 'https://raw.githubusercontent.com/ShishirPatil/gorilla/main/data/api/huggingface_api.jsonl'
        apibench = 'https://raw.githubusercontent.com/ShishirPatil/gorilla/main/data/apibench/huggingface_eval.json'
        ast_eval = ast_eval_hf
-    if hub == 'torch':
+    elif hub == 'torch':
        question_data = 'https://raw.githubusercontent.com/ShishirPatil/gorilla/main/eval/eval-data/questions/torchhub/questions_torchhub_0_shot.jsonl'
        api_dataset = 'https://raw.githubusercontent.com/ShishirPatil/gorilla/main/data/api/torchhub_api.jsonl'
        apibench = 'https://raw.githubusercontent.com/ShishirPatil/gorilla/main/data/apibench/torchhub_eval.json'
        ast_eval = ast_eval_th
-    if hub == 'tf':
+    elif hub == 'tf':
        question_data = 'https://raw.githubusercontent.com/ShishirPatil/gorilla/main/eval/eval-data/questions/tensorflowhub/questions_tensorflowhub_0_shot.jsonl'
        api_dataset = 'https://raw.githubusercontent.com/ShishirPatil/gorilla/main/data/api/tensorflowhub_api.jsonl'
        apibench = 'https://raw.githubusercontent.com/ShishirPatil/gorilla/main/data/apibench/tensorflow_eval.json'
        ast_eval = ast_eval_tf

-    # get questions and question_ids
+    question_data = fetch_data(question_data, 'question_data.jsonl')
+    api_dataset = fetch_data(api_dataset, 'api_dataset.jsonl')
+    apibench = fetch_data(apibench, 'apibench.json')
+
+    # Parse question data
    questions = []
    question_ids = []
-    question_data = requests.get(question_data)
-    if question_data.status_code == 200:
-        lines = question_data.text.splitlines()
-        for line in lines:
-            questions.append(json.loads(line)['text'])
-            question_ids.append(json.loads(line)['question_id'])
+    for line in question_data.splitlines():
+        data = json.loads(line)
+        questions.append(data['text'])
+        question_ids.append(data['question_id'])

-    # get the api datasest
-    api_database = []
-    api_dataset = requests.get(api_dataset)
-    if api_dataset.status_code == 200:
-        lines = api_dataset.text.splitlines()
-        for line in lines:
-            api_database.append(json.loads(line))
+    # Parse API dataset
+    api_database = [json.loads(line) for line in api_dataset.splitlines()]

-    # get the question answer pair datasest
-    qa_pairs = []
-    apibench = requests.get(apibench)
-    if apibench.status_code == 200:
-        lines = apibench.text.splitlines()
-        for line in lines:
-            qa_pairs.append(json.loads(line)['api_data'])
+    # Parse question-answer pairs
+    qa_pairs = [json.loads(line)['api_data'] for line in apibench.splitlines()]

    # Parse all apis to ast trees
    ast_database = []
@ -97,4 +110,15 @@ def get_data(hub):
        ast_tree = ast_parse(data['api_call'])
        ast_database.append(ast_tree)
    ast_eval = partial(ast_eval, api_database, qa_pairs, ast_database)
-    return questions, question_ids, ast_eval
+
+    return pd.DataFrame(
+        {
+            'question_id': question_ids,
+            'question': questions,
+            'api_database': [api_database] * len(questions),
+            'qa_pairs': [qa_pairs] * len(questions),
+            'ast_database': [ast_database] * len(questions),
+            'ast_eval': [ast_eval] * len(questions),
+            'hub': [hub] * len(questions),
+        }
+    )
--- a/evaluation/gpqa/README.md
+++ b/evaluation/gpqa/README.md
@ -15,31 +15,9 @@ Further references:
 - https://paperswithcode.com/dataset/gpqa
 - https://github.com/idavidrein/gpqa

+## Setup Environment and LLM Configuration

-## Setup Environment
-
-Please follow [this document](https://github.com/OpenDevin/OpenDevin/blob/main/Development.md) to setup local develop environment for OpenDevin.
-
-
-## Configure OpenDevin and your LLM
-
-Create a `config.toml` file (you can copy from `config.template.toml`) if it does not exist at the root of the workspace.
-
-Add the following configurations:
-
-```toml
-# TODO: Change these to the model you want to evaluate
-[llm.eval_gpt4_1106_preview]
-model = "gpt-4-1106-preview"
-api_key = "XXX"
-temperature = 0.0
-
-[llm.eval_azure_openai_compatible_model]
-model = "AZURE_OPENAI_EXACT_DEPLOYMENT_MODEL_NAME"
-base_url = "AZURE_OPENAI_ENDPOINT"
-api_key = "AZURE_ENDPOINT_API_KEY"
-temperature = 0.0
-```
+Please follow instruction [here](../README.md#setup) to setup your local development environment and LLM.

 ## Run Inference on GPQA Benchmark
 'gpqa_main', 'gqpa_diamond', 'gpqa_experts', 'gpqa_extended' -- data split options
@ -55,8 +33,3 @@ like to evaluate. It could also be a release tag like `0.6.2`.
 - `num_samples_eval`: Number of samples to evaluate (useful for testing and debugging).
 - `data_split`: The data split to evaluate on. Must be one of `gpqa_main`, `gqpa_diamond`, `gpqa_experts`, `gpqa_extended`. Defaults to `gpqa_diamond` as done in the paper.
 - `AgentClass`: The agent class to use for evaluation. Currently only supports `CodeActAgent` for CodeActAgent.
-
-
-## Benchmark Evaluation Results
-
- [] TODO: Finish the evaluation run across the entire benchmark and compile results
--- a/evaluation/gpqa/run_infer.py
+++ b/evaluation/gpqa/run_infer.py
@ -17,9 +17,7 @@ TODOs:
 """

 import asyncio
-import logging
 import os
-import pathlib
 import random
 import re
 from typing import Callable
@ -29,22 +27,27 @@ from datasets import load_dataset

 from evaluation.utils.shared import (
    EvalMetadata,
-    codeact_user_response,
+    EvalOutput,
    make_metadata,
    prepare_dataset,
+    reset_logger_for_multiprocessing,
    run_evaluation,
 )
-from opendevin.controller.agent import Agent
 from opendevin.controller.state.state import State
-from opendevin.core.config import get_llm_config_arg, get_parser, load_app_config
-from opendevin.core.logger import get_console_handler
+from opendevin.core.config import (
+    AppConfig,
+    SandboxConfig,
+    get_llm_config_arg,
+    get_parser,
+)
 from opendevin.core.logger import opendevin_logger as logger
-from opendevin.core.main import run_controller
-from opendevin.events.action import Action, AgentFinishAction, MessageAction
+from opendevin.core.main import create_runtime, run_controller
+from opendevin.events.action import (
+    Action,
+    AgentFinishAction,
+    MessageAction,
+)
 from opendevin.events.observation import Observation
-from opendevin.llm.llm import LLM
-
-config = load_app_config()

 ACTION_FORMAT = """
 <<FINAL_ANSWER||
@ -53,6 +56,28 @@ ACTION_FORMAT = """
 """.strip()


+def get_config(
+    metadata: EvalMetadata,
+) -> AppConfig:
+    config = AppConfig(
+        default_agent=metadata.agent_class,
+        run_as_devin=False,
+        runtime='eventstream',
+        max_iterations=metadata.max_iterations,
+        sandbox=SandboxConfig(
+            container_image='ubuntu:22.04',
+            enable_auto_lint=True,
+            use_host_network=False,
+            update_source_code=True,
+        ),
+        # do not mount workspace
+        workspace_base=None,
+        workspace_mount_path=None,
+    )
+    config.set_llm_config(metadata.llm_config)
+    return config
+
+
 def gpqa_codeact_user_response(
    state: State,
    encapsulate_solution: bool = False,
@ -68,11 +93,10 @@ def gpqa_codeact_user_response(
        '<execute_bash> exit </execute_bash>\n'
        'IMPORTANT: YOU SHOULD NEVER ASK FOR HUMAN HELP TO SOLVE THIS TASK.\n'
    )
-
    return msg


-AGENT_CLS_TO_FAKE_USER_RESPONSE_FN = {'CodeActAgent': codeact_user_response}
+AGENT_CLS_TO_FAKE_USER_RESPONSE_FN = {'CodeActAgent': gpqa_codeact_user_response}

 AGENT_CLS_TO_INST_SUFFIX = {
    'CodeActAgent': '\n\n SUPER IMPORTANT: When you think you have solved the question, first report it back to the user in the requested format. Only once that is done, in the next turn, please run the following command: <execute_bash> exit </execute_bash>.\n'
@ -146,57 +170,23 @@ def convert_instance_dict(instance):
    return out_instance_dict


-def process_instance(
+async def process_instance(
    instance: pd.Series,
    metadata: EvalMetadata,
    reset_logger: bool = True,
 ):
-    # Create the agent
-    agent = Agent.get_cls(metadata.agent_class)(llm=LLM(config=metadata.llm_config))
-    old_workspace_mount_path = config.workspace_mount_path
-    old_workspace_base = config.workspace_base
-    try:
-        workspace_mount_path = os.path.join(
-            config.workspace_mount_path, '_eval_workspace'
-        )
-        # create process-specific workspace dir
-        workspace_mount_path = os.path.join(workspace_mount_path, str(os.getpid()))
-        pathlib.Path(workspace_mount_path).mkdir(parents=True, exist_ok=True)
+    config = get_config(metadata)

-        # reset workspace to config
-        config.workspace_base = workspace_mount_path
-        config.workspace_mount_path = workspace_mount_path
+    # Setup the logger properly, so you can run multi-processing to parallelize the evaluation
+    if reset_logger:
+        log_dir = os.path.join(metadata.eval_output_dir, 'infer_logs')
+        reset_logger_for_multiprocessing(logger, instance['instance_id'], log_dir)
+    else:
+        logger.info(f'Starting evaluation for instance {instance["instance_id"]}.')

-        # Setup the logger properly, so you can run multi-processing to parallelize the evaluation
-        if reset_logger:
-            # Set up logger
-            log_file = os.path.join(
-                metadata.eval_output_dir, 'logs', f'instance_{instance.instance_id}.log'
-            )
-            # Remove all existing handlers from logger
-            for handler in logger.handlers[:]:
-                logger.removeHandler(handler)
-            # add back the console handler to print ONE line
-            logger.addHandler(get_console_handler())
-            logger.info(
-                f'Starting evaluation for instance {instance.instance_id}.\nHint: run "tail -f {log_file}" to see live logs in a separate shell'
-            )
-            # Remove all existing handlers from logger
-            for handler in logger.handlers[:]:
-                logger.removeHandler(handler)
-            file_handler = logging.FileHandler(log_file)
-            file_handler.setFormatter(
-                logging.Formatter('%(asctime)s - %(levelname)s - %(message)s')
-            )
-            logger.addHandler(file_handler)
-        else:
-            logger.info(f'Starting evaluation for instance {instance.instance_id}.')
-
-        logger.info(f'Process-specific workspace mounted at {workspace_mount_path}')
-
-        # ======= Run the agent on the instance =======
-        # Prepare instruction for the agent using suggested format in gpqa codebase
-        instruction = f"""
+    # ======= Run the agent on the instance =======
+    # Prepare instruction for the agent using suggested format in gpqa codebase
+    instruction = f"""
 What is the correct answer to this question:\n
 {instance['question']}\n

@ -225,109 +215,98 @@ Again do not quit without reporting the answer first.
 Ok now its time to start solving the question. Good luck!
 """

-        # Here's how you can run the agent (similar to the `main` function) and get the final task state
-        config.max_iterations = metadata.max_iterations
-        state: State | None = asyncio.run(
-            run_controller(
-                config=config,
-                task_str=instruction,
-                fake_user_response_fn=AGENT_CLS_TO_FAKE_USER_RESPONSE_FN.get(
-                    agent.__class__.__name__
-                ),
-                agent=agent,
-                sid=f'gptq_{str(instance.instance_id)}',
-            )
-        )
-        assert state is not None, 'State should not be None.'
+    runtime = await create_runtime(config, sid=f'gptq_{str(instance.instance_id)}')

-        # ======= Attempt to evaluate the agent's edits =======
+    state: State | None = await run_controller(
+        config=config,
+        task_str=instruction,
+        runtime=runtime,
+        fake_user_response_fn=AGENT_CLS_TO_FAKE_USER_RESPONSE_FN.get(
+            metadata.agent_class
+        ),
+    )
+    assert state is not None, 'State should not be None.'

-        question_choices = {
-            'A': instance['choices'][0],
-            'B': instance['choices'][1],
-            'C': instance['choices'][2],
-            'D': instance['choices'][3],
-        }
-        # get the final message from the state history (default to empty if not found)
-        found_answers = {
-            'A': False,
-            'B': False,
-            'C': False,
-            'D': False,
-        }
-        for event in state.history.get_events(reverse=True):
-            if (
-                isinstance(event, AgentFinishAction)
-                and event.source != 'user'
-                and '<<FINAL_ANSWER||' in event.thought
-            ):
-                final_message = event.thought
-                break
-            elif (
-                isinstance(event, MessageAction)
-                and event.source != 'user'
-                and '<<FINAL_ANSWER||' in event.content
-            ):
-                final_message = event.content
-                break
-            elif isinstance(event, Observation):
-                for option, option_text in question_choices.items():
-                    if option_text in event.content:
-                        found_answers[option] = True
-            else:
-                final_message = None
+    # ======= Attempt to evaluate the agent's edits =======

-        found_options = [option for option, found in found_answers.items() if found]
+    question_choices = {
+        'A': instance['choices'][0],
+        'B': instance['choices'][1],
+        'C': instance['choices'][2],
+        'D': instance['choices'][3],
+    }
+    # get the final message from the state history (default to empty if not found)
+    found_answers = {
+        'A': False,
+        'B': False,
+        'C': False,
+        'D': False,
+    }
+    for event in state.history.get_events(reverse=True):
+        if (
+            isinstance(event, AgentFinishAction)
+            and event.source != 'user'
+            and '<<FINAL_ANSWER||' in event.thought
+        ):
+            final_message = event.thought
+            break
+        elif (
+            isinstance(event, MessageAction)
+            and event.source != 'user'
+            and '<<FINAL_ANSWER||' in event.content
+        ):
+            final_message = event.content
+            break
+        elif isinstance(event, Observation):
+            for option, option_text in question_choices.items():
+                if option_text in event.content:
+                    found_answers[option] = True
+        else:
+            final_message = None
+
+    found_options = [option for option, found in found_answers.items() if found]
+    logger.info('#############################################')
+    logger.info(f'Final message generated by the agent: {final_message}')
+    logger.info('#############################################')
+
+    # check if the model output matches the ground truth
+    test_result = compare_answers(final_message, instance.correct_solution)
+    if final_message is None and len(found_options) > 0:
+        _selected = random.choice(found_options)
+        # if the final message is None, then the agent did not report the answer in the correct format
+        # so we randomly select one of the found options and compare it with the correct solution
+        test_result = _selected == instance.correct_solution
        logger.info('#############################################')
-        logger.info(f'Final message generated by the agent: {final_message}')
+        logger.info('Agent did not report the answer in the correct format.')
+        logger.info(f'Found options: {found_options}')
+        logger.info(f'Selected option: {_selected}')
        logger.info('#############################################')

-        # check if the model output matches the ground truth
-        test_result = compare_answers(final_message, instance.correct_solution)
-        if final_message is None and len(found_options) > 0:
-            _selected = random.choice(found_options)
-            # if the final message is None, then the agent did not report the answer in the correct format
-            # so we randomly select one of the found options and compare it with the correct solution
-            test_result = _selected == instance.correct_solution
-            logger.info('#############################################')
-            logger.info('Agent did not report the answer in the correct format.')
-            logger.info(f'Found options: {found_options}')
-            logger.info(f'Selected option: {_selected}')
-            logger.info('#############################################')
+    logger.info('#############################################')
+    logger.info(f'Test result: {test_result}')
+    logger.info('#############################################')

-        logger.info('#############################################')
-        logger.info(f'Test result: {test_result}')
-        logger.info('#############################################')
+    # If you are working on some simpler benchmark that only evaluates the final model output (e.g., in a MessageAction)
+    # You can simply get the LAST `MessageAction` from the returned `state.history` and parse it for evaluation.
+    if state is None:
+        raise ValueError('State should not be None.')

-        # If you are working on some simpler benchmark that only evaluates the final model output (e.g., in a MessageAction)
-        # You can simply get the LAST `MessageAction` from the returned `state.history` and parse it for evaluation.
-        if state is None:
-            raise ValueError('State should not be None.')
+    metrics = state.metrics.get() if state.metrics else None

-        metrics = state.metrics.get() if state.metrics else None
-
-        # Save the output
-        output = {
-            'task_id': instance.task_id,
-            'instance_id': instance.instance_id,
-            'instruction': instruction,
-            'metadata': metadata.model_dump(),
-            'history': state.history.compatibility_for_eval_history_pairs(),
-            'metrics': metrics,
-            'error': state.last_error if state and state.last_error else None,
-            'test_result': {
-                'result': test_result,
-                'found_answers': found_answers,
-                'last_message': final_message,
-            },
-        }
-
-    except Exception:
-        logger.error('Process instance failed')
-        raise
-    finally:
-        config.workspace_mount_path = old_workspace_mount_path
-        config.workspace_base = old_workspace_base
+    # Save the output
+    output = EvalOutput(
+        instance_id=str(instance.instance_id),
+        instruction=instruction,
+        metadata=metadata,
+        history=state.history.compatibility_for_eval_history_pairs(),
+        metrics=metrics,
+        error=state.last_error if state and state.last_error else None,
+        test_result={
+            'result': test_result,
+            'found_answers': found_answers,
+            'last_message': final_message,
+        },
+    )
    return output


@ -343,8 +322,11 @@ if __name__ == '__main__':
    )
    args, _ = parser.parse_known_args()

-    llm_config = get_llm_config_arg(args.llm_config) if args.llm_config else config.llm
-    logger.info(f'Config for evaluation: {config}')
+    llm_config = None
+    if args.llm_config:
+        llm_config = get_llm_config_arg(args.llm_config)
+    if llm_config is None:
+        raise ValueError(f'Could not find LLM config: --llm_config {args.llm_config}')

    # NOTE: It is preferable to load datasets from huggingface datasets and perform post-processing
    # so we don't need to manage file uploading to OpenDevin's repo
@ -355,8 +337,6 @@ if __name__ == '__main__':
    gpqa_dataset = gpqa_dataset.to_pandas()
    # Add a new column 'instance_id' with the index
    gpqa_dataset['instance_id'] = gpqa_dataset.index
-    gpqa_dataset['task_id'] = gpqa_dataset.index
-    # gpqa_dataset = dataset['train'].to_pandas().sort_values(by='id').reset_index(drop=True)

    if args.agent_cls != 'CodeActAgent':
        raise ValueError(
@ -374,15 +354,14 @@ if __name__ == '__main__':
    )

    output_file = os.path.join(metadata.eval_output_dir, 'output.jsonl')
-    prepared_dataset = prepare_dataset(
-        gpqa_dataset, output_file, args.eval_n_limit, 'task_id'
-    )
+    prepared_dataset = prepare_dataset(gpqa_dataset, output_file, args.eval_n_limit)

-    run_evaluation(
-        dataset=prepared_dataset,
-        metadata=metadata,
-        output_file=output_file,
-        num_workers=args.eval_num_workers,
-        process_instance_func=process_instance,
-        id_column='task_id',
+    asyncio.run(
+        run_evaluation(
+            dataset=prepared_dataset,
+            metadata=metadata,
+            output_file=output_file,
+            num_workers=args.eval_num_workers,
+            process_instance_func=process_instance,
+        )
    )
--- a/evaluation/humanevalfix/README.md
+++ b/evaluation/humanevalfix/README.md
@ -1,39 +1,10 @@
 # HumanEvalFix Evaluation with OpenDevin

-Implements evaluation of agents on HumanEvalFix from the HumanEvalPack benchmark introduced in [OctoPack: Instruction Tuning Code Large Language Models](https://arxiv.org/abs/2308.07124). Please see [here](https://github.com/bigcode-project/bigcode-evaluation-harness/blob/main/bigcode_eval/tasks/humanevalpack.py) for the reference implementation used in the paper.
+Implements evaluation of agents on HumanEvalFix from the HumanEvalPack benchmark introduced in [OctoPack: Instruction Tuning Code Large Language Models](https://arxiv.org/abs/2308.07124). Please see [here](https://github.com/bigcode-project/bigcode-evaluation-harness/blob/main/bigcode_eval/tasks/humanevalpack.py) for the reference implementation used in the paper. Currently only `python` evaluation is supported.

-## Setup Environment
+## Setup Environment and LLM Configuration

-Please follow [this document](https://github.com/OpenDevin/OpenDevin/blob/main/Development.md) to setup local develop environment for OpenDevin.
-
-
-## Configure OpenDevin and your LLM
-
-Create a `config.toml` file if it does not exist at the root of the workspace.
-
-Add the following configurations:
-
-```toml
-[core]
-max_iterations = 100
-cache_dir = "/tmp/cache"
-ssh_hostname = "localhost"
-
-[sandbox]
-enable_auto_lint = true
-
-# TODO: Change these to the model you want to evaluate
-[llm.eval_gpt4_1106_preview]
-model = "gpt-4-1106-preview"
-api_key = "XXX"
-temperature = 0.0
-
-[llm.eval_some_openai_compatible_model]
-model = "openai/MODEL_NAME"
-base_url = "https://OPENAI_COMPATIBLE_URL/v1"
-api_key = "XXX"
-temperature = 0.0
-```
+Please follow instruction [here](../README.md#setup) to setup your local development environment and LLM.

 ## Run Inference on HumanEvalFix

--- a/evaluation/humanevalfix/run_infer.py
+++ b/evaluation/humanevalfix/run_infer.py
@ -9,9 +9,9 @@ TODOs:
 """

 import asyncio
-import logging
 import os
-import pathlib
+import tempfile
+from typing import Any

 import pandas as pd
 from datasets import load_dataset
@ -19,20 +19,25 @@ from evaluate import load

 from evaluation.utils.shared import (
    EvalMetadata,
+    EvalOutput,
    codeact_user_response,
    make_metadata,
    prepare_dataset,
+    reset_logger_for_multiprocessing,
    run_evaluation,
 )
-from opendevin.controller.agent import Agent
 from opendevin.controller.state.state import State
-from opendevin.core.config import get_llm_config_arg, load_app_config, parse_arguments
-from opendevin.core.logger import get_console_handler
+from opendevin.core.config import (
+    AppConfig,
+    SandboxConfig,
+    get_llm_config_arg,
+    parse_arguments,
+)
 from opendevin.core.logger import opendevin_logger as logger
-from opendevin.core.main import run_controller
-from opendevin.llm.llm import LLM
-
-config = load_app_config()
+from opendevin.core.main import create_runtime, run_controller
+from opendevin.events.action import CmdRunAction
+from opendevin.events.observation import CmdOutputObservation
+from opendevin.runtime.runtime import Runtime

 IMPORT_HELPER = {
    'python': [
@ -72,19 +77,106 @@ AGENT_CLS_TO_INST_SUFFIX = {
 }


-def get_test_result(instance, path, language='python', timeout=10):
-    # Evaluation reference: https://github.com/bigcode-project/bigcode-evaluation-harness/blob/84b96da31b7f840b55c5733325346176140cdb6b/bigcode_eval/tasks/humanevalpack.py#L347
+def get_config(
+    metadata: EvalMetadata,
+) -> AppConfig:
+    config = AppConfig(
+        default_agent=metadata.agent_class,
+        run_as_devin=False,
+        runtime='eventstream',
+        max_iterations=metadata.max_iterations,
+        sandbox=SandboxConfig(
+            container_image='ubuntu:22.04',
+            enable_auto_lint=True,
+            use_host_network=False,
+            update_source_code=True,
+        ),
+        # do not mount workspace
+        workspace_base=None,
+        workspace_mount_path=None,
+    )
+    config.set_llm_config(metadata.llm_config)
+    return config
+
+
+def _get_instance_id(instance: pd.Series) -> str:
+    return instance.task_id.replace('/', '__')
+
+
+async def initialize_runtime(
+    runtime: Runtime,
+    instance: pd.Series,  # this argument is not required
+):
+    """Initialize the runtime for the agent.
+
+    This function is called before the runtime is used to run the agent.
+    """
+    logger.info(f"{'-' * 50} BEGIN Runtime Initialization Fn {'-' * 50}")
+    obs: CmdOutputObservation
+
+    action = CmdRunAction(command='mkdir -p /workspace')
+    logger.info(action, extra={'msg_type': 'ACTION'})
+    obs = await runtime.run_action(action)
+    assert obs.exit_code == 0
+
+    action = CmdRunAction(command='cd /workspace')
+    logger.info(action, extra={'msg_type': 'ACTION'})
+    obs = await runtime.run_action(action)
+    assert obs.exit_code == 0
+
+    problem_statement = (
+        instance.declaration + instance.buggy_solution + '\n' + instance.test
+    )
+    filename = f'{_get_instance_id(instance)}.py'
+    with tempfile.TemporaryDirectory() as tmpdir:
+        host_script_path = os.path.join(tmpdir, filename)
+        with open(host_script_path, 'w') as f:
+            f.write(problem_statement)
+        await runtime.copy_to(
+            host_script_path,
+            '/workspace',
+        )
+
+    # check file exists
+    action = CmdRunAction(command=f'ls /workspace/{_get_instance_id(instance)}.py')
+    obs = await runtime.run_action(action)
+    assert obs.exit_code == 0
+
+    logger.info(f"{'-' * 50} END Runtime Initialization Fn {'-' * 50}")
+
+
+async def complete_runtime(
+    runtime: Runtime,
+    instance: pd.Series,  # this argument is not required, but it is used to get the workspace_dir_name
+) -> dict[str, Any]:
+    """Complete the runtime for the agent.
+
+    This function is called before the runtime is used to run the agent.
+    If you need to do something in the sandbox to get the correctness metric after
+    the agent has run, modify this function.
+    """
+    logger.info(f"{'-' * 50} BEGIN Runtime Completion Fn {'-' * 50}")
+    obs: CmdOutputObservation
+
+    # default value
+    language = 'python'
+    timeout = 10
+
    test_result = {'result': {}, 'metadata': {}}
    code_metric = load('Muennighoff/code_eval_octopack')
    timeout = LANGUAGE_TO_TIMEOUT[language]
    num_workers = LANGUAGE_TO_NUM_WORKERS[language]
    python_imports = '\n'.join(IMPORT_HELPER[language])

-    # Load function from path
-    with open(path, 'r') as f:
-        function = f.read()
+    action = CmdRunAction(
+        command=f'cat /workspace/{_get_instance_id(instance)}.py', keep_prompt=False
+    )
+    obs = await runtime.run_action(action)
+    assert obs.exit_code == 0

-    function = [[python_imports + '\n' + function.strip()]]
+    function = obs.content.replace('\r\n', '\n')
+    logger.info(f'Function: {function}')
+    function = [[python_imports + '\n' + function]]

    results, logs = code_metric.compute(
        references=[instance.test],
@ -99,129 +191,79 @@ def get_test_result(instance, path, language='python', timeout=10):
        'timeout': timeout,
        'num_workers': num_workers,
    }
+    logger.info(f"{'-' * 50} END Runtime Completion Fn {'-' * 50}")
    return test_result


-def process_instance(
+async def process_instance(
    instance: pd.Series,
    metadata: EvalMetadata,
    reset_logger: bool = True,
-):
-    # Create the agent
-    agent = Agent.get_cls(metadata.agent_class)(llm=LLM(config=metadata.llm_config))
-    old_workspace_mount_path = config.workspace_mount_path
-    old_workspace_base = config.workspace_base
+) -> EvalOutput:
+    config = get_config(metadata)
+    # use a session id for concurrent evaluation
+    sid = _get_instance_id(instance)

-    try:
-        workspace_mount_path = os.path.join(
-            config.workspace_mount_path, '_eval_workspace'
-        )
-        # create process-specific workspace dir
-        workspace_mount_path = os.path.join(workspace_mount_path, str(os.getpid()))
-        pathlib.Path(workspace_mount_path).mkdir(parents=True, exist_ok=True)
+    # Setup the logger properly, so you can run multi-processing to parallelize the evaluation
+    if reset_logger:
+        log_dir = os.path.join(metadata.eval_output_dir, 'infer_logs')
+        reset_logger_for_multiprocessing(logger, instance.task_id, log_dir)
+    else:
+        logger.info(f'Starting evaluation for instance {instance.task_id}.')

-        # reset workspace to config
-        config.workspace_base = workspace_mount_path
-        config.workspace_mount_path = workspace_mount_path
+    # Create file with HumanEvalFix problem
+    # Prompt reference: https://github.com/bigcode-project/bigcode-evaluation-harness/blob/84b96da31b7f840b55c5733325346176140cdb6b/bigcode_eval/tasks/humanevalpack.py#L509
+    problem_statement = (
+        instance.declaration + instance.buggy_solution + '\n' + instance.test
+    )

-        # use a session id for concurrent evaluation
-        sid = instance.task_id.replace('/', '__')
+    # Prepare instruction
+    instruction = (
+        f'Please fix the function in {sid}.py such that all test cases pass.\n'
+        'Environment has been set up for you to start working. You may assume all necessary tools are installed.\n\n'
+        '# Problem Statement\n'
+        f'{problem_statement}\n\n'
+    )
+    instruction += (
+        'IMPORTANT: You should ONLY interact with the environment provided to you AND NEVER ASK FOR HUMAN HELP.\n'
+        'You should NOT modify any existing test case files. If needed, you can add new test cases in a NEW file to reproduce the issue.\n'
+        'You SHOULD INCLUDE PROPER INDENTATION in your edit commands.\n'
+    )
+    # NOTE: You can actually set slightly different instruction for different agents
+    instruction += AGENT_CLS_TO_INST_SUFFIX[metadata.agent_class]

-        # Setup the logger properly, so you can run multi-processing to parallelize the evaluation
-        if reset_logger:
-            # Set up logger
-            log_file = os.path.join(
-                metadata.eval_output_dir,
-                'logs',
-                f'instance_{sid}.log',
-            )
-            # Remove all existing handlers from logger
-            for handler in logger.handlers[:]:
-                logger.removeHandler(handler)
-            # add back the console handler to print ONE line
-            logger.addHandler(get_console_handler())
-            logger.info(
-                f'Starting evaluation for instance {instance.task_id}.\nLOG:   tail -f {log_file}'
-            )
-            # Remove all existing handlers from logger
-            for handler in logger.handlers[:]:
-                logger.removeHandler(handler)
-            file_handler = logging.FileHandler(log_file)
-            file_handler.setFormatter(
-                logging.Formatter('%(asctime)s - %(levelname)s - %(message)s')
-            )
-            logger.addHandler(file_handler)
+    # Here's how you can run the agent (similar to the `main` function) and get the final task state
+    runtime = await create_runtime(config, sid=sid)
+    await initialize_runtime(runtime, instance)
+    state: State | None = await run_controller(
+        config=config,
+        task_str=instruction,
+        runtime=runtime,
+        fake_user_response_fn=AGENT_CLS_TO_FAKE_USER_RESPONSE_FN.get(
+            metadata.agent_class
+        ),
+    )

-        logger.info(f'Process-specific workspace mounted at {workspace_mount_path}')
+    if state is None:
+        raise ValueError('State should not be None.')
+    metrics = state.metrics.get() if state.metrics else None
+    test_result = await complete_runtime(runtime, instance)

-        # Create file with HumanEvalFix problem
-        # Prompt reference: https://github.com/bigcode-project/bigcode-evaluation-harness/blob/84b96da31b7f840b55c5733325346176140cdb6b/bigcode_eval/tasks/humanevalpack.py#L509
-        problem_statement = (
-            instance.declaration + instance.buggy_solution + '\n' + instance.test
-        )
-        path = os.path.join(workspace_mount_path, f'{sid}.py')
-        with open(path, 'w') as f:
-            f.write(problem_statement)
+    # history is now available as a stream of events, rather than list of pairs of (Action, Observation)
+    # for compatibility with the existing output format, we can remake the pairs here
+    # remove when it becomes unnecessary
+    histories = state.history.compatibility_for_eval_history_pairs()

-        # Prepare instruction
-        instruction = (
-            f'Please fix the function in {instance.task_id.replace("/", "__")}.py such that all test cases pass.\n'
-            'Environment has been set up for you to start working. You may assume all necessary tools are installed.\n\n'
-            '# Problem Statement\n'
-            f'{problem_statement}\n\n'
-        )
-        instruction += (
-            'IMPORTANT: You should ONLY interact with the environment provided to you AND NEVER ASK FOR HUMAN HELP.\n'
-            'You should NOT modify any existing test case files. If needed, you can add new test cases in a NEW file to reproduce the issue.\n'
-            'You SHOULD INCLUDE PROPER INDENTATION in your edit commands.\n'
-        )
-        # NOTE: You can actually set slightly different instruction for different agents
-        instruction += AGENT_CLS_TO_INST_SUFFIX[agent.__class__.__name__]
-
-        # Here's how you can run the agent (similar to the `main` function) and get the final task state
-        config.max_iterations = metadata.max_iterations
-        state: State | None = asyncio.run(
-            run_controller(
-                config=config,
-                task_str=instruction,
-                fake_user_response_fn=AGENT_CLS_TO_FAKE_USER_RESPONSE_FN.get(
-                    agent.__class__.__name__
-                ),
-                agent=agent,
-                sid=sid,
-            )
-        )
-
-        # ======= Attempt to evaluate the agent's edits =======
-        test_result = get_test_result(instance, path)
-
-        # If you are working on some simpler benchmark that only evaluates the final model output (e.g., in a MessageAction)
-        # You can simply get the LAST `MessageAction` from the returned `state.history` and parse it for evaluation.
-        if state is None:
-            raise ValueError('State should not be None.')
-        metrics = state.metrics.get() if state.metrics else None
-
-        # history is now available as a stream of events, rather than list of pairs of (Action, Observation)
-        # for compatibility with the existing output format, we can remake the pairs here
-        # remove when it becomes unnecessary
-        histories = state.history.compatibility_for_eval_history_pairs()
-
-        # Save the output
-        output = {
-            'task_id': instance.task_id,
-            'instruction': instruction,
-            'metadata': metadata.model_dump(),
-            'history': histories,
-            'metrics': metrics,
-            'error': state.last_error if state and state.last_error else None,
-            'test_result': test_result,
-        }
-    except Exception:
-        logger.error('Process instance failed')
-        raise
-    finally:
-        config.workspace_mount_path = old_workspace_mount_path
-        config.workspace_base = old_workspace_base
+    # Save the output
+    output = EvalOutput(
+        instance_id=instance.task_id,
+        instruction=instruction,
+        metadata=metadata,
+        history=histories,
+        metrics=metrics,
+        error=state.last_error if state and state.last_error else None,
+        test_result=test_result,
+    )
    return output


@ -234,28 +276,31 @@ if __name__ == '__main__':
        'bigcode/humanevalpack', 'python'
    )  # TODO: Support other languages
    hefix_tests = dataset['test'].to_pandas()
+    hefix_tests.rename(columns={'task_id': 'instance_id'}, inplace=True)

-    id_column = 'task_id'
-
-    llm_config = get_llm_config_arg(args.llm_config) if args.llm_config else config.llm
-    logger.info(f'Config for evaluation: {config}')
+    llm_config = None
+    if args.llm_config:
+        llm_config = get_llm_config_arg(args.llm_config)
+    if llm_config is None:
+        raise ValueError(f'Could not find LLM config: --llm_config {args.llm_config}')

    metadata = make_metadata(
        llm_config,
-        args.dataset_name,
+        'humanevalfix-python',
        args.agent_cls,
        args.max_iterations,
        args.eval_note,
        args.eval_output_dir,
    )
    output_file = os.path.join(metadata.eval_output_dir, 'output.jsonl')
-    instances = prepare_dataset(dataset, output_file, args.eval_n_limit, id_column)
+    instances = prepare_dataset(hefix_tests, output_file, args.eval_n_limit)

-    run_evaluation(
-        instances,
-        metadata,
-        output_file,
-        args.eval_num_workers,
-        process_instance,
-        id_column,
+    asyncio.run(
+        run_evaluation(
+            instances,
+            metadata,
+            output_file,
+            args.eval_num_workers,
+            process_instance,
+        )
    )
--- a/evaluation/humanevalfix/scripts/run_infer.sh
+++ b/evaluation/humanevalfix/scripts/run_infer.sh
--- a/evaluation/logic_reasoning/Dockerfile
+++ b/evaluation/logic_reasoning/Dockerfile
@ -0,0 +1,7 @@
+FROM ubuntu:22.04
+
+RUN apt-get update && apt-get install -y python3 python3-pip
+
+RUN pip install scitools-pyke
+
+# docker build -t xingyaoww/od_logic_reasoning .
--- a/evaluation/logic_reasoning/README.md
+++ b/evaluation/logic_reasoning/README.md
@ -2,38 +2,13 @@

 This folder contains evaluation harness for evaluating agents on the logic reasoning benchmark [ProntoQA](https://github.com/asaparov/prontoqa) and [ProofWriter](https://allenai.org/data/proofwriter).

-## Configure OpenDevin and your LLM
+## Setup Environment and LLM Configuration

-Create a `config.toml` file if it does not exist at the root of the workspace.
-
-Add the following configurations:
-
-```toml
-[core]
-max_iterations = 100
-cache_dir = "/tmp/cache"
-ssh_hostname = "localhost"
-
-[sandbox]
-enable_auto_lint = true
-
-# TODO: Change these to the model you want to evaluate
-[llm.eval_gpt4_1106_preview_llm]
-model = "gpt-4-1106-preview"
-api_key = "XXX"
-temperature = 0.0
-
-[llm.eval_some_openai_compatible_model_llm]
-model = "openai/MODEL_NAME"
-base_url = "https://OPENAI_COMPATIBLE_URL/v1"
-api_key = "XXX"
-temperature = 0.0
-```
+Please follow instruction [here](../README.md#setup) to setup your local development environment and LLM.

 ## Run Inference on logic_reasoning
-The following code will run inference on the first example of the ProntoQA dataset,
-using OpenDevin 0.6.2 version.
+The following code will run inference on the first example of the ProofWriter dataset,

 ```bash
-./evaluation/logic_reasoning/scripts/run_infer.sh ProntoQA eval_gpt4_1106_preview_llm 0.6.2 1
+./evaluation/logic_reasoning/scripts/run_infer.sh eval_gpt4_1106_preview_llm ProofWriter
 ```
--- a/evaluation/logic_reasoning/instruction.txt
+++ b/evaluation/logic_reasoning/instruction.txt
@ -3,12 +3,12 @@ you can interact with an interactive Python (Jupyter Notebook) environment and r
 In this task, you need to use the code in [[logic_inference_path.py]] to help you. Specifically, you first need to instantiate a **LogicInferenceEngine** class and use the **safe_execute_program** method to prove the **logic programs**. You should receive *answer*, *flag*, *error_message* from the output.

 An example would be look like this:
-    <execute_ipython>
-    import sys
-    sys.path.append(workspace_mount_path)
-    engine = LogicInferenceEngine(dataset_name, workspace_mount_path)
-    answer, flag, error_message = engine.safe_execute_program(logic_programs)
-    </execute_ipython>
+<execute_ipython>
+import sys
+sys.path.append('/workspace')
+engine = LogicInferenceEngine()
+answer, flag, error_message = engine.safe_execute_program(logic_programs)
+</execute_ipython>

 Please send the *answer* variable through message.

--- a/evaluation/logic_reasoning/logic_inference.py
+++ b/evaluation/logic_reasoning/logic_inference.py
@ -191,9 +191,9 @@ class PykeProgram:


 class LogicInferenceEngine:
-    def __init__(self, dataset_name, workspace_mount_path):
-        self.dataset_name = dataset_name
-        self.workspace_mount_path = workspace_mount_path
+    def __init__(self):
+        self.dataset_name = os.environ.get('DATASET_NAME', 'ProofWriter')
+        self.workspace_mount_path = '/workspace'

    def random_backup(self):
        if self.dataset_name == 'ProntoQA':
--- a/evaluation/logic_reasoning/run_infer.py
+++ b/evaluation/logic_reasoning/run_infer.py
@ -1,29 +1,35 @@
 import asyncio
-import logging
 import os
-import pathlib
-import shutil

 import pandas as pd
 from datasets import load_dataset

-from evaluation.swe_bench.swe_env_box import DockerSSHBox
 from evaluation.utils.shared import (
    EvalMetadata,
+    EvalOutput,
    codeact_user_response,
    make_metadata,
    prepare_dataset,
+    reset_logger_for_multiprocessing,
    run_evaluation,
 )
-from opendevin.controller.agent import Agent
 from opendevin.controller.state.state import State
-from opendevin.core.config import get_llm_config_arg, get_parser, load_app_config
-from opendevin.core.logger import get_console_handler
+from opendevin.core.config import (
+    AppConfig,
+    SandboxConfig,
+    get_llm_config_arg,
+    get_parser,
+)
 from opendevin.core.logger import opendevin_logger as logger
-from opendevin.core.main import run_controller
-from opendevin.llm.llm import LLM
-
-config = load_app_config()
+from opendevin.core.main import create_runtime, run_controller
+from opendevin.events.action import (
+    AgentFinishAction,
+    CmdRunAction,
+    IPythonRunCellAction,
+    MessageAction,
+)
+from opendevin.events.observation import CmdOutputObservation
+from opendevin.runtime.runtime import Runtime

 AGENT_CLS_TO_FAKE_USER_RESPONSE_FN = {
    'CodeActAgent': codeact_user_response,
@ -34,6 +40,29 @@ AGENT_CLS_TO_INST_SUFFIX = {
 }


+def get_config(
+    metadata: EvalMetadata,
+) -> AppConfig:
+    config = AppConfig(
+        default_agent=metadata.agent_class,
+        run_as_devin=False,
+        runtime='eventstream',
+        max_iterations=metadata.max_iterations,
+        sandbox=SandboxConfig(
+            container_image='xingyaoww/od-eval-logic-reasoning:v1.0',
+            enable_auto_lint=True,
+            use_host_network=False,
+            update_source_code=True,
+            od_runtime_extra_deps='$OD_INTERPRETER_PATH -m pip install scitools-pyke',
+        ),
+        # do not mount workspace
+        workspace_base=None,
+        workspace_mount_path=None,
+    )
+    config.set_llm_config(metadata.llm_config)
+    return config
+
+
 def get_choice(answer_str):
    choices = [
        'A',
@ -83,7 +112,7 @@ def get_test_result(
        'the correct answer is',
        'The correct answer is',
        'The correct option is',
-        'Thus, the answer is',
+        'the answer is',
    ]
    if prediction is None:
        for indicator in indicators:
@ -97,162 +126,143 @@ def get_test_result(
    return test_result


-def process_instance(
+CUR_EVAL_DIR = os.path.dirname(__file__)
+
+
+async def initialize_runtime(
+    runtime: Runtime,
+    instance: pd.Series,  # this argument is not required
+):
+    """Initialize the runtime for the agent.
+
+    This function is called before the runtime is used to run the agent.
+    """
+    logger.info(f"{'-' * 50} BEGIN Runtime Initialization Fn {'-' * 50}")
+    obs: CmdOutputObservation
+
+    # Set instance id
+    action = CmdRunAction(command='mkdir -p /workspace')
+    logger.info(action, extra={'msg_type': 'ACTION'})
+    obs = await runtime.run_action(action)
+    assert obs.exit_code == 0
+
+    action = CmdRunAction(command='cd /workspace')
+    logger.info(action, extra={'msg_type': 'ACTION'})
+    obs = await runtime.run_action(action)
+    assert obs.exit_code == 0
+
+    # copy logic_inference.py to /workspace
+    await runtime.copy_to(
+        os.path.join(CUR_EVAL_DIR, 'logic_inference.py'), '/workspace'
+    )
+    # check if the file exists
+    obs = await runtime.run_action(CmdRunAction(command='ls /workspace'))
+    assert obs.exit_code == 0
+    assert 'logic_inference.py' in obs.content
+
+    await runtime.add_env_vars({'DATASET_NAME': metadata.dataset})
+
+    action = CmdRunAction(command='mkdir -p /workspace/.cache_program')
+    logger.info(action, extra={'msg_type': 'ACTION'})
+    obs = await runtime.run_action(action)
+    assert obs.exit_code == 0
+
+    action = IPythonRunCellAction(code='%pip install scitools-pyke')
+    logger.info(action, extra={'msg_type': 'ACTION'})
+    ipynb_obs = await runtime.run_action(action)
+    logger.info(ipynb_obs, extra={'msg_type': 'OBSERVATION'})
+
+    logger.info(f"{'-' * 50} END Runtime Initialization Fn {'-' * 50}")
+
+
+# Prepare instruction
+with open(os.path.join(CUR_EVAL_DIR, 'instruction.txt'), 'r') as f:
+    INSTRUCTION_TEMPLATE = f.read()
+
+
+async def process_instance(
    instance: pd.Series,
    metadata: EvalMetadata,
    reset_logger: bool = True,
 ):
-    # Create the agent
-    agent = Agent.get_cls(metadata.agent_class)(llm=LLM(config=metadata.llm_config))
-    old_workspace_mount_path = config.workspace_mount_path
-    old_workspace_base = config.workspace_base
+    config = get_config(metadata)

-    try:
-        workspace_mount_path = os.path.join(
-            config.workspace_mount_path, '_eval_workspace'
+    # Setup the logger properly, so you can run multi-processing to parallelize the evaluation
+    if reset_logger:
+        log_dir = os.path.join(metadata.eval_output_dir, 'infer_logs')
+        reset_logger_for_multiprocessing(logger, instance['instance_id'], log_dir)
+    else:
+        logger.info(f'Starting evaluation for instance {instance["instance_id"]}.')
+
+    instance_logic_programs = instance['raw_logic_programs'][0].strip()
+    instruction = (
+        INSTRUCTION_TEMPLATE.replace('[[dataset_name]]', dataset_name)
+        .replace('[[logic_programs]]', instance_logic_programs)
+        .replace('[[logic_inference_path.py]]', '/workspace/logic_inference.py')
+    )
+
+    # NOTE: You can actually set slightly different instruction for different agents
+    instruction += AGENT_CLS_TO_INST_SUFFIX[metadata.agent_class]
+
+    # use a session id for concurrent evaluation
+    sid = instance['instance_id']
+
+    runtime = await create_runtime(config, sid=sid)
+    await initialize_runtime(runtime, instance)
+
+    # Here's how you can run the agent (similar to the `main` function) and get the final task state
+    state: State | None = asyncio.run(
+        run_controller(
+            config=config,
+            task_str=instruction,
+            runtime=runtime,
+            fake_user_response_fn=AGENT_CLS_TO_FAKE_USER_RESPONSE_FN.get(
+                metadata.agent_class
+            ),
        )
-        # create process-specific workspace dir
-        workspace_mount_path = os.path.join(workspace_mount_path, str(os.getpid()))
-        pathlib.Path(workspace_mount_path).mkdir(parents=True, exist_ok=True)
+    )
+    # ======= Attempt to evaluate the agent's edits =======
+    # If you are working on simpler benchmark that only evaluates the final model output (e.g., in a MessageAction)
+    # You can simply get the LAST `MessageAction` from the returned `state.history` and parse it for evaluation.

-        # reset workspace to config
-        config.workspace_base = workspace_mount_path
-        config.workspace_mount_path = workspace_mount_path
+    if state is None:
+        raise ValueError('State should not be None.')

-        # Setup the logger properly, so you can run multi-processing to parallelize the evaluation
-        if reset_logger:
-            # Set up logger
-            log_file = os.path.join(
-                metadata.eval_output_dir, 'logs', f'instance_{instance["id"]}.log'
-            )
-            # Remove all existing handlers from logger
-            for handler in logger.handlers[:]:
-                logger.removeHandler(handler)
-            # add back the console handler to print ONE line
-            logger.addHandler(get_console_handler())
-            logger.info(
-                f'Starting evaluation for instance {instance["id"]}.\nLOG:   tail -f {log_file}'
-            )
-            # Remove all existing handlers from logger
-            for handler in logger.handlers[:]:
-                logger.removeHandler(handler)
-            file_handler = logging.FileHandler(log_file)
-            file_handler.setFormatter(
-                logging.Formatter('%(asctime)s - %(levelname)s - %(message)s')
-            )
-            logger.addHandler(file_handler)
+    final_message = ''
+    for event in state.history.get_events(reverse=True):
+        if isinstance(event, AgentFinishAction):
+            final_message = event.thought
+            break
+        elif isinstance(event, MessageAction):
+            final_message = event.content
+            break

-        logger.info(f'Process-specific workspace mounted at {workspace_mount_path}')
+    final_message = final_message.strip("'")
+    logger.info(
+        f'Predicted answer: {final_message}, Ground truth: {instance["answer"]}'
+    )

-        # sandbox = DockerSSHBox()
-        logic_inference_path = os.path.join(workspace_mount_path, 'logic_inference.py')
-        if not os.path.exists(logic_inference_path):
-            shutil.copyfile(
-                './evaluation/logic_reasoning/logic_inference.py', logic_inference_path
-            )
-        logger.info(f'logic_inference.py copied to {workspace_mount_path}')
+    test_result = get_test_result(
+        model_answer=final_message, ground_truth=instance['answer']
+    )
+    test_result['final_message'] = final_message

-        cache_dir = os.path.join(workspace_mount_path, '.cache_program')
-        if not os.path.exists(cache_dir):
-            os.makedirs(cache_dir)
-
-        # Prepare instruction
-
-        with open('./evaluation/logic_reasoning/instruction.txt', 'r') as f:
-            instruction = f.read()
-
-        instance_logic_programs = instance['raw_logic_programs'][0].strip()
-        instruction = instruction.replace('[[dataset_name]]', dataset_name)
-        instruction = instruction.replace('[[logic_programs]]', instance_logic_programs)
-        instruction = instruction.replace(
-            '[[logic_inference_path.py]]', logic_inference_path
-        )
-
-        # NOTE: You can actually set slightly different instruction for different agents
-        instruction += AGENT_CLS_TO_INST_SUFFIX[agent.__class__.__name__]
-
-        # use a session id for concurrent evaluation
-        sid = instance['id'] + '_' + str(os.getpid())
-        sandbox = DockerSSHBox(
-            config=config.sandbox,
-            persist_sandbox=False,
-            workspace_mount_path=config.workspace_mount_path,
-            sandbox_workspace_dir=config.workspace_mount_path_in_sandbox,
-            cache_dir=config.cache_dir,
-            run_as_devin=config.run_as_devin,
-            sid=sid,
-        )
-        exit_code, command_output = sandbox.execute('pip install scitools-pyke')
-
-        # Here's how you can run the agent (similar to the `main` function) and get the final task state
-        config.max_iterations = metadata.max_iterations
-        state: State | None = asyncio.run(
-            run_controller(
-                config=config,
-                task_str=instruction,
-                fake_user_response_fn=AGENT_CLS_TO_FAKE_USER_RESPONSE_FN.get(
-                    agent.__class__.__name__
-                ),
-                agent=agent,
-                sandbox=sandbox,
-                sid=sid,
-            )
-        )
-        # ======= Attempt to evaluate the agent's edits =======
-        # If you are working on simpler benchmark that only evaluates the final model output (e.g., in a MessageAction)
-        # You can simply get the LAST `MessageAction` from the returned `state.history` and parse it for evaluation.
-
-        if state is None:
-            raise ValueError('State should not be None.')
-
-        final_message = ''
-        messages = []
-        for event in state.history.get_events(reverse=True):
-            # will this be a MessageAction?
-            # TODO we can filter for types of events if we know what to expect
-            messages.append(event.content)
-            if str(event.content) in ["'A'", "'B'", "'C'"]:
-                final_message = event.content
-                break
-
-        final_message = final_message.strip("'")
-        logger.info(
-            f'Predicted answer: {final_message}, Ground truth: {instance["answer"]}'
-        )
-
-        test_result = get_test_result(
-            model_answer=final_message, ground_truth=instance['answer']
-        )
-        metrics = state.metrics.get() if state.metrics else None
-
-        # history is now available as a stream of events, rather than list of pairs of (Action, Observation)
-        # for compatibility with the existing output format, we can remake the pairs here
-        # remove when it becomes unnecessary
-        histories = state.history.compatibility_for_eval_history_pairs()
-
-        # Save the output
-        output = {
-            'id': instance['id'],
-            'instance': instance,
-            'instruction': instruction,
-            # 'metadata': metadata.model_dump(),
-            'history': histories,
-            'metrics': metrics,
-            'final_message': final_message,
-            'messages': messages,
-            'error': state.last_error if state and state.last_error else None,
-            'test_result': test_result,
-        }
-    except Exception:
-        logger.error('Process instance failed')
-        raise
-    finally:
-        config.workspace_mount_path = old_workspace_mount_path
-        config.workspace_base = old_workspace_base
-
-    # Close the sandbox
-    sandbox.close()
+    metrics = state.metrics.get() if state.metrics else None
+    # history is now available as a stream of events, rather than list of pairs of (Action, Observation)
+    # for compatibility with the existing output format, we can remake the pairs here
+    # remove when it becomes unnecessary
+    histories = state.history.compatibility_for_eval_history_pairs()

+    # Save the output
+    output = EvalOutput(
+        instance_id=instance['instance_id'],
+        instruction=instruction,
+        metadata=metadata,
+        history=histories,
+        metrics=metrics,
+        error=state.last_error if state and state.last_error else None,
+        test_result=test_result,
+    )
    return output


@ -262,7 +272,7 @@ if __name__ == '__main__':
        '--dataset',
        type=str,
        help='the logic reasoning dataset to evaluate on {ProntoQA, ProofWriter}',
-        default='ProntoQA',
+        default='ProofWriter',
    )
    parser.add_argument(
        '--data_split',
@ -270,36 +280,32 @@ if __name__ == '__main__':
        help='data split to evaluate on {validation}',  # right now we only support validation split
        default='validation',
    )
-
    args, _ = parser.parse_known_args()
-    if args.directory:
-        config.workspace_base = os.path.abspath(args.directory)
-        print(f'Setting workspace base to {config.workspace_base}')

    dataset_name = args.dataset
    data_split = args.data_split
    dataset = load_dataset(f'renma/{dataset_name}')
-    logic_reasoning_tests = dataset[data_split]
+    dataset_df = dataset[data_split].to_pandas()
+    dataset_df.rename(columns={'id': 'instance_id'}, inplace=True)

-    id_column = 'id'
-    llm_config = get_llm_config_arg(args.llm_config) if args.llm_config else config.llm
-    logger.info(f'Config for evaluation: {config}')
+    llm_config = None
+    if args.llm_config:
+        llm_config = get_llm_config_arg(args.llm_config)
+    if llm_config is None:
+        raise ValueError(f'Could not find LLM config: --llm_config {args.llm_config}')

    metadata = make_metadata(
        llm_config,
-        args.dataset_name,
+        dataset_name,
        args.agent_cls,
        args.max_iterations,
        args.eval_note,
        args.eval_output_dir,
    )
    output_file = os.path.join(metadata.eval_output_dir, 'output.jsonl')
-    instances = prepare_dataset(dataset, output_file, args.eval_n_limit, id_column)
-    run_evaluation(
-        instances,
-        metadata,
-        output_file,
-        args.eval_num_workers,
-        process_instance,
-        id_column,
+    instances = prepare_dataset(dataset_df, output_file, args.eval_n_limit)
+    asyncio.run(
+        run_evaluation(
+            instances, metadata, output_file, args.eval_num_workers, process_instance
+        )
    )
--- a/evaluation/logic_reasoning/scripts/run_infer.sh
+++ b/evaluation/logic_reasoning/scripts/run_infer.sh
@ -3,8 +3,8 @@ set -eo pipefail

 source "evaluation/utils/version_control.sh"

-DATASET=$1
-MODEL_CONFIG=$2
+MODEL_CONFIG=$1
+DATASET=$2
 COMMIT_HASH=$3
 EVAL_LIMIT=$4
 AGENT=$5
@ -23,6 +23,11 @@ if [ -z "$AGENT" ]; then
  AGENT="CodeActAgent"
 fi

+if [ -z "$DATASET" ]; then
+  echo "Dataset not specified, use default ProofWriter"
+  DATASET="ProofWriter"
+fi
+
 get_agent_version

 echo "AGENT: $AGENT"
--- a/evaluation/miniwob/Dockerfile
+++ b/evaluation/miniwob/Dockerfile
@ -0,0 +1,10 @@
+FROM ubuntu:22.04
+
+RUN apt-get update && apt-get install -y python3 python3-pip git
+
+RUN git clone https://github.com/Farama-Foundation/miniwob-plusplus.git /miniwob-plusplus && \
+    git -C "/miniwob-plusplus" reset --hard 7fd85d71a4b60325c6585396ec4f48377d049838
+
+ENV MINIWOB_URL="file:///miniwob-plusplus/miniwob/html/miniwob/"
+
+# docker build -t xingyaoww/od-eval-miniwob .
--- a/evaluation/miniwob/README.md
+++ b/evaluation/miniwob/README.md
@ -2,52 +2,9 @@

 This folder contains evaluation for [MiniWoB++](https://miniwob.farama.org/) benchmark, powered by [BrowserGym](https://github.com/ServiceNow/BrowserGym) for easy evaluation of how well an agent capable of browsing can perform on synthetic web browsing tasks.

-## Setup OpenDevin Environment
+## Setup Environment and LLM Configuration

-Please follow [this document](https://github.com/OpenDevin/OpenDevin/blob/main/Development.md) to setup local develop environment for OpenDevin.
-
-## Configure OpenDevin and your LLM
-
-Create a `config.toml` file if it does not exist at the root of the workspace.
-
-Add the following configurations:
-
-```toml
-[core]
-max_iterations = 100
-cache_dir = "/tmp/cache"
-ssh_hostname = "localhost"
-
-[sandbox]
-box_type = "ssh"
-timeout = 120
-
-# TODO: Change these to the model you want to evaluate
-[llm.eval_gpt4_1106_preview]
-model = "gpt-4-1106-preview"
-api_key = "XXX"
-temperature = 0.0
-
-[llm.eval_some_openai_compatible_model]
-model = "openai/MODEL_NAME"
-base_url = "https://OPENAI_COMPATIBLE_URL/v1"
-api_key = "XXX"
-temperature = 0.0
-```
-
-## Setup MiniWoB++ Environment and Environment Variables of MiniWoB++
-MiniWoB++ requires you to set up websites containing a static website that is accessible via URL to the machine running the OpenDevin agents.
-
- Clone miniwob (use a specific frozen commit for reproducibility)
-```sh
-git clone git@github.com:Farama-Foundation/miniwob-plusplus.git
-git -C "./miniwob-plusplus" reset --hard 7fd85d71a4b60325c6585396ec4f48377d049838
-```
-
- Setup Miniwob URL (change `PATH_TO_MINIWOB_CLONED_REPO` here to the absolute path to your `miniwob-plusplus` folder) in `evaluation/miniwob/scripts/run_infer.sh`
-```sh
-export MINIWOB_URL="file://<PATH_TO_MINIWOB_CLONED_REPO>/miniwob/html/miniwob/"
-```
+Please follow instruction [here](../README.md#setup) to setup your local development environment and LLM.

 ## Test if your environment works

@ -56,7 +13,7 @@ Access with browser the above MiniWoB URLs and see if they load correctly.
 ## Run Evaluation

 ```sh
-bash evaluation/miniwob/scripts/run_infer.sh
+./evaluation/miniwob/scripts/run_infer.sh llm.claude-35-sonnet-eval
 ```

 Results will be in `evaluation/evaluation_outputs/outputs/miniwob/`
--- a/evaluation/miniwob/run_infer.py
+++ b/evaluation/miniwob/run_infer.py
@ -1,7 +1,7 @@
 import asyncio
 import json
-import logging
 import os
+from typing import Any

 import browsergym.miniwob  # noqa F401 register miniwob tasks as gym environments
 import gymnasium as gym
@ -9,91 +9,132 @@ import pandas as pd

 from evaluation.utils.shared import (
    EvalMetadata,
+    EvalOutput,
    make_metadata,
    prepare_dataset,
+    reset_logger_for_multiprocessing,
    run_evaluation,
 )
-from opendevin.controller.agent import Agent
 from opendevin.controller.state.state import State
-from opendevin.core.config import get_llm_config_arg, load_app_config, parse_arguments
-from opendevin.core.logger import get_console_handler
+from opendevin.core.config import (
+    AppConfig,
+    SandboxConfig,
+    get_llm_config_arg,
+    parse_arguments,
+)
 from opendevin.core.logger import opendevin_logger as logger
-from opendevin.core.main import run_controller
-from opendevin.llm.llm import LLM
-from opendevin.runtime.docker.ssh_box import DockerSSHBox
-from opendevin.runtime.tools import RuntimeTool
-
-config = load_app_config()
+from opendevin.core.main import create_runtime, run_controller
+from opendevin.events.action import (
+    BrowseInteractiveAction,
+    CmdRunAction,
+    MessageAction,
+)
+from opendevin.events.observation import CmdOutputObservation
+from opendevin.runtime.browser.browser_env import (
+    BROWSER_EVAL_GET_GOAL_ACTION,
+    BROWSER_EVAL_GET_REWARDS_ACTION,
+)
+from opendevin.runtime.runtime import Runtime

 SUPPORTED_AGENT_CLS = {'BrowsingAgent'}

-docker_ssh_box: DockerSSHBox | None = None
+
+def get_config(
+    metadata: EvalMetadata,
+    env_id: str,
+) -> AppConfig:
+    config = AppConfig(
+        default_agent=metadata.agent_class,
+        run_as_devin=False,
+        runtime='eventstream',
+        max_iterations=metadata.max_iterations,
+        sandbox=SandboxConfig(
+            container_image='xingyaoww/od-eval-miniwob:v1.0',
+            enable_auto_lint=True,
+            use_host_network=False,
+            update_source_code=True,
+            browsergym_eval_env=env_id,
+        ),
+        # do not mount workspace
+        workspace_base=None,
+        workspace_mount_path=None,
+    )
+    config.set_llm_config(metadata.llm_config)
+    return config


-def get_sandbox():
-    global docker_ssh_box
-    if docker_ssh_box is None:
-        docker_ssh_box = DockerSSHBox(
-            config=config.sandbox,
-            persist_sandbox=False,
-            workspace_mount_path=config.workspace_mount_path,
-            sandbox_workspace_dir=config.workspace_mount_path_in_sandbox,
-            cache_dir=config.cache_dir,
-            run_as_devin=config.run_as_devin,
-        )
-    return docker_ssh_box
+async def initialize_runtime(
+    runtime: Runtime,
+) -> str:
+    """Initialize the runtime for the agent.
+
+    This function is called before the runtime is used to run the agent.
+    """
+    logger.info(f"{'-' * 50} BEGIN Runtime Initialization Fn {'-' * 50}")
+    obs: CmdOutputObservation
+
+    # Set instance id
+    action = CmdRunAction(command='mkdir -p /workspace')
+    logger.info(action, extra={'msg_type': 'ACTION'})
+    obs = await runtime.run_action(action)
+    assert obs.exit_code == 0
+
+    action = BrowseInteractiveAction(browser_actions=BROWSER_EVAL_GET_GOAL_ACTION)
+    logger.info(action, extra={'msg_type': 'ACTION'})
+    obs = await runtime.run_action(action)
+    logger.info(obs, extra={'msg_type': 'OBSERVATION'})
+    goal = obs.content
+
+    logger.info(f"{'-' * 50} END Runtime Initialization Fn {'-' * 50}")
+    return goal


-def process_instance(
+async def complete_runtime(
+    runtime: Runtime,
+) -> dict[str, Any]:
+    """Complete the runtime for the agent.
+
+    This function is called before the runtime is used to run the agent.
+    If you need to do something in the sandbox to get the correctness metric after
+    the agent has run, modify this function.
+    """
+    logger.info(f"{'-' * 50} BEGIN Runtime Completion Fn {'-' * 50}")
+    obs: CmdOutputObservation
+
+    action = BrowseInteractiveAction(browser_actions=BROWSER_EVAL_GET_REWARDS_ACTION)
+    logger.info(action, extra={'msg_type': 'ACTION'})
+    obs = await runtime.run_action(action)
+    logger.info(obs, extra={'msg_type': 'OBSERVATION'})
+
+    logger.info(f"{'-' * 50} END Runtime Completion Fn {'-' * 50}")
+    return {
+        'rewards': json.loads(obs.content),
+    }
+
+
+async def process_instance(
    instance: pd.Series,
    metadata: EvalMetadata,
    reset_logger: bool = True,
-):
-    # Create the agent
-    agent = Agent.get_cls(metadata.agent_class)(llm=LLM(config=metadata.llm_config))
+) -> EvalOutput:
    env_id = instance.id
+    config = get_config(metadata, env_id)
+
    # Setup the logger properly, so you can run multi-processing to parallelize the evaluation
    if reset_logger:
-        # Set up logger
-        log_file = os.path.join(
-            metadata.eval_output_dir, 'logs', f'instance_{env_id}.log'
-        )
-        # Remove all existing handlers from logger
-        for handler in logger.handlers[:]:
-            logger.removeHandler(handler)
-        # add back the console handler to print ONE line
-        logger.addHandler(get_console_handler())
-        logger.info(
-            f'Starting evaluation for instance {env_id}.\nHint: run "tail -f {log_file}" to see live logs in a separate shell'
-        )
-        # Remove all existing handlers from logger
-        for handler in logger.handlers[:]:
-            logger.removeHandler(handler)
-        file_handler = logging.FileHandler(log_file)
-        file_handler.setFormatter(
-            logging.Formatter('%(asctime)s - %(levelname)s - %(message)s')
-        )
-        logger.addHandler(file_handler)
+        log_dir = os.path.join(metadata.eval_output_dir, 'infer_logs')
+        reset_logger_for_multiprocessing(logger, env_id, log_dir)
    else:
        logger.info(f'Starting evaluation for instance {env_id}.')

-    # Here's how you can run the agent (similar to the `main` function) and get the final task state
-    runtime_tools_config = {
-        RuntimeTool.BROWSER: {
-            'browsergym_eval': env_id,
-            'browsergym_eval_save_dir': metadata.eval_output_dir,
-        }
-    }
+    runtime = await create_runtime(config, sid=env_id)
+    task_str = await initialize_runtime(runtime)

-    config.max_iterations = metadata.max_iterations
    state: State | None = asyncio.run(
        run_controller(
            config=config,
-            task_str='PLACEHOLDER_GOAL',
-            runtime_tools_config=runtime_tools_config,
-            agent=agent,
-            sandbox=get_sandbox(),
-            sid=env_id,
+            task_str=task_str,  # take output from initialize_runtime
+            runtime=runtime,
        )
    )

@ -106,18 +147,17 @@ def process_instance(
        raise ValueError('State should not be None.')

    metrics = state.metrics.get() if state.metrics else None
-    browsergym_eval_dir = os.path.join(metadata.eval_output_dir, env_id.split('/')[1])
-    # read goal
-    with open(
-        os.path.join(browsergym_eval_dir, 'goal.txt'), 'r', encoding='utf-8'
-    ) as f:
-        instruction = f.read()
-    # read reward
-    with open(
-        os.path.join(browsergym_eval_dir, 'rewards.json'), 'r', encoding='utf-8'
-    ) as f:
-        rewards = json.load(f)
-        reward = max(rewards)
+
+    # Instruction is the first message from the USER
+    instruction = ''
+    for event in state.history.get_events():
+        if isinstance(event, MessageAction):
+            instruction = event.content
+            break
+
+    return_val = await complete_runtime(runtime)
+    logger.info(f'Return value from complete_runtime: {return_val}')
+    reward = max(return_val['rewards'])

    # history is now available as a stream of events, rather than list of pairs of (Action, Observation)
    # for compatibility with the existing output format, we can remake the pairs here
@ -125,16 +165,17 @@ def process_instance(
    histories = state.history.compatibility_for_eval_history_pairs()

    # Save the output
-    output = {
-        'instance_id': env_id,
-        'instruction': instruction,
-        'metadata': metadata.model_dump(),
-        'history': histories,
-        'metrics': metrics,
-        'error': state.last_error if state and state.last_error else None,
-        'test_result': reward,
-    }
-
+    output = EvalOutput(
+        instance_id=env_id,
+        instruction=instruction,
+        metadata=metadata,
+        history=histories,
+        metrics=metrics,
+        error=state.last_error if state and state.last_error else None,
+        test_result={
+            'reward': reward,
+        },
+    )
    return output


@ -143,7 +184,7 @@ if __name__ == '__main__':

    dataset = pd.DataFrame(
        {
-            'id': [
+            'instance_id': [
                id
                for id in gym.envs.registry.keys()
                if id.startswith('browsergym/miniwob')
@ -151,26 +192,25 @@ if __name__ == '__main__':
        }
    )

-    id_column = 'id'
-    llm_config = get_llm_config_arg(args.llm_config) if args.llm_config else config.llm
-    logger.info(f'Config for evaluation: {config}')
+    llm_config = None
+    if args.llm_config:
+        llm_config = get_llm_config_arg(args.llm_config)
+    if llm_config is None:
+        raise ValueError(f'Could not find LLM config: --llm_config {args.llm_config}')

    metadata = make_metadata(
        llm_config,
-        args.dataset_name,
+        'miniwob',
        args.agent_cls,
        args.max_iterations,
        args.eval_note,
        args.eval_output_dir,
    )
    output_file = os.path.join(metadata.eval_output_dir, 'output.jsonl')
-    instances = prepare_dataset(dataset, output_file, args.eval_n_limit, id_column)
-    _ = get_sandbox()  # Initialize the sandbox
-    run_evaluation(
-        instances,
-        metadata,
-        output_file,
-        args.eval_num_workers,
-        process_instance,
-        id_column,
+    instances = prepare_dataset(dataset, output_file, args.eval_n_limit)
+
+    asyncio.run(
+        run_evaluation(
+            instances, metadata, output_file, args.eval_num_workers, process_instance
+        )
    )
--- a/evaluation/miniwob/scripts/run_infer.sh
+++ b/evaluation/miniwob/scripts/run_infer.sh
@ -3,14 +3,10 @@ set -eo pipefail

 source "evaluation/utils/version_control.sh"

-# configure miniwob website, change URL to yours
-export MINIWOB_URL="file:///home/fangzhex/miniwob-plusplus/miniwob/html/miniwob/"
-
 # configure browsing agent
 export USE_NAV="false"
 export USE_CONCISE_ANSWER="true"

-
 MODEL_CONFIG=$1
 COMMIT_HASH=$2
 AGENT=$3
@ -42,7 +38,7 @@ COMMAND="poetry run python evaluation/miniwob/run_infer.py \
  --llm-config $MODEL_CONFIG \
  --max-iterations 10 \
  --max-chars 10000000 \
-  --eval-num-workers $NUM_WORKERS \
+  --eval-num-workers $NUM_WORKERS"

 if [ -n "$EVAL_LIMIT" ]; then
  echo "EVAL_LIMIT: $EVAL_LIMIT"
--- a/evaluation/mint/Dockerfile
+++ b/evaluation/mint/Dockerfile
@ -0,0 +1,10 @@
+FROM ubuntu:22.04
+
+RUN apt-get update && apt-get install -y python3 python3-pip git gcc
+
+WORKDIR /root
+
+COPY requirements.txt .
+RUN pip install -r requirements.txt
+
+# docker build -t xingyaoww/od-eval-mint:v1.0 .
--- a/evaluation/mint/README.md
+++ b/evaluation/mint/README.md
@ -2,9 +2,11 @@

 This folder contains the evaluation harness for the [MINT benchmark](https://arxiv.org/abs/2309.10691) on LLMs' ability to solve tasks with multi-turn interactions.

-## Configure OpenDevin and LM
+We support evaluation of the [Eurus subset focus on math and code reasoning](https://arxiv.org/abs/2404.02078), including MATH, MMLU, TheoremQA, HumanEval, MBPP.

-Create a `config.toml` file if it does not exist at the root of the workspace. Please check [README.md](../../README.md) for how to set this up.
+## Setup Environment and LLM Configuration
+
+Please follow instruction [here](../README.md#setup) to setup your local development environment and LLM.

 ## Start the evaluation

--- a/evaluation/mint/run_infer.py
+++ b/evaluation/mint/run_infer.py
@ -1,33 +1,36 @@
-import asyncio
 import functools
-import logging
 import os
-import pathlib
 from typing import Any, Dict

+import pandas as pd
 from datasets import load_dataset

-from evaluation.swe_bench.swe_env_box import DockerSSHBox
+from evaluation.mint.datatypes import TaskState
+from evaluation.mint.env import SimplifiedEnv
+from evaluation.mint.prompts import ToolPromptTemplate
+from evaluation.mint.tasks import Task
 from evaluation.utils.shared import (
    EvalMetadata,
+    EvalOutput,
    make_metadata,
    prepare_dataset,
+    reset_logger_for_multiprocessing,
    run_evaluation,
 )
-from opendevin.controller.agent import Agent
 from opendevin.controller.state.state import State
-from opendevin.core.config import get_llm_config_arg, get_parser, load_app_config
-from opendevin.core.logger import get_console_handler
+from opendevin.core.config import (
+    AppConfig,
+    SandboxConfig,
+    get_llm_config_arg,
+    get_parser,
+)
 from opendevin.core.logger import opendevin_logger as logger
-from opendevin.core.main import run_controller
-from opendevin.llm.llm import LLM
-
-from .datatypes import TaskState
-from .env import SimplifiedEnv
-from .prompts import ToolPromptTemplate
-from .tasks import Task
-
-config = load_app_config()
+from opendevin.core.main import create_runtime, run_controller
+from opendevin.events.action import (
+    CmdRunAction,
+)
+from opendevin.events.observation import CmdOutputObservation
+from opendevin.runtime.runtime import Runtime


 def codeact_user_response_mint(state: State, task: Task, task_config: Dict[str, int]):
@ -42,7 +45,7 @@ def codeact_user_response_mint(state: State, task: Task, task_config: Dict[str,
    last_action = state.history.get_last_action()
    result_state: TaskState = env.step(last_action.message or '')

-    state.task_state = result_state
+    state.extra_data['task_state'] = result_state

    if not result_state.latest_output:
        # Task is finished
@ -62,85 +65,108 @@ AGENT_CLS_TO_INST_SUFFIX = {
    'CodeActAgent': '\nIMPORTANT: When your answer is confirmed by the user to be correct, you can exit using the following command: <execute_bash> exit </execute_bash>.\n'
 }

+with open(os.path.join(os.path.dirname(__file__), 'requirements.txt'), 'r') as f:
+    MINT_DEPENDENCIES = f.read().splitlines()

-def process_instance(
+
+def load_incontext_example(task_name: str, with_tool: bool = True):
+    assert with_tool, 'NOT with_tool is not supported yet'
+    subset = {
+        'gsm8k': 'reasoning',
+        'math': 'reasoning',
+        'mmlu': 'reasoning',
+        'theoremqa': 'reasoning',
+        'mbpp': 'mbpp',
+        'humaneval': 'humaneval',
+    }[task_name]
+    with open(
+        os.path.join(
+            os.path.dirname(__file__),
+            'tasks',
+            'in_context_examples',
+            subset,
+            'with_tool.txt',
+        ),
+        'r',
+    ) as f:
+        return f.read()
+
+
+def get_config(
+    metadata: EvalMetadata,
+) -> AppConfig:
+    config = AppConfig(
+        default_agent=metadata.agent_class,
+        run_as_devin=False,
+        runtime='eventstream',
+        max_iterations=metadata.max_iterations,
+        sandbox=SandboxConfig(
+            container_image='xingyaoww/od-eval-mint:v1.0',
+            enable_auto_lint=True,
+            use_host_network=False,
+            update_source_code=True,
+            od_runtime_extra_deps=f'$OD_INTERPRETER_PATH -m pip install {" ".join(MINT_DEPENDENCIES)}',
+        ),
+        # do not mount workspace
+        workspace_base=None,
+        workspace_mount_path=None,
+    )
+    config.set_llm_config(metadata.llm_config)
+    return config
+
+
+async def initialize_runtime(runtime: Runtime):
+    """Initialize the runtime for the agent.
+
+    This function is called before the runtime is used to run the agent.
+    """
+    logger.info(f"{'-' * 50} BEGIN Runtime Initialization Fn {'-' * 50}")
+    obs: CmdOutputObservation
+
+    # Set instance id
+    action = CmdRunAction(command='mkdir -p /workspace')
+    logger.info(action, extra={'msg_type': 'ACTION'})
+    obs = await runtime.run_action(action)
+    assert obs.exit_code == 0
+
+    action = CmdRunAction(command='cd /workspace')
+    logger.info(action, extra={'msg_type': 'ACTION'})
+    obs = await runtime.run_action(action)
+    assert obs.exit_code == 0
+
+    logger.info(f"{'-' * 50} END Runtime Initialization Fn {'-' * 50}")
+
+
+async def process_instance(
    instance: Any,
    metadata: EvalMetadata,
    reset_logger: bool = True,
 ):
-    agent = Agent.get_cls(metadata.agent_class)(llm=LLM(metadata.llm_config))
-    workspace_mount_path = os.path.join(config.workspace_mount_path, '_eval_workspace')
-    # create process-specific workspace dir
-    workspace_mount_path = os.path.join(workspace_mount_path, str(os.getpid()))
-    pathlib.Path(workspace_mount_path).mkdir(parents=True, exist_ok=True)
+    config = get_config(metadata)

    # Setup the logger properly, so you can run multi-processing to parallelize the evaluation
    if reset_logger:
-        # Set up logger
-        log_file = os.path.join(
-            metadata.eval_output_dir, 'logs', f'instance_{instance.task_id}.log'
-        )
-        # Remove all existing handlers from logger
-        for handler in logger.handlers[:]:
-            logger.removeHandler(handler)
-        # add back the console handler to print ONE line
-        logger.addHandler(get_console_handler())
-        logger.info(
-            f'Starting evaluation for instance {instance.task_id}.\nHint: run "tail -f {log_file}" to see live logs in a separate shell'
-        )
-        # Remove all existing handlers from logger
-        for handler in logger.handlers[:]:
-            logger.removeHandler(handler)
-        file_handler = logging.FileHandler(log_file)
-        file_handler.setFormatter(
-            logging.Formatter('%(asctime)s - %(levelname)s - %(message)s')
-        )
-        logger.addHandler(file_handler)
-
-    logger.info(f'Process-specific workspace mounted at {workspace_mount_path}')
-
-    # use a session id for concurrent processing
-    sid = instance.task_id + '_' + str(os.getpid())
-    sandbox = DockerSSHBox(
-        config=config.sandbox,
-        persist_sandbox=False,
-        workspace_mount_path=config.workspace_mount_path,
-        sandbox_workspace_dir=config.workspace_mount_path_in_sandbox,
-        cache_dir=config.cache_dir,
-        run_as_devin=config.run_as_devin,
-        sid=sid,
-    )
-
-    requirements_host_src = 'evaluation/mint/requirements.txt'
-    requirements_sandbox_dest = '/opendevin/plugins/mint/'
-    sandbox.copy_to(
-        host_src=requirements_host_src,
-        sandbox_dest=requirements_sandbox_dest,
-        recursive=False,
-    )
-    logger.info(
-        f'Copied files from [{requirements_host_src}] to [{requirements_sandbox_dest}] inside sandbox.'
-    )
-    exit_code, output = sandbox.execute(f'pip install -r {requirements_sandbox_dest}')
+        log_dir = os.path.join(metadata.eval_output_dir, 'infer_logs')
+        reset_logger_for_multiprocessing(logger, instance.instance_id, log_dir)
+    else:
+        logger.info(f'Starting evaluation for instance {instance.instance_id}.')

    # Prepare instruction
    assert metadata.details is not None
    instruction = ToolPromptTemplate(use_tool=True)(
        max_total_steps=metadata.max_iterations,
        max_propose_solution=metadata.details['max_propose_solution'],
-        in_context_example=instance.in_context_example(
-            use_tool=True, with_feedback=False
-        ),
+        in_context_example=instance.in_context_example,
        task_prompt='Task:\n' + instance.prompt,
    )
    instruction += 'IMPORTANT: You should ONLY interact with the environment provided to you or provide the concise RESULT inside <solution> tag AND NEVER ASK FOR HUMAN HELP.\n'

    # NOTE: You can actually set slightly different instruction for different agents
-    instruction += AGENT_CLS_TO_INST_SUFFIX[agent.__class__.__name__]
+    instruction += AGENT_CLS_TO_INST_SUFFIX[metadata.agent_class]

    # Here's how you can run the agent (similar to the `main` function) and get the final task state
    fake_user_response_fn = functools.partial(
-        AGENT_CLS_TO_FAKE_USER_RESPONSE_FN[agent.__class__.__name__],
+        AGENT_CLS_TO_FAKE_USER_RESPONSE_FN[metadata.agent_class],
        task=instance,
        task_config={
            'max_iterations': metadata.max_iterations,
@ -148,24 +174,22 @@ def process_instance(
        },
    )

-    config.max_iterations = metadata.max_iterations
-    state: State | None = asyncio.run(
-        run_controller(
-            config=config,
-            task_str=instruction,
-            fake_user_response_fn=fake_user_response_fn,
-            agent=agent,
-            sandbox=sandbox,
-            sid=sid,
-        )
+    runtime = await create_runtime(config, sid=instance.instance_id)
+    await initialize_runtime(runtime)
+
+    state: State | None = await run_controller(
+        config=config,
+        task_str=instruction,
+        runtime=runtime,
+        fake_user_response_fn=fake_user_response_fn,
    )

    if state is None:
        raise ValueError('State should not be None.')

    task_state = None
-    if hasattr(state, 'task_state'):
-        task_state = state.task_state
+    if 'task_state' in state.extra_data:
+        task_state = state.extra_data['task_state']
        logger.info('Task state: ' + str(task_state.to_dict()))

    metrics = state.metrics.get() if state.metrics else None
@ -176,30 +200,37 @@ def process_instance(
    histories = state.history.compatibility_for_eval_history_pairs()

    # Save the output
-    output = {
-        'id': instance.task_id,
-        'instance': instance.to_dict(),
-        'instruction': instruction,
-        'metadata': metadata.model_dump(),
-        'history': histories,
-        'metrics': metrics,
-        'error': state.last_error if state and state.last_error else None,
-        'test_result': task_state.success if task_state else False,
-    }
-
-    # Close the sandbox
-    sandbox.close()
-
+    output = EvalOutput(
+        instance_id=instance.instance_id,
+        instance=instance.to_dict(),
+        instruction=instruction,
+        metadata=metadata,
+        history=histories,
+        metrics=metrics,
+        error=state.last_error if state and state.last_error else None,
+        test_result={
+            'success': task_state.success if task_state else False,
+        },
+    )
    return output


 if __name__ == '__main__':
    parser = get_parser()

+    SUBSETS = [
+        # Eurus subset: https://arxiv.org/abs/2404.02078
+        'math',
+        # 'gsm8k',
+        'mmlu',
+        'theoremqa',
+        'mbpp',
+        'humaneval',
+    ]
    parser.add_argument(
        '--subset',
-        default='math',
-        choices=['math', 'gsm8k', 'mmlu', 'theoremqa', 'mbpp', 'humaneval'],
+        default='all',
+        choices=SUBSETS + ['all'],
        type=str,
        help='subset of the dataset to be used',
    )
@ -214,19 +245,36 @@ if __name__ == '__main__':

    # NOTE: It is preferable to load datasets from huggingface datasets and perform post-processing
    # so we don't need to manage file uploading to OpenDevin's repo
-    mint_dataset = load_dataset(
-        'ryanhoangt/xingyaoww-mint-bench', name=args.subset, split='test'
-    )
-    logger.info(f'Evaluating MINT - {args.subset} subset')
-    mint_tests = mint_dataset.to_pandas()
+    if args.subset == 'all':
+        subsets = SUBSETS
+    else:
+        subsets = [args.subset]

-    id_column = 'id'
-    llm_config = get_llm_config_arg(args.llm_config) if args.llm_config else config.llm
-    logger.info(f'Config for evaluation: {config}')
+    dataset_dfs = []
+    for subset in subsets:
+        in_context_example = load_incontext_example(subset)
+        _cur_dataset = load_dataset(
+            'ryanhoangt/xingyaoww-mint-bench', name=subset, split='test'
+        )
+        logger.info(f'Loaded MINT - {subset} subset')
+        _df = _cur_dataset.to_pandas().rename(columns={'id': 'instance_id'})
+        _df['instance_id'] = _df['instance_id'].apply(lambda x: f'{subset}/{x}')  # noqa
+        _df['in_context_example'] = in_context_example
+        dataset_dfs.append(_df)
+        logger.info(f'Loaded {len(_df)} instances for subset: {subset}')
+
+    dataset_df = pd.concat(dataset_dfs)
+    logger.info(f'Loaded {len(dataset_df)} instances for subset: {subsets}')
+
+    llm_config = None
+    if args.llm_config:
+        llm_config = get_llm_config_arg(args.llm_config)
+    if llm_config is None:
+        raise ValueError(f'Could not find LLM config: --llm_config {args.llm_config}')

    metadata = make_metadata(
        llm_config,
-        args.dataset_name,
+        f'MINT-{args.subset}',
        args.agent_cls,
        args.max_iterations,
        args.eval_note,
@ -234,12 +282,7 @@ if __name__ == '__main__':
        details={'max_propose_solution': args.max_propose_solution},
    )
    output_file = os.path.join(metadata.eval_output_dir, 'output.jsonl')
-    instances = prepare_dataset(mint_dataset, output_file, args.eval_n_limit, id_column)
+    instances = prepare_dataset(dataset_df, output_file, args.eval_n_limit)
    run_evaluation(
-        instances,
-        metadata,
-        output_file,
-        args.eval_num_workers,
-        process_instance,
-        id_column,
+        instances, metadata, output_file, args.eval_num_workers, process_instance
    )
--- a/evaluation/mint/scripts/run_infer.sh
+++ b/evaluation/mint/scripts/run_infer.sh
@ -29,15 +29,16 @@ COMMAND="poetry run python ./evaluation/mint/run_infer.py \
    --llm-config $MODEL_CONFIG \
    --max-iterations 5 \
    --max-propose-solution 2 \
-    --eval-num-workers $NUM_WORKERS \
+    --eval-num-workers $NUM_WORKERS
+"

 if [ -n "$SUBSET" ]; then
  echo "SUBSET: $SUBSET"
  COMMAND="$COMMAND --subset $SUBSET"
 # otherwise default to use the math subset
 else
-  echo "SUBSET: math"
-  COMMAND="$COMMAND --subset math"
+  echo "SUBSET: all"
+  COMMAND="$COMMAND --subset all"
 fi

 if [ -n "$EVAL_LIMIT" ]; then
--- a/evaluation/ml_bench/README.md
+++ b/evaluation/ml_bench/README.md
@ -10,40 +10,9 @@ The task introduces new challenges for LLMs, such as comprehending long and lang

 For more details on the ML-Bench task and dataset, please refer to the paper: [ML-Bench: Evaluating Large Language Models for Code Generation in Repository-Level Machine Learning Tasks](https://arxiv.org/abs/2311.09835).

-## Setup Environment
+## Setup Environment and LLM Configuration

-Please follow the [OpenDevin setup guide](https://github.com/OpenDevin/OpenDevin/blob/main/docs/setup.md) to set up the local development environment for OpenDevin.
-
-## Configure OpenDevin and your LLM
-
-Create a `config.toml` file if it does not exist at the root of the workspace.
-
-Add the following configurations:
-
-```toml
-[core]
-max_iterations = 100
-cache_dir = "/tmp/cache"
-ssh_hostname = "localhost"
-run_as_devin = false
-sandbox_container_image = "public.ecr.aws/i5g0m1f6/ml-bench" # Use the latest image from the ML-Bench repository
-
-[sandbox]
-enable_auto_lint = true
-
-
-# TODO: Change these to the model you want to evaluate
-[llm.eval_gpt4_1106_preview]
-model = "gpt-4-1106-preview"
-api_key = "XXX"
-temperature = 0.0
-
-[llm.eval_some_openai_compatible_model]
-model = "openai/MODEL_NAME"
-base_url = "https://OPENAI_COMPATIBLE_URL/v1"
-api_key = "XXX"
-temperature = 0.0
-```
+Please follow instruction [here](../README.md#setup) to setup your local development environment and LLM.

 ## Run Inference on ML-Bench

--- a/evaluation/ml_bench/run_infer.py
+++ b/evaluation/ml_bench/run_infer.py
@ -13,29 +13,34 @@ TODOs:
 - Clean up the code and docker image used for evaluation.
 """

-import asyncio
-import logging
 import os
-import pathlib
 from typing import Any

+import pandas as pd
 from datasets import load_dataset

 from evaluation.utils.shared import (
    EvalMetadata,
+    EvalOutput,
    codeact_user_response,
    make_metadata,
    prepare_dataset,
+    reset_logger_for_multiprocessing,
    run_evaluation,
 )
-from opendevin.controller.agent import Agent
 from opendevin.controller.state.state import State
-from opendevin.core.config import get_llm_config_arg, get_parser, load_app_config
-from opendevin.core.logger import get_console_handler
+from opendevin.core.config import (
+    AppConfig,
+    SandboxConfig,
+    get_llm_config_arg,
+    get_parser,
+    load_app_config,
+)
 from opendevin.core.logger import opendevin_logger as logger
-from opendevin.core.main import run_controller
-from opendevin.llm.llm import LLM
-from opendevin.runtime.docker.ssh_box import DockerSSHBox
+from opendevin.core.main import create_runtime, run_controller
+from opendevin.events.action import CmdRunAction
+from opendevin.events.observation import CmdOutputObservation
+from opendevin.runtime.runtime import Runtime

 config = load_app_config()

@ -66,169 +71,204 @@ ID2CONDA = {
 }


-def process_instance(instance: Any, metadata: EvalMetadata, reset_logger: bool = True):
-    agent = Agent.get_cls(metadata.agent_class)(llm=LLM(config=metadata.llm_config))
-    old_workspace_mount_path = config.workspace_mount_path
-    old_workspace_base = config.workspace_base
-    try:
-        workspace_mount_path = os.path.join(
-            config.workspace_mount_path, '_eval_workspace'
+def get_config(
+    metadata: EvalMetadata,
+) -> AppConfig:
+    config = AppConfig(
+        default_agent=metadata.agent_class,
+        run_as_devin=False,
+        runtime='eventstream',
+        max_iterations=metadata.max_iterations,
+        sandbox=SandboxConfig(
+            container_image='public.ecr.aws/i5g0m1f6/ml-bench',
+            enable_auto_lint=True,
+            use_host_network=False,
+            update_source_code=True,
+        ),
+        # do not mount workspace
+        workspace_base=None,
+        workspace_mount_path=None,
+    )
+    config.set_llm_config(metadata.llm_config)
+    return config
+
+
+async def initialize_runtime(
+    runtime: Runtime,
+    instance: pd.Series,  # this argument is not required
+):
+    """Initialize the runtime for the agent.
+
+    This function is called before the runtime is used to run the agent.
+    """
+    logger.info(f"{'-' * 50} BEGIN Runtime Initialization Fn {'-' * 50}")
+    obs: CmdOutputObservation
+
+    # Set instance id
+    action = CmdRunAction(command='mkdir -p /workspace')
+    logger.info(action, extra={'msg_type': 'ACTION'})
+    obs = await runtime.run_action(action)
+    assert obs.exit_code == 0
+
+    # Set up the task environment
+    action = CmdRunAction(command=f'conda activate {ID2CONDA[instance["github_id"]]}')
+    logger.info(action, extra={'msg_type': 'ACTION'})
+    obs = await runtime.run_action(action)
+    assert obs.exit_code == 0
+
+    repo_url = instance['github']
+    repo_name = repo_url.split('/')[-1]
+    action = CmdRunAction(command=f'git clone {repo_url} /workspace/{repo_name}')
+    logger.info(action, extra={'msg_type': 'ACTION'})
+    obs = await runtime.run_action(action)
+    assert obs.exit_code == 0
+
+    action = CmdRunAction(command=f'chmod -R 777 /workspace/{repo_name}')
+    logger.info(action, extra={'msg_type': 'ACTION'})
+    obs = await runtime.run_action(action)
+    assert obs.exit_code == 0
+
+    # Navigate to the task's code path
+    task_path = os.path.join('/workspace', repo_name, instance['path'][2:])
+    action = CmdRunAction(command=f'cd {task_path}')
+    logger.info(action, extra={'msg_type': 'ACTION'})
+    obs = await runtime.run_action(action)
+    assert obs.exit_code == 0
+
+    logger.info(f"{'-' * 50} END Runtime Initialization Fn {'-' * 50}")
+
+
+async def complete_runtime(
+    runtime: Runtime,
+    instance: pd.Series,  # this argument is not required, but it is used to get the workspace_dir_name
+) -> dict[str, Any]:
+    """Complete the runtime for the agent.
+
+    This function is called before the runtime is used to run the agent.
+    If you need to do something in the sandbox to get the correctness metric after
+    the agent has run, modify this function.
+    """
+    logger.info(f"{'-' * 50} BEGIN Runtime Completion Fn {'-' * 50}")
+    obs: CmdOutputObservation
+
+    repo_url = instance['github']
+    repo_name = repo_url.split('/')[-1]
+    task_path = os.path.join('/workspace', repo_name, instance['path'][2:])
+
+    # Evaluate the agent's script
+    eval_script = os.path.join(task_path, 'run.sh')
+    logger.info(f'Running evaluation script: {eval_script}')
+
+    action = CmdRunAction(command=f'cat {eval_script}', keep_prompt=False)
+    logger.info(action, extra={'msg_type': 'ACTION'})
+    obs = await runtime.run_action(action)
+    if obs.exit_code == 0:
+        eval_script_content = obs.content
+    else:
+        logger.error(f'Error reading evaluation script: {obs.content}')
+        eval_script_content = ''
+
+    action = CmdRunAction(
+        command=f'timeout 120s conda run -n {ID2CONDA[instance["github_id"]]} bash {eval_script}',
+        timeout=600,
+    )
+    logger.info(action, extra={'msg_type': 'ACTION'})
+    obs = await runtime.run_action(action)
+    if obs.exit_code == 0:
+        eval_output = obs.content
+    else:
+        logger.error(f'Error running evaluation script: {obs.content}')
+        eval_output = ''
+
+    outputs = {
+        'eval_script_content': eval_script_content,
+        'eval_output': eval_output,
+    }
+    if obs.exit_code != 0 and obs.exit_code != 124:
+        logger.warning(f'Evaluation script failed with exit code {obs.exit_code}')
+        logger.warning(f'Output: {eval_output}')
+        outputs['success'] = int(
+            'KeyboardInterrupt' in eval_output
+        )  # super-dainiu: assume ``KeyboardInterrupt`` is a success as is done in ML-Bench
+
+    else:
+        logger.info(f'Evaluation script succeeded with exit code {obs.exit_code}')
+        logger.info(f'Output: {eval_output}')
+        outputs['success'] = 1
+    outputs['eval_exit_code'] = obs.exit_code
+
+    logger.info(f"{'-' * 50} END Runtime Completion Fn {'-' * 50}")
+    return outputs
+
+
+async def process_instance(
+    instance: Any, metadata: EvalMetadata, reset_logger: bool = True
+):
+    config = get_config(metadata)
+
+    # Setup the logger properly, so you can run multi-processing to parallelize the evaluation
+    if reset_logger:
+        log_dir = os.path.join(metadata.eval_output_dir, 'infer_logs')
+        reset_logger_for_multiprocessing(logger, instance['instance_id'], log_dir)
+    else:
+        logger.info(f'Starting evaluation for instance {instance["instance_id"]}.')
+
+    # Create a sandbox, using the instance ID and PID as the session ID to avoid conflicts
+    sid = str(instance['instance_id'])
+
+    repo_url = instance['github']
+    repo_name = repo_url.split('/')[-1]
+    task_path = os.path.join('/workspace', repo_name, instance['path'][2:])
+    # Prepare the task instruction
+    instruction = (
+        f'Please complete the Machine Learning task in the following repository: {repo_name}\n\n'
+        f'{instance["instruction"]}\n\n'
+        'You should create a script named `run.sh` under the specified path in the repo to run the task.\n\n'
+        f'You can find the task repo at: {task_path}\n\n'
+        + (
+            'Here is the prefix code for the task:\n'
+            '```bash\n'
+            f'{instance["prefix_code"]}\n'
+            '```\n\n'
+            if instance['prefix_code']
+            else ''
        )
-        # create process-specific workspace dir
-        # so that different agent don't interfere with each other.
-        workspace_mount_path = os.path.join(workspace_mount_path, str(os.getpid()))
-        pathlib.Path(workspace_mount_path).mkdir(parents=True, exist_ok=True)
+        + 'You should terminate the subprocess after running the task (e.g., call subprocess.Popen(args).wait()).'
+    )
+    instruction += AGENT_CLS_TO_INST_SUFFIX[metadata.agent_class]

-        # reset workspace to config
-        config.workspace_base = workspace_mount_path
-        config.workspace_mount_path = workspace_mount_path
+    runtime = await create_runtime(config, sid=sid)
+    await initialize_runtime(runtime, instance)

-        # Setup the logger properly, so you can run multi-processing to parallelize the evaluation
-        if reset_logger:
-            # Set up logger
-            log_file = os.path.join(
-                metadata.eval_output_dir,
-                'logs',
-                f"instance_{instance['id']}_pid_{os.getpid()}.log",
-            )
-            # Remove all existing handlers from logger
-            for handler in logger.handlers[:]:
-                logger.removeHandler(handler)
-            # add back the console handler to print ONE line
-            logger.addHandler(get_console_handler())
-            logger.info(
-                f"Starting evaluation for instance {instance['id']}.\nLOG:   tail -f {log_file}"
-            )
-            # Remove all existing handlers from logger
-            for handler in logger.handlers[:]:
-                logger.removeHandler(handler)
-            file_handler = logging.FileHandler(log_file)
-            file_handler.setFormatter(
-                logging.Formatter('%(asctime)s - %(levelname)s - %(message)s')
-            )
-            logger.addHandler(file_handler)
+    # Run the agent
+    state: State | None = await run_controller(
+        config=config,
+        task_str=instruction,
+        runtime=runtime,
+        fake_user_response_fn=AGENT_CLS_TO_FAKE_USER_RESPONSE_FN.get(
+            metadata.agent_class
+        ),
+    )
+    assert state is not None
+    metrics = state.metrics.get() if state.metrics else {}

-        logger.info(f'Process-specific workspace mounted at {workspace_mount_path}')
+    test_result = await complete_runtime(runtime)

-        # Create a sandbox, using the instance ID and PID as the session ID to avoid conflicts
-        sid = str(instance['id']) + '_' + str(os.getpid())
-        sandbox = DockerSSHBox(
-            config=config.sandbox,
-            persist_sandbox=False,
-            workspace_mount_path=config.workspace_mount_path,
-            sandbox_workspace_dir=config.workspace_mount_path_in_sandbox,
-            cache_dir=config.cache_dir,
-            run_as_devin=config.run_as_devin,
-            sid=sid,
-        )
+    # history is now available as a stream of events, rather than list of pairs of (Action, Observation)
+    # for compatibility with the existing output format, we can remake the pairs here
+    # remove when it becomes unnecessary
+    histories = state.history.compatibility_for_eval_history_pairs()

-        # Set up the task environment
-        sandbox.execute(f'conda activate {ID2CONDA[instance["github_id"]]}')
-
-        # Clone the task repo into the sandbox
-        repo_url = instance['github']
-        repo_name = repo_url.split('/')[-1]
-        sandbox.execute(f'git clone {repo_url} /workspace/{repo_name}')
-        sandbox.execute(f'chmod -R 777 /workspace/{repo_name}')
-
-        # Navigate to the task's code path
-        task_path = os.path.join('/workspace', repo_name, instance['path'][2:])
-        sandbox.execute(f'cd {task_path}')
-
-        # Prepare the task instruction
-        instruction = (
-            f'Please complete the Machine Learning task in the following repository: {repo_name}\n\n'
-            f'The task is: {instance["task"]}\n\n'
-            f'{instance["instruction"]}\n\n'
-            'You should create a script named `run.sh` under the specified path in the repo to run the task.\n\n'
-            f'You can find the task repo at: {task_path}\n\n'
-            + (
-                'Here is the prefix code for the task:\n'
-                '```bash\n'
-                f'{instance["prefix_code"]}\n'
-                '```\n\n'
-                if instance['prefix_code']
-                else ''
-            )
-            + 'You should terminate the subprocess after running the task (e.g., call subprocess.Popen(args).wait()).'
-        )
-        instruction += AGENT_CLS_TO_INST_SUFFIX[agent.__class__.__name__]
-
-        # Run the agent
-        config.max_iterations = metadata.max_iterations
-        state: State | None = asyncio.run(
-            run_controller(
-                config=config,
-                task_str=instruction,
-                fake_user_response_fn=AGENT_CLS_TO_FAKE_USER_RESPONSE_FN.get(
-                    agent.__class__.__name__
-                ),
-                agent=agent,
-                sandbox=sandbox,
-                sid=sid,
-            )
-        )
-        assert state is not None
-        metrics = state.metrics.get() if state.metrics else {}
-
-        # Evaluate the agent's script
-        eval_script = os.path.join(task_path, 'run.sh')
-        logger.info(f'Running evaluation script: {eval_script}')
-
-        try:
-            _, eval_script_content = sandbox.execute(f'cat {eval_script}')
-        except Exception as e:
-            logger.error(f'Error reading evaluation script: {e}')
-            eval_script_content = ''
-
-        try:
-            exit_code, eval_output = sandbox.execute(
-                f'timeout 120s conda run -n {ID2CONDA[instance["github_id"]]} bash {eval_script}',
-                timeout=600,
-            )
-        except Exception as e:
-            logger.error(f'Error running evaluation script: {e}')
-            exit_code = -1
-            eval_output = ''
-
-        if exit_code != 0 and exit_code != 124:
-            logger.warning(f'Evaluation script failed with exit code {exit_code}')
-            logger.warning(f'Output: {eval_output}')
-            metrics['success'] = int(
-                'KeyboardInterrupt' in eval_output
-            )  # super-dainiu: assume ``KeyboardInterrupt`` is a success as is done in ML-Bench
-        else:
-            logger.info(f'Evaluation script succeeded with exit code {exit_code}')
-            logger.info(f'Output: {eval_output}')
-            metrics['success'] = 1
-
-        # history is now available as a stream of events, rather than list of pairs of (Action, Observation)
-        # for compatibility with the existing output format, we can remake the pairs here
-        # remove when it becomes unnecessary
-        histories = state.history.compatibility_for_eval_history_pairs()
-
-        # Save the output
-        output = {
-            'instance_id': instance['id'],
-            'repo': repo_url,
-            'instruction': instruction,
-            'metadata': metadata.model_dump(),
-            'history': histories,
-            'eval_script': eval_script_content,
-            'eval_exit_code': exit_code,
-            'eval_output': eval_output,
-            'metrics': metrics,
-        }
-
-    except Exception as e:
-        logger.error(f'Error processing instance {instance["id"]}: {e}')
-        raise
-    finally:
-        config.workspace_mount_path = old_workspace_mount_path
-        config.workspace_base = old_workspace_base
-
-    # Shutdown the sandbox
-    sandbox.close()
+    # Save the output
+    output = EvalOutput(
+        instance_id=instance['instance_id'],
+        instance=instance.to_dict(),
+        instruction=instruction,
+        metadata=metadata,
+        history=histories,
+        test_result=test_result,
+        metrics=metrics,
+    )
    return output


@ -246,30 +286,26 @@ if __name__ == '__main__':

    data_split = args.eval_split

-    # NOTE: It is preferable to load datasets from huggingface datasets and perform post-processing
-    # so we don't need to manage file uploading to OpenDevin's repo
    ml_bench = load_dataset('super-dainiu/ml-bench', split=data_split).to_pandas()
+    ml_bench.rename(columns={'id': 'instance_id'}, inplace=True)

-    id_column = 'instance_id'
-    llm_config = get_llm_config_arg(args.llm_config) if args.llm_config else config.llm
-    logger.info(f'Config for evaluation: {config}')
+    llm_config = None
+    if args.llm_config:
+        llm_config = get_llm_config_arg(args.llm_config)
+    if llm_config is None:
+        raise ValueError(f'Could not find LLM config: --llm_config {args.llm_config}')

    metadata = make_metadata(
        llm_config,
-        args.dataset_name,
+        f'ml-bench-{data_split}',
        args.agent_cls,
        args.max_iterations,
        args.eval_note,
        args.eval_output_dir,
    )
    output_file = os.path.join(metadata.eval_output_dir, 'output.jsonl')
-    instances = prepare_dataset(ml_bench, output_file, args.eval_n_limit, id_column)
+    instances = prepare_dataset(ml_bench, output_file, args.eval_n_limit)

    run_evaluation(
-        instances,
-        metadata,
-        output_file,
-        args.eval_num_workers,
-        process_instance,
-        id_column,
+        instances, metadata, output_file, args.eval_num_workers, process_instance
    )
--- a/evaluation/ml_bench/scripts/run_infer.sh
+++ b/evaluation/ml_bench/scripts/run_infer.sh
--- a/evaluation/swe_bench/README.md
+++ b/evaluation/swe_bench/README.md
@ -1,132 +1,79 @@
 # SWE-Bench Evaluation with OpenDevin SWE-Bench Docker Image

-This folder contains the evaluation harness that we built on top of the original [SWE-Bench benchmark](https://www.swebench.com/) ([paper](https://arxiv.org/abs/2310.06770)). We created [a fork of SWE-Bench](https://github.com/OpenDevin/OD-SWE-bench.git) mostly built on top of [the original repo](https://github.com/princeton-nlp/SWE-bench) and [containerized](#opendevin-swe-bench-docker-image) it for easy evaluation.
+This folder contains the evaluation harness that we built on top of the original [SWE-Bench benchmark](https://www.swebench.com/) ([paper](https://arxiv.org/abs/2310.06770)).

 **UPDATE (7/1/2024): We now support the official SWE-Bench dockerized evaluation as announced [here](https://github.com/princeton-nlp/SWE-bench/blob/main/docs/20240627_docker/README.md).**

-## Setup Environment
+The evaluation consists of three steps:

-Please follow [this document](https://github.com/OpenDevin/OpenDevin/blob/main/Development.md) to set up a local development environment for OpenDevin.
+1. Environment setup: [install python environment](../README.md#development-environment), [configure LLM config](../README.md#configure-opendevin-and-your-llm), and [pull docker](#opendevin-swe-bench-instance-level-docker-support).
+2. [Run inference](#run-inference-on-swe-bench-instances): Generate a edit patch for each Github issue
+3. [Evaluate patches using SWE-Bench docker](#evaluate-generated-patches)

-## OpenDevin SWE-Bench Docker Image
+## Setup Environment and LLM Configuration

-In [OpenDevin-SWE-Bench fork](https://github.com/OpenDevin/OD-SWE-bench.git) (mostly from [original repo](https://github.com/princeton-nlp/SWE-bench) with some fixes), we try to pre-build the **testbed** (i.e., code of the repository we want the agent to edit) AND the **conda environment**, so that in evaluation (inference) time, we can directly leverage existing environments for efficient evaluation.
+Please follow instruction [here](../README.md#setup) to setup your local development environment and LLM.

-**We pack everything you need for SWE-Bench inference into one, gigantic, docker image.** To use it:
+## OpenDevin SWE-Bench Instance-level Docker Support
+
+OpenDevin now support using the [official evaluation docker](https://github.com/princeton-nlp/SWE-bench/blob/main/docs/20240627_docker/README.md) for both **[inference](#run-inference-on-swe-bench-instances) and [evaluation](#evaluate-generated-patches)**.
+This is now the default behavior.
+
+### Download Docker Images
+
+**(Recommended for reproducibility)** If you have extra local space (e.g., 100GB), you can try pull the [instance-level docker images](https://github.com/princeton-nlp/SWE-bench/blob/main/docs/20240627_docker/README.md#choosing-the-right-cache_level) we've prepared by running:

 ```bash
-docker pull ghcr.io/opendevin/eval-swe-bench:full-v1.2.1
+evaluation/swe_bench/scripts/docker/pull_all_eval_docker.sh instance
 ```

-The Docker image contains several important directories:
-
- `/swe_util/OD-SWE-bench`: root directory for the OD-SWE-bench repository
- `/swe_util/eval_data`: directory to eval data
-  - `/swe_util/eval_data/eval_logs/`: evaluation logs
-  - `/swe_util/eval_data/eval_temp/`: temporary folder for the evaluation process
-  - `/swe_util/eval_data/instances/`: swe-bench raw instances
-  - `/swe_util/eval_data/outputs/`: model or agent outputs
-  - `/swe_util/eval_data/testbed_logs/`: logs for testbed building
-  - `/swe_util/eval_data/testbeds/`: directory for all testbeds
- `/swe_util/miniforge3/`: directory for miniforge3
-
-To reproduce how we pack the image, check [this doc](./BUILD_TESTBED_AND_ENV.md).
-
-NOTE: We only support SWE-Bench lite for now. But modifying our existing scripts for full SWE-Bench should be quite straightforward.
-
-## Configure OpenDevin and your LLM
-
-Create a `config.toml` file if it does not exist at the root of the workspace.
-
-Add the following configurations:
-
-```toml
-[core]
-max_iterations = 100
-cache_dir = "/tmp/cache"
-ssh_hostname = "localhost"
-
-[sandbox]
-box_type = "ssh"
-timeout = 120
-
-run_as_devin = false
-max_budget_per_task = 4 # 4 USD
-
-[sandbox]
-# SWEBench eval specific
-use_host_network = false
-enable_auto_lint = true
-
-# TODO: Change these to the model you want to evaluate
-[llm.eval_gpt4_1106_preview_llm]
-model = "gpt-4-1106-preview"
-api_key = "XXX"
-temperature = 0.0
-
-[llm.eval_some_openai_compatible_model_llm]
-model = "openai/MODEL_NAME"
-base_url = "https://OPENAI_COMPATIBLE_URL/v1"
-api_key = "XXX"
-temperature = 0.0
-```
-
-## Test if your environment works
-
-Make sure your Docker daemon is running, and you have pulled the `eval-swe-bench:full-v1.2`
-docker image. Then run this python script:
+If you want to save disk space a bit (e.g., with ~50GB free disk space), while speeding up the image pre-build process, you can pull the environment-level docker images:

 ```bash
-# export USE_INSTANCE_IMAGE=true # if you want to test support for instance-level docker images
-poetry run python evaluation/swe_bench/swe_env_box.py
+evaluation/swe_bench/scripts/docker/pull_all_eval_docker.sh env
 ```

-If you get to the interactive shell successfully, it means your environment works!
-If you see an error, please make sure your `config.toml` contains all
-`SWEBench eval specific` settings as shown in the previous section.
-
 ## Run Inference on SWE-Bench Instances

+Make sure your Docker daemon is running, and you have pulled the [instance-level docker image](#opendevin-swe-bench-instance-level-docker-support).
+
 ```bash
 ./evaluation/swe_bench/scripts/run_infer.sh [model_config] [git-version] [agent] [eval_limit] [max_iter] [num_workers]
-# e.g., ./evaluation/swe_bench/scripts/run_infer.sh eval_gpt4_1106_preview_llm HEAD CodeActAgent 300
+# e.g., ./evaluation/swe_bench/scripts/run_infer.sh llm.eval_gpt4_1106_preview HEAD CodeActAgent 300
 ```

-where `model_config` is mandatory, while `agent` and `eval_limit` are optional.
+where `model_config` is mandatory, and the rest are optional.

-`model_config`, e.g. `eval_gpt4_1106_preview`, is the config group name for your
+- `model_config`, e.g. `eval_gpt4_1106_preview`, is the config group name for your
 LLM settings, as defined in your `config.toml`.
-
-`git-version`, e.g. `HEAD`, is the git commit hash of the OpenDevin version you would
+- `git-version`, e.g. `HEAD`, is the git commit hash of the OpenDevin version you would
 like to evaluate. It could also be a release tag like `0.6.2`.
-
-`agent`, e.g. `CodeActAgent`, is the name of the agent for benchmarks, defaulting
+- `agent`, e.g. `CodeActAgent`, is the name of the agent for benchmarks, defaulting
 to `CodeActAgent`.
-
-`eval_limit`, e.g. `10`, limits the evaluation to the first `eval_limit` instances. By
+- `eval_limit`, e.g. `10`, limits the evaluation to the first `eval_limit` instances. By
 default, the script evaluates the entire SWE-bench_Lite test set (300 issues). Note:
 in order to use `eval_limit`, you must also set `agent`.
-
-`max_iter`, e.g. `20`, is the maximum number of iterations for the agent to run. By
+- `max_iter`, e.g. `20`, is the maximum number of iterations for the agent to run. By
 default, it is set to 30.
-
-`num_workers`, e.g. `3`, is the number of parallel workers to run the evaluation. By
+- `num_workers`, e.g. `3`, is the number of parallel workers to run the evaluation. By
 default, it is set to 1.

 There are also two optional environment variables you can set.
 ```
-export USE_HINT_TEXT=true # if you want to use hint text in the evaluation. Ignore this if you are not sure.
-export USE_INSTANCE_IMAGE=true # if you want to use instance-level docker images
+export USE_HINT_TEXT=true # if you want to use hint text in the evaluation. Default to false. Ignore this if you are not sure.
+export USE_INSTANCE_IMAGE=true # if you want to use instance-level docker images. Default to true
 ```

-Let's say you'd like to run 10 instances using `eval_gpt4_1106_preview_llm` and CodeActAgent,
+Let's say you'd like to run 10 instances using `llm.eval_gpt4_1106_preview` and CodeActAgent,

 then your command would be:

 ```bash
-./evaluation/swe_bench/scripts/run_infer.sh eval_gpt4_1106_preview_llm HEAD CodeActAgent 10
+./evaluation/swe_bench/scripts/run_infer.sh llm.eval_gpt4_1106_preview HEAD CodeActAgent 10
 ```

+### Specify a subset of tasks to run infer
+
 If you would like to specify a list of tasks you'd like to benchmark on, you could
 create a `config.toml` under `./evaluation/swe_bench/` folder, and put a list
 attribute named `selected_ids`, e.g.
@ -146,22 +93,12 @@ With `output.jsonl` file, you can run `eval_infer.sh` to evaluate generated patc

 **This evaluation is performed using the official dockerized evaluation announced [here](https://github.com/princeton-nlp/SWE-bench/blob/main/docs/20240627_docker/README.md).**

-If you want to evaluate existing results, you should first run this to clone existing outputs
+> If you want to evaluate existing results, you should first run this to clone existing outputs
+>```bash
+>git clone https://huggingface.co/spaces/OpenDevin/evaluation evaluation/evaluation_outputs
+>```

-```bash
-git clone https://huggingface.co/spaces/OpenDevin/evaluation evaluation/evaluation_outputs
-```
-
-If you have extra local space (e.g., 500GB), you can try pull the [instance-level docker images](https://github.com/princeton-nlp/SWE-bench/blob/main/docs/20240627_docker/README.md#choosing-the-right-cache_level) we've prepared to speed up the evaluation by running:
-
-```bash
-evaluation/swe_bench/scripts/docker/pull_all_eval_docker.sh instance
-```
-
-If you want to save disk space a bit (e.g., with ~50GB free disk space), while speeding up the image pre-build process, you can pull the environment-level docker images:
-```bash
-evaluation/swe_bench/scripts/docker/pull_all_eval_docker.sh env
-```
+NOTE, you should have already pulled the instance-level OR env-level docker images following [this section](#opendevin-swe-bench-instance-level-docker-support).

 Then you can run the following:

@ -171,13 +108,13 @@ Then you can run the following:
 ./evaluation/swe_bench/scripts/eval_infer.sh evaluation/evaluation_outputs/outputs/swe_bench/CodeActAgent/gpt-4-1106-preview_maxiter_50_N_v1.0/output.jsonl
 ```

-PS: You can also pass in a JSONL with [SWE-Bench format](https://github.com/princeton-nlp/SWE-bench/blob/main/tutorials/evaluation.md#-creating-predictions) to `./evaluation/swe_bench/scripts/eval_infer.sh`, where each line is a JSON of `{"model_patch": "XXX", "model_name_or_path": "YYY", "instance_id": "ZZZ"}`.
+> You can also pass in a JSONL with [SWE-Bench format](https://github.com/princeton-nlp/SWE-bench/blob/main/tutorials/evaluation.md#-creating-predictions) to `./evaluation/swe_bench/scripts/eval_infer.sh`, where each line is a JSON of `{"model_patch": "XXX", "model_name_or_path": "YYY", "instance_id": "ZZZ"}`.

 The final results will be saved to `evaluation/evaluation_outputs/outputs/swe_bench/CodeActAgent/gpt-4-1106-preview_maxiter_50_N_v1.0/` with the following files/directory:

 - `README.md`: a report showing what are the instances that passed, failed, etc.
 - `report.json`: a JSON file that contains keys like `"resolved_ids"` pointing to instance IDs that are resolved by the agent.
- `eval_outputs/`: a directory of test logs
+- `logs/`: a directory of test logs

 ## Visualize Results

@ -189,9 +126,10 @@ git clone https://huggingface.co/spaces/OpenDevin/evaluation

 **(optional) setup streamlit environment with conda**:
 ```bash
+cd evaluation
 conda create -n streamlit python=3.10
 conda activate streamlit
-pip install streamlit altair st_pages
+pip install -r requirements.txt
 ```

 **run the visualizer**:
--- a/evaluation/swe_bench/prompt.py
+++ b/evaluation/swe_bench/prompt.py
@ -0,0 +1,28 @@
+CODEACT_SWE_PROMPT = """Now, you're going to solve this issue on your own. Your terminal session has started and you're in the repository's root directory. You can use any bash commands or the special interface to help you. Edit all the files you need to and run any checks or tests that you want.
+Remember, YOU CAN ONLY ENTER ONE COMMAND AT A TIME. You should always wait for feedback after every command.
+When you're satisfied with all of the changes you've made, you can run the following command: <execute_bash> exit </execute_bash>.
+Note however that you cannot use any interactive session commands (e.g. vim) in this environment, but you can write scripts and run them. E.g. you can write a python script and then run it with `python <script_name>.py`.
+
+NOTE ABOUT THE EDIT COMMAND: Indentation really matters! When editing a file, make sure to insert appropriate indentation before each line!
+
+IMPORTANT TIPS:
+1. Always start by trying to replicate the bug that the issues discusses.
+    If the issue includes code for reproducing the bug, we recommend that you re-implement that in your environment, and run it to make sure you can reproduce the bug.
+    Then start trying to fix it.
+    When you think you've fixed the bug, re-run the bug reproduction script to make sure that the bug has indeed been fixed.
+
+    If the bug reproduction script does not print anything when it successfully runs, we recommend adding a print("Script completed successfully, no errors.") command at the end of the file,
+    so that you can be sure that the script indeed ran fine all the way through.
+
+2. If you run a command and it doesn't work, try running a different command. A command that did not work once will not work the second time unless you modify it!
+
+3. If you open a file and need to get to an area around a specific line that is not in the first 100 lines, say line 583, don't just use the scroll_down command multiple times. Instead, use the goto 583 command. It's much quicker.
+
+4. If the bug reproduction script requires inputting/reading a specific file, such as buggy-input.png, and you'd like to understand how to input that file, conduct a search in the existing repo code, to see whether someone else has already done that. Do this by running the command: find_file("buggy-input.png") If that doesn't work, use the linux 'find' command.
+
+5. Always make sure to look at the currently open file and the current working directory (which appears right after the currently open file). The currently open file might be in a different directory than the working directory! Note that some commands, such as 'create', open files, so they might change the current  open file.
+
+6. When editing files, it is easy to accidentally specify a wrong line number or to write code with incorrect indentation. Always check the code after you issue an edit to make sure that it reflects what you wanted to accomplish. If it didn't, issue another command to fix it.
+
+[Current directory: /workspace/{workspace_dir_name}]
+"""
--- a/evaluation/swe_bench/run_infer.py
+++ b/evaluation/swe_bench/run_infer.py
@ -1,34 +1,39 @@
 import asyncio
-import logging
+import json
 import os
-import pathlib
+import tempfile
+from typing import Any

 import pandas as pd
 import toml
-import whatthepatch
 from datasets import load_dataset

 import agenthub
-from evaluation.swe_bench.swe_env_box import SWEBenchSSHBox
+from evaluation.swe_bench.prompt import CODEACT_SWE_PROMPT
 from evaluation.utils.shared import (
    EvalMetadata,
+    EvalOutput,
    codeact_user_response,
    make_metadata,
    prepare_dataset,
+    reset_logger_for_multiprocessing,
    run_evaluation,
 )
-from opendevin.controller.agent import Agent
 from opendevin.controller.state.state import State
-from opendevin.core.config import get_llm_config_arg, load_app_config, parse_arguments
-from opendevin.core.logger import get_console_handler
+from opendevin.core.config import (
+    AppConfig,
+    SandboxConfig,
+    get_llm_config_arg,
+    parse_arguments,
+)
 from opendevin.core.logger import opendevin_logger as logger
-from opendevin.core.main import run_controller
-from opendevin.llm.llm import LLM
+from opendevin.core.main import create_runtime, run_controller
+from opendevin.events.action import CmdRunAction
+from opendevin.events.observation import CmdOutputObservation, ErrorObservation
+from opendevin.runtime.runtime import Runtime

-config = load_app_config()
-
-USE_HINT_TEXT = os.environ.get('USE_HINT_TEXT', 'false') == 'true'
-USE_INSTANCE_IMAGE = os.environ.get('USE_INSTANCE_IMAGE', 'false') == 'true'
+USE_HINT_TEXT = os.environ.get('USE_HINT_TEXT', 'false').lower() == 'true'
+USE_INSTANCE_IMAGE = os.environ.get('USE_INSTANCE_IMAGE', 'false').lower() == 'true'

 AGENT_CLS_TO_FAKE_USER_RESPONSE_FN = {
    'CodeActAgent': codeact_user_response,
@ -41,184 +46,12 @@ AGENT_CLS_TO_INST_SUFFIX = {
 }


-def get_test_result(instance, sandbox, workspace_dir_name):
-    test_result = {'result': {}, 'metadata': {}}
-    # NOTE: if you need to do something in the sandbox to get the correctness metric, modify this function
-    try:
-        test_patch_parsed = whatthepatch.parse_patch(instance.test_patch)
-        # get a list of filepaths that are involved in the patch
-        involved_filepaths = set()
-        for patch in test_patch_parsed:
-            involved_filepaths.add(patch.header.old_path.removeprefix('a/'))
-            involved_filepaths.add(patch.header.new_path.removeprefix('b/'))
-        involved_filepaths = list(involved_filepaths)
-        test_result['metadata']['1_test_patch_parse_success'] = True
-        test_result['metadata']['1_test_involved_filepaths'] = involved_filepaths
-    except Exception as e:
-        logger.error(
-            f'Error parsing test patch for instance {instance.instance_id}: {e}'
-        )
-        test_result['metadata']['1_test_patch_parse_success'] = False
-        test_result['metadata']['1_test_patch_parse_error'] = str(e)
-        test_result['metadata']['1_test_involved_filepaths'] = None
-        involved_filepaths = []
-
-    # Try to revert the changes for involved filepaths
-    err_code, output = sandbox.execute(f'cd /workspace/{workspace_dir_name}')
-    test_result['metadata']['2_revert_test_involved_filepaths_success'] = []
-    for filepath in involved_filepaths:
-        err_code, output = sandbox.execute(
-            f'git checkout {instance["base_commit"]} -- {filepath}'
-        )
-        if err_code != 0:
-            logger.error(f'Error reverting changes for {filepath}: {output}')
-            test_result['metadata']['2_revert_test_involved_filepaths_success'].append(
-                False
-            )
-        else:
-            test_result['metadata']['2_revert_test_involved_filepaths_success'].append(
-                True
-            )
-
-    # Apply the testcase
-    err_code, output = sandbox.execute('git apply $SWE_TASK_DIR/test.patch')
-    if err_code != 0:
-        logger.error(f'Error applying test patch: {output}')
-        test_result['metadata']['3_apply_test_patch_success'] = False
-        test_result['metadata']['3_apply_test_patch_error'] = output
-    else:
-        test_result['metadata']['3_apply_test_patch_success'] = True
-
-    # Run the test command
-    err_code, output = sandbox.execute(
-        '$TEST_CMD > /workspace/$SWE_INSTANCE_ID.log 2>&1'
-    )
-    if err_code != 0:
-        logger.error(f'Error running test command: {output}')
-        test_result['metadata']['4_run_test_command_success'] = False
-        test_result['metadata']['4_run_test_command_error'] = output
-    else:
-        test_result['metadata']['4_run_test_command_success'] = True
-
-    # Get the test output
-    err_code, output = sandbox.execute('cat /workspace/$SWE_INSTANCE_ID.log')
-    if err_code != 0:
-        logger.error(f'Error getting test output: {output}')
-        test_result['metadata']['4_get_test_output_success'] = False
-        test_result['metadata']['4_get_test_output_error'] = output
-    else:
-        test_result['metadata']['4_get_test_output_success'] = True
-        test_result['test_output'] = output
-
-    # Reformat instance.json
-    # $SWE_TASK_DIR/instance.json is a dict {"XXX": "YYY"}, add a [ before and a ] after
-    err_code, output = sandbox.execute(
-        (
-            'cat $SWE_TASK_DIR/instance.json | sed "s/^{/[{/" | sed "s/}$/}]/" > /workspace/instance.json'
-        )
-    )
-    if err_code != 0:
-        logger.error(f'Error creating instance.json: {output}')
-        test_result['metadata']['5_reformat_instance_json_success'] = False
-        test_result['metadata']['5_reformat_instance_json_error'] = output
-    else:
-        test_result['metadata']['5_reformat_instance_json_success'] = True
-
-    if USE_INSTANCE_IMAGE:
-        # instance report is not supported in instance image mode
-        test_result['metadata']['6_get_instance_report_success'] = False
-        test_result['metadata']['6_get_instance_report_error'] = (
-            'Instance report is not supported in instance image mode.'
-        )
-
-    else:
-        # Get the instance report
-        err_code, output = sandbox.execute(
-            (
-                'cd /swe_util/OD-SWE-bench '
-                '&& export PYTHONPATH=$(pwd):$PYTHONPATH '
-                '&& conda run -n swe-bench-eval python swebench/metrics/get_instance_report.py --swe_bench_task /workspace/instance.json --log_path /workspace/$SWE_INSTANCE_ID.log'
-            )
-        )
-        if err_code != 0:
-            logger.error(f'Error getting instance report: {output}')
-            test_result['metadata']['6_get_instance_report_success'] = False
-            test_result['metadata']['6_get_instance_report_error'] = output
-        else:
-            test_result['metadata']['6_get_instance_report_success'] = True
-            test_result['result_raw'] = output
-
-            # try to parse output
-            for line in output.strip().split('\n'):
-                line = line.strip('-')
-                try:
-                    key, value = line.split(':')
-                except ValueError:
-                    # skip this line
-                    print(f'Error parsing result line: {line}')
-                    continue
-                value = value.strip()
-                try:
-                    value = int(value)
-                except ValueError:
-                    pass
-                test_result['result'][key.strip()] = value
-    return test_result
+def _get_swebench_workspace_dir_name(instance: pd.Series) -> str:
+    return f'{instance.repo}__{instance.version}'.replace('/', '__')


-def process_instance(
-    instance: pd.Series,
-    metadata: EvalMetadata,
-    reset_logger: bool = True,
-):
-    # Create the agent
-    agent = Agent.get_cls(metadata.agent_class)(llm=LLM(config=metadata.llm_config))
-
-    workspace_mount_path = os.path.join(config.workspace_mount_path, '_eval_workspace')
-    # create process-specific workspace dir
-    workspace_mount_path = os.path.join(workspace_mount_path, str(os.getpid()))
-    pathlib.Path(workspace_mount_path).mkdir(parents=True, exist_ok=True)
-
-    # Setup the logger properly, so you can run multi-processing to parallelize the evaluation
-    if reset_logger:
-        # Set up logger
-        log_file = os.path.join(
-            metadata.eval_output_dir,
-            'infer_logs',
-            f'instance_{instance.instance_id}.log',
-        )
-        # Remove all existing handlers from logger
-        for handler in logger.handlers[:]:
-            logger.removeHandler(handler)
-        # add back the console handler to print ONE line
-        logger.addHandler(get_console_handler())
-        logger.info(
-            f'Starting evaluation for instance {instance.instance_id}.\nHint: run "tail -f {log_file}" to see live logs in a separate shell'
-        )
-        # Remove all existing handlers from logger
-        for handler in logger.handlers[:]:
-            logger.removeHandler(handler)
-        os.makedirs(os.path.dirname(log_file), exist_ok=True)
-        file_handler = logging.FileHandler(log_file)
-        file_handler.setFormatter(
-            logging.Formatter('%(asctime)s - %(levelname)s - %(message)s')
-        )
-        logger.addHandler(file_handler)
-    else:
-        logger.info(f'Starting evaluation for instance {instance.instance_id}.')
-
-    # NOTE: this is something special we do for SWE-Bench due to the reason described in the previous section
-    # You can omit this if you don't need to setup specialized sandbox
-    workspace_dir_name = f'{instance.repo}__{instance.version}'.replace('/', '__')
-    sandbox = SWEBenchSSHBox.get_box_for_instance(
-        config,
-        instance,
-        workspace_dir_name,
-        workspace_mount_path=workspace_mount_path,
-        sandbox_plugins=agenthub.Agent.get_cls(metadata.agent_class).sandbox_plugins,
-        use_instance_image=USE_INSTANCE_IMAGE,
-    )
-
+def get_instruction(instance: pd.Series, metadata: EvalMetadata):
+    workspace_dir_name = _get_swebench_workspace_dir_name(instance)
    # Prepare instruction
    if metadata.agent_class == 'CodeActSWEAgent':
        instruction = (
@ -227,39 +60,11 @@ def process_instance(
            f'{instance.problem_statement}\n'
            '--- END ISSUE ---\n\n'
        )
-
        if USE_HINT_TEXT and instance.hints_text:
            instruction += (
                f'--- BEGIN HINTS ---\n{instance.hints_text}\n--- END HINTS ---\n'
            )
-        instruction += f"""Now, you're going to solve this issue on your own. Your terminal session has started and you're in the repository's root directory. You can use any bash commands or the special interface to help you. Edit all the files you need to and run any checks or tests that you want.
-Remember, YOU CAN ONLY ENTER ONE COMMAND AT A TIME. You should always wait for feedback after every command.
-When you're satisfied with all of the changes you've made, you can run the following command: <execute_bash> exit </execute_bash>.
-Note however that you cannot use any interactive session commands (e.g. vim) in this environment, but you can write scripts and run them. E.g. you can write a python script and then run it with `python <script_name>.py`.
-
-NOTE ABOUT THE EDIT COMMAND: Indentation really matters! When editing a file, make sure to insert appropriate indentation before each line!
-
-IMPORTANT TIPS:
-1. Always start by trying to replicate the bug that the issues discusses.
-    If the issue includes code for reproducing the bug, we recommend that you re-implement that in your environment, and run it to make sure you can reproduce the bug.
-    Then start trying to fix it.
-    When you think you've fixed the bug, re-run the bug reproduction script to make sure that the bug has indeed been fixed.
-
-    If the bug reproduction script does not print anything when it successfully runs, we recommend adding a print("Script completed successfully, no errors.") command at the end of the file,
-    so that you can be sure that the script indeed ran fine all the way through.
-
-2. If you run a command and it doesn't work, try running a different command. A command that did not work once will not work the second time unless you modify it!
-
-3. If you open a file and need to get to an area around a specific line that is not in the first 100 lines, say line 583, don't just use the scroll_down command multiple times. Instead, use the goto 583 command. It's much quicker.
-
-4. If the bug reproduction script requires inputting/reading a specific file, such as buggy-input.png, and you'd like to understand how to input that file, conduct a search in the existing repo code, to see whether someone else has already done that. Do this by running the command: find_file("buggy-input.png") If that doesn't work, use the linux 'find' command.
-
-5. Always make sure to look at the currently open file and the current working directory (which appears right after the currently open file). The currently open file might be in a different directory than the working directory! Note that some commands, such as 'create', open files, so they might change the current  open file.
-
-6. When editing files, it is easy to accidentally specify a wrong line number or to write code with incorrect indentation. Always check the code after you issue an edit to make sure that it reflects what you wanted to accomplish. If it didn't, issue another command to fix it.
-
-[Current directory: /workspace/{workspace_dir_name}]
-"""
+        instruction += CODEACT_SWE_PROMPT.format(workspace_dir_name=workspace_dir_name)
    else:
        # Testing general agents
        instruction = (
@ -277,61 +82,280 @@ IMPORTANT TIPS:
        )

    # NOTE: You can actually set slightly different instruction for different agents
-    instruction += AGENT_CLS_TO_INST_SUFFIX[agent.__class__.__name__]
+    instruction += AGENT_CLS_TO_INST_SUFFIX[metadata.agent_class]
+    return instruction
+
+
+def get_config(
+    instance: pd.Series,
+    metadata: EvalMetadata,
+) -> AppConfig:
+    SWE_BENCH_CONTAINER_IMAGE = 'ghcr.io/opendevin/eval-swe-bench:full-v1.2.1'
+    if USE_INSTANCE_IMAGE:
+        # We use a different instance image for the each instance of swe-bench eval
+        container_image = 'sweb.eval.x86_64.' + instance['instance_id']
+    else:
+        container_image = SWE_BENCH_CONTAINER_IMAGE
+
+    config = AppConfig(
+        default_agent=metadata.agent_class,
+        run_as_devin=False,
+        runtime='eventstream',
+        max_budget_per_task=4,
+        max_iterations=metadata.max_iterations,
+        sandbox=SandboxConfig(
+            container_image=container_image,
+            enable_auto_lint=True,
+            use_host_network=False,
+            # always make sure we are using the latest source code
+            update_source_code=True,
+            # large enough timeout, since some testcases take very long to run
+            timeout=300,
+        ),
+        # do not mount workspace
+        workspace_base=None,
+        workspace_mount_path=None,
+    )
+    config.set_llm_config(metadata.llm_config)
+    return config
+
+
+async def initialize_runtime(
+    runtime: Runtime,
+    instance: pd.Series,  # this argument is not required
+):
+    """Initialize the runtime for the agent.
+
+    This function is called before the runtime is used to run the agent.
+    """
+    logger.info('-' * 30)
+    logger.info('BEGIN Runtime Initialization Fn')
+    logger.info('-' * 30)
+    workspace_dir_name = _get_swebench_workspace_dir_name(instance)
+    obs: CmdOutputObservation
+
+    # Set instance id
+    action = CmdRunAction(
+        command=f"""echo 'export SWE_INSTANCE_ID={instance['instance_id']}' >> ~/.bashrc && echo 'export PIP_CACHE_DIR=~/.cache/pip' >> ~/.bashrc && echo "alias git='git --no-pager'" >> ~/.bashrc"""
+    )
+    logger.info(action, extra={'msg_type': 'ACTION'})
+    obs = await runtime.run_action(action)
+    logger.info(obs, extra={'msg_type': 'OBSERVATION'})
+    assert obs.exit_code == 0
+
+    if USE_INSTANCE_IMAGE:
+        # inject the init script
+        script_dir = os.path.dirname(__file__)
+
+        # inject the instance info
+        action = CmdRunAction(command='mkdir -p /swe_util/eval_data/instances')
+        logger.info(action, extra={'msg_type': 'ACTION'})
+        obs = await runtime.run_action(action)
+        logger.info(obs, extra={'msg_type': 'OBSERVATION'})
+        assert (
+            obs.exit_code == 0
+        ), f'Failed to create /swe_util/eval_data/instances: {obs.content}'
+
+        swe_instance_json_name = 'swe-bench-instance.json'
+        with tempfile.TemporaryDirectory() as temp_dir:
+            # Construct the full path for the desired file name within the temporary directory
+            temp_file_path = os.path.join(temp_dir, swe_instance_json_name)
+            # Write to the file with the desired name within the temporary directory
+            with open(temp_file_path, 'w') as f:
+                if not isinstance(instance, dict):
+                    json.dump([instance.to_dict()], f)
+                else:
+                    json.dump([instance], f)
+
+            # Copy the file to the desired location
+            await runtime.copy_to(temp_file_path, '/swe_util/eval_data/instances/')
+
+        # inject the instance swe entry
+        await runtime.copy_to(
+            str(os.path.join(script_dir, 'scripts/setup/instance_swe_entry.sh')),
+            '/swe_util/',
+        )
+        action = CmdRunAction(command='cat ~/.bashrc')
+        logger.info(action, extra={'msg_type': 'ACTION'})
+        obs = await runtime.run_action(action)
+        logger.info(obs, extra={'msg_type': 'OBSERVATION'})
+        assert obs.exit_code == 0
+
+        action = CmdRunAction(command='source ~/.bashrc')
+        logger.info(action, extra={'msg_type': 'ACTION'})
+        obs = await runtime.run_action(action)
+        logger.info(obs, extra={'msg_type': 'OBSERVATION'})
+        assert obs.exit_code == 0
+
+        action = CmdRunAction(command='source /swe_util/instance_swe_entry.sh')
+        logger.info(action, extra={'msg_type': 'ACTION'})
+        obs = await runtime.run_action(action)
+        logger.info(obs, extra={'msg_type': 'OBSERVATION'})
+        assert obs.exit_code == 0
+    else:
+        action = CmdRunAction(command='source /swe_util/swe_entry.sh')
+        logger.info(action, extra={'msg_type': 'ACTION'})
+        obs = await runtime.run_action(action)
+        logger.info(obs, extra={'msg_type': 'OBSERVATION'})
+        assert (
+            obs.exit_code == 0
+        ), f'Failed to source /swe_util/swe_entry.sh: {obs.content}'
+
+    action = CmdRunAction(command=f'cd /workspace/{workspace_dir_name}')
+    logger.info(action, extra={'msg_type': 'ACTION'})
+    obs = await runtime.run_action(action)
+    logger.info(obs, extra={'msg_type': 'OBSERVATION'})
+    assert obs.exit_code == 0
+
+    action = CmdRunAction(command='git reset --hard')
+    logger.info(action, extra={'msg_type': 'ACTION'})
+    obs = await runtime.run_action(action)
+    logger.info(obs, extra={'msg_type': 'OBSERVATION'})
+    assert obs.exit_code == 0
+
+    action = CmdRunAction(
+        command='for remote_name in $(git remote); do git remote remove "${remote_name}"; done'
+    )
+    logger.info(action, extra={'msg_type': 'ACTION'})
+    obs = await runtime.run_action(action)
+    logger.info(obs, extra={'msg_type': 'OBSERVATION'})
+    assert obs.exit_code == 0
+
+    logger.info('-' * 30)
+    logger.info('END Runtime Initialization Fn')
+    logger.info('-' * 30)
+
+
+async def complete_runtime(
+    runtime: Runtime,
+    instance: pd.Series,  # this argument is not required, but it is used to get the workspace_dir_name
+) -> dict[str, Any]:
+    """Complete the runtime for the agent.
+
+    This function is called before the runtime is used to run the agent.
+    If you need to do something in the sandbox to get the correctness metric after
+    the agent has run, modify this function.
+    """
+    logger.info('-' * 30)
+    logger.info('BEGIN Runtime Completion Fn')
+    logger.info('-' * 30)
+    obs: CmdOutputObservation
+    workspace_dir_name = _get_swebench_workspace_dir_name(instance)
+
+    action = CmdRunAction(command=f'cd /workspace/{workspace_dir_name}')
+    logger.info(action, extra={'msg_type': 'ACTION'})
+    obs = await runtime.run_action(action)
+    logger.info(obs, extra={'msg_type': 'OBSERVATION'})
+    assert obs.exit_code == 0
+
+    action = CmdRunAction(command='git config --global core.pager ""')
+    logger.info(action, extra={'msg_type': 'ACTION'})
+    obs = await runtime.run_action(action)
+    logger.info(obs, extra={'msg_type': 'OBSERVATION'})
+    assert obs.exit_code == 0
+
+    action = CmdRunAction(command='git add -A')
+    logger.info(action, extra={'msg_type': 'ACTION'})
+    obs = await runtime.run_action(action)
+    logger.info(obs, extra={'msg_type': 'OBSERVATION'})
+    assert obs.exit_code == 0
+
+    n_retries = 0
+    git_patch = None
+    while n_retries < 5:
+        action = CmdRunAction(
+            command=f'git diff --no-color --cached {instance["base_commit"]}',
+            keep_prompt=False,
+        )
+        action.timeout = 600 + 100 * n_retries
+        logger.info(action, extra={'msg_type': 'ACTION'})
+        obs = await runtime.run_action(action)
+        logger.info(obs, extra={'msg_type': 'OBSERVATION'})
+        n_retries += 1
+        if isinstance(obs, CmdOutputObservation):
+            if obs.exit_code == 0:
+                git_patch = obs.content.strip()
+                break
+            else:
+                logger.info('Failed to get git diff, retrying...')
+                await asyncio.sleep(10)
+        elif isinstance(obs, ErrorObservation):
+            logger.error(f'Error occurred: {obs.content}. Retrying...')
+            await asyncio.sleep(10)
+        else:
+            raise ValueError(f'Unexpected observation type: {type(obs)}')
+
+    logger.info('-' * 30)
+    logger.info('END Runtime Completion Fn')
+    logger.info('-' * 30)
+    return {'git_patch': git_patch}
+
+
+async def process_instance(
+    instance: pd.Series,
+    metadata: EvalMetadata,
+    reset_logger: bool = True,
+) -> EvalOutput:
+    config = get_config(instance, metadata)
+
+    # Setup the logger properly, so you can run multi-processing to parallelize the evaluation
+    if reset_logger:
+        log_dir = os.path.join(metadata.eval_output_dir, 'infer_logs')
+        reset_logger_for_multiprocessing(logger, instance.instance_id, log_dir)
+    else:
+        logger.info(f'Starting evaluation for instance {instance.instance_id}.')
+
+    runtime = await create_runtime(config, sid=instance.instance_id)
+    await initialize_runtime(runtime, instance)
+
+    instruction = get_instruction(instance, metadata)

    # Here's how you can run the agent (similar to the `main` function) and get the final task state
-    config.max_iterations = metadata.max_iterations
-    state: State | None = asyncio.run(
-        run_controller(
-            config=config,
-            task_str=instruction,
-            fake_user_response_fn=AGENT_CLS_TO_FAKE_USER_RESPONSE_FN[
-                agent.__class__.__name__
-            ],
-            agent=agent,
-            sandbox=sandbox,
-            sid=instance.instance_id,
-        )
+    state: State | None = await run_controller(
+        config=config,
+        task_str=instruction,
+        runtime=runtime,
+        fake_user_response_fn=AGENT_CLS_TO_FAKE_USER_RESPONSE_FN[metadata.agent_class],
    )

    # ======= THIS IS SWE-Bench specific =======
    # Get git patch
-    git_patch = sandbox.get_diff_patch()
-    logger.info(f'Got git diff for instance {instance.instance_id}')
+    return_val = await complete_runtime(runtime, instance)
+    git_patch = return_val['git_patch']
+    logger.info(
+        f'Got git diff for instance {instance.instance_id}:\n--------\n{git_patch}\n--------'
+    )
    # ==========================================

    # ======= Attempt to evaluate the agent's edits =======
-    # TODO: if you need to do something in the sandbox to get the correctness metric, modify this function
-    test_result = get_test_result(instance, sandbox, workspace_dir_name)
+    # we use eval_infer.sh to evaluate the agent's edits, not here
+    # because the agent may alter the environment / testcases
+    test_result = {
+        'git_patch': git_patch,
+    }

    # If you are working on some simpler benchmark that only evaluates the final model output (e.g., in a MessageAction)
    # You can simply get the LAST `MessageAction` from the returned `state.history` and parse it for evaluation.
-
    if state is None:
        raise ValueError('State should not be None.')

-    metrics = state.metrics.get() if state.metrics else None
-
    # history is now available as a stream of events, rather than list of pairs of (Action, Observation)
    # for compatibility with the existing output format, we can remake the pairs here
    # remove when it becomes unnecessary
    histories = state.history.compatibility_for_eval_history_pairs()
+    metrics = state.metrics.get() if state.metrics else None

    # Save the output
-    output = {
-        'instance_id': instance.instance_id,
-        'swe_instance': instance.to_dict(),  # SWE Bench specific
-        'instruction': instruction,
-        'git_patch': git_patch,  # SWE Bench specific
-        'metadata': metadata.model_dump(),
-        'history': histories,
-        'metrics': metrics,
-        'error': state.last_error if state and state.last_error else None,
-        'test_result': test_result,
-    }
-
-    # Close the sandbox
-    sandbox.close()
+    output = EvalOutput(
+        instance_id=instance.instance_id,
+        instruction=instruction,
+        instance=instance.to_dict(),  # SWE Bench specific
+        test_result=test_result,
+        metadata=metadata,
+        history=histories,
+        metrics=metrics,
+        error=state.last_error if state and state.last_error else None,
+    )
    return output


@ -359,11 +383,12 @@ if __name__ == '__main__':
    dataset = load_dataset('princeton-nlp/SWE-bench_Lite')
    swe_bench_tests = filter_dataset(dataset['test'].to_pandas(), 'instance_id')

-    id_column = 'instance_id'
-    llm_config = get_llm_config_arg(args.llm_config) if args.llm_config else config.llm
-    if args.llm_config and llm_config is None:
-        raise ValueError(f'Could not find LLM config {args.llm_config}')
-    logger.info(f'Config for evaluation: {config}')
+    llm_config = None
+    if args.llm_config:
+        llm_config = get_llm_config_arg(args.llm_config)
+
+    if llm_config is None:
+        raise ValueError(f'Could not find LLM config: --llm_config {args.llm_config}')

    details = {}
    _agent_cls = agenthub.Agent.get_cls(args.agent_cls)
@ -383,14 +408,10 @@ if __name__ == '__main__':
    )

    output_file = os.path.join(metadata.eval_output_dir, 'output.jsonl')
-    instances = prepare_dataset(
-        swe_bench_tests, output_file, args.eval_n_limit, id_column
-    )
-    run_evaluation(
-        instances,
-        metadata,
-        output_file,
-        args.eval_num_workers,
-        process_instance,
-        id_column,
+    instances = prepare_dataset(swe_bench_tests, output_file, args.eval_n_limit)
+
+    asyncio.run(
+        run_evaluation(
+            instances, metadata, output_file, args.eval_num_workers, process_instance
+        )
    )
--- a/evaluation/swe_bench/scripts/eval/convert_od_output_to_swe_json.py
+++ b/evaluation/swe_bench/scripts/eval/convert_od_output_to_swe_json.py
@ -45,9 +45,16 @@ def process_git_patch(patch):


 def convert_row_to_swebench_format(row):
+    if 'git_patch' in row:
+        model_patch = row['git_patch']
+    elif 'test_result' in row and 'git_patch' in row['test_result']:
+        model_patch = row['test_result']['git_patch']
+    else:
+        raise ValueError(f'Row {row} does not have a git_patch')
+
    return {
        'instance_id': row['instance_id'],
-        'model_patch': process_git_patch(row['git_patch']),
+        'model_patch': process_git_patch(model_patch),
        'model_name_or_path': model_name,
    }

--- a/evaluation/swe_bench/scripts/run_infer.sh
+++ b/evaluation/swe_bench/scripts/run_infer.sh
@ -27,8 +27,8 @@ if [ -z "$MAX_ITER" ]; then
 fi

 if [ -z "$USE_INSTANCE_IMAGE" ]; then
-  echo "USE_INSTANCE_IMAGE not specified, use default false"
-  USE_INSTANCE_IMAGE=false
+  echo "USE_INSTANCE_IMAGE not specified, use default true"
+  USE_INSTANCE_IMAGE=true
 fi

 export USE_INSTANCE_IMAGE=$USE_INSTANCE_IMAGE
--- a/evaluation/swe_bench/scripts/setup/instance_swe_entry.sh
+++ b/evaluation/swe_bench/scripts/setup/instance_swe_entry.sh
@ -45,7 +45,11 @@ echo "$item" | jq -r '.patch' > $SWE_TASK_DIR/gold.patch
 echo "$item" | jq 'del(.test_patch, .patch)' > $SWE_TASK_DIR/instance.json

 # Clear the workspace
-rm -rf /workspace/*
+if [ -d /workspace ]; then
+    rm -rf /workspace/*
+else
+    mkdir /workspace
+fi
 # Copy repo to workspace
 if [ -d /workspace/$WORKSPACE_NAME ]; then
    rm -rf /workspace/$WORKSPACE_NAME
@ -61,7 +65,7 @@ mkdir -p $SWE_TASK_DIR/reset_testbed_log_dir

 REPO_PATH=/workspace/$WORKSPACE_NAME
 echo "Repo Path: $REPO_PATH"
-echo "Test Command: $TEST_CMD"
+# echo "Test Command: $TEST_CMD"
 echo "export REPO_PATH=\"$REPO_PATH\"" >> ~/.bashrc
 # echo "export TEST_CMD=\"$TEST_CMD\"" >> ~/.bashrc

--- a/evaluation/swe_bench/swe_env_box.py
+++ b/evaluation/swe_bench/swe_env_box.py
@ -1,313 +0,0 @@
-import json
-import os
-import sys
-import tempfile
-import uuid
-
-from datasets import load_dataset
-from swebench.harness.constants import MAP_REPO_TO_TEST_FRAMEWORK
-from swebench.harness.utils import get_test_directives
-
-from opendevin.core.config import AppConfig, SandboxConfig, load_app_config
-from opendevin.core.logger import opendevin_logger as logger
-from opendevin.runtime.docker.ssh_box import DockerSSHBox
-from opendevin.runtime.plugins import (
-    AgentSkillsRequirement,
-    JupyterRequirement,
-    PluginRequirement,
-)
-
-SWE_BENCH_CONTAINER_IMAGE = 'ghcr.io/opendevin/eval-swe-bench:full-v1.2.1'
-
-
-def get_image_name_from_instance_id(instance_id: str) -> str:
-    return 'sweb.eval.x86_64.' + instance_id
-
-
-class SWEBenchSSHBox(DockerSSHBox):
-    def __init__(
-        self,
-        config: AppConfig,
-        container_image: str,
-        timeout: int = 120,
-        sid: str | None = None,
-        swe_instance_id: str | None = None,
-        swe_instance: dict | None = None,
-        skip_workspace_mount: bool = True,
-        sandbox_plugins: list[PluginRequirement] = [],  # noqa: B006
-        workspace_dir_name: str | None = None,
-        use_instance_image: bool = False,
-    ):
-        if swe_instance_id is None:
-            raise ValueError('swe_instance_id must be provided!')
-        self.swe_instance_id = swe_instance_id
-        self.swe_instance = swe_instance
-        self.skip_workspace_mount = skip_workspace_mount
-        self.workspace_dir_name = workspace_dir_name
-
-        assert (
-            container_image is not None
-        ), 'container_image is required for SWEBenchSSHBox!'
-        # Need to run as root to use SWEBench container
-        sid = f'swe_bench_{swe_instance_id}_' + str(uuid.uuid4())
-        logger.info(f'===Using container image: {container_image}')
-        super().__init__(
-            config=SandboxConfig(container_image=container_image, timeout=timeout),
-            persist_sandbox=config.persist_sandbox,
-            workspace_mount_path=config.workspace_mount_path,
-            sandbox_workspace_dir=config.workspace_mount_path_in_sandbox,
-            cache_dir=config.cache_dir,
-            run_as_devin=config.run_as_devin,
-            ssh_hostname=config.ssh_hostname,
-            ssh_password=config.ssh_password,
-            ssh_port=config.ssh_port,
-            sid=sid,
-        )
-        self.init_plugins(sandbox_plugins)
-
-        exit_code, output = self.execute('mv ~/.bashrc ~/.bashrc.bak')
-        assert exit_code == 0, f'Failed to backup ~/.bashrc: {output}'
-
-        exit_code, output = self.execute(
-            f"echo 'export SWE_INSTANCE_ID={self.swe_instance_id}' >> ~/.bashrc && echo 'export PIP_CACHE_DIR=~/.cache/pip' >> ~/.bashrc && echo \"alias git='git --no-pager'\" >> ~/.bashrc"
-        )
-        assert exit_code == 0, f'Failed to set SWE_INSTANCE_ID in ~/.bashrc: {output}'
-
-        logger.info('Sourcing swe_entry.sh to set up environment variables')
-        logger.info(
-            'Initialization of SWEBench may take approximately 10 minutes due to long-running installations, such as those requiring compilation.'
-        )
-        logger.info(f'Use instance image: {use_instance_image}')
-        if use_instance_image:
-            # we directly inject the instance info into the container and the init script
-            script_dir = os.path.dirname(__file__)
-
-            # inject test command
-            test_type = MAP_REPO_TO_TEST_FRAMEWORK[swe_instance['repo']][
-                swe_instance['version']
-            ]
-            swe_instance['test_directives'] = get_test_directives(swe_instance)
-            swe_instance['test_cmd'] = (
-                f"{test_type} {' '.join(swe_instance['test_directives'])}"
-            )
-            exit_code, output = self.execute(
-                f"""echo "export TEST_CMD='{swe_instance["test_cmd"]}'" >> ~/.bashrc"""
-            )
-            # assert exit_code == 0, f'Failed to set TEST_CMD in ~/.bashrc: {output}'
-
-            # inject the instance info
-            self.execute('mkdir -p /swe_util/eval_data/instances')
-            swe_instance_json_name = 'swe-bench-instance.json'
-            with tempfile.TemporaryDirectory() as temp_dir:
-                # Construct the full path for the desired file name within the temporary directory
-                temp_file_path = os.path.join(temp_dir, swe_instance_json_name)
-                # Write to the file with the desired name within the temporary directory
-                with open(temp_file_path, 'w') as f:
-                    if not isinstance(swe_instance, dict):
-                        json.dump([swe_instance.to_dict()], f)
-                    else:
-                        json.dump([swe_instance], f)
-
-                # Copy the file to the desired location
-                self.copy_to(temp_file_path, '/swe_util/eval_data/instances/')
-
-            # inject the init script
-            self.copy_to(
-                str(os.path.join(script_dir, 'scripts/setup/instance_swe_entry.sh')),
-                '/swe_util/',
-            )
-            self.execute('cat ~/.bashrc')
-            self.execute('source ~/.bashrc')
-
-            self.execute('source /swe_util/instance_swe_entry.sh', timeout=600)
-            logger.info('exit code: %d', exit_code)
-            logger.info(output)
-            assert exit_code == 0, f'Failed to source swe_entry.sh: {output}'
-            logger.info('Sourced swe_entry.sh successfully')
-        else:
-            exit_code, output = self.execute(
-                'source /swe_util/swe_entry.sh', timeout=600
-            )
-            logger.info('exit code: %d', exit_code)
-            logger.info(output)
-            assert exit_code == 0, f'Failed to source swe_entry.sh: {output}'
-            logger.info('Sourced swe_entry.sh successfully')
-
-    @property
-    def volumes(self):
-        if self.skip_workspace_mount:
-            return {
-                k: v
-                for k, v in super().volumes.items()
-                if not v['bind'] == self.sandbox_workspace_dir
-            }
-        return super().volumes
-
-    @classmethod
-    def get_box_for_instance(
-        cls,
-        config: AppConfig,
-        instance,
-        workspace_dir_name=None,
-        skip_workspace_mount: bool = True,
-        workspace_mount_path: str | None = None,
-        sandbox_plugins: list[PluginRequirement] = [],  # noqa: B006
-        use_instance_image: bool = False,
-    ) -> 'SWEBenchSSHBox':
-        if workspace_dir_name is None:
-            workspace_dir_name = f"{instance['repo']}__{instance['version']}".replace(
-                '/', '__'
-            )
-        old_workspace_base = config.workspace_base
-        old_workspace_mount_path = config.workspace_mount_path
-
-        try:
-            config.workspace_base = workspace_mount_path
-            config.workspace_mount_path = workspace_mount_path
-
-            # linting python after editing helps LLM fix indentations
-            config.sandbox.enable_auto_lint = True
-            # Need to run as root to use SWEBench container
-            config.run_as_devin = False
-            if use_instance_image:
-                container_image = get_image_name_from_instance_id(
-                    instance['instance_id']
-                )
-            else:
-                container_image = SWE_BENCH_CONTAINER_IMAGE
-            sandbox = cls(
-                container_image=container_image,
-                config=config,
-                swe_instance_id=instance['instance_id'],
-                swe_instance=instance,
-                skip_workspace_mount=skip_workspace_mount,
-                sandbox_plugins=sandbox_plugins,
-                workspace_dir_name=workspace_dir_name,
-                use_instance_image=use_instance_image,
-            )
-            logger.info(f"SSH box started for instance {instance['instance_id']}.")
-
-            # cd to the repo
-            exit_code, output = sandbox.execute(f'cd /workspace/{workspace_dir_name}')
-            if exit_code != 0:
-                logger.error(f'Failed to cd to the repo: {output}')
-                sys.exit(1)
-
-            # remove all future commits & remote following Devin
-            # https://www.cognition-labs.com/post/swe-bench-technical-report
-            exit_code, output = sandbox.execute('git reset --hard')
-            if exit_code != 0:
-                logger.error(f'Failed to reset the repo: {output}')
-                sys.exit(1)
-            exit_code, output = sandbox.execute(
-                'for remote_name in $(git remote); do git remote remove "${remote_name}"; done'
-            )
-            if exit_code != 0:
-                logger.error(f'Failed to remove remote: {output}')
-                sys.exit(1)
-        except Exception:
-            raise
-        finally:
-            # restore workspace_base and workspace_mount_path
-            config.workspace_base = old_workspace_base
-            config.workspace_mount_path = old_workspace_mount_path
-        return sandbox
-
-    def get_diff_patch(self):
-        # add everything to the index
-        exit_code, output = self.execute(f'cd /workspace/{self.workspace_dir_name}')
-        if exit_code != 0:
-            logger.error('Failed to cd to the repo')
-            return ''
-
-        exit_code, _output = self.execute('git config --global core.pager ""')
-        if exit_code != 0:
-            logger.error('Failed to change git config')
-            return ''
-
-        # add everything to the index
-        exit_code, output = self.execute('git add -A')
-        if exit_code != 0:
-            logger.error('Failed to add everything to the index')
-            return ''
-
-        # get the git diff
-        exit_code, git_patch = self.execute(
-            f'git diff --no-color --cached {self.swe_instance["base_commit"]}'
-        )
-        if exit_code != 0:
-            logger.error('Failed to get git diff')
-            return ''
-        return git_patch
-
-
-if __name__ == '__main__':
-    config = load_app_config()
-
-    # NOTE: It is preferable to load datasets from huggingface datasets and perform post-processing
-    # so we don't need to manage file uploading to OpenDevin's repo
-    dataset = load_dataset('princeton-nlp/SWE-bench_Lite')
-    swe_bench_tests = dataset['test'].to_pandas()
-    USE_INSTANCE_IMAGE = os.environ.get('USE_INSTANCE_IMAGE', 'false') == 'true'
-    logger.info(f'USE_INSTANCE_IMAGE: {USE_INSTANCE_IMAGE}')
-
-    # INSTANCE_ID = 'django__django-11099'
-    INSTANCE_ID = 'astropy__astropy-12907'
-    swe_bench_tests = swe_bench_tests[swe_bench_tests['instance_id'] == INSTANCE_ID]
-    EXAMPLE_INSTANCE = swe_bench_tests.iloc[0].to_dict()
-
-    sandbox = SWEBenchSSHBox.get_box_for_instance(
-        config=config,
-        instance=EXAMPLE_INSTANCE,
-        sandbox_plugins=[AgentSkillsRequirement(), JupyterRequirement()],
-        use_instance_image=USE_INSTANCE_IMAGE,
-    )
-
-    # PRE TEST
-    exit_code, output = sandbox.execute('cd $REPO_PATH')
-    assert exit_code == 0, 'Failed to cd $REPO_PATH'
-    logger.info(f'cd $REPO_PATH: {output}')
-
-    # apply test patch
-    exit_code, output = sandbox.execute('git apply $SWE_TASK_DIR/test.patch')
-    assert exit_code == 0, 'Failed to apply test patch'
-    logger.info(f'git apply $SWE_TASK_DIR/test.patch: {output}')
-
-    # TEST
-    exit_code, output = sandbox.execute('$TEST_CMD')
-    assert exit_code == 1, 'Expected exit code 1 (since this is a FAIL_TO_PASS)'
-    logger.info(f'$TEST_CMD:\n{output}')
-
-    # apply gold patch
-    exit_code, output = sandbox.execute('git apply $SWE_TASK_DIR/gold.patch')
-    logger.info('exit code: %d', exit_code)
-    logger.info(f'git apply $SWE_TASK_DIR/gold.patch: {output}')
-
-    # TEST
-    exit_code, output = sandbox.execute('$TEST_CMD')
-    assert exit_code == 0, 'Expected exit code 0 (since we applied the gold patch)'
-    logger.info(f'$TEST_CMD:\n{output}')
-
-    # Reset the repo
-    exit_code, output = sandbox.execute('git reset --hard')
-    assert exit_code == 0, 'Failed to reset the repo'
-    logger.info(f'git reset --hard: {output}')
-
-    sys.stdout.flush()
-    try:
-        while True:
-            try:
-                user_input = input('>>> ')
-            except EOFError:
-                logger.info('Exiting...')
-                break
-            if user_input.lower() == 'exit':
-                logger.info('Exiting...')
-                break
-            exit_code, output = sandbox.execute(user_input)
-            logger.info('exit code: %d', exit_code)
-            logger.info(output)
-            sys.stdout.flush()
-    except KeyboardInterrupt:
-        logger.info('Exiting...')
-    sandbox.close()
--- a/evaluation/toolqa/Dockerfile
+++ b/evaluation/toolqa/Dockerfile
@ -0,0 +1,17 @@
+FROM ubuntu:22.04
+
+RUN apt-get update && apt-get install -y python3 python3-pip
+
+RUN mkdir /workspace
+WORKDIR /workspace
+
+
+COPY data/ /workspace/data/
+COPY tools/ /workspace/tools/
+
+# TODO: NEED TO FIGURE DEPENDECIES FOR THESE TOOLS
+
+# pushd evaluation/toolqa
+# mkdir data
+# python3 -c "from utils import download_data, download_tools; download_data('/workspace'); download_tools('/workspace')"
+# docker build --network host -t xingyaoww/od-eval-toolqa .
--- a/evaluation/toolqa/README.md
+++ b/evaluation/toolqa/README.md
@ -2,13 +2,9 @@

 This folder contains an evaluation harness we built on top of the original [ToolQA](https://github.com/night-chen/ToolQA) ([paper](https://arxiv.org/pdf/2306.13304)).

-## Setup Environment
+## Setup Environment and LLM Configuration

-Please follow [this document](https://github.com/OpenDevin/OpenDevin/blob/main/Development.md) to setup local development environment for OpenDevin.
-
-## Configure OpenDevin and your LLM
-
-Run `make setup-config` to set up the `config.toml` file if it does not exist at the root of the workspace.
+Please follow instruction [here](../README.md#setup) to setup your local development environment and LLM.

 ## Run Inference on ToolQA Instances

--- a/evaluation/toolqa/run_infer.py
+++ b/evaluation/toolqa/run_infer.py
@ -1,29 +1,31 @@
 import asyncio
-import logging
 import os
-import pathlib
 from typing import Any

 import pandas as pd

+from evaluation.toolqa.utils import encode_question, eval_answer, get_data
 from evaluation.utils.shared import (
    EvalMetadata,
+    EvalOutput,
    codeact_user_response,
    make_metadata,
    prepare_dataset,
+    reset_logger_for_multiprocessing,
    run_evaluation,
 )
-from opendevin.controller.agent import Agent
 from opendevin.controller.state.state import State
-from opendevin.core.config import get_llm_config_arg, get_parser, load_app_config
-from opendevin.core.logger import get_console_handler
+from opendevin.core.config import (
+    AppConfig,
+    SandboxConfig,
+    get_llm_config_arg,
+    get_parser,
+)
 from opendevin.core.logger import opendevin_logger as logger
-from opendevin.core.main import run_controller
-from opendevin.llm.llm import LLM
-
-from .utils import download_data, download_tools, encode_question, eval_answer, get_data
-
-config = load_app_config()
+from opendevin.core.main import create_runtime, run_controller
+from opendevin.events.action import CmdRunAction
+from opendevin.events.observation import CmdOutputObservation
+from opendevin.runtime.runtime import Runtime

 AGENT_CLS_TO_FAKE_USER_RESPONSE_FN = {
    'CodeActAgent': codeact_user_response,
@ -34,59 +36,84 @@ AGENT_CLS_TO_INST_SUFFIX = {
 }


-def process_instance(instance: Any, metadata: EvalMetadata, reset_logger: bool = True):
-    agent = Agent.get_cls(metadata.agent_class)(llm=LLM(config=metadata.llm_config))
-    # create process-specific workspace dir
-    # we will create a workspace directory for EACH process
-    # so that different agent don't interfere with each other.
-    workspace_mount_path = config.workspace_mount_path
-    pathlib.Path(workspace_mount_path).mkdir(parents=True, exist_ok=True)
+def get_config(
+    metadata: EvalMetadata,
+) -> AppConfig:
+    config = AppConfig(
+        default_agent=metadata.agent_class,
+        run_as_devin=False,
+        runtime='eventstream',
+        max_iterations=metadata.max_iterations,
+        sandbox=SandboxConfig(
+            container_image='ubuntu:22.04',
+            enable_auto_lint=True,
+            use_host_network=False,
+            update_source_code=True,
+        ),
+        # do not mount workspace
+        workspace_base=None,
+        workspace_mount_path=None,
+    )
+    config.set_llm_config(metadata.llm_config)
+    return config
+
+
+async def initialize_runtime(runtime: Runtime):
+    """Initialize the runtime for the agent.
+
+    This function is called before the runtime is used to run the agent.
+    """
+    logger.info(f"{'-' * 50} BEGIN Runtime Initialization Fn {'-' * 50}")
+    obs: CmdOutputObservation
+
+    # Set instance id
+    action = CmdRunAction(command='mkdir -p /workspace')
+    logger.info(action, extra={'msg_type': 'ACTION'})
+    obs = await runtime.run_action(action)
+    assert obs.exit_code == 0
+
+    action = CmdRunAction(command='cd /workspace')
+    logger.info(action, extra={'msg_type': 'ACTION'})
+    obs = await runtime.run_action(action)
+    assert obs.exit_code == 0
+
+    await runtime.add_env_vars({'WOLFRAM_ALPHA_APPID': args.wolfram_alpha_appid})
+
+    logger.info(f"{'-' * 50} END Runtime Initialization Fn {'-' * 50}")
+
+
+async def process_instance(
+    instance: Any, metadata: EvalMetadata, reset_logger: bool = True
+):
+    config = get_config(metadata)

-    # Setup the logger properly, so you can run multi-processing to parallelize the evaluation
-    eval_output_dir = metadata.eval_output_dir
    qid = instance.qid
    question = instance.question
    answer = instance.answer
+
+    # Setup the logger properly, so you can run multi-processing to parallelize the evaluation
    if reset_logger:
-        # Set up logger
-        log_file = os.path.join(eval_output_dir, 'logs', f'instance_{qid}.log')
-        # Remove all existing handlers from logger
-        for handler in logger.handlers[:]:
-            logger.removeHandler(handler)
-        # add back the console handler to print ONE line
-        logger.addHandler(get_console_handler())
-        logger.info(
-            f'Starting evaluation for instance {qid}.\nHint: run "tail -f {log_file}" to see live logs in a separate shell'
-        )
-        # Remove all existing handlers from logger
-        for handler in logger.handlers[:]:
-            logger.removeHandler(handler)
-        file_handler = logging.FileHandler(log_file)
-        file_handler.setFormatter(
-            logging.Formatter('%(asctime)s - %(levelname)s - %(message)s')
-        )
-        logger.addHandler(file_handler)
-    logger.info(f'Process-specific workspace mounted at {workspace_mount_path}')
+        log_dir = os.path.join(metadata.eval_output_dir, 'infer_logs')
+        reset_logger_for_multiprocessing(logger, qid, log_dir)
+    else:
+        logger.info(f'Starting evaluation for instance {qid}.')

    # Prepare instruction
    instruction = encode_question(question)
    instruction += 'IMPORTANT: You should ONLY interact with the environment provided to you AND NEVER ASK FOR HUMAN HELP.\n'
    # NOTE: You can actually set slightly different instruction for different agents
-    instruction += AGENT_CLS_TO_INST_SUFFIX[agent.__class__.__name__]
-    # logger.info(f'Instruction:\n{instruction}', extra={'msg_type': 'OBSERVATION'})
+    instruction += AGENT_CLS_TO_INST_SUFFIX[metadata.agent_class]
+    logger.info(f'Instruction:\n{instruction}', extra={'msg_type': 'OBSERVATION'})
+
+    runtime = await create_runtime(config, sid=qid)
+    await initialize_runtime(runtime)

    # Here's how you can run the agent (similar to the `main` function) and get the final task state
-    config.max_iterations = metadata.max_iterations
-    state: State | None = asyncio.run(
-        run_controller(
-            config=config,
-            task_str=instruction,
-            fake_user_response_fn=AGENT_CLS_TO_FAKE_USER_RESPONSE_FN[
-                agent.__class__.__name__
-            ],
-            agent=agent,
-            sid=qid,
-        )
+    state: State | None = await run_controller(
+        config=config,
+        task_str=instruction,
+        runtime=runtime,
+        fake_user_response_fn=AGENT_CLS_TO_FAKE_USER_RESPONSE_FN[metadata.agent_class],
    )
    # ======= Attempt to evaluate the agent's edits =======
    # If you are working on simpler benchmark that only evaluates the final model output (e.g., in a MessageAction)
@ -110,17 +137,17 @@ def process_instance(instance: Any, metadata: EvalMetadata, reset_logger: bool =
    histories = state.history.compatibility_for_eval_history_pairs()

    # Save the output
-    output = {
-        'qid': qid,
-        'text': model_answer_raw,
-        'correct': correct,
-        'answer_id': 'None',
-        'model_id': metadata.model_name,
-        'metadata': metadata,
-        'history': histories,
-        'metrics': metrics,
-        'error': state.last_error if state and state.last_error else None,
-    }
+    output = EvalOutput(
+        instance_id=qid,
+        test_result={
+            'model_answer_raw': model_answer_raw,
+            'correct': correct,
+        },
+        metadata=metadata,
+        history=histories,
+        metrics=metrics,
+        error=state.last_error if state and state.last_error else None,
+    )
    return output


@ -145,8 +172,12 @@ if __name__ == '__main__':
        default='YOUR_WOLFRAMALPHA_APPID',
    )
    args, _ = parser.parse_known_args()
-    llm_config = get_llm_config_arg(args.llm_config) if args.llm_config else config.llm
-    logger.info(f'Config for evaluation: {config}')
+
+    llm_config = None
+    if args.llm_config:
+        llm_config = get_llm_config_arg(args.llm_config)
+    if llm_config is None:
+        raise ValueError(f'Could not find LLM config: --llm_config {args.llm_config}')

    dataset = ''
    hardness = ''
@ -168,14 +199,9 @@ if __name__ == '__main__':
    if args.hardness not in ['easy', 'hard']:
        raise ValueError('Please choose from easy and hard for hardness.')

-    # workspace_mount_path = os.path.join(config.workspace_mount_path, '_eval_workspace')
-    workspace_mount_path = config.workspace_mount_path
-    pathlib.Path(workspace_mount_path).mkdir(parents=True, exist_ok=True)
    toolqa_test = pd.DataFrame(get_data(dataset, hardness))
-    toolqa_data_path = download_data(workspace_mount_path)
-    toolqa_tool_path = download_tools(workspace_mount_path, args.wolfram_alpha_appid)
+    toolqa_test.rename(columns={'qid': 'instance_id'}, inplace=True)

-    id_column = 'qid'
    metadata = make_metadata(
        llm_config,
        f'toolqa-{args.dataset}-{args.hardness}',
@ -184,12 +210,9 @@ if __name__ == '__main__':
        args.eval_output_dir,
    )
    output_file = os.path.join(metadata.eval_output_dir, 'output.jsonl')
-    instances = prepare_dataset(toolqa_test, output_file, args.eval_n_limit, id_column)
-    run_evaluation(
-        instances,
-        metadata,
-        output_file,
-        args.eval_num_workers,
-        process_instance,
-        id_column,
+    instances = prepare_dataset(toolqa_test, output_file, args.eval_n_limit)
+    asyncio.run(
+        run_evaluation(
+            instances, metadata, output_file, args.eval_num_workers, process_instance
+        )
    )
--- a/evaluation/toolqa/utils.py
+++ b/evaluation/toolqa/utils.py
@ -4,11 +4,12 @@ import re
 import string
 import zipfile

-import gdown
 import requests


 def download_data(dir):
+    import gdown
+
    data_path = os.path.join(dir, 'data/external_corpus')
    if os.path.exists(data_path):
        return data_path
@ -19,6 +20,7 @@ def download_data(dir):
        zip_ref.extractall(os.path.join(dir, 'data'))
    if os.path.exists(zip_path):
        os.remove(zip_path)
+    print(f'Data saved to {data_path}')
    return data_path


@ -42,6 +44,7 @@ def download_tools(dir, wolfram_alpha_appid='YOUR_WOLFRAMALPHA_APPID'):
        output_file = os.path.join(tool_path, tool.split('/')[1])
        with open(output_file, 'wb') as f:
            f.write(response.content)
+        print(f'Tool saved to {output_file}')
    with open(os.path.join(tool_path, 'calculator.py'), 'r') as f:
        content = f.read()
    new_content = content.replace('YOUR_WOLFRAMALPHA_APPID', wolfram_alpha_appid)
@ -64,14 +67,29 @@ def download_tools(dir, wolfram_alpha_appid='YOUR_WOLFRAMALPHA_APPID'):
        f.write(new_content)


+LOCAL_DATA_DIR = os.path.join(os.path.dirname(__file__), 'data')
+
+
 def get_data(dataset, hardness):
-    data = []
-    url = f'https://raw.githubusercontent.com/night-chen/ToolQA/main/data/questions/{hardness}/{dataset}-{hardness}.jsonl'
-    url = requests.get(url)
-    if url.status_code == 200:
-        lines = url.text.splitlines()
-        for line in lines:
-            data.append(json.loads(line))
+    data_path = os.path.join(LOCAL_DATA_DIR, f'{dataset}-{hardness}.jsonl')
+    if os.path.exists(data_path):
+        print(f'Loading data from {data_path}')
+        with open(data_path, 'r') as f:
+            return json.load(f)
+    else:
+        print(
+            f'Downloading data from https://raw.githubusercontent.com/night-chen/ToolQA/main/data/questions/{hardness}/{dataset}-{hardness}.jsonl'
+        )
+        data = []
+        url = f'https://raw.githubusercontent.com/night-chen/ToolQA/main/data/questions/{hardness}/{dataset}-{hardness}.jsonl'
+        url = requests.get(url)
+        if url.status_code == 200:
+            lines = url.text.splitlines()
+            for line in lines:
+                data.append(json.loads(line))
+            with open(data_path, 'w') as f:
+                json.dump(data, f)
+        print(f'Data saved to {data_path}')
    return data


--- a/evaluation/utils/shared.py
+++ b/evaluation/utils/shared.py
@ -1,12 +1,13 @@
+import asyncio
 import json
+import logging
 import multiprocessing as mp
 import os
 import pathlib
 import subprocess
 import time
-from asyncio.log import logger
 from concurrent.futures import ProcessPoolExecutor
-from typing import Any, Callable
+from typing import Any, Awaitable, Callable

 import pandas as pd
 from pydantic import BaseModel
@ -14,6 +15,8 @@ from tqdm import tqdm

 from opendevin.controller.state.state import State
 from opendevin.core.config import LLMConfig
+from opendevin.core.logger import get_console_handler
+from opendevin.core.logger import opendevin_logger as logger
 from opendevin.events.action import Action
 from opendevin.events.action.message import MessageAction

@ -38,6 +41,31 @@ class EvalMetadata(BaseModel):
        return json.dumps(dumped_dict)


+class EvalOutput(BaseModel):
+    # NOTE: User-specified
+    instance_id: str
+    instruction: str
+    # output of the evaluation
+    # store anything that is needed for the score calculation
+    test_result: dict[str, Any]
+
+    # Interaction info
+    metadata: EvalMetadata
+    history: list[tuple[dict[str, Any], dict[str, Any]]]
+    metrics: dict[str, Any]
+    error: str | None = None
+
+    # Optionally save the input test instance
+    instance: dict[str, Any] | None = None
+
+    def model_dump_json(self, *args, **kwargs):
+        dumped = super().model_dump_json(*args, **kwargs)
+        dumped_dict = json.loads(dumped)
+        # Apply custom serialization for metadata (to avoid leaking sensitive information)
+        dumped_dict['metadata'] = json.loads(self.metadata.model_dump_json())
+        return json.dumps(dumped_dict)
+
+
 def codeact_user_response(
    state: State,
    encapsulate_solution: bool = False,
@ -136,7 +164,11 @@ def make_metadata(
    return metadata


-def prepare_dataset(dataset: pd.DataFrame, output_file, eval_n_limit, id_column):
+def prepare_dataset(dataset: pd.DataFrame, output_file: str, eval_n_limit: int):
+    assert (
+        'instance_id' in dataset.columns
+    ), "Expected 'instance_id' column in the dataset. You should define your own unique identifier for each instance and use it as the 'instance_id' column."
+    id_column = 'instance_id'
    logger.info(f'Writing evaluation output to {output_file}')
    finished_ids = set()
    if os.path.exists(output_file):
@ -164,14 +196,16 @@ def prepare_dataset(dataset: pd.DataFrame, output_file, eval_n_limit, id_column)
    return pd.DataFrame(new_dataset)


-def run_evaluation(
+async def run_evaluation(
    dataset: pd.DataFrame,
    metadata: EvalMetadata,
    output_file: str,
    num_workers: int,
-    process_instance_func: Callable[[pd.Series, EvalMetadata, bool], Any],
-    id_column: str,
+    process_instance_func: Callable[
+        [pd.Series, EvalMetadata, bool], Awaitable[EvalOutput]
+    ],
 ):
+    use_multiprocessing = num_workers > 1
    logger.info(
        f'Evaluation started with Agent {metadata.agent_class}, '
        f'model {metadata.llm_config.model}, max iterations {metadata.max_iterations}.'
@ -179,35 +213,77 @@ def run_evaluation(
    pbar = tqdm(total=len(dataset))
    output_fp = open(output_file, 'a')

-    def update_progress(future):
+    async def update_progress(future):
        pbar.update(1)
-        output = future.result()
-        pbar.set_description(f'Instance {output[id_column]}')
-        pbar.set_postfix_str(f'Test Result: {output["test_result"]["result"]}')
+        output: EvalOutput = await future if use_multiprocessing else future
+
+        pbar.set_description(f'Instance {output.instance_id}')
+        pbar.set_postfix_str(f'Test Result: {output.test_result}')
        logger.info(
-            f'Finished evaluation for instance {output[id_column]}: {output["test_result"]["result"]}'
+            f'Finished evaluation for instance {output.instance_id}: {output.test_result}'
        )
-        output_fp.write(json.dumps(output) + '\n')
+        output_fp.write(json.dumps(output.model_dump()) + '\n')
        output_fp.flush()

    try:
-        with ProcessPoolExecutor(num_workers) as executor:
-            futures = []
-            for _, instance in dataset.iterrows():
-                future = executor.submit(
-                    process_instance_func,
-                    instance,
-                    metadata,
-                    bool(num_workers > 1),
-                )
-                future.add_done_callback(update_progress)
-                futures.append(future)
+        if use_multiprocessing:
+            with ProcessPoolExecutor(num_workers) as executor:
+                loop = asyncio.get_event_loop()
+                futures = []
+                for _, instance in dataset.iterrows():
+                    future = loop.run_in_executor(
+                        executor,
+                        process_instance_func,
+                        instance,
+                        metadata,
+                        bool(num_workers > 1),
+                    )
+                    futures.append(update_progress(future))
+
+                await asyncio.gather(*futures)
+        # Use plain for loop for single process for easier debugging
+        else:
+            assert num_workers == 1
+            for _, instance in dataset.iterrows():
+                output = await process_instance_func(instance, metadata, False)
+                await update_progress(output)

-            for future in futures:
-                future.result()
    except KeyboardInterrupt:
        print('KeyboardInterrupt received. Cleaning up...')
        cleanup()

    output_fp.close()
    logger.info('Evaluation finished.')
+
+
+def reset_logger_for_multiprocessing(
+    logger: logging.Logger, instance_id: str, log_dir: str
+):
+    """Reset the logger for multiprocessing.
+
+    Save logs to a separate file for each process, instead of trying to write to the
+    same file/console from multiple processes.
+    """
+    # Set up logger
+    log_file = os.path.join(
+        log_dir,
+        f'instance_{instance_id}.log',
+    )
+    # Remove all existing handlers from logger
+    for handler in logger.handlers[:]:
+        logger.removeHandler(handler)
+    # add back the console handler to print ONE line
+    logger.addHandler(get_console_handler())
+    logger.info(
+        f'Starting evaluation for instance {instance_id}.\n'
+        f'Hint: run "tail -f {log_file}" to see live logs in a separate shell'
+    )
+    # Remove all existing handlers from logger
+    for handler in logger.handlers[:]:
+        logger.removeHandler(handler)
+    os.makedirs(os.path.dirname(log_file), exist_ok=True)
+    file_handler = logging.FileHandler(log_file)
+    file_handler.setFormatter(
+        logging.Formatter('%(asctime)s - %(levelname)s - %(message)s')
+    )
+    logger.addHandler(file_handler)
--- a/evaluation/utils/version_control.sh
+++ b/evaluation/utils/version_control.sh
@ -1,13 +1,11 @@
 checkout_eval_branch() {
    if [ -z "$COMMIT_HASH" ]; then
        echo "Commit hash not specified, use current git commit"
-        build_sandbox
        return 0
    fi

    if git diff --quiet $COMMIT_HASH HEAD; then
        echo "The given hash is equivalent to the current HEAD"
-        build_sandbox
        return 0
    fi

@ -30,14 +28,8 @@ checkout_eval_branch() {
    # Trap the EXIT signal to checkout original branch
    trap checkout_original_branch EXIT

-    build_sandbox
 }

-build_sandbox() {
-    echo "Build sandbox locally"
-    docker build -t eval-sandbox -f containers/sandbox/Dockerfile /tmp
-    export SANDBOX_CONTAINER_IMAGE="eval-sandbox"
-}

 checkout_original_branch() {
    if [ -z "$current_branch" ]; then
--- a/evaluation/webarena/README.md
+++ b/evaluation/webarena/README.md
@ -2,59 +2,14 @@

 This folder contains evaluation for [WebArena](https://github.com/web-arena-x/webarena) benchmark, powered by [BrowserGym](https://github.com/ServiceNow/BrowserGym) for easy evaluation of how well an agent capable of browsing can perform on realistic web browsing tasks.

-## Setup OpenDevin Environment
+## Setup Environment and LLM Configuration

-Please follow [this document](https://github.com/OpenDevin/OpenDevin/blob/main/Development.md) to setup local develop environment for OpenDevin.
-
-## Configure OpenDevin and your LLM
-
-Create a `config.toml` file if it does not exist at the root of the workspace.
-
-Add the following configurations:
-
-```toml
-[core]
-max_iterations = 100
-cache_dir = "/tmp/cache"
-ssh_hostname = "localhost"
-
-[sandbox]
-box_type = "ssh"
-timeout = 120
-
-# TODO: Change these to the model you want to evaluate
-[eval_gpt4_1106_preview]
-model = "gpt-4-1106-preview"
-api_key = "XXX"
-temperature = 0.0
-
-[eval_some_openai_compatible_model]
-model = "openai/MODEL_NAME"
-base_url = "https://OPENAI_COMPATIBLE_URL/v1"
-api_key = "XXX"
-temperature = 0.0
-```
+Please follow instruction [here](../README.md#setup) to setup your local development environment and LLM.

 ## Setup WebArena Environment
 WebArena requires you to set up websites containing pre-populated content that is accessible via URL to the machine running the OpenDevin agents.
 Follow [this document](https://github.com/web-arena-x/webarena/blob/main/environment_docker/README.md) to set up your own WebArena environment through local servers or AWS EC2 instances.
-Take note of the base URL of the machine where the environment is installed.
-
-## Setup Environment Variables of WebArena Websites
-
-Create a script `webarena_env.sh` under `evaluation/webarena/scripts` with the following:
-
-```bash
-export BASE_URL=<YOUR_SERVER_URL_HERE>
-export SHOPPING="$BASE_URL:7770/"
-export SHOPPING_ADMIN="$BASE_URL:7780/admin"
-export REDDIT="$BASE_URL:9999"
-export GITLAB="$BASE_URL:8023"
-export WIKIPEDIA="$BASE_URL:8888/wikipedia_en_all_maxi_2022-05/A/User:The_other_Kiwix_guy/Landing"
-export MAP="$BASE_URL:3000"
-export HOMEPAGE="$BASE_URL:4399"
-export OPENAI_API_KEY="yourkey" # this key is required for some WebArena validators that utilize LLMs
-```
+Take note of the base URL (`$WEBARENA_BASE_URL`) of the machine where the environment is installed.

 ## Test if your environment works

@ -65,7 +20,9 @@ Follow the WebArena environment setup guide carefully, and make sure the URL fie

 ## Run Evaluation

-```sh
+```bash
+export WEBARENA_BASE_URL=<YOUR_SERVER_URL_HERE>
+export OPENAI_API_KEY="yourkey" # this key is required for some WebArena validators that utilize LLMs
 bash evaluation/webarena/scripts/run_infer.sh
 ```

--- a/evaluation/webarena/run_infer.py
+++ b/evaluation/webarena/run_infer.py
@ -1,7 +1,7 @@
 import asyncio
 import json
-import logging
 import os
+from typing import Any

 import browsergym.webarena  # noqa F401 register webarena tasks as gym environments
 import gymnasium as gym
@ -9,93 +9,147 @@ import pandas as pd

 from evaluation.utils.shared import (
    EvalMetadata,
+    EvalOutput,
    make_metadata,
    prepare_dataset,
+    reset_logger_for_multiprocessing,
    run_evaluation,
 )
-from opendevin.controller.agent import Agent
 from opendevin.controller.state.state import State
-from opendevin.core.config import get_llm_config_arg, load_app_config, parse_arguments
-from opendevin.core.logger import get_console_handler
+from opendevin.core.config import (
+    AppConfig,
+    SandboxConfig,
+    get_llm_config_arg,
+    parse_arguments,
+)
 from opendevin.core.logger import opendevin_logger as logger
-from opendevin.core.main import run_controller
-from opendevin.llm.llm import LLM
-from opendevin.runtime.docker.ssh_box import DockerSSHBox
-from opendevin.runtime.tools import RuntimeTool
-
-config = load_app_config()
+from opendevin.core.main import create_runtime, run_controller
+from opendevin.events.action import (
+    BrowseInteractiveAction,
+    CmdRunAction,
+    MessageAction,
+)
+from opendevin.events.observation import CmdOutputObservation
+from opendevin.runtime.browser.browser_env import (
+    BROWSER_EVAL_GET_GOAL_ACTION,
+    BROWSER_EVAL_GET_REWARDS_ACTION,
+)
+from opendevin.runtime.runtime import Runtime

 SUPPORTED_AGENT_CLS = {'BrowsingAgent'}


-docker_ssh_box: DockerSSHBox | None = None
+def get_config(
+    metadata: EvalMetadata,
+    env_id: str,
+) -> AppConfig:
+    base_url = os.environ.get('WEBARENA_BASE_URL', None)
+    openai_api_key = os.environ.get('OPENAI_API_KEY', None)
+    assert base_url is not None, 'WEBARENA_BASE_URL must be set'
+    assert openai_api_key is not None, 'OPENAI_API_KEY must be set'
+
+    config = AppConfig(
+        default_agent=metadata.agent_class,
+        run_as_devin=False,
+        runtime='eventstream',
+        max_iterations=metadata.max_iterations,
+        sandbox=SandboxConfig(
+            container_image='ubuntu:22.04',
+            enable_auto_lint=True,
+            use_host_network=False,
+            update_source_code=True,
+            browsergym_eval_env=env_id,
+            od_runtime_startup_env_vars={
+                'BASE_URL': base_url,
+                'OPENAI_API_KEY': openai_api_key,
+                'SHOPPING': f'{base_url}:7770/',
+                'SHOPPING_ADMIN': f'{base_url}:7780/admin',
+                'REDDIT': f'{base_url}:9999',
+                'GITLAB': f'{base_url}:8023',
+                'WIKIPEDIA': f'{base_url}:8888/wikipedia_en_all_maxi_2022-05/A/User:The_other_Kiwix_guy/Landing',
+                'MAP': f'{base_url}:3000',
+                'HOMEPAGE': f'{base_url}:4399',
+            },
+        ),
+        # do not mount workspace
+        workspace_base=None,
+        workspace_mount_path=None,
+    )
+    config.set_llm_config(metadata.llm_config)
+    return config


-def get_sandbox():
-    global docker_ssh_box
-    if docker_ssh_box is None:
-        docker_ssh_box = DockerSSHBox(
-            config=config.sandbox,
-            persist_sandbox=False,
-            workspace_mount_path=config.workspace_mount_path,
-            sandbox_workspace_dir=config.workspace_mount_path_in_sandbox,
-            cache_dir=config.cache_dir,
-            run_as_devin=config.run_as_devin,
-        )
-    return docker_ssh_box
+async def initialize_runtime(
+    runtime: Runtime,
+) -> dict:
+    """Initialize the runtime for the agent.
+
+    This function is called before the runtime is used to run the agent.
+    """
+    logger.info(f"{'-' * 50} BEGIN Runtime Initialization Fn {'-' * 50}")
+    obs: CmdOutputObservation
+
+    # Set instance id
+    action = CmdRunAction(command='mkdir -p /workspace')
+    logger.info(action, extra={'msg_type': 'ACTION'})
+    obs = await runtime.run_action(action)
+    assert obs.exit_code == 0
+
+    action = BrowseInteractiveAction(browser_actions=BROWSER_EVAL_GET_GOAL_ACTION)
+    logger.info(action, extra={'msg_type': 'ACTION'})
+    obs = await runtime.run_action(action)
+    logger.info(obs, extra={'msg_type': 'OBSERVATION'})
+    goal = obs.content
+
+    logger.info(f"{'-' * 50} END Runtime Initialization Fn {'-' * 50}")
+    return goal


-def process_instance(
+async def complete_runtime(
+    runtime: Runtime,
+) -> dict[str, Any]:
+    """Complete the runtime for the agent.
+
+    This function is called before the runtime is used to run the agent.
+    If you need to do something in the sandbox to get the correctness metric after
+    the agent has run, modify this function.
+    """
+    logger.info(f"{'-' * 50} BEGIN Runtime Completion Fn {'-' * 50}")
+    obs: CmdOutputObservation
+
+    action = BrowseInteractiveAction(browser_actions=BROWSER_EVAL_GET_REWARDS_ACTION)
+    logger.info(action, extra={'msg_type': 'ACTION'})
+    obs = await runtime.run_action(action)
+    logger.info(obs, extra={'msg_type': 'OBSERVATION'})
+
+    logger.info(f"{'-' * 50} END Runtime Completion Fn {'-' * 50}")
+    return {
+        'rewards': json.loads(obs.content),
+    }
+
+
+async def process_instance(
    instance: pd.Series,
    metadata: EvalMetadata,
    reset_logger: bool = True,
 ):
-    # Create the agent
-    agent = Agent.get_cls(metadata.agent_class)(llm=LLM(config=metadata.llm_config))
-    env_id = instance.id
+    env_id = instance.instance_id
+    config = get_config(metadata, env_id)
+
    # Setup the logger properly, so you can run multi-processing to parallelize the evaluation
    if reset_logger:
-        # Set up logger
-        log_file = os.path.join(
-            metadata.eval_output_dir, 'logs', f'instance_{env_id}.log'
-        )
-        # Remove all existing handlers from logger
-        for handler in logger.handlers[:]:
-            logger.removeHandler(handler)
-        # add back the console handler to print ONE line
-        logger.addHandler(get_console_handler())
-        logger.info(
-            f'Starting evaluation for instance {env_id}.\nHint: run "tail -f {log_file}" to see live logs in a separate shell'
-        )
-        # Remove all existing handlers from logger
-        for handler in logger.handlers[:]:
-            logger.removeHandler(handler)
-        file_handler = logging.FileHandler(log_file)
-        file_handler.setFormatter(
-            logging.Formatter('%(asctime)s - %(levelname)s - %(message)s')
-        )
-        logger.addHandler(file_handler)
+        log_dir = os.path.join(metadata.eval_output_dir, 'infer_logs')
+        reset_logger_for_multiprocessing(logger, env_id, log_dir)
    else:
        logger.info(f'Starting evaluation for instance {env_id}.')

-    # Here's how you can run the agent (similar to the `main` function) and get the final task state
-    runtime_tools_config = {
-        RuntimeTool.BROWSER: {
-            'browsergym_eval': env_id,
-            'browsergym_eval_save_dir': metadata.eval_output_dir,
-        }
-    }
+    runtime = await create_runtime(config, sid=env_id)
+    task_str = await initialize_runtime(runtime)

-    config.max_iterations = metadata.max_iterations
-    state: State | None = asyncio.run(
-        run_controller(
-            config=config,
-            task_str='PLACEHOLDER_GOAL',
-            runtime_tools_config=runtime_tools_config,
-            agent=agent,
-            sandbox=get_sandbox(),
-            sid=env_id,
-        )
+    state: State | None = await run_controller(
+        config=config,
+        task_str=task_str,
+        runtime=runtime,
    )

    # ======= Attempt to evaluate the agent's environment impact =======
@ -107,18 +161,17 @@ def process_instance(
        raise ValueError('State should not be None.')

    metrics = state.metrics.get() if state.metrics else None
-    browsergym_eval_dir = os.path.join(metadata.eval_output_dir, env_id.split('/')[1])
-    # read goal
-    with open(
-        os.path.join(browsergym_eval_dir, 'goal.txt'), 'r', encoding='utf-8'
-    ) as f:
-        instruction = f.read()
-    # read reward
-    with open(
-        os.path.join(browsergym_eval_dir, 'rewards.json'), 'r', encoding='utf-8'
-    ) as f:
-        rewards = json.load(f)
-        reward = max(rewards)
+
+    # Instruction is the first message from the USER
+    instruction = ''
+    for event in state.history.get_events():
+        if isinstance(event, MessageAction):
+            instruction = event.content
+            break
+
+    return_val = await complete_runtime(runtime)
+    logger.info(f'Return value from complete_runtime: {return_val}')
+    reward = max(return_val['rewards'])

    # history is now available as a stream of events, rather than list of pairs of (Action, Observation)
    # for compatibility with the existing output format, we can remake the pairs here
@ -126,39 +179,38 @@ def process_instance(
    histories = state.history.compatibility_for_eval_history_pairs()

    # Save the output
-    output = {
-        'instance_id': env_id,
-        'instruction': instruction,
-        'metadata': metadata.model_dump(),
-        'history': histories,
-        'metrics': metrics,
-        'error': state.last_error if state and state.last_error else None,
-        'test_result': reward,
-    }
-
+    output = EvalOutput(
+        instance_id=env_id,
+        instruction=instruction,
+        metadata=metadata,
+        history=histories,
+        metrics=metrics,
+        error=state.last_error if state and state.last_error else None,
+        test_result={
+            'reward': reward,
+        },
+    )
    return output


 if __name__ == '__main__':
    args = parse_arguments()

-    env_ids = [
-        id for id in gym.envs.registry.keys() if id.startswith('browsergym/webarena')
-    ]
-
    dataset = pd.DataFrame(
        {
-            'id': [
+            'instance_id': [
                id
                for id in gym.envs.registry.keys()
-                if id.startswith('browsergym/miniwob')
+                if id.startswith('browsergym/webarena')
            ]
        }
    )

-    id_column = 'id'
-    llm_config = get_llm_config_arg(args.llm_config) if args.llm_config else config.llm
-    logger.info(f'Config for evaluation: {config}')
+    llm_config = None
+    if args.llm_config:
+        llm_config = get_llm_config_arg(args.llm_config)
+    if llm_config is None:
+        raise ValueError(f'Could not find LLM config: --llm_config {args.llm_config}')

    metadata = make_metadata(
        llm_config,
@ -169,13 +221,14 @@ if __name__ == '__main__':
        args.eval_output_dir,
    )
    output_file = os.path.join(metadata.eval_output_dir, 'output.jsonl')
-    instances = prepare_dataset(dataset, output_file, args.eval_n_limit, id_column)
-    _ = get_sandbox()  # Initialize the sandbox
-    run_evaluation(
-        instances,
-        metadata,
-        output_file,
-        args.eval_num_workers,
-        process_instance,
-        id_column,
+    instances = prepare_dataset(dataset, output_file, args.eval_n_limit)
+
+    asyncio.run(
+        run_evaluation(
+            instances,
+            metadata,
+            output_file,
+            args.eval_num_workers,
+            process_instance,
+        )
    )
--- a/opendevin/controller/state/state.py
+++ b/opendevin/controller/state/state.py
@ -2,6 +2,7 @@ import base64
 import pickle
 from dataclasses import dataclass, field
 from enum import Enum
+from typing import Any

 from opendevin.controller.state.task import RootTask
 from opendevin.core.logger import opendevin_logger as logger
@ -106,6 +107,9 @@ class State:
    start_id: int = -1
    end_id: int = -1
    almost_stuck: int = 0
+    # NOTE: This will never be used by the controller, but it can be used by different
+    # evaluation tasks to store extra data needed to track the progress/state of the task.
+    extra_data: dict[str, Any] = field(default_factory=dict)

    def save_to_session(self, sid: str, file_store: FileStore):
        pickled = pickle.dumps(self)
--- a/opendevin/core/config.py
+++ b/opendevin/core/config.py
@ -159,9 +159,12 @@ class SandboxConfig(metaclass=Singleton):
            It can contain any valid shell commands (e.g., pip install numpy).
            The path to the interpreter is available as $OD_INTERPRETER_PATH,
            which can be used to install dependencies for the OD-specific Python interpreter.
+        od_runtime_startup_env_vars: The environment variables to set at the launch of the runtime.
+            This is a dictionary of key-value pairs.
+            This is useful for setting environment variables that are needed by the runtime.
+            For example, for specifying the base url of website for browsergym evaluation.
        browsergym_eval_env: The BrowserGym environment to use for evaluation.
            Default is None for general purpose browsing. Check evaluation/miniwob and evaluation/webarena for examples.
-
    """

    box_type: str = 'ssh'
@ -179,6 +182,7 @@ class SandboxConfig(metaclass=Singleton):
    initialize_plugins: bool = True
    update_source_code: bool = False
    od_runtime_extra_deps: str | None = None
+    od_runtime_startup_env_vars: dict[str, str] = field(default_factory=dict)
    browsergym_eval_env: str | None = None

    def defaults_to_dict(self) -> dict:
@ -243,10 +247,11 @@ class AppConfig(metaclass=Singleton):
    runtime: str = 'server'
    file_store: str = 'memory'
    file_store_path: str = '/tmp/file_store'
+    # TODO: clean up workspace path after the removal of ServerRuntime
    workspace_base: str = os.path.join(os.getcwd(), 'workspace')
-    workspace_mount_path: str = (
+    workspace_mount_path: str | None = (
        UndefinedString.UNDEFINED  # this path should always be set when config is fully loaded
-    )
+    )  # when set to None, do not mount the workspace
    workspace_mount_path_in_sandbox: str = '/workspace'
    workspace_mount_rewrite: str | None = None
    cache_dir: str = '/tmp/cache'
@ -550,7 +555,7 @@ def finalize_config(cfg: AppConfig):
    cfg.workspace_base = os.path.abspath(cfg.workspace_base)

    # In local there is no sandbox, the workspace will have the same pwd as the host
-    if cfg.sandbox.box_type == 'local':
+    if cfg.sandbox.box_type == 'local' and cfg.workspace_mount_path is not None:
        cfg.workspace_mount_path_in_sandbox = cfg.workspace_mount_path

    if cfg.workspace_mount_rewrite:  # and not config.workspace_mount_path:
--- a/opendevin/core/main.py
+++ b/opendevin/core/main.py
@ -1,6 +1,7 @@
 import asyncio
 import os
 import sys
+import uuid
 from typing import Callable, Type

 import agenthub  # noqa F401 (we import this to get the agents registered)
@ -21,7 +22,7 @@ from opendevin.events.event import Event
 from opendevin.events.observation import AgentStateChangedObservation
 from opendevin.llm.llm import LLM
 from opendevin.runtime import get_runtime_cls
-from opendevin.runtime.sandbox import Sandbox
+from opendevin.runtime.runtime import Runtime
 from opendevin.runtime.server.runtime import ServerRuntime
 from opendevin.storage import get_file_store

@ -37,86 +38,39 @@ def read_task_from_stdin() -> str:
    return sys.stdin.read()


-async def run_controller(
+async def create_runtime(
    config: AppConfig,
-    task_str: str,
-    exit_on_message: bool = False,
-    fake_user_response_fn: Callable[[State | None], str] | None = None,
-    sandbox: Sandbox | None = None,
-    agent: Agent | None = None,
-    runtime_tools_config: dict | None = None,
    sid: str | None = None,
-    headless_mode: bool = True,
-) -> State | None:
-    """Main coroutine to run the agent controller with task input flexibility.
-    It's only used when you launch opendevin backend directly via cmdline.
+    runtime_tools_config: dict | None = None,
+) -> Runtime:
+    """Create a runtime for the agent to run on.

-    Args:
-        config: The app config.
-        task_str: The task to run.
-        exit_on_message: quit if agent asks for a message from user (optional)
-        fake_user_response_fn: An optional function that receives the current state (could be None) and returns a fake user response.
-        sandbox: (will be deprecated) An optional sandbox to run the agent in.
-        agent: An optional agent to run.
-        runtime_tools_config: (will be deprecated) The runtime tools config.
-        sid: The session id.
-        headless_mode: Whether the agent is run in headless mode.
+    config: The app config.
+    sid: The session id.
+    runtime_tools_config: (will be deprecated) The runtime tools config.
    """
-    # Create the agent
-    if agent is None:
-        agent_cls: Type[Agent] = Agent.get_cls(config.default_agent)
-        agent = agent_cls(
-            llm=LLM(config=config.get_llm_config_from_agent(config.default_agent))
-        )
-
-    # Logging
-    logger.info(
-        f'Running agent {agent.name}, model {agent.llm.config.model}, with task: "{task_str}"'
-    )
-
    # set up the event stream
    file_store = get_file_store(config.file_store, config.file_store_path)
-    cli_session = 'main' + ('_' + sid if sid else '')
-    event_stream = EventStream(cli_session, file_store)
+    session_id = 'main' + ('_' + sid if sid else str(uuid.uuid4()))
+    event_stream = EventStream(session_id, file_store)

-    # restore cli session if enabled
-    initial_state = None
-    if config.enable_cli_session:
-        try:
-            logger.info('Restoring agent state from cli session')
-            initial_state = State.restore_from_session(cli_session, file_store)
-        except Exception as e:
-            print('Error restoring state', e)
-
-    # init controller with this initial state
-    controller = AgentController(
-        agent=agent,
-        max_iterations=config.max_iterations,
-        max_budget_per_task=config.max_budget_per_task,
-        agent_to_llm_config=config.get_agent_to_llm_config_map(),
-        event_stream=event_stream,
-        initial_state=initial_state,
-        headless_mode=headless_mode,
-    )
+    # agent class
+    agent_cls = agenthub.Agent.get_cls(config.default_agent)

    # runtime and tools
    runtime_cls = get_runtime_cls(config.runtime)
-    extra_kwargs = {}
-    if isinstance(runtime_cls, ServerRuntime):
-        extra_kwargs['sandbox'] = sandbox
-        # TODO: deprecate this and accept runtime as a parameter instead
-
    logger.info(f'Initializing runtime: {runtime_cls}')
-    runtime = runtime_cls(
+    runtime: Runtime = runtime_cls(
        config=config,
        event_stream=event_stream,
-        plugins=controller.agent.sandbox_plugins,
-        **extra_kwargs,
+        sid=session_id,
+        plugins=agent_cls.sandbox_plugins,
    )
    await runtime.ainit()
+
    if isinstance(runtime, ServerRuntime):
        runtime.init_runtime_tools(
-            controller.agent.runtime_tools,
+            agent_cls.runtime_tools,
            runtime_tools_config=runtime_tools_config,
        )
        # browser eval specific
@ -130,7 +84,68 @@ async def run_controller(
            ) as f:
                task_str = f.read()
                logger.info(f'Dynamic Eval task: {task_str}')
-    # TODO: Implement this for EventStream Runtime
+    return runtime
+
+
+async def run_controller(
+    config: AppConfig,
+    task_str: str,
+    runtime: Runtime | None = None,
+    agent: Agent | None = None,
+    exit_on_message: bool = False,
+    fake_user_response_fn: Callable[[State | None], str] | None = None,
+    headless_mode: bool = True,
+) -> State | None:
+    """Main coroutine to run the agent controller with task input flexibility.
+    It's only used when you launch opendevin backend directly via cmdline.
+
+    Args:
+        config: The app config.
+        task_str: The task to run. It can be a string.
+        runtime: (optional) A runtime for the agent to run on.
+        agent: (optional) A agent to run.
+        exit_on_message: quit if agent asks for a message from user (optional)
+        fake_user_response_fn: An optional function that receives the current state (could be None) and returns a fake user response.
+        headless_mode: Whether the agent is run in headless mode.
+    """
+    # Create the agent
+    if agent is None:
+        agent_cls: Type[Agent] = Agent.get_cls(config.default_agent)
+        agent = agent_cls(
+            llm=LLM(config=config.get_llm_config_from_agent(config.default_agent))
+        )
+
+    if runtime is None:
+        runtime = await create_runtime(config)
+
+    event_stream = runtime.event_stream
+    # restore cli session if enabled
+    initial_state = None
+    if config.enable_cli_session:
+        try:
+            logger.info('Restoring agent state from cli session')
+            initial_state = State.restore_from_session(
+                event_stream.sid, event_stream.file_store
+            )
+        except Exception as e:
+            logger.info('Error restoring state', e)
+
+    # init controller with this initial state
+    controller = AgentController(
+        agent=agent,
+        max_iterations=config.max_iterations,
+        max_budget_per_task=config.max_budget_per_task,
+        agent_to_llm_config=config.get_agent_to_llm_config_map(),
+        event_stream=event_stream,
+        initial_state=initial_state,
+        headless_mode=headless_mode,
+    )
+
+    assert isinstance(task_str, str), f'task_str must be a string, got {type(task_str)}'
+    # Logging
+    logger.info(
+        f'Agent Controller Initialized: Running agent {agent.name}, model {agent.llm.config.model}, with task: "{task_str}"'
+    )

    # start event is a MessageAction with the task, either resumed or new
    if config.enable_cli_session and initial_state is not None:
@ -170,12 +185,13 @@ async def run_controller(
    # save session when we're about to close
    if config.enable_cli_session:
        end_state = controller.get_state()
-        end_state.save_to_session(cli_session, file_store)
+        end_state.save_to_session(event_stream.sid, event_stream.file_store)

    # close when done
    await controller.close()
-    await runtime.close()
-    return controller.get_state()
+    state = controller.get_state()
+
+    return state


 if __name__ == '__main__':
--- a/opendevin/events/stream.py
+++ b/opendevin/events/stream.py
@ -22,16 +22,16 @@ class EventStreamSubscriber(str, Enum):

 class EventStream:
    sid: str
+    file_store: FileStore
    # For each subscriber ID, there is a stack of callback functions - useful
    # when there are agent delegates
    _subscribers: dict[str, list[Callable]]
    _cur_id: int
    _lock: threading.Lock
-    _file_store: FileStore

    def __init__(self, sid: str, file_store: FileStore):
        self.sid = sid
-        self._file_store = file_store
+        self.file_store = file_store
        self._subscribers = {}
        self._cur_id = 0
        self._lock = threading.Lock()
@ -39,7 +39,7 @@ class EventStream:

    def _reinitialize_from_file_store(self) -> None:
        try:
-            events = self._file_store.list(f'sessions/{self.sid}/events')
+            events = self.file_store.list(f'sessions/{self.sid}/events')
        except FileNotFoundError:
            logger.debug(f'No events found for session {self.sid}')
            self._cur_id = 0
@ -100,7 +100,7 @@ class EventStream:

    def get_event(self, id: int) -> Event:
        filename = self._get_filename_for_id(id)
-        content = self._file_store.read(filename)
+        content = self.file_store.read(filename)
        data = json.loads(content)
        return event_from_dict(data)

@ -136,9 +136,7 @@ class EventStream:
        event._source = source  # type: ignore [attr-defined]
        data = event_to_dict(event)
        if event.id is not None:
-            self._file_store.write(
-                self._get_filename_for_id(event.id), json.dumps(data)
-            )
+            self.file_store.write(self._get_filename_for_id(event.id), json.dumps(data))
        for stack in self._subscribers.values():
            callback = stack[-1]
            asyncio.create_task(callback(event))
@ -149,7 +147,7 @@ class EventStream:
                yield event

    def clear(self):
-        self._file_store.delete(f'sessions/{self.sid}')
+        self.file_store.delete(f'sessions/{self.sid}')
        self._cur_id = 0
        # self._subscribers = {}
        self._reinitialize_from_file_store()
--- a/opendevin/runtime/client/client.py
+++ b/opendevin/runtime/client/client.py
@ -187,23 +187,37 @@ class RuntimeClient:
        keep_prompt: bool = True,
    ) -> tuple[str, int]:
        logger.debug(f'Executing command: {command}')
-        self.shell.sendline(command)
-        self.shell.expect(self.__bash_expect_regex, timeout=timeout)
+        try:
+            self.shell.sendline(command)
+            self.shell.expect(self.__bash_expect_regex, timeout=timeout)

-        output = self.shell.before
+            output = self.shell.before

-        bash_prompt = self._get_bash_prompt_and_update_pwd()
-        if keep_prompt:
-            output += '\r\n' + bash_prompt
-        logger.debug(f'Command output: {output}')
+            # Get exit code
+            self.shell.sendline('echo $?')
+            logger.debug(f'Executing command for exit code: {command}')
+            self.shell.expect(self.__bash_expect_regex, timeout=timeout)
+            _exit_code_output = self.shell.before
+            logger.debug(f'Exit code Output: {_exit_code_output}')
+            exit_code = int(_exit_code_output.strip().split()[0])
+
+        except pexpect.TIMEOUT as e:
+            self.shell.sendintr()  # send SIGINT to the shell
+            self.shell.expect(self.__bash_expect_regex, timeout=timeout)
+            output = self.shell.before
+            output += (
+                '\r\n\r\n'
+                + f'[Command timed out after {timeout} seconds. SIGINT was sent to interrupt it.]'
+            )
+            exit_code = 130  # SIGINT
+            logger.error(f'Failed to execute command: {command}. Error: {e}')
+
+        finally:
+            bash_prompt = self._get_bash_prompt_and_update_pwd()
+            if keep_prompt:
+                output += '\r\n' + bash_prompt
+            logger.debug(f'Command output: {output}')

-        # Get exit code
-        self.shell.sendline('echo $?')
-        logger.debug(f'Executing command for exit code: {command}')
-        self.shell.expect(self.__bash_expect_regex, timeout=timeout)
-        _exit_code_output = self.shell.before
-        logger.debug(f'Exit code Output: {_exit_code_output}')
-        exit_code = int(_exit_code_output.strip().split()[0])
        return output, exit_code

    async def run_action(self, action) -> Observation:
--- a/opendevin/runtime/client/runtime.py
+++ b/opendevin/runtime/client/runtime.py
@ -230,7 +230,7 @@ class EventStreamRuntime(Runtime):

    async def copy_to(
        self, host_src: str, sandbox_dest: str, recursive: bool = False
-    ) -> dict[str, Any]:
+    ) -> None:
        if not os.path.exists(host_src):
            raise FileNotFoundError(f'Source file {host_src} does not exist')

@ -264,7 +264,7 @@ class EventStreamRuntime(Runtime):
                f'{self.api_url}/upload_file', data=upload_data, params=params
            ) as response:
                if response.status == 200:
-                    return await response.json()
+                    return
                else:
                    error_message = await response.text()
                    raise Exception(f'Copy operation failed: {error_message}')
@ -276,6 +276,7 @@ class EventStreamRuntime(Runtime):
        finally:
            if recursive:
                os.unlink(temp_zip_path)
+            logger.info(f'Copy completed: host:{host_src} -> runtime:{sandbox_dest}')

    async def run_action(self, action: Action) -> Observation:
        # set timeout to default if not set
--- a/opendevin/runtime/docker/ssh_box.py
+++ b/opendevin/runtime/docker/ssh_box.py
@ -117,7 +117,7 @@ class DockerSSHBox(Sandbox):
        self,
        config: SandboxConfig,
        persist_sandbox: bool,
-        workspace_mount_path: str,
+        workspace_mount_path: str | None,
        sandbox_workspace_dir: str,
        cache_dir: str,
        run_as_devin: bool,
@ -554,9 +554,7 @@ class DockerSSHBox(Sandbox):

    @property
    def volumes(self):
-        mount_dir = self.workspace_mount_path
-        return {
-            mount_dir: {'bind': self.sandbox_workspace_dir, 'mode': 'rw'},
+        mount_volumes = {
            # mount cache directory to /home/opendevin/.cache for pip cache reuse
            self.cache_dir: {
                'bind': (
@ -565,6 +563,12 @@ class DockerSSHBox(Sandbox):
                'mode': 'rw',
            },
        }
+        if self.workspace_mount_path is not None:
+            mount_volumes[self.workspace_mount_path] = {
+                'bind': self.sandbox_workspace_dir,
+                'mode': 'rw',
+            }
+        return mount_volumes

    def restart_docker_container(self):
        try:
--- a/opendevin/runtime/runtime.py
+++ b/opendevin/runtime/runtime.py
@ -30,7 +30,6 @@ from opendevin.events.observation import (
 from opendevin.events.serialization.action import ACTION_TYPE_TO_CLASS
 from opendevin.runtime.plugins import JupyterRequirement, PluginRequirement
 from opendevin.runtime.tools import RuntimeTool
-from opendevin.storage import FileStore


 def _default_env_vars(sandbox_config: SandboxConfig) -> dict[str, str]:
@ -52,7 +51,6 @@ class Runtime:
    """

    sid: str
-    file_store: FileStore
    DEFAULT_ENV_VARS: dict[str, str]

    def __init__(
--- a/opendevin/runtime/server/runtime.py
+++ b/opendevin/runtime/server/runtime.py
@ -28,7 +28,6 @@ from opendevin.runtime.browser.browser_env import BrowserEnv
 from opendevin.runtime.plugins import JupyterRequirement, PluginRequirement
 from opendevin.runtime.runtime import Runtime
 from opendevin.runtime.tools import RuntimeTool
-from opendevin.storage.local import LocalFileStore

 from ..browser import browse
 from .files import read_file, write_file
@ -44,7 +43,6 @@ class ServerRuntime(Runtime):
        sandbox: Sandbox | None = None,
    ):
        super().__init__(config, event_stream, sid, plugins)
-        self.file_store = LocalFileStore(config.workspace_base)
        if sandbox is None:
            self.sandbox = self.create_sandbox(sid, config.sandbox.box_type)
            self._is_external_sandbox = False
@ -187,7 +185,6 @@ class ServerRuntime(Runtime):
        return IPythonRunCellObservation(content=output, code=action.code)

    async def read(self, action: FileReadAction) -> Observation:
-        # TODO: use self.file_store
        working_dir = self.sandbox.get_working_directory()
        return await read_file(
            action.path,
@ -199,7 +196,6 @@ class ServerRuntime(Runtime):
        )

    async def write(self, action: FileWriteAction) -> Observation:
-        # TODO: use self.file_store
        working_dir = self.sandbox.get_working_directory()
        return await write_file(
            action.path,
--- a/opendevin/runtime/utils/runtime_templates/Dockerfile.j2
+++ b/opendevin/runtime/utils/runtime_templates/Dockerfile.j2
@ -14,10 +14,16 @@ FROM {{ base_image }}

 # Install necessary packages and clean up in one layer
 RUN apt-get update && \
-    apt-get install -y wget sudo apt-utils {{ LIBGL_MESA }} libasound2-plugins python3 git && \
-    apt-get clean \
-    && ln -s /usr/bin/python3 /usr/bin/python \
-    && rm -rf /var/lib/apt/lists/*
+    apt-get install -y wget sudo apt-utils {{ LIBGL_MESA }} libasound2-plugins git && \
+    apt-get clean && \
+    rm -rf /var/lib/apt/lists/*
+
+# Install Python if not already installed
+RUN if [ ! -e /usr/bin/python ]; then \
+    apt-get update && \
+    apt-get install -y python3 && \
+    ln -s /usr/bin/python3 /usr/bin/python; \
+fi

 # Create necessary directories
 RUN mkdir -p /opendevin && \
--- a/poetry.lock
+++ b/poetry.lock
@ -1883,6 +1883,36 @@ test-downstream = ["aiobotocore (>=2.5.4,<3.0.0)", "dask-expr", "dask[dataframe,
 test-full = ["adlfs", "aiohttp (!=4.0.0a0,!=4.0.0a1)", "cloudpickle", "dask", "distributed", "dropbox", "dropboxdrivefs", "fastparquet", "fusepy", "gcsfs", "jinja2", "kerchunk", "libarchive-c", "lz4", "notebook", "numpy", "ocifs", "pandas", "panel", "paramiko", "pyarrow", "pyarrow (>=1)", "pyftpdlib", "pygit2", "pytest", "pytest-asyncio (!=0.22.0)", "pytest-benchmark", "pytest-cov", "pytest-mock", "pytest-recording", "pytest-rerunfailures", "python-snappy", "requests", "smbprotocol", "tqdm", "urllib3", "zarr", "zstandard"]
 tqdm = ["tqdm"]

+[[package]]
+name = "func-timeout"
+version = "4.3.5"
+description = "Python module which allows you to specify timeouts when calling any existing function. Also provides support for stoppable-threads"
+optional = false
+python-versions = "*"
+files = [
+    {file = "func_timeout-4.3.5.tar.gz", hash = "sha256:74cd3c428ec94f4edfba81f9b2f14904846d5ffccc27c92433b8b5939b5575dd"},
+]
+
+[[package]]
+name = "gdown"
+version = "5.2.0"
+description = "Google Drive Public File/Folder Downloader"
+optional = false
+python-versions = ">=3.8"
+files = [
+    {file = "gdown-5.2.0-py3-none-any.whl", hash = "sha256:33083832d82b1101bdd0e9df3edd0fbc0e1c5f14c9d8c38d2a35bf1683b526d6"},
+    {file = "gdown-5.2.0.tar.gz", hash = "sha256:2145165062d85520a3cd98b356c9ed522c5e7984d408535409fd46f94defc787"},
+]
+
+[package.dependencies]
+beautifulsoup4 = "*"
+filelock = "*"
+requests = {version = "*", extras = ["socks"]}
+tqdm = "*"
+
+[package.extras]
+test = ["build", "mypy", "pytest", "pytest-xdist", "ruff", "twine", "types-requests", "types-setuptools"]
+
 [[package]]
 name = "gevent"
 version = "24.2.1"
@ -6202,6 +6232,18 @@ files = [
    {file = "pyreadline3-3.4.1.tar.gz", hash = "sha256:6f3d1f7b8a31ba32b73917cefc1f28cc660562f39aea8646d30bd6eff21f7bae"},
 ]

+[[package]]
+name = "pysocks"
+version = "1.7.1"
+description = "A Python SOCKS client module. See https://github.com/Anorov/PySocks for more information."
+optional = false
+python-versions = ">=2.7, !=3.0.*, !=3.1.*, !=3.2.*, !=3.3.*"
+files = [
+    {file = "PySocks-1.7.1-py27-none-any.whl", hash = "sha256:08e69f092cc6dbe92a0fdd16eeb9b9ffbc13cadfe5ca4c7bd92ffb078b293299"},
+    {file = "PySocks-1.7.1-py3-none-any.whl", hash = "sha256:2725bd0a9925919b9b51739eea5f9e2bae91e83288108a9ad338b2e3a4435ee5"},
+    {file = "PySocks-1.7.1.tar.gz", hash = "sha256:3f8804571ebe159c380ac6de37643bb4685970655d3bba243530d6558b799aa0"},
+]
+
 [[package]]
 name = "pytest"
 version = "8.3.2"
@ -6705,6 +6747,7 @@ files = [
 certifi = ">=2017.4.17"
 charset-normalizer = ">=2,<4"
 idna = ">=2.5,<4"
+PySocks = {version = ">=1.5.6,<1.5.7 || >1.5.7", optional = true, markers = "extra == \"socks\""}
 urllib3 = ">=1.21.1,<3"

 [package.extras]
@ -9106,4 +9149,4 @@ testing = ["coverage (>=5.0.3)", "zope.event", "zope.testing"]
 [metadata]
 lock-version = "2.0"
 python-versions = "^3.11"
-content-hash = "fa6657610c1f5cf14107f7c615e3e6c9d14367defbcb7582b2e72fe77bc939c9"
+content-hash = "22b31a3f8d5b241552a798639668e43aabafff7f9611888f1d4239ea07d71a75"
--- a/pyproject.toml
+++ b/pyproject.toml
@ -118,3 +118,6 @@ whatthepatch = "*"
 retry = "*"
 evaluate = "*"
 swebench = { git = "https://github.com/OpenDevin/SWE-bench.git" }
+func_timeout = "*"
+sympy = "*"
+gdown = "*"
--- a/tests/unit/test_event_stream.py
+++ b/tests/unit/test_event_stream.py
@ -32,7 +32,7 @@ def test_stream_storage(temp_dir: str):
    event_stream = EventStream('abc', file_store)
    event_stream.add_event(NullObservation(''), EventSource.AGENT)
    assert len(collect_events(event_stream)) == 1
-    content = event_stream._file_store.read('sessions/abc/events/0.json')
+    content = event_stream.file_store.read('sessions/abc/events/0.json')
    assert content is not None
    data = json.loads(content)
    assert 'timestamp' in data