24 Commits

Author SHA1 Message Date
tofarr
152f99c64f
Chore Bump python version (#3545) 2024-10-03 13:40:55 -04:00
Xingyao Wang
090c911a50
(refactor) Make Runtime class synchronous (#3661)
* change runtime to be synchronous

* fix test runtime with the new interface

* fix arg

* fix eval

* fix missing config attribute

* fix plugins

* fix on_event by revert it back to async

* update upload_file endpoint

* fix argument to upload file

* remove unncessary async for eval;
fix evaluation run in parallel

* use asyncio to run controller for eval

* revert file upload

* truncate eval test result output
2024-08-30 01:37:03 +00:00
Graham Neubig
f9088766e8
Allow setting of runtime container image (#3573)
* Add runtime container image setting

* Fix typo in test

* Fix sandbox base container image

* Update variables

* Update to base_container_image

* Update tests/unit/test_config.py

Co-authored-by: Xingyao Wang <xingyao6@illinois.edu>

* Fixed eval

* Fixed container_image

* Fix typo

---------

Co-authored-by: Xingyao Wang <xingyao6@illinois.edu>
2024-08-25 23:05:41 +00:00
Robert Brennan
01ae22ef57
Rename OpenDevin to OpenHands (#3472)
* Replace OpenDevin with OpenHands

* Update CONTRIBUTING.md

* Update README.md

* Update README.md

* update poetry lock; move opendevin folder to openhands

* fix env var

* revert image references in docs

* revert permissions

* revert permissions

---------

Co-authored-by: Xingyao Wang <xingyao6@illinois.edu>
2024-08-20 00:44:54 +08:00
Xingyao Wang
bdf6df12c3
fix: pip not available in runtime (#3306)
* try to fix pip unavailable

* update test case for pip

* force rebuild in CI

* remove extra symlink

* fix newline

* added semi-colon to line 31

* Dockerfile.j2: activate env at the end

* Revert "Dockerfile.j2: activate env at the end"

This reverts commit cf2f5651021fe80d4ab69a35a85f0a35b29dc3d7.

* cleanup Dockerfile

* switch default python image

* remove image agnostic (no longer used)

* fix tests

* switch to nikolaik/python-nodejs:python3.11-nodejs22

* fix test

* fix test

* revert docker

* update template

---------

Co-authored-by: tobitege <tobitege@gmx.de>
Co-authored-by: Graham Neubig <neubig@gmail.com>
2024-08-09 15:04:43 -04:00
Xingyao Wang
b30a2dd87a
completely remove update_source_code (#3280) 2024-08-07 16:57:11 +00:00
Xingyao Wang
31b244f95e
[Refactor, Evaluation] Refactor and clean up evaluation harness to remove global config and use EventStreamRuntime (#3230)
* move multi-line bash tests to test_runtime;
support multi-line bash for esruntime;

* add testcase to handle PS2 prompt

* use bashlex for bash parsing to handle multi-line commands;
add testcases for multi-line commands

* revert ghcr runtime change

* Apply stash

* fix run as other user;
make test async;

* fix test runtime for run as od

* add run-as-devin to all the runtime tests

* handle the case when username is root

* move all run-as-devin tests from sandbox;
only tests a few cases on different user to save time;

* move over multi-line echo related tests to test_runtime

* fix user-specific jupyter by fixing the pypoetry virtualenv folder

* make plugin's init async;
chdir at initialization of jupyter plugin;
move ipy simple testcase to test runtime;

* support agentskills import in
move tests for jupyter pwd tests;
overload `add_env_vars` for EventStreamRuntime to update env var also in Jupyter;
make agentskills read env var lazily, in case env var is updated;

* fix ServerRuntime agentskills issue

* move agnostic image test to test_runtime

* merge runtime tests in CI

* fix enable auto lint as env var

* update warning message

* update warning message

* test for different container images

* change parsing output as debug

* add exception handling for update_pwd_decorator

* fix unit test indentation

* add plugins as default input to Runtime class;
remove init_sandbox_plugins;
implement add_env_var (include jupyter) in the base class;

* fix server runtime auto lint

* Revert "add exception handling for update_pwd_decorator"

This reverts commit 2b668b1506e02145cb8f87e321aad62febca3d50.

* tries to print debugging info for agentskills

* explictly setting uid (try fix permission issue)

* Revert "tries to print debugging info for agentskills"

This reverts commit 8be4c86756f0e3fc62957b327ba2ac4999c419de.

* set sandbox user id during testing to hopefully fix the permission issue

* add browser tools for server runtime

* try to debug for old pwd

* update debug cmd

* only test agnostic runtime when TEST_RUNTIME is Server

* fix temp dir mkdir

* load TEST_RUNTIME at the beginning

* remove ipython tests

* only log to file when DEBUG

* default logging to project root

* temporarily remove log to file

* fix LLM logger dir

* fix logger

* make set pwd an optional aux action

* fix prev pwd

* fix infinity recursion

* simplify

* do not import the whole od library to avoid logger folder by jupyter

* fix browsing

* increase timeout

* attempt to fix agentskills yet again

* clean up in testcases, since CI maybe run as non-root

* add _cause attribute for event.id

* remove parent

* add a bunch of debugging statement again for CI :(

* fix temp_dir fixture

* change all temp dir to follow pytest's tmp_path_factory

* remove extra bracket

* clean up error printing a bit

* jupyter chdir to self.config.workspace_mount_path_in_sandbox on initialization

* jupyter chdir to self.config.workspace_mount_path_in_sandbox on initialization

* add typing for tmp dir fixture

* clear the directory before running the test to avoid weird CI temp dir

* remove agnostic test case for server runtime

* Revert "remove agnostic test case for server runtime"

This reverts commit 30e2181c3fc1410e69596c2dcd06be01f1d016b3.

* disable agnostic tests in CI

* fix test

* make sure plugin arg is not passed when no plugin is specified;
remove redundant on_event function;

* move mock prompt

* rename runtime

* remove extra logging

* refactor run_controller's interface;
support multiple runtime for integration test;
filter out hostname for prompt

* uncomment other tests

* pass the right runtime to controller

* log runtime when start

* uncomment tests

* improve symbol filters

* add intergration test prompts that seemd ok

* add integration test workflow

* add python3 to default ubuntu image

* symlink python and fix permission to jupyter pip

* add retry for jupyter execute server

* fix jupyter pip install;
add post-process for jupyter pip install;
simplify init by add agent_skills path to PYTHONPATH;
add testcase to tests jupyter pip install;

* fix bug

* use ubuntu:22.04 for eventstream integration tests

* add todo

* update testcase

* remove redundant code

* fix unit test

* reduce dependency for runtime

* try making llama-index an optional dependency that's not installed by default

* remove pip install since it seemd not needed

* log ipython execution;
await write message since it returns a future

* update ipy testcase

* do not install llama-index in CI

* do not install llama-index in the app docker as well

* set sandbox container image in the integration test script

* log plugins & env var for runtime

* update conftest for sha256

* add git

* remove all non-alphanumeric chalracters

* add working ipy module tests!

* default to use host network

* remove is_async from browser to make thing a little more reliable;
retry loading browser when error;

* add sleep to wait a bit for http server

* kill http server before regenerate browsing tests

* fix browsing

* only set sandbox container image if undefined

* skip empty config value

* update evaluation to use the latest run_controller

* revert logger in execute_server to be compatible with server runtime

* revert logging level to fix jupyter

* set logger level

* revert the logging

* chmod for workspace to fix permission

* support getting timeout from action

* update test for server runtime

* try to fix file permission

* fix test_cmd_run_action_serialization_deserialization test (added timeout)

* poetry: pip 24.2, torch 2.2.2

* revert adding pip to pyproject.toml

* add build to dependencies in pyproject.toml

* forgot poetry lock --no-update

* fix a DelegatorAgent prompt_002.log (timeout)

* fix a DelegatorAgent prompt_003.log (timeout)

* couple more timeout attribs in prompt files

* some more prompt files

* prompts galore

* add clarification comment for timeout

* default timeout to config

* add assert

* update integraton tests for eventstream

* update integration tests

* fix timeout for action<->dict

* remove redundant on_event

* default to use instance image

* update run_controller interface

* add logging for copy

* refactor swe_bench for the new design

* fix action execution timeout

* updatelock

* remove build sandbox locally

* fix runtime

* use plain for-loop for single process

* remove extra print

* get swebench inference working

* print whole `test_result` dict

* got swebench patch post-process working

* update swe-bench evaluation readme

* refactor using shared reset_logger function

* move messy swebench prompt to a different file

* support the ability to specify whether to keep prompt

* support the ability to specify whether to keep prompt

* fix dockerfile

* fix import and remove unnecessary strip logic

* fix action serialization

* get agentbench running

* remove extra ls for agent bench

* fix agentbench metric

* factor out common documentation for eval

* update biocoder doc

* remove swe_env_box since it is no longer needed

* get biocoder working

* add func timeout for bird

* fix jupyter pwd with ~ as user name

* fix jupyter pwd with ~ as user name

* get bird working

* get browsing evaluation working

* make eda runnable

* fix id column

* fix eda run_infer

* unify eval output using a structured format;
make swebench coompatible with that format;
update client source code for every swebench run;
do not inject testcmd for swebench

* standardize existing benchs for the new eval output

* set update source code = true

* get gaia standardized

* fix gaia

* gorilla refactored but stuck at language.so to test

* refactor and make gpqa work

* refactor humanevalfix and get it working

* refactor logic reasoning and get it working

* refactor browser env so it works with eventstream runtime for eval

* add initial version of miniwob refactor

* fix browsergym environment

* get miniwob working!!

* allowing injecting additional dependency to OD runtime docker image

* allowing injecting additional dependency to OD runtime docker image

* support logic reasoning with pre-injected dependency

* get mint working

* update runtime build

* fix mint docker

* add test for keep_prompt;
add missing await close for some tests

* update integration tests for eventstream runtime

* fix integration tests for server runtime

* refactor ml bench and toolqa

* refactor webarena

* fix default factory

* Update run_infer.py

* add APIError to retry

* increase timeout for swebench

* make sure to hide api key when dump eval output

* update the behavior of put source code to put files instead of tarball

* add dishash to dependency

* sendintr when timeout

* fix dockerfile copy

* reduce timeout

* use dirhash to avoid repeat building for update source

* fix runtime_build testcase

* add dir_hash to docker build pipeline

* revert api error

* update poetry lock

* add retries for swebench run infer

* fix git patch

* update poetry lock

* adjust config order

* fix mount volumns

* enforce all eval to use "instance_id"

* remove file store from runtime

* make file_store public inside eventstream

* move the runtime logic inside `main` out

* support using async function for process_instance_fn

* refactor run_infer with the create_time

* fix file store

* Update evaluation/toolqa/utils.py

Co-authored-by: Graham Neubig <neubig@gmail.com>

* fix typo

---------

Co-authored-by: tobitege <tobitege@gmx.de>
Co-authored-by: super-dainiu <78588128+super-dainiu@users.noreply.github.com>
Co-authored-by: Graham Neubig <neubig@gmail.com>
2024-08-06 17:21:45 +00:00
Xingyao Wang
001195a3ea
reduce the duplication in run_controller (#3217) 2024-08-02 10:12:34 +08:00
Xingyao Wang
4f0a454ed6
[Arch] Support integration tests using EventStream Runtime (#3184)
* Remove global config from memory

* Remove runtime global config

* Remove from storage

* Remove global config

* Fix event stream tests

* Fix sandbox issue

* Change config

* Removed transferred tests

* Add swe env box

* Fixes on testing

* Fixed some tests

* Merge with stashed changes

* Fix typing

* Fix ipython test

* Revive function

* Make temp_dir fixture

* Remove test to avoid circular import

* fix eventstream filestore for test_runtime

* fix parse arg issue that cause integration test to fail

* support swebench pull from custom namespace

* add back simple tests for runtime

* move multi-line bash tests to test_runtime;
support multi-line bash for esruntime;

* add testcase to handle PS2 prompt

* use bashlex for bash parsing to handle multi-line commands;
add testcases for multi-line commands

* revert ghcr runtime change

* Apply stash

* fix run as other user;
make test async;

* fix test runtime for run as od

* add run-as-devin to all the runtime tests

* handle the case when username is root

* move all run-as-devin tests from sandbox;
only tests a few cases on different user to save time;

* move over multi-line echo related tests to test_runtime

* fix user-specific jupyter by fixing the pypoetry virtualenv folder

* make plugin's init async;
chdir at initialization of jupyter plugin;
move ipy simple testcase to test runtime;

* support agentskills import in
move tests for jupyter pwd tests;
overload `add_env_vars` for EventStreamRuntime to update env var also in Jupyter;
make agentskills read env var lazily, in case env var is updated;

* fix ServerRuntime agentskills issue

* move agnostic image test to test_runtime

* merge runtime tests in CI

* fix enable auto lint as env var

* update warning message

* update warning message

* test for different container images

* change parsing output as debug

* add exception handling for update_pwd_decorator

* fix unit test indentation

* add plugins as default input to Runtime class;
remove init_sandbox_plugins;
implement add_env_var (include jupyter) in the base class;

* fix server runtime auto lint

* Revert "add exception handling for update_pwd_decorator"

This reverts commit 2b668b1506e02145cb8f87e321aad62febca3d50.

* tries to print debugging info for agentskills

* explictly setting uid (try fix permission issue)

* Revert "tries to print debugging info for agentskills"

This reverts commit 8be4c86756f0e3fc62957b327ba2ac4999c419de.

* set sandbox user id during testing to hopefully fix the permission issue

* add browser tools for server runtime

* try to debug for old pwd

* update debug cmd

* only test agnostic runtime when TEST_RUNTIME is Server

* fix temp dir mkdir

* load TEST_RUNTIME at the beginning

* remove ipython tests

* only log to file when DEBUG

* default logging to project root

* temporarily remove log to file

* fix LLM logger dir

* fix logger

* make set pwd an optional aux action

* fix prev pwd

* fix infinity recursion

* simplify

* do not import the whole od library to avoid logger folder by jupyter

* fix browsing

* increase timeout

* attempt to fix agentskills yet again

* clean up in testcases, since CI maybe run as non-root

* add _cause attribute for event.id

* remove parent

* add a bunch of debugging statement again for CI :(

* fix temp_dir fixture

* change all temp dir to follow pytest's tmp_path_factory

* remove extra bracket

* clean up error printing a bit

* jupyter chdir to self.config.workspace_mount_path_in_sandbox on initialization

* jupyter chdir to self.config.workspace_mount_path_in_sandbox on initialization

* add typing for tmp dir fixture

* clear the directory before running the test to avoid weird CI temp dir

* remove agnostic test case for server runtime

* Revert "remove agnostic test case for server runtime"

This reverts commit 30e2181c3fc1410e69596c2dcd06be01f1d016b3.

* disable agnostic tests in CI

* fix test

* make sure plugin arg is not passed when no plugin is specified;
remove redundant on_event function;

* move mock prompt

* rename runtime

* remove extra logging

* refactor run_controller's interface;
support multiple runtime for integration test;
filter out hostname for prompt

* uncomment other tests

* pass the right runtime to controller

* log runtime when start

* uncomment tests

* improve symbol filters

* add intergration test prompts that seemd ok

* add integration test workflow

* add python3 to default ubuntu image

* symlink python and fix permission to jupyter pip

* add retry for jupyter execute server

* fix jupyter pip install;
add post-process for jupyter pip install;
simplify init by add agent_skills path to PYTHONPATH;
add testcase to tests jupyter pip install;

* fix bug

* use ubuntu:22.04 for eventstream integration tests

* add todo

* update testcase

* remove redundant code

* fix unit test

* reduce dependency for runtime

* try making llama-index an optional dependency that's not installed by default

* remove pip install since it seemd not needed

* log ipython execution;
await write message since it returns a future

* update ipy testcase

* do not install llama-index in CI

* do not install llama-index in the app docker as well

* set sandbox container image in the integration test script

* log plugins & env var for runtime

* update conftest for sha256

* add git

* remove all non-alphanumeric chalracters

* add working ipy module tests!

* default to use host network

* remove is_async from browser to make thing a little more reliable;
retry loading browser when error;

* add sleep to wait a bit for http server

* kill http server before regenerate browsing tests

* fix browsing

* only set sandbox container image if undefined

* skip empty config value

* update evaluation to use the latest run_controller

* revert logger in execute_server to be compatible with server runtime

* revert logging level to fix jupyter

* set logger level

* revert the logging

* chmod for workspace to fix permission

* support getting timeout from action

* update test for server runtime

* try to fix file permission

* fix test_cmd_run_action_serialization_deserialization test (added timeout)

* poetry: pip 24.2, torch 2.2.2

* revert adding pip to pyproject.toml

* add build to dependencies in pyproject.toml

* forgot poetry lock --no-update

* fix a DelegatorAgent prompt_002.log (timeout)

* fix a DelegatorAgent prompt_003.log (timeout)

* couple more timeout attribs in prompt files

* some more prompt files

* prompts galore

* add clarification comment for timeout

* default timeout to config

* add assert

* update integraton tests for eventstream

* update integration tests

* fix timeout for action<->dict

* remove redundant on_event

* fix action execution timeout

* updatelock

---------

Co-authored-by: Graham Neubig <neubig@gmail.com>
Co-authored-by: tobitege <tobitege@gmx.de>
2024-08-01 22:07:39 +00:00
Graham Neubig
275ea706cf
Remove remaining global config (#3099)
* Remove global config from memory

* Remove runtime global config

* Remove from storage

* Remove global config

* Fix event stream tests

* Fix sandbox issue

* Change config

* Removed transferred tests

* Add swe env box

* Fixes on testing

* Fixed some tests

* Fix typing

* Fix ipython test

* Revive function

* Make temp_dir fixture

* Remove test to avoid circular import
2024-07-26 18:43:32 +00:00
Xingyao Wang
da17665cab
fix: make max_budget_per_task optional in run_agent_controller (#3071)
* fix: make max_budget_per_task optional in `run_agent_controller`

* update arg for each run infer
2024-07-22 21:47:00 -04:00
Graham Neubig
3a21198424
Remove monologue agent (#3036)
* Remove monologue agent

* Fixes
2024-07-19 19:25:05 +00:00
jigsawlabs-student
fa6c12473e
#2220, integrated aider style linting, currently passes related o… (#2489)
* WIP for integrate aider linter, see OpenDevin#2220

Updated aider linter to:
    * Always return text and line numbers
    * Moved extract line number more consistently
    * Changed pylint to stop after first linter detects errors
Updated agentskills
    * To get back a LintResult object and then use lines and text for error message and related line number
    * Moved code for extracting line number to aider linter
Tests:
* Added additional unit tests for aider to test for
* Return values from lint failures
* Confirm linter works for non-configured languages like Ruby

* move to agent_skills, fixes not seeing skills error

* format/lint to new code, fix failing tests, remove unused code from aider linter

* small changes (remove litellm, fix readme typo)

* fix failing sandbox test

* keep, change dumping of metadata

* WIP for integrate aider linter, see OpenDevin#2220

Updated aider linter to:
    * Always return text and line numbers
    * Moved extract line number more consistently
    * Changed pylint to stop after first linter detects errors
Updated agentskills
    * To get back a LintResult object and then use lines and text for error message and related line number
    * Moved code for extracting line number to aider linter
Tests:
* Added additional unit tests for aider to test for
* Return values from lint failures
* Confirm linter works for non-configured languages like Ruby

* move to agent_skills, fixes not seeing skills error

* format/lint to new code, fix failing tests, remove unused code from aider linter

* remove duplication of tree-sitter, grep-ast and update poetry.lock

* revert to main branch poetry.lock version

* only update necessary package

* fix jupyter kernel wrong interpreter issue (only for swebench)

* fix failing lint tests

* update syntax error checks for flake

* update poetry lock file

* update poetry.lock file, which update content-hash

* add grep ast

* remove extra stuff caused by merge

* update pyproject

* remove extra pytest fixture, ruff styling fixes

* lint files

* update poetry.lock file

---------

Co-authored-by: Jeff Katzy <jeffreyerickatz@gmail.com>
Co-authored-by: yufansong <yufan@risingwave-labs.com>
Co-authored-by: Xingyao Wang <xingyao@all-hands.dev>
Co-authored-by: Xingyao Wang <xingyao6@illinois.edu>
Co-authored-by: tobitege <tobitege@gmx.de>
2024-07-19 21:58:54 +08:00
Xingyao Wang
cf910dfa9d
fix eval api_key leak in metadata; fix llm config in run infer (#2998) 2024-07-18 15:46:59 +00:00
Engel Nyst
2df1d67007
History clean up (#2849)
* clean up add_history

* refactor last agent message
2024-07-08 05:10:21 +02:00
Engel Nyst
d37b2973b2
Refactoring: event stream based agent history (#2709)
* add to event stream sync

* remove async from tests

* small logging spam fix

* remove swe agent

* arch refactoring: use history from the event stream

* refactor agents

* monologue agent

* ruff

* planner agent

* micro-agents

* refactor history in evaluations

* evals history refactoring

* adapt evals and tests

* unit testing stuck

* testing micro agents, event stream

* fix planner agent

* fix tests

* fix stuck after rename

* fix test

* small clean up

* fix merge

* fix merge issue

* fix integration tests

* Update agenthub/dummy_agent/agent.py

* fix tests

* rename more clearly; add todo; clean up
2024-07-07 21:04:23 +00:00
Graham Neubig
d0384cafdd
Two fixes to swe bench eval (#2831)
* Two fixes to swe bench eval

* Add error message

* Change dumping of metadata
2024-07-07 07:21:50 +00:00
Xingyao Wang
f6dc89b41a
[Evaluation] Simplify eval & and multi-processing related fixes (#2810)
* initialize agent inside process_instance_fn;

* remove dependency on `config.max_iterations`

* switch back to only include llm config to metadata
2024-07-06 07:18:46 +08:00
Graham Neubig
a081935fd8
Simplify eval code (#2775)
* Start simplifying eval code

* Update

* Add EDA

* Updated GAIA

* Update gpqa

* Add humanevalfix

* Fix logic_reasoning

* Add miniwob

* Add mint and ml_bench

* toolqa

* Added swe-bench

* Fixed webarena

* Refactor parameters
2024-07-05 19:33:08 +09:00
Graham Neubig
ffd3c7144c
Remove global args (#2760)
* Remove global args

* Remove global args

* Update files

* Update main

* Bug fixes

* Fix logging
2024-07-03 20:07:52 +09:00
Engel Nyst
2d9bb56763
Add ability to restore the cli session (optional) (#2699)
* add ability to restore the main session

* add quick log

* rename to cli session
2024-06-30 06:56:55 +00:00
Engel Nyst
874b4c9075
CLI concurrency (#2695)
* add session id in cli, evals

* fix main sid
2024-06-30 04:04:30 +02:00
RainRat
745ae42a72
fix typos (#2352) 2024-06-09 12:57:58 -07:00
yueqis
82d4d25b09
feat: support ToolQA benchmark (#2263)
* Add files via upload

* Update README.md

* Update run_infer.py

* Update utils.py

* make lint

* Update evaluation/toolqa/run_infer.py

---------

Co-authored-by: Engel Nyst <enyst@users.noreply.github.com>
Co-authored-by: yufansong <yufan@risingwave-labs.com>
Co-authored-by: Boxuan Li <liboxuan@connect.hku.hk>
2024-06-08 07:54:01 -04:00