* move multi-line bash tests to test_runtime;
support multi-line bash for esruntime;
* add testcase to handle PS2 prompt
* use bashlex for bash parsing to handle multi-line commands;
add testcases for multi-line commands
* revert ghcr runtime change
* Apply stash
* fix run as other user;
make test async;
* fix test runtime for run as od
* add run-as-devin to all the runtime tests
* handle the case when username is root
* move all run-as-devin tests from sandbox;
only tests a few cases on different user to save time;
* move over multi-line echo related tests to test_runtime
* fix user-specific jupyter by fixing the pypoetry virtualenv folder
* make plugin's init async;
chdir at initialization of jupyter plugin;
move ipy simple testcase to test runtime;
* support agentskills import in
move tests for jupyter pwd tests;
overload `add_env_vars` for EventStreamRuntime to update env var also in Jupyter;
make agentskills read env var lazily, in case env var is updated;
* fix ServerRuntime agentskills issue
* move agnostic image test to test_runtime
* merge runtime tests in CI
* fix enable auto lint as env var
* update warning message
* update warning message
* test for different container images
* change parsing output as debug
* add exception handling for update_pwd_decorator
* fix unit test indentation
* add plugins as default input to Runtime class;
remove init_sandbox_plugins;
implement add_env_var (include jupyter) in the base class;
* fix server runtime auto lint
* Revert "add exception handling for update_pwd_decorator"
This reverts commit 2b668b1506e02145cb8f87e321aad62febca3d50.
* tries to print debugging info for agentskills
* explictly setting uid (try fix permission issue)
* Revert "tries to print debugging info for agentskills"
This reverts commit 8be4c86756f0e3fc62957b327ba2ac4999c419de.
* set sandbox user id during testing to hopefully fix the permission issue
* add browser tools for server runtime
* try to debug for old pwd
* update debug cmd
* only test agnostic runtime when TEST_RUNTIME is Server
* fix temp dir mkdir
* load TEST_RUNTIME at the beginning
* remove ipython tests
* only log to file when DEBUG
* default logging to project root
* temporarily remove log to file
* fix LLM logger dir
* fix logger
* make set pwd an optional aux action
* fix prev pwd
* fix infinity recursion
* simplify
* do not import the whole od library to avoid logger folder by jupyter
* fix browsing
* increase timeout
* attempt to fix agentskills yet again
* clean up in testcases, since CI maybe run as non-root
* add _cause attribute for event.id
* remove parent
* add a bunch of debugging statement again for CI :(
* fix temp_dir fixture
* change all temp dir to follow pytest's tmp_path_factory
* remove extra bracket
* clean up error printing a bit
* jupyter chdir to self.config.workspace_mount_path_in_sandbox on initialization
* jupyter chdir to self.config.workspace_mount_path_in_sandbox on initialization
* add typing for tmp dir fixture
* clear the directory before running the test to avoid weird CI temp dir
* remove agnostic test case for server runtime
* Revert "remove agnostic test case for server runtime"
This reverts commit 30e2181c3fc1410e69596c2dcd06be01f1d016b3.
* disable agnostic tests in CI
* fix test
* make sure plugin arg is not passed when no plugin is specified;
remove redundant on_event function;
* move mock prompt
* rename runtime
* remove extra logging
* refactor run_controller's interface;
support multiple runtime for integration test;
filter out hostname for prompt
* uncomment other tests
* pass the right runtime to controller
* log runtime when start
* uncomment tests
* improve symbol filters
* add intergration test prompts that seemd ok
* add integration test workflow
* add python3 to default ubuntu image
* symlink python and fix permission to jupyter pip
* add retry for jupyter execute server
* fix jupyter pip install;
add post-process for jupyter pip install;
simplify init by add agent_skills path to PYTHONPATH;
add testcase to tests jupyter pip install;
* fix bug
* use ubuntu:22.04 for eventstream integration tests
* add todo
* update testcase
* remove redundant code
* fix unit test
* reduce dependency for runtime
* try making llama-index an optional dependency that's not installed by default
* remove pip install since it seemd not needed
* log ipython execution;
await write message since it returns a future
* update ipy testcase
* do not install llama-index in CI
* do not install llama-index in the app docker as well
* set sandbox container image in the integration test script
* log plugins & env var for runtime
* update conftest for sha256
* add git
* remove all non-alphanumeric chalracters
* add working ipy module tests!
* default to use host network
* remove is_async from browser to make thing a little more reliable;
retry loading browser when error;
* add sleep to wait a bit for http server
* kill http server before regenerate browsing tests
* fix browsing
* only set sandbox container image if undefined
* skip empty config value
* update evaluation to use the latest run_controller
* revert logger in execute_server to be compatible with server runtime
* revert logging level to fix jupyter
* set logger level
* revert the logging
* chmod for workspace to fix permission
* support getting timeout from action
* update test for server runtime
* try to fix file permission
* fix test_cmd_run_action_serialization_deserialization test (added timeout)
* poetry: pip 24.2, torch 2.2.2
* revert adding pip to pyproject.toml
* add build to dependencies in pyproject.toml
* forgot poetry lock --no-update
* fix a DelegatorAgent prompt_002.log (timeout)
* fix a DelegatorAgent prompt_003.log (timeout)
* couple more timeout attribs in prompt files
* some more prompt files
* prompts galore
* add clarification comment for timeout
* default timeout to config
* add assert
* update integraton tests for eventstream
* update integration tests
* fix timeout for action<->dict
* remove redundant on_event
* default to use instance image
* update run_controller interface
* add logging for copy
* refactor swe_bench for the new design
* fix action execution timeout
* updatelock
* remove build sandbox locally
* fix runtime
* use plain for-loop for single process
* remove extra print
* get swebench inference working
* print whole `test_result` dict
* got swebench patch post-process working
* update swe-bench evaluation readme
* refactor using shared reset_logger function
* move messy swebench prompt to a different file
* support the ability to specify whether to keep prompt
* support the ability to specify whether to keep prompt
* fix dockerfile
* fix import and remove unnecessary strip logic
* fix action serialization
* get agentbench running
* remove extra ls for agent bench
* fix agentbench metric
* factor out common documentation for eval
* update biocoder doc
* remove swe_env_box since it is no longer needed
* get biocoder working
* add func timeout for bird
* fix jupyter pwd with ~ as user name
* fix jupyter pwd with ~ as user name
* get bird working
* get browsing evaluation working
* make eda runnable
* fix id column
* fix eda run_infer
* unify eval output using a structured format;
make swebench coompatible with that format;
update client source code for every swebench run;
do not inject testcmd for swebench
* standardize existing benchs for the new eval output
* set update source code = true
* get gaia standardized
* fix gaia
* gorilla refactored but stuck at language.so to test
* refactor and make gpqa work
* refactor humanevalfix and get it working
* refactor logic reasoning and get it working
* refactor browser env so it works with eventstream runtime for eval
* add initial version of miniwob refactor
* fix browsergym environment
* get miniwob working!!
* allowing injecting additional dependency to OD runtime docker image
* allowing injecting additional dependency to OD runtime docker image
* support logic reasoning with pre-injected dependency
* get mint working
* update runtime build
* fix mint docker
* add test for keep_prompt;
add missing await close for some tests
* update integration tests for eventstream runtime
* fix integration tests for server runtime
* refactor ml bench and toolqa
* refactor webarena
* fix default factory
* Update run_infer.py
* add APIError to retry
* increase timeout for swebench
* make sure to hide api key when dump eval output
* update the behavior of put source code to put files instead of tarball
* add dishash to dependency
* sendintr when timeout
* fix dockerfile copy
* reduce timeout
* use dirhash to avoid repeat building for update source
* fix runtime_build testcase
* add dir_hash to docker build pipeline
* revert api error
* update poetry lock
* add retries for swebench run infer
* fix git patch
* update poetry lock
* adjust config order
* fix mount volumns
* enforce all eval to use "instance_id"
* remove file store from runtime
* make file_store public inside eventstream
* move the runtime logic inside `main` out
* support using async function for process_instance_fn
* refactor run_infer with the create_time
* fix file store
* Update evaluation/toolqa/utils.py
Co-authored-by: Graham Neubig <neubig@gmail.com>
* fix typo
---------
Co-authored-by: tobitege <tobitege@gmx.de>
Co-authored-by: super-dainiu <78588128+super-dainiu@users.noreply.github.com>
Co-authored-by: Graham Neubig <neubig@gmail.com>
* fix: make max_budget_per_task optional in `run_agent_controller`
* update arg for each run infer
* fix: metrics logging carried along; reset llm metric with the agent;
---------
Co-authored-by: Graham Neubig <neubig@gmail.com>
* update and polish gptq eval
* fix typo
* Update evaluation/gpqa/README.md
Co-authored-by: Graham Neubig <neubig@gmail.com>
* Update evaluation/gpqa/run_infer.py
Co-authored-by: Graham Neubig <neubig@gmail.com>
* add headless mode to all appropriate agent controller call
* delegate set to error when in headless mode
* try to deduplicate a bit
* make headless_mode default to True and only change it to false for AgentSession
---------
Co-authored-by: Graham Neubig <neubig@gmail.com>
* Updated documentation using ruff's autofix feature
* Updated pyproject.toml to include docstring validations
* Updated documentation using ruff's autofix feature
* Updated pyproject.toml to include docstring validations
* Updated docstrings using ruff's autfix feature
* Deleted opendevin/runtime/utils/soource.py, Keeping in sync with main
---------
Co-authored-by: Graham Neubig <neubig@gmail.com>
Currently, OpenDevin uses a global singleton LLM config and a global singleton agent config. This PR allows customers to configure an LLM config for each agent. A hypothetically useful scenario is to use a cheaper LLM for repo exploration / code search, and a more powerful LLM to actually do the problem solving (CodeActAgent).
Partially solves #2075 (web GUI improvement is not the goal of this PR)
* add event to stream before budget check
* make the budget check before the step
* Update opendevin/controller/agent_controller.py
Co-authored-by: Boxuan Li <liboxuan@connect.hku.hk>
---------
Co-authored-by: Boxuan Li <liboxuan@connect.hku.hk>
1. Add support for rejection action on frontend
2. Show users the reason for rejection
3. Get rid of weird empty box after delegation
4. On web GUI, show customer when a delegation starts and ends
* Fix AgentRejectAction handling
* Add ManagerAgent to integration tests
* Fix regenerate.sh
* Fix merge
* Update README for micro-agents
* Add test reject to regenerate.sh
* regenerate.sh: Add support for running a specific test and/or agent
* Refine reject schema, and allow ManagerAgent to handle reject
* Add test artifacts for test_simple_task_rejection
* Fix manager agent tests
* Fix README
* test_simple_task_rejection: check final agent state
* Integration test: exit if mock prompt not found
* Update test_simple_task_rejection tests
* Fix test_edits test artifacts after prompt update
* Fix ManagerAgent test_edits
* WIP
* Fix tests
* update test_edits for ManagerAgent
* Skip local sandbox for reject test
* Fix test comparison
* feat: lazy launching browser; browser optional for diffrent agents.
* style: lint
* fix: integration test fail due to browser not started.
* fix: run by cli and integration test failed.
* fix: lint
* fix: lint
---------
Co-authored-by: Graham Neubig <neubig@gmail.com>
This PR fixes#1897. In addition, this PR fixes and tweaks a few micro-agents.
For the first time, I am able to use ManagerAgent to complete test_write_simple_script and test_edits tasks in integration tests, so this PR also adds ManagerAgent as part of integration tests. test_write_simple_script involves delegation to CoderAgent while test_edits involves delegation to TypoFixerAgent.
Also for the first time, I am able to use DelegateAgent to complete test_write_simple_script and test_edits tasks in integration tests, so this PR also adds DelegateAgent as part of integration tests. It involves delegation to StudyRepoForTaskAgent, CoderAgent and VerifierAgent.
This PR is a blocker for #1735 and likely #1945.
* feat: add max_budget_per_task configuration to control task cost
* Fix test_arg_parser.py
* Use the config.max_budget_per_task as default value
* Add max_budget_per_task to core/main.py as well
* Update opendevin/controller/agent_controller.py
Co-authored-by: Boxuan Li <liboxuan@connect.hku.hk>
---------
Co-authored-by: Boxuan Li <liboxuan@connect.hku.hk>
* properly log user messages;
format browser action/obs, summarize action, messages properly for logging
* add source to message
* add spaces for printing
* Refactor monologue to use the messages in state history
* add messages, clean up
* fix monologue
* update integration tests
* move private method
* update SWE agent to use the history from State
* integration tests for SWE agent
* rename monologue to initial_thoughts, since that is what it is
* Add MacOS to integration tests
* Switch back to python 3.11
* Install Docker for macos pipeline
* regenerate.sh: Use environmental variable for sandbox type
* Pack different agents' tests into a single check
* Fix CodeAct tests
* Reduce file match and extensive debug logs
* Add TEST_IN_CI mode that reports codecov
* Small fix: don't quit if reusing old responses failed
* Merge codecov results
* Fix typos
* Remove coverage merge step - codecov automatically does that
* Make mac integration tests as optional - too slow
* Fix codecov args
* Add comments in yaml
* Include sandbox type in codecov report name
* Fix codecov report merge
* Revert renaming of test_matrix_success
* Remove SWEAgent and PlannerAgent from tests
* Mark planner agent and SWE agent as deprecated
* CodeCov: Ignore planner and sweagent
* Revert "Remove SWEAgent and PlannerAgent from tests"
This reverts commit 040cb3bfb9496090d4516f7b45f38376f95de7be.
* Remove all tests for SWE Agent
* Only keep basic tests for MonologueAgent and PlannerAgent
* Mark SWE Agent as deprecated, and ignore code coverage for it
---------
Co-authored-by: Engel Nyst <enyst@users.noreply.github.com>