* update swe_bench prompt;
use minimal prompt for codeact;
* upgrade agentskills and update testcases
* update infer prompt
* fix cwd
* add icl for swebench
* also log in_context_example to run infer
* remove extra print
* change prompt to abs path
* update error message to include current file info
* change cwd for jupyter if needed
* update edit error message
* update prompt
* improve git get patch
* update hint string
* default to 50 turns
* revert changes from codeact agent and create new CodeActSWEAgent
* revert changes to codeact
* revert instructions for run infer
* revert instructions for run infer
* update README
* update max iter
* add codeact swe agent
* fix issue for CodeActSWEAgent
* allow specifying max iter in cmdline script
* stop printing
* Update agenthub/codeact_swe_agent/README.md
Co-authored-by: Yufan Song <33971064+yufansong@users.noreply.github.com>
* Fix prompt regression in jupyter plugin
---------
Co-authored-by: Yufan Song <33971064+yufansong@users.noreply.github.com>
Co-authored-by: Boxuan Li <liboxuan@connect.hku.hk>
* setup boilerplate and README
* setup test script and load dataset
* add temp intg that works
* refactor code
* add solution evaluation through 'fake_user_response_fn'
* finish integrating MATH subset
* Update evaluation/mint/run_infer.py
* Update evaluation/mint/run_infer.sh
* Update opendevin/core/main.py
* remove redudant templates, add eval_note, update README
* use <execute_ipython> tag instead of <execute>
* hardcode AGENT option for run_infer.sh
* Update evaluation/mint/task.py
Co-authored-by: Yufan Song <33971064+yufansong@users.noreply.github.com>
* fix: bug no message returned when task's success
* change message to make the agent exit
* import bash abstractmethod
* install all required packages inside sandbox before the agent runs, adjust prompt
* add subset eval folder separation and test for gsm8k
* fix bug in Reasoning task result check, add requirements.txt
* Fix syntax error in evaluation/mint/run_infer.py
* update README, add default values for `SUBSET` and `EVAL_LIMIT`
---------
Co-authored-by: Yufan Song <33971064+yufansong@users.noreply.github.com>
Co-authored-by: yufansong <yufan@risingwave-labs.com>
Co-authored-by: Boxuan Li <liboxuan@connect.hku.hk>
* reset workspace base properly
* support running without hint
* support running without hint
* bump swe-bench eval docker to v1.2 for latest agentskills
* only give hint when use hint text is trie
* add swe-agent instructions for validation
* update dockerfile
* pin the python interpreter for execute_cli
* avoid initialize plugins twice
* default to use hint
* save results to swe_bench_lite
* unset gh token and increase max iter to 50
* remove printing of use hint status
* refractor ssh login into one function
* ok drop to 30 turns bc it is so expensive :(
* remove reproduce comments to avoid stuck
* adding draft evaluation code for EDA, using chatgpt as the temporal agent for now
* Update README.md
* Delete frontend/package.json
* reverse the irrelevant changes
* reverse package.json
* use chatgpt as the codeactagent
* integrate with opendevin
* Update evaluation/EDA/README.md
* Update evaluation/EDA/README.md
* Use poetry to manage packages
* integrate with opendevin
* minor update
* minor update
* update poetry
* update README
* clean-up infer scripts
* add run_infer script and improve readme
* log final success and final message & ground truth
---------
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Xingyao Wang <xingyao6@illinois.edu>
Co-authored-by: yufansong <yufan@risingwave-labs.com>
Co-authored-by: Boxuan Li <liboxuan@connect.hku.hk>
* add draft for skills
* Implement and test agentskills functions: open_file, goto_line, scroll_down, scroll_up, create_file, search_dir, search_file, find_file
* Remove new_sample.txt file
* add some work from opendevin w/ fixes
* Add unit tests for agentskills module
* fix some issues and updated tests
* add more tests for open
* tweak and handle goto_line
* add tests for some edge cases
* add tests for scrolling
* add tests for edit
* add tests for search_dir
* update tests to use pytest
* use pytest --forked to avoid file op unit tests to interfere with each other via global var
* update doc based on swe agent tool
* update and add tests for find_file and search_file
* move agent_skills to plugins
* add agentskills as plugin and docs
* add agentskill to ssh box and fix sandbox integration
* remove extra returns in doc
* add agentskills to initial tool for jupyter
* support re-init jupyter kernel (for agentskills) after restart
* fix print window's issue with indentation and add testcases
* add prompt for codeact with the newest edit primitives
* modify the way line number is presented (remove leading space)
* change prompt to the newest display format
* support tracking of costs via metrics
* Update opendevin/runtime/plugins/agent_skills/README.md
* Update opendevin/runtime/plugins/agent_skills/README.md
* implement and add tests for py linting
* remove extra text arg for incompatible subprocess ver
* remove sample.txt
* update test_edits integration tests
* fix all integration
* Update opendevin/runtime/plugins/agent_skills/README.md
* Update opendevin/runtime/plugins/agent_skills/README.md
* Update opendevin/runtime/plugins/agent_skills/README.md
* Update agenthub/codeact_agent/prompt.py
Co-authored-by: Boxuan Li <liboxuan@connect.hku.hk>
* Update agenthub/codeact_agent/prompt.py
Co-authored-by: Boxuan Li <liboxuan@connect.hku.hk>
* Update agenthub/codeact_agent/prompt.py
Co-authored-by: Boxuan Li <liboxuan@connect.hku.hk>
* Update opendevin/runtime/plugins/agent_skills/agentskills.py
Co-authored-by: Boxuan Li <liboxuan@connect.hku.hk>
* correctly setup plugins for swebench eval
* bump swe-bench version and add logging
* correctly setup plugins for swebench eval
* bump swe-bench version and add logging
* Revert "correctly setup plugins for swebench eval"
This reverts commit 2bd10556739e2af602ea85371b976390f7c48077.
* bump version
* remove _AGENT_SKILLS_DOCS
* move flake8 to test dep
* update poetry.lock
* remove extra arg
* reduce max iter for eval
* update poetry
* fix integration tests
---------
Co-authored-by: OpenDevin <opendevin@opendevin.ai>
Co-authored-by: Engel Nyst <enyst@users.noreply.github.com>
Co-authored-by: Boxuan Li <liboxuan@connect.hku.hk>
* correctly setup plugins for swebench eval
* bump swe-bench version and add logging
* Revert "correctly setup plugins for swebench eval"
This reverts commit 2bd10556739e2af602ea85371b976390f7c48077.
* bump version
* Preliminary HumanEvalFix integration
* Clean paths
* fix: set workspace path correctly for config
fix: task in that contains /
* add missing run_infer.sh
* update run_infer w/o hard coded agent
* fix typo
* change `instance_id` to `task_id`
* add the warning and env var setting to run_infer.sh
* reset back workspace mount at the end of each instance
* 10 max iter is probably enough for humanevalfix
* Remove unneeded section
Co-authored-by: Xingyao Wang <xingyao6@illinois.edu>
* Fix link
Co-authored-by: Yufan Song <33971064+yufansong@users.noreply.github.com>
* Use logger
Co-authored-by: Yufan Song <33971064+yufansong@users.noreply.github.com>
* Update run_infer.py
fix a bug:
ERROR:concurrent.futures:exception calling callback for <Future at 0x309cbc470 state=finished raised NameError>
concurrent.futures.process._RemoteTraceback:
* Update README.md
* Update README.md
* Update README.md
* Update README.md
added an example
* Update README.md
added: enable_auto_lint = true
* Update pyproject.toml
add: evaluate package
* Delete poetry.lock
update poetry.lock
* update poetry.lock
update poetry.lock
* Update README.md
* Update README.md
---------
Co-authored-by: Xingyao Wang <xingyao6@illinois.edu>
Co-authored-by: Yufan Song <33971064+yufansong@users.noreply.github.com>
Co-authored-by: Robert <871607149@qq.com>
I was able to run a few benchmark instances from SWE-Bench by myself following the documentation - it was great! In general the experience was smooth, thanks to @xingyaoww, @libowen2121 and the team! I made a few small enhancements and fixes to further improve the developer experience.
Always use poetry run python (using python from poetry's virtual environment) over python or python3 in scripts to make sure the behavior is consistent.
Make AGENT configurable. One can use an argument to control which agent they would like to benchmark. To facilitate this, I removed hardcoded CodeActAgent from run_infer.sh, and also added VERSION attribute to all agents, as the benchmark needs to record the agent version.
Make EVAL_LIMIT configurable. One can use an argument to control how many instances they'd like to benchmark. Useful for debugging & development purposes.
Fix 'eval_output_dir' not defined error in run_infer.py.
Other enhancements to the README file and logs.
I also notice that a lot of code from run_infer.py could be shared by other benchmarks, but since we only have one benchmark now, I think we could avoid over-engineering. A refactor and code dedup would be useful in the future once we have more benchmarks, though.
Disable Python linting by default, and turn it on for SWE Bench.
It is turned off by default since this behavior is weird and somewhat annoying to end users.
It is turned on for SWE Bench because linting python files gives LLM a chance to fix the indentations.
* Feat: add stream output to exec_run
* Using command timeout to control the exec_box's timeout.
* add bash -c to source command to compatible for sh.
Signed-off-by: ifuryst <ifuryst@gmail.com>
* Feat: add stream output to SSHBox execute
Signed-off-by: ifuryst <ifuryst@gmail.com>
* fix the test case fail.
Signed-off-by: ifuryst <ifuryst@gmail.com>
* fix the test case import wrong path for method.
Signed-off-by: ifuryst <ifuryst@gmail.com>
---------
Signed-off-by: ifuryst <ifuryst@gmail.com>
* add draft dockerfile for build all
* add rsync for build
* add all-in-one docker
* update prepare scripts
* Update swe_env_box.py
* Add swe_entry.sh (buggy now)
* Parse the test command in swe_entry.sh
* Update README for instance eval in sandbox
* revert specialized config
* replace run_as_devin as an init arg
* set container & run_as_root via args
* update swe entry script
* update env
* remove mounting
* allow error after swe_entry
* update swe_env_box
* move file
* update gitignore
* get swe_env_box a working demo
* support faking user response & provide sandox ahead of time;
also return state for controller
* tweak main to support adding controller kwargs
* add module
* initialize plugin for provided sandbox
* add pip cache to plugin & fix jupyter kernel waiting
* better print Observation output
* add run infer scripts
* update readme
* add utility for getting diff patch
* use get_diff_patch in infer
* update readme
* support cost tracking for codeact
* add swe agent edit hack
* disable color in git diff
* fix git diff cmd
* fix state return
* support limit eval
* increase t imeout and export pip cache
* add eval limit config
* return state when hit turn limit
* save log to file; allow agent to give up
* run eval with max 50 turns
* add outputs to gitignore
* save swe_instance & instruction
* add uuid to swebench
* add streamlit dep
* fix save series
* fix the issue where session id might be duplicated
* allow setting temperature for llm (use 0 for eval)
* Get report from agent running log
* support evaluating task success right after inference.
* remove extra log
* comment out prompt for baseline
* add visualizer for eval
* use plaintext for instruction
* reduce timeout for all; only increase timeout for init
* reduce timeout for all; only increase timeout for init
* ignore sid for swe env
* close sandbox in each eval loop
* update visualizer instruction
* increase max chars
* add finish action to history too
* show test result in metrics
* add sidebars for visualizer
* also visualize swe_instance
* cleanup browser when agent controller finish runinng
* do not mount workspace for swe-eval to avoid accidentally overwrite files
* Revert "do not mount workspace for swe-eval to avoid accidentally overwrite files"
This reverts commit 8ef77390543e562e6f0a5a9992418014d8b3010c.
* Revert "Revert "do not mount workspace for swe-eval to avoid accidentally overwrite files""
This reverts commit 016cfbb9f0475f32bacbad5822996b4eaff24a5e.
* run jupyter command via copy to, instead of cp to mount
* only print mixin output when failed
* change ssh box logging
* add visualizer for pass rate
* add instance id to sandbox name
* only remove container we created
* use opendevin logger in main
* support multi-processing infer
* add back metadata, support keyboard interrupt
* remove container with startswith
* make pbar behave correctly
* update instruction w/ multi-processing
* show resolved rate by repo
* rename tmp dir name
* attempt to fix racing for copy to ssh_box
* fix script
* bump swe-bench-all version
* fix ipython with self-contained commands
* add jupyter demo to swe_env_box
* make resolved count two column
* increase height
* do not add glob to url params
* analyze obs length
* print instance id prior to removal handler
* add gold patch in visualizer
* fix interactive git by adding a git --no-pager as alias
* increase max_char to 10k to cover 98% of swe-bench obs cases
* allow parsing note
* prompt v2
* add iteration reminder
* adjust user response
* adjust order
* fix return eval
* fix typo
* add reminder before logging
* remove other resolve rate
* re adjust to new folder structure
* support adding eval note
* fix eval note path
* make sure first log of each instance is printed
* add eval note
* fix the display for visualizer
* tweak visualizer for better git patch reading
* exclude empty patch
* add retry mechanism for swe_env_box start
* fix ssh timeout issue
* add stat field for apply test patch success
* add visualization for fine-grained report
* attempt to support monologue agent by constraining it to single thread
* also log error msg when stopeed
* save error as well
* override WORKSPACE_MOUNT_PATH and WORKSPACE_BASE for monologue to work in mp
* add retry mechanism for sshbox
* remove retry for swe env box
* try to handle loop state stopped
* Add get report scripts
* Add script to convert agent output to swe-bench format
* Merge fine grained report for visualizer
* Update eval readme
* Update README.md
* Add CodeAct gpt4-1106 output and eval logs on swe-bench-lite
* Update the script to get model report
* Update get_model_report.sh
* Update get_agent_report.sh
* Update report merge script
* Add agent output conversion script
* Update swe_lite_env_setup.sh
* Add example swe-bench output files
* Update eval readme
* Remove redundant scripts
* set iteration count down to false by default
* fix: Issue where CodeAct agent was trying to log cost on local llm and throwing Undefined Model execption out of litellm (#1666)
* fix: Issue where CodeAct agent was trying to log cost on local llm and throwing Undefined Model execption out of litellm
* Review Feedback
* Missing None Check
* Review feedback and improved error handling
---------
Co-authored-by: Robert Brennan <accounts@rbren.io>
* fix prepare_swe_util scripts
* update builder images
* update setup script
* remove swe-bench build workflow
* update lock
* remove experiments since they are moved to hf
* remove visualizer (since it is moved to hf repo)
* simply jupyter execution via heredoc
* update ssh_box
* add initial docker readme
* add pkg-config as dependency
* add script for swe_bench all-in-one docker
* add rsync to builder
* rename var
* update commit
* update readme
* update lock
* support specify timeout for long running tasks
* fix path
* separate building of all deps and files
* support returning states at the end of controller
* remove return None
* support specify timeout for long running tasks
* add timeout for all existing sandbox impl
* fix swe_env_box for new codebase
* update llm config in config.py
* support pass sandbox in
* remove force set
* update eval script
* fix issue of overriding final state
* change default eval output to hf demo
* change default eval output to hf demo
* fix config
* only close it when it is NOT external sandbox
* add scripts
* tweak config
* only put in hostory when state has history attr
* fix agent controller on the case of run out interaction budget
* always assume state is always not none
* remove print of final state
* catch all exception when cannot compute completion cost
* Update README.md
* save source into json
* fix path
* update docker path
* return the final state on close
* merge AgentState with State
* fix integration test
* merge AgentState with State
* fix integration test
* add ChangeAgentStateAction to history in attempt to fix integration
* add back set agent state
* update tests
* update tests
* move scripts for setup
* update script and readme for infer
* do not reset logger when n processes == 1
* update eval_infer scripts and readme
* simplify readme
* copy over dir after eval
* copy over dir after eval
* directly return get state
* update lock
* fix output saving of infer
* replace print with logger
* update eval_infer script
* add back the missing .close
* increase timeout
* copy all swe_bench_format file
* attempt to fix output parsing
* log git commit id as metadata
* fix eval script
* update lock
* update unit tests
* fix argparser unit test
* fix lock
* the deps are now lightweight enough to be incude in make build
* add spaces for tests
* add eval outputs to gitignore
* remove git submodule
* readme
* tweak git email
* update upload instruction
* bump codeact version for eval
---------
Co-authored-by: Bowen Li <libowen.ne@gmail.com>
Co-authored-by: huybery <huybery@gmail.com>
Co-authored-by: Bart Shappee <bshappee@gmail.com>
Co-authored-by: Robert Brennan <accounts@rbren.io>
* mypy is invaluable
* fix config, add test
* Add new-style toml support
* add singleton, small doc fixes
* fix some cases of loading toml, clean up, try to make it clearer
* Add defaults_dict for UI
* allow config to be mutable
error handling
fix toml parsing
* remove debug stuff
* Adapt Makefile
* Add defaults for temperature and top_p
* update to CodeActAgent
* comments
* fix unit tests
* implement groups of llm settings (CLI)
* fix merge issue
* small fix sandboxes, small refactoring
* adapt LLM init to accept overrides at runtime
* reading config is enough
* Encapsulate minimally embeddings initialization
* agent bug fix; fix tests
* fix sandboxes tests
* refactor globals in sandboxes to properties
* style: moved argument parsing into a separate function
* commito
* Update evaluation/regression/conftest.py
---------
Co-authored-by: Robert Brennan <accounts@rbren.io>
* Remove all the unnecessary files
* Create finalize the regression testing framework and add hello world test case
* Update requirements.txt
* Update the test function to execute the generate script
* Move regression tests to evaluation/
* use pythnon instead of docker in the script
* add model para
* change python to python3
* bug fix
* add python path
* add readme
* a starting point for SWE-Bench evaluation with docker
* fix the swe-bench uid issue
* typo fixed
* fix conda missing issue
* move files based on new PR
* Update doc and gitignore using devin prediction file from #81
* fix typo
* add a sentence
* fix typo in path
* fix path
---------
Co-authored-by: Binyuan Hui <binyuan.hby@alibaba-inc.com>
* adding code to fetch and convert devin's output for evaluation
* update README.md
* update code for fetching and processing devin's outputs
* update code for fetching and processing devin's outputs