OpenHands

mirror of https://github.com/OpenHands/OpenHands.git synced 2025-12-26 05:48:36 +08:00

Author	SHA1	Message	Date
Xingyao Wang	2406b901df	feat(SWE-Bench environment) integrate SWE-Bench sandbox (#1468 ) * add draft dockerfile for build all * add rsync for build * add all-in-one docker * update prepare scripts * Update swe_env_box.py * Add swe_entry.sh (buggy now) * Parse the test command in swe_entry.sh * Update README for instance eval in sandbox * revert specialized config * replace run_as_devin as an init arg * set container & run_as_root via args * update swe entry script * update env * remove mounting * allow error after swe_entry * update swe_env_box * move file * update gitignore * get swe_env_box a working demo * support faking user response & provide sandox ahead of time; also return state for controller * tweak main to support adding controller kwargs * add module * initialize plugin for provided sandbox * add pip cache to plugin & fix jupyter kernel waiting * better print Observation output * add run infer scripts * update readme * add utility for getting diff patch * use get_diff_patch in infer * update readme * support cost tracking for codeact * add swe agent edit hack * disable color in git diff * fix git diff cmd * fix state return * support limit eval * increase t imeout and export pip cache * add eval limit config * return state when hit turn limit * save log to file; allow agent to give up * run eval with max 50 turns * add outputs to gitignore * save swe_instance & instruction * add uuid to swebench * add streamlit dep * fix save series * fix the issue where session id might be duplicated * allow setting temperature for llm (use 0 for eval) * Get report from agent running log * support evaluating task success right after inference. * remove extra log * comment out prompt for baseline * add visualizer for eval * use plaintext for instruction * reduce timeout for all; only increase timeout for init * reduce timeout for all; only increase timeout for init * ignore sid for swe env * close sandbox in each eval loop * update visualizer instruction * increase max chars * add finish action to history too * show test result in metrics * add sidebars for visualizer * also visualize swe_instance * cleanup browser when agent controller finish runinng * do not mount workspace for swe-eval to avoid accidentally overwrite files * Revert "do not mount workspace for swe-eval to avoid accidentally overwrite files" This reverts commit 8ef77390543e562e6f0a5a9992418014d8b3010c. * Revert "Revert "do not mount workspace for swe-eval to avoid accidentally overwrite files"" This reverts commit 016cfbb9f0475f32bacbad5822996b4eaff24a5e. * run jupyter command via copy to, instead of cp to mount * only print mixin output when failed * change ssh box logging * add visualizer for pass rate * add instance id to sandbox name * only remove container we created * use opendevin logger in main * support multi-processing infer * add back metadata, support keyboard interrupt * remove container with startswith * make pbar behave correctly * update instruction w/ multi-processing * show resolved rate by repo * rename tmp dir name * attempt to fix racing for copy to ssh_box * fix script * bump swe-bench-all version * fix ipython with self-contained commands * add jupyter demo to swe_env_box * make resolved count two column * increase height * do not add glob to url params * analyze obs length * print instance id prior to removal handler * add gold patch in visualizer * fix interactive git by adding a git --no-pager as alias * increase max_char to 10k to cover 98% of swe-bench obs cases * allow parsing note * prompt v2 * add iteration reminder * adjust user response * adjust order * fix return eval * fix typo * add reminder before logging * remove other resolve rate * re adjust to new folder structure * support adding eval note * fix eval note path * make sure first log of each instance is printed * add eval note * fix the display for visualizer * tweak visualizer for better git patch reading * exclude empty patch * add retry mechanism for swe_env_box start * fix ssh timeout issue * add stat field for apply test patch success * add visualization for fine-grained report * attempt to support monologue agent by constraining it to single thread * also log error msg when stopeed * save error as well * override WORKSPACE_MOUNT_PATH and WORKSPACE_BASE for monologue to work in mp * add retry mechanism for sshbox * remove retry for swe env box * try to handle loop state stopped * Add get report scripts * Add script to convert agent output to swe-bench format * Merge fine grained report for visualizer * Update eval readme * Update README.md * Add CodeAct gpt4-1106 output and eval logs on swe-bench-lite * Update the script to get model report * Update get_model_report.sh * Update get_agent_report.sh * Update report merge script * Add agent output conversion script * Update swe_lite_env_setup.sh * Add example swe-bench output files * Update eval readme * Remove redundant scripts * set iteration count down to false by default * fix: Issue where CodeAct agent was trying to log cost on local llm and throwing Undefined Model execption out of litellm (#1666) * fix: Issue where CodeAct agent was trying to log cost on local llm and throwing Undefined Model execption out of litellm * Review Feedback * Missing None Check * Review feedback and improved error handling --------- Co-authored-by: Robert Brennan <accounts@rbren.io> * fix prepare_swe_util scripts * update builder images * update setup script * remove swe-bench build workflow * update lock * remove experiments since they are moved to hf * remove visualizer (since it is moved to hf repo) * simply jupyter execution via heredoc * update ssh_box * add initial docker readme * add pkg-config as dependency * add script for swe_bench all-in-one docker * add rsync to builder * rename var * update commit * update readme * update lock * support specify timeout for long running tasks * fix path * separate building of all deps and files * support returning states at the end of controller * remove return None * support specify timeout for long running tasks * add timeout for all existing sandbox impl * fix swe_env_box for new codebase * update llm config in config.py * support pass sandbox in * remove force set * update eval script * fix issue of overriding final state * change default eval output to hf demo * change default eval output to hf demo * fix config * only close it when it is NOT external sandbox * add scripts * tweak config * only put in hostory when state has history attr * fix agent controller on the case of run out interaction budget * always assume state is always not none * remove print of final state * catch all exception when cannot compute completion cost * Update README.md * save source into json * fix path * update docker path * return the final state on close * merge AgentState with State * fix integration test * merge AgentState with State * fix integration test * add ChangeAgentStateAction to history in attempt to fix integration * add back set agent state * update tests * update tests * move scripts for setup * update script and readme for infer * do not reset logger when n processes == 1 * update eval_infer scripts and readme * simplify readme * copy over dir after eval * copy over dir after eval * directly return get state * update lock * fix output saving of infer * replace print with logger * update eval_infer script * add back the missing .close * increase timeout * copy all swe_bench_format file * attempt to fix output parsing * log git commit id as metadata * fix eval script * update lock * update unit tests * fix argparser unit test * fix lock * the deps are now lightweight enough to be incude in make build * add spaces for tests * add eval outputs to gitignore * remove git submodule * readme * tweak git email * update upload instruction * bump codeact version for eval --------- Co-authored-by: Bowen Li <libowen.ne@gmail.com> Co-authored-by: huybery <huybery@gmail.com> Co-authored-by: Bart Shappee <bshappee@gmail.com> Co-authored-by: Robert Brennan <accounts@rbren.io>	2024-05-15 16:15:55 +00:00
Aleksandar	657b177b4e	Default to less expensive gpt-3.5-turbo model (#1675 )	2024-05-09 19:11:27 -04:00
Engel Nyst	446eaec1e6	Refactor config to dataclasses (#1552 ) * mypy is invaluable * fix config, add test * Add new-style toml support * add singleton, small doc fixes * fix some cases of loading toml, clean up, try to make it clearer * Add defaults_dict for UI * allow config to be mutable error handling fix toml parsing * remove debug stuff * Adapt Makefile * Add defaults for temperature and top_p * update to CodeActAgent * comments * fix unit tests * implement groups of llm settings (CLI) * fix merge issue * small fix sandboxes, small refactoring * adapt LLM init to accept overrides at runtime * reading config is enough * Encapsulate minimally embeddings initialization * agent bug fix; fix tests * fix sandboxes tests * refactor globals in sandboxes to properties	2024-05-09 22:48:29 +02:00
Jirka Borovec	0c2ebfd6e1	Ruff: use I rule for isort (#1410 ) Ruff: use I rule for isort	2024-04-29 15:41:58 -07:00
Alex Bäuerle	cd58194d2a	docs(docs): start implementing docs website (#1372 ) * docs(docs): start implementing docs website * update video url * add autogenerated codebase docs for backend * precommit * update links * fix config and video * gh actions * rename * workdirs * path * path * fix doc1 * redo markdown * docs * change main folder name * simplify readme * add back architecture * Fix lint errors * lint * update poetry lock --------- Co-authored-by: Jim Su <jimsu@protonmail.com>	2024-04-29 10:00:51 -07:00
Jirka Borovec	e32d95cb1a	lint: simplify hooks already covered by Ruff (#1204 ) * lint: simplify hooks already covered by Ruff * prune dev dependency * setting E, W, F * poetry? * autopep8 * quote-style = "single" * double-quote-string-fixer * --all-files * apply * Q * drop double-quote-string-fixer * --all-files * apply pre-commit * python3.11 -m poetry lock --no-update --------- Co-authored-by: Robert Brennan <accounts@rbren.io>	2024-04-27 11:32:14 +00:00
Xia Zhenhua	747ac23cd0	fix: conftest.py comment bug. (#1303 ) Co-authored-by: aaren.xzh <aaren.xzh@antfin.com>	2024-04-23 07:51:33 -04:00
Robert Brennan	516c9bf1e0	Revamp docker build process (#1121 ) * refactor docker building * change to buildx * disable branch filter * disable tags * matrix for building * fix branch filter * rename workflow * sanitize ref name * fix sanitization * fix source command * fix source command * add push arg * enable for all branches * logs * empty commit * try freeing disk space * try disk clean again * try alpine * Update ghcr.yml * Update ghcr.yml * move checkout * ignore .git * add disk space debug * add df h to build script * remove pull * try another failure bypass * remove maximize build space step * remove df -h debug * add no-root * multi-stage python build * add ssh * update readme * remove references to config.toml	2024-04-15 19:10:38 -04:00
hugehope	9cd4ad3298	chore: fix some typos in comments (#1013 ) Signed-off-by: hugehope <cmm7@sina.cn>	2024-04-11 23:21:46 +02:00
libowen2121	e256329e5e	Update SWE-bench eval results (#978 )	2024-04-10 21:09:49 +08:00
Engel Nyst	99a8dc4ff9	Fallback to less expensive model (#475 )	2024-04-07 05:45:37 +02:00
Alex Bäuerle	a82e065f56	feat: add commands for swebench (#682 ) * feat: add commands for swebench * restructure	2024-04-05 12:47:32 -05:00
Yufan Song	5e87c79838	refactor (#543 )	2024-04-02 08:13:38 -04:00
Robert Brennan	511afa12fe	fix old references to langchains (#513 )	2024-04-01 13:33:20 -04:00
Aravind Somaraj	26c9ce132b	style: Moved argument parsing statements into a separate function (#503 ) * style: moved argument parsing into a separate function * commito * Update evaluation/regression/conftest.py --------- Co-authored-by: Robert Brennan <accounts@rbren.io>	2024-04-01 10:47:58 -04:00
Tess	8796a690d5	doc - Added code documentation for clarity (#434 ) * doc - Added code documentaion to 'plan.py' * doc - Added code documentation to 'session.py' * doc - added code documentation for clarity * doc - added documentation to 'conftest.py' * doc - added code documentation to 'run_tests.pt' * Update evaluation/regression/conftest.py --------- Co-authored-by: Robert Brennan <accounts@rbren.io>	2024-04-01 10:22:09 -04:00
iFurySt	a08c82d35e	ci: check if the image exists in ghcr.io to avoid repeat building and pushing (#283 ) * ci: check if the image exists in ghcr.io to avoid repeat building and pushing. * feat: add push MAJOR and MINOR version to ghcr.io	2024-03-31 11:30:13 -04:00
iFurySt	2286e73912	fix: change to use the latest docker image. (#290 ) Co-authored-by: Robert Brennan <accounts@rbren.io>	2024-03-29 15:59:52 -04:00
Jim Su	b1b96df8a8	Replace environment variables with configuration file (#339 ) * Replace environment variables with configuration file * Add config.toml to .gitignore * Remove unused os imports * Update README.md * Update README.md * Update README.md * Fix merge conflict * Fallback to environment variables * Use template file for config.toml * Update config.toml.template * Update config.toml.template --------- Co-authored-by: Robert Brennan <accounts@rbren.io>	2024-03-29 15:26:20 -04:00
Anas DORBANI	7c27e59918	feat: Ad/regression tests using pytest (#329 ) * Remove all the unnecessary files * Create finalize the regression testing framework and add hello world test case * Update requirements.txt * Update the test function to execute the generate script	2024-03-28 23:40:30 -04:00
iFurySt	89abc5e253	fix: move the makefile to correct path. (#252 )	2024-03-27 23:53:40 +08:00
iFurySt	8b9fc3df28	feat: add workflow to ghcr (#237 ) Co-authored-by: Xingyao Wang <xingyao6@illinois.edu>	2024-03-27 23:10:34 +08:00
zch-cc	e5a28cba2f	Evaluation: Fix bug on python path on run.sh (#98 ) * Move regression tests to evaluation/ * use pythnon instead of docker in the script * add model para * change python to python3 * bug fix * add python path * add readme	2024-03-23 00:01:48 +08:00
zch-cc	cfefc47439	Move regression tests to evaluation/ (#86 ) * Move regression tests to evaluation/ * use pythnon instead of docker in the script * add model para * change python to python3 * bug fix	2024-03-22 23:26:37 +08:00
libowen2121	40a3614e80	Add a roadmap for eval (#92 )	2024-03-22 20:27:30 +08:00
Xingyao Wang	2d5c8f1060	change to OpenDevin fork (#89 )	2024-03-22 18:30:12 +08:00
Xingyao Wang	5ff96111f0	A starting point for SWE-Bench Evaluation with docker (#60 ) * a starting point for SWE-Bench evaluation with docker * fix the swe-bench uid issue * typo fixed * fix conda missing issue * move files based on new PR * Update doc and gitignore using devin prediction file from #81 * fix typo * add a sentence * fix typo in path * fix path --------- Co-authored-by: Binyuan Hui <binyuan.hby@alibaba-inc.com>	2024-03-22 12:43:49 +08:00
Jiaxin Pei	dc88dac296	adding a script to fetch and convert devin's output for evaluation (#81 ) * adding code to fetch and convert devin's output for evaluation * update README.md * update code for fetching and processing devin's outputs * update code for fetching and processing devin's outputs	2024-03-22 01:33:01 +08:00
Binyuan Hui	f99f4ebdaa	fix: typo in the evaluation folder name. (#66 )	2024-03-20 23:00:09 +08:00

29 Commits