29 Commits

Author SHA1 Message Date
Xingyao Wang
2406b901df
feat(SWE-Bench environment) integrate SWE-Bench sandbox (#1468)
* add draft dockerfile for build all

* add rsync for build

* add all-in-one docker

* update prepare scripts

* Update swe_env_box.py

* Add swe_entry.sh (buggy now)

* Parse the test command in swe_entry.sh

* Update README for instance eval in sandbox

* revert specialized config

* replace run_as_devin as an init arg

* set container & run_as_root via args

* update swe entry script

* update env

* remove mounting

* allow error after swe_entry

* update swe_env_box

* move file

* update gitignore

* get swe_env_box a working demo

* support faking user response & provide sandox ahead of time;
also return state for controller

* tweak main to support adding controller kwargs

* add module

* initialize plugin for provided sandbox

* add pip cache to plugin & fix jupyter kernel waiting

* better print Observation output

* add run infer scripts

* update readme

* add utility for getting diff patch

* use get_diff_patch in infer

* update readme

* support cost tracking for codeact

* add swe agent edit hack

* disable color in git diff

* fix git diff cmd

* fix state return

* support limit eval

* increase t imeout and export pip cache

* add eval limit config

* return state when hit turn limit

* save log to file; allow agent to give up

* run eval with max 50 turns

* add outputs to gitignore

* save swe_instance & instruction

* add uuid to swebench

* add streamlit dep

* fix save series

* fix the issue where session id might be duplicated

* allow setting temperature for llm (use 0 for eval)

* Get report from agent running log

* support evaluating task success right after inference.

* remove extra log

* comment out prompt for baseline

* add visualizer for eval

* use plaintext for instruction

* reduce timeout for all; only increase timeout for init

* reduce timeout for all; only increase timeout for init

* ignore sid for swe env

* close sandbox in each eval loop

* update visualizer instruction

* increase max chars

* add finish action to history too

* show test result in metrics

* add sidebars for visualizer

* also visualize swe_instance

* cleanup browser when agent controller finish runinng

* do not mount workspace for swe-eval to avoid accidentally overwrite files

* Revert "do not mount workspace for swe-eval to avoid accidentally overwrite files"

This reverts commit 8ef77390543e562e6f0a5a9992418014d8b3010c.

* Revert "Revert "do not mount workspace for swe-eval to avoid accidentally overwrite files""

This reverts commit 016cfbb9f0475f32bacbad5822996b4eaff24a5e.

* run jupyter command via copy to, instead of cp to mount

* only print mixin output when failed

* change ssh box logging

* add visualizer for pass rate

* add instance id to sandbox name

* only remove container we created

* use opendevin logger in main

* support multi-processing infer

* add back metadata, support keyboard interrupt

* remove container with startswith

* make pbar behave correctly

* update instruction w/ multi-processing

* show resolved rate by repo

* rename tmp dir name

* attempt to fix racing for copy to ssh_box

* fix script

* bump swe-bench-all version

* fix ipython with self-contained commands

* add jupyter demo to swe_env_box

* make resolved count two column

* increase height

* do not add glob to url params

* analyze obs length

* print instance id prior to removal handler

* add gold patch in visualizer

* fix interactive git by adding a git --no-pager as alias

* increase max_char to 10k to cover 98% of swe-bench obs cases

* allow parsing note

* prompt v2

* add iteration reminder

* adjust user response

* adjust order

* fix return eval

* fix typo

* add reminder before logging

* remove other resolve rate

* re adjust to new folder structure

* support adding eval note

* fix eval note path

* make sure first log of each instance is printed

* add eval note

* fix the display for visualizer

* tweak visualizer for better git patch reading

* exclude empty patch

* add retry mechanism for swe_env_box start

* fix ssh timeout issue

* add stat field for apply test patch success

* add visualization for fine-grained report

* attempt to support monologue agent by constraining it to single thread

* also log error msg when stopeed

* save error as well

* override WORKSPACE_MOUNT_PATH and WORKSPACE_BASE for monologue to work in mp

* add retry mechanism for sshbox

* remove retry for swe env box

* try to handle loop state stopped

* Add get report scripts

* Add script to convert agent output to swe-bench format

* Merge fine grained report for visualizer

* Update eval readme

* Update README.md

* Add CodeAct gpt4-1106 output and eval logs on swe-bench-lite

* Update the script to get model report

* Update get_model_report.sh

* Update get_agent_report.sh

* Update report merge script

* Add agent output conversion script

* Update swe_lite_env_setup.sh

* Add example swe-bench output files

* Update eval readme

* Remove redundant scripts

* set iteration count down to false by default

* fix: Issue where CodeAct agent was trying to log cost on local llm and throwing Undefined Model execption out of litellm (#1666)

* fix: Issue where CodeAct agent was trying to log cost on local llm and throwing Undefined Model execption out of litellm

* Review Feedback

* Missing None Check

* Review feedback and improved error handling

---------

Co-authored-by: Robert Brennan <accounts@rbren.io>

* fix prepare_swe_util scripts

* update builder images

* update setup script

* remove swe-bench build workflow

* update lock

* remove experiments since they are moved to hf

* remove visualizer (since it is moved to hf repo)

* simply jupyter execution via heredoc

* update ssh_box

* add initial docker readme

* add pkg-config as dependency

* add script for swe_bench all-in-one docker

* add rsync to builder

* rename var

* update commit

* update readme

* update lock

* support specify timeout for long running tasks

* fix path

* separate building of all deps and files

* support returning states at the end of controller

* remove return None

* support specify timeout for long running tasks

* add timeout for all existing sandbox impl

* fix swe_env_box for new codebase

* update llm config in config.py

* support pass sandbox in

* remove force set

* update eval script

* fix issue of overriding final state

* change default eval output to hf demo

* change default eval output to hf demo

* fix config

* only close it when it is NOT external sandbox

* add scripts

* tweak config

* only put in hostory when state has history attr

* fix agent controller on the case of run out interaction budget

* always assume state is always not none

* remove print of final state

* catch all exception when cannot compute completion cost

* Update README.md

* save source into json

* fix path

* update docker path

* return the final state on close

* merge AgentState with State

* fix integration test

* merge AgentState with State

* fix integration test

* add ChangeAgentStateAction to history in attempt to fix integration

* add back set agent state

* update tests

* update tests

* move scripts for setup

* update script and readme for infer

* do not reset logger when n processes == 1

* update eval_infer scripts and readme

* simplify readme

* copy over dir after eval

* copy over dir after eval

* directly return get state

* update lock

* fix output saving of infer

* replace print with logger

* update eval_infer script

* add back the missing .close

* increase timeout

* copy all swe_bench_format file

* attempt to fix output parsing

* log git commit id as metadata

* fix eval script

* update lock

* update unit tests

* fix argparser unit test

* fix lock

* the deps are now lightweight enough to be incude in make build

* add spaces for tests

* add eval outputs to gitignore

* remove git submodule

* readme

* tweak git email

* update upload instruction

* bump codeact version for eval

---------

Co-authored-by: Bowen Li <libowen.ne@gmail.com>
Co-authored-by: huybery <huybery@gmail.com>
Co-authored-by: Bart Shappee <bshappee@gmail.com>
Co-authored-by: Robert Brennan <accounts@rbren.io>
2024-05-15 16:15:55 +00:00
Aleksandar
657b177b4e
Default to less expensive gpt-3.5-turbo model (#1675) 2024-05-09 19:11:27 -04:00
Engel Nyst
446eaec1e6
Refactor config to dataclasses (#1552)
* mypy is invaluable

* fix config, add test

* Add new-style toml support

* add singleton, small doc fixes

* fix some cases of loading toml, clean up, try to make it clearer

* Add defaults_dict for UI

* allow config to be mutable
error handling
fix toml parsing

* remove debug stuff

* Adapt Makefile

* Add defaults for temperature and top_p

* update to CodeActAgent

* comments

* fix unit tests

* implement groups of llm settings (CLI)

* fix merge issue

* small fix sandboxes, small refactoring

* adapt LLM init to accept overrides at runtime

* reading config is enough

* Encapsulate minimally embeddings initialization

* agent bug fix; fix tests

* fix sandboxes tests

* refactor globals in sandboxes to properties
2024-05-09 22:48:29 +02:00
Jirka Borovec
0c2ebfd6e1
Ruff: use I rule for isort (#1410)
Ruff: use I rule for isort
2024-04-29 15:41:58 -07:00
Alex Bäuerle
cd58194d2a
docs(docs): start implementing docs website (#1372)
* docs(docs): start implementing docs website

* update video url

* add autogenerated codebase docs for backend

* precommit

* update links

* fix config and video

* gh actions

* rename

* workdirs

* path

* path

* fix doc1

* redo markdown

* docs

* change main folder name

* simplify readme

* add back architecture

* Fix lint errors

* lint

* update poetry lock

---------

Co-authored-by: Jim Su <jimsu@protonmail.com>
2024-04-29 10:00:51 -07:00
Jirka Borovec
e32d95cb1a
lint: simplify hooks already covered by Ruff (#1204)
* lint: simplify hooks already covered by Ruff

* prune dev dependency

* setting E, W, F

* poetry?

* autopep8

* quote-style = "single"

* double-quote-string-fixer

* --all-files

* apply

* Q

* drop double-quote-string-fixer

* --all-files

* apply pre-commit

* python3.11 -m poetry lock --no-update

---------

Co-authored-by: Robert Brennan <accounts@rbren.io>
2024-04-27 11:32:14 +00:00
Xia Zhenhua
747ac23cd0
fix: conftest.py comment bug. (#1303)
Co-authored-by: aaren.xzh <aaren.xzh@antfin.com>
2024-04-23 07:51:33 -04:00
Robert Brennan
516c9bf1e0
Revamp docker build process (#1121)
* refactor docker building

* change to buildx

* disable branch filter

* disable tags

* matrix for building

* fix branch filter

* rename workflow

* sanitize ref name

* fix sanitization

* fix source command

* fix source command

* add push arg

* enable for all branches

* logs

* empty commit

* try freeing disk space

* try disk clean again

* try alpine

* Update ghcr.yml

* Update ghcr.yml

* move checkout

* ignore .git

* add disk space debug

* add df h to build script

* remove pull

* try another failure bypass

* remove maximize build space step

* remove df -h debug

* add no-root

* multi-stage python build

* add ssh

* update readme

* remove references to config.toml
2024-04-15 19:10:38 -04:00
hugehope
9cd4ad3298
chore: fix some typos in comments (#1013)
Signed-off-by: hugehope <cmm7@sina.cn>
2024-04-11 23:21:46 +02:00
libowen2121
e256329e5e
Update SWE-bench eval results (#978) 2024-04-10 21:09:49 +08:00
Engel Nyst
99a8dc4ff9
Fallback to less expensive model (#475) 2024-04-07 05:45:37 +02:00
Alex Bäuerle
a82e065f56
feat: add commands for swebench (#682)
* feat: add commands for swebench

* restructure
2024-04-05 12:47:32 -05:00
Yufan Song
5e87c79838
refactor (#543) 2024-04-02 08:13:38 -04:00
Robert Brennan
511afa12fe
fix old references to langchains (#513) 2024-04-01 13:33:20 -04:00
Aravind Somaraj
26c9ce132b
style: Moved argument parsing statements into a separate function (#503)
* style: moved argument parsing into a separate function

* commito

* Update evaluation/regression/conftest.py

---------

Co-authored-by: Robert Brennan <accounts@rbren.io>
2024-04-01 10:47:58 -04:00
Tess
8796a690d5
doc - Added code documentation for clarity (#434)
* doc - Added code documentaion to 'plan.py'

* doc - Added code documentation to 'session.py'

* doc - added code documentation for clarity

* doc - added documentation to 'conftest.py'

* doc - added code documentation to 'run_tests.pt'

* Update evaluation/regression/conftest.py

---------

Co-authored-by: Robert Brennan <accounts@rbren.io>
2024-04-01 10:22:09 -04:00
iFurySt
a08c82d35e
ci: check if the image exists in ghcr.io to avoid repeat building and pushing (#283)
* ci: check if the image exists in ghcr.io to avoid repeat building and pushing.

* feat: add push MAJOR and MINOR version to ghcr.io
2024-03-31 11:30:13 -04:00
iFurySt
2286e73912
fix: change to use the latest docker image. (#290)
Co-authored-by: Robert Brennan <accounts@rbren.io>
2024-03-29 15:59:52 -04:00
Jim Su
b1b96df8a8
Replace environment variables with configuration file (#339)
* Replace environment variables with configuration file

* Add config.toml to .gitignore

* Remove unused os imports

* Update README.md

* Update README.md

* Update README.md

* Fix merge conflict

* Fallback to environment variables

* Use template file for config.toml

* Update config.toml.template

* Update config.toml.template

---------

Co-authored-by: Robert Brennan <accounts@rbren.io>
2024-03-29 15:26:20 -04:00
Anas DORBANI
7c27e59918
feat: Ad/regression tests using pytest (#329)
* Remove all the unnecessary files

* Create finalize the regression testing framework and add hello world test case

* Update requirements.txt

* Update the test function to execute the generate script
2024-03-28 23:40:30 -04:00
iFurySt
89abc5e253
fix: move the makefile to correct path. (#252) 2024-03-27 23:53:40 +08:00
iFurySt
8b9fc3df28
feat: add workflow to ghcr (#237)
Co-authored-by: Xingyao Wang <xingyao6@illinois.edu>
2024-03-27 23:10:34 +08:00
zch-cc
e5a28cba2f
Evaluation: Fix bug on python path on run.sh (#98)
* Move regression tests to evaluation/

* use pythnon instead of docker in the script

* add model para

* change python to python3

* bug fix

* add python path

* add readme
2024-03-23 00:01:48 +08:00
zch-cc
cfefc47439
Move regression tests to evaluation/ (#86)
* Move regression tests to evaluation/

* use pythnon instead of docker in the script

* add model para

* change python to python3

* bug fix
2024-03-22 23:26:37 +08:00
libowen2121
40a3614e80
Add a roadmap for eval (#92) 2024-03-22 20:27:30 +08:00
Xingyao Wang
2d5c8f1060
change to OpenDevin fork (#89) 2024-03-22 18:30:12 +08:00
Xingyao Wang
5ff96111f0
A starting point for SWE-Bench Evaluation with docker (#60)
* a starting point for SWE-Bench evaluation with docker

* fix the swe-bench uid issue

* typo fixed

* fix conda missing issue

* move files based on new PR

* Update doc and gitignore using devin prediction file from #81

* fix typo

* add a sentence

* fix typo in path

* fix path

---------

Co-authored-by: Binyuan Hui <binyuan.hby@alibaba-inc.com>
2024-03-22 12:43:49 +08:00
Jiaxin Pei
dc88dac296
adding a script to fetch and convert devin's output for evaluation (#81)
* adding code to fetch and convert devin's output for evaluation

* update README.md

* update code for fetching and processing devin's outputs

* update code for fetching and processing devin's outputs
2024-03-22 01:33:01 +08:00
Binyuan Hui
f99f4ebdaa
fix: typo in the evaluation folder name. (#66) 2024-03-20 23:00:09 +08:00