OpenHands

mirror of https://github.com/OpenHands/OpenHands.git synced 2026-03-22 13:47:19 +08:00

Author	SHA1	Message	Date
Xavier Vergés	cd91d45b44	Allow SANDBOX_CONTAINER_IMAGEs built from opendevin/sandbox:main (#2622 )	2024-06-26 12:05:07 +08:00
Xingyao Wang	6de584d77d	update swe-bench output with eval results (#2606 )	2024-06-24 08:07:28 +09:00
Graham Neubig	cab7a288ca	Add NUM_WORKERS variable to run_infer.sh scripts for configurable woker settings (#2597 ) * Add NUM_WORKERS variable to run_infer.sh scripts for configurable worker settings * Update evaluation/webarena/scripts/run_infer.sh --------- Co-authored-by: OpenDevin <opendevin@all-hands.dev>	2024-06-23 03:43:43 +00:00
மனோஜ்குமார் பழனிச்சாமி	41564c2eac	Use :main instead of :latest (#2539 ) Co-authored-by: Boxuan Li <liboxuan@connect.hku.hk>	2024-06-21 03:57:50 +00:00
Boxuan Li	feabc97aba	Evaluation time travel: build sandbox on the fly (#2491 )	2024-06-20 20:22:02 -06:00
Xingyao Wang	b569ba710d	docs: Add visualizer instruction for SWE-Bench (#2529 ) * Update README.md for visualizer instruction * Polish the visualization guidance (#2531) * fix conda create error * fix and polish the readme for visualization * Update README.md --------- Co-authored-by: Haofei Yu <haofeiy@cs.cmu.edu>	2024-06-19 20:41:09 +00:00
Xingyao Wang	1f379bebc2	Update README.md (#2505 ) LGTM	2024-06-18 18:14:21 +02:00
Boxuan Li	6f235937cf	Evaluation time travel: allow evaluation on a specific version (#2356 ) * Time travel for evaluation * Fix source script path * Exit script if given version doesn't exist * Exit on failure * Update README * Change scripts of all other benchmarks * Modify README files * Fix logic_reasoning README	2024-06-16 10:25:14 -04:00
super-dainiu	563bc41fd3	Use LLM to analyze ML-Bench failure cases (#2399 ) * add ml-bench w/o exec env * fix typos (#1956) no functional change * Refactored Logs (#1939) * [Feat] A competitive Web Browsing agent (#1856) * initial attempt at a browsing only agent * add browsing agent * update * implement agent * update * fix comments * remove unnecessary things from memory extras * update image processing --------- Co-authored-by: Yufan Song <33971064+yufansong@users.noreply.github.com> * Update README.md SWE-bench score (#1959) * Update README.md SWE-bench score Our most recent results on swe-bench lite are 25%, so this updates the README accordingly. * Update * fix: llm is_local function logic error (#1961) Co-authored-by: மனோஜ்குமார் பழனிச்சாமி <smartmanoj42857@gmail.com> * doc: update documentation about poetry update (#1962) * add doc * Update Development.md --------- Co-authored-by: Boxuan Li <liboxuan@connect.hku.hk> * feat: add metrics related to cost for better observability (#1944) * add metrics for total_cost * make lint * refact codeact * change metrics into llm * add costs list, add into state * refactor log completion * refactor and test others * make lint * Update opendevin/core/metrics.py Co-authored-by: Boxuan Li <liboxuan@connect.hku.hk> * Update opendevin/llm/llm.py Co-authored-by: Xingyao Wang <xingyao6@illinois.edu> * refactor * add code --------- Co-authored-by: Boxuan Li <liboxuan@connect.hku.hk> Co-authored-by: Xingyao Wang <xingyao6@illinois.edu> * doc: add more cmd in unit test documentation (#1963) * --- (#1975) updated-dependencies: - dependency-name: boto3 dependency-type: direct:production update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * --- (#1976) updated-dependencies: - dependency-name: litellm dependency-type: direct:production update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * Logging security (#1943) * update .gitignore * Rename the confusing 'INFO' style to 'DETAIL' * override str and repr * feat: api_key desensitize * feat: add SensitiveDataFilter in file handler * tweak regex, add tests * more tweaks, include other attrs * add env vars, those with equivalent config * fix tests * tests are invaluable --------- Co-authored-by: Shimada666 <649940882@qq.com> * --- (#1967) updated-dependencies: - dependency-name: react-dom dependency-type: direct:production update-type: version-update:semver-minor - dependency-name: "@types/react-dom" dependency-type: direct:development update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * --- (#1968) updated-dependencies: - dependency-name: "@reduxjs/toolkit" dependency-type: direct:production update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * --- (#1969) updated-dependencies: - dependency-name: husky dependency-type: direct:development update-type: version-update:semver-major ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * --- (#1970) updated-dependencies: - dependency-name: tailwind-merge dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * --- (#1971) updated-dependencies: - dependency-name: i18next dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Yufan Song <33971064+yufansong@users.noreply.github.com> * Refactor session management (#1810) * refactor session mgmt * defer file handling to runtime * add todo * refactor sessions a bit more * remove messages logic from FE * fix up socket handshake * refactor frontend auth a bit * first pass at redoing file explorer * implement directory suffix * fix up file tree * close agent on websocket close * remove session saving * move file refresh * remove getWorkspace * plumb path/code differently * fix build issues * fix the tests * fix npm build * add session rehydration * fix event serialization * logspam * fix user message rehydration * add get_event fn * agent state restoration * change history tracking for codeact * fix responsiveness of init * fix lint * lint * delint * fix prop * update tests * logspam * lint * fix test * revert codeact * change fileService to use API * fix up session loading * delint * delint * fix integration tests * revert test * fix up access to options endpoints * fix initial files load * delint * fix file initialization * fix mock server * fixl int * fix auth for html * Update frontend/src/i18n/translation.json Co-authored-by: Xingyao Wang <xingyao6@illinois.edu> * refactor sessions and sockets * avoid reinitializing the same session * fix reconnect issue * change up intro message * more guards on reinit * rename agent_session * delint * fix a bunch of tests * delint * fix last test * remove code editor context * fix build * fix any * fix dot notation * Update frontend/src/services/api.ts Co-authored-by: Boxuan Li <liboxuan@connect.hku.hk> * fix up error handling * Update opendevin/server/session/agent.py Co-authored-by: Boxuan Li <liboxuan@connect.hku.hk> * Update opendevin/server/session/agent.py Co-authored-by: Boxuan Li <liboxuan@connect.hku.hk> * Update frontend/src/services/session.ts Co-authored-by: Boxuan Li <liboxuan@connect.hku.hk> * fix build errs * fix else * add closed state * delint * Update opendevin/server/session/session.py Co-authored-by: Engel Nyst <enyst@users.noreply.github.com> --------- Co-authored-by: Xingyao Wang <xingyao6@illinois.edu> Co-authored-by: Graham Neubig <neubig@gmail.com> Co-authored-by: Boxuan Li <liboxuan@connect.hku.hk> Co-authored-by: Engel Nyst <enyst@users.noreply.github.com> * fix #1960 (#1964) * Add ruff for shared mutable defaults (B) (#1938) * Add ruff for shared mutable defaults (B) * Apply B006, B008 on current files, except fast API * Update agenthub/SWE_agent/prompts.py Co-authored-by: Graham Neubig <neubig@gmail.com> * fix unintended behavior change * this is correct, tell Ruff to leave it alone --------- Co-authored-by: Graham Neubig <neubig@gmail.com> Co-authored-by: Boxuan Li <liboxuan@connect.hku.hk> * Refactor integration testing CI, add optional Mac tests, and mark a few agents as deprecated (#1888) * Add MacOS to integration tests * Switch back to python 3.11 * Install Docker for macos pipeline * regenerate.sh: Use environmental variable for sandbox type * Pack different agents' tests into a single check * Fix CodeAct tests * Reduce file match and extensive debug logs * Add TEST_IN_CI mode that reports codecov * Small fix: don't quit if reusing old responses failed * Merge codecov results * Fix typos * Remove coverage merge step - codecov automatically does that * Make mac integration tests as optional - too slow * Fix codecov args * Add comments in yaml * Include sandbox type in codecov report name * Fix codecov report merge * Revert renaming of test_matrix_success * Remove SWEAgent and PlannerAgent from tests * Mark planner agent and SWE agent as deprecated * CodeCov: Ignore planner and sweagent * Revert "Remove SWEAgent and PlannerAgent from tests" This reverts commit `040cb3bfb9`. * Remove all tests for SWE Agent * Only keep basic tests for MonologueAgent and PlannerAgent * Mark SWE Agent as deprecated, and ignore code coverage for it --------- Co-authored-by: Engel Nyst <enyst@users.noreply.github.com> * Fix Repeated Responses in Chat by Adding IPythonRunCellObservation (#1987) Co-authored-by: jianghongwei <jianghongwei@58.com> Co-authored-by: மனோஜ்குமார் பழனிச்சாமி <smartmanoj42857@gmail.com> * Save CI cycles for backend tests (#1985) * Fix typo in prompt (#1992) * Refactor monologue and SWE agent to use the messages in state history (#1863) * Refactor monologue to use the messages in state history * add messages, clean up * fix monologue * update integration tests * move private method * update SWE agent to use the history from State * integration tests for SWE agent * rename monologue to initial_thoughts, since that is what it is * fix: catch session file not existed exception when init EventStream(maybe creating a new session with no session files stored). (#1994) * add ml-bench in readme * Bump boto3 from 1.34.110 to 1.34.111 (#2001) Bumps [boto3](https://github.com/boto/boto3) from 1.34.110 to 1.34.111. - [Release notes](https://github.com/boto/boto3/releases) - [Changelog](https://github.com/boto/boto3/blob/develop/CHANGELOG.rst) - [Commits](https://github.com/boto/boto3/compare/1.34.110...1.34.111) --- updated-dependencies: - dependency-name: boto3 dependency-type: direct:production update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * Bump docker from 7.0.0 to 7.1.0 (#2002) Bumps [docker](https://github.com/docker/docker-py) from 7.0.0 to 7.1.0. - [Release notes](https://github.com/docker/docker-py/releases) - [Commits](https://github.com/docker/docker-py/compare/7.0.0...7.1.0) --- updated-dependencies: - dependency-name: docker dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * Bump litellm from 1.37.20 to 1.38.0 (#2005) Bumps [litellm](https://github.com/BerriAI/litellm) from 1.37.20 to 1.38.0. - [Release notes](https://github.com/BerriAI/litellm/releases) - [Commits](https://github.com/BerriAI/litellm/compare/v1.37.20...v1.38.0) --- updated-dependencies: - dependency-name: litellm dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * Fix SWE-Bench evaluation due to setuptools version (#1995) * correctly setup plugins for swebench eval * bump swe-bench version and add logging * Revert "correctly setup plugins for swebench eval" This reverts commit `2bd1055673`. * bump version * fix session state after resuming (#1999) * fix state resuming * fix session reconnection * fix lint * Implement `agentskills` for OpenDevin to helpfully improve edit AND including more useful tools/skills (#1941) * add draft for skills * Implement and test agentskills functions: open_file, goto_line, scroll_down, scroll_up, create_file, search_dir, search_file, find_file * Remove new_sample.txt file * add some work from opendevin w/ fixes * Add unit tests for agentskills module * fix some issues and updated tests * add more tests for open * tweak and handle goto_line * add tests for some edge cases * add tests for scrolling * add tests for edit * add tests for search_dir * update tests to use pytest * use pytest --forked to avoid file op unit tests to interfere with each other via global var * update doc based on swe agent tool * update and add tests for find_file and search_file * move agent_skills to plugins * add agentskills as plugin and docs * add agentskill to ssh box and fix sandbox integration * remove extra returns in doc * add agentskills to initial tool for jupyter * support re-init jupyter kernel (for agentskills) after restart * fix print window's issue with indentation and add testcases * add prompt for codeact with the newest edit primitives * modify the way line number is presented (remove leading space) * change prompt to the newest display format * support tracking of costs via metrics * Update opendevin/runtime/plugins/agent_skills/README.md * Update opendevin/runtime/plugins/agent_skills/README.md * implement and add tests for py linting * remove extra text arg for incompatible subprocess ver * remove sample.txt * update test_edits integration tests * fix all integration * Update opendevin/runtime/plugins/agent_skills/README.md * Update opendevin/runtime/plugins/agent_skills/README.md * Update opendevin/runtime/plugins/agent_skills/README.md * Update agenthub/codeact_agent/prompt.py Co-authored-by: Boxuan Li <liboxuan@connect.hku.hk> * Update agenthub/codeact_agent/prompt.py Co-authored-by: Boxuan Li <liboxuan@connect.hku.hk> * Update agenthub/codeact_agent/prompt.py Co-authored-by: Boxuan Li <liboxuan@connect.hku.hk> * Update opendevin/runtime/plugins/agent_skills/agentskills.py Co-authored-by: Boxuan Li <liboxuan@connect.hku.hk> * correctly setup plugins for swebench eval * bump swe-bench version and add logging * correctly setup plugins for swebench eval * bump swe-bench version and add logging * Revert "correctly setup plugins for swebench eval" This reverts commit `2bd1055673`. * bump version * remove _AGENT_SKILLS_DOCS * move flake8 to test dep * update poetry.lock * remove extra arg * reduce max iter for eval * update poetry * fix integration tests --------- Co-authored-by: OpenDevin <opendevin@opendevin.ai> Co-authored-by: Engel Nyst <enyst@users.noreply.github.com> Co-authored-by: Boxuan Li <liboxuan@connect.hku.hk> * build: Add poetry command to use Python 3.11 for environment setup (#1972) * Bump @react-types/shared from 3.23.0 to 3.23.1 in /frontend (#2006) Bumps [@react-types/shared](https://github.com/adobe/react-spectrum) from 3.23.0 to 3.23.1. - [Release notes](https://github.com/adobe/react-spectrum/releases) - [Commits](https://github.com/adobe/react-spectrum/compare/@react-types/shared@3.23.0...@react-types/shared@3.23.1) --- updated-dependencies: - dependency-name: "@react-types/shared" dependency-type: direct:production update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * Bump @types/react-syntax-highlighter in /frontend (#2007) Bumps [@types/react-syntax-highlighter](https://github.com/DefinitelyTyped/DefinitelyTyped/tree/HEAD/types/react-syntax-highlighter) from 15.5.11 to 15.5.13. - [Release notes](https://github.com/DefinitelyTyped/DefinitelyTyped/releases) - [Commits](https://github.com/DefinitelyTyped/DefinitelyTyped/commits/HEAD/types/react-syntax-highlighter) --- updated-dependencies: - dependency-name: "@types/react-syntax-highlighter" dependency-type: direct:development update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * Bump @typescript-eslint/parser from 7.9.0 to 7.10.0 in /frontend (#2008) Bumps [@typescript-eslint/parser](https://github.com/typescript-eslint/typescript-eslint/tree/HEAD/packages/parser) from 7.9.0 to 7.10.0. - [Release notes](https://github.com/typescript-eslint/typescript-eslint/releases) - [Changelog](https://github.com/typescript-eslint/typescript-eslint/blob/main/packages/parser/CHANGELOG.md) - [Commits](https://github.com/typescript-eslint/typescript-eslint/commits/v7.10.0/packages/parser) --- updated-dependencies: - dependency-name: "@typescript-eslint/parser" dependency-type: direct:development update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * Bump lint-staged from 15.2.2 to 15.2.4 in /frontend (#2009) Bumps [lint-staged](https://github.com/okonet/lint-staged) from 15.2.2 to 15.2.4. - [Release notes](https://github.com/okonet/lint-staged/releases) - [Changelog](https://github.com/lint-staged/lint-staged/blob/master/CHANGELOG.md) - [Commits](https://github.com/okonet/lint-staged/compare/v15.2.2...v15.2.4) --- updated-dependencies: - dependency-name: lint-staged dependency-type: direct:development update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * Update README.md * Update README.md * add run_infer.sh * fix input output * fix docker sandbox * fix run * update and clean run_infer.py * add script to clean up dockers * update repo uid * add description * new * Update README.md * use root for sandbox * update readme * update ml-bench conda env * update readme * update readme * use try except * modify raise exception * add int * update README * longer time * fix existing issues * fix existing issue * new docker image * add metrics of cost * add result parsing cost * fix * fix * update summarize * fix * add analyze * update readme * use 4o * add eval output --------- Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: Ubuntu <ubuntu@ip-172-31-31-157.ec2.internal> Co-authored-by: RainRat <rainrat78@yahoo.ca> Co-authored-by: மனோஜ்குமார் பழனிச்சாமி <smartmanoj42857@gmail.com> Co-authored-by: Frank Xu <frankxu2004@gmail.com> Co-authored-by: Yufan Song <33971064+yufansong@users.noreply.github.com> Co-authored-by: Graham Neubig <neubig@gmail.com> Co-authored-by: Shimada666 <649940882@qq.com> Co-authored-by: Boxuan Li <liboxuan@connect.hku.hk> Co-authored-by: Xingyao Wang <xingyao6@illinois.edu> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Engel Nyst <enyst@users.noreply.github.com> Co-authored-by: Robert Brennan <accounts@rbren.io> Co-authored-by: Rahul Anand <62982824+zeul22@users.noreply.github.com> Co-authored-by: jiangleo <jiangleo@users.noreply.github.com> Co-authored-by: jianghongwei <jianghongwei@58.com> Co-authored-by: Jeremi Joslin <jeremi@newlogic.com> Co-authored-by: Aaron Xia <zhhuaxia@gmail.com> Co-authored-by: OpenDevin <opendevin@opendevin.ai> Co-authored-by: DaxServer <7479937+DaxServer@users.noreply.github.com> Co-authored-by: Robert <871607149@qq.com>	2024-06-13 09:30:55 +08:00
Xingyao Wang	b3bdc44292	mkdir `infer_logs` instead of `logs` (#2382 )	2024-06-11 07:18:19 +08:00
Xingyao Wang	11a2d1682d	Minor SWE-Bench inference config tweak (#2381 ) * save infer logs to infer_logs * set max budget for swebench eval	2024-06-10 20:14:22 +00:00
Xingyao Wang	a6ba6c5277	Add SWEBench-docker eval (#2085 ) * add initial version of swebench-docker eval * update the branch of git repo * add poetry run * download dev set too and pre-load f2p and p2p * update eval infer script * increase timeout * add poetry run * install swebench from our fork * update script * update loc * support single instance debug * replace \r\n from model patch * replace eval docker from namespace xingyaoww * update script to auto detect swe-bench format jsonl * support eval infer on single instance id * change log output dir to logs * update summarise result script * update README * update readme * tweak branch * Update evaluation/swe_bench/scripts/eval/prep_eval.sh Co-authored-by: Graham Neubig <neubig@gmail.com> --------- Co-authored-by: Graham Neubig <neubig@gmail.com>	2024-06-10 19:30:40 +00:00
Yufan Song	f4cb192ebe	Fix llm key leaks bug (#2376 ) * fix bug * fix bug * add	2024-06-10 15:55:33 +00:00
Robert	7fc57650f3	BioCoder integration (#2076 ) * prepare execution and inference * Create README.md * Update README.md * Update evaluation/biocoder/README.md * Update evaluation/swe_bench/swe_env_box.py * switch to biocoder docker container and test-specific code * code for copying and running test files into container * add metrics * add readme * Biocoder evaluation code finished (rewrite testing infrastructure, prompt tuning, and bug fixes) * Update README.md --------- Co-authored-by: lilbillybiscuit <qianbill2014@outlook.com> Co-authored-by: Yufan Song <33971064+yufansong@users.noreply.github.com> Co-authored-by: yufansong <yufan@risingwave-labs.com>	2024-06-10 11:11:40 +08:00
RainRat	745ae42a72	fix typos (#2352 )	2024-06-09 12:57:58 -07:00
yueqis	68d9ad61cf	Feat: Support Gorilla APIBench (#2081 ) * removed unused files from gorilla * Update run_infer.py, removed unused imports * Update utils.py * Update ast_eval_hf.py * Update ast_eval_tf.py * Update ast_eval_th.py * Create README.md * Update run_infer.py * make lint * Update run_infer.py * fix lint --------- Co-authored-by: yufansong <yufan@risingwave-labs.com>	2024-06-08 16:54:54 +00:00
Jaskirat Singh	e8307608c2	Support gpqa benchmark evaluation (#2080 ) * feat: add gpqa benchmark evaluation * add metrics * reset configs in final block * make lint --------- Co-authored-by: yufansong <yufan@risingwave-labs.com>	2024-06-08 16:24:24 +00:00
yueqis	82d4d25b09	feat: support ToolQA benchmark (#2263 ) * Add files via upload * Update README.md * Update run_infer.py * Update utils.py * make lint * Update evaluation/toolqa/run_infer.py --------- Co-authored-by: Engel Nyst <enyst@users.noreply.github.com> Co-authored-by: yufansong <yufan@risingwave-labs.com> Co-authored-by: Boxuan Li <liboxuan@connect.hku.hk>	2024-06-08 07:54:01 -04:00
super-dainiu	beabcce16d	[Hotfix] Fix ML-Bench continue ``run_inference.py`` (#2284 ) * add ml-bench w/o exec env * fix typos (#1956) no functional change * Refactored Logs (#1939) * [Feat] A competitive Web Browsing agent (#1856) * initial attempt at a browsing only agent * add browsing agent * update * implement agent * update * fix comments * remove unnecessary things from memory extras * update image processing --------- Co-authored-by: Yufan Song <33971064+yufansong@users.noreply.github.com> * Update README.md SWE-bench score (#1959) * Update README.md SWE-bench score Our most recent results on swe-bench lite are 25%, so this updates the README accordingly. * Update * fix: llm is_local function logic error (#1961) Co-authored-by: மனோஜ்குமார் பழனிச்சாமி <smartmanoj42857@gmail.com> * doc: update documentation about poetry update (#1962) * add doc * Update Development.md --------- Co-authored-by: Boxuan Li <liboxuan@connect.hku.hk> * feat: add metrics related to cost for better observability (#1944) * add metrics for total_cost * make lint * refact codeact * change metrics into llm * add costs list, add into state * refactor log completion * refactor and test others * make lint * Update opendevin/core/metrics.py Co-authored-by: Boxuan Li <liboxuan@connect.hku.hk> * Update opendevin/llm/llm.py Co-authored-by: Xingyao Wang <xingyao6@illinois.edu> * refactor * add code --------- Co-authored-by: Boxuan Li <liboxuan@connect.hku.hk> Co-authored-by: Xingyao Wang <xingyao6@illinois.edu> * doc: add more cmd in unit test documentation (#1963) * --- (#1975) updated-dependencies: - dependency-name: boto3 dependency-type: direct:production update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * --- (#1976) updated-dependencies: - dependency-name: litellm dependency-type: direct:production update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * Logging security (#1943) * update .gitignore * Rename the confusing 'INFO' style to 'DETAIL' * override str and repr * feat: api_key desensitize * feat: add SensitiveDataFilter in file handler * tweak regex, add tests * more tweaks, include other attrs * add env vars, those with equivalent config * fix tests * tests are invaluable --------- Co-authored-by: Shimada666 <649940882@qq.com> * --- (#1967) updated-dependencies: - dependency-name: react-dom dependency-type: direct:production update-type: version-update:semver-minor - dependency-name: "@types/react-dom" dependency-type: direct:development update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * --- (#1968) updated-dependencies: - dependency-name: "@reduxjs/toolkit" dependency-type: direct:production update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * --- (#1969) updated-dependencies: - dependency-name: husky dependency-type: direct:development update-type: version-update:semver-major ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * --- (#1970) updated-dependencies: - dependency-name: tailwind-merge dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * --- (#1971) updated-dependencies: - dependency-name: i18next dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Yufan Song <33971064+yufansong@users.noreply.github.com> * Refactor session management (#1810) * refactor session mgmt * defer file handling to runtime * add todo * refactor sessions a bit more * remove messages logic from FE * fix up socket handshake * refactor frontend auth a bit * first pass at redoing file explorer * implement directory suffix * fix up file tree * close agent on websocket close * remove session saving * move file refresh * remove getWorkspace * plumb path/code differently * fix build issues * fix the tests * fix npm build * add session rehydration * fix event serialization * logspam * fix user message rehydration * add get_event fn * agent state restoration * change history tracking for codeact * fix responsiveness of init * fix lint * lint * delint * fix prop * update tests * logspam * lint * fix test * revert codeact * change fileService to use API * fix up session loading * delint * delint * fix integration tests * revert test * fix up access to options endpoints * fix initial files load * delint * fix file initialization * fix mock server * fixl int * fix auth for html * Update frontend/src/i18n/translation.json Co-authored-by: Xingyao Wang <xingyao6@illinois.edu> * refactor sessions and sockets * avoid reinitializing the same session * fix reconnect issue * change up intro message * more guards on reinit * rename agent_session * delint * fix a bunch of tests * delint * fix last test * remove code editor context * fix build * fix any * fix dot notation * Update frontend/src/services/api.ts Co-authored-by: Boxuan Li <liboxuan@connect.hku.hk> * fix up error handling * Update opendevin/server/session/agent.py Co-authored-by: Boxuan Li <liboxuan@connect.hku.hk> * Update opendevin/server/session/agent.py Co-authored-by: Boxuan Li <liboxuan@connect.hku.hk> * Update frontend/src/services/session.ts Co-authored-by: Boxuan Li <liboxuan@connect.hku.hk> * fix build errs * fix else * add closed state * delint * Update opendevin/server/session/session.py Co-authored-by: Engel Nyst <enyst@users.noreply.github.com> --------- Co-authored-by: Xingyao Wang <xingyao6@illinois.edu> Co-authored-by: Graham Neubig <neubig@gmail.com> Co-authored-by: Boxuan Li <liboxuan@connect.hku.hk> Co-authored-by: Engel Nyst <enyst@users.noreply.github.com> * fix #1960 (#1964) * Add ruff for shared mutable defaults (B) (#1938) * Add ruff for shared mutable defaults (B) * Apply B006, B008 on current files, except fast API * Update agenthub/SWE_agent/prompts.py Co-authored-by: Graham Neubig <neubig@gmail.com> * fix unintended behavior change * this is correct, tell Ruff to leave it alone --------- Co-authored-by: Graham Neubig <neubig@gmail.com> Co-authored-by: Boxuan Li <liboxuan@connect.hku.hk> * Refactor integration testing CI, add optional Mac tests, and mark a few agents as deprecated (#1888) * Add MacOS to integration tests * Switch back to python 3.11 * Install Docker for macos pipeline * regenerate.sh: Use environmental variable for sandbox type * Pack different agents' tests into a single check * Fix CodeAct tests * Reduce file match and extensive debug logs * Add TEST_IN_CI mode that reports codecov * Small fix: don't quit if reusing old responses failed * Merge codecov results * Fix typos * Remove coverage merge step - codecov automatically does that * Make mac integration tests as optional - too slow * Fix codecov args * Add comments in yaml * Include sandbox type in codecov report name * Fix codecov report merge * Revert renaming of test_matrix_success * Remove SWEAgent and PlannerAgent from tests * Mark planner agent and SWE agent as deprecated * CodeCov: Ignore planner and sweagent * Revert "Remove SWEAgent and PlannerAgent from tests" This reverts commit `040cb3bfb9`. * Remove all tests for SWE Agent * Only keep basic tests for MonologueAgent and PlannerAgent * Mark SWE Agent as deprecated, and ignore code coverage for it --------- Co-authored-by: Engel Nyst <enyst@users.noreply.github.com> * Fix Repeated Responses in Chat by Adding IPythonRunCellObservation (#1987) Co-authored-by: jianghongwei <jianghongwei@58.com> Co-authored-by: மனோஜ்குமார் பழனிச்சாமி <smartmanoj42857@gmail.com> * Save CI cycles for backend tests (#1985) * Fix typo in prompt (#1992) * Refactor monologue and SWE agent to use the messages in state history (#1863) * Refactor monologue to use the messages in state history * add messages, clean up * fix monologue * update integration tests * move private method * update SWE agent to use the history from State * integration tests for SWE agent * rename monologue to initial_thoughts, since that is what it is * fix: catch session file not existed exception when init EventStream(maybe creating a new session with no session files stored). (#1994) * add ml-bench in readme * Bump boto3 from 1.34.110 to 1.34.111 (#2001) Bumps [boto3](https://github.com/boto/boto3) from 1.34.110 to 1.34.111. - [Release notes](https://github.com/boto/boto3/releases) - [Changelog](https://github.com/boto/boto3/blob/develop/CHANGELOG.rst) - [Commits](https://github.com/boto/boto3/compare/1.34.110...1.34.111) --- updated-dependencies: - dependency-name: boto3 dependency-type: direct:production update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * Bump docker from 7.0.0 to 7.1.0 (#2002) Bumps [docker](https://github.com/docker/docker-py) from 7.0.0 to 7.1.0. - [Release notes](https://github.com/docker/docker-py/releases) - [Commits](https://github.com/docker/docker-py/compare/7.0.0...7.1.0) --- updated-dependencies: - dependency-name: docker dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * Bump litellm from 1.37.20 to 1.38.0 (#2005) Bumps [litellm](https://github.com/BerriAI/litellm) from 1.37.20 to 1.38.0. - [Release notes](https://github.com/BerriAI/litellm/releases) - [Commits](https://github.com/BerriAI/litellm/compare/v1.37.20...v1.38.0) --- updated-dependencies: - dependency-name: litellm dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * Fix SWE-Bench evaluation due to setuptools version (#1995) * correctly setup plugins for swebench eval * bump swe-bench version and add logging * Revert "correctly setup plugins for swebench eval" This reverts commit `2bd1055673`. * bump version * fix session state after resuming (#1999) * fix state resuming * fix session reconnection * fix lint * Implement `agentskills` for OpenDevin to helpfully improve edit AND including more useful tools/skills (#1941) * add draft for skills * Implement and test agentskills functions: open_file, goto_line, scroll_down, scroll_up, create_file, search_dir, search_file, find_file * Remove new_sample.txt file * add some work from opendevin w/ fixes * Add unit tests for agentskills module * fix some issues and updated tests * add more tests for open * tweak and handle goto_line * add tests for some edge cases * add tests for scrolling * add tests for edit * add tests for search_dir * update tests to use pytest * use pytest --forked to avoid file op unit tests to interfere with each other via global var * update doc based on swe agent tool * update and add tests for find_file and search_file * move agent_skills to plugins * add agentskills as plugin and docs * add agentskill to ssh box and fix sandbox integration * remove extra returns in doc * add agentskills to initial tool for jupyter * support re-init jupyter kernel (for agentskills) after restart * fix print window's issue with indentation and add testcases * add prompt for codeact with the newest edit primitives * modify the way line number is presented (remove leading space) * change prompt to the newest display format * support tracking of costs via metrics * Update opendevin/runtime/plugins/agent_skills/README.md * Update opendevin/runtime/plugins/agent_skills/README.md * implement and add tests for py linting * remove extra text arg for incompatible subprocess ver * remove sample.txt * update test_edits integration tests * fix all integration * Update opendevin/runtime/plugins/agent_skills/README.md * Update opendevin/runtime/plugins/agent_skills/README.md * Update opendevin/runtime/plugins/agent_skills/README.md * Update agenthub/codeact_agent/prompt.py Co-authored-by: Boxuan Li <liboxuan@connect.hku.hk> * Update agenthub/codeact_agent/prompt.py Co-authored-by: Boxuan Li <liboxuan@connect.hku.hk> * Update agenthub/codeact_agent/prompt.py Co-authored-by: Boxuan Li <liboxuan@connect.hku.hk> * Update opendevin/runtime/plugins/agent_skills/agentskills.py Co-authored-by: Boxuan Li <liboxuan@connect.hku.hk> * correctly setup plugins for swebench eval * bump swe-bench version and add logging * correctly setup plugins for swebench eval * bump swe-bench version and add logging * Revert "correctly setup plugins for swebench eval" This reverts commit `2bd1055673`. * bump version * remove _AGENT_SKILLS_DOCS * move flake8 to test dep * update poetry.lock * remove extra arg * reduce max iter for eval * update poetry * fix integration tests --------- Co-authored-by: OpenDevin <opendevin@opendevin.ai> Co-authored-by: Engel Nyst <enyst@users.noreply.github.com> Co-authored-by: Boxuan Li <liboxuan@connect.hku.hk> * build: Add poetry command to use Python 3.11 for environment setup (#1972) * Bump @react-types/shared from 3.23.0 to 3.23.1 in /frontend (#2006) Bumps [@react-types/shared](https://github.com/adobe/react-spectrum) from 3.23.0 to 3.23.1. - [Release notes](https://github.com/adobe/react-spectrum/releases) - [Commits](https://github.com/adobe/react-spectrum/compare/@react-types/shared@3.23.0...@react-types/shared@3.23.1) --- updated-dependencies: - dependency-name: "@react-types/shared" dependency-type: direct:production update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * Bump @types/react-syntax-highlighter in /frontend (#2007) Bumps [@types/react-syntax-highlighter](https://github.com/DefinitelyTyped/DefinitelyTyped/tree/HEAD/types/react-syntax-highlighter) from 15.5.11 to 15.5.13. - [Release notes](https://github.com/DefinitelyTyped/DefinitelyTyped/releases) - [Commits](https://github.com/DefinitelyTyped/DefinitelyTyped/commits/HEAD/types/react-syntax-highlighter) --- updated-dependencies: - dependency-name: "@types/react-syntax-highlighter" dependency-type: direct:development update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * Bump @typescript-eslint/parser from 7.9.0 to 7.10.0 in /frontend (#2008) Bumps [@typescript-eslint/parser](https://github.com/typescript-eslint/typescript-eslint/tree/HEAD/packages/parser) from 7.9.0 to 7.10.0. - [Release notes](https://github.com/typescript-eslint/typescript-eslint/releases) - [Changelog](https://github.com/typescript-eslint/typescript-eslint/blob/main/packages/parser/CHANGELOG.md) - [Commits](https://github.com/typescript-eslint/typescript-eslint/commits/v7.10.0/packages/parser) --- updated-dependencies: - dependency-name: "@typescript-eslint/parser" dependency-type: direct:development update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * Bump lint-staged from 15.2.2 to 15.2.4 in /frontend (#2009) Bumps [lint-staged](https://github.com/okonet/lint-staged) from 15.2.2 to 15.2.4. - [Release notes](https://github.com/okonet/lint-staged/releases) - [Changelog](https://github.com/lint-staged/lint-staged/blob/master/CHANGELOG.md) - [Commits](https://github.com/okonet/lint-staged/compare/v15.2.2...v15.2.4) --- updated-dependencies: - dependency-name: lint-staged dependency-type: direct:development update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * Update README.md * Update README.md * add run_infer.sh * fix input output * fix docker sandbox * fix run * update and clean run_infer.py * add script to clean up dockers * update repo uid * add description * new * Update README.md * use root for sandbox * update readme * update ml-bench conda env * update readme * update readme * use try except * modify raise exception * add int * update README * longer time * fix existing issues * fix existing issue * new docker image * add metrics of cost * add result parsing cost * fix * fix * update summarize * fix --------- Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: Ubuntu <ubuntu@ip-172-31-31-157.ec2.internal> Co-authored-by: RainRat <rainrat78@yahoo.ca> Co-authored-by: மனோஜ்குமார் பழனிச்சாமி <smartmanoj42857@gmail.com> Co-authored-by: Frank Xu <frankxu2004@gmail.com> Co-authored-by: Yufan Song <33971064+yufansong@users.noreply.github.com> Co-authored-by: Graham Neubig <neubig@gmail.com> Co-authored-by: Shimada666 <649940882@qq.com> Co-authored-by: Boxuan Li <liboxuan@connect.hku.hk> Co-authored-by: Xingyao Wang <xingyao6@illinois.edu> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Engel Nyst <enyst@users.noreply.github.com> Co-authored-by: Robert Brennan <accounts@rbren.io> Co-authored-by: Rahul Anand <62982824+zeul22@users.noreply.github.com> Co-authored-by: jiangleo <jiangleo@users.noreply.github.com> Co-authored-by: jianghongwei <jianghongwei@58.com> Co-authored-by: Jeremi Joslin <jeremi@newlogic.com> Co-authored-by: Aaron Xia <zhhuaxia@gmail.com> Co-authored-by: OpenDevin <opendevin@opendevin.ai> Co-authored-by: DaxServer <7479937+DaxServer@users.noreply.github.com> Co-authored-by: Robert <871607149@qq.com>	2024-06-06 03:53:21 +00:00
Frank Xu	48151bdbb0	[feat] WebArena benchmark, MiniWoB++ benchmark and related arch changes (#2170 ) * add webarena, and revamp messaging for webarena eval * add changes for browsergym * update infer script * fix unit tests * update * add multiple run for miniwob * update instruction, remove personal path * update * add code for getting final reward, fix integration, add results * add avg cost calculation	2024-06-06 09:01:20 +08:00
மனோஜ்குமார் பழனிச்சாமி	ae815b20d2	Improved logs (#2272 )	2024-06-05 17:54:40 +05:30
Boxuan Li	208b1461ca	[AgentBench evaluation] set run_as_devin to true (#2269 ) Co-authored-by: Leo <ifuryst@gmail.com>	2024-06-05 07:53:33 +00:00
Ryan H. Tran	0584e428b2	[Mint evaluation] Fix bug in stopping when the agent reaches max steps or solution proposals (#2268 ) * fix: bug in stopping when the agent reaches max steps or solution proposals * remove --eval-num-workers * update env.py	2024-06-05 06:47:07 +00:00
super-dainiu	ebafb702e5	Add ML-Bench Evaluation with OpenDevin (#2015 ) * add ml-bench w/o exec env * fix typos (#1956) no functional change * Refactored Logs (#1939) * [Feat] A competitive Web Browsing agent (#1856) * initial attempt at a browsing only agent * add browsing agent * update * implement agent * update * fix comments * remove unnecessary things from memory extras * update image processing --------- Co-authored-by: Yufan Song <33971064+yufansong@users.noreply.github.com> * Update README.md SWE-bench score (#1959) * Update README.md SWE-bench score Our most recent results on swe-bench lite are 25%, so this updates the README accordingly. * Update * fix: llm is_local function logic error (#1961) Co-authored-by: மனோஜ்குமார் பழனிச்சாமி <smartmanoj42857@gmail.com> * doc: update documentation about poetry update (#1962) * add doc * Update Development.md --------- Co-authored-by: Boxuan Li <liboxuan@connect.hku.hk> * feat: add metrics related to cost for better observability (#1944) * add metrics for total_cost * make lint * refact codeact * change metrics into llm * add costs list, add into state * refactor log completion * refactor and test others * make lint * Update opendevin/core/metrics.py Co-authored-by: Boxuan Li <liboxuan@connect.hku.hk> * Update opendevin/llm/llm.py Co-authored-by: Xingyao Wang <xingyao6@illinois.edu> * refactor * add code --------- Co-authored-by: Boxuan Li <liboxuan@connect.hku.hk> Co-authored-by: Xingyao Wang <xingyao6@illinois.edu> * doc: add more cmd in unit test documentation (#1963) * --- (#1975) updated-dependencies: - dependency-name: boto3 dependency-type: direct:production update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * --- (#1976) updated-dependencies: - dependency-name: litellm dependency-type: direct:production update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * Logging security (#1943) * update .gitignore * Rename the confusing 'INFO' style to 'DETAIL' * override str and repr * feat: api_key desensitize * feat: add SensitiveDataFilter in file handler * tweak regex, add tests * more tweaks, include other attrs * add env vars, those with equivalent config * fix tests * tests are invaluable --------- Co-authored-by: Shimada666 <649940882@qq.com> * --- (#1967) updated-dependencies: - dependency-name: react-dom dependency-type: direct:production update-type: version-update:semver-minor - dependency-name: "@types/react-dom" dependency-type: direct:development update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * --- (#1968) updated-dependencies: - dependency-name: "@reduxjs/toolkit" dependency-type: direct:production update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * --- (#1969) updated-dependencies: - dependency-name: husky dependency-type: direct:development update-type: version-update:semver-major ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * --- (#1970) updated-dependencies: - dependency-name: tailwind-merge dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * --- (#1971) updated-dependencies: - dependency-name: i18next dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Yufan Song <33971064+yufansong@users.noreply.github.com> * Refactor session management (#1810) * refactor session mgmt * defer file handling to runtime * add todo * refactor sessions a bit more * remove messages logic from FE * fix up socket handshake * refactor frontend auth a bit * first pass at redoing file explorer * implement directory suffix * fix up file tree * close agent on websocket close * remove session saving * move file refresh * remove getWorkspace * plumb path/code differently * fix build issues * fix the tests * fix npm build * add session rehydration * fix event serialization * logspam * fix user message rehydration * add get_event fn * agent state restoration * change history tracking for codeact * fix responsiveness of init * fix lint * lint * delint * fix prop * update tests * logspam * lint * fix test * revert codeact * change fileService to use API * fix up session loading * delint * delint * fix integration tests * revert test * fix up access to options endpoints * fix initial files load * delint * fix file initialization * fix mock server * fixl int * fix auth for html * Update frontend/src/i18n/translation.json Co-authored-by: Xingyao Wang <xingyao6@illinois.edu> * refactor sessions and sockets * avoid reinitializing the same session * fix reconnect issue * change up intro message * more guards on reinit * rename agent_session * delint * fix a bunch of tests * delint * fix last test * remove code editor context * fix build * fix any * fix dot notation * Update frontend/src/services/api.ts Co-authored-by: Boxuan Li <liboxuan@connect.hku.hk> * fix up error handling * Update opendevin/server/session/agent.py Co-authored-by: Boxuan Li <liboxuan@connect.hku.hk> * Update opendevin/server/session/agent.py Co-authored-by: Boxuan Li <liboxuan@connect.hku.hk> * Update frontend/src/services/session.ts Co-authored-by: Boxuan Li <liboxuan@connect.hku.hk> * fix build errs * fix else * add closed state * delint * Update opendevin/server/session/session.py Co-authored-by: Engel Nyst <enyst@users.noreply.github.com> --------- Co-authored-by: Xingyao Wang <xingyao6@illinois.edu> Co-authored-by: Graham Neubig <neubig@gmail.com> Co-authored-by: Boxuan Li <liboxuan@connect.hku.hk> Co-authored-by: Engel Nyst <enyst@users.noreply.github.com> * fix #1960 (#1964) * Add ruff for shared mutable defaults (B) (#1938) * Add ruff for shared mutable defaults (B) * Apply B006, B008 on current files, except fast API * Update agenthub/SWE_agent/prompts.py Co-authored-by: Graham Neubig <neubig@gmail.com> * fix unintended behavior change * this is correct, tell Ruff to leave it alone --------- Co-authored-by: Graham Neubig <neubig@gmail.com> Co-authored-by: Boxuan Li <liboxuan@connect.hku.hk> * Refactor integration testing CI, add optional Mac tests, and mark a few agents as deprecated (#1888) * Add MacOS to integration tests * Switch back to python 3.11 * Install Docker for macos pipeline * regenerate.sh: Use environmental variable for sandbox type * Pack different agents' tests into a single check * Fix CodeAct tests * Reduce file match and extensive debug logs * Add TEST_IN_CI mode that reports codecov * Small fix: don't quit if reusing old responses failed * Merge codecov results * Fix typos * Remove coverage merge step - codecov automatically does that * Make mac integration tests as optional - too slow * Fix codecov args * Add comments in yaml * Include sandbox type in codecov report name * Fix codecov report merge * Revert renaming of test_matrix_success * Remove SWEAgent and PlannerAgent from tests * Mark planner agent and SWE agent as deprecated * CodeCov: Ignore planner and sweagent * Revert "Remove SWEAgent and PlannerAgent from tests" This reverts commit `040cb3bfb9`. * Remove all tests for SWE Agent * Only keep basic tests for MonologueAgent and PlannerAgent * Mark SWE Agent as deprecated, and ignore code coverage for it --------- Co-authored-by: Engel Nyst <enyst@users.noreply.github.com> * Fix Repeated Responses in Chat by Adding IPythonRunCellObservation (#1987) Co-authored-by: jianghongwei <jianghongwei@58.com> Co-authored-by: மனோஜ்குமார் பழனிச்சாமி <smartmanoj42857@gmail.com> * Save CI cycles for backend tests (#1985) * Fix typo in prompt (#1992) * Refactor monologue and SWE agent to use the messages in state history (#1863) * Refactor monologue to use the messages in state history * add messages, clean up * fix monologue * update integration tests * move private method * update SWE agent to use the history from State * integration tests for SWE agent * rename monologue to initial_thoughts, since that is what it is * fix: catch session file not existed exception when init EventStream(maybe creating a new session with no session files stored). (#1994) * add ml-bench in readme * Bump boto3 from 1.34.110 to 1.34.111 (#2001) Bumps [boto3](https://github.com/boto/boto3) from 1.34.110 to 1.34.111. - [Release notes](https://github.com/boto/boto3/releases) - [Changelog](https://github.com/boto/boto3/blob/develop/CHANGELOG.rst) - [Commits](https://github.com/boto/boto3/compare/1.34.110...1.34.111) --- updated-dependencies: - dependency-name: boto3 dependency-type: direct:production update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * Bump docker from 7.0.0 to 7.1.0 (#2002) Bumps [docker](https://github.com/docker/docker-py) from 7.0.0 to 7.1.0. - [Release notes](https://github.com/docker/docker-py/releases) - [Commits](https://github.com/docker/docker-py/compare/7.0.0...7.1.0) --- updated-dependencies: - dependency-name: docker dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * Bump litellm from 1.37.20 to 1.38.0 (#2005) Bumps [litellm](https://github.com/BerriAI/litellm) from 1.37.20 to 1.38.0. - [Release notes](https://github.com/BerriAI/litellm/releases) - [Commits](https://github.com/BerriAI/litellm/compare/v1.37.20...v1.38.0) --- updated-dependencies: - dependency-name: litellm dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * Fix SWE-Bench evaluation due to setuptools version (#1995) * correctly setup plugins for swebench eval * bump swe-bench version and add logging * Revert "correctly setup plugins for swebench eval" This reverts commit `2bd1055673`. * bump version * fix session state after resuming (#1999) * fix state resuming * fix session reconnection * fix lint * Implement `agentskills` for OpenDevin to helpfully improve edit AND including more useful tools/skills (#1941) * add draft for skills * Implement and test agentskills functions: open_file, goto_line, scroll_down, scroll_up, create_file, search_dir, search_file, find_file * Remove new_sample.txt file * add some work from opendevin w/ fixes * Add unit tests for agentskills module * fix some issues and updated tests * add more tests for open * tweak and handle goto_line * add tests for some edge cases * add tests for scrolling * add tests for edit * add tests for search_dir * update tests to use pytest * use pytest --forked to avoid file op unit tests to interfere with each other via global var * update doc based on swe agent tool * update and add tests for find_file and search_file * move agent_skills to plugins * add agentskills as plugin and docs * add agentskill to ssh box and fix sandbox integration * remove extra returns in doc * add agentskills to initial tool for jupyter * support re-init jupyter kernel (for agentskills) after restart * fix print window's issue with indentation and add testcases * add prompt for codeact with the newest edit primitives * modify the way line number is presented (remove leading space) * change prompt to the newest display format * support tracking of costs via metrics * Update opendevin/runtime/plugins/agent_skills/README.md * Update opendevin/runtime/plugins/agent_skills/README.md * implement and add tests for py linting * remove extra text arg for incompatible subprocess ver * remove sample.txt * update test_edits integration tests * fix all integration * Update opendevin/runtime/plugins/agent_skills/README.md * Update opendevin/runtime/plugins/agent_skills/README.md * Update opendevin/runtime/plugins/agent_skills/README.md * Update agenthub/codeact_agent/prompt.py Co-authored-by: Boxuan Li <liboxuan@connect.hku.hk> * Update agenthub/codeact_agent/prompt.py Co-authored-by: Boxuan Li <liboxuan@connect.hku.hk> * Update agenthub/codeact_agent/prompt.py Co-authored-by: Boxuan Li <liboxuan@connect.hku.hk> * Update opendevin/runtime/plugins/agent_skills/agentskills.py Co-authored-by: Boxuan Li <liboxuan@connect.hku.hk> * correctly setup plugins for swebench eval * bump swe-bench version and add logging * correctly setup plugins for swebench eval * bump swe-bench version and add logging * Revert "correctly setup plugins for swebench eval" This reverts commit `2bd1055673`. * bump version * remove _AGENT_SKILLS_DOCS * move flake8 to test dep * update poetry.lock * remove extra arg * reduce max iter for eval * update poetry * fix integration tests --------- Co-authored-by: OpenDevin <opendevin@opendevin.ai> Co-authored-by: Engel Nyst <enyst@users.noreply.github.com> Co-authored-by: Boxuan Li <liboxuan@connect.hku.hk> * build: Add poetry command to use Python 3.11 for environment setup (#1972) * Bump @react-types/shared from 3.23.0 to 3.23.1 in /frontend (#2006) Bumps [@react-types/shared](https://github.com/adobe/react-spectrum) from 3.23.0 to 3.23.1. - [Release notes](https://github.com/adobe/react-spectrum/releases) - [Commits](https://github.com/adobe/react-spectrum/compare/@react-types/shared@3.23.0...@react-types/shared@3.23.1) --- updated-dependencies: - dependency-name: "@react-types/shared" dependency-type: direct:production update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * Bump @types/react-syntax-highlighter in /frontend (#2007) Bumps [@types/react-syntax-highlighter](https://github.com/DefinitelyTyped/DefinitelyTyped/tree/HEAD/types/react-syntax-highlighter) from 15.5.11 to 15.5.13. - [Release notes](https://github.com/DefinitelyTyped/DefinitelyTyped/releases) - [Commits](https://github.com/DefinitelyTyped/DefinitelyTyped/commits/HEAD/types/react-syntax-highlighter) --- updated-dependencies: - dependency-name: "@types/react-syntax-highlighter" dependency-type: direct:development update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * Bump @typescript-eslint/parser from 7.9.0 to 7.10.0 in /frontend (#2008) Bumps [@typescript-eslint/parser](https://github.com/typescript-eslint/typescript-eslint/tree/HEAD/packages/parser) from 7.9.0 to 7.10.0. - [Release notes](https://github.com/typescript-eslint/typescript-eslint/releases) - [Changelog](https://github.com/typescript-eslint/typescript-eslint/blob/main/packages/parser/CHANGELOG.md) - [Commits](https://github.com/typescript-eslint/typescript-eslint/commits/v7.10.0/packages/parser) --- updated-dependencies: - dependency-name: "@typescript-eslint/parser" dependency-type: direct:development update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * Bump lint-staged from 15.2.2 to 15.2.4 in /frontend (#2009) Bumps [lint-staged](https://github.com/okonet/lint-staged) from 15.2.2 to 15.2.4. - [Release notes](https://github.com/okonet/lint-staged/releases) - [Changelog](https://github.com/lint-staged/lint-staged/blob/master/CHANGELOG.md) - [Commits](https://github.com/okonet/lint-staged/compare/v15.2.2...v15.2.4) --- updated-dependencies: - dependency-name: lint-staged dependency-type: direct:development update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * Update README.md * Update README.md * add run_infer.sh * fix input output * fix docker sandbox * fix run * update and clean run_infer.py * add script to clean up dockers * update repo uid * add description * new * Update README.md * use root for sandbox * update readme * update ml-bench conda env * update readme * update readme * use try except * modify raise exception * add int * update README * longer time * fix existing issues * fix existing issue * new docker image * add metrics of cost * add result parsing cost * fix --------- Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: Ubuntu <ubuntu@ip-172-31-31-157.ec2.internal> Co-authored-by: RainRat <rainrat78@yahoo.ca> Co-authored-by: மனோஜ்குமார் பழனிச்சாமி <smartmanoj42857@gmail.com> Co-authored-by: Frank Xu <frankxu2004@gmail.com> Co-authored-by: Yufan Song <33971064+yufansong@users.noreply.github.com> Co-authored-by: Graham Neubig <neubig@gmail.com> Co-authored-by: Shimada666 <649940882@qq.com> Co-authored-by: Boxuan Li <liboxuan@connect.hku.hk> Co-authored-by: Xingyao Wang <xingyao6@illinois.edu> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Engel Nyst <enyst@users.noreply.github.com> Co-authored-by: Robert Brennan <accounts@rbren.io> Co-authored-by: Rahul Anand <62982824+zeul22@users.noreply.github.com> Co-authored-by: jiangleo <jiangleo@users.noreply.github.com> Co-authored-by: jianghongwei <jianghongwei@58.com> Co-authored-by: Jeremi Joslin <jeremi@newlogic.com> Co-authored-by: Aaron Xia <zhhuaxia@gmail.com> Co-authored-by: OpenDevin <opendevin@opendevin.ai> Co-authored-by: DaxServer <7479937+DaxServer@users.noreply.github.com> Co-authored-by: Robert <871607149@qq.com>	2024-06-05 01:56:39 +00:00
Leo	040d6bd806	fix: add an early exit check for agent answers in agent bench. (#2257 ) Signed-off-by: ifuryst <ifuryst@gmail.com>	2024-06-04 18:45:07 -07:00
tobitege	5776474dcf	Fix SWE-Bench README typos (#2250 )	2024-06-05 01:18:02 +00:00
Leo	9ada36e30b	fix: restore python linting. (#2228 ) * fix: restore python linting. Signed-off-by: ifuryst <ifuryst@gmail.com> * update: extend the Python lint check to evaluation. Signed-off-by: ifuryst <ifuryst@gmail.com> * Update evaluation/logic_reasoning/instruction.txt --------- Signed-off-by: ifuryst <ifuryst@gmail.com> Co-authored-by: Boxuan Li <liboxuan@connect.hku.hk>	2024-06-04 06:36:19 +00:00
finaltrip	05b84df9cb	chore: fix some comments (#2234 ) Signed-off-by: finaltrip <finaltrip@qq.com>	2024-06-03 16:04:34 +00:00
Boxuan Li	538d1d85a2	evaluation: Reset configs in finally block (#2214 )	2024-06-03 09:52:12 +08:00
Ryan H. Tran	22e8fb39b1	add cost metrics to evaluation outputs for all benchmarks (#2199 )	2024-06-02 08:28:00 +00:00
Yizhe Zhang	8d79c3edbc	modify the exiting logic and reward calculation, delete unused function (#2198 )	2024-06-02 06:38:09 +00:00
RainRat	ed6dcc8381	fix typos (#2187 ) * fix typos no functional change * fix typos	2024-06-01 20:40:30 +00:00
Leo	2c231c57c9	Add supported benchmarks to evaluation README (AgentBench, BIRD, LogicReasoning) (#2183 ) Signed-off-by: ifuryst <ifuryst@gmail.com>	2024-06-01 11:33:01 -04:00
Binyuan Hui	46dcf4bb3e	Support BIRD benchmark (#2117 ) * update: change timeout from 10 to 30 * update: readme for bird evaluation * Update evaluation/bird/run_infer.py Co-authored-by: Engel Nyst <enyst@users.noreply.github.com> * Update evaluation/bird/README.md Co-authored-by: Shimada666 <649940882@qq.com> * Update evaluation/bird/README.md Co-authored-by: Shimada666 <649940882@qq.com> * Update evaluation/bird/run_infer.py Co-authored-by: Engel Nyst <enyst@users.noreply.github.com> --------- Co-authored-by: Engel Nyst <enyst@users.noreply.github.com> Co-authored-by: Shimada666 <649940882@qq.com> Co-authored-by: Yufan Song <33971064+yufansong@users.noreply.github.com>	2024-06-01 11:34:36 +00:00
Leo	be251b11de	Add AgentBench. (#2012 ) * Add AgentBench. * Load the datasets from HF. Signed-off-by: ifuryst <ifuryst@gmail.com> * Add helper functions. * Add mock executor. Signed-off-by: ifuryst <ifuryst@gmail.com> * Add retriv agent answer cmd. * Adjust the dataset. * Refine test results. Signed-off-by: ifuryst <ifuryst@gmail.com> * Consolidate all AgentBench datasets and scripts into a single CSV dataset. * Refactor dataset source. * Update helper functions. Signed-off-by: ifuryst <ifuryst@gmail.com> * Fix the CRLF problem. Signed-off-by: ifuryst <ifuryst@gmail.com> * Separate the instance's workspace. Signed-off-by: ifuryst <ifuryst@gmail.com> * Add cleanup logic and error handling for sandbox closure. * Normalized dataset Signed-off-by: ifuryst <ifuryst@gmail.com> * Update README. Signed-off-by: ifuryst <ifuryst@gmail.com> * Update the prompt to capture the answer. Signed-off-by: ifuryst <ifuryst@gmail.com> * Refactor script execution paths to use absolute container workspace path. Signed-off-by: ifuryst <ifuryst@gmail.com> * Update AgentBench README. Signed-off-by: ifuryst <ifuryst@gmail.com> * Delete useless functions. Signed-off-by: ifuryst <ifuryst@gmail.com> * Update evaluation/agent_bench/README.md * Add script to summarize test results from JSONL file in AgentBench Signed-off-by: ifuryst <ifuryst@gmail.com> * Delete useless script and codes. Signed-off-by: ifuryst <ifuryst@gmail.com> * Update evaluation/agent_bench/scripts/summarise_results.py --------- Signed-off-by: ifuryst <ifuryst@gmail.com> Co-authored-by: Boxuan Li <liboxuan@connect.hku.hk>	2024-06-01 07:58:14 +00:00
Ryan H. Tran	01296ff79d	Add remaining subsets for MINT benchmark (#2142 ) * add MMLU subset * add theoremqa subset * remove redundant packages from requirements.txt, adjust prompts, handle gpt3.5 propose a wrong answer after a correct answer * add MBPP subset * add humaneval subset * update README * exit actively after the agent finishes the task	2024-05-31 20:04:13 +00:00
Boxuan Li	4d14b44a9a	SWE-bench: Add summarise utility script to view passed/failed task IDs (#2137 ) * SWE-bench: Add summarise utility script to view passed/failed task IDs * Fix typos * Move file * Prettify * Use merged jsonl file	2024-05-31 12:32:17 +08:00
Boxuan Li	f188abd7a3	Delete evaluation outputs files (#2152 ) * Delete evaluation outputs files * Fix README	2024-05-31 03:12:27 +00:00
Ren Ma	a9823491e6	Support Logic Reasoning Benchmark (#1973 )	2024-05-30 16:35:15 +08:00
Xingyao Wang	01ef90205d	Add CodeActSWEAgent to remove browsing & github + improvements on agentskills (#2105 ) * update swe_bench prompt; use minimal prompt for codeact; * upgrade agentskills and update testcases * update infer prompt * fix cwd * add icl for swebench * also log in_context_example to run infer * remove extra print * change prompt to abs path * update error message to include current file info * change cwd for jupyter if needed * update edit error message * update prompt * improve git get patch * update hint string * default to 50 turns * revert changes from codeact agent and create new CodeActSWEAgent * revert changes to codeact * revert instructions for run infer * revert instructions for run infer * update README * update max iter * add codeact swe agent * fix issue for CodeActSWEAgent * allow specifying max iter in cmdline script * stop printing * Update agenthub/codeact_swe_agent/README.md Co-authored-by: Yufan Song <33971064+yufansong@users.noreply.github.com> * Fix prompt regression in jupyter plugin --------- Co-authored-by: Yufan Song <33971064+yufansong@users.noreply.github.com> Co-authored-by: Boxuan Li <liboxuan@connect.hku.hk>	2024-05-29 21:19:00 -07:00
Ryan H. Tran	9434bcce48	Support MINT benchmark (MATH, GSM8K subset) (#1955 ) * setup boilerplate and README * setup test script and load dataset * add temp intg that works * refactor code * add solution evaluation through 'fake_user_response_fn' * finish integrating MATH subset * Update evaluation/mint/run_infer.py * Update evaluation/mint/run_infer.sh * Update opendevin/core/main.py * remove redudant templates, add eval_note, update README * use <execute_ipython> tag instead of <execute> * hardcode AGENT option for run_infer.sh * Update evaluation/mint/task.py Co-authored-by: Yufan Song <33971064+yufansong@users.noreply.github.com> * fix: bug no message returned when task's success * change message to make the agent exit * import bash abstractmethod * install all required packages inside sandbox before the agent runs, adjust prompt * add subset eval folder separation and test for gsm8k * fix bug in Reasoning task result check, add requirements.txt * Fix syntax error in evaluation/mint/run_infer.py * update README, add default values for `SUBSET` and `EVAL_LIMIT` --------- Co-authored-by: Yufan Song <33971064+yufansong@users.noreply.github.com> Co-authored-by: yufansong <yufan@risingwave-labs.com> Co-authored-by: Boxuan Li <liboxuan@connect.hku.hk>	2024-05-28 07:42:52 +00:00
Xingyao Wang	2c0a2dbc61	fix yet another swe_bench issue (#2069 )	2024-05-26 10:01:43 -07:00
Gant	f0271f9f91	need to run as root to use SWEBench container (#2068 )	2024-05-26 14:21:33 +00:00
Xingyao Wang	5114230e53	Some SWE-Bench infer fixes and improvements (#2065 ) * reset workspace base properly * support running without hint * support running without hint * bump swe-bench eval docker to v1.2 for latest agentskills * only give hint when use hint text is trie * add swe-agent instructions for validation * update dockerfile * pin the python interpreter for execute_cli * avoid initialize plugins twice * default to use hint * save results to swe_bench_lite * unset gh token and increase max iter to 50 * remove printing of use hint status * refractor ssh login into one function * ok drop to 30 turns bc it is so expensive :( * remove reproduce comments to avoid stuck	2024-05-26 10:02:11 +00:00
Yizhe Zhang	0c829cd067	Support Entity-Deduction-Arena (EDA) Benchmark (#1931 ) * adding draft evaluation code for EDA, using chatgpt as the temporal agent for now * Update README.md * Delete frontend/package.json * reverse the irrelevant changes * reverse package.json * use chatgpt as the codeactagent * integrate with opendevin * Update evaluation/EDA/README.md * Update evaluation/EDA/README.md * Use poetry to manage packages * integrate with opendevin * minor update * minor update * update poetry * update README * clean-up infer scripts * add run_infer script and improve readme * log final success and final message & ground truth --------- Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Xingyao Wang <xingyao6@illinois.edu> Co-authored-by: yufansong <yufan@risingwave-labs.com> Co-authored-by: Boxuan Li <liboxuan@connect.hku.hk>	2024-05-25 23:17:04 +08:00
Xingyao Wang	28ab00946b	update README for GAIA (#2054 ) * update README for GAIA * Update evaluation/gaia/README.md * Update evaluation/gaia/README.md * Update evaluation/gaia/README.md --------- Co-authored-by: Yufan Song <33971064+yufansong@users.noreply.github.com>	2024-05-25 15:01:03 +00:00
Jiayi Pan	2d52298a1d	Support GAIA benchmark (#1911 ) * Add gaia test * Improve gaia prompts * Fix browser_env hang bug * Fix gaia bugs * add gaia to eval readme * Fix gaia bugs * minor fix * add run_infer.sh and update readme * set num eval worker to 1 * default to 2023 gaia level1 subset * default to level 1 * add prompt to instruct model enclose answer within <solution> tag * add missing break --------- Co-authored-by: Boxuan Li <liboxuan@connect.hku.hk> Co-authored-by: yufansong <yufan@risingwave-labs.com> Co-authored-by: Xingyao Wang <xingyao6@illinois.edu>	2024-05-24 11:22:28 +00:00
Xingyao Wang	602ffcdffb	Implement `agentskills` for OpenDevin to helpfully improve edit AND including more useful tools/skills (#1941 ) * add draft for skills * Implement and test agentskills functions: open_file, goto_line, scroll_down, scroll_up, create_file, search_dir, search_file, find_file * Remove new_sample.txt file * add some work from opendevin w/ fixes * Add unit tests for agentskills module * fix some issues and updated tests * add more tests for open * tweak and handle goto_line * add tests for some edge cases * add tests for scrolling * add tests for edit * add tests for search_dir * update tests to use pytest * use pytest --forked to avoid file op unit tests to interfere with each other via global var * update doc based on swe agent tool * update and add tests for find_file and search_file * move agent_skills to plugins * add agentskills as plugin and docs * add agentskill to ssh box and fix sandbox integration * remove extra returns in doc * add agentskills to initial tool for jupyter * support re-init jupyter kernel (for agentskills) after restart * fix print window's issue with indentation and add testcases * add prompt for codeact with the newest edit primitives * modify the way line number is presented (remove leading space) * change prompt to the newest display format * support tracking of costs via metrics * Update opendevin/runtime/plugins/agent_skills/README.md * Update opendevin/runtime/plugins/agent_skills/README.md * implement and add tests for py linting * remove extra text arg for incompatible subprocess ver * remove sample.txt * update test_edits integration tests * fix all integration * Update opendevin/runtime/plugins/agent_skills/README.md * Update opendevin/runtime/plugins/agent_skills/README.md * Update opendevin/runtime/plugins/agent_skills/README.md * Update agenthub/codeact_agent/prompt.py Co-authored-by: Boxuan Li <liboxuan@connect.hku.hk> * Update agenthub/codeact_agent/prompt.py Co-authored-by: Boxuan Li <liboxuan@connect.hku.hk> * Update agenthub/codeact_agent/prompt.py Co-authored-by: Boxuan Li <liboxuan@connect.hku.hk> * Update opendevin/runtime/plugins/agent_skills/agentskills.py Co-authored-by: Boxuan Li <liboxuan@connect.hku.hk> * correctly setup plugins for swebench eval * bump swe-bench version and add logging * correctly setup plugins for swebench eval * bump swe-bench version and add logging * Revert "correctly setup plugins for swebench eval" This reverts commit `2bd1055673`. * bump version * remove _AGENT_SKILLS_DOCS * move flake8 to test dep * update poetry.lock * remove extra arg * reduce max iter for eval * update poetry * fix integration tests --------- Co-authored-by: OpenDevin <opendevin@opendevin.ai> Co-authored-by: Engel Nyst <enyst@users.noreply.github.com> Co-authored-by: Boxuan Li <liboxuan@connect.hku.hk>	2024-05-23 16:04:09 +00:00
Xingyao Wang	6ff50ed369	Fix SWE-Bench evaluation due to setuptools version (#1995 ) * correctly setup plugins for swebench eval * bump swe-bench version and add logging * Revert "correctly setup plugins for swebench eval" This reverts commit `2bd1055673`. * bump version	2024-05-23 23:17:42 +08:00
Niklas Muennighoff	ef6cdb7532	HumanEvalFix integration (#1908 ) * Preliminary HumanEvalFix integration * Clean paths * fix: set workspace path correctly for config fix: task in that contains / * add missing run_infer.sh * update run_infer w/o hard coded agent * fix typo * change `instance_id` to `task_id` * add the warning and env var setting to run_infer.sh * reset back workspace mount at the end of each instance * 10 max iter is probably enough for humanevalfix * Remove unneeded section Co-authored-by: Xingyao Wang <xingyao6@illinois.edu> * Fix link Co-authored-by: Yufan Song <33971064+yufansong@users.noreply.github.com> * Use logger Co-authored-by: Yufan Song <33971064+yufansong@users.noreply.github.com> * Update run_infer.py fix a bug: ERROR:concurrent.futures:exception calling callback for <Future at 0x309cbc470 state=finished raised NameError> concurrent.futures.process._RemoteTraceback: * Update README.md * Update README.md * Update README.md * Update README.md added an example * Update README.md added: enable_auto_lint = true * Update pyproject.toml add: evaluate package * Delete poetry.lock update poetry.lock * update poetry.lock update poetry.lock * Update README.md * Update README.md --------- Co-authored-by: Xingyao Wang <xingyao6@illinois.edu> Co-authored-by: Yufan Song <33971064+yufansong@users.noreply.github.com> Co-authored-by: Robert <871607149@qq.com>	2024-05-23 13:09:40 +00:00

1 2

88 Commits