Xingyao Wang b2fdb963b6
Add detailed tutorial for adding new evaluation benchmarks (#1827)
* Add detailed tutorial for adding new evaluation benchmarks

* update tutorial, fix typo, and log observation to the cmdline

* fix url

* Update evaluation/TUTORIAL.md

* Update evaluation/TUTORIAL.md

* Update evaluation/TUTORIAL.md

* Update evaluation/TUTORIAL.md

Co-authored-by: Graham Neubig <neubig@gmail.com>

* Update evaluation/TUTORIAL.md

Co-authored-by: Graham Neubig <neubig@gmail.com>

* Update evaluation/TUTORIAL.md

Co-authored-by: Graham Neubig <neubig@gmail.com>

* Update evaluation/TUTORIAL.md

Co-authored-by: Graham Neubig <neubig@gmail.com>

* Update evaluation/TUTORIAL.md

Co-authored-by: Graham Neubig <neubig@gmail.com>

* Update evaluation/TUTORIAL.md

Co-authored-by: Graham Neubig <neubig@gmail.com>

* Update evaluation/TUTORIAL.md

Co-authored-by: Graham Neubig <neubig@gmail.com>

* Update evaluation/TUTORIAL.md

Co-authored-by: Graham Neubig <neubig@gmail.com>

* Update evaluation/TUTORIAL.md

Co-authored-by: Graham Neubig <neubig@gmail.com>

* Update evaluation/TUTORIAL.md

Co-authored-by: Graham Neubig <neubig@gmail.com>

* Update evaluation/TUTORIAL.md

Co-authored-by: Graham Neubig <neubig@gmail.com>

* Update evaluation/TUTORIAL.md

Co-authored-by: Graham Neubig <neubig@gmail.com>

* Update evaluation/TUTORIAL.md

Co-authored-by: Graham Neubig <neubig@gmail.com>

* Update evaluation/TUTORIAL.md

Co-authored-by: Graham Neubig <neubig@gmail.com>

* simplify readme and add comments to the actual code

* Fix typo in evaluation/TUTORIAL.md

* Fix typo in evaluation/swe_bench/run_infer.py

* Fix another typo in evaluation/swe_bench/run_infer.py

* Update TUTORIAL.md

* Set host net work to false for SWEBench

* Update evaluation/TUTORIAL.md

Co-authored-by: Boxuan Li <liboxuan@connect.hku.hk>

* Update evaluation/TUTORIAL.md

Co-authored-by: Boxuan Li <liboxuan@connect.hku.hk>

* Update evaluation/TUTORIAL.md

Co-authored-by: Boxuan Li <liboxuan@connect.hku.hk>

* Update evaluation/TUTORIAL.md

Co-authored-by: Boxuan Li <liboxuan@connect.hku.hk>

---------

Co-authored-by: OpenDevin <opendevin@opendevin.ai>
Co-authored-by: Engel Nyst <enyst@users.noreply.github.com>
Co-authored-by: Graham Neubig <neubig@gmail.com>
Co-authored-by: Boxuan Li <liboxuan@connect.hku.hk>
2024-05-18 13:40:53 -04:00

22 lines
714 B
Bash
Executable File

#!/bin/bash
AGENT=CodeActAgent
# IMPORTANT: Because Agent's prompt changes fairly often in the rapidly evolving codebase of OpenDevin
# We need to track the version of Agent in the evaluation to make sure results are comparable
AGENT_VERSION=v$(python3 -c "from agenthub.codeact_agent import CodeActAgent; print(CodeActAgent.VERSION)")
MODEL_CONFIG=$1
echo "AGENT: $AGENT"
echo "AGENT_VERSION: $AGENT_VERSION"
echo "MODEL_CONFIG: $MODEL_CONFIG"
# You should add $MODEL_CONFIG in your `config.toml`
poetry run python3 evaluation/swe_bench/run_infer.py \
--agent-cls $AGENT \
--llm-config $MODEL_CONFIG \
--max-iterations 50 \
--max-chars 10000000 \
--eval-num-workers 8 \
--eval-note $AGENT_VERSION