diff --git a/evaluation/benchmarks/swe_bench/README.md b/evaluation/benchmarks/swe_bench/README.md index b1858bf70a..a1bc26fb7f 100644 --- a/evaluation/benchmarks/swe_bench/README.md +++ b/evaluation/benchmarks/swe_bench/README.md @@ -183,24 +183,7 @@ The final results will be saved to `evaluation/evaluation_outputs/outputs/swe_be - `report.json`: a JSON file that contains keys like `"resolved_ids"` pointing to instance IDs that are resolved by the agent. - `logs/`: a directory of test logs -### Run evaluation with `RemoteRuntime` -OpenHands Remote Runtime is currently in beta (read [here](https://runtime.all-hands.dev/) for more details), it allows you to run rollout in parallel in the cloud, so you don't need a powerful machine to run evaluation. -Fill out [this form](https://docs.google.com/forms/d/e/1FAIpQLSckVz_JFwg2_mOxNZjCtr7aoBFI2Mwdan3f75J_TrdMS1JV2g/viewform) to apply if you want to try this out! - -```bash -./evaluation/benchmarks/swe_bench/scripts/eval_infer_remote.sh [output.jsonl filepath] [num_workers] - -# Example - This evaluates patches generated by CodeActAgent on Llama-3.1-70B-Instruct-Turbo on "princeton-nlp/SWE-bench_Lite"'s test set, with 16 number of workers running in parallel -ALLHANDS_API_KEY="YOUR-API-KEY" RUNTIME=remote SANDBOX_REMOTE_RUNTIME_API_URL="https://runtime.eval.all-hands.dev" EVAL_DOCKER_IMAGE_PREFIX="us-central1-docker.pkg.dev/evaluation-092424/swe-bench-images" \ -evaluation/benchmarks/swe_bench/scripts/eval_infer_remote.sh evaluation/evaluation_outputs/outputs/swe-bench-lite/CodeActAgent/Llama-3.1-70B-Instruct-Turbo_maxiter_100_N_v1.9-no-hint/output.jsonl 16 "princeton-nlp/SWE-bench_Lite" "test" -``` - -To clean-up all existing runtimes that you've already started, run: - -```bash -ALLHANDS_API_KEY="YOUR-API-KEY" ./evaluation/utils/scripts/cleanup_remote_runtime.sh -``` ## SWT-Bench Evaluation diff --git a/evaluation/benchmarks/swe_bench/scripts/eval_infer_remote.sh b/evaluation/benchmarks/swe_bench/scripts/eval_infer_remote.sh deleted file mode 100755 index 1ec07182ac..0000000000 --- a/evaluation/benchmarks/swe_bench/scripts/eval_infer_remote.sh +++ /dev/null @@ -1,46 +0,0 @@ -#!/usr/bin/env bash -set -eo pipefail - -INPUT_FILE=$1 -NUM_WORKERS=$2 -DATASET=$3 -SPLIT=$4 - -if [ -z "$INPUT_FILE" ]; then - echo "INPUT_FILE not specified (should be a path to a jsonl file)" - exit 1 -fi - -if [ -z "$DATASET" ]; then - echo "DATASET not specified, use default princeton-nlp/SWE-bench_Lite" - DATASET="princeton-nlp/SWE-bench_Lite" -fi - -if [ -z "$SPLIT" ]; then - echo "SPLIT not specified, use default test" - SPLIT="test" -fi - -if [ -z "$NUM_WORKERS" ]; then - echo "NUM_WORKERS not specified, use default 1" - NUM_WORKERS=1 -fi - -echo "... Evaluating on $INPUT_FILE ..." - -COMMAND="poetry run python evaluation/benchmarks/swe_bench/eval_infer.py \ - --eval-num-workers $NUM_WORKERS \ - --input-file $INPUT_FILE \ - --dataset $DATASET \ - --split $SPLIT" - -if [ -n "$EVAL_LIMIT" ]; then - echo "EVAL_LIMIT: $EVAL_LIMIT" - COMMAND="$COMMAND --eval-n-limit $EVAL_LIMIT" -fi - -# Run the command -eval $COMMAND - -# update the output with evaluation results -poetry run python evaluation/benchmarks/swe_bench/scripts/eval/update_output_with_eval.py $INPUT_FILE diff --git a/evaluation/benchmarks/testgeneval/NOTES.md b/evaluation/benchmarks/testgeneval/NOTES.md index 29d76c421c..5bfbc6643f 100644 --- a/evaluation/benchmarks/testgeneval/NOTES.md +++ b/evaluation/benchmarks/testgeneval/NOTES.md @@ -5,8 +5,7 @@ pynguin_ids = ['pydata__xarray-6548-16541', 'pydata__xarray-7003-16557', 'pydata ids = ['pydata__xarray-3114-16452', 'pydata__xarray-3151-16453', 'pydata__xarray-3156-16454', 'pydata__xarray-3239-16456', 'pydata__xarray-3239-16457', 'pydata__xarray-3239-16458', 'pydata__xarray-3302-16459', 'pydata__xarray-3364-16461', 'pydata__xarray-3677-16471', 'pydata__xarray-3905-16478', 'pydata__xarray-4182-16484', 'pydata__xarray-4248-16486', 'pydata__xarray-4339-16487', 'pydata__xarray-4419-16488', 'pydata__xarray-4629-16492', 'pydata__xarray-4750-16496', 'pydata__xarray-4802-16505', 'pydata__xarray-4966-16515', 'pydata__xarray-4994-16516', 'pydata__xarray-5033-16517', 'pydata__xarray-5126-16518', 'pydata__xarray-5126-16519', 'pydata__xarray-5131-16520', 'pydata__xarray-5365-16529', 'pydata__xarray-5455-16530', 'pydata__xarray-5662-16532', 'pydata__xarray-5731-16534', 'pydata__xarray-6135-16535', 'pydata__xarray-6135-16536', 'pydata__xarray-6386-16537', 'pydata__xarray-6394-16538', 'pydata__xarray-6400-16539', 'pydata__xarray-6461-16540', 'pydata__xarray-6548-16541', 'pydata__xarray-6599-16543', 'pydata__xarray-6601-16544', 'pydata__xarray-6882-16548', 'pydata__xarray-6889-16549', 'pydata__xarray-7003-16557', 'pydata__xarray-7147-16571', 'pydata__xarray-7150-16572', 'pydata__xarray-7203-16577', 'pydata__xarray-7229-16578', 'pydata__xarray-7393-16581', 'pydata__xarray-7400-16582'] -Command eval (our approach): -poetry run ./evaluation/benchmarks/testgeneval/scripts/eval_infer_remote.sh evaluation/evaluation_outputs/outputs/kjain14__testgeneval-test/CodeActAgent/gpt-4o_maxiter_25_N_v0.20.0-no-hint-run_1/output.jsonl 10 kjain14/testgeneval test true + Command run (our approach): ./evaluation/benchmarks/testgeneval/scripts/run_infer.sh llm.eval_gpt HEAD CodeActAgent -1 25 10 kjain14/testgeneval test 1 ../TestGenEval/results/testgeneval/preds/gpt-4o-2024-08-06__testgeneval__0.2__test.jsonl