diff --git a/evaluation/benchmarks/aider_bench/README.md b/evaluation/benchmarks/aider_bench/README.md index 086cfe5816..1f2ef14a0f 100644 --- a/evaluation/benchmarks/aider_bench/README.md +++ b/evaluation/benchmarks/aider_bench/README.md @@ -56,9 +56,10 @@ You can update the arguments in the script ./evaluation/benchmarks/aider_bench/scripts/run_infer.sh eval_gpt35_turbo HEAD CodeActAgent 100 1 "1,3,10" ``` -### Run Inference on `RemoteRuntime` (experimental) +### Run Inference on `RemoteRuntime` + +This is in beta. Fill out [this form](https://docs.google.com/forms/d/e/1FAIpQLSckVz_JFwg2_mOxNZjCtr7aoBFI2Mwdan3f75J_TrdMS1JV2g/viewform) to apply if you want to try this out! -This is in limited beta. Contact Xingyao over slack if you want to try this out! ```bash ./evaluation/benchmarks/aider_bench/scripts/run_infer.sh [model_config] [git-version] [agent] [eval_limit] [eval-num-workers] [eval_ids] diff --git a/evaluation/benchmarks/commit0_bench/README.md b/evaluation/benchmarks/commit0_bench/README.md index 734875b1f7..5e533ec120 100644 --- a/evaluation/benchmarks/commit0_bench/README.md +++ b/evaluation/benchmarks/commit0_bench/README.md @@ -58,9 +58,10 @@ then your command would be: ./evaluation/benchmarks/commit0_bench/scripts/run_infer.sh lite llm.eval_sonnet HEAD CodeActAgent 10 30 1 wentingzhao/commit0_combined test ``` -### Run Inference on `RemoteRuntime` (experimental) +### Run Inference on `RemoteRuntime` + +This is in beta. Fill out [this form](https://docs.google.com/forms/d/e/1FAIpQLSckVz_JFwg2_mOxNZjCtr7aoBFI2Mwdan3f75J_TrdMS1JV2g/viewform) to apply if you want to try this out! -This is in limited beta. Contact Xingyao over slack if you want to try this out! ```bash ./evaluation/benchmarks/commit0_bench/scripts/run_infer.sh [repo_split] [model_config] [git-version] [agent] [eval_limit] [max_iter] [num_workers] [dataset] [dataset_split] diff --git a/evaluation/benchmarks/miniwob/README.md b/evaluation/benchmarks/miniwob/README.md index 002489097b..8b457d3e88 100644 --- a/evaluation/benchmarks/miniwob/README.md +++ b/evaluation/benchmarks/miniwob/README.md @@ -19,9 +19,10 @@ Access with browser the above MiniWoB URLs and see if they load correctly. ./evaluation/benchmarks/miniwob/scripts/run_infer.sh llm.claude-35-sonnet-eval ``` -### Run Inference on `RemoteRuntime` (experimental) +### Run Inference on `RemoteRuntime` + +This is in beta. Fill out [this form](https://docs.google.com/forms/d/e/1FAIpQLSckVz_JFwg2_mOxNZjCtr7aoBFI2Mwdan3f75J_TrdMS1JV2g/viewform) to apply if you want to try this out! -This is in limited beta. Contact Xingyao over slack if you want to try this out! ```bash ./evaluation/benchmarks/miniwob/scripts/run_infer.sh [model_config] [git-version] [agent] [note] [eval_limit] [num_workers] diff --git a/evaluation/benchmarks/swe_bench/README.md b/evaluation/benchmarks/swe_bench/README.md index 4758a6616c..4746024882 100644 --- a/evaluation/benchmarks/swe_bench/README.md +++ b/evaluation/benchmarks/swe_bench/README.md @@ -65,9 +65,9 @@ then your command would be: ./evaluation/benchmarks/swe_bench/scripts/run_infer.sh llm.eval_gpt4_1106_preview HEAD CodeActAgent 10 ``` -### Run Inference on `RemoteRuntime` (experimental) +### Run Inference on `RemoteRuntime` -This is in limited beta. Contact Xingyao over slack if you want to try this out! +This is in beta. Fill out [this form](https://docs.google.com/forms/d/e/1FAIpQLSckVz_JFwg2_mOxNZjCtr7aoBFI2Mwdan3f75J_TrdMS1JV2g/viewform) to apply if you want to try this out! ```bash ./evaluation/benchmarks/swe_bench/scripts/run_infer.sh [model_config] [git-version] [agent] [eval_limit] [max_iter] [num_workers] [dataset] [dataset_split] @@ -163,9 +163,9 @@ The final results will be saved to `evaluation/evaluation_outputs/outputs/swe_be - `report.json`: a JSON file that contains keys like `"resolved_ids"` pointing to instance IDs that are resolved by the agent. - `logs/`: a directory of test logs -### Run evaluation with `RemoteRuntime` (experimental) +### Run evaluation with `RemoteRuntime` -This is in limited beta. Contact Xingyao over slack if you want to try this out! +This is in beta. Fill out [this form](https://docs.google.com/forms/d/e/1FAIpQLSckVz_JFwg2_mOxNZjCtr7aoBFI2Mwdan3f75J_TrdMS1JV2g/viewform) to apply if you want to try this out! ```bash ./evaluation/benchmarks/swe_bench/scripts/eval_infer_remote.sh [output.jsonl filepath] [num_workers]