diff --git a/evaluation/EDA/README.md b/evaluation/EDA/README.md index b910a6c840..8ae5f7b843 100644 --- a/evaluation/EDA/README.md +++ b/evaluation/EDA/README.md @@ -19,7 +19,7 @@ where `model_config` is mandatory, while `git-version`, `agent`, `dataset` and ` - `model_config`, e.g. `eval_gpt4_1106_preview`, is the config group name for your LLM settings, as defined in your `config.toml`. -- `git-version`, e.g. `head`, is the git commit hash of the OpenDevin version you would +- `git-version`, e.g. `HEAD`, is the git commit hash of the OpenDevin version you would like to evaluate. It could also be a release tag like `0.6.2`. - `agent`, e.g. `CodeActAgent`, is the name of the agent for benchmarks, defaulting diff --git a/evaluation/biocoder/README.md b/evaluation/biocoder/README.md index b9262667c4..6829e1aaa5 100644 --- a/evaluation/biocoder/README.md +++ b/evaluation/biocoder/README.md @@ -32,7 +32,7 @@ where `model_config` is mandatory, while `git-version`, `agent`, `dataset` and ` - `model_config`, e.g. `eval_gpt4_1106_preview`, is the config group name for your LLM settings, as defined in your `config.toml`. -- `git-version`, e.g. `head`, is the git commit hash of the OpenDevin version you would +- `git-version`, e.g. `HEAD`, is the git commit hash of the OpenDevin version you would like to evaluate. It could also be a release tag like `0.6.2`. - `agent`, e.g. `CodeActAgent`, is the name of the agent for benchmarks, defaulting diff --git a/evaluation/bird/README.md b/evaluation/bird/README.md index e4710aa0b3..250aefb5e5 100644 --- a/evaluation/bird/README.md +++ b/evaluation/bird/README.md @@ -42,7 +42,7 @@ temperature = 0.0 - `model_config`, e.g. `eval_gpt4_1106_preview`, is the config group name for your LLM settings, as defined in your `config.toml`. -- `git-version`, e.g. `head`, is the git commit hash of the OpenDevin version you would +- `git-version`, e.g. `HEAD`, is the git commit hash of the OpenDevin version you would like to evaluate. It could also be a release tag like `0.6.2`. ## Examples diff --git a/evaluation/gaia/README.md b/evaluation/gaia/README.md index a85668dc61..6cf911c954 100644 --- a/evaluation/gaia/README.md +++ b/evaluation/gaia/README.md @@ -22,7 +22,7 @@ where `model_config` is mandatory, while `git-version`, `agent`, `eval_limit` an - `model_config`, e.g. `eval_gpt4_1106_preview`, is the config group name for your LLM settings, as defined in your `config.toml`, defaulting to `gpt-3.5-turbo` -- `git-version`, e.g. `head`, is the git commit hash of the OpenDevin version you would +- `git-version`, e.g. `HEAD`, is the git commit hash of the OpenDevin version you would like to evaluate. It could also be a release tag like `0.6.2`. - `agent`, e.g. `CodeActAgent`, is the name of the agent for benchmarks, defaulting diff --git a/evaluation/gorilla/README.md b/evaluation/gorilla/README.md index d0df6617ba..c5da3ad453 100644 --- a/evaluation/gorilla/README.md +++ b/evaluation/gorilla/README.md @@ -23,7 +23,7 @@ where `model_config` is mandatory, while all other arguments are optional. `model_config`, e.g. `llm`, is the config group name for your LLM settings, as defined in your `config.toml`. -`git-version`, e.g. `head`, is the git commit hash of the OpenDevin version you would +`git-version`, e.g. `HEAD`, is the git commit hash of the OpenDevin version you would like to evaluate. It could also be a release tag like `0.6.2`. `agent`, e.g. `CodeActAgent`, is the name of the agent for benchmarks, defaulting diff --git a/evaluation/gpqa/README.md b/evaluation/gpqa/README.md index a0d0eab50a..538114c863 100644 --- a/evaluation/gpqa/README.md +++ b/evaluation/gpqa/README.md @@ -60,7 +60,7 @@ From the root of the OpenDevin repo, run the following command: You can replace `model_config_name` with any model you set up in `config.toml`. - `model_config_name`: The model configuration name from `config.toml` that you want to evaluate. -- `git-version`, e.g. `head`, is the git commit hash of the OpenDevin version you would +- `git-version`, e.g. `HEAD`, is the git commit hash of the OpenDevin version you would like to evaluate. It could also be a release tag like `0.6.2`. - `num_samples_eval`: Number of samples to evaluate (useful for testing and debugging). - `data_split`: The data split to evaluate on. Must be one of `gpqa_main`, `gqpa_diamond`, `gpqa_experts`, `gpqa_extended`. Defaults to `gpqa_diamond` as done in the paper. diff --git a/evaluation/mint/README.md b/evaluation/mint/README.md index a5815c7f61..1e07bd6431 100644 --- a/evaluation/mint/README.md +++ b/evaluation/mint/README.md @@ -20,7 +20,7 @@ where `model_config` is mandatory, while others are optional. - `model_config`, e.g. `eval_gpt4_1106_preview`, is the config group name for your LLM settings, as defined in your `config.toml`. -- `git-version`, e.g. `head`, is the git commit hash of the OpenDevin version you would +- `git-version`, e.g. `HEAD`, is the git commit hash of the OpenDevin version you would like to evaluate. It could also be a release tag like `0.6.2`. - `subset`, e.g. `math`, is the subset of the MINT benchmark to evaluate on, defaulting to `math`. It can be either: `math`, `gsm8k`, `mmlu`, `theoremqa`, `mbpp`,`humaneval`. diff --git a/evaluation/swe_bench/README.md b/evaluation/swe_bench/README.md index 3b49c85198..966ed5c3c8 100644 --- a/evaluation/swe_bench/README.md +++ b/evaluation/swe_bench/README.md @@ -82,7 +82,7 @@ If you see an error, please make sure your `config.toml` contains all ```bash ./evaluation/swe_bench/scripts/run_infer.sh [model_config] [git-version] [agent] [eval_limit] -# e.g., ./evaluation/swe_bench/scripts/run_infer.sh eval_gpt4_1106_preview head CodeActAgent 300 +# e.g., ./evaluation/swe_bench/scripts/run_infer.sh eval_gpt4_1106_preview HEAD CodeActAgent 300 ``` where `model_config` is mandatory, while `agent` and `eval_limit` are optional. @@ -90,7 +90,7 @@ where `model_config` is mandatory, while `agent` and `eval_limit` are optional. `model_config`, e.g. `eval_gpt4_1106_preview`, is the config group name for your LLM settings, as defined in your `config.toml`. -`git-version`, e.g. `head`, is the git commit hash of the OpenDevin version you would +`git-version`, e.g. `HEAD`, is the git commit hash of the OpenDevin version you would like to evaluate. It could also be a release tag like `0.6.2`. `agent`, e.g. `CodeActAgent`, is the name of the agent for benchmarks, defaulting @@ -104,7 +104,7 @@ Let's say you'd like to run 10 instances using `eval_gpt4_1106_preview` and Code then your command would be: ```bash -./evaluation/swe_bench/scripts/run_infer.sh eval_gpt4_1106_preview head CodeActAgent 10 +./evaluation/swe_bench/scripts/run_infer.sh eval_gpt4_1106_preview HEAD CodeActAgent 10 ``` If you would like to specify a list of tasks you'd like to benchmark on, you could diff --git a/evaluation/toolqa/README.md b/evaluation/toolqa/README.md index e8d23df475..058ac96bee 100644 --- a/evaluation/toolqa/README.md +++ b/evaluation/toolqa/README.md @@ -23,7 +23,7 @@ where `model_config` is mandatory, while all other arguments are optional. `model_config`, e.g. `llm`, is the config group name for your LLM settings, as defined in your `config.toml`. -`git-version`, e.g. `head`, is the git commit hash of the OpenDevin version you would +`git-version`, e.g. `HEAD`, is the git commit hash of the OpenDevin version you would like to evaluate. It could also be a release tag like `0.6.2`. `agent`, e.g. `CodeActAgent`, is the name of the agent for benchmarks, defaulting