mirror of
https://github.com/OpenHands/OpenHands.git
synced 2026-03-22 05:37:20 +08:00
Fix doc error in evals (#2654)
This commit is contained in:
@@ -19,7 +19,7 @@ where `model_config` is mandatory, while `git-version`, `agent`, `dataset` and `
|
||||
- `model_config`, e.g. `eval_gpt4_1106_preview`, is the config group name for your
|
||||
LLM settings, as defined in your `config.toml`.
|
||||
|
||||
- `git-version`, e.g. `head`, is the git commit hash of the OpenDevin version you would
|
||||
- `git-version`, e.g. `HEAD`, is the git commit hash of the OpenDevin version you would
|
||||
like to evaluate. It could also be a release tag like `0.6.2`.
|
||||
|
||||
- `agent`, e.g. `CodeActAgent`, is the name of the agent for benchmarks, defaulting
|
||||
|
||||
@@ -32,7 +32,7 @@ where `model_config` is mandatory, while `git-version`, `agent`, `dataset` and `
|
||||
- `model_config`, e.g. `eval_gpt4_1106_preview`, is the config group name for your
|
||||
LLM settings, as defined in your `config.toml`.
|
||||
|
||||
- `git-version`, e.g. `head`, is the git commit hash of the OpenDevin version you would
|
||||
- `git-version`, e.g. `HEAD`, is the git commit hash of the OpenDevin version you would
|
||||
like to evaluate. It could also be a release tag like `0.6.2`.
|
||||
|
||||
- `agent`, e.g. `CodeActAgent`, is the name of the agent for benchmarks, defaulting
|
||||
|
||||
@@ -42,7 +42,7 @@ temperature = 0.0
|
||||
- `model_config`, e.g. `eval_gpt4_1106_preview`, is the config group name for your
|
||||
LLM settings, as defined in your `config.toml`.
|
||||
|
||||
- `git-version`, e.g. `head`, is the git commit hash of the OpenDevin version you would
|
||||
- `git-version`, e.g. `HEAD`, is the git commit hash of the OpenDevin version you would
|
||||
like to evaluate. It could also be a release tag like `0.6.2`.
|
||||
|
||||
## Examples
|
||||
|
||||
@@ -22,7 +22,7 @@ where `model_config` is mandatory, while `git-version`, `agent`, `eval_limit` an
|
||||
- `model_config`, e.g. `eval_gpt4_1106_preview`, is the config group name for your
|
||||
LLM settings, as defined in your `config.toml`, defaulting to `gpt-3.5-turbo`
|
||||
|
||||
- `git-version`, e.g. `head`, is the git commit hash of the OpenDevin version you would
|
||||
- `git-version`, e.g. `HEAD`, is the git commit hash of the OpenDevin version you would
|
||||
like to evaluate. It could also be a release tag like `0.6.2`.
|
||||
|
||||
- `agent`, e.g. `CodeActAgent`, is the name of the agent for benchmarks, defaulting
|
||||
|
||||
@@ -23,7 +23,7 @@ where `model_config` is mandatory, while all other arguments are optional.
|
||||
`model_config`, e.g. `llm`, is the config group name for your
|
||||
LLM settings, as defined in your `config.toml`.
|
||||
|
||||
`git-version`, e.g. `head`, is the git commit hash of the OpenDevin version you would
|
||||
`git-version`, e.g. `HEAD`, is the git commit hash of the OpenDevin version you would
|
||||
like to evaluate. It could also be a release tag like `0.6.2`.
|
||||
|
||||
`agent`, e.g. `CodeActAgent`, is the name of the agent for benchmarks, defaulting
|
||||
|
||||
@@ -60,7 +60,7 @@ From the root of the OpenDevin repo, run the following command:
|
||||
You can replace `model_config_name` with any model you set up in `config.toml`.
|
||||
|
||||
- `model_config_name`: The model configuration name from `config.toml` that you want to evaluate.
|
||||
- `git-version`, e.g. `head`, is the git commit hash of the OpenDevin version you would
|
||||
- `git-version`, e.g. `HEAD`, is the git commit hash of the OpenDevin version you would
|
||||
like to evaluate. It could also be a release tag like `0.6.2`.
|
||||
- `num_samples_eval`: Number of samples to evaluate (useful for testing and debugging).
|
||||
- `data_split`: The data split to evaluate on. Must be one of `gpqa_main`, `gqpa_diamond`, `gpqa_experts`, `gpqa_extended`. Defaults to `gpqa_diamond` as done in the paper.
|
||||
|
||||
@@ -20,7 +20,7 @@ where `model_config` is mandatory, while others are optional.
|
||||
|
||||
- `model_config`, e.g. `eval_gpt4_1106_preview`, is the config group name for your LLM settings, as defined in your `config.toml`.
|
||||
|
||||
- `git-version`, e.g. `head`, is the git commit hash of the OpenDevin version you would
|
||||
- `git-version`, e.g. `HEAD`, is the git commit hash of the OpenDevin version you would
|
||||
like to evaluate. It could also be a release tag like `0.6.2`.
|
||||
|
||||
- `subset`, e.g. `math`, is the subset of the MINT benchmark to evaluate on, defaulting to `math`. It can be either: `math`, `gsm8k`, `mmlu`, `theoremqa`, `mbpp`,`humaneval`.
|
||||
|
||||
@@ -82,7 +82,7 @@ If you see an error, please make sure your `config.toml` contains all
|
||||
|
||||
```bash
|
||||
./evaluation/swe_bench/scripts/run_infer.sh [model_config] [git-version] [agent] [eval_limit]
|
||||
# e.g., ./evaluation/swe_bench/scripts/run_infer.sh eval_gpt4_1106_preview head CodeActAgent 300
|
||||
# e.g., ./evaluation/swe_bench/scripts/run_infer.sh eval_gpt4_1106_preview HEAD CodeActAgent 300
|
||||
```
|
||||
|
||||
where `model_config` is mandatory, while `agent` and `eval_limit` are optional.
|
||||
@@ -90,7 +90,7 @@ where `model_config` is mandatory, while `agent` and `eval_limit` are optional.
|
||||
`model_config`, e.g. `eval_gpt4_1106_preview`, is the config group name for your
|
||||
LLM settings, as defined in your `config.toml`.
|
||||
|
||||
`git-version`, e.g. `head`, is the git commit hash of the OpenDevin version you would
|
||||
`git-version`, e.g. `HEAD`, is the git commit hash of the OpenDevin version you would
|
||||
like to evaluate. It could also be a release tag like `0.6.2`.
|
||||
|
||||
`agent`, e.g. `CodeActAgent`, is the name of the agent for benchmarks, defaulting
|
||||
@@ -104,7 +104,7 @@ Let's say you'd like to run 10 instances using `eval_gpt4_1106_preview` and Code
|
||||
then your command would be:
|
||||
|
||||
```bash
|
||||
./evaluation/swe_bench/scripts/run_infer.sh eval_gpt4_1106_preview head CodeActAgent 10
|
||||
./evaluation/swe_bench/scripts/run_infer.sh eval_gpt4_1106_preview HEAD CodeActAgent 10
|
||||
```
|
||||
|
||||
If you would like to specify a list of tasks you'd like to benchmark on, you could
|
||||
|
||||
@@ -23,7 +23,7 @@ where `model_config` is mandatory, while all other arguments are optional.
|
||||
`model_config`, e.g. `llm`, is the config group name for your
|
||||
LLM settings, as defined in your `config.toml`.
|
||||
|
||||
`git-version`, e.g. `head`, is the git commit hash of the OpenDevin version you would
|
||||
`git-version`, e.g. `HEAD`, is the git commit hash of the OpenDevin version you would
|
||||
like to evaluate. It could also be a release tag like `0.6.2`.
|
||||
|
||||
`agent`, e.g. `CodeActAgent`, is the name of the agent for benchmarks, defaulting
|
||||
|
||||
Reference in New Issue
Block a user