docs: fix markdown linting and broken links (#5401)

This commit is contained in:
Cheng Yang 2024-12-05 01:28:04 +08:00 committed by GitHub
parent c5117bc48d
commit 8f47547b08
No known key found for this signature in database
GPG Key ID: B5690EEEBB952194
18 changed files with 45 additions and 39 deletions

View File

@ -4,12 +4,10 @@ This folder contains evaluation harness for evaluating agents on the Entity-dedu
## Setup Environment and LLM Configuration
Please follow instruction [here](../README.md#setup) to setup your local development environment and LLM.
Please follow instruction [here](../../README.md#setup) to setup your local development environment and LLM.
## Start the evaluation
```bash
export OPENAI_API_KEY="sk-XXX"; # This is required for evaluation (to simulate another party of conversation)
./evaluation/benchmarks/EDA/scripts/run_infer.sh [model_config] [git-version] [agent] [dataset] [eval_limit]
@ -37,7 +35,8 @@ For example,
```
## Reference
```
```bibtex
@inproceedings{zhang2023entity,
title={Probing the Multi-turn Planning Capabilities of LLMs via 20 Question Games},
author={Zhang, Yizhe and Lu, Jiarui and Jaitly, Navdeep},

View File

@ -4,7 +4,7 @@ This folder contains evaluation harness for evaluating agents on the [AgentBench
## Setup Environment and LLM Configuration
Please follow instruction [here](../README.md#setup) to setup your local development environment and LLM.
Please follow instruction [here](../../README.md#setup) to setup your local development environment and LLM.
## Start the evaluation

View File

@ -10,7 +10,7 @@ Hugging Face dataset based on the
## Setup Environment and LLM Configuration
Please follow instruction [here](../README.md#setup) to setup your local
Please follow instruction [here](../../README.md#setup) to setup your local
development environment and LLM.
## Start the evaluation

View File

@ -4,13 +4,14 @@ Implements evaluation of agents on BioCoder from the BioCoder benchmark introduc
## Setup Environment and LLM Configuration
Please follow instruction [here](../README.md#setup) to setup your local development environment and LLM.
Please follow instruction [here](../../README.md#setup) to setup your local development environment and LLM.
## BioCoder Docker Image
In the openhands branch of the Biocoder repository, we have slightly modified our original Docker image to work with the OpenHands environment. In the Docker image are testing scripts (`/testing/start_test_openhands.py` and aux files in `/testing_files/`) to assist with evaluation. Additionally, we have installed all dependencies, including OpenJDK, mamba (with Python 3.6), and many system libraries. Notably, we have **not** packaged all repositories into the image, so they are downloaded at runtime.
**Before first execution, pull our Docker image with the following command**
```bash
docker pull public.ecr.aws/i5g0m1f6/eval_biocoder:v1.0
```
@ -19,7 +20,6 @@ To reproduce this image, please see the Dockerfile_Openopenhands in the `biocode
## Start the evaluation
```bash
./evaluation/benchmarks/biocoder/scripts/run_infer.sh [model_config] [git-version] [agent] [eval_limit]
```
@ -47,7 +47,8 @@ with current OpenHands version, then your command would be:
```
## Reference
```
```bibtex
@misc{tang2024biocoder,
title={BioCoder: A Benchmark for Bioinformatics Code Generation with Large Language Models},
author={Xiangru Tang and Bill Qian and Rick Gao and Jiakang Chen and Xinyun Chen and Mark Gerstein},

File diff suppressed because one or more lines are too long

View File

@ -7,7 +7,7 @@ If so, the browsing performance upper-bound of CodeActAgent will be the performa
## Setup Environment and LLM Configuration
Please follow instruction [here](../README.md#setup) to setup your local development environment and LLM.
Please follow instruction [here](../../README.md#setup) to setup your local development environment and LLM.
## Run Inference

View File

@ -4,19 +4,18 @@ This folder contains the evaluation harness that we built on top of the original
The evaluation consists of three steps:
1. Environment setup: [install python environment](../README.md#development-environment), [configure LLM config](../README.md#configure-openhands-and-your-llm).
1. Environment setup: [install python environment](../../README.md#development-environment), [configure LLM config](../../README.md#configure-openhands-and-your-llm).
2. [Run Evaluation](#run-inference-on-commit0-instances): Generate a edit patch for each Commit0 Repo, and get the evaluation results
## Setup Environment and LLM Configuration
Please follow instruction [here](../README.md#setup) to setup your local development environment and LLM.
Please follow instruction [here](../../README.md#setup) to setup your local development environment and LLM.
## OpenHands Commit0 Instance-level Docker Support
OpenHands supports using the Commit0 Docker for **[inference](#run-inference-on-commit0-instances).
This is now the default behavior.
## Run Inference on Commit0 Instances
Make sure your Docker daemon is running, and you have ample disk space (at least 200-500GB, depends on the Commit0 set you are running on) for the [instance-level docker image](#openhands-commit0-instance-level-docker-support).

View File

@ -4,9 +4,10 @@ This folder contains evaluation harness for evaluating agents on the [GAIA bench
## Setup Environment and LLM Configuration
Please follow instruction [here](../README.md#setup) to setup your local development environment and LLM.
Please follow instruction [here](../../README.md#setup) to setup your local development environment and LLM.
## Run the evaluation
We are using the GAIA dataset hosted on [Hugging Face](https://huggingface.co/datasets/gaia-benchmark/GAIA).
Please accept the terms and make sure to have logged in on your computer by `huggingface-cli login` before running the evaluation.
@ -41,6 +42,7 @@ For example,
## Get score
Then you can get stats by running the following command:
```bash
python ./evaluation/benchmarks/gaia/get_score.py \
--file <path_to/output.json>

View File

@ -4,7 +4,7 @@ This folder contains evaluation harness we built on top of the original [Gorilla
## Setup Environment and LLM Configuration
Please follow instruction [here](../README.md#setup) to setup your local development environment and LLM.
Please follow instruction [here](../../README.md#setup) to setup your local development environment and LLM.
## Run Inference on APIBench Instances

View File

@ -3,6 +3,7 @@
Implements the evaluation of agents on the GPQA benchmark introduced in [GPQA: A Graduate-Level Google-Proof Q&A Benchmark](https://arxiv.org/abs/2308.07124).
This code implements the evaluation of agents on the GPQA Benchmark with Open Book setting.
- The benchmark consists of 448 high-quality and extremely difficult multiple-choice questions in the domains of biology, physics, and chemistry. The questions are intentionally designed to be "Google-proof," meaning that even highly skilled non-expert validators achieve only 34% accuracy despite unrestricted access to the web.
- Even experts in the corresponding domains achieve only 65% accuracy.
- State-of-the-art AI systems achieve only 39% accuracy on this challenging dataset.
@ -11,20 +12,24 @@ This code implements the evaluation of agents on the GPQA Benchmark with Open Bo
Accurate solving of above graduate level questions would require both tool use (e.g., python for calculations) and web-search for finding related facts as information required for the questions might not be part of the LLM knowledge / training data.
Further references:
- https://arxiv.org/pdf/2311.12022
- https://paperswithcode.com/dataset/gpqa
- https://github.com/idavidrein/gpqa
- <https://arxiv.org/pdf/2311.12022>
- <https://paperswithcode.com/dataset/gpqa>
- <https://github.com/idavidrein/gpqa>
## Setup Environment and LLM Configuration
Please follow instruction [here](../README.md#setup) to setup your local development environment and LLM.
Please follow instruction [here](../../README.md#setup) to setup your local development environment and LLM.
## Run Inference on GPQA Benchmark
'gpqa_main', 'gqpa_diamond', 'gpqa_experts', 'gpqa_extended' -- data split options
From the root of the OpenHands repo, run the following command:
```bash
./evaluation/benchmarks/gpqa/scripts/run_infer.sh [model_config_name] [git-version] [num_samples_eval] [data_split] [AgentClass]
```
You can replace `model_config_name` with any model you set up in `config.toml`.
- `model_config_name`: The model configuration name from `config.toml` that you want to evaluate.

View File

@ -4,7 +4,7 @@ Implements evaluation of agents on HumanEvalFix from the HumanEvalPack benchmark
## Setup Environment and LLM Configuration
Please follow instruction [here](../README.md#setup) to setup your local development environment and LLM.
Please follow instruction [here](../../README.md#setup) to setup your local development environment and LLM.
## Run Inference on HumanEvalFix
@ -14,13 +14,11 @@ Please follow instruction [here](../README.md#setup) to setup your local develop
You can replace `eval_gpt4_1106_preview` with any model you set up in `config.toml`.
## Examples
For each problem, OpenHands is given a set number of iterations to fix the failing code. The history field shows each iteration's response to correct its code that fails any test case.
```
```json
{
"task_id": "Python/2",
"instruction": "Please fix the function in Python__2.py such that all test cases pass.\nEnvironment has been set up for you to start working. You may assume all necessary tools are installed.\n\n# Problem Statement\ndef truncate_number(number: float) -> float:\n return number % 1.0 + 1.0\n\n\n\n\n\n\ndef check(truncate_number):\n assert truncate_number(3.5) == 0.5\n assert abs(truncate_number(1.33) - 0.33) < 1e-6\n assert abs(truncate_number(123.456) - 0.456) < 1e-6\n\ncheck(truncate_number)\n\nIMPORTANT: You should ONLY interact with the environment provided to you AND NEVER ASK FOR HUMAN HELP.\nYou should NOT modify any existing test case files. If needed, you can add new test cases in a NEW file to reproduce the issue.\nYou SHOULD INCLUDE PROPER INDENTATION in your edit commands.\nWhen you think you have fixed the issue through code changes, please finish the interaction using the "finish" tool.\n",

View File

@ -4,9 +4,10 @@ This folder contains evaluation harness for evaluating agents on the logic reaso
## Setup Environment and LLM Configuration
Please follow instruction [here](../README.md#setup) to setup your local development environment and LLM.
Please follow instruction [here](../../README.md#setup) to setup your local development environment and LLM.
## Run Inference on logic_reasoning
The following code will run inference on the first example of the ProofWriter dataset,
```bash

View File

@ -4,7 +4,7 @@ This folder contains evaluation for [MiniWoB++](https://miniwob.farama.org/) ben
## Setup Environment and LLM Configuration
Please follow instruction [here](../README.md#setup) to setup your local development environment and LLM.
Please follow instruction [here](../../README.md#setup) to setup your local development environment and LLM.
## Test if your environment works
@ -42,7 +42,6 @@ poetry run python evaluation/benchmarks/miniwob/get_success_rate.py evaluation/e
You can start your own fork of [our huggingface evaluation outputs](https://huggingface.co/spaces/OpenHands/evaluation) and submit a PR of your evaluation results following the guide [here](https://huggingface.co/docs/hub/en/repositories-pull-requests-discussions#pull-requests-and-discussions).
## BrowsingAgent V1.0 result
Tested on BrowsingAgent V1.0

View File

@ -12,7 +12,7 @@ For more details on the ML-Bench task and dataset, please refer to the paper: [M
## Setup Environment and LLM Configuration
Please follow instruction [here](../README.md#setup) to setup your local development environment and LLM.
Please follow instruction [here](../../README.md#setup) to setup your local development environment and LLM.
## Run Inference on ML-Bench

View File

@ -1,10 +1,10 @@
# ScienceAgentBench Evaluation with OpenHands
This folder contains the evaluation harness for [ScienceAgentBench](https://osu-nlp-group.github.io/ScienceAgentBench/) (paper: https://arxiv.org/abs/2410.05080).
This folder contains the evaluation harness for [ScienceAgentBench](https://osu-nlp-group.github.io/ScienceAgentBench/) (paper: <https://arxiv.org/abs/2410.05080>).
## Setup Environment and LLM Configuration
Please follow instruction [here](../README.md#setup) to setup your local development environment and LLM.
Please follow instruction [here](../../README.md#setup) to setup your local development environment and LLM.
## Setup ScienceAgentBench
@ -45,6 +45,7 @@ After the inference is completed, you may use the following command to extract n
```bash
python post_proc.py [log_fname]
```
- `log_fname`, e.g. `evaluation/.../output.jsonl`, is the automatically saved trajectory log of an OpenHands agent.
Output will be write to e.g. `evaluation/.../output.converted.jsonl`

View File

@ -6,20 +6,19 @@ This folder contains the evaluation harness that we built on top of the original
The evaluation consists of three steps:
1. Environment setup: [install python environment](../README.md#development-environment), [configure LLM config](../README.md#configure-openhands-and-your-llm), and [pull docker](#openhands-swe-bench-instance-level-docker-support).
1. Environment setup: [install python environment](../../README.md#development-environment), [configure LLM config](../../README.md#configure-openhands-and-your-llm), and [pull docker](#openhands-swe-bench-instance-level-docker-support).
2. [Run inference](#run-inference-on-swe-bench-instances): Generate a edit patch for each Github issue
3. [Evaluate patches using SWE-Bench docker](#evaluate-generated-patches)
## Setup Environment and LLM Configuration
Please follow instruction [here](../README.md#setup) to setup your local development environment and LLM.
Please follow instruction [here](../../README.md#setup) to setup your local development environment and LLM.
## OpenHands SWE-Bench Instance-level Docker Support
OpenHands now support using the [official evaluation docker](https://github.com/princeton-nlp/SWE-bench/blob/main/docs/20240627_docker/README.md) for both **[inference](#run-inference-on-swe-bench-instances) and [evaluation](#evaluate-generated-patches)**.
This is now the default behavior.
## Run Inference on SWE-Bench Instances
Make sure your Docker daemon is running, and you have ample disk space (at least 200-500GB, depends on the SWE-Bench set you are running on) for the [instance-level docker image](#openhands-swe-bench-instance-level-docker-support).
@ -52,7 +51,8 @@ default, it is set to 1.
- `dataset_split`, split for the huggingface dataset. e.g., `test`, `dev`. Default to `test`.
There are also two optional environment variables you can set.
```
```bash
export USE_HINT_TEXT=true # if you want to use hint text in the evaluation. Default to false. Ignore this if you are not sure.
export USE_INSTANCE_IMAGE=true # if you want to use instance-level docker images. Default to true
```
@ -127,6 +127,7 @@ With `output.jsonl` file, you can run `eval_infer.sh` to evaluate generated patc
**This evaluation is performed using the official dockerized evaluation announced [here](https://github.com/princeton-nlp/SWE-bench/blob/main/docs/20240627_docker/README.md).**
> If you want to evaluate existing results, you should first run this to clone existing outputs
>
>```bash
>git clone https://huggingface.co/spaces/OpenHands/evaluation evaluation/evaluation_outputs
>```
@ -143,6 +144,7 @@ Then you can run the following:
```
The script now accepts optional arguments:
- `instance_id`: Specify a single instance to evaluate (optional)
- `dataset_name`: The name of the dataset to use (default: `"princeton-nlp/SWE-bench_Lite"`)
- `split`: The split of the dataset to use (default: `"test"`)
@ -179,7 +181,6 @@ To clean-up all existing runtimes that you've already started, run:
ALLHANDS_API_KEY="YOUR-API-KEY" ./evaluation/benchmarks/swe_bench/scripts/cleanup_remote_runtime.sh
```
## Visualize Results
First you need to clone `https://huggingface.co/spaces/OpenHands/evaluation` and add your own running results from openhands into the `outputs` of the cloned repo.
@ -189,6 +190,7 @@ git clone https://huggingface.co/spaces/OpenHands/evaluation
```
**(optional) setup streamlit environment with conda**:
```bash
cd evaluation
conda create -n streamlit python=3.10

View File

@ -4,7 +4,7 @@ This folder contains an evaluation harness we built on top of the original [Tool
## Setup Environment and LLM Configuration
Please follow instruction [here](../README.md#setup) to setup your local development environment and LLM.
Please follow instruction [here](../../README.md#setup) to setup your local development environment and LLM.
## Run Inference on ToolQA Instances

View File

@ -4,7 +4,7 @@ This folder contains evaluation for [WebArena](https://github.com/web-arena-x/we
## Setup Environment and LLM Configuration
Please follow instruction [here](../README.md#setup) to setup your local development environment and LLM.
Please follow instruction [here](../../README.md#setup) to setup your local development environment and LLM.
## Setup WebArena Environment