Files
OpenHands/evaluation/logic_reasoning
Leo 9ada36e30b fix: restore python linting. (#2228)
* fix: restore python linting.

Signed-off-by: ifuryst <ifuryst@gmail.com>

* update: extend the Python lint check to evaluation.

Signed-off-by: ifuryst <ifuryst@gmail.com>

* Update evaluation/logic_reasoning/instruction.txt

---------

Signed-off-by: ifuryst <ifuryst@gmail.com>
Co-authored-by: Boxuan Li <liboxuan@connect.hku.hk>
2024-06-04 06:36:19 +00:00
..
2024-06-03 16:04:34 +00:00

Logic Reasoning Evaluation

This folder contains evaluation harness for evaluating agents on the logic reasoning benchmark ProntoQA and ProofWriter.

Configure OpenDevin and your LLM

Create a config.toml file if it does not exist at the root of the workspace.

Add the following configurations:

[core]
max_iterations = 100
cache_dir = "/tmp/cache"
ssh_hostname = "localhost"
enable_auto_lint = true

# TODO: Change these to the model you want to evaluate
[eval_gpt4_1106_preview]
model = "gpt-4-1106-preview"
api_key = "XXX"
temperature = 0.0

[eval_some_openai_compatible_model]
model = "openai/MODEL_NAME"
base_url = "https://OPENAI_COMPATIBLE_URL/v1"
api_key = "XXX"
temperature = 0.0

Run Inference on logic_reasoning

The following code will run inference on the first example of the ProntoQA dataset with model gpt-4o.

./evaluation/logic_reasoning/scripts/run_infer.sh ProntoQA gpt-4o 1