mirror of https://github.com/OpenHands/OpenHands.git synced 2025-12-26 05:48:36 +08:00

History

Evaluation: redirect sessions to repo-local .eval_sessions via helper; apply across entrypoints; add tests (#10540 )

Co-authored-by: openhands <openhands@all-hands.dev>

2025-08-22 13:34:02 +00:00

prompts

feat(evaluation): Add NoCode-bench evaluation script (#10229 )

2025-08-16 16:41:22 +00:00

resource

feat(evaluation): Add NoCode-bench evaluation script (#10229 )

2025-08-16 16:41:22 +00:00

scripts

feat(evaluation): Add NoCode-bench evaluation script (#10229 )

2025-08-16 16:41:22 +00:00

__init__.py

feat(evaluation): Add NoCode-bench evaluation script (#10229 )

2025-08-16 16:41:22 +00:00

binary_patch_utils.py

feat(evaluation): Add NoCode-bench evaluation script (#10229 )

2025-08-16 16:41:22 +00:00

consistants.py

feat(evaluation): Add NoCode-bench evaluation script (#10229 )

2025-08-16 16:41:22 +00:00

README.md

feat(evaluation): Add NoCode-bench evaluation script (#10229 )

2025-08-16 16:41:22 +00:00

run_infer_nc.py

Evaluation: redirect sessions to repo-local .eval_sessions via helper; apply across entrypoints; add tests (#10540 )

2025-08-22 13:34:02 +00:00

README.md

Evaluate OpenHands on NoCode-bench

LLM Setup

Please follow here.

Docker image download

Evaluating OpenHands on NoCode-bench need instance-level docker image. Please follow the instructions of NoCode-bench image setup to build or download all instance-level dokcer here.

Generate patch

Please follow the instructions here For example,

bash ./evaluation/benchmarks/nocode_bench/scripts/run_infer_nc.sh llm.claude HEAD CodeActAgent 114 100 10 NoCode-bench/NoCode-bench_Verified test

The results will be generated in evaluation/evaluation_outputs/outputs/XXX/CodeActAgent/YYY/output.jsonl.

Runing evaluation

First, install NoCode-bench.

Second, convert the output.jsonl to patch.jsonl with script.

python evaluation/benchmarks/multi_swe_bench/scripts/eval/convert.py

Finally, evaluate with NoCode-bench.

export PYTHONPATH=$PYTHONPATH:$(pwd)
python ./evaluation/eval.py \
    --predictions_path ./all_preds.jsonl \  # <path_to_your_predictions>
    --log_dir ./evaluation/logs \ # <path_to_your_log_dir>
    --bench_tasks NoCode-bench/NoCode-bench_Verified \ # <dataset_name>
    --max_workers 110 \ # <number_of_workers>
    --output_file eval_result.txt \ # <path_to_your_output_file>
    --image_level repo \ # <cache_image_level>
    --timeout 600 \ # <timeout_in_seconds>
    --proxy None # <proxy_if_needed>