mirror of https://github.com/OpenHands/OpenHands.git synced 2025-12-26 05:48:36 +08:00

A starting point for SWE-Bench Evaluation with docker (#60 )

* a starting point for SWE-Bench evaluation with docker

* fix the swe-bench uid issue

* typo fixed

* fix conda missing issue

* move files based on new PR

* Update doc and gitignore using devin prediction file from #81

* fix typo

* add a sentence

* fix typo in path

* fix path

---------

Co-authored-by: Binyuan Hui <binyuan.hby@alibaba-inc.com>

2024-03-22 12:43:49 +08:00

1.5 KiB

Raw Blame History

Evaluation

This folder contains code and resources to run experiments and evaluations.

Logistics

To better organize the evaluation folder, we should follow the rules below:

Each subfolder contains a specific benchmark or experiment. For example, evaluation/SWE-bench should contain all the preprocessing/evaluation/analysis scripts.
Raw data and experimental records should not be stored within this repo (e.g. Google Drive or Hugging Face Datasets).
Important data files of manageable size and analysis scripts (e.g., jupyter notebooks) can be directly uploaded to this repo.

Tasks

SWE-bench

notebooks
- devin_eval_analysis.ipynb: notebook analyzing devin's outputs
scripts
- prepare_devin_outputs_for_evaluation.py: script fetching and converting devin's output into the desired json file for evaluation.
  - usage: python prepare_devin_outputs_for_evaluation.py <setting> where setting can be passed, failed or all
resources
- Devin's outputs processed for evaluations is available on Huggingface
  - get predictions that passed the test: wget https://huggingface.co/datasets/OpenDevin/Devin-SWE-bench-output/raw/main/devin_swe_passed.json
  - get all predictions wget https://huggingface.co/datasets/OpenDevin/Devin-SWE-bench-output/raw/main/devin_swe_outputs.json

See SWE-bench/README.md for more details on how to run SWE-Bench for evaluation.

1.5 KiB Raw Blame History

Evaluation

Logistics

Tasks

SWE-bench

1.5 KiB

Raw Blame History