mirror of
https://github.com/OpenHands/OpenHands.git
synced 2025-12-26 05:48:36 +08:00
* style: moved argument parsing into a separate function * commito * Update evaluation/regression/conftest.py --------- Co-authored-by: Robert Brennan <accounts@rbren.io>
Evaluation
This folder contains code and resources to run experiments and evaluations.
Logistics
To better organize the evaluation folder, we should follow the rules below:
- Each subfolder contains a specific benchmark or experiment. For example,
evaluation/SWE-benchshould contain all the preprocessing/evaluation/analysis scripts. - Raw data and experimental records should not be stored within this repo (e.g. Google Drive or Hugging Face Datasets).
- Important data files of manageable size and analysis scripts (e.g., jupyter notebooks) can be directly uploaded to this repo.
Roadmap
- Sanity check. Reproduce Devin's scores on SWE-bench using the released outputs to make sure that our harness pipeline works.
- Open source model support.
- Contributors are encouraged to submit their commits to our forked SEW-bench repo.
- Ensure compatibility with OpenAI interface for inference.
- Serve open source models, prioritizing high concurrency and throughput.
Tasks
SWE-bench
- notebooks
devin_eval_analysis.ipynb: notebook analyzing devin's outputs
- scripts
prepare_devin_outputs_for_evaluation.py: script fetching and converting devin's output into the desired json file for evaluation.- usage:
python prepare_devin_outputs_for_evaluation.py <setting>where setting can bepassed,failedorall
- usage:
- resources
- Devin's outputs processed for evaluations is available on Huggingface
- get predictions that passed the test:
wget https://huggingface.co/datasets/OpenDevin/Devin-SWE-bench-output/raw/main/devin_swe_passed.json - get all predictions
wget https://huggingface.co/datasets/OpenDevin/Devin-SWE-bench-output/raw/main/devin_swe_outputs.json
- get predictions that passed the test:
- Devin's outputs processed for evaluations is available on Huggingface
See SWE-bench/README.md for more details on how to run SWE-Bench for evaluation.