* prepare execution and inference * Create README.md * Update README.md * Update evaluation/biocoder/README.md * Update evaluation/swe_bench/swe_env_box.py * switch to biocoder docker container and test-specific code * code for copying and running test files into container * add metrics * add readme * Biocoder evaluation code finished (rewrite testing infrastructure, prompt tuning, and bug fixes) * Update README.md --------- Co-authored-by: lilbillybiscuit <qianbill2014@outlook.com> Co-authored-by: Yufan Song <33971064+yufansong@users.noreply.github.com> Co-authored-by: yufansong <yufan@risingwave-labs.com>
BioCoder Evaluation with Opendevin
Implements evaluation of agents on BioCoder from the BioCoder benchmark introduced in BioCoder: A Benchmark for Bioinformatics Code Generation with Large Language Models. Please see here for the reference implementation used in the paper.
Setup Environment
Please follow this document to setup local develop environment for OpenDevin.
Configure OpenDevin and your LLM
Create a config.toml file if it does not exist at the root of the workspace. Please check README.md for how to set this up.
BioCoder Docker Image
In the opendevin branch of the Biocoder repository, we have slightly modified our original Docker image to work with the OpenDevin environment. In the Docker image are testing scripts (/testing/start_test_opendevin.py and aux files in /testing_files/) to assist with evaluation. Additionally, we have installed all dependencies, including OpenJDK, mamba (with Python 3.6), and many system libraries. Notably, we have not packaged all repositories into the image, so they are downloaded at runtime.
Before first execution, pull our Docker image with the following command
docker pull public.ecr.aws/i5g0m1f6/eval_biocoder:v1.0
To reproduce this image, please see the Dockerfile_Opendevin in the biocoder repository.
Start the evaluation
./evaluation/biocoder/scripts/run_infer.sh [model_config] [agent] [eval_limit]
where model_config is mandatory, while agent, dataset and eval_limit are optional.
-
model_config, e.g.eval_gpt4_1106_preview, is the config group name for your LLM settings, as defined in yourconfig.toml. -
agent, e.g.CodeActAgent, is the name of the agent for benchmarks, defaulting toCodeActAgent. -
eval_limit, e.g.10, limits the evaluation to the firsteval_limitinstances. By default it infers all instances.
Let's say you'd like to run 10 instances using eval_gpt4_1106_eval_gpt4o_2024_05_13preview and CodeActAgent,
then your command would be:
Examples
./evaluation/biocoder/scripts/run_infer.sh eval_gpt4o_2024_05_13 CodeActAgent 1
Reference
@misc{tang2024biocoder,
title={BioCoder: A Benchmark for Bioinformatics Code Generation with Large Language Models},
author={Xiangru Tang and Bill Qian and Rick Gao and Jiakang Chen and Xinyun Chen and Mark Gerstein},
year={2024},
eprint={2308.16458},
archivePrefix={arXiv},
primaryClass={cs.LG}
}