Co-authored-by: openhands <openhands@all-hands.dev> Co-authored-by: Engel Nyst <enyst@users.noreply.github.com>
4.5 KiB
VersiCode benchmark
This project is used to evaluate the performance of the model on VersiCode. It includes:
- data: the test data needed and the model outputs
- inference_utils: inference scripts for ours tasks and models
- metric: scripts for calculating various metric
- output_processing: process the model output to facilitate the calculation of model metrics
Details
-
Prepare the environment
#create conda environment conda create -n VersiCode python==3.12 #install requirements pip install -r requirements.txt -
Experiment Data
To obtain the experimental data, please visit the Hugging Face link: https://huggingface.co/datasets/AstoneNg/VersiCode. Locate the files
VersiCode_block_completion.jsonandVersiCode_migration.jsonunder theexperiment_datadirectory, and place them in the/data/test_data directoryof this project. -
Model inference
#cd inference_utils directory cd inference_utils #The script file starting with 'test' is used to test the local model #The script file at the beginning of the API is used to test the API call model #block level code completipn #Modify the 10th and 12th lines of code to specify the base URL and model name python api_test_block_completion.py #Modify the 30th line of code to specify the local model path python test_block.py # code migration (migration order is 'old_to_new') #Modify the 10th and 12th lines of code to specify the base URL and model name python api_code_migration.py #Modify the 30th line of code to specify the local model path python test_migration.py -
Process output Process the output content of the model, remove redundant content, extract specified content for easy calculation of indicators.
#cd output_processing cd output_processing #Extract content from<start> and <end> #Modify the 8th and 9th lines of code to specify the model and task granularity python clear_ans.py #In the block completion task and migration task, cdc@k The calculation of indicators needs to be targeted at key rows, #Modify lines 76 and 79 to specify the data path python choose_core_line_from_block_versicode.py python choose_core_line_from_migration_versicode.py -
Metric We have three metrics pass@k,em@k and cdc@k Due to our inability to automatically build a dynamic evaluation environment, we have not provided pass@k .
#cd metric cd metric #Modify lines 137-140 in migration task (compute_migration_cdc_score.py) or 143-145 in block and line completion task (compute_versicode_cdc_score.py and compute_versicode_em_score.py) of the code to specify the data path and calculate the k-value of the metric python compute_migration_cdc_score.py python compute_versicode_cdc_score.py python compute_versicode_em_score.py #Notes #We found limitations in the ISM@k and PM@k metrics for evaluating code generation, so they are used only as reference in our experiments. #Modify lines 261-265 in block and line completion task of the code to specify the data path and calculate the k-value of the metric python compute_ism_pm_score.py
Citation
@article{versicode,
author={Tongtong Wu and Weigang Wu and Xingyu Wang and Kang Xu and Suyu Ma and Bo Jiang and Ping Yang and Zhenchang Xing and Yuan-Fang Li and Gholamreza Haffari},
title = {VersiCode: Towards Version-controllable Code Generation},
journal = {CoRR},
volume = {abs/2406.07411},
year = {2024},
url = {https://arxiv.org/abs/2406.07411},
}
Github url: https://github.com/wutong8023/VersiCode
Contributor
Tongtong Wu, Weigang Wu, Xingyu Wang, Kang Xu, Suyu Ma, Bo Jiang, Ping Yang, Zhenchang Xing, Yuan-Fang Li, Gholamreza Haffari