mirror of https://github.com/OpenHands/OpenHands.git synced 2026-03-22 13:47:19 +08:00

Files

Boxuan Li c68478f470 Customize LLM config per agent (#2756 )

Currently, OpenDevin uses a global singleton LLM config and a global singleton agent config. This PR allows customers to configure an LLM config for each agent. A hypothetically useful scenario is to use a cheaper LLM for repo exploration / code search, and a more powerful LLM to actually do the problem solving (CodeActAgent).

Partially solves #2075 (web GUI improvement is not the goal of this PR)

2024-07-09 22:05:54 -07:00

agent_bench

Customize LLM config per agent (#2756 )

2024-07-09 22:05:54 -07:00

biocoder

Refactoring: event stream based agent history (#2709 )

2024-07-07 21:04:23 +00:00

bird

Customize LLM config per agent (#2756 )

2024-07-09 22:05:54 -07:00

EDA

History clean up (#2849 )

2024-07-08 05:10:21 +02:00

gaia

Refactoring: event stream based agent history (#2709 )

2024-07-07 21:04:23 +00:00

gorilla

History clean up (#2849 )

2024-07-08 05:10:21 +02:00

gpqa

Customize LLM config per agent (#2756 )

2024-07-09 22:05:54 -07:00

humanevalfix

Customize LLM config per agent (#2756 )

2024-07-09 22:05:54 -07:00

logic_reasoning

Customize LLM config per agent (#2756 )

2024-07-09 22:05:54 -07:00

miniwob

Customize LLM config per agent (#2756 )

2024-07-09 22:05:54 -07:00

mint

Refactoring: event stream based agent history (#2709 )

2024-07-07 21:04:23 +00:00

ml_bench

Customize LLM config per agent (#2756 )

2024-07-09 22:05:54 -07:00

regression

Feat: add stream output to exec_run (#1625 )

2024-05-16 14:37:49 +00:00

static

Add detailed tutorial for adding new evaluation benchmarks (#1827 )

2024-05-18 13:40:53 -04:00

swe_bench

Customize LLM config per agent (#2756 )

2024-07-09 22:05:54 -07:00

toolqa

History clean up (#2849 )

2024-07-08 05:10:21 +02:00

utils

Refactoring: event stream based agent history (#2709 )

2024-07-07 21:04:23 +00:00

webarena

Refactoring: event stream based agent history (#2709 )

2024-07-07 21:04:23 +00:00

__init__.py

feat(SWE-Bench environment) integrate SWE-Bench sandbox (#1468 )

2024-05-15 16:15:55 +00:00

README.md

Add ML-Bench Evaluation with OpenDevin (#2015 )

2024-06-05 01:56:39 +00:00

TUTORIAL.md

Customize LLM config per agent (#2756 )

2024-07-09 22:05:54 -07:00

README.md

Evaluation

This folder contains code and resources to run experiments and evaluations.

Logistics

To better organize the evaluation folder, we should follow the rules below:

Each subfolder contains a specific benchmark or experiment. For example, evaluation/swe_bench should contain all the preprocessing/evaluation/analysis scripts.
Raw data and experimental records should not be stored within this repo.
- For model outputs, they should be stored at this huggingface space for visualization.
Important data files of manageable size and analysis scripts (e.g., jupyter notebooks) can be directly uploaded to this repo.

Supported Benchmarks

SWE-Bench: evaluation/swe_bench
ML-Bench: evaluation/ml_bench
HumanEvalFix: evaluation/humanevalfix
GAIA: evaluation/gaia
Entity deduction Arena (EDA): evaluation/EDA
MINT: evaluation/mint
AgentBench: evaluation/agent_bench
BIRD: evaluation/bird
LogicReasoning: evaluation/logic_reasoning

Result Visualization

Check this huggingface space for visualization of existing experimental results.

Upload your results

You can start your own fork of our huggingface evaluation outputs and submit a PR of your evaluation results to our hosted huggingface repo via PR following the guide here.