We present Workforce, a hierarchical multi-agent framework that decouples planning from execution through a modular architecture with a domain-agnostic Planner, Coordinator, and specialized Workers. This enables cross-domain transfer by allowing worker modification without full system retraining. On the GAIA benchmark, Workforce achieves state-of-the-art 69.70% accuracy, outperforming commercial systems.

This repository contains inference part code for the OWL framework (Workforce).

Inference

The camel version we use is 0.2.46. To reproduce Workforce inference performance on GAIA benchmark (69.70% - Claude-3.7 accuracy on GAIA benchmark, pass@1, and 60.61% - GPT-4o accuracy on GAIA benchmark, pass@3), follow the steps below:

Installation and Setup

Create a Python 3.11 Conda environment:

conda create -n owl python=3.11

Install the required packages:

pip install -r requirements.txt

Set up envionment variables:

copy .env.example to .env and set the environment variables, and set the keys in .env file.

Run the inference:

For reproducing results using GPT-4o, run:

python run_gaia_workforce.py

For reproducing results using Claude-3.7, run:

python run_gaia_workforce_claude.py

You can modify test_idx variable to specify the test case.