docs: Update CodeAct agent documentation (#5418)

Co-authored-by: openhands <openhands@all-hands.dev>
2025-12-26 05:48:36 +08:00 · 2024-12-05 09:25:54 -05:00 · 2024-12-05 09:25:54 -05:00 · 83b94786a3
commit 83b94786a3
parent 786cde39fd
1 changed files with 75 additions and 28 deletions
--- a/openhands/agenthub/codeact_agent/README.md
+++ b/openhands/agenthub/codeact_agent/README.md
@ -1,28 +1,75 @@
 # CodeAct Agent Framework

-This folder implements the CodeAct idea ([paper](https://arxiv.org/abs/2402.01030), [tweet](https://twitter.com/xingyaow_/status/1754556835703751087)) that consolidates LLM agents’ **act**ions into a unified **code** action space for both *simplicity* and *performance* (see paper for more details).
+This folder is an implementation of OpenHands's main agent, the CodeAct Agent. It is based on ([CodeAct](https://arxiv.org/abs/2402.01030), [tweet](https://twitter.com/xingyaow_/status/1754556835703751087)), an idea of consolidating LLM agents' **act**ions into a unified **code** action space for both *simplicity* and *performance*.

-The conceptual idea is illustrated below. At each turn, the agent can:
+## Overview
+
+The CodeAct agent operates through a function calling interface. At each turn, the agent can:

 1. **Converse**: Communicate with humans in natural language to ask for clarification, confirmation, etc.
-2. **CodeAct**: Choose to perform the task by executing code
-   - Execute any valid Linux `bash` command
-   - Execute any valid `Python` code with [an interactive Python interpreter](https://ipython.org/). This is simulated through `bash` command, see plugin system below for more details.
+2. **CodeAct**: Execute actions through a set of well-defined tools:
+   - Execute Linux `bash` commands with `execute_bash`
+   - Run Python code in an [IPython](https://ipython.org/) environment with `execute_ipython_cell`
+   - Interact with web browsers using `browser` and `web_read`
+   - Edit files using `str_replace_editor` or `edit_file`

 ![image](https://github.com/All-Hands-AI/OpenHands/assets/38853559/92b622e3-72ad-4a61-8f41-8c040b6d5fb3)

+## Built-in Tools
+
+The agent provides several built-in tools:
+
+### 1. `execute_bash`
+- Execute any valid Linux bash command
+- Handles long-running commands by running them in background with output redirection
+- Supports interactive processes with STDIN input and process interruption
+- Handles command timeouts with automatic retry in background mode
+
+### 2. `execute_ipython_cell`
+- Run Python code in an IPython environment
+- Supports magic commands like `%pip`
+- Variables are scoped to the IPython environment
+- Requires defining variables and importing packages before use
+
+### 3. `web_read` and `browser`
+- `web_read`: Read and convert webpage content to markdown
+- `browser`: Interact with webpages through Python code
+- Supports common browser actions like navigation, clicking, form filling, scrolling
+- Handles file uploads and drag-and-drop operations
+
+### 4. `str_replace_editor`
+- View, create and edit files through string replacement
+- Persistent state across command calls
+- File viewing with line numbers
+- String replacement with exact matching
+- Undo functionality for edits
+
+### 5. `edit_file` (LLM-based)
+- Edit files using LLM-based content generation
+- Support for partial file edits with line ranges
+- Handles large files by editing specific sections
+- Append mode for adding content to files
+
+## Configuration
+
+Tools can be enabled/disabled through configuration parameters:
+- `codeact_enable_browsing`: Enable browser interaction tools
+- `codeact_enable_jupyter`: Enable IPython code execution
+- `codeact_enable_llm_editor`: Enable LLM-based file editing (falls back to string replacement editor if disabled)
+
+## Micro-agents
+
+The agent includes specialized micro-agents for specific tasks:
+
+1. **npm**: Handles npm package installation with non-interactive shell workarounds
+2. **github**: Manages GitHub operations with API token support and PR creation guidelines
+3. **flarglebargle**: Easter egg response handler
+
 ## Adding New Tools

-The CodeAct agent uses a function calling interface to define tools that the agent can use. Tools are defined in `function_calling.py` using the `ChatCompletionToolParam` class from `litellm`. Each tool consists of:
-
-1. A description string that explains what the tool does and how to use it
-2. A tool definition using `ChatCompletionToolParam` that specifies:
-   - The tool's name
-   - The tool's parameters and their types
-   - Required vs optional parameters
-
-Here's an example of how a tool is defined:
+The CodeAct agent uses a function calling interface based on `litellm`'s `ChatCompletionToolParam`. To add a new tool:

+1. Define the tool in `function_calling.py`:
 ```python
 MyTool = ChatCompletionToolParam(
    type='function',
@ -47,20 +94,20 @@ MyTool = ChatCompletionToolParam(
 )
 ```

-To add a new tool:
+2. Add the tool to `get_tools()` in `function_calling.py`
+3. Implement the corresponding action handler in the agent class

-1. Define your tool in `function_calling.py` following the pattern above
-2. Add your tool to the `get_tools()` function in `function_calling.py`
-3. Implement the corresponding action handler in the agent to process the tool's invocation
+## Implementation Details

-The agent currently supports several built-in tools:
- `execute_bash`: Execute bash commands
- `execute_ipython_cell`: Run Python code in IPython
- `browser`: Interact with a web browser
- `str_replace_editor`: Edit files using string replacement
- `edit_file`: Edit files using LLM-based editing
+The agent is implemented in two main files:

-Tools can be enabled/disabled through configuration parameters:
- `codeact_enable_browsing`: Enable browser interaction
- `codeact_enable_jupyter`: Enable IPython code execution
- `codeact_enable_llm_editor`: Enable LLM-based file editing (if disabled, uses string replacement editor instead)
+1. `codeact_agent.py`: Core agent implementation with:
+   - Message history management
+   - Tool execution handling
+   - State management
+   - Action/observation processing
+
+2. `function_calling.py`: Tool definitions and function calling interface with:
+   - Tool parameter specifications
+   - Tool descriptions and examples
+   - Function calling response parsing