Tim O'Farrell af49b615b1
Add BatchedWebHookFileStore for batching webhook updates (#10119)
Co-authored-by: openhands <openhands@all-hands.dev>
2025-08-11 12:51:08 -06:00

128 lines
4.9 KiB
Markdown

# OpenHands Storage Module
The storage module provides different storage options for file operations in OpenHands, used for storing events, settings and other metadata. This module implements a common interface (`FileStore`) that allows for interchangeable storage backends.
**Usage:**
```python
store = ...
# Write, read, list, and delete operations
store.write("example.txt", "Hello, world!")
content = store.read("example.txt")
files = store.list("/")
store.delete("example.txt")
```
## Available Storage Options
### 1. Local File Storage (`local`)
Local file storage saves files to the local filesystem.
**Environment Variables:**
- None specific to this storage option
- Files are stored at the path specified by `file_store_path` in the configuration
### 2. In-Memory Storage (`memory`)
In-memory storage keeps files in memory, which is useful for testing or temporary storage.
**Environment Variables:**
- None
### 3. Amazon S3 Storage (`s3`)
S3 storage uses Amazon S3 or compatible services for file storage.
**Environment Variables:**
- The bucket name is specified by `file_store_path` in the configuration with a fallback to the `AWS_S3_BUCKET` environment variable.
- `AWS_ACCESS_KEY_ID`: Your AWS access key
- `AWS_SECRET_ACCESS_KEY`: Your AWS secret key
- `AWS_S3_ENDPOINT`: Optional custom endpoint for S3-compatible services (Allows overriding the default)
- `AWS_S3_SECURE`: Whether to use HTTPS (default: "true")
### 4. Google Cloud Storage (`google_cloud`)
Google Cloud Storage uses Google Cloud Storage buckets for file storage.
**Environment Variables:**
- The bucket name is specified by `file_store_path` in the configuration with a fallback to the `GOOGLE_CLOUD_BUCKET_NAME` environment variable.
- `GOOGLE_APPLICATION_CREDENTIALS`: Path to Google Cloud credentials JSON file
## Webhook Protocol
The webhook protocol allows for integration with external systems by sending HTTP requests when files are written or deleted.
### Overview
The `WebHookFileStore` wraps another `FileStore` implementation and sends HTTP requests to a specified URL whenever files are written or deleted. This enables real-time notifications and synchronization with external systems.
**Configuration Options:**
- `file_store_web_hook_url`: The base URL for webhook requests
- `file_store_web_hook_headers`: HTTP headers to include in webhook requests
- `file_store_web_hook_batch`: Whether to use batched webhook requests (default: false)
### Protocol Details
#### Standard Webhook Protocol (Non-Batched)
1. **File Write Operation**:
- When a file is written, a POST request is sent to `{base_url}{path}`
- The request body contains the file contents
- The operation is retried up to 3 times with a 1-second delay between attempts
2. **File Delete Operation**:
- When a file is deleted, a DELETE request is sent to `{base_url}{path}`
- The operation is retried up to 3 times with a 1-second delay between attempts
#### Batched Webhook Protocol
The `BatchedWebHookFileStore` extends the webhook functionality by batching multiple file operations into a single request, which can significantly improve performance when many files are being modified in a short period of time.
1. **Batch Request**:
- A single POST request is sent to `{base_url}` with a JSON array in the body
- Each item in the array contains:
- `method`: "POST" for write operations, "DELETE" for delete operations
- `path`: The file path
- `content`: The file contents (for write operations only)
- `encoding`: "base64" if binary content was base64-encoded (optional)
2. **Batch Triggering**:
- Batches are sent when one of the following conditions is met:
- A timeout period has elapsed (defaults to 5 seconds, configurable via constructor parameter)
- The total size of batched content exceeds a size limit (defaults to 1MB, configurable via constructor parameter)
- The `flush()` method is explicitly called
3. **Error Handling**:
- The batch request is retried up to 3 times with a 1-second delay between attempts
## Configuration
To configure the storage module in OpenHands, use the following configuration options:
```toml
[core]
# File store type: "local", "memory", "s3", "google_cloud"
file_store = "local"
# Path for local file store
file_store_path = "/tmp/file_store"
# Optional webhook URL
file_store_web_hook_url = "https://example.com/api/files"
# Optional webhook headers (JSON string)
file_store_web_hook_headers = '{"Authorization": "Bearer token"}'
# Optional batched webhook mode (default: false)
file_store_web_hook_batch = true
```
**Batched Webhook Configuration:**
The batched webhook behavior uses predefined constants with the following default values:
- Batch timeout: 5 seconds
- Batch size limit: 1MB (1048576 bytes)
These values can be customized by passing `batch_timeout_seconds` and `batch_size_limit_bytes` parameters to the `BatchedWebHookFileStore` constructor.