OpenHands Enterprise Telemetry Service M1 (#11468)

Co-authored-by: openhands <openhands@all-hands.dev>
Co-authored-by: Ray Myers <ray.myers@gmail.com>
This commit is contained in:
John-Mason P. Shackelford 2025-10-27 09:01:56 -04:00 committed by GitHub
parent 3ec8d70d04
commit 26c636d63e
No known key found for this signature in database
GPG Key ID: B5690EEEBB952194
7 changed files with 1515 additions and 0 deletions

View File

@ -0,0 +1,856 @@
# OpenHands Enterprise Usage Telemetry Service
## Table of Contents
1. [Introduction](#1-introduction)
- 1.1 [Problem Statement](#11-problem-statement)
- 1.2 [Proposed Solution](#12-proposed-solution)
2. [User Interface](#2-user-interface)
- 2.1 [License Warning Banner](#21-license-warning-banner)
- 2.2 [Administrator Experience](#22-administrator-experience)
3. [Other Context](#3-other-context)
- 3.1 [Replicated Platform Integration](#31-replicated-platform-integration)
- 3.2 [Administrator Email Detection Strategy](#32-administrator-email-detection-strategy)
- 3.3 [Metrics Collection Framework](#33-metrics-collection-framework)
4. [Technical Design](#4-technical-design)
- 4.1 [Database Schema](#41-database-schema)
- 4.1.1 [Telemetry Metrics Table](#411-telemetry-metrics-table)
- 4.1.2 [Telemetry Identity Table](#412-telemetry-identity-table)
- 4.2 [Metrics Collection Framework](#42-metrics-collection-framework)
- 4.2.1 [Base Collector Interface](#421-base-collector-interface)
- 4.2.2 [Collector Registry](#422-collector-registry)
- 4.2.3 [Example Collector Implementation](#423-example-collector-implementation)
- 4.3 [Collection and Upload System](#43-collection-and-upload-system)
- 4.3.1 [Metrics Collection Processor](#431-metrics-collection-processor)
- 4.3.2 [Replicated Upload Processor](#432-replicated-upload-processor)
- 4.4 [License Warning System](#44-license-warning-system)
- 4.4.1 [License Status Endpoint](#441-license-status-endpoint)
- 4.4.2 [UI Integration](#442-ui-integration)
- 4.5 [Cronjob Configuration](#45-cronjob-configuration)
- 4.5.1 [Collection Cronjob](#451-collection-cronjob)
- 4.5.2 [Upload Cronjob](#452-upload-cronjob)
5. [Implementation Plan](#5-implementation-plan)
- 5.1 [Database Schema and Models (M1)](#51-database-schema-and-models-m1)
- 5.1.1 [OpenHands - Database Migration](#511-openhands---database-migration)
- 5.1.2 [OpenHands - Model Tests](#512-openhands---model-tests)
- 5.2 [Metrics Collection Framework (M2)](#52-metrics-collection-framework-m2)
- 5.2.1 [OpenHands - Core Collection Framework](#521-openhands---core-collection-framework)
- 5.2.2 [OpenHands - Example Collectors](#522-openhands---example-collectors)
- 5.2.3 [OpenHands - Framework Tests](#523-openhands---framework-tests)
- 5.3 [Collection and Upload Processors (M3)](#53-collection-and-upload-processors-m3)
- 5.3.1 [OpenHands - Collection Processor](#531-openhands---collection-processor)
- 5.3.2 [OpenHands - Upload Processor](#532-openhands---upload-processor)
- 5.3.3 [OpenHands - Integration Tests](#533-openhands---integration-tests)
- 5.4 [License Warning API (M4)](#54-license-warning-api-m4)
- 5.4.1 [OpenHands - License Status API](#541-openhands---license-status-api)
- 5.4.2 [OpenHands - API Integration](#542-openhands---api-integration)
- 5.5 [UI Warning Banner (M5)](#55-ui-warning-banner-m5)
- 5.5.1 [OpenHands - UI Warning Banner](#551-openhands---ui-warning-banner)
- 5.5.2 [OpenHands - UI Integration](#552-openhands---ui-integration)
- 5.6 [Helm Chart Deployment Configuration (M6)](#56-helm-chart-deployment-configuration-m6)
- 5.6.1 [OpenHands-Cloud - Cronjob Manifests](#561-openhands-cloud---cronjob-manifests)
- 5.6.2 [OpenHands-Cloud - Configuration Management](#562-openhands-cloud---configuration-management)
- 5.7 [Documentation and Enhanced Collectors (M7)](#57-documentation-and-enhanced-collectors-m7)
- 5.7.1 [OpenHands - Advanced Collectors](#571-openhands---advanced-collectors)
- 5.7.2 [OpenHands - Monitoring and Testing](#572-openhands---monitoring-and-testing)
- 5.7.3 [OpenHands - Technical Documentation](#573-openhands---technical-documentation)
## 1. Introduction
### 1.1 Problem Statement
OpenHands Enterprise (OHE) helm charts are publicly available but not open source, creating a visibility gap for the sales team. Unknown users can install and use OHE without the vendor's knowledge, preventing proper customer engagement and sales pipeline management. Without usage telemetry, the vendor cannot identify potential customers, track installation health, or proactively support users who may need assistance.
### 1.2 Proposed Solution
We propose implementing a comprehensive telemetry service that leverages the Replicated metrics platform and Python SDK to track OHE installations and usage. The solution provides automatic customer discovery, instance monitoring, and usage metrics collection while maintaining a clear license compliance pathway.
The system consists of three main components: (1) a pluggable metrics collection framework that allows developers to easily define and register custom metrics collectors, (2) automated cronjobs that periodically collect metrics and upload them to Replicated's vendor portal, and (3) a license compliance warning system that displays UI notifications when telemetry uploads fail, indicating potential license expiration.
The design ensures that telemetry cannot be easily disabled without breaking core OHE functionality by tying the warning system to environment variables that are essential for OHE operation. This approach balances user transparency with business requirements for customer visibility.
## 2. User Interface
### 2.1 License Warning Banner
When telemetry uploads fail for more than 4 days, users will see a prominent warning banner in the OpenHands Enterprise UI:
```
⚠️ Your OpenHands Enterprise license will expire in 30 days. Please contact support if this issue persists.
```
The banner appears at the top of all pages and cannot be permanently dismissed while the condition persists. Users can temporarily dismiss it, but it will reappear on page refresh until telemetry uploads resume successfully.
### 2.2 Administrator Experience
System administrators will not need to configure the telemetry system manually. The service automatically:
1. **Detects OHE installations** using existing required environment variables (`GITHUB_APP_CLIENT_ID`, `KEYCLOAK_SERVER_URL`, etc.)
2. **Generates unique customer identifiers** using administrator contact information:
- Customer email: Determined by the following priority order:
1. `OPENHANDS_ADMIN_EMAIL` environment variable (if set in helm values)
2. Email of the first user who accepted Terms of Service (earliest `accepted_tos` timestamp)
- Instance ID: Automatically generated by Replicated SDK using machine fingerprinting (IOPlatformUUID on macOS, D-Bus machine ID on Linux, Machine GUID on Windows)
- **No Fallback**: If neither email source is available, telemetry collection is skipped until at least one user exists
3. **Collects and uploads metrics transparently** in the background via weekly collection and daily upload cronjobs
4. **Displays warnings only when necessary** for license compliance - no notifications appear during normal operation
## 3. Other Context
### 3.1 Replicated Platform Integration
The Replicated platform provides vendor-hosted infrastructure for collecting customer and instance telemetry. The Python SDK handles authentication, state management, and reliable metric delivery. Key concepts:
- **Customer**: Represents a unique OHE installation, identified by email or installation fingerprint
- **Instance**: Represents a specific deployment of OHE for a customer
- **Metrics**: Custom key-value data points collected from the installation
- **Status**: Instance health indicators (running, degraded, updating, etc.)
The SDK automatically handles machine fingerprinting, local state caching, and retry logic for failed uploads.
### 3.2 Administrator Email Detection Strategy
To identify the appropriate administrator contact for sales outreach, the system uses a three-tier approach that avoids performance penalties on user authentication:
**Tier 1: Explicit Configuration** - The `OPENHANDS_ADMIN_EMAIL` environment variable allows administrators to explicitly specify the contact email during deployment.
**Tier 2: First Active User Detection** - If no explicit email is configured, the system identifies the first user who accepted Terms of Service (earliest `accepted_tos` timestamp with a valid email). This represents the first person to actively engage with the system and is very likely the administrator or installer.
**No Fallback Needed** - If neither email source is available, telemetry collection is skipped entirely. This ensures we only report meaningful usage data when there are actual active users.
**Performance Optimization**: The admin email determination is performed only during telemetry upload attempts, ensuring zero performance impact on user login flows.
### 3.3 Metrics Collection Framework
The proposed collector framework allows developers to define metrics in a single file change:
```python
@register_collector("user_activity")
class UserActivityCollector(MetricsCollector):
def collect(self) -> Dict[str, Any]:
# Query database and return metrics
return {"active_users_7d": count, "conversations_created": total}
```
Collectors are automatically discovered and executed by the collection cronjob, making the system extensible without modifying core collection logic.
## 4. Technical Design
### 4.1 Database Schema
#### 4.1.1 Telemetry Metrics Table
Stores collected metrics with transmission status tracking:
```sql
CREATE TABLE telemetry_metrics (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
collected_at TIMESTAMP WITH TIME ZONE NOT NULL DEFAULT CURRENT_TIMESTAMP,
metrics_data JSONB NOT NULL,
uploaded_at TIMESTAMP WITH TIME ZONE NULL,
upload_attempts INTEGER DEFAULT 0,
last_upload_error TEXT NULL,
created_at TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP,
updated_at TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP
);
CREATE INDEX idx_telemetry_metrics_collected_at ON telemetry_metrics(collected_at);
CREATE INDEX idx_telemetry_metrics_uploaded_at ON telemetry_metrics(uploaded_at);
```
#### 4.1.2 Telemetry Identity Table
Stores persistent identity information that must survive container restarts:
```sql
CREATE TABLE telemetry_identity (
id INTEGER PRIMARY KEY DEFAULT 1,
customer_id VARCHAR(255) NULL,
instance_id VARCHAR(255) NULL,
created_at TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP,
updated_at TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP,
CONSTRAINT single_identity_row CHECK (id = 1)
);
```
**Design Rationale:**
- **Separation of Concerns**: Identity data (customer_id, instance_id) is separated from operational data
- **Persistent vs Computed**: Only data that cannot be reliably recomputed is persisted
- **Upload Tracking**: Upload timestamps are tied directly to the metrics they represent
- **Simplified Queries**: System state can be derived from metrics table (e.g., `MAX(uploaded_at)` for last successful upload)
### 4.2 Metrics Collection Framework
#### 4.2.1 Base Collector Interface
```python
from abc import ABC, abstractmethod
from typing import Dict, Any, List
from dataclasses import dataclass
@dataclass
class MetricResult:
key: str
value: Any
class MetricsCollector(ABC):
"""Base class for metrics collectors."""
@abstractmethod
def collect(self) -> List[MetricResult]:
"""Collect metrics and return results."""
pass
@property
@abstractmethod
def collector_name(self) -> str:
"""Unique name for this collector."""
pass
def should_collect(self) -> bool:
"""Override to add collection conditions."""
return True
```
#### 4.2.2 Collector Registry
```python
from typing import Dict, Type, List
import importlib
import pkgutil
class CollectorRegistry:
"""Registry for metrics collectors."""
def __init__(self):
self._collectors: Dict[str, Type[MetricsCollector]] = {}
def register(self, collector_class: Type[MetricsCollector]) -> None:
"""Register a collector class."""
collector = collector_class()
self._collectors[collector.collector_name] = collector_class
def get_all_collectors(self) -> List[MetricsCollector]:
"""Get instances of all registered collectors."""
return [cls() for cls in self._collectors.values()]
def discover_collectors(self, package_path: str) -> None:
"""Auto-discover collectors in a package."""
# Implementation to scan for @register_collector decorators
pass
# Global registry instance
collector_registry = CollectorRegistry()
def register_collector(name: str):
"""Decorator to register a collector."""
def decorator(cls: Type[MetricsCollector]) -> Type[MetricsCollector]:
collector_registry.register(cls)
return cls
return decorator
```
#### 4.2.3 Example Collector Implementation
```python
@register_collector("system_metrics")
class SystemMetricsCollector(MetricsCollector):
"""Collects basic system and usage metrics."""
@property
def collector_name(self) -> str:
return "system_metrics"
def collect(self) -> List[MetricResult]:
results = []
# Collect user count
with session_maker() as session:
user_count = session.query(UserSettings).count()
results.append(MetricResult(
key="total_users",
value=user_count
))
# Collect conversation count (last 30 days)
thirty_days_ago = datetime.now(timezone.utc) - timedelta(days=30)
conversation_count = session.query(StoredConversationMetadata)\
.filter(StoredConversationMetadata.created_at >= thirty_days_ago)\
.count()
results.append(MetricResult(
key="conversations_30d",
value=conversation_count
))
return results
```
### 4.3 Collection and Upload System
#### 4.3.1 Metrics Collection Processor
```python
class TelemetryCollectionProcessor(MaintenanceTaskProcessor):
"""Maintenance task processor for collecting metrics."""
collection_interval_days: int = 7
async def __call__(self, task: MaintenanceTask) -> dict:
"""Collect metrics from all registered collectors."""
# Check if collection is needed
if not self._should_collect():
return {"status": "skipped", "reason": "too_recent"}
# Collect metrics from all registered collectors
all_metrics = {}
collector_results = {}
for collector in collector_registry.get_all_collectors():
try:
if collector.should_collect():
results = collector.collect()
for result in results:
all_metrics[result.key] = result.value
collector_results[collector.collector_name] = len(results)
except Exception as e:
logger.error(f"Collector {collector.collector_name} failed: {e}")
collector_results[collector.collector_name] = f"error: {e}"
# Store metrics in database
with session_maker() as session:
telemetry_record = TelemetryMetrics(
metrics_data=all_metrics,
collected_at=datetime.now(timezone.utc)
)
session.add(telemetry_record)
session.commit()
# Note: No need to track last_collection_at separately
# Can be derived from MAX(collected_at) in telemetry_metrics
return {
"status": "completed",
"metrics_collected": len(all_metrics),
"collectors_run": collector_results
}
def _should_collect(self) -> bool:
"""Check if collection is needed based on interval."""
with session_maker() as session:
# Get last collection time from metrics table
last_collected = session.query(func.max(TelemetryMetrics.collected_at)).scalar()
if not last_collected:
return True
time_since_last = datetime.now(timezone.utc) - last_collected
return time_since_last.days >= self.collection_interval_days
```
#### 4.3.2 Replicated Upload Processor
```python
from replicated import AsyncReplicatedClient, InstanceStatus
class TelemetryUploadProcessor(MaintenanceTaskProcessor):
"""Maintenance task processor for uploading metrics to Replicated."""
replicated_publishable_key: str
replicated_app_slug: str
async def __call__(self, task: MaintenanceTask) -> dict:
"""Upload pending metrics to Replicated."""
# Get pending metrics
with session_maker() as session:
pending_metrics = session.query(TelemetryMetrics)\
.filter(TelemetryMetrics.uploaded_at.is_(None))\
.order_by(TelemetryMetrics.collected_at)\
.all()
if not pending_metrics:
return {"status": "no_pending_metrics"}
# Get admin email - skip if not available
admin_email = self._get_admin_email()
if not admin_email:
logger.info("Skipping telemetry upload - no admin email available")
return {
"status": "skipped",
"reason": "no_admin_email",
"total_processed": 0
}
uploaded_count = 0
failed_count = 0
async with AsyncReplicatedClient(
publishable_key=self.replicated_publishable_key,
app_slug=self.replicated_app_slug
) as client:
# Get or create customer and instance
customer = await client.customer.get_or_create(
email_address=admin_email
)
instance = await customer.get_or_create_instance()
# Store customer/instance IDs for future use
await self._update_telemetry_identity(customer.customer_id, instance.instance_id)
# Upload each metric batch
for metric_record in pending_metrics:
try:
# Send individual metrics
for key, value in metric_record.metrics_data.items():
await instance.send_metric(key, value)
# Update instance status
await instance.set_status(InstanceStatus.RUNNING)
# Mark as uploaded
with session_maker() as session:
record = session.query(TelemetryMetrics)\
.filter(TelemetryMetrics.id == metric_record.id)\
.first()
if record:
record.uploaded_at = datetime.now(timezone.utc)
session.commit()
uploaded_count += 1
except Exception as e:
logger.error(f"Failed to upload metrics {metric_record.id}: {e}")
# Update error info
with session_maker() as session:
record = session.query(TelemetryMetrics)\
.filter(TelemetryMetrics.id == metric_record.id)\
.first()
if record:
record.upload_attempts += 1
record.last_upload_error = str(e)
session.commit()
failed_count += 1
# Note: No need to track last_successful_upload_at separately
# Can be derived from MAX(uploaded_at) in telemetry_metrics
return {
"status": "completed",
"uploaded": uploaded_count,
"failed": failed_count,
"total_processed": len(pending_metrics)
}
def _get_admin_email(self) -> str | None:
"""Get administrator email for customer identification."""
# 1. Check environment variable first
env_admin_email = os.getenv('OPENHANDS_ADMIN_EMAIL')
if env_admin_email:
logger.info("Using admin email from environment variable")
return env_admin_email
# 2. Use first active user's email (earliest accepted_tos)
with session_maker() as session:
first_user = session.query(UserSettings)\
.filter(UserSettings.email.isnot(None))\
.filter(UserSettings.accepted_tos.isnot(None))\
.order_by(UserSettings.accepted_tos.asc())\
.first()
if first_user and first_user.email:
logger.info(f"Using first active user email: {first_user.email}")
return first_user.email
# No admin email available - skip telemetry
logger.info("No admin email available - skipping telemetry collection")
return None
async def _update_telemetry_identity(self, customer_id: str, instance_id: str) -> None:
"""Update or create telemetry identity record."""
with session_maker() as session:
identity = session.query(TelemetryIdentity).first()
if not identity:
identity = TelemetryIdentity()
session.add(identity)
identity.customer_id = customer_id
identity.instance_id = instance_id
session.commit()
```
### 4.4 License Warning System
#### 4.4.1 License Status Endpoint
```python
from fastapi import APIRouter
from datetime import datetime, timezone, timedelta
license_router = APIRouter()
@license_router.get("/license-status")
async def get_license_status():
"""Get license warning status for UI display."""
# Only show warnings for OHE installations
if not _is_openhands_enterprise():
return {"warn": False, "message": ""}
with session_maker() as session:
# Get last successful upload time from metrics table
last_upload = session.query(func.max(TelemetryMetrics.uploaded_at))\
.filter(TelemetryMetrics.uploaded_at.isnot(None))\
.scalar()
if not last_upload:
# No successful uploads yet - show warning after 4 days
return {
"warn": True,
"message": "OpenHands Enterprise license verification pending. Please ensure network connectivity."
}
# Check if last successful upload was more than 4 days ago
days_since_upload = (datetime.now(timezone.utc) - last_upload).days
if days_since_upload > 4:
# Find oldest unsent batch
oldest_unsent = session.query(TelemetryMetrics)\
.filter(TelemetryMetrics.uploaded_at.is_(None))\
.order_by(TelemetryMetrics.collected_at)\
.first()
if oldest_unsent:
# Calculate expiration date (oldest unsent + 34 days)
expiration_date = oldest_unsent.collected_at + timedelta(days=34)
days_until_expiration = (expiration_date - datetime.now(timezone.utc)).days
if days_until_expiration <= 0:
message = "Your OpenHands Enterprise license has expired. Please contact support immediately."
else:
message = f"Your OpenHands Enterprise license will expire in {days_until_expiration} days. Please contact support if this issue persists."
return {"warn": True, "message": message}
return {"warn": False, "message": ""}
def _is_openhands_enterprise() -> bool:
"""Detect if this is an OHE installation."""
# Check for required OHE environment variables
required_vars = [
'GITHUB_APP_CLIENT_ID',
'KEYCLOAK_SERVER_URL',
'KEYCLOAK_REALM_NAME'
]
return all(os.getenv(var) for var in required_vars)
```
#### 4.4.2 UI Integration
The frontend will poll the license status endpoint and display warnings using the existing banner component pattern:
```typescript
// New component: LicenseWarningBanner.tsx
interface LicenseStatus {
warn: boolean;
message: string;
}
export function LicenseWarningBanner() {
const [licenseStatus, setLicenseStatus] = useState<LicenseStatus>({ warn: false, message: "" });
useEffect(() => {
const checkLicenseStatus = async () => {
try {
const response = await fetch('/api/license-status');
const status = await response.json();
setLicenseStatus(status);
} catch (error) {
console.error('Failed to check license status:', error);
}
};
// Check immediately and then every hour
checkLicenseStatus();
const interval = setInterval(checkLicenseStatus, 60 * 60 * 1000);
return () => clearInterval(interval);
}, []);
if (!licenseStatus.warn) {
return null;
}
return (
<div className="bg-red-600 text-white p-4 rounded flex items-center justify-between">
<div className="flex items-center">
<FaExclamationTriangle className="mr-3" />
<span>{licenseStatus.message}</span>
</div>
</div>
);
}
```
### 4.5 Cronjob Configuration
The cronjob configurations will be deployed via the OpenHands-Cloud helm charts.
#### 4.5.1 Collection Cronjob
The collection cronjob runs weekly to gather metrics:
```yaml
# charts/openhands/templates/telemetry-collection-cronjob.yaml
apiVersion: batch/v1
kind: CronJob
metadata:
name: {{ include "openhands.fullname" . }}-telemetry-collection
labels:
{{- include "openhands.labels" . | nindent 4 }}
spec:
schedule: "0 2 * * 0" # Weekly on Sunday at 2 AM
jobTemplate:
spec:
template:
spec:
containers:
- name: telemetry-collector
image: "{{ .Values.image.repository }}:{{ .Values.image.tag }}"
env:
{{- include "openhands.env" . | nindent 12 }}
command:
- python
- -c
- |
from enterprise.storage.maintenance_task import MaintenanceTask, MaintenanceTaskStatus
from enterprise.storage.database import session_maker
from enterprise.server.telemetry.collection_processor import TelemetryCollectionProcessor
# Create collection task
processor = TelemetryCollectionProcessor()
task = MaintenanceTask()
task.set_processor(processor)
task.status = MaintenanceTaskStatus.PENDING
with session_maker() as session:
session.add(task)
session.commit()
restartPolicy: OnFailure
```
#### 4.5.2 Upload Cronjob
The upload cronjob runs daily to send metrics to Replicated:
```yaml
# charts/openhands/templates/telemetry-upload-cronjob.yaml
apiVersion: batch/v1
kind: CronJob
metadata:
name: {{ include "openhands.fullname" . }}-telemetry-upload
labels:
{{- include "openhands.labels" . | nindent 4 }}
spec:
schedule: "0 3 * * *" # Daily at 3 AM
jobTemplate:
spec:
template:
spec:
containers:
- name: telemetry-uploader
image: "{{ .Values.image.repository }}:{{ .Values.image.tag }}"
env:
{{- include "openhands.env" . | nindent 12 }}
- name: REPLICATED_PUBLISHABLE_KEY
valueFrom:
secretKeyRef:
name: {{ include "openhands.fullname" . }}-replicated-config
key: publishable-key
- name: REPLICATED_APP_SLUG
value: {{ .Values.telemetry.replicatedAppSlug | default "openhands-enterprise" | quote }}
command:
- python
- -c
- |
from enterprise.storage.maintenance_task import MaintenanceTask, MaintenanceTaskStatus
from enterprise.storage.database import session_maker
from enterprise.server.telemetry.upload_processor import TelemetryUploadProcessor
import os
# Create upload task
processor = TelemetryUploadProcessor(
replicated_publishable_key=os.getenv('REPLICATED_PUBLISHABLE_KEY'),
replicated_app_slug=os.getenv('REPLICATED_APP_SLUG', 'openhands-enterprise')
)
task = MaintenanceTask()
task.set_processor(processor)
task.status = MaintenanceTaskStatus.PENDING
with session_maker() as session:
session.add(task)
session.commit()
restartPolicy: OnFailure
```
## 5. Implementation Plan
All implementation must pass existing lints and tests. New functionality requires comprehensive unit tests with >90% coverage. Integration tests should verify end-to-end telemetry flow including collection, storage, upload, and warning display.
### 5.1 Database Schema and Models (M1)
**Repository**: OpenHands
Establish the foundational database schema and SQLAlchemy models for telemetry data storage.
#### 5.1.1 OpenHands - Database Migration
- [ ] `enterprise/migrations/versions/077_create_telemetry_tables.py`
- [ ] `enterprise/storage/telemetry_metrics.py`
- [ ] `enterprise/storage/telemetry_config.py`
#### 5.1.2 OpenHands - Model Tests
- [ ] `enterprise/tests/unit/storage/test_telemetry_metrics.py`
- [ ] `enterprise/tests/unit/storage/test_telemetry_config.py`
**Demo**: Database tables created and models can store/retrieve telemetry data.
### 5.2 Metrics Collection Framework (M2)
**Repository**: OpenHands
Implement the pluggable metrics collection system with registry and base classes.
#### 5.2.1 OpenHands - Core Collection Framework
- [ ] `enterprise/server/telemetry/__init__.py`
- [ ] `enterprise/server/telemetry/collector_base.py`
- [ ] `enterprise/server/telemetry/collector_registry.py`
- [ ] `enterprise/server/telemetry/decorators.py`
#### 5.2.2 OpenHands - Example Collectors
- [ ] `enterprise/server/telemetry/collectors/__init__.py`
- [ ] `enterprise/server/telemetry/collectors/system_metrics.py`
- [ ] `enterprise/server/telemetry/collectors/user_activity.py`
#### 5.2.3 OpenHands - Framework Tests
- [ ] `enterprise/tests/unit/telemetry/test_collector_base.py`
- [ ] `enterprise/tests/unit/telemetry/test_collector_registry.py`
- [ ] `enterprise/tests/unit/telemetry/test_system_metrics.py`
**Demo**: Developers can create new collectors with a single file change using the @register_collector decorator.
### 5.3 Collection and Upload Processors (M3)
**Repository**: OpenHands
Implement maintenance task processors for collecting metrics and uploading to Replicated.
#### 5.3.1 OpenHands - Collection Processor
- [ ] `enterprise/server/telemetry/collection_processor.py`
- [ ] `enterprise/tests/unit/telemetry/test_collection_processor.py`
#### 5.3.2 OpenHands - Upload Processor
- [ ] `enterprise/server/telemetry/upload_processor.py`
- [ ] `enterprise/tests/unit/telemetry/test_upload_processor.py`
#### 5.3.3 OpenHands - Integration Tests
- [ ] `enterprise/tests/integration/test_telemetry_flow.py`
**Demo**: Metrics are automatically collected weekly and uploaded daily to Replicated vendor portal.
### 5.4 License Warning API (M4)
**Repository**: OpenHands
Implement the license status endpoint for the warning system.
#### 5.4.1 OpenHands - License Status API
- [ ] `enterprise/server/routes/license.py`
- [ ] `enterprise/tests/unit/routes/test_license.py`
#### 5.4.2 OpenHands - API Integration
- [ ] Update `enterprise/saas_server.py` to include license router
**Demo**: License status API returns warning status based on telemetry upload success.
### 5.5 UI Warning Banner (M5)
**Repository**: OpenHands
Implement the frontend warning banner component and integration.
#### 5.5.1 OpenHands - UI Warning Banner
- [ ] `frontend/src/components/features/license/license-warning-banner.tsx`
- [ ] `frontend/src/components/features/license/license-warning-banner.test.tsx`
#### 5.5.2 OpenHands - UI Integration
- [ ] Update main UI layout to include license warning banner
- [ ] Add license status polling service
**Demo**: License warnings appear in UI when telemetry uploads fail for >4 days, with accurate expiration countdown.
### 5.6 Helm Chart Deployment Configuration (M6)
**Repository**: OpenHands-Cloud
Create Kubernetes cronjob configurations and deployment scripts.
#### 5.6.1 OpenHands-Cloud - Cronjob Manifests
- [ ] `charts/openhands/templates/telemetry-collection-cronjob.yaml`
- [ ] `charts/openhands/templates/telemetry-upload-cronjob.yaml`
#### 5.6.2 OpenHands-Cloud - Configuration Management
- [ ] `charts/openhands/templates/replicated-secret.yaml`
- [ ] Update `charts/openhands/values.yaml` with telemetry configuration options:
```yaml
# Add to values.yaml
telemetry:
enabled: true
replicatedAppSlug: "openhands-enterprise"
adminEmail: "" # Optional: admin email for customer identification
# Add to deployment environment variables
env:
OPENHANDS_ADMIN_EMAIL: "{{ .Values.telemetry.adminEmail }}"
```
**Demo**: Complete telemetry system deployed via helm chart with configurable collection intervals and Replicated integration.
### 5.7 Documentation and Enhanced Collectors (M7)
**Repository**: OpenHands
Add comprehensive metrics collectors, monitoring capabilities, and documentation.
#### 5.7.1 OpenHands - Advanced Collectors
- [ ] `enterprise/server/telemetry/collectors/conversation_metrics.py`
- [ ] `enterprise/server/telemetry/collectors/integration_usage.py`
- [ ] `enterprise/server/telemetry/collectors/performance_metrics.py`
#### 5.7.2 OpenHands - Monitoring and Testing
- [ ] `enterprise/server/telemetry/monitoring.py`
- [ ] `enterprise/tests/e2e/test_telemetry_system.py`
- [ ] Performance tests for large-scale metric collection
#### 5.7.3 OpenHands - Technical Documentation
- [ ] `enterprise/server/telemetry/README.md`
- [ ] Update deployment documentation with telemetry configuration instructions
- [ ] Add troubleshooting guide for telemetry issues
**Demo**: Rich telemetry data flowing to vendor portal with comprehensive monitoring, alerting for system health, and complete documentation.

View File

@ -0,0 +1,129 @@
"""create telemetry tables
Revision ID: 078
Revises: 077
Create Date: 2025-10-21
"""
from typing import Sequence, Union
import sqlalchemy as sa
from alembic import op
# revision identifiers, used by Alembic.
revision: str = '078'
down_revision: Union[str, None] = '077'
branch_labels: Union[str, Sequence[str], None] = None
depends_on: Union[str, Sequence[str], None] = None
def upgrade() -> None:
"""Create telemetry tables for metrics collection and configuration."""
# Create telemetry_metrics table
op.create_table(
'telemetry_metrics',
sa.Column(
'id',
sa.String(), # UUID as string
nullable=False,
primary_key=True,
),
sa.Column(
'collected_at',
sa.DateTime(timezone=True),
nullable=False,
server_default=sa.text('CURRENT_TIMESTAMP'),
),
sa.Column(
'metrics_data',
sa.JSON(),
nullable=False,
),
sa.Column(
'uploaded_at',
sa.DateTime(timezone=True),
nullable=True,
),
sa.Column(
'upload_attempts',
sa.Integer(),
nullable=False,
server_default='0',
),
sa.Column(
'last_upload_error',
sa.Text(),
nullable=True,
),
sa.Column(
'created_at',
sa.DateTime(timezone=True),
nullable=False,
server_default=sa.text('CURRENT_TIMESTAMP'),
),
sa.Column(
'updated_at',
sa.DateTime(timezone=True),
nullable=False,
server_default=sa.text('CURRENT_TIMESTAMP'),
),
)
# Create indexes for telemetry_metrics
op.create_index(
'ix_telemetry_metrics_collected_at', 'telemetry_metrics', ['collected_at']
)
op.create_index(
'ix_telemetry_metrics_uploaded_at', 'telemetry_metrics', ['uploaded_at']
)
# Create telemetry_replicated_identity table (minimal persistent identity data)
op.create_table(
'telemetry_replicated_identity',
sa.Column(
'id',
sa.Integer(),
nullable=False,
primary_key=True,
server_default='1',
),
sa.Column(
'customer_id',
sa.String(255),
nullable=True,
),
sa.Column(
'instance_id',
sa.String(255),
nullable=True,
),
sa.Column(
'created_at',
sa.DateTime(timezone=True),
nullable=False,
server_default=sa.text('CURRENT_TIMESTAMP'),
),
sa.Column(
'updated_at',
sa.DateTime(timezone=True),
nullable=False,
server_default=sa.text('CURRENT_TIMESTAMP'),
),
)
# Add constraint to ensure single row in telemetry_replicated_identity
op.create_check_constraint(
'single_identity_row', 'telemetry_replicated_identity', 'id = 1'
)
def downgrade() -> None:
"""Drop telemetry tables."""
# Drop indexes first
op.drop_index('ix_telemetry_metrics_uploaded_at', 'telemetry_metrics')
op.drop_index('ix_telemetry_metrics_collected_at', 'telemetry_metrics')
# Drop tables
op.drop_table('telemetry_replicated_identity')
op.drop_table('telemetry_metrics')

View File

@ -0,0 +1,98 @@
"""SQLAlchemy model for telemetry identity.
This model stores persistent identity information that must survive container restarts
for the OpenHands Enterprise Telemetry Service.
"""
from datetime import UTC, datetime
from typing import Optional
from sqlalchemy import CheckConstraint, Column, DateTime, Integer, String
from storage.base import Base
class TelemetryIdentity(Base): # type: ignore
"""Stores persistent identity information for telemetry.
This table is designed to contain exactly one row (enforced by database constraint)
that maintains only the identity data that cannot be reliably recomputed:
- customer_id: Established relationship with Replicated
- instance_id: Generated once, must remain stable
Operational data like timestamps are derived from the telemetry_metrics table.
"""
__tablename__ = 'telemetry_replicated_identity'
__table_args__ = (CheckConstraint('id = 1', name='single_identity_row'),)
id = Column(Integer, primary_key=True, default=1)
customer_id = Column(String(255), nullable=True)
instance_id = Column(String(255), nullable=True)
created_at = Column(
DateTime(timezone=True),
default=lambda: datetime.now(UTC),
nullable=False,
)
updated_at = Column(
DateTime(timezone=True),
default=lambda: datetime.now(UTC),
onupdate=lambda: datetime.now(UTC),
nullable=False,
)
def __init__(
self,
customer_id: Optional[str] = None,
instance_id: Optional[str] = None,
**kwargs,
):
"""Initialize telemetry identity.
Args:
customer_id: Unique identifier for the customer
instance_id: Unique identifier for this OpenHands instance
**kwargs: Additional keyword arguments for SQLAlchemy
"""
super().__init__(**kwargs)
# Set defaults for fields that would normally be set by SQLAlchemy
now = datetime.now(UTC)
if not hasattr(self, 'created_at') or self.created_at is None:
self.created_at = now
if not hasattr(self, 'updated_at') or self.updated_at is None:
self.updated_at = now
# Force id to be 1 to maintain single-row constraint
self.id = 1
self.customer_id = customer_id
self.instance_id = instance_id
def set_customer_info(
self,
customer_id: Optional[str] = None,
instance_id: Optional[str] = None,
) -> None:
"""Update customer and instance identification information.
Args:
customer_id: Unique identifier for the customer
instance_id: Unique identifier for this OpenHands instance
"""
if customer_id is not None:
self.customer_id = customer_id
if instance_id is not None:
self.instance_id = instance_id
@property
def has_customer_info(self) -> bool:
"""Check if customer identification information is configured."""
return bool(self.customer_id and self.instance_id)
def __repr__(self) -> str:
return (
f"<TelemetryIdentity(customer_id='{self.customer_id}', "
f"instance_id='{self.instance_id}')>"
)
class Config:
from_attributes = True

View File

@ -0,0 +1,112 @@
"""SQLAlchemy model for telemetry metrics data.
This model stores individual metric collection records with upload tracking
and retry logic for the OpenHands Enterprise Telemetry Service.
"""
import uuid
from datetime import UTC, datetime
from typing import Any, Dict, Optional
from sqlalchemy import JSON, Column, DateTime, Integer, String, Text
from storage.base import Base
class TelemetryMetrics(Base): # type: ignore
"""Stores collected telemetry metrics with upload tracking.
Each record represents a single metrics collection event with associated
metadata for upload status and retry logic.
"""
__tablename__ = 'telemetry_metrics'
id = Column(String, primary_key=True, default=lambda: str(uuid.uuid4()))
collected_at = Column(
DateTime(timezone=True),
nullable=False,
default=lambda: datetime.now(UTC),
index=True,
)
metrics_data = Column(JSON, nullable=False)
uploaded_at = Column(DateTime(timezone=True), nullable=True, index=True)
upload_attempts = Column(Integer, nullable=False, default=0)
last_upload_error = Column(Text, nullable=True)
created_at = Column(
DateTime(timezone=True),
default=lambda: datetime.now(UTC),
nullable=False,
)
updated_at = Column(
DateTime(timezone=True),
default=lambda: datetime.now(UTC),
onupdate=lambda: datetime.now(UTC),
nullable=False,
)
def __init__(
self,
metrics_data: Dict[str, Any],
collected_at: Optional[datetime] = None,
**kwargs,
):
"""Initialize a new telemetry metrics record.
Args:
metrics_data: Dictionary containing the collected metrics
collected_at: Timestamp when metrics were collected (defaults to now)
**kwargs: Additional keyword arguments for SQLAlchemy
"""
super().__init__(**kwargs)
# Set defaults for fields that would normally be set by SQLAlchemy
now = datetime.now(UTC)
if not hasattr(self, 'id') or self.id is None:
self.id = str(uuid.uuid4())
if not hasattr(self, 'upload_attempts') or self.upload_attempts is None:
self.upload_attempts = 0
if not hasattr(self, 'created_at') or self.created_at is None:
self.created_at = now
if not hasattr(self, 'updated_at') or self.updated_at is None:
self.updated_at = now
self.metrics_data = metrics_data
if collected_at:
self.collected_at = collected_at
elif not hasattr(self, 'collected_at') or self.collected_at is None:
self.collected_at = now
def mark_uploaded(self) -> None:
"""Mark this metrics record as successfully uploaded."""
self.uploaded_at = datetime.now(UTC)
self.last_upload_error = None
def mark_upload_failed(self, error_message: str) -> None:
"""Mark this metrics record as having failed upload.
Args:
error_message: Description of the upload failure
"""
self.upload_attempts += 1
self.last_upload_error = error_message
self.uploaded_at = None
@property
def is_uploaded(self) -> bool:
"""Check if this metrics record has been successfully uploaded."""
return self.uploaded_at is not None
@property
def needs_retry(self) -> bool:
"""Check if this metrics record needs upload retry (failed but not too many attempts)."""
return not self.is_uploaded and self.upload_attempts < 3
def __repr__(self) -> str:
return (
f"<TelemetryMetrics(id='{self.id}', "
f"collected_at='{self.collected_at}', "
f'uploaded={self.is_uploaded})>'
)
class Config:
from_attributes = True

View File

@ -0,0 +1 @@
# Storage unit tests

View File

@ -0,0 +1,129 @@
"""Unit tests for TelemetryIdentity model.
Tests the persistent identity storage for the OpenHands Enterprise Telemetry Service.
"""
from datetime import datetime
from storage.telemetry_identity import TelemetryIdentity
class TestTelemetryIdentity:
"""Test cases for TelemetryIdentity model."""
def test_create_identity_with_defaults(self):
"""Test creating identity with default values."""
identity = TelemetryIdentity()
assert identity.id == 1
assert identity.customer_id is None
assert identity.instance_id is None
assert isinstance(identity.created_at, datetime)
assert isinstance(identity.updated_at, datetime)
def test_create_identity_with_values(self):
"""Test creating identity with specific values."""
customer_id = 'cust_123'
instance_id = 'inst_456'
identity = TelemetryIdentity(customer_id=customer_id, instance_id=instance_id)
assert identity.id == 1
assert identity.customer_id == customer_id
assert identity.instance_id == instance_id
def test_set_customer_info(self):
"""Test updating customer information."""
identity = TelemetryIdentity()
# Update customer info
identity.set_customer_info(
customer_id='new_customer', instance_id='new_instance'
)
assert identity.customer_id == 'new_customer'
assert identity.instance_id == 'new_instance'
def test_set_customer_info_partial(self):
"""Test partial updates of customer information."""
identity = TelemetryIdentity(
customer_id='original_customer', instance_id='original_instance'
)
# Update only customer_id
identity.set_customer_info(customer_id='updated_customer')
assert identity.customer_id == 'updated_customer'
assert identity.instance_id == 'original_instance'
# Update only instance_id
identity.set_customer_info(instance_id='updated_instance')
assert identity.customer_id == 'updated_customer'
assert identity.instance_id == 'updated_instance'
def test_set_customer_info_with_none(self):
"""Test that None values don't overwrite existing data."""
identity = TelemetryIdentity(
customer_id='existing_customer', instance_id='existing_instance'
)
# Call with None values - should not change existing data
identity.set_customer_info(customer_id=None, instance_id=None)
assert identity.customer_id == 'existing_customer'
assert identity.instance_id == 'existing_instance'
def test_has_customer_info_property(self):
"""Test has_customer_info property logic."""
identity = TelemetryIdentity()
# Initially false (both None)
assert not identity.has_customer_info
# Still false with only customer_id
identity.customer_id = 'customer_123'
assert not identity.has_customer_info
# Still false with only instance_id
identity.customer_id = None
identity.instance_id = 'instance_456'
assert not identity.has_customer_info
# True when both are set
identity.customer_id = 'customer_123'
identity.instance_id = 'instance_456'
assert identity.has_customer_info
def test_has_customer_info_with_empty_strings(self):
"""Test has_customer_info with empty strings."""
identity = TelemetryIdentity(customer_id='', instance_id='')
# Empty strings should be falsy
assert not identity.has_customer_info
def test_repr_method(self):
"""Test string representation of identity."""
identity = TelemetryIdentity(
customer_id='test_customer', instance_id='test_instance'
)
repr_str = repr(identity)
assert 'TelemetryIdentity' in repr_str
assert 'test_customer' in repr_str
assert 'test_instance' in repr_str
def test_id_forced_to_one(self):
"""Test that ID is always forced to 1."""
identity = TelemetryIdentity()
assert identity.id == 1
# Even if we try to set a different ID in constructor
identity2 = TelemetryIdentity(customer_id='test')
assert identity2.id == 1
def test_timestamps_are_set(self):
"""Test that timestamps are properly set."""
identity = TelemetryIdentity()
assert identity.created_at is not None
assert identity.updated_at is not None
assert isinstance(identity.created_at, datetime)
assert isinstance(identity.updated_at, datetime)

View File

@ -0,0 +1,190 @@
"""Unit tests for TelemetryMetrics model."""
import uuid
from datetime import UTC, datetime
from storage.telemetry_metrics import TelemetryMetrics
class TestTelemetryMetrics:
"""Test cases for TelemetryMetrics model."""
def test_init_with_metrics_data(self):
"""Test initialization with metrics data."""
metrics_data = {
'cpu_usage': 75.5,
'memory_usage': 1024,
'active_sessions': 5,
}
metrics = TelemetryMetrics(metrics_data=metrics_data)
assert metrics.metrics_data == metrics_data
assert metrics.upload_attempts == 0
assert metrics.uploaded_at is None
assert metrics.last_upload_error is None
assert metrics.collected_at is not None
assert metrics.created_at is not None
assert metrics.updated_at is not None
def test_init_with_custom_collected_at(self):
"""Test initialization with custom collected_at timestamp."""
metrics_data = {'test': 'value'}
custom_time = datetime(2023, 1, 1, 12, 0, 0, tzinfo=UTC)
metrics = TelemetryMetrics(metrics_data=metrics_data, collected_at=custom_time)
assert metrics.collected_at == custom_time
def test_mark_uploaded(self):
"""Test marking metrics as uploaded."""
metrics = TelemetryMetrics(metrics_data={'test': 'data'})
# Initially not uploaded
assert not metrics.is_uploaded
assert metrics.uploaded_at is None
# Mark as uploaded
metrics.mark_uploaded()
assert metrics.is_uploaded
def test_mark_upload_failed(self):
"""Test marking upload as failed."""
metrics = TelemetryMetrics(metrics_data={'test': 'data'})
error_message = 'Network timeout'
# Initially no failures
assert metrics.upload_attempts == 0
assert metrics.last_upload_error is None
# Mark as failed
metrics.mark_upload_failed(error_message)
assert metrics.upload_attempts == 1
assert metrics.last_upload_error == error_message
assert metrics.uploaded_at is None
assert not metrics.is_uploaded
def test_multiple_upload_failures(self):
"""Test multiple upload failures increment attempts."""
metrics = TelemetryMetrics(metrics_data={'test': 'data'})
metrics.mark_upload_failed('Error 1')
assert metrics.upload_attempts == 1
metrics.mark_upload_failed('Error 2')
assert metrics.upload_attempts == 2
assert metrics.last_upload_error == 'Error 2'
def test_is_uploaded_property(self):
"""Test is_uploaded property."""
metrics = TelemetryMetrics(metrics_data={'test': 'data'})
# Initially not uploaded
assert not metrics.is_uploaded
# After marking uploaded
metrics.mark_uploaded()
assert metrics.is_uploaded
def test_needs_retry_property(self):
"""Test needs_retry property logic."""
metrics = TelemetryMetrics(metrics_data={'test': 'data'})
# Initially needs retry (0 attempts, not uploaded)
assert metrics.needs_retry
# After 1 failure, still needs retry
metrics.mark_upload_failed('Error 1')
assert metrics.needs_retry
# After 2 failures, still needs retry
metrics.mark_upload_failed('Error 2')
assert metrics.needs_retry
# After 3 failures, no more retries
metrics.mark_upload_failed('Error 3')
assert not metrics.needs_retry
# Reset and test successful upload
metrics2 = TelemetryMetrics(metrics_data={'test': 'data'}) # type: ignore[unreachable]
metrics2.mark_uploaded()
# After upload, needs_retry should be False since is_uploaded is True
def test_upload_failure_clears_uploaded_at(self):
"""Test that upload failure clears uploaded_at timestamp."""
metrics = TelemetryMetrics(metrics_data={'test': 'data'})
# Mark as uploaded first
metrics.mark_uploaded()
assert metrics.uploaded_at is not None
# Mark as failed - should clear uploaded_at
metrics.mark_upload_failed('Network error')
assert metrics.uploaded_at is None
def test_successful_upload_clears_error(self):
"""Test that successful upload clears error message."""
metrics = TelemetryMetrics(metrics_data={'test': 'data'})
# Mark as failed first
metrics.mark_upload_failed('Network error')
assert metrics.last_upload_error == 'Network error'
# Mark as uploaded - should clear error
metrics.mark_uploaded()
assert metrics.last_upload_error is None
def test_uuid_generation(self):
"""Test that each instance gets a unique UUID."""
metrics1 = TelemetryMetrics(metrics_data={'test': 'data1'})
metrics2 = TelemetryMetrics(metrics_data={'test': 'data2'})
assert metrics1.id != metrics2.id
assert isinstance(uuid.UUID(metrics1.id), uuid.UUID)
assert isinstance(uuid.UUID(metrics2.id), uuid.UUID)
def test_repr(self):
"""Test string representation."""
metrics = TelemetryMetrics(metrics_data={'test': 'data'})
repr_str = repr(metrics)
assert 'TelemetryMetrics' in repr_str
assert metrics.id in repr_str
assert str(metrics.collected_at) in repr_str
assert 'uploaded=False' in repr_str
# Test after upload
metrics.mark_uploaded()
repr_str = repr(metrics)
assert 'uploaded=True' in repr_str
def test_complex_metrics_data(self):
"""Test with complex nested metrics data."""
complex_data = {
'system': {
'cpu': {'usage': 75.5, 'cores': 8},
'memory': {'total': 16384, 'used': 8192},
},
'sessions': [
{'id': 'session1', 'duration': 3600},
{'id': 'session2', 'duration': 1800},
],
'timestamp': '2023-01-01T12:00:00Z',
}
metrics = TelemetryMetrics(metrics_data=complex_data)
assert metrics.metrics_data == complex_data
def test_empty_metrics_data(self):
"""Test with empty metrics data."""
metrics = TelemetryMetrics(metrics_data={})
assert metrics.metrics_data == {}
def test_config_class(self):
"""Test that Config class is properly set."""
assert hasattr(TelemetryMetrics, 'Config')
assert TelemetryMetrics.Config.from_attributes is True