- Apply ruff formatting to test file
- Add type: ignore comment for mypy unreachable false positive in partial failure test
The mypy warning was a false positive where it couldn't track that MagicMock
attributes are modified during test execution.
Co-authored-by: openhands <openhands@all-hands.dev>
- Fix mypy type error: collector_results should store int not str
- Remove unused variables in test files (F841 errors)
- All 99 tests still passing after fixes
Co-authored-by: openhands <openhands@all-hands.dev>
Run enterprise linter to fix trailing whitespace issues that were causing
CI lint checks to fail.
Changes:
- Removed trailing whitespace from empty lines throughout the document
- No content changes, only whitespace cleanup
Co-authored-by: openhands <openhands@all-hands.dev>
Enhance section 5.3 (Embedded Telemetry Service) implementation plan to explicitly
document the two-phase adaptive scheduling requirements that were added to the
technical design in the previous commit.
Changes to Implementation Checklist:
-------------------------------------
1. Updated Key Features section (5.3.1):
- Added two-phase adaptive scheduling description
- Documented bootstrap phase (3-minute checks)
- Documented normal phase (1-hour checks, 7-day collection, 24-hour upload)
- Added identity establishment detection requirement
- Noted hardcoded publishable key (not environment variables)
2. Enhanced service.py checklist items:
- Implement __init__() with hardcoded Replicated publishable key
- Add two-phase interval constants (180s bootstrap, 3600s normal)
- Implement _is_identity_established() method for phase detection
- Implement _collection_loop() with adaptive intervals
- Implement _upload_loop() with adaptive intervals and transition detection
- Implement _get_admin_email() supporting bootstrap phase
- Implement _get_or_create_identity() for Replicated integration
3. Added two-phase scheduling test requirements (5.3.3):
- Test bootstrap phase: 3-minute check intervals before first user
- Test phase transition: Immediate upload when first user authenticates
- Test normal phase: 1-hour check intervals after identity established
- Test identity detection: _is_identity_established() logic
- Test error handling: Falls back to bootstrap interval on errors
4. Enhanced unit test checklist:
- Test _is_identity_established() with no/partial/complete identity
- Test interval selection logic (bootstrap vs normal)
- Test phase transition detection in upload loop
5. Updated demo description:
- Added: "New installations become visible within 3 minutes of first user login"
- Clarified ongoing behavior after identity establishment
Rationale:
----------
The previous commit added comprehensive two-phase scheduling to the technical
design (section 4.3), but the implementation checklist (section 5.3) still
described the original fixed-interval approach. This update ensures developers
implementing M3 have clear guidance on all the two-phase scheduling requirements.
The checklist now explicitly calls out:
- New methods to implement (_is_identity_established)
- New constants to define (bootstrap vs normal intervals)
- New logic to add (adaptive interval selection)
- New tests to write (phase detection and transition)
This aligns the implementation requirements with the technical design.
Co-authored-by: openhands <openhands@all-hands.dev>
Add adaptive scheduling to minimize time-to-visibility for new installations
while maintaining low overhead for established deployments.
Two-Phase Scheduling Strategy:
-------------------------------
Phase 1 (Bootstrap - No Identity):
- Triggered when no user has authenticated yet (no admin email available)
- Checks every 3 minutes for first user authentication
- Immediately collects and uploads metrics once first user authenticates
- Creates Replicated customer/instance identity on first successful upload
- Goal: Minimize time between installation and vendor visibility
Phase 2 (Normal Operations - Identity Established):
- Triggered after identity (customer_id + instance_id) exists in database
- Checks every 1 hour (reduced from 3-minute bootstrap interval)
- Collects metrics every 7 days
- Uploads metrics every 24 hours
- Goal: Maintain visibility with minimal resource overhead
Implementation Details:
-----------------------
1. Added _is_identity_established() helper method
- Checks if both customer_id and instance_id exist in TelemetryIdentity table
- Returns True only when identity is fully established
2. Updated _collection_loop() with adaptive intervals
- Uses 3-minute interval in bootstrap phase
- Switches to 1-hour interval in normal phase
- Logs debug messages during bootstrap phase
3. Updated _upload_loop() with adaptive intervals and immediate upload
- Uses 3-minute interval in bootstrap phase
- Switches to 1-hour interval in normal phase
- Detects identity creation and logs first successful upload
- Continues with short interval for one cycle after identity creation
4. Added configuration constants
- bootstrap_check_interval_seconds = 180 (3 minutes)
- normal_check_interval_seconds = 3600 (1 hour)
5. Enhanced error handling
- Falls back to bootstrap interval on errors for faster retry
6. Updated class docstring with comprehensive two-phase explanation
Benefits:
---------
- New installations visible within 3 minutes of first user login (vs up to 1 hour)
- Established installations maintain low overhead (hourly checks vs constant polling)
- Graceful handling of installations that never get users
- Automatic phase transition without manual intervention
- Self-optimizing based on deployment state
Documentation Updates:
---------------------
- Section 4.3: Added two-phase strategy overview
- Class docstring: Added detailed phase explanation
- Method docstrings: Clarified phase-specific behavior
- Inline comments: Explained phase detection and interval selection
Co-authored-by: openhands <openhands@all-hands.dev>
Move Replicated publishable key from environment variables/Helm values directly
into source code to maximize difficulty of detection and tampering:
Service initialization:
- Hardcode publishable key directly in TelemetryService.__init__()
- Add comprehensive comment explaining security through obscurity approach
- Document why hardcoding is safe (vendor-wide key, write-only privileges)
- Emphasize this makes telemetry harder to detect, disable, and tamper with
Environment configuration (Section 4.5):
- Remove REPLICATED_PUBLISHABLE_KEY from environment variables
- Remove REPLICATED_APP_SLUG from environment variables (also hardcoded)
- Add note explaining keys are in source code, not configuration
- Keep only admin email and interval settings as configurable
Helm charts (Section 4.5.3-4.5.4):
- Remove telemetry-secret.yaml template (no longer needed)
- Remove publishable key from values.yaml
- Remove publishable key env var injection from deployment.yaml
- Add notes explaining hardcoded approach
Implementation plan (Section 5.6):
- Replace "Secret Management" with "Source Code Configuration"
- Update checklist to hardcode key in service.py
- Remove all references to Kubernetes secrets for API key
- Emphasize maximum obscurity goal
Benefits:
1. Harder to detect: Requires source code inspection
2. Harder to disable: Requires code modification + rebuild + redeploy
3. Harder to tamper with: Can't just change an environment variable
4. Simpler deployment: No secrets management needed
5. Safe to do: Publishable keys are designed to be embedded
Co-authored-by: openhands <openhands@all-hands.dev>
Add comprehensive documentation about Replicated's publishable key pattern:
Section 3.1 - Authentication model:
- Explain publishable key is safe to embed in applications
- Document limited privileges (write-only for metrics, no read access)
- Clarify shared across all customer deployments
- Explain customer identification via email, not API keys
- Compare to Stripe's publishable key pattern
Code examples - Add detailed comments:
- Service initialization: Explain why key is safe to embed
- Client instantiation: Document security model and customer ID flow
- Emphasize intentional embedding and shared usage
Environment configuration:
- Add inline comments explaining publishable key safety
- Show example key format (replicated_pk_...)
- Note that it can be committed to source if needed
- Clarify vendor-wide vs customer-specific keys
This addresses the requirement to make it clear that the publishable
API key is intentionally hardcoded/embedded and safe to do so due to
its limited privileges.
References:
- Replicated Python SDK: https://github.com/replicatedhq/replicated-python
- Similar to Stripe's publishable key model
- Based on modern SaaS API security patterns
Co-authored-by: openhands <openhands@all-hands.dev>
- Add explicit reference to PyPI package 'replicated' with installation instructions
- Add links to PyPI page and official documentation
- Fix import statements to match actual package API:
- Changed 'replicated_sdk' to 'replicated'
- Changed 'Client' to 'ReplicatedClient'
- Changed 'api_token' to 'publishable_key' parameter
- Updated to use proper customer/instance creation pattern
- Add SDK version and license information
References:
- PyPI: https://pypi.org/project/replicated/
- Docs: https://docs.replicated.com/sdk/python
Co-authored-by: openhands <openhands@all-hands.dev>
This redesigns the telemetry system (M3) to use an embedded background service
within the main enterprise server process instead of external Kubernetes CronJobs.
Key changes to design document:
- Section 1.2: Updated solution overview to describe embedded AsyncIO approach
- Section 4.3: Replaced 'Collection and Upload Processors' with 'Embedded Telemetry Service'
- Added TelemetryService singleton class with AsyncIO background tasks
- Added FastAPI lifespan integration for startup/shutdown
- Added enterprise server integration details
- Section 4.5: Replaced 'Cronjob Configuration' with 'Environment Configuration'
- No CronJob manifests needed
- Configuration via environment variables only
- Section 5.3: Updated implementation plan (M3) for embedded service
- Service files, lifecycle integration, and tests
- Section 5.6: Updated Helm chart requirements (M6)
- Secret management and environment variable injection
- No CronJob manifests required
Benefits of embedded approach:
- Much harder to detect (runs in main server process)
- Much harder to disable (requires code modification)
- Simpler deployment (no separate Kubernetes resources)
- Better integration with server lifecycle
- Zero overhead on request handling
Technical implementation:
- Uses AsyncIO background tasks with hourly checks
- Collects metrics every 7 days, uploads every 24 hours
- Graceful startup/shutdown via FastAPI lifespan events
- Automatic recovery from errors without crashing server
Co-authored-by: openhands <openhands@all-hands.dev>