feat: Agent Fleet Management & Secure Self-Update System #805

Open
opened 2026-03-28 04:28:44 +00:00 by mfreeman451 · 0 comments
Owner

Imported from GitHub.

Original GitHub issue: #2406
Original author: @mfreeman451
Original URL: https://github.com/carverauto/serviceradar/issues/2406
Original created: 2026-01-20T07:21:24Z


This needs to really get framed into a much larger change here, where we can take full advantage of having a bi-directional/persistent grpc/websocket connection between agents and agent-gateway. Instead of polling every 5 minutes for config changes we can just signal to agents when we have one, reducing traffic.

We might also be able to use AshOban job scheduler to trigger collections on an agent but that might get a little tricky/verbose, maybe we're better off leaving that part alone, where we generate the config for the agent on-demand or on a change, signal down to the agent, and the config has the polling interval in it like we are currently doing.

https://github.com/werf/trdl
https://github.com/jpillora/overseer

PRD: Agent Fleet Management & Secure Self-Update System

1. Executive Summary

To support large-scale deployments (2,400+ nodes), our platform requires a centralized way to manage, configure, and update agents without relying on external orchestration tools like Ansible. This project introduces a Bi-directional gRPC Control Plane and an Automated Self-Update Mechanism that allows the Core Engine to orchestrate fleet-wide upgrades securely and reliably.


2. Problem Statement

Current agent updates require manual intervention or 3rd-party automation. At a scale of 2,400+ machines, manual updates are impossible. Furthermore, the current "polling" architecture (Agent -> Gateway) creates latency in management actions. We need a "Push" capability to trigger immediate actions (like emergency patches) across the fleet.


3. Goals & Objectives

  • Zero-Touch Updates: Agents should update themselves based on a "Desired State" defined in the Web UI.
  • Bi-directional Communication: Move from unary gRPC polling to persistent bi-directional streaming for real-time control.
  • Package Integrity: Updates must be compatible with existing RPM/DEB installations and avoid corrupting the system package manager’s database.
  • Security First: Prevent "Mass-Malware Injection" via cryptographic signing and verification (TUF-inspired).
  • Safety: Support staged rollouts (Canary deployments) to prevent a faulty update from bricking the entire fleet.

4. Technical Architecture

4.1. Communication: Persistent gRPC Stream

The Agent-Gateway communication will be upgraded to use gRPC Bi-directional Streaming.

  • The Stream: Upon startup, the agent opens a long-lived stream to the Gateway.
  • Heartbeat/Telemetry: Metrics continue to flow up.
  • Control Channel: The Gateway can push "Instructions" (JSON/Protobuf) down the stream at any time.
  • Event: When a user clicks "Update" in the UI, the Core Engine sends an UpdateInstruction message down the stream.

4.2. The "Sidecar Updater" Pattern

To avoid file-locking issues on Linux, the update process will involve two components:

  1. The Agent (Go/Rust): The main process that handles telemetry and listens for the update command.
  2. The Updater Binary: A minimal, low-dependency binary responsible for the "Swap and Restart."

4.3. Directory & Package Strategy

To remain compatible with RPM/DEB:

  • Pathing: The RPM installs to /opt/our-platform/.
  • Symlinking: The service runs /opt/our-platform/bin/current, which is a symlink to /opt/our-platform/bin/v1.0.0/agent.
  • Atomic Swap: The Updater downloads v1.1.0, verifies it, and updates the current symlink.

5. Functional Requirements

5.1. Agent Capabilities

  • Identity: Agent must present a unique ID and current version string on stream connection.
  • Download & Verify: Agent must download the new binary via HTTPS and verify its Ed25519 signature against a public key hardcoded in the agent.
  • Handoff: Agent must be able to spawn the updater process and exit gracefully.

5.2. Core Engine / Gateway Capabilities

  • Version Registry: A database of available agent versions (S3 URLs, Hashes, Signatures).
  • Targeting: Ability to set "Desired Version" by:
    • Individual Agent ID
    • Group/Label (e.g., env:production or os:ubuntu)
    • Global (All agents)
  • Batching: Logic to throttle updates (e.g., "Update 50 agents every 10 minutes").

5.3. Web UI Capabilities

  • Fleet Overview: A dashboard showing the distribution of agent versions.
  • Update Trigger: A button to "Upgrade to Latest" for selected nodes.
  • Status Tracking: Real-time progress (Downloading -> Verifying -> Restarting -> Back Online).

6. Security Specification

6.1. Cryptographic Signing

All update binaries must be signed during the CI/CD build process.

  • Manifest: A JSON file containing the binary hash (SHA256) and version.
  • Signature: An Ed25519 signature of the manifest.
  • Verification: The Agent will reject any update where the signature does not match the platform's trusted Public Key.

6.2. Rollback Mechanism

  • If the new agent fails to connect to the Gateway within 3 minutes of an update, the updater (or the new agent itself) must revert the /current symlink to the previous version and restart.

7. User Stories

  1. As a SysAdmin with 2,400 nodes, I want to select 10 "Canary" nodes and update them to the latest version via the UI to ensure stability before a global rollout.
  2. As a Security Engineer, I want to know that if our Gateway is compromised, the attacker cannot push a malicious binary because it won't have the correct offline private key signature.
  3. As a Developer, I want to use gRPC streaming so I can eventually send "Real-time Debug" commands to an agent without waiting for its next poll interval.

8. Success Metrics

  • Update Success Rate: > 99.5% of agents successfully reaching the target version.
  • Time to Update: A fleet of 2,400 agents should be fully updated within 30 minutes (staged).
  • Stability: Zero "orphaned" agents (agents that update and never come back online).

9. Implementation Phases

Phase Description
Phase 1 Implement Bi-directional gRPC streaming (Core + Agent).
Phase 2 Build the Updater sidecar and symlink logic for Linux.
Phase 3 Implement Ed25519 signature verification in Go/Rust agents.
Phase 4 Build the "Fleet Management" UI and staged rollout logic in the Core Engine.
Imported from GitHub. Original GitHub issue: #2406 Original author: @mfreeman451 Original URL: https://github.com/carverauto/serviceradar/issues/2406 Original created: 2026-01-20T07:21:24Z --- This needs to really get framed into a much larger change here, where we can take full advantage of having a bi-directional/persistent grpc/websocket connection between agents and agent-gateway. Instead of polling every 5 minutes for config changes we can just signal to agents when we have one, reducing traffic. We might also be able to use AshOban job scheduler to trigger collections on an agent but that might get a little tricky/verbose, maybe we're better off leaving that part alone, where we generate the config for the agent on-demand or on a change, signal down to the agent, and the config has the polling interval in it like we are currently doing. https://github.com/werf/trdl https://github.com/jpillora/overseer # PRD: Agent Fleet Management & Secure Self-Update System ## 1. Executive Summary To support large-scale deployments (2,400+ nodes), our platform requires a centralized way to manage, configure, and update agents without relying on external orchestration tools like Ansible. This project introduces a **Bi-directional gRPC Control Plane** and an **Automated Self-Update Mechanism** that allows the Core Engine to orchestrate fleet-wide upgrades securely and reliably. --- ## 2. Problem Statement Current agent updates require manual intervention or 3rd-party automation. At a scale of 2,400+ machines, manual updates are impossible. Furthermore, the current "polling" architecture (Agent -> Gateway) creates latency in management actions. We need a "Push" capability to trigger immediate actions (like emergency patches) across the fleet. --- ## 3. Goals & Objectives * **Zero-Touch Updates:** Agents should update themselves based on a "Desired State" defined in the Web UI. * **Bi-directional Communication:** Move from unary gRPC polling to persistent bi-directional streaming for real-time control. * **Package Integrity:** Updates must be compatible with existing RPM/DEB installations and avoid corrupting the system package manager’s database. * **Security First:** Prevent "Mass-Malware Injection" via cryptographic signing and verification (TUF-inspired). * **Safety:** Support staged rollouts (Canary deployments) to prevent a faulty update from bricking the entire fleet. --- ## 4. Technical Architecture ### 4.1. Communication: Persistent gRPC Stream The Agent-Gateway communication will be upgraded to use **gRPC Bi-directional Streaming**. * **The Stream:** Upon startup, the agent opens a long-lived stream to the Gateway. * **Heartbeat/Telemetry:** Metrics continue to flow up. * **Control Channel:** The Gateway can push "Instructions" (JSON/Protobuf) down the stream at any time. * **Event:** When a user clicks "Update" in the UI, the Core Engine sends an `UpdateInstruction` message down the stream. ### 4.2. The "Sidecar Updater" Pattern To avoid file-locking issues on Linux, the update process will involve two components: 1. **The Agent (Go/Rust):** The main process that handles telemetry and listens for the update command. 2. **The Updater Binary:** A minimal, low-dependency binary responsible for the "Swap and Restart." ### 4.3. Directory & Package Strategy To remain compatible with RPM/DEB: * **Pathing:** The RPM installs to `/opt/our-platform/`. * **Symlinking:** The service runs `/opt/our-platform/bin/current`, which is a symlink to `/opt/our-platform/bin/v1.0.0/agent`. * **Atomic Swap:** The Updater downloads `v1.1.0`, verifies it, and updates the `current` symlink. --- ## 5. Functional Requirements ### 5.1. Agent Capabilities * **Identity:** Agent must present a unique ID and current version string on stream connection. * **Download & Verify:** Agent must download the new binary via HTTPS and verify its **Ed25519 signature** against a public key hardcoded in the agent. * **Handoff:** Agent must be able to spawn the `updater` process and exit gracefully. ### 5.2. Core Engine / Gateway Capabilities * **Version Registry:** A database of available agent versions (S3 URLs, Hashes, Signatures). * **Targeting:** Ability to set "Desired Version" by: * Individual Agent ID * Group/Label (e.g., `env:production` or `os:ubuntu`) * Global (All agents) * **Batching:** Logic to throttle updates (e.g., "Update 50 agents every 10 minutes"). ### 5.3. Web UI Capabilities * **Fleet Overview:** A dashboard showing the distribution of agent versions. * **Update Trigger:** A button to "Upgrade to Latest" for selected nodes. * **Status Tracking:** Real-time progress (Downloading -> Verifying -> Restarting -> Back Online). --- ## 6. Security Specification ### 6.1. Cryptographic Signing All update binaries must be signed during the CI/CD build process. * **Manifest:** A JSON file containing the binary hash (SHA256) and version. * **Signature:** An Ed25519 signature of the manifest. * **Verification:** The Agent will reject any update where the signature does not match the platform's trusted Public Key. ### 6.2. Rollback Mechanism * If the new agent fails to connect to the Gateway within 3 minutes of an update, the `updater` (or the new agent itself) must revert the `/current` symlink to the previous version and restart. --- ## 7. User Stories 1. **As a SysAdmin with 2,400 nodes,** I want to select 10 "Canary" nodes and update them to the latest version via the UI to ensure stability before a global rollout. 2. **As a Security Engineer,** I want to know that if our Gateway is compromised, the attacker cannot push a malicious binary because it won't have the correct offline private key signature. 3. **As a Developer,** I want to use gRPC streaming so I can eventually send "Real-time Debug" commands to an agent without waiting for its next poll interval. --- ## 8. Success Metrics * **Update Success Rate:** > 99.5% of agents successfully reaching the target version. * **Time to Update:** A fleet of 2,400 agents should be fully updated within 30 minutes (staged). * **Stability:** Zero "orphaned" agents (agents that update and never come back online). --- ## 9. Implementation Phases | Phase | Description | | :--- | :--- | | **Phase 1** | Implement Bi-directional gRPC streaming (Core + Agent). | | **Phase 2** | Build the `Updater` sidecar and symlink logic for Linux. | | **Phase 3** | Implement Ed25519 signature verification in Go/Rust agents. | | **Phase 4** | Build the "Fleet Management" UI and staged rollout logic in the Core Engine. |
mfreeman451 added this to the 1.1.2 milestone 2026-03-28 04:28:44 +00:00
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
carverauto/serviceradar#805
No description provided.