Secure edge poller onboarding flow #628

Closed
opened 2026-03-28 04:26:37 +00:00 by mfreeman451 · 4 comments
Owner

Imported from GitHub.

Original GitHub issue: #1903
Original author: @mfreeman451
Original URL: https://github.com/carverauto/serviceradar/issues/1903
Original created: 2025-10-28T20:18:39Z


PRD: Secure Edge Poller Onboarding

Background

  • The edge poller bundle (docker-compose) now embeds a nested SPIRE server/agent pair and can connect back to the demo Core over LoadBalancer endpoints.
  • Bootstrap currently requires running docker/compose/edge-poller-restart.sh, which calls kubectl exec against the demo SPIRE server to mint a join token and downstream entry.
  • This workflow assumes the operator has cluster-admin kube credentials. Edge installers or partners cannot self-service onboarding without sharing privileged access.
  • We need a first-class, auditable, secure flow for provisioning edge pollers that does not expose kube creds and keeps SPIRE join tokens tightly controlled.

Problem Statement

Provide an operator-approved workflow—accessible via Core UI and API—that issues short-lived credentials to edge poller deployments, tracks their lifecycle, and prevents unauthorized enrollment. The solution must remove direct kube dependencies from the edge host while preserving SPIRE trust guarantees.

Goals

  1. Allow authorized operators to create edge poller onboarding packages (token + config) from Core without shelling into the cluster.
  2. Ensure each package is single-use, time-bound, and auditable (who created, when, consumed or revoked).
  3. Let headless edge installers apply the onboarding artifact with minimal manual steps (ideally just drop files + run restart script).
  4. Support revocation and monitoring of issued tokens/registrations.

Non-Goals

  • Replacing SPIRE with an alternative CA.
  • Implementing hardware-based attestation in this iteration (may be future work).
  • Shipping an OTA deployment agent for edge hosts (focus is on credential issuance and bootstrap).

Personas

  • Operator (Ops/SRE): Manages central Core/K8s, approves edge deployments.
  • Edge Installer (Field Tech/Automation): Has access to the edge host but not central kube credentials.
  • Security/Compliance Reviewer: Needs audit trail of issued credentials.

User Stories

  1. Operator logs into Core UI, requests a new edge poller package for site X, downloads the package, shares it securely with the installer.
  2. Operator views outstanding packages, sees status (not used / activated / expired) and revokes if compromised.
  3. Edge installer uploads package onto host, runs restart helper (or simplified script) to join using provided token/bundle.
  4. Security reviewer queries audit log to confirm who issued which package and when activation occurred.

Functional Requirements

  1. Package Issuance API/UI
    • Endpoint in Core (REST + UI) to create onboarding package.
    • Inputs: friendly name/site ID, optional selector overrides, expiry TTL (default 15 min).
    • Output: archive (tar/zip) containing join token, SPIRE bundle, pre-rendered edge-poller.env (or overrides), readme.
    • Request requires operator auth with config:write scope.
  2. SPIRE Integration
    • Core backend calls SPIRE controller/server to create downstream entry + join token.
    • Use SPIRE Admin API via existing credentials (no kubectl). Respect entry selectors and TTL.
    • Store metadata (entry ID, token ID, expiry) in Core DB.
  3. Delivery & Tracking
    • Package ID persisted with status: Created → Delivered → Activated → Expired/Revoked.
    • When edge poller first connects, Core matches poller_id/SPIFE ID and flips status to Activated; record timestamp and source IP.
  4. Revocation
    • Operators can revoke packages (delete downstream entry, mark revoked, optionally invalidate join token if unused).
    • Revoked/expired tokens should not authenticate.
  5. Installer Experience
    • Provide updated documentation and optional helper script that accepts package path (e.g., ./edge-poller-install.sh --package edge-site-x.tar.gz).
    • Script extracts files into docker/compose/spire/, updates .env, runs restart helper.
  6. Audit / Logging
    • All issuance, download, activation, revocation events logged with actor, timestamp, package ID.
    • Expose read-only UI/API listing packages with filters.

Non-Functional Requirements

  • Security: Tokens single-use TTL <= 15m default; package download requires operator auth; package contents marked sensitive (no caching/logging).
  • Usability: Operator UI flow should complete in < 5 clicks; installer script reduces manual edits.
  • Scalability: Support at least 100 concurrent outstanding packages.
  • Reliability: Backend must handle SPIRE unavailability gracefully, surfacing errors to UI.

UX / Flow

  1. Operator navigates to Core UI → Edge Pollers → “Create Onboarding Package”.
  2. Fill form (site name, optional TTL, optional IP notes) → submit.
  3. Backend creates SPIRE entry/token, stores metadata, returns package for download.
  4. Installer receives package, copies to edge host, runs edge-poller-install.sh --package pkg.tar.gz (script does extraction + restart helper).
  5. Poller joins; Core UI updates package status to Activated.
  6. Operator can revoke or reissue if needed.

API/Backend Tasks

  • Add new service in Core to talk to SPIRE (reuse existing SPIRE admin credentials if available, else add service account).
  • Extend Core config for SPIRE admin endpoint/token and default selector templates.
  • Create DB schema: edge_onboarding_packages with fields (id, name, spiffe_id, join_token_id, entry_id, ttl, status, created_by, created_at, activated_at, revoked_at, metadata JSON).
  • Add REST endpoints: POST /api/edge-packages, GET /api/edge-packages, POST /api/edge-packages/{id}/revoke, GET /api/edge-packages/{id}/download.
  • Update Core poller ingestion path to flip package status when first status report arrives from associated SPIFE ID.

Frontend Tasks

  • New Edge Packages page (list + status filters).
  • Drawer/modal for creating package (fields + success download link).
  • Detail view showing history and actions (revoke, copy instructions).

Installer Tooling

  • New script docker/compose/edge-poller-install.sh (or update restart helper) to accept package archive, extract to appropriate directories, optionally prompt for CORE/KV override if different from package defaults.

Security Considerations

  • Package downloads should use signed URLs or require re-auth when fetching.
  • Token stored encrypted at rest; display token only during package download (afterwards masked).
  • Optionally integrate with audit log aggregator or send event to Slack/Webhook.

Metrics / Success Criteria

  • Time to onboard edge poller (goal: <5 minutes operator time, <5 minutes installer time).
  • Number of packages issued vs activated vs revoked per month.
  • Zero unauthorized poller registrations (monitored via alerting on unknown pollers).

Open Questions

  1. Do we want multi-use packages for fleet deployments or strictly one-per-edge?
  2. Should we allow custom selector inputs or enforce defaults for simplicity?
  3. How do we expose package download to automation (CLI?) without relaxing auth controls?
  4. Any compliance requirements for storing issued tokens (encrypt/rotate)?

Milestones

  1. Design sign-off (scope clarifications, security review).
  2. Backend SPIRE service + DB schema.
  3. API + UI issuance flows.
  4. Installer tooling updates.
  5. Audit + observability.
  6. Docs & runbooks.

Dependencies

  • SPIRE server admin credentials accessible by Core (may require new Kubernetes secret/service account).
  • Core authentication/authorization framework (reuse existing roles).
  • Frontend design bandwidth for new UI elements.

Risks / Mitigations

  • Risk: SPIRE controller deletions conflict with manual entries → ensure new entries use dedicated prefixes and annotate.
  • Risk: Package leak → short TTL + revoke endpoint + encrypted storage.
  • Risk: Operator forgets to hand package securely → update docs to recommend secure channel, optionally integrate one-time download links.

Acceptance Criteria

  • Operator can issue, download, and revoke packages exclusively via Core UI/API (no kubectl required).
  • Edge installer can bootstrap using provided package without editing repo files.
  • Core audit log shows issuance and activation with timestamps/users.
  • Attempted reuse of token after TTL or revocation fails (confirmed in logs).
Imported from GitHub. Original GitHub issue: #1903 Original author: @mfreeman451 Original URL: https://github.com/carverauto/serviceradar/issues/1903 Original created: 2025-10-28T20:18:39Z --- # PRD: Secure Edge Poller Onboarding ## Background - The edge poller bundle (docker-compose) now embeds a nested SPIRE server/agent pair and can connect back to the demo Core over LoadBalancer endpoints. - Bootstrap currently requires running `docker/compose/edge-poller-restart.sh`, which calls `kubectl exec` against the demo SPIRE server to mint a join token and downstream entry. - This workflow assumes the operator has cluster-admin kube credentials. Edge installers or partners cannot self-service onboarding without sharing privileged access. - We need a first-class, auditable, secure flow for provisioning edge pollers that does not expose kube creds and keeps SPIRE join tokens tightly controlled. ## Problem Statement Provide an operator-approved workflow—accessible via Core UI and API—that issues short-lived credentials to edge poller deployments, tracks their lifecycle, and prevents unauthorized enrollment. The solution must remove direct kube dependencies from the edge host while preserving SPIRE trust guarantees. ## Goals 1. Allow authorized operators to create edge poller onboarding packages (token + config) from Core without shelling into the cluster. 2. Ensure each package is single-use, time-bound, and auditable (who created, when, consumed or revoked). 3. Let headless edge installers apply the onboarding artifact with minimal manual steps (ideally just drop files + run restart script). 4. Support revocation and monitoring of issued tokens/registrations. ## Non-Goals - Replacing SPIRE with an alternative CA. - Implementing hardware-based attestation in this iteration (may be future work). - Shipping an OTA deployment agent for edge hosts (focus is on credential issuance and bootstrap). ## Personas - **Operator (Ops/SRE):** Manages central Core/K8s, approves edge deployments. - **Edge Installer (Field Tech/Automation):** Has access to the edge host but not central kube credentials. - **Security/Compliance Reviewer:** Needs audit trail of issued credentials. ## User Stories 1. Operator logs into Core UI, requests a new edge poller package for site X, downloads the package, shares it securely with the installer. 2. Operator views outstanding packages, sees status (not used / activated / expired) and revokes if compromised. 3. Edge installer uploads package onto host, runs restart helper (or simplified script) to join using provided token/bundle. 4. Security reviewer queries audit log to confirm who issued which package and when activation occurred. ## Functional Requirements 1. **Package Issuance API/UI** - Endpoint in Core (REST + UI) to create onboarding package. - Inputs: friendly name/site ID, optional selector overrides, expiry TTL (default 15 min). - Output: archive (tar/zip) containing join token, SPIRE bundle, pre-rendered `edge-poller.env` (or overrides), readme. - Request requires operator auth with `config:write` scope. 2. **SPIRE Integration** - Core backend calls SPIRE controller/server to create downstream entry + join token. - Use SPIRE Admin API via existing credentials (no kubectl). Respect entry selectors and TTL. - Store metadata (entry ID, token ID, expiry) in Core DB. 3. **Delivery & Tracking** - Package ID persisted with status: Created → Delivered → Activated → Expired/Revoked. - When edge poller first connects, Core matches poller_id/SPIFE ID and flips status to Activated; record timestamp and source IP. 4. **Revocation** - Operators can revoke packages (delete downstream entry, mark revoked, optionally invalidate join token if unused). - Revoked/expired tokens should not authenticate. 5. **Installer Experience** - Provide updated documentation and optional helper script that accepts package path (e.g., `./edge-poller-install.sh --package edge-site-x.tar.gz`). - Script extracts files into `docker/compose/spire/`, updates `.env`, runs restart helper. 6. **Audit / Logging** - All issuance, download, activation, revocation events logged with actor, timestamp, package ID. - Expose read-only UI/API listing packages with filters. ## Non-Functional Requirements - **Security:** Tokens single-use TTL <= 15m default; package download requires operator auth; package contents marked sensitive (no caching/logging). - **Usability:** Operator UI flow should complete in < 5 clicks; installer script reduces manual edits. - **Scalability:** Support at least 100 concurrent outstanding packages. - **Reliability:** Backend must handle SPIRE unavailability gracefully, surfacing errors to UI. ## UX / Flow 1. Operator navigates to Core UI → Edge Pollers → “Create Onboarding Package”. 2. Fill form (site name, optional TTL, optional IP notes) → submit. 3. Backend creates SPIRE entry/token, stores metadata, returns package for download. 4. Installer receives package, copies to edge host, runs `edge-poller-install.sh --package pkg.tar.gz` (script does extraction + restart helper). 5. Poller joins; Core UI updates package status to Activated. 6. Operator can revoke or reissue if needed. ## API/Backend Tasks - Add new service in Core to talk to SPIRE (reuse existing SPIRE admin credentials if available, else add service account). - Extend Core config for SPIRE admin endpoint/token and default selector templates. - Create DB schema: `edge_onboarding_packages` with fields (id, name, spiffe_id, join_token_id, entry_id, ttl, status, created_by, created_at, activated_at, revoked_at, metadata JSON). - Add REST endpoints: `POST /api/edge-packages`, `GET /api/edge-packages`, `POST /api/edge-packages/{id}/revoke`, `GET /api/edge-packages/{id}/download`. - Update Core poller ingestion path to flip package status when first status report arrives from associated SPIFE ID. ## Frontend Tasks - New Edge Packages page (list + status filters). - Drawer/modal for creating package (fields + success download link). - Detail view showing history and actions (revoke, copy instructions). ## Installer Tooling - New script `docker/compose/edge-poller-install.sh` (or update restart helper) to accept package archive, extract to appropriate directories, optionally prompt for CORE/KV override if different from package defaults. ## Security Considerations - Package downloads should use signed URLs or require re-auth when fetching. - Token stored encrypted at rest; display token only during package download (afterwards masked). - Optionally integrate with audit log aggregator or send event to Slack/Webhook. ## Metrics / Success Criteria - Time to onboard edge poller (goal: <5 minutes operator time, <5 minutes installer time). - Number of packages issued vs activated vs revoked per month. - Zero unauthorized poller registrations (monitored via alerting on unknown pollers). ## Open Questions 1. Do we want multi-use packages for fleet deployments or strictly one-per-edge? 2. Should we allow custom selector inputs or enforce defaults for simplicity? 3. How do we expose package download to automation (CLI?) without relaxing auth controls? 4. Any compliance requirements for storing issued tokens (encrypt/rotate)? ## Milestones 1. **Design sign-off** (scope clarifications, security review). 2. **Backend SPIRE service + DB schema**. 3. **API + UI issuance flows**. 4. **Installer tooling updates**. 5. **Audit + observability**. 6. **Docs & runbooks**. ## Dependencies - SPIRE server admin credentials accessible by Core (may require new Kubernetes secret/service account). - Core authentication/authorization framework (reuse existing roles). - Frontend design bandwidth for new UI elements. ## Risks / Mitigations - *Risk:* SPIRE controller deletions conflict with manual entries → ensure new entries use dedicated prefixes and annotate. - *Risk:* Package leak → short TTL + revoke endpoint + encrypted storage. - *Risk:* Operator forgets to hand package securely → update docs to recommend secure channel, optionally integrate one-time download links. ## Acceptance Criteria - Operator can issue, download, and revoke packages exclusively via Core UI/API (no kubectl required). - Edge installer can bootstrap using provided package without editing repo files. - Core audit log shows issuance and activation with timestamps/users. - Attempted reuse of token after TTL or revocation fails (confirmed in logs).
Author
Owner

Imported GitHub comment.

Original author: @mfreeman451
Original URL: https://github.com/carverauto/serviceradar/issues/1903#issuecomment-3458668536
Original created: 2025-10-28T21:54:34Z


Am wondering if we can automate the bundle download.. this becomes a bit tricky because the edge devices aren't necessarily supposed to be able to talk directly to the core, if we put it on the KV, before onboarding it wouldn't have the necessary credentials to get it from there either unless we did just a normal TLS interface on a different port for GRPC API calls, we would have to keep the request / token generation information in the KV so we could verify it some how.. this might be more complicated.. still would like to have a good way to do easier deployments, and hands-off deployments. The way it is today it can still easily be automated.

We could just try and hit the public/core API from the edge and try and make that a requirement (network access from edge agents to core API)

Another thought, is to let the poller orchestrate a lot of this, since the poller has to be onboarded first (we need a way to onboard pollers the same as edge agents btw), the poller can talk to the core GRPC API directly and get information about clients that are waiting to be onboarded. Part of the UI onboarding workflow should be selecting the poller as well that we expect the new edge agent/checker should be associated with. The poller would gather the onboarding information from the core and make it available to agents/checkers. Packages could be stored in NATS JetStream object store, reaper routine would watch the TTL/expiry on the token and then clean them up.

Imported GitHub comment. Original author: @mfreeman451 Original URL: https://github.com/carverauto/serviceradar/issues/1903#issuecomment-3458668536 Original created: 2025-10-28T21:54:34Z --- Am wondering if we can automate the bundle download.. this becomes a bit tricky because the edge devices aren't necessarily supposed to be able to talk directly to the core, if we put it on the KV, before onboarding it wouldn't have the necessary credentials to get it from there either unless we did just a normal TLS interface on a different port for GRPC API calls, we would have to keep the request / token generation information in the KV so we could verify it some how.. this might be more complicated.. still would like to have a good way to do easier deployments, and hands-off deployments. The way it is today it can still easily be automated. We could just try and hit the public/core API from the edge and try and make that a requirement (network access from edge agents to core API) Another thought, is to let the poller orchestrate a lot of this, since the poller has to be onboarded first (we need a way to onboard pollers the same as edge agents btw), the poller can talk to the core GRPC API directly and get information about clients that are waiting to be onboarded. Part of the UI onboarding workflow should be selecting the poller as well that we expect the new edge agent/checker should be associated with. The poller would gather the onboarding information from the core and make it available to agents/checkers. Packages could be stored in NATS JetStream object store, reaper routine would watch the TTL/expiry on the token and then clean them up.
Author
Owner

Imported GitHub comment.

Original author: @mfreeman451
Original URL: https://github.com/carverauto/serviceradar/issues/1903#issuecomment-3458840163
Original created: 2025-10-28T22:38:10Z


We also need to make sure that we are generating internal events that get published to NATS and show up in the UI under the events dashboard/console, for all of these onboarding events.

Imported GitHub comment. Original author: @mfreeman451 Original URL: https://github.com/carverauto/serviceradar/issues/1903#issuecomment-3458840163 Original created: 2025-10-28T22:38:10Z --- We also need to make sure that we are generating internal events that get published to NATS and show up in the UI under the events dashboard/console, for all of these onboarding events.
Author
Owner

Imported GitHub comment.

Original author: @mfreeman451
Original URL: https://github.com/carverauto/serviceradar/issues/1903#issuecomment-3459492711
Original created: 2025-10-29T03:38:27Z


Edge onboarding backend is now lifecycle-aware:

  • implements Start/Stop, runs a bounded refresh loop (5s timeout, 5m interval), and exposes a callback so the API’s dynamic poller list stays in sync.
  • Core wires the service through , broadcasts updates via , and no longer blocks startup on a streaming query.
  • Proton query updated to to satisfy changelog_kv semantics.
  • Unit coverage updated (async start/stop, permission revocation); gofmt + go test ./pkg/... are clean.

Remaining work: trim (backfill path) into lifecycle options, finish UI/CLI flows, and script proton token automation.

Imported GitHub comment. Original author: @mfreeman451 Original URL: https://github.com/carverauto/serviceradar/issues/1903#issuecomment-3459492711 Original created: 2025-10-29T03:38:27Z --- Edge onboarding backend is now lifecycle-aware: - implements Start/Stop, runs a bounded refresh loop (5s timeout, 5m interval), and exposes a callback so the API’s dynamic poller list stays in sync. - Core wires the service through , broadcasts updates via , and no longer blocks startup on a streaming query. - Proton query updated to to satisfy changelog_kv semantics. - Unit coverage updated (async start/stop, permission revocation); gofmt + go test ./pkg/... are clean. Remaining work: trim (backfill path) into lifecycle options, finish UI/CLI flows, and script proton token automation.
Author
Owner

Imported GitHub comment.

Original author: @mfreeman451
Original URL: https://github.com/carverauto/serviceradar/issues/1903#issuecomment-3459493022
Original created: 2025-10-29T03:38:40Z


Edge onboarding backend is now lifecycle-aware:

  • edgeOnboardingService implements Start/Stop, runs a bounded refresh loop (5s timeout, 5m interval), and exposes a callback so the API’s dynamic poller list stays in sync.
  • Core wires the service through pkg/lifecycle, broadcasts updates via SetDynamicPollers, and no longer blocks startup on a streaming query.
  • Proton query updated to SELECT ... FROM table(edge_onboarding_packages) FINAL to satisfy changelog_kv semantics.
  • Unit coverage updated (async start/stop, permission revocation); gofmt + go test ./pkg/... are clean.

Remaining work: trim cmd/core/main.go (backfill path) into lifecycle options, finish UI/CLI flows, and script proton token automation.

Imported GitHub comment. Original author: @mfreeman451 Original URL: https://github.com/carverauto/serviceradar/issues/1903#issuecomment-3459493022 Original created: 2025-10-29T03:38:40Z --- Edge onboarding backend is now lifecycle-aware: - `edgeOnboardingService` implements Start/Stop, runs a bounded refresh loop (5s timeout, 5m interval), and exposes a callback so the API’s dynamic poller list stays in sync. - Core wires the service through `pkg/lifecycle`, broadcasts updates via `SetDynamicPollers`, and no longer blocks startup on a streaming query. - Proton query updated to `SELECT ... FROM table(edge_onboarding_packages) FINAL` to satisfy changelog_kv semantics. - Unit coverage updated (async start/stop, permission revocation); gofmt + go test ./pkg/... are clean. Remaining work: trim `cmd/core/main.go` (backfill path) into lifecycle options, finish UI/CLI flows, and script proton token automation.
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
carverauto/serviceradar#628
No description provided.