Create common onboarding library to eliminate edge deployment friction #635

Closed
opened 2026-03-28 04:26:44 +00:00 by mfreeman451 · 1 comment
Owner

Imported from GitHub.

Original GitHub issue: #1915
Original author: @mfreeman451
Original URL: https://github.com/carverauto/serviceradar/issues/1915
Original created: 2025-10-31T20:55:09Z


Problem Statement

The current edge onboarding process requires extensive manual intervention through shell scripts, kubectl commands, and configuration manipulation. This creates significant friction for edge deployments and doesn't scale for production use.

Current Pain Points:

  • Multiple shell scripts required (setup-edge-e2e.sh, edge-poller-restart.sh, setup-edge-poller.sh, refresh-upstream-credentials.sh)
  • Manual poller registration via kubectl and KV updates
  • Manual DNS to IP address conversion for Docker deployments
  • No simple API/CLI for package management (authentication issues)
  • Join token expiration handling is manual
  • Complex multi-step process prone to errors

See detailed analysis in: docker/compose/edge-e2e/FRICTION_POINTS.md

IMPORTANT SCOPE: This is ONLY for edge service onboarding. k8s and main docker-compose stack already have working SPIFFE enrollment via SPIFFE controller and CRDs - those are NOT being changed.

Vision

Create a common onboarding library (pkg/edgeonboarding) that edge ServiceRadar components (agent, poller, checkers) can use to automatically onboard using a simple token-based flow.

Desired User Experience

For Edge Poller/Agent:

# Create package in UI or via CLI
serviceradar-cli edge create-package --name "Remote Site A" --type poller

# Deploy with just the token - everything else is automatic
docker run -e ONBOARDING_TOKEN=<token> ghcr.io/carverauto/serviceradar-poller:latest

For Edge Checkers:

# Create checker package
serviceradar-cli edge create-package --name "Remote Checker" --type checker

# Deploy checker with token
docker run -e ONBOARDING_TOKEN=<token> ghcr.io/carverauto/serviceradar-checker:latest

The library handles everything:

  • Downloads package from Core
  • Extracts and configures SPIRE credentials (nested SPIRE for edge, not k8s SPIRE)
  • Auto-registers with Core via KV (datasvc)
  • Generates service configuration
  • Handles credential rotation
  • Starts service

Proposed Architecture

1. Common Onboarding Library

Package: pkg/edgeonboarding

Core Interface:

type OnboardingConfig struct {
    Token              string   // Onboarding token from package
    CoreEndpoint       string   // Optional: auto-discovered from package
    KVEndpoint         string   // Bootstrap KV address (sticky config)
    ServiceType        string   // "agent", "poller", "checker"
    ServiceID          string   // Optional: readable name override
    StoragePath        string   // Where to persist config/credentials
    DeploymentType     string   // "docker", "bare-metal" (auto-detect if not provided)
}

func Bootstrap(ctx context.Context, cfg OnboardingConfig) error {
    // 1. Download package from Core using token
    // 2. Validate and extract package contents
    // 3. Configure nested SPIRE for edge (NOT k8s SPIRE)
    // 4. Auto-register with Core via KV/database
    // 5. Generate service config based on deployment type
    // 6. Set up credential rotation
    // 7. Return ready-to-use config
}

func Rotate(ctx context.Context) error {
    // Handle SPIRE credential rotation
}

Deployment Type Detection:

  • Automatically detect if running in Docker or bare metal
  • Use appropriate addresses (LoadBalancer IPs for Docker accessing k8s services)
  • Configure nested SPIRE appropriately for edge environment

Sticky vs Dynamic Config:

  • Sticky (static config file): KV address, Core address (chicken/egg - need these to bootstrap)
  • Dynamic (from KV): Everything else - checker configs, known pollers list, etc.

2. Automatic Poller Registration via KV

KV/Database-based approach (NOT ConfigMaps):

  • Modify isKnownPoller() in Core to check edge_packages table via KV
  • Add allowed_poller_id column to edge_packages table
  • Package creation sets: allowed_poller_id = poller_id_override OR component_id
  • Packages with status in [Issued, Delivered, Activated] are automatically allowed
  • Services read allowed poller list from KV, not ConfigMaps
  • No restarts needed

Key Point: ConfigMaps are legacy. Everything dynamic should be in KV (datasvc).

3. Separate Docker Compose Stacks

Strategy: Keep main stack clean, create dedicated edge stacks

edge-poller-stack.compose.yml (already exists as poller-stack.compose.yml):

  • Poller with nested SPIRE server (edge deployment)
  • Agent sharing network namespace with poller
  • Optional checkers
  • Uses onboarding library
  • For edge sites needing full agent + poller deployment
  • Uses nested SPIRE, NOT k8s SPIRE

edge-checker-stack.compose.yml (new):

  • Just checkers, no poller/agent
  • Lighter weight authentication
  • For edge sites that only need checker deployment
  • Can run in container or bare metal
  • Uses onboarding library

main docker-compose.yml:

  • UNCHANGED - no edge onboarding logic
  • Uses SPIFFE controller and CRDs for automatic enrollment
  • Used for local development and trusted environments
  • NOT related to this work

k8s deployments:

  • UNCHANGED - SPIFFE controller and CRDs handle enrollment automatically
  • NOT related to this work

4. Simplified Package Management

Add CLI commands (extends existing serviceradar-cli):

serviceradar-cli edge create-package --name "Remote Site A" --type poller --spiffe-id "spiffe://carverauto.dev/ns/edge/site-a-poller"
serviceradar-cli edge list-packages [--status issued|delivered|activated|revoked]
serviceradar-cli edge revoke-package <id>
serviceradar-cli edge delete-package <id>
serviceradar-cli edge download-package <id> --output ./package.tar.gz

Fix API authentication:

  • Document default admin credentials
  • Add token-based API access for automation
  • Consider service account tokens for CI/CD

Implementation Plan

Phase 1: Core Library (Week 1-2)

  • Create pkg/edgeonboarding package
  • Implement package download and validation
  • Implement nested SPIRE credential handling for edge
  • Implement deployment type detection (docker/bare-metal)
  • Add configuration generation (sticky bootstrap + dynamic from KV)
  • Unit tests for all components

Phase 2: KV-backed Registration (Week 2)

  • Add allowed_poller_id column to edge_packages table
  • Update package creation to populate allowed_poller_id
  • Modify isKnownPoller() to check KV/database
  • Ensure Core reads known pollers from KV, not ConfigMaps
  • Add migration path for existing pollers
  • Integration tests

Phase 3: Service Integration (Week 3)

  • Update edge poller to use onboarding library
  • Update edge agent to use onboarding library
  • Update edge checkers to use onboarding library
  • Add ONBOARDING_TOKEN environment variable support
  • E2E tests for each service type
  • DO NOT modify k8s or main docker-compose deployments

Phase 4: Stack Reorganization (Week 3-4)

  • Create edge-checker-stack.compose.yml
  • Clean up edge-poller-stack.compose.yml to use library
  • Ensure main docker-compose.yml has no edge logic
  • Document differences between edge stacks and main stack
  • Update all edge deployment guides

Phase 5: CLI and API Improvements (Week 4)

  • Add serviceradar-cli edge commands
  • Fix API authentication issues
  • Document API endpoints with Swagger
  • Add service account token support
  • Update Web UI for edge package management

Phase 6: Cleanup and Documentation (Week 4-5)

  • Remove old shell scripts (setup-edge-e2e.sh, etc.)
  • Update all edge documentation
  • Create migration guide for existing edge deployments
  • Add troubleshooting guide
  • Record demo video

Success Criteria

Deployment Simplicity:

  • Edge deployment requires only onboarding token (no shell scripts)
  • No manual kubectl commands required
  • No manual ConfigMap or KV updates
  • Works across Docker and bare metal edge deployments

Automation:

  • Automatic poller registration via KV
  • Automatic deployment type detection
  • Automatic configuration generation (bootstrap sticky + dynamic from KV)
  • Automatic credential rotation

Scalability:

  • Can deploy hundreds of edge sites without manual intervention
  • Package management via CLI and API
  • No Core restarts required for new edge pollers

Code Quality:

  • Main docker-compose stack has zero edge onboarding code
  • k8s deployments unchanged (SPIFFE controller still works)
  • Clear separation between trusted (k8s/main) and untrusted (edge) environments
  • Comprehensive tests (unit, integration, e2e)
  • Well-documented with examples

Architecture:

  • All dynamic config in KV (datasvc), NOT ConfigMaps
  • Only bootstrap/sticky config in static files (KV address, Core address)
  • Nested SPIRE for edge, separate from k8s SPIRE

Out of Scope (Future Work)

  • Automatic SPIRE join token rotation (currently requires manual regeneration after 15 minutes)
  • Multi-region package distribution
  • Edge deployment health monitoring dashboard
  • Automatic version upgrades for edge deployments
  • Advanced networking (VPN, NAT traversal, etc.)
  • Changes to k8s SPIFFE enrollment (already works)
  • Changes to main docker-compose SPIFFE enrollment (already works)
  • Closes: #1911 (E2E validation - completed via bd issue serviceradar-56)
  • Addresses all friction points documented in docker/compose/edge-e2e/FRICTION_POINTS.md
  • Tracked in bd as: serviceradar-57

Files Impacted

New:

  • pkg/edgeonboarding/ - Core library
  • docker/compose/edge-checker-stack.compose.yml - New checker-only stack for edge

Modified:

  • pkg/db/edge_packages.go - Add allowed_poller_id column
  • pkg/core/pollers.go - Update isKnownPoller() to check KV/database
  • docker/compose/edge-poller-stack.compose.yml - Use onboarding library (edge only)
  • Edge services (poller, agent, checkers) - Integrate onboarding library

NOT Modified:

  • k8s deployments - SPIFFE controller continues to work
  • docker-compose.yml (main) - No edge logic added
  • k8s SPIRE configuration - Not touched

Removed (after migration):

  • docker/compose/setup-edge-e2e.sh
  • docker/compose/edge-poller-restart.sh
  • docker/compose/setup-edge-poller.sh
  • docker/compose/refresh-upstream-credentials.sh
  • docker/compose/edge-e2e/ directory (documentation moves to main docs)

🤖 Generated with Claude Code

Imported from GitHub. Original GitHub issue: #1915 Original author: @mfreeman451 Original URL: https://github.com/carverauto/serviceradar/issues/1915 Original created: 2025-10-31T20:55:09Z --- ## Problem Statement The current edge onboarding process requires extensive manual intervention through shell scripts, kubectl commands, and configuration manipulation. This creates significant friction for edge deployments and doesn't scale for production use. **Current Pain Points:** - Multiple shell scripts required (`setup-edge-e2e.sh`, `edge-poller-restart.sh`, `setup-edge-poller.sh`, `refresh-upstream-credentials.sh`) - Manual poller registration via kubectl and KV updates - Manual DNS to IP address conversion for Docker deployments - No simple API/CLI for package management (authentication issues) - Join token expiration handling is manual - Complex multi-step process prone to errors See detailed analysis in: `docker/compose/edge-e2e/FRICTION_POINTS.md` **IMPORTANT SCOPE**: This is ONLY for edge service onboarding. k8s and main docker-compose stack already have working SPIFFE enrollment via SPIFFE controller and CRDs - those are NOT being changed. ## Vision Create a **common onboarding library** (`pkg/edgeonboarding`) that edge ServiceRadar components (agent, poller, checkers) can use to automatically onboard using a simple token-based flow. ### Desired User Experience **For Edge Poller/Agent:** ```bash # Create package in UI or via CLI serviceradar-cli edge create-package --name "Remote Site A" --type poller # Deploy with just the token - everything else is automatic docker run -e ONBOARDING_TOKEN=<token> ghcr.io/carverauto/serviceradar-poller:latest ``` **For Edge Checkers:** ```bash # Create checker package serviceradar-cli edge create-package --name "Remote Checker" --type checker # Deploy checker with token docker run -e ONBOARDING_TOKEN=<token> ghcr.io/carverauto/serviceradar-checker:latest ``` The library handles everything: - Downloads package from Core - Extracts and configures SPIRE credentials (nested SPIRE for edge, not k8s SPIRE) - Auto-registers with Core via KV (datasvc) - Generates service configuration - Handles credential rotation - Starts service ## Proposed Architecture ### 1. Common Onboarding Library **Package**: `pkg/edgeonboarding` **Core Interface**: ```go type OnboardingConfig struct { Token string // Onboarding token from package CoreEndpoint string // Optional: auto-discovered from package KVEndpoint string // Bootstrap KV address (sticky config) ServiceType string // "agent", "poller", "checker" ServiceID string // Optional: readable name override StoragePath string // Where to persist config/credentials DeploymentType string // "docker", "bare-metal" (auto-detect if not provided) } func Bootstrap(ctx context.Context, cfg OnboardingConfig) error { // 1. Download package from Core using token // 2. Validate and extract package contents // 3. Configure nested SPIRE for edge (NOT k8s SPIRE) // 4. Auto-register with Core via KV/database // 5. Generate service config based on deployment type // 6. Set up credential rotation // 7. Return ready-to-use config } func Rotate(ctx context.Context) error { // Handle SPIRE credential rotation } ``` **Deployment Type Detection:** - Automatically detect if running in Docker or bare metal - Use appropriate addresses (LoadBalancer IPs for Docker accessing k8s services) - Configure nested SPIRE appropriately for edge environment **Sticky vs Dynamic Config:** - **Sticky (static config file)**: KV address, Core address (chicken/egg - need these to bootstrap) - **Dynamic (from KV)**: Everything else - checker configs, known pollers list, etc. ### 2. Automatic Poller Registration via KV **KV/Database-based approach** (NOT ConfigMaps): - Modify `isKnownPoller()` in Core to check `edge_packages` table via KV - Add `allowed_poller_id` column to `edge_packages` table - Package creation sets: `allowed_poller_id = poller_id_override OR component_id` - Packages with status in [Issued, Delivered, Activated] are automatically allowed - Services read allowed poller list from KV, not ConfigMaps - No restarts needed **Key Point**: ConfigMaps are legacy. Everything dynamic should be in KV (datasvc). ### 3. Separate Docker Compose Stacks **Strategy**: Keep main stack clean, create dedicated edge stacks **`edge-poller-stack.compose.yml`** (already exists as `poller-stack.compose.yml`): - Poller with nested SPIRE server (edge deployment) - Agent sharing network namespace with poller - Optional checkers - Uses onboarding library - For edge sites needing full agent + poller deployment - Uses nested SPIRE, NOT k8s SPIRE **`edge-checker-stack.compose.yml`** (new): - Just checkers, no poller/agent - Lighter weight authentication - For edge sites that only need checker deployment - Can run in container or bare metal - Uses onboarding library **`main docker-compose.yml`**: - UNCHANGED - no edge onboarding logic - Uses SPIFFE controller and CRDs for automatic enrollment - Used for local development and trusted environments - NOT related to this work **k8s deployments**: - UNCHANGED - SPIFFE controller and CRDs handle enrollment automatically - NOT related to this work ### 4. Simplified Package Management **Add CLI commands** (extends existing serviceradar-cli): ```bash serviceradar-cli edge create-package --name "Remote Site A" --type poller --spiffe-id "spiffe://carverauto.dev/ns/edge/site-a-poller" serviceradar-cli edge list-packages [--status issued|delivered|activated|revoked] serviceradar-cli edge revoke-package <id> serviceradar-cli edge delete-package <id> serviceradar-cli edge download-package <id> --output ./package.tar.gz ``` **Fix API authentication**: - Document default admin credentials - Add token-based API access for automation - Consider service account tokens for CI/CD ## Implementation Plan ### Phase 1: Core Library (Week 1-2) - [ ] Create `pkg/edgeonboarding` package - [ ] Implement package download and validation - [ ] Implement nested SPIRE credential handling for edge - [ ] Implement deployment type detection (docker/bare-metal) - [ ] Add configuration generation (sticky bootstrap + dynamic from KV) - [ ] Unit tests for all components ### Phase 2: KV-backed Registration (Week 2) - [ ] Add `allowed_poller_id` column to `edge_packages` table - [ ] Update package creation to populate `allowed_poller_id` - [ ] Modify `isKnownPoller()` to check KV/database - [ ] Ensure Core reads known pollers from KV, not ConfigMaps - [ ] Add migration path for existing pollers - [ ] Integration tests ### Phase 3: Service Integration (Week 3) - [ ] Update edge poller to use onboarding library - [ ] Update edge agent to use onboarding library - [ ] Update edge checkers to use onboarding library - [ ] Add `ONBOARDING_TOKEN` environment variable support - [ ] E2E tests for each service type - [ ] DO NOT modify k8s or main docker-compose deployments ### Phase 4: Stack Reorganization (Week 3-4) - [ ] Create `edge-checker-stack.compose.yml` - [ ] Clean up `edge-poller-stack.compose.yml` to use library - [ ] Ensure main `docker-compose.yml` has no edge logic - [ ] Document differences between edge stacks and main stack - [ ] Update all edge deployment guides ### Phase 5: CLI and API Improvements (Week 4) - [ ] Add `serviceradar-cli edge` commands - [ ] Fix API authentication issues - [ ] Document API endpoints with Swagger - [ ] Add service account token support - [ ] Update Web UI for edge package management ### Phase 6: Cleanup and Documentation (Week 4-5) - [ ] Remove old shell scripts (`setup-edge-e2e.sh`, etc.) - [ ] Update all edge documentation - [ ] Create migration guide for existing edge deployments - [ ] Add troubleshooting guide - [ ] Record demo video ## Success Criteria **Deployment Simplicity:** - ✅ Edge deployment requires only onboarding token (no shell scripts) - ✅ No manual kubectl commands required - ✅ No manual ConfigMap or KV updates - ✅ Works across Docker and bare metal edge deployments **Automation:** - ✅ Automatic poller registration via KV - ✅ Automatic deployment type detection - ✅ Automatic configuration generation (bootstrap sticky + dynamic from KV) - ✅ Automatic credential rotation **Scalability:** - ✅ Can deploy hundreds of edge sites without manual intervention - ✅ Package management via CLI and API - ✅ No Core restarts required for new edge pollers **Code Quality:** - ✅ Main docker-compose stack has zero edge onboarding code - ✅ k8s deployments unchanged (SPIFFE controller still works) - ✅ Clear separation between trusted (k8s/main) and untrusted (edge) environments - ✅ Comprehensive tests (unit, integration, e2e) - ✅ Well-documented with examples **Architecture:** - ✅ All dynamic config in KV (datasvc), NOT ConfigMaps - ✅ Only bootstrap/sticky config in static files (KV address, Core address) - ✅ Nested SPIRE for edge, separate from k8s SPIRE ## Out of Scope (Future Work) - Automatic SPIRE join token rotation (currently requires manual regeneration after 15 minutes) - Multi-region package distribution - Edge deployment health monitoring dashboard - Automatic version upgrades for edge deployments - Advanced networking (VPN, NAT traversal, etc.) - Changes to k8s SPIFFE enrollment (already works) - Changes to main docker-compose SPIFFE enrollment (already works) ## Related Issues - Closes: #1911 (E2E validation - completed via bd issue serviceradar-56) - Addresses all friction points documented in `docker/compose/edge-e2e/FRICTION_POINTS.md` - Tracked in bd as: serviceradar-57 ## Files Impacted **New:** - `pkg/edgeonboarding/` - Core library - `docker/compose/edge-checker-stack.compose.yml` - New checker-only stack for edge **Modified:** - `pkg/db/edge_packages.go` - Add `allowed_poller_id` column - `pkg/core/pollers.go` - Update `isKnownPoller()` to check KV/database - `docker/compose/edge-poller-stack.compose.yml` - Use onboarding library (edge only) - Edge services (poller, agent, checkers) - Integrate onboarding library **NOT Modified:** - k8s deployments - SPIFFE controller continues to work - `docker-compose.yml` (main) - No edge logic added - k8s SPIRE configuration - Not touched **Removed (after migration):** - `docker/compose/setup-edge-e2e.sh` - `docker/compose/edge-poller-restart.sh` - `docker/compose/setup-edge-poller.sh` - `docker/compose/refresh-upstream-credentials.sh` - `docker/compose/edge-e2e/` directory (documentation moves to main docs) --- 🤖 Generated with [Claude Code](https://claude.com/claude-code)
Author
Owner

Imported GitHub comment.

Original author: @mfreeman451
Original URL: https://github.com/carverauto/serviceradar/issues/1915#issuecomment-3477243000
Original created: 2025-11-02T02:40:02Z


Progress Update: Service Device Registration

Completed foundational work for service component visibility and lifecycle management - a prerequisite for the common onboarding library.

What Was Done

Implemented comprehensive device registration system so that pollers, agents, and checkers appear as distinct devices in the inventory, even when running on the same IP address.

Key Features:

  • Service-aware device IDs: serviceradar:service_type:service_id
  • Automatic self-registration on status reports
  • Parent-child relationship tracking in metadata
  • Revocation cleanup (tombstone updates when packages are revoked)
  • High cardinality support (tested with 100+ checkers per agent)

This Enables:

  1. Visibility into which pollers, agents, and checkers are deployed
  2. Tracking of service relationships (agent→poller, checker→agent)
  3. Proper cleanup when services are revoked/un-enrolled
  4. Distinct device records even when all services run on the same IP

Impact on Onboarding Library

This work addresses the "Auto-register with Core via KV (datasvc)" step mentioned in the Bootstrap() function design. When the common onboarding library is implemented, it can rely on these registration mechanisms:

  • Services automatically register themselves as devices upon first status report
  • No manual KV updates needed for device visibility
  • Revocation automatically marks services as unavailable
  • Parent-child relationships preserved for operational visibility

Files Created/Modified

See detailed comment on #1909 for full file list. Key additions:

  • pkg/models/service_device.go
  • pkg/models/service_registration.go
  • Registration hooks in pkg/core/pollers.go and pkg/core/services.go
  • Revocation cleanup in pkg/core/edge_onboarding.go
  • Closes device registration requirements from #1909
  • Sets foundation for Phase 2 (KV-backed Registration) of this issue's implementation plan

🤖 Generated with Claude Code

Imported GitHub comment. Original author: @mfreeman451 Original URL: https://github.com/carverauto/serviceradar/issues/1915#issuecomment-3477243000 Original created: 2025-11-02T02:40:02Z --- ## Progress Update: Service Device Registration Completed foundational work for service component visibility and lifecycle management - a prerequisite for the common onboarding library. ### What Was Done Implemented comprehensive device registration system so that pollers, agents, and checkers appear as distinct devices in the inventory, even when running on the same IP address. **Key Features:** - Service-aware device IDs: `serviceradar:service_type:service_id` - Automatic self-registration on status reports - Parent-child relationship tracking in metadata - Revocation cleanup (tombstone updates when packages are revoked) - High cardinality support (tested with 100+ checkers per agent) **This Enables:** 1. ✅ Visibility into which pollers, agents, and checkers are deployed 2. ✅ Tracking of service relationships (agent→poller, checker→agent) 3. ✅ Proper cleanup when services are revoked/un-enrolled 4. ✅ Distinct device records even when all services run on the same IP ### Impact on Onboarding Library This work addresses the "Auto-register with Core via KV (datasvc)" step mentioned in the Bootstrap() function design. When the common onboarding library is implemented, it can rely on these registration mechanisms: - Services automatically register themselves as devices upon first status report - No manual KV updates needed for device visibility - Revocation automatically marks services as unavailable - Parent-child relationships preserved for operational visibility ### Files Created/Modified See detailed comment on #1909 for full file list. Key additions: - `pkg/models/service_device.go` - `pkg/models/service_registration.go` - Registration hooks in `pkg/core/pollers.go` and `pkg/core/services.go` - Revocation cleanup in `pkg/core/edge_onboarding.go` ### Related Work - Closes device registration requirements from #1909 - Sets foundation for Phase 2 (KV-backed Registration) of this issue's implementation plan 🤖 Generated with [Claude Code](https://claude.com/claude-code)
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
carverauto/serviceradar#635
No description provided.