feat(agent): dispatch systemd-service/timer add-ons via agent-updater (delivery-models 3.1) (#3425) #3467

Merged
mfreeman451 merged 2 commits from feat/addon-systemd-dispatch into staging 2026-05-31 16:10:54 +00:00
Owner

What

Slice 2b of add-native-addon-delivery-models — wires the systemd supervision primitive (the agent-updater install/enable from the prior slice, merged in #3465) into the agent's assignment reconciliation, completing the agent side of task 3.1. This is what makes a systemd-service add-on (e.g. netprobe) or a systemd-timer add-on (e.g. Bumblebee) actually install and run from a pushed assignment.

Design

  • classifyAddonSupervision: systemd_service/systemd_timer now route to a new addonDispatchSystemd (ephemeral-helper stays unimplemented).
  • Shared stageAndCapability (stage → setcap → rollback) used by both the sidecar and systemd paths; sidecar handling moved to buildSidecarAddonSpec (behavior preserved).
  • applySystemdAddon: stage → discover the .service/.timer units shipped inside the signed staged bundle (the agent reads its own staging area — no proto/manifest/DB unit plumbing) → install + enable the primary (timer for systemd_timer, service for systemd_service) via the root-owned updater. Rolls current back on any failure.
  • reconcileSystemdAddons: tracks installed units per add-on and uninstalls those no longer desired (disabled/unassigned) — satisfies "disabling stops it".

Adversarial review + hardening (2nd commit)

Ran a multi-lens adversarial review of the first commit; fixed the confirmed findings:

  • Concurrency (high): serialized the whole reconcile under a new addonReconcileMu (it runs from the poll/control-stream/enroll goroutines) so concurrent rounds can't race over the tracking map. Verified with -race.
  • Restart persistence (high): rehydrate installed-unit tracking from the staging root once per process, so a disabled add-on's units are still uninstalled after an agent restart.
  • Stale-unit leak (medium): an add-on update that renames its units now uninstalls the dropped ones.
  • Destructive re-deploy cleanup (medium): install-failure cleanup removes only newly-created unit files, never a running add-on's existing ones.

Verification

  • go build/vet clean; golangci-lint 0 issues; addon test suite passes under -race.
  • Unit-tested: dispatch classification, unit discovery, primary-unit selection, reconcile decision, stringsNotIn, rehydration discovery.
  • Real host (Linux + root): both systemd_service (install → enabled/active → uninstall → removed) and systemd_timer (install .service+.timer, enable .timer → enabled/active/scheduled → uninstall) verified end-to-end against the built updater.

Scope / follow-ups

  • Builds on #3465 (rollback + setcap + the updater systemd primitive), now merged to staging.
  • 2c (build packaging the unit files into the signed bundle) and the per-add-on systemd-timer spool ingest are follow-ups; full agent-process e2e is the migration scratch-agent test.
  • Low/defense-in-depth: tightening the updater's uninstall to an add-on unit naming convention — noted, not done.

🤖 Generated with Claude Code

## What Slice 2b of `add-native-addon-delivery-models` — wires the systemd supervision primitive (the agent-updater install/enable from the prior slice, merged in #3465) into the agent's assignment reconciliation, completing the agent side of **task 3.1**. This is what makes a `systemd-service` add-on (e.g. netprobe) or a `systemd-timer` add-on (e.g. Bumblebee) actually install and run from a pushed assignment. ### Design - `classifyAddonSupervision`: `systemd_service`/`systemd_timer` now route to a new `addonDispatchSystemd` (ephemeral-helper stays unimplemented). - Shared `stageAndCapability` (stage → setcap → rollback) used by both the sidecar and systemd paths; sidecar handling moved to `buildSidecarAddonSpec` (behavior preserved). - `applySystemdAddon`: stage → **discover** the `.service`/`.timer` units shipped inside the signed staged bundle (the agent reads its own staging area — no proto/manifest/DB unit plumbing) → install + enable the primary (timer for `systemd_timer`, service for `systemd_service`) via the root-owned updater. Rolls `current` back on any failure. - `reconcileSystemdAddons`: tracks installed units per add-on and uninstalls those no longer desired (disabled/unassigned) — satisfies "disabling stops it". ### Adversarial review + hardening (2nd commit) Ran a multi-lens adversarial review of the first commit; fixed the confirmed findings: - **Concurrency (high):** serialized the whole reconcile under a new `addonReconcileMu` (it runs from the poll/control-stream/enroll goroutines) so concurrent rounds can't race over the tracking map. Verified with `-race`. - **Restart persistence (high):** rehydrate installed-unit tracking from the staging root once per process, so a disabled add-on's units are still uninstalled after an agent restart. - **Stale-unit leak (medium):** an add-on update that renames its units now uninstalls the dropped ones. - **Destructive re-deploy cleanup (medium):** install-failure cleanup removes only newly-created unit files, never a running add-on's existing ones. ## Verification - `go build`/`vet` clean; `golangci-lint` 0 issues; addon test suite passes under `-race`. - Unit-tested: dispatch classification, unit discovery, primary-unit selection, reconcile decision, `stringsNotIn`, rehydration discovery. - **Real host (Linux + root):** both `systemd_service` (install → enabled/active → uninstall → removed) and `systemd_timer` (install `.service`+`.timer`, enable `.timer` → enabled/active/scheduled → uninstall) verified end-to-end against the built updater. ## Scope / follow-ups - Builds on #3465 (rollback + setcap + the updater systemd primitive), now merged to `staging`. - **2c** (build packaging the unit files into the signed bundle) and the per-add-on **systemd-timer spool ingest** are follow-ups; full agent-process e2e is the migration scratch-agent test. - Low/defense-in-depth: tightening the updater's uninstall to an add-on unit naming convention — noted, not done. 🤖 Generated with [Claude Code](https://claude.com/claude-code)
Wires the systemd supervision primitive (the agent-updater install/enable from the prior
slice) into the agent's assignment reconciliation, completing the systemd half of
delivery-models task 3.1.

- classifyAddonSupervision: systemd_service/systemd_timer now route to a new
  addonDispatchSystemd (ephemeral_helper stays unimplemented).
- Extract stageAndCapability (shared stage + setcap + rollback) so the sidecar and
  systemd paths reuse identical delivery logic; sidecar handling moves to
  buildSidecarAddonSpec (behavior preserved).
- applySystemdAddon: stage -> setcap -> discover the .service/.timer units shipped in the
  signed staged bundle (the agent reads its own staging area; no proto/manifest unit
  plumbing needed) -> install + enable the primary (timer for systemd_timer, service for
  systemd_service) via the root-owned updater. On any failure, roll `current` back.
- reconcileSystemdAddons: track installed units per add-on and uninstall those no longer
  desired (assignment disabled/removed), satisfying "disabling stops it".

Unit-tested: discovery, primary-unit selection, reconcile decision, classify, plus the
existing stage/capability/rollback suite. Verified on a Linux+root host: both
systemd_service and systemd_timer install -> enabled/active and uninstall -> removed.
Full agent-process e2e is the migration scratch-agent test. Unit files are bundled by
the build (2c) follow-up; spool ingest for timers stays per-add-on.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
fix(agent): harden systemd add-on dispatch per adversarial review (#3425)
Some checks failed
lint / lint (push) Successful in 1m12s
Golang Tests / test-go (push) Successful in 1m18s
Secret Scan / gitleaks (pull_request) Successful in 48s
lint / lint (pull_request) Successful in 1m22s
CI / build (pull_request) Failing after 2m41s
ef681282fa
Addresses confirmed findings from an adversarial review of the systemd dispatch:

- Concurrency (high): applyAddonAssignments runs from the poll, control-stream, and
  enroll goroutines; systemdAddonsMu only guarded individual map cells, so concurrent
  rounds could race over installedSystemdAddons and spuriously uninstall a still-desired
  add-on or orphan units. Serialize the whole reconcile under a new addonReconcileMu
  (the sidecar path is already atomic inside manager.Apply). Verified with -race.
- Restart persistence (high): installedSystemdAddons was in-memory only, so after an
  agent restart a disabled/removed add-on's units were never uninstalled. Rehydrate the
  tracking once per process from the staging root (discoverInstalledSystemdAddons).
- Stale-unit leak on update (medium): when an add-on update renames/drops unit files,
  uninstall the previously-installed units the new bundle no longer ships.
- Destructive re-deploy cleanup (medium): InstallAddonSystemdUnits' failure cleanup now
  removes only unit files it newly created, never a pre-existing (running) add-on's units.

New unit tests: stringsNotIn, discoverInstalledSystemdAddons. Build/vet/golangci-lint
clean; addon suite passes under -race. (Low: tightening the updater's uninstall to an
add-on naming convention is a noted defense-in-depth follow-up.)

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
mfreeman451 left a comment

lgtm

lgtm
Sign in to join this conversation.
No reviewers
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
carverauto/serviceradar!3467
No description provided.