feat(agent): pushed-artifact add-on rollback + setcap via agent-updater (delivery-models 2.2) (#3425) #3465

Merged
mfreeman451 merged 7 commits from feat/addon-delivery-supervision into staging 2026-05-31 15:34:04 +00:00
Owner

What

Continues add-native-addon-delivery-models — completes task 2.2: pushed-artifact activation rollback + Linux file-capability application (setcap) via the root-owned agent-updater. This is the privileged-delivery primitive the netprobe migration needs (cap_net_raw/cap_bpf/cap_perfmon for eBPF/AF_XDP).

Pieces

  • Rollback primitives (addon_activation.go): readAddonCurrentTarget captures the active version before staging; rollbackAddonCurrent restores it (or removes a failed first-time current), refusing a missing target rather than dangling.
  • Proto: os_capabilities added to AddonAssignmentConfig (Go + Elixir bindings regenerated consistently, protoc-gen-go v1.36.11).
  • agent-updater: new --addon-id/--addon-binary/--addon-capabilities mode → ApplyAddonCapabilities: allowlist-bounded capabilities (fails closed on anything outside the set), staged-binary resolution under the controlled add-on root with safe-segment + symlink-escape guards, then setcap <caps>=+ep.
  • AgentConfigGenerator (Elixir): emits os_capabilities from the manifest's requires so the field actually reaches the agent.
  • applyAddonAssignments: capture prior version → stage → apply capabilities; on failure, roll current back and keep the last-known-good assignment.

Verification

  • Unit (CI/darwin): capability normalize/allowlist-reject, setcap arg string, staged-binary resolution + unsafe-segment + symlink-escape guards, and all rollback paths. golangci-lint clean; go build/vet clean; Elixir core compiles.
  • Real setcap (Linux + root host): end-to-end against the built agent-updater
    • allowed cap_net_raw,cap_bpfgetcap shows cap_net_raw,cap_bpf=ep
    • disallowed cap_sys_admin → fails closed, no setcap
    • tampered current → escape refused, outside binary never setcap'd

Scope / follow-ups

  • Completes delivery-models 2.2. Remaining in that change: 3.1 systemd-service/timer supervision + 3.2 ephemeral-helper, and 1.1/1.2 base-agent packaging carve + os-package template. Those are the next slices.
  • Merges cleanly into staging (independent of the already-merged addon PRs).
  • The agent never gains capabilities itself — only the root-owned, package-owned updater applies them.

🤖 Generated with Claude Code

## What Continues `add-native-addon-delivery-models` — completes **task 2.2**: pushed-artifact activation **rollback** + Linux **file-capability application (setcap)** via the root-owned `agent-updater`. This is the privileged-delivery primitive the netprobe migration needs (cap_net_raw/cap_bpf/cap_perfmon for eBPF/AF_XDP). ### Pieces - **Rollback primitives** (`addon_activation.go`): `readAddonCurrentTarget` captures the active version before staging; `rollbackAddonCurrent` restores it (or removes a failed first-time `current`), refusing a missing target rather than dangling. - **Proto**: `os_capabilities` added to `AddonAssignmentConfig` (Go + Elixir bindings regenerated consistently, protoc-gen-go v1.36.11). - **agent-updater**: new `--addon-id/--addon-binary/--addon-capabilities` mode → `ApplyAddonCapabilities`: allowlist-bounded capabilities (fails closed on anything outside the set), staged-binary resolution under the controlled add-on root with safe-segment + symlink-escape guards, then `setcap <caps>=+ep`. - **AgentConfigGenerator** (Elixir): emits `os_capabilities` from the manifest's `requires` so the field actually reaches the agent. - **`applyAddonAssignments`**: capture prior version → stage → apply capabilities; on failure, roll `current` back and keep the last-known-good assignment. ## Verification - **Unit (CI/darwin):** capability normalize/allowlist-reject, setcap arg string, staged-binary resolution + unsafe-segment + symlink-escape guards, and all rollback paths. `golangci-lint` clean; `go build`/`vet` clean; Elixir core compiles. - **Real `setcap` (Linux + root host):** end-to-end against the built `agent-updater` — - allowed `cap_net_raw,cap_bpf` → `getcap` shows `cap_net_raw,cap_bpf=ep` ✅ - disallowed `cap_sys_admin` → fails closed, no setcap ✅ - tampered `current` → escape refused, outside binary never setcap'd ✅ ## Scope / follow-ups - Completes delivery-models **2.2**. Remaining in that change: **3.1** systemd-service/timer supervision + **3.2** ephemeral-helper, and **1.1/1.2** base-agent packaging carve + os-package template. Those are the next slices. - Merges cleanly into `staging` (independent of the already-merged addon PRs). - The agent never gains capabilities itself — only the root-owned, package-owned updater applies them. 🤖 Generated with [Claude Code](https://claude.com/claude-code)
delivery-models task 2.2 (rollback half). Adds the building blocks the privileged
activation sequence (setcap / systemd unit install) uses to recover from a failed
activation without leaving `current` pointing at an unusable version:

- readAddonCurrentTarget: capture the active version BEFORE staging a new one.
- rollbackAddonCurrent: atomically restore `current` to the prior version (or remove
  it on a failed first-time activation); refuses to restore a missing version dir
  (ErrAddonRollbackTargetMissing) rather than publish a dangling symlink; validates
  addon_id as a safe path segment.

Unit-tested: restore-prior, no-prior-removes, target-missing, unsafe-id. The setcap
file-capability application that calls rollback-on-failure lands in the next commit
(needs the os_capabilities proto field + agent-updater subcommand).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
delivery-models task 2.2 needs the manifest's requires.os_capabilities to reach the
agent so it can apply Linux file capabilities (setcap) to a staged pushed-artifact
add-on via the root-owned agent-updater (e.g. netprobe's cap_net_raw/cap_bpf/
cap_perfmon). Add `repeated string os_capabilities = 15` to AddonAssignmentConfig and
regenerate the Go binding (protoc-gen-go v1.36.11, matching the committed format);
the Elixir binding gets the matching declarative field.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
delivery-models task 2.2 (setcap half). After staging a pushed-artifact add-on, the
non-root agent asks the root-owned agent-updater to apply the manifest's declared
Linux file capabilities (requires.os_capabilities, now on AddonAssignmentConfig) to
the staged binary, e.g. netprobe's cap_net_raw/cap_bpf/cap_perfmon. The agent never
gains those capabilities itself.

- addon_capabilities.go: an allowlist-bounded capability set; normalize/validate
  (fail closed on anything outside the allowlist); resolve the staged binary under the
  controlled add-on root with safe-segment + symlink-escape guards; ApplyAddonCapabilities
  (the privileged setcap, run inside the updater); applyStagedAddonCapabilitiesViaUpdater
  (the agent-side invoke of the validated, root-owned updater).
- agent-updater: new --addon-id/--addon-binary/--addon-capabilities mode dispatches to
  ApplyAddonCapabilities instead of release activation.
- push_loop applyAddonAssignments: capture the prior current version, stage, apply
  capabilities; on failure roll `current` back (rollback primitives) and keep the
  last-known-good assignment.

Unit-tested on this host: allowlist normalize/reject, setcap arg string, staged-binary
resolution + unsafe-segment + escape guards, rollback. The actual setcap exec is
Linux+root and is verified separately on a test host.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Completes the delivery-models task 2.2 data flow: the generator now reads the
AddonPackage manifest's requires.os_capabilities and emits it on AddonAssignmentConfig
(string or atom keys, blanks dropped). The agent applies these as file capabilities to
the staged pushed-artifact binary via the root-owned agent-updater. Without this the
agent would always receive an empty list and never invoke setcap.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
docs(openspec): mark delivery-models 2.2 (rollback + setcap) done (#3425)
Some checks failed
lint / lint (push) Successful in 1m32s
Secret Scan / gitleaks (pull_request) Successful in 48s
Golang Tests / test-go (push) Successful in 1m32s
lint / lint (pull_request) Successful in 1m26s
CI / build (pull_request) Failing after 2m50s
Elixir Quality / Elixir Quality (pull_request) Failing after 13m6s
8e362ffc15
Task 2.2 (pushed-artifact rollback + file-capability application via the root-owned
agent-updater) is implemented and verified (unit tests + real setcap on a Linux+root
host). Remaining in this change: systemd-service/timer + ephemeral-helper supervision
(3.1/3.2) and the base packaging carve (1.1/1.2).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
mfreeman451 left a comment

lgtm

lgtm
The supervision-model primitive for systemd-service / systemd-timer add-ons. Unit files
ship in the signed add-on bundle and stage under current/; the root-owned agent-updater
installs exactly the units the control plane names, then enables (--now) the primary:

- addon_systemd.go: validate unit names (safe segment + .service/.timer), resolve under
  the staged current/ dir with symlink-escape guard, copy to /etc/systemd/system,
  daemon-reload, enable --now; self-cleans on failure. UninstallAddonSystemdUnits does
  idempotent disable --now + remove + reload (rollback/teardown).
- agent-updater: --addon-systemd-install/--addon-systemd-enable/--addon-systemd-uninstall
  modes (dispatch refactored to a switch over release / capabilities / systemd modes).

Unit-tested here (name/resolution/escape/validation). Verified on a Linux+root host:
install -> enabled+active, uninstall -> removed+inactive, tampered current -> escape
refused (nothing installed). Agent dispatch + proto unit plumbing + bundle packaging
are the next phases (2b/2c); systemd-timer spool ingest stays per-add-on.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
refactor(agent): split push_loop.go into cohesive files (<1k lines)
Some checks failed
lint / lint (push) Successful in 1m17s
Secret Scan / gitleaks (pull_request) Successful in 59s
lint / lint (pull_request) Successful in 2m28s
Golang Tests / test-go (push) Successful in 2m48s
CI / build (pull_request) Failing after 3m6s
Elixir Quality / Elixir Quality (pull_request) Failing after 13m12s
53950ca04a
push_loop.go had grown to 3427 lines. Split it into focused push_loop_*.go files in
the same package (extending the existing push_loop_plugins/flow_attribution/... split
convention). Pure mechanical move — whole declarations relocated with their doc
comments, zero logic/signature/behavior changes.

push_loop.go: 3427 -> 520 lines (core: PushLoop struct, NewPushLoop, Start/Stop,
pushStatus orchestration, desktop-media glue, JSON-limit helpers). New files:
push_loop_state (accessors), _enroll, _config (config apply), _addons (add-on
supervision + the setcap/rollback wiring), _plugin_config, _capabilities, _snmp,
_mapper_netprobe, _deployment, _status (status push/convert/chunk).

Behavior-preserving: go build/vet pass, the existing go/pkg/agent test suite passes,
golangci-lint 0 issues, gofmt clean. Largest new file 663 lines.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
mfreeman451 left a comment

lgtm

lgtm
Sign in to join this conversation.
No reviewers
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
carverauto/serviceradar!3465
No description provided.