feat(agent): native add-on pushed-artifact delivery + resilience (#3425) #3451

Merged
mfreeman451 merged 7 commits from feat/native-addon-delivery-models into staging 2026-05-29 17:15:12 +00:00
Owner

Summary

Implements the agent-side pushed-artifact delivery model and its resilience for native add-ons — the OpenSpec change add-native-addon-delivery-models (issue #3425), the first follow-up to the framework PR #3447.

Until now the agent only ran agent_sidecar add-ons whose binary already existed on the host. This branch lets the control plane deliver a signed add-on artifact that the agent fetches, verifies, stages, and runs.

What's in this PR

  • Agent pushed-artifact activation (go/pkg/agent/addon_activation.go): fetch the artifact from object storage → verify sha256 (and ed25519 signature when present) → stage under <runtime-root>/addons/<id>/versions/<version> with an atomic current symlink → hand the resolved binary to the go-plugin supervisor.
  • Control-plane artifact reference (agent_config_generator.ex): AddonAssignmentConfig gains artifact_object_key / artifact_sha256 / artifact_signature / target_os / target_arch; the generator selects the per-arch artifact from the package's artifacts map by the agent's metadata.os/arch and emits it; the reference joins the config version hash.
  • Last-known-good fallback: the versioned staging dir doubles as the cache — a transient delivery/verification failure reuses the existing current binary instead of tearing down a running add-on (and survives reboots while the store is down).
  • Local override (addons.local.json): operator break-glass / dev pinning that takes precedence over pushed assignments by addon_id; a malformed file is ignored so it can't break pushed delivery.

Reuse over reinvention

Per review guidance, this leans on existing infrastructure rather than reimplementing it: ObjectStore.DownloadObject (the bumblebee catalog-staging interface), hashutil for digests, the agent release ed25519 trust root for signature verification, and resolveReleaseRuntimeRoot for the staging root.

Validation

  • go build / go vet / go test + golangci-lint (0 issues) + gofmt — green.
  • mix compile --warnings-as-errors — green.
  • AgentConfigGenerator suite (incl. the new per-arch artifact tests) — 36 tests, 0 failures against the srql-fixtures CNPG cluster.
  • openspec validate add-native-addon-delivery-models --strict — passes.

Deferred (need root / systemd / build-packaging / signing keys)

Tracked in the change's tasks.md: file-capability application via agent-updater (setcap, needs root), version rollback (health-feedback loop), systemd-service/timer/ephemeral-helper supervision, config-toggle/os-package dispatch (best landed with the remote-access migration), and the base-agent packaging carve. These are better implemented and verified in an environment with that infrastructure.

DCO: I can add Signed-off-by trailers if the DCO check requires them.

🤖 Generated with Claude Code

## Summary Implements the agent-side **pushed-artifact** delivery model and its resilience for native add-ons — the OpenSpec change `add-native-addon-delivery-models` (issue #3425), the first follow-up to the framework PR #3447. Until now the agent only ran `agent_sidecar` add-ons whose binary already existed on the host. This branch lets the control plane deliver a signed add-on artifact that the agent fetches, verifies, stages, and runs. ## What's in this PR - **Agent pushed-artifact activation** (`go/pkg/agent/addon_activation.go`): fetch the artifact from object storage → verify `sha256` (and ed25519 signature when present) → stage under `<runtime-root>/addons/<id>/versions/<version>` with an atomic `current` symlink → hand the resolved binary to the go-plugin supervisor. - **Control-plane artifact reference** (`agent_config_generator.ex`): `AddonAssignmentConfig` gains `artifact_object_key` / `artifact_sha256` / `artifact_signature` / `target_os` / `target_arch`; the generator selects the per-arch artifact from the package's `artifacts` map by the agent's `metadata.os/arch` and emits it; the reference joins the config version hash. - **Last-known-good fallback**: the versioned staging dir doubles as the cache — a transient delivery/verification failure reuses the existing `current` binary instead of tearing down a running add-on (and survives reboots while the store is down). - **Local override** (`addons.local.json`): operator break-glass / dev pinning that takes precedence over pushed assignments by `addon_id`; a malformed file is ignored so it can't break pushed delivery. ## Reuse over reinvention Per review guidance, this leans on existing infrastructure rather than reimplementing it: `ObjectStore.DownloadObject` (the bumblebee catalog-staging interface), `hashutil` for digests, the agent release ed25519 trust root for signature verification, and `resolveReleaseRuntimeRoot` for the staging root. ## Validation - `go build` / `go vet` / `go test` + `golangci-lint` (0 issues) + `gofmt` — green. - `mix compile --warnings-as-errors` — green. - AgentConfigGenerator suite (incl. the new per-arch artifact tests) — **36 tests, 0 failures** against the srql-fixtures CNPG cluster. - `openspec validate add-native-addon-delivery-models --strict` — passes. ## Deferred (need root / systemd / build-packaging / signing keys) Tracked in the change's `tasks.md`: file-capability application via `agent-updater` (setcap, needs root), version rollback (health-feedback loop), systemd-service/timer/ephemeral-helper supervision, `config-toggle`/`os-package` dispatch (best landed with the remote-access migration), and the base-agent packaging carve. These are better implemented and verified in an environment with that infrastructure. DCO: I can add `Signed-off-by` trailers if the DCO check requires them. 🤖 Generated with [Claude Code](https://claude.com/claude-code)
feat(agent): pushed-artifact add-on activation — fetch, verify, stage (#3425)
All checks were successful
lint / lint (push) Successful in 2m10s
Golang Tests / test-go (push) Successful in 3m3s
516b30babb
Implements the agent side of pushed_artifact delivery (add-native-addon-delivery-models §2.2). For a delivery=pushed_artifact assignment carrying an artifact reference, the agent now fetches the artifact from object storage, verifies its sha256 (and ed25519 signature when present), stages it under <runtime-root>/addons/<id>/versions/<version> with an atomic  symlink, and runs the resolved staged binary via the go-plugin supervisor — instead of requiring binary_path to already exist on the host.

Reuse over reinvention: ObjectStore.DownloadObject (same interface bumblebee catalog staging uses), hashutil.EqualSHA256 for the digest check, the agent release ed25519 trust root (releaseVerificationKey/decodeReleaseSignature) for signature verification, and resolveReleaseRuntimeRoot for the staging root. Activation lives in package agent (where those live) and hands the manager a resolved Spec.BinaryPath, keeping the add-on supervisor a pure process manager.

Proto: AddonAssignmentConfig gains artifact_object_key/artifact_sha256/artifact_signature/target_os/target_arch (regenerated monitoring.pb.go + hand-maintained monitoring.pb.ex). Unsigned artifacts are allowed through with a warning until the build/signing pipeline (add-native-addon-build-signing) makes signing mandatory.

Tests: stage success (executable + versioned copy + current symlink), sha256 mismatch, nil store, incomplete reference, ed25519 verify (valid + tampered). go build/vet/test + golangci-lint (0 issues) + gofmt green.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
feat(core): emit per-arch add-on artifact reference for pushed-artifact delivery (#3425)
Some checks failed
lint / lint (push) Successful in 1m40s
Golang Tests / test-go (push) Failing after 2m6s
3475a31f0a
Completes the control-plane half of pushed_artifact delivery. The generator now resolves the target agent's os/arch from its registry metadata, selects the matching artifact from the AddonPackage artifacts map (keyed by os/arch), and emits artifact_object_key/artifact_sha256/artifact_signature/target_os/target_arch into AddonAssignmentConfig. The reference joins the config version hash (via the existing stable_addon_assignment passthrough), so publishing or rotating an artifact re-versions and reaches polling agents.

The platform lookup is skipped on the common no-add-on path. When no artifact matches the agent arch (e.g. before the signing pipeline populates artifacts, or an unsupported arch), the reference is empty and the agent falls back to binary_path.

Tests (agent_config_generator_test.exs, verified against the srql-fixtures CNPG cluster, 36 tests / 0 failures): a linux/amd64 agent receives the matching artifact reference; an arm64 agent with only an amd64 artifact gets an empty reference. mix compile --warnings-as-errors green; openspec validate --strict green.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
feat(agent): last-known-good fallback for pushed-artifact add-ons (#3425)
All checks were successful
Golang Tests / test-go (push) Successful in 1m53s
lint / lint (push) Successful in 1m54s
9187875d82
Implements the resilience core of add-native-addon-delivery-models 4.1. The versioned staging directory doubles as the last-known-good cache: when a fresh pushed-artifact delivery fails (object store unreachable, sha256/signature mismatch), applyAddonAssignments now reuses the existing 'current' staged binary via lastKnownGoodAddonBinary instead of dropping the assignment from the desired set, so a transient failure no longer tears down a running add-on. With no prior staged version the assignment is skipped as before.

Because the last-good binary persists on disk, an agent reboot can relaunch the add-on from it even while the object store is unavailable. Remaining for 4.1: a local override file mirroring the agent config override.

Tests: lastKnownGoodAddonBinary returns nothing before staging, the staged binary after a success, and survives a subsequent failed delivery with bytes intact. go build/vet/test + golangci-lint (0 issues) + gofmt green.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
feat(agent): local override file for add-on assignments (#3425)
Some checks failed
lint / lint (push) Successful in 1m50s
Golang Tests / test-go (push) Failing after 2m28s
Secret Scan / gitleaks (pull_request) Successful in 25s
lint / lint (pull_request) Successful in 1m36s
CI / build (pull_request) Failing after 19m27s
Elixir Quality / Elixir Quality (pull_request) Has been cancelled
2ec8f29c62
Completes add-native-addon-delivery-models 4.1. Adds an operator-managed local override read from addons.local.json in the agent config dir: entries take precedence over a pushed assignment with the same addon_id and local-only entries are appended (file order preserved). This is the break-glass / dev companion to the last-known-good fallback — an operator can pin a dev build or force-enable an add-on locally, independent of the control plane.

A missing file is a no-op; a malformed file is ignored (returns the pushed assignments unchanged + a logged warning) so a bad local edit can never break pushed delivery. applyAddonAssignments merges the override before reconciling.

Tests: missing-file no-op, replace-existing + append-local-only (enabled-by-default), malformed-returns-pushed. go build/vet/test + golangci-lint (0 issues) + gofmt green; openspec validate --strict green.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
fix(addons): harden pushed-artifact delivery from adversarial review (#3425)
Some checks failed
Secret Scan / gitleaks (pull_request) Successful in 33s
lint / lint (push) Successful in 2m14s
lint / lint (pull_request) Successful in 2m36s
Golang Tests / test-go (push) Successful in 3m7s
CI / build (pull_request) Failing after 18m54s
Elixir Quality / Elixir Quality (pull_request) Failing after 32m13s
8fd085d023
Addresses the confirmed findings from a multi-agent review of this branch:

P1 path traversal: addon_id and version arrive from the control plane and become path segments under the staging root. They are now validated as safe single segments (no separators, no . or .. ) via safeAddonSegment/ErrAddonUnsafePath BEFORE any fetch or filesystem write, in both stageAddonArtifact and lastKnownGoodAddonBinary, so a crafted addon_id=../../etc or version=.. can no longer escape the staging root.

P1 last-known-good correctness: on a delivery/verification failure the agent previously paired the OLD staged binary with the NEW assignment config/args. It now caches the last fully-verified spec per add-on (rememberAddonSpec/lastGoodAddonSpec) and reuses it verbatim on fallback, so a transient failure keeps the add-on running exactly as before instead of reconfiguring an old binary with incompatible config.

P2 incomplete artifact reference: the generator now emits an artifact reference only when both object_key and sha256 are present, so the agent is never told to fetch something it cannot verify.

Tests: path-traversal rejection (5 cases, asserts nothing created under root), fail-closed signature when a signature is supplied but no verification key is configured, local override enabled=false disables a pushed add-on, last-known-good spec cache, and the generator incomplete-artifact case. Go build/vet/test + golangci-lint (0 issues) + gofmt green; generator suite 37 tests / 0 failures on srql-fixtures.

Reviewed-and-dismissed as false positives: the resolve_agent_platform nil-metadata guard (is_map/2 already routes nil to {nil,nil}), proto field numbering (per-message, no collision), reuse of the release ed25519 trust root (intentional and fails closed), and the symlink TOCTOU (atomic rename).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
fix(addons): address xhigh code-review findings on delivery (#3425)
Some checks failed
Secret Scan / gitleaks (pull_request) Successful in 35s
lint / lint (pull_request) Successful in 1m35s
Golang Tests / test-go (push) Successful in 1m59s
lint / lint (push) Successful in 2m0s
CI / build (pull_request) Failing after 4m12s
Elixir Quality / Elixir Quality (pull_request) Failing after 28m31s
6d3d504e87
Second review pass (xhigh recall) on the pushed-artifact delivery branch surfaced real issues beyond the first review; fixed:

P1 binary_path basename traversal: addonBinaryName accepted a basename of '..', which filepath.Join(versionDir, '..') would resolve outside the version dir. The derived name is now validated with safeAddonSegment and falls back to the synthesized name when unsafe.

P1 cache-miss fallback paired old binary with new config: the lastKnownGoodAddonBinary path reused the on-disk binary but built the spec from the NEW assignment (new version/args/config, and a re-derived filename that could miss). Removed that brittle path entirely — on a delivery failure the agent reuses the cached fully-verified spec verbatim, or skips when there is no cached spec (a running add-on always has one). Single, consistent fallback semantic.

P2 local override blanked omitted fields: the override now patches only the fields it specifies onto the matching pushed assignment (config/capabilities/etc. inherited) instead of replacing the whole assignment with zeros.

P2 cache eviction + empty-configDir guard: the last-good cache is pruned to currently-assigned ids so a removed/re-added add-on cannot reuse a stale spec; the local override is skipped when no config dir is known (avoids a relative-path read).

Dismissed as false positives (with reasoning): the order-of-validation claim (validation IS before download), the resolve_agent_platform nil guard (is_map/2 routes nil to {nil,nil}), and the reuse of the release ed25519 trust root (intentional, fails closed).

Tests: binary_path traversal fallback, addonBinaryName unsafe-base rejection, override merge preserves pushed config/capabilities, override enabled=false disables-but-preserves, cache prune. go build/vet/test + golangci-lint (0 issues) + gofmt green; openspec validate --strict green.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
feat(agent): explicit per-model add-on supervision dispatch (#3425)
Some checks failed
lint / lint (push) Successful in 1m17s
Secret Scan / gitleaks (pull_request) Successful in 1m1s
Golang Tests / test-go (push) Successful in 1m50s
lint / lint (pull_request) Successful in 1m51s
CI / build (pull_request) Failing after 4m19s
Elixir Quality / Elixir Quality (pull_request) Failing after 23m23s
d8b95f1431
Completes the delivery-dispatch task (add-native-addon-delivery-models 2.1) and the config-toggle half of 3.2. classifyAddonSupervision routes each supervision model explicitly instead of treating everything other than agent_sidecar as unsupported:

- agent_sidecar: stage (pushed_artifact) or use the on-host binary (os_package) and run as a supervised go-plugin (unchanged).

- config_toggle: acknowledged as a compiled-in capability that self-configures from agent config; no subprocess is launched and it is no longer mislabeled unsupported. This is the model the remote-access reference consumer uses.

- systemd_service / systemd_timer / ephemeral_helper: recognized and logged as not-yet-implemented (their supervision is tasks 3.1/3.2) rather than a generic unsupported.

- unknown models: reported unsupported.

Test: classifyAddonSupervision covers every model + an unknown. go build/vet/test + golangci-lint (0 issues, incl. exhaustive) + gofmt green; openspec validate --strict green.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Sign in to join this conversation.
No reviewers
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
carverauto/serviceradar!3451
No description provided.