feat: auto-updating agents #3094

Open
mfreeman451 wants to merge 5 commits from refs/pull/3094/head into staging
mfreeman451 commented 2026-03-27 23:52:34 +00:00 (Migrated from github.com)
Owner

Imported from GitHub pull request.

Original GitHub pull request: #3096
Original author: @mfreeman451
Original URL: https://github.com/carverauto/serviceradar/pull/3096
Original created: 2026-03-27T23:52:34Z
Original updated: 2026-03-28T05:02:55Z
Original head: carverauto/serviceradar:2406-feat-agent-fleet-management-secure-self-update-system
Original base: staging

IMPORTANT: Please sign the Developer Certificate of Origin

Thank you for your contribution to ServiceRadar. Please note, when contributing, the developer must include
a DCO sign-off statement indicating the DCO acceptance in one commit message. Here
is an example DCO Signed-off-by line in a commit message:

Signed-off-by: J. Doe <j.doe@domain.com>

Describe your changes

Code checklist before requesting a review

  • I have signed the DCO?
  • The build completes without errors?
  • All tests are passing when running make test?
Imported from GitHub pull request. Original GitHub pull request: #3096 Original author: @mfreeman451 Original URL: https://github.com/carverauto/serviceradar/pull/3096 Original created: 2026-03-27T23:52:34Z Original updated: 2026-03-28T05:02:55Z Original head: carverauto/serviceradar:2406-feat-agent-fleet-management-secure-self-update-system Original base: staging --- ## IMPORTANT: Please sign the Developer Certificate of Origin Thank you for your contribution to ServiceRadar. Please note, when contributing, the developer must include a [DCO sign-off statement]( https://developercertificate.org/) indicating the DCO acceptance in one commit message. Here is an example DCO Signed-off-by line in a commit message: ``` Signed-off-by: J. Doe <j.doe@domain.com> ``` ## Describe your changes ## Issue ticket number and link ## Code checklist before requesting a review - [ ] I have signed the DCO? - [ ] The build completes without errors? - [ ] All tests are passing when running make test?
Copilot commented 2026-03-27 23:59:24 +00:00 (Migrated from github.com)
Author
Owner

Imported GitHub PR review comment.

Original author: @Copilot
Original URL: https://github.com/carverauto/serviceradar/pull/3096#discussion_r3003717907
Original created: 2026-03-27T23:59:24Z
Original path: elixir/serviceradar_core/lib/serviceradar/edge/agent_release_manager.ex
Original line: 330

create_targets/4 raises on target creation failure. In a long-lived service this can crash the caller process and makes it harder to handle expected errors (e.g., constraint violations). Prefer returning {:error, reason} from create_rollout/2 so the API can respond cleanly and callers can decide how to retry/repair.

Imported GitHub PR review comment. Original author: @Copilot Original URL: https://github.com/carverauto/serviceradar/pull/3096#discussion_r3003717907 Original created: 2026-03-27T23:59:24Z Original path: elixir/serviceradar_core/lib/serviceradar/edge/agent_release_manager.ex Original line: 330 --- `create_targets/4` raises on target creation failure. In a long-lived service this can crash the caller process and makes it harder to handle expected errors (e.g., constraint violations). Prefer returning `{:error, reason}` from `create_rollout/2` so the API can respond cleanly and callers can decide how to retry/repair.
Copilot commented 2026-03-27 23:59:24 +00:00 (Migrated from github.com)
Author
Owner

Imported GitHub PR review comment.

Original author: @Copilot
Original URL: https://github.com/carverauto/serviceradar/pull/3096#discussion_r3003717925
Original created: 2026-03-27T23:59:24Z
Original path: elixir/serviceradar_core/lib/serviceradar/edge.ex
Original line: 62

This file switches the Ash.Domain DSL to parenthesized calls (resource(...), require_actor?(...), authorize(...)) while other domain modules (e.g. ServiceRadar.Infrastructure) use the no-parentheses DSL style. Consider keeping the domain DSL consistent across the project to reduce churn in future diffs.

Imported GitHub PR review comment. Original author: @Copilot Original URL: https://github.com/carverauto/serviceradar/pull/3096#discussion_r3003717925 Original created: 2026-03-27T23:59:24Z Original path: elixir/serviceradar_core/lib/serviceradar/edge.ex Original line: 62 --- This file switches the Ash.Domain DSL to parenthesized calls (`resource(...)`, `require_actor?(...)`, `authorize(...)`) while other domain modules (e.g. `ServiceRadar.Infrastructure`) use the no-parentheses DSL style. Consider keeping the domain DSL consistent across the project to reduce churn in future diffs.
Copilot commented 2026-03-27 23:59:24 +00:00 (Migrated from github.com)
Author
Owner

Imported GitHub PR review comment.

Original author: @Copilot
Original URL: https://github.com/carverauto/serviceradar/pull/3096#discussion_r3003717933
Original created: 2026-03-27T23:59:24Z
Original path: go/pkg/agent/release_update.go
Original line: 530

The design/spec in this PR states artifacts are downloaded over HTTPS, but downloadReleaseArtifact accepts any URL and will happily fetch over plain HTTP. Even with signature+digest verification, enforcing https (or an explicit allowlist for test/dev) avoids accidental insecure configurations and aligns code with the stated requirement.

Imported GitHub PR review comment. Original author: @Copilot Original URL: https://github.com/carverauto/serviceradar/pull/3096#discussion_r3003717933 Original created: 2026-03-27T23:59:24Z Original path: go/pkg/agent/release_update.go Original line: 530 --- The design/spec in this PR states artifacts are downloaded over HTTPS, but `downloadReleaseArtifact` accepts any URL and will happily fetch over plain HTTP. Even with signature+digest verification, enforcing `https` (or an explicit allowlist for test/dev) avoids accidental insecure configurations and aligns code with the stated requirement.
Copilot commented 2026-03-27 23:59:25 +00:00 (Migrated from github.com)
Author
Owner

Imported GitHub PR review comment.

Original author: @Copilot
Original URL: https://github.com/carverauto/serviceradar/pull/3096#discussion_r3003717940
Original created: 2026-03-27T23:59:25Z
Original path: elixir/serviceradar_core/lib/serviceradar/edge/agent_release_manager.ex
Original line: 277

batch_delay_seconds is stored on the rollout but is not used anywhere in dispatch decisions (maybe_dispatch_rollout/2 ignores it). As a result, staged rollouts will not honor inter-batch delays as modeled in the schema/spec. Either implement delay gating using last_dispatch_at + batch_delay_seconds, or remove the field until it’s supported to avoid misleading operators.

Imported GitHub PR review comment. Original author: @Copilot Original URL: https://github.com/carverauto/serviceradar/pull/3096#discussion_r3003717940 Original created: 2026-03-27T23:59:25Z Original path: elixir/serviceradar_core/lib/serviceradar/edge/agent_release_manager.ex Original line: 277 --- `batch_delay_seconds` is stored on the rollout but is not used anywhere in dispatch decisions (`maybe_dispatch_rollout/2` ignores it). As a result, staged rollouts will not honor inter-batch delays as modeled in the schema/spec. Either implement delay gating using `last_dispatch_at` + `batch_delay_seconds`, or remove the field until it’s supported to avoid misleading operators.
Copilot commented 2026-03-27 23:59:25 +00:00 (Migrated from github.com)
Author
Owner

Imported GitHub PR review comment.

Original author: @Copilot
Original URL: https://github.com/carverauto/serviceradar/pull/3096#discussion_r3003717942
Original created: 2026-03-27T23:59:25Z
Original path: elixir/serviceradar_core/lib/serviceradar/edge/agent_release.ex
Original line: 55

publish currently persists whatever manifest/signature values are provided without validating that the manifest is complete and that the signature is actually valid for the manifest content. Given the requirements in this PR, consider adding validations (e.g., required manifest keys/artifacts, signature verification) so the control plane rejects unsigned/incomplete releases at publish time.

Imported GitHub PR review comment. Original author: @Copilot Original URL: https://github.com/carverauto/serviceradar/pull/3096#discussion_r3003717942 Original created: 2026-03-27T23:59:25Z Original path: elixir/serviceradar_core/lib/serviceradar/edge/agent_release.ex Original line: 55 --- `publish` currently persists whatever `manifest`/`signature` values are provided without validating that the manifest is complete and that the signature is actually valid for the manifest content. Given the requirements in this PR, consider adding validations (e.g., required manifest keys/artifacts, signature verification) so the control plane rejects unsigned/incomplete releases at publish time.
Copilot commented 2026-03-27 23:59:25 +00:00 (Migrated from github.com)
Author
Owner

Imported GitHub PR review comment.

Original author: @Copilot
Original URL: https://github.com/carverauto/serviceradar/pull/3096#discussion_r3003717953
Original created: 2026-03-27T23:59:25Z
Original path: elixir/serviceradar_core/lib/serviceradar/infrastructure/agent.ex
Original line: 217

This change rewrites a large chunk of the Ash DSL to use function-call syntax (e.g., table("..."), type("..."), get?(:by_uid)), while most resources/domains in this repo use the no-parentheses DSL style. This makes the file stylistically inconsistent and harder to diff going forward; consider reverting to the prevailing DSL style or running a formatter consistently across the whole codebase.

Imported GitHub PR review comment. Original author: @Copilot Original URL: https://github.com/carverauto/serviceradar/pull/3096#discussion_r3003717953 Original created: 2026-03-27T23:59:25Z Original path: elixir/serviceradar_core/lib/serviceradar/infrastructure/agent.ex Original line: 217 --- This change rewrites a large chunk of the Ash DSL to use function-call syntax (e.g., `table("...")`, `type("...")`, `get?(:by_uid)`), while most resources/domains in this repo use the no-parentheses DSL style. This makes the file stylistically inconsistent and harder to diff going forward; consider reverting to the prevailing DSL style or running a formatter consistently across the whole codebase.
Copilot commented 2026-03-27 23:59:25 +00:00 (Migrated from github.com)
Author
Owner

Imported GitHub PR review comment.

Original author: @Copilot
Original URL: https://github.com/carverauto/serviceradar/pull/3096#discussion_r3003717962
Original created: 2026-03-27T23:59:25Z
Original path: go/pkg/agent/release_update.go
Original line: 140

The artifact URL/SHA256 used for download+hash verification is taken from the unsigned command payload, but the signed manifest is never cross-checked to ensure it authorizes that exact artifact (url/sha256/os/arch). This means an attacker who can tamper with the command payload could supply a different artifact that still passes signature verification (signature covers only payload.Manifest). Bind verification to the manifest by selecting the artifact from the signed manifest (or at least validating payload.Artifact matches a manifest entry) and use the manifest-provided digest/URL for verification.

Imported GitHub PR review comment. Original author: @Copilot Original URL: https://github.com/carverauto/serviceradar/pull/3096#discussion_r3003717962 Original created: 2026-03-27T23:59:25Z Original path: go/pkg/agent/release_update.go Original line: 140 --- The artifact URL/SHA256 used for download+hash verification is taken from the unsigned command payload, but the signed manifest is never cross-checked to ensure it authorizes that exact artifact (url/sha256/os/arch). This means an attacker who can tamper with the command payload could supply a different artifact that still passes signature verification (signature covers only `payload.Manifest`). Bind verification to the manifest by selecting the artifact from the signed manifest (or at least validating `payload.Artifact` matches a manifest entry) and use the manifest-provided digest/URL for verification.
Copilot commented 2026-03-27 23:59:25 +00:00 (Migrated from github.com)
Author
Owner

Imported GitHub PR review comment.

Original author: @Copilot
Original URL: https://github.com/carverauto/serviceradar/pull/3096#discussion_r3003717966
Original created: 2026-03-27T23:59:25Z
Original path: go/pkg/agent/release_update.go
Original line: 428

extractReleaseArchive limits the compressed download size, but does not cap the total extracted/uncompressed bytes. A crafted tar.gz (decompression bomb) could expand to very large output and fill disk under the runtime root. Track and enforce a maximum total extracted size (and optionally per-file limits) during extraction, aborting when exceeded.

Imported GitHub PR review comment. Original author: @Copilot Original URL: https://github.com/carverauto/serviceradar/pull/3096#discussion_r3003717966 Original created: 2026-03-27T23:59:25Z Original path: go/pkg/agent/release_update.go Original line: 428 --- `extractReleaseArchive` limits the *compressed* download size, but does not cap the total extracted/uncompressed bytes. A crafted tar.gz (decompression bomb) could expand to very large output and fill disk under the runtime root. Track and enforce a maximum total extracted size (and optionally per-file limits) during extraction, aborting when exceeded.
Copilot commented 2026-03-27 23:59:25 +00:00 (Migrated from github.com)
Author
Owner

Imported GitHub PR review comment.

Original author: @Copilot
Original URL: https://github.com/carverauto/serviceradar/pull/3096#discussion_r3003717972
Original created: 2026-03-27T23:59:25Z
Original path: go/pkg/agent/release_update.go
Original line: 361

inferReleaseArtifactFormat returns arbitrary artifact.Format values, but stageReleasePayload treats anything other than tar.gz as a raw executable binary. This will silently mishandle unsupported formats (e.g. zip) by writing the archive bytes as the entrypoint. Consider rejecting unknown formats explicitly (return a clear error) instead of defaulting to the binary path.

Imported GitHub PR review comment. Original author: @Copilot Original URL: https://github.com/carverauto/serviceradar/pull/3096#discussion_r3003717972 Original created: 2026-03-27T23:59:25Z Original path: go/pkg/agent/release_update.go Original line: 361 --- `inferReleaseArtifactFormat` returns arbitrary `artifact.Format` values, but `stageReleasePayload` treats anything other than `tar.gz` as a raw executable binary. This will silently mishandle unsupported formats (e.g. `zip`) by writing the archive bytes as the entrypoint. Consider rejecting unknown formats explicitly (return a clear error) instead of defaulting to the binary path.
Copilot commented 2026-03-27 23:59:26 +00:00 (Migrated from github.com)
Author
Owner

Imported GitHub PR review comment.

Original author: @Copilot
Original URL: https://github.com/carverauto/serviceradar/pull/3096#discussion_r3003717976
Original created: 2026-03-27T23:59:26Z
Original path: go/pkg/agent/control_stream.go
Original line: 545

This handler always reports a successful result with status: staged and never produces a terminal healthy / rolled_back result. With the current control-plane logic, rollouts will remain active (and count staged targets as inflight) indefinitely. Either trigger the activation/updater workflow and send a final terminal result, or (if staging is the terminal behavior for now) update the control-plane semantics so staged completes a target/rollout.

Imported GitHub PR review comment. Original author: @Copilot Original URL: https://github.com/carverauto/serviceradar/pull/3096#discussion_r3003717976 Original created: 2026-03-27T23:59:26Z Original path: go/pkg/agent/control_stream.go Original line: 545 --- This handler always reports a successful result with `status: staged` and never produces a terminal `healthy` / `rolled_back` result. With the current control-plane logic, rollouts will remain `active` (and count staged targets as inflight) indefinitely. Either trigger the activation/updater workflow and send a final terminal result, or (if staging is the terminal behavior for now) update the control-plane semantics so `staged` completes a target/rollout.
Copilot commented 2026-03-27 23:59:26 +00:00 (Migrated from github.com)
Author
Owner

Imported GitHub PR review comment.

Original author: @Copilot
Original URL: https://github.com/carverauto/serviceradar/pull/3096#discussion_r3003717980
Original created: 2026-03-27T23:59:26Z
Original path: elixir/serviceradar_core/lib/serviceradar/edge/agent_release_manager.ex
Original line: 74

create_rollout/2 allows agent_ids to be omitted or normalize down to an empty list, which creates a rollout with no targets. That rollout will never complete (completion requires targets != []). Validate agent_ids is non-empty and return an error early instead of creating an empty rollout.

    with {:ok, release} <- load_release(attrs, actor) do
      agent_ids =
        normalize_agent_ids(Map.get(attrs, :agent_ids) || Map.get(attrs, "agent_ids"))

      case agent_ids do
        [] ->
          {:error, :no_targets_specified}

        _ ->
          with {:ok, rollout} <- create_rollout_record(release, attrs, agent_ids, actor),
               {:ok, _targets} <- create_targets(rollout, release, agent_ids, actor) do
            maybe_dispatch_rollout(rollout.id, actor: actor)
            {:ok, rollout}
          end
      end
Imported GitHub PR review comment. Original author: @Copilot Original URL: https://github.com/carverauto/serviceradar/pull/3096#discussion_r3003717980 Original created: 2026-03-27T23:59:26Z Original path: elixir/serviceradar_core/lib/serviceradar/edge/agent_release_manager.ex Original line: 74 --- `create_rollout/2` allows `agent_ids` to be omitted or normalize down to an empty list, which creates a rollout with no targets. That rollout will never complete (completion requires `targets != []`). Validate `agent_ids` is non-empty and return an error early instead of creating an empty rollout. ```suggestion with {:ok, release} <- load_release(attrs, actor) do agent_ids = normalize_agent_ids(Map.get(attrs, :agent_ids) || Map.get(attrs, "agent_ids")) case agent_ids do [] -> {:error, :no_targets_specified} _ -> with {:ok, rollout} <- create_rollout_record(release, attrs, agent_ids, actor), {:ok, _targets} <- create_targets(rollout, release, agent_ids, actor) do maybe_dispatch_rollout(rollout.id, actor: actor) {:ok, rollout} end end ```
Copilot commented 2026-03-27 23:59:26 +00:00 (Migrated from github.com)
Author
Owner

Imported GitHub PR review comment.

Original author: @Copilot
Original URL: https://github.com/carverauto/serviceradar/pull/3096#discussion_r3003717994
Original created: 2026-03-27T23:59:26Z
Original path: elixir/serviceradar_core/lib/serviceradar/edge/agent_release_manager.ex
Original line: 207

On a status: staged command result you set current_version to the staged version (often the desired version) and sync_agent_release_state/6 writes that into Agent.version. That will make inventory report the agent is already running the new version even though activation hasn't happened yet. Only update Agent.version on a terminal/activated state (e.g. healthy), and keep staged version in target metadata instead.

Imported GitHub PR review comment. Original author: @Copilot Original URL: https://github.com/carverauto/serviceradar/pull/3096#discussion_r3003717994 Original created: 2026-03-27T23:59:26Z Original path: elixir/serviceradar_core/lib/serviceradar/edge/agent_release_manager.ex Original line: 207 --- On a `status: staged` command result you set `current_version` to the staged version (often the desired version) and `sync_agent_release_state/6` writes that into `Agent.version`. That will make inventory report the agent is already running the new version even though activation hasn't happened yet. Only update `Agent.version` on a terminal/activated state (e.g. `healthy`), and keep staged version in target metadata instead.
This pull request is broken due to missing fork information.
View command line instructions

Checkout

From your project repository, check out a new branch and test the changes.
git fetch -u origin refs/pull/3094/head:refs/pull/3094/head
git switch refs/pull/3094/head

Merge

Merge the changes and update on Forgejo.

Warning: The "Autodetect manual merge" setting is not enabled for this repository, you will have to mark this pull request as manually merged afterwards.

git switch staging
git merge --no-ff refs/pull/3094/head
git switch refs/pull/3094/head
git rebase staging
git switch staging
git merge --ff-only refs/pull/3094/head
git switch refs/pull/3094/head
git rebase staging
git switch staging
git merge --no-ff refs/pull/3094/head
git switch staging
git merge --squash refs/pull/3094/head
git switch staging
git merge --ff-only refs/pull/3094/head
git switch staging
git merge refs/pull/3094/head
git push origin staging
Sign in to join this conversation.
No reviewers
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
carverauto/serviceradar!3094
No description provided.