feat(agent): gate netprobe lifecycle on its AddonAssignment (#3425) #3483

Merged
mfreeman451 merged 2 commits from feat/netprobe-assignment-cutover into staging 2026-06-01 02:59:39 +00:00
Owner

Summary

migrate-netprobe-to-native-addon §2.2 cutover. Moves netprobe off the always-on agent-launched visibility path onto the systemd-service add-on lifecycle, gated on its AddonAssignment — building on the attach mode from #3482.

Routing (push_loop_config.go)

applyVisibilityConfig now takes systemdManaged bool, derived from netprobeSystemdAssignmentPresent(configResp.GetAddons()) (an enabled netprobe systemd_service assignment):

  • systemd-managed → applyVisibilityConfigSystemd: switches the supervisor to attach mode (Mode()-gated Stop+StartAttach, so a previously agent-launched netprobe is stopped before systemd owns the socket — no double-run) and hands the config to the sidecar. It always returns true so the apply never aborts before applyAddonAssignments installs the unit later in the same cycle (a synchronous failing ApplyConfig there would deadlock the cutover).
  • not assigned → applyVisibilityConfigLaunched: the original launch path, byte-for-byte for un-assigned fleets (zero regression), plus a Stop + SetDesiredConfig(nil) revert if coming from attach.

Apply-on-connect (netprobe/sidecar.go)

SetDesiredConfig stores the latest VisibilityConfig (atomic) and asynchronously pushDesired (serialized via applyMu, last-write-wins); setClient re-triggers it on every (re)connect. So a systemd-managed netprobe gets its full config over IPC — including device bindings, which the bootstrap file does not carry — on startup and after any restart, independent of the gateway poll cadence and the NotModified short-circuit. The push primitive is injectable (applyFn) for tests.

sidecarLifecycleManager gained Mode() + StartAttach(); the netprobe sidecar takes a Logger.

Why this is safe to land now

The cutover is inert until the control plane seeds a netprobe AddonAssignment (§3.1) — with no assignment, every fleet stays on the unchanged launch path.

Validation

  • go test -race green: routing → attach + revert → stop (TestApplyVisibilityConfigRoutesNetprobeBySupervision), assignment-detection table (TestNetprobeSystemdAssignmentPresent), apply-on-connect applies + nil-clears (TestSidecarSetDesiredConfigApplies). These run in CI via go test ./... (tests-golang.yml).
  • golangci-lint run clean; agent packages build.
  • An adversarial multi-agent review of the diff confirmed the routing/handoff and apply-on-connect (latest-wins, no data race) are correct — 0 confirmed defects of 8 candidates.

Next (§2.2 remainder + beyond)

  • The local-override/cache fallback across an agent restart while the control plane is unreachable (in-memory desiredConfig is lost on agent restart; relies on the agent's cached-config replay — verify in §4.x).
  • §2.3 status, §2.4 rollback.
  • §4.3 scratch-Linux-agent e2e is the full-flow gate (systemd installs+starts netprobe → agent attaches → config pushed → events ingested → restart re-applies); bundles can't build/run on darwin.

🤖 Generated with Claude Code

## Summary `migrate-netprobe-to-native-addon` §2.2 **cutover**. Moves netprobe off the always-on agent-launched visibility path onto the **systemd-service add-on lifecycle, gated on its `AddonAssignment`** — building on the attach mode from #3482. ## Routing (`push_loop_config.go`) `applyVisibilityConfig` now takes `systemdManaged bool`, derived from `netprobeSystemdAssignmentPresent(configResp.GetAddons())` (an **enabled netprobe `systemd_service` assignment**): - **systemd-managed → `applyVisibilityConfigSystemd`**: switches the supervisor to **attach mode** (`Mode()`-gated `Stop`+`StartAttach`, so a previously agent-launched netprobe is stopped before systemd owns the socket — **no double-run**) and hands the config to the sidecar. It **always returns `true`** so the apply never aborts before `applyAddonAssignments` installs the unit later in the same cycle (a synchronous failing `ApplyConfig` there would deadlock the cutover). - **not assigned → `applyVisibilityConfigLaunched`**: the original launch path, **byte-for-byte for un-assigned fleets (zero regression)**, plus a `Stop` + `SetDesiredConfig(nil)` revert if coming from attach. ## Apply-on-connect (`netprobe/sidecar.go`) `SetDesiredConfig` stores the latest `VisibilityConfig` (atomic) and asynchronously `pushDesired` (serialized via `applyMu`, **last-write-wins**); `setClient` re-triggers it on every **(re)connect**. So a systemd-managed netprobe gets its **full config over IPC — including device bindings, which the bootstrap file does not carry** — on startup and after any restart, independent of the gateway poll cadence and the `NotModified` short-circuit. The push primitive is injectable (`applyFn`) for tests. `sidecarLifecycleManager` gained `Mode()` + `StartAttach()`; the netprobe sidecar takes a `Logger`. ## Why this is safe to land now The cutover is **inert until the control plane seeds a netprobe `AddonAssignment` (§3.1)** — with no assignment, every fleet stays on the unchanged launch path. ## Validation - `go test -race` green: routing → attach + revert → stop (`TestApplyVisibilityConfigRoutesNetprobeBySupervision`), assignment-detection table (`TestNetprobeSystemdAssignmentPresent`), apply-on-connect applies + nil-clears (`TestSidecarSetDesiredConfigApplies`). These run in CI via `go test ./...` (tests-golang.yml). - `golangci-lint run` clean; agent packages build. - An **adversarial multi-agent review** of the diff confirmed the routing/handoff and apply-on-connect (latest-wins, no data race) are correct — **0 confirmed defects** of 8 candidates. ## Next (§2.2 remainder + beyond) - The local-override/cache fallback across an **agent** restart while the control plane is unreachable (in-memory `desiredConfig` is lost on agent restart; relies on the agent's cached-config replay — verify in §4.x). - §2.3 status, §2.4 rollback. - **§4.3 scratch-Linux-agent e2e** is the full-flow gate (systemd installs+starts netprobe → agent attaches → config pushed → events ingested → restart re-applies); bundles can't build/run on darwin. 🤖 Generated with [Claude Code](https://claude.com/claude-code)
feat(agent): gate netprobe lifecycle on its AddonAssignment (#3425)
Some checks failed
Secret Scan / gitleaks (pull_request) Successful in 25s
lint / lint (pull_request) Successful in 1m12s
lint / lint (push) Successful in 1m33s
Golang Tests / test-go (push) Successful in 1m51s
CI / build (pull_request) Failing after 4m32s
feae1e4108
migrate-netprobe-to-native-addon §2.2 cutover. Moves netprobe off the always-on
agent-launched visibility path onto the systemd-service add-on lifecycle, gated on its
AddonAssignment — building on the attach mode from #3482.

Routing (push_loop_config.go): applyVisibilityConfig now takes systemdManaged bool,
derived from netprobeSystemdAssignmentPresent(configResp.GetAddons()) (an enabled netprobe
systemd_service assignment), and branches:
- systemd-managed -> applyVisibilityConfigSystemd: switch the supervisor to attach mode
  (Mode()-gated Stop+StartAttach so a previously agent-launched netprobe is stopped before
  systemd owns the socket — no double-run) and hand the config to the sidecar. It ALWAYS
  returns true so the apply never aborts before applyAddonAssignments installs the unit
  later in the same cycle (a synchronous failing ApplyConfig there would deadlock the
  cutover).
- not assigned -> applyVisibilityConfigLaunched: the original launch path, byte-for-byte
  for un-assigned fleets (zero regression), plus a Stop + SetDesiredConfig(nil) revert if
  coming from attach mode.

Apply-on-connect (netprobe/sidecar.go): SetDesiredConfig stores the latest VisibilityConfig
(atomic) and asynchronously pushDesired (serialized via applyMu, always applies the newest —
last-write-wins); setClient re-triggers it on every (re)connect. So a systemd-managed
netprobe gets its FULL config over IPC (incl. device bindings, which the bootstrap file does
not carry) on startup and after any restart, independent of the gateway poll cadence and the
NotModified short-circuit. The push primitive is injectable (applyFn) for tests.

sidecarLifecycleManager gained Mode() + StartAttach(); the netprobe sidecar takes a Logger.

The cutover is inert until the control plane seeds a netprobe AddonAssignment (§3.1): with no
assignment, every fleet stays on the unchanged launch path.

Validation: go test -race green (routing -> attach + revert -> stop; assignment-detection
table; apply-on-connect applies + nil-clears); golangci-lint clean; agent builds. An
adversarial multi-agent review of the diff confirmed the routing/handoff and apply-on-connect
(latest-wins, no data race) are correct with 0 confirmed defects. Full-flow validation
(systemd start -> attach -> ingest -> restart re-apply) is the §4.3 scratch-Linux-agent e2e.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
fix(agent): bound netprobe apply-on-connect to the agent run context (#3425)
Some checks failed
Secret Scan / gitleaks (pull_request) Successful in 25s
lint / lint (pull_request) Successful in 59s
lint / lint (push) Successful in 1m13s
Golang Tests / test-go (push) Successful in 2m39s
CI / build (pull_request) Failing after 3m18s
dca86fca8e
Address review: pushDesired used context.WithTimeout(context.Background(), ...),
detaching the fire-and-forget config push from the agent lifecycle — on shutdown those
goroutines were not cancelled (they lingered to the 30s timeout).

The push must outlive the config-apply call that triggers it (on the first cutover netprobe
isn't running yet, so it keeps polling for a client after applyConfigResponse returns), so a
per-request ctx can't be used. The right lifetime is the agent run/poll-loop context, which
is long-lived and cancelled on shutdown. SetDesiredConfig now takes that ctx and stores it as
the sidecar's base context; pushDesired derives its per-attempt timeout from it. The reconnect
trigger (setClient) has no ctx in scope, so the lifetime ctx is stored on the sidecar rather
than threaded through the generic OnHealthy/Sidecar interface.

The routing test now drives a cancellable context and cancels it on teardown, so the
apply-on-connect goroutine stops instead of polling out its timeout.

go test -race + golangci-lint clean.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
mfreeman451 left a comment

lgtm

lgtm
mfreeman451 deleted branch feat/netprobe-assignment-cutover 2026-06-01 02:59:39 +00:00
Sign in to join this conversation.
No reviewers
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
carverauto/serviceradar!3483
No description provided.