feat(agent): sidecar manager attach mode for systemd-managed sidecars (#3425) #3482

Merged
mfreeman451 merged 2 commits from feat/netprobe-assignment-gated-lifecycle into staging 2026-06-01 02:10:37 +00:00
Owner

Summary

migrate-netprobe-to-native-addon §2.2 — building block. Teaches the agent sidecar supervisor to attach to an externally-(systemd-)managed sidecar socket without launching the process. This is the mechanism netprobe's systemd-service lifecycle connects through (config push + event ingest), and it lands first — additive, fully tested, zero production behavior change (no caller yet) — to de-risk the cutover, in the same spirit as #3481 shipping the inert unit.

Why attach mode

Under systemd-service, systemd owns the netprobe process (it binds the IPC socket). The agent must still get a connected client to push VisibilityConfig over IPC and ingest fingerprint/DPI/flow events. Ingest is already launch-decoupled (DrainFlowAttributionEvents drains a channel fed by any connected client), so the only thing the agent needs is a health/connect loop that wires the client in via OnHealthy — which is exactly what the supervisor's health loop already does. Attach mode reuses it, minus the exec/restart.

What changed (go/pkg/agent/sidecar/manager.go)

  • StartAttach(ctx) vs launch Start(ctx); both delegate to start(ctx, attach bool).
  • In attach mode supervise()superviseAttached(): runs the existing healthLoop against the well-known socket — dial, health-check, OnHealthy (wires the client → config push + ingest), reconnect on loss — with no exec, no process wait, no restart/backoff/circuit-breaker. PID stays 0; status is driven to Running/Unhealthy/Stopped by connectivity.
  • Mode() (started, attach bool) so the forthcoming push-loop cutover can decide when a launch↔attach switch is needed.
  • attach is passed to supervise() as a parameter (not read from the field), so no new unsynchronized shared state; Mode() reads the mode field under the existing RLock.

Tests (manager_attach_test.go, registered in the Bazel go_test srcs)

  • connect-without-launch: the sidecar's binary path doesn't exist, yet it reaches Running with PID 0 and fires OnHealthy with a non-nil client — impossible on the launch path, so it proves no exec.
  • genuine reconnect: connect → RunningPing starts failing (stale client Close()'d) → UnhealthyPing recovers → health loop re-dialsRunning, with RestartCount==0 and PID==0 (no process restart).
  • Mode() across StartAttach/Stop/Start.

Validation

  • go test -race ./go/pkg/agent/sidecar/ green (full package, incl. existing launch tests — the supervise signature change broke nothing)
  • go vet + gofmt clean; agent packages build
  • An adversarial multi-agent review of the diff found the concurrency/lifecycle correct (no race/deadlock/leak) and surfaced two fixes I applied: registering the test file in the Bazel go_test srcs (this repo enumerates test srcs, no globbing), and making the reconnect test actually exercise the drop/recover path.

Next (the cutover, §2.2 remainder)

push_loop_config.go: detect the netprobe AddonAssignment (enabled systemd_service) → StartAttach + push config; absent → the existing launch path (unchanged). Plus apply-on-connect for the netprobe sidecar — required because applyVisibilityConfig runs before applyAddonAssignments installs the unit (a synchronous attach ApplyConfig would return false and abort the apply at push_loop_config.go:203), and the NotModified short-circuit means poll-driven re-push won't recover after a systemd restart.

🤖 Generated with Claude Code

## Summary `migrate-netprobe-to-native-addon` §2.2 — **building block**. Teaches the agent sidecar supervisor to **attach** to an externally-(systemd-)managed sidecar socket *without launching the process*. This is the mechanism netprobe's `systemd-service` lifecycle connects through (config push + event ingest), and it lands first — additive, fully tested, **zero production behavior change** (no caller yet) — to de-risk the cutover, in the same spirit as #3481 shipping the inert unit. ## Why attach mode Under `systemd-service`, systemd owns the netprobe process (it binds the IPC socket). The agent must still get a **connected client** to push `VisibilityConfig` over IPC and ingest fingerprint/DPI/flow events. Ingest is already **launch-decoupled** (`DrainFlowAttributionEvents` drains a channel fed by any connected client), so the only thing the agent needs is a health/connect loop that wires the client in via `OnHealthy` — which is exactly what the supervisor's health loop already does. Attach mode reuses it, minus the `exec`/restart. ## What changed (`go/pkg/agent/sidecar/manager.go`) - **`StartAttach(ctx)`** vs launch **`Start(ctx)`**; both delegate to `start(ctx, attach bool)`. - In attach mode `supervise()` → `superviseAttached()`: runs the existing `healthLoop` against the well-known socket — dial, health-check, `OnHealthy` (wires the client → config push + ingest), reconnect on loss — with **no `exec`, no process wait, no restart/backoff/circuit-breaker**. `PID` stays 0; status is driven to `Running`/`Unhealthy`/`Stopped` by connectivity. - **`Mode() (started, attach bool)`** so the forthcoming push-loop cutover can decide when a launch↔attach switch is needed. - `attach` is passed to `supervise()` as a **parameter** (not read from the field), so no new unsynchronized shared state; `Mode()` reads the mode field under the existing `RLock`. ## Tests (`manager_attach_test.go`, registered in the Bazel `go_test` srcs) - **connect-without-launch**: the sidecar's binary path doesn't exist, yet it reaches `Running` with `PID 0` and fires `OnHealthy` with a non-nil client — impossible on the launch path, so it proves no `exec`. - **genuine reconnect**: connect → `Running` → `Ping` starts failing (stale client `Close()`'d) → `Unhealthy` → `Ping` recovers → health loop **re-dials** → `Running`, with `RestartCount==0` and `PID==0` (no process restart). - **`Mode()`** across `StartAttach`/`Stop`/`Start`. ## Validation - `go test -race ./go/pkg/agent/sidecar/` green (full package, incl. existing launch tests — the `supervise` signature change broke nothing) - `go vet` + `gofmt` clean; agent packages build - An adversarial multi-agent review of the diff found the concurrency/lifecycle correct (no race/deadlock/leak) and surfaced two fixes I applied: registering the test file in the Bazel `go_test` srcs (this repo enumerates test srcs, no globbing), and making the reconnect test actually exercise the drop/recover path. ## Next (the cutover, §2.2 remainder) `push_loop_config.go`: detect the netprobe `AddonAssignment` (enabled `systemd_service`) → `StartAttach` + push config; absent → the existing launch path (unchanged). Plus **apply-on-connect** for the netprobe sidecar — required because `applyVisibilityConfig` runs before `applyAddonAssignments` installs the unit (a synchronous attach `ApplyConfig` would `return false` and abort the apply at `push_loop_config.go:203`), and the `NotModified` short-circuit means poll-driven re-push won't recover after a systemd restart. 🤖 Generated with [Claude Code](https://claude.com/claude-code)
feat(agent): sidecar manager attach mode for systemd-managed sidecars (#3425)
Some checks failed
Secret Scan / gitleaks (pull_request) Successful in 21s
lint / lint (push) Failing after 54s
Golang Tests / test-go (push) Successful in 2m0s
lint / lint (pull_request) Failing after 1m51s
CI / build (pull_request) Has been cancelled
937c4623e1
migrate-netprobe-to-native-addon §2.2 (building block). Teaches the agent sidecar
supervisor to attach to an externally-managed sidecar socket without launching the
process — the mechanism netprobe's systemd-service lifecycle will connect through.

- sidecar.Manager.StartAttach(ctx) (vs launch Start(ctx)); both delegate to a private
  start(ctx, attach). In attach mode the manager does NOT exec the binary, wait on a
  process, or run the restart/backoff/circuit-breaker loop. supervise() dispatches to
  superviseAttached(), which runs the existing health loop against the well-known socket:
  it dials, health-checks, wires the client into the sidecar via OnHealthy (so config
  push + event ingest work identically — ingest is launch-decoupled), and reconnects on
  loss. PID stays 0 (the agent does not own the process); the health loop drives the
  status to Running on connect, Unhealthy when the external process is unreachable, and
  Stopped on context cancel.
- Manager.Mode() (started, attach bool) reports the current mode so the (forthcoming)
  push-loop cutover can decide when a launch<->attach switch is needed.
- attach is passed to supervise() as a parameter (not read from the field) so no new
  unsynchronized shared state; Mode() reads the mode field under the existing RLock.

Additive only — nothing calls StartAttach yet, so there is zero production behavior
change. The push-loop assignment-gated routing + apply-on-connect cutover is the next
focused PR (noted under §2.2 in tasks.md).

Validation: go test -race ./go/pkg/agent/sidecar/ green (3 new attach tests prove
connect-without-launch via a nonexistent binary that still reaches Running with PID 0, a
genuine Running->Unhealthy->Running reconnect with RestartCount==0, and Mode() across the
Start/StartAttach/Stop transitions); go vet + gofmt clean; agent packages build. The new
test file is registered in the Bazel go_test srcs so it runs in CI. Findings from an
adversarial multi-agent review of the diff were applied (Bazel srcs registration + a
reconnect test that actually exercises the drop/recover path).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
mfreeman451 left a comment

lgtm

lgtm
fix(agent): satisfy unparam lint on the waitForStatus test helper (#3425)
Some checks failed
lint / lint (pull_request) Successful in 56s
Secret Scan / gitleaks (pull_request) Successful in 56s
lint / lint (push) Successful in 1m23s
Golang Tests / test-go (push) Successful in 3m10s
CI / build (pull_request) Failing after 5m3s
769e797f5e
CI lint-go (golangci-lint unparam) flagged the pre-existing waitForStatus test
helper: its `name` param is always "netprobe". The new attach-mode tests added more
call sites, surfacing it. Suppress with //nolint:unparam (matching the repo's existing
convention, e.g. sweeper/aggregation_test.go) since these single-sidecar tests legitimately
always pass "netprobe" and the param reads clearly at the call sites. `golangci-lint run`
on the package now reports 0 issues; go test -race stays green.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
mfreeman451 deleted branch feat/netprobe-assignment-gated-lifecycle 2026-06-01 02:10:42 +00:00
Sign in to join this conversation.
No reviewers
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
carverauto/serviceradar!3482
No description provided.