feat(agents): gateway-proxied netprobe add-on delivery + agent host-IP self-registration (#3425) #3526

Merged
mfreeman451 merged 2 commits from feat/netprobe-gateway-addon-delivery into staging 2026-06-04 18:34:38 +00:00
Owner

Two fixes that make worker-node agents fully functional (managed devices running netprobe via the gateway, never the KV).

1. Gateway-proxied native add-on delivery (475dbc4ab)

Agents fetch add-on artifacts THROUGH the agent-gateway over mTLS HTTPS (the same path WASM plugins use) instead of the direct-KV grpcRemoteStore — so external/NAT'd agents (no kv_address) can run netprobe.

  • proto: download_url/download_token on AddonAssignmentConfig.
  • control plane: build_addon_assignment_config mints a signed (HMAC) download request via StorageToken; stripped from the config-version hash so the rotating token doesn't churn it.
  • web-ng: POST /api/addon-packages/:id/blob/download (token-verified, entitlement-checked against the package's own artifact object_keys via constant-time compare, serves from storage via DataService).
  • agent: HTTPS fetch when download_url set (mTLS gateway client), else direct store; sha256 + ed25519 signature verified before any disk write.

2. Agent host-IP self-registration (d75ed2f44)

Fixes the regression where externally/NAT'd agents never become managed devices. AgentHelloRequest/ControlStreamHello now carry host_ip; the gateway prefers it over the TCP peer IP so DIRE links the agent to the right device record (sets agent_id/is_managed/agent discovery source). In-cluster behavior unchanged (falls back to peer IP). Already-orphaned devices self-heal on next Hello.

Verified: go build ./go/... + go test ./go/pkg/agent/... green; mix compile --warnings-as-errors (core + agent_gateway) clean.

Note: make generate-proto-elixir (0.16.0) emits paren-less style + a spurious proto/flow/ dir; the checked-in monitoring.pb.ex carries just the new field lines in the existing format — if verify-proto-elixir flags drift it's cosmetic, not a contract change.

🤖 Generated with Claude Code

Two fixes that make worker-node agents fully functional (managed devices running netprobe via the gateway, never the KV). ## 1. Gateway-proxied native add-on delivery (`475dbc4ab`) Agents fetch add-on artifacts THROUGH the agent-gateway over mTLS HTTPS (the same path WASM plugins use) instead of the direct-KV `grpcRemoteStore` — so external/NAT'd agents (no `kv_address`) can run netprobe. - proto: `download_url`/`download_token` on `AddonAssignmentConfig`. - control plane: `build_addon_assignment_config` mints a signed (HMAC) download request via `StorageToken`; stripped from the config-version hash so the rotating token doesn't churn it. - web-ng: `POST /api/addon-packages/:id/blob/download` (token-verified, entitlement-checked against the package's own artifact object_keys via constant-time compare, serves from storage via DataService). - agent: HTTPS fetch when `download_url` set (mTLS gateway client), else direct store; sha256 + ed25519 signature verified before any disk write. ## 2. Agent host-IP self-registration (`d75ed2f44`) Fixes the regression where externally/NAT'd agents never become *managed devices*. `AgentHelloRequest`/`ControlStreamHello` now carry `host_ip`; the gateway prefers it over the TCP peer IP so DIRE links the agent to the right device record (sets `agent_id`/`is_managed`/`agent` discovery source). In-cluster behavior unchanged (falls back to peer IP). Already-orphaned devices self-heal on next Hello. Verified: `go build ./go/... ` + `go test ./go/pkg/agent/...` green; `mix compile --warnings-as-errors` (core + agent_gateway) clean. Note: `make generate-proto-elixir` (0.16.0) emits paren-less style + a spurious `proto/flow/` dir; the checked-in `monitoring.pb.ex` carries just the new field lines in the existing format — if `verify-proto-elixir` flags drift it's cosmetic, not a contract change. 🤖 Generated with [Claude Code](https://claude.com/claude-code)
Agents fetch native add-on (netprobe) artifacts THROUGH the agent-gateway
over HTTPS, exactly like WASM plugins, never touching the KV/object store
directly. Five pieces, mirroring the existing WASM-plugin delivery pattern:

1. Elixir proto regen: Monitoring.AddonAssignmentConfig now carries
   :download_url (16) and :download_token (17) (proto/monitoring.{proto,pb.go}
   already had them).

2. Control plane (agent_config_generator.ex): build_addon_assignment_config/3
   now mints a gateway download request for the selected per-arch artifact via
   StorageToken.download_addon_request/2 and sets download_url/download_token on
   the proto. StorageToken gains an addon URL variant
   (/api/addon-packages/:id/blob/download) reusing the identical signed-token
   mechanism; only the URL path differs. stable_addon_assignment/1 now strips
   the per-poll download fields so the rotating token does not churn the config
   version hash.

3. web-ng (addon_package_controller.ex + router): new
   POST /api/addon-packages/:id/blob/download in the same :api pipeline as the
   plugin blob route. Verifies the download token (Storage.verify_token), checks
   token.id == route :id, fetches the AddonPackage, and serves the blob only if
   the token's key is one of the package's mirrored artifact object_keys
   (constant-time), as application/gzip.

4. Agent (addon_activation.go, push_loop_addons.go): when download_url is set,
   stageAddonArtifactWithClient fetches the artifact over HTTPS via the gateway
   mTLS client (gatewayArtifactHTTPClient(GatewaySecurity)) with the token in
   X-ServiceRadar-Plugin-Token, then applies the SAME sha256 + ed25519 signature
   verification before staging. External agents (no kv_address) no longer hit
   ErrAddonObjectStoreUnavailable when a download_url is present; direct-store
   path preserved when it is empty.

5. Seeder (netprobe_addon_package_seeder.ex): when an existing package already
   has non-empty artifacts (imported/mirrored), the seeder no longer overwrites
   artifacts or downgrades status via restage; it at most refreshes config_schema.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
fix(agents): self-register host IP in Hello so external/NAT'd agents link to their device (DIRE)
Some checks failed
Secret Scan / gitleaks (pull_request) Successful in 38s
lint / lint (push) Successful in 1m48s
Golang Tests / test-go (push) Successful in 1m54s
lint / lint (pull_request) Successful in 1m33s
Elixir Quality / Elixir Quality (pull_request) Failing after 11m11s
CI / build (pull_request) Failing after 26m12s
d75ed2f44f
An agent should become a managed device on Hello: the gateway calls
AgentGatewaySync.ensure_device_for_agent, which resolves the device by IP and
stamps agent_id/is_managed/"agent" discovery source. But the device IP it used
was the TCP peer IP (get_peer_ip), because AgentHelloRequest had no host_ip
field. For external/NAT'd agents the peer IP != the agent's host IP, so DIRE's
IP lookup missed the real device and the agent never linked.

- proto: add string host_ip = 11 to AgentHelloRequest and ControlStreamHello;
  regenerate Go + Elixir bindings (field-only additions).
- go agent: set HostIp from getSourceIP() (prefers configured HostIP, falls back
  to interface enumeration) in both Hello builders, so Hello and PushStatus agree.
- gateway: device_attrs_from_request prefers the request's reported host_ip when
  present, otherwise keeps the peer source_ip (in-cluster behavior unchanged).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
mfreeman451 left a comment

lgtm

lgtm
mfreeman451 deleted branch feat/netprobe-gateway-addon-delivery 2026-06-04 18:34:39 +00:00
Sign in to join this conversation.
No reviewers
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
carverauto/serviceradar!3526
No description provided.