feat(observability): correlate netprobe attribution with NetFlow into attributed flows (#3425) #3516

Merged
mfreeman451 merged 1 commit from feat/attributed-flow-correlation into staging 2026-06-02 07:49:03 +00:00
Owner

Summary

Part of #3425. Makes the demo's /observability/flows/attributed page show real attributed flows from the live system, with an architecture that scales to fleets — NetFlow stays the flow source, netprobe only supplies the process, and the correlation is a set-based DB join in core-elx (no agent-as-flow-collector, no host-slice routing, no per-host config).

The model

  1. NetFlow/sFlow keeps flowing from routers → flow-collector → ocsf_network_activity (unchanged — that's the 9.5k flow rows/24h already there).
  2. netprobe agents push process attribution (5-tuple → pid/comm/uid/container) up the existing agent-gateway → core-elx pipeline. netprobe already emits FlowAttributionEvent; no netprobe/agent/flow-collector changes.
  3. core-elx persists each pushed attribution to CNPG and a worker correlates it against the NetFlow already in the DB (direction-agnostic 5-tuple + partition + 15-min window), stamping matching flows event_type=attributed_flow with the process context.
  4. web-ng reads those attributed_flow rows (was hardcoded fixtures).

Changes

  • migration platform.flow_process_attributions — 5-tuple + process + partition + observed_at, indexed.
  • ServiceRadar.FlowAttributionpersist/3 (StatusHandler writes pushed attributions), correlate/0 (the join/UPDATE), prune/0 (60-min retention).
  • ServiceRadar.FlowAttribution.Correlator — GenServer, correlate+prune every 30s, supervised under the EventWriter supervisor.
  • status_handler.ex — persist pushed attributions (additive; the existing real-time joiner path is untouched).
  • web-ng attributed_live.ex — real Repo.query against ocsf_network_activity.

Why not the alternatives

  • Static host_slices config (flow-collector per-agent IP map): doesn't scale to 100k hosts — rejected.
  • Agent emits complete flows (netprobe as flow collector): out of scope; we want network-wide NetFlow correlated with on-host process, not per-agent flow accounting.

Deploy / validation

Needs a core-elx + web-ng image build and the migration to run (core-migrations-job). After deploy, a netprobe-equipped agent (e.g. agent-sr-test-pve04) capturing on its interface will push attributions that correlate against its NetFlow (its host IP already appears in 142 flows/24h) → rows appear on the attributed-flows page. ⚠️ Elixir not compiled locally in this environment — CI build is the first compile/credo gate.

🤖 Generated with Claude Code

## Summary Part of #3425. Makes the demo's `/observability/flows/attributed` page show **real** attributed flows from the live system, with an architecture that scales to fleets — **NetFlow stays the flow source, netprobe only supplies the process, and the correlation is a set-based DB join in core-elx** (no agent-as-flow-collector, no host-slice routing, no per-host config). ### The model 1. **NetFlow/sFlow** keeps flowing from routers → flow-collector → `ocsf_network_activity` (unchanged — that's the 9.5k flow rows/24h already there). 2. **netprobe agents push process attribution** (5-tuple → pid/comm/uid/container) up the **existing** agent-gateway → core-elx pipeline. netprobe already emits `FlowAttributionEvent`; **no netprobe/agent/flow-collector changes.** 3. **core-elx persists** each pushed attribution to CNPG and a **worker correlates** it against the NetFlow already in the DB (direction-agnostic 5-tuple + partition + 15-min window), stamping matching flows `event_type=attributed_flow` with the process context. 4. **web-ng** reads those `attributed_flow` rows (was hardcoded fixtures). ### Changes - `migration` `platform.flow_process_attributions` — 5-tuple + process + partition + observed_at, indexed. - `ServiceRadar.FlowAttribution` — `persist/3` (StatusHandler writes pushed attributions), `correlate/0` (the join/UPDATE), `prune/0` (60-min retention). - `ServiceRadar.FlowAttribution.Correlator` — GenServer, correlate+prune every 30s, supervised under the EventWriter supervisor. - `status_handler.ex` — persist pushed attributions (additive; the existing real-time joiner path is untouched). - web-ng `attributed_live.ex` — real `Repo.query` against `ocsf_network_activity`. ### Why not the alternatives - *Static `host_slices` config* (flow-collector per-agent IP map): doesn't scale to 100k hosts — rejected. - *Agent emits complete flows* (netprobe as flow collector): out of scope; we want network-wide NetFlow correlated with on-host process, not per-agent flow accounting. ### Deploy / validation Needs a **core-elx + web-ng image build** and the **migration to run** (core-migrations-job). After deploy, a netprobe-equipped agent (e.g. `agent-sr-test-pve04`) capturing on its interface will push attributions that correlate against its NetFlow (its host IP already appears in 142 flows/24h) → rows appear on the attributed-flows page. ⚠️ Elixir not compiled locally in this environment — CI build is the first compile/credo gate. 🤖 Generated with [Claude Code](https://claude.com/claude-code)
feat(observability): correlate netprobe attribution with NetFlow into attributed flows (#3425)
Some checks failed
Secret Scan / gitleaks (pull_request) Successful in 38s
lint / lint (push) Successful in 1m25s
Golang Tests / test-go (push) Successful in 1m35s
lint / lint (pull_request) Successful in 1m15s
CI / build (pull_request) Failing after 10m2s
Elixir Quality / Elixir Quality (pull_request) Failing after 13m22s
0bf554acea
NetFlow/sFlow stays the authoritative flow source (flow-collector ->
ocsf_network_activity, unchanged). netprobe agents push process attribution
(5-tuple -> pid/comm/uid/container) up the EXISTING agent-gateway -> core-elx
pipeline; core persists it and a worker correlates it against the NetFlow
already collected in the DB, stamping matching flows as attributed_flow with
process context. No agent-as-flow-collector, no host-slice routing, no per-host
config — scales to fleets because the correlation is a set-based DB join.

- migration: platform.flow_process_attributions (5-tuple + process + partition +
  observed_at, indexed for correlation).
- ServiceRadar.FlowAttribution: persist/3 (StatusHandler writes the pushed
  attributions), correlate/0 (direction-agnostic 5-tuple + partition + 15-min
  window UPDATE marking ocsf_network_activity rows event_type=attributed_flow +
  attribution), prune/0 (60-min retention).
- ServiceRadar.FlowAttribution.Correlator: GenServer running correlate+prune every
  30s, supervised under the EventWriter supervisor.
- web-ng /observability/flows/attributed: queries real attributed_flow rows from
  ocsf_network_activity (was hardcoded fixtures).

netprobe and the flow-collector are unchanged (netprobe already emits
FlowAttributionEvent). Needs a core-elx + web-ng image build/deploy and the
migration to run; then a netprobe-equipped agent's flows show up attributed.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Sign in to join this conversation.
No reviewers
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
carverauto/serviceradar!3516
No description provided.