feat(causal-engine): umbrella proposal + Phase-0 (state-change feed, Gap B, Gap G handoff) #3519
No reviewers
Labels
No labels
1week
2weeks
Failed compliance check
IP cameras
NATS
Possible security concern
Review effort 1/5
Review effort 2/5
Review effort 3/5
Review effort 4/5
Review effort 5/5
UI
aardvark
accessibility
amd64
api
arm64
auth
back-end
bgp
blog
bug
build
checkers
ci-cd
cleanup
cnpg
codex
core
dependencies
device-management
documentation
duplicate
dusk
ebpf
enhancement
eta 1d
eta 1hr
eta 3d
eta 3hr
feature
fieldsurvey
github_actions
go
good first issue
help wanted
invalid
javascript
k8s
log-collector
mapper
mtr
needs-triage
netflow
network-sweep
observability
oracle
otel
plug-in
proton
python
question
reddit
redhat
research
rperf
rperf-checker
rust
sdk
security
serviceradar-agent
serviceradar-agent-gateway
serviceradar-web
serviceradar-web-ng
siem
snmp
sysmon
topology
ubiquiti
wasm
wontfix
zen-engine
No milestone
No project
No assignees
1 participant
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
carverauto/serviceradar!3519
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "feat/causal-engine-v1"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Summary
Introduces the
add-causal-engineOpenSpec umbrella proposal for ServiceRadar's DeepCausality causal engine and lands Phase 0 of it.The engine's value is automation — events → alerts → state enrichment → prediction — with the God-View topology renderer as one optional consumer. Today DeepCausality backs a misplaced 244-line reactive stub in
god_view_nif; the proposal ships a real single-pod fusedrust/causal-engineon the existing CNPG/SRQL/AGE substrate and closes the loop by publishing predictions that re-enterStatefulAlertEngine.Proposal (
openspec/changes/add-causal-engine/)openspec validate --strict(49 requirements / 131 scenarios).causal-engine,causal-reasoning,causal-prediction-signals,inventory-risk-feed,service-flow-bridge,device-components), 5 MODIFIED (observability-signals,age-graph,health-events,topology-causal-overlays,topology-god-view).design.mdrecords the four key decisions and the reversiblegod_view_nifcutover. The three authoritative design docs are vendored underdocs/docs/.Phase-0 implementation (ServiceRadar-owned slice)
network_discovery/topology_graph.exnow marks edges without a populatedcapacity_bps(min-of-both-endsspeed_bps/if_speed) ineligible, so the saturation causaloid (C6) skips them rather than treating absent capacity as zero/infinite. Refines the existingtelemetry_eligiblefield; no new schema column. Coverage audit SQL inrunbooks/gap-b-capacity-audit.sql.ServiceRadar.EventWriter.StateChangePublisheremits app-level state transitions tocdc.platform.<table>over NATS — default-disabled (STATE_CHANGE_EVENTS_ENABLED/:state_change_events_enabled), fire-and-forget, never hypertables, and not pgoutput CDC / logical replication. Hookedhealth_events(HealthTracker),service_state(ServiceStateRegistry, gated pre-fetch + transition diff), andocsf_devices(sync_ingestor bulk path, COALESCE-aware diff). Pure unit test for the envelope + gating.runbooks/gap-g-ultragraph-handoff.mdspecs theStructuralGraphAlgorithmstrait (articulation points / bridges / reachability / pathway centrality / unfreeze) and the six causaloids that gate on it. ServiceRadar does not forkultragraph.Deferred (tracked in
tasks.md)Virtualization + AGE-projection transition hooks, the
cdc.platform.>JetStream stream + Phase-1 consumer (held back so app startup isn't coupled to a not-yet-existing processor), and an optionalrisk_scorerollup hook.Testing
All touched Elixir was syntax-parsed locally (
Code.string_to_quoted); full compile + credo + the publisher unit test run in CI (no local deps fetched).openspec validate --strictpasses. No Bazel BUILD change needed (serviceradar_coreglobs sources).🤖 Generated with Claude Code
Completes the device-availability slice of the add-causal-engine Decision-1 state-change feed (task 0.3.1d): - inventory/sync_ingestor.ex: before the bulk upsert_devices, gated pre-fetch of prior is_available/is_managed by uid (StateChangePublisher.enabled?/0 guards the extra read); after {:ok, remap}, diff against the records and publish a cdc.platform.ocsf_devices transition per changed device, keyed by the remapped final sr: uid. COALESCE-aware: only a non-nil incoming value that differs from the stored one is treated as a transition (matches the device upsert's COALESCE(EXCLUDED.x, stored)). risk_score transitions deferred (separate DeviceRiskReducer rollup hook). Best-effort + rescued; never affects ingestion. - Gap G (task 0.1) routed to the DeepCausality author: handoff spec at runbooks/gap-g-ultragraph-handoff.md (StructuralGraphAlgorithms trait — articulation_points/bridges/biconnected_components/is_reachable/ pathway_betweenness_centrality + unfreeze, ~200 LOC Tarjan; tests; the six causaloids that gate on it). ServiceRadar does not fork ultragraph. Remaining Phase-0/1 deferred: virtualization + AGE-projection hooks, the cdc.platform.> JetStream stream + Phase-1 consumer. Elixir syntax-parsed locally; full compile/credo + the publisher unit test gated in CI. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>New top-level workspace crate for the DeepCausality engine — single binary, single pod, fused (hydrator + reasoner in-process). Modules: - context_hydrator: ContextStore trait (hydrator<->reasoner seam, preserves a future split) + ContextHydrator stub - domain_model: V1 entity types (Context/Device/Service) keyed by canonical sr: ids - reasoner: Verdict/Classification + Reasoner stub (CausaloidGraph TODO 1.3) - emitter: signals.causal.predictions publisher stub (TODO 1.6) - snapshot: restart-snapshot stub (TODO 1.7) - config (CAUSAL_ENGINE_* via envy) + error (thiserror) main.rs wires a fused reasoning tick loop (hydrate -> reason -> emit). Heavy integration deps are deferred to the increment that first uses them (srql + async-nats in 1.2; ultragraph 0.9 + deep_causality in 1.3) to keep each crate_universe change scoped. Added to workspace members + Cargo.lock; BUILD.bazel mirrors rust/srql (all_crate_deps). Verified: cargo check, cargo clippy --all-targets -D warnings, cargo fmt --check, and bazel build //rust/causal-engine:{causal_engine_lib,causal_engine_bin} --config=ci (RBE) all green. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>rust/causal-engine/src/emitter.rs: Emitter connects NATS JetStream (async_nats 0.48) and publishes one message per verdict on signals.causal.predictions.<entity> with a DETERMINISTIC event_identity (pred:<entity>:<classification>) so re-emission is idempotent (CausalSignals dedupes on it). The OCSF-compatible envelope (signal_type "causal"; event_type mapped to the God-View 4 buckets; source_identity / routing_correlation) mirrors what the existing CausalSignals processor normalizes into ocsf_events — the signals.causal.* prefix already routes, so no new inbound plumbing. main.rs connects the emitter at startup. Deps async-nats 0.48 + chrono added (both already in the workspace lock). Verified: cargo clippy --all-targets -D warnings, cargo test (3 pass), cargo fmt --check, and bazel build //rust/causal-engine:{causal_engine_lib, causal_engine_bin,causal_engine_test} --config=ci (RBE). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>ContextHydrator::connect() opens a CNPG pool through EmbeddedSrql (srql path dep, srql AppConfig::from_env), and current_context() hydrates the Context from SRQL (in:devices, in:services) via QueryEngine::execute_query, mapping result rows to domain_model Device/Service. Single-point identity validation: device rows without a canonical sr: uid are skipped (the engine never forks the ID space). Service identity is the composite agent_id:service_type:service_name (Decision 2). Unit-tested row mappers; main.rs connects the hydrator at startup. Feeds 2/3 (JetStream signals.causal.> + cdc.platform.<table> live deltas), broader entity coverage, and endpoint-cluster-summary handling land in 1.2b. Dep: srql = { path = "../srql" }; BUILD.bazel deps //rust/srql:srql_lib. Verified: cargo check, cargo clippy --all-targets -D warnings, cargo test (6 pass), cargo fmt --check, bazel build //rust/causal-engine:{causal_engine_lib, causal_engine_bin,causal_engine_test} --config=ci (RBE, 1097 actions). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>New `delta` module: parse_state_change parses the cdc.platform.<table> envelope (the Phase-0 StateChangePublisher format) into a typed StateChangeDelta, and apply_delta mutates the in-memory Context for ocsf_devices (is_available/ is_managed) and service_state (available). Pure + unit-tested (11 crate tests total). The async JetStream subscriber that maintains a shared Context and applies these deltas between EmbeddedSrql snapshots is the next sub-step (1.2b). No new deps. Verified: cargo clippy --all-targets -D warnings, cargo test (11 pass), cargo fmt --check, bazel build //rust/causal-engine:{lib,test} --config=ci (RBE). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>Push-input wired for V1 (per the input-strategy decision): - nats.rs: TLS-aware NATS connect helper (mTLS via CAUSAL_ENGINE_NATS_* cert paths) returning the core Client + a JetStream context; used by both the subscriber and the emitter. - context_hydrator: now holds a shared Arc<RwLock<Context>> — seeded by an initial EmbeddedSrql snapshot on connect, refreshed periodically (refresh()), and read via current_context() (clone under read lock). shared_context() hands the Arc to the subscriber. - subscriber.rs: best-effort core-NATS subscription to signals.state.> that parses StateChangePublisher envelopes (delta.rs) and applies deltas to the shared Context between refreshes. Durability/correctness comes from the periodic refresh, so no JetStream stream/consumer/ack is needed for V1 (Phase-2 reactivity/scale concern). - emitter: refactored to Emitter::new(jetstream) over the shared connection. - main: connect NATS once, spawn the subscriber, run dual cadences (reason tick + slower refresh tick) via tokio::select!. Config gains nats_ca_file/nats_cert_file/nats_key_file + refresh_interval_ms. Dep: futures 0.3 (StreamExt; already in the workspace lock). Verified: cargo check, cargo clippy --all-targets -D warnings, cargo test (11), cargo fmt --check, bazel build //rust/causal-engine:{causal_engine_lib,causal_engine_bin, causal_engine_test} --config=ci (RBE). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>snapshot.rs SnapshotStore: atomic JSON persistence of the Context (temp+rename). ContextHydrator::connect(snapshot_path) restores the snapshot to seed the shared Context instantly, then runs a best-effort initial refresh (serves the restored state if CNPG is momentarily down at boot; the refresh tick retries). refresh() persists the Context after each SRQL re-snapshot. Context/Device/Service gain serde derives. main passes config.snapshot_path; the standalone SnapshotStore in main is removed (the hydrator owns it). CausaloidGraph frozen-state persistence is deferred to the reasoner (Marvin). Verified: cargo clippy --all-targets -D warnings, cargo test (12, incl. snapshot round-trip), cargo fmt --check, bazel build //rust/causal-engine:{causal_engine_lib, causal_engine_bin,causal_engine_test} --config=ci (RBE). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>First real causaloid: c3_gateway_shared_fate groups unavailable devices by gateway_id and, when >= threshold share a gateway, emits a RootCause verdict for the gateway + Affected verdicts for the devices (the shared gateway is the likelier root cause than each device independently). Plain Rust over the Context; maps cleanly onto the God-View 4 buckets. domain_model.Device gains gateway_id (hydrator map_device reads it). Unit-tested (shared-gateway, below-threshold, gatewayless, evaluate-runs-c3). Remaining causaloids (C1/C2/C4/C5/C5b/C6/C7-C13) + risk composition build on this + the ultragraph 0.9 graph layer (next). Verified: cargo clippy --all-targets -D warnings, cargo test (15), cargo fmt --check, bazel build //rust/causal-engine:{lib,test} --config=ci (RBE). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>refresh() now runs a read-only graph_cypher MATCH over the AGE platform graph and projects CONNECTS_TO/MANAGED_BY/CONTAINS/BACKED_BY/DEPENDS_ON relationships into Context.edges (canonical sr: ids via the Device `id` property), unlocking C1/C2/C4/C5/C5b/C7/C9/C10 against live topology. Best-effort: a projection failure logs and reasons-without-edges that tick rather than aborting refresh. parse_topology_edges + edge_kind_from_label are pure and unit-tested against the documented graph_cypher {nodes,edges} wrapper shape (dedup + canonical-id + unknown-label filtering). Live cypher still needs a CNPG smoke test; links/ flows/bgp_routes/operator_rules feeds remain TODO(1.2b) (causaloids no-op empty). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>View command line instructions
Manual merge helper
Use this merge commit message when completing the merge manually.
Checkout
From your project repository, check out a new branch and test the changes.Merge
Merge the changes and update on Forgejo.Warning: The "Autodetect manual merge" setting is not enabled for this repository, you will have to mark this pull request as manually merged afterwards.