Feat/stateful alert rule eng #2644
No reviewers
Labels
No labels
1week
2weeks
Failed compliance check
IP cameras
NATS
Possible security concern
Review effort 1/5
Review effort 2/5
Review effort 3/5
Review effort 4/5
Review effort 5/5
UI
aardvark
accessibility
amd64
api
arm64
auth
back-end
bgp
blog
bug
build
checkers
ci-cd
cleanup
cnpg
codex
core
dependencies
device-management
documentation
duplicate
dusk
ebpf
enhancement
eta 1d
eta 1hr
eta 3d
eta 3hr
feature
fieldsurvey
github_actions
go
good first issue
help wanted
invalid
javascript
k8s
log-collector
mapper
mtr
needs-triage
netflow
network-sweep
observability
oracle
otel
plug-in
proton
python
question
reddit
redhat
research
rperf
rperf-checker
rust
sdk
security
serviceradar-agent
serviceradar-agent-gateway
serviceradar-web
serviceradar-web-ng
siem
snmp
sysmon
topology
ubiquiti
wasm
wontfix
zen-engine
No milestone
No project
No assignees
1 participant
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
carverauto/serviceradar!2644
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "refs/pull/2644/head"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Imported from GitHub pull request.
Original GitHub pull request: #2241
Original author: @mfreeman451
Original URL: https://github.com/carverauto/serviceradar/pull/2241
Original created: 2026-01-10T07:48:06Z
Original updated: 2026-01-10T20:03:03Z
Original head: carverauto/serviceradar:feat/stateful_alert_rule_eng
Original base: testing
Original merged: 2026-01-10T20:02:59Z by @mfreeman451
User description
IMPORTANT: Please sign the Developer Certificate of Origin
Thank you for your contribution to ServiceRadar. Please note, when contributing, the developer must include
a DCO sign-off statement indicating the DCO acceptance in one commit message. Here
is an example DCO Signed-off-by line in a commit message:
Describe your changes
Issue ticket number and link
Code checklist before requesting a review
PR Type
Enhancement
Description
Major refactoring from poller-based to gateway-based architecture with comprehensive terminology updates across Go and Elixir services
Agent push mode implementation: Refactored agent startup with simplified config loading, graceful shutdown, and push-based communication with gateway via new
runPushMode()functionStateful alert rule evaluation engine: Implemented GenServer-based alert engine with bucketed time-window analysis, ETS state snapshots, cooldown periods, and OCSF event generation
Agent gateway gRPC server: Multi-tenant secure server with mTLS certificate validation, agent enrollment, heartbeat handling, and service status processing
NATS infrastructure enhancements: Added account service initialization, credentials file support across multiple services (trapd, zen, otel, flowgger), and subject pattern matching with wildcard support
Multi-tenant process management: Implemented per-tenant Horde registries and DynamicSupervisors for process isolation via
TenantRegistrymoduleEdge onboarding and certificate generation: Comprehensive onboarding packages context with component certificate generation and X.509 CA management
Poll orchestration: New module for distributed service check execution with gateway discovery and PollJob state management
SPIFFE/SPIRE integration: Added cluster authentication support with SSL/TLS configuration and certificate verification
Database schema migrations: Initial tenant schema with 40+ tables and OpenTelemetry hypertables with TimescaleDB support
Simplified message processing: Zen consumer refactored to output raw JSON instead of CloudEvents format
CLI and service naming updates: Renamed
update-pollertoupdate-gateway, addednats-bootstrapandadminsubcommandsRemoved legacy components: Deleted poller and sync services with associated Docker configurations and setup scripts
Diagram Walkthrough
File Walkthrough
28 files
main.go
Agent refactored to push mode with simplified config loadingcmd/agent/main.go
loading with embedded defaults fallback
loadConfig()functionrunPushMode()function for push-based agentcommunication with gateway
protection
Versionvariable for build-time injection and agentenrollment reporting
main.go
NATS account service initialization and configurationcmd/data-services/main.go
with fallback to config file
NATSAccountServiceServerin gRPC service registrationmain.go
CLI subcommands refactored for gateway and NATS operationscmd/cli/main.go
update-pollersubcommand toupdate-gatewaynats-bootstrapsubcommand for NATS operationsadminsubcommand routing withdispatchAdminCommand()functionRunAdminNatsCommand()dispatch for admin NATS operationsmain.go
SNMP service renamed from poller to gatewaycmd/checkers/snmp/main.go
SNMPPollerServicetoSNMPGatewayServicePollerstruct toGatewaystructapp.go
gRPC service registration renamed to agent gatewaycmd/core/app/app.go
RegisterPollerServiceServertoRegisterAgentGatewayServiceServerconfig.rs
NATS credentials and subject pattern matching supportcmd/consumers/zen/src/config.rs
nats_creds_fileoptional configuration field with validationnats_creds_path()method to resolve credentials file pathwith support for absolute and relative paths
wildcard matching support
subject_matches()function for NATS subject pattern matchingwith
*and>wildcardsnats_output.rs
OTEL NATS output with credentials and logs subject supportcmd/otel/src/nats_output.rs
logs_subjectoptional configuration for dedicated logs subjectcreds_fileoptional field for NATS credentials authenticationconfig.rs
OTEL configuration with NATS credentials supportcmd/otel/src/config.rs
logs_subjectandcreds_fileoptional TOML configuration fieldsserver.rs
Sysmon checker field renamed from poller to gatewaycmd/checkers/sysmon/src/server.rs
poller_idfield togateway_idin GetStatus and GetResults logmessages
poller_idtogateway_idin status andresults responses
message_processor.rs
Message processor simplified to output raw JSONcmd/consumers/zen/src/message_processor.rs
(EventBuilder, Uuid, Url)
CloudEvents format
main.rs
Trapd NATS credentials and gateway field supportcmd/trapd/src/main.rs
nats_creds_path()methodNATS connections
poller_idtogateway_idin status and results responsesnats_output.rs
Flowgger NATS output credentials supportcmd/flowgger/src/flowgger/output/nats_output.rs
creds_fileoptional configuration field for NATS credentialsempty string validation
config.rs
Trapd configuration with NATS credentials supportcmd/trapd/src/config.rs
nats_creds_fileoptional configuration field with validationnats_creds_path()method to resolve credentials file pathgrpc_server.rs
Zen gRPC server field renamed to gatewaycmd/consumers/zen/src/grpc_server.rs
poller_idfield togateway_idin GetStatus and GetResultsresponses
nats.rs
Zen NATS connection with credentials supportcmd/consumers/zen/src/nats.rs
nats_creds_path()methodserver.rs
RPerf checker response structure updated with gateway fieldcmd/checkers/rperf-client/src/server.rs
gateway_idfield to GetStatus and GetResults response structuressetup.rs
OTEL setup logging enhancementcmd/otel/src/setup.rs
onboarding_packages.ex
Edge onboarding packages context with certificate generationelixir/serviceradar_core/lib/serviceradar/edge/onboarding_packages.ex
packages
and soft-delete
stateful_alert_engine.ex
Stateful alert rule evaluation engine with bucketingelixir/serviceradar_core/lib/serviceradar/observability/stateful_alert_engine.ex
log/event rules with bucketed time-window analysis
supporting cooldown periods and renotification
filtering by subject, service, severity, body, and attributes
recording with OCSF event generation
agent_gateway_server.ex
Agent gateway gRPC server with mTLS multi-tenancyelixir/serviceradar_agent_gateway/lib/serviceradar_agent_gateway/agent_gateway_server.ex
multi-tenant security via mTLS certificate validation
component identity enforcement
forwarding to core cluster
spoofing through server-side metadata
account_client.ex
NATS account management gRPC clientelixir/serviceradar_core/lib/serviceradar/nats/account_client.ex
JWT-based credentials
and operator bootstrapping
and comprehensive error handling
and stream exports/imports
agent.ex
Agent resource with state machine and capabilitieselixir/serviceradar_core/lib/serviceradar/infrastructure/agent.ex
state machine lifecycle
with metadata and health tracking
disconnected, unavailable) with event publishing
display/status information
alert_generator.ex
Alert generation service for monitoring eventselixir/serviceradar_core/lib/serviceradar/monitoring/alert_generator.ex
(service state changes, device availability, gateway/agent health,
metric thresholds, stats anomalies)
service_down,service_recovered,device_offline,gateway_offline,agent_offline,threshold_violation,from_event,stats_anomalyvia
WebhookNotifierevent processing with support for alert configuration overrides
tenant_registry.ex
Multi-tenant process registry and supervisor managementelixir/serviceradar_core/lib/serviceradar/cluster/tenant_registry.ex
DynamicSupervisors for multi-tenant process isolation
ensure_registry,start_tenant_infrastructure,stop_tenant_infrastructure)convenience functions for gateways and agents
TenantSupervisorchild module managing per-tenantHorde.Registry and Horde.DynamicSupervisor instances
alias_events.ex
Device alias lifecycle event tracking systemelixir/serviceradar_core/lib/serviceradar/identity/alias_events.ex
core
AliasRecordstruct for parsing and comparing device aliasmetadata (service IDs, IPs, collectors)
build_lifecycle_eventsto detect alias changes and generateaudit events
process_and_persistfunction to record alias sightings andmanage state transitions
event metadata building
generator.ex
X.509 certificate generation for tenant CAs and edgeelixir/serviceradar_core/lib/serviceradar/edge/tenant_ca/generator.ex
edge components
generate_tenant_cato create long-lived intermediate CAcertificates signed by platform root CA
generate_component_certto create short-lived edgecomponent certificates with SPIFFE IDs
extraction, tenant extraction from certificates
:public_keymodule for all cryptographic operations withproper OID and extension handling
spiffe.ex
SPIFFE/SPIRE integration for cluster authenticationelixir/serviceradar_core/lib/serviceradar/spiffe.ex
authentication
client, and server connections
domain validation
rotation monitoring
certificate verification callbacks
poll_orchestrator.ex
Poll orchestration for distributed service checkselixir/serviceradar_core/lib/serviceradar/monitoring/poll_orchestrator.ex
across distributed cluster
execute_scheduleto create PollJob records and dispatchchecks to available gateways
execute_schedule_asyncfor asynchronous job execution withPubSub result broadcasting
multiple assignment modes (
:any,:partition,:domain,:specific)error handling and result aggregation
2 files
main.go
API documentation terminology updatecmd/core/main.go
main.go
Config sync tool documentation updatecmd/tools/config-sync/main.go
1 files
20260107043446_initial_schema.exs
Initial tenant schema migration with core entitieselixir/serviceradar_core/priv/repo/tenant_migrations/20260107043446_initial_schema.exs
multi-tenant system
credentials, edge onboarding, and monitoring
1 files
test_helper.exs
Agent gateway test helper initializationelixir/serviceradar_agent_gateway/test/test_helper.exs
AgentRegistry
1 files
20260108005841_add_otel_tables.exs
OpenTelemetry observability tables with TimescaleDBelixir/serviceradar_core/priv/repo/tenant_migrations/20260108005841_add_otel_tables.exs
(logs, metrics, traces)
comprehensive indexing strategies
filtering for 7-day retention
time-series queries
101 files
Imported GitHub PR comment.
Original author: @qodo-code-review[bot]
Original URL: https://github.com/carverauto/serviceradar/pull/2241#issuecomment-3732116821
Original created: 2026-01-10T07:49:23Z
PR Compliance Guide 🔍
Below is a summary of compliance checks for this PR:
Sensitive info in logs
Description: Debug logging prints the configured NATS credentials file path (
nats.creds_file) which canexpose sensitive filesystem layout and facilitate credential targeting if logs are
accessible to untrusted parties.
setup.rs [48-53]
Referred Code
🎫 No ticket provided
Codebase context is not defined
Follow the guide to enable codebase context checks.
Generic: Meaningful Naming and Self-Documenting Code
Objective: Ensure all identifiers clearly express their purpose and intent, making code
self-documenting
Status: Passed
Generic: Robust Error Handling and Edge Case Management
Objective: Ensure comprehensive error handling that provides meaningful context and graceful
degradation
Status: Passed
Generic: Secure Error Handling
Objective: To prevent the leakage of sensitive system information through error messages while
providing sufficient detail for internal debugging.
Status: Passed
Generic: Secure Logging Practices
Objective: To ensure logs are useful for debugging and auditing without exposing sensitive
information like PII, PHI, or cardholder data.
Status:
Sensitive data in logs: New logging prints potentially sensitive configuration and identity material (e.g.,
allowed client identities and credentials/config file paths) which can leak
security-relevant details into logs.
Referred Code
Generic: Comprehensive Audit Trails
Objective: To create a detailed and reliable record of critical system actions for security analysis
and compliance.
Status:
Missing audit context: The new NATS account service initialization/bootstrapping path logs operational messages
but does not record an auditable actor identity/outcome trail for security-relevant
actions, which may be required depending on how these endpoints are used.
Referred Code
Generic: Security-First Input Validation and Data Handling
Objective: Ensure all data inputs are validated, sanitized, and handled securely to prevent
vulnerabilities
Status:
Logs unvalidated details: The new request logging includes
req.detailsdirectly which may contain user-controlled orsensitive content and should be validated/redacted before logging based on the data
contract.
Referred Code
Compliance status legend
🟢 - Fully Compliant🟡 - Partial Compliant
🔴 - Not Compliant
⚪ - Requires Further Human Verification
🏷️ - Compliance label
Imported GitHub PR review comment.
Original author: @mfreeman451
Original URL: https://github.com/carverauto/serviceradar/pull/2241#discussion_r2678424261
Original created: 2026-01-10T07:50:37Z
Original path: docs/docs/agents.md
Original line: 171
this needs to get checked
Imported GitHub PR comment.
Original author: @qodo-code-review[bot]
Original URL: https://github.com/carverauto/serviceradar/pull/2241#issuecomment-3732118005
Original created: 2026-01-10T07:50:47Z
PR Code Suggestions ✨
Explore these optional code suggestions:
Fix incorrect NATS wildcard matching
Fix a bug in the
subject_matchesfunction for NATS wildcard matching. Add acheck to ensure the
>wildcard only matches when there is at least one tokenremaining in the subject.
cmd/consumers/zen/src/config.rs [290-315]
Suggestion importance[1-10]: 8
__
Why: The suggestion correctly identifies and fixes a bug in the NATS subject matching logic for the
>wildcard, preventing incorrect matches and ensuring behavior aligns with the NATS specification.Avoid using current time for records
Modify
record_timestampto return an error and log a warning if a timestamp ismissing from a record, instead of defaulting to the current time, to prevent
inaccurate alert evaluations.
elixir/serviceradar_core/lib/serviceradar/observability/stateful_alert_engine.ex [792-803]
[To ensure code accuracy, apply this suggestion manually]Suggestion importance[1-10]: 8
__
Why: The suggestion correctly identifies that using
DateTime.utc_now()as a fallback can cause incorrect alert evaluation for delayed events, which is a critical flaw in a stateful, time-window-based alerting system. The proposed change significantly improves the correctness and reliability of the core alerting logic.Fix deduplication by timestamp
Correct the usage of
Enum.max_by/3indeduplicate_by_device_id/1by providing amapping function to handle
niltimestamps properly.elixir/serviceradar_core/lib/serviceradar/identity/alias_events.ex [513-525]
[To ensure code accuracy, apply this suggestion manually]Suggestion importance[1-10]: 8
__
Why: The suggestion correctly identifies that the comparator function in
Enum.max_by/3is being used incorrectly and provides a valid, idiomatic fix that correctly finds the update with the latest timestamp.Fix race condition in shutdown
Fix a race condition in the shutdown goroutine. Remove the
selectblock thatreads from
errChanto prevent it from consuming a critical error that should behandled by the main error path.
cmd/agent/main.go [214-218]
Suggestion importance[1-10]: 7
__
Why: The suggestion correctly identifies a race condition where a critical error from the
pushLoopcould be consumed and ignored during shutdown, but the proposed fix is also flawed as it still consumes the error.Use a transaction for atomicity
Wrap the database alert creation and webhook notification in an
Ash.transactionto ensure both operations succeed or fail together, preventing lost
notifications.
elixir/serviceradar_core/lib/serviceradar/monitoring/alert_generator.ex [413-436]
[To ensure code accuracy, apply this suggestion manually]Suggestion importance[1-10]: 7
__
Why: The suggestion correctly identifies a potential atomicity issue and proposes using a transaction, which is a valid approach to ensure the alert and notification are coupled, improving reliability.
Return error instead of raising exception
Change
resolve_tenantto return an error tuple like{:error, :tenant_not_found}instead of raising an
ArgumentErrorwhen a tenant identifier is missing. Thisallows callers to handle the error gracefully.
elixir/serviceradar_core/lib/serviceradar/edge/onboarding_packages.ex [448-461]
[To ensure code accuracy, apply this suggestion manually]Suggestion importance[1-10]: 7
__
Why: The suggestion correctly points out that raising an exception for a missing tenant is poor practice in a web context. Returning an error tuple allows for graceful error handling and improves the application's robustness, which is a valuable improvement.
Shuffle nodes before making RPC calls
In
core_call, shuffle the list ofcore_nodesbefore iterating through them forRPC calls. This improves resilience by distributing load and preventing repeated
calls to a single slow or unresponsive node.
elixir/serviceradar_agent_gateway/lib/serviceradar_agent_gateway/agent_gateway_server.ex [856-872]
[To ensure code accuracy, apply this suggestion manually]Suggestion importance[1-10]: 7
__
Why: The suggestion provides a valid improvement for resilience by shuffling nodes to avoid repeatedly hitting a slow or unresponsive node first. This is a good practice for distributed systems to improve load distribution and fault tolerance.
Retain previous rules on load error
In
load_rules, modify therescueblock to fall back to the existingstate.rulesinstead of an empty list when a database error occurs. This prevents transient
errors from disabling all alerts.
elixir/serviceradar_core/lib/serviceradar/observability/stateful_alert_engine.ex [154-167]
[To ensure code accuracy, apply this suggestion manually]Suggestion importance[1-10]: 7
__
Why: The suggestion improves the system's resilience by preventing a transient database error from disabling all alerts. Retaining the last known set of rules ensures the alerting system remains functional during temporary outages, which is a significant reliability improvement.
Log errors for silent failures
Add logging to the error case in
handle_existing_alias/4to make failures inDeviceAliasState.record_sightingvisible for debugging.elixir/serviceradar_core/lib/serviceradar/identity/alias_events.ex [438-473]
[To ensure code accuracy, apply this suggestion manually]Suggestion importance[1-10]: 7
__
Why: The suggestion correctly points out that an error from
DeviceAliasState.record_sightingis silently ignored, and adding logging would improve observability and debugging for this failure case.Cache loaded root CA for performance
Cache the root CA certificate and key using
:persistent_terminload_root_ca/0to avoid repeated file I/O and improve performance.
elixir/serviceradar_core/lib/serviceradar/edge/tenant_ca/generator.ex [284-298]
[To ensure code accuracy, apply this suggestion manually]Suggestion importance[1-10]: 6
__
Why: The suggestion provides a valid performance optimization by caching file reads using
:persistent_term, which is a good pattern for frequently accessed, rarely changing configuration data.Imported GitHub PR review comment.
Original author: @mfreeman451
Original URL: https://github.com/carverauto/serviceradar/pull/2241#discussion_r2678425501
Original created: 2026-01-10T07:53:23Z
Original path: elixir/serviceradar_core/lib/serviceradar/event_writer/pipeline.ex
Original line: 83
is this actually being used anymore? shouldnt this all be replaced by db-event-writer?
Imported GitHub PR comment.
Original author: @qodo-code-review[bot]
Original URL: https://github.com/carverauto/serviceradar/pull/2241#issuecomment-3733444870
Original created: 2026-01-10T20:01:15Z
CI Feedback 🧐
A test triggered by this PR failed. Here is an AI-generated analysis of the failure:
Action: test-rust (rust/srql)
Failed stage: Run Tests [❌]
Failed test name: srql_api_queries
Failure summary:
The action failed because the integration test
srql_api_queriescrashed while trying to connect tothe remote PostgreSQL fixture.
- The panic occurred in
rust/srql/tests/support/harness.rs:74:10whenthe test tried to acquire the remote fixture lock and received a database error.
- The underlying
cause is a PostgreSQL access policy/TLS mismatch:
FATAL: pg_hba.conf rejects connection for host"10.42.68.65", user "srql_hydra", database "postgres", no encryption.This indicates the DB
server’s
pg_hba.confdoes not allow connections from the GitHub runner IP without encryption (ordoes not allow that host/user/db combination at all), so the test cannot proceed.
Relevant error logs: