Feat/rule builder UI #2646
No reviewers
Labels
No labels
1week
2weeks
Failed compliance check
IP cameras
NATS
Possible security concern
Review effort 1/5
Review effort 2/5
Review effort 3/5
Review effort 4/5
Review effort 5/5
UI
aardvark
accessibility
amd64
api
arm64
auth
back-end
bgp
blog
bug
build
checkers
ci-cd
cleanup
cnpg
codex
core
dependencies
device-management
documentation
duplicate
dusk
ebpf
enhancement
eta 1d
eta 1hr
eta 3d
eta 3hr
feature
fieldsurvey
github_actions
go
good first issue
help wanted
invalid
javascript
k8s
log-collector
mapper
mtr
needs-triage
netflow
network-sweep
observability
oracle
otel
plug-in
proton
python
question
reddit
redhat
research
rperf
rperf-checker
rust
sdk
security
serviceradar-agent
serviceradar-agent-gateway
serviceradar-web
serviceradar-web-ng
siem
snmp
sysmon
topology
ubiquiti
wasm
wontfix
zen-engine
No milestone
No project
No assignees
1 participant
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
carverauto/serviceradar!2646
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "refs/pull/2646/head"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Imported from GitHub pull request.
Original GitHub pull request: #2243
Original author: @mfreeman451
Original URL: https://github.com/carverauto/serviceradar/pull/2243
Original created: 2026-01-11T05:41:42Z
Original updated: 2026-01-11T08:22:48Z
Original head: carverauto/serviceradar:feat/rule-builder-ui
Original base: testing
Original merged: 2026-01-11T08:22:47Z by @mfreeman451
User description
IMPORTANT: Please sign the Developer Certificate of Origin
Thank you for your contribution to ServiceRadar. Please note, when contributing, the developer must include
a DCO sign-off statement indicating the DCO acceptance in one commit message. Here
is an example DCO Signed-off-by line in a commit message:
Describe your changes
Issue ticket number and link
Code checklist before requesting a review
PR Type
Enhancement
Description
Major architectural refactoring transitioning from poller-based to gateway-based architecture with comprehensive platform modernization:
Core Architecture Changes:
Refactored agent from KV-based config management to push mode with file-based configuration and direct gateway connection
Implemented gRPC agent gateway server with mTLS multi-tenant security for receiving agent status pushes
Removed legacy poller, sync, and edge onboarding components; replaced with modern push-based architecture
NATS Infrastructure:
Added NATS account service initialization with operator key management and gRPC server registration
Implemented comprehensive NATS credentials file support across all consumers (zen, trapd, flowgger, otel)
Added wildcard subject pattern matching (
*and>) for flexible event routingUpdated default NATS subjects and added configurable logs subject support
Multi-Tenant Platform Foundation:
Created comprehensive database schema migration with 40+ tables for user management, NATS infrastructure, device discovery, monitoring, and alerts
Implemented multi-tenant process registry and supervisor management via DynamicSupervisor with Horde registries
Added SPIFFE/SPIRE integration for distributed cluster security with X.509 certificate generation for tenant CAs
Implemented per-tenant CA and edge component certificate generation with SPIFFE URIs
Monitoring & Alerting:
Implemented stateful alert engine with bucketed time-window aggregation for log/event rules
Added alert generation service for monitoring events with severity levels and webhook notifications
Implemented poll orchestrator for distributed service check execution across cluster gateways
Added device alias lifecycle event tracking system for audit and alerting
Infrastructure Management:
Defined Agent resource with OCSF v1.4.0 state machine and capability definitions
Implemented edge onboarding packages with encrypted token generation and component certificate signing
Added NATS account management gRPC client for tenant account and credential lifecycle
Configured agent gateway runtime with cluster topology strategies and mTLS security
Terminology Updates:
Renamed all references from "poller" to "gateway" across Go services (SNMP, sysmon, rperf-client)
Updated CLI commands: renamed
update-pollertoupdate-gateway, addednats-bootstrapandadminsubcommandsUpdated API documentation and help text to reflect new gateway terminology
Message Processing Simplification:
Removed CloudEvents wrapping from zen consumer message processing
Simplified to direct JSON publishing of context data instead of wrapped events
Diagram Walkthrough
File Walkthrough
27 files
main.go
Refactor agent to push mode with file-based configcmd/agent/main.go
file-based configuration loading
replaced with push mode architecture
loadConfig()function for JSON config parsing with embeddeddefaults fallback
runPushMode()function handling gateway connection, pushloop, and graceful shutdown with signal handling
missing gateway address
main.go
Initialize NATS account service with operator supportcmd/data-services/main.go
environment variables with config file fallback
NATSAccountServicegRPC server when configuredand resolver configuration
main.go
Add NATS bootstrap and admin command supportcmd/cli/main.go
update-pollersubcommand toupdate-gatewaynats-bootstrapsubcommand for NATS operator initializationadminsubcommand dispatcher withnatsadmin resource routingdispatchAdminCommand()helper function for admin subcommandrouting
main.go
Rename SNMP poller to gateway terminologycmd/checkers/snmp/main.go
SNMPPollerServicetoSNMPGatewayServicePollerstruct toGatewaystructapp.go
Rename poller service to agent gateway servicecmd/core/app/app.go
RegisterPollerServiceServertoRegisterAgentGatewayServiceServerconfig.rs
Add NATS credentials and wildcard subject matchingcmd/consumers/zen/src/config.rs
nats_creds_fileoptional configuration field for NATScredentials
nats_creds_path()method resolving credentials file path withsupport for absolute/relative paths and cert directory
*and>) via newsubject_matches()functioncredentials file validation
config.rs
Add NATS credentials and logs subject configurationcmd/otel/src/config.rs
logs_subjectoptional field for dedicated logs subjectconfiguration
creds_fileoptional field for NATS credentials file pathevents.oteltootelnats_output.rs
Implement NATS credentials and configurable logs subjectcmd/otel/src/nats_output.rs
logs_subjectandcreds_filefields toNATSConfigstructlogs_subjectwith fallback
events.oteltootelserver.rs
Rename poller_id to gateway_id in sysmon servicecmd/checkers/sysmon/src/server.rs
poller_idfield togateway_idin GetStatus and GetResultsresponse logging and response building
message_processor.rs
Simplify message processing to direct JSON publishingcmd/consumers/zen/src/message_processor.rs
of wrapped CloudEvents
main.rs
Add NATS credentials support and rename poller_idcmd/trapd/src/main.rs
nats_creds_path()methodfor both secure and non-secure modes
poller_idtogateway_idin GetStatus and GetResults responsesnats_output.rs
Add NATS credentials file support to flowggercmd/flowgger/src/flowgger/output/nats_output.rs
creds_fileoptional field toNATSConfigstructconfig.rs
Add NATS credentials configuration to trapdcmd/trapd/src/config.rs
nats_creds_fileoptional configuration fieldnats_creds_path()method resolving credentials file path withsecurity config support
grpc_server.rs
Rename poller_id to gateway_id in zen gRPC servicecmd/consumers/zen/src/grpc_server.rs
poller_idfield togateway_idin GetStatus and GetResultsresponse building
nats.rs
Add NATS credentials file support to zen consumercmd/consumers/zen/src/nats.rs
nats_creds_path()server.rs
Add gateway_id field to rperf service responsescmd/checkers/rperf-client/src/server.rs
gateway_idfield to GetStatus and GetResults response buildingaccount_client.ex
Implement NATS account management gRPC clientelixir/serviceradar_core/lib/serviceradar/nats/account_client.ex
NATSAccountServicewith accountand credential management
credentials, signing account JWTs, and bootstrapping operator
and comprehensive error handling
credential expiration
stateful_alert_engine.ex
Stateful alert engine with bucketed rule evaluationelixir/serviceradar_core/lib/serviceradar/observability/stateful_alert_engine.ex
log/event rules with bucketed time-window aggregation
handles threshold-based firing/recovery logic
filtering by subject, service name, severity, and body content
recording for comprehensive alert lifecycle management
agent_gateway_server.ex
gRPC agent gateway server with mTLS multi-tenant securityelixir/serviceradar_agent_gateway/lib/serviceradar_agent_gateway/agent_gateway_server.ex
updates with multi-tenant mTLS security
component identity to prevent spoofing
processing with comprehensive validation and error handling
to core cluster for processing
onboarding_packages.ex
Edge onboarding packages with certificate generationelixir/serviceradar_core/lib/serviceradar/edge/onboarding_packages.ex
token generation and delivery workflows
optional component certificates signed by tenant CA
verification, revocation, and soft-delete with event recording
certificate bundles for secure agent onboarding
agent.ex
Agent resource with OCSF state machine and capabilitieselixir/serviceradar_core/lib/serviceradar/infrastructure/agent.ex
managing Go agents running on monitored hosts
(connecting, connected, degraded, disconnected, unavailable)
capabilities (ICMP, TCP, HTTP, gRPC, DNS, Process, SNMP)
management, and heartbeat operations with tenant isolation policies
alert_generator.ex
Alert generation service for monitoring eventselixir/serviceradar_core/lib/serviceradar/monitoring/alert_generator.ex
changes, device availability, metric violations, etc.)
notifications
persistent_term
lifecycle
tenant_registry.ex
Multi-tenant process registry and supervisor managementelixir/serviceradar_core/lib/serviceradar/cluster/tenant_registry.ex
supervisors for multi-tenant process isolation
process management
discovery
alias_events.ex
Device alias lifecycle event tracking systemelixir/serviceradar_core/lib/serviceradar/identity/alias_events.ex
IPs, collectors)
AliasRecordstruct for parsing and comparing alias metadataaudit/alerting
DeviceAliasState resource
generator.ex
X.509 certificate generation for tenant CAselixir/serviceradar_core/lib/serviceradar/edge/tenant_ca/generator.ex
edge components
component certificates (1-year validity)
from certificate CNs
spiffe.ex
SPIFFE/SPIRE integration for distributed clusterelixir/serviceradar_core/lib/serviceradar/spiffe.ex
and verification
client/server connections
capabilities
poll_orchestrator.ex
Poll execution orchestrator for service checkselixir/serviceradar_core/lib/serviceradar/monitoring/poll_orchestrator.ex
across the cluster
via AshStateMachine
using location-transparent PIDs
specific) and async execution
3 files
main.go
Update API documentation terminologycmd/core/main.go
main.go
Update config-sync role documentationcmd/tools/config-sync/main.go
setup.rs
Add NATS credentials debug loggingcmd/otel/src/setup.rs
1 files
20260107043446_initial_schema.exs
Add initial tenant database schema migrationelixir/serviceradar_core/priv/repo/tenant_migrations/20260107043446_initial_schema.exs
multi-tenant platform
discovery, monitoring, alerts, and onboarding
and data integrity
certificates)
1 files
runtime.exs
Agent gateway runtime configuration with cluster strategieselixir/serviceradar_agent_gateway/config/runtime.exs
Gossip) for agent gateway node discovery
with trust domain and certificate paths
enabling cluster coordination with core nodes
aggregation across the cluster
101 files
Imported GitHub PR comment.
Original author: @qodo-code-review[bot]
Original URL: https://github.com/carverauto/serviceradar/pull/2243#issuecomment-3734057887
Original created: 2026-01-11T05:43:17Z
PR Compliance Guide 🔍
Below is a summary of compliance checks for this PR:
Arbitrary file write
Description: Environment-variable overrides (
NATS_OPERATOR_CONFIG_PATH,NATS_RESOLVER_PATH) can directthe service to write operator/resolver config files to arbitrary filesystem locations via
WriteOperatorConfig(), which is a potential arbitrary file-write risk if an attacker caninfluence the process environment or config.
main.go [107-129]
Referred Code
Sensitive path logging
Description: Debug logging prints the configured NATS credentials file path (
nats.creds_file), whichcan expose sensitive filesystem layout/secret locations to anyone with access to logs.
setup.rs [48-52]
Referred Code
🎫 No ticket provided
Codebase context is not defined
Follow the guide to enable codebase context checks.
Generic: Meaningful Naming and Self-Documenting Code
Objective: Ensure all identifiers clearly express their purpose and intent, making code
self-documenting
Status: Passed
Generic: Robust Error Handling and Edge Case Management
Objective: Ensure comprehensive error handling that provides meaningful context and graceful
degradation
Status: Passed
Generic: Secure Error Handling
Objective: To prevent the leakage of sensitive system information through error messages while
providing sufficient detail for internal debugging.
Status: Passed
Generic: Secure Logging Practices
Objective: To ensure logs are useful for debugging and auditing without exposing sensitive
information like PII, PHI, or cardholder data.
Status:
Potential sensitive logs: New
info!logs printreq.detailsverbatim, which may contain sensitive information and isnot redacted or structured for safe auditing.
Referred Code
Generic: Comprehensive Audit Trails
Objective: To create a detailed and reliable record of critical system actions for security analysis
and compliance.
Status:
Audit coverage unclear: The diff shows some operational logging (agent enrollment/config/status), but it is not
verifiable from the provided hunks whether all critical security/data actions across the
refactor have complete audit logging with consistent actor context.
Referred Code
Generic: Security-First Input Validation and Data Handling
Objective: Ensure all data inputs are validated, sanitized, and handled securely to prevent
vulnerabilities
Status:
AuthZ/validation scope: The gateway server includes strong validation and mTLS-based tenant resolution, but full
verification of end-to-end authorization and secure handling across newly added/modified
files is not possible from the partial diff provided.
Referred Code
Compliance status legend
🟢 - Fully Compliant🟡 - Partial Compliant
🔴 - Not Compliant
⚪ - Requires Further Human Verification
🏷️ - Compliance label
Imported GitHub PR comment.
Original author: @qodo-code-review[bot]
Original URL: https://github.com/carverauto/serviceradar/pull/2243#issuecomment-3734058682
Original created: 2026-01-11T05:44:34Z
PR Code Suggestions ✨
Explore these optional code suggestions:
Fix race condition in shutdown
Refactor the shutdown goroutine to prevent a race condition by ensuring the push
loop is fully stopped before stopping the server.
cmd/agent/main.go [207-223]
Suggestion importance[1-10]: 9
__
Why: The suggestion correctly identifies a race condition in the shutdown logic that could prevent services from being stopped correctly, and the proposed fix resolves this critical bug.
Use correct max_by for deduplication
Correct the implementation of
deduplicate_by_device_id/1by replacing theincorrect usage of
Enum.max_by/3withEnum.max_by/2. Convert timestamps to acomparable integer value to correctly find the most recent update.
elixir/serviceradar_core/lib/serviceradar/identity/alias_events.ex [513-525]
[To ensure code accuracy, apply this suggestion manually]Suggestion importance[1-10]: 9
__
Why: The suggestion correctly identifies a bug in the usage of
Enum.max_by/3, which would not sort as intended. The proposed fix usingEnum.max_by/2with a transformation function is correct and resolves the bug, ensuring the latest update is always selected.Ensure notifications are sent transactionally
Refactor the alert notification logic to use an
Ash.Notifier. This ensures thatwebhook notifications are sent only after the database transaction to create the
alert has successfully committed, preventing notifications for rolled-back
transactions.
elixir/serviceradar_core/lib/serviceradar/monitoring/alert_generator.ex [413-436]
[To ensure code accuracy, apply this suggestion manually]Suggestion importance[1-10]: 8
__
Why: This suggestion correctly identifies a data consistency issue where a notification could be sent for a database record that was never committed. The proposed solution to use an
Ash.Notifieris the correct, framework-idiomatic way to ensure transactional integrity for side effects.Include tenant identifiers in CA data
Update the return map of
generate_tenant_ca/2to includetenant_idandtenant_slug. This is required by the downstreamgenerate_component_cert/5function and will prevent a runtime error.
elixir/serviceradar_core/lib/serviceradar/edge/tenant_ca/generator.ex [106-116]
[To ensure code accuracy, apply this suggestion manually]Suggestion importance[1-10]: 8
__
Why: The suggestion correctly identifies that
generate_component_cert/5requirestenant_idfrom thetenant_camap, which is missing from the return value ofgenerate_tenant_ca/2. Adding it prevents a runtime error and fixes a clear bug in the data flow between functions.Improve fault tolerance in RPC calls
Modify
core_callto continue iterating through available core nodes onapplication-level errors, rather than halting immediately, to improve fault
tolerance.
elixir/serviceradar_agent_gateway/lib/serviceradar_agent_gateway/agent_gateway_server.ex [856-872]
[To ensure code accuracy, apply this suggestion manually]Suggestion importance[1-10]: 7
__
Why: The suggestion correctly points out a potential issue where
core_callmight fail prematurely if the first node returns an application-level error, without trying other nodes. The proposed change improves fault tolerance by continuing to the next node unless a definitive success is returned.Handle both string and atom keys
Update
get_nested_valueto handle both string and atom keys for nested maplookups to improve robustness against varying key types.
elixir/serviceradar_core/lib/serviceradar/observability/stateful_alert_engine.ex [739-745]
[To ensure code accuracy, apply this suggestion manually]Suggestion importance[1-10]: 6
__
Why: The suggestion correctly identifies that
get_nested_valueonly handles string keys, while other parts of the code handle both string and atom keys. The proposed change makes the matching logic more robust and consistent with the rest of the file.Prevent potential atom exhaustion vulnerability
To prevent a potential atom exhaustion Denial of Service (DoS) vulnerability,
replace the use of
String.to_existing_atom/1with a lookup in a string-keyed mapfor capability definitions.
elixir/serviceradar_core/lib/serviceradar/infrastructure/agent.ex [70-76]
[To ensure code accuracy, apply this suggestion manually]Suggestion importance[1-10]: 9
__
Why: The suggestion correctly identifies a significant security vulnerability (atom exhaustion DoS) and provides a robust, idiomatic fix by using a string-keyed map instead of converting external input to atoms.
Throttle rule reload on error
In
load_rules, updaterules_loaded_ateven on failure to prevent the engine fromretrying the load on every evaluation and flooding logs.
elixir/serviceradar_core/lib/serviceradar/observability/stateful_alert_engine.ex [154-167]
[To ensure code accuracy, apply this suggestion manually]Suggestion importance[1-10]: 7
__
Why: The suggestion correctly identifies that a failure to load rules will cause a retry on every subsequent evaluation, potentially flooding logs. Updating
rules_loaded_aton failure introduces a cooldown period, which is a sensible improvement for error handling and system stability.Avoid using internal process dictionary
Replace the use of the internal
:"$ancestors"process dictionary key with thepublic
Supervisor.which_children/1API to find a process's parent supervisor.This makes the implementation more robust and less likely to break with future
OTP updates.
elixir/serviceradar_core/lib/serviceradar/cluster/tenant_registry.ex [274-302]
[To ensure code accuracy, apply this suggestion manually]Suggestion importance[1-10]: 7
__
Why: The suggestion correctly points out the fragility of relying on the undocumented
:"$ancestors"process dictionary key and proposes a more robust solution using public APIs, which improves code maintainability and resilience to OTP updates.