Updates/sync service ng #2632
No reviewers
Labels
No labels
1week
2weeks
Failed compliance check
IP cameras
NATS
Possible security concern
Review effort 1/5
Review effort 2/5
Review effort 3/5
Review effort 4/5
Review effort 5/5
UI
aardvark
accessibility
amd64
api
arm64
auth
back-end
bgp
blog
bug
build
checkers
ci-cd
cleanup
cnpg
codex
core
dependencies
device-management
documentation
duplicate
dusk
ebpf
enhancement
eta 1d
eta 1hr
eta 3d
eta 3hr
feature
fieldsurvey
github_actions
go
good first issue
help wanted
invalid
javascript
k8s
log-collector
mapper
mtr
needs-triage
netflow
network-sweep
observability
oracle
otel
plug-in
proton
python
question
reddit
redhat
research
rperf
rperf-checker
rust
sdk
security
serviceradar-agent
serviceradar-agent-gateway
serviceradar-web
serviceradar-web-ng
siem
snmp
sysmon
topology
ubiquiti
wasm
wontfix
zen-engine
No milestone
No project
No assignees
1 participant
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
carverauto/serviceradar!2632
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "refs/pull/2632/head"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Imported from GitHub pull request.
Original GitHub pull request: #2225
Original author: @mfreeman451
Original URL: https://github.com/carverauto/serviceradar/pull/2225
Original created: 2026-01-05T21:17:13Z
Original updated: 2026-01-06T09:05:56Z
Original head: carverauto/serviceradar:updates/sync-service-ng
Original base: testing
Original merged: 2026-01-06T09:05:54Z by @mfreeman451
User description
IMPORTANT: Please sign the Developer Certificate of Origin
Thank you for your contribution to ServiceRadar. Please note, when contributing, the developer must include
a DCO sign-off statement indicating the DCO acceptance in one commit message. Here
is an example DCO Signed-off-by line in a commit message:
Describe your changes
Issue ticket number and link
Code checklist before requesting a review
PR Type
Enhancement
Description
Major architectural refactoring from gRPC server-based sync service to push-mode agent gateway communication model with comprehensive multi-tenant support:
Core Architecture Changes:
Replaced centralized sync service with distributed push-mode agents communicating directly to agent gateway
Refactored Go agent (
cmd/agent/main.go) from gRPC server to push-mode client with signal handling and graceful shutdownImplemented agent gateway gRPC server in Elixir with mTLS certificate validation and multi-tenant isolation
NATS Integration Enhancements:
Added NATS credentials file support across all services (trapd, zen consumer, otel, flowgger, sysmon checker, rperf-client)
Implemented NATS account service initialization with operator key management and bootstrap support
Added NATS Account Client for tenant-isolated account management and JWT signing
Multi-Tenant Infrastructure:
Implemented distributed process registry using Horde for per-tenant process isolation and discovery
Created comprehensive database schema with tenant-aware tables for devices, agents, pollers, and monitoring
Added tenant CA certificate generation for secure component onboarding with X.509 support
Integrated SPIFFE/SPIRE for distributed cluster authentication
Agent and Status Tracking:
Implemented agent tracking with ETS-based in-memory state management and TTL-based stale detection
Added
gateway_idfield to response messages across checkers and consumers for proper status attributionCreated agent resource with state machine (connecting, connected, degraded, disconnected, unavailable) and capability management
Monitoring and Operations:
Implemented device actor for runtime state management with event buffering and health tracking
Added device statistics aggregator for periodic snapshots and capability breakdowns
Created gateway resource with state machine for job orchestrator management
Implemented DataService KV client with automatic reconnection and mTLS support
CLI and Configuration:
Added
nats-bootstrapand admin commands for NATS operationsImplemented file-based agent configuration loading with environment variable fallback
Added comprehensive configuration support for NATS credentials across all components
Testing:
Added agent registry tests with multi-tenant isolation verification
Implemented event publisher tests for infrastructure event functionality
Updated test configurations for NATS credentials support
Diagram Walkthrough
File Walkthrough
27 files
main.go
Agent refactored to push-mode gateway communication modelcmd/agent/main.go
architecture with gateway communication
simplified to file-based config loading
loadConfig()function for JSON config parsing with embeddeddefault fallback via environment variable
runPushMode()with signal handling, push loop management,and graceful shutdown with timeout
configuration validation
main.go
NATS account service initialization and configurationcmd/data-services/main.go
and bootstrap support
environment variables with fallback to config
NATSAccountServiceServerin gRPC service registrationresolver configuration
main.go
CLI commands for NATS bootstrap and admin operationscmd/cli/main.go
nats-bootstrapcommand dispatch for NATS bootstrap operationsadmincommand dispatch withnatssubcommand routing viadispatchAdminCommand()main.rs
NATS credentials support and gateway ID tracking in trapdcmd/trapd/src/main.rs
nats_creds_path()method(None, Spiffe, Mtls)
gateway_idfield to trap and result response messages for properattribution
config.rs
NATS credentials file configuration for zen consumercmd/consumers/zen/src/config.rs
nats_creds_fileoptional configuration field with validationnats_creds_path()method to resolve credentials file pathswith cert_dir support
config.rs
NATS credentials file configuration for trapdcmd/trapd/src/config.rs
nats_creds_fileoptional configuration field with validationnats_creds_path()method to resolve credentials file pathsusing security config
config.rs
NATS credentials file support in otel configurationcmd/otel/src/config.rs
creds_fileoptional field toNATSConfigTOMLstructempty string handling
nats_output.rs
NATS credentials file support in flowgger outputcmd/flowgger/src/flowgger/output/nats_output.rs
creds_fileoptional field toNATSConfigstructstring trimming
nats_output.rs
NATS credentials file support in otel NATS outputcmd/otel/src/nats_output.rs
creds_fileoptional field toNATSConfigstruct with default Nonevalue
debug logging
server.rs
Gateway ID tracking in sysmon checker responsescmd/checkers/sysmon/src/server.rs
gateway_idfield to bothPingResponseandResultResponsemessages for proper status attribution
grpc_server.rs
Gateway ID tracking in zen consumer responsescmd/consumers/zen/src/grpc_server.rs
gateway_idfield to bothPingResponseandResultResponsemessages for status attribution
server.rs
Poller and gateway ID tracking in rperf responsescmd/checkers/rperf-client/src/server.rs
poller_idandgateway_idfields to both error and successresponse messages
nats.rs
NATS credentials file integration in zen NATS connectioncmd/consumers/zen/src/nats.rs
nats_creds_path()setup.rs
NATS credentials file debug loggingcmd/otel/src/setup.rs
agent_gateway_server.ex
Agent gateway gRPC server with mTLS and multi-tenant supportelixir/serviceradar_agent_gateway/lib/serviceradar_agent_gateway/agent_gateway_server.ex
pushes from Go agents
hello()for agent enrollment with mTLS certificatevalidation and tenant extraction
get_config()for agent configuration delivery withversioning support
push_status()andstream_status()for status receptionwith validation and resource limits
multi-tenant isolation
tenant_registry.ex
Multi-tenant process registry with Horde infrastructureelixir/serviceradar_core/lib/serviceradar/cluster/tenant_registry.ex
process discovery and supervision
isolation
agents, and gateways
initialization
agent_tracker.ex
Agent tracking with ETS and stale detectionelixir/serviceradar_core/lib/serviceradar/agent_tracker.ex
detection
source IP)
agent.ex
Agent resource with state machine and capability managementelixir/serviceradar_core/lib/serviceradar/infrastructure/agent.ex
connected, degraded, disconnected, unavailable)
with metadata and icons
management, and status updates
color, and gRPC endpoint
onboarding_packages.ex
Onboarding packages context with token and certificate managementelixir/serviceradar_core/lib/serviceradar/edge/onboarding_packages.ex
functionality
download tokens
delivered, activated, revoked)
onboarding
recording
alias_events.ex
Device alias event tracking and lifecycle managementelixir/serviceradar_core/lib/serviceradar/identity/alias_events.ex
from device metadata
logic
DeviceAliasStateresource with sighting counts and statetransitions
generator.ex
X.509 certificate generation for tenant CAs and componentselixir/serviceradar_core/lib/serviceradar/edge/tenant_ca/generator.ex
platform root CA
and SANs
computation
:public_keymodule for all cryptographic operationsspiffe.ex
SPIFFE/SPIRE integration for cluster authenticationelixir/serviceradar_core/lib/serviceradar/spiffe.ex
verification
information
loading
account_client.ex
NATS Account Client for Tenant Isolationelixir/serviceradar_core/lib/serviceradar/nats/account_client.ex
isolation support
and operator bootstrap
SSL/TLS support
(limits, permissions, mappings)
stats_aggregator.ex
Device Statistics Aggregator with Periodic Snapshotselixir/serviceradar_core/lib/serviceradar/core/stats_aggregator.ex
caching
per-partition statistics
breakdowns (ICMP, SNMP, Sysmon)
management
device.ex
Device Actor for Runtime State Managementelixir/serviceradar_core/lib/serviceradar/actors/device.ex
caching
tracking with state machine
automatic hibernation
broadcasting
gateway.ex
Gateway Resource with State Machine and Job Orchestrationelixir/serviceradar_core/lib/serviceradar/infrastructure/gateway.ex
machine
maintenance, draining)
isolation
client.ex
DataService KV Client with Reconnection and mTLSelixir/serviceradar_core/lib/serviceradar/data_service/client.ex
automatic reconnection
certificate support
operations
3 files
message_processor.rs
Test configuration update for NATS credentialscmd/consumers/zen/src/message_processor.rs
nats_creds_file: Nonefield to test configuration structagent_registry_test.exs
Agent registry tests with multi-tenant isolationelixir/serviceradar_core/test/serviceradar/registry/agent_registry_test.exs
AgentRegistrywith 362 lines of testcases
storage
event_publisher_test.exs
Event Publisher Tests for Infrastructure Eventselixir/serviceradar_core/test/serviceradar/infrastructure/event_publisher_test.exs
deregistration, and health changes
tests
1 files
20251224184523_initial_schema.exs
Initial database schema migration for ServiceRadar coreelixir/serviceradar_core/priv/repo/migrations/20251224184523_initial_schema.exs
devices, agents, service checks, and monitoring
partitions)
alerts
all tables
1 files
monitoring.pb.ex
Protobuf Definitions for Monitoring Serviceselixir/serviceradar_core/lib/serviceradar/proto/monitoring.pb.ex
status reporting
configuration
intervals
101 files
Imported GitHub PR comment.
Original author: @qodo-code-review[bot]
Original URL: https://github.com/carverauto/serviceradar/pull/2225#issuecomment-3712121318
Original created: 2026-01-05T21:18:37Z
PR Compliance Guide 🔍
Below is a summary of compliance checks for this PR:
Sensitive info exposure
Description: The new logging emits potentially sensitive operational details (e.g.,
AllowedClientIdentities, resolver/system creds file paths, and operator name) which couldaid attackers or leak internal identity/credential placement if logs are exposed; similar
creds-path logging was also added in
cmd/otel/src/nats_output.rsandcmd/otel/src/setup.rs.main.go [82-141]
Referred Code
🎫 No ticket provided
Codebase context is not defined
Follow the guide to enable codebase context checks.
Generic: Meaningful Naming and Self-Documenting Code
Objective: Ensure all identifiers clearly express their purpose and intent, making code
self-documenting
Status: Passed
Generic: Secure Error Handling
Objective: To prevent the leakage of sensitive system information through error messages while
providing sufficient detail for internal debugging.
Status: Passed
Generic: Security-First Input Validation and Data Handling
Objective: Ensure all data inputs are validated, sanitized, and handled securely to prevent
vulnerabilities
Status: Passed
Generic: Robust Error Handling and Edge Case Management
Objective: Ensure comprehensive error handling that provides meaningful context and graceful
degradation
Status:
Potential panic unwrap: New config parsing uses
as_str().unwrap()which can panic on unexpected config typesrather than returning a handled error with context.
Referred Code
Generic: Comprehensive Audit Trails
Objective: To create a detailed and reliable record of critical system actions for security analysis
and compliance.
Status:
Missing audit context: The newly added NATS account service initialization implies privileged
signing/bootstrapping operations but the diff does not show corresponding audit logging
that captures who performed such actions and their outcomes.
Referred Code
Generic: Secure Logging Practices
Objective: To ensure logs are useful for debugging and auditing without exposing sensitive
information like PII, PHI, or cardholder data.
Status:
Sensitive config in logs: The new logging statements print potentially sensitive security configuration details
(e.g., allowed client identities and credential/config file paths) which may be considered
sensitive operational information in some environments.
Referred Code
Compliance status legend
🟢 - Fully Compliant🟡 - Partial Compliant
🔴 - Not Compliant
⚪ - Requires Further Human Verification
🏷️ - Compliance label
Imported GitHub PR comment.
Original author: @qodo-code-review[bot]
Original URL: https://github.com/carverauto/serviceradar/pull/2225#issuecomment-3712126644
Original created: 2026-01-05T21:20:36Z
PR Code Suggestions ✨
Explore these optional code suggestions:
Use correct attribute for timestamp
Correct the attribute name from
:last_seen_timeto:last_seen_atin thepersist_identityfunction to prevent device update failures.elixir/serviceradar_core/lib/serviceradar/actors/device.ex [503-523]
[To ensure code accuracy, apply this suggestion manually]Suggestion importance[1-10]: 9
__
Why: This suggestion identifies a critical bug where an incorrect attribute name (
:last_seen_timeinstead of:last_seen_at) would cause all device identity updates to fail. Fixing this is essential for the core functionality of the device actor.✅
Fix race condition in shutdown logicSuggestion Impact:
The commit introduces a shared shutdownCtx with a timeout and uses it in the shutdown goroutine and the final select (replacing the separate timer), matching the suggested approach to avoid premature timeout races. It also extends the idea by passing shutdownCtx into pushLoop.Stop and guarding the errChan wait with shutdownCtx.code diff:
Refactor the shutdown logic to use a single parent context with a timeout for
all shutdown operations, preventing a race condition where the main goroutine
could time out prematurely.
cmd/agent/main.go [201-226]
[Suggestion processed]Suggestion importance[1-10]: 8
__
Why: The suggestion correctly identifies and fixes a race condition in the shutdown logic, ensuring all shutdown operations are bound by a single, shared timeout, which improves the agent's robustness.
✅
Ensure state transitions are atomicSuggestion Impact:
The commit removed `require_atomic? false` from multiple state-transition update actions including `establish_connection` (and others like connection_failed, degrade, restore_health, etc.), aligning those transitions with atomic execution.code diff:
Remove
require_atomic? falsefrom all state-transition actions likeestablish_connectionto ensure atomicity and prevent data inconsistency. Statetransitions are critical and should be transactional.
elixir/serviceradar_core/lib/serviceradar/infrastructure/agent.ex [270-292]
[To ensure code accuracy, apply this suggestion manually]Suggestion importance[1-10]: 8
__
Why: The suggestion correctly identifies that using
require_atomic? falsefor state transition actions can lead to data inconsistency, which is a significant correctness issue. Ensuring atomicity for these critical operations is crucial for maintaining a valid state.✅
Combine create and update into oneSuggestion Impact:
The commit removed the separate update call that set token fields after creation, and instead applies the token attributes directly on the create changeset (via force_change_attribute) before a single Ash.create, making the operation effectively atomic and avoiding inconsistent intermediate state.code diff:
Refactor the
createfunction to be atomic by combining the "create-then-update"logic into a single operation. Pass all attributes, including tokens, to the
:createaction to prevent data inconsistency.elixir/serviceradar_core/lib/serviceradar/edge/onboarding_packages.ex [156-199]
[To ensure code accuracy, apply this suggestion manually]Suggestion importance[1-10]: 8
__
Why: The suggestion correctly points out a non-atomic "create-then-update" pattern that could lead to inconsistent data. Consolidating this into a single atomic
createoperation is a critical improvement for data integrity.✅
Prevent race condition during updatesSuggestion Impact:
The commit removed the separate conditional DeviceAliasState.confirm/2 call and instead passed confirm_threshold into DeviceAliasState.record_sighting/3, implying the confirmation decision is now handled within the sighting recording operation (moving toward the suggested atomic consolidation), though it did not introduce the specifically suggested atomic_sighting_and_confirm function or the suggested return-shape handling.code diff:
Prevent a race condition in
handle_existing_alias/4by consolidating the logicfor recording a sighting and confirming the state into a single, atomic action
on the
DeviceAliasStateresource.elixir/serviceradar_core/lib/serviceradar/identity/alias_events.ex [438-474]
[To ensure code accuracy, apply this suggestion manually]Suggestion importance[1-10]: 8
__
Why: The suggestion correctly identifies a race condition in
handle_existing_alias/4due to non-atomic read-modify-write operations. Consolidating the sighting and confirmation logic into a single atomic action is crucial to prevent incorrect state transitions.Conditionally enable SSL for gRPC
Conditionally enable SSL for the gRPC connection based on an environment
variable, allowing for both secure and insecure deployments.
elixir/serviceradar_core/lib/serviceradar/nats/account_client.ex [487-518]
[To ensure code accuracy, apply this suggestion manually]Suggestion importance[1-10]: 6
__
Why: The suggestion correctly points out that the function hardcodes an mTLS connection, which contradicts the module's documentation stating it uses the same configuration as
DataService.Client. Applying this change improves flexibility and aligns the component's behavior with other parts of the system, making it usable in non-SSL environments.Prevent misleading operator config log
Add a check to ensure
operatorConfigPathis not empty before attempting to writethe NATS operator configuration to prevent misleading log messages.
cmd/data-services/main.go [117-130]
Suggestion importance[1-10]: 5
__
Why: The suggestion correctly identifies a condition that leads to a misleading log message and proposes a simple fix, improving the accuracy of the application's logging.
✅
Ensure deterministic record selection orderSuggestion Impact:
The commit adds `sorted_fallback = Enum.sort(fallback)` and updates the subsequent Enum.reduce to iterate over `sorted_fallback` instead of the unsorted map, making processing deterministic.code diff:
Sort the
fallbackmap by key before reducing it inselect_canonical_recordstoensure deterministic device selection and prevent metric instability.
elixir/serviceradar_core/lib/serviceradar/core/stats_aggregator.ex [263-286]
[To ensure code accuracy, apply this suggestion manually]Suggestion importance[1-10]: 5
__
Why: The suggestion correctly identifies that iterating over a map is not order-guaranteed, which could lead to non-deterministic behavior and metric instability. Sorting the fallback records before processing is a valid improvement for ensuring consistent results.
✅
Safely parse integer env varsSuggestion Impact:
Updated get_env_int to use Integer.parse/1 and return nil for non-numeric or partially numeric environment variable values instead of raising.code diff:
Replace
String.to_integer/1with the saferInteger.parse/1inget_env_inttoprevent crashes from non-numeric environment variable values.
elixir/serviceradar_core/lib/serviceradar/data_service/client.ex [301-307]
[To ensure code accuracy, apply this suggestion manually]Suggestion importance[1-10]: 7
__
Why: The suggestion correctly points out that
String.to_integer/1will crash if the environment variable contains a non-numeric string. UsingInteger.parse/1makes the configuration loading more robust against invalid user input.Use default limits struct
In
build_limits, return a defaultProto.AccountLimits{}struct instead ofnilwhen no limits are provided to ensure a valid protobuf message.
elixir/serviceradar_core/lib/serviceradar/nats/account_client.ex [520]
[To ensure code accuracy, apply this suggestion manually]Suggestion importance[1-10]: 7
__
Why: The suggestion correctly identifies that sending
nilfor a protobuf message field can cause issues. Returning a default, emptyAccountLimitsstruct is safer and ensures the gRPC request is always well-formed, improving the client's robustness.✅
Handle invalid timestamps safelySuggestion Impact:
The commit replaced the bang version with DateTime.from_unix/1 and added pattern matching to return nil on error, preventing potential crashes from invalid UNIX timestamps.code diff:
Replace
DateTime.from_unix!/1with the saferDateTime.from_unix/1and handlepotential errors to prevent crashes from invalid timestamps.
elixir/serviceradar_core/lib/serviceradar/nats/account_client.ex [174-178]
[To ensure code accuracy, apply this suggestion manually]Suggestion importance[1-10]: 7
__
Why: This suggestion correctly identifies a potential crash when parsing a UNIX timestamp from a gRPC response. Using the non-bang version
DateTime.from_unix/1and handling the error case makes the code more resilient to invalid data from the remote service.Improve detection of trailing config data
Replace the flawed check for trailing JSON data with the more robust and
idiomatic
dec.More()method to correctly handle whitespace and prevent invalidconfigurations from being accepted.
cmd/agent/main.go [133-137]
Suggestion importance[1-10]: 6
__
Why: The suggestion correctly identifies a subtle flaw in the JSON parsing logic and proposes using the idiomatic
dec.More()function, which improves code correctness and readability.Include TLS1.2 cipher suites
Update the
cipherslist to include suites for both TLS 1.3 and TLS 1.2. Thisensures that TLS 1.2 connections are supported, as currently only TLS 1.3 cipher
suites are configured.
elixir/serviceradar_core/lib/serviceradar/spiffe.ex [323]
[To ensure code accuracy, apply this suggestion manually]Suggestion importance[1-10]: 6
__
Why: The suggestion correctly points out that while
versionsallows both TLS 1.3 and 1.2, thecipherslist only includes suites for TLS 1.3. Adding TLS 1.2 cipher suites is necessary to ensure TLS 1.2 connections can be established, improving compatibility.✅
Use SQL fragment for endpointSuggestion Impact:
The :endpoint expr was changed from `host <> ":" <> fragment("?::text", port)` to a single SQL fragment `fragment("? || ':' || ?::text", host, port)`, ensuring concatenation happens correctly in the database.code diff:
In the
:endpointcalculation, use a single SQLfragmentfor string concatenationinstead of mixing Elixir's
<>operator with a fragment. This ensures theconcatenation is performed correctly in the database.
elixir/serviceradar_core/lib/serviceradar/infrastructure/agent.ex [635-643]
[To ensure code accuracy, apply this suggestion manually]Suggestion importance[1-10]: 5
__
Why: The suggestion correctly identifies that the current implementation mixes Elixir string concatenation with an SQL fragment, which is not valid in an Ash
expr. The proposed change to use a singlefragmentfor concatenation is a valid and necessary correction for the calculation to work at the database level.