carverauto/serviceradar

Fork 0

Updates/sync service ng #2632

Merged

mfreeman451 merged 26 commits from refs/pull/2632/head into testing

2026-01-06 09:05:54 +00:00

mfreeman451 commented

2026-01-05 21:17:13 +00:00

(Migrated from github.com)

Owner

Imported from GitHub pull request.

Original GitHub pull request: #2225
Original author: @mfreeman451
Original URL: https://github.com/carverauto/serviceradar/pull/2225
Original created: 2026-01-05T21:17:13Z
Original updated: 2026-01-06T09:05:56Z
Original head: carverauto/serviceradar:updates/sync-service-ng
Original base: testing
Original merged: 2026-01-06T09:05:54Z by @mfreeman451

User description

IMPORTANT: Please sign the Developer Certificate of Origin

Thank you for your contribution to ServiceRadar. Please note, when contributing, the developer must include
a DCO sign-off statement indicating the DCO acceptance in one commit message. Here
is an example DCO Signed-off-by line in a commit message:

Signed-off-by: J. Doe <j.doe@domain.com>

Describe your changes

Issue ticket number and link

Code checklist before requesting a review

I have signed the DCO?
The build completes without errors?
All tests are passing when running make test?

PR Type

Enhancement

Description

Major architectural refactoring from gRPC server-based sync service to push-mode agent gateway communication model with comprehensive multi-tenant support:

Core Architecture Changes:

Replaced centralized sync service with distributed push-mode agents communicating directly to agent gateway
Refactored Go agent (cmd/agent/main.go) from gRPC server to push-mode client with signal handling and graceful shutdown
Implemented agent gateway gRPC server in Elixir with mTLS certificate validation and multi-tenant isolation

NATS Integration Enhancements:

Added NATS credentials file support across all services (trapd, zen consumer, otel, flowgger, sysmon checker, rperf-client)
Implemented NATS account service initialization with operator key management and bootstrap support
Added NATS Account Client for tenant-isolated account management and JWT signing

Multi-Tenant Infrastructure:

Implemented distributed process registry using Horde for per-tenant process isolation and discovery
Created comprehensive database schema with tenant-aware tables for devices, agents, pollers, and monitoring
Added tenant CA certificate generation for secure component onboarding with X.509 support
Integrated SPIFFE/SPIRE for distributed cluster authentication

Agent and Status Tracking:

Implemented agent tracking with ETS-based in-memory state management and TTL-based stale detection
Added gateway_id field to response messages across checkers and consumers for proper status attribution
Created agent resource with state machine (connecting, connected, degraded, disconnected, unavailable) and capability management

Monitoring and Operations:

Implemented device actor for runtime state management with event buffering and health tracking
Added device statistics aggregator for periodic snapshots and capability breakdowns
Created gateway resource with state machine for job orchestrator management
Implemented DataService KV client with automatic reconnection and mTLS support

CLI and Configuration:

Added nats-bootstrap and admin commands for NATS operations
Implemented file-based agent configuration loading with environment variable fallback
Added comprehensive configuration support for NATS credentials across all components

Testing:

Added agent registry tests with multi-tenant isolation verification
Implemented event publisher tests for infrastructure event functionality
Updated test configurations for NATS credentials support

Diagram Walkthrough

flowchart LR
  A["Go Agents<br/>Push-Mode"] -->|"mTLS<br/>Status Push"| B["Agent Gateway<br/>gRPC Server"]
  B -->|"Config<br/>Delivery"| A
  B -->|"Tenant<br/>Isolation"| C["Tenant<br/>Registry"]
  C -->|"Process<br/>Management"| D["Horde<br/>Infrastructure"]
  E["NATS<br/>Services"] -->|"Credentials<br/>File"| F["All<br/>Components"]
  G["Agent<br/>Tracker"] -->|"ETS<br/>State"| H["Device<br/>Actor"]
  H -->|"Events"| I["Event<br/>Publisher"]
  J["Certificate<br/>Generator"] -->|"X.509<br/>Certs"| K["Tenant<br/>CA"]

File Walkthrough

Relevant files

Enhancement

27 files

main.go `Agent refactored to push-mode gateway communication model` cmd/agent/main.go Refactored agent startup from gRPC server model to push-mode architecture with gateway communication Removed edge onboarding, KV watch, and config bootstrap dependencies; simplified to file-based config loading Added `loadConfig()` function for JSON config parsing with embedded default fallback via environment variable Implemented `runPushMode()` with signal handling, push loop management, and graceful shutdown with timeout Added version injection via ldflags and proper error definitions for configuration validation	+171/-74
main.go `NATS account service initialization and configuration` cmd/data-services/main.go Added NATS account service initialization with operator key management and bootstrap support Configured resolver paths and system account credentials from environment variables with fallback to config Registered `NATSAccountServiceServer` in gRPC service registration Added validation and logging for allowed client identities and resolver configuration	+68/-0
main.go `CLI commands for NATS bootstrap and admin operations` cmd/cli/main.go Added `nats-bootstrap` command dispatch for NATS bootstrap operations Added `admin` command dispatch with `nats` subcommand routing via `dispatchAdminCommand()` Implemented error handling for unknown admin resources	+14/-0
main.rs `NATS credentials support and gateway ID tracking in trapd` cmd/trapd/src/main.rs Added support for NATS credentials file authentication via `nats_creds_path()` method Integrated credentials file loading into all NATS connection modes (None, Spiffe, Mtls) Added `gateway_id` field to trap and result response messages for proper attribution	+23/-1
config.rs `NATS credentials file configuration for zen consumer` cmd/consumers/zen/src/config.rs Added `nats_creds_file` optional configuration field with validation Implemented `nats_creds_path()` method to resolve credentials file paths with cert_dir support Added validation to ensure non-empty credentials file paths	+26/-0
config.rs `NATS credentials file configuration for trapd` cmd/trapd/src/config.rs Added `nats_creds_file` optional configuration field with validation Implemented `nats_creds_path()` method to resolve credentials file paths using security config Added validation to reject empty credentials file values	+21/-0
config.rs `NATS credentials file support in otel configuration` cmd/otel/src/config.rs Added `creds_file` optional field to `NATSConfigTOML` struct Implemented credentials file path resolution in config parsing with empty string handling Updated default config example to include credentials file path	+13/-0
nats_output.rs `NATS credentials file support in flowgger output` cmd/flowgger/src/flowgger/output/nats_output.rs Added `creds_file` optional field to `NATSConfig` struct Implemented credentials file loading from configuration with empty string trimming Integrated credentials file into NATS connection options	+14/-0
nats_output.rs `NATS credentials file support in otel NATS output` cmd/otel/src/nats_output.rs Added `creds_file` optional field to `NATSConfig` struct with default None value Integrated credentials file loading into NATS connection options with debug logging	+7/-0
server.rs `Gateway ID tracking in sysmon checker responses` cmd/checkers/sysmon/src/server.rs Added `gateway_id` field to both `PingResponse` and `ResultResponse` messages for proper status attribution	+2/-0
grpc_server.rs `Gateway ID tracking in zen consumer responses` cmd/consumers/zen/src/grpc_server.rs Added `gateway_id` field to both `PingResponse` and `ResultResponse` messages for status attribution	+2/-0
server.rs `Poller and gateway ID tracking in rperf responses` cmd/checkers/rperf-client/src/server.rs Added `poller_id` and `gateway_id` fields to both error and success response messages	+4/-0
nats.rs `NATS credentials file integration in zen NATS connection` cmd/consumers/zen/src/nats.rs Added credentials file loading into NATS connection options via `nats_creds_path()`	+4/-0
setup.rs `NATS credentials file debug logging` cmd/otel/src/setup.rs Added debug logging for NATS credentials file configuration	+1/-0
agent_gateway_server.ex `Agent gateway gRPC server with mTLS and multi-tenant support` elixir/serviceradar_agent_gateway/lib/serviceradar_agent_gateway/agent_gateway_server.ex Implemented complete gRPC server for agent gateway receiving status pushes from Go agents Implemented `hello()` for agent enrollment with mTLS certificate validation and tenant extraction Implemented `get_config()` for agent configuration delivery with versioning support Implemented `push_status()` and `stream_status()` for status reception with validation and resource limits Integrated tenant security via mTLS certificate parsing and multi-tenant isolation	+1013/-0
tenant_registry.ex `Multi-tenant process registry with Horde infrastructure` elixir/serviceradar_core/lib/serviceradar/cluster/tenant_registry.ex Implemented multi-tenant process registry using Horde for distributed process discovery and supervision Created per-tenant Horde Registry and DynamicSupervisor for process isolation Implemented slug-to-UUID mapping via ETS table for admin/debug lookups Added convenience functions for registering and managing pollers, agents, and gateways Implemented tenant infrastructure lifecycle management with lazy initialization	+682/-0
agent_tracker.ex `Agent tracking with ETS and stale detection` elixir/serviceradar_core/lib/serviceradar/agent_tracker.ex Implemented in-memory agent tracking using ETS with TTL-based stale detection Added agent status tracking with metadata (service count, partition, source IP) Implemented periodic cleanup of stale agents (24-hour threshold) Added PubSub broadcast for UI updates on agent status changes	+177/-0
agent.ex `Agent resource with state machine and capability management` elixir/serviceradar_core/lib/serviceradar/infrastructure/agent.ex New Ash resource for managing Go agents with OCSF v1.4.0 compliance Implements agent lifecycle state machine with transitions (connecting, connected, degraded, disconnected, unavailable) Defines agent capabilities (ICMP, TCP, HTTP, gRPC, DNS, Process, SNMP) with metadata and icons Provides JSON API routes for agent registration, connection management, and status updates Includes calculated fields for display name, online status, status color, and gRPC endpoint	+649/-0
onboarding_packages.ex `Onboarding packages context with token and certificate management` elixir/serviceradar_core/lib/serviceradar/edge/onboarding_packages.ex Context module for edge onboarding package operations with CRUD functionality Implements token generation, encryption, and verification for join and download tokens Provides package delivery workflow with state transitions (issued, delivered, activated, revoked) Integrates tenant CA certificate generation for secure component onboarding Includes soft-delete and revocation capabilities with audit event recording	+606/-0
alias_events.ex `Device alias event tracking and lifecycle management` elixir/serviceradar_core/lib/serviceradar/identity/alias_events.ex Device alias lifecycle event tracking module ported from Go core Detects and processes alias changes (service IDs, IPs, collector IPs) from device metadata Builds lifecycle events for audit and alerting with change detection logic Manages `DeviceAliasState` resource with sighting counts and state transitions Provides metadata parsing and comparison utilities for alias records	+650/-0
generator.ex `X.509 certificate generation for tenant CAs and components` elixir/serviceradar_core/lib/serviceradar/edge/tenant_ca/generator.ex X.509 certificate generation for per-tenant CAs and edge components Generates tenant intermediate CAs (10-year validity) signed by platform root CA Creates edge component certificates (1-year validity) with SPIFFE IDs and SANs Provides certificate parsing, CN extraction, and SPKI SHA-256 computation Uses Erlang `:public_key` module for all cryptographic operations	+541/-0
spiffe.ex `SPIFFE/SPIRE integration for cluster authentication` elixir/serviceradar_core/lib/serviceradar/spiffe.ex SPIFFE/SPIRE integration module for distributed cluster authentication Provides SSL/TLS options for ERTS distribution with SPIFFE certificate verification Parses and validates SPIFFE IDs with trust domain verification Monitors certificate files for rotation and provides expiry information Supports both filesystem and SPIRE Workload API modes for certificate loading	+564/-0
account_client.ex `NATS Account Client for Tenant Isolation` elixir/serviceradar_core/lib/serviceradar/nats/account_client.ex New gRPC client module for NATS account management with tenant isolation support Implements account creation, user credential generation, JWT signing, and operator bootstrap Provides channel management with fallback to fresh connections and SSL/TLS support Includes helper functions for building protobuf request structures (limits, permissions, mappings)	+563/-0
stats_aggregator.ex `Device Statistics Aggregator with Periodic Snapshots` elixir/serviceradar_core/lib/serviceradar/core/stats_aggregator.ex New GenServer module for periodic device statistics aggregation and caching Implements canonical record selection, device filtering, and per-partition statistics Tracks device availability, activity windows, and capability breakdowns (ICMP, SNMP, Sysmon) Includes telemetry recording, alert handler integration, and snapshot management	+626/-0
device.ex `Device Actor for Runtime State Management` elixir/serviceradar_core/lib/serviceradar/actors/device.ex New GenServer-based device actor for in-memory state management and caching Implements identity resolution, event buffering, health status tracking with state machine Provides lazy initialization, Horde registry integration, and automatic hibernation Includes event flushing, health check recording, and PubSub broadcasting	+608/-0
gateway.ex `Gateway Resource with State Machine and Job Orchestration` elixir/serviceradar_core/lib/serviceradar/infrastructure/gateway.ex New Ash resource for gateway (job orchestrator) management with state machine Implements status transitions (healthy, degraded, offline, recovering, maintenance, draining) Provides registration, heartbeat, and health tracking with tenant isolation Includes JSON API routes and policy-based authorization	+544/-0
client.ex `DataService KV Client with Reconnection and mTLS` elixir/serviceradar_core/lib/serviceradar/data_service/client.ex New GenServer-based gRPC client for datasvc KV store operations Implements put, get, delete, list_keys, and put_many operations with automatic reconnection Provides exponential backoff reconnection strategy and mTLS certificate support Handles gun connection lifecycle events and ensures connection before operations	+507/-0

Tests

3 files

message_processor.rs `Test configuration update for NATS credentials` cmd/consumers/zen/src/message_processor.rs Added `nats_creds_file: None` field to test configuration struct	+1/-0
agent_registry_test.exs `Agent registry tests with multi-tenant isolation` elixir/serviceradar_core/test/serviceradar/registry/agent_registry_test.exs Comprehensive test suite for `AgentRegistry` with 362 lines of test cases Tests agent registration with gRPC connection details and metadata storage Verifies multi-tenant isolation for registry operations and lookups Tests capability-based and partition-based agent discovery Includes heartbeat, unregister, and count operations	+362/-0
event_publisher_test.exs `Event Publisher Tests for Infrastructure Events` elixir/serviceradar_core/test/serviceradar/infrastructure/event_publisher_test.exs New test module for infrastructure event publishing functionality Tests event structure building for state changes, registration, deregistration, and health changes Verifies mandatory field requirements and supported entity/event types Mocks NATS connection to avoid requiring running NATS server in unit tests	+143/-0

Configuration

1 files

20251224184523_initial_schema.exs

Initial database schema migration for ServiceRadar core

elixir/serviceradar_core/priv/repo/migrations/20251224184523_initial_schema.exs

Created comprehensive database schema with tables for tenants, users,
devices, agents, service checks, and monitoring
Defined relationships between core entities (devices, agents, pollers,
partitions)
Added support for edge onboarding packages, monitoring events, and
alerts
Implemented multi-tenant isolation with tenant_id foreign keys across
all tables

+733/-0

Dependencies

1 files

monitoring.pb.ex

Protobuf Definitions for Monitoring Services

elixir/serviceradar_core/lib/serviceradar/proto/monitoring.pb.ex

Auto-generated protobuf definitions for monitoring service messages
Defines request/response structures for agent, poller, and gateway
status reporting
Includes sweep completion tracking, chunked streaming, and agent
configuration
Supports agent hello handshake with capabilities and heartbeat
intervals

+398/-0

Additional files

101 files

.bazelignore	+4/-0
.bazelrc	+2/-0
.env.example	+38/-0
AGENTS.md	+171/-2
MODULE.bazel	+5/-0
Makefile	+4/-0
README-Docker.md	+16/-1
BUILD.bazel	+11/-0
BUILD.bazel	+12/-0
mix_release.bzl	+120/-49
BUILD.bazel	+1/-0
config.json	+5/-6
monitoring.proto	+4/-1
zen-consumer-with-otel.json	+1/-0
zen-consumer.json	+1/-0
BUILD.bazel	+1/-0
README.md	+9/-12
flowgger.toml	+1/-0
otel.toml	+1/-0
BUILD.bazel	+0/-25
config.json	+0/-77
main.go	+0/-123
docker-compose.elx.yml	+108/-0
docker-compose.spiffe.yml	+8/-55
docker-compose.yml	+298/-280
Dockerfile.agent-gateway	+94/-0
Dockerfile.core-elx	+108/-0
Dockerfile.poller-elx	+95/-0
Dockerfile.sync	+0/-95
Dockerfile.web-ng	+6/-0
agent-minimal.docker.json	+6/-6
agent.docker.json	+5/-20
agent.mtls.json	+7/-10
.gitkeep	+1/-0
datasvc.docker.json	+3/-2
datasvc.mtls.json	+14/-1
db-event-writer.mtls.json	+1/-0
entrypoint-certs.sh	+12/-8
entrypoint-sync.sh	+0/-96
flowgger.docker.toml	+2/-1
generate-certs.sh	+217/-12
nats.docker.conf	+16/-160
netflow-consumer.mtls.json	+1/-0
otel.docker.toml	+2/-0
pg_hba.conf	+9/-0
pg_ident.conf	+17/-0
bootstrap-compose-spire.sh	+0/-1
ssl_dist.core.conf	+17/-0
ssl_dist.gateway.conf	+17/-0
ssl_dist.poller.conf	+17/-0
ssl_dist.web.conf	+17/-0
sync.docker.json	+0/-71
sync.mtls.json	+0/-75
trapd.docker.json	+2/-1
zen.docker.json	+2/-1
BUILD.bazel	+80/-42
push_targets.bzl	+2/-1
SBOM.md	+0/-1
agents.md	+8/-9
architecture.md	+100/-26
armis.md	+7/-21
ash-api.md	+305/-0
ash-authentication.md	+244/-0
ash-authorization.md	+283/-0
ash-domains.md	+223/-0
ash-migration-guide.md	+339/-0
auth-configuration.md	+1/-1
cluster.md	+2/-2
discovery.md	+19/-20
docker-setup.md	+57/-14
edge-agents.md	+253/-0
edge-onboarding.md	+316/-273
identity_drift_monitoring.md	+1/-1
installation.md	+13/-14
intro.md	+5/-5
kv-configuration.md	+3/-3
netbox.md	+4/-23
netflow.md	+1/-1
onboarding-review-2025.md	+0/-1
rperf-monitoring.md	+1/-1
compose-mtls-sysmonosx.md	+1/-1
docker-compose-login-500.md	+3/-3
security-architecture.md	+238/-0
self-signed.md	+32/-107
service-port-map.md	+3/-5
snmp.md	+2/-2
spiffe-identity.md	+52/-0
srql-service.md	+2/-2
sync.md	+75/-482
syslog.md	+1/-1
troubleshooting-guide.md	+53/-2
web-ui.md	+31/-106
datasvc.ex	+32/-2
BUILD.bazel	+30/-0
config.exs	+13/-0
dev.exs	+4/-0
prod.exs	+4/-0
runtime.exs	+209/-0
test.exs	+20/-0
agent_client.ex	+472/-0
Additional files not shown

Imported from GitHub pull request. Original GitHub pull request: #2225 Original author: @mfreeman451 Original URL: https://github.com/carverauto/serviceradar/pull/2225 Original created: 2026-01-05T21:17:13Z Original updated: 2026-01-06T09:05:56Z Original head: carverauto/serviceradar:updates/sync-service-ng Original base: testing Original merged: 2026-01-06T09:05:54Z by @mfreeman451 --- ### **User description** ## IMPORTANT: Please sign the Developer Certificate of Origin Thank you for your contribution to ServiceRadar. Please note, when contributing, the developer must include a [DCO sign-off statement]( https://developercertificate.org/) indicating the DCO acceptance in one commit message. Here is an example DCO Signed-off-by line in a commit message: ``` Signed-off-by: J. Doe <j.doe@domain.com> ``` ## Describe your changes ## Issue ticket number and link ## Code checklist before requesting a review - [ ] I have signed the DCO? - [ ] The build completes without errors? - [ ] All tests are passing when running make test? ___ ### **PR Type** Enhancement ___ ### **Description** Major architectural refactoring from gRPC server-based sync service to push-mode agent gateway communication model with comprehensive multi-tenant support: **Core Architecture Changes:** - Replaced centralized sync service with distributed push-mode agents communicating directly to agent gateway - Refactored Go agent (`cmd/agent/main.go`) from gRPC server to push-mode client with signal handling and graceful shutdown - Implemented agent gateway gRPC server in Elixir with mTLS certificate validation and multi-tenant isolation **NATS Integration Enhancements:** - Added NATS credentials file support across all services (trapd, zen consumer, otel, flowgger, sysmon checker, rperf-client) - Implemented NATS account service initialization with operator key management and bootstrap support - Added NATS Account Client for tenant-isolated account management and JWT signing **Multi-Tenant Infrastructure:** - Implemented distributed process registry using Horde for per-tenant process isolation and discovery - Created comprehensive database schema with tenant-aware tables for devices, agents, pollers, and monitoring - Added tenant CA certificate generation for secure component onboarding with X.509 support - Integrated SPIFFE/SPIRE for distributed cluster authentication **Agent and Status Tracking:** - Implemented agent tracking with ETS-based in-memory state management and TTL-based stale detection - Added `gateway_id` field to response messages across checkers and consumers for proper status attribution - Created agent resource with state machine (connecting, connected, degraded, disconnected, unavailable) and capability management **Monitoring and Operations:** - Implemented device actor for runtime state management with event buffering and health tracking - Added device statistics aggregator for periodic snapshots and capability breakdowns - Created gateway resource with state machine for job orchestrator management - Implemented DataService KV client with automatic reconnection and mTLS support **CLI and Configuration:** - Added `nats-bootstrap` and admin commands for NATS operations - Implemented file-based agent configuration loading with environment variable fallback - Added comprehensive configuration support for NATS credentials across all components **Testing:** - Added agent registry tests with multi-tenant isolation verification - Implemented event publisher tests for infrastructure event functionality - Updated test configurations for NATS credentials support ___ ### Diagram Walkthrough ```mermaid flowchart LR A["Go Agents Push-Mode"] -->|"mTLS Status Push"| B["Agent Gateway gRPC Server"] B -->|"Config Delivery"| A B -->|"Tenant Isolation"| C["Tenant Registry"] C -->|"Process Management"| D["Horde Infrastructure"] E["NATS Services"] -->|"Credentials File"| F["All Components"] G["Agent Tracker"] -->|"ETS State"| H["Device Actor"] H -->|"Events"| I["Event Publisher"] J["Certificate Generator"] -->|"X.509 Certs"| K["Tenant CA"] ``` <details><summary><h3>File Walkthrough</h3></summary> <table><thead><tr><th></th><th align="left">Relevant files</th></tr></thead><tbody><tr><td>Enhancement</td><td><details><summary>27 files</summary><table> <tr> <td> <details> <summary>main.go<dd><code>Agent refactored to push-mode gateway communication model</code></dd></summary> <hr> cmd/agent/main.go <ul><li>Refactored agent startup from gRPC server model to push-mode architecture with gateway communication <li> Removed edge onboarding, KV watch, and config bootstrap dependencies; simplified to file-based config loading <li> Added <code>loadConfig()</code> function for JSON config parsing with embedded default fallback via environment variable <li> Implemented <code>runPushMode()</code> with signal handling, push loop management, and graceful shutdown with timeout <li> Added version injection via ldflags and proper error definitions for configuration validation</ul> </details> </td> <td><a href="https://github.com/carverauto/serviceradar/pull/2225/files#diff-61358711e980ccf505246fd3915f97cbd3a380e9b66f6fa5aad46749968c5ca3">+171/-74</a></td> </tr> <tr> <td> <details> <summary>main.go<dd><code>NATS account service initialization and configuration</code>        </dd></summary> <hr> cmd/data-services/main.go <ul><li>Added NATS account service initialization with operator key management and bootstrap support <li> Configured resolver paths and system account credentials from environment variables with fallback to config <li> Registered <code>NATSAccountServiceServer</code> in gRPC service registration <li> Added validation and logging for allowed client identities and resolver configuration</ul> </details> </td> <td><a href="https://github.com/carverauto/serviceradar/pull/2225/files#diff-5e7731adfb877918cd65d9d5531621312496450fd550fea2682efca4ca8fe816">+68/-0</a>    </td> </tr> <tr> <td> <details> <summary>main.go<dd><code>CLI commands for NATS bootstrap and admin operations</code>          </dd></summary> <hr> cmd/cli/main.go <ul><li>Added <code>nats-bootstrap</code> command dispatch for NATS bootstrap operations <li> Added <code>admin</code> command dispatch with <code>nats</code> subcommand routing via <code>dispatchAdminCommand()</code> <li> Implemented error handling for unknown admin resources</ul> </details> </td> <td><a href="https://github.com/carverauto/serviceradar/pull/2225/files#diff-ed4d81d29a7267f93fd77e17993fd3491b9ef6ded18490b4514d10ed1d803bc2">+14/-0</a>    </td> </tr> <tr> <td> <details> <summary>main.rs<dd><code>NATS credentials support and gateway ID tracking in trapd</code></dd></summary> <hr> cmd/trapd/src/main.rs <ul><li>Added support for NATS credentials file authentication via <code>nats_creds_path()</code> method <li> Integrated credentials file loading into all NATS connection modes (None, Spiffe, Mtls) <li> Added <code>gateway_id</code> field to trap and result response messages for proper attribution</ul> </details> </td> <td><a href="https://github.com/carverauto/serviceradar/pull/2225/files#diff-33b655d8730ae3e9c844ee280787d11f1b0d5343119188273f89558805f814ba">+23/-1</a>    </td> </tr> <tr> <td> <details> <summary>config.rs<dd><code>NATS credentials file configuration for zen consumer</code>          </dd></summary> <hr> cmd/consumers/zen/src/config.rs <ul><li>Added <code>nats_creds_file</code> optional configuration field with validation <li> Implemented <code>nats_creds_path()</code> method to resolve credentials file paths with cert_dir support <li> Added validation to ensure non-empty credentials file paths</ul> </details> </td> <td><a href="https://github.com/carverauto/serviceradar/pull/2225/files#diff-05038f3867985e757de9027609950e682bad6d1992dac6acd7c28962a3c65dc4">+26/-0</a>    </td> </tr> <tr> <td> <details> <summary>config.rs<dd><code>NATS credentials file configuration for trapd</code>                        </dd></summary> <hr> cmd/trapd/src/config.rs <ul><li>Added <code>nats_creds_file</code> optional configuration field with validation <li> Implemented <code>nats_creds_path()</code> method to resolve credentials file paths using security config <li> Added validation to reject empty credentials file values</ul> </details> </td> <td><a href="https://github.com/carverauto/serviceradar/pull/2225/files#diff-c89b88ba4d2bf0a054d0ba69a672a92c30140b8d19503d67b980a218ffe3106d">+21/-0</a>    </td> </tr> <tr> <td> <details> <summary>config.rs<dd><code>NATS credentials file support in otel configuration</code>            </dd></summary> <hr> cmd/otel/src/config.rs <ul><li>Added <code>creds_file</code> optional field to <code>NATSConfigTOML</code> struct <li> Implemented credentials file path resolution in config parsing with empty string handling <li> Updated default config example to include credentials file path</ul> </details> </td> <td><a href="https://github.com/carverauto/serviceradar/pull/2225/files#diff-abbaec651da3d6af96b482e0f77bb909b65dbe0cabd78b5803769cc9dab0a1b0">+13/-0</a>    </td> </tr> <tr> <td> <details> <summary>nats_output.rs<dd><code>NATS credentials file support in flowgger output</code>                  </dd></summary> <hr> cmd/flowgger/src/flowgger/output/nats_output.rs <ul><li>Added <code>creds_file</code> optional field to <code>NATSConfig</code> struct <li> Implemented credentials file loading from configuration with empty string trimming <li> Integrated credentials file into NATS connection options</ul> </details> </td> <td><a href="https://github.com/carverauto/serviceradar/pull/2225/files#diff-a82e2e4d413539bf0b414b5629665b19648447523994cba639c4d1238aa5a0c1">+14/-0</a>    </td> </tr> <tr> <td> <details> <summary>nats_output.rs<dd><code>NATS credentials file support in otel NATS output</code>                </dd></summary> <hr> cmd/otel/src/nats_output.rs <ul><li>Added <code>creds_file</code> optional field to <code>NATSConfig</code> struct with default None value <li> Integrated credentials file loading into NATS connection options with debug logging</ul> </details> </td> <td><a href="https://github.com/carverauto/serviceradar/pull/2225/files#diff-6b585ea3564a481174e04da1270e2e13edd4e2b980d02a2652d6d21e6d82a498">+7/-0</a>      </td> </tr> <tr> <td> <details> <summary>server.rs<dd><code>Gateway ID tracking in sysmon checker responses</code>                    </dd></summary> <hr> cmd/checkers/sysmon/src/server.rs <ul><li>Added <code>gateway_id</code> field to both <code>PingResponse</code> and <code>ResultResponse</code> messages for proper status attribution</ul> </details> </td> <td><a href="https://github.com/carverauto/serviceradar/pull/2225/files#diff-2c4395fee16396339c3eea518ad9bec739174c67c9cedf62e6848c17136dd33e">+2/-0</a>      </td> </tr> <tr> <td> <details> <summary>grpc_server.rs<dd><code>Gateway ID tracking in zen consumer responses</code>                        </dd></summary> <hr> cmd/consumers/zen/src/grpc_server.rs <ul><li>Added <code>gateway_id</code> field to both <code>PingResponse</code> and <code>ResultResponse</code> messages for status attribution</ul> </details> </td> <td><a href="https://github.com/carverauto/serviceradar/pull/2225/files#diff-e4564a93f6cf84ff91cd3d8141fc9272ec9b4ec19defd107afa42be01fcfed5b">+2/-0</a>      </td> </tr> <tr> <td> <details> <summary>server.rs<dd><code>Poller and gateway ID tracking in rperf responses</code>                </dd></summary> <hr> cmd/checkers/rperf-client/src/server.rs <ul><li>Added <code>poller_id</code> and <code>gateway_id</code> fields to both error and success response messages</ul> </details> </td> <td><a href="https://github.com/carverauto/serviceradar/pull/2225/files#diff-bce0f4ca6548712f224b73816825d28e831acbbff7dbed3c98671ed50f65d028">+4/-0</a>      </td> </tr> <tr> <td> <details> <summary>nats.rs<dd><code>NATS credentials file integration in zen NATS connection</code>  </dd></summary> <hr> cmd/consumers/zen/src/nats.rs <ul><li>Added credentials file loading into NATS connection options via <code>nats_creds_path()</code></ul> </details> </td> <td><a href="https://github.com/carverauto/serviceradar/pull/2225/files#diff-97f7335def0ad5d644b594a1076ae2d7080b11259cbb8de22c7946cc8e4b39f8">+4/-0</a>      </td> </tr> <tr> <td> <details> <summary>setup.rs<dd><code>NATS credentials file debug logging</code>                                            </dd></summary> <hr> cmd/otel/src/setup.rs - Added debug logging for NATS credentials file configuration </details> </td> <td><a href="https://github.com/carverauto/serviceradar/pull/2225/files#diff-3891f667deb20fd26e296d3e2742c57378d3764fe1743118e612465ae360391f">+1/-0</a>      </td> </tr> <tr> <td> <details> <summary>agent_gateway_server.ex<dd><code>Agent gateway gRPC server with mTLS and multi-tenant support</code></dd></summary> <hr> elixir/serviceradar_agent_gateway/lib/serviceradar_agent_gateway/agent_gateway_server.ex <ul><li>Implemented complete gRPC server for agent gateway receiving status pushes from Go agents <li> Implemented <code>hello()</code> for agent enrollment with mTLS certificate validation and tenant extraction <li> Implemented <code>get_config()</code> for agent configuration delivery with versioning support <li> Implemented <code>push_status()</code> and <code>stream_status()</code> for status reception with validation and resource limits <li> Integrated tenant security via mTLS certificate parsing and multi-tenant isolation</ul> </details> </td> <td><a href="https://github.com/carverauto/serviceradar/pull/2225/files#diff-369a368073dc8ec1140bcea699005a1ce97a90cd59629df0bd18c71c7ffaae9f">+1013/-0</a></td> </tr> <tr> <td> <details> <summary>tenant_registry.ex<dd><code>Multi-tenant process registry with Horde infrastructure</code>    </dd></summary> <hr> elixir/serviceradar_core/lib/serviceradar/cluster/tenant_registry.ex <ul><li>Implemented multi-tenant process registry using Horde for distributed process discovery and supervision <li> Created per-tenant Horde Registry and DynamicSupervisor for process isolation <li> Implemented slug-to-UUID mapping via ETS table for admin/debug lookups <li> Added convenience functions for registering and managing pollers, agents, and gateways <li> Implemented tenant infrastructure lifecycle management with lazy initialization</ul> </details> </td> <td><a href="https://github.com/carverauto/serviceradar/pull/2225/files#diff-91248b3b128a2e3d9bea6ffdb5e0f295e4a1745e82f87687c640ad01416fb85d">+682/-0</a>  </td> </tr> <tr> <td> <details> <summary>agent_tracker.ex<dd><code>Agent tracking with ETS and stale detection</code>                            </dd></summary> <hr> elixir/serviceradar_core/lib/serviceradar/agent_tracker.ex <ul><li>Implemented in-memory agent tracking using ETS with TTL-based stale detection <li> Added agent status tracking with metadata (service count, partition, source IP) <li> Implemented periodic cleanup of stale agents (24-hour threshold) <li> Added PubSub broadcast for UI updates on agent status changes</ul> </details> </td> <td><a href="https://github.com/carverauto/serviceradar/pull/2225/files#diff-792a06febe706658f0bd382113e31d8070d15c477202d51f1699e77c8317001a">+177/-0</a>  </td> </tr> <tr> <td> <details> <summary>agent.ex<dd><code>Agent resource with state machine and capability management</code></dd></summary> <hr> elixir/serviceradar_core/lib/serviceradar/infrastructure/agent.ex <ul><li>New Ash resource for managing Go agents with OCSF v1.4.0 compliance <li> Implements agent lifecycle state machine with transitions (connecting, connected, degraded, disconnected, unavailable) <li> Defines agent capabilities (ICMP, TCP, HTTP, gRPC, DNS, Process, SNMP) with metadata and icons <li> Provides JSON API routes for agent registration, connection management, and status updates <li> Includes calculated fields for display name, online status, status color, and gRPC endpoint</ul> </details> </td> <td><a href="https://github.com/carverauto/serviceradar/pull/2225/files#diff-c56f92b6ce744cab3f2dc00dde92e2017cffdd12ad4618f7fa720252f2a6843a">+649/-0</a>  </td> </tr> <tr> <td> <details> <summary>onboarding_packages.ex<dd><code>Onboarding packages context with token and certificate management</code></dd></summary> <hr> elixir/serviceradar_core/lib/serviceradar/edge/onboarding_packages.ex <ul><li>Context module for edge onboarding package operations with CRUD functionality <li> Implements token generation, encryption, and verification for join and download tokens <li> Provides package delivery workflow with state transitions (issued, delivered, activated, revoked) <li> Integrates tenant CA certificate generation for secure component onboarding <li> Includes soft-delete and revocation capabilities with audit event recording</ul> </details> </td> <td><a href="https://github.com/carverauto/serviceradar/pull/2225/files#diff-e4fe8e19bc324416302bb4c962f57133b3f62eb82053766844d881c522a473e5">+606/-0</a>  </td> </tr> <tr> <td> <details> <summary>alias_events.ex<dd><code>Device alias event tracking and lifecycle management</code>          </dd></summary> <hr> elixir/serviceradar_core/lib/serviceradar/identity/alias_events.ex <ul><li>Device alias lifecycle event tracking module ported from Go core <li> Detects and processes alias changes (service IDs, IPs, collector IPs) from device metadata <li> Builds lifecycle events for audit and alerting with change detection logic <li> Manages <code>DeviceAliasState</code> resource with sighting counts and state transitions <li> Provides metadata parsing and comparison utilities for alias records</ul> </details> </td> <td><a href="https://github.com/carverauto/serviceradar/pull/2225/files#diff-bc3743067ea774f59bc5665770f7110a2d6e90f6e1156a7717a1c287f8979d28">+650/-0</a>  </td> </tr> <tr> <td> <details> <summary>generator.ex<dd><code>X.509 certificate generation for tenant CAs and components</code></dd></summary> <hr> elixir/serviceradar_core/lib/serviceradar/edge/tenant_ca/generator.ex <ul><li>X.509 certificate generation for per-tenant CAs and edge components <li> Generates tenant intermediate CAs (10-year validity) signed by platform root CA <li> Creates edge component certificates (1-year validity) with SPIFFE IDs and SANs <li> Provides certificate parsing, CN extraction, and SPKI SHA-256 computation <li> Uses Erlang <code>:public_key</code> module for all cryptographic operations</ul> </details> </td> <td><a href="https://github.com/carverauto/serviceradar/pull/2225/files#diff-b48e4a9e1189da61e2a60e16f56fce81298d76b7cdab745107140fed3f6e48b4">+541/-0</a>  </td> </tr> <tr> <td> <details> <summary>spiffe.ex<dd><code>SPIFFE/SPIRE integration for cluster authentication</code>            </dd></summary> <hr> elixir/serviceradar_core/lib/serviceradar/spiffe.ex <ul><li>SPIFFE/SPIRE integration module for distributed cluster authentication <li> Provides SSL/TLS options for ERTS distribution with SPIFFE certificate verification <li> Parses and validates SPIFFE IDs with trust domain verification <li> Monitors certificate files for rotation and provides expiry information <li> Supports both filesystem and SPIRE Workload API modes for certificate loading</ul> </details> </td> <td><a href="https://github.com/carverauto/serviceradar/pull/2225/files#diff-0cb8d921c19f671b66f91c0978e351e71d927c5f4694924984c9f1ed34d7ee78">+564/-0</a>  </td> </tr> <tr> <td> <details> <summary>account_client.ex<dd><code>NATS Account Client for Tenant Isolation</code>                                  </dd></summary> <hr> elixir/serviceradar_core/lib/serviceradar/nats/account_client.ex <ul><li>New gRPC client module for NATS account management with tenant isolation support <li> Implements account creation, user credential generation, JWT signing, and operator bootstrap <li> Provides channel management with fallback to fresh connections and SSL/TLS support <li> Includes helper functions for building protobuf request structures (limits, permissions, mappings)</ul> </details> </td> <td><a href="https://github.com/carverauto/serviceradar/pull/2225/files#diff-2e18ac777ac600b12982ba9e9d5327e23ebd84c139a2add7976f8bf61283e554">+563/-0</a>  </td> </tr> <tr> <td> <details> <summary>stats_aggregator.ex<dd><code>Device Statistics Aggregator with Periodic Snapshots</code>          </dd></summary> <hr> elixir/serviceradar_core/lib/serviceradar/core/stats_aggregator.ex <ul><li>New GenServer module for periodic device statistics aggregation and caching <li> Implements canonical record selection, device filtering, and per-partition statistics <li> Tracks device availability, activity windows, and capability breakdowns (ICMP, SNMP, Sysmon) <li> Includes telemetry recording, alert handler integration, and snapshot management</ul> </details> </td> <td><a href="https://github.com/carverauto/serviceradar/pull/2225/files#diff-1f4ac8290be7d27cac0ed660e51a9b3b23a219a6bb43b3735f3c5a9768321031">+626/-0</a>  </td> </tr> <tr> <td> <details> <summary>device.ex<dd><code>Device Actor for Runtime State Management</code>                                </dd></summary> <hr> elixir/serviceradar_core/lib/serviceradar/actors/device.ex <ul><li>New GenServer-based device actor for in-memory state management and caching <li> Implements identity resolution, event buffering, health status tracking with state machine <li> Provides lazy initialization, Horde registry integration, and automatic hibernation <li> Includes event flushing, health check recording, and PubSub broadcasting</ul> </details> </td> <td><a href="https://github.com/carverauto/serviceradar/pull/2225/files#diff-eba1d95a852e4a736813c7b486da651704f20718e24f931c966ff3f37c421eea">+608/-0</a>  </td> </tr> <tr> <td> <details> <summary>gateway.ex<dd><code>Gateway Resource with State Machine and Job Orchestration</code></dd></summary> <hr> elixir/serviceradar_core/lib/serviceradar/infrastructure/gateway.ex <ul><li>New Ash resource for gateway (job orchestrator) management with state machine <li> Implements status transitions (healthy, degraded, offline, recovering, maintenance, draining) <li> Provides registration, heartbeat, and health tracking with tenant isolation <li> Includes JSON API routes and policy-based authorization</ul> </details> </td> <td><a href="https://github.com/carverauto/serviceradar/pull/2225/files#diff-1655f7cc80c8949b3d2c9e50d20110cb85883413fb6181825bfbcf0bda578c6b">+544/-0</a>  </td> </tr> <tr> <td> <details> <summary>client.ex<dd><code>DataService KV Client with Reconnection and mTLS</code>                  </dd></summary> <hr> elixir/serviceradar_core/lib/serviceradar/data_service/client.ex <ul><li>New GenServer-based gRPC client for datasvc KV store operations <li> Implements put, get, delete, list_keys, and put_many operations with automatic reconnection <li> Provides exponential backoff reconnection strategy and mTLS certificate support <li> Handles gun connection lifecycle events and ensures connection before operations</ul> </details> </td> <td><a href="https://github.com/carverauto/serviceradar/pull/2225/files#diff-503e195ad79e05e12d7ad03a675f6e35ffdfc201b8571b0d30a220fe036e03a1">+507/-0</a>  </td> </tr> </table></details></td></tr><tr><td>Tests</td><td><details><summary>3 files</summary><table> <tr> <td> <details> <summary>message_processor.rs<dd><code>Test configuration update for NATS credentials</code>                      </dd></summary> <hr> cmd/consumers/zen/src/message_processor.rs - Added `nats_creds_file: None` field to test configuration struct </details> </td> <td><a href="https://github.com/carverauto/serviceradar/pull/2225/files#diff-9fcbc5358a9009e60a8cd22d21e5a9ea652787c727732d0b869e0865495114c3">+1/-0</a>      </td> </tr> <tr> <td> <details> <summary>agent_registry_test.exs<dd><code>Agent registry tests with multi-tenant isolation</code>                  </dd></summary> <hr> elixir/serviceradar_core/test/serviceradar/registry/agent_registry_test.exs <ul><li>Comprehensive test suite for <code>AgentRegistry</code> with 362 lines of test cases <li> Tests agent registration with gRPC connection details and metadata storage <li> Verifies multi-tenant isolation for registry operations and lookups <li> Tests capability-based and partition-based agent discovery <li> Includes heartbeat, unregister, and count operations</ul> </details> </td> <td><a href="https://github.com/carverauto/serviceradar/pull/2225/files#diff-e46695d8be4c0156bea5a577432628ea11a106208eaae749e30ca71d6ec42b47">+362/-0</a>  </td> </tr> <tr> <td> <details> <summary>event_publisher_test.exs<dd><code>Event Publisher Tests for Infrastructure Events</code>                    </dd></summary> <hr> elixir/serviceradar_core/test/serviceradar/infrastructure/event_publisher_test.exs <ul><li>New test module for infrastructure event publishing functionality <li> Tests event structure building for state changes, registration, deregistration, and health changes <li> Verifies mandatory field requirements and supported entity/event types <li> Mocks NATS connection to avoid requiring running NATS server in unit tests</ul> </details> </td> <td><a href="https://github.com/carverauto/serviceradar/pull/2225/files#diff-aedb73cf51479f3bd752152338a3a98c595d4ef48d36139eee372149c134975b">+143/-0</a>  </td> </tr> </table></details></td></tr><tr><td>Configuration</td><td><details><summary>1 files</summary><table> <tr> <td> <details> <summary>20251224184523_initial_schema.exs<dd><code>Initial database schema migration for ServiceRadar core</code>    </dd></summary> <hr> elixir/serviceradar_core/priv/repo/migrations/20251224184523_initial_schema.exs <ul><li>Created comprehensive database schema with tables for tenants, users, devices, agents, service checks, and monitoring <li> Defined relationships between core entities (devices, agents, pollers, partitions) <li> Added support for edge onboarding packages, monitoring events, and alerts <li> Implemented multi-tenant isolation with tenant_id foreign keys across all tables</ul> </details> </td> <td><a href="https://github.com/carverauto/serviceradar/pull/2225/files#diff-a95104ae5125c32fa35bb828aa9a716f5129b8052d83d73901d3bf8a5dc3502a">+733/-0</a>  </td> </tr> </table></details></td></tr><tr><td>Dependencies</td><td><details><summary>1 files</summary><table> <tr> <td> <details> <summary>monitoring.pb.ex<dd><code>Protobuf Definitions for Monitoring Services</code>                          </dd></summary> <hr> elixir/serviceradar_core/lib/serviceradar/proto/monitoring.pb.ex <ul><li>Auto-generated protobuf definitions for monitoring service messages <li> Defines request/response structures for agent, poller, and gateway status reporting <li> Includes sweep completion tracking, chunked streaming, and agent configuration <li> Supports agent hello handshake with capabilities and heartbeat intervals</ul> </details> </td> <td><a href="https://github.com/carverauto/serviceradar/pull/2225/files#diff-c4072f56f73e61992b2401a13598f9edcf11b255ba343c79cc629782b0b0dca0">+398/-0</a>  </td> </tr> </table></details></td></tr><tr><td>Additional files</td><td><details><summary>101 files</summary><table> <tr> <td>.bazelignore</td> <td><a href="https://github.com/carverauto/serviceradar/pull/2225/files#diff-a5641cd37d6ad98b32cdfce1980836cc68312277bc6a7052f55da02ada5bc6cf">+4/-0</a>      </td> </tr> <tr> <td>.bazelrc</td> <td><a href="https://github.com/carverauto/serviceradar/pull/2225/files#diff-544556920c45b42cbfe40159b082ce8af6bd929e492d076769226265f215832f">+2/-0</a>      </td> </tr> <tr> <td>.env.example</td> <td><a href="https://github.com/carverauto/serviceradar/pull/2225/files#diff-a3046da0d15a27e89f2afe639b25748a7ad4d9290af3e7b1b6c1a5533c8f0a8c">+38/-0</a>    </td> </tr> <tr> <td>AGENTS.md</td> <td><a href="https://github.com/carverauto/serviceradar/pull/2225/files#diff-a54ff182c7e8acf56acfd6e4b9c3ff41e2c41a31c9b211b2deb9df75d9a478f9">+171/-2</a>  </td> </tr> <tr> <td>MODULE.bazel</td> <td><a href="https://github.com/carverauto/serviceradar/pull/2225/files#diff-6136fc12446089c3db7360e923203dd114b6a1466252e71667c6791c20fe6bdc">+5/-0</a>      </td> </tr> <tr> <td>Makefile</td> <td><a href="https://github.com/carverauto/serviceradar/pull/2225/files#diff-76ed074a9305c04054cdebb9e9aad2d818052b07091de1f20cad0bbac34ffb52">+4/-0</a>      </td> </tr> <tr> <td>README-Docker.md</td> <td><a href="https://github.com/carverauto/serviceradar/pull/2225/files#diff-9fd61d24482efe68c22d8d41e2a1dcc440f39195aa56e7a050f2abe598179efd">+16/-1</a>    </td> </tr> <tr> <td>BUILD.bazel</td> <td><a href="https://github.com/carverauto/serviceradar/pull/2225/files#diff-884fa9353a5226345e44fbabea3300efc7a87dfbcde0b6a42521ca51823f1b68">+11/-0</a>    </td> </tr> <tr> <td>BUILD.bazel</td> <td><a href="https://github.com/carverauto/serviceradar/pull/2225/files#diff-0e80ea46aeb61a873324685edb96eae864c7a2004fbb7ee404b4ec951190ba10">+12/-0</a>    </td> </tr> <tr> <td>mix_release.bzl</td> <td><a href="https://github.com/carverauto/serviceradar/pull/2225/files#diff-86ec281f99363b6b6eb1f49e21d83b7eeca93a35b552b9f305fffc6855e38ccd">+120/-49</a></td> </tr> <tr> <td>BUILD.bazel</td> <td><a href="https://github.com/carverauto/serviceradar/pull/2225/files#diff-143f8d1549d52f28906f19ce28e5568a5be474470ff103c2c1e63c3e6b08d670">+1/-0</a>      </td> </tr> <tr> <td>config.json</td> <td><a href="https://github.com/carverauto/serviceradar/pull/2225/files#diff-5b1bc8fe77422534739bdd3a38dc20d2634a86c171265c34e1b5d0c5a61b6bab">+5/-6</a>      </td> </tr> <tr> <td>monitoring.proto</td> <td><a href="https://github.com/carverauto/serviceradar/pull/2225/files#diff-b56f709f4a0a3db694f2124353908318631f23e20b7846bc4b8ee869e2e0632a">+4/-1</a>      </td> </tr> <tr> <td>zen-consumer-with-otel.json</td> <td><a href="https://github.com/carverauto/serviceradar/pull/2225/files#diff-68375f1f7847e1fbdf75664f6be65b1ad94ae6ce86ed73fc5964d65054668acb">+1/-0</a>      </td> </tr> <tr> <td>zen-consumer.json</td> <td><a href="https://github.com/carverauto/serviceradar/pull/2225/files#diff-4d308af9802a93a0f656e8c02a3b5fcd8991407bb18360f087470db74e1f9524">+1/-0</a>      </td> </tr> <tr> <td>BUILD.bazel</td> <td><a href="https://github.com/carverauto/serviceradar/pull/2225/files#diff-c62c0139ebdb337369f4067567cd2c52b8e7decb3ddfabc77f9f67b2f6e5789c">+1/-0</a>      </td> </tr> <tr> <td>README.md</td> <td><a href="https://github.com/carverauto/serviceradar/pull/2225/files#diff-f425b4378f84e0ba0c6f532facff17ff5d55b4dc6033d8bf35130a159cd2ba32">+9/-12</a>    </td> </tr> <tr> <td>flowgger.toml</td> <td><a href="https://github.com/carverauto/serviceradar/pull/2225/files#diff-af9f49f931e282dca53d1f0521b036d222fe671f77e61a876a84cf4c6d7cca4d">+1/-0</a>      </td> </tr> <tr> <td>otel.toml</td> <td><a href="https://github.com/carverauto/serviceradar/pull/2225/files#diff-c64b9ace832b8ea57a2be62f84166e03bb1904882635d444ec76a880cdf14cc0">+1/-0</a>      </td> </tr> <tr> <td>BUILD.bazel</td> <td><a href="https://github.com/carverauto/serviceradar/pull/2225/files#diff-4f5d2ea4260d490a0d6f28adde0b35eca8af77d22f3ee366a783946c53687619">+0/-25</a>    </td> </tr> <tr> <td>config.json</td> <td><a href="https://github.com/carverauto/serviceradar/pull/2225/files#diff-bcac20d6b3cb81f0059e766839ba1ee59a885009249501b0ba1182ebb1daea25">+0/-77</a>    </td> </tr> <tr> <td>main.go</td> <td><a href="https://github.com/carverauto/serviceradar/pull/2225/files#diff-78dc6bc53f1c760c66f43ff5f486bfe78a65bee8b2e0d4862293ec0892da2b29">+0/-123</a>  </td> </tr> <tr> <td>docker-compose.elx.yml</td> <td><a href="https://github.com/carverauto/serviceradar/pull/2225/files#diff-9562070d7ad4a3e9b2d06567008cf35de1d96448d914b3b45bf6c36d97cdd914">+108/-0</a>  </td> </tr> <tr> <td>docker-compose.spiffe.yml</td> <td><a href="https://github.com/carverauto/serviceradar/pull/2225/files#diff-603fd9e7d40841d174f26b95d0cb0c9537430bf3f7a5da3ccbba4ea3d8ac66c9">+8/-55</a>    </td> </tr> <tr> <td>docker-compose.yml</td> <td><a href="https://github.com/carverauto/serviceradar/pull/2225/files#diff-e45e45baeda1c1e73482975a664062aa56f20c03dd9d64a827aba57775bed0d3">+298/-280</a></td> </tr> <tr> <td>Dockerfile.agent-gateway</td> <td><a href="https://github.com/carverauto/serviceradar/pull/2225/files#diff-332bc81a932ae08efa711a71b60fe0954d99bf17ebdab00a3baaa177a44de8b0">+94/-0</a>    </td> </tr> <tr> <td>Dockerfile.core-elx</td> <td><a href="https://github.com/carverauto/serviceradar/pull/2225/files#diff-5ec7a971285669999af442a0c7f141c34f7fd9180257307f5c4ed12f789a2182">+108/-0</a>  </td> </tr> <tr> <td>Dockerfile.poller-elx</td> <td><a href="https://github.com/carverauto/serviceradar/pull/2225/files#diff-966cdcf55b7defa4f9fd83a15c55a0e1106430e6e20926c1181187e95ed8d87e">+95/-0</a>    </td> </tr> <tr> <td>Dockerfile.sync</td> <td><a href="https://github.com/carverauto/serviceradar/pull/2225/files#diff-0227933b9961fd553af1d229e89d71a0271fdc475081bbcef49b587941af1eda">+0/-95</a>    </td> </tr> <tr> <td>Dockerfile.web-ng</td> <td><a href="https://github.com/carverauto/serviceradar/pull/2225/files#diff-92d43af1965575d56c3380ecc8a81024aac2ff36f039ec2d3839e9fc7852bc10">+6/-0</a>      </td> </tr> <tr> <td>agent-minimal.docker.json</td> <td><a href="https://github.com/carverauto/serviceradar/pull/2225/files#diff-1f09fad94636c90373af8e270f6ba0332ae4f4d1df50a4909729280a3a9691e6">+6/-6</a>      </td> </tr> <tr> <td>agent.docker.json</td> <td><a href="https://github.com/carverauto/serviceradar/pull/2225/files#diff-5d33fe703515d03076d31261ecf946e9c6fc668cf5bf65099d49b670739e455e">+5/-20</a>    </td> </tr> <tr> <td>agent.mtls.json</td> <td><a href="https://github.com/carverauto/serviceradar/pull/2225/files#diff-008f2216f159a9bd5db9cc90baaf6f1e64487df7af05b56ab3b9d6c4946aa95f">+7/-10</a>    </td> </tr> <tr> <td>.gitkeep</td> <td><a href="https://github.com/carverauto/serviceradar/pull/2225/files#diff-d72c41aab2d6f2c230a4340dfefe7917cdd12bed942c825aa0d4c9875a637bac">+1/-0</a>      </td> </tr> <tr> <td>datasvc.docker.json</td> <td><a href="https://github.com/carverauto/serviceradar/pull/2225/files#diff-3f2719d3dbfe042e8383739e3c78e74e5f851a44e5e46bea8e79c4b79fdcc34f">+3/-2</a>      </td> </tr> <tr> <td>datasvc.mtls.json</td> <td><a href="https://github.com/carverauto/serviceradar/pull/2225/files#diff-3a45619e57f1e6e9a31486ec7fffb33ef246e271f82bac272ee0a946b88da70a">+14/-1</a>    </td> </tr> <tr> <td>db-event-writer.mtls.json</td> <td><a href="https://github.com/carverauto/serviceradar/pull/2225/files#diff-7a33f95f7545499abf0ed9fc91b58499ab209639e4885019579c959583fc7496">+1/-0</a>      </td> </tr> <tr> <td>entrypoint-certs.sh</td> <td><a href="https://github.com/carverauto/serviceradar/pull/2225/files#diff-83d6800b184a5233c66c69766286b0a60fece1bc64addb112d9f8dc019437f05">+12/-8</a>    </td> </tr> <tr> <td>entrypoint-sync.sh</td> <td><a href="https://github.com/carverauto/serviceradar/pull/2225/files#diff-9d5620b8e6833309dbafb8ee6b6b75c3b942d163c3fe7f1a9827958b2d640265">+0/-96</a>    </td> </tr> <tr> <td>flowgger.docker.toml</td> <td><a href="https://github.com/carverauto/serviceradar/pull/2225/files#diff-824f8797b418d4b9f5ea41e4a3741a0ed64b881f343072464489a76b7ea01008">+2/-1</a>      </td> </tr> <tr> <td>generate-certs.sh</td> <td><a href="https://github.com/carverauto/serviceradar/pull/2225/files#diff-8298241543b4744a6ac7780c760ac5b5a0a87ba62de19c8612ebe1aba0996ebd">+217/-12</a></td> </tr> <tr> <td>nats.docker.conf</td> <td><a href="https://github.com/carverauto/serviceradar/pull/2225/files#diff-06f2012494f428fe1bfb304972061c2094e0d99da88ba9af6914f7776872e6eb">+16/-160</a></td> </tr> <tr> <td>netflow-consumer.mtls.json</td> <td><a href="https://github.com/carverauto/serviceradar/pull/2225/files#diff-f15920e8498a24f71ce3eec4f48fe8fefbb1765a90362998af779a660fcef9e1">+1/-0</a>      </td> </tr> <tr> <td>otel.docker.toml</td> <td><a href="https://github.com/carverauto/serviceradar/pull/2225/files#diff-d4af38790e3657b7589cd37a7539d5308b032f11caba7aa740ddc86bf99f4415">+2/-0</a>      </td> </tr> <tr> <td>pg_hba.conf</td> <td><a href="https://github.com/carverauto/serviceradar/pull/2225/files#diff-7bd5f7292054916c7e5997f4c84ac9ec07d4c945621a48936c2aed0575fb96eb">+9/-0</a>      </td> </tr> <tr> <td>pg_ident.conf</td> <td><a href="https://github.com/carverauto/serviceradar/pull/2225/files#diff-e7b8ce062e32c61fdc3bcc9e525c1f1df1c8008fbc02b11409e58c67baa17cc5">+17/-0</a>    </td> </tr> <tr> <td>bootstrap-compose-spire.sh</td> <td><a href="https://github.com/carverauto/serviceradar/pull/2225/files#diff-ca219a124d4c95ee7995764d7e0c322b4bfe59e357b7bcb42bc5d7c8b9b0af0d">+0/-1</a>      </td> </tr> <tr> <td>ssl_dist.core.conf</td> <td><a href="https://github.com/carverauto/serviceradar/pull/2225/files#diff-08d49d8621b581d1a9aa5c456f61e8c5774e021083c982cbb514019f915a1701">+17/-0</a>    </td> </tr> <tr> <td>ssl_dist.gateway.conf</td> <td><a href="https://github.com/carverauto/serviceradar/pull/2225/files#diff-4a43a8290d45ac68592000e7ef51afe78b4213090155bd42aafb46e66130f7ae">+17/-0</a>    </td> </tr> <tr> <td>ssl_dist.poller.conf</td> <td><a href="https://github.com/carverauto/serviceradar/pull/2225/files#diff-3373e5d02bd9a19c75819948cf829086dc9c02f41f30c78759beb7038c428d3e">+17/-0</a>    </td> </tr> <tr> <td>ssl_dist.web.conf</td> <td><a href="https://github.com/carverauto/serviceradar/pull/2225/files#diff-cef5be462ddb059fdfdeb9fd7c5cd70e656c4cd8b6ae1fe3fe312557b3da80ac">+17/-0</a>    </td> </tr> <tr> <td>sync.docker.json</td> <td><a href="https://github.com/carverauto/serviceradar/pull/2225/files#diff-4237fcee4f33a230abf28e12e8d4823499d163759cd1ff124fec1c62faa8b8b4">+0/-71</a>    </td> </tr> <tr> <td>sync.mtls.json</td> <td><a href="https://github.com/carverauto/serviceradar/pull/2225/files#diff-c652c07f7127be5b2932d92e6ef4c7448c544d1f3095cb96a03294fa58fd3c4c">+0/-75</a>    </td> </tr> <tr> <td>trapd.docker.json</td> <td><a href="https://github.com/carverauto/serviceradar/pull/2225/files#diff-1ab1a0e03e63bc02e0ef31992a7187a377927272ed2060150b40d44cc0ea3357">+2/-1</a>      </td> </tr> <tr> <td>zen.docker.json</td> <td><a href="https://github.com/carverauto/serviceradar/pull/2225/files#diff-e060a3164cdc2746e0d9ad000fcf43c4bcdb05f4a41c586d7220e2ff2a7df01d">+2/-1</a>      </td> </tr> <tr> <td>BUILD.bazel</td> <td><a href="https://github.com/carverauto/serviceradar/pull/2225/files#diff-0e4db31c224a8f72ae8e870a849e38a59d74a2c7f7b04347b0b3eb07e20c5a80">+80/-42</a>  </td> </tr> <tr> <td>push_targets.bzl</td> <td><a href="https://github.com/carverauto/serviceradar/pull/2225/files#diff-4af33fe62caba04b6d479589c16cfb85babc39bae5c92595d4d4e31660738513">+2/-1</a>      </td> </tr> <tr> <td>SBOM.md</td> <td><a href="https://github.com/carverauto/serviceradar/pull/2225/files#diff-1747f3c15088e6b4a3ff7fb1fdec9c2ef15ac865ee991b26ace892deba984eae">+0/-1</a>      </td> </tr> <tr> <td>agents.md</td> <td><a href="https://github.com/carverauto/serviceradar/pull/2225/files#diff-af8d04277f2353629065b0cc5fad3e44bd3e7c20339bd125e0812104bdbeff28">+8/-9</a>      </td> </tr> <tr> <td>architecture.md</td> <td><a href="https://github.com/carverauto/serviceradar/pull/2225/files#diff-90abd06467420fd89391fd1a4d75ceb1f6a9381de4d13a95fffe606abff38d37">+100/-26</a></td> </tr> <tr> <td>armis.md</td> <td><a href="https://github.com/carverauto/serviceradar/pull/2225/files#diff-3e5c728550851ae1cfb8754eb8a1566d8084ebc58646c73f7c29ac6f7d60adbf">+7/-21</a>    </td> </tr> <tr> <td>ash-api.md</td> <td><a href="https://github.com/carverauto/serviceradar/pull/2225/files#diff-1cb48a12148688dc640f42675d0b3c2458ecc40175f4ae7eef4b07bdc2ede3a3">+305/-0</a>  </td> </tr> <tr> <td>ash-authentication.md</td> <td><a href="https://github.com/carverauto/serviceradar/pull/2225/files#diff-e3d22dc8d6917661b980dd9c7e08bdc07fad4d225b10c60885b091a9cdd20425">+244/-0</a>  </td> </tr> <tr> <td>ash-authorization.md</td> <td><a href="https://github.com/carverauto/serviceradar/pull/2225/files#diff-72abeea2439e2281b6f998db2d11a73fe3ab393898125c466f0b16e7a49d3088">+283/-0</a>  </td> </tr> <tr> <td>ash-domains.md</td> <td><a href="https://github.com/carverauto/serviceradar/pull/2225/files#diff-bc78325c5ea4c6839212536a43336a231175450738af14f4173be663acf2fe49">+223/-0</a>  </td> </tr> <tr> <td>ash-migration-guide.md</td> <td><a href="https://github.com/carverauto/serviceradar/pull/2225/files#diff-e730373f07f6656d00327de02b4aa64e13809aadb3bf5478a4c60041ee431ad8">+339/-0</a>  </td> </tr> <tr> <td>auth-configuration.md</td> <td><a href="https://github.com/carverauto/serviceradar/pull/2225/files#diff-92b481e02c4edb9a178f51f2a8cb94108c7bcb1514a4629d8c67d8f20a70e852">+1/-1</a>      </td> </tr> <tr> <td>cluster.md</td> <td><a href="https://github.com/carverauto/serviceradar/pull/2225/files#diff-0d2e32a530d31d08b311213cd036c14e611d2f9a31e9cbcaa97f02b53edeeb44">+2/-2</a>      </td> </tr> <tr> <td>discovery.md</td> <td><a href="https://github.com/carverauto/serviceradar/pull/2225/files#diff-a7f02a003a0e9aaf09d6aa072689ca8ad8b7b298ee014c6ffb878d2a5ecbab28">+19/-20</a>  </td> </tr> <tr> <td>docker-setup.md</td> <td><a href="https://github.com/carverauto/serviceradar/pull/2225/files#diff-8604269dffb3ce4133e48cab374ca8e97745d0efbdef67cad792aeb5945fe5ec">+57/-14</a>  </td> </tr> <tr> <td>edge-agents.md</td> <td><a href="https://github.com/carverauto/serviceradar/pull/2225/files#diff-df8f4b6b6cd8fb926f9204b2bab85ec4f393286d1b25d8a62d2debd357dd651d">+253/-0</a>  </td> </tr> <tr> <td>edge-onboarding.md</td> <td><a href="https://github.com/carverauto/serviceradar/pull/2225/files#diff-f32fbefcc4dd23b0f2146015630a411099012e8bf56ea7b9d52ae8b2a83ac3e3">+316/-273</a></td> </tr> <tr> <td>identity_drift_monitoring.md</td> <td><a href="https://github.com/carverauto/serviceradar/pull/2225/files#diff-963ddfdc868c8777f9004744f5d2c5ec3ce585b81d1c0de2269b9d7a73942a9f">+1/-1</a>      </td> </tr> <tr> <td>installation.md</td> <td><a href="https://github.com/carverauto/serviceradar/pull/2225/files#diff-19fd876fb56da88a1becb80817b52db1a40e8491641d547913b93bb7ecda3d22">+13/-14</a>  </td> </tr> <tr> <td>intro.md</td> <td><a href="https://github.com/carverauto/serviceradar/pull/2225/files#diff-5ddf6bcf52fe8ef8dfc865f5f2295616d5c073d899766372e5929cc0fd479ffd">+5/-5</a>      </td> </tr> <tr> <td>kv-configuration.md</td> <td><a href="https://github.com/carverauto/serviceradar/pull/2225/files#diff-24c06e4806fa0ec5508ae5aeab9fbe82be7122af9933e6b7970032576cf5f6f8">+3/-3</a>      </td> </tr> <tr> <td>netbox.md</td> <td><a href="https://github.com/carverauto/serviceradar/pull/2225/files#diff-ee0f0a76b85416a7c6e086eca04b7991c176075816b51f442ddb94ea6506dc68">+4/-23</a>    </td> </tr> <tr> <td>netflow.md</td> <td><a href="https://github.com/carverauto/serviceradar/pull/2225/files#diff-a64260e87bce20ed3bd93c7b64dac6ff175cdc7c67071276601aa9f6d370d634">+1/-1</a>      </td> </tr> <tr> <td>onboarding-review-2025.md</td> <td><a href="https://github.com/carverauto/serviceradar/pull/2225/files#diff-1728c73ae8e684d13b7b166b90cbfc5363c0794dcd63f7fa3733e1c624940e98">+0/-1</a>      </td> </tr> <tr> <td>rperf-monitoring.md</td> <td><a href="https://github.com/carverauto/serviceradar/pull/2225/files#diff-70d068dec062c5ed771e7294401c52d55e85e279e9eee4f1084ca06c23b2fde8">+1/-1</a>      </td> </tr> <tr> <td>compose-mtls-sysmonosx.md</td> <td><a href="https://github.com/carverauto/serviceradar/pull/2225/files#diff-6faf517c376b3509d0b14235f1b674a825f0399cd8a20791d8caf3b6ef16525b">+1/-1</a>      </td> </tr> <tr> <td>docker-compose-login-500.md</td> <td><a href="https://github.com/carverauto/serviceradar/pull/2225/files#diff-d3ec1ba25f119cff4efcb5767a1afe923b92fd5727453c6276030bc8e9923f05">+3/-3</a>      </td> </tr> <tr> <td>security-architecture.md</td> <td><a href="https://github.com/carverauto/serviceradar/pull/2225/files#diff-49e33c2fc609030af8a64af34016a22e9e86c0f3781acec5a9e6d0f7ccf5dc09">+238/-0</a>  </td> </tr> <tr> <td>self-signed.md</td> <td><a href="https://github.com/carverauto/serviceradar/pull/2225/files#diff-92747f4c31b991032f6920205d2f9751d89568323d4c65f5d1bd2406beb7a28a">+32/-107</a></td> </tr> <tr> <td>service-port-map.md</td> <td><a href="https://github.com/carverauto/serviceradar/pull/2225/files#diff-81eabc52b9d300cbb6414b5f1f55cecb85adfe891bdb58cf138ce4dc115c48ee">+3/-5</a>      </td> </tr> <tr> <td>snmp.md</td> <td><a href="https://github.com/carverauto/serviceradar/pull/2225/files#diff-5d2a72888517a4c731287280253a29cedbcbe595f1ce57c2fa14aef37c62595a">+2/-2</a>      </td> </tr> <tr> <td>spiffe-identity.md</td> <td><a href="https://github.com/carverauto/serviceradar/pull/2225/files#diff-6a3f0bdce99eb15d1075805cb5101594398f81ad92aa4eea9b27bb7545098db1">+52/-0</a>    </td> </tr> <tr> <td>srql-service.md</td> <td><a href="https://github.com/carverauto/serviceradar/pull/2225/files#diff-9b2d1be11daf488059f9debfbdd2b5ebe51286a7e2c8b0d134112a6c51132ba6">+2/-2</a>      </td> </tr> <tr> <td>sync.md</td> <td><a href="https://github.com/carverauto/serviceradar/pull/2225/files#diff-a498de557ae883d480609bf07629a7eee78ec86b5cf5ee966508e75713973a9e">+75/-482</a></td> </tr> <tr> <td>syslog.md</td> <td><a href="https://github.com/carverauto/serviceradar/pull/2225/files#diff-8efe898141f64a74b80133318b56ed95d03703ac95b47cebc95978b7ef761a4a">+1/-1</a>      </td> </tr> <tr> <td>troubleshooting-guide.md</td> <td><a href="https://github.com/carverauto/serviceradar/pull/2225/files#diff-343b87dd0b430b3f99b16d32200c353bb6e3d7bbb185da3c1b3effc3a03e7f2a">+53/-2</a>    </td> </tr> <tr> <td>web-ui.md</td> <td><a href="https://github.com/carverauto/serviceradar/pull/2225/files#diff-7e4dc17ccc5b2fd5773b7e6c9583dbe0f597ab9ae21849a2ec0a31554a6337a3">+31/-106</a></td> </tr> <tr> <td>datasvc.ex</td> <td><a href="https://github.com/carverauto/serviceradar/pull/2225/files#diff-00c928e77e8bf2f741a35e42e304acd2f2d4b8546532a5125a717924f2bde5dc">+32/-2</a>    </td> </tr> <tr> <td>BUILD.bazel</td> <td><a href="https://github.com/carverauto/serviceradar/pull/2225/files#diff-aa8f6a4e89041a3423f2299efab4bb1b5d13358d445b3c6e35cc46f527dbaa7b">+30/-0</a>    </td> </tr> <tr> <td>config.exs</td> <td><a href="https://github.com/carverauto/serviceradar/pull/2225/files#diff-abaac0167f05b505cf4bc5c5d375ae769001d8cc20120919941748c79681bbae">+13/-0</a>    </td> </tr> <tr> <td>dev.exs</td> <td><a href="https://github.com/carverauto/serviceradar/pull/2225/files#diff-378e53cb66373073730a5d7ceba4973777eb84ae419c8a38472e3f69c3056fe9">+4/-0</a>      </td> </tr> <tr> <td>prod.exs</td> <td><a href="https://github.com/carverauto/serviceradar/pull/2225/files#diff-5fdc859dfda96a02326392f61218325913f22f9f0c75a1276ee940d5db9fc62b">+4/-0</a>      </td> </tr> <tr> <td>runtime.exs</td> <td><a href="https://github.com/carverauto/serviceradar/pull/2225/files#diff-842568fafa717a8064203543d674517fd28ad7dd2a4d3f0f157d274cfda4f18b">+209/-0</a>  </td> </tr> <tr> <td>test.exs</td> <td><a href="https://github.com/carverauto/serviceradar/pull/2225/files#diff-00c8b58b49024f5d6f37ba40ac310a571e5a1b47fc16819423dc42f3035bc175">+20/-0</a>    </td> </tr> <tr> <td>agent_client.ex</td> <td><a href="https://github.com/carverauto/serviceradar/pull/2225/files#diff-82bf77f47ac3b0001b036f0f22944d07bf894f1ee07b912e0ecab959eb8d5c53">+472/-0</a>  </td> </tr> <tr> <td>Additional files not shown</td> <td><a href="https://github.com/carverauto/serviceradar/pull/2225/files#diff-2f328e4cd8dbe3ad193e49d92bcf045f47a6b72b1e9487d366f6b8288589b4ca"></a></td> </tr> </table></details></td></tr></tbody></table> </details> ___

qodo-code-review[bot] commented

2026-01-05 21:18:37 +00:00

(Migrated from github.com)

Author

Owner

Imported GitHub PR comment.

Original author: @qodo-code-review[bot]
Original URL: https://github.com/carverauto/serviceradar/pull/2225#issuecomment-3712121318
Original created: 2026-01-05T21:18:37Z

PR Compliance Guide 🔍

Below is a summary of compliance checks for this PR:

Security Compliance
⚪	Sensitive info exposure Description: The new logging emits potentially sensitive operational details (e.g., `AllowedClientIdentities`, resolver/system creds file paths, and operator name) which could aid attackers or leak internal identity/credential placement if logs are exposed; similar creds-path logging was also added in `cmd/otel/src/nats_output.rs` and `cmd/otel/src/setup.rs`. main.go [82-141] Referred Code // Initialize NATS account service // This service is stateless - it only holds the operator key for signing operations. // Account state (seeds, JWTs) is stored by Elixir in CNPG with AshCloak encryption. // The service can start without an operator and bootstrap later via gRPC. var natsAccountServer *datasvc.NATSAccountServer if cfg.NATSOperator != nil { operator, opErr := accounts.NewOperator(cfg.NATSOperator) if opErr != nil { // Operator not available yet - that's okay, bootstrap will be called later log.Printf("NATS account service starting without operator (will bootstrap later): %v", opErr) natsAccountServer = datasvc.NewNATSAccountServer(nil) } else { natsAccountServer = datasvc.NewNATSAccountServer(operator) log.Printf("NATS account service initialized with operator %s", operator.Name()) } natsAccountServer.SetAllowedClientIdentities(cfg.NATSOperator.AllowedClientIdentities) if len(cfg.NATSOperator.AllowedClientIdentities) == 0 { log.Printf("Warning: no allowed client identities configured for NATS account service; requests will be rejected") } else { log.Printf("NATS account service allowed identities: %v", cfg.NATSOperator.AllowedClientIdentities) ... (clipped 39 lines)
Ticket Compliance
⚪	🎫 No ticket provided Create ticket/issue
Codebase Duplication Compliance
⚪	Codebase context is not defined Follow the guide to enable codebase context checks.
Custom Compliance
🟢	Generic: Meaningful Naming and Self-Documenting Code Objective: Ensure all identifiers clearly express their purpose and intent, making code self-documenting Status: Passed Learn more about managing compliance generic rules or creating your own custom rules
	Generic: Secure Error Handling Objective: To prevent the leakage of sensitive system information through error messages while providing sufficient detail for internal debugging. Status: Passed Learn more about managing compliance generic rules or creating your own custom rules
	Generic: Security-First Input Validation and Data Handling Objective: Ensure all data inputs are validated, sanitized, and handled securely to prevent vulnerabilities Status: Passed Learn more about managing compliance generic rules or creating your own custom rules
🔴	Generic: Robust Error Handling and Edge Case Management Objective: Ensure comprehensive error handling that provides meaningful context and graceful degradation Status: Potential panic unwrap: New config parsing uses `as_str().unwrap()` which can panic on unexpected config types rather than returning a handled error with context. Referred Code `let creds_file = cfg.lookup("output.nats_creds_file").and_then(\|v\| { let value = v.as_str().unwrap().trim(); if value.is_empty() { None } else { Some(PathBuf::from(value)) } });` Learn more about managing compliance generic rules or creating your own custom rules
⚪	Generic: Comprehensive Audit Trails Objective: To create a detailed and reliable record of critical system actions for security analysis and compliance. Status: Missing audit context: The newly added NATS account service initialization implies privileged signing/bootstrapping operations but the diff does not show corresponding audit logging that captures who performed such actions and their outcomes. Referred Code // Initialize NATS account service // This service is stateless - it only holds the operator key for signing operations. // Account state (seeds, JWTs) is stored by Elixir in CNPG with AshCloak encryption. // The service can start without an operator and bootstrap later via gRPC. var natsAccountServer *datasvc.NATSAccountServer if cfg.NATSOperator != nil { operator, opErr := accounts.NewOperator(cfg.NATSOperator) if opErr != nil { // Operator not available yet - that's okay, bootstrap will be called later log.Printf("NATS account service starting without operator (will bootstrap later): %v", opErr) natsAccountServer = datasvc.NewNATSAccountServer(nil) } else { natsAccountServer = datasvc.NewNATSAccountServer(operator) log.Printf("NATS account service initialized with operator %s", operator.Name()) } natsAccountServer.SetAllowedClientIdentities(cfg.NATSOperator.AllowedClientIdentities) if len(cfg.NATSOperator.AllowedClientIdentities) == 0 { log.Printf("Warning: no allowed client identities configured for NATS account service; requests will be rejected") } else { log.Printf("NATS account service allowed identities: %v", cfg.NATSOperator.AllowedClientIdentities) ... (clipped 57 lines) Learn more about managing compliance generic rules or creating your own custom rules
	Generic: Secure Logging Practices Objective: To ensure logs are useful for debugging and auditing without exposing sensitive information like PII, PHI, or cardholder data. Status: Sensitive config in logs: The new logging statements print potentially sensitive security configuration details (e.g., allowed client identities and credential/config file paths) which may be considered sensitive operational information in some environments. Referred Code // Initialize NATS account service // This service is stateless - it only holds the operator key for signing operations. // Account state (seeds, JWTs) is stored by Elixir in CNPG with AshCloak encryption. // The service can start without an operator and bootstrap later via gRPC. var natsAccountServer *datasvc.NATSAccountServer if cfg.NATSOperator != nil { operator, opErr := accounts.NewOperator(cfg.NATSOperator) if opErr != nil { // Operator not available yet - that's okay, bootstrap will be called later log.Printf("NATS account service starting without operator (will bootstrap later): %v", opErr) natsAccountServer = datasvc.NewNATSAccountServer(nil) } else { natsAccountServer = datasvc.NewNATSAccountServer(operator) log.Printf("NATS account service initialized with operator %s", operator.Name()) } natsAccountServer.SetAllowedClientIdentities(cfg.NATSOperator.AllowedClientIdentities) if len(cfg.NATSOperator.AllowedClientIdentities) == 0 { log.Printf("Warning: no allowed client identities configured for NATS account service; requests will be rejected") } else { log.Printf("NATS account service allowed identities: %v", cfg.NATSOperator.AllowedClientIdentities) ... (clipped 39 lines) Learn more about managing compliance generic rules or creating your own custom rules
Update

Compliance status legend

🟢 - Fully Compliant
🟡 - Partial Compliant
🔴 - Not Compliant
⚪ - Requires Further Human Verification
🏷️ - Compliance label

Imported GitHub PR comment. Original author: @qodo-code-review[bot] Original URL: https://github.com/carverauto/serviceradar/pull/2225#issuecomment-3712121318 Original created: 2026-01-05T21:18:37Z --- ## PR Compliance Guide 🔍  Below is a summary of compliance checks for this PR: <table><tbody><tr><td colspan='2'>Security Compliance</td></tr> <tr><td rowspan=1>⚪</td> <td><details><summary>Sensitive info exposure </summary> Description: The new logging emits potentially sensitive operational details (e.g., <code>AllowedClientIdentities</code>, resolver/system creds file paths, and operator name) which could aid attackers or leak internal identity/credential placement if logs are exposed; similar creds-path logging was also added in <code>cmd/otel/src/nats_output.rs</code> and <code>cmd/otel/src/setup.rs</code>. <a href='https://github.com/carverauto/serviceradar/pull/2225/files#diff-5e7731adfb877918cd65d9d5531621312496450fd550fea2682efca4ca8fe816R82-R141'>main.go [82-141]</a> <details open><summary>Referred Code</summary> ```go // Initialize NATS account service // This service is stateless - it only holds the operator key for signing operations. // Account state (seeds, JWTs) is stored by Elixir in CNPG with AshCloak encryption. // The service can start without an operator and bootstrap later via gRPC. var natsAccountServer *datasvc.NATSAccountServer if cfg.NATSOperator != nil { operator, opErr := accounts.NewOperator(cfg.NATSOperator) if opErr != nil { // Operator not available yet - that's okay, bootstrap will be called later log.Printf("NATS account service starting without operator (will bootstrap later): %v", opErr) natsAccountServer = datasvc.NewNATSAccountServer(nil) } else { natsAccountServer = datasvc.NewNATSAccountServer(operator) log.Printf("NATS account service initialized with operator %s", operator.Name()) } natsAccountServer.SetAllowedClientIdentities(cfg.NATSOperator.AllowedClientIdentities) if len(cfg.NATSOperator.AllowedClientIdentities) == 0 { log.Printf("Warning: no allowed client identities configured for NATS account service; requests will be rejected") } else { log.Printf("NATS account service allowed identities: %v", cfg.NATSOperator.AllowedClientIdentities) ... (clipped 39 lines) ``` </details></details></td></tr> <tr><td colspan='2'>Ticket Compliance</td></tr> <tr><td>⚪</td><td><details><summary>🎫 No ticket provided </summary> - [ ] Create ticket/issue  </details></td></tr> <tr><td colspan='2'>Codebase Duplication Compliance</td></tr> <tr><td>⚪</td><td><details><summary>Codebase context is not defined </summary> Follow the <a href='https://qodo-merge-docs.qodo.ai/core-abilities/rag_context_enrichment/'>guide</a> to enable codebase context checks. </details></td></tr> <tr><td colspan='2'>Custom Compliance</td></tr> <tr><td rowspan=3>🟢</td><td> <details><summary>Generic: Meaningful Naming and Self-Documenting Code</summary> **Objective:** Ensure all identifiers clearly express their purpose and intent, making code self-documenting **Status:** Passed > Learn more about managing compliance <a href='https://qodo-merge-docs.qodo.ai/tools/compliance/#configuration-options'>generic rules</a> or creating your own <a href='https://qodo-merge-docs.qodo.ai/tools/compliance/#custom-compliance'>custom rules</a> </details></td></tr> <tr><td> <details><summary>Generic: Secure Error Handling</summary> **Objective:** To prevent the leakage of sensitive system information through error messages while providing sufficient detail for internal debugging. **Status:** Passed > Learn more about managing compliance <a href='https://qodo-merge-docs.qodo.ai/tools/compliance/#configuration-options'>generic rules</a> or creating your own <a href='https://qodo-merge-docs.qodo.ai/tools/compliance/#custom-compliance'>custom rules</a> </details></td></tr> <tr><td> <details><summary>Generic: Security-First Input Validation and Data Handling</summary> **Objective:** Ensure all data inputs are validated, sanitized, and handled securely to prevent vulnerabilities **Status:** Passed > Learn more about managing compliance <a href='https://qodo-merge-docs.qodo.ai/tools/compliance/#configuration-options'>generic rules</a> or creating your own <a href='https://qodo-merge-docs.qodo.ai/tools/compliance/#custom-compliance'>custom rules</a> </details></td></tr> <tr><td rowspan=1>🔴</td> <td><details> <summary>Generic: Robust Error Handling and Edge Case Management</summary> **Objective:** Ensure comprehensive error handling that provides meaningful context and graceful degradation **Status:** <a href='https://github.com/carverauto/serviceradar/pull/2225/files#diff-a82e2e4d413539bf0b414b5629665b19648447523994cba639c4d1238aa5a0c1R86-R93'>Potential panic unwrap</a>: New config parsing uses <code>as_str().unwrap()</code> which can panic on unexpected config types rather than returning a handled error with context. <details open><summary>Referred Code</summary> ```rust let creds_file = cfg.lookup("output.nats_creds_file").and_then(|v| { let value = v.as_str().unwrap().trim(); if value.is_empty() { None } else { Some(PathBuf::from(value)) } }); ``` </details> > Learn more about managing compliance <a href='https://qodo-merge-docs.qodo.ai/tools/compliance/#configuration-options'>generic rules</a> or creating your own <a href='https://qodo-merge-docs.qodo.ai/tools/compliance/#custom-compliance'>custom rules</a> </details></td></tr> <tr><td rowspan=2>⚪</td> <td><details> <summary>Generic: Comprehensive Audit Trails</summary> **Objective:** To create a detailed and reliable record of critical system actions for security analysis and compliance. **Status:** <a href='https://github.com/carverauto/serviceradar/pull/2225/files#diff-5e7731adfb877918cd65d9d5531621312496450fd550fea2682efca4ca8fe816R82-R159'>Missing audit context</a>: The newly added NATS account service initialization implies privileged signing/bootstrapping operations but the diff does not show corresponding audit logging that captures who performed such actions and their outcomes. <details open><summary>Referred Code</summary> ```go // Initialize NATS account service // This service is stateless - it only holds the operator key for signing operations. // Account state (seeds, JWTs) is stored by Elixir in CNPG with AshCloak encryption. // The service can start without an operator and bootstrap later via gRPC. var natsAccountServer *datasvc.NATSAccountServer if cfg.NATSOperator != nil { operator, opErr := accounts.NewOperator(cfg.NATSOperator) if opErr != nil { // Operator not available yet - that's okay, bootstrap will be called later log.Printf("NATS account service starting without operator (will bootstrap later): %v", opErr) natsAccountServer = datasvc.NewNATSAccountServer(nil) } else { natsAccountServer = datasvc.NewNATSAccountServer(operator) log.Printf("NATS account service initialized with operator %s", operator.Name()) } natsAccountServer.SetAllowedClientIdentities(cfg.NATSOperator.AllowedClientIdentities) if len(cfg.NATSOperator.AllowedClientIdentities) == 0 { log.Printf("Warning: no allowed client identities configured for NATS account service; requests will be rejected") } else { log.Printf("NATS account service allowed identities: %v", cfg.NATSOperator.AllowedClientIdentities) ... (clipped 57 lines) ``` </details> > Learn more about managing compliance <a href='https://qodo-merge-docs.qodo.ai/tools/compliance/#configuration-options'>generic rules</a> or creating your own <a href='https://qodo-merge-docs.qodo.ai/tools/compliance/#custom-compliance'>custom rules</a> </details></td></tr> <tr><td><details> <summary>Generic: Secure Logging Practices</summary> **Objective:** To ensure logs are useful for debugging and auditing without exposing sensitive information like PII, PHI, or cardholder data. **Status:** <a href='https://github.com/carverauto/serviceradar/pull/2225/files#diff-5e7731adfb877918cd65d9d5531621312496450fd550fea2682efca4ca8fe816R82-R141'>Sensitive config in logs</a>: The new logging statements print potentially sensitive security configuration details (e.g., allowed client identities and credential/config file paths) which may be considered sensitive operational information in some environments. <details open><summary>Referred Code</summary> ```go // Initialize NATS account service // This service is stateless - it only holds the operator key for signing operations. // Account state (seeds, JWTs) is stored by Elixir in CNPG with AshCloak encryption. // The service can start without an operator and bootstrap later via gRPC. var natsAccountServer *datasvc.NATSAccountServer if cfg.NATSOperator != nil { operator, opErr := accounts.NewOperator(cfg.NATSOperator) if opErr != nil { // Operator not available yet - that's okay, bootstrap will be called later log.Printf("NATS account service starting without operator (will bootstrap later): %v", opErr) natsAccountServer = datasvc.NewNATSAccountServer(nil) } else { natsAccountServer = datasvc.NewNATSAccountServer(operator) log.Printf("NATS account service initialized with operator %s", operator.Name()) } natsAccountServer.SetAllowedClientIdentities(cfg.NATSOperator.AllowedClientIdentities) if len(cfg.NATSOperator.AllowedClientIdentities) == 0 { log.Printf("Warning: no allowed client identities configured for NATS account service; requests will be rejected") } else { log.Printf("NATS account service allowed identities: %v", cfg.NATSOperator.AllowedClientIdentities) ... (clipped 39 lines) ``` </details> > Learn more about managing compliance <a href='https://qodo-merge-docs.qodo.ai/tools/compliance/#configuration-options'>generic rules</a> or creating your own <a href='https://qodo-merge-docs.qodo.ai/tools/compliance/#custom-compliance'>custom rules</a> </details></td></tr> <tr><td align="center" colspan="2"> - [ ] Update  </td></tr></tbody></table> <details><summary>Compliance status legend</summary> 🟢 - Fully Compliant 🟡 - Partial Compliant 🔴 - Not Compliant ⚪ - Requires Further Human Verification 🏷️ - Compliance label </details>

qodo-code-review[bot] commented

2026-01-05 21:20:36 +00:00

(Migrated from github.com)

Author

Owner

Imported GitHub PR comment.

Original author: @qodo-code-review[bot]
Original URL: https://github.com/carverauto/serviceradar/pull/2225#issuecomment-3712126644
Original created: 2026-01-05T21:20:36Z

PR Code Suggestions ✨

Explore these optional code suggestions:

Category	Suggestion	Impact
Possible issue	Use correct attribute for timestamp Correct the attribute name from `:last_seen_time` to `:last_seen_at` in the `persist_identity` function to prevent device update failures. elixir/serviceradar_core/lib/serviceradar/actors/device.ex [503-523] defp persist_identity(tenant_id, device_id, identity) do case DeviceResource.get_by_uid(device_id, tenant: tenant_id, authorize?: false) do {:ok, device} -> updates = identity \|> Map.take([:name, :hostname, :ip, :mac, :vendor_name, :model, :os]) - \|> Map.put(:last_seen_time, DateTime.utc_now()) + \|> Map.put(:last_seen_at, DateTime.utc_now()) case Ash.update(device, updates, authorize?: false) do {:ok, _} -> :ok {:error, error} -> {:error, error} end {:error, %Ash.Error.Query.NotFound{}} -> # Device doesn't exist yet, nothing to persist :ok {:error, error} -> {:error, error} end end `[To ensure code accuracy, apply this suggestion manually]` Suggestion importance[1-10]: 9 __ Why: This suggestion identifies a critical bug where an incorrect attribute name (`:last_seen_time` instead of `:last_seen_at`) would cause all device identity updates to fail. Fixing this is essential for the core functionality of the device actor.	High
	✅ ~~Fix race condition in shutdown logic~~ Suggestion Impact: The commit introduces a shared shutdownCtx with a timeout and uses it in the shutdown goroutine and the final select (replacing the separate timer), matching the suggested approach to avoid premature timeout races. It also extends the idea by passing shutdownCtx into pushLoop.Stop and guarding the errChan wait with shutdownCtx. code diff: // Bound shutdown so the process can't hang forever (includes pushLoop.Stop()). const shutdownTimeout = 10 * time.Second + shutdownCtx, shutdownCancel := context.WithTimeout(context.Background(), shutdownTimeout) + defer shutdownCancel() shutdownDone := make(chan struct{}) go func() { defer close(shutdownDone) cancel() - pushLoop.Stop() - <-errChan - - stopCtx, stopCancel := context.WithTimeout(context.Background(), shutdownTimeout) - defer stopCancel() - if err := server.Stop(stopCtx); err != nil { + if err := pushLoop.Stop(shutdownCtx); err != nil { + log.Warn().Err(err).Msg("Push loop stop did not complete before timeout") + } + select { + case <-errChan: + case <-shutdownCtx.Done(): + return + } + + if err := server.Stop(shutdownCtx); err != nil { log.Error().Err(err).Msg("Error stopping agent services") } }() - - timer := time.NewTimer(shutdownTimeout) - defer timer.Stop() select { case <-shutdownDone: - case <-timer.C: + case <-shutdownCtx.Done(): return fmt.Errorf("%w after %s", errShutdownTimeout, shutdownTimeout) Refactor the shutdown logic to use a single parent context with a timeout for all shutdown operations, preventing a race condition where the main goroutine could time out prematurely. cmd/agent/main.go [201-226] // Bound shutdown so the process can't hang forever (includes pushLoop.Stop()). const shutdownTimeout = 10 * time.Second + shutdownCtx, shutdownCancel := context.WithTimeout(context.Background(), shutdownTimeout) + defer shutdownCancel() shutdownDone := make(chan struct{}) go func() { defer close(shutdownDone) cancel() pushLoop.Stop() <-errChan - stopCtx, stopCancel := context.WithTimeout(context.Background(), shutdownTimeout) - defer stopCancel() - if err := server.Stop(stopCtx); err != nil { + // Use the shutdownCtx for stopping the server to respect the overall timeout. + if err := server.Stop(shutdownCtx); err != nil { log.Error().Err(err).Msg("Error stopping agent services") } }() - timer := time.NewTimer(shutdownTimeout) - defer timer.Stop() - select { case <-shutdownDone: - case <-timer.C: + // Shutdown completed gracefully. + case <-shutdownCtx.Done(): return fmt.Errorf("%w after %s", errShutdownTimeout, shutdownTimeout) } `[Suggestion processed]` Suggestion importance[1-10]: 8 __ Why: The suggestion correctly identifies and fixes a race condition in the shutdown logic, ensuring all shutdown operations are bound by a single, shared timeout, which improves the agent's robustness.	Medium
	✅ ~~Ensure state transitions are atomic~~ Suggestion Impact: The commit removed `require_atomic? false` from multiple state-transition update actions including `establish_connection` (and others like connection_failed, degrade, restore_health, etc.), aligning those transitions with atomic execution. code diff: @@ -282,7 +307,6 @@ update :establish_connection do description "Mark agent as connected (from connecting state)" accept [:poller_id] - require_atomic? false change transition_state(:connected) change set_attribute(:is_healthy, true) @@ -293,7 +317,6 @@ update :connection_failed do description "Mark connection attempt as failed" - require_atomic? false change transition_state(:disconnected) change set_attribute(:modified_time, &DateTime.utc_now/0) @@ -302,7 +325,6 @@ update :degrade do description "Mark agent as degraded (connected but unhealthy)" - require_atomic? false change transition_state(:degraded) change set_attribute(:is_healthy, false) @@ -312,7 +334,6 @@ update :restore_health do description "Restore agent health (from degraded to connected)" - require_atomic? false change transition_state(:connected) change set_attribute(:is_healthy, true) @@ -322,7 +343,6 @@ update :lose_connection do description "Mark agent as disconnected (connection lost)" - require_atomic? false change transition_state(:disconnected) change set_attribute(:poller_id, nil) @@ -332,7 +352,6 @@ update :reconnect do description "Start reconnection process (from disconnected to connecting)" - require_atomic? false change transition_state(:connecting) change set_attribute(:modified_time, &DateTime.utc_now/0) @@ -342,7 +361,6 @@ update :mark_unavailable do description "Mark agent as unavailable (admin action)" argument :reason, :string - require_atomic? false change transition_state(:unavailable) change set_attribute(:is_healthy, false) @@ -352,7 +370,6 @@ update :recover do description "Start recovery process (from unavailable to connecting)" - require_atomic? false change transition_state(:connecting) change set_attribute(:modified_time, &DateTime.utc_now/0) @@ -363,7 +380,6 @@ update :connect do description "Mark agent as connected to a poller (legacy - use establish_connection)" accept [:poller_id] - require_atomic? false change transition_state(:connected) change set_attribute(:is_healthy, true) @@ -374,7 +390,6 @@ update :disconnect do description "Mark agent as disconnected (legacy - use lose_connection)" - require_atomic? false change transition_state(:disconnected) change set_attribute(:poller_id, nil) @@ -384,7 +399,6 @@ update :mark_unhealthy do description "Mark agent as unhealthy (legacy - use degrade)" - require_atomic? false change transition_state(:degraded) change set_attribute(:is_healthy, false) Remove `require_atomic? false` from all state-transition actions like `establish_connection` to ensure atomicity and prevent data inconsistency. State transitions are critical and should be transactional. elixir/serviceradar_core/lib/serviceradar/infrastructure/agent.ex [270-292] update :heartbeat do description "Update last_seen_time and health status (for connected agents)" accept [:is_healthy, :capabilities] require_atomic? false change set_attribute(:last_seen_time, &DateTime.utc_now/0) change set_attribute(:modified_time, &DateTime.utc_now/0) end # State machine transition actions # Each action includes PublishStateChange to emit NATS events update :establish_connection do description "Mark agent as connected (from connecting state)" accept [:poller_id] - require_atomic? false change transition_state(:connected) change set_attribute(:is_healthy, true) change set_attribute(:last_seen_time, &DateTime.utc_now/0) change set_attribute(:modified_time, &DateTime.utc_now/0) change {ServiceRadar.Infrastructure.Changes.PublishStateChange, entity_type: :agent, new_state: :connected} end `[To ensure code accuracy, apply this suggestion manually]` Suggestion importance[1-10]: 8 __ Why: The suggestion correctly identifies that using `require_atomic? false` for state transition actions can lead to data inconsistency, which is a significant correctness issue. Ensuring atomicity for these critical operations is crucial for maintaining a valid state.	Medium
	✅ ~~Combine create and update into one~~ Suggestion Impact: The commit removed the separate update call that set token fields after creation, and instead applies the token attributes directly on the create changeset (via force_change_attribute) before a single Ash.create, making the operation effectively atomic and avoiding inconsistent intermediate state. code diff: + changeset = + OnboardingPackage + \|> Ash.Changeset.for_create(:create, create_attrs, + actor: actor, + authorize?: authorize?, + tenant: tenant + ) + \|> Ash.Changeset.force_change_attribute(:join_token_ciphertext, join_token_ciphertext) + \|> Ash.Changeset.force_change_attribute(:join_token_expires_at, join_expires) + \|> Ash.Changeset.force_change_attribute(:download_token_hash, download_token_hash) + \|> Ash.Changeset.force_change_attribute(:download_token_expires_at, download_expires) + + case Ash.create(changeset) do {:ok, package} -> - # Update with token fields - token_attrs = %{ - join_token_ciphertext: join_token_ciphertext, - join_token_expires_at: join_expires, - download_token_hash: download_token_hash, - download_token_expires_at: download_expires - } - - case package - \|> Ash.Changeset.for_update(:update_tokens, token_attrs, - actor: actor, - authorize?: authorize?, - tenant: tenant - ) - \|> Ash.update() do - {:ok, updated_package} -> - # Record creation event - OnboardingEvents.record(package.id, :created, - actor: get_actor_name(actor), - source_ip: source_ip - ) - - {:ok, - %{ - package: updated_package, - join_token: join_token, - download_token: download_token - }} - - {:error, error} -> - {:error, error} - end + # Record creation event + OnboardingEvents.record(package.id, :created, + actor: get_actor_name(actor), + source_ip: source_ip + ) + + {:ok, + %{ + package: package, + join_token: join_token, + download_token: download_token + }} Refactor the `create` function to be atomic by combining the "create-then-update" logic into a single operation. Pass all attributes, including tokens, to the `:create` action to prevent data inconsistency. elixir/serviceradar_core/lib/serviceradar/edge/onboarding_packages.ex [156-199] -case OnboardingPackage - \|> Ash.Changeset.for_create(:create, create_attrs, - actor: actor, - authorize?: authorize?, - tenant: tenant - ) - \|> Ash.create() do +# Encrypt join token and hash download token +join_token_ciphertext = Crypto.encrypt(join_token) +download_token_hash = Crypto.hash_token(download_token) + +# Prepare attributes for Ash create +create_attrs = + attrs + \|> Map.put(:created_by, get_actor_name(actor)) + \|> Map.put(:join_token_ciphertext, join_token_ciphertext) + \|> Map.put(:join_token_expires_at, join_expires) + \|> Map.put(:download_token_hash, download_token_hash) + \|> Map.put(:download_token_expires_at, download_expires) + +case OnboardingPackage.create(create_attrs, actor: actor, authorize?: authorize?, tenant: tenant) do {:ok, package} -> - # Update with token fields - token_attrs = %{ - join_token_ciphertext: join_token_ciphertext, - join_token_expires_at: join_expires, - download_token_hash: download_token_hash, - download_token_expires_at: download_expires - } + # Record creation event + OnboardingEvents.record(package.id, :created, + actor: get_actor_name(actor), + source_ip: source_ip + ) - case package - \|> Ash.Changeset.for_update(:update_tokens, token_attrs, - actor: actor, - authorize?: authorize?, - tenant: tenant - ) - \|> Ash.update() do - {:ok, updated_package} -> - # Record creation event - OnboardingEvents.record(package.id, :created, - actor: get_actor_name(actor), - source_ip: source_ip - ) - - {:ok, - %{ - package: updated_package, - join_token: join_token, - download_token: download_token - }} - - {:error, error} -> - {:error, error} - end + {:ok, + %{ + package: package, + join_token: join_token, + download_token: download_token + }} {:error, error} -> {:error, error} end `[To ensure code accuracy, apply this suggestion manually]` Suggestion importance[1-10]: 8 __ Why: The suggestion correctly points out a non-atomic "create-then-update" pattern that could lead to inconsistent data. Consolidating this into a single atomic `create` operation is a critical improvement for data integrity.	Medium
	✅ ~~Prevent race condition during updates~~ Suggestion Impact: The commit removed the separate conditional DeviceAliasState.confirm/2 call and instead passed confirm_threshold into DeviceAliasState.record_sighting/3, implying the confirmation decision is now handled within the sighting recording operation (moving toward the suggested atomic consolidation), though it did not introduce the specifically suggested atomic_sighting_and_confirm function or the suggested return-shape handling. code diff: `# Record the sighting - case DeviceAliasState.record_sighting(existing, %{}, actor: actor) do + case DeviceAliasState.record_sighting( + existing, + %{confirm_threshold: confirm_threshold}, + actor: actor + ) do {:ok, updated} -> - # Check if we should confirm - if updated.state == :detected and updated.sighting_count >= confirm_threshold do - DeviceAliasState.confirm(updated, actor: actor) - end` Prevent a race condition in `handle_existing_alias/4` by consolidating the logic for recording a sighting and confirming the state into a single, atomic action on the `DeviceAliasState` resource. elixir/serviceradar_core/lib/serviceradar/identity/alias_events.ex [438-474] defp handle_existing_alias(existing, _update, confirm_threshold, actor) do alias ServiceRadar.Identity.DeviceAliasState - # Record the sighting - case DeviceAliasState.record_sighting(existing, %{}, actor: actor) do - {:ok, updated} -> - # Check if we should confirm - if updated.state == :detected and updated.sighting_count >= confirm_threshold do - DeviceAliasState.confirm(updated, actor: actor) - end + # Atomically record sighting and potentially confirm + case DeviceAliasState.atomic_sighting_and_confirm(existing, %{confirm_threshold: confirm_threshold}, actor: actor) do + {:ok, %{data: %{state_changed: true, new_state: new_state, sighting_count: count}}} -> + # Generate event for state change + %{ + device_id: existing.device_id, + partition: existing.partition \|\| "", + action: "alias_state_changed", + reason: "state_transition", + timestamp: DateTime.utc_now(), + severity: "Info", + level: 6, + metadata: %{ + "alias_type" => to_string(existing.alias_type), + "alias_value" => existing.alias_value, + "previous_state" => to_string(existing.state), + "new_state" => to_string(new_state), + "sighting_count" => to_string(count) + } + } - # Generate event if state changed - if updated.state != existing.state do - %{ - device_id: existing.device_id, - partition: existing.partition \|\| "", - action: "alias_state_changed", - reason: "state_transition", - timestamp: DateTime.utc_now(), - severity: "Info", - level: 6, - metadata: %{ - "alias_type" => to_string(existing.alias_type), - "alias_value" => existing.alias_value, - "previous_state" => to_string(existing.state), - "new_state" => to_string(updated.state), - "sighting_count" => to_string(updated.sighting_count) - } - } - else - nil - end + {:ok, _} -> + # Sighting recorded, but no state change + nil {:error, _reason} -> nil end end `[To ensure code accuracy, apply this suggestion manually]` Suggestion importance[1-10]: 8 __ Why: The suggestion correctly identifies a race condition in `handle_existing_alias/4` due to non-atomic read-modify-write operations. Consolidating the sighting and confirmation logic into a single atomic action is crucial to prevent incorrect state transitions.	Medium
	Conditionally enable SSL for gRPC Conditionally enable SSL for the gRPC connection based on an environment variable, allowing for both secure and insecure deployments. elixir/serviceradar_core/lib/serviceradar/nats/account_client.ex [487-518] defp create_fresh_channel do host = System.get_env("DATASVC_HOST", "datasvc") port = String.to_integer(System.get_env("DATASVC_PORT", "50057")) + ssl_enabled = System.get_env("DATASVC_SSL", "false") in ["true", "1", "yes"] cert_dir = System.get_env("DATASVC_CERT_DIR", "/etc/serviceradar/certs") cert_name = System.get_env("DATASVC_CERT_NAME", "core") server_name = System.get_env("DATASVC_SERVER_NAME", "datasvc.serviceradar") endpoint = "#{host}:#{port}" - ssl_opts = [ - cacertfile: String.to_charlist(Path.join(cert_dir, "root.pem")), - certfile: String.to_charlist(Path.join(cert_dir, "#{cert_name}.pem")), - keyfile: String.to_charlist(Path.join(cert_dir, "#{cert_name}-key.pem")), - verify: :verify_peer, - server_name_indication: String.to_charlist(server_name) - ] + cred_opts = + if ssl_enabled do + ssl_opts = [ + cacertfile: String.to_charlist(Path.join(cert_dir, "root.pem")), + certfile: String.to_charlist(Path.join(cert_dir, "#{cert_name}.pem")), + keyfile: String.to_charlist(Path.join(cert_dir, "#{cert_name}-key.pem")), + verify: :verify_peer, + server_name_indication: String.to_charlist(server_name) + ] - connect_opts = [ - cred: GRPC.Credential.new(ssl: ssl_opts), - adapter_opts: [connect_timeout: 5_000] - ] + [cred: GRPC.Credential.new(ssl: ssl_opts)] + else + [] + end + + connect_opts = + cred_opts + \|> Keyword.put(:adapter_opts, [connect_timeout: 5_000]) case GRPC.Stub.connect(endpoint, connect_opts) do {:ok, channel} -> Logger.debug("Created fresh NATS account gRPC channel to #{endpoint}") {:ok, channel} {:error, reason} -> Logger.error("Failed to create NATS account gRPC channel: #{inspect(reason)}") {:error, {:connection_failed, reason}} end end `[To ensure code accuracy, apply this suggestion manually]` Suggestion importance[1-10]: 6 __ Why: The suggestion correctly points out that the function hardcodes an mTLS connection, which contradicts the module's documentation stating it uses the same configuration as `DataService.Client`. Applying this change improves flexibility and aligns the component's behavior with other parts of the system, making it usable in non-SSL environments.	Low
	Prevent misleading operator config log Add a check to ensure `operatorConfigPath` is not empty before attempting to write the NATS operator configuration to prevent misleading log messages. cmd/data-services/main.go [117-130] if operatorConfigPath != "" \|\| resolverPath != "" { natsAccountServer.SetResolverPaths(operatorConfigPath, resolverPath) log.Printf("NATS resolver paths configured: operator=%s resolver=%s", operatorConfigPath, resolverPath) // If operator is already initialized, write the config now // This ensures config files exist even when datasvc restarts with an existing operator - if operator != nil { + if operator != nil && operatorConfigPath != "" { if err := natsAccountServer.WriteOperatorConfig(); err != nil { log.Printf("Warning: failed to write initial operator config: %v", err) } else { log.Printf("Wrote initial operator config to %s", operatorConfigPath) } } } Apply / Chat Suggestion importance[1-10]: 5 __ Why: The suggestion correctly identifies a condition that leads to a misleading log message and proposes a simple fix, improving the accuracy of the application's logging.	Low
	✅ ~~Ensure deterministic record selection order~~ Suggestion Impact: The commit adds `sorted_fallback = Enum.sort(fallback)` and updates the subsequent Enum.reduce to iterate over `sorted_fallback` instead of the unsorted map, making processing deterministic. code diff: `# Merge fallback records that don't have canonical equivalents + sorted_fallback = Enum.sort(fallback) + {canonical, meta} = - Enum.reduce(fallback, {canonical, meta}, fn {key, device}, {can, m} -> + Enum.reduce(sorted_fallback, {canonical, meta}, fn {key, device}, {can, m} ->` Sort the `fallback` map by key before reducing it in `select_canonical_records` to ensure deterministic device selection and prevent metric instability. elixir/serviceradar_core/lib/serviceradar/core/stats_aggregator.ex [263-286] defp select_canonical_records(devices, meta) do # Group by canonical ID to deduplicate canonical = %{} fallback = %{} {canonical, fallback, meta} = Enum.reduce(devices, {canonical, fallback, meta}, fn device, {can, fall, m} -> process_device_record(device, can, fall, m) end) # Merge fallback records that don't have canonical equivalents + # Sort the fallback map to ensure deterministic processing + sorted_fallback = Enum.sort(fallback) + {canonical, meta} = - Enum.reduce(fallback, {canonical, meta}, fn {key, device}, {can, m} -> + Enum.reduce(sorted_fallback, {canonical, meta}, fn {key, device}, {can, m} -> if Map.has_key?(can, key) do {can, %{m \| skipped_non_canonical: m.skipped_non_canonical + 1}} else {Map.put(can, key, device), %{m \| inferred_canonical_fallback: m.inferred_canonical_fallback + 1}} end end) selected = Map.values(canonical) {selected, meta} end `[To ensure code accuracy, apply this suggestion manually]` Suggestion importance[1-10]: 5 __ Why: The suggestion correctly identifies that iterating over a map is not order-guaranteed, which could lead to non-deterministic behavior and metric instability. Sorting the fallback records before processing is a valid improvement for ensuring consistent results.	Low
General	✅ ~~Safely parse integer env vars~~ Suggestion Impact: Updated get_env_int to use Integer.parse/1 and return nil for non-numeric or partially numeric environment variable values instead of raising. code diff: `@@ -302,7 +302,11 @@ case System.get_env(key) do nil -> nil "" -> nil - value -> String.to_integer(value) + value -> + case Integer.parse(value) do + {int, ""} -> int + _ -> nil + end` Replace `String.to_integer/1` with the safer `Integer.parse/1` in `get_env_int` to prevent crashes from non-numeric environment variable values. elixir/serviceradar_core/lib/serviceradar/data_service/client.ex [301-307] `defp get_env_int(key) do case System.get_env(key) do nil -> nil "" -> nil - value -> String.to_integer(value) + value -> + case Integer.parse(value) do + {int, ""} -> int + _ -> nil + end end end` `[To ensure code accuracy, apply this suggestion manually]` Suggestion importance[1-10]: 7 __ Why: The suggestion correctly points out that `String.to_integer/1` will crash if the environment variable contains a non-numeric string. Using `Integer.parse/1` makes the configuration loading more robust against invalid user input.	Medium
	Use default limits struct In `build_limits`, return a default `Proto.AccountLimits{}` struct instead of `nil` when no limits are provided to ensure a valid protobuf message. elixir/serviceradar_core/lib/serviceradar/nats/account_client.ex [520] `-defp build_limits(nil), do: nil +defp build_limits(nil), do: %Proto.AccountLimits{}` `[To ensure code accuracy, apply this suggestion manually]` Suggestion importance[1-10]: 7 __ Why: The suggestion correctly identifies that sending `nil` for a protobuf message field can cause issues. Returning a default, empty `AccountLimits` struct is safer and ensures the gRPC request is always well-formed, improving the client's robustness.	Medium
	✅ ~~Handle invalid timestamps safely~~ Suggestion Impact: The commit replaced the bang version with DateTime.from_unix/1 and added pattern matching to return nil on error, preventing potential crashes from invalid UNIX timestamps. code diff: `expires_at = case response.expires_at_unix do 0 -> nil - unix -> DateTime.from_unix!(unix) + unix -> + case DateTime.from_unix(unix) do + {:ok, dt} -> dt + {:error, _} -> nil + end` Replace `DateTime.from_unix!/1` with the safer `DateTime.from_unix/1` and handle potential errors to prevent crashes from invalid timestamps. elixir/serviceradar_core/lib/serviceradar/nats/account_client.ex [174-178] `expires_at = case response.expires_at_unix do 0 -> nil - unix -> DateTime.from_unix!(unix) + unix -> + case DateTime.from_unix(unix) do + {:ok, dt} -> dt + {:error, _} -> nil + end end` `[To ensure code accuracy, apply this suggestion manually]` Suggestion importance[1-10]: 7 __ Why: This suggestion correctly identifies a potential crash when parsing a UNIX timestamp from a gRPC response. Using the non-bang version `DateTime.from_unix/1` and handling the error case makes the code more resilient to invalid data from the remote service.	Medium
	Improve detection of trailing config data Replace the flawed check for trailing JSON data with the more robust and idiomatic `dec.More()` method to correctly handle whitespace and prevent invalid configurations from being accepted. cmd/agent/main.go [133-137] `- if err := dec.Decode(&struct{}{}); err == nil { + if dec.More() { return nil, fmt.Errorf("failed to parse config: %w", errConfigTrailingData) - } else if !errors.Is(err, io.EOF) { - return nil, fmt.Errorf("failed to parse config: %w", err) }` Apply / Chat Suggestion importance[1-10]: 6 __ Why: The suggestion correctly identifies a subtle flaw in the JSON parsing logic and proposes using the idiomatic `dec.More()` function, which improves code correctness and readability.	Low
	Include TLS1.2 cipher suites Update the `ciphers` list to include suites for both TLS 1.3 and TLS 1.2. This ensures that TLS 1.2 connections are supported, as currently only TLS 1.3 cipher suites are configured. elixir/serviceradar_core/lib/serviceradar/spiffe.ex [323] `-ciphers: :ssl.cipher_suites(:default, :"tlsv1.3") +ciphers: :ssl.cipher_suites(:default, :"tlsv1.3") ++ + :ssl.cipher_suites(:default, :"tlsv1.2")` `[To ensure code accuracy, apply this suggestion manually]` Suggestion importance[1-10]: 6 __ Why: The suggestion correctly points out that while `versions` allows both TLS 1.3 and 1.2, the `ciphers` list only includes suites for TLS 1.3. Adding TLS 1.2 cipher suites is necessary to ensure TLS 1.2 connections can be established, improving compatibility.	Low
	✅ ~~Use SQL fragment for endpoint~~ Suggestion Impact: The :endpoint expr was changed from `host <> ":" <> fragment("?::text", port)` to a single SQL fragment `fragment("? \|\| ':' \|\| ?::text", host, port)`, ensuring concatenation happens correctly in the database. code diff: `@@ -636,7 +650,7 @@ :string, expr( if not is_nil(host) and not is_nil(port) do - host <> ":" <> fragment("?::text", port) + fragment("? \|\| ':' \|\| ?::text", host, port) else nil end` In the `:endpoint` calculation, use a single SQL `fragment` for string concatenation instead of mixing Elixir's `<>` operator with a fragment. This ensures the concatenation is performed correctly in the database. elixir/serviceradar_core/lib/serviceradar/infrastructure/agent.ex [635-643] `calculate :endpoint, - :string, - expr( - if not is_nil(host) and not is_nil(port) do - host <> ":" <> fragment("?::text", port) - else - nil - end - ) + :string, + expr( + case do + not is_nil(host) and not is_nil(port) -> + fragment("? \|\| ':' \|\| ?::text", host, port) + true -> + nil + end + )` `[To ensure code accuracy, apply this suggestion manually]` Suggestion importance[1-10]: 5 __ Why: The suggestion correctly identifies that the current implementation mixes Elixir string concatenation with an SQL fragment, which is not valid in an Ash `expr`. The proposed change to use a single `fragment` for concatenation is a valid and necessary correction for the calculation to work at the database level.	Low
Update

Imported GitHub PR comment. Original author: @qodo-code-review[bot] Original URL: https://github.com/carverauto/serviceradar/pull/2225#issuecomment-3712126644 Original created: 2026-01-05T21:20:36Z --- ## PR Code Suggestions ✨  Explore these optional code suggestions: <table><thead><tr><td>Category</td><td align=left>Suggestion                                                                                                                                    </td><td align=center>Impact</td></tr><tbody><tr><td rowspan=8>Possible issue</td> <td> <details><summary>Use correct attribute for timestamp</summary> ___ **Correct the attribute name from <code>:last_seen_time</code> to <code>:last_seen_at</code> in the <code>persist_identity</code> function to prevent device update failures.** [elixir/serviceradar_core/lib/serviceradar/actors/device.ex [503-523]](https://github.com/carverauto/serviceradar/pull/2225/files#diff-eba1d95a852e4a736813c7b486da651704f20718e24f931c966ff3f37c421eeaR503-R523) ```diff defp persist_identity(tenant_id, device_id, identity) do case DeviceResource.get_by_uid(device_id, tenant: tenant_id, authorize?: false) do {:ok, device} -> updates = identity |> Map.take([:name, :hostname, :ip, :mac, :vendor_name, :model, :os]) - |> Map.put(:last_seen_time, DateTime.utc_now()) + |> Map.put(:last_seen_at, DateTime.utc_now()) case Ash.update(device, updates, authorize?: false) do {:ok, _} -> :ok {:error, error} -> {:error, error} end {:error, %Ash.Error.Query.NotFound{}} -> # Device doesn't exist yet, nothing to persist :ok {:error, error} -> {:error, error} end end ``` `[To ensure code accuracy, apply this suggestion manually]` <details><summary>Suggestion importance[1-10]: 9</summary> __ Why: This suggestion identifies a critical bug where an incorrect attribute name (`:last_seen_time` instead of `:last_seen_at`) would cause all device identity updates to fail. Fixing this is essential for the core functionality of the device actor. </details></details></td><td align=center>High </td></tr><tr><td> <details><summary>✅ <s>Fix race condition in shutdown logic</s></summary> ___ <details><summary>Suggestion Impact:</summary>The commit introduces a shared shutdownCtx with a timeout and uses it in the shutdown goroutine and the final select (replacing the separate timer), matching the suggested approach to avoid premature timeout races. It also extends the idea by passing shutdownCtx into pushLoop.Stop and guarding the errChan wait with shutdownCtx. code diff: ```diff // Bound shutdown so the process can't hang forever (includes pushLoop.Stop()). const shutdownTimeout = 10 * time.Second + shutdownCtx, shutdownCancel := context.WithTimeout(context.Background(), shutdownTimeout) + defer shutdownCancel() shutdownDone := make(chan struct{}) go func() { defer close(shutdownDone) cancel() - pushLoop.Stop() - <-errChan - - stopCtx, stopCancel := context.WithTimeout(context.Background(), shutdownTimeout) - defer stopCancel() - if err := server.Stop(stopCtx); err != nil { + if err := pushLoop.Stop(shutdownCtx); err != nil { + log.Warn().Err(err).Msg("Push loop stop did not complete before timeout") + } + select { + case <-errChan: + case <-shutdownCtx.Done(): + return + } + + if err := server.Stop(shutdownCtx); err != nil { log.Error().Err(err).Msg("Error stopping agent services") } }() - - timer := time.NewTimer(shutdownTimeout) - defer timer.Stop() select { case <-shutdownDone: - case <-timer.C: + case <-shutdownCtx.Done(): return fmt.Errorf("%w after %s", errShutdownTimeout, shutdownTimeout) ``` </details> ___ **Refactor the shutdown logic to use a single parent context with a timeout for all shutdown operations, preventing a race condition where the main goroutine could time out prematurely.** [cmd/agent/main.go [201-226]](https://github.com/carverauto/serviceradar/pull/2225/files#diff-61358711e980ccf505246fd3915f97cbd3a380e9b66f6fa5aad46749968c5ca3R201-R226) ```diff // Bound shutdown so the process can't hang forever (includes pushLoop.Stop()). const shutdownTimeout = 10 * time.Second + shutdownCtx, shutdownCancel := context.WithTimeout(context.Background(), shutdownTimeout) + defer shutdownCancel() shutdownDone := make(chan struct{}) go func() { defer close(shutdownDone) cancel() pushLoop.Stop() <-errChan - stopCtx, stopCancel := context.WithTimeout(context.Background(), shutdownTimeout) - defer stopCancel() - if err := server.Stop(stopCtx); err != nil { + // Use the shutdownCtx for stopping the server to respect the overall timeout. + if err := server.Stop(shutdownCtx); err != nil { log.Error().Err(err).Msg("Error stopping agent services") } }() - timer := time.NewTimer(shutdownTimeout) - defer timer.Stop() - select { case <-shutdownDone: - case <-timer.C: + // Shutdown completed gracefully. + case <-shutdownCtx.Done(): return fmt.Errorf("%w after %s", errShutdownTimeout, shutdownTimeout) } ``` `[Suggestion processed]` <details><summary>Suggestion importance[1-10]: 8</summary> __ Why: The suggestion correctly identifies and fixes a race condition in the shutdown logic, ensuring all shutdown operations are bound by a single, shared timeout, which improves the agent's robustness. </details></details></td><td align=center>Medium </td></tr><tr><td> <details><summary>✅ <s>Ensure state transitions are atomic</s></summary> ___ <details><summary>Suggestion Impact:</summary>The commit removed `require_atomic? false` from multiple state-transition update actions including `establish_connection` (and others like connection_failed, degrade, restore_health, etc.), aligning those transitions with atomic execution. code diff: ```diff @@ -282,7 +307,6 @@ update :establish_connection do description "Mark agent as connected (from connecting state)" accept [:poller_id] - require_atomic? false change transition_state(:connected) change set_attribute(:is_healthy, true) @@ -293,7 +317,6 @@ update :connection_failed do description "Mark connection attempt as failed" - require_atomic? false change transition_state(:disconnected) change set_attribute(:modified_time, &DateTime.utc_now/0) @@ -302,7 +325,6 @@ update :degrade do description "Mark agent as degraded (connected but unhealthy)" - require_atomic? false change transition_state(:degraded) change set_attribute(:is_healthy, false) @@ -312,7 +334,6 @@ update :restore_health do description "Restore agent health (from degraded to connected)" - require_atomic? false change transition_state(:connected) change set_attribute(:is_healthy, true) @@ -322,7 +343,6 @@ update :lose_connection do description "Mark agent as disconnected (connection lost)" - require_atomic? false change transition_state(:disconnected) change set_attribute(:poller_id, nil) @@ -332,7 +352,6 @@ update :reconnect do description "Start reconnection process (from disconnected to connecting)" - require_atomic? false change transition_state(:connecting) change set_attribute(:modified_time, &DateTime.utc_now/0) @@ -342,7 +361,6 @@ update :mark_unavailable do description "Mark agent as unavailable (admin action)" argument :reason, :string - require_atomic? false change transition_state(:unavailable) change set_attribute(:is_healthy, false) @@ -352,7 +370,6 @@ update :recover do description "Start recovery process (from unavailable to connecting)" - require_atomic? false change transition_state(:connecting) change set_attribute(:modified_time, &DateTime.utc_now/0) @@ -363,7 +380,6 @@ update :connect do description "Mark agent as connected to a poller (legacy - use establish_connection)" accept [:poller_id] - require_atomic? false change transition_state(:connected) change set_attribute(:is_healthy, true) @@ -374,7 +390,6 @@ update :disconnect do description "Mark agent as disconnected (legacy - use lose_connection)" - require_atomic? false change transition_state(:disconnected) change set_attribute(:poller_id, nil) @@ -384,7 +399,6 @@ update :mark_unhealthy do description "Mark agent as unhealthy (legacy - use degrade)" - require_atomic? false change transition_state(:degraded) change set_attribute(:is_healthy, false) ``` </details> ___ **Remove <code>require_atomic? false</code> from all state-transition actions like <code>establish_connection</code> to ensure atomicity and prevent data inconsistency. State transitions are critical and should be transactional.** [elixir/serviceradar_core/lib/serviceradar/infrastructure/agent.ex [270-292]](https://github.com/carverauto/serviceradar/pull/2225/files#diff-c56f92b6ce744cab3f2dc00dde92e2017cffdd12ad4618f7fa720252f2a6843aR270-R292) ```diff update :heartbeat do description "Update last_seen_time and health status (for connected agents)" accept [:is_healthy, :capabilities] require_atomic? false change set_attribute(:last_seen_time, &DateTime.utc_now/0) change set_attribute(:modified_time, &DateTime.utc_now/0) end # State machine transition actions # Each action includes PublishStateChange to emit NATS events update :establish_connection do description "Mark agent as connected (from connecting state)" accept [:poller_id] - require_atomic? false change transition_state(:connected) change set_attribute(:is_healthy, true) change set_attribute(:last_seen_time, &DateTime.utc_now/0) change set_attribute(:modified_time, &DateTime.utc_now/0) change {ServiceRadar.Infrastructure.Changes.PublishStateChange, entity_type: :agent, new_state: :connected} end ``` `[To ensure code accuracy, apply this suggestion manually]` <details><summary>Suggestion importance[1-10]: 8</summary> __ Why: The suggestion correctly identifies that using `require_atomic? false` for state transition actions can lead to data inconsistency, which is a significant correctness issue. Ensuring atomicity for these critical operations is crucial for maintaining a valid state. </details></details></td><td align=center>Medium </td></tr><tr><td> <details><summary>✅ <s>Combine create and update into one</s></summary> ___ <details><summary>Suggestion Impact:</summary>The commit removed the separate update call that set token fields after creation, and instead applies the token attributes directly on the create changeset (via force_change_attribute) before a single Ash.create, making the operation effectively atomic and avoiding inconsistent intermediate state. code diff: ```diff + changeset = + OnboardingPackage + |> Ash.Changeset.for_create(:create, create_attrs, + actor: actor, + authorize?: authorize?, + tenant: tenant + ) + |> Ash.Changeset.force_change_attribute(:join_token_ciphertext, join_token_ciphertext) + |> Ash.Changeset.force_change_attribute(:join_token_expires_at, join_expires) + |> Ash.Changeset.force_change_attribute(:download_token_hash, download_token_hash) + |> Ash.Changeset.force_change_attribute(:download_token_expires_at, download_expires) + + case Ash.create(changeset) do {:ok, package} -> - # Update with token fields - token_attrs = %{ - join_token_ciphertext: join_token_ciphertext, - join_token_expires_at: join_expires, - download_token_hash: download_token_hash, - download_token_expires_at: download_expires - } - - case package - |> Ash.Changeset.for_update(:update_tokens, token_attrs, - actor: actor, - authorize?: authorize?, - tenant: tenant - ) - |> Ash.update() do - {:ok, updated_package} -> - # Record creation event - OnboardingEvents.record(package.id, :created, - actor: get_actor_name(actor), - source_ip: source_ip - ) - - {:ok, - %{ - package: updated_package, - join_token: join_token, - download_token: download_token - }} - - {:error, error} -> - {:error, error} - end + # Record creation event + OnboardingEvents.record(package.id, :created, + actor: get_actor_name(actor), + source_ip: source_ip + ) + + {:ok, + %{ + package: package, + join_token: join_token, + download_token: download_token + }} ``` </details> ___ **Refactor the <code>create</code> function to be atomic by combining the "create-then-update" logic into a single operation. Pass all attributes, including tokens, to the <code>:create</code> action to prevent data inconsistency.** [elixir/serviceradar_core/lib/serviceradar/edge/onboarding_packages.ex [156-199]](https://github.com/carverauto/serviceradar/pull/2225/files#diff-e4fe8e19bc324416302bb4c962f57133b3f62eb82053766844d881c522a473e5R156-R199) ```diff -case OnboardingPackage - |> Ash.Changeset.for_create(:create, create_attrs, - actor: actor, - authorize?: authorize?, - tenant: tenant - ) - |> Ash.create() do +# Encrypt join token and hash download token +join_token_ciphertext = Crypto.encrypt(join_token) +download_token_hash = Crypto.hash_token(download_token) + +# Prepare attributes for Ash create +create_attrs = + attrs + |> Map.put(:created_by, get_actor_name(actor)) + |> Map.put(:join_token_ciphertext, join_token_ciphertext) + |> Map.put(:join_token_expires_at, join_expires) + |> Map.put(:download_token_hash, download_token_hash) + |> Map.put(:download_token_expires_at, download_expires) + +case OnboardingPackage.create(create_attrs, actor: actor, authorize?: authorize?, tenant: tenant) do {:ok, package} -> - # Update with token fields - token_attrs = %{ - join_token_ciphertext: join_token_ciphertext, - join_token_expires_at: join_expires, - download_token_hash: download_token_hash, - download_token_expires_at: download_expires - } + # Record creation event + OnboardingEvents.record(package.id, :created, + actor: get_actor_name(actor), + source_ip: source_ip + ) - case package - |> Ash.Changeset.for_update(:update_tokens, token_attrs, - actor: actor, - authorize?: authorize?, - tenant: tenant - ) - |> Ash.update() do - {:ok, updated_package} -> - # Record creation event - OnboardingEvents.record(package.id, :created, - actor: get_actor_name(actor), - source_ip: source_ip - ) - - {:ok, - %{ - package: updated_package, - join_token: join_token, - download_token: download_token - }} - - {:error, error} -> - {:error, error} - end + {:ok, + %{ + package: package, + join_token: join_token, + download_token: download_token + }} {:error, error} -> {:error, error} end ``` `[To ensure code accuracy, apply this suggestion manually]` <details><summary>Suggestion importance[1-10]: 8</summary> __ Why: The suggestion correctly points out a non-atomic "create-then-update" pattern that could lead to inconsistent data. Consolidating this into a single atomic `create` operation is a critical improvement for data integrity. </details></details></td><td align=center>Medium </td></tr><tr><td> <details><summary>✅ <s>Prevent race condition during updates</s></summary> ___ <details><summary>Suggestion Impact:</summary>The commit removed the separate conditional DeviceAliasState.confirm/2 call and instead passed confirm_threshold into DeviceAliasState.record_sighting/3, implying the confirmation decision is now handled within the sighting recording operation (moving toward the suggested atomic consolidation), though it did not introduce the specifically suggested atomic_sighting_and_confirm function or the suggested return-shape handling. code diff: ```diff # Record the sighting - case DeviceAliasState.record_sighting(existing, %{}, actor: actor) do + case DeviceAliasState.record_sighting( + existing, + %{confirm_threshold: confirm_threshold}, + actor: actor + ) do {:ok, updated} -> - # Check if we should confirm - if updated.state == :detected and updated.sighting_count >= confirm_threshold do - DeviceAliasState.confirm(updated, actor: actor) - end ``` </details> ___ **Prevent a race condition in <code>handle_existing_alias/4</code> by consolidating the logic for recording a sighting and confirming the state into a single, atomic action on the <code>DeviceAliasState</code> resource.** [elixir/serviceradar_core/lib/serviceradar/identity/alias_events.ex [438-474]](https://github.com/carverauto/serviceradar/pull/2225/files#diff-bc3743067ea774f59bc5665770f7110a2d6e90f6e1156a7717a1c287f8979d28R438-R474) ```diff defp handle_existing_alias(existing, _update, confirm_threshold, actor) do alias ServiceRadar.Identity.DeviceAliasState - # Record the sighting - case DeviceAliasState.record_sighting(existing, %{}, actor: actor) do - {:ok, updated} -> - # Check if we should confirm - if updated.state == :detected and updated.sighting_count >= confirm_threshold do - DeviceAliasState.confirm(updated, actor: actor) - end + # Atomically record sighting and potentially confirm + case DeviceAliasState.atomic_sighting_and_confirm(existing, %{confirm_threshold: confirm_threshold}, actor: actor) do + {:ok, %{data: %{state_changed: true, new_state: new_state, sighting_count: count}}} -> + # Generate event for state change + %{ + device_id: existing.device_id, + partition: existing.partition || "", + action: "alias_state_changed", + reason: "state_transition", + timestamp: DateTime.utc_now(), + severity: "Info", + level: 6, + metadata: %{ + "alias_type" => to_string(existing.alias_type), + "alias_value" => existing.alias_value, + "previous_state" => to_string(existing.state), + "new_state" => to_string(new_state), + "sighting_count" => to_string(count) + } + } - # Generate event if state changed - if updated.state != existing.state do - %{ - device_id: existing.device_id, - partition: existing.partition || "", - action: "alias_state_changed", - reason: "state_transition", - timestamp: DateTime.utc_now(), - severity: "Info", - level: 6, - metadata: %{ - "alias_type" => to_string(existing.alias_type), - "alias_value" => existing.alias_value, - "previous_state" => to_string(existing.state), - "new_state" => to_string(updated.state), - "sighting_count" => to_string(updated.sighting_count) - } - } - else - nil - end + {:ok, _} -> + # Sighting recorded, but no state change + nil {:error, _reason} -> nil end end ``` `[To ensure code accuracy, apply this suggestion manually]` <details><summary>Suggestion importance[1-10]: 8</summary> __ Why: The suggestion correctly identifies a race condition in `handle_existing_alias/4` due to non-atomic read-modify-write operations. Consolidating the sighting and confirmation logic into a single atomic action is crucial to prevent incorrect state transitions. </details></details></td><td align=center>Medium </td></tr><tr><td> <details><summary>Conditionally enable SSL for gRPC</summary> ___ **Conditionally enable SSL for the gRPC connection based on an environment variable, allowing for both secure and insecure deployments.** [elixir/serviceradar_core/lib/serviceradar/nats/account_client.ex [487-518]](https://github.com/carverauto/serviceradar/pull/2225/files#diff-2e18ac777ac600b12982ba9e9d5327e23ebd84c139a2add7976f8bf61283e554R487-R518) ```diff defp create_fresh_channel do host = System.get_env("DATASVC_HOST", "datasvc") port = String.to_integer(System.get_env("DATASVC_PORT", "50057")) + ssl_enabled = System.get_env("DATASVC_SSL", "false") in ["true", "1", "yes"] cert_dir = System.get_env("DATASVC_CERT_DIR", "/etc/serviceradar/certs") cert_name = System.get_env("DATASVC_CERT_NAME", "core") server_name = System.get_env("DATASVC_SERVER_NAME", "datasvc.serviceradar") endpoint = "#{host}:#{port}" - ssl_opts = [ - cacertfile: String.to_charlist(Path.join(cert_dir, "root.pem")), - certfile: String.to_charlist(Path.join(cert_dir, "#{cert_name}.pem")), - keyfile: String.to_charlist(Path.join(cert_dir, "#{cert_name}-key.pem")), - verify: :verify_peer, - server_name_indication: String.to_charlist(server_name) - ] + cred_opts = + if ssl_enabled do + ssl_opts = [ + cacertfile: String.to_charlist(Path.join(cert_dir, "root.pem")), + certfile: String.to_charlist(Path.join(cert_dir, "#{cert_name}.pem")), + keyfile: String.to_charlist(Path.join(cert_dir, "#{cert_name}-key.pem")), + verify: :verify_peer, + server_name_indication: String.to_charlist(server_name) + ] - connect_opts = [ - cred: GRPC.Credential.new(ssl: ssl_opts), - adapter_opts: [connect_timeout: 5_000] - ] + [cred: GRPC.Credential.new(ssl: ssl_opts)] + else + [] + end + + connect_opts = + cred_opts + |> Keyword.put(:adapter_opts, [connect_timeout: 5_000]) case GRPC.Stub.connect(endpoint, connect_opts) do {:ok, channel} -> Logger.debug("Created fresh NATS account gRPC channel to #{endpoint}") {:ok, channel} {:error, reason} -> Logger.error("Failed to create NATS account gRPC channel: #{inspect(reason)}") {:error, {:connection_failed, reason}} end end ``` `[To ensure code accuracy, apply this suggestion manually]` <details><summary>Suggestion importance[1-10]: 6</summary> __ Why: The suggestion correctly points out that the function hardcodes an mTLS connection, which contradicts the module's documentation stating it uses the same configuration as `DataService.Client`. Applying this change improves flexibility and aligns the component's behavior with other parts of the system, making it usable in non-SSL environments. </details></details></td><td align=center>Low </td></tr><tr><td> <details><summary>Prevent misleading operator config log</summary> ___ **Add a check to ensure <code>operatorConfigPath</code> is not empty before attempting to write the NATS operator configuration to prevent misleading log messages.** [cmd/data-services/main.go [117-130]](https://github.com/carverauto/serviceradar/pull/2225/files#diff-5e7731adfb877918cd65d9d5531621312496450fd550fea2682efca4ca8fe816R117-R130) ```diff if operatorConfigPath != "" || resolverPath != "" { natsAccountServer.SetResolverPaths(operatorConfigPath, resolverPath) log.Printf("NATS resolver paths configured: operator=%s resolver=%s", operatorConfigPath, resolverPath) // If operator is already initialized, write the config now // This ensures config files exist even when datasvc restarts with an existing operator - if operator != nil { + if operator != nil && operatorConfigPath != "" { if err := natsAccountServer.WriteOperatorConfig(); err != nil { log.Printf("Warning: failed to write initial operator config: %v", err) } else { log.Printf("Wrote initial operator config to %s", operatorConfigPath) } } } ``` - [ ] **Apply / Chat**  <details><summary>Suggestion importance[1-10]: 5</summary> __ Why: The suggestion correctly identifies a condition that leads to a misleading log message and proposes a simple fix, improving the accuracy of the application's logging. </details></details></td><td align=center>Low </td></tr><tr><td> <details><summary>✅ <s>Ensure deterministic record selection order</s></summary> ___ <details><summary>Suggestion Impact:</summary>The commit adds `sorted_fallback = Enum.sort(fallback)` and updates the subsequent Enum.reduce to iterate over `sorted_fallback` instead of the unsorted map, making processing deterministic. code diff: ```diff # Merge fallback records that don't have canonical equivalents + sorted_fallback = Enum.sort(fallback) + {canonical, meta} = - Enum.reduce(fallback, {canonical, meta}, fn {key, device}, {can, m} -> + Enum.reduce(sorted_fallback, {canonical, meta}, fn {key, device}, {can, m} -> ``` </details> ___ **Sort the <code>fallback</code> map by key before reducing it in <code>select_canonical_records</code> to ensure deterministic device selection and prevent metric instability.** [elixir/serviceradar_core/lib/serviceradar/core/stats_aggregator.ex [263-286]](https://github.com/carverauto/serviceradar/pull/2225/files#diff-1f4ac8290be7d27cac0ed660e51a9b3b23a219a6bb43b3735f3c5a9768321031R263-R286) ```diff defp select_canonical_records(devices, meta) do # Group by canonical ID to deduplicate canonical = %{} fallback = %{} {canonical, fallback, meta} = Enum.reduce(devices, {canonical, fallback, meta}, fn device, {can, fall, m} -> process_device_record(device, can, fall, m) end) # Merge fallback records that don't have canonical equivalents + # Sort the fallback map to ensure deterministic processing + sorted_fallback = Enum.sort(fallback) + {canonical, meta} = - Enum.reduce(fallback, {canonical, meta}, fn {key, device}, {can, m} -> + Enum.reduce(sorted_fallback, {canonical, meta}, fn {key, device}, {can, m} -> if Map.has_key?(can, key) do {can, %{m | skipped_non_canonical: m.skipped_non_canonical + 1}} else {Map.put(can, key, device), %{m | inferred_canonical_fallback: m.inferred_canonical_fallback + 1}} end end) selected = Map.values(canonical) {selected, meta} end ``` `[To ensure code accuracy, apply this suggestion manually]` <details><summary>Suggestion importance[1-10]: 5</summary> __ Why: The suggestion correctly identifies that iterating over a map is not order-guaranteed, which could lead to non-deterministic behavior and metric instability. Sorting the fallback records before processing is a valid improvement for ensuring consistent results. </details></details></td><td align=center>Low </td></tr><tr><td rowspan=6>General</td> <td> <details><summary>✅ <s>Safely parse integer env vars</s></summary> ___ <details><summary>Suggestion Impact:</summary>Updated get_env_int to use Integer.parse/1 and return nil for non-numeric or partially numeric environment variable values instead of raising. code diff: ```diff @@ -302,7 +302,11 @@ case System.get_env(key) do nil -> nil "" -> nil - value -> String.to_integer(value) + value -> + case Integer.parse(value) do + {int, ""} -> int + _ -> nil + end ``` </details> ___ **Replace <code>String.to_integer/1</code> with the safer <code>Integer.parse/1</code> in <code>get_env_int</code> to prevent crashes from non-numeric environment variable values.** [elixir/serviceradar_core/lib/serviceradar/data_service/client.ex [301-307]](https://github.com/carverauto/serviceradar/pull/2225/files#diff-503e195ad79e05e12d7ad03a675f6e35ffdfc201b8571b0d30a220fe036e03a1R301-R307) ```diff defp get_env_int(key) do case System.get_env(key) do nil -> nil "" -> nil - value -> String.to_integer(value) + value -> + case Integer.parse(value) do + {int, ""} -> int + _ -> nil + end end end ``` `[To ensure code accuracy, apply this suggestion manually]` <details><summary>Suggestion importance[1-10]: 7</summary> __ Why: The suggestion correctly points out that `String.to_integer/1` will crash if the environment variable contains a non-numeric string. Using `Integer.parse/1` makes the configuration loading more robust against invalid user input. </details></details></td><td align=center>Medium </td></tr><tr><td> <details><summary>Use default limits struct</summary> ___ **In <code>build_limits</code>, return a default <code>Proto.AccountLimits{}</code> struct instead of <code>nil</code> when no limits are provided to ensure a valid protobuf message.** [elixir/serviceradar_core/lib/serviceradar/nats/account_client.ex [520]](https://github.com/carverauto/serviceradar/pull/2225/files#diff-2e18ac777ac600b12982ba9e9d5327e23ebd84c139a2add7976f8bf61283e554R520-R520) ```diff -defp build_limits(nil), do: nil +defp build_limits(nil), do: %Proto.AccountLimits{} ``` `[To ensure code accuracy, apply this suggestion manually]` <details><summary>Suggestion importance[1-10]: 7</summary> __ Why: The suggestion correctly identifies that sending `nil` for a protobuf message field can cause issues. Returning a default, empty `AccountLimits` struct is safer and ensures the gRPC request is always well-formed, improving the client's robustness. </details></details></td><td align=center>Medium </td></tr><tr><td> <details><summary>✅ <s>Handle invalid timestamps safely</s></summary> ___ <details><summary>Suggestion Impact:</summary>The commit replaced the bang version with DateTime.from_unix/1 and added pattern matching to return nil on error, preventing potential crashes from invalid UNIX timestamps. code diff: ```diff expires_at = case response.expires_at_unix do 0 -> nil - unix -> DateTime.from_unix!(unix) + unix -> + case DateTime.from_unix(unix) do + {:ok, dt} -> dt + {:error, _} -> nil + end ``` </details> ___ **Replace <code>DateTime.from_unix!/1</code> with the safer <code>DateTime.from_unix/1</code> and handle potential errors to prevent crashes from invalid timestamps.** [elixir/serviceradar_core/lib/serviceradar/nats/account_client.ex [174-178]](https://github.com/carverauto/serviceradar/pull/2225/files#diff-2e18ac777ac600b12982ba9e9d5327e23ebd84c139a2add7976f8bf61283e554R174-R178) ```diff expires_at = case response.expires_at_unix do 0 -> nil - unix -> DateTime.from_unix!(unix) + unix -> + case DateTime.from_unix(unix) do + {:ok, dt} -> dt + {:error, _} -> nil + end end ``` `[To ensure code accuracy, apply this suggestion manually]` <details><summary>Suggestion importance[1-10]: 7</summary> __ Why: This suggestion correctly identifies a potential crash when parsing a UNIX timestamp from a gRPC response. Using the non-bang version `DateTime.from_unix/1` and handling the error case makes the code more resilient to invalid data from the remote service. </details></details></td><td align=center>Medium </td></tr><tr><td> <details><summary>Improve detection of trailing config data</summary> ___ **Replace the flawed check for trailing JSON data with the more robust and idiomatic <code>dec.More()</code> method to correctly handle whitespace and prevent invalid configurations from being accepted.** [cmd/agent/main.go [133-137]](https://github.com/carverauto/serviceradar/pull/2225/files#diff-61358711e980ccf505246fd3915f97cbd3a380e9b66f6fa5aad46749968c5ca3R133-R137) ```diff - if err := dec.Decode(&struct{}{}); err == nil { + if dec.More() { return nil, fmt.Errorf("failed to parse config: %w", errConfigTrailingData) - } else if !errors.Is(err, io.EOF) { - return nil, fmt.Errorf("failed to parse config: %w", err) } ``` - [ ] **Apply / Chat**  <details><summary>Suggestion importance[1-10]: 6</summary> __ Why: The suggestion correctly identifies a subtle flaw in the JSON parsing logic and proposes using the idiomatic `dec.More()` function, which improves code correctness and readability. </details></details></td><td align=center>Low </td></tr><tr><td> <details><summary>Include TLS1.2 cipher suites</summary> ___ **Update the <code>ciphers</code> list to include suites for both TLS 1.3 and TLS 1.2. This ensures that TLS 1.2 connections are supported, as currently only TLS 1.3 cipher suites are configured.** [elixir/serviceradar_core/lib/serviceradar/spiffe.ex [323]](https://github.com/carverauto/serviceradar/pull/2225/files#diff-0cb8d921c19f671b66f91c0978e351e71d927c5f4694924984c9f1ed34d7ee78R323-R323) ```diff -ciphers: :ssl.cipher_suites(:default, :"tlsv1.3") +ciphers: :ssl.cipher_suites(:default, :"tlsv1.3") ++ + :ssl.cipher_suites(:default, :"tlsv1.2") ``` `[To ensure code accuracy, apply this suggestion manually]` <details><summary>Suggestion importance[1-10]: 6</summary> __ Why: The suggestion correctly points out that while `versions` allows both TLS 1.3 and 1.2, the `ciphers` list only includes suites for TLS 1.3. Adding TLS 1.2 cipher suites is necessary to ensure TLS 1.2 connections can be established, improving compatibility. </details></details></td><td align=center>Low </td></tr><tr><td> <details><summary>✅ <s>Use SQL fragment for endpoint</s></summary> ___ <details><summary>Suggestion Impact:</summary>The :endpoint expr was changed from `host <> ":" <> fragment("?::text", port)` to a single SQL fragment `fragment("? || ':' || ?::text", host, port)`, ensuring concatenation happens correctly in the database. code diff: ```diff @@ -636,7 +650,7 @@ :string, expr( if not is_nil(host) and not is_nil(port) do - host <> ":" <> fragment("?::text", port) + fragment("? || ':' || ?::text", host, port) else nil end ``` </details> ___ **In the <code>:endpoint</code> calculation, use a single SQL <code>fragment</code> for string concatenation instead of mixing Elixir's <code><></code> operator with a fragment. This ensures the concatenation is performed correctly in the database.** [elixir/serviceradar_core/lib/serviceradar/infrastructure/agent.ex [635-643]](https://github.com/carverauto/serviceradar/pull/2225/files#diff-c56f92b6ce744cab3f2dc00dde92e2017cffdd12ad4618f7fa720252f2a6843aR635-R643) ```diff calculate :endpoint, - :string, - expr( - if not is_nil(host) and not is_nil(port) do - host <> ":" <> fragment("?::text", port) - else - nil - end - ) + :string, + expr( + case do + not is_nil(host) and not is_nil(port) -> + fragment("? || ':' || ?::text", host, port) + true -> + nil + end + ) ``` `[To ensure code accuracy, apply this suggestion manually]` <details><summary>Suggestion importance[1-10]: 5</summary> __ Why: The suggestion correctly identifies that the current implementation mixes Elixir string concatenation with an SQL fragment, which is not valid in an Ash `expr`. The proposed change to use a single `fragment` for concatenation is a valid and necessary correction for the calculation to work at the database level. </details></details></td><td align=center>Low </td></tr> <tr><td align="center" colspan="2"> - [ ] Update  </td><td></td></tr></tbody></table>