carverauto/serviceradar

Fork 0

Feat/agent gateway work #2629

Merged

mfreeman451 merged 62 commits from refs/pull/2629/head into testing

2026-01-04 10:06:57 +00:00

mfreeman451 commented

2026-01-03 09:47:10 +00:00

(Migrated from github.com)

Owner

Imported from GitHub pull request.

Original GitHub pull request: #2221
Original author: @mfreeman451
Original URL: https://github.com/carverauto/serviceradar/pull/2221
Original created: 2026-01-03T09:47:10Z
Original updated: 2026-01-04T10:06:59Z
Original head: carverauto/serviceradar:feat/agent-gateway-work
Original base: testing
Original merged: 2026-01-04T10:06:57Z by @mfreeman451

User description

IMPORTANT: Please sign the Developer Certificate of Origin

Thank you for your contribution to ServiceRadar. Please note, when contributing, the developer must include
a DCO sign-off statement indicating the DCO acceptance in one commit message. Here
is an example DCO Signed-off-by line in a commit message:

Signed-off-by: J. Doe <j.doe@domain.com>

Describe your changes

Issue ticket number and link

Code checklist before requesting a review

I have signed the DCO?
The build completes without errors?
All tests are passing when running make test?

PR Type

Enhancement, Tests

Description

Implements comprehensive NATS account management system with operator bootstrap, tenant account creation, and user credential generation
Adds new AgentGatewayService gRPC API for agent-to-gateway status communication with push and streaming capabilities
Introduces multi-tenant identity extraction from mTLS certificates with tenant context propagation throughout the system
Implements agent push mode for active status transmission to gateway with automatic reconnection and exponential backoff
Adds NATS bootstrap CLI command supporting local and remote modes for operator and account initialization
Extends CLI with admin command dispatcher for managing NATS bootstrap and account operations
Implements tenant-aware NATS event subject prefixing via environment variable control
Adds comprehensive test coverage for account management, user credentials, operator initialization, and tenant context handling
Extends data models with collector package types, credentials structures, and NATS configuration fields
Normalizes proto file path references for KV service consistency

Diagram Walkthrough

flowchart LR
  Agent["Agent<br/>Push Mode"] -->|"PushStatus<br/>StreamStatus"| Gateway["Agent Gateway<br/>Service"]
  Gateway -->|"mTLS<br/>Tenant Context"| DataSvc["Data Service<br/>NATS Account Mgmt"]
  DataSvc -->|"JWT Operations<br/>Credential Gen"| NATS["NATS Server<br/>Operator/Accounts"]
  CLI["CLI<br/>nats-bootstrap"] -->|"Bootstrap<br/>Config Gen"| NATS
  Cert["mTLS Certificate<br/>CN Parsing"] -->|"Tenant ID<br/>Extraction"| TenantCtx["Tenant Context<br/>Propagation"]
  TenantCtx -->|"Subject Prefixing<br/>Channel Routing"| EventPub["Event Publisher<br/>NATS Events"]

File Walkthrough

Relevant files

Enhancement

20 files

nats_account.pb.go `NATS account management protobuf definitions` proto/nats_account.pb.go Generated protobuf Go code for NATS account management with 1266 lines of new code Defines message types for account limits, subject mappings, tenant account creation, and user credentials Implements credential types (collector, service, admin) and user permission structures Includes RPC service definitions for bootstrap, account creation, JWT signing, and credential generation	+1266/-0
nats_bootstrap.go `NATS bootstrap CLI implementation` pkg/cli/nats_bootstrap.go Implements NATS bootstrap CLI command with 961 lines of new functionality Provides local and remote bootstrap modes for operator and account initialization Generates NATS configuration files, JWTs, credentials, and system/platform accounts Includes token generation, status checking, and tenant listing capabilities	+961/-0
cli.go `CLI command routing for admin operations` pkg/cli/cli.go Adds new `AdminHandler` struct to dispatch nested admin commands Registers `nats-bootstrap` and `admin` command handlers in the command map Implements routing logic for admin subcommands (currently supporting `admin nats`)	+22/-0
nats_account_service.go `NATS Account Service Implementation with JWT Operations` pkg/datasvc/nats_account_service.go Implements `NATSAccountServer` struct for stateless NATS JWT/NKeys cryptographic operations Provides gRPC methods for operator bootstrap, tenant account creation, user credential generation, and JWT signing Includes mTLS authorization checks and resolver connection management for pushing JWTs to NATS Supports file-based and NATS-based JWT resolver configurations with proper error handling	+808/-0
monitoring.pb.go `Gateway Status Messages and Service Definition` proto/monitoring.pb.go Adds `GatewayStatusRequest`, `GatewayStatusResponse`, `GatewayStatusChunk`, and `GatewayServiceStatus` message types Includes tenant-aware fields (`tenant_id`, `tenant_slug`) for multi-tenant routing Adds new `AgentGatewayService` with `PushStatus` and `StreamStatus` RPC methods Updates message type count from 13 to 17 and service count from 2 to 3	+501/-20
nats_account_grpc.pb.go `NATS Account Service gRPC Stubs Generation` proto/nats_account_grpc.pb.go Auto-generated gRPC service stubs for `NATSAccountService` interface Implements client and server interfaces with six RPC methods: `BootstrapOperator`, `GetOperatorInfo`, `CreateTenantAccount`, `GenerateUserCredentials`, `SignAccountJWT`, `PushAccountJWT` Includes unary server handlers and service descriptor registration	+376/-0
operator.go `NATS Operator Key Management and Bootstrap` pkg/nats/accounts/operator.go Defines `Operator` struct for managing NATS operator keys and signing operations Implements `BootstrapOperator` function to initialize new or import existing operator keys Provides key generation utilities for operators, accounts, and users with NKeys Includes configuration loading from multiple sources (direct, environment, files)	+476/-0
events.go `NATS Configuration Extension for Credentials` pkg/models/events.go Adds `CredsFile` field to `NATSConfig` struct for system account credentials file path Maintains backward compatibility with existing NATS configuration fields	+4/-3
tenant.go `Multi-tenant identity extraction from certificates` pkg/tenant/tenant.go New package providing multi-tenant identity extraction from mTLS certificate CNs Implements CN parsing with format `...serviceradar` Provides context-based tenant propagation and NATS channel prefixing utilities Supports certificate extraction from gRPC contexts, TLS states, and PEM files	+315/-0
nats_config.go `NATS server configuration and bootstrap utilities` pkg/edgeonboarding/nats_config.go Configuration generation for NATS server with operator JWT and system account setup Template-based config generation supporting TLS, JetStream, clustering, and leaf nodes Helper functions for generating collector-specific NATS permissions by type Data structures for NATS bootstrap requests and collector credentials	+328/-0
gateway_client.go `Agent gateway client with reconnection logic` pkg/agent/gateway_client.go New gRPC client for agent-to-gateway communication with connection management Implements status push and streaming with automatic reconnection and exponential backoff Supports mTLS security configuration and graceful disconnect Provides connection state tracking and timeout handling	+270/-0
account_manager.go `NATS account creation and JWT signing` pkg/nats/accounts/account_manager.go Account signer for creating and managing NATS tenant accounts with JWT signing Implements subject mapping for tenant isolation with `{{tenant}}` placeholder replacement Supports account limits, JetStream enablement, and user key revocations Provides stateless account creation and JWT regeneration operations	+238/-0
monitoring_grpc.pb.go `New AgentGatewayService gRPC API` proto/monitoring_grpc.pb.go Adds new `AgentGatewayService` gRPC service with PushStatus and StreamStatus methods Marks `PollerService` as deprecated with migration guidance to `AgentGatewayService` Implements client and server stubs for agent-to-gateway status communication Supports both unary and client-streaming RPC patterns	+148/-0
user_manager.go `NATS user credential generation and management` pkg/nats/accounts/user_manager.go User credential generation for NATS with type-based permissions (collector, service, admin) Implements `.creds` file formatting for NATS authentication Supports custom permissions and credential expiration Applies tenant-scoped permissions based on credential type	+217/-0
push_loop.go `Agent status push loop implementation` pkg/agent/push_loop.go Push loop implementation for periodic agent status collection and gateway transmission Collects status from sweep services and configured checkers Converts internal status format to gateway protocol messages Handles source IP detection and tenant/partition metadata propagation	+222/-0
events.go `Tenant-aware NATS event subject prefixing` pkg/natsutil/events.go Adds tenant prefixing support for NATS event subjects via environment variable control Implements `applyTenantPrefix()` to inject tenant slug into subject paths Adds `NewEventPublisherWithPrefixing()` for explicit prefix control Provides helper functions to check and modify prefixing state	+74/-13
main.go `Data service NATS account initialization` cmd/data-services/main.go Initializes NATS account service with operator configuration and resolver paths Supports operator bootstrap from environment variables and config files Registers `NATSAccountService` gRPC endpoint for account management Configures system account credentials and resolver client for JWT persistence	+68/-0
edge_onboarding.go `Collector package and credential data models` pkg/models/edge_onboarding.go Adds `CollectorType` enum for collector classification (flowgger, trapd, netflow, otel) Adds `CollectorPackageStatus` enum for deployment lifecycle tracking Introduces `CollectorPackage`, `NatsCredential`, and `CollectorDownloadResult` data structures Supports collector package metadata, credential management, and download tracking	+71/-0
main.go `Agent push mode implementation with gateway integration` cmd/agent/main.go Implements push mode for agents to actively push status to gateway Adds `runPushMode()` with gateway client initialization and push loop management Implements graceful shutdown with signal handling for push mode Maintains legacy server mode as fallback with deprecation warning	+82/-0
main.go `CLI command routing for NATS bootstrap and admin` cmd/cli/main.go Adds `nats-bootstrap` command routing for NATS server bootstrap operations Adds `admin` command dispatcher with `nats` subcommand support Implements `dispatchAdminCommand()` for admin resource routing Maintains backward compatibility with existing bcrypt mode	+14/-0

Formatting

1 files

kv.pb.go

Proto file path normalization for KV service

proto/kv.pb.go

Updates proto file path references from proto/kv.proto to kv.proto
Renames internal variable references from file_proto_kv_proto_* to
file_kv_proto_*
Updates all message type index references and descriptor function
calls

+83/-83

Tests

6 files

nats_account_service_test.go `NATS Account Service Unit Tests` pkg/datasvc/nats_account_service_test.go Comprehensive test suite for `NATSAccountServer` with authorization context setup Tests account creation, user credential generation, and JWT signing with various scenarios Includes proto conversion helper function tests for limits, credential types, and permissions Validates error handling for missing required fields and invalid inputs	+474/-0
user_manager_test.go `User Credentials Generation Tests` pkg/nats/accounts/user_manager_test.go Tests for `GenerateUserCredentials` function covering collector, service, and admin credential types Validates custom permissions, expiration handling, and JWT claim structure Tests error cases for invalid seeds and wrong key types Includes helper function tests for credentials file formatting	+440/-0
account_manager_test.go `Account manager test suite for NATS tenant accounts` pkg/nats/accounts/account_manager_test.go Comprehensive test suite for `AccountSigner` with 370 lines covering account creation, JWT signing, and subject mappings Tests for tenant account creation with limits, custom mappings, and revocations Validation of default subject mappings for collectors (events, syslog, snmp, netflow, otel, logs, telemetry) Tests for account key derivation and multiple account generation	+370/-0
tenant_test.go `Tenant identity parsing and context management tests` pkg/tenant/tenant_test.go Test suite for tenant identity parsing and context management with 354 lines Tests for CN parsing with various formats and error cases Tests for tenant info string representation and NATS channel prefixing Tests for context-based tenant propagation and helper functions	+354/-0
operator_test.go `NATS operator initialization and JWT signing tests` pkg/nats/accounts/operator_test.go Test suite for NATS operator key generation and JWT creation with 357 lines Tests for operator initialization from seed, file, and environment variable Tests for system account public key configuration and invalid seed handling Tests for operator JWT creation and key encoding/decoding for storage	+357/-0
events_test.go `Event publisher tenant prefixing tests` pkg/natsutil/events_test.go Tests for tenant prefix environment variable detection Tests for subject prefixing with and without tenant context Tests for `NewEventPublisherWithPrefixing()` and `SetTenantPrefixing()` methods Validates fallback to `DefaultTenant` when tenant not in context	+135/-0

Configuration changes

1 files

.gitkeep

Docker Compose Credentials Directory

docker/compose/creds/.gitkeep

Creates placeholder directory for Docker Compose credentials storage

+1/-0

Additional files

101 files

.bazelignore	+4/-0
.bazelrc	+2/-0
.env.example	+38/-0
AGENTS.md	+171/-2
MODULE.bazel	+5/-0
Makefile	+4/-0
README-Docker.md	+16/-1
BUILD.bazel	+6/-0
BUILD.bazel	+12/-0
mix_release.bzl	+120/-49
config.json	+5/-6
config.rs	+25/-0
nats.rs	+4/-0
zen-consumer-with-otel.json	+1/-0
zen-consumer.json	+1/-0
BUILD.bazel	+1/-0
flowgger.toml	+1/-0
nats_output.rs	+14/-0
otel.toml	+1/-0
config.rs	+13/-0
nats_output.rs	+7/-0
setup.rs	+1/-0
config.rs	+21/-0
main.rs	+21/-1
docker-compose.elx.yml	+108/-0
docker-compose.spiffe.yml	+8/-0
docker-compose.yml	+325/-269
Dockerfile.agent-gateway	+94/-0
Dockerfile.core-elx	+108/-0
Dockerfile.poller-elx	+95/-0
Dockerfile.web-ng	+6/-0
agent-minimal.docker.json	+6/-6
agent.docker.json	+9/-15
agent.mtls.json	+13/-3
datasvc.docker.json	+3/-2
datasvc.mtls.json	+14/-1
db-event-writer.mtls.json	+1/-0
entrypoint-certs.sh	+11/-6
flowgger.docker.toml	+2/-1
generate-certs.sh	+216/-10
nats.docker.conf	+16/-160
netflow-consumer.mtls.json	+1/-0
otel.docker.toml	+2/-0
pg_hba.conf	+9/-0
pg_ident.conf	+17/-0
ssl_dist.core.conf	+17/-0
ssl_dist.gateway.conf	+17/-0
ssl_dist.poller.conf	+17/-0
ssl_dist.web.conf	+17/-0
trapd.docker.json	+2/-1
zen.docker.json	+2/-1
BUILD.bazel	+80/-0
push_targets.bzl	+2/-0
architecture.md	+58/-6
ash-api.md	+305/-0
ash-authentication.md	+244/-0
ash-authorization.md	+283/-0
ash-domains.md	+223/-0
ash-migration-guide.md	+339/-0
docker-setup.md	+39/-4
edge-agents.md	+253/-0
edge-onboarding.md	+316/-273
security-architecture.md	+238/-0
spiffe-identity.md	+52/-0
troubleshooting-guide.md	+51/-0
datasvc.ex	+32/-2
BUILD.bazel	+30/-0
config.exs	+13/-0
dev.exs	+4/-0
prod.exs	+4/-0
runtime.exs	+202/-0
test.exs	+20/-0
agent_client.ex	+472/-0
agent_gateway_server.ex	+226/-0
application.ex	+192/-0
config.ex	+311/-0
endpoint.ex	+14/-0
status_processor.ex	+202/-0
task_executor.ex	+435/-0
mix.exs	+69/-0
env.sh.eex	+78/-0
serviceradar-poller.service	+66/-0
vm.args.eex	+50/-0
registration_test.exs	+235/-0
test_helper.exs	+7/-0
.formatter.exs	+15/-0
BUILD.bazel	+14/-0
README.md	+269/-0
config.exs	+101/-0
dev.exs	+66/-0
prod.exs	+6/-0
runtime.exs	+196/-0
test.exs	+91/-0
eventwriter.md	+370/-0
device.ex	+608/-0
device_registry.ex	+284/-0
telemetry.ex	+267/-0
agent_tracker.ex	+177/-0
application.ex	+381/-0
backoff.ex	+59/-0
Additional files not shown

Imported from GitHub pull request. Original GitHub pull request: #2221 Original author: @mfreeman451 Original URL: https://github.com/carverauto/serviceradar/pull/2221 Original created: 2026-01-03T09:47:10Z Original updated: 2026-01-04T10:06:59Z Original head: carverauto/serviceradar:feat/agent-gateway-work Original base: testing Original merged: 2026-01-04T10:06:57Z by @mfreeman451 --- ### **User description** ## IMPORTANT: Please sign the Developer Certificate of Origin Thank you for your contribution to ServiceRadar. Please note, when contributing, the developer must include a [DCO sign-off statement]( https://developercertificate.org/) indicating the DCO acceptance in one commit message. Here is an example DCO Signed-off-by line in a commit message: ``` Signed-off-by: J. Doe <j.doe@domain.com> ``` ## Describe your changes ## Issue ticket number and link ## Code checklist before requesting a review - [ ] I have signed the DCO? - [ ] The build completes without errors? - [ ] All tests are passing when running make test? ___ ### **PR Type** Enhancement, Tests ___ ### **Description** - Implements comprehensive NATS account management system with operator bootstrap, tenant account creation, and user credential generation - Adds new `AgentGatewayService` gRPC API for agent-to-gateway status communication with push and streaming capabilities - Introduces multi-tenant identity extraction from mTLS certificates with tenant context propagation throughout the system - Implements agent push mode for active status transmission to gateway with automatic reconnection and exponential backoff - Adds NATS bootstrap CLI command supporting local and remote modes for operator and account initialization - Extends CLI with admin command dispatcher for managing NATS bootstrap and account operations - Implements tenant-aware NATS event subject prefixing via environment variable control - Adds comprehensive test coverage for account management, user credentials, operator initialization, and tenant context handling - Extends data models with collector package types, credentials structures, and NATS configuration fields - Normalizes proto file path references for KV service consistency ___ ### Diagram Walkthrough ```mermaid flowchart LR Agent["Agent Push Mode"] -->|"PushStatus StreamStatus"| Gateway["Agent Gateway Service"] Gateway -->|"mTLS Tenant Context"| DataSvc["Data Service NATS Account Mgmt"] DataSvc -->|"JWT Operations Credential Gen"| NATS["NATS Server Operator/Accounts"] CLI["CLI nats-bootstrap"] -->|"Bootstrap Config Gen"| NATS Cert["mTLS Certificate CN Parsing"] -->|"Tenant ID Extraction"| TenantCtx["Tenant Context Propagation"] TenantCtx -->|"Subject Prefixing Channel Routing"| EventPub["Event Publisher NATS Events"] ``` <details><summary><h3>File Walkthrough</h3></summary> <table><thead><tr><th></th><th align="left">Relevant files</th></tr></thead><tbody><tr><td>Enhancement</td><td><details><summary>20 files</summary><table> <tr> <td> <details> <summary>nats_account.pb.go<dd><code>NATS account management protobuf definitions</code>                          </dd></summary> <hr> proto/nats_account.pb.go <ul><li>Generated protobuf Go code for NATS account management with 1266 lines of new code <li> Defines message types for account limits, subject mappings, tenant account creation, and user credentials <li> Implements credential types (collector, service, admin) and user permission structures <li> Includes RPC service definitions for bootstrap, account creation, JWT signing, and credential generation</ul> </details> </td> <td><a href="https://github.com/carverauto/serviceradar/pull/2221/files#diff-49eb93c28e2d86d8dcf4ee78fd24bed498cf3fcfaa8e49849a36e70980420087">+1266/-0</a></td> </tr> <tr> <td> <details> <summary>nats_bootstrap.go<dd><code>NATS bootstrap CLI implementation</code>                                                </dd></summary> <hr> pkg/cli/nats_bootstrap.go <ul><li>Implements NATS bootstrap CLI command with 961 lines of new functionality <li> Provides local and remote bootstrap modes for operator and account initialization <li> Generates NATS configuration files, JWTs, credentials, and system/platform accounts <li> Includes token generation, status checking, and tenant listing capabilities</ul> </details> </td> <td><a href="https://github.com/carverauto/serviceradar/pull/2221/files#diff-2176d03ba6ab4ccc1b0587a4727171546eecf6123e4df815ead583aaab062457">+961/-0</a>  </td> </tr> <tr> <td> <details> <summary>cli.go<dd><code>CLI command routing for admin operations</code>                                  </dd></summary> <hr> pkg/cli/cli.go <ul><li>Adds new <code>AdminHandler</code> struct to dispatch nested admin commands <li> Registers <code>nats-bootstrap</code> and <code>admin</code> command handlers in the command map <li> Implements routing logic for admin subcommands (currently supporting <code>admin nats</code>)</ul> </details> </td> <td><a href="https://github.com/carverauto/serviceradar/pull/2221/files#diff-3040c236897f2704958b674ce81c445b6de53f2ff4c204c812ec510de8a76a73">+22/-0</a>    </td> </tr> <tr> <td> <details> <summary>nats_account_service.go<dd><code>NATS Account Service Implementation with JWT Operations</code>    </dd></summary> <hr> pkg/datasvc/nats_account_service.go <ul><li>Implements <code>NATSAccountServer</code> struct for stateless NATS JWT/NKeys cryptographic operations <li> Provides gRPC methods for operator bootstrap, tenant account creation, user credential generation, and JWT signing <li> Includes mTLS authorization checks and resolver connection management for pushing JWTs to NATS <li> Supports file-based and NATS-based JWT resolver configurations with proper error handling</ul> </details> </td> <td><a href="https://github.com/carverauto/serviceradar/pull/2221/files#diff-3e814328b5d5bddca9cd4b0ca021a975cf416afdc059e454452580a4ce751320">+808/-0</a>  </td> </tr> <tr> <td> <details> <summary>monitoring.pb.go<dd><code>Gateway Status Messages and Service Definition</code>                      </dd></summary> <hr> proto/monitoring.pb.go <ul><li>Adds <code>GatewayStatusRequest</code>, <code>GatewayStatusResponse</code>, <code>GatewayStatusChunk</code>, and <code>GatewayServiceStatus</code> message types <li> Includes tenant-aware fields (<code>tenant_id</code>, <code>tenant_slug</code>) for multi-tenant routing <li> Adds new <code>AgentGatewayService</code> with <code>PushStatus</code> and <code>StreamStatus</code> RPC methods <li> Updates message type count from 13 to 17 and service count from 2 to 3</ul> </details> </td> <td><a href="https://github.com/carverauto/serviceradar/pull/2221/files#diff-4f7e955b42854cc9cf3fb063b95e58a04f36271e6a0c1cb42ea6d7953dd96cc4">+501/-20</a></td> </tr> <tr> <td> <details> <summary>nats_account_grpc.pb.go<dd><code>NATS Account Service gRPC Stubs Generation</code>                              </dd></summary> <hr> proto/nats_account_grpc.pb.go <ul><li>Auto-generated gRPC service stubs for <code>NATSAccountService</code> interface <li> Implements client and server interfaces with six RPC methods: <code>BootstrapOperator</code>, <code>GetOperatorInfo</code>, <code>CreateTenantAccount</code>, <code>GenerateUserCredentials</code>, <code>SignAccountJWT</code>, <code>PushAccountJWT</code> <li> Includes unary server handlers and service descriptor registration</ul> </details> </td> <td><a href="https://github.com/carverauto/serviceradar/pull/2221/files#diff-af98ae5f073d25a2ea0a044c996542fc65d215f6a70f14a9fba0447d87b34e11">+376/-0</a>  </td> </tr> <tr> <td> <details> <summary>operator.go<dd><code>NATS Operator Key Management and Bootstrap</code>                              </dd></summary> <hr> pkg/nats/accounts/operator.go <ul><li>Defines <code>Operator</code> struct for managing NATS operator keys and signing operations <li> Implements <code>BootstrapOperator</code> function to initialize new or import existing operator keys <li> Provides key generation utilities for operators, accounts, and users with NKeys <li> Includes configuration loading from multiple sources (direct, environment, files)</ul> </details> </td> <td><a href="https://github.com/carverauto/serviceradar/pull/2221/files#diff-14ae6d6482381ca8bbbd5dc42b4a5caa04269601624c8e7dc1941567f8e874d8">+476/-0</a>  </td> </tr> <tr> <td> <details> <summary>events.go<dd><code>NATS Configuration Extension for Credentials</code>                          </dd></summary> <hr> pkg/models/events.go <ul><li>Adds <code>CredsFile</code> field to <code>NATSConfig</code> struct for system account credentials file path <li> Maintains backward compatibility with existing NATS configuration fields</ul> </details> </td> <td><a href="https://github.com/carverauto/serviceradar/pull/2221/files#diff-8430b0f58da1cbe15deac7e09bb4acaa1b1e3f2e8772d883a7df24ef6173fa2b">+4/-3</a>      </td> </tr> <tr> <td> <details> <summary>tenant.go<dd><code>Multi-tenant identity extraction from certificates</code>              </dd></summary> <hr> pkg/tenant/tenant.go <ul><li>New package providing multi-tenant identity extraction from mTLS certificate CNs <li> Implements CN parsing with format <code><component_id>.<partition_id>.<tenant_slug>.serviceradar</code> <li> Provides context-based tenant propagation and NATS channel prefixing utilities <li> Supports certificate extraction from gRPC contexts, TLS states, and PEM files</ul> </details> </td> <td><a href="https://github.com/carverauto/serviceradar/pull/2221/files#diff-52b7de0bd266ec6f84b342a6515e76a08f89be71a1a69dabd93babf55f644178">+315/-0</a>  </td> </tr> <tr> <td> <details> <summary>nats_config.go<dd><code>NATS server configuration and bootstrap utilities</code>                </dd></summary> <hr> pkg/edgeonboarding/nats_config.go <ul><li>Configuration generation for NATS server with operator JWT and system account setup <li> Template-based config generation supporting TLS, JetStream, clustering, and leaf nodes <li> Helper functions for generating collector-specific NATS permissions by type <li> Data structures for NATS bootstrap requests and collector credentials</ul> </details> </td> <td><a href="https://github.com/carverauto/serviceradar/pull/2221/files#diff-ae75ffeeb6591fed63961ef7cb46d3ded56258a641e009354f92a9ad78e38978">+328/-0</a>  </td> </tr> <tr> <td> <details> <summary>gateway_client.go<dd><code>Agent gateway client with reconnection logic</code>                          </dd></summary> <hr> pkg/agent/gateway_client.go <ul><li>New gRPC client for agent-to-gateway communication with connection management <li> Implements status push and streaming with automatic reconnection and exponential backoff <li> Supports mTLS security configuration and graceful disconnect <li> Provides connection state tracking and timeout handling</ul> </details> </td> <td><a href="https://github.com/carverauto/serviceradar/pull/2221/files#diff-54aa63b6f05aabb0db096685e5be714c0c6d5c5a892310f048ff2ff8a542191f">+270/-0</a>  </td> </tr> <tr> <td> <details> <summary>account_manager.go<dd><code>NATS account creation and JWT signing</code>                                        </dd></summary> <hr> pkg/nats/accounts/account_manager.go <ul><li>Account signer for creating and managing NATS tenant accounts with JWT signing <li> Implements subject mapping for tenant isolation with <code>{{tenant}}</code> placeholder replacement <li> Supports account limits, JetStream enablement, and user key revocations <li> Provides stateless account creation and JWT regeneration operations</ul> </details> </td> <td><a href="https://github.com/carverauto/serviceradar/pull/2221/files#diff-6d99437619fe01afb4470e118528ff6f0154b1169948cce1fccfadd742c41178">+238/-0</a>  </td> </tr> <tr> <td> <details> <summary>monitoring_grpc.pb.go<dd><code>New AgentGatewayService gRPC API</code>                                                  </dd></summary> <hr> proto/monitoring_grpc.pb.go <ul><li>Adds new <code>AgentGatewayService</code> gRPC service with PushStatus and StreamStatus methods <li> Marks <code>PollerService</code> as deprecated with migration guidance to <code>AgentGatewayService</code> <li> Implements client and server stubs for agent-to-gateway status communication <li> Supports both unary and client-streaming RPC patterns</ul> </details> </td> <td><a href="https://github.com/carverauto/serviceradar/pull/2221/files#diff-4cc43cae31fd1906aa866d16b67cd0812e2a433eeb29b9070947b1693ad0146f">+148/-0</a>  </td> </tr> <tr> <td> <details> <summary>user_manager.go<dd><code>NATS user credential generation and management</code>                      </dd></summary> <hr> pkg/nats/accounts/user_manager.go <ul><li>User credential generation for NATS with type-based permissions (collector, service, admin) <li> Implements <code>.creds</code> file formatting for NATS authentication <li> Supports custom permissions and credential expiration <li> Applies tenant-scoped permissions based on credential type</ul> </details> </td> <td><a href="https://github.com/carverauto/serviceradar/pull/2221/files#diff-84b756d1bdd4d008851fb5859e56f79f44c8083c766ee0314d7862484ce8c12b">+217/-0</a>  </td> </tr> <tr> <td> <details> <summary>push_loop.go<dd><code>Agent status push loop implementation</code>                                        </dd></summary> <hr> pkg/agent/push_loop.go <ul><li>Push loop implementation for periodic agent status collection and gateway transmission <li> Collects status from sweep services and configured checkers <li> Converts internal status format to gateway protocol messages <li> Handles source IP detection and tenant/partition metadata propagation</ul> </details> </td> <td><a href="https://github.com/carverauto/serviceradar/pull/2221/files#diff-5f0d59be34ef26b449d7f5fd2b198a29b277936b9708a699f7487415ed6c2785">+222/-0</a>  </td> </tr> <tr> <td> <details> <summary>events.go<dd><code>Tenant-aware NATS event subject prefixing</code>                                </dd></summary> <hr> pkg/natsutil/events.go <ul><li>Adds tenant prefixing support for NATS event subjects via environment variable control <li> Implements <code>applyTenantPrefix()</code> to inject tenant slug into subject paths <li> Adds <code>NewEventPublisherWithPrefixing()</code> for explicit prefix control <li> Provides helper functions to check and modify prefixing state</ul> </details> </td> <td><a href="https://github.com/carverauto/serviceradar/pull/2221/files#diff-3503010fc6a66fb16b8fbc055b7daf53be305d7284efe58d37a7fbf1813a6b63">+74/-13</a>  </td> </tr> <tr> <td> <details> <summary>main.go<dd><code>Data service NATS account initialization</code>                                  </dd></summary> <hr> cmd/data-services/main.go <ul><li>Initializes NATS account service with operator configuration and resolver paths <li> Supports operator bootstrap from environment variables and config files <li> Registers <code>NATSAccountService</code> gRPC endpoint for account management <li> Configures system account credentials and resolver client for JWT persistence</ul> </details> </td> <td><a href="https://github.com/carverauto/serviceradar/pull/2221/files#diff-5e7731adfb877918cd65d9d5531621312496450fd550fea2682efca4ca8fe816">+68/-0</a>    </td> </tr> <tr> <td> <details> <summary>edge_onboarding.go<dd><code>Collector package and credential data models</code>                          </dd></summary> <hr> pkg/models/edge_onboarding.go <ul><li>Adds <code>CollectorType</code> enum for collector classification (flowgger, trapd, netflow, otel) <li> Adds <code>CollectorPackageStatus</code> enum for deployment lifecycle tracking <li> Introduces <code>CollectorPackage</code>, <code>NatsCredential</code>, and <code>CollectorDownloadResult</code> data structures <li> Supports collector package metadata, credential management, and download tracking</ul> </details> </td> <td><a href="https://github.com/carverauto/serviceradar/pull/2221/files#diff-0c66da94363b25ecb9e4766c8f12cde387c1874aeb9099447c609c4e867b4fe0">+71/-0</a>    </td> </tr> <tr> <td> <details> <summary>main.go<dd><code>Agent push mode implementation with gateway integration</code>    </dd></summary> <hr> cmd/agent/main.go <ul><li>Implements push mode for agents to actively push status to gateway <li> Adds <code>runPushMode()</code> with gateway client initialization and push loop management <li> Implements graceful shutdown with signal handling for push mode <li> Maintains legacy server mode as fallback with deprecation warning</ul> </details> </td> <td><a href="https://github.com/carverauto/serviceradar/pull/2221/files#diff-61358711e980ccf505246fd3915f97cbd3a380e9b66f6fa5aad46749968c5ca3">+82/-0</a>    </td> </tr> <tr> <td> <details> <summary>main.go<dd><code>CLI command routing for NATS bootstrap and admin</code>                  </dd></summary> <hr> cmd/cli/main.go <ul><li>Adds <code>nats-bootstrap</code> command routing for NATS server bootstrap operations <li> Adds <code>admin</code> command dispatcher with <code>nats</code> subcommand support <li> Implements <code>dispatchAdminCommand()</code> for admin resource routing <li> Maintains backward compatibility with existing bcrypt mode</ul> </details> </td> <td><a href="https://github.com/carverauto/serviceradar/pull/2221/files#diff-ed4d81d29a7267f93fd77e17993fd3491b9ef6ded18490b4514d10ed1d803bc2">+14/-0</a>    </td> </tr> </table></details></td></tr><tr><td>Formatting</td><td><details><summary>1 files</summary><table> <tr> <td> <details> <summary>kv.pb.go<dd><code>Proto file path normalization for KV service</code>                          </dd></summary> <hr> proto/kv.pb.go <ul><li>Updates proto file path references from <code>proto/kv.proto</code> to <code>kv.proto</code> <li> Renames internal variable references from <code>file_proto_kv_proto_*</code> to <code>file_kv_proto_*</code> <li> Updates all message type index references and descriptor function calls</ul> </details> </td> <td><a href="https://github.com/carverauto/serviceradar/pull/2221/files#diff-0b9bbdd1cf0f65de86dd5ee0c824e5d35326bb4b8edcf1c97b61864493f4477f">+83/-83</a>  </td> </tr> </table></details></td></tr><tr><td>Tests</td><td><details><summary>6 files</summary><table> <tr> <td> <details> <summary>nats_account_service_test.go<dd><code>NATS Account Service Unit Tests</code>                                                    </dd></summary> <hr> pkg/datasvc/nats_account_service_test.go <ul><li>Comprehensive test suite for <code>NATSAccountServer</code> with authorization context setup <li> Tests account creation, user credential generation, and JWT signing with various scenarios <li> Includes proto conversion helper function tests for limits, credential types, and permissions <li> Validates error handling for missing required fields and invalid inputs</ul> </details> </td> <td><a href="https://github.com/carverauto/serviceradar/pull/2221/files#diff-3d1ee22d8514e26752c45da92cfacb867a80d29cb8bfc3682b5784d123bcfc52">+474/-0</a>  </td> </tr> <tr> <td> <details> <summary>user_manager_test.go<dd><code>User Credentials Generation Tests</code>                                                </dd></summary> <hr> pkg/nats/accounts/user_manager_test.go <ul><li>Tests for <code>GenerateUserCredentials</code> function covering collector, service, and admin credential types <li> Validates custom permissions, expiration handling, and JWT claim structure <li> Tests error cases for invalid seeds and wrong key types <li> Includes helper function tests for credentials file formatting</ul> </details> </td> <td><a href="https://github.com/carverauto/serviceradar/pull/2221/files#diff-000364a4950aac254d71c594f4c09b87abd8a8993f855d7a71cb248847cb82ce">+440/-0</a>  </td> </tr> <tr> <td> <details> <summary>account_manager_test.go<dd><code>Account manager test suite for NATS tenant accounts</code>            </dd></summary> <hr> pkg/nats/accounts/account_manager_test.go <ul><li>Comprehensive test suite for <code>AccountSigner</code> with 370 lines covering account creation, JWT signing, and subject mappings <li> Tests for tenant account creation with limits, custom mappings, and revocations <li> Validation of default subject mappings for collectors (events, syslog, snmp, netflow, otel, logs, telemetry) <li> Tests for account key derivation and multiple account generation</ul> </details> </td> <td><a href="https://github.com/carverauto/serviceradar/pull/2221/files#diff-0c2bcb7ae33e8e4e35efdfb12c247cc0f62c7d479ac18dbf9e2682664dd2a5a6">+370/-0</a>  </td> </tr> <tr> <td> <details> <summary>tenant_test.go<dd><code>Tenant identity parsing and context management tests</code>          </dd></summary> <hr> pkg/tenant/tenant_test.go <ul><li>Test suite for tenant identity parsing and context management with 354 lines <li> Tests for CN parsing with various formats and error cases <li> Tests for tenant info string representation and NATS channel prefixing <li> Tests for context-based tenant propagation and helper functions</ul> </details> </td> <td><a href="https://github.com/carverauto/serviceradar/pull/2221/files#diff-04245997664d90bcc09ea6a3b4edcda1b462adac68b2920953262ada7d7f6fbb">+354/-0</a>  </td> </tr> <tr> <td> <details> <summary>operator_test.go<dd><code>NATS operator initialization and JWT signing tests</code>              </dd></summary> <hr> pkg/nats/accounts/operator_test.go <ul><li>Test suite for NATS operator key generation and JWT creation with 357 lines <li> Tests for operator initialization from seed, file, and environment variable <li> Tests for system account public key configuration and invalid seed handling <li> Tests for operator JWT creation and key encoding/decoding for storage</ul> </details> </td> <td><a href="https://github.com/carverauto/serviceradar/pull/2221/files#diff-7e02d50421951cc5d2172638cfcfd3cd0b953726699b20d887808baa63cecbf0">+357/-0</a>  </td> </tr> <tr> <td> <details> <summary>events_test.go<dd><code>Event publisher tenant prefixing tests</code>                                      </dd></summary> <hr> pkg/natsutil/events_test.go <ul><li>Tests for tenant prefix environment variable detection <li> Tests for subject prefixing with and without tenant context <li> Tests for <code>NewEventPublisherWithPrefixing()</code> and <code>SetTenantPrefixing()</code> methods <li> Validates fallback to <code>DefaultTenant</code> when tenant not in context</ul> </details> </td> <td><a href="https://github.com/carverauto/serviceradar/pull/2221/files#diff-c11081459515a60362c5729f90da0828cee115b39cfe881e252d1d119b2263e2">+135/-0</a>  </td> </tr> </table></details></td></tr><tr><td>Configuration changes</td><td><details><summary>1 files</summary><table> <tr> <td> <details> <summary>.gitkeep<dd><code>Docker Compose Credentials Directory</code>                                          </dd></summary> <hr> docker/compose/creds/.gitkeep <ul><li>Creates placeholder directory for Docker Compose credentials storage</ul> </details> </td> <td><a href="https://github.com/carverauto/serviceradar/pull/2221/files#diff-d72c41aab2d6f2c230a4340dfefe7917cdd12bed942c825aa0d4c9875a637bac">+1/-0</a>      </td> </tr> </table></details></td></tr><tr><td>Additional files</td><td><details><summary>101 files</summary><table> <tr> <td>.bazelignore</td> <td><a href="https://github.com/carverauto/serviceradar/pull/2221/files#diff-a5641cd37d6ad98b32cdfce1980836cc68312277bc6a7052f55da02ada5bc6cf">+4/-0</a>      </td> </tr> <tr> <td>.bazelrc</td> <td><a href="https://github.com/carverauto/serviceradar/pull/2221/files#diff-544556920c45b42cbfe40159b082ce8af6bd929e492d076769226265f215832f">+2/-0</a>      </td> </tr> <tr> <td>.env.example</td> <td><a href="https://github.com/carverauto/serviceradar/pull/2221/files#diff-a3046da0d15a27e89f2afe639b25748a7ad4d9290af3e7b1b6c1a5533c8f0a8c">+38/-0</a>    </td> </tr> <tr> <td>AGENTS.md</td> <td><a href="https://github.com/carverauto/serviceradar/pull/2221/files#diff-a54ff182c7e8acf56acfd6e4b9c3ff41e2c41a31c9b211b2deb9df75d9a478f9">+171/-2</a>  </td> </tr> <tr> <td>MODULE.bazel</td> <td><a href="https://github.com/carverauto/serviceradar/pull/2221/files#diff-6136fc12446089c3db7360e923203dd114b6a1466252e71667c6791c20fe6bdc">+5/-0</a>      </td> </tr> <tr> <td>Makefile</td> <td><a href="https://github.com/carverauto/serviceradar/pull/2221/files#diff-76ed074a9305c04054cdebb9e9aad2d818052b07091de1f20cad0bbac34ffb52">+4/-0</a>      </td> </tr> <tr> <td>README-Docker.md</td> <td><a href="https://github.com/carverauto/serviceradar/pull/2221/files#diff-9fd61d24482efe68c22d8d41e2a1dcc440f39195aa56e7a050f2abe598179efd">+16/-1</a>    </td> </tr> <tr> <td>BUILD.bazel</td> <td><a href="https://github.com/carverauto/serviceradar/pull/2221/files#diff-884fa9353a5226345e44fbabea3300efc7a87dfbcde0b6a42521ca51823f1b68">+6/-0</a>      </td> </tr> <tr> <td>BUILD.bazel</td> <td><a href="https://github.com/carverauto/serviceradar/pull/2221/files#diff-0e80ea46aeb61a873324685edb96eae864c7a2004fbb7ee404b4ec951190ba10">+12/-0</a>    </td> </tr> <tr> <td>mix_release.bzl</td> <td><a href="https://github.com/carverauto/serviceradar/pull/2221/files#diff-86ec281f99363b6b6eb1f49e21d83b7eeca93a35b552b9f305fffc6855e38ccd">+120/-49</a></td> </tr> <tr> <td>config.json</td> <td><a href="https://github.com/carverauto/serviceradar/pull/2221/files#diff-5b1bc8fe77422534739bdd3a38dc20d2634a86c171265c34e1b5d0c5a61b6bab">+5/-6</a>      </td> </tr> <tr> <td>config.rs</td> <td><a href="https://github.com/carverauto/serviceradar/pull/2221/files#diff-05038f3867985e757de9027609950e682bad6d1992dac6acd7c28962a3c65dc4">+25/-0</a>    </td> </tr> <tr> <td>nats.rs</td> <td><a href="https://github.com/carverauto/serviceradar/pull/2221/files#diff-97f7335def0ad5d644b594a1076ae2d7080b11259cbb8de22c7946cc8e4b39f8">+4/-0</a>      </td> </tr> <tr> <td>zen-consumer-with-otel.json</td> <td><a href="https://github.com/carverauto/serviceradar/pull/2221/files#diff-68375f1f7847e1fbdf75664f6be65b1ad94ae6ce86ed73fc5964d65054668acb">+1/-0</a>      </td> </tr> <tr> <td>zen-consumer.json</td> <td><a href="https://github.com/carverauto/serviceradar/pull/2221/files#diff-4d308af9802a93a0f656e8c02a3b5fcd8991407bb18360f087470db74e1f9524">+1/-0</a>      </td> </tr> <tr> <td>BUILD.bazel</td> <td><a href="https://github.com/carverauto/serviceradar/pull/2221/files#diff-c62c0139ebdb337369f4067567cd2c52b8e7decb3ddfabc77f9f67b2f6e5789c">+1/-0</a>      </td> </tr> <tr> <td>flowgger.toml</td> <td><a href="https://github.com/carverauto/serviceradar/pull/2221/files#diff-af9f49f931e282dca53d1f0521b036d222fe671f77e61a876a84cf4c6d7cca4d">+1/-0</a>      </td> </tr> <tr> <td>nats_output.rs</td> <td><a href="https://github.com/carverauto/serviceradar/pull/2221/files#diff-a82e2e4d413539bf0b414b5629665b19648447523994cba639c4d1238aa5a0c1">+14/-0</a>    </td> </tr> <tr> <td>otel.toml</td> <td><a href="https://github.com/carverauto/serviceradar/pull/2221/files#diff-c64b9ace832b8ea57a2be62f84166e03bb1904882635d444ec76a880cdf14cc0">+1/-0</a>      </td> </tr> <tr> <td>config.rs</td> <td><a href="https://github.com/carverauto/serviceradar/pull/2221/files#diff-abbaec651da3d6af96b482e0f77bb909b65dbe0cabd78b5803769cc9dab0a1b0">+13/-0</a>    </td> </tr> <tr> <td>nats_output.rs</td> <td><a href="https://github.com/carverauto/serviceradar/pull/2221/files#diff-6b585ea3564a481174e04da1270e2e13edd4e2b980d02a2652d6d21e6d82a498">+7/-0</a>      </td> </tr> <tr> <td>setup.rs</td> <td><a href="https://github.com/carverauto/serviceradar/pull/2221/files#diff-3891f667deb20fd26e296d3e2742c57378d3764fe1743118e612465ae360391f">+1/-0</a>      </td> </tr> <tr> <td>config.rs</td> <td><a href="https://github.com/carverauto/serviceradar/pull/2221/files#diff-c89b88ba4d2bf0a054d0ba69a672a92c30140b8d19503d67b980a218ffe3106d">+21/-0</a>    </td> </tr> <tr> <td>main.rs</td> <td><a href="https://github.com/carverauto/serviceradar/pull/2221/files#diff-33b655d8730ae3e9c844ee280787d11f1b0d5343119188273f89558805f814ba">+21/-1</a>    </td> </tr> <tr> <td>docker-compose.elx.yml</td> <td><a href="https://github.com/carverauto/serviceradar/pull/2221/files#diff-9562070d7ad4a3e9b2d06567008cf35de1d96448d914b3b45bf6c36d97cdd914">+108/-0</a>  </td> </tr> <tr> <td>docker-compose.spiffe.yml</td> <td><a href="https://github.com/carverauto/serviceradar/pull/2221/files#diff-603fd9e7d40841d174f26b95d0cb0c9537430bf3f7a5da3ccbba4ea3d8ac66c9">+8/-0</a>      </td> </tr> <tr> <td>docker-compose.yml</td> <td><a href="https://github.com/carverauto/serviceradar/pull/2221/files#diff-e45e45baeda1c1e73482975a664062aa56f20c03dd9d64a827aba57775bed0d3">+325/-269</a></td> </tr> <tr> <td>Dockerfile.agent-gateway</td> <td><a href="https://github.com/carverauto/serviceradar/pull/2221/files#diff-332bc81a932ae08efa711a71b60fe0954d99bf17ebdab00a3baaa177a44de8b0">+94/-0</a>    </td> </tr> <tr> <td>Dockerfile.core-elx</td> <td><a href="https://github.com/carverauto/serviceradar/pull/2221/files#diff-5ec7a971285669999af442a0c7f141c34f7fd9180257307f5c4ed12f789a2182">+108/-0</a>  </td> </tr> <tr> <td>Dockerfile.poller-elx</td> <td><a href="https://github.com/carverauto/serviceradar/pull/2221/files#diff-966cdcf55b7defa4f9fd83a15c55a0e1106430e6e20926c1181187e95ed8d87e">+95/-0</a>    </td> </tr> <tr> <td>Dockerfile.web-ng</td> <td><a href="https://github.com/carverauto/serviceradar/pull/2221/files#diff-92d43af1965575d56c3380ecc8a81024aac2ff36f039ec2d3839e9fc7852bc10">+6/-0</a>      </td> </tr> <tr> <td>agent-minimal.docker.json</td> <td><a href="https://github.com/carverauto/serviceradar/pull/2221/files#diff-1f09fad94636c90373af8e270f6ba0332ae4f4d1df50a4909729280a3a9691e6">+6/-6</a>      </td> </tr> <tr> <td>agent.docker.json</td> <td><a href="https://github.com/carverauto/serviceradar/pull/2221/files#diff-5d33fe703515d03076d31261ecf946e9c6fc668cf5bf65099d49b670739e455e">+9/-15</a>    </td> </tr> <tr> <td>agent.mtls.json</td> <td><a href="https://github.com/carverauto/serviceradar/pull/2221/files#diff-008f2216f159a9bd5db9cc90baaf6f1e64487df7af05b56ab3b9d6c4946aa95f">+13/-3</a>    </td> </tr> <tr> <td>datasvc.docker.json</td> <td><a href="https://github.com/carverauto/serviceradar/pull/2221/files#diff-3f2719d3dbfe042e8383739e3c78e74e5f851a44e5e46bea8e79c4b79fdcc34f">+3/-2</a>      </td> </tr> <tr> <td>datasvc.mtls.json</td> <td><a href="https://github.com/carverauto/serviceradar/pull/2221/files#diff-3a45619e57f1e6e9a31486ec7fffb33ef246e271f82bac272ee0a946b88da70a">+14/-1</a>    </td> </tr> <tr> <td>db-event-writer.mtls.json</td> <td><a href="https://github.com/carverauto/serviceradar/pull/2221/files#diff-7a33f95f7545499abf0ed9fc91b58499ab209639e4885019579c959583fc7496">+1/-0</a>      </td> </tr> <tr> <td>entrypoint-certs.sh</td> <td><a href="https://github.com/carverauto/serviceradar/pull/2221/files#diff-83d6800b184a5233c66c69766286b0a60fece1bc64addb112d9f8dc019437f05">+11/-6</a>    </td> </tr> <tr> <td>flowgger.docker.toml</td> <td><a href="https://github.com/carverauto/serviceradar/pull/2221/files#diff-824f8797b418d4b9f5ea41e4a3741a0ed64b881f343072464489a76b7ea01008">+2/-1</a>      </td> </tr> <tr> <td>generate-certs.sh</td> <td><a href="https://github.com/carverauto/serviceradar/pull/2221/files#diff-8298241543b4744a6ac7780c760ac5b5a0a87ba62de19c8612ebe1aba0996ebd">+216/-10</a></td> </tr> <tr> <td>nats.docker.conf</td> <td><a href="https://github.com/carverauto/serviceradar/pull/2221/files#diff-06f2012494f428fe1bfb304972061c2094e0d99da88ba9af6914f7776872e6eb">+16/-160</a></td> </tr> <tr> <td>netflow-consumer.mtls.json</td> <td><a href="https://github.com/carverauto/serviceradar/pull/2221/files#diff-f15920e8498a24f71ce3eec4f48fe8fefbb1765a90362998af779a660fcef9e1">+1/-0</a>      </td> </tr> <tr> <td>otel.docker.toml</td> <td><a href="https://github.com/carverauto/serviceradar/pull/2221/files#diff-d4af38790e3657b7589cd37a7539d5308b032f11caba7aa740ddc86bf99f4415">+2/-0</a>      </td> </tr> <tr> <td>pg_hba.conf</td> <td><a href="https://github.com/carverauto/serviceradar/pull/2221/files#diff-7bd5f7292054916c7e5997f4c84ac9ec07d4c945621a48936c2aed0575fb96eb">+9/-0</a>      </td> </tr> <tr> <td>pg_ident.conf</td> <td><a href="https://github.com/carverauto/serviceradar/pull/2221/files#diff-e7b8ce062e32c61fdc3bcc9e525c1f1df1c8008fbc02b11409e58c67baa17cc5">+17/-0</a>    </td> </tr> <tr> <td>ssl_dist.core.conf</td> <td><a href="https://github.com/carverauto/serviceradar/pull/2221/files#diff-08d49d8621b581d1a9aa5c456f61e8c5774e021083c982cbb514019f915a1701">+17/-0</a>    </td> </tr> <tr> <td>ssl_dist.gateway.conf</td> <td><a href="https://github.com/carverauto/serviceradar/pull/2221/files#diff-4a43a8290d45ac68592000e7ef51afe78b4213090155bd42aafb46e66130f7ae">+17/-0</a>    </td> </tr> <tr> <td>ssl_dist.poller.conf</td> <td><a href="https://github.com/carverauto/serviceradar/pull/2221/files#diff-3373e5d02bd9a19c75819948cf829086dc9c02f41f30c78759beb7038c428d3e">+17/-0</a>    </td> </tr> <tr> <td>ssl_dist.web.conf</td> <td><a href="https://github.com/carverauto/serviceradar/pull/2221/files#diff-cef5be462ddb059fdfdeb9fd7c5cd70e656c4cd8b6ae1fe3fe312557b3da80ac">+17/-0</a>    </td> </tr> <tr> <td>trapd.docker.json</td> <td><a href="https://github.com/carverauto/serviceradar/pull/2221/files#diff-1ab1a0e03e63bc02e0ef31992a7187a377927272ed2060150b40d44cc0ea3357">+2/-1</a>      </td> </tr> <tr> <td>zen.docker.json</td> <td><a href="https://github.com/carverauto/serviceradar/pull/2221/files#diff-e060a3164cdc2746e0d9ad000fcf43c4bcdb05f4a41c586d7220e2ff2a7df01d">+2/-1</a>      </td> </tr> <tr> <td>BUILD.bazel</td> <td><a href="https://github.com/carverauto/serviceradar/pull/2221/files#diff-0e4db31c224a8f72ae8e870a849e38a59d74a2c7f7b04347b0b3eb07e20c5a80">+80/-0</a>    </td> </tr> <tr> <td>push_targets.bzl</td> <td><a href="https://github.com/carverauto/serviceradar/pull/2221/files#diff-4af33fe62caba04b6d479589c16cfb85babc39bae5c92595d4d4e31660738513">+2/-0</a>      </td> </tr> <tr> <td>architecture.md</td> <td><a href="https://github.com/carverauto/serviceradar/pull/2221/files#diff-90abd06467420fd89391fd1a4d75ceb1f6a9381de4d13a95fffe606abff38d37">+58/-6</a>    </td> </tr> <tr> <td>ash-api.md</td> <td><a href="https://github.com/carverauto/serviceradar/pull/2221/files#diff-1cb48a12148688dc640f42675d0b3c2458ecc40175f4ae7eef4b07bdc2ede3a3">+305/-0</a>  </td> </tr> <tr> <td>ash-authentication.md</td> <td><a href="https://github.com/carverauto/serviceradar/pull/2221/files#diff-e3d22dc8d6917661b980dd9c7e08bdc07fad4d225b10c60885b091a9cdd20425">+244/-0</a>  </td> </tr> <tr> <td>ash-authorization.md</td> <td><a href="https://github.com/carverauto/serviceradar/pull/2221/files#diff-72abeea2439e2281b6f998db2d11a73fe3ab393898125c466f0b16e7a49d3088">+283/-0</a>  </td> </tr> <tr> <td>ash-domains.md</td> <td><a href="https://github.com/carverauto/serviceradar/pull/2221/files#diff-bc78325c5ea4c6839212536a43336a231175450738af14f4173be663acf2fe49">+223/-0</a>  </td> </tr> <tr> <td>ash-migration-guide.md</td> <td><a href="https://github.com/carverauto/serviceradar/pull/2221/files#diff-e730373f07f6656d00327de02b4aa64e13809aadb3bf5478a4c60041ee431ad8">+339/-0</a>  </td> </tr> <tr> <td>docker-setup.md</td> <td><a href="https://github.com/carverauto/serviceradar/pull/2221/files#diff-8604269dffb3ce4133e48cab374ca8e97745d0efbdef67cad792aeb5945fe5ec">+39/-4</a>    </td> </tr> <tr> <td>edge-agents.md</td> <td><a href="https://github.com/carverauto/serviceradar/pull/2221/files#diff-df8f4b6b6cd8fb926f9204b2bab85ec4f393286d1b25d8a62d2debd357dd651d">+253/-0</a>  </td> </tr> <tr> <td>edge-onboarding.md</td> <td><a href="https://github.com/carverauto/serviceradar/pull/2221/files#diff-f32fbefcc4dd23b0f2146015630a411099012e8bf56ea7b9d52ae8b2a83ac3e3">+316/-273</a></td> </tr> <tr> <td>security-architecture.md</td> <td><a href="https://github.com/carverauto/serviceradar/pull/2221/files#diff-49e33c2fc609030af8a64af34016a22e9e86c0f3781acec5a9e6d0f7ccf5dc09">+238/-0</a>  </td> </tr> <tr> <td>spiffe-identity.md</td> <td><a href="https://github.com/carverauto/serviceradar/pull/2221/files#diff-6a3f0bdce99eb15d1075805cb5101594398f81ad92aa4eea9b27bb7545098db1">+52/-0</a>    </td> </tr> <tr> <td>troubleshooting-guide.md</td> <td><a href="https://github.com/carverauto/serviceradar/pull/2221/files#diff-343b87dd0b430b3f99b16d32200c353bb6e3d7bbb185da3c1b3effc3a03e7f2a">+51/-0</a>    </td> </tr> <tr> <td>datasvc.ex</td> <td><a href="https://github.com/carverauto/serviceradar/pull/2221/files#diff-00c928e77e8bf2f741a35e42e304acd2f2d4b8546532a5125a717924f2bde5dc">+32/-2</a>    </td> </tr> <tr> <td>BUILD.bazel</td> <td><a href="https://github.com/carverauto/serviceradar/pull/2221/files#diff-aa8f6a4e89041a3423f2299efab4bb1b5d13358d445b3c6e35cc46f527dbaa7b">+30/-0</a>    </td> </tr> <tr> <td>config.exs</td> <td><a href="https://github.com/carverauto/serviceradar/pull/2221/files#diff-abaac0167f05b505cf4bc5c5d375ae769001d8cc20120919941748c79681bbae">+13/-0</a>    </td> </tr> <tr> <td>dev.exs</td> <td><a href="https://github.com/carverauto/serviceradar/pull/2221/files#diff-378e53cb66373073730a5d7ceba4973777eb84ae419c8a38472e3f69c3056fe9">+4/-0</a>      </td> </tr> <tr> <td>prod.exs</td> <td><a href="https://github.com/carverauto/serviceradar/pull/2221/files#diff-5fdc859dfda96a02326392f61218325913f22f9f0c75a1276ee940d5db9fc62b">+4/-0</a>      </td> </tr> <tr> <td>runtime.exs</td> <td><a href="https://github.com/carverauto/serviceradar/pull/2221/files#diff-842568fafa717a8064203543d674517fd28ad7dd2a4d3f0f157d274cfda4f18b">+202/-0</a>  </td> </tr> <tr> <td>test.exs</td> <td><a href="https://github.com/carverauto/serviceradar/pull/2221/files#diff-00c8b58b49024f5d6f37ba40ac310a571e5a1b47fc16819423dc42f3035bc175">+20/-0</a>    </td> </tr> <tr> <td>agent_client.ex</td> <td><a href="https://github.com/carverauto/serviceradar/pull/2221/files#diff-82bf77f47ac3b0001b036f0f22944d07bf894f1ee07b912e0ecab959eb8d5c53">+472/-0</a>  </td> </tr> <tr> <td>agent_gateway_server.ex</td> <td><a href="https://github.com/carverauto/serviceradar/pull/2221/files#diff-369a368073dc8ec1140bcea699005a1ce97a90cd59629df0bd18c71c7ffaae9f">+226/-0</a>  </td> </tr> <tr> <td>application.ex</td> <td><a href="https://github.com/carverauto/serviceradar/pull/2221/files#diff-fc8dfd7489f127775b1f0baf09cea0cdf77b825dd5f92540e126a43cb246f5b2">+192/-0</a>  </td> </tr> <tr> <td>config.ex</td> <td><a href="https://github.com/carverauto/serviceradar/pull/2221/files#diff-5ba6871060dcff48aabea128b381e2d121bbb32256be1066c31dba447dc10813">+311/-0</a>  </td> </tr> <tr> <td>endpoint.ex</td> <td><a href="https://github.com/carverauto/serviceradar/pull/2221/files#diff-ba09da2cb13f93e8b0f8d1dad1a214444239ef82ccd052a1e580f1fd1173d3d4">+14/-0</a>    </td> </tr> <tr> <td>status_processor.ex</td> <td><a href="https://github.com/carverauto/serviceradar/pull/2221/files#diff-2d04050dff3ba2cc8153559e33a892f5f421982bf6dcbda7172857b3bf398a02">+202/-0</a>  </td> </tr> <tr> <td>task_executor.ex</td> <td><a href="https://github.com/carverauto/serviceradar/pull/2221/files#diff-5779481787c6634825d987f45daa3d38f8010f4476f108dd14befcd275a2becf">+435/-0</a>  </td> </tr> <tr> <td>mix.exs</td> <td><a href="https://github.com/carverauto/serviceradar/pull/2221/files#diff-0230a6c5cfe38c579eedeae533c0ae08389c513592b55ab18f9ecca34edbdef7">+69/-0</a>    </td> </tr> <tr> <td>env.sh.eex</td> <td><a href="https://github.com/carverauto/serviceradar/pull/2221/files#diff-e9df1de6119d13ee07c13d3368d94d8d5bf5e4bfb8eafa88296d92ca8cfba3d8">+78/-0</a>    </td> </tr> <tr> <td>serviceradar-poller.service</td> <td><a href="https://github.com/carverauto/serviceradar/pull/2221/files#diff-09a1230ce7d28b7c4ecc829f950ab86e06e519a57de9f8f072c47430db58de9a">+66/-0</a>    </td> </tr> <tr> <td>vm.args.eex</td> <td><a href="https://github.com/carverauto/serviceradar/pull/2221/files#diff-94b4c9aef90071b4b2f63e05a89e3f4b1aa385ca4dfc299c606179dce3fdc6df">+50/-0</a>    </td> </tr> <tr> <td>registration_test.exs</td> <td><a href="https://github.com/carverauto/serviceradar/pull/2221/files#diff-9a5e2ee4c17c6db1518f26ff99f0dfc6465038eaa839794c7415002662f7fb31">+235/-0</a>  </td> </tr> <tr> <td>test_helper.exs</td> <td><a href="https://github.com/carverauto/serviceradar/pull/2221/files#diff-438a6880b1ce83b9bc82deacc98898368920b156c596305b0cdf6b8ff0ed7b06">+7/-0</a>      </td> </tr> <tr> <td>.formatter.exs</td> <td><a href="https://github.com/carverauto/serviceradar/pull/2221/files#diff-791d4757cee68d168d8ab62d3eca6562470207fa6b3ccbf124e35c3fbd9de875">+15/-0</a>    </td> </tr> <tr> <td>BUILD.bazel</td> <td><a href="https://github.com/carverauto/serviceradar/pull/2221/files#diff-f1e707ff2843703a98d4752b3a3730c080c75772dbb5dbff1221f66af0678bc7">+14/-0</a>    </td> </tr> <tr> <td>README.md</td> <td><a href="https://github.com/carverauto/serviceradar/pull/2221/files#diff-5dc9fa4d00db77d9f3e08a76ec06c2e2d703a7976ac2531af5edd9e3d1175b56">+269/-0</a>  </td> </tr> <tr> <td>config.exs</td> <td><a href="https://github.com/carverauto/serviceradar/pull/2221/files#diff-42b3888cc53d9dcf4ebc45ec7fffb2c672b129bffe763b6c76de58e4678a13a8">+101/-0</a>  </td> </tr> <tr> <td>dev.exs</td> <td><a href="https://github.com/carverauto/serviceradar/pull/2221/files#diff-a18ef01363a8ee26ebf4b56399d88607b4cf7934c597d6d3085041195b06ab5b">+66/-0</a>    </td> </tr> <tr> <td>prod.exs</td> <td><a href="https://github.com/carverauto/serviceradar/pull/2221/files#diff-4fe837307f8710c783ecb9ae7595bb7bb9ec37d8522248bd081fa256803c3e92">+6/-0</a>      </td> </tr> <tr> <td>runtime.exs</td> <td><a href="https://github.com/carverauto/serviceradar/pull/2221/files#diff-5d045e2af9f15590a9745058e25a00380ad7a2cc363d0a9934715bd11ea8eef3">+196/-0</a>  </td> </tr> <tr> <td>test.exs</td> <td><a href="https://github.com/carverauto/serviceradar/pull/2221/files#diff-d8fffec56a9bc7305c02bbdd424e21031ecbbb2c2e6ba07ba626e6043540e360">+91/-0</a>    </td> </tr> <tr> <td>eventwriter.md</td> <td><a href="https://github.com/carverauto/serviceradar/pull/2221/files#diff-f505978234d706915d40bae31f46f0c906a941b1b63a82f2e8c37dcd01e21c4d">+370/-0</a>  </td> </tr> <tr> <td>device.ex</td> <td><a href="https://github.com/carverauto/serviceradar/pull/2221/files#diff-eba1d95a852e4a736813c7b486da651704f20718e24f931c966ff3f37c421eea">+608/-0</a>  </td> </tr> <tr> <td>device_registry.ex</td> <td><a href="https://github.com/carverauto/serviceradar/pull/2221/files#diff-c9bf6ea0ae2afb5d4cbcfe8686ebb5682031524e28edb55e43ca5e4b0512d281">+284/-0</a>  </td> </tr> <tr> <td>telemetry.ex</td> <td><a href="https://github.com/carverauto/serviceradar/pull/2221/files#diff-09c303c0c3b329f8470971b33c0c6b4af8161efc77c03edc35fbae33e4b68dbf">+267/-0</a>  </td> </tr> <tr> <td>agent_tracker.ex</td> <td><a href="https://github.com/carverauto/serviceradar/pull/2221/files#diff-792a06febe706658f0bd382113e31d8070d15c477202d51f1699e77c8317001a">+177/-0</a>  </td> </tr> <tr> <td>application.ex</td> <td><a href="https://github.com/carverauto/serviceradar/pull/2221/files#diff-a9ffbf400b7f9b22cd8980c41286c54fe373f1f1a8684bb6a344a5fb39b178d0">+381/-0</a>  </td> </tr> <tr> <td>backoff.ex</td> <td><a href="https://github.com/carverauto/serviceradar/pull/2221/files#diff-53ea1e963221e36412478d7951d5c99c6e88af68cfed27d2de18145f1225a2c1">+59/-0</a>    </td> </tr> <tr> <td>Additional files not shown</td> <td><a href="https://github.com/carverauto/serviceradar/pull/2221/files#diff-2f328e4cd8dbe3ad193e49d92bcf045f47a6b72b1e9487d366f6b8288589b4ca"></a></td> </tr> </table></details></td></tr></tbody></table> </details> ___

qodo-code-review[bot] commented

2026-01-03 09:48:48 +00:00

(Migrated from github.com)

Author

Owner

Imported GitHub PR comment.

Original author: @qodo-code-review[bot]
Original URL: https://github.com/carverauto/serviceradar/pull/2221#issuecomment-3706933859
Original created: 2026-01-03T09:48:48Z

PR Compliance Guide 🔍

(Compliance updated until commit `github.com/carverauto/serviceradar@d0138ce012`)

Below is a summary of compliance checks for this PR:

Security Compliance
⚪	Insecure transport allowed Description: The client can fall back to plaintext gRPC using `grpc.WithTransportCredentials(insecure.NewCredentials())` when `SR_ALLOW_INSECURE=true`, which would allow MITM interception/modification of agent status/config traffic if enabled outside local/dev environments. gateway_client.go [178-187] Referred Code `} else { // Insecure connections require explicit opt-in via environment variable // to prevent accidental plaintext gRPC in production deployments if os.Getenv("SR_ALLOW_INSECURE") != "true" { return nil, ErrSecurityRequired } g.logger.Warn().Msg("Using insecure connection to gateway (SR_ALLOW_INSECURE=true)") opts = append(opts, grpc.WithTransportCredentials(insecure.NewCredentials())) }`
Ticket Compliance
⚪	🎫 No ticket provided Create ticket/issue
Codebase Duplication Compliance
⚪	Codebase context is not defined Follow the guide to enable codebase context checks.
Custom Compliance
🟢	Generic: Meaningful Naming and Self-Documenting Code Objective: Ensure all identifiers clearly express their purpose and intent, making code self-documenting Status: Passed Learn more about managing compliance generic rules or creating your own custom rules
	Generic: Robust Error Handling and Edge Case Management Objective: Ensure comprehensive error handling that provides meaningful context and graceful degradation Status: Passed Learn more about managing compliance generic rules or creating your own custom rules
⚪	Generic: Comprehensive Audit Trails Objective: To create a detailed and reliable record of critical system actions for security analysis and compliance. Status: Missing audit context: The new agent enrollment/configuration actions emit operational logs but do not demonstrate dedicated audit logging with consistently required audit fields (actor identity, timestamp, action, outcome) for critical actions. Referred Code // Hello sends an enrollment request to the gateway. // This should be called on agent startup to announce the agent and register with the gateway. func (g GatewayClient) Hello(ctx context.Context, req proto.AgentHelloRequest) (*proto.AgentHelloResponse, error) { g.mu.RLock() client := g.client connected := g.connected g.mu.RUnlock() if !connected \|\| client == nil { return nil, ErrGatewayNotConnected } helloCtx, cancel := context.WithTimeout(ctx, defaultConnectTimeout) defer cancel() resp, err := client.Hello(helloCtx, req) if err != nil { g.logger.Error().Err(err).Msg("Failed to send Hello to gateway") g.markDisconnected() return nil, fmt.Errorf("failed to send Hello: %w", err) } ... (clipped 50 lines) Learn more about managing compliance generic rules or creating your own custom rules
	Generic: Secure Error Handling Objective: To prevent the leakage of sensitive system information through error messages while providing sufficient detail for internal debugging. Status: Error detail exposure: Connection/enrollment errors include internal connection details (e.g., gateway address and upstream error strings) that may be surfaced to user-facing contexts depending on caller behavior not visible in this diff. Referred Code // Connect establishes a connection to the gateway. func (g *GatewayClient) Connect(ctx context.Context) error { g.mu.Lock() defer g.mu.Unlock() if g.addr == "" { return ErrGatewayAddrRequired } if g.conn != nil && g.connected { state := g.conn.GetState() if state == connectivity.Ready { return nil // Already connected and ready } // Connection exists but isn't healthy; force reconnect. _ = g.conn.Close() g.conn = nil g.client = nil g.connected = false if g.securityProvider != nil { _ = g.securityProvider.Close() ... (clipped 52 lines) Learn more about managing compliance generic rules or creating your own custom rules
	Generic: Secure Logging Practices Objective: To ensure logs are useful for debugging and auditing without exposing sensitive information like PII, PHI, or cardholder data. Status: Sensitive data in logs: The new logging includes potentially sensitive operational identifiers and network targets (e.g., `tenant_slug`, `tenant_id`, `target`, `port`) that could constitute sensitive tenant/internal infrastructure data depending on deployment policy. Referred Code p.setEnrolled(true) p.logger.Info(). Str("agent_id", helloResp.AgentId). Str("gateway_id", helloResp.GatewayId). Str("tenant_slug", helloResp.TenantSlug). Msg("Successfully enrolled with gateway") // Fetch initial config if outdated or not yet fetched if helloResp.ConfigOutdated \|\| p.getConfigVersion() == "" { p.fetchAndApplyConfig(ctx) } } // configPollLoop periodically polls for config updates. func (p *PushLoop) configPollLoop(ctx context.Context) { // Wait for initial enrollment before polling ticker := time.NewTicker(time.Second) defer ticker.Stop() for !p.isEnrolled() { select { case <-ctx.Done(): ... (clipped 127 lines) Learn more about managing compliance generic rules or creating your own custom rules
	Generic: Security-First Input Validation and Data Handling Objective: Ensure all data inputs are validated, sanitized, and handled securely to prevent vulnerabilities Status: Unverified remote config: Remote `GetConfig` check definitions are applied dynamically with limited validation, and the diff alone does not prove that the gateway-side authorization/authentication and integrity guarantees for these inputs are enforced end-to-end. Referred Code // fetchAndApplyConfig fetches config from gateway and applies it. func (p *PushLoop) fetchAndApplyConfig(ctx context.Context) { configReq := &proto.AgentConfigRequest{ AgentId: p.server.config.AgentID, ConfigVersion: p.getConfigVersion(), } configResp, err := p.gateway.GetConfig(ctx, configReq) if err != nil { p.logger.Error().Err(err).Msg("Failed to fetch config from gateway") return } // If config hasn't changed, nothing to do if configResp.NotModified { p.logger.Debug().Str("version", p.getConfigVersion()).Msg("Config not modified") return } // Update intervals from config response if configResp.HeartbeatIntervalSec > 0 { ... (clipped 162 lines) Learn more about managing compliance generic rules or creating your own custom rules
Update

Compliance status legend

🟢 - Fully Compliant
🟡 - Partial Compliant
🔴 - Not Compliant
⚪ - Requires Further Human Verification
🏷️ - Compliance label

Previous compliance checks

Compliance check up to commit 8cdb823

Security Compliance
⚪	SSRF via remote config Description: Remote configuration received from the gateway is converted into active network targets (e.g., `check.Target` and `check.Port` in `protoCheckToCheckerConfig`, then stored in `p.server.checkerConfs` and later executed), which can enable SSRF/port-scanning behavior from the agent if the gateway/control-plane or its authZ is compromised or misconfigured. push_loop.go [492-615] Referred Code // applyChecks converts proto checks to checker configs and updates the server. func (p PushLoop) applyChecks(checks []proto.AgentCheckConfig) { if len(checks) == 0 { p.logger.Debug().Msg("No checks to apply") return } p.server.mu.Lock() defer p.server.mu.Unlock() // Track which checks we've seen (for removing stale checks later) seenChecks := make(map[string]bool) for _, check := range checks { // Guard against nil entries in the checks slice if check == nil { continue } if !check.Enabled { continue ... (clipped 103 lines)
	Sensitive identifier logging Description: The agent logs multi-tenant identifiers from the enrollment response (including `tenant_id` and `tenant_slug`), which can expose tenant metadata to anyone with access to agent logs and may violate least-privilege log hygiene in shared or centrally aggregated logging environments. gateway_client.go [381-389] Referred Code `g.logger.Info(). Str("agent_id", resp.AgentId). Str("gateway_id", resp.GatewayId). Str("tenant_id", resp.TenantId). Str("tenant_slug", resp.TenantSlug). Int32("heartbeat_interval_sec", resp.HeartbeatIntervalSec). Bool("config_outdated", resp.ConfigOutdated). Msg("Agent enrolled with gateway")`
Ticket Compliance
⚪	🎫 No ticket provided Create ticket/issue
Codebase Duplication Compliance
⚪	Codebase context is not defined Follow the guide to enable codebase context checks.
Custom Compliance
🟢	Generic: Meaningful Naming and Self-Documenting Code Objective: Ensure all identifiers clearly express their purpose and intent, making code self-documenting Status: Passed Learn more about managing compliance generic rules or creating your own custom rules
	Generic: Secure Error Handling Objective: To prevent the leakage of sensitive system information through error messages while providing sufficient detail for internal debugging. Status: Passed Learn more about managing compliance generic rules or creating your own custom rules
🔴	Generic: Robust Error Handling and Edge Case Management Objective: Ensure comprehensive error handling that provides meaningful context and graceful degradation Status: Nil dereference risk: `protoCheckToCheckerConfig` can return `nil` for invalid gateway-supplied checks, but `applyChecks` uses the returned `checkerConf` without a nil check which can cause a panic instead of graceful handling. Referred Code `// Convert proto check to CheckerConfig checkerConf := protoCheckToCheckerConfig(check) // Check if this config already exists and is unchanged if existing, exists := p.server.checkerConfs[check.Name]; exists { if existing.Type == checkerConf.Type && existing.Address == checkerConf.Address && existing.Timeout == checkerConf.Timeout { // Config unchanged, skip continue } }` Learn more about managing compliance generic rules or creating your own custom rules
	Generic: Security-First Input Validation and Data Handling Objective: Ensure all data inputs are validated, sanitized, and handled securely to prevent vulnerabilities Status: Unhandled invalid input: Gateway-supplied check data is partially validated in `protoCheckToCheckerConfig`, but invalid checks can produce `nil` configs and are not safely handled in `applyChecks`, enabling a crash via untrusted external input. Referred Code for _, check := range checks { // Guard against nil entries in the checks slice if check == nil { continue } if !check.Enabled { continue } seenChecks[check.Name] = true // Convert proto check to CheckerConfig checkerConf := protoCheckToCheckerConfig(check) // Check if this config already exists and is unchanged if existing, exists := p.server.checkerConfs[check.Name]; exists { if existing.Type == checkerConf.Type && existing.Address == checkerConf.Address && existing.Timeout == checkerConf.Timeout { // Config unchanged, skip ... (clipped 14 lines) Learn more about managing compliance generic rules or creating your own custom rules
⚪

Compliance check up to commit 6aa63f6

Security Compliance
🟢	No security concerns identified No security vulnerabilities detected by AI analysis. Human verification advised for critical code.
Ticket Compliance
⚪	🎫 No ticket provided Create ticket/issue
Codebase Duplication Compliance
⚪	Codebase context is not defined Follow the guide to enable codebase context checks.
Custom Compliance
🟢	Generic: Meaningful Naming and Self-Documenting Code Objective: Ensure all identifiers clearly express their purpose and intent, making code self-documenting Status: Passed Learn more about managing compliance generic rules or creating your own custom rules
🔴	Generic: Secure Logging Practices Objective: To ensure logs are useful for debugging and auditing without exposing sensitive information like PII, PHI, or cardholder data. Status: Sensitive identifiers logged: Logs include potentially sensitive tenant identifiers and infrastructure details (e.g., `tenant_id`, `tenant_slug`) which violates the requirement that no sensitive data be present in logs. Referred Code `g.logger.Info(). Str("agent_id", resp.AgentId). Str("gateway_id", resp.GatewayId). Str("tenant_id", resp.TenantId). Str("tenant_slug", resp.TenantSlug). Int32("heartbeat_interval_sec", resp.HeartbeatIntervalSec). Bool("config_outdated", resp.ConfigOutdated). Msg("Agent enrolled with gateway")` Learn more about managing compliance generic rules or creating your own custom rules
⚪	Generic: Comprehensive Audit Trails Objective: To create a detailed and reliable record of critical system actions for security analysis and compliance. Status: Missing actor context: Enrollment and config-application actions are logged but without a consistent "actor" identifier (e.g., user/admin) beyond `agent_id`, and other critical actions in the PR (not shown in this diff) may lack audit logging. Referred Code p.logger.Info().Msg("Enrolling with gateway...") // Build Hello request helloReq := &proto.AgentHelloRequest{ AgentId: p.server.config.AgentID, Version: Version, // Agent version from version.go Capabilities: getAgentCapabilities(), ConfigVersion: p.configVersion, } // Send Hello helloResp, err := p.gateway.Hello(ctx, helloReq) if err != nil { p.logger.Error().Err(err).Msg("Failed to enroll with gateway") return } // Update server config with tenant info from gateway p.server.config.TenantID = helloResp.TenantId p.server.config.TenantSlug = helloResp.TenantSlug ... (clipped 98 lines) Learn more about managing compliance generic rules or creating your own custom rules
	Generic: Robust Error Handling and Edge Case Management Objective: Ensure comprehensive error handling that provides meaningful context and graceful degradation Status: Nil map risk: `applyChecks` writes to `p.server.checkerConfs` without ensuring the map is initialized, which can panic if `checkerConfs` is nil depending on server construction not visible in this diff. Referred Code // applyChecks converts proto checks to checker configs and updates the server. func (p PushLoop) applyChecks(checks []proto.AgentCheckConfig) { if len(checks) == 0 { p.logger.Debug().Msg("No checks to apply") return } p.server.mu.Lock() defer p.server.mu.Unlock() // Track which checks we've seen (for removing stale checks later) seenChecks := make(map[string]bool) for _, check := range checks { // Guard against nil entries in the checks slice if check == nil { continue } if !check.Enabled { continue ... (clipped 27 lines) Learn more about managing compliance generic rules or creating your own custom rules
	Generic: Secure Error Handling Objective: To prevent the leakage of sensitive system information through error messages while providing sufficient detail for internal debugging. Status: Potentially detailed errors: Connection and enrollment failures are returned with detailed internal context (e.g., gateway address and gateway-provided rejection message) which may be user-visible in CLI output and should be reviewed for sensitive detail exposure. Referred Code // Connect establishes a connection to the gateway. func (g *GatewayClient) Connect(ctx context.Context) error { g.mu.Lock() defer g.mu.Unlock() if g.addr == "" { return ErrGatewayAddrRequired } if g.connected && g.conn != nil { return nil // Already connected } g.logger.Info().Str("addr", g.addr).Msg("Connecting to agent-gateway") opts, err := g.buildDialOptions(ctx) if err != nil { return fmt.Errorf("failed to build dial options: %w", err) } connectCtx, cancel := context.WithTimeout(ctx, defaultConnectTimeout) ... (clipped 249 lines) Learn more about managing compliance generic rules or creating your own custom rules
	Generic: Security-First Input Validation and Data Handling Objective: Ensure all data inputs are validated, sanitized, and handled securely to prevent vulnerabilities Status: Unvalidated config inputs: JSON config fields loaded from disk (including network endpoints and timing values) are used without explicit validation/sanitization, and correctness/security depends on constraints that are not shown in this diff. Referred Code // loadConfig loads agent configuration from file, falling back to embedded defaults. func loadConfig(configPath string) (*agent.ServerConfig, error) { var cfg agent.ServerConfig // Try to read config file data, err := os.ReadFile(configPath) if err != nil { if os.IsNotExist(err) { // Fall back to embedded default config data = defaultConfig } else { return nil, fmt.Errorf("failed to read config file: %w", err) } } if err := json.Unmarshal(data, &cfg); err != nil { return nil, fmt.Errorf("failed to parse config: %w", err) } return &cfg, nil } ... (clipped 70 lines) Learn more about managing compliance generic rules or creating your own custom rules

Compliance check up to commit 4d3d82c

Security Compliance
⚪	Insecure gRPC transport Description: The client explicitly allows falling back to `grpc.WithTransportCredentials(insecure.NewCredentials())` when `security` is nil/disabled, which would permit plaintext gRPC and enable MITM/credential interception if this path is reachable in non-development deployments. gateway_client.go [108-135] Referred Code // buildDialOptions constructs gRPC dial options based on security configuration. func (g *GatewayClient) buildDialOptions(ctx context.Context) ([]grpc.DialOption, error) { var opts []grpc.DialOption if g.security != nil && g.security.Mode != "" && g.security.Mode != models.SecurityModeNone { // Create security provider using the standard pattern provider, err := srgrpc.NewSecurityProvider(ctx, g.security, g.logger) if err != nil { return nil, fmt.Errorf("failed to create security provider: %w", err) } // Get client credentials from the provider creds, err := provider.GetClientCredentials(ctx) if err != nil { _ = provider.Close() return nil, fmt.Errorf("failed to get client credentials: %w", err) } g.securityProvider = provider opts = append(opts, creds) } else { ... (clipped 7 lines)
	Unexpected external egress Description: `getSourceIP()` performs an outbound UDP dial to the public address `8.8.8.8:80`, creating unexpected external egress that can leak environment/network metadata and may violate network/security policies in restricted deployments. push_loop.go [233-250] Referred Code `// getSourceIP attempts to determine the source IP of this agent. func (p PushLoop) getSourceIP() string { // First check if HostIP is configured if p.server.config.HostIP != "" { return p.server.config.HostIP } // Try to get the outbound IP conn, err := net.Dial("udp", "8.8.8.8:80") if err != nil { p.logger.Debug().Err(err).Msg("Failed to determine source IP") return "" } defer conn.Close() localAddr := conn.LocalAddr().(net.UDPAddr) return localAddr.IP.String() }`
Ticket Compliance
⚪	🎫 No ticket provided Create ticket/issue
Codebase Duplication Compliance
⚪	Codebase context is not defined Follow the guide to enable codebase context checks.
Custom Compliance
🟢	Generic: Meaningful Naming and Self-Documenting Code Objective: Ensure all identifiers clearly express their purpose and intent, making code self-documenting Status: Passed Learn more about managing compliance generic rules or creating your own custom rules
	Generic: Secure Error Handling Objective: To prevent the leakage of sensitive system information through error messages while providing sufficient detail for internal debugging. Status: Passed Learn more about managing compliance generic rules or creating your own custom rules
🔴	Generic: Robust Error Handling and Edge Case Management Objective: Ensure comprehensive error handling that provides meaningful context and graceful degradation Status: Nil check panic: `applyChecks` does not guard against nil entries in the `checks` slice and will panic when dereferencing `check.Enabled`. Referred Code `for _, check := range checks { if !check.Enabled { continue }` Learn more about managing compliance generic rules or creating your own custom rules
⚪	Generic: Comprehensive Audit Trails Objective: To create a detailed and reliable record of critical system actions for security analysis and compliance. Status: Audit scope unclear: The new agent enrollment/config/status actions are logged, but it is not verifiable from the diff alone that these logs meet your audit requirements (e.g., consistent actor identity, timestamps, and outcome recording across all critical actions). Referred Code // Start begins the push loop. It runs until the context is cancelled. func (p *PushLoop) Start(ctx context.Context) error { p.logger.Info().Dur("interval", p.interval).Msg("Starting push loop") // Initial connection and enrollment attempt if err := p.gateway.Connect(ctx); err != nil { p.logger.Warn().Err(err).Msg("Initial gateway connection failed, will retry") } else { // Connected, try to enroll p.enroll(ctx) } // Start config polling in a separate goroutine go p.configPollLoop(ctx) ticker := time.NewTicker(p.interval) defer ticker.Stop() // Do an initial push immediately (only if enrolled) if p.enrolled { p.pushStatus(ctx) ... (clipped 73 lines) Learn more about managing compliance generic rules or creating your own custom rules
	Generic: Secure Logging Practices Objective: To ensure logs are useful for debugging and auditing without exposing sensitive information like PII, PHI, or cardholder data. Status: Identifier logging: The new logs include tenant identifiers (e.g., `tenant_id` and `tenant_slug`), and whether these are classified as sensitive/PII cannot be confirmed from the diff alone. Referred Code `g.logger.Info(). Str("agent_id", resp.AgentId). Str("gateway_id", resp.GatewayId). Str("tenant_id", resp.TenantId). Str("tenant_slug", resp.TenantSlug). Int32("heartbeat_interval_sec", resp.HeartbeatIntervalSec). Bool("config_outdated", resp.ConfigOutdated). Msg("Agent enrolled with gateway")` Learn more about managing compliance generic rules or creating your own custom rules
	Generic: Security-First Input Validation and Data Handling Objective: Ensure all data inputs are validated, sanitized, and handled securely to prevent vulnerabilities Status: Insecure transport option: The gateway client explicitly allows falling back to `insecure.NewCredentials()` when security mode is none, which may violate expected security posture depending on deployment requirements. Referred Code if g.security != nil && g.security.Mode != "" && g.security.Mode != models.SecurityModeNone { // Create security provider using the standard pattern provider, err := srgrpc.NewSecurityProvider(ctx, g.security, g.logger) if err != nil { return nil, fmt.Errorf("failed to create security provider: %w", err) } // Get client credentials from the provider creds, err := provider.GetClientCredentials(ctx) if err != nil { _ = provider.Close() return nil, fmt.Errorf("failed to get client credentials: %w", err) } g.securityProvider = provider opts = append(opts, creds) } else { // Use insecure connection (for development/testing) g.logger.Warn().Msg("Using insecure connection to gateway (no mTLS)") opts = append(opts, grpc.WithTransportCredentials(insecure.NewCredentials())) } ... (clipped 1 lines) Learn more about managing compliance generic rules or creating your own custom rules

Compliance check up to commit 6a52314

Security Compliance
⚪	Credential secret exposure Description: `GenerateUserCredentials` returns `CredsFileContent` containing the raw user NKey seed (via `formatCredsFile(userJWT, userSeed)`), which is highly sensitive and could be exfiltrated if this value is ever logged, stored, or returned over an insufficiently protected channel. user_manager.go [54-120] Referred Code // GenerateUserCredentials creates NATS credentials for a user in a tenant's account. // The accountSeed is the tenant's account private key (passed from Elixir storage). func GenerateUserCredentials( tenantSlug string, accountSeed string, userName string, credType UserCredentialType, permissions UserPermissions, expirationSeconds int64, ) (UserCredentials, error) { // Parse account seed to get key pair for signing accountKP, err := nkeys.FromSeed([]byte(accountSeed)) if err != nil { return nil, fmt.Errorf("invalid account seed: %w", err) } accountPublicKey, err := accountKP.PublicKey() if err != nil { return nil, fmt.Errorf("failed to get account public key: %w", err) } ... (clipped 46 lines)
	Tenant isolation bypass Description: Default publish/subscribe permissions include non-tenant-scoped subjects (e.g., `events.>`, `logs.>`, `telemetry.>`) and also build allow-subjects from unvalidated `tenantSlug` (e.g., `fmt.Sprintf("%s.>", tenantSlug)`), which could enable cross-tenant access if subject mapping is misconfigured or if `tenantSlug` contains wildcard/special tokens. user_manager.go [123-198] Referred Code // applyUserPermissions sets permissions on user claims based on credential type. func applyUserPermissions( claims jwt.UserClaims, tenantSlug string, credType UserCredentialType, custom UserPermissions, ) { // Start with type-based defaults switch credType { case CredentialTypeCollector: // Collectors can publish to their tenant's subjects (mapped by account) // and subscribe to responses claims.Pub.Allow.Add("events.>") claims.Pub.Allow.Add("snmp.traps") claims.Pub.Allow.Add("logs.>") claims.Pub.Allow.Add("telemetry.>") claims.Pub.Allow.Add("netflow.>") claims.Sub.Allow.Add("_INBOX.>") claims.Resp = &jwt.ResponsePermission{ MaxMsgs: 1, Expires: time.Minute, ... (clipped 55 lines)
Ticket Compliance
⚪	🎫 No ticket provided Create ticket/issue
Codebase Duplication Compliance
⚪	Codebase context is not defined Follow the guide to enable codebase context checks.
Custom Compliance
🟢	Generic: Meaningful Naming and Self-Documenting Code Objective: Ensure all identifiers clearly express their purpose and intent, making code self-documenting Status: Passed Learn more about managing compliance generic rules or creating your own custom rules
⚪	Generic: Comprehensive Audit Trails Objective: To create a detailed and reliable record of critical system actions for security analysis and compliance. Status: Audit logging unknown: The diff introduces APIs for operator/account bootstrap and credential generation, but no corresponding audit logging is visible in the provided hunks, so it cannot be verified that these critical actions are reliably logged with actor context and outcomes. Referred Code // CreateTenantAccountRequest is the request to create a new tenant NATS account. type CreateTenantAccountRequest struct { state protoimpl.MessageState `protogen:"open.v1"` TenantSlug string `protobuf:"bytes,1,opt,name=tenant_slug,json=tenantSlug,proto3" json:"tenant_slug,omitempty"` // Unique tenant identifier (e.g., "acme-corp") Limits AccountLimits `protobuf:"bytes,2,opt,name=limits,proto3" json:"limits,omitempty"` // Optional resource limits SubjectMappings []SubjectMapping `protobuf:"bytes,3,rep,name=subject_mappings,json=subjectMappings,proto3" json:"subject_mappings,omitempty"` // Optional custom subject mappings unknownFields protoimpl.UnknownFields sizeCache protoimpl.SizeCache } func (x CreateTenantAccountRequest) Reset() { x = CreateTenantAccountRequest{} mi := &file_nats_account_proto_msgTypes[2] ms := protoimpl.X.MessageStateOf(protoimpl.Pointer(x)) ms.StoreMessageInfo(mi) } func (x *CreateTenantAccountRequest) String() string { return protoimpl.X.MessageStringOf(x) } ... (clipped 829 lines) Learn more about managing compliance generic rules or creating your own custom rules
	Generic: Robust Error Handling and Edge Case Management Objective: Ensure comprehensive error handling that provides meaningful context and graceful degradation Status: Error handling unseen: The new gRPC request/response surface for account and credential operations is added, but the diff does not include the service implementation where validation and error handling would occur, so robustness against edge cases cannot be confirmed. Referred Code // CreateTenantAccountRequest is the request to create a new tenant NATS account. type CreateTenantAccountRequest struct { state protoimpl.MessageState `protogen:"open.v1"` TenantSlug string `protobuf:"bytes,1,opt,name=tenant_slug,json=tenantSlug,proto3" json:"tenant_slug,omitempty"` // Unique tenant identifier (e.g., "acme-corp") Limits AccountLimits `protobuf:"bytes,2,opt,name=limits,proto3" json:"limits,omitempty"` // Optional resource limits SubjectMappings []SubjectMapping `protobuf:"bytes,3,rep,name=subject_mappings,json=subjectMappings,proto3" json:"subject_mappings,omitempty"` // Optional custom subject mappings unknownFields protoimpl.UnknownFields sizeCache protoimpl.SizeCache } func (x CreateTenantAccountRequest) Reset() { x = CreateTenantAccountRequest{} mi := &file_nats_account_proto_msgTypes[2] ms := protoimpl.X.MessageStateOf(protoimpl.Pointer(x)) ms.StoreMessageInfo(mi) } func (x *CreateTenantAccountRequest) String() string { return protoimpl.X.MessageStringOf(x) } ... (clipped 926 lines) Learn more about managing compliance generic rules or creating your own custom rules
	Generic: Secure Error Handling Objective: To prevent the leakage of sensitive system information through error messages while providing sufficient detail for internal debugging. Status: Error exposure risk: Responses like `PushAccountJWTResponse.message` appear able to carry error details, but without the server-side implementation in the visible diff it cannot be verified that user-facing errors are sanitized and sensitive internals are only logged securely. Referred Code // PushAccountJWTRequest is the request to push an account JWT to the NATS resolver. type PushAccountJWTRequest struct { state protoimpl.MessageState `protogen:"open.v1"` AccountPublicKey string `protobuf:"bytes,1,opt,name=account_public_key,json=accountPublicKey,proto3" json:"account_public_key,omitempty"` // The account's public NKey (starts with 'A') AccountJwt string `protobuf:"bytes,2,opt,name=account_jwt,json=accountJwt,proto3" json:"account_jwt,omitempty"` // The signed account JWT to push unknownFields protoimpl.UnknownFields sizeCache protoimpl.SizeCache } func (x PushAccountJWTRequest) Reset() { x = PushAccountJWTRequest{} mi := &file_nats_account_proto_msgTypes[13] ms := protoimpl.X.MessageStateOf(protoimpl.Pointer(x)) ms.StoreMessageInfo(mi) } func (x PushAccountJWTRequest) String() string { return protoimpl.X.MessageStringOf(x) } func (PushAccountJWTRequest) ProtoMessage() {} ... (clipped 84 lines) Learn more about managing compliance generic rules or creating your own custom rules
	Generic: Secure Logging Practices Objective: To ensure logs are useful for debugging and auditing without exposing sensitive information like PII, PHI, or cardholder data. Status: Secrets may be logged: The added models include highly sensitive fields (e.g., `CollectorDownloadResult.nats_creds_file` and `mtls_bundle`) and the diff does not show safeguards preventing these values from being logged, so secure logging compliance cannot be verified. Referred Code // CollectorPackage represents a collector deployment package with NATS credentials. type CollectorPackage struct { PackageID string `json:"package_id"` TenantID string `json:"tenant_id"` CollectorType CollectorType `json:"collector_type"` UserName string `json:"user_name"` Site string `json:"site,omitempty"` Hostname string `json:"hostname,omitempty"` Status CollectorPackageStatus `json:"status"` NatsCredentialID string `json:"nats_credential_id,omitempty"` DownloadTokenHash string `json:"download_token_hash,omitempty"` DownloadTokenExpiresAt time.Time `json:"download_token_expires_at,omitempty"` DownloadedAt time.Time `json:"downloaded_at,omitempty"` DownloadedByIP string `json:"downloaded_by_ip,omitempty"` InstalledAt time.Time `json:"installed_at,omitempty"` RevokedAt *time.Time `json:"revoked_at,omitempty"` RevokeReason string `json:"revoke_reason,omitempty"` ErrorMessage string `json:"error_message,omitempty"` ConfigOverrides map[string]interface{} `json:"config_overrides,omitempty"` CreatedAt time.Time `json:"created_at"` UpdatedAt time.Time `json:"updated_at"` ... (clipped 26 lines) Learn more about managing compliance generic rules or creating your own custom rules
	Generic: Security-First Input Validation and Data Handling Objective: Ensure all data inputs are validated, sanitized, and handled securely to prevent vulnerabilities Status: Sensitive data handling: New data structures introduce sensitive credential material and token-related fields (e.g., `download_token_hash`, `.creds` content, and `mtls_bundle`), but the diff does not include the creation/transport/storage code needed to verify encryption, access controls, and validation of externally supplied identifiers. Referred Code // CollectorPackage represents a collector deployment package with NATS credentials. type CollectorPackage struct { PackageID string `json:"package_id"` TenantID string `json:"tenant_id"` CollectorType CollectorType `json:"collector_type"` UserName string `json:"user_name"` Site string `json:"site,omitempty"` Hostname string `json:"hostname,omitempty"` Status CollectorPackageStatus `json:"status"` NatsCredentialID string `json:"nats_credential_id,omitempty"` DownloadTokenHash string `json:"download_token_hash,omitempty"` DownloadTokenExpiresAt time.Time `json:"download_token_expires_at,omitempty"` DownloadedAt time.Time `json:"downloaded_at,omitempty"` DownloadedByIP string `json:"downloaded_by_ip,omitempty"` InstalledAt time.Time `json:"installed_at,omitempty"` RevokedAt *time.Time `json:"revoked_at,omitempty"` RevokeReason string `json:"revoke_reason,omitempty"` ErrorMessage string `json:"error_message,omitempty"` ConfigOverrides map[string]interface{} `json:"config_overrides,omitempty"` CreatedAt time.Time `json:"created_at"` UpdatedAt time.Time `json:"updated_at"` ... (clipped 26 lines) Learn more about managing compliance generic rules or creating your own custom rules

Imported GitHub PR comment. Original author: @qodo-code-review[bot] Original URL: https://github.com/carverauto/serviceradar/pull/2221#issuecomment-3706933859 Original created: 2026-01-03T09:48:48Z --- ## PR Compliance Guide 🔍  #### (Compliance updated until commit https://github.com/carverauto/serviceradar/commit/d0138ce01275f980df692d4cd71400cdcd064ffd) Below is a summary of compliance checks for this PR: <table><tbody><tr><td colspan='2'>Security Compliance</td></tr> <tr><td rowspan=1>⚪</td> <td><details><summary>Insecure transport allowed </summary> Description: The client can fall back to plaintext gRPC using <code>grpc.WithTransportCredentials(insecure.NewCredentials())</code> when <code>SR_ALLOW_INSECURE=true</code>, which would allow MITM interception/modification of agent status/config traffic if enabled outside local/dev environments. <a href='https://github.com/carverauto/serviceradar/pull/2221/files#diff-54aa63b6f05aabb0db096685e5be714c0c6d5c5a892310f048ff2ff8a542191fR178-R187'>gateway_client.go [178-187]</a> <details open><summary>Referred Code</summary> ```go } else { // Insecure connections require explicit opt-in via environment variable // to prevent accidental plaintext gRPC in production deployments if os.Getenv("SR_ALLOW_INSECURE") != "true" { return nil, ErrSecurityRequired } g.logger.Warn().Msg("Using insecure connection to gateway (SR_ALLOW_INSECURE=true)") opts = append(opts, grpc.WithTransportCredentials(insecure.NewCredentials())) } ``` </details></details></td></tr> <tr><td colspan='2'>Ticket Compliance</td></tr> <tr><td>⚪</td><td><details><summary>🎫 No ticket provided </summary> - [ ] Create ticket/issue  </details></td></tr> <tr><td colspan='2'>Codebase Duplication Compliance</td></tr> <tr><td>⚪</td><td><details><summary>Codebase context is not defined </summary> Follow the <a href='https://qodo-merge-docs.qodo.ai/core-abilities/rag_context_enrichment/'>guide</a> to enable codebase context checks. </details></td></tr> <tr><td colspan='2'>Custom Compliance</td></tr> <tr><td rowspan=2>🟢</td><td> <details><summary>Generic: Meaningful Naming and Self-Documenting Code</summary> **Objective:** Ensure all identifiers clearly express their purpose and intent, making code self-documenting **Status:** Passed > Learn more about managing compliance <a href='https://qodo-merge-docs.qodo.ai/tools/compliance/#configuration-options'>generic rules</a> or creating your own <a href='https://qodo-merge-docs.qodo.ai/tools/compliance/#custom-compliance'>custom rules</a> </details></td></tr> <tr><td> <details><summary>Generic: Robust Error Handling and Edge Case Management</summary> **Objective:** Ensure comprehensive error handling that provides meaningful context and graceful degradation **Status:** Passed > Learn more about managing compliance <a href='https://qodo-merge-docs.qodo.ai/tools/compliance/#configuration-options'>generic rules</a> or creating your own <a href='https://qodo-merge-docs.qodo.ai/tools/compliance/#custom-compliance'>custom rules</a> </details></td></tr> <tr><td rowspan=4>⚪</td> <td><details> <summary>Generic: Comprehensive Audit Trails</summary> **Objective:** To create a detailed and reliable record of critical system actions for security analysis and compliance. **Status:** <a href='https://github.com/carverauto/serviceradar/pull/2221/files#diff-54aa63b6f05aabb0db096685e5be714c0c6d5c5a892310f048ff2ff8a542191fR367-R437'>Missing audit context</a>: The new agent enrollment/configuration actions emit operational logs but do not demonstrate dedicated audit logging with consistently required audit fields (actor identity, timestamp, action, outcome) for critical actions. <details open><summary>Referred Code</summary> ```go // Hello sends an enrollment request to the gateway. // This should be called on agent startup to announce the agent and register with the gateway. func (g *GatewayClient) Hello(ctx context.Context, req *proto.AgentHelloRequest) (*proto.AgentHelloResponse, error) { g.mu.RLock() client := g.client connected := g.connected g.mu.RUnlock() if !connected || client == nil { return nil, ErrGatewayNotConnected } helloCtx, cancel := context.WithTimeout(ctx, defaultConnectTimeout) defer cancel() resp, err := client.Hello(helloCtx, req) if err != nil { g.logger.Error().Err(err).Msg("Failed to send Hello to gateway") g.markDisconnected() return nil, fmt.Errorf("failed to send Hello: %w", err) } ... (clipped 50 lines) ``` </details> > Learn more about managing compliance <a href='https://qodo-merge-docs.qodo.ai/tools/compliance/#configuration-options'>generic rules</a> or creating your own <a href='https://qodo-merge-docs.qodo.ai/tools/compliance/#custom-compliance'>custom rules</a> </details></td></tr> <tr><td><details> <summary>Generic: Secure Error Handling</summary> **Objective:** To prevent the leakage of sensitive system information through error messages while providing sufficient detail for internal debugging. **Status:** <a href='https://github.com/carverauto/serviceradar/pull/2221/files#diff-54aa63b6f05aabb0db096685e5be714c0c6d5c5a892310f048ff2ff8a542191fR84-R156'>Error detail exposure</a>: Connection/enrollment errors include internal connection details (e.g., gateway address and upstream error strings) that may be surfaced to user-facing contexts depending on caller behavior not visible in this diff. <details open><summary>Referred Code</summary> ```go // Connect establishes a connection to the gateway. func (g *GatewayClient) Connect(ctx context.Context) error { g.mu.Lock() defer g.mu.Unlock() if g.addr == "" { return ErrGatewayAddrRequired } if g.conn != nil && g.connected { state := g.conn.GetState() if state == connectivity.Ready { return nil // Already connected and ready } // Connection exists but isn't healthy; force reconnect. _ = g.conn.Close() g.conn = nil g.client = nil g.connected = false if g.securityProvider != nil { _ = g.securityProvider.Close() ... (clipped 52 lines) ``` </details> > Learn more about managing compliance <a href='https://qodo-merge-docs.qodo.ai/tools/compliance/#configuration-options'>generic rules</a> or creating your own <a href='https://qodo-merge-docs.qodo.ai/tools/compliance/#custom-compliance'>custom rules</a> </details></td></tr> <tr><td><details> <summary>Generic: Secure Logging Practices</summary> **Objective:** To ensure logs are useful for debugging and auditing without exposing sensitive information like PII, PHI, or cardholder data. **Status:** <a href='https://github.com/carverauto/serviceradar/pull/2221/files#diff-5f0d59be34ef26b449d7f5fd2b198a29b277936b9708a699f7487415ed6c2785R412-R559'>Sensitive data in logs</a>: The new logging includes potentially sensitive operational identifiers and network targets (e.g., <code>tenant_slug</code>, <code>tenant_id</code>, <code>target</code>, <code>port</code>) that could constitute sensitive tenant/internal infrastructure data depending on deployment policy. <details open><summary>Referred Code</summary> ```go p.setEnrolled(true) p.logger.Info(). Str("agent_id", helloResp.AgentId). Str("gateway_id", helloResp.GatewayId). Str("tenant_slug", helloResp.TenantSlug). Msg("Successfully enrolled with gateway") // Fetch initial config if outdated or not yet fetched if helloResp.ConfigOutdated || p.getConfigVersion() == "" { p.fetchAndApplyConfig(ctx) } } // configPollLoop periodically polls for config updates. func (p *PushLoop) configPollLoop(ctx context.Context) { // Wait for initial enrollment before polling ticker := time.NewTicker(time.Second) defer ticker.Stop() for !p.isEnrolled() { select { case <-ctx.Done(): ... (clipped 127 lines) ``` </details> > Learn more about managing compliance <a href='https://qodo-merge-docs.qodo.ai/tools/compliance/#configuration-options'>generic rules</a> or creating your own <a href='https://qodo-merge-docs.qodo.ai/tools/compliance/#custom-compliance'>custom rules</a> </details></td></tr> <tr><td><details> <summary>Generic: Security-First Input Validation and Data Handling</summary> **Objective:** Ensure all data inputs are validated, sanitized, and handled securely to prevent vulnerabilities **Status:** <a href='https://github.com/carverauto/serviceradar/pull/2221/files#diff-5f0d59be34ef26b449d7f5fd2b198a29b277936b9708a699f7487415ed6c2785R457-R639'>Unverified remote config</a>: Remote <code>GetConfig</code> check definitions are applied dynamically with limited validation, and the diff alone does not prove that the gateway-side authorization/authentication and integrity guarantees for these inputs are enforced end-to-end. <details open><summary>Referred Code</summary> ```go // fetchAndApplyConfig fetches config from gateway and applies it. func (p *PushLoop) fetchAndApplyConfig(ctx context.Context) { configReq := &proto.AgentConfigRequest{ AgentId: p.server.config.AgentID, ConfigVersion: p.getConfigVersion(), } configResp, err := p.gateway.GetConfig(ctx, configReq) if err != nil { p.logger.Error().Err(err).Msg("Failed to fetch config from gateway") return } // If config hasn't changed, nothing to do if configResp.NotModified { p.logger.Debug().Str("version", p.getConfigVersion()).Msg("Config not modified") return } // Update intervals from config response if configResp.HeartbeatIntervalSec > 0 { ... (clipped 162 lines) ``` </details> > Learn more about managing compliance <a href='https://qodo-merge-docs.qodo.ai/tools/compliance/#configuration-options'>generic rules</a> or creating your own <a href='https://qodo-merge-docs.qodo.ai/tools/compliance/#custom-compliance'>custom rules</a> </details></td></tr> <tr><td align="center" colspan="2"> - [ ] Update  </td></tr></tbody></table> <details><summary>Compliance status legend</summary> 🟢 - Fully Compliant 🟡 - Partial Compliant 🔴 - Not Compliant ⚪ - Requires Further Human Verification 🏷️ - Compliance label </details> ___ #### Previous compliance checks <details> <summary>Compliance check up to commit <a href='https://github.com/carverauto/serviceradar/commit/8cdb82329f0fb24c24558a22c1a94c72b6444b9d'>8cdb823</a></summary> <table><tbody><tr><td colspan='2'>Security Compliance</td></tr> <tr><td rowspan=2>⚪</td> <td><details><summary>SSRF via remote config </summary> Description: Remote configuration received from the gateway is converted into active network targets (e.g., <code>check.Target</code> and <code>check.Port</code> in <code>protoCheckToCheckerConfig</code>, then stored in <code>p.server.checkerConfs</code> and later executed), which can enable SSRF/port-scanning behavior from the agent if the gateway/control-plane or its authZ is compromised or misconfigured. <a href='https://github.com/carverauto/serviceradar/pull/2221/files#diff-5f0d59be34ef26b449d7f5fd2b198a29b277936b9708a699f7487415ed6c2785R492-R615'>push_loop.go [492-615]</a> <details open><summary>Referred Code</summary> ```go // applyChecks converts proto checks to checker configs and updates the server. func (p *PushLoop) applyChecks(checks []*proto.AgentCheckConfig) { if len(checks) == 0 { p.logger.Debug().Msg("No checks to apply") return } p.server.mu.Lock() defer p.server.mu.Unlock() // Track which checks we've seen (for removing stale checks later) seenChecks := make(map[string]bool) for _, check := range checks { // Guard against nil entries in the checks slice if check == nil { continue } if !check.Enabled { continue ... (clipped 103 lines) ``` </details></details></td></tr> <tr><td><details><summary>Sensitive identifier logging </summary> Description: The agent logs multi-tenant identifiers from the enrollment response (including <code>tenant_id</code> and <code>tenant_slug</code>), which can expose tenant metadata to anyone with access to agent logs and may violate least-privilege log hygiene in shared or centrally aggregated logging environments. <a href='https://github.com/carverauto/serviceradar/pull/2221/files#diff-54aa63b6f05aabb0db096685e5be714c0c6d5c5a892310f048ff2ff8a542191fR381-R389'>gateway_client.go [381-389]</a> <details open><summary>Referred Code</summary> ```go g.logger.Info(). Str("agent_id", resp.AgentId). Str("gateway_id", resp.GatewayId). Str("tenant_id", resp.TenantId). Str("tenant_slug", resp.TenantSlug). Int32("heartbeat_interval_sec", resp.HeartbeatIntervalSec). Bool("config_outdated", resp.ConfigOutdated). Msg("Agent enrolled with gateway") ``` </details></details></td></tr> <tr><td colspan='2'>Ticket Compliance</td></tr> <tr><td>⚪</td><td><details><summary>🎫 No ticket provided </summary> - [ ] Create ticket/issue  </details></td></tr> <tr><td colspan='2'>Codebase Duplication Compliance</td></tr> <tr><td>⚪</td><td><details><summary>Codebase context is not defined </summary> Follow the <a href='https://qodo-merge-docs.qodo.ai/core-abilities/rag_context_enrichment/'>guide</a> to enable codebase context checks. </details></td></tr> <tr><td colspan='2'>Custom Compliance</td></tr> <tr><td rowspan=2>🟢</td><td> <details><summary>Generic: Meaningful Naming and Self-Documenting Code</summary> **Objective:** Ensure all identifiers clearly express their purpose and intent, making code self-documenting **Status:** Passed > Learn more about managing compliance <a href='https://qodo-merge-docs.qodo.ai/tools/compliance/#configuration-options'>generic rules</a> or creating your own <a href='https://qodo-merge-docs.qodo.ai/tools/compliance/#custom-compliance'>custom rules</a> </details></td></tr> <tr><td> <details><summary>Generic: Secure Error Handling</summary> **Objective:** To prevent the leakage of sensitive system information through error messages while providing sufficient detail for internal debugging. **Status:** Passed > Learn more about managing compliance <a href='https://qodo-merge-docs.qodo.ai/tools/compliance/#configuration-options'>generic rules</a> or creating your own <a href='https://qodo-merge-docs.qodo.ai/tools/compliance/#custom-compliance'>custom rules</a> </details></td></tr> <tr><td rowspan=2>🔴</td> <td><details> <summary>Generic: Robust Error Handling and Edge Case Management</summary> **Objective:** Ensure comprehensive error handling that provides meaningful context and graceful degradation **Status:** <a href='https://github.com/carverauto/serviceradar/pull/2221/files#diff-5f0d59be34ef26b449d7f5fd2b198a29b277936b9708a699f7487415ed6c2785R517-R528'>Nil dereference risk</a>: <code>protoCheckToCheckerConfig</code> can return <code>nil</code> for invalid gateway-supplied checks, but <code>applyChecks</code> uses the returned <code>checkerConf</code> without a nil check which can cause a panic instead of graceful handling. <details open><summary>Referred Code</summary> ```go // Convert proto check to CheckerConfig checkerConf := protoCheckToCheckerConfig(check) // Check if this config already exists and is unchanged if existing, exists := p.server.checkerConfs[check.Name]; exists { if existing.Type == checkerConf.Type && existing.Address == checkerConf.Address && existing.Timeout == checkerConf.Timeout { // Config unchanged, skip continue } } ``` </details> > Learn more about managing compliance <a href='https://qodo-merge-docs.qodo.ai/tools/compliance/#configuration-options'>generic rules</a> or creating your own <a href='https://qodo-merge-docs.qodo.ai/tools/compliance/#custom-compliance'>custom rules</a> </details></td></tr> <tr><td><details> <summary>Generic: Security-First Input Validation and Data Handling</summary> **Objective:** Ensure all data inputs are validated, sanitized, and handled securely to prevent vulnerabilities **Status:** <a href='https://github.com/carverauto/serviceradar/pull/2221/files#diff-5f0d59be34ef26b449d7f5fd2b198a29b277936b9708a699f7487415ed6c2785R505-R539'>Unhandled invalid input</a>: Gateway-supplied check data is partially validated in <code>protoCheckToCheckerConfig</code>, but invalid checks can produce <code>nil</code> configs and are not safely handled in <code>applyChecks</code>, enabling a crash via untrusted external input. <details open><summary>Referred Code</summary> ```go for _, check := range checks { // Guard against nil entries in the checks slice if check == nil { continue } if !check.Enabled { continue } seenChecks[check.Name] = true // Convert proto check to CheckerConfig checkerConf := protoCheckToCheckerConfig(check) // Check if this config already exists and is unchanged if existing, exists := p.server.checkerConfs[check.Name]; exists { if existing.Type == checkerConf.Type && existing.Address == checkerConf.Address && existing.Timeout == checkerConf.Timeout { // Config unchanged, skip ... (clipped 14 lines) ``` </details> > Learn more about managing compliance <a href='https://qodo-merge-docs.qodo.ai/tools/compliance/#configuration-options'>generic rules</a> or creating your own <a href='https://qodo-merge-docs.qodo.ai/tools/compliance/#custom-compliance'>custom rules</a> </details></td></tr> <tr><td rowspan=2>⚪</td> <tr><td align="center" colspan="2">  </td></tr></tbody></table> </details> <details> <summary>Compliance check up to commit <a href='https://github.com/carverauto/serviceradar/commit/6aa63f620203712994d6db2b85a1185f1ffb4af8'>6aa63f6</a></summary> <table><tbody><tr><td colspan='2'>Security Compliance</td></tr> <tr><td>🟢</td><td><details><summary>No security concerns identified</summary> No security vulnerabilities detected by AI analysis. Human verification advised for critical code. </details></td></tr> <tr><td colspan='2'>Ticket Compliance</td></tr> <tr><td>⚪</td><td><details><summary>🎫 No ticket provided </summary> - [ ] Create ticket/issue  </details></td></tr> <tr><td colspan='2'>Codebase Duplication Compliance</td></tr> <tr><td>⚪</td><td><details><summary>Codebase context is not defined </summary> Follow the <a href='https://qodo-merge-docs.qodo.ai/core-abilities/rag_context_enrichment/'>guide</a> to enable codebase context checks. </details></td></tr> <tr><td colspan='2'>Custom Compliance</td></tr> <tr><td rowspan=1>🟢</td><td> <details><summary>Generic: Meaningful Naming and Self-Documenting Code</summary> **Objective:** Ensure all identifiers clearly express their purpose and intent, making code self-documenting **Status:** Passed > Learn more about managing compliance <a href='https://qodo-merge-docs.qodo.ai/tools/compliance/#configuration-options'>generic rules</a> or creating your own <a href='https://qodo-merge-docs.qodo.ai/tools/compliance/#custom-compliance'>custom rules</a> </details></td></tr> <tr><td rowspan=1>🔴</td> <td><details> <summary>Generic: Secure Logging Practices</summary> **Objective:** To ensure logs are useful for debugging and auditing without exposing sensitive information like PII, PHI, or cardholder data. **Status:** <a href='https://github.com/carverauto/serviceradar/pull/2221/files#diff-54aa63b6f05aabb0db096685e5be714c0c6d5c5a892310f048ff2ff8a542191fR339-R347'>Sensitive identifiers logged</a>: Logs include potentially sensitive tenant identifiers and infrastructure details (e.g., <code>tenant_id</code>, <code>tenant_slug</code>) which violates the requirement that no sensitive data be present in logs. <details open><summary>Referred Code</summary> ```go g.logger.Info(). Str("agent_id", resp.AgentId). Str("gateway_id", resp.GatewayId). Str("tenant_id", resp.TenantId). Str("tenant_slug", resp.TenantSlug). Int32("heartbeat_interval_sec", resp.HeartbeatIntervalSec). Bool("config_outdated", resp.ConfigOutdated). Msg("Agent enrolled with gateway") ``` </details> > Learn more about managing compliance <a href='https://qodo-merge-docs.qodo.ai/tools/compliance/#configuration-options'>generic rules</a> or creating your own <a href='https://qodo-merge-docs.qodo.ai/tools/compliance/#custom-compliance'>custom rules</a> </details></td></tr> <tr><td rowspan=4>⚪</td> <td><details> <summary>Generic: Comprehensive Audit Trails</summary> **Objective:** To create a detailed and reliable record of critical system actions for security analysis and compliance. **Status:** <a href='https://github.com/carverauto/serviceradar/pull/2221/files#diff-5f0d59be34ef26b449d7f5fd2b198a29b277936b9708a699f7487415ed6c2785R336-R454'>Missing actor context</a>: Enrollment and config-application actions are logged but without a consistent "actor" identifier (e.g., user/admin) beyond <code>agent_id</code>, and other critical actions in the PR (not shown in this diff) may lack audit logging. <details open><summary>Referred Code</summary> ```go p.logger.Info().Msg("Enrolling with gateway...") // Build Hello request helloReq := &proto.AgentHelloRequest{ AgentId: p.server.config.AgentID, Version: Version, // Agent version from version.go Capabilities: getAgentCapabilities(), ConfigVersion: p.configVersion, } // Send Hello helloResp, err := p.gateway.Hello(ctx, helloReq) if err != nil { p.logger.Error().Err(err).Msg("Failed to enroll with gateway") return } // Update server config with tenant info from gateway p.server.config.TenantID = helloResp.TenantId p.server.config.TenantSlug = helloResp.TenantSlug ... (clipped 98 lines) ``` </details> > Learn more about managing compliance <a href='https://qodo-merge-docs.qodo.ai/tools/compliance/#configuration-options'>generic rules</a> or creating your own <a href='https://qodo-merge-docs.qodo.ai/tools/compliance/#custom-compliance'>custom rules</a> </details></td></tr> <tr><td><details> <summary>Generic: Robust Error Handling and Edge Case Management</summary> **Objective:** Ensure comprehensive error handling that provides meaningful context and graceful degradation **Status:** <a href='https://github.com/carverauto/serviceradar/pull/2221/files#diff-5f0d59be34ef26b449d7f5fd2b198a29b277936b9708a699f7487415ed6c2785R456-R503'>Nil map risk</a>: <code>applyChecks</code> writes to <code>p.server.checkerConfs</code> without ensuring the map is initialized, which can panic if <code>checkerConfs</code> is nil depending on server construction not visible in this diff. <details open><summary>Referred Code</summary> ```go // applyChecks converts proto checks to checker configs and updates the server. func (p *PushLoop) applyChecks(checks []*proto.AgentCheckConfig) { if len(checks) == 0 { p.logger.Debug().Msg("No checks to apply") return } p.server.mu.Lock() defer p.server.mu.Unlock() // Track which checks we've seen (for removing stale checks later) seenChecks := make(map[string]bool) for _, check := range checks { // Guard against nil entries in the checks slice if check == nil { continue } if !check.Enabled { continue ... (clipped 27 lines) ``` </details> > Learn more about managing compliance <a href='https://qodo-merge-docs.qodo.ai/tools/compliance/#configuration-options'>generic rules</a> or creating your own <a href='https://qodo-merge-docs.qodo.ai/tools/compliance/#custom-compliance'>custom rules</a> </details></td></tr> <tr><td><details> <summary>Generic: Secure Error Handling</summary> **Objective:** To prevent the leakage of sensitive system information through error messages while providing sufficient detail for internal debugging. **Status:** <a href='https://github.com/carverauto/serviceradar/pull/2221/files#diff-54aa63b6f05aabb0db096685e5be714c0c6d5c5a892310f048ff2ff8a542191fR79-R348'>Potentially detailed errors</a>: Connection and enrollment failures are returned with detailed internal context (e.g., gateway address and gateway-provided rejection message) which may be user-visible in CLI output and should be reviewed for sensitive detail exposure. <details open><summary>Referred Code</summary> ```go // Connect establishes a connection to the gateway. func (g *GatewayClient) Connect(ctx context.Context) error { g.mu.Lock() defer g.mu.Unlock() if g.addr == "" { return ErrGatewayAddrRequired } if g.connected && g.conn != nil { return nil // Already connected } g.logger.Info().Str("addr", g.addr).Msg("Connecting to agent-gateway") opts, err := g.buildDialOptions(ctx) if err != nil { return fmt.Errorf("failed to build dial options: %w", err) } connectCtx, cancel := context.WithTimeout(ctx, defaultConnectTimeout) ... (clipped 249 lines) ``` </details> > Learn more about managing compliance <a href='https://qodo-merge-docs.qodo.ai/tools/compliance/#configuration-options'>generic rules</a> or creating your own <a href='https://qodo-merge-docs.qodo.ai/tools/compliance/#custom-compliance'>custom rules</a> </details></td></tr> <tr><td><details> <summary>Generic: Security-First Input Validation and Data Handling</summary> **Objective:** Ensure all data inputs are validated, sanitized, and handled securely to prevent vulnerabilities **Status:** <a href='https://github.com/carverauto/serviceradar/pull/2221/files#diff-61358711e980ccf505246fd3915f97cbd3a380e9b66f6fa5aad46749968c5ca3R97-R187'>Unvalidated config inputs</a>: JSON config fields loaded from disk (including network endpoints and timing values) are used without explicit validation/sanitization, and correctness/security depends on constraints that are not shown in this diff. <details open><summary>Referred Code</summary> ```go // loadConfig loads agent configuration from file, falling back to embedded defaults. func loadConfig(configPath string) (*agent.ServerConfig, error) { var cfg agent.ServerConfig // Try to read config file data, err := os.ReadFile(configPath) if err != nil { if os.IsNotExist(err) { // Fall back to embedded default config data = defaultConfig } else { return nil, fmt.Errorf("failed to read config file: %w", err) } } if err := json.Unmarshal(data, &cfg); err != nil { return nil, fmt.Errorf("failed to parse config: %w", err) } return &cfg, nil } ... (clipped 70 lines) ``` </details> > Learn more about managing compliance <a href='https://qodo-merge-docs.qodo.ai/tools/compliance/#configuration-options'>generic rules</a> or creating your own <a href='https://qodo-merge-docs.qodo.ai/tools/compliance/#custom-compliance'>custom rules</a> </details></td></tr> <tr><td align="center" colspan="2">  </td></tr></tbody></table> </details> <details> <summary>Compliance check up to commit <a href='https://github.com/carverauto/serviceradar/commit/4d3d82c8155f53aa56358b2d03f7d12efcf49dd5'>4d3d82c</a></summary> <table><tbody><tr><td colspan='2'>Security Compliance</td></tr> <tr><td rowspan=2>⚪</td> <td><details><summary>Insecure gRPC transport </summary> Description: The client explicitly allows falling back to <code>grpc.WithTransportCredentials(insecure.NewCredentials())</code> when <code>security</code> is nil/disabled, which would permit plaintext gRPC and enable MITM/credential interception if this path is reachable in non-development deployments. <a href='https://github.com/carverauto/serviceradar/pull/2221/files#diff-54aa63b6f05aabb0db096685e5be714c0c6d5c5a892310f048ff2ff8a542191fR108-R135'>gateway_client.go [108-135]</a> <details open><summary>Referred Code</summary> ```go // buildDialOptions constructs gRPC dial options based on security configuration. func (g *GatewayClient) buildDialOptions(ctx context.Context) ([]grpc.DialOption, error) { var opts []grpc.DialOption if g.security != nil && g.security.Mode != "" && g.security.Mode != models.SecurityModeNone { // Create security provider using the standard pattern provider, err := srgrpc.NewSecurityProvider(ctx, g.security, g.logger) if err != nil { return nil, fmt.Errorf("failed to create security provider: %w", err) } // Get client credentials from the provider creds, err := provider.GetClientCredentials(ctx) if err != nil { _ = provider.Close() return nil, fmt.Errorf("failed to get client credentials: %w", err) } g.securityProvider = provider opts = append(opts, creds) } else { ... (clipped 7 lines) ``` </details></details></td></tr> <tr><td><details><summary>Unexpected external egress </summary> Description: <code>getSourceIP()</code> performs an outbound UDP dial to the public address <code>8.8.8.8:80</code>, creating unexpected external egress that can leak environment/network metadata and may violate network/security policies in restricted deployments. <a href='https://github.com/carverauto/serviceradar/pull/2221/files#diff-5f0d59be34ef26b449d7f5fd2b198a29b277936b9708a699f7487415ed6c2785R233-R250'>push_loop.go [233-250]</a> <details open><summary>Referred Code</summary> ```go // getSourceIP attempts to determine the source IP of this agent. func (p *PushLoop) getSourceIP() string { // First check if HostIP is configured if p.server.config.HostIP != "" { return p.server.config.HostIP } // Try to get the outbound IP conn, err := net.Dial("udp", "8.8.8.8:80") if err != nil { p.logger.Debug().Err(err).Msg("Failed to determine source IP") return "" } defer conn.Close() localAddr := conn.LocalAddr().(*net.UDPAddr) return localAddr.IP.String() } ``` </details></details></td></tr> <tr><td colspan='2'>Ticket Compliance</td></tr> <tr><td>⚪</td><td><details><summary>🎫 No ticket provided </summary> - [ ] Create ticket/issue  </details></td></tr> <tr><td colspan='2'>Codebase Duplication Compliance</td></tr> <tr><td>⚪</td><td><details><summary>Codebase context is not defined </summary> Follow the <a href='https://qodo-merge-docs.qodo.ai/core-abilities/rag_context_enrichment/'>guide</a> to enable codebase context checks. </details></td></tr> <tr><td colspan='2'>Custom Compliance</td></tr> <tr><td rowspan=2>🟢</td><td> <details><summary>Generic: Meaningful Naming and Self-Documenting Code</summary> **Objective:** Ensure all identifiers clearly express their purpose and intent, making code self-documenting **Status:** Passed > Learn more about managing compliance <a href='https://qodo-merge-docs.qodo.ai/tools/compliance/#configuration-options'>generic rules</a> or creating your own <a href='https://qodo-merge-docs.qodo.ai/tools/compliance/#custom-compliance'>custom rules</a> </details></td></tr> <tr><td> <details><summary>Generic: Secure Error Handling</summary> **Objective:** To prevent the leakage of sensitive system information through error messages while providing sufficient detail for internal debugging. **Status:** Passed > Learn more about managing compliance <a href='https://qodo-merge-docs.qodo.ai/tools/compliance/#configuration-options'>generic rules</a> or creating your own <a href='https://qodo-merge-docs.qodo.ai/tools/compliance/#custom-compliance'>custom rules</a> </details></td></tr> <tr><td rowspan=1>🔴</td> <td><details> <summary>Generic: Robust Error Handling and Edge Case Management</summary> **Objective:** Ensure comprehensive error handling that provides meaningful context and graceful degradation **Status:** <a href='https://github.com/carverauto/serviceradar/pull/2221/files#diff-5f0d59be34ef26b449d7f5fd2b198a29b277936b9708a699f7487415ed6c2785R385-R389'>Nil check panic</a>: <code>applyChecks</code> does not guard against nil entries in the <code>checks</code> slice and will panic when dereferencing <code>check.Enabled</code>. <details open><summary>Referred Code</summary> ```go for _, check := range checks { if !check.Enabled { continue } ``` </details> > Learn more about managing compliance <a href='https://qodo-merge-docs.qodo.ai/tools/compliance/#configuration-options'>generic rules</a> or creating your own <a href='https://qodo-merge-docs.qodo.ai/tools/compliance/#custom-compliance'>custom rules</a> </details></td></tr> <tr><td rowspan=3>⚪</td> <td><details> <summary>Generic: Comprehensive Audit Trails</summary> **Objective:** To create a detailed and reliable record of critical system actions for security analysis and compliance. **Status:** <a href='https://github.com/carverauto/serviceradar/pull/2221/files#diff-5f0d59be34ef26b449d7f5fd2b198a29b277936b9708a699f7487415ed6c2785R64-R157'>Audit scope unclear</a>: The new agent enrollment/config/status actions are logged, but it is not verifiable from the diff alone that these logs meet your audit requirements (e.g., consistent actor identity, timestamps, and outcome recording across all critical actions). <details open><summary>Referred Code</summary> ```go // Start begins the push loop. It runs until the context is cancelled. func (p *PushLoop) Start(ctx context.Context) error { p.logger.Info().Dur("interval", p.interval).Msg("Starting push loop") // Initial connection and enrollment attempt if err := p.gateway.Connect(ctx); err != nil { p.logger.Warn().Err(err).Msg("Initial gateway connection failed, will retry") } else { // Connected, try to enroll p.enroll(ctx) } // Start config polling in a separate goroutine go p.configPollLoop(ctx) ticker := time.NewTicker(p.interval) defer ticker.Stop() // Do an initial push immediately (only if enrolled) if p.enrolled { p.pushStatus(ctx) ... (clipped 73 lines) ``` </details> > Learn more about managing compliance <a href='https://qodo-merge-docs.qodo.ai/tools/compliance/#configuration-options'>generic rules</a> or creating your own <a href='https://qodo-merge-docs.qodo.ai/tools/compliance/#custom-compliance'>custom rules</a> </details></td></tr> <tr><td><details> <summary>Generic: Secure Logging Practices</summary> **Objective:** To ensure logs are useful for debugging and auditing without exposing sensitive information like PII, PHI, or cardholder data. **Status:** <a href='https://github.com/carverauto/serviceradar/pull/2221/files#diff-54aa63b6f05aabb0db096685e5be714c0c6d5c5a892310f048ff2ff8a542191fR298-R305'>Identifier logging</a>: The new logs include tenant identifiers (e.g., <code>tenant_id</code> and <code>tenant_slug</code>), and whether these are classified as sensitive/PII cannot be confirmed from the diff alone. <details open><summary>Referred Code</summary> ```go g.logger.Info(). Str("agent_id", resp.AgentId). Str("gateway_id", resp.GatewayId). Str("tenant_id", resp.TenantId). Str("tenant_slug", resp.TenantSlug). Int32("heartbeat_interval_sec", resp.HeartbeatIntervalSec). Bool("config_outdated", resp.ConfigOutdated). Msg("Agent enrolled with gateway") ``` </details> > Learn more about managing compliance <a href='https://qodo-merge-docs.qodo.ai/tools/compliance/#configuration-options'>generic rules</a> or creating your own <a href='https://qodo-merge-docs.qodo.ai/tools/compliance/#custom-compliance'>custom rules</a> </details></td></tr> <tr><td><details> <summary>Generic: Security-First Input Validation and Data Handling</summary> **Objective:** Ensure all data inputs are validated, sanitized, and handled securely to prevent vulnerabilities **Status:** <a href='https://github.com/carverauto/serviceradar/pull/2221/files#diff-54aa63b6f05aabb0db096685e5be714c0c6d5c5a892310f048ff2ff8a542191fR112-R133'>Insecure transport option</a>: The gateway client explicitly allows falling back to <code>insecure.NewCredentials()</code> when security mode is none, which may violate expected security posture depending on deployment requirements. <details open><summary>Referred Code</summary> ```go if g.security != nil && g.security.Mode != "" && g.security.Mode != models.SecurityModeNone { // Create security provider using the standard pattern provider, err := srgrpc.NewSecurityProvider(ctx, g.security, g.logger) if err != nil { return nil, fmt.Errorf("failed to create security provider: %w", err) } // Get client credentials from the provider creds, err := provider.GetClientCredentials(ctx) if err != nil { _ = provider.Close() return nil, fmt.Errorf("failed to get client credentials: %w", err) } g.securityProvider = provider opts = append(opts, creds) } else { // Use insecure connection (for development/testing) g.logger.Warn().Msg("Using insecure connection to gateway (no mTLS)") opts = append(opts, grpc.WithTransportCredentials(insecure.NewCredentials())) } ... (clipped 1 lines) ``` </details> > Learn more about managing compliance <a href='https://qodo-merge-docs.qodo.ai/tools/compliance/#configuration-options'>generic rules</a> or creating your own <a href='https://qodo-merge-docs.qodo.ai/tools/compliance/#custom-compliance'>custom rules</a> </details></td></tr> <tr><td align="center" colspan="2">  </td></tr></tbody></table> </details> <details> <summary>Compliance check up to commit <a href='https://github.com/carverauto/serviceradar/commit/6a52314c098cfc8c95be0596bae269d74bcb49ed'>6a52314</a></summary> <table><tbody><tr><td colspan='2'>Security Compliance</td></tr> <tr><td rowspan=2>⚪</td> <td><details><summary>Credential secret exposure </summary> Description: <code>GenerateUserCredentials</code> returns <code>CredsFileContent</code> containing the raw user NKey seed (via <code>formatCredsFile(userJWT, userSeed)</code>), which is highly sensitive and could be exfiltrated if this value is ever logged, stored, or returned over an insufficiently protected channel. <a href='https://github.com/carverauto/serviceradar/pull/2221/files#diff-84b756d1bdd4d008851fb5859e56f79f44c8083c766ee0314d7862484ce8c12bR54-R120'>user_manager.go [54-120]</a> <details open><summary>Referred Code</summary> ```go // GenerateUserCredentials creates NATS credentials for a user in a tenant's account. // The accountSeed is the tenant's account private key (passed from Elixir storage). func GenerateUserCredentials( tenantSlug string, accountSeed string, userName string, credType UserCredentialType, permissions *UserPermissions, expirationSeconds int64, ) (*UserCredentials, error) { // Parse account seed to get key pair for signing accountKP, err := nkeys.FromSeed([]byte(accountSeed)) if err != nil { return nil, fmt.Errorf("invalid account seed: %w", err) } accountPublicKey, err := accountKP.PublicKey() if err != nil { return nil, fmt.Errorf("failed to get account public key: %w", err) } ... (clipped 46 lines) ``` </details></details></td></tr> <tr><td><details><summary>Tenant isolation bypass </summary> Description: Default publish/subscribe permissions include non-tenant-scoped subjects (e.g., <code>events.></code>, <code>logs.></code>, <code>telemetry.></code>) and also build allow-subjects from unvalidated <code>tenantSlug</code> (e.g., <code>fmt.Sprintf("%s.>", tenantSlug)</code>), which could enable cross-tenant access if subject mapping is misconfigured or if <code>tenantSlug</code> contains wildcard/special tokens. <a href='https://github.com/carverauto/serviceradar/pull/2221/files#diff-84b756d1bdd4d008851fb5859e56f79f44c8083c766ee0314d7862484ce8c12bR123-R198'>user_manager.go [123-198]</a> <details open><summary>Referred Code</summary> ```go // applyUserPermissions sets permissions on user claims based on credential type. func applyUserPermissions( claims *jwt.UserClaims, tenantSlug string, credType UserCredentialType, custom *UserPermissions, ) { // Start with type-based defaults switch credType { case CredentialTypeCollector: // Collectors can publish to their tenant's subjects (mapped by account) // and subscribe to responses claims.Pub.Allow.Add("events.>") claims.Pub.Allow.Add("snmp.traps") claims.Pub.Allow.Add("logs.>") claims.Pub.Allow.Add("telemetry.>") claims.Pub.Allow.Add("netflow.>") claims.Sub.Allow.Add("_INBOX.>") claims.Resp = &jwt.ResponsePermission{ MaxMsgs: 1, Expires: time.Minute, ... (clipped 55 lines) ``` </details></details></td></tr> <tr><td colspan='2'>Ticket Compliance</td></tr> <tr><td>⚪</td><td><details><summary>🎫 No ticket provided </summary> - [ ] Create ticket/issue  </details></td></tr> <tr><td colspan='2'>Codebase Duplication Compliance</td></tr> <tr><td>⚪</td><td><details><summary>Codebase context is not defined </summary> Follow the <a href='https://qodo-merge-docs.qodo.ai/core-abilities/rag_context_enrichment/'>guide</a> to enable codebase context checks. </details></td></tr> <tr><td colspan='2'>Custom Compliance</td></tr> <tr><td rowspan=1>🟢</td><td> <details><summary>Generic: Meaningful Naming and Self-Documenting Code</summary> **Objective:** Ensure all identifiers clearly express their purpose and intent, making code self-documenting **Status:** Passed > Learn more about managing compliance <a href='https://qodo-merge-docs.qodo.ai/tools/compliance/#configuration-options'>generic rules</a> or creating your own <a href='https://qodo-merge-docs.qodo.ai/tools/compliance/#custom-compliance'>custom rules</a> </details></td></tr> <tr><td rowspan=5>⚪</td> <td><details> <summary>Generic: Comprehensive Audit Trails</summary> **Objective:** To create a detailed and reliable record of critical system actions for security analysis and compliance. **Status:** <a href='https://github.com/carverauto/serviceradar/pull/2221/files#diff-49eb93c28e2d86d8dcf4ee78fd24bed498cf3fcfaa8e49849a36e70980420087R238-R1087'>Audit logging unknown</a>: The diff introduces APIs for operator/account bootstrap and credential generation, but no corresponding audit logging is visible in the provided hunks, so it cannot be verified that these critical actions are reliably logged with actor context and outcomes. <details open><summary>Referred Code</summary> ```go // CreateTenantAccountRequest is the request to create a new tenant NATS account. type CreateTenantAccountRequest struct { state protoimpl.MessageState `protogen:"open.v1"` TenantSlug string `protobuf:"bytes,1,opt,name=tenant_slug,json=tenantSlug,proto3" json:"tenant_slug,omitempty"` // Unique tenant identifier (e.g., "acme-corp") Limits *AccountLimits `protobuf:"bytes,2,opt,name=limits,proto3" json:"limits,omitempty"` // Optional resource limits SubjectMappings []*SubjectMapping `protobuf:"bytes,3,rep,name=subject_mappings,json=subjectMappings,proto3" json:"subject_mappings,omitempty"` // Optional custom subject mappings unknownFields protoimpl.UnknownFields sizeCache protoimpl.SizeCache } func (x *CreateTenantAccountRequest) Reset() { *x = CreateTenantAccountRequest{} mi := &file_nats_account_proto_msgTypes[2] ms := protoimpl.X.MessageStateOf(protoimpl.Pointer(x)) ms.StoreMessageInfo(mi) } func (x *CreateTenantAccountRequest) String() string { return protoimpl.X.MessageStringOf(x) } ... (clipped 829 lines) ``` </details> > Learn more about managing compliance <a href='https://qodo-merge-docs.qodo.ai/tools/compliance/#configuration-options'>generic rules</a> or creating your own <a href='https://qodo-merge-docs.qodo.ai/tools/compliance/#custom-compliance'>custom rules</a> </details></td></tr> <tr><td><details> <summary>Generic: Robust Error Handling and Edge Case Management</summary> **Objective:** Ensure comprehensive error handling that provides meaningful context and graceful degradation **Status:** <a href='https://github.com/carverauto/serviceradar/pull/2221/files#diff-49eb93c28e2d86d8dcf4ee78fd24bed498cf3fcfaa8e49849a36e70980420087R238-R1184'>Error handling unseen</a>: The new gRPC request/response surface for account and credential operations is added, but the diff does not include the service implementation where validation and error handling would occur, so robustness against edge cases cannot be confirmed. <details open><summary>Referred Code</summary> ```go // CreateTenantAccountRequest is the request to create a new tenant NATS account. type CreateTenantAccountRequest struct { state protoimpl.MessageState `protogen:"open.v1"` TenantSlug string `protobuf:"bytes,1,opt,name=tenant_slug,json=tenantSlug,proto3" json:"tenant_slug,omitempty"` // Unique tenant identifier (e.g., "acme-corp") Limits *AccountLimits `protobuf:"bytes,2,opt,name=limits,proto3" json:"limits,omitempty"` // Optional resource limits SubjectMappings []*SubjectMapping `protobuf:"bytes,3,rep,name=subject_mappings,json=subjectMappings,proto3" json:"subject_mappings,omitempty"` // Optional custom subject mappings unknownFields protoimpl.UnknownFields sizeCache protoimpl.SizeCache } func (x *CreateTenantAccountRequest) Reset() { *x = CreateTenantAccountRequest{} mi := &file_nats_account_proto_msgTypes[2] ms := protoimpl.X.MessageStateOf(protoimpl.Pointer(x)) ms.StoreMessageInfo(mi) } func (x *CreateTenantAccountRequest) String() string { return protoimpl.X.MessageStringOf(x) } ... (clipped 926 lines) ``` </details> > Learn more about managing compliance <a href='https://qodo-merge-docs.qodo.ai/tools/compliance/#configuration-options'>generic rules</a> or creating your own <a href='https://qodo-merge-docs.qodo.ai/tools/compliance/#custom-compliance'>custom rules</a> </details></td></tr> <tr><td><details> <summary>Generic: Secure Error Handling</summary> **Objective:** To prevent the leakage of sensitive system information through error messages while providing sufficient detail for internal debugging. **Status:** <a href='https://github.com/carverauto/serviceradar/pull/2221/files#diff-49eb93c28e2d86d8dcf4ee78fd24bed498cf3fcfaa8e49849a36e70980420087R983-R1087'>Error exposure risk</a>: Responses like <code>PushAccountJWTResponse.message</code> appear able to carry error details, but without the server-side implementation in the visible diff it cannot be verified that user-facing errors are sanitized and sensitive internals are only logged securely. <details open><summary>Referred Code</summary> ```go // PushAccountJWTRequest is the request to push an account JWT to the NATS resolver. type PushAccountJWTRequest struct { state protoimpl.MessageState `protogen:"open.v1"` AccountPublicKey string `protobuf:"bytes,1,opt,name=account_public_key,json=accountPublicKey,proto3" json:"account_public_key,omitempty"` // The account's public NKey (starts with 'A') AccountJwt string `protobuf:"bytes,2,opt,name=account_jwt,json=accountJwt,proto3" json:"account_jwt,omitempty"` // The signed account JWT to push unknownFields protoimpl.UnknownFields sizeCache protoimpl.SizeCache } func (x *PushAccountJWTRequest) Reset() { *x = PushAccountJWTRequest{} mi := &file_nats_account_proto_msgTypes[13] ms := protoimpl.X.MessageStateOf(protoimpl.Pointer(x)) ms.StoreMessageInfo(mi) } func (x *PushAccountJWTRequest) String() string { return protoimpl.X.MessageStringOf(x) } func (*PushAccountJWTRequest) ProtoMessage() {} ... (clipped 84 lines) ``` </details> > Learn more about managing compliance <a href='https://qodo-merge-docs.qodo.ai/tools/compliance/#configuration-options'>generic rules</a> or creating your own <a href='https://qodo-merge-docs.qodo.ai/tools/compliance/#custom-compliance'>custom rules</a> </details></td></tr> <tr><td><details> <summary>Generic: Secure Logging Practices</summary> **Objective:** To ensure logs are useful for debugging and auditing without exposing sensitive information like PII, PHI, or cardholder data. **Status:** <a href='https://github.com/carverauto/serviceradar/pull/2221/files#diff-0c66da94363b25ecb9e4766c8f12cde387c1874aeb9099447c609c4e867b4fe0R194-R240'>Secrets may be logged</a>: The added models include highly sensitive fields (e.g., <code>CollectorDownloadResult.nats_creds_file</code> and <code>mtls_bundle</code>) and the diff does not show safeguards preventing these values from being logged, so secure logging compliance cannot be verified. <details open><summary>Referred Code</summary> ```go // CollectorPackage represents a collector deployment package with NATS credentials. type CollectorPackage struct { PackageID string `json:"package_id"` TenantID string `json:"tenant_id"` CollectorType CollectorType `json:"collector_type"` UserName string `json:"user_name"` Site string `json:"site,omitempty"` Hostname string `json:"hostname,omitempty"` Status CollectorPackageStatus `json:"status"` NatsCredentialID string `json:"nats_credential_id,omitempty"` DownloadTokenHash string `json:"download_token_hash,omitempty"` DownloadTokenExpiresAt time.Time `json:"download_token_expires_at,omitempty"` DownloadedAt *time.Time `json:"downloaded_at,omitempty"` DownloadedByIP string `json:"downloaded_by_ip,omitempty"` InstalledAt *time.Time `json:"installed_at,omitempty"` RevokedAt *time.Time `json:"revoked_at,omitempty"` RevokeReason string `json:"revoke_reason,omitempty"` ErrorMessage string `json:"error_message,omitempty"` ConfigOverrides map[string]interface{} `json:"config_overrides,omitempty"` CreatedAt time.Time `json:"created_at"` UpdatedAt time.Time `json:"updated_at"` ... (clipped 26 lines) ``` </details> > Learn more about managing compliance <a href='https://qodo-merge-docs.qodo.ai/tools/compliance/#configuration-options'>generic rules</a> or creating your own <a href='https://qodo-merge-docs.qodo.ai/tools/compliance/#custom-compliance'>custom rules</a> </details></td></tr> <tr><td><details> <summary>Generic: Security-First Input Validation and Data Handling</summary> **Objective:** Ensure all data inputs are validated, sanitized, and handled securely to prevent vulnerabilities **Status:** <a href='https://github.com/carverauto/serviceradar/pull/2221/files#diff-0c66da94363b25ecb9e4766c8f12cde387c1874aeb9099447c609c4e867b4fe0R194-R240'>Sensitive data handling</a>: New data structures introduce sensitive credential material and token-related fields (e.g., <code>download_token_hash</code>, <code>.creds</code> content, and <code>mtls_bundle</code>), but the diff does not include the creation/transport/storage code needed to verify encryption, access controls, and validation of externally supplied identifiers. <details open><summary>Referred Code</summary> ```go // CollectorPackage represents a collector deployment package with NATS credentials. type CollectorPackage struct { PackageID string `json:"package_id"` TenantID string `json:"tenant_id"` CollectorType CollectorType `json:"collector_type"` UserName string `json:"user_name"` Site string `json:"site,omitempty"` Hostname string `json:"hostname,omitempty"` Status CollectorPackageStatus `json:"status"` NatsCredentialID string `json:"nats_credential_id,omitempty"` DownloadTokenHash string `json:"download_token_hash,omitempty"` DownloadTokenExpiresAt time.Time `json:"download_token_expires_at,omitempty"` DownloadedAt *time.Time `json:"downloaded_at,omitempty"` DownloadedByIP string `json:"downloaded_by_ip,omitempty"` InstalledAt *time.Time `json:"installed_at,omitempty"` RevokedAt *time.Time `json:"revoked_at,omitempty"` RevokeReason string `json:"revoke_reason,omitempty"` ErrorMessage string `json:"error_message,omitempty"` ConfigOverrides map[string]interface{} `json:"config_overrides,omitempty"` CreatedAt time.Time `json:"created_at"` UpdatedAt time.Time `json:"updated_at"` ... (clipped 26 lines) ``` </details> > Learn more about managing compliance <a href='https://qodo-merge-docs.qodo.ai/tools/compliance/#configuration-options'>generic rules</a> or creating your own <a href='https://qodo-merge-docs.qodo.ai/tools/compliance/#custom-compliance'>custom rules</a> </details></td></tr> <tr><td align="center" colspan="2">  </td></tr></tbody></table> </details>

qodo-code-review[bot] commented

2026-01-03 09:50:14 +00:00

(Migrated from github.com)

Author

Owner

Imported GitHub PR comment.

Original author: @qodo-code-review[bot]
Original URL: https://github.com/carverauto/serviceradar/pull/2221#issuecomment-3706934588
Original created: 2026-01-03T09:50:14Z

PR Code Suggestions ✨

Latest suggestions up to 5c8ccf6

Category	Suggestion	Impact
Possible issue	Fix remaining state transitions Update `transition_state` calls in several `update` actions to use the correct transition name instead of the target state atom to prevent runtime errors. elixir/serviceradar_core/lib/serviceradar/infrastructure/gateway.ex [254-363] update :heartbeat_timeout do description "Mark gateway as degraded due to heartbeat timeout" require_atomic? false - change transition_state(:degraded) + change transition_state(:heartbeat_timeout) change set_attribute(:is_healthy, false) change set_attribute(:updated_at, &DateTime.utc_now/0) change {ServiceRadar.Infrastructure.Changes.PublishStateChange, entity_type: :gateway, new_state: :degraded} end update :lose_connection do description "Mark gateway as offline due to lost connection" require_atomic? false - change transition_state(:offline) + change transition_state(:lose_connection) change set_attribute(:is_healthy, false) change set_attribute(:updated_at, &DateTime.utc_now/0) change {ServiceRadar.Infrastructure.Changes.PublishStateChange, entity_type: :gateway, new_state: :offline} end update :restore_health do description "Restore gateway to healthy state" require_atomic? false - change transition_state(:healthy) + change transition_state(:restore_health) change set_attribute(:is_healthy, true) change set_attribute(:last_seen, &DateTime.utc_now/0) change set_attribute(:updated_at, &DateTime.utc_now/0) change {ServiceRadar.Infrastructure.Changes.PublishStateChange, entity_type: :gateway, new_state: :healthy} end update :end_maintenance do description "End maintenance mode, return to healthy" require_atomic? false - change transition_state(:healthy) + change transition_state(:end_maintenance) change set_attribute(:is_healthy, true) change set_attribute(:updated_at, &DateTime.utc_now/0) change {ServiceRadar.Infrastructure.Changes.PublishStateChange, entity_type: :gateway, new_state: :healthy} end update :finish_draining do description "Finish draining, go offline" require_atomic? false - change transition_state(:offline) + change transition_state(:finish_draining) change set_attribute(:is_healthy, false) change set_attribute(:updated_at, &DateTime.utc_now/0) change {ServiceRadar.Infrastructure.Changes.PublishStateChange, entity_type: :gateway, new_state: :offline} end update :mark_unhealthy do description "Mark gateway as unhealthy (legacy - use degrade)" require_atomic? false - change transition_state(:degraded) + change transition_state(:degrade) change set_attribute(:is_healthy, false) change set_attribute(:updated_at, &DateTime.utc_now/0) change {ServiceRadar.Infrastructure.Changes.PublishStateChange, entity_type: :gateway, new_state: :degraded} end `[To ensure code accuracy, apply this suggestion manually]` Suggestion importance[1-10]: 9 __ Why: The suggestion correctly identifies a bug where `transition_state` is called with the target state instead of the transition name, which would cause a runtime error.	High
	✅ ~~Enforce streaming chunk ordering~~ Suggestion Impact: The commit updated the stream reducer state to include expected_idx and pinned_total_chunks, added validation that total_chunks cannot change mid-stream, and enforced that each incoming chunk_index must match the expected sequential index, updating the reducer continuation/halt tuples accordingly. code diff: - {total_services, saw_final?, _stream_agent_id} = - Enum.reduce_while(request_stream, {0, false, nil}, fn chunk, {acc, _saw_final?, stream_agent_id} -> + {total_services, saw_final?, _stream_agent_id, _expected_idx, _pinned_total_chunks} = + Enum.reduce_while(request_stream, {0, false, nil, 0, nil}, fn chunk, + {acc, _saw_final?, stream_agent_id, expected_idx, + pinned_total_chunks} -> agent_id = case chunk.agent_id do nil -> @@ -343,8 +349,19 @@ raise GRPC.RPCError, status: :invalid_argument, message: "total_chunks must be > 0" end + pinned_total_chunks = + case pinned_total_chunks do + nil -> total_chunks + ^total_chunks -> total_chunks + _ -> raise GRPC.RPCError, status: :invalid_argument, message: "total_chunks changed mid-stream" + end + if chunk_index < 0 or chunk_index >= total_chunks do raise GRPC.RPCError, status: :invalid_argument, message: "invalid chunk_index" + end + + if chunk_index != expected_idx do + raise GRPC.RPCError, status: :invalid_argument, message: "unexpected chunk_index" end Logger.debug( @@ -396,9 +413,9 @@ end record_push_metrics(agent_id, new_total) - {:halt, {new_total, true, stream_agent_id}} + {:halt, {new_total, true, stream_agent_id, expected_idx + 1, pinned_total_chunks}} else - {:cont, {new_total, false, stream_agent_id}} + {:cont, {new_total, false, stream_agent_id, expected_idx + 1, pinned_total_chunks}} end end) Enforce chunk ordering and contiguity in `stream_status/2` by tracking the `expected_chunk_index` and pinning `total_chunks` to prevent partial data ingestion. elixir/serviceradar_agent_gateway/lib/serviceradar_agent_gateway/agent_gateway_server.ex [301-403] {total_services, saw_final?, _stream_agent_id} = - Enum.reduce_while(request_stream, {0, false, nil}, fn chunk, {acc, _saw_final?, stream_agent_id} -> + Enum.reduce_while(request_stream, {0, false, nil, 0, nil}, fn chunk, + {acc, _saw_final?, stream_agent_id, expected_idx, + pinned_total_chunks} -> ... chunk_index = chunk.chunk_index \|\| 0 total_chunks = chunk.total_chunks \|\| 0 if total_chunks <= 0 do raise GRPC.RPCError, status: :invalid_argument, message: "total_chunks must be > 0" end + pinned_total_chunks = + case pinned_total_chunks do + nil -> total_chunks + ^total_chunks -> total_chunks + _ -> raise GRPC.RPCError, status: :invalid_argument, message: "total_chunks changed mid-stream" + end + if chunk_index < 0 or chunk_index >= total_chunks do raise GRPC.RPCError, status: :invalid_argument, message: "invalid chunk_index" + end + + if chunk_index != expected_idx do + raise GRPC.RPCError, status: :invalid_argument, message: "unexpected chunk_index" end ... if chunk.is_final do if chunk_index != total_chunks - 1 do raise GRPC.RPCError, status: :invalid_argument, message: "final chunk_index does not match total_chunks" end record_push_metrics(agent_id, new_total) - {:halt, {new_total, true, stream_agent_id}} + {:halt, {new_total, true, stream_agent_id, expected_idx + 1, pinned_total_chunks}} else - {:cont, {new_total, false, stream_agent_id}} + {:cont, {new_total, false, stream_agent_id, expected_idx + 1, pinned_total_chunks}} end end) `[To ensure code accuracy, apply this suggestion manually]` Suggestion importance[1-10]: 8 __ Why: The suggestion correctly identifies a vulnerability where out-of-order or missing stream chunks are not detected, potentially leading to silent data loss.	Medium
	Unify version reporting source Propagate the build-time `Version` from the `main` package to the `agent` package to ensure correct version reporting during gateway enrollment. cmd/agent/main.go [42-45] `// Version is set at build time via ldflags // //nolint:gochecknoglobals // Required for build-time ldflags injection var Version = "dev" +... + +// Ensure the agent package reports the same build version during enrollment. +agent.Version = Version +` `[To ensure code accuracy, apply this suggestion manually]` Suggestion importance[1-10]: 7 __ Why: The suggestion correctly identifies that the build-time `Version` from the `main` package is not propagated to the `agent` package, causing incorrect version reporting during enrollment.	Medium
Incremental ^[*]	✅ ~~Skip malformed service entries~~ Suggestion Impact: The reduce function in push_status was changed to use function-head pattern matching for %Monitoring.GatewayServiceStatus{} and to ignore all other entries via an _other clause, matching the suggestion and preventing invalid entries from triggering exception-based control flow. code diff: @@ -252,25 +252,29 @@ processed_count = services \|> Enum.reject(&is_nil/1) - \|> Enum.reduce(0, fn service, acc -> - try do - process_service_status(service, metadata) - acc + 1 - rescue - e in GRPC.RPCError -> - Logger.warning( - "Dropping invalid service status from agent #{metadata.agent_id}: #{e.status} #{e.message}" - ) - - acc - - e -> - Logger.warning( - "Dropping service status from agent #{metadata.agent_id} due to error: #{Exception.message(e)}" - ) - - acc - end + \|> Enum.reduce(0, fn + %Monitoring.GatewayServiceStatus{} = service, acc -> + try do + process_service_status(service, metadata) + acc + 1 + rescue + e in GRPC.RPCError -> + Logger.warning( + "Dropping invalid service status from agent #{metadata.agent_id}: #{e.status} #{e.message}" + ) + + acc + + e -> + Logger.warning( + "Dropping service status from agent #{metadata.agent_id} due to error: #{Exception.message(e)}" + ) + + acc + end + + _other, acc -> + acc end) In `push_status`, use pattern matching on `%Monitoring.GatewayServiceStatus{}` within `Enum.reduce` to filter invalid service entries. This avoids using exceptions for control flow and mitigates a potential resource exhaustion attack. elixir/serviceradar_agent_gateway/lib/serviceradar_agent_gateway/agent_gateway_server.ex [252-278] processed_count = services \|> Enum.reject(&is_nil/1) - \|> Enum.reduce(0, fn service, acc -> - try do - process_service_status(service, metadata) - acc + 1 - rescue - e in GRPC.RPCError -> - Logger.warning( - "Dropping invalid service status from agent #{metadata.agent_id}: #{e.status} #{e.message}" - ) + \|> Enum.reduce(0, fn + %Monitoring.GatewayServiceStatus{} = service, acc -> + try do + process_service_status(service, metadata) + acc + 1 + rescue + e in GRPC.RPCError -> + Logger.warning( + "Dropping invalid service status from agent #{metadata.agent_id}: #{e.status} #{e.message}" + ) - acc + acc - e -> - Logger.warning( - "Dropping service status from agent #{metadata.agent_id} due to error: #{Exception.message(e)}" - ) + e -> + Logger.warning( + "Dropping service status from agent #{metadata.agent_id} due to error: #{Exception.message(e)}" + ) - acc - end + acc + end + + _other, acc -> + acc end) if processed_count == 0 and service_count > 0 do raise GRPC.RPCError, status: :invalid_argument, message: "no valid service statuses" end `[Suggestion processed]` Suggestion importance[1-10]: 8 __ Why: The suggestion correctly identifies a potential resource exhaustion vector and proposes a more robust and secure implementation using pattern matching. This avoids expensive exception handling for invalid data types in a loop, which is a good security and performance practice.	Medium
	✅ ~~Bound user-controlled string sizes~~ Suggestion Impact: The commit adds trimming/normalization of `service_type` and `source`, validates their maximum byte size and disallows control characters, and then uses these sanitized values in the constructed status map. code diff: @@ -422,6 +439,18 @@ _ -> "" end + service_type = + case service.service_type do + st when is_binary(st) -> String.trim(st) + _ -> "" + end + + source = + case service.source do + src when is_binary(src) -> String.trim(src) + _ -> "" + end + cond do service_name == "" -> raise GRPC.RPCError, status: :invalid_argument, message: "service_name is required" @@ -431,6 +460,12 @@ String.contains?(service_name, ["\n", "\r", "\t"]) -> raise GRPC.RPCError, status: :invalid_argument, message: "service_name contains invalid characters" + + byte_size(service_type) > 64 or String.contains?(service_type, ["\n", "\r", "\t"]) -> + raise GRPC.RPCError, status: :invalid_argument, message: "service_type is invalid" + + byte_size(source) > 64 or String.contains?(source, ["\n", "\r", "\t"]) -> + raise GRPC.RPCError, status: :invalid_argument, message: "source is invalid" true -> :ok @@ -458,16 +493,28 @@ end end) + response_time = + case service.response_time do + rt when is_integer(rt) and rt >= 0 and rt <= 86_400_000 -> + rt + + rt when is_integer(rt) and rt > 86_400_000 -> + 86_400_000 + + _ -> + 0 + end + status = %{ service_name: service_name, available: service.available == true, message: message, - service_type: service.service_type, - response_time: service.response_time, + service_type: service_type, + response_time: response_time, agent_id: metadata.agent_id, gateway_id: metadata.gateway_id, partition: normalize_partition(service.partition \|\| metadata.partition), - source: service.source, + source: source, kv_store_id: service.kv_store_id \|\| metadata.kv_store_id, In `process_service_status`, add size and character validation for the `service_type` and `source` fields. This prevents potential resource exhaustion and data ingestion issues from malicious or malformed client input. elixir/serviceradar_agent_gateway/lib/serviceradar_agent_gateway/agent_gateway_server.ex [418-423] service_name = case service.service_name do name when is_binary(name) -> String.trim(name) nil -> "" _ -> "" end +service_type = + case service.service_type do + st when is_binary(st) -> String.trim(st) + _ -> "" + end + +source = + case service.source do + src when is_binary(src) -> String.trim(src) + _ -> "" + end + +cond do + service_name == "" -> + raise GRPC.RPCError, status: :invalid_argument, message: "service_name is required" + + byte_size(service_name) > 255 -> + raise GRPC.RPCError, status: :invalid_argument, message: "service_name is too long" + + String.contains?(service_name, ["\n", "\r", "\t"]) -> + raise GRPC.RPCError, status: :invalid_argument, message: "service_name contains invalid characters" + + byte_size(service_type) > 64 or String.contains?(service_type, ["\n", "\r", "\t"]) -> + raise GRPC.RPCError, status: :invalid_argument, message: "service_type is invalid" + + byte_size(source) > 64 or String.contains?(source, ["\n", "\r", "\t"]) -> + raise GRPC.RPCError, status: :invalid_argument, message: "source is invalid" + + true -> + :ok +end + `[Suggestion processed]` Suggestion importance[1-10]: 8 __ Why: The suggestion correctly points out that other user-controlled string fields like `service_type` and `source` are not being validated. Adding size and character validation for these fields is a good security practice to prevent resource exhaustion and data ingestion issues.	Medium
	Prevent test goroutine leaks Prevent a test goroutine leak by using a cancellable context when starting the `service` and `defer`ring a call to `service.Stop()` to ensure cleanup. pkg/checker/snmp/service_test.go [162-201] initialDataChan := make(chan DataPoint) newDataChan := make(chan DataPoint) // Channel to signal when GetResults is called for the new target. getResultsCalled := make(chan struct{}) ... // Test adding target -err = service.Start(context.Background()) +startCtx, cancelStart := context.WithCancel(context.Background()) +defer cancelStart() +defer func() { require.NoError(t, service.Stop()) }() +defer close(initialDataChan) +defer close(newDataChan) + +err = service.Start(startCtx) require.NoError(t, err) `[To ensure code accuracy, apply this suggestion manually]` Suggestion importance[1-10]: 7 __ Why: The suggestion correctly identifies a goroutine leak in the test due to a non-cancellable context, which can cause test flakiness, and provides a correct fix to ensure proper cleanup.	Medium
	✅ ~~Handle RPC agent map shapes~~ Suggestion Impact: The reduce/caching logic was changed to fetch agent_id and other fields using both atom and string map keys, matching the suggestion to robustly handle RPC result shapes. code diff: - Map.put(acc, agent.agent_id, %{ - agent_id: agent.agent_id, - tenant_id: Map.get(agent, :tenant_id), - tenant_slug: Map.get(agent, :tenant_slug, "default"), - last_seen: Map.get(agent, :last_seen), - last_seen_mono: Map.get(agent, :last_seen_mono), - service_count: Map.get(agent, :service_count, 0), - partition: Map.get(agent, :partition), - source_ip: Map.get(agent, :source_ip) + agent_id = Map.get(agent, :agent_id) \|\| Map.get(agent, "agent_id") + + Map.put(acc, agent_id, %{ + agent_id: agent_id, + tenant_id: Map.get(agent, :tenant_id) \|\| Map.get(agent, "tenant_id"), + tenant_slug: Map.get(agent, :tenant_slug) \|\| Map.get(agent, "tenant_slug") \|\| "default", + last_seen: Map.get(agent, :last_seen) \|\| Map.get(agent, "last_seen"), + last_seen_mono: Map.get(agent, :last_seen_mono) \|\| Map.get(agent, "last_seen_mono"), + service_count: Map.get(agent, :service_count) \|\| Map.get(agent, "service_count") \|\| 0, + partition: Map.get(agent, :partition) \|\| Map.get(agent, "partition"), + source_ip: Map.get(agent, :source_ip) \|\| Map.get(agent, "source_ip") }) Modify the agent cache builder to handle both atom and string keys from RPC results. This prevents incorrect data processing due to serialization differences between nodes. web-ng/lib/serviceradar_web_ng_web/live/infrastructure_live/index.ex [739-749] -Map.put(acc, agent.agent_id, %{ - agent_id: agent.agent_id, - tenant_id: Map.get(agent, :tenant_id), - tenant_slug: Map.get(agent, :tenant_slug, "default"), - last_seen: Map.get(agent, :last_seen), - last_seen_mono: Map.get(agent, :last_seen_mono), - service_count: Map.get(agent, :service_count, 0), - partition: Map.get(agent, :partition), - source_ip: Map.get(agent, :source_ip) +agent_id = Map.get(agent, :agent_id) \|\| Map.get(agent, "agent_id") + +Map.put(acc, agent_id, %{ + agent_id: agent_id, + tenant_id: Map.get(agent, :tenant_id) \|\| Map.get(agent, "tenant_id"), + tenant_slug: Map.get(agent, :tenant_slug) \|\| Map.get(agent, "tenant_slug") \|\| "default", + last_seen: Map.get(agent, :last_seen) \|\| Map.get(agent, "last_seen"), + last_seen_mono: Map.get(agent, :last_seen_mono) \|\| Map.get(agent, "last_seen_mono"), + service_count: Map.get(agent, :service_count) \|\| Map.get(agent, "service_count") \|\| 0, + partition: Map.get(agent, :partition) \|\| Map.get(agent, "partition"), + source_ip: Map.get(agent, :source_ip) \|\| Map.get(agent, "source_ip") }) `[To ensure code accuracy, apply this suggestion manually]` Suggestion importance[1-10]: 7 __ Why: This is a valid suggestion that improves robustness by defensively handling both atom and string keys for data received via RPC, which can have inconsistent serialization. This prevents a potential bug where agents might be incorrectly processed.	Medium
	✅ ~~Fix embedded port detection~~ Suggestion Impact: The commit replaces the fallback url.Parse("tcp://"+target) port detection with logic that strips path/query and uses net.SplitHostPort, and adds the needed strings import. code diff: @@ -22,6 +22,7 @@ "fmt" "net" "net/url" + "strings" "sync" "time" @@ -685,9 +686,17 @@ hasEmbeddedPort = true } else if parsed, err := url.Parse(target); err == nil && parsed.Host != "" && parsed.Port() != "" { hasEmbeddedPort = true - } else if parsed, err := url.Parse("tcp://" + target); err == nil && parsed.Host != "" && parsed.Port() != "" { + } else { // Handles "host:port/path" inputs without an explicit scheme. - hasEmbeddedPort = true + hostPort := target + if i := strings.IndexAny(hostPort, "/?"); i >= 0 { + hostPort = hostPort[:i] + } + + // net.SplitHostPort requires bracketed IPv6; be explicit to avoid silently dropping checks. + if host, p, err := net.SplitHostPort(hostPort); err == nil && host != "" && p != "" { + hasEmbeddedPort = true + } Improve embedded port detection for check targets by using `net.SplitHostPort` on the host part of the target string, making it more resilient to various formats like unbracketed IPv6 addresses. pkg/agent/push_loop.go [688-691] -} else if parsed, err := url.Parse("tcp://" + target); err == nil && parsed.Host != "" && parsed.Port() != "" { +} else { // Handles "host:port/path" inputs without an explicit scheme. - hasEmbeddedPort = true + hostPort := target + if i := strings.IndexAny(hostPort, "/?"); i >= 0 { + hostPort = hostPort[:i] + } + + // net.SplitHostPort requires bracketed IPv6; be explicit to avoid silently dropping checks. + if host, p, err := net.SplitHostPort(hostPort); err == nil && host != "" && p != "" { + hasEmbeddedPort = true + } } `[To ensure code accuracy, apply this suggestion manually]` Suggestion importance[1-10]: 6 __ Why: The suggestion correctly identifies that the current port detection logic can fail for unbracketed IPv6 addresses, and proposes a more robust implementation using `net.SplitHostPort`.	Low
Security	✅ ~~Validate client-provided timing data~~ Suggestion Impact: The commit added response_time validation/normalization logic (clamping to 86_400_000 and defaulting to 0 for invalid values) and updated the status map to use the sanitized response_time instead of the raw client-provided value. code diff: `+ response_time = + case service.response_time do + rt when is_integer(rt) and rt >= 0 and rt <= 86_400_000 -> + rt + + rt when is_integer(rt) and rt > 86_400_000 -> + 86_400_000 + + _ -> + 0 + end + status = %{ service_name: service_name, available: service.available == true, message: message, - service_type: service.service_type, - response_time: service.response_time, + service_type: service_type, + response_time: response_time, agent_id: metadata.agent_id,` Validate and normalize the client-provided `response_time` to be a non-negative integer within a reasonable range before it is processed. elixir/serviceradar_agent_gateway/lib/serviceradar_agent_gateway/agent_gateway_server.ex [461-475] +response_time = + case service.response_time do + rt when is_integer(rt) and rt >= 0 and rt <= 86_400_000 -> + rt + + rt when is_integer(rt) and rt > 86_400_000 -> + 86_400_000 + + _ -> + 0 + end + status = %{ service_name: service_name, available: service.available == true, message: message, service_type: service.service_type, - response_time: service.response_time, + response_time: response_time, agent_id: metadata.agent_id, gateway_id: metadata.gateway_id, partition: normalize_partition(service.partition \|\| metadata.partition), source: service.source, kv_store_id: service.kv_store_id \|\| metadata.kv_store_id, timestamp: metadata.timestamp, tenant_id: metadata.tenant_id, tenant_slug: metadata.tenant_slug } `[Suggestion processed]` Suggestion importance[1-10]: 7 __ Why: The suggestion correctly points out that `response_time` from the client is not validated, which could lead to data corruption or processing errors downstream.	Medium
General	✅ ~~Preserve JSON parsing error details~~ Suggestion Impact: Updated the JSON parsing error handling to wrap and return the actual decode error (err) when decoding fails with non-EOF errors, improving debugging information. code diff: `if err := dec.Decode(&struct{}{}); err == nil { return nil, fmt.Errorf("failed to parse config: %w", errConfigTrailingData) } else if !errors.Is(err, io.EOF) { - return nil, fmt.Errorf("failed to parse config: %w", errConfigTrailingData) + return nil, fmt.Errorf("failed to parse config: %w", err)` Improve JSON parsing error handling by returning the specific underlying error instead of a generic one, which will aid in debugging configuration file issues. cmd/agent/main.go [129-133] `if err := dec.Decode(&struct{}{}); err == nil { return nil, fmt.Errorf("failed to parse config: %w", errConfigTrailingData) } else if !errors.Is(err, io.EOF) { - return nil, fmt.Errorf("failed to parse config: %w", errConfigTrailingData) + return nil, fmt.Errorf("failed to parse config: %w", err) }` `[Suggestion processed]` Suggestion importance[1-10]: 6 __ Why: The suggestion correctly points out that a generic error is returned for JSON parsing issues, which hides the specific cause and makes debugging more difficult.	Low
Update

Previous suggestions

✅ Suggestions up to commit c96fbf9

Category	Suggestion	Impact
Possible issue	Fix state machine transitions Fix the `transition_state` calls in the state machine update actions to use the correct transition names (e.g., `:degrade`) instead of the target state atoms (e.g., `:degraded`) to prevent runtime errors. elixir/serviceradar_core/lib/serviceradar/infrastructure/gateway.ex [243-351] update :degrade do description "Mark gateway as degraded (having issues)" argument :reason, :string require_atomic? false - change transition_state(:degraded) + change transition_state(:degrade) change set_attribute(:is_healthy, false) change set_attribute(:updated_at, &DateTime.utc_now/0) change {ServiceRadar.Infrastructure.Changes.PublishStateChange, entity_type: :gateway, new_state: :degraded} end update :go_offline do description "Mark gateway as offline" argument :reason, :string require_atomic? false - change transition_state(:offline) + change transition_state(:go_offline) change set_attribute(:is_healthy, false) change set_attribute(:updated_at, &DateTime.utc_now/0) change {ServiceRadar.Infrastructure.Changes.PublishStateChange, entity_type: :gateway, new_state: :offline} end update :recover do description "Start recovery process for degraded/offline gateway" require_atomic? false - change transition_state(:recovering) + change transition_state(:recover) change set_attribute(:updated_at, &DateTime.utc_now/0) change {ServiceRadar.Infrastructure.Changes.PublishStateChange, entity_type: :gateway, new_state: :recovering} end update :start_maintenance do description "Put gateway into maintenance mode" require_atomic? false - change transition_state(:maintenance) + change transition_state(:start_maintenance) change set_attribute(:updated_at, &DateTime.utc_now/0) change {ServiceRadar.Infrastructure.Changes.PublishStateChange, entity_type: :gateway, new_state: :maintenance} end update :start_draining do description "Start graceful shutdown (draining)" require_atomic? false - change transition_state(:draining) + change transition_state(:start_draining) change set_attribute(:updated_at, &DateTime.utc_now/0) change {ServiceRadar.Infrastructure.Changes.PublishStateChange, entity_type: :gateway, new_state: :draining} end update :deactivate do description "Deactivate a gateway (admin action)" require_atomic? false - change transition_state(:inactive) + change transition_state(:deactivate) change set_attribute(:is_healthy, false) change set_attribute(:updated_at, &DateTime.utc_now/0) change {ServiceRadar.Infrastructure.Changes.PublishStateChange, entity_type: :gateway, new_state: :inactive} end Suggestion importance[1-10]: 9 __ Why: This suggestion correctly identifies a critical bug where `transition_state` is called with the target state instead of the transition name, which would cause runtime crashes.	High
	Fix gRPC connection initialization Replace the manual gRPC connection logic using `grpc.NewClient` with the standard `grpc.DialContext` and `grpc.WithBlock` for a simpler and more robust implementation. pkg/agent/gateway_client.go [130-154] -conn, err := grpc.NewClient(g.addr, opts...) +opts = append(opts, grpc.WithBlock()) + +conn, err := grpc.DialContext(connectCtx, g.addr, opts...) if err != nil { if provider != nil { _ = provider.Close() } return fmt.Errorf("failed to connect to gateway at %s: %w", g.addr, err) } -conn.Connect() -for state := conn.GetState(); state != connectivity.Ready; state = conn.GetState() { - if state == connectivity.Shutdown { - _ = conn.Close() - if provider != nil { - _ = provider.Close() - } - return fmt.Errorf("failed to connect to gateway at %s: %w", g.addr, ErrConnectionShutdown) - } - if !conn.WaitForStateChange(connectCtx, state) { - _ = conn.Close() - if provider != nil { - _ = provider.Close() - } - return fmt.Errorf("failed to connect to gateway at %s: %w", g.addr, connectCtx.Err()) - } -} - Suggestion importance[1-10]: 7 __ Why: The suggestion correctly identifies that using `grpc.DialContext` with `grpc.WithBlock` is the idiomatic and more robust way to establish a gRPC connection, simplifying the code and improving maintainability.	Medium
	✅ ~~Prevent false agent activity~~ Suggestion Impact: The code now uses Map.get(agent, :last_seen) and Map.get(agent, :last_seen_mono) instead of agent.last_seen and System.monotonic_time(:millisecond), preventing false activity on initial cache load. code diff: - all_agents - \|> Enum.reduce(%{}, fn agent, acc -> - Map.put(acc, agent.agent_id, %{ - agent_id: agent.agent_id, - tenant_id: Map.get(agent, :tenant_id), - tenant_slug: Map.get(agent, :tenant_slug, "default"), - last_seen: agent.last_seen, - last_seen_mono: System.monotonic_time(:millisecond), - service_count: Map.get(agent, :service_count, 0), - partition: Map.get(agent, :partition), - source_ip: Map.get(agent, :source_ip) - }) - end) + all_agents + \|> Enum.reduce(%{}, fn agent, acc -> + Map.put(acc, agent.agent_id, %{ + agent_id: agent.agent_id, + tenant_id: Map.get(agent, :tenant_id), + tenant_slug: Map.get(agent, :tenant_slug, "default"), + last_seen: Map.get(agent, :last_seen), + last_seen_mono: Map.get(agent, :last_seen_mono), + service_count: Map.get(agent, :service_count, 0), + partition: Map.get(agent, :partition), + source_ip: Map.get(agent, :source_ip) + }) + end) In `load_initial_agents_cache`, set `last_seen_mono` from the agent data from the tracker instead of the current monotonic time to avoid incorrectly marking stale agents as active. web-ng/lib/serviceradar_web_ng_web/live/infrastructure_live/index.ex [738-750] all_agents \|> Enum.reduce(%{}, fn agent, acc -> Map.put(acc, agent.agent_id, %{ agent_id: agent.agent_id, tenant_id: Map.get(agent, :tenant_id), tenant_slug: Map.get(agent, :tenant_slug, "default"), - last_seen: agent.last_seen, - last_seen_mono: System.monotonic_time(:millisecond), + last_seen: Map.get(agent, :last_seen), + last_seen_mono: Map.get(agent, :last_seen_mono), service_count: Map.get(agent, :service_count, 0), partition: Map.get(agent, :partition), source_ip: Map.get(agent, :source_ip) }) end) Suggestion importance[1-10]: 7 __ Why: The suggestion correctly identifies a logic flaw where setting `last_seen_mono` to the current time on initial load can falsely mark stale agents as active, leading to incorrect UI state.	Medium
Incremental ^[*]	✅ ~~Accept host:port/path targets~~ Suggestion Impact: The commit adds an additional url.Parse("tcp://" + target) branch to detect an embedded port in host:port/path inputs without an explicit scheme, preventing such targets from being dropped. code diff: `@@ -684,6 +684,9 @@ if host, p, err := net.SplitHostPort(target); err == nil && host != "" && p != "" { hasEmbeddedPort = true } else if parsed, err := url.Parse(target); err == nil && parsed.Host != "" && parsed.Port() != "" { + hasEmbeddedPort = true + } else if parsed, err := url.Parse("tcp://" + target); err == nil && parsed.Host != "" && parsed.Port() != "" { + // Handles "host:port/path" inputs without an explicit scheme. hasEmbeddedPort = true }` Fix the embedded-port detection logic to correctly handle targets that include both a port and a path (e.g., `example.com:443/health`). This prevents valid TCP/gRPC checks from being silently dropped. pkg/agent/push_loop.go [681-692] if (checkerType == tcpCheckType \|\| checkerType == grpcCheckType) && port == 0 { // Allow "host:port" or URL targets that already include a port. hasEmbeddedPort := false if host, p, err := net.SplitHostPort(target); err == nil && host != "" && p != "" { hasEmbeddedPort = true } else if parsed, err := url.Parse(target); err == nil && parsed.Host != "" && parsed.Port() != "" { hasEmbeddedPort = true + } else if parsed, err := url.Parse("tcp://" + target); err == nil && parsed.Host != "" && parsed.Port() != "" { + // Handles "host:port/path" inputs without an explicit scheme. + hasEmbeddedPort = true } if !hasEmbeddedPort { return nil } } Suggestion importance[1-10]: 8 __ Why: This suggestion correctly identifies a bug where valid check targets with a port and path (e.g., `example.com:8080/health`) are silently ignored, leading to a failure in monitoring. The fix is accurate and resolves this issue.	Medium
	✅ ~~Prevent collectors from stopping early~~ Suggestion Impact: The code was changed to return an error ("service not started") when the service lifecycle context is nil, preventing collectors from being started with the request context and stopping prematurely. code diff: `@@ -322,7 +322,7 @@ serviceCtx := s.serviceCtx s.mu.RUnlock() if serviceCtx == nil { - serviceCtx = ctx + return errors.New("service not started") }` Prevent SNMP collectors from stopping prematurely by ensuring they are always started with the service's lifecycle context. If the service is not running, fail fast instead of using a potentially short-lived request context. pkg/checker/snmp/service.go [321-331] `s.mu.RLock() serviceCtx := s.serviceCtx s.mu.RUnlock() if serviceCtx == nil { - serviceCtx = ctx + return errors.New("service not started") } // Start collector with the service context so it stops on service shutdown. if err := collector.Start(serviceCtx); err != nil { return fmt.Errorf("%w: %s", errFailedToStartCollector, target.Name) }` Suggestion importance[1-10]: 8 __ Why: The suggestion correctly identifies a critical bug where an SNMP collector's lifecycle could be tied to a short-lived request context, causing it to terminate prematurely. Failing fast prevents this and ensures collectors run for their intended duration.	Medium
	✅ ~~Fail when all items drop~~ Suggestion Impact: The commit replaced Enum.each with Enum.reduce to track processed_count, raises a GRPC.RPCError when processed_count == 0 but service_count > 0, and updates record_push_metrics to use processed_count instead of service_count. code diff: + processed_count = + services + \|> Enum.reject(&is_nil/1) + \|> Enum.reduce(0, fn service, acc -> + try do + process_service_status(service, metadata) + acc + 1 + rescue + e in GRPC.RPCError -> + Logger.warning( + "Dropping invalid service status from agent #{metadata.agent_id}: #{e.status} #{e.message}" + ) + + acc + + e -> + Logger.warning( + "Dropping service status from agent #{metadata.agent_id} due to error: #{Exception.message(e)}" + ) + + acc + end + end) + + if processed_count == 0 and service_count > 0 do + raise GRPC.RPCError, status: :invalid_argument, message: "no valid service statuses" + end # Record metrics - record_push_metrics(agent_id, service_count) + record_push_metrics(agent_id, processed_count) Modify the `push_status` function to track the number of successfully processed service statuses. If all statuses are dropped due to errors, the gRPC request should fail instead of returning a success response. elixir/serviceradar_agent_gateway/lib/serviceradar_agent_gateway/agent_gateway_server.ex [252-273] -services -\|> Enum.reject(&is_nil/1) -\|> Enum.each(fn service -> - try do - process_service_status(service, metadata) - rescue - e in GRPC.RPCError -> - Logger.warning( - "Dropping invalid service status from agent #{metadata.agent_id}: #{e.status} #{e.message}" - ) +processed_count = + services + \|> Enum.reject(&is_nil/1) + \|> Enum.reduce(0, fn service, acc -> + try do + process_service_status(service, metadata) + acc + 1 + rescue + e in GRPC.RPCError -> + Logger.warning( + "Dropping invalid service status from agent #{metadata.agent_id}: #{e.status} #{e.message}" + ) - e -> - Logger.warning( - "Dropping service status from agent #{metadata.agent_id} due to error: #{Exception.message(e)}" - ) - end -end) + acc + + e -> + Logger.warning( + "Dropping service status from agent #{metadata.agent_id} due to error: #{Exception.message(e)}" + ) + + acc + end + end) + +if processed_count == 0 and service_count > 0 do + raise GRPC.RPCError, status: :invalid_argument, message: "no valid service statuses" +end # Record metrics -record_push_metrics(agent_id, service_count) +record_push_metrics(agent_id, processed_count) %Monitoring.GatewayStatusResponse{received: true} `[Suggestion processed]` Suggestion importance[1-10]: 8 __ Why: This suggestion correctly identifies a silent data loss scenario where the client receives a success response even if all submitted data was invalid and dropped, improving the robustness of the gRPC endpoint.	Medium
	✅ ~~Avoid reconnect loops in idle~~ Suggestion Impact: Updated IsConnected to return true when the connection state is either connectivity.Ready or connectivity.Idle, matching the suggested switch-based logic. code diff: `@@ -250,7 +250,12 @@ if !g.connected \|\| g.conn == nil { return false } - return g.conn.GetState() == connectivity.Ready + switch g.conn.GetState() { + case connectivity.Ready, connectivity.Idle: + return true + default: + return false + }` Update `IsConnected` to consider the `connectivity.Idle` state as connected. This prevents unnecessary reconnect loops when the gRPC connection is temporarily idle but still healthy. pkg/agent/gateway_client.go [250-253] `if !g.connected \|\| g.conn == nil { return false } -return g.conn.GetState() == connectivity.Ready +switch g.conn.GetState() { +case connectivity.Ready, connectivity.Idle: + return true +default: + return false +}` Suggestion importance[1-10]: 7 __ Why: The suggestion correctly identifies that treating an `Idle` gRPC connection as disconnected can cause unnecessary reconnect attempts, improving efficiency and reducing load on the agent and gateway.	Medium
	✅ ~~Reject missing service names~~ Suggestion Impact: The commit replaced `to_string() \|> String.trim()` with a case expression that returns "" for nil/non-binary values and trims only binaries, enabling the existing validation to reject missing service_name instead of accepting "nil". code diff: `@@ -406,9 +416,11 @@ # Tenant and gateway_id come from server-side metadata (mTLS cert + server identity) # NOT from the service message - this prevents spoofing service_name = - service.service_name - \|> to_string() - \|> String.trim() + case service.service_name do + name when is_binary(name) -> String.trim(name) + nil -> "" + _ -> "" + end` Explicitly handle `nil` values for `service.service_name` to prevent them from being processed as the string `"nil"`. The current implementation allows this, which can lead to invalid service records. elixir/serviceradar_agent_gateway/lib/serviceradar_agent_gateway/agent_gateway_server.ex [408-425] service_name = - service.service_name - \|> to_string() - \|> String.trim() + case service.service_name do + name when is_binary(name) -> String.trim(name) + nil -> "" + _ -> "" + end cond do service_name == "" -> raise GRPC.RPCError, status: :invalid_argument, message: "service_name is required" byte_size(service_name) > 255 -> raise GRPC.RPCError, status: :invalid_argument, message: "service_name is too long" String.contains?(service_name, ["\n", "\r", "\t"]) -> raise GRPC.RPCError, status: :invalid_argument, message: "service_name contains invalid characters" true -> :ok end `[Suggestion processed]` Suggestion importance[1-10]: 7 __ Why: This is a valid bug fix that prevents data pollution by correctly handling `nil` `service_name` values, which would otherwise be processed as the string `"nil"`.	Medium

✅ Suggestions up to commit b7127ef

Category	Suggestion	Impact
Incremental ^[*]	✅ ~~Restrict monotonic fallback activity~~ Suggestion Impact: The commit updates agent_active? to gate the monotonic fallback with `and is_nil(agent[:last_seen])`, matching the suggestion’s intent and preventing monotonic timestamps from overriding stale wall-clock last_seen values. It also reuses agent_active? for agent cache pruning, aligning staleness rules with the UI logic. code diff: `- is_integer(agent[:last_seen_mono]) -> + is_integer(agent[:last_seen_mono]) and is_nil(agent[:last_seen]) -> now_mono = System.monotonic_time(:millisecond) delta_ms = max(now_mono - agent[:last_seen_mono], 0) delta_ms < @stale_threshold_ms` In `agent_active?`, restrict the monotonic time fallback to only apply when the wall-clock `last_seen` timestamp is `nil` to ensure correct staleness detection. web-ng/lib/serviceradar_web_ng_web/live/infrastructure_live/index.ex [858-866] `delta_ms = max(now_ms - last_seen_ms, 0) delta_ms < @stale_threshold_ms - is_integer(agent[:last_seen_mono]) -> + is_integer(agent[:last_seen_mono]) and is_nil(agent[:last_seen]) -> now_mono = System.monotonic_time(:millisecond) delta_ms = max(now_mono - agent[:last_seen_mono], 0) delta_ms < @stale_threshold_ms` Suggestion importance[1-10]: 8 __ Why: The suggestion correctly identifies a subtle bug where the monotonic time fallback could incorrectly mark an agent as active if a wall-clock time exists but is stale, improving the accuracy of staleness detection.	Medium
	✅ ~~Bound service shutdown with timeout~~ Suggestion Impact: The commit removed the unbounded deferred server.Stop(context.Background()) and instead calls server.Stop() with a context timeout inside the shutdown goroutine, plus replaces time.After with an explicit timer. It also adds timed server.Stop() calls on the errChan path, further ensuring shutdown is bounded. code diff: @@ -177,11 +177,6 @@ if err := server.Start(pushCtx); err != nil { return fmt.Errorf("failed to start agent services: %w", err) } - defer func() { - if err := server.Stop(context.Background()); err != nil { - log.Error().Err(err).Msg("Error stopping agent services") - } - }() // Handle shutdown signals sigChan := make(chan os.Signal, 1) @@ -204,21 +199,42 @@ shutdownDone := make(chan struct{}) go func() { + defer close(shutdownDone) + cancel() pushLoop.Stop() <-errChan - close(shutdownDone) + + stopCtx, stopCancel := context.WithTimeout(context.Background(), shutdownTimeout) + defer stopCancel() + if err := server.Stop(stopCtx); err != nil { + log.Error().Err(err).Msg("Error stopping agent services") + } }() + + timer := time.NewTimer(shutdownTimeout) + defer timer.Stop() select { case <-shutdownDone: - case <-time.After(shutdownTimeout): + case <-timer.C: return fmt.Errorf("%w after %s", errShutdownTimeout, shutdownTimeout) } case err := <-errChan: if err != nil && !errors.Is(err, context.Canceled) { + stopCtx, stopCancel := context.WithTimeout(context.Background(), 10time.Second) + defer stopCancel() + if stopErr := server.Stop(stopCtx); stopErr != nil { + log.Error().Err(stopErr).Msg("Error stopping agent services") + } return fmt.Errorf("push loop error: %w", err) + } + + stopCtx, stopCancel := context.WithTimeout(context.Background(), 10time.Second) + defer stopCancel() + if stopErr := server.Stop(stopCtx); stopErr != nil { + log.Error().Err(stopErr).Msg("Error stopping agent services") } Move the `server.Stop()` call into the timed shutdown block to ensure it is bounded by the `shutdownTimeout`, preventing the process from hanging during termination. cmd/agent/main.go [202-217] // Bound shutdown so the process can't hang forever (includes pushLoop.Stop()). const shutdownTimeout = 10 * time.Second shutdownDone := make(chan struct{}) go func() { + defer close(shutdownDone) + cancel() pushLoop.Stop() <-errChan - close(shutdownDone) + + stopCtx, stopCancel := context.WithTimeout(context.Background(), shutdownTimeout) + defer stopCancel() + if err := server.Stop(stopCtx); err != nil { + log.Error().Err(err).Msg("Error stopping agent services") + } }() + +timer := time.NewTimer(shutdownTimeout) +defer timer.Stop() select { case <-shutdownDone: -case <-time.After(shutdownTimeout): +case <-timer.C: return fmt.Errorf("%w after %s", errShutdownTimeout, shutdownTimeout) } `[Suggestion processed]` Suggestion importance[1-10]: 7 __ Why: The suggestion correctly identifies that `server.Stop()` is not covered by the shutdown timeout, which could cause the process to hang, and proposes a valid fix to include it in the timed shutdown sequence.	Medium
	✅ ~~Clamp skewed timestamps in pruning~~ Suggestion Impact: The commit updated the pruned_gateways_cache pruning logic to clamp delta_ms using max(now_ms - last_ms, 0) and removed the delta_ms < 0 check, preventing active gateways from being pruned due to clock skew. It also included additional related changes to agent pruning logic. code diff: `+ pruned_gateways_cache = + socket.assigns.gateways_cache + \|> Enum.reject(fn {_id, gw} -> + last_ms = parse_timestamp_to_ms(Map.get(gw, :last_heartbeat)) + + delta_ms = + if is_integer(last_ms) do + max(now_ms - last_ms, 0) + else + nil + end + + not is_integer(delta_ms) or delta_ms > @stale_threshold_ms + end) + \|> Map.new()` To prevent incorrectly pruning active gateways due to clock skew, clamp negative `delta_ms` values to `0` in the `pruned_gateways_cache` logic. web-ng/lib/serviceradar_web_ng_web/live/infrastructure_live/index.ex [97-106] `pruned_gateways_cache = socket.assigns.gateways_cache \|> Enum.reject(fn {_id, gw} -> last_ms = parse_timestamp_to_ms(Map.get(gw, :last_heartbeat)) - delta_ms = if is_integer(last_ms), do: now_ms - last_ms, else: nil + delta_ms = + if is_integer(last_ms) do + max(now_ms - last_ms, 0) + else + nil + end - not is_integer(delta_ms) or delta_ms < 0 or delta_ms > @stale_threshold_ms + not is_integer(delta_ms) or delta_ms > @stale_threshold_ms end) \|> Map.new()` Suggestion importance[1-10]: 7 __ Why: This is a valid bug fix for handling clock skew, which could cause active gateways to be incorrectly pruned, but the same logic is already correctly implemented in `compute_gateways`, making this a consistency fix.	Medium
	✅ ~~Avoid dropping agents on skew~~ Suggestion Impact: The pruning logic was changed to avoid dropping entities due to negative deltas: gateways now clamp delta_ms with `max(..., 0)` and no longer prune on `delta_ms < 0`. For agents, instead of directly computing delta_ms, the code now delegates pruning to `agent_active?/2`, which already clamps negative deltas and applies consistent staleness rules, achieving the same goal (avoid pruning due to clock skew) though via a different implementation. code diff: + pruned_gateways_cache = + socket.assigns.gateways_cache + \|> Enum.reject(fn {_id, gw} -> + last_ms = parse_timestamp_to_ms(Map.get(gw, :last_heartbeat)) + + delta_ms = + if is_integer(last_ms) do + max(now_ms - last_ms, 0) + else + nil + end + + not is_integer(delta_ms) or delta_ms > @stale_threshold_ms + end) + \|> Map.new() + + pruned_agents_cache = + socket.assigns.agents_cache + \|> Enum.reject(fn {_id, agent} -> + # Use the same staleness rules as the UI (wall clock first, monotonic fallback) + not agent_active?(agent, now_ms) + end) + \|> Map.new() gateways = compute_gateways(pruned_gateways_cache) @@ -859,7 +861,7 @@ delta_ms = max(now_ms - last_seen_ms, 0) delta_ms < @stale_threshold_ms - is_integer(agent[:last_seen_mono]) -> + is_integer(agent[:last_seen_mono]) and is_nil(agent[:last_seen]) -> now_mono = System.monotonic_time(:millisecond) delta_ms = max(now_mono - agent[:last_seen_mono], 0) delta_ms < @stale_threshold_ms To prevent incorrectly pruning active agents due to clock skew, clamp negative `delta_ms` values to `0` in the `pruned_agents_cache` logic. web-ng/lib/serviceradar_web_ng_web/live/infrastructure_live/index.ex [108-117] `pruned_agents_cache = socket.assigns.agents_cache \|> Enum.reject(fn {_id, agent} -> last_ms = agent_last_seen_ms(agent) - delta_ms = if is_integer(last_ms), do: now_ms - last_ms, else: nil + delta_ms = + if is_integer(last_ms) do + max(now_ms - last_ms, 0) + else + nil + end - not is_integer(delta_ms) or delta_ms < 0 or delta_ms > @stale_threshold_ms + not is_integer(delta_ms) or delta_ms > @stale_threshold_ms end) \|> Map.new()` Suggestion importance[1-10]: 7 __ Why: This is a valid bug fix for handling clock skew, which could cause active agents to be incorrectly pruned, but the same logic is already correctly implemented in `agent_active?`, making this a consistency fix.	Medium
Possible issue	✅ ~~Bind collectors to service context~~ Suggestion Impact: The commit moved the serviceCtx lookup to before starting the collector and changed collector.Start to use serviceCtx (falling back to ctx when nil), aligning collector lifetime with the service context. code diff: - // Start collector - if err := collector.Start(ctx); err != nil { + s.mu.RLock() + serviceCtx := s.serviceCtx + s.mu.RUnlock() + if serviceCtx == nil { + serviceCtx = ctx + } + + // Start collector with the service context so it stops on service shutdown. + if err := collector.Start(serviceCtx); err != nil { return fmt.Errorf("%w: %s", errFailedToStartCollector, target.Name) } @@ -363,12 +370,6 @@ Str("host_name", target.Name). Msg("Initialized service status") - s.mu.RLock() - serviceCtx := s.serviceCtx - s.mu.RUnlock() - if serviceCtx == nil { - serviceCtx = ctx - } go s.processResults(serviceCtx, target.Name, results, aggregator) Start the SNMP collector with the service's long-lived context (`serviceCtx`) instead of the caller-provided context to prevent resource leaks when the service is stopped or restarted. pkg/checker/snmp/service.go [322-372] `-if err := collector.Start(ctx); err != nil { - return fmt.Errorf("%w: %s", errFailedToStartCollector, target.Name) -} -... s.mu.RLock() serviceCtx := s.serviceCtx s.mu.RUnlock() if serviceCtx == nil { serviceCtx = ctx } + +if err := collector.Start(serviceCtx); err != nil { + return fmt.Errorf("%w: %s", errFailedToStartCollector, target.Name) +} +... go s.processResults(serviceCtx, target.Name, results, aggregator)` Suggestion importance[1-10]: 8 __ Why: The suggestion correctly identifies a potential goroutine leak where a collector started via `AddTarget` might not be tied to the service's lifecycle, and proposes using the service's context to fix it.	Medium
	✅ ~~Prune agents using activity predicate~~ Suggestion Impact: The commit updated pruned_agents_cache to reject agents based on `not agent_active?(agent, now_ms)` instead of re-implementing staleness checks, aligning pruning with the UI logic. It also refined agent_active?/2 to only use the monotonic timestamp fallback when wall-clock `:last_seen` is nil, further ensuring consistent activity evaluation. code diff: `+ pruned_agents_cache = + socket.assigns.agents_cache + \|> Enum.reject(fn {_id, agent} -> + # Use the same staleness rules as the UI (wall clock first, monotonic fallback) + not agent_active?(agent, now_ms) + end) + \|> Map.new() gateways = compute_gateways(pruned_gateways_cache) @@...`

Imported GitHub PR comment. Original author: @qodo-code-review[bot] Original URL: https://github.com/carverauto/serviceradar/pull/2221#issuecomment-3706934588 Original created: 2026-01-03T09:50:14Z --- ## PR Code Suggestions ✨  Latest suggestions up to 5c8ccf6 <table><thead><tr><td>Category</td><td align=left>Suggestion                                                                                                                                    </td><td align=center>Impact</td></tr><tbody><tr><td rowspan=3>Possible issue</td> <td> <details><summary>Fix remaining state transitions</summary> ___ **Update <code>transition_state</code> calls in several <code>update</code> actions to use the correct transition name instead of the target state atom to prevent runtime errors.** [elixir/serviceradar_core/lib/serviceradar/infrastructure/gateway.ex [254-363]](https://github.com/carverauto/serviceradar/pull/2221/files#diff-1655f7cc80c8949b3d2c9e50d20110cb85883413fb6181825bfbcf0bda578c6bR254-R363) ```diff update :heartbeat_timeout do description "Mark gateway as degraded due to heartbeat timeout" require_atomic? false - change transition_state(:degraded) + change transition_state(:heartbeat_timeout) change set_attribute(:is_healthy, false) change set_attribute(:updated_at, &DateTime.utc_now/0) change {ServiceRadar.Infrastructure.Changes.PublishStateChange, entity_type: :gateway, new_state: :degraded} end update :lose_connection do description "Mark gateway as offline due to lost connection" require_atomic? false - change transition_state(:offline) + change transition_state(:lose_connection) change set_attribute(:is_healthy, false) change set_attribute(:updated_at, &DateTime.utc_now/0) change {ServiceRadar.Infrastructure.Changes.PublishStateChange, entity_type: :gateway, new_state: :offline} end update :restore_health do description "Restore gateway to healthy state" require_atomic? false - change transition_state(:healthy) + change transition_state(:restore_health) change set_attribute(:is_healthy, true) change set_attribute(:last_seen, &DateTime.utc_now/0) change set_attribute(:updated_at, &DateTime.utc_now/0) change {ServiceRadar.Infrastructure.Changes.PublishStateChange, entity_type: :gateway, new_state: :healthy} end update :end_maintenance do description "End maintenance mode, return to healthy" require_atomic? false - change transition_state(:healthy) + change transition_state(:end_maintenance) change set_attribute(:is_healthy, true) change set_attribute(:updated_at, &DateTime.utc_now/0) change {ServiceRadar.Infrastructure.Changes.PublishStateChange, entity_type: :gateway, new_state: :healthy} end update :finish_draining do description "Finish draining, go offline" require_atomic? false - change transition_state(:offline) + change transition_state(:finish_draining) change set_attribute(:is_healthy, false) change set_attribute(:updated_at, &DateTime.utc_now/0) change {ServiceRadar.Infrastructure.Changes.PublishStateChange, entity_type: :gateway, new_state: :offline} end update :mark_unhealthy do description "Mark gateway as unhealthy (legacy - use degrade)" require_atomic? false - change transition_state(:degraded) + change transition_state(:degrade) change set_attribute(:is_healthy, false) change set_attribute(:updated_at, &DateTime.utc_now/0) change {ServiceRadar.Infrastructure.Changes.PublishStateChange, entity_type: :gateway, new_state: :degraded} end ``` `[To ensure code accuracy, apply this suggestion manually]` <details><summary>Suggestion importance[1-10]: 9</summary> __ Why: The suggestion correctly identifies a bug where `transition_state` is called with the target state instead of the transition name, which would cause a runtime error. </details></details></td><td align=center>High </td></tr><tr><td> <details><summary>✅ <s>Enforce streaming chunk ordering</s></summary> ___ <details><summary>Suggestion Impact:</summary>The commit updated the stream reducer state to include expected_idx and pinned_total_chunks, added validation that total_chunks cannot change mid-stream, and enforced that each incoming chunk_index must match the expected sequential index, updating the reducer continuation/halt tuples accordingly. code diff: ```diff - {total_services, saw_final?, _stream_agent_id} = - Enum.reduce_while(request_stream, {0, false, nil}, fn chunk, {acc, _saw_final?, stream_agent_id} -> + {total_services, saw_final?, _stream_agent_id, _expected_idx, _pinned_total_chunks} = + Enum.reduce_while(request_stream, {0, false, nil, 0, nil}, fn chunk, + {acc, _saw_final?, stream_agent_id, expected_idx, + pinned_total_chunks} -> agent_id = case chunk.agent_id do nil -> @@ -343,8 +349,19 @@ raise GRPC.RPCError, status: :invalid_argument, message: "total_chunks must be > 0" end + pinned_total_chunks = + case pinned_total_chunks do + nil -> total_chunks + ^total_chunks -> total_chunks + _ -> raise GRPC.RPCError, status: :invalid_argument, message: "total_chunks changed mid-stream" + end + if chunk_index < 0 or chunk_index >= total_chunks do raise GRPC.RPCError, status: :invalid_argument, message: "invalid chunk_index" + end + + if chunk_index != expected_idx do + raise GRPC.RPCError, status: :invalid_argument, message: "unexpected chunk_index" end Logger.debug( @@ -396,9 +413,9 @@ end record_push_metrics(agent_id, new_total) - {:halt, {new_total, true, stream_agent_id}} + {:halt, {new_total, true, stream_agent_id, expected_idx + 1, pinned_total_chunks}} else - {:cont, {new_total, false, stream_agent_id}} + {:cont, {new_total, false, stream_agent_id, expected_idx + 1, pinned_total_chunks}} end end) ``` </details> ___ **Enforce chunk ordering and contiguity in <code>stream_status/2</code> by tracking the <code>expected_chunk_index</code> and pinning <code>total_chunks</code> to prevent partial data ingestion.** [elixir/serviceradar_agent_gateway/lib/serviceradar_agent_gateway/agent_gateway_server.ex [301-403]](https://github.com/carverauto/serviceradar/pull/2221/files#diff-369a368073dc8ec1140bcea699005a1ce97a90cd59629df0bd18c71c7ffaae9fR301-R403) ```diff {total_services, saw_final?, _stream_agent_id} = - Enum.reduce_while(request_stream, {0, false, nil}, fn chunk, {acc, _saw_final?, stream_agent_id} -> + Enum.reduce_while(request_stream, {0, false, nil, 0, nil}, fn chunk, + {acc, _saw_final?, stream_agent_id, expected_idx, + pinned_total_chunks} -> ... chunk_index = chunk.chunk_index || 0 total_chunks = chunk.total_chunks || 0 if total_chunks <= 0 do raise GRPC.RPCError, status: :invalid_argument, message: "total_chunks must be > 0" end + pinned_total_chunks = + case pinned_total_chunks do + nil -> total_chunks + ^total_chunks -> total_chunks + _ -> raise GRPC.RPCError, status: :invalid_argument, message: "total_chunks changed mid-stream" + end + if chunk_index < 0 or chunk_index >= total_chunks do raise GRPC.RPCError, status: :invalid_argument, message: "invalid chunk_index" + end + + if chunk_index != expected_idx do + raise GRPC.RPCError, status: :invalid_argument, message: "unexpected chunk_index" end ... if chunk.is_final do if chunk_index != total_chunks - 1 do raise GRPC.RPCError, status: :invalid_argument, message: "final chunk_index does not match total_chunks" end record_push_metrics(agent_id, new_total) - {:halt, {new_total, true, stream_agent_id}} + {:halt, {new_total, true, stream_agent_id, expected_idx + 1, pinned_total_chunks}} else - {:cont, {new_total, false, stream_agent_id}} + {:cont, {new_total, false, stream_agent_id, expected_idx + 1, pinned_total_chunks}} end end) ``` `[To ensure code accuracy, apply this suggestion manually]` <details><summary>Suggestion importance[1-10]: 8</summary> __ Why: The suggestion correctly identifies a vulnerability where out-of-order or missing stream chunks are not detected, potentially leading to silent data loss. </details></details></td><td align=center>Medium </td></tr><tr><td> <details><summary>Unify version reporting source</summary> ___ **Propagate the build-time <code>Version</code> from the <code>main</code> package to the <code>agent</code> package to ensure correct version reporting during gateway enrollment.** [cmd/agent/main.go [42-45]](https://github.com/carverauto/serviceradar/pull/2221/files#diff-61358711e980ccf505246fd3915f97cbd3a380e9b66f6fa5aad46749968c5ca3R42-R45) ```diff // Version is set at build time via ldflags // //nolint:gochecknoglobals // Required for build-time ldflags injection var Version = "dev" +... + +// Ensure the agent package reports the same build version during enrollment. +agent.Version = Version + ``` `[To ensure code accuracy, apply this suggestion manually]` <details><summary>Suggestion importance[1-10]: 7</summary> __ Why: The suggestion correctly identifies that the build-time `Version` from the `main` package is not propagated to the `agent` package, causing incorrect version reporting during enrollment. </details></details></td><td align=center>Medium </td></tr><tr><td rowspan=5>Incremental <a href='https://qodo-merge-docs.qodo.ai/core-abilities/incremental_update/'>[*]</a></td> <td> <details><summary>✅ <s>Skip malformed service entries</s></summary> ___ <details><summary>Suggestion Impact:</summary>The reduce function in push_status was changed to use function-head pattern matching for %Monitoring.GatewayServiceStatus{} and to ignore all other entries via an _other clause, matching the suggestion and preventing invalid entries from triggering exception-based control flow. code diff: ```diff @@ -252,25 +252,29 @@ processed_count = services |> Enum.reject(&is_nil/1) - |> Enum.reduce(0, fn service, acc -> - try do - process_service_status(service, metadata) - acc + 1 - rescue - e in GRPC.RPCError -> - Logger.warning( - "Dropping invalid service status from agent #{metadata.agent_id}: #{e.status} #{e.message}" - ) - - acc - - e -> - Logger.warning( - "Dropping service status from agent #{metadata.agent_id} due to error: #{Exception.message(e)}" - ) - - acc - end + |> Enum.reduce(0, fn + %Monitoring.GatewayServiceStatus{} = service, acc -> + try do + process_service_status(service, metadata) + acc + 1 + rescue + e in GRPC.RPCError -> + Logger.warning( + "Dropping invalid service status from agent #{metadata.agent_id}: #{e.status} #{e.message}" + ) + + acc + + e -> + Logger.warning( + "Dropping service status from agent #{metadata.agent_id} due to error: #{Exception.message(e)}" + ) + + acc + end + + _other, acc -> + acc end) ``` </details> ___ **In <code>push_status</code>, use pattern matching on <code>%Monitoring.GatewayServiceStatus{}</code> within <code>Enum.reduce</code> to filter invalid service entries. This avoids using exceptions for control flow and mitigates a potential resource exhaustion attack.** [elixir/serviceradar_agent_gateway/lib/serviceradar_agent_gateway/agent_gateway_server.ex [252-278]](https://github.com/carverauto/serviceradar/pull/2221/files#diff-369a368073dc8ec1140bcea699005a1ce97a90cd59629df0bd18c71c7ffaae9fR252-R278) ```diff processed_count = services |> Enum.reject(&is_nil/1) - |> Enum.reduce(0, fn service, acc -> - try do - process_service_status(service, metadata) - acc + 1 - rescue - e in GRPC.RPCError -> - Logger.warning( - "Dropping invalid service status from agent #{metadata.agent_id}: #{e.status} #{e.message}" - ) + |> Enum.reduce(0, fn + %Monitoring.GatewayServiceStatus{} = service, acc -> + try do + process_service_status(service, metadata) + acc + 1 + rescue + e in GRPC.RPCError -> + Logger.warning( + "Dropping invalid service status from agent #{metadata.agent_id}: #{e.status} #{e.message}" + ) - acc + acc - e -> - Logger.warning( - "Dropping service status from agent #{metadata.agent_id} due to error: #{Exception.message(e)}" - ) + e -> + Logger.warning( + "Dropping service status from agent #{metadata.agent_id} due to error: #{Exception.message(e)}" + ) - acc - end + acc + end + + _other, acc -> + acc end) if processed_count == 0 and service_count > 0 do raise GRPC.RPCError, status: :invalid_argument, message: "no valid service statuses" end ``` `[Suggestion processed]` <details><summary>Suggestion importance[1-10]: 8</summary> __ Why: The suggestion correctly identifies a potential resource exhaustion vector and proposes a more robust and secure implementation using pattern matching. This avoids expensive exception handling for invalid data types in a loop, which is a good security and performance practice. </details></details></td><td align=center>Medium </td></tr><tr><td> <details><summary>✅ <s>Bound user-controlled string sizes</s></summary> ___ <details><summary>Suggestion Impact:</summary>The commit adds trimming/normalization of `service_type` and `source`, validates their maximum byte size and disallows control characters, and then uses these sanitized values in the constructed status map. code diff: ```diff @@ -422,6 +439,18 @@ _ -> "" end + service_type = + case service.service_type do + st when is_binary(st) -> String.trim(st) + _ -> "" + end + + source = + case service.source do + src when is_binary(src) -> String.trim(src) + _ -> "" + end + cond do service_name == "" -> raise GRPC.RPCError, status: :invalid_argument, message: "service_name is required" @@ -431,6 +460,12 @@ String.contains?(service_name, ["\n", "\r", "\t"]) -> raise GRPC.RPCError, status: :invalid_argument, message: "service_name contains invalid characters" + + byte_size(service_type) > 64 or String.contains?(service_type, ["\n", "\r", "\t"]) -> + raise GRPC.RPCError, status: :invalid_argument, message: "service_type is invalid" + + byte_size(source) > 64 or String.contains?(source, ["\n", "\r", "\t"]) -> + raise GRPC.RPCError, status: :invalid_argument, message: "source is invalid" true -> :ok @@ -458,16 +493,28 @@ end end) + response_time = + case service.response_time do + rt when is_integer(rt) and rt >= 0 and rt <= 86_400_000 -> + rt + + rt when is_integer(rt) and rt > 86_400_000 -> + 86_400_000 + + _ -> + 0 + end + status = %{ service_name: service_name, available: service.available == true, message: message, - service_type: service.service_type, - response_time: service.response_time, + service_type: service_type, + response_time: response_time, agent_id: metadata.agent_id, gateway_id: metadata.gateway_id, partition: normalize_partition(service.partition || metadata.partition), - source: service.source, + source: source, kv_store_id: service.kv_store_id || metadata.kv_store_id, ``` </details> ___ **In <code>process_service_status</code>, add size and character validation for the <code>service_type</code> and <code>source</code> fields. This prevents potential resource exhaustion and data ingestion issues from malicious or malformed client input.** [elixir/serviceradar_agent_gateway/lib/serviceradar_agent_gateway/agent_gateway_server.ex [418-423]](https://github.com/carverauto/serviceradar/pull/2221/files#diff-369a368073dc8ec1140bcea699005a1ce97a90cd59629df0bd18c71c7ffaae9fR418-R423) ```diff service_name = case service.service_name do name when is_binary(name) -> String.trim(name) nil -> "" _ -> "" end +service_type = + case service.service_type do + st when is_binary(st) -> String.trim(st) + _ -> "" + end + +source = + case service.source do + src when is_binary(src) -> String.trim(src) + _ -> "" + end + +cond do + service_name == "" -> + raise GRPC.RPCError, status: :invalid_argument, message: "service_name is required" + + byte_size(service_name) > 255 -> + raise GRPC.RPCError, status: :invalid_argument, message: "service_name is too long" + + String.contains?(service_name, ["\n", "\r", "\t"]) -> + raise GRPC.RPCError, status: :invalid_argument, message: "service_name contains invalid characters" + + byte_size(service_type) > 64 or String.contains?(service_type, ["\n", "\r", "\t"]) -> + raise GRPC.RPCError, status: :invalid_argument, message: "service_type is invalid" + + byte_size(source) > 64 or String.contains?(source, ["\n", "\r", "\t"]) -> + raise GRPC.RPCError, status: :invalid_argument, message: "source is invalid" + + true -> + :ok +end + ``` `[Suggestion processed]` <details><summary>Suggestion importance[1-10]: 8</summary> __ Why: The suggestion correctly points out that other user-controlled string fields like `service_type` and `source` are not being validated. Adding size and character validation for these fields is a good security practice to prevent resource exhaustion and data ingestion issues. </details></details></td><td align=center>Medium </td></tr><tr><td> <details><summary>Prevent test goroutine leaks</summary> ___ **Prevent a test goroutine leak by using a cancellable context when starting the <code>service</code> and <code>defer</code>ring a call to <code>service.Stop()</code> to ensure cleanup.** [pkg/checker/snmp/service_test.go [162-201]](https://github.com/carverauto/serviceradar/pull/2221/files#diff-23cbcb66506ccc940c39669f00ce2f35960afe4be9e91668b114693a22ad4b89R162-R201) ```diff initialDataChan := make(chan DataPoint) newDataChan := make(chan DataPoint) // Channel to signal when GetResults is called for the new target. getResultsCalled := make(chan struct{}) ... // Test adding target -err = service.Start(context.Background()) +startCtx, cancelStart := context.WithCancel(context.Background()) +defer cancelStart() +defer func() { require.NoError(t, service.Stop()) }() +defer close(initialDataChan) +defer close(newDataChan) + +err = service.Start(startCtx) require.NoError(t, err) ``` `[To ensure code accuracy, apply this suggestion manually]` <details><summary>Suggestion importance[1-10]: 7</summary> __ Why: The suggestion correctly identifies a goroutine leak in the test due to a non-cancellable context, which can cause test flakiness, and provides a correct fix to ensure proper cleanup. </details></details></td><td align=center>Medium </td></tr><tr><td> <details><summary>✅ <s>Handle RPC agent map shapes</s></summary> ___ <details><summary>Suggestion Impact:</summary>The reduce/caching logic was changed to fetch agent_id and other fields using both atom and string map keys, matching the suggestion to robustly handle RPC result shapes. code diff: ```diff - Map.put(acc, agent.agent_id, %{ - agent_id: agent.agent_id, - tenant_id: Map.get(agent, :tenant_id), - tenant_slug: Map.get(agent, :tenant_slug, "default"), - last_seen: Map.get(agent, :last_seen), - last_seen_mono: Map.get(agent, :last_seen_mono), - service_count: Map.get(agent, :service_count, 0), - partition: Map.get(agent, :partition), - source_ip: Map.get(agent, :source_ip) + agent_id = Map.get(agent, :agent_id) || Map.get(agent, "agent_id") + + Map.put(acc, agent_id, %{ + agent_id: agent_id, + tenant_id: Map.get(agent, :tenant_id) || Map.get(agent, "tenant_id"), + tenant_slug: Map.get(agent, :tenant_slug) || Map.get(agent, "tenant_slug") || "default", + last_seen: Map.get(agent, :last_seen) || Map.get(agent, "last_seen"), + last_seen_mono: Map.get(agent, :last_seen_mono) || Map.get(agent, "last_seen_mono"), + service_count: Map.get(agent, :service_count) || Map.get(agent, "service_count") || 0, + partition: Map.get(agent, :partition) || Map.get(agent, "partition"), + source_ip: Map.get(agent, :source_ip) || Map.get(agent, "source_ip") }) ``` </details> ___ **Modify the agent cache builder to handle both atom and string keys from RPC results. This prevents incorrect data processing due to serialization differences between nodes.** [web-ng/lib/serviceradar_web_ng_web/live/infrastructure_live/index.ex [739-749]](https://github.com/carverauto/serviceradar/pull/2221/files#diff-f1af30a84da554ef3d43b226a9303174b33dd5d27e23e9b702031483074e5f54R739-R749) ```diff -Map.put(acc, agent.agent_id, %{ - agent_id: agent.agent_id, - tenant_id: Map.get(agent, :tenant_id), - tenant_slug: Map.get(agent, :tenant_slug, "default"), - last_seen: Map.get(agent, :last_seen), - last_seen_mono: Map.get(agent, :last_seen_mono), - service_count: Map.get(agent, :service_count, 0), - partition: Map.get(agent, :partition), - source_ip: Map.get(agent, :source_ip) +agent_id = Map.get(agent, :agent_id) || Map.get(agent, "agent_id") + +Map.put(acc, agent_id, %{ + agent_id: agent_id, + tenant_id: Map.get(agent, :tenant_id) || Map.get(agent, "tenant_id"), + tenant_slug: Map.get(agent, :tenant_slug) || Map.get(agent, "tenant_slug") || "default", + last_seen: Map.get(agent, :last_seen) || Map.get(agent, "last_seen"), + last_seen_mono: Map.get(agent, :last_seen_mono) || Map.get(agent, "last_seen_mono"), + service_count: Map.get(agent, :service_count) || Map.get(agent, "service_count") || 0, + partition: Map.get(agent, :partition) || Map.get(agent, "partition"), + source_ip: Map.get(agent, :source_ip) || Map.get(agent, "source_ip") }) ``` `[To ensure code accuracy, apply this suggestion manually]` <details><summary>Suggestion importance[1-10]: 7</summary> __ Why: This is a valid suggestion that improves robustness by defensively handling both atom and string keys for data received via RPC, which can have inconsistent serialization. This prevents a potential bug where agents might be incorrectly processed. </details></details></td><td align=center>Medium </td></tr><tr><td> <details><summary>✅ <s>Fix embedded port detection</s></summary> ___ <details><summary>Suggestion Impact:</summary>The commit replaces the fallback url.Parse("tcp://"+target) port detection with logic that strips path/query and uses net.SplitHostPort, and adds the needed strings import. code diff: ```diff @@ -22,6 +22,7 @@ "fmt" "net" "net/url" + "strings" "sync" "time" @@ -685,9 +686,17 @@ hasEmbeddedPort = true } else if parsed, err := url.Parse(target); err == nil && parsed.Host != "" && parsed.Port() != "" { hasEmbeddedPort = true - } else if parsed, err := url.Parse("tcp://" + target); err == nil && parsed.Host != "" && parsed.Port() != "" { + } else { // Handles "host:port/path" inputs without an explicit scheme. - hasEmbeddedPort = true + hostPort := target + if i := strings.IndexAny(hostPort, "/?"); i >= 0 { + hostPort = hostPort[:i] + } + + // net.SplitHostPort requires bracketed IPv6; be explicit to avoid silently dropping checks. + if host, p, err := net.SplitHostPort(hostPort); err == nil && host != "" && p != "" { + hasEmbeddedPort = true + } ``` </details> ___ **Improve embedded port detection for check targets by using <code>net.SplitHostPort</code> on the host part of the target string, making it more resilient to various formats like unbracketed IPv6 addresses.** [pkg/agent/push_loop.go [688-691]](https://github.com/carverauto/serviceradar/pull/2221/files#diff-5f0d59be34ef26b449d7f5fd2b198a29b277936b9708a699f7487415ed6c2785R688-R691) ```diff -} else if parsed, err := url.Parse("tcp://" + target); err == nil && parsed.Host != "" && parsed.Port() != "" { +} else { // Handles "host:port/path" inputs without an explicit scheme. - hasEmbeddedPort = true + hostPort := target + if i := strings.IndexAny(hostPort, "/?"); i >= 0 { + hostPort = hostPort[:i] + } + + // net.SplitHostPort requires bracketed IPv6; be explicit to avoid silently dropping checks. + if host, p, err := net.SplitHostPort(hostPort); err == nil && host != "" && p != "" { + hasEmbeddedPort = true + } } ``` `[To ensure code accuracy, apply this suggestion manually]` <details><summary>Suggestion importance[1-10]: 6</summary> __ Why: The suggestion correctly identifies that the current port detection logic can fail for unbracketed IPv6 addresses, and proposes a more robust implementation using `net.SplitHostPort`. </details></details></td><td align=center>Low </td></tr><tr><td rowspan=1>Security</td> <td> <details><summary>✅ <s>Validate client-provided timing data</s></summary> ___ <details><summary>Suggestion Impact:</summary>The commit added response_time validation/normalization logic (clamping to 86_400_000 and defaulting to 0 for invalid values) and updated the status map to use the sanitized response_time instead of the raw client-provided value. code diff: ```diff + response_time = + case service.response_time do + rt when is_integer(rt) and rt >= 0 and rt <= 86_400_000 -> + rt + + rt when is_integer(rt) and rt > 86_400_000 -> + 86_400_000 + + _ -> + 0 + end + status = %{ service_name: service_name, available: service.available == true, message: message, - service_type: service.service_type, - response_time: service.response_time, + service_type: service_type, + response_time: response_time, agent_id: metadata.agent_id, ``` </details> ___ **Validate and normalize the client-provided <code>response_time</code> to be a non-negative integer within a reasonable range before it is processed.** [elixir/serviceradar_agent_gateway/lib/serviceradar_agent_gateway/agent_gateway_server.ex [461-475]](https://github.com/carverauto/serviceradar/pull/2221/files#diff-369a368073dc8ec1140bcea699005a1ce97a90cd59629df0bd18c71c7ffaae9fR461-R475) ```diff +response_time = + case service.response_time do + rt when is_integer(rt) and rt >= 0 and rt <= 86_400_000 -> + rt + + rt when is_integer(rt) and rt > 86_400_000 -> + 86_400_000 + + _ -> + 0 + end + status = %{ service_name: service_name, available: service.available == true, message: message, service_type: service.service_type, - response_time: service.response_time, + response_time: response_time, agent_id: metadata.agent_id, gateway_id: metadata.gateway_id, partition: normalize_partition(service.partition || metadata.partition), source: service.source, kv_store_id: service.kv_store_id || metadata.kv_store_id, timestamp: metadata.timestamp, tenant_id: metadata.tenant_id, tenant_slug: metadata.tenant_slug } ``` `[Suggestion processed]` <details><summary>Suggestion importance[1-10]: 7</summary> __ Why: The suggestion correctly points out that `response_time` from the client is not validated, which could lead to data corruption or processing errors downstream. </details></details></td><td align=center>Medium </td></tr><tr><td rowspan=1>General</td> <td> <details><summary>✅ <s>Preserve JSON parsing error details</s></summary> ___ <details><summary>Suggestion Impact:</summary>Updated the JSON parsing error handling to wrap and return the actual decode error (err) when decoding fails with non-EOF errors, improving debugging information. code diff: ```diff if err := dec.Decode(&struct{}{}); err == nil { return nil, fmt.Errorf("failed to parse config: %w", errConfigTrailingData) } else if !errors.Is(err, io.EOF) { - return nil, fmt.Errorf("failed to parse config: %w", errConfigTrailingData) + return nil, fmt.Errorf("failed to parse config: %w", err) ``` </details> ___ **Improve JSON parsing error handling by returning the specific underlying error instead of a generic one, which will aid in debugging configuration file issues.** [cmd/agent/main.go [129-133]](https://github.com/carverauto/serviceradar/pull/2221/files#diff-61358711e980ccf505246fd3915f97cbd3a380e9b66f6fa5aad46749968c5ca3R129-R133) ```diff if err := dec.Decode(&struct{}{}); err == nil { return nil, fmt.Errorf("failed to parse config: %w", errConfigTrailingData) } else if !errors.Is(err, io.EOF) { - return nil, fmt.Errorf("failed to parse config: %w", errConfigTrailingData) + return nil, fmt.Errorf("failed to parse config: %w", err) } ``` `[Suggestion processed]` <details><summary>Suggestion importance[1-10]: 6</summary> __ Why: The suggestion correctly points out that a generic error is returned for JSON parsing issues, which hides the specific cause and makes debugging more difficult. </details></details></td><td align=center>Low </td></tr> <tr><td align="center" colspan="2"> - [ ] Update  </td><td></td></tr></tbody></table> ___ #### Previous suggestions <details><summary>✅ Suggestions up to commit c96fbf9</summary> <table><thead><tr><td>Category</td><td align=left>Suggestion                                                                                                                                    </td><td align=center>Impact</td></tr><tbody><tr><td rowspan=3>Possible issue</td> <td> <details><summary>Fix state machine transitions</summary> ___ **Fix the <code>transition_state</code> calls in the state machine update actions to use the correct transition names (e.g., <code>:degrade</code>) instead of the target state atoms (e.g., <code>:degraded</code>) to prevent runtime errors.** [elixir/serviceradar_core/lib/serviceradar/infrastructure/gateway.ex [243-351]](https://github.com/carverauto/serviceradar/pull/2221/files#diff-1655f7cc80c8949b3d2c9e50d20110cb85883413fb6181825bfbcf0bda578c6bR243-R351) ```diff update :degrade do description "Mark gateway as degraded (having issues)" argument :reason, :string require_atomic? false - change transition_state(:degraded) + change transition_state(:degrade) change set_attribute(:is_healthy, false) change set_attribute(:updated_at, &DateTime.utc_now/0) change {ServiceRadar.Infrastructure.Changes.PublishStateChange, entity_type: :gateway, new_state: :degraded} end update :go_offline do description "Mark gateway as offline" argument :reason, :string require_atomic? false - change transition_state(:offline) + change transition_state(:go_offline) change set_attribute(:is_healthy, false) change set_attribute(:updated_at, &DateTime.utc_now/0) change {ServiceRadar.Infrastructure.Changes.PublishStateChange, entity_type: :gateway, new_state: :offline} end update :recover do description "Start recovery process for degraded/offline gateway" require_atomic? false - change transition_state(:recovering) + change transition_state(:recover) change set_attribute(:updated_at, &DateTime.utc_now/0) change {ServiceRadar.Infrastructure.Changes.PublishStateChange, entity_type: :gateway, new_state: :recovering} end update :start_maintenance do description "Put gateway into maintenance mode" require_atomic? false - change transition_state(:maintenance) + change transition_state(:start_maintenance) change set_attribute(:updated_at, &DateTime.utc_now/0) change {ServiceRadar.Infrastructure.Changes.PublishStateChange, entity_type: :gateway, new_state: :maintenance} end update :start_draining do description "Start graceful shutdown (draining)" require_atomic? false - change transition_state(:draining) + change transition_state(:start_draining) change set_attribute(:updated_at, &DateTime.utc_now/0) change {ServiceRadar.Infrastructure.Changes.PublishStateChange, entity_type: :gateway, new_state: :draining} end update :deactivate do description "Deactivate a gateway (admin action)" require_atomic? false - change transition_state(:inactive) + change transition_state(:deactivate) change set_attribute(:is_healthy, false) change set_attribute(:updated_at, &DateTime.utc_now/0) change {ServiceRadar.Infrastructure.Changes.PublishStateChange, entity_type: :gateway, new_state: :inactive} end ``` <details><summary>Suggestion importance[1-10]: 9</summary> __ Why: This suggestion correctly identifies a critical bug where `transition_state` is called with the target state instead of the transition name, which would cause runtime crashes. </details></details></td><td align=center>High </td></tr><tr><td> <details><summary>Fix gRPC connection initialization</summary> ___ **Replace the manual gRPC connection logic using <code>grpc.NewClient</code> with the standard <code>grpc.DialContext</code> and <code>grpc.WithBlock</code> for a simpler and more robust implementation.** [pkg/agent/gateway_client.go [130-154]](https://github.com/carverauto/serviceradar/pull/2221/files#diff-54aa63b6f05aabb0db096685e5be714c0c6d5c5a892310f048ff2ff8a542191fR130-R154) ```diff -conn, err := grpc.NewClient(g.addr, opts...) +opts = append(opts, grpc.WithBlock()) + +conn, err := grpc.DialContext(connectCtx, g.addr, opts...) if err != nil { if provider != nil { _ = provider.Close() } return fmt.Errorf("failed to connect to gateway at %s: %w", g.addr, err) } -conn.Connect() -for state := conn.GetState(); state != connectivity.Ready; state = conn.GetState() { - if state == connectivity.Shutdown { - _ = conn.Close() - if provider != nil { - _ = provider.Close() - } - return fmt.Errorf("failed to connect to gateway at %s: %w", g.addr, ErrConnectionShutdown) - } - if !conn.WaitForStateChange(connectCtx, state) { - _ = conn.Close() - if provider != nil { - _ = provider.Close() - } - return fmt.Errorf("failed to connect to gateway at %s: %w", g.addr, connectCtx.Err()) - } -} - ``` <details><summary>Suggestion importance[1-10]: 7</summary> __ Why: The suggestion correctly identifies that using `grpc.DialContext` with `grpc.WithBlock` is the idiomatic and more robust way to establish a gRPC connection, simplifying the code and improving maintainability. </details></details></td><td align=center>Medium </td></tr><tr><td> <details><summary>✅ <s>Prevent false agent activity</s></summary> ___ <details><summary>Suggestion Impact:</summary>The code now uses Map.get(agent, :last_seen) and Map.get(agent, :last_seen_mono) instead of agent.last_seen and System.monotonic_time(:millisecond), preventing false activity on initial cache load. code diff: ```diff - all_agents - |> Enum.reduce(%{}, fn agent, acc -> - Map.put(acc, agent.agent_id, %{ - agent_id: agent.agent_id, - tenant_id: Map.get(agent, :tenant_id), - tenant_slug: Map.get(agent, :tenant_slug, "default"), - last_seen: agent.last_seen, - last_seen_mono: System.monotonic_time(:millisecond), - service_count: Map.get(agent, :service_count, 0), - partition: Map.get(agent, :partition), - source_ip: Map.get(agent, :source_ip) - }) - end) + all_agents + |> Enum.reduce(%{}, fn agent, acc -> + Map.put(acc, agent.agent_id, %{ + agent_id: agent.agent_id, + tenant_id: Map.get(agent, :tenant_id), + tenant_slug: Map.get(agent, :tenant_slug, "default"), + last_seen: Map.get(agent, :last_seen), + last_seen_mono: Map.get(agent, :last_seen_mono), + service_count: Map.get(agent, :service_count, 0), + partition: Map.get(agent, :partition), + source_ip: Map.get(agent, :source_ip) + }) + end) ``` </details> ___ **In <code>load_initial_agents_cache</code>, set <code>last_seen_mono</code> from the agent data from the tracker instead of the current monotonic time to avoid incorrectly marking stale agents as active.** [web-ng/lib/serviceradar_web_ng_web/live/infrastructure_live/index.ex [738-750]](https://github.com/carverauto/serviceradar/pull/2221/files#diff-f1af30a84da554ef3d43b226a9303174b33dd5d27e23e9b702031483074e5f54R738-R750) ```diff all_agents |> Enum.reduce(%{}, fn agent, acc -> Map.put(acc, agent.agent_id, %{ agent_id: agent.agent_id, tenant_id: Map.get(agent, :tenant_id), tenant_slug: Map.get(agent, :tenant_slug, "default"), - last_seen: agent.last_seen, - last_seen_mono: System.monotonic_time(:millisecond), + last_seen: Map.get(agent, :last_seen), + last_seen_mono: Map.get(agent, :last_seen_mono), service_count: Map.get(agent, :service_count, 0), partition: Map.get(agent, :partition), source_ip: Map.get(agent, :source_ip) }) end) ``` <details><summary>Suggestion importance[1-10]: 7</summary> __ Why: The suggestion correctly identifies a logic flaw where setting `last_seen_mono` to the current time on initial load can falsely mark stale agents as active, leading to incorrect UI state. </details></details></td><td align=center>Medium </td></tr><tr><td rowspan=5>Incremental <a href='https://qodo-merge-docs.qodo.ai/core-abilities/incremental_update/'>[*]</a></td> <td> <details><summary>✅ <s>Accept host:port/path targets</s></summary> ___ <details><summary>Suggestion Impact:</summary>The commit adds an additional url.Parse("tcp://" + target) branch to detect an embedded port in host:port/path inputs without an explicit scheme, preventing such targets from being dropped. code diff: ```diff @@ -684,6 +684,9 @@ if host, p, err := net.SplitHostPort(target); err == nil && host != "" && p != "" { hasEmbeddedPort = true } else if parsed, err := url.Parse(target); err == nil && parsed.Host != "" && parsed.Port() != "" { + hasEmbeddedPort = true + } else if parsed, err := url.Parse("tcp://" + target); err == nil && parsed.Host != "" && parsed.Port() != "" { + // Handles "host:port/path" inputs without an explicit scheme. hasEmbeddedPort = true } ``` </details> ___ **Fix the embedded-port detection logic to correctly handle targets that include both a port and a path (e.g., <code>example.com:443/health</code>). This prevents valid TCP/gRPC checks from being silently dropped.** [pkg/agent/push_loop.go [681-692]](https://github.com/carverauto/serviceradar/pull/2221/files#diff-5f0d59be34ef26b449d7f5fd2b198a29b277936b9708a699f7487415ed6c2785R681-R692) ```diff if (checkerType == tcpCheckType || checkerType == grpcCheckType) && port == 0 { // Allow "host:port" or URL targets that already include a port. hasEmbeddedPort := false if host, p, err := net.SplitHostPort(target); err == nil && host != "" && p != "" { hasEmbeddedPort = true } else if parsed, err := url.Parse(target); err == nil && parsed.Host != "" && parsed.Port() != "" { hasEmbeddedPort = true + } else if parsed, err := url.Parse("tcp://" + target); err == nil && parsed.Host != "" && parsed.Port() != "" { + // Handles "host:port/path" inputs without an explicit scheme. + hasEmbeddedPort = true } if !hasEmbeddedPort { return nil } } ``` <details><summary>Suggestion importance[1-10]: 8</summary> __ Why: This suggestion correctly identifies a bug where valid check targets with a port and path (e.g., `example.com:8080/health`) are silently ignored, leading to a failure in monitoring. The fix is accurate and resolves this issue. </details></details></td><td align=center>Medium </td></tr><tr><td> <details><summary>✅ <s>Prevent collectors from stopping early</s></summary> ___ <details><summary>Suggestion Impact:</summary>The code was changed to return an error ("service not started") when the service lifecycle context is nil, preventing collectors from being started with the request context and stopping prematurely. code diff: ```diff @@ -322,7 +322,7 @@ serviceCtx := s.serviceCtx s.mu.RUnlock() if serviceCtx == nil { - serviceCtx = ctx + return errors.New("service not started") } ``` </details> ___ **Prevent SNMP collectors from stopping prematurely by ensuring they are always started with the service's lifecycle context. If the service is not running, fail fast instead of using a potentially short-lived request context.** [pkg/checker/snmp/service.go [321-331]](https://github.com/carverauto/serviceradar/pull/2221/files#diff-c592649b582c62a85d1dd416ac7fb609e55531cb43c8199a15f4432ce8ae05d8R321-R331) ```diff s.mu.RLock() serviceCtx := s.serviceCtx s.mu.RUnlock() if serviceCtx == nil { - serviceCtx = ctx + return errors.New("service not started") } // Start collector with the service context so it stops on service shutdown. if err := collector.Start(serviceCtx); err != nil { return fmt.Errorf("%w: %s", errFailedToStartCollector, target.Name) } ``` <details><summary>Suggestion importance[1-10]: 8</summary> __ Why: The suggestion correctly identifies a critical bug where an SNMP collector's lifecycle could be tied to a short-lived request context, causing it to terminate prematurely. Failing fast prevents this and ensures collectors run for their intended duration. </details></details></td><td align=center>Medium </td></tr><tr><td> <details><summary>✅ <s>Fail when all items drop</s></summary> ___ <details><summary>Suggestion Impact:</summary>The commit replaced Enum.each with Enum.reduce to track processed_count, raises a GRPC.RPCError when processed_count == 0 but service_count > 0, and updates record_push_metrics to use processed_count instead of service_count. code diff: ```diff + processed_count = + services + |> Enum.reject(&is_nil/1) + |> Enum.reduce(0, fn service, acc -> + try do + process_service_status(service, metadata) + acc + 1 + rescue + e in GRPC.RPCError -> + Logger.warning( + "Dropping invalid service status from agent #{metadata.agent_id}: #{e.status} #{e.message}" + ) + + acc + + e -> + Logger.warning( + "Dropping service status from agent #{metadata.agent_id} due to error: #{Exception.message(e)}" + ) + + acc + end + end) + + if processed_count == 0 and service_count > 0 do + raise GRPC.RPCError, status: :invalid_argument, message: "no valid service statuses" + end # Record metrics - record_push_metrics(agent_id, service_count) + record_push_metrics(agent_id, processed_count) ``` </details> ___ **Modify the <code>push_status</code> function to track the number of successfully processed service statuses. If all statuses are dropped due to errors, the gRPC request should fail instead of returning a success response.** [elixir/serviceradar_agent_gateway/lib/serviceradar_agent_gateway/agent_gateway_server.ex [252-273]](https://github.com/carverauto/serviceradar/pull/2221/files#diff-369a368073dc8ec1140bcea699005a1ce97a90cd59629df0bd18c71c7ffaae9fR252-R273) ```diff -services -|> Enum.reject(&is_nil/1) -|> Enum.each(fn service -> - try do - process_service_status(service, metadata) - rescue - e in GRPC.RPCError -> - Logger.warning( - "Dropping invalid service status from agent #{metadata.agent_id}: #{e.status} #{e.message}" - ) +processed_count = + services + |> Enum.reject(&is_nil/1) + |> Enum.reduce(0, fn service, acc -> + try do + process_service_status(service, metadata) + acc + 1 + rescue + e in GRPC.RPCError -> + Logger.warning( + "Dropping invalid service status from agent #{metadata.agent_id}: #{e.status} #{e.message}" + ) - e -> - Logger.warning( - "Dropping service status from agent #{metadata.agent_id} due to error: #{Exception.message(e)}" - ) - end -end) + acc + + e -> + Logger.warning( + "Dropping service status from agent #{metadata.agent_id} due to error: #{Exception.message(e)}" + ) + + acc + end + end) + +if processed_count == 0 and service_count > 0 do + raise GRPC.RPCError, status: :invalid_argument, message: "no valid service statuses" +end # Record metrics -record_push_metrics(agent_id, service_count) +record_push_metrics(agent_id, processed_count) %Monitoring.GatewayStatusResponse{received: true} ``` `[Suggestion processed]` <details><summary>Suggestion importance[1-10]: 8</summary> __ Why: This suggestion correctly identifies a silent data loss scenario where the client receives a success response even if all submitted data was invalid and dropped, improving the robustness of the gRPC endpoint. </details></details></td><td align=center>Medium </td></tr><tr><td> <details><summary>✅ <s>Avoid reconnect loops in idle</s></summary> ___ <details><summary>Suggestion Impact:</summary>Updated IsConnected to return true when the connection state is either connectivity.Ready or connectivity.Idle, matching the suggested switch-based logic. code diff: ```diff @@ -250,7 +250,12 @@ if !g.connected || g.conn == nil { return false } - return g.conn.GetState() == connectivity.Ready + switch g.conn.GetState() { + case connectivity.Ready, connectivity.Idle: + return true + default: + return false + } ``` </details> ___ **Update <code>IsConnected</code> to consider the <code>connectivity.Idle</code> state as connected. This prevents unnecessary reconnect loops when the gRPC connection is temporarily idle but still healthy.** [pkg/agent/gateway_client.go [250-253]](https://github.com/carverauto/serviceradar/pull/2221/files#diff-54aa63b6f05aabb0db096685e5be714c0c6d5c5a892310f048ff2ff8a542191fR250-R253) ```diff if !g.connected || g.conn == nil { return false } -return g.conn.GetState() == connectivity.Ready +switch g.conn.GetState() { +case connectivity.Ready, connectivity.Idle: + return true +default: + return false +} ``` <details><summary>Suggestion importance[1-10]: 7</summary> __ Why: The suggestion correctly identifies that treating an `Idle` gRPC connection as disconnected can cause unnecessary reconnect attempts, improving efficiency and reducing load on the agent and gateway. </details></details></td><td align=center>Medium </td></tr><tr><td> <details><summary>✅ <s>Reject missing service names</s></summary> ___ <details><summary>Suggestion Impact:</summary>The commit replaced `to_string() |> String.trim()` with a case expression that returns "" for nil/non-binary values and trims only binaries, enabling the existing validation to reject missing service_name instead of accepting "nil". code diff: ```diff @@ -406,9 +416,11 @@ # Tenant and gateway_id come from server-side metadata (mTLS cert + server identity) # NOT from the service message - this prevents spoofing service_name = - service.service_name - |> to_string() - |> String.trim() + case service.service_name do + name when is_binary(name) -> String.trim(name) + nil -> "" + _ -> "" + end ``` </details> ___ **Explicitly handle <code>nil</code> values for <code>service.service_name</code> to prevent them from being processed as the string <code>"nil"</code>. The current implementation allows this, which can lead to invalid service records.** [elixir/serviceradar_agent_gateway/lib/serviceradar_agent_gateway/agent_gateway_server.ex [408-425]](https://github.com/carverauto/serviceradar/pull/2221/files#diff-369a368073dc8ec1140bcea699005a1ce97a90cd59629df0bd18c71c7ffaae9fR408-R425) ```diff service_name = - service.service_name - |> to_string() - |> String.trim() + case service.service_name do + name when is_binary(name) -> String.trim(name) + nil -> "" + _ -> "" + end cond do service_name == "" -> raise GRPC.RPCError, status: :invalid_argument, message: "service_name is required" byte_size(service_name) > 255 -> raise GRPC.RPCError, status: :invalid_argument, message: "service_name is too long" String.contains?(service_name, ["\n", "\r", "\t"]) -> raise GRPC.RPCError, status: :invalid_argument, message: "service_name contains invalid characters" true -> :ok end ``` `[Suggestion processed]` <details><summary>Suggestion importance[1-10]: 7</summary> __ Why: This is a valid bug fix that prevents data pollution by correctly handling `nil` `service_name` values, which would otherwise be processed as the string `"nil"`. </details></details></td><td align=center>Medium </td></tr> <tr><td align="center" colspan="2">  </td><td></td></tr></tbody></table> </details> <details><summary>✅ Suggestions up to commit b7127ef</summary> <table><thead><tr><td>Category</td><td align=left>Suggestion                                                                                                                                    </td><td align=center>Impact</td></tr><tbody><tr><td rowspan=4>Incremental <a href='https://qodo-merge-docs.qodo.ai/core-abilities/incremental_update/'>[*]</a></td> <td> <details><summary>✅ <s>Restrict monotonic fallback activity</s></summary> ___ <details><summary>Suggestion Impact:</summary>The commit updates agent_active? to gate the monotonic fallback with `and is_nil(agent[:last_seen])`, matching the suggestion’s intent and preventing monotonic timestamps from overriding stale wall-clock last_seen values. It also reuses agent_active? for agent cache pruning, aligning staleness rules with the UI logic. code diff: ```diff - is_integer(agent[:last_seen_mono]) -> + is_integer(agent[:last_seen_mono]) and is_nil(agent[:last_seen]) -> now_mono = System.monotonic_time(:millisecond) delta_ms = max(now_mono - agent[:last_seen_mono], 0) delta_ms < @stale_threshold_ms ``` </details> ___ **In <code>agent_active?</code>, restrict the monotonic time fallback to only apply when the wall-clock <code>last_seen</code> timestamp is <code>nil</code> to ensure correct staleness detection.** [web-ng/lib/serviceradar_web_ng_web/live/infrastructure_live/index.ex [858-866]](https://github.com/carverauto/serviceradar/pull/2221/files#diff-f1af30a84da554ef3d43b226a9303174b33dd5d27e23e9b702031483074e5f54R858-R866) ```diff delta_ms = max(now_ms - last_seen_ms, 0) delta_ms < @stale_threshold_ms - is_integer(agent[:last_seen_mono]) -> + is_integer(agent[:last_seen_mono]) and is_nil(agent[:last_seen]) -> now_mono = System.monotonic_time(:millisecond) delta_ms = max(now_mono - agent[:last_seen_mono], 0) delta_ms < @stale_threshold_ms ``` <details><summary>Suggestion importance[1-10]: 8</summary> __ Why: The suggestion correctly identifies a subtle bug where the monotonic time fallback could incorrectly mark an agent as active if a wall-clock time exists but is stale, improving the accuracy of staleness detection. </details></details></td><td align=center>Medium </td></tr><tr><td> <details><summary>✅ <s>Bound service shutdown with timeout</s></summary> ___ <details><summary>Suggestion Impact:</summary>The commit removed the unbounded deferred server.Stop(context.Background()) and instead calls server.Stop() with a context timeout inside the shutdown goroutine, plus replaces time.After with an explicit timer. It also adds timed server.Stop() calls on the errChan path, further ensuring shutdown is bounded. code diff: ```diff @@ -177,11 +177,6 @@ if err := server.Start(pushCtx); err != nil { return fmt.Errorf("failed to start agent services: %w", err) } - defer func() { - if err := server.Stop(context.Background()); err != nil { - log.Error().Err(err).Msg("Error stopping agent services") - } - }() // Handle shutdown signals sigChan := make(chan os.Signal, 1) @@ -204,21 +199,42 @@ shutdownDone := make(chan struct{}) go func() { + defer close(shutdownDone) + cancel() pushLoop.Stop() <-errChan - close(shutdownDone) + + stopCtx, stopCancel := context.WithTimeout(context.Background(), shutdownTimeout) + defer stopCancel() + if err := server.Stop(stopCtx); err != nil { + log.Error().Err(err).Msg("Error stopping agent services") + } }() + + timer := time.NewTimer(shutdownTimeout) + defer timer.Stop() select { case <-shutdownDone: - case <-time.After(shutdownTimeout): + case <-timer.C: return fmt.Errorf("%w after %s", errShutdownTimeout, shutdownTimeout) } case err := <-errChan: if err != nil && !errors.Is(err, context.Canceled) { + stopCtx, stopCancel := context.WithTimeout(context.Background(), 10*time.Second) + defer stopCancel() + if stopErr := server.Stop(stopCtx); stopErr != nil { + log.Error().Err(stopErr).Msg("Error stopping agent services") + } return fmt.Errorf("push loop error: %w", err) + } + + stopCtx, stopCancel := context.WithTimeout(context.Background(), 10*time.Second) + defer stopCancel() + if stopErr := server.Stop(stopCtx); stopErr != nil { + log.Error().Err(stopErr).Msg("Error stopping agent services") } ``` </details> ___ **Move the <code>server.Stop()</code> call into the timed shutdown block to ensure it is bounded by the <code>shutdownTimeout</code>, preventing the process from hanging during termination.** [cmd/agent/main.go [202-217]](https://github.com/carverauto/serviceradar/pull/2221/files#diff-61358711e980ccf505246fd3915f97cbd3a380e9b66f6fa5aad46749968c5ca3R202-R217) ```diff // Bound shutdown so the process can't hang forever (includes pushLoop.Stop()). const shutdownTimeout = 10 * time.Second shutdownDone := make(chan struct{}) go func() { + defer close(shutdownDone) + cancel() pushLoop.Stop() <-errChan - close(shutdownDone) + + stopCtx, stopCancel := context.WithTimeout(context.Background(), shutdownTimeout) + defer stopCancel() + if err := server.Stop(stopCtx); err != nil { + log.Error().Err(err).Msg("Error stopping agent services") + } }() + +timer := time.NewTimer(shutdownTimeout) +defer timer.Stop() select { case <-shutdownDone: -case <-time.After(shutdownTimeout): +case <-timer.C: return fmt.Errorf("%w after %s", errShutdownTimeout, shutdownTimeout) } ``` `[Suggestion processed]` <details><summary>Suggestion importance[1-10]: 7</summary> __ Why: The suggestion correctly identifies that `server.Stop()` is not covered by the shutdown timeout, which could cause the process to hang, and proposes a valid fix to include it in the timed shutdown sequence. </details></details></td><td align=center>Medium </td></tr><tr><td> <details><summary>✅ <s>Clamp skewed timestamps in pruning</s></summary> ___ <details><summary>Suggestion Impact:</summary>The commit updated the pruned_gateways_cache pruning logic to clamp delta_ms using max(now_ms - last_ms, 0) and removed the delta_ms < 0 check, preventing active gateways from being pruned due to clock skew. It also included additional related changes to agent pruning logic. code diff: ```diff + pruned_gateways_cache = + socket.assigns.gateways_cache + |> Enum.reject(fn {_id, gw} -> + last_ms = parse_timestamp_to_ms(Map.get(gw, :last_heartbeat)) + + delta_ms = + if is_integer(last_ms) do + max(now_ms - last_ms, 0) + else + nil + end + + not is_integer(delta_ms) or delta_ms > @stale_threshold_ms + end) + |> Map.new() ``` </details> ___ **To prevent incorrectly pruning active gateways due to clock skew, clamp negative <code>delta_ms</code> values to <code>0</code> in the <code>pruned_gateways_cache</code> logic.** [web-ng/lib/serviceradar_web_ng_web/live/infrastructure_live/index.ex [97-106]](https://github.com/carverauto/serviceradar/pull/2221/files#diff-f1af30a84da554ef3d43b226a9303174b33dd5d27e23e9b702031483074e5f54R97-R106) ```diff pruned_gateways_cache = socket.assigns.gateways_cache |> Enum.reject(fn {_id, gw} -> last_ms = parse_timestamp_to_ms(Map.get(gw, :last_heartbeat)) - delta_ms = if is_integer(last_ms), do: now_ms - last_ms, else: nil + delta_ms = + if is_integer(last_ms) do + max(now_ms - last_ms, 0) + else + nil + end - not is_integer(delta_ms) or delta_ms < 0 or delta_ms > @stale_threshold_ms + not is_integer(delta_ms) or delta_ms > @stale_threshold_ms end) |> Map.new() ``` <details><summary>Suggestion importance[1-10]: 7</summary> __ Why: This is a valid bug fix for handling clock skew, which could cause active gateways to be incorrectly pruned, but the same logic is already correctly implemented in `compute_gateways`, making this a consistency fix. </details></details></td><td align=center>Medium </td></tr><tr><td> <details><summary>✅ <s>Avoid dropping agents on skew</s></summary> ___ <details><summary>Suggestion Impact:</summary>The pruning logic was changed to avoid dropping entities due to negative deltas: gateways now clamp delta_ms with `max(..., 0)` and no longer prune on `delta_ms < 0`. For agents, instead of directly computing delta_ms, the code now delegates pruning to `agent_active?/2`, which already clamps negative deltas and applies consistent staleness rules, achieving the same goal (avoid pruning due to clock skew) though via a different implementation. code diff: ```diff + pruned_gateways_cache = + socket.assigns.gateways_cache + |> Enum.reject(fn {_id, gw} -> + last_ms = parse_timestamp_to_ms(Map.get(gw, :last_heartbeat)) + + delta_ms = + if is_integer(last_ms) do + max(now_ms - last_ms, 0) + else + nil + end + + not is_integer(delta_ms) or delta_ms > @stale_threshold_ms + end) + |> Map.new() + + pruned_agents_cache = + socket.assigns.agents_cache + |> Enum.reject(fn {_id, agent} -> + # Use the same staleness rules as the UI (wall clock first, monotonic fallback) + not agent_active?(agent, now_ms) + end) + |> Map.new() gateways = compute_gateways(pruned_gateways_cache) @@ -859,7 +861,7 @@ delta_ms = max(now_ms - last_seen_ms, 0) delta_ms < @stale_threshold_ms - is_integer(agent[:last_seen_mono]) -> + is_integer(agent[:last_seen_mono]) and is_nil(agent[:last_seen]) -> now_mono = System.monotonic_time(:millisecond) delta_ms = max(now_mono - agent[:last_seen_mono], 0) delta_ms < @stale_threshold_ms ``` </details> ___ **To prevent incorrectly pruning active agents due to clock skew, clamp negative <code>delta_ms</code> values to <code>0</code> in the <code>pruned_agents_cache</code> logic.** [web-ng/lib/serviceradar_web_ng_web/live/infrastructure_live/index.ex [108-117]](https://github.com/carverauto/serviceradar/pull/2221/files#diff-f1af30a84da554ef3d43b226a9303174b33dd5d27e23e9b702031483074e5f54R108-R117) ```diff pruned_agents_cache = socket.assigns.agents_cache |> Enum.reject(fn {_id, agent} -> last_ms = agent_last_seen_ms(agent) - delta_ms = if is_integer(last_ms), do: now_ms - last_ms, else: nil + delta_ms = + if is_integer(last_ms) do + max(now_ms - last_ms, 0) + else + nil + end - not is_integer(delta_ms) or delta_ms < 0 or delta_ms > @stale_threshold_ms + not is_integer(delta_ms) or delta_ms > @stale_threshold_ms end) |> Map.new() ``` <details><summary>Suggestion importance[1-10]: 7</summary> __ Why: This is a valid bug fix for handling clock skew, which could cause active agents to be incorrectly pruned, but the same logic is already correctly implemented in `agent_active?`, making this a consistency fix. </details></details></td><td align=center>Medium </td></tr><tr><td rowspan=3>Possible issue</td> <td> <details><summary>✅ <s>Bind collectors to service context</s></summary> ___ <details><summary>Suggestion Impact:</summary>The commit moved the serviceCtx lookup to before starting the collector and changed collector.Start to use serviceCtx (falling back to ctx when nil), aligning collector lifetime with the service context. code diff: ```diff - // Start collector - if err := collector.Start(ctx); err != nil { + s.mu.RLock() + serviceCtx := s.serviceCtx + s.mu.RUnlock() + if serviceCtx == nil { + serviceCtx = ctx + } + + // Start collector with the service context so it stops on service shutdown. + if err := collector.Start(serviceCtx); err != nil { return fmt.Errorf("%w: %s", errFailedToStartCollector, target.Name) } @@ -363,12 +370,6 @@ Str("host_name", target.Name). Msg("Initialized service status") - s.mu.RLock() - serviceCtx := s.serviceCtx - s.mu.RUnlock() - if serviceCtx == nil { - serviceCtx = ctx - } go s.processResults(serviceCtx, target.Name, results, aggregator) ``` </details> ___ **Start the SNMP collector with the service's long-lived context (<code>serviceCtx</code>) instead of the caller-provided context to prevent resource leaks when the service is stopped or restarted.** [pkg/checker/snmp/service.go [322-372]](https://github.com/carverauto/serviceradar/pull/2221/files#diff-c592649b582c62a85d1dd416ac7fb609e55531cb43c8199a15f4432ce8ae05d8R322-R372) ```diff -if err := collector.Start(ctx); err != nil { - return fmt.Errorf("%w: %s", errFailedToStartCollector, target.Name) -} -... s.mu.RLock() serviceCtx := s.serviceCtx s.mu.RUnlock() if serviceCtx == nil { serviceCtx = ctx } + +if err := collector.Start(serviceCtx); err != nil { + return fmt.Errorf("%w: %s", errFailedToStartCollector, target.Name) +} +... go s.processResults(serviceCtx, target.Name, results, aggregator) ``` <details><summary>Suggestion importance[1-10]: 8</summary> __ Why: The suggestion correctly identifies a potential goroutine leak where a collector started via `AddTarget` might not be tied to the service's lifecycle, and proposes using the service's context to fix it. </details></details></td><td align=center>Medium </td></tr><tr><td> <details><summary>✅ <s>Prune agents using activity predicate</s></summary> ___ <details><summary>Suggestion Impact:</summary>The commit updated pruned_agents_cache to reject agents based on `not agent_active?(agent, now_ms)` instead of re-implementing staleness checks, aligning pruning with the UI logic. It also refined agent_active?/2 to only use the monotonic timestamp fallback when wall-clock `:last_seen` is nil, further ensuring consistent activity evaluation. code diff: ```diff + pruned_agents_cache = + socket.assigns.agents_cache + |> Enum.reject(fn {_id, agent} -> + # Use the same staleness rules as the UI (wall clock first, monotonic fallback) + not agent_active?(agent, now_ms) + end) + |> Map.new() gateways = compute_gateways(pruned_gateways_cache) @@...

qodo-code-review[bot] commented

2026-01-03 22:00:43 +00:00

(Migrated from github.com)

Author

Owner

Imported GitHub PR comment.

Original author: @qodo-code-review[bot]
Original URL: https://github.com/carverauto/serviceradar/pull/2221#issuecomment-3707397340
Original created: 2026-01-03T22:00:43Z

Persistent suggestions updated to latest commit 1c51855

Imported GitHub PR comment. Original author: @qodo-code-review[bot] Original URL: https://github.com/carverauto/serviceradar/pull/2221#issuecomment-3707397340 Original created: 2026-01-03T22:00:43Z --- **[Persistent suggestions](https://github.com/carverauto/serviceradar/pull/2221#issuecomment-3706934588)** updated to latest commit 1c51855

qodo-code-review[bot] commented

2026-01-03 23:52:45 +00:00

(Migrated from github.com)

Author

Owner

Imported GitHub PR comment.

Original author: @qodo-code-review[bot]
Original URL: https://github.com/carverauto/serviceradar/pull/2221#issuecomment-3707456600
Original created: 2026-01-03T23:52:45Z

Persistent suggestions updated to latest commit 9c018c2

Imported GitHub PR comment. Original author: @qodo-code-review[bot] Original URL: https://github.com/carverauto/serviceradar/pull/2221#issuecomment-3707456600 Original created: 2026-01-03T23:52:45Z --- **[Persistent suggestions](https://github.com/carverauto/serviceradar/pull/2221#issuecomment-3706934588)** updated to latest commit 9c018c2

qodo-code-review[bot] commented

2026-01-04 00:09:57 +00:00

(Migrated from github.com)

Author

Owner

Imported GitHub PR comment.

Original author: @qodo-code-review[bot]
Original URL: https://github.com/carverauto/serviceradar/pull/2221#issuecomment-3707466559
Original created: 2026-01-04T00:09:57Z

Persistent suggestions updated to latest commit 31c2e30

Imported GitHub PR comment. Original author: @qodo-code-review[bot] Original URL: https://github.com/carverauto/serviceradar/pull/2221#issuecomment-3707466559 Original created: 2026-01-04T00:09:57Z --- **[Persistent suggestions](https://github.com/carverauto/serviceradar/pull/2221#issuecomment-3706934588)** updated to latest commit 31c2e30

qodo-code-review[bot] commented

2026-01-04 01:34:43 +00:00

(Migrated from github.com)

Author

Owner

Imported GitHub PR comment.

Original author: @qodo-code-review[bot]
Original URL: https://github.com/carverauto/serviceradar/pull/2221#issuecomment-3707515479
Original created: 2026-01-04T01:34:43Z

Persistent suggestions updated to latest commit 31165fe

Imported GitHub PR comment. Original author: @qodo-code-review[bot] Original URL: https://github.com/carverauto/serviceradar/pull/2221#issuecomment-3707515479 Original created: 2026-01-04T01:34:43Z --- **[Persistent suggestions](https://github.com/carverauto/serviceradar/pull/2221#issuecomment-3706934588)** updated to latest commit 31165fe

qodo-code-review[bot] commented

2026-01-04 02:20:43 +00:00

(Migrated from github.com)

Author

Owner

Imported GitHub PR comment.

Original author: @qodo-code-review[bot]
Original URL: https://github.com/carverauto/serviceradar/pull/2221#issuecomment-3707544156
Original created: 2026-01-04T02:20:43Z

Persistent suggestions updated to latest commit b3ec5cd

Imported GitHub PR comment. Original author: @qodo-code-review[bot] Original URL: https://github.com/carverauto/serviceradar/pull/2221#issuecomment-3707544156 Original created: 2026-01-04T02:20:43Z --- **[Persistent suggestions](https://github.com/carverauto/serviceradar/pull/2221#issuecomment-3706934588)** updated to latest commit b3ec5cd

qodo-code-review[bot] commented

2026-01-04 05:07:29 +00:00

(Migrated from github.com)

Author

Owner

Imported GitHub PR comment.

Original author: @qodo-code-review[bot]
Original URL: https://github.com/carverauto/serviceradar/pull/2221#issuecomment-3707670337
Original created: 2026-01-04T05:07:29Z

Persistent suggestions updated to latest commit 77a078e

Imported GitHub PR comment. Original author: @qodo-code-review[bot] Original URL: https://github.com/carverauto/serviceradar/pull/2221#issuecomment-3707670337 Original created: 2026-01-04T05:07:29Z --- **[Persistent suggestions](https://github.com/carverauto/serviceradar/pull/2221#issuecomment-3706934588)** updated to latest commit 77a078e

qodo-code-review[bot] commented

2026-01-04 06:56:41 +00:00

(Migrated from github.com)

Author

Owner

Imported GitHub PR comment.

Original author: @qodo-code-review[bot]
Original URL: https://github.com/carverauto/serviceradar/pull/2221#issuecomment-3707800531
Original created: 2026-01-04T06:56:41Z

Persistent suggestions updated to latest commit 6196af3

Imported GitHub PR comment. Original author: @qodo-code-review[bot] Original URL: https://github.com/carverauto/serviceradar/pull/2221#issuecomment-3707800531 Original created: 2026-01-04T06:56:41Z --- **[Persistent suggestions](https://github.com/carverauto/serviceradar/pull/2221#issuecomment-3706934588)** updated to latest commit 6196af3

qodo-code-review[bot] commented

2026-01-04 07:11:17 +00:00

(Migrated from github.com)

Author

Owner

Imported GitHub PR comment.

Original author: @qodo-code-review[bot]
Original URL: https://github.com/carverauto/serviceradar/pull/2221#issuecomment-3707810490
Original created: 2026-01-04T07:11:17Z

Persistent suggestions updated to latest commit da4580a

Imported GitHub PR comment. Original author: @qodo-code-review[bot] Original URL: https://github.com/carverauto/serviceradar/pull/2221#issuecomment-3707810490 Original created: 2026-01-04T07:11:17Z --- **[Persistent suggestions](https://github.com/carverauto/serviceradar/pull/2221#issuecomment-3706934588)** updated to latest commit da4580a

qodo-code-review[bot] commented

2026-01-04 07:21:19 +00:00

(Migrated from github.com)

Author

Owner

Imported GitHub PR comment.

Original author: @qodo-code-review[bot]
Original URL: https://github.com/carverauto/serviceradar/pull/2221#issuecomment-3707818048
Original created: 2026-01-04T07:21:19Z

Persistent suggestions updated to latest commit db99aac

Imported GitHub PR comment. Original author: @qodo-code-review[bot] Original URL: https://github.com/carverauto/serviceradar/pull/2221#issuecomment-3707818048 Original created: 2026-01-04T07:21:19Z --- **[Persistent suggestions](https://github.com/carverauto/serviceradar/pull/2221#issuecomment-3706934588)** updated to latest commit db99aac

qodo-code-review[bot] commented

2026-01-04 08:04:14 +00:00

(Migrated from github.com)

Author

Owner

Imported GitHub PR comment.

Original author: @qodo-code-review[bot]
Original URL: https://github.com/carverauto/serviceradar/pull/2221#issuecomment-3707846321
Original created: 2026-01-04T08:04:14Z

Persistent suggestions updated to latest commit c0cb179

Imported GitHub PR comment. Original author: @qodo-code-review[bot] Original URL: https://github.com/carverauto/serviceradar/pull/2221#issuecomment-3707846321 Original created: 2026-01-04T08:04:14Z --- **[Persistent suggestions](https://github.com/carverauto/serviceradar/pull/2221#issuecomment-3706934588)** updated to latest commit c0cb179

No reviewers

No labels

1week

2weeks

Failed compliance check

IP cameras

NATS

Possible security concern

serviceradar-agent-gateway

No milestone

No project

No assignees

1 participant

Notifications

Due date

The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference

carverauto/serviceradar!2629

No description provided.

Rows
Columns

Feat/agent gateway work #2629

User description

IMPORTANT: Please sign the Developer Certificate of Origin

Describe your changes

Issue ticket number and link

Code checklist before requesting a review

PR Type

Description

Diagram Walkthrough

File Walkthrough

Original author: @qodo-code-review[bot] Original URL: https://github.com/carverauto/serviceradar/pull/2221#issuecomment-3706933859 Original created: 2026-01-03T09:48:48Z

PR Compliance Guide 🔍

(Compliance updated until commit github.com/carverauto/serviceradar@d0138ce012)

Previous compliance checks

Original author: @qodo-code-review[bot] Original URL: https://github.com/carverauto/serviceradar/pull/2221#issuecomment-3706934588 Original created: 2026-01-03T09:50:14Z

PR Code Suggestions ✨

Previous suggestions

Original author: @qodo-code-review[bot] Original URL: https://github.com/carverauto/serviceradar/pull/2221#issuecomment-3707397340 Original created: 2026-01-03T22:00:43Z

Original author: @qodo-code-review[bot] Original URL: https://github.com/carverauto/serviceradar/pull/2221#issuecomment-3707456600 Original created: 2026-01-03T23:52:45Z

Original author: @qodo-code-review[bot] Original URL: https://github.com/carverauto/serviceradar/pull/2221#issuecomment-3707466559 Original created: 2026-01-04T00:09:57Z

Original author: @qodo-code-review[bot] Original URL: https://github.com/carverauto/serviceradar/pull/2221#issuecomment-3707515479 Original created: 2026-01-04T01:34:43Z

Original author: @qodo-code-review[bot] Original URL: https://github.com/carverauto/serviceradar/pull/2221#issuecomment-3707544156 Original created: 2026-01-04T02:20:43Z

Original author: @qodo-code-review[bot] Original URL: https://github.com/carverauto/serviceradar/pull/2221#issuecomment-3707670337 Original created: 2026-01-04T05:07:29Z

Original author: @qodo-code-review[bot] Original URL: https://github.com/carverauto/serviceradar/pull/2221#issuecomment-3707800531 Original created: 2026-01-04T06:56:41Z

Original author: @qodo-code-review[bot] Original URL: https://github.com/carverauto/serviceradar/pull/2221#issuecomment-3707810490 Original created: 2026-01-04T07:11:17Z

Original author: @qodo-code-review[bot] Original URL: https://github.com/carverauto/serviceradar/pull/2221#issuecomment-3707818048 Original created: 2026-01-04T07:21:19Z

Original author: @qodo-code-review[bot] Original URL: https://github.com/carverauto/serviceradar/pull/2221#issuecomment-3707846321 Original created: 2026-01-04T08:04:14Z

Original author: @qodo-code-review[bot]
Original URL: https://github.com/carverauto/serviceradar/pull/2221#issuecomment-3706933859
Original created: 2026-01-03T09:48:48Z

(Compliance updated until commit `github.com/carverauto/serviceradar@d0138ce012`)

Original author: @qodo-code-review[bot]
Original URL: https://github.com/carverauto/serviceradar/pull/2221#issuecomment-3706934588
Original created: 2026-01-03T09:50:14Z

Original author: @qodo-code-review[bot]
Original URL: https://github.com/carverauto/serviceradar/pull/2221#issuecomment-3707397340
Original created: 2026-01-03T22:00:43Z

Original author: @qodo-code-review[bot]
Original URL: https://github.com/carverauto/serviceradar/pull/2221#issuecomment-3707456600
Original created: 2026-01-03T23:52:45Z

Original author: @qodo-code-review[bot]
Original URL: https://github.com/carverauto/serviceradar/pull/2221#issuecomment-3707466559
Original created: 2026-01-04T00:09:57Z

Original author: @qodo-code-review[bot]
Original URL: https://github.com/carverauto/serviceradar/pull/2221#issuecomment-3707515479
Original created: 2026-01-04T01:34:43Z

Original author: @qodo-code-review[bot]
Original URL: https://github.com/carverauto/serviceradar/pull/2221#issuecomment-3707544156
Original created: 2026-01-04T02:20:43Z

Original author: @qodo-code-review[bot]
Original URL: https://github.com/carverauto/serviceradar/pull/2221#issuecomment-3707670337
Original created: 2026-01-04T05:07:29Z

Original author: @qodo-code-review[bot]
Original URL: https://github.com/carverauto/serviceradar/pull/2221#issuecomment-3707800531
Original created: 2026-01-04T06:56:41Z

Original author: @qodo-code-review[bot]
Original URL: https://github.com/carverauto/serviceradar/pull/2221#issuecomment-3707810490
Original created: 2026-01-04T07:11:17Z

Original author: @qodo-code-review[bot]
Original URL: https://github.com/carverauto/serviceradar/pull/2221#issuecomment-3707818048
Original created: 2026-01-04T07:21:19Z

Original author: @qodo-code-review[bot]
Original URL: https://github.com/carverauto/serviceradar/pull/2221#issuecomment-3707846321
Original created: 2026-01-04T08:04:14Z