Feat/agent gateway work #2629

Merged
mfreeman451 merged 62 commits from refs/pull/2629/head into testing 2026-01-04 10:06:57 +00:00
mfreeman451 commented 2026-01-03 09:47:10 +00:00 (Migrated from github.com)
Owner

Imported from GitHub pull request.

Original GitHub pull request: #2221
Original author: @mfreeman451
Original URL: https://github.com/carverauto/serviceradar/pull/2221
Original created: 2026-01-03T09:47:10Z
Original updated: 2026-01-04T10:06:59Z
Original head: carverauto/serviceradar:feat/agent-gateway-work
Original base: testing
Original merged: 2026-01-04T10:06:57Z by @mfreeman451

User description

IMPORTANT: Please sign the Developer Certificate of Origin

Thank you for your contribution to ServiceRadar. Please note, when contributing, the developer must include
a DCO sign-off statement indicating the DCO acceptance in one commit message. Here
is an example DCO Signed-off-by line in a commit message:

Signed-off-by: J. Doe <j.doe@domain.com>

Describe your changes

Code checklist before requesting a review

  • I have signed the DCO?
  • The build completes without errors?
  • All tests are passing when running make test?

PR Type

Enhancement, Tests


Description

  • Implements comprehensive NATS account management system with operator bootstrap, tenant account creation, and user credential generation

  • Adds new AgentGatewayService gRPC API for agent-to-gateway status communication with push and streaming capabilities

  • Introduces multi-tenant identity extraction from mTLS certificates with tenant context propagation throughout the system

  • Implements agent push mode for active status transmission to gateway with automatic reconnection and exponential backoff

  • Adds NATS bootstrap CLI command supporting local and remote modes for operator and account initialization

  • Extends CLI with admin command dispatcher for managing NATS bootstrap and account operations

  • Implements tenant-aware NATS event subject prefixing via environment variable control

  • Adds comprehensive test coverage for account management, user credentials, operator initialization, and tenant context handling

  • Extends data models with collector package types, credentials structures, and NATS configuration fields

  • Normalizes proto file path references for KV service consistency


Diagram Walkthrough

flowchart LR
  Agent["Agent<br/>Push Mode"] -->|"PushStatus<br/>StreamStatus"| Gateway["Agent Gateway<br/>Service"]
  Gateway -->|"mTLS<br/>Tenant Context"| DataSvc["Data Service<br/>NATS Account Mgmt"]
  DataSvc -->|"JWT Operations<br/>Credential Gen"| NATS["NATS Server<br/>Operator/Accounts"]
  CLI["CLI<br/>nats-bootstrap"] -->|"Bootstrap<br/>Config Gen"| NATS
  Cert["mTLS Certificate<br/>CN Parsing"] -->|"Tenant ID<br/>Extraction"| TenantCtx["Tenant Context<br/>Propagation"]
  TenantCtx -->|"Subject Prefixing<br/>Channel Routing"| EventPub["Event Publisher<br/>NATS Events"]

File Walkthrough

Relevant files
Enhancement
20 files
nats_account.pb.go
NATS account management protobuf definitions                         

proto/nats_account.pb.go

  • Generated protobuf Go code for NATS account management with 1266 lines
    of new code
  • Defines message types for account limits, subject mappings, tenant
    account creation, and user credentials
  • Implements credential types (collector, service, admin) and user
    permission structures
  • Includes RPC service definitions for bootstrap, account creation, JWT
    signing, and credential generation
+1266/-0
nats_bootstrap.go
NATS bootstrap CLI implementation                                               

pkg/cli/nats_bootstrap.go

  • Implements NATS bootstrap CLI command with 961 lines of new
    functionality
  • Provides local and remote bootstrap modes for operator and account
    initialization
  • Generates NATS configuration files, JWTs, credentials, and
    system/platform accounts
  • Includes token generation, status checking, and tenant listing
    capabilities
+961/-0 
cli.go
CLI command routing for admin operations                                 

pkg/cli/cli.go

  • Adds new AdminHandler struct to dispatch nested admin commands
  • Registers nats-bootstrap and admin command handlers in the command map
  • Implements routing logic for admin subcommands (currently supporting
    admin nats)
+22/-0   
nats_account_service.go
NATS Account Service Implementation with JWT Operations   

pkg/datasvc/nats_account_service.go

  • Implements NATSAccountServer struct for stateless NATS JWT/NKeys
    cryptographic operations
  • Provides gRPC methods for operator bootstrap, tenant account creation,
    user credential generation, and JWT signing
  • Includes mTLS authorization checks and resolver connection management
    for pushing JWTs to NATS
  • Supports file-based and NATS-based JWT resolver configurations with
    proper error handling
+808/-0 
monitoring.pb.go
Gateway Status Messages and Service Definition                     

proto/monitoring.pb.go

  • Adds GatewayStatusRequest, GatewayStatusResponse, GatewayStatusChunk,
    and GatewayServiceStatus message types
  • Includes tenant-aware fields (tenant_id, tenant_slug) for multi-tenant
    routing
  • Adds new AgentGatewayService with PushStatus and StreamStatus RPC
    methods
  • Updates message type count from 13 to 17 and service count from 2 to 3
+501/-20
nats_account_grpc.pb.go
NATS Account Service gRPC Stubs Generation                             

proto/nats_account_grpc.pb.go

  • Auto-generated gRPC service stubs for NATSAccountService interface
  • Implements client and server interfaces with six RPC methods:
    BootstrapOperator, GetOperatorInfo, CreateTenantAccount,
    GenerateUserCredentials, SignAccountJWT, PushAccountJWT
  • Includes unary server handlers and service descriptor registration
+376/-0 
operator.go
NATS Operator Key Management and Bootstrap                             

pkg/nats/accounts/operator.go

  • Defines Operator struct for managing NATS operator keys and signing
    operations
  • Implements BootstrapOperator function to initialize new or import
    existing operator keys
  • Provides key generation utilities for operators, accounts, and users
    with NKeys
  • Includes configuration loading from multiple sources (direct,
    environment, files)
+476/-0 
events.go
NATS Configuration Extension for Credentials                         

pkg/models/events.go

  • Adds CredsFile field to NATSConfig struct for system account
    credentials file path
  • Maintains backward compatibility with existing NATS configuration
    fields
+4/-3     
tenant.go
Multi-tenant identity extraction from certificates             

pkg/tenant/tenant.go

  • New package providing multi-tenant identity extraction from mTLS
    certificate CNs
  • Implements CN parsing with format ...serviceradar
  • Provides context-based tenant propagation and NATS channel prefixing
    utilities
  • Supports certificate extraction from gRPC contexts, TLS states, and
    PEM files
+315/-0 
nats_config.go
NATS server configuration and bootstrap utilities               

pkg/edgeonboarding/nats_config.go

  • Configuration generation for NATS server with operator JWT and system
    account setup
  • Template-based config generation supporting TLS, JetStream,
    clustering, and leaf nodes
  • Helper functions for generating collector-specific NATS permissions by
    type
  • Data structures for NATS bootstrap requests and collector credentials
+328/-0 
gateway_client.go
Agent gateway client with reconnection logic                         

pkg/agent/gateway_client.go

  • New gRPC client for agent-to-gateway communication with connection
    management
  • Implements status push and streaming with automatic reconnection and
    exponential backoff
  • Supports mTLS security configuration and graceful disconnect
  • Provides connection state tracking and timeout handling
+270/-0 
account_manager.go
NATS account creation and JWT signing                                       

pkg/nats/accounts/account_manager.go

  • Account signer for creating and managing NATS tenant accounts with JWT
    signing
  • Implements subject mapping for tenant isolation with {{tenant}}
    placeholder replacement
  • Supports account limits, JetStream enablement, and user key
    revocations
  • Provides stateless account creation and JWT regeneration operations
+238/-0 
monitoring_grpc.pb.go
New AgentGatewayService gRPC API                                                 

proto/monitoring_grpc.pb.go

  • Adds new AgentGatewayService gRPC service with PushStatus and
    StreamStatus methods
  • Marks PollerService as deprecated with migration guidance to
    AgentGatewayService
  • Implements client and server stubs for agent-to-gateway status
    communication
  • Supports both unary and client-streaming RPC patterns
+148/-0 
user_manager.go
NATS user credential generation and management                     

pkg/nats/accounts/user_manager.go

  • User credential generation for NATS with type-based permissions
    (collector, service, admin)
  • Implements .creds file formatting for NATS authentication
  • Supports custom permissions and credential expiration
  • Applies tenant-scoped permissions based on credential type
+217/-0 
push_loop.go
Agent status push loop implementation                                       

pkg/agent/push_loop.go

  • Push loop implementation for periodic agent status collection and
    gateway transmission
  • Collects status from sweep services and configured checkers
  • Converts internal status format to gateway protocol messages
  • Handles source IP detection and tenant/partition metadata propagation
+222/-0 
events.go
Tenant-aware NATS event subject prefixing                               

pkg/natsutil/events.go

  • Adds tenant prefixing support for NATS event subjects via environment
    variable control
  • Implements applyTenantPrefix() to inject tenant slug into subject
    paths
  • Adds NewEventPublisherWithPrefixing() for explicit prefix control
  • Provides helper functions to check and modify prefixing state
+74/-13 
main.go
Data service NATS account initialization                                 

cmd/data-services/main.go

  • Initializes NATS account service with operator configuration and
    resolver paths
  • Supports operator bootstrap from environment variables and config
    files
  • Registers NATSAccountService gRPC endpoint for account management
  • Configures system account credentials and resolver client for JWT
    persistence
+68/-0   
edge_onboarding.go
Collector package and credential data models                         

pkg/models/edge_onboarding.go

  • Adds CollectorType enum for collector classification (flowgger, trapd,
    netflow, otel)
  • Adds CollectorPackageStatus enum for deployment lifecycle tracking
  • Introduces CollectorPackage, NatsCredential, and
    CollectorDownloadResult data structures
  • Supports collector package metadata, credential management, and
    download tracking
+71/-0   
main.go
Agent push mode implementation with gateway integration   

cmd/agent/main.go

  • Implements push mode for agents to actively push status to gateway
  • Adds runPushMode() with gateway client initialization and push loop
    management
  • Implements graceful shutdown with signal handling for push mode
  • Maintains legacy server mode as fallback with deprecation warning
+82/-0   
main.go
CLI command routing for NATS bootstrap and admin                 

cmd/cli/main.go

  • Adds nats-bootstrap command routing for NATS server bootstrap
    operations
  • Adds admin command dispatcher with nats subcommand support
  • Implements dispatchAdminCommand() for admin resource routing
  • Maintains backward compatibility with existing bcrypt mode
+14/-0   
Formatting
1 files
kv.pb.go
Proto file path normalization for KV service                         

proto/kv.pb.go

  • Updates proto file path references from proto/kv.proto to kv.proto
  • Renames internal variable references from file_proto_kv_proto_* to
    file_kv_proto_*
  • Updates all message type index references and descriptor function
    calls
+83/-83 
Tests
6 files
nats_account_service_test.go
NATS Account Service Unit Tests                                                   

pkg/datasvc/nats_account_service_test.go

  • Comprehensive test suite for NATSAccountServer with authorization
    context setup
  • Tests account creation, user credential generation, and JWT signing
    with various scenarios
  • Includes proto conversion helper function tests for limits, credential
    types, and permissions
  • Validates error handling for missing required fields and invalid
    inputs
+474/-0 
user_manager_test.go
User Credentials Generation Tests                                               

pkg/nats/accounts/user_manager_test.go

  • Tests for GenerateUserCredentials function covering collector,
    service, and admin credential types
  • Validates custom permissions, expiration handling, and JWT claim
    structure
  • Tests error cases for invalid seeds and wrong key types
  • Includes helper function tests for credentials file formatting
+440/-0 
account_manager_test.go
Account manager test suite for NATS tenant accounts           

pkg/nats/accounts/account_manager_test.go

  • Comprehensive test suite for AccountSigner with 370 lines covering
    account creation, JWT signing, and subject mappings
  • Tests for tenant account creation with limits, custom mappings, and
    revocations
  • Validation of default subject mappings for collectors (events, syslog,
    snmp, netflow, otel, logs, telemetry)
  • Tests for account key derivation and multiple account generation
+370/-0 
tenant_test.go
Tenant identity parsing and context management tests         

pkg/tenant/tenant_test.go

  • Test suite for tenant identity parsing and context management with 354
    lines
  • Tests for CN parsing with various formats and error cases
  • Tests for tenant info string representation and NATS channel prefixing
  • Tests for context-based tenant propagation and helper functions
+354/-0 
operator_test.go
NATS operator initialization and JWT signing tests             

pkg/nats/accounts/operator_test.go

  • Test suite for NATS operator key generation and JWT creation with 357
    lines
  • Tests for operator initialization from seed, file, and environment
    variable
  • Tests for system account public key configuration and invalid seed
    handling
  • Tests for operator JWT creation and key encoding/decoding for storage
+357/-0 
events_test.go
Event publisher tenant prefixing tests                                     

pkg/natsutil/events_test.go

  • Tests for tenant prefix environment variable detection
  • Tests for subject prefixing with and without tenant context
  • Tests for NewEventPublisherWithPrefixing() and SetTenantPrefixing()
    methods
  • Validates fallback to DefaultTenant when tenant not in context
+135/-0 
Configuration changes
1 files
.gitkeep
Docker Compose Credentials Directory                                         

docker/compose/creds/.gitkeep

  • Creates placeholder directory for Docker Compose credentials storage
+1/-0     
Additional files
101 files
.bazelignore +4/-0     
.bazelrc +2/-0     
.env.example +38/-0   
AGENTS.md +171/-2 
MODULE.bazel +5/-0     
Makefile +4/-0     
README-Docker.md +16/-1   
BUILD.bazel +6/-0     
BUILD.bazel +12/-0   
mix_release.bzl +120/-49
config.json +5/-6     
config.rs +25/-0   
nats.rs +4/-0     
zen-consumer-with-otel.json +1/-0     
zen-consumer.json +1/-0     
BUILD.bazel +1/-0     
flowgger.toml +1/-0     
nats_output.rs +14/-0   
otel.toml +1/-0     
config.rs +13/-0   
nats_output.rs +7/-0     
setup.rs +1/-0     
config.rs +21/-0   
main.rs +21/-1   
docker-compose.elx.yml +108/-0 
docker-compose.spiffe.yml +8/-0     
docker-compose.yml +325/-269
Dockerfile.agent-gateway +94/-0   
Dockerfile.core-elx +108/-0 
Dockerfile.poller-elx +95/-0   
Dockerfile.web-ng +6/-0     
agent-minimal.docker.json +6/-6     
agent.docker.json +9/-15   
agent.mtls.json +13/-3   
datasvc.docker.json +3/-2     
datasvc.mtls.json +14/-1   
db-event-writer.mtls.json +1/-0     
entrypoint-certs.sh +11/-6   
flowgger.docker.toml +2/-1     
generate-certs.sh +216/-10
nats.docker.conf +16/-160
netflow-consumer.mtls.json +1/-0     
otel.docker.toml +2/-0     
pg_hba.conf +9/-0     
pg_ident.conf +17/-0   
ssl_dist.core.conf +17/-0   
ssl_dist.gateway.conf +17/-0   
ssl_dist.poller.conf +17/-0   
ssl_dist.web.conf +17/-0   
trapd.docker.json +2/-1     
zen.docker.json +2/-1     
BUILD.bazel +80/-0   
push_targets.bzl +2/-0     
architecture.md +58/-6   
ash-api.md +305/-0 
ash-authentication.md +244/-0 
ash-authorization.md +283/-0 
ash-domains.md +223/-0 
ash-migration-guide.md +339/-0 
docker-setup.md +39/-4   
edge-agents.md +253/-0 
edge-onboarding.md +316/-273
security-architecture.md +238/-0 
spiffe-identity.md +52/-0   
troubleshooting-guide.md +51/-0   
datasvc.ex +32/-2   
BUILD.bazel +30/-0   
config.exs +13/-0   
dev.exs +4/-0     
prod.exs +4/-0     
runtime.exs +202/-0 
test.exs +20/-0   
agent_client.ex +472/-0 
agent_gateway_server.ex +226/-0 
application.ex +192/-0 
config.ex +311/-0 
endpoint.ex +14/-0   
status_processor.ex +202/-0 
task_executor.ex +435/-0 
mix.exs +69/-0   
env.sh.eex +78/-0   
serviceradar-poller.service +66/-0   
vm.args.eex +50/-0   
registration_test.exs +235/-0 
test_helper.exs +7/-0     
.formatter.exs +15/-0   
BUILD.bazel +14/-0   
README.md +269/-0 
config.exs +101/-0 
dev.exs +66/-0   
prod.exs +6/-0     
runtime.exs +196/-0 
test.exs +91/-0   
eventwriter.md +370/-0 
device.ex +608/-0 
device_registry.ex +284/-0 
telemetry.ex +267/-0 
agent_tracker.ex +177/-0 
application.ex +381/-0 
backoff.ex +59/-0   
Additional files not shown

Imported from GitHub pull request. Original GitHub pull request: #2221 Original author: @mfreeman451 Original URL: https://github.com/carverauto/serviceradar/pull/2221 Original created: 2026-01-03T09:47:10Z Original updated: 2026-01-04T10:06:59Z Original head: carverauto/serviceradar:feat/agent-gateway-work Original base: testing Original merged: 2026-01-04T10:06:57Z by @mfreeman451 --- ### **User description** ## IMPORTANT: Please sign the Developer Certificate of Origin Thank you for your contribution to ServiceRadar. Please note, when contributing, the developer must include a [DCO sign-off statement]( https://developercertificate.org/) indicating the DCO acceptance in one commit message. Here is an example DCO Signed-off-by line in a commit message: ``` Signed-off-by: J. Doe <j.doe@domain.com> ``` ## Describe your changes ## Issue ticket number and link ## Code checklist before requesting a review - [ ] I have signed the DCO? - [ ] The build completes without errors? - [ ] All tests are passing when running make test? ___ ### **PR Type** Enhancement, Tests ___ ### **Description** - Implements comprehensive NATS account management system with operator bootstrap, tenant account creation, and user credential generation - Adds new `AgentGatewayService` gRPC API for agent-to-gateway status communication with push and streaming capabilities - Introduces multi-tenant identity extraction from mTLS certificates with tenant context propagation throughout the system - Implements agent push mode for active status transmission to gateway with automatic reconnection and exponential backoff - Adds NATS bootstrap CLI command supporting local and remote modes for operator and account initialization - Extends CLI with admin command dispatcher for managing NATS bootstrap and account operations - Implements tenant-aware NATS event subject prefixing via environment variable control - Adds comprehensive test coverage for account management, user credentials, operator initialization, and tenant context handling - Extends data models with collector package types, credentials structures, and NATS configuration fields - Normalizes proto file path references for KV service consistency ___ ### Diagram Walkthrough ```mermaid flowchart LR Agent["Agent<br/>Push Mode"] -->|"PushStatus<br/>StreamStatus"| Gateway["Agent Gateway<br/>Service"] Gateway -->|"mTLS<br/>Tenant Context"| DataSvc["Data Service<br/>NATS Account Mgmt"] DataSvc -->|"JWT Operations<br/>Credential Gen"| NATS["NATS Server<br/>Operator/Accounts"] CLI["CLI<br/>nats-bootstrap"] -->|"Bootstrap<br/>Config Gen"| NATS Cert["mTLS Certificate<br/>CN Parsing"] -->|"Tenant ID<br/>Extraction"| TenantCtx["Tenant Context<br/>Propagation"] TenantCtx -->|"Subject Prefixing<br/>Channel Routing"| EventPub["Event Publisher<br/>NATS Events"] ``` <details><summary><h3>File Walkthrough</h3></summary> <table><thead><tr><th></th><th align="left">Relevant files</th></tr></thead><tbody><tr><td><strong>Enhancement</strong></td><td><details><summary>20 files</summary><table> <tr> <td> <details> <summary><strong>nats_account.pb.go</strong><dd><code>NATS account management protobuf definitions</code>&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; </dd></summary> <hr> proto/nats_account.pb.go <ul><li>Generated protobuf Go code for NATS account management with 1266 lines <br>of new code<br> <li> Defines message types for account limits, subject mappings, tenant <br>account creation, and user credentials<br> <li> Implements credential types (collector, service, admin) and user <br>permission structures<br> <li> Includes RPC service definitions for bootstrap, account creation, JWT <br>signing, and credential generation</ul> </details> </td> <td><a href="https://github.com/carverauto/serviceradar/pull/2221/files#diff-49eb93c28e2d86d8dcf4ee78fd24bed498cf3fcfaa8e49849a36e70980420087">+1266/-0</a></td> </tr> <tr> <td> <details> <summary><strong>nats_bootstrap.go</strong><dd><code>NATS bootstrap CLI implementation</code>&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; </dd></summary> <hr> pkg/cli/nats_bootstrap.go <ul><li>Implements NATS bootstrap CLI command with 961 lines of new <br>functionality<br> <li> Provides local and remote bootstrap modes for operator and account <br>initialization<br> <li> Generates NATS configuration files, JWTs, credentials, and <br>system/platform accounts<br> <li> Includes token generation, status checking, and tenant listing <br>capabilities</ul> </details> </td> <td><a href="https://github.com/carverauto/serviceradar/pull/2221/files#diff-2176d03ba6ab4ccc1b0587a4727171546eecf6123e4df815ead583aaab062457">+961/-0</a>&nbsp; </td> </tr> <tr> <td> <details> <summary><strong>cli.go</strong><dd><code>CLI command routing for admin operations</code>&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; </dd></summary> <hr> pkg/cli/cli.go <ul><li>Adds new <code>AdminHandler</code> struct to dispatch nested admin commands<br> <li> Registers <code>nats-bootstrap</code> and <code>admin</code> command handlers in the command map<br> <li> Implements routing logic for admin subcommands (currently supporting <br><code>admin nats</code>)</ul> </details> </td> <td><a href="https://github.com/carverauto/serviceradar/pull/2221/files#diff-3040c236897f2704958b674ce81c445b6de53f2ff4c204c812ec510de8a76a73">+22/-0</a>&nbsp; &nbsp; </td> </tr> <tr> <td> <details> <summary><strong>nats_account_service.go</strong><dd><code>NATS Account Service Implementation with JWT Operations</code>&nbsp; &nbsp; </dd></summary> <hr> pkg/datasvc/nats_account_service.go <ul><li>Implements <code>NATSAccountServer</code> struct for stateless NATS JWT/NKeys <br>cryptographic operations<br> <li> Provides gRPC methods for operator bootstrap, tenant account creation, <br>user credential generation, and JWT signing<br> <li> Includes mTLS authorization checks and resolver connection management <br>for pushing JWTs to NATS<br> <li> Supports file-based and NATS-based JWT resolver configurations with <br>proper error handling</ul> </details> </td> <td><a href="https://github.com/carverauto/serviceradar/pull/2221/files#diff-3e814328b5d5bddca9cd4b0ca021a975cf416afdc059e454452580a4ce751320">+808/-0</a>&nbsp; </td> </tr> <tr> <td> <details> <summary><strong>monitoring.pb.go</strong><dd><code>Gateway Status Messages and Service Definition</code>&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; </dd></summary> <hr> proto/monitoring.pb.go <ul><li>Adds <code>GatewayStatusRequest</code>, <code>GatewayStatusResponse</code>, <code>GatewayStatusChunk</code>, <br>and <code>GatewayServiceStatus</code> message types<br> <li> Includes tenant-aware fields (<code>tenant_id</code>, <code>tenant_slug</code>) for multi-tenant <br>routing<br> <li> Adds new <code>AgentGatewayService</code> with <code>PushStatus</code> and <code>StreamStatus</code> RPC <br>methods<br> <li> Updates message type count from 13 to 17 and service count from 2 to 3</ul> </details> </td> <td><a href="https://github.com/carverauto/serviceradar/pull/2221/files#diff-4f7e955b42854cc9cf3fb063b95e58a04f36271e6a0c1cb42ea6d7953dd96cc4">+501/-20</a></td> </tr> <tr> <td> <details> <summary><strong>nats_account_grpc.pb.go</strong><dd><code>NATS Account Service gRPC Stubs Generation</code>&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; </dd></summary> <hr> proto/nats_account_grpc.pb.go <ul><li>Auto-generated gRPC service stubs for <code>NATSAccountService</code> interface<br> <li> Implements client and server interfaces with six RPC methods: <br><code>BootstrapOperator</code>, <code>GetOperatorInfo</code>, <code>CreateTenantAccount</code>, <br><code>GenerateUserCredentials</code>, <code>SignAccountJWT</code>, <code>PushAccountJWT</code><br> <li> Includes unary server handlers and service descriptor registration</ul> </details> </td> <td><a href="https://github.com/carverauto/serviceradar/pull/2221/files#diff-af98ae5f073d25a2ea0a044c996542fc65d215f6a70f14a9fba0447d87b34e11">+376/-0</a>&nbsp; </td> </tr> <tr> <td> <details> <summary><strong>operator.go</strong><dd><code>NATS Operator Key Management and Bootstrap</code>&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; </dd></summary> <hr> pkg/nats/accounts/operator.go <ul><li>Defines <code>Operator</code> struct for managing NATS operator keys and signing <br>operations<br> <li> Implements <code>BootstrapOperator</code> function to initialize new or import <br>existing operator keys<br> <li> Provides key generation utilities for operators, accounts, and users <br>with NKeys<br> <li> Includes configuration loading from multiple sources (direct, <br>environment, files)</ul> </details> </td> <td><a href="https://github.com/carverauto/serviceradar/pull/2221/files#diff-14ae6d6482381ca8bbbd5dc42b4a5caa04269601624c8e7dc1941567f8e874d8">+476/-0</a>&nbsp; </td> </tr> <tr> <td> <details> <summary><strong>events.go</strong><dd><code>NATS Configuration Extension for Credentials</code>&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; </dd></summary> <hr> pkg/models/events.go <ul><li>Adds <code>CredsFile</code> field to <code>NATSConfig</code> struct for system account <br>credentials file path<br> <li> Maintains backward compatibility with existing NATS configuration <br>fields</ul> </details> </td> <td><a href="https://github.com/carverauto/serviceradar/pull/2221/files#diff-8430b0f58da1cbe15deac7e09bb4acaa1b1e3f2e8772d883a7df24ef6173fa2b">+4/-3</a>&nbsp; &nbsp; &nbsp; </td> </tr> <tr> <td> <details> <summary><strong>tenant.go</strong><dd><code>Multi-tenant identity extraction from certificates</code>&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; </dd></summary> <hr> pkg/tenant/tenant.go <ul><li>New package providing multi-tenant identity extraction from mTLS <br>certificate CNs<br> <li> Implements CN parsing with format <code><component_id>.<partition_id>.<tenant_slug>.serviceradar</code><br> <li> Provides context-based tenant propagation and NATS channel prefixing <br>utilities<br> <li> Supports certificate extraction from gRPC contexts, TLS states, and <br>PEM files</ul> </details> </td> <td><a href="https://github.com/carverauto/serviceradar/pull/2221/files#diff-52b7de0bd266ec6f84b342a6515e76a08f89be71a1a69dabd93babf55f644178">+315/-0</a>&nbsp; </td> </tr> <tr> <td> <details> <summary><strong>nats_config.go</strong><dd><code>NATS server configuration and bootstrap utilities</code>&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; </dd></summary> <hr> pkg/edgeonboarding/nats_config.go <ul><li>Configuration generation for NATS server with operator JWT and system <br>account setup<br> <li> Template-based config generation supporting TLS, JetStream, <br>clustering, and leaf nodes<br> <li> Helper functions for generating collector-specific NATS permissions by <br>type<br> <li> Data structures for NATS bootstrap requests and collector credentials</ul> </details> </td> <td><a href="https://github.com/carverauto/serviceradar/pull/2221/files#diff-ae75ffeeb6591fed63961ef7cb46d3ded56258a641e009354f92a9ad78e38978">+328/-0</a>&nbsp; </td> </tr> <tr> <td> <details> <summary><strong>gateway_client.go</strong><dd><code>Agent gateway client with reconnection logic</code>&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; </dd></summary> <hr> pkg/agent/gateway_client.go <ul><li>New gRPC client for agent-to-gateway communication with connection <br>management<br> <li> Implements status push and streaming with automatic reconnection and <br>exponential backoff<br> <li> Supports mTLS security configuration and graceful disconnect<br> <li> Provides connection state tracking and timeout handling</ul> </details> </td> <td><a href="https://github.com/carverauto/serviceradar/pull/2221/files#diff-54aa63b6f05aabb0db096685e5be714c0c6d5c5a892310f048ff2ff8a542191f">+270/-0</a>&nbsp; </td> </tr> <tr> <td> <details> <summary><strong>account_manager.go</strong><dd><code>NATS account creation and JWT signing</code>&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; </dd></summary> <hr> pkg/nats/accounts/account_manager.go <ul><li>Account signer for creating and managing NATS tenant accounts with JWT <br>signing<br> <li> Implements subject mapping for tenant isolation with <code>{{tenant}}</code> <br>placeholder replacement<br> <li> Supports account limits, JetStream enablement, and user key <br>revocations<br> <li> Provides stateless account creation and JWT regeneration operations</ul> </details> </td> <td><a href="https://github.com/carverauto/serviceradar/pull/2221/files#diff-6d99437619fe01afb4470e118528ff6f0154b1169948cce1fccfadd742c41178">+238/-0</a>&nbsp; </td> </tr> <tr> <td> <details> <summary><strong>monitoring_grpc.pb.go</strong><dd><code>New AgentGatewayService gRPC API</code>&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; </dd></summary> <hr> proto/monitoring_grpc.pb.go <ul><li>Adds new <code>AgentGatewayService</code> gRPC service with PushStatus and <br>StreamStatus methods<br> <li> Marks <code>PollerService</code> as deprecated with migration guidance to <br><code>AgentGatewayService</code><br> <li> Implements client and server stubs for agent-to-gateway status <br>communication<br> <li> Supports both unary and client-streaming RPC patterns</ul> </details> </td> <td><a href="https://github.com/carverauto/serviceradar/pull/2221/files#diff-4cc43cae31fd1906aa866d16b67cd0812e2a433eeb29b9070947b1693ad0146f">+148/-0</a>&nbsp; </td> </tr> <tr> <td> <details> <summary><strong>user_manager.go</strong><dd><code>NATS user credential generation and management</code>&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; </dd></summary> <hr> pkg/nats/accounts/user_manager.go <ul><li>User credential generation for NATS with type-based permissions <br>(collector, service, admin)<br> <li> Implements <code>.creds</code> file formatting for NATS authentication<br> <li> Supports custom permissions and credential expiration<br> <li> Applies tenant-scoped permissions based on credential type</ul> </details> </td> <td><a href="https://github.com/carverauto/serviceradar/pull/2221/files#diff-84b756d1bdd4d008851fb5859e56f79f44c8083c766ee0314d7862484ce8c12b">+217/-0</a>&nbsp; </td> </tr> <tr> <td> <details> <summary><strong>push_loop.go</strong><dd><code>Agent status push loop implementation</code>&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; </dd></summary> <hr> pkg/agent/push_loop.go <ul><li>Push loop implementation for periodic agent status collection and <br>gateway transmission<br> <li> Collects status from sweep services and configured checkers<br> <li> Converts internal status format to gateway protocol messages<br> <li> Handles source IP detection and tenant/partition metadata propagation</ul> </details> </td> <td><a href="https://github.com/carverauto/serviceradar/pull/2221/files#diff-5f0d59be34ef26b449d7f5fd2b198a29b277936b9708a699f7487415ed6c2785">+222/-0</a>&nbsp; </td> </tr> <tr> <td> <details> <summary><strong>events.go</strong><dd><code>Tenant-aware NATS event subject prefixing</code>&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; </dd></summary> <hr> pkg/natsutil/events.go <ul><li>Adds tenant prefixing support for NATS event subjects via environment <br>variable control<br> <li> Implements <code>applyTenantPrefix()</code> to inject tenant slug into subject <br>paths<br> <li> Adds <code>NewEventPublisherWithPrefixing()</code> for explicit prefix control<br> <li> Provides helper functions to check and modify prefixing state</ul> </details> </td> <td><a href="https://github.com/carverauto/serviceradar/pull/2221/files#diff-3503010fc6a66fb16b8fbc055b7daf53be305d7284efe58d37a7fbf1813a6b63">+74/-13</a>&nbsp; </td> </tr> <tr> <td> <details> <summary><strong>main.go</strong><dd><code>Data service NATS account initialization</code>&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; </dd></summary> <hr> cmd/data-services/main.go <ul><li>Initializes NATS account service with operator configuration and <br>resolver paths<br> <li> Supports operator bootstrap from environment variables and config <br>files<br> <li> Registers <code>NATSAccountService</code> gRPC endpoint for account management<br> <li> Configures system account credentials and resolver client for JWT <br>persistence</ul> </details> </td> <td><a href="https://github.com/carverauto/serviceradar/pull/2221/files#diff-5e7731adfb877918cd65d9d5531621312496450fd550fea2682efca4ca8fe816">+68/-0</a>&nbsp; &nbsp; </td> </tr> <tr> <td> <details> <summary><strong>edge_onboarding.go</strong><dd><code>Collector package and credential data models</code>&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; </dd></summary> <hr> pkg/models/edge_onboarding.go <ul><li>Adds <code>CollectorType</code> enum for collector classification (flowgger, trapd, <br>netflow, otel)<br> <li> Adds <code>CollectorPackageStatus</code> enum for deployment lifecycle tracking<br> <li> Introduces <code>CollectorPackage</code>, <code>NatsCredential</code>, and <br><code>CollectorDownloadResult</code> data structures<br> <li> Supports collector package metadata, credential management, and <br>download tracking</ul> </details> </td> <td><a href="https://github.com/carverauto/serviceradar/pull/2221/files#diff-0c66da94363b25ecb9e4766c8f12cde387c1874aeb9099447c609c4e867b4fe0">+71/-0</a>&nbsp; &nbsp; </td> </tr> <tr> <td> <details> <summary><strong>main.go</strong><dd><code>Agent push mode implementation with gateway integration</code>&nbsp; &nbsp; </dd></summary> <hr> cmd/agent/main.go <ul><li>Implements push mode for agents to actively push status to gateway<br> <li> Adds <code>runPushMode()</code> with gateway client initialization and push loop <br>management<br> <li> Implements graceful shutdown with signal handling for push mode<br> <li> Maintains legacy server mode as fallback with deprecation warning</ul> </details> </td> <td><a href="https://github.com/carverauto/serviceradar/pull/2221/files#diff-61358711e980ccf505246fd3915f97cbd3a380e9b66f6fa5aad46749968c5ca3">+82/-0</a>&nbsp; &nbsp; </td> </tr> <tr> <td> <details> <summary><strong>main.go</strong><dd><code>CLI command routing for NATS bootstrap and admin</code>&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; </dd></summary> <hr> cmd/cli/main.go <ul><li>Adds <code>nats-bootstrap</code> command routing for NATS server bootstrap <br>operations<br> <li> Adds <code>admin</code> command dispatcher with <code>nats</code> subcommand support<br> <li> Implements <code>dispatchAdminCommand()</code> for admin resource routing<br> <li> Maintains backward compatibility with existing bcrypt mode</ul> </details> </td> <td><a href="https://github.com/carverauto/serviceradar/pull/2221/files#diff-ed4d81d29a7267f93fd77e17993fd3491b9ef6ded18490b4514d10ed1d803bc2">+14/-0</a>&nbsp; &nbsp; </td> </tr> </table></details></td></tr><tr><td><strong>Formatting</strong></td><td><details><summary>1 files</summary><table> <tr> <td> <details> <summary><strong>kv.pb.go</strong><dd><code>Proto file path normalization for KV service</code>&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; </dd></summary> <hr> proto/kv.pb.go <ul><li>Updates proto file path references from <code>proto/kv.proto</code> to <code>kv.proto</code><br> <li> Renames internal variable references from <code>file_proto_kv_proto_*</code> to <br><code>file_kv_proto_*</code><br> <li> Updates all message type index references and descriptor function <br>calls</ul> </details> </td> <td><a href="https://github.com/carverauto/serviceradar/pull/2221/files#diff-0b9bbdd1cf0f65de86dd5ee0c824e5d35326bb4b8edcf1c97b61864493f4477f">+83/-83</a>&nbsp; </td> </tr> </table></details></td></tr><tr><td><strong>Tests</strong></td><td><details><summary>6 files</summary><table> <tr> <td> <details> <summary><strong>nats_account_service_test.go</strong><dd><code>NATS Account Service Unit Tests</code>&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; </dd></summary> <hr> pkg/datasvc/nats_account_service_test.go <ul><li>Comprehensive test suite for <code>NATSAccountServer</code> with authorization <br>context setup<br> <li> Tests account creation, user credential generation, and JWT signing <br>with various scenarios<br> <li> Includes proto conversion helper function tests for limits, credential <br>types, and permissions<br> <li> Validates error handling for missing required fields and invalid <br>inputs</ul> </details> </td> <td><a href="https://github.com/carverauto/serviceradar/pull/2221/files#diff-3d1ee22d8514e26752c45da92cfacb867a80d29cb8bfc3682b5784d123bcfc52">+474/-0</a>&nbsp; </td> </tr> <tr> <td> <details> <summary><strong>user_manager_test.go</strong><dd><code>User Credentials Generation Tests</code>&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; </dd></summary> <hr> pkg/nats/accounts/user_manager_test.go <ul><li>Tests for <code>GenerateUserCredentials</code> function covering collector, <br>service, and admin credential types<br> <li> Validates custom permissions, expiration handling, and JWT claim <br>structure<br> <li> Tests error cases for invalid seeds and wrong key types<br> <li> Includes helper function tests for credentials file formatting</ul> </details> </td> <td><a href="https://github.com/carverauto/serviceradar/pull/2221/files#diff-000364a4950aac254d71c594f4c09b87abd8a8993f855d7a71cb248847cb82ce">+440/-0</a>&nbsp; </td> </tr> <tr> <td> <details> <summary><strong>account_manager_test.go</strong><dd><code>Account manager test suite for NATS tenant accounts</code>&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; </dd></summary> <hr> pkg/nats/accounts/account_manager_test.go <ul><li>Comprehensive test suite for <code>AccountSigner</code> with 370 lines covering <br>account creation, JWT signing, and subject mappings<br> <li> Tests for tenant account creation with limits, custom mappings, and <br>revocations<br> <li> Validation of default subject mappings for collectors (events, syslog, <br>snmp, netflow, otel, logs, telemetry)<br> <li> Tests for account key derivation and multiple account generation</ul> </details> </td> <td><a href="https://github.com/carverauto/serviceradar/pull/2221/files#diff-0c2bcb7ae33e8e4e35efdfb12c247cc0f62c7d479ac18dbf9e2682664dd2a5a6">+370/-0</a>&nbsp; </td> </tr> <tr> <td> <details> <summary><strong>tenant_test.go</strong><dd><code>Tenant identity parsing and context management tests</code>&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; </dd></summary> <hr> pkg/tenant/tenant_test.go <ul><li>Test suite for tenant identity parsing and context management with 354 <br>lines<br> <li> Tests for CN parsing with various formats and error cases<br> <li> Tests for tenant info string representation and NATS channel prefixing<br> <li> Tests for context-based tenant propagation and helper functions</ul> </details> </td> <td><a href="https://github.com/carverauto/serviceradar/pull/2221/files#diff-04245997664d90bcc09ea6a3b4edcda1b462adac68b2920953262ada7d7f6fbb">+354/-0</a>&nbsp; </td> </tr> <tr> <td> <details> <summary><strong>operator_test.go</strong><dd><code>NATS operator initialization and JWT signing tests</code>&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; </dd></summary> <hr> pkg/nats/accounts/operator_test.go <ul><li>Test suite for NATS operator key generation and JWT creation with 357 <br>lines<br> <li> Tests for operator initialization from seed, file, and environment <br>variable<br> <li> Tests for system account public key configuration and invalid seed <br>handling<br> <li> Tests for operator JWT creation and key encoding/decoding for storage</ul> </details> </td> <td><a href="https://github.com/carverauto/serviceradar/pull/2221/files#diff-7e02d50421951cc5d2172638cfcfd3cd0b953726699b20d887808baa63cecbf0">+357/-0</a>&nbsp; </td> </tr> <tr> <td> <details> <summary><strong>events_test.go</strong><dd><code>Event publisher tenant prefixing tests</code>&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; </dd></summary> <hr> pkg/natsutil/events_test.go <ul><li>Tests for tenant prefix environment variable detection<br> <li> Tests for subject prefixing with and without tenant context<br> <li> Tests for <code>NewEventPublisherWithPrefixing()</code> and <code>SetTenantPrefixing()</code> <br>methods<br> <li> Validates fallback to <code>DefaultTenant</code> when tenant not in context</ul> </details> </td> <td><a href="https://github.com/carverauto/serviceradar/pull/2221/files#diff-c11081459515a60362c5729f90da0828cee115b39cfe881e252d1d119b2263e2">+135/-0</a>&nbsp; </td> </tr> </table></details></td></tr><tr><td><strong>Configuration changes</strong></td><td><details><summary>1 files</summary><table> <tr> <td> <details> <summary><strong>.gitkeep</strong><dd><code>Docker Compose Credentials Directory</code>&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; </dd></summary> <hr> docker/compose/creds/.gitkeep <ul><li>Creates placeholder directory for Docker Compose credentials storage</ul> </details> </td> <td><a href="https://github.com/carverauto/serviceradar/pull/2221/files#diff-d72c41aab2d6f2c230a4340dfefe7917cdd12bed942c825aa0d4c9875a637bac">+1/-0</a>&nbsp; &nbsp; &nbsp; </td> </tr> </table></details></td></tr><tr><td><strong>Additional files</strong></td><td><details><summary>101 files</summary><table> <tr> <td><strong>.bazelignore</strong></td> <td><a href="https://github.com/carverauto/serviceradar/pull/2221/files#diff-a5641cd37d6ad98b32cdfce1980836cc68312277bc6a7052f55da02ada5bc6cf">+4/-0</a>&nbsp; &nbsp; &nbsp; </td> </tr> <tr> <td><strong>.bazelrc</strong></td> <td><a href="https://github.com/carverauto/serviceradar/pull/2221/files#diff-544556920c45b42cbfe40159b082ce8af6bd929e492d076769226265f215832f">+2/-0</a>&nbsp; &nbsp; &nbsp; </td> </tr> <tr> <td><strong>.env.example</strong></td> <td><a href="https://github.com/carverauto/serviceradar/pull/2221/files#diff-a3046da0d15a27e89f2afe639b25748a7ad4d9290af3e7b1b6c1a5533c8f0a8c">+38/-0</a>&nbsp; &nbsp; </td> </tr> <tr> <td><strong>AGENTS.md</strong></td> <td><a href="https://github.com/carverauto/serviceradar/pull/2221/files#diff-a54ff182c7e8acf56acfd6e4b9c3ff41e2c41a31c9b211b2deb9df75d9a478f9">+171/-2</a>&nbsp; </td> </tr> <tr> <td><strong>MODULE.bazel</strong></td> <td><a href="https://github.com/carverauto/serviceradar/pull/2221/files#diff-6136fc12446089c3db7360e923203dd114b6a1466252e71667c6791c20fe6bdc">+5/-0</a>&nbsp; &nbsp; &nbsp; </td> </tr> <tr> <td><strong>Makefile</strong></td> <td><a href="https://github.com/carverauto/serviceradar/pull/2221/files#diff-76ed074a9305c04054cdebb9e9aad2d818052b07091de1f20cad0bbac34ffb52">+4/-0</a>&nbsp; &nbsp; &nbsp; </td> </tr> <tr> <td><strong>README-Docker.md</strong></td> <td><a href="https://github.com/carverauto/serviceradar/pull/2221/files#diff-9fd61d24482efe68c22d8d41e2a1dcc440f39195aa56e7a050f2abe598179efd">+16/-1</a>&nbsp; &nbsp; </td> </tr> <tr> <td><strong>BUILD.bazel</strong></td> <td><a href="https://github.com/carverauto/serviceradar/pull/2221/files#diff-884fa9353a5226345e44fbabea3300efc7a87dfbcde0b6a42521ca51823f1b68">+6/-0</a>&nbsp; &nbsp; &nbsp; </td> </tr> <tr> <td><strong>BUILD.bazel</strong></td> <td><a href="https://github.com/carverauto/serviceradar/pull/2221/files#diff-0e80ea46aeb61a873324685edb96eae864c7a2004fbb7ee404b4ec951190ba10">+12/-0</a>&nbsp; &nbsp; </td> </tr> <tr> <td><strong>mix_release.bzl</strong></td> <td><a href="https://github.com/carverauto/serviceradar/pull/2221/files#diff-86ec281f99363b6b6eb1f49e21d83b7eeca93a35b552b9f305fffc6855e38ccd">+120/-49</a></td> </tr> <tr> <td><strong>config.json</strong></td> <td><a href="https://github.com/carverauto/serviceradar/pull/2221/files#diff-5b1bc8fe77422534739bdd3a38dc20d2634a86c171265c34e1b5d0c5a61b6bab">+5/-6</a>&nbsp; &nbsp; &nbsp; </td> </tr> <tr> <td><strong>config.rs</strong></td> <td><a href="https://github.com/carverauto/serviceradar/pull/2221/files#diff-05038f3867985e757de9027609950e682bad6d1992dac6acd7c28962a3c65dc4">+25/-0</a>&nbsp; &nbsp; </td> </tr> <tr> <td><strong>nats.rs</strong></td> <td><a href="https://github.com/carverauto/serviceradar/pull/2221/files#diff-97f7335def0ad5d644b594a1076ae2d7080b11259cbb8de22c7946cc8e4b39f8">+4/-0</a>&nbsp; &nbsp; &nbsp; </td> </tr> <tr> <td><strong>zen-consumer-with-otel.json</strong></td> <td><a href="https://github.com/carverauto/serviceradar/pull/2221/files#diff-68375f1f7847e1fbdf75664f6be65b1ad94ae6ce86ed73fc5964d65054668acb">+1/-0</a>&nbsp; &nbsp; &nbsp; </td> </tr> <tr> <td><strong>zen-consumer.json</strong></td> <td><a href="https://github.com/carverauto/serviceradar/pull/2221/files#diff-4d308af9802a93a0f656e8c02a3b5fcd8991407bb18360f087470db74e1f9524">+1/-0</a>&nbsp; &nbsp; &nbsp; </td> </tr> <tr> <td><strong>BUILD.bazel</strong></td> <td><a href="https://github.com/carverauto/serviceradar/pull/2221/files#diff-c62c0139ebdb337369f4067567cd2c52b8e7decb3ddfabc77f9f67b2f6e5789c">+1/-0</a>&nbsp; &nbsp; &nbsp; </td> </tr> <tr> <td><strong>flowgger.toml</strong></td> <td><a href="https://github.com/carverauto/serviceradar/pull/2221/files#diff-af9f49f931e282dca53d1f0521b036d222fe671f77e61a876a84cf4c6d7cca4d">+1/-0</a>&nbsp; &nbsp; &nbsp; </td> </tr> <tr> <td><strong>nats_output.rs</strong></td> <td><a href="https://github.com/carverauto/serviceradar/pull/2221/files#diff-a82e2e4d413539bf0b414b5629665b19648447523994cba639c4d1238aa5a0c1">+14/-0</a>&nbsp; &nbsp; </td> </tr> <tr> <td><strong>otel.toml</strong></td> <td><a href="https://github.com/carverauto/serviceradar/pull/2221/files#diff-c64b9ace832b8ea57a2be62f84166e03bb1904882635d444ec76a880cdf14cc0">+1/-0</a>&nbsp; &nbsp; &nbsp; </td> </tr> <tr> <td><strong>config.rs</strong></td> <td><a href="https://github.com/carverauto/serviceradar/pull/2221/files#diff-abbaec651da3d6af96b482e0f77bb909b65dbe0cabd78b5803769cc9dab0a1b0">+13/-0</a>&nbsp; &nbsp; </td> </tr> <tr> <td><strong>nats_output.rs</strong></td> <td><a href="https://github.com/carverauto/serviceradar/pull/2221/files#diff-6b585ea3564a481174e04da1270e2e13edd4e2b980d02a2652d6d21e6d82a498">+7/-0</a>&nbsp; &nbsp; &nbsp; </td> </tr> <tr> <td><strong>setup.rs</strong></td> <td><a href="https://github.com/carverauto/serviceradar/pull/2221/files#diff-3891f667deb20fd26e296d3e2742c57378d3764fe1743118e612465ae360391f">+1/-0</a>&nbsp; &nbsp; &nbsp; </td> </tr> <tr> <td><strong>config.rs</strong></td> <td><a href="https://github.com/carverauto/serviceradar/pull/2221/files#diff-c89b88ba4d2bf0a054d0ba69a672a92c30140b8d19503d67b980a218ffe3106d">+21/-0</a>&nbsp; &nbsp; </td> </tr> <tr> <td><strong>main.rs</strong></td> <td><a href="https://github.com/carverauto/serviceradar/pull/2221/files#diff-33b655d8730ae3e9c844ee280787d11f1b0d5343119188273f89558805f814ba">+21/-1</a>&nbsp; &nbsp; </td> </tr> <tr> <td><strong>docker-compose.elx.yml</strong></td> <td><a href="https://github.com/carverauto/serviceradar/pull/2221/files#diff-9562070d7ad4a3e9b2d06567008cf35de1d96448d914b3b45bf6c36d97cdd914">+108/-0</a>&nbsp; </td> </tr> <tr> <td><strong>docker-compose.spiffe.yml</strong></td> <td><a href="https://github.com/carverauto/serviceradar/pull/2221/files#diff-603fd9e7d40841d174f26b95d0cb0c9537430bf3f7a5da3ccbba4ea3d8ac66c9">+8/-0</a>&nbsp; &nbsp; &nbsp; </td> </tr> <tr> <td><strong>docker-compose.yml</strong></td> <td><a href="https://github.com/carverauto/serviceradar/pull/2221/files#diff-e45e45baeda1c1e73482975a664062aa56f20c03dd9d64a827aba57775bed0d3">+325/-269</a></td> </tr> <tr> <td><strong>Dockerfile.agent-gateway</strong></td> <td><a href="https://github.com/carverauto/serviceradar/pull/2221/files#diff-332bc81a932ae08efa711a71b60fe0954d99bf17ebdab00a3baaa177a44de8b0">+94/-0</a>&nbsp; &nbsp; </td> </tr> <tr> <td><strong>Dockerfile.core-elx</strong></td> <td><a href="https://github.com/carverauto/serviceradar/pull/2221/files#diff-5ec7a971285669999af442a0c7f141c34f7fd9180257307f5c4ed12f789a2182">+108/-0</a>&nbsp; </td> </tr> <tr> <td><strong>Dockerfile.poller-elx</strong></td> <td><a href="https://github.com/carverauto/serviceradar/pull/2221/files#diff-966cdcf55b7defa4f9fd83a15c55a0e1106430e6e20926c1181187e95ed8d87e">+95/-0</a>&nbsp; &nbsp; </td> </tr> <tr> <td><strong>Dockerfile.web-ng</strong></td> <td><a href="https://github.com/carverauto/serviceradar/pull/2221/files#diff-92d43af1965575d56c3380ecc8a81024aac2ff36f039ec2d3839e9fc7852bc10">+6/-0</a>&nbsp; &nbsp; &nbsp; </td> </tr> <tr> <td><strong>agent-minimal.docker.json</strong></td> <td><a href="https://github.com/carverauto/serviceradar/pull/2221/files#diff-1f09fad94636c90373af8e270f6ba0332ae4f4d1df50a4909729280a3a9691e6">+6/-6</a>&nbsp; &nbsp; &nbsp; </td> </tr> <tr> <td><strong>agent.docker.json</strong></td> <td><a href="https://github.com/carverauto/serviceradar/pull/2221/files#diff-5d33fe703515d03076d31261ecf946e9c6fc668cf5bf65099d49b670739e455e">+9/-15</a>&nbsp; &nbsp; </td> </tr> <tr> <td><strong>agent.mtls.json</strong></td> <td><a href="https://github.com/carverauto/serviceradar/pull/2221/files#diff-008f2216f159a9bd5db9cc90baaf6f1e64487df7af05b56ab3b9d6c4946aa95f">+13/-3</a>&nbsp; &nbsp; </td> </tr> <tr> <td><strong>datasvc.docker.json</strong></td> <td><a href="https://github.com/carverauto/serviceradar/pull/2221/files#diff-3f2719d3dbfe042e8383739e3c78e74e5f851a44e5e46bea8e79c4b79fdcc34f">+3/-2</a>&nbsp; &nbsp; &nbsp; </td> </tr> <tr> <td><strong>datasvc.mtls.json</strong></td> <td><a href="https://github.com/carverauto/serviceradar/pull/2221/files#diff-3a45619e57f1e6e9a31486ec7fffb33ef246e271f82bac272ee0a946b88da70a">+14/-1</a>&nbsp; &nbsp; </td> </tr> <tr> <td><strong>db-event-writer.mtls.json</strong></td> <td><a href="https://github.com/carverauto/serviceradar/pull/2221/files#diff-7a33f95f7545499abf0ed9fc91b58499ab209639e4885019579c959583fc7496">+1/-0</a>&nbsp; &nbsp; &nbsp; </td> </tr> <tr> <td><strong>entrypoint-certs.sh</strong></td> <td><a href="https://github.com/carverauto/serviceradar/pull/2221/files#diff-83d6800b184a5233c66c69766286b0a60fece1bc64addb112d9f8dc019437f05">+11/-6</a>&nbsp; &nbsp; </td> </tr> <tr> <td><strong>flowgger.docker.toml</strong></td> <td><a href="https://github.com/carverauto/serviceradar/pull/2221/files#diff-824f8797b418d4b9f5ea41e4a3741a0ed64b881f343072464489a76b7ea01008">+2/-1</a>&nbsp; &nbsp; &nbsp; </td> </tr> <tr> <td><strong>generate-certs.sh</strong></td> <td><a href="https://github.com/carverauto/serviceradar/pull/2221/files#diff-8298241543b4744a6ac7780c760ac5b5a0a87ba62de19c8612ebe1aba0996ebd">+216/-10</a></td> </tr> <tr> <td><strong>nats.docker.conf</strong></td> <td><a href="https://github.com/carverauto/serviceradar/pull/2221/files#diff-06f2012494f428fe1bfb304972061c2094e0d99da88ba9af6914f7776872e6eb">+16/-160</a></td> </tr> <tr> <td><strong>netflow-consumer.mtls.json</strong></td> <td><a href="https://github.com/carverauto/serviceradar/pull/2221/files#diff-f15920e8498a24f71ce3eec4f48fe8fefbb1765a90362998af779a660fcef9e1">+1/-0</a>&nbsp; &nbsp; &nbsp; </td> </tr> <tr> <td><strong>otel.docker.toml</strong></td> <td><a href="https://github.com/carverauto/serviceradar/pull/2221/files#diff-d4af38790e3657b7589cd37a7539d5308b032f11caba7aa740ddc86bf99f4415">+2/-0</a>&nbsp; &nbsp; &nbsp; </td> </tr> <tr> <td><strong>pg_hba.conf</strong></td> <td><a href="https://github.com/carverauto/serviceradar/pull/2221/files#diff-7bd5f7292054916c7e5997f4c84ac9ec07d4c945621a48936c2aed0575fb96eb">+9/-0</a>&nbsp; &nbsp; &nbsp; </td> </tr> <tr> <td><strong>pg_ident.conf</strong></td> <td><a href="https://github.com/carverauto/serviceradar/pull/2221/files#diff-e7b8ce062e32c61fdc3bcc9e525c1f1df1c8008fbc02b11409e58c67baa17cc5">+17/-0</a>&nbsp; &nbsp; </td> </tr> <tr> <td><strong>ssl_dist.core.conf</strong></td> <td><a href="https://github.com/carverauto/serviceradar/pull/2221/files#diff-08d49d8621b581d1a9aa5c456f61e8c5774e021083c982cbb514019f915a1701">+17/-0</a>&nbsp; &nbsp; </td> </tr> <tr> <td><strong>ssl_dist.gateway.conf</strong></td> <td><a href="https://github.com/carverauto/serviceradar/pull/2221/files#diff-4a43a8290d45ac68592000e7ef51afe78b4213090155bd42aafb46e66130f7ae">+17/-0</a>&nbsp; &nbsp; </td> </tr> <tr> <td><strong>ssl_dist.poller.conf</strong></td> <td><a href="https://github.com/carverauto/serviceradar/pull/2221/files#diff-3373e5d02bd9a19c75819948cf829086dc9c02f41f30c78759beb7038c428d3e">+17/-0</a>&nbsp; &nbsp; </td> </tr> <tr> <td><strong>ssl_dist.web.conf</strong></td> <td><a href="https://github.com/carverauto/serviceradar/pull/2221/files#diff-cef5be462ddb059fdfdeb9fd7c5cd70e656c4cd8b6ae1fe3fe312557b3da80ac">+17/-0</a>&nbsp; &nbsp; </td> </tr> <tr> <td><strong>trapd.docker.json</strong></td> <td><a href="https://github.com/carverauto/serviceradar/pull/2221/files#diff-1ab1a0e03e63bc02e0ef31992a7187a377927272ed2060150b40d44cc0ea3357">+2/-1</a>&nbsp; &nbsp; &nbsp; </td> </tr> <tr> <td><strong>zen.docker.json</strong></td> <td><a href="https://github.com/carverauto/serviceradar/pull/2221/files#diff-e060a3164cdc2746e0d9ad000fcf43c4bcdb05f4a41c586d7220e2ff2a7df01d">+2/-1</a>&nbsp; &nbsp; &nbsp; </td> </tr> <tr> <td><strong>BUILD.bazel</strong></td> <td><a href="https://github.com/carverauto/serviceradar/pull/2221/files#diff-0e4db31c224a8f72ae8e870a849e38a59d74a2c7f7b04347b0b3eb07e20c5a80">+80/-0</a>&nbsp; &nbsp; </td> </tr> <tr> <td><strong>push_targets.bzl</strong></td> <td><a href="https://github.com/carverauto/serviceradar/pull/2221/files#diff-4af33fe62caba04b6d479589c16cfb85babc39bae5c92595d4d4e31660738513">+2/-0</a>&nbsp; &nbsp; &nbsp; </td> </tr> <tr> <td><strong>architecture.md</strong></td> <td><a href="https://github.com/carverauto/serviceradar/pull/2221/files#diff-90abd06467420fd89391fd1a4d75ceb1f6a9381de4d13a95fffe606abff38d37">+58/-6</a>&nbsp; &nbsp; </td> </tr> <tr> <td><strong>ash-api.md</strong></td> <td><a href="https://github.com/carverauto/serviceradar/pull/2221/files#diff-1cb48a12148688dc640f42675d0b3c2458ecc40175f4ae7eef4b07bdc2ede3a3">+305/-0</a>&nbsp; </td> </tr> <tr> <td><strong>ash-authentication.md</strong></td> <td><a href="https://github.com/carverauto/serviceradar/pull/2221/files#diff-e3d22dc8d6917661b980dd9c7e08bdc07fad4d225b10c60885b091a9cdd20425">+244/-0</a>&nbsp; </td> </tr> <tr> <td><strong>ash-authorization.md</strong></td> <td><a href="https://github.com/carverauto/serviceradar/pull/2221/files#diff-72abeea2439e2281b6f998db2d11a73fe3ab393898125c466f0b16e7a49d3088">+283/-0</a>&nbsp; </td> </tr> <tr> <td><strong>ash-domains.md</strong></td> <td><a href="https://github.com/carverauto/serviceradar/pull/2221/files#diff-bc78325c5ea4c6839212536a43336a231175450738af14f4173be663acf2fe49">+223/-0</a>&nbsp; </td> </tr> <tr> <td><strong>ash-migration-guide.md</strong></td> <td><a href="https://github.com/carverauto/serviceradar/pull/2221/files#diff-e730373f07f6656d00327de02b4aa64e13809aadb3bf5478a4c60041ee431ad8">+339/-0</a>&nbsp; </td> </tr> <tr> <td><strong>docker-setup.md</strong></td> <td><a href="https://github.com/carverauto/serviceradar/pull/2221/files#diff-8604269dffb3ce4133e48cab374ca8e97745d0efbdef67cad792aeb5945fe5ec">+39/-4</a>&nbsp; &nbsp; </td> </tr> <tr> <td><strong>edge-agents.md</strong></td> <td><a href="https://github.com/carverauto/serviceradar/pull/2221/files#diff-df8f4b6b6cd8fb926f9204b2bab85ec4f393286d1b25d8a62d2debd357dd651d">+253/-0</a>&nbsp; </td> </tr> <tr> <td><strong>edge-onboarding.md</strong></td> <td><a href="https://github.com/carverauto/serviceradar/pull/2221/files#diff-f32fbefcc4dd23b0f2146015630a411099012e8bf56ea7b9d52ae8b2a83ac3e3">+316/-273</a></td> </tr> <tr> <td><strong>security-architecture.md</strong></td> <td><a href="https://github.com/carverauto/serviceradar/pull/2221/files#diff-49e33c2fc609030af8a64af34016a22e9e86c0f3781acec5a9e6d0f7ccf5dc09">+238/-0</a>&nbsp; </td> </tr> <tr> <td><strong>spiffe-identity.md</strong></td> <td><a href="https://github.com/carverauto/serviceradar/pull/2221/files#diff-6a3f0bdce99eb15d1075805cb5101594398f81ad92aa4eea9b27bb7545098db1">+52/-0</a>&nbsp; &nbsp; </td> </tr> <tr> <td><strong>troubleshooting-guide.md</strong></td> <td><a href="https://github.com/carverauto/serviceradar/pull/2221/files#diff-343b87dd0b430b3f99b16d32200c353bb6e3d7bbb185da3c1b3effc3a03e7f2a">+51/-0</a>&nbsp; &nbsp; </td> </tr> <tr> <td><strong>datasvc.ex</strong></td> <td><a href="https://github.com/carverauto/serviceradar/pull/2221/files#diff-00c928e77e8bf2f741a35e42e304acd2f2d4b8546532a5125a717924f2bde5dc">+32/-2</a>&nbsp; &nbsp; </td> </tr> <tr> <td><strong>BUILD.bazel</strong></td> <td><a href="https://github.com/carverauto/serviceradar/pull/2221/files#diff-aa8f6a4e89041a3423f2299efab4bb1b5d13358d445b3c6e35cc46f527dbaa7b">+30/-0</a>&nbsp; &nbsp; </td> </tr> <tr> <td><strong>config.exs</strong></td> <td><a href="https://github.com/carverauto/serviceradar/pull/2221/files#diff-abaac0167f05b505cf4bc5c5d375ae769001d8cc20120919941748c79681bbae">+13/-0</a>&nbsp; &nbsp; </td> </tr> <tr> <td><strong>dev.exs</strong></td> <td><a href="https://github.com/carverauto/serviceradar/pull/2221/files#diff-378e53cb66373073730a5d7ceba4973777eb84ae419c8a38472e3f69c3056fe9">+4/-0</a>&nbsp; &nbsp; &nbsp; </td> </tr> <tr> <td><strong>prod.exs</strong></td> <td><a href="https://github.com/carverauto/serviceradar/pull/2221/files#diff-5fdc859dfda96a02326392f61218325913f22f9f0c75a1276ee940d5db9fc62b">+4/-0</a>&nbsp; &nbsp; &nbsp; </td> </tr> <tr> <td><strong>runtime.exs</strong></td> <td><a href="https://github.com/carverauto/serviceradar/pull/2221/files#diff-842568fafa717a8064203543d674517fd28ad7dd2a4d3f0f157d274cfda4f18b">+202/-0</a>&nbsp; </td> </tr> <tr> <td><strong>test.exs</strong></td> <td><a href="https://github.com/carverauto/serviceradar/pull/2221/files#diff-00c8b58b49024f5d6f37ba40ac310a571e5a1b47fc16819423dc42f3035bc175">+20/-0</a>&nbsp; &nbsp; </td> </tr> <tr> <td><strong>agent_client.ex</strong></td> <td><a href="https://github.com/carverauto/serviceradar/pull/2221/files#diff-82bf77f47ac3b0001b036f0f22944d07bf894f1ee07b912e0ecab959eb8d5c53">+472/-0</a>&nbsp; </td> </tr> <tr> <td><strong>agent_gateway_server.ex</strong></td> <td><a href="https://github.com/carverauto/serviceradar/pull/2221/files#diff-369a368073dc8ec1140bcea699005a1ce97a90cd59629df0bd18c71c7ffaae9f">+226/-0</a>&nbsp; </td> </tr> <tr> <td><strong>application.ex</strong></td> <td><a href="https://github.com/carverauto/serviceradar/pull/2221/files#diff-fc8dfd7489f127775b1f0baf09cea0cdf77b825dd5f92540e126a43cb246f5b2">+192/-0</a>&nbsp; </td> </tr> <tr> <td><strong>config.ex</strong></td> <td><a href="https://github.com/carverauto/serviceradar/pull/2221/files#diff-5ba6871060dcff48aabea128b381e2d121bbb32256be1066c31dba447dc10813">+311/-0</a>&nbsp; </td> </tr> <tr> <td><strong>endpoint.ex</strong></td> <td><a href="https://github.com/carverauto/serviceradar/pull/2221/files#diff-ba09da2cb13f93e8b0f8d1dad1a214444239ef82ccd052a1e580f1fd1173d3d4">+14/-0</a>&nbsp; &nbsp; </td> </tr> <tr> <td><strong>status_processor.ex</strong></td> <td><a href="https://github.com/carverauto/serviceradar/pull/2221/files#diff-2d04050dff3ba2cc8153559e33a892f5f421982bf6dcbda7172857b3bf398a02">+202/-0</a>&nbsp; </td> </tr> <tr> <td><strong>task_executor.ex</strong></td> <td><a href="https://github.com/carverauto/serviceradar/pull/2221/files#diff-5779481787c6634825d987f45daa3d38f8010f4476f108dd14befcd275a2becf">+435/-0</a>&nbsp; </td> </tr> <tr> <td><strong>mix.exs</strong></td> <td><a href="https://github.com/carverauto/serviceradar/pull/2221/files#diff-0230a6c5cfe38c579eedeae533c0ae08389c513592b55ab18f9ecca34edbdef7">+69/-0</a>&nbsp; &nbsp; </td> </tr> <tr> <td><strong>env.sh.eex</strong></td> <td><a href="https://github.com/carverauto/serviceradar/pull/2221/files#diff-e9df1de6119d13ee07c13d3368d94d8d5bf5e4bfb8eafa88296d92ca8cfba3d8">+78/-0</a>&nbsp; &nbsp; </td> </tr> <tr> <td><strong>serviceradar-poller.service</strong></td> <td><a href="https://github.com/carverauto/serviceradar/pull/2221/files#diff-09a1230ce7d28b7c4ecc829f950ab86e06e519a57de9f8f072c47430db58de9a">+66/-0</a>&nbsp; &nbsp; </td> </tr> <tr> <td><strong>vm.args.eex</strong></td> <td><a href="https://github.com/carverauto/serviceradar/pull/2221/files#diff-94b4c9aef90071b4b2f63e05a89e3f4b1aa385ca4dfc299c606179dce3fdc6df">+50/-0</a>&nbsp; &nbsp; </td> </tr> <tr> <td><strong>registration_test.exs</strong></td> <td><a href="https://github.com/carverauto/serviceradar/pull/2221/files#diff-9a5e2ee4c17c6db1518f26ff99f0dfc6465038eaa839794c7415002662f7fb31">+235/-0</a>&nbsp; </td> </tr> <tr> <td><strong>test_helper.exs</strong></td> <td><a href="https://github.com/carverauto/serviceradar/pull/2221/files#diff-438a6880b1ce83b9bc82deacc98898368920b156c596305b0cdf6b8ff0ed7b06">+7/-0</a>&nbsp; &nbsp; &nbsp; </td> </tr> <tr> <td><strong>.formatter.exs</strong></td> <td><a href="https://github.com/carverauto/serviceradar/pull/2221/files#diff-791d4757cee68d168d8ab62d3eca6562470207fa6b3ccbf124e35c3fbd9de875">+15/-0</a>&nbsp; &nbsp; </td> </tr> <tr> <td><strong>BUILD.bazel</strong></td> <td><a href="https://github.com/carverauto/serviceradar/pull/2221/files#diff-f1e707ff2843703a98d4752b3a3730c080c75772dbb5dbff1221f66af0678bc7">+14/-0</a>&nbsp; &nbsp; </td> </tr> <tr> <td><strong>README.md</strong></td> <td><a href="https://github.com/carverauto/serviceradar/pull/2221/files#diff-5dc9fa4d00db77d9f3e08a76ec06c2e2d703a7976ac2531af5edd9e3d1175b56">+269/-0</a>&nbsp; </td> </tr> <tr> <td><strong>config.exs</strong></td> <td><a href="https://github.com/carverauto/serviceradar/pull/2221/files#diff-42b3888cc53d9dcf4ebc45ec7fffb2c672b129bffe763b6c76de58e4678a13a8">+101/-0</a>&nbsp; </td> </tr> <tr> <td><strong>dev.exs</strong></td> <td><a href="https://github.com/carverauto/serviceradar/pull/2221/files#diff-a18ef01363a8ee26ebf4b56399d88607b4cf7934c597d6d3085041195b06ab5b">+66/-0</a>&nbsp; &nbsp; </td> </tr> <tr> <td><strong>prod.exs</strong></td> <td><a href="https://github.com/carverauto/serviceradar/pull/2221/files#diff-4fe837307f8710c783ecb9ae7595bb7bb9ec37d8522248bd081fa256803c3e92">+6/-0</a>&nbsp; &nbsp; &nbsp; </td> </tr> <tr> <td><strong>runtime.exs</strong></td> <td><a href="https://github.com/carverauto/serviceradar/pull/2221/files#diff-5d045e2af9f15590a9745058e25a00380ad7a2cc363d0a9934715bd11ea8eef3">+196/-0</a>&nbsp; </td> </tr> <tr> <td><strong>test.exs</strong></td> <td><a href="https://github.com/carverauto/serviceradar/pull/2221/files#diff-d8fffec56a9bc7305c02bbdd424e21031ecbbb2c2e6ba07ba626e6043540e360">+91/-0</a>&nbsp; &nbsp; </td> </tr> <tr> <td><strong>eventwriter.md</strong></td> <td><a href="https://github.com/carverauto/serviceradar/pull/2221/files#diff-f505978234d706915d40bae31f46f0c906a941b1b63a82f2e8c37dcd01e21c4d">+370/-0</a>&nbsp; </td> </tr> <tr> <td><strong>device.ex</strong></td> <td><a href="https://github.com/carverauto/serviceradar/pull/2221/files#diff-eba1d95a852e4a736813c7b486da651704f20718e24f931c966ff3f37c421eea">+608/-0</a>&nbsp; </td> </tr> <tr> <td><strong>device_registry.ex</strong></td> <td><a href="https://github.com/carverauto/serviceradar/pull/2221/files#diff-c9bf6ea0ae2afb5d4cbcfe8686ebb5682031524e28edb55e43ca5e4b0512d281">+284/-0</a>&nbsp; </td> </tr> <tr> <td><strong>telemetry.ex</strong></td> <td><a href="https://github.com/carverauto/serviceradar/pull/2221/files#diff-09c303c0c3b329f8470971b33c0c6b4af8161efc77c03edc35fbae33e4b68dbf">+267/-0</a>&nbsp; </td> </tr> <tr> <td><strong>agent_tracker.ex</strong></td> <td><a href="https://github.com/carverauto/serviceradar/pull/2221/files#diff-792a06febe706658f0bd382113e31d8070d15c477202d51f1699e77c8317001a">+177/-0</a>&nbsp; </td> </tr> <tr> <td><strong>application.ex</strong></td> <td><a href="https://github.com/carverauto/serviceradar/pull/2221/files#diff-a9ffbf400b7f9b22cd8980c41286c54fe373f1f1a8684bb6a344a5fb39b178d0">+381/-0</a>&nbsp; </td> </tr> <tr> <td><strong>backoff.ex</strong></td> <td><a href="https://github.com/carverauto/serviceradar/pull/2221/files#diff-53ea1e963221e36412478d7951d5c99c6e88af68cfed27d2de18145f1225a2c1">+59/-0</a>&nbsp; &nbsp; </td> </tr> <tr> <td><strong>Additional files not shown</strong></td> <td><a href="https://github.com/carverauto/serviceradar/pull/2221/files#diff-2f328e4cd8dbe3ad193e49d92bcf045f47a6b72b1e9487d366f6b8288589b4ca"></a></td> </tr> </table></details></td></tr></tbody></table> </details> ___
qodo-code-review[bot] commented 2026-01-03 09:48:48 +00:00 (Migrated from github.com)
Author
Owner

Imported GitHub PR comment.

Original author: @qodo-code-review[bot]
Original URL: https://github.com/carverauto/serviceradar/pull/2221#issuecomment-3706933859
Original created: 2026-01-03T09:48:48Z

PR Compliance Guide 🔍

(Compliance updated until commit github.com/carverauto/serviceradar@d0138ce012)

Below is a summary of compliance checks for this PR:

Security Compliance
Insecure transport allowed

Description: The client can fall back to plaintext gRPC using
grpc.WithTransportCredentials(insecure.NewCredentials()) when SR_ALLOW_INSECURE=true,
which would allow MITM interception/modification of agent status/config traffic if enabled
outside local/dev environments.
gateway_client.go [178-187]

Referred Code
} else {
	// Insecure connections require explicit opt-in via environment variable
	// to prevent accidental plaintext gRPC in production deployments
	if os.Getenv("SR_ALLOW_INSECURE") != "true" {
		return nil, ErrSecurityRequired
	}

	g.logger.Warn().Msg("Using insecure connection to gateway (SR_ALLOW_INSECURE=true)")
	opts = append(opts, grpc.WithTransportCredentials(insecure.NewCredentials()))
}

Ticket Compliance
🎫 No ticket provided
  • Create ticket/issue
Codebase Duplication Compliance
Codebase context is not defined

Follow the guide to enable codebase context checks.

Custom Compliance
🟢
Generic: Meaningful Naming and Self-Documenting Code

Objective: Ensure all identifiers clearly express their purpose and intent, making code
self-documenting

Status: Passed

Learn more about managing compliance generic rules or creating your own custom rules

Generic: Robust Error Handling and Edge Case Management

Objective: Ensure comprehensive error handling that provides meaningful context and graceful
degradation

Status: Passed

Learn more about managing compliance generic rules or creating your own custom rules

Generic: Comprehensive Audit Trails

Objective: To create a detailed and reliable record of critical system actions for security analysis
and compliance.

Status:
Missing audit context: The new agent enrollment/configuration actions emit operational logs but do not
demonstrate dedicated audit logging with consistently required audit fields (actor
identity, timestamp, action, outcome) for critical actions.

Referred Code
// Hello sends an enrollment request to the gateway.
// This should be called on agent startup to announce the agent and register with the gateway.
func (g *GatewayClient) Hello(ctx context.Context, req *proto.AgentHelloRequest) (*proto.AgentHelloResponse, error) {
	g.mu.RLock()
	client := g.client
	connected := g.connected
	g.mu.RUnlock()

	if !connected || client == nil {
		return nil, ErrGatewayNotConnected
	}

	helloCtx, cancel := context.WithTimeout(ctx, defaultConnectTimeout)
	defer cancel()

	resp, err := client.Hello(helloCtx, req)
	if err != nil {
		g.logger.Error().Err(err).Msg("Failed to send Hello to gateway")
		g.markDisconnected()
		return nil, fmt.Errorf("failed to send Hello: %w", err)
	}



 ... (clipped 50 lines)

Learn more about managing compliance generic rules or creating your own custom rules

Generic: Secure Error Handling

Objective: To prevent the leakage of sensitive system information through error messages while
providing sufficient detail for internal debugging.

Status:
Error detail exposure: Connection/enrollment errors include internal connection details (e.g., gateway address
and upstream error strings) that may be surfaced to user-facing contexts depending on
caller behavior not visible in this diff.

Referred Code
// Connect establishes a connection to the gateway.
func (g *GatewayClient) Connect(ctx context.Context) error {
	g.mu.Lock()
	defer g.mu.Unlock()

	if g.addr == "" {
		return ErrGatewayAddrRequired
	}

	if g.conn != nil && g.connected {
		state := g.conn.GetState()
		if state == connectivity.Ready {
			return nil // Already connected and ready
		}
		// Connection exists but isn't healthy; force reconnect.
		_ = g.conn.Close()
		g.conn = nil
		g.client = nil
		g.connected = false
		if g.securityProvider != nil {
			_ = g.securityProvider.Close()



 ... (clipped 52 lines)

Learn more about managing compliance generic rules or creating your own custom rules

Generic: Secure Logging Practices

Objective: To ensure logs are useful for debugging and auditing without exposing sensitive
information like PII, PHI, or cardholder data.

Status:
Sensitive data in logs: The new logging includes potentially sensitive operational identifiers and network targets
(e.g., tenant_slug, tenant_id, target, port) that could constitute sensitive
tenant/internal infrastructure data depending on deployment policy.

Referred Code
	p.setEnrolled(true)
	p.logger.Info().
		Str("agent_id", helloResp.AgentId).
		Str("gateway_id", helloResp.GatewayId).
		Str("tenant_slug", helloResp.TenantSlug).
		Msg("Successfully enrolled with gateway")

	// Fetch initial config if outdated or not yet fetched
	if helloResp.ConfigOutdated || p.getConfigVersion() == "" {
		p.fetchAndApplyConfig(ctx)
	}
}

// configPollLoop periodically polls for config updates.
func (p *PushLoop) configPollLoop(ctx context.Context) {
	// Wait for initial enrollment before polling
	ticker := time.NewTicker(time.Second)
	defer ticker.Stop()
	for !p.isEnrolled() {
		select {
		case <-ctx.Done():



 ... (clipped 127 lines)

Learn more about managing compliance generic rules or creating your own custom rules

Generic: Security-First Input Validation and Data Handling

Objective: Ensure all data inputs are validated, sanitized, and handled securely to prevent
vulnerabilities

Status:
Unverified remote config: Remote GetConfig check definitions are applied dynamically with limited validation, and
the diff alone does not prove that the gateway-side authorization/authentication and
integrity guarantees for these inputs are enforced end-to-end.

Referred Code
// fetchAndApplyConfig fetches config from gateway and applies it.
func (p *PushLoop) fetchAndApplyConfig(ctx context.Context) {
	configReq := &proto.AgentConfigRequest{
		AgentId:       p.server.config.AgentID,
		ConfigVersion: p.getConfigVersion(),
	}

	configResp, err := p.gateway.GetConfig(ctx, configReq)
	if err != nil {
		p.logger.Error().Err(err).Msg("Failed to fetch config from gateway")
		return
	}

	// If config hasn't changed, nothing to do
	if configResp.NotModified {
		p.logger.Debug().Str("version", p.getConfigVersion()).Msg("Config not modified")
		return
	}

	// Update intervals from config response
	if configResp.HeartbeatIntervalSec > 0 {



 ... (clipped 162 lines)

Learn more about managing compliance generic rules or creating your own custom rules

  • Update
Compliance status legend 🟢 - Fully Compliant
🟡 - Partial Compliant
🔴 - Not Compliant
- Requires Further Human Verification
🏷️ - Compliance label

Previous compliance checks

Compliance check up to commit 8cdb823
Security Compliance
SSRF via remote config

Description: Remote configuration received from the gateway is converted into active network targets
(e.g., check.Target and check.Port in protoCheckToCheckerConfig, then stored in
p.server.checkerConfs and later executed), which can enable SSRF/port-scanning behavior
from the agent if the gateway/control-plane or its authZ is compromised or misconfigured.
push_loop.go [492-615]

Referred Code
// applyChecks converts proto checks to checker configs and updates the server.
func (p *PushLoop) applyChecks(checks []*proto.AgentCheckConfig) {
	if len(checks) == 0 {
		p.logger.Debug().Msg("No checks to apply")
		return
	}

	p.server.mu.Lock()
	defer p.server.mu.Unlock()

	// Track which checks we've seen (for removing stale checks later)
	seenChecks := make(map[string]bool)

	for _, check := range checks {
		// Guard against nil entries in the checks slice
		if check == nil {
			continue
		}

		if !check.Enabled {
			continue



 ... (clipped 103 lines)
Sensitive identifier logging

Description: The agent logs multi-tenant identifiers from the enrollment response (including tenant_id
and tenant_slug), which can expose tenant metadata to anyone with access to agent logs and
may violate least-privilege log hygiene in shared or centrally aggregated logging
environments.
gateway_client.go [381-389]

Referred Code
g.logger.Info().
	Str("agent_id", resp.AgentId).
	Str("gateway_id", resp.GatewayId).
	Str("tenant_id", resp.TenantId).
	Str("tenant_slug", resp.TenantSlug).
	Int32("heartbeat_interval_sec", resp.HeartbeatIntervalSec).
	Bool("config_outdated", resp.ConfigOutdated).
	Msg("Agent enrolled with gateway")


Ticket Compliance
🎫 No ticket provided
  • Create ticket/issue
Codebase Duplication Compliance
Codebase context is not defined

Follow the guide to enable codebase context checks.

Custom Compliance
🟢
Generic: Meaningful Naming and Self-Documenting Code

Objective: Ensure all identifiers clearly express their purpose and intent, making code
self-documenting

Status: Passed

Learn more about managing compliance generic rules or creating your own custom rules

Generic: Secure Error Handling

Objective: To prevent the leakage of sensitive system information through error messages while
providing sufficient detail for internal debugging.

Status: Passed

Learn more about managing compliance generic rules or creating your own custom rules

🔴
Generic: Robust Error Handling and Edge Case Management

Objective: Ensure comprehensive error handling that provides meaningful context and graceful
degradation

Status:
Nil dereference risk: protoCheckToCheckerConfig can return nil for invalid gateway-supplied checks, but
applyChecks uses the returned checkerConf without a nil check which can cause a panic
instead of graceful handling.

Referred Code
// Convert proto check to CheckerConfig
checkerConf := protoCheckToCheckerConfig(check)

// Check if this config already exists and is unchanged
if existing, exists := p.server.checkerConfs[check.Name]; exists {
	if existing.Type == checkerConf.Type &&
		existing.Address == checkerConf.Address &&
		existing.Timeout == checkerConf.Timeout {
		// Config unchanged, skip
		continue
	}
}

Learn more about managing compliance generic rules or creating your own custom rules

Generic: Security-First Input Validation and Data Handling

Objective: Ensure all data inputs are validated, sanitized, and handled securely to prevent
vulnerabilities

Status:
Unhandled invalid input: Gateway-supplied check data is partially validated in protoCheckToCheckerConfig, but
invalid checks can produce nil configs and are not safely handled in applyChecks, enabling
a crash via untrusted external input.

Referred Code
for _, check := range checks {
	// Guard against nil entries in the checks slice
	if check == nil {
		continue
	}

	if !check.Enabled {
		continue
	}

	seenChecks[check.Name] = true

	// Convert proto check to CheckerConfig
	checkerConf := protoCheckToCheckerConfig(check)

	// Check if this config already exists and is unchanged
	if existing, exists := p.server.checkerConfs[check.Name]; exists {
		if existing.Type == checkerConf.Type &&
			existing.Address == checkerConf.Address &&
			existing.Timeout == checkerConf.Timeout {
			// Config unchanged, skip



 ... (clipped 14 lines)

Learn more about managing compliance generic rules or creating your own custom rules

Compliance check up to commit 6aa63f6
Security Compliance
🟢
No security concerns identified No security vulnerabilities detected by AI analysis. Human verification advised for critical code.
Ticket Compliance
🎫 No ticket provided
  • Create ticket/issue
Codebase Duplication Compliance
Codebase context is not defined

Follow the guide to enable codebase context checks.

Custom Compliance
🟢
Generic: Meaningful Naming and Self-Documenting Code

Objective: Ensure all identifiers clearly express their purpose and intent, making code
self-documenting

Status: Passed

Learn more about managing compliance generic rules or creating your own custom rules

🔴
Generic: Secure Logging Practices

Objective: To ensure logs are useful for debugging and auditing without exposing sensitive
information like PII, PHI, or cardholder data.

Status:
Sensitive identifiers logged: Logs include potentially sensitive tenant identifiers and infrastructure details (e.g.,
tenant_id, tenant_slug) which violates the requirement that no sensitive data be present
in logs.

Referred Code
g.logger.Info().
	Str("agent_id", resp.AgentId).
	Str("gateway_id", resp.GatewayId).
	Str("tenant_id", resp.TenantId).
	Str("tenant_slug", resp.TenantSlug).
	Int32("heartbeat_interval_sec", resp.HeartbeatIntervalSec).
	Bool("config_outdated", resp.ConfigOutdated).
	Msg("Agent enrolled with gateway")


Learn more about managing compliance generic rules or creating your own custom rules

Generic: Comprehensive Audit Trails

Objective: To create a detailed and reliable record of critical system actions for security analysis
and compliance.

Status:
Missing actor context: Enrollment and config-application actions are logged but without a consistent
"actor" identifier (e.g., user/admin) beyond agent_id, and other critical
actions in the PR (not shown in this diff) may lack audit logging.

Referred Code
p.logger.Info().Msg("Enrolling with gateway...")

// Build Hello request
helloReq := &proto.AgentHelloRequest{
	AgentId:       p.server.config.AgentID,
	Version:       Version, // Agent version from version.go
	Capabilities:  getAgentCapabilities(),
	ConfigVersion: p.configVersion,
}

// Send Hello
helloResp, err := p.gateway.Hello(ctx, helloReq)
if err != nil {
	p.logger.Error().Err(err).Msg("Failed to enroll with gateway")
	return
}

// Update server config with tenant info from gateway
p.server.config.TenantID = helloResp.TenantId
p.server.config.TenantSlug = helloResp.TenantSlug




 ... (clipped 98 lines)

Learn more about managing compliance generic rules or creating your own custom rules

Generic: Robust Error Handling and Edge Case Management

Objective: Ensure comprehensive error handling that provides meaningful context and graceful
degradation

Status:
Nil map risk: applyChecks writes to p.server.checkerConfs without ensuring the map is initialized, which
can panic if checkerConfs is nil depending on server construction not visible in this
diff.

Referred Code
// applyChecks converts proto checks to checker configs and updates the server.
func (p *PushLoop) applyChecks(checks []*proto.AgentCheckConfig) {
	if len(checks) == 0 {
		p.logger.Debug().Msg("No checks to apply")
		return
	}

	p.server.mu.Lock()
	defer p.server.mu.Unlock()

	// Track which checks we've seen (for removing stale checks later)
	seenChecks := make(map[string]bool)

	for _, check := range checks {
		// Guard against nil entries in the checks slice
		if check == nil {
			continue
		}

		if !check.Enabled {
			continue



 ... (clipped 27 lines)

Learn more about managing compliance generic rules or creating your own custom rules

Generic: Secure Error Handling

Objective: To prevent the leakage of sensitive system information through error messages while
providing sufficient detail for internal debugging.

Status:
Potentially detailed errors: Connection and enrollment failures are returned with detailed internal context (e.g.,
gateway address and gateway-provided rejection message) which may be user-visible in CLI
output and should be reviewed for sensitive detail exposure.

Referred Code
// Connect establishes a connection to the gateway.
func (g *GatewayClient) Connect(ctx context.Context) error {
	g.mu.Lock()
	defer g.mu.Unlock()

	if g.addr == "" {
		return ErrGatewayAddrRequired
	}

	if g.connected && g.conn != nil {
		return nil // Already connected
	}

	g.logger.Info().Str("addr", g.addr).Msg("Connecting to agent-gateway")

	opts, err := g.buildDialOptions(ctx)
	if err != nil {
		return fmt.Errorf("failed to build dial options: %w", err)
	}

	connectCtx, cancel := context.WithTimeout(ctx, defaultConnectTimeout)



 ... (clipped 249 lines)

Learn more about managing compliance generic rules or creating your own custom rules

Generic: Security-First Input Validation and Data Handling

Objective: Ensure all data inputs are validated, sanitized, and handled securely to prevent
vulnerabilities

Status:
Unvalidated config inputs: JSON config fields loaded from disk (including network endpoints and timing values) are
used without explicit validation/sanitization, and correctness/security depends on
constraints that are not shown in this diff.

Referred Code
// loadConfig loads agent configuration from file, falling back to embedded defaults.
func loadConfig(configPath string) (*agent.ServerConfig, error) {
	var cfg agent.ServerConfig

	// Try to read config file
	data, err := os.ReadFile(configPath)
	if err != nil {
		if os.IsNotExist(err) {
			// Fall back to embedded default config
			data = defaultConfig
		} else {
			return nil, fmt.Errorf("failed to read config file: %w", err)
		}
	}

	if err := json.Unmarshal(data, &cfg); err != nil {
		return nil, fmt.Errorf("failed to parse config: %w", err)
	}

	return &cfg, nil
}


 ... (clipped 70 lines)

Learn more about managing compliance generic rules or creating your own custom rules

Compliance check up to commit 4d3d82c
Security Compliance
Insecure gRPC transport

Description: The client explicitly allows falling back to
grpc.WithTransportCredentials(insecure.NewCredentials()) when security is nil/disabled,
which would permit plaintext gRPC and enable MITM/credential interception if this path is
reachable in non-development deployments.
gateway_client.go [108-135]

Referred Code
// buildDialOptions constructs gRPC dial options based on security configuration.
func (g *GatewayClient) buildDialOptions(ctx context.Context) ([]grpc.DialOption, error) {
	var opts []grpc.DialOption

	if g.security != nil && g.security.Mode != "" && g.security.Mode != models.SecurityModeNone {
		// Create security provider using the standard pattern
		provider, err := srgrpc.NewSecurityProvider(ctx, g.security, g.logger)
		if err != nil {
			return nil, fmt.Errorf("failed to create security provider: %w", err)
		}

		// Get client credentials from the provider
		creds, err := provider.GetClientCredentials(ctx)
		if err != nil {
			_ = provider.Close()
			return nil, fmt.Errorf("failed to get client credentials: %w", err)
		}

		g.securityProvider = provider
		opts = append(opts, creds)
	} else {



 ... (clipped 7 lines)
Unexpected external egress

Description: getSourceIP() performs an outbound UDP dial to the public address 8.8.8.8:80, creating
unexpected external egress that can leak environment/network metadata and may violate
network/security policies in restricted deployments.
push_loop.go [233-250]

Referred Code
// getSourceIP attempts to determine the source IP of this agent.
func (p *PushLoop) getSourceIP() string {
	// First check if HostIP is configured
	if p.server.config.HostIP != "" {
		return p.server.config.HostIP
	}

	// Try to get the outbound IP
	conn, err := net.Dial("udp", "8.8.8.8:80")
	if err != nil {
		p.logger.Debug().Err(err).Msg("Failed to determine source IP")
		return ""
	}
	defer conn.Close()

	localAddr := conn.LocalAddr().(*net.UDPAddr)
	return localAddr.IP.String()
}

Ticket Compliance
🎫 No ticket provided
  • Create ticket/issue
Codebase Duplication Compliance
Codebase context is not defined

Follow the guide to enable codebase context checks.

Custom Compliance
🟢
Generic: Meaningful Naming and Self-Documenting Code

Objective: Ensure all identifiers clearly express their purpose and intent, making code
self-documenting

Status: Passed

Learn more about managing compliance generic rules or creating your own custom rules

Generic: Secure Error Handling

Objective: To prevent the leakage of sensitive system information through error messages while
providing sufficient detail for internal debugging.

Status: Passed

Learn more about managing compliance generic rules or creating your own custom rules

🔴
Generic: Robust Error Handling and Edge Case Management

Objective: Ensure comprehensive error handling that provides meaningful context and graceful
degradation

Status:
Nil check panic: applyChecks does not guard against nil entries in the checks slice and will panic when
dereferencing check.Enabled.

Referred Code
for _, check := range checks {
	if !check.Enabled {
		continue
	}


Learn more about managing compliance generic rules or creating your own custom rules

Generic: Comprehensive Audit Trails

Objective: To create a detailed and reliable record of critical system actions for security analysis
and compliance.

Status:
Audit scope unclear: The new agent enrollment/config/status actions are logged, but it is not verifiable from
the diff alone that these logs meet your audit requirements (e.g., consistent actor
identity, timestamps, and outcome recording across all critical actions).

Referred Code
// Start begins the push loop. It runs until the context is cancelled.
func (p *PushLoop) Start(ctx context.Context) error {
	p.logger.Info().Dur("interval", p.interval).Msg("Starting push loop")

	// Initial connection and enrollment attempt
	if err := p.gateway.Connect(ctx); err != nil {
		p.logger.Warn().Err(err).Msg("Initial gateway connection failed, will retry")
	} else {
		// Connected, try to enroll
		p.enroll(ctx)
	}

	// Start config polling in a separate goroutine
	go p.configPollLoop(ctx)

	ticker := time.NewTicker(p.interval)
	defer ticker.Stop()

	// Do an initial push immediately (only if enrolled)
	if p.enrolled {
		p.pushStatus(ctx)



 ... (clipped 73 lines)

Learn more about managing compliance generic rules or creating your own custom rules

Generic: Secure Logging Practices

Objective: To ensure logs are useful for debugging and auditing without exposing sensitive
information like PII, PHI, or cardholder data.

Status:
Identifier logging: The new logs include tenant identifiers (e.g., tenant_id and tenant_slug), and whether
these are classified as sensitive/PII cannot be confirmed from the diff alone.

Referred Code
g.logger.Info().
	Str("agent_id", resp.AgentId).
	Str("gateway_id", resp.GatewayId).
	Str("tenant_id", resp.TenantId).
	Str("tenant_slug", resp.TenantSlug).
	Int32("heartbeat_interval_sec", resp.HeartbeatIntervalSec).
	Bool("config_outdated", resp.ConfigOutdated).
	Msg("Agent enrolled with gateway")

Learn more about managing compliance generic rules or creating your own custom rules

Generic: Security-First Input Validation and Data Handling

Objective: Ensure all data inputs are validated, sanitized, and handled securely to prevent
vulnerabilities

Status:
Insecure transport option: The gateway client explicitly allows falling back to insecure.NewCredentials() when
security mode is none, which may violate expected security posture depending on deployment
requirements.

Referred Code
if g.security != nil && g.security.Mode != "" && g.security.Mode != models.SecurityModeNone {
	// Create security provider using the standard pattern
	provider, err := srgrpc.NewSecurityProvider(ctx, g.security, g.logger)
	if err != nil {
		return nil, fmt.Errorf("failed to create security provider: %w", err)
	}

	// Get client credentials from the provider
	creds, err := provider.GetClientCredentials(ctx)
	if err != nil {
		_ = provider.Close()
		return nil, fmt.Errorf("failed to get client credentials: %w", err)
	}

	g.securityProvider = provider
	opts = append(opts, creds)
} else {
	// Use insecure connection (for development/testing)
	g.logger.Warn().Msg("Using insecure connection to gateway (no mTLS)")
	opts = append(opts, grpc.WithTransportCredentials(insecure.NewCredentials()))
}



 ... (clipped 1 lines)

Learn more about managing compliance generic rules or creating your own custom rules

Compliance check up to commit 6a52314
Security Compliance
Credential secret exposure

Description: GenerateUserCredentials returns CredsFileContent containing the raw user NKey seed (via
formatCredsFile(userJWT, userSeed)), which is highly sensitive and could be exfiltrated if
this value is ever logged, stored, or returned over an insufficiently protected channel.
user_manager.go [54-120]

Referred Code
// GenerateUserCredentials creates NATS credentials for a user in a tenant's account.
// The accountSeed is the tenant's account private key (passed from Elixir storage).
func GenerateUserCredentials(
	tenantSlug string,
	accountSeed string,
	userName string,
	credType UserCredentialType,
	permissions *UserPermissions,
	expirationSeconds int64,
) (*UserCredentials, error) {
	// Parse account seed to get key pair for signing
	accountKP, err := nkeys.FromSeed([]byte(accountSeed))
	if err != nil {
		return nil, fmt.Errorf("invalid account seed: %w", err)
	}

	accountPublicKey, err := accountKP.PublicKey()
	if err != nil {
		return nil, fmt.Errorf("failed to get account public key: %w", err)
	}




 ... (clipped 46 lines)
Tenant isolation bypass

Description: Default publish/subscribe permissions include non-tenant-scoped subjects (e.g., events.>,
logs.>, telemetry.>) and also build allow-subjects from unvalidated tenantSlug (e.g.,
fmt.Sprintf("%s.>", tenantSlug)), which could enable cross-tenant access if subject
mapping is misconfigured or if tenantSlug contains wildcard/special tokens.
user_manager.go [123-198]

Referred Code
// applyUserPermissions sets permissions on user claims based on credential type.
func applyUserPermissions(
	claims *jwt.UserClaims,
	tenantSlug string,
	credType UserCredentialType,
	custom *UserPermissions,
) {
	// Start with type-based defaults
	switch credType {
	case CredentialTypeCollector:
		// Collectors can publish to their tenant's subjects (mapped by account)
		// and subscribe to responses
		claims.Pub.Allow.Add("events.>")
		claims.Pub.Allow.Add("snmp.traps")
		claims.Pub.Allow.Add("logs.>")
		claims.Pub.Allow.Add("telemetry.>")
		claims.Pub.Allow.Add("netflow.>")
		claims.Sub.Allow.Add("_INBOX.>")
		claims.Resp = &jwt.ResponsePermission{
			MaxMsgs: 1,
			Expires: time.Minute,



 ... (clipped 55 lines)
Ticket Compliance
🎫 No ticket provided
  • Create ticket/issue
Codebase Duplication Compliance
Codebase context is not defined

Follow the guide to enable codebase context checks.

Custom Compliance
🟢
Generic: Meaningful Naming and Self-Documenting Code

Objective: Ensure all identifiers clearly express their purpose and intent, making code
self-documenting

Status: Passed

Learn more about managing compliance generic rules or creating your own custom rules

Generic: Comprehensive Audit Trails

Objective: To create a detailed and reliable record of critical system actions for security analysis
and compliance.

Status:
Audit logging unknown: The diff introduces APIs for operator/account bootstrap and credential generation, but no
corresponding audit logging is visible in the provided hunks, so it cannot be verified
that these critical actions are reliably logged with actor context and outcomes.

Referred Code
// CreateTenantAccountRequest is the request to create a new tenant NATS account.
type CreateTenantAccountRequest struct {
	state           protoimpl.MessageState `protogen:"open.v1"`
	TenantSlug      string                 `protobuf:"bytes,1,opt,name=tenant_slug,json=tenantSlug,proto3" json:"tenant_slug,omitempty"`                // Unique tenant identifier (e.g., "acme-corp")
	Limits          *AccountLimits         `protobuf:"bytes,2,opt,name=limits,proto3" json:"limits,omitempty"`                                          // Optional resource limits
	SubjectMappings []*SubjectMapping      `protobuf:"bytes,3,rep,name=subject_mappings,json=subjectMappings,proto3" json:"subject_mappings,omitempty"` // Optional custom subject mappings
	unknownFields   protoimpl.UnknownFields
	sizeCache       protoimpl.SizeCache
}

func (x *CreateTenantAccountRequest) Reset() {
	*x = CreateTenantAccountRequest{}
	mi := &file_nats_account_proto_msgTypes[2]
	ms := protoimpl.X.MessageStateOf(protoimpl.Pointer(x))
	ms.StoreMessageInfo(mi)
}

func (x *CreateTenantAccountRequest) String() string {
	return protoimpl.X.MessageStringOf(x)
}




 ... (clipped 829 lines)

Learn more about managing compliance generic rules or creating your own custom rules

Generic: Robust Error Handling and Edge Case Management

Objective: Ensure comprehensive error handling that provides meaningful context and graceful
degradation

Status:
Error handling unseen: The new gRPC request/response surface for account and credential operations is added, but
the diff does not include the service implementation where validation and error handling
would occur, so robustness against edge cases cannot be confirmed.

Referred Code
// CreateTenantAccountRequest is the request to create a new tenant NATS account.
type CreateTenantAccountRequest struct {
	state           protoimpl.MessageState `protogen:"open.v1"`
	TenantSlug      string                 `protobuf:"bytes,1,opt,name=tenant_slug,json=tenantSlug,proto3" json:"tenant_slug,omitempty"`                // Unique tenant identifier (e.g., "acme-corp")
	Limits          *AccountLimits         `protobuf:"bytes,2,opt,name=limits,proto3" json:"limits,omitempty"`                                          // Optional resource limits
	SubjectMappings []*SubjectMapping      `protobuf:"bytes,3,rep,name=subject_mappings,json=subjectMappings,proto3" json:"subject_mappings,omitempty"` // Optional custom subject mappings
	unknownFields   protoimpl.UnknownFields
	sizeCache       protoimpl.SizeCache
}

func (x *CreateTenantAccountRequest) Reset() {
	*x = CreateTenantAccountRequest{}
	mi := &file_nats_account_proto_msgTypes[2]
	ms := protoimpl.X.MessageStateOf(protoimpl.Pointer(x))
	ms.StoreMessageInfo(mi)
}

func (x *CreateTenantAccountRequest) String() string {
	return protoimpl.X.MessageStringOf(x)
}




 ... (clipped 926 lines)

Learn more about managing compliance generic rules or creating your own custom rules

Generic: Secure Error Handling

Objective: To prevent the leakage of sensitive system information through error messages while
providing sufficient detail for internal debugging.

Status:
Error exposure risk: Responses like PushAccountJWTResponse.message appear able to carry error details, but
without the server-side implementation in the visible diff it cannot be verified that
user-facing errors are sanitized and sensitive internals are only logged securely.

Referred Code
// PushAccountJWTRequest is the request to push an account JWT to the NATS resolver.
type PushAccountJWTRequest struct {
	state            protoimpl.MessageState `protogen:"open.v1"`
	AccountPublicKey string                 `protobuf:"bytes,1,opt,name=account_public_key,json=accountPublicKey,proto3" json:"account_public_key,omitempty"` // The account's public NKey (starts with 'A')
	AccountJwt       string                 `protobuf:"bytes,2,opt,name=account_jwt,json=accountJwt,proto3" json:"account_jwt,omitempty"`                     // The signed account JWT to push
	unknownFields    protoimpl.UnknownFields
	sizeCache        protoimpl.SizeCache
}

func (x *PushAccountJWTRequest) Reset() {
	*x = PushAccountJWTRequest{}
	mi := &file_nats_account_proto_msgTypes[13]
	ms := protoimpl.X.MessageStateOf(protoimpl.Pointer(x))
	ms.StoreMessageInfo(mi)
}

func (x *PushAccountJWTRequest) String() string {
	return protoimpl.X.MessageStringOf(x)
}

func (*PushAccountJWTRequest) ProtoMessage() {}



 ... (clipped 84 lines)

Learn more about managing compliance generic rules or creating your own custom rules

Generic: Secure Logging Practices

Objective: To ensure logs are useful for debugging and auditing without exposing sensitive
information like PII, PHI, or cardholder data.

Status:
Secrets may be logged: The added models include highly sensitive fields (e.g.,
CollectorDownloadResult.nats_creds_file and mtls_bundle) and the diff does not show
safeguards preventing these values from being logged, so secure logging compliance cannot
be verified.

Referred Code
// CollectorPackage represents a collector deployment package with NATS credentials.
type CollectorPackage struct {
	PackageID              string                 `json:"package_id"`
	TenantID               string                 `json:"tenant_id"`
	CollectorType          CollectorType          `json:"collector_type"`
	UserName               string                 `json:"user_name"`
	Site                   string                 `json:"site,omitempty"`
	Hostname               string                 `json:"hostname,omitempty"`
	Status                 CollectorPackageStatus `json:"status"`
	NatsCredentialID       string                 `json:"nats_credential_id,omitempty"`
	DownloadTokenHash      string                 `json:"download_token_hash,omitempty"`
	DownloadTokenExpiresAt time.Time              `json:"download_token_expires_at,omitempty"`
	DownloadedAt           *time.Time             `json:"downloaded_at,omitempty"`
	DownloadedByIP         string                 `json:"downloaded_by_ip,omitempty"`
	InstalledAt            *time.Time             `json:"installed_at,omitempty"`
	RevokedAt              *time.Time             `json:"revoked_at,omitempty"`
	RevokeReason           string                 `json:"revoke_reason,omitempty"`
	ErrorMessage           string                 `json:"error_message,omitempty"`
	ConfigOverrides        map[string]interface{} `json:"config_overrides,omitempty"`
	CreatedAt              time.Time              `json:"created_at"`
	UpdatedAt              time.Time              `json:"updated_at"`



 ... (clipped 26 lines)

Learn more about managing compliance generic rules or creating your own custom rules

Generic: Security-First Input Validation and Data Handling

Objective: Ensure all data inputs are validated, sanitized, and handled securely to prevent
vulnerabilities

Status:
Sensitive data handling: New data structures introduce sensitive credential material and token-related fields
(e.g., download_token_hash, .creds content, and mtls_bundle), but the diff does not
include the creation/transport/storage code needed to verify encryption, access controls,
and validation of externally supplied identifiers.

Referred Code
// CollectorPackage represents a collector deployment package with NATS credentials.
type CollectorPackage struct {
	PackageID              string                 `json:"package_id"`
	TenantID               string                 `json:"tenant_id"`
	CollectorType          CollectorType          `json:"collector_type"`
	UserName               string                 `json:"user_name"`
	Site                   string                 `json:"site,omitempty"`
	Hostname               string                 `json:"hostname,omitempty"`
	Status                 CollectorPackageStatus `json:"status"`
	NatsCredentialID       string                 `json:"nats_credential_id,omitempty"`
	DownloadTokenHash      string                 `json:"download_token_hash,omitempty"`
	DownloadTokenExpiresAt time.Time              `json:"download_token_expires_at,omitempty"`
	DownloadedAt           *time.Time             `json:"downloaded_at,omitempty"`
	DownloadedByIP         string                 `json:"downloaded_by_ip,omitempty"`
	InstalledAt            *time.Time             `json:"installed_at,omitempty"`
	RevokedAt              *time.Time             `json:"revoked_at,omitempty"`
	RevokeReason           string                 `json:"revoke_reason,omitempty"`
	ErrorMessage           string                 `json:"error_message,omitempty"`
	ConfigOverrides        map[string]interface{} `json:"config_overrides,omitempty"`
	CreatedAt              time.Time              `json:"created_at"`
	UpdatedAt              time.Time              `json:"updated_at"`



 ... (clipped 26 lines)

Learn more about managing compliance generic rules or creating your own custom rules

Imported GitHub PR comment. Original author: @qodo-code-review[bot] Original URL: https://github.com/carverauto/serviceradar/pull/2221#issuecomment-3706933859 Original created: 2026-01-03T09:48:48Z --- ## PR Compliance Guide 🔍 <!-- https://github.com/carverauto/serviceradar/commit/d0138ce01275f980df692d4cd71400cdcd064ffd --> #### (Compliance updated until commit https://github.com/carverauto/serviceradar/commit/d0138ce01275f980df692d4cd71400cdcd064ffd) Below is a summary of compliance checks for this PR:<br> <table><tbody><tr><td colspan='2'><strong>Security Compliance</strong></td></tr> <tr><td rowspan=1>⚪</td> <td><details><summary><strong>Insecure transport allowed </strong></summary><br> <b>Description:</b> The client can fall back to plaintext gRPC using <br><code>grpc.WithTransportCredentials(insecure.NewCredentials())</code> when <code>SR_ALLOW_INSECURE=true</code>, <br>which would allow MITM interception/modification of agent status/config traffic if enabled <br>outside local/dev environments.<br> <strong><a href='https://github.com/carverauto/serviceradar/pull/2221/files#diff-54aa63b6f05aabb0db096685e5be714c0c6d5c5a892310f048ff2ff8a542191fR178-R187'>gateway_client.go [178-187]</a></strong><br> <details open><summary>Referred Code</summary> ```go } else { // Insecure connections require explicit opt-in via environment variable // to prevent accidental plaintext gRPC in production deployments if os.Getenv("SR_ALLOW_INSECURE") != "true" { return nil, ErrSecurityRequired } g.logger.Warn().Msg("Using insecure connection to gateway (SR_ALLOW_INSECURE=true)") opts = append(opts, grpc.WithTransportCredentials(insecure.NewCredentials())) } ``` </details></details></td></tr> <tr><td colspan='2'><strong>Ticket Compliance</strong></td></tr> <tr><td>⚪</td><td><details><summary>🎫 <strong>No ticket provided </strong></summary> - [ ] Create ticket/issue <!-- /create_ticket --create_ticket=true --> </details></td></tr> <tr><td colspan='2'><strong>Codebase Duplication Compliance</strong></td></tr> <tr><td>⚪</td><td><details><summary><strong>Codebase context is not defined </strong></summary> Follow the <a href='https://qodo-merge-docs.qodo.ai/core-abilities/rag_context_enrichment/'>guide</a> to enable codebase context checks. </details></td></tr> <tr><td colspan='2'><strong>Custom Compliance</strong></td></tr> <tr><td rowspan=2>🟢</td><td> <details><summary><strong>Generic: Meaningful Naming and Self-Documenting Code</strong></summary><br> **Objective:** Ensure all identifiers clearly express their purpose and intent, making code <br>self-documenting<br> **Status:** Passed<br> > Learn more about managing compliance <a href='https://qodo-merge-docs.qodo.ai/tools/compliance/#configuration-options'>generic rules</a> or creating your own <a href='https://qodo-merge-docs.qodo.ai/tools/compliance/#custom-compliance'>custom rules</a> </details></td></tr> <tr><td> <details><summary><strong>Generic: Robust Error Handling and Edge Case Management</strong></summary><br> **Objective:** Ensure comprehensive error handling that provides meaningful context and graceful <br>degradation<br> **Status:** Passed<br> > Learn more about managing compliance <a href='https://qodo-merge-docs.qodo.ai/tools/compliance/#configuration-options'>generic rules</a> or creating your own <a href='https://qodo-merge-docs.qodo.ai/tools/compliance/#custom-compliance'>custom rules</a> </details></td></tr> <tr><td rowspan=4>⚪</td> <td><details> <summary><strong>Generic: Comprehensive Audit Trails</strong></summary><br> **Objective:** To create a detailed and reliable record of critical system actions for security analysis <br>and compliance.<br> **Status:** <br><a href='https://github.com/carverauto/serviceradar/pull/2221/files#diff-54aa63b6f05aabb0db096685e5be714c0c6d5c5a892310f048ff2ff8a542191fR367-R437'><strong>Missing audit context</strong></a>: The new agent enrollment/configuration actions emit operational logs but do not <br>demonstrate dedicated audit logging with consistently required audit fields (actor <br>identity, timestamp, action, outcome) for critical actions.<br> <details open><summary>Referred Code</summary> ```go // Hello sends an enrollment request to the gateway. // This should be called on agent startup to announce the agent and register with the gateway. func (g *GatewayClient) Hello(ctx context.Context, req *proto.AgentHelloRequest) (*proto.AgentHelloResponse, error) { g.mu.RLock() client := g.client connected := g.connected g.mu.RUnlock() if !connected || client == nil { return nil, ErrGatewayNotConnected } helloCtx, cancel := context.WithTimeout(ctx, defaultConnectTimeout) defer cancel() resp, err := client.Hello(helloCtx, req) if err != nil { g.logger.Error().Err(err).Msg("Failed to send Hello to gateway") g.markDisconnected() return nil, fmt.Errorf("failed to send Hello: %w", err) } ... (clipped 50 lines) ``` </details> > Learn more about managing compliance <a href='https://qodo-merge-docs.qodo.ai/tools/compliance/#configuration-options'>generic rules</a> or creating your own <a href='https://qodo-merge-docs.qodo.ai/tools/compliance/#custom-compliance'>custom rules</a> </details></td></tr> <tr><td><details> <summary><strong>Generic: Secure Error Handling</strong></summary><br> **Objective:** To prevent the leakage of sensitive system information through error messages while <br>providing sufficient detail for internal debugging.<br> **Status:** <br><a href='https://github.com/carverauto/serviceradar/pull/2221/files#diff-54aa63b6f05aabb0db096685e5be714c0c6d5c5a892310f048ff2ff8a542191fR84-R156'><strong>Error detail exposure</strong></a>: Connection/enrollment errors include internal connection details (e.g., gateway address <br>and upstream error strings) that may be surfaced to user-facing contexts depending on <br>caller behavior not visible in this diff.<br> <details open><summary>Referred Code</summary> ```go // Connect establishes a connection to the gateway. func (g *GatewayClient) Connect(ctx context.Context) error { g.mu.Lock() defer g.mu.Unlock() if g.addr == "" { return ErrGatewayAddrRequired } if g.conn != nil && g.connected { state := g.conn.GetState() if state == connectivity.Ready { return nil // Already connected and ready } // Connection exists but isn't healthy; force reconnect. _ = g.conn.Close() g.conn = nil g.client = nil g.connected = false if g.securityProvider != nil { _ = g.securityProvider.Close() ... (clipped 52 lines) ``` </details> > Learn more about managing compliance <a href='https://qodo-merge-docs.qodo.ai/tools/compliance/#configuration-options'>generic rules</a> or creating your own <a href='https://qodo-merge-docs.qodo.ai/tools/compliance/#custom-compliance'>custom rules</a> </details></td></tr> <tr><td><details> <summary><strong>Generic: Secure Logging Practices</strong></summary><br> **Objective:** To ensure logs are useful for debugging and auditing without exposing sensitive <br>information like PII, PHI, or cardholder data.<br> **Status:** <br><a href='https://github.com/carverauto/serviceradar/pull/2221/files#diff-5f0d59be34ef26b449d7f5fd2b198a29b277936b9708a699f7487415ed6c2785R412-R559'><strong>Sensitive data in logs</strong></a>: The new logging includes potentially sensitive operational identifiers and network targets <br>(e.g., <code>tenant_slug</code>, <code>tenant_id</code>, <code>target</code>, <code>port</code>) that could constitute sensitive <br>tenant/internal infrastructure data depending on deployment policy.<br> <details open><summary>Referred Code</summary> ```go p.setEnrolled(true) p.logger.Info(). Str("agent_id", helloResp.AgentId). Str("gateway_id", helloResp.GatewayId). Str("tenant_slug", helloResp.TenantSlug). Msg("Successfully enrolled with gateway") // Fetch initial config if outdated or not yet fetched if helloResp.ConfigOutdated || p.getConfigVersion() == "" { p.fetchAndApplyConfig(ctx) } } // configPollLoop periodically polls for config updates. func (p *PushLoop) configPollLoop(ctx context.Context) { // Wait for initial enrollment before polling ticker := time.NewTicker(time.Second) defer ticker.Stop() for !p.isEnrolled() { select { case <-ctx.Done(): ... (clipped 127 lines) ``` </details> > Learn more about managing compliance <a href='https://qodo-merge-docs.qodo.ai/tools/compliance/#configuration-options'>generic rules</a> or creating your own <a href='https://qodo-merge-docs.qodo.ai/tools/compliance/#custom-compliance'>custom rules</a> </details></td></tr> <tr><td><details> <summary><strong>Generic: Security-First Input Validation and Data Handling</strong></summary><br> **Objective:** Ensure all data inputs are validated, sanitized, and handled securely to prevent <br>vulnerabilities<br> **Status:** <br><a href='https://github.com/carverauto/serviceradar/pull/2221/files#diff-5f0d59be34ef26b449d7f5fd2b198a29b277936b9708a699f7487415ed6c2785R457-R639'><strong>Unverified remote config</strong></a>: Remote <code>GetConfig</code> check definitions are applied dynamically with limited validation, and <br>the diff alone does not prove that the gateway-side authorization/authentication and <br>integrity guarantees for these inputs are enforced end-to-end.<br> <details open><summary>Referred Code</summary> ```go // fetchAndApplyConfig fetches config from gateway and applies it. func (p *PushLoop) fetchAndApplyConfig(ctx context.Context) { configReq := &proto.AgentConfigRequest{ AgentId: p.server.config.AgentID, ConfigVersion: p.getConfigVersion(), } configResp, err := p.gateway.GetConfig(ctx, configReq) if err != nil { p.logger.Error().Err(err).Msg("Failed to fetch config from gateway") return } // If config hasn't changed, nothing to do if configResp.NotModified { p.logger.Debug().Str("version", p.getConfigVersion()).Msg("Config not modified") return } // Update intervals from config response if configResp.HeartbeatIntervalSec > 0 { ... (clipped 162 lines) ``` </details> > Learn more about managing compliance <a href='https://qodo-merge-docs.qodo.ai/tools/compliance/#configuration-options'>generic rules</a> or creating your own <a href='https://qodo-merge-docs.qodo.ai/tools/compliance/#custom-compliance'>custom rules</a> </details></td></tr> <tr><td align="center" colspan="2"> - [ ] Update <!-- /compliance --update_compliance=true --> </td></tr></tbody></table> <details><summary>Compliance status legend</summary> 🟢 - Fully Compliant<br> 🟡 - Partial Compliant<br> 🔴 - Not Compliant<br> ⚪ - Requires Further Human Verification<br> 🏷️ - Compliance label<br> </details> ___ #### Previous compliance checks <details> <summary>Compliance check up to commit <a href='https://github.com/carverauto/serviceradar/commit/8cdb82329f0fb24c24558a22c1a94c72b6444b9d'>8cdb823</a></summary><br> <table><tbody><tr><td colspan='2'><strong>Security Compliance</strong></td></tr> <tr><td rowspan=2>⚪</td> <td><details><summary><strong>SSRF via remote config </strong></summary><br> <b>Description:</b> Remote configuration received from the gateway is converted into active network targets <br>(e.g., <code>check.Target</code> and <code>check.Port</code> in <code>protoCheckToCheckerConfig</code>, then stored in <br><code>p.server.checkerConfs</code> and later executed), which can enable SSRF/port-scanning behavior <br>from the agent if the gateway/control-plane or its authZ is compromised or misconfigured.<br> <strong><a href='https://github.com/carverauto/serviceradar/pull/2221/files#diff-5f0d59be34ef26b449d7f5fd2b198a29b277936b9708a699f7487415ed6c2785R492-R615'>push_loop.go [492-615]</a></strong><br> <details open><summary>Referred Code</summary> ```go // applyChecks converts proto checks to checker configs and updates the server. func (p *PushLoop) applyChecks(checks []*proto.AgentCheckConfig) { if len(checks) == 0 { p.logger.Debug().Msg("No checks to apply") return } p.server.mu.Lock() defer p.server.mu.Unlock() // Track which checks we've seen (for removing stale checks later) seenChecks := make(map[string]bool) for _, check := range checks { // Guard against nil entries in the checks slice if check == nil { continue } if !check.Enabled { continue ... (clipped 103 lines) ``` </details></details></td></tr> <tr><td><details><summary><strong>Sensitive identifier logging </strong></summary><br> <b>Description:</b> The agent logs multi-tenant identifiers from the enrollment response (including <code>tenant_id</code> <br>and <code>tenant_slug</code>), which can expose tenant metadata to anyone with access to agent logs and <br>may violate least-privilege log hygiene in shared or centrally aggregated logging <br>environments.<br> <strong><a href='https://github.com/carverauto/serviceradar/pull/2221/files#diff-54aa63b6f05aabb0db096685e5be714c0c6d5c5a892310f048ff2ff8a542191fR381-R389'>gateway_client.go [381-389]</a></strong><br> <details open><summary>Referred Code</summary> ```go g.logger.Info(). Str("agent_id", resp.AgentId). Str("gateway_id", resp.GatewayId). Str("tenant_id", resp.TenantId). Str("tenant_slug", resp.TenantSlug). Int32("heartbeat_interval_sec", resp.HeartbeatIntervalSec). Bool("config_outdated", resp.ConfigOutdated). Msg("Agent enrolled with gateway") ``` </details></details></td></tr> <tr><td colspan='2'><strong>Ticket Compliance</strong></td></tr> <tr><td>⚪</td><td><details><summary>🎫 <strong>No ticket provided </strong></summary> - [ ] Create ticket/issue <!-- /create_ticket --create_ticket=true --> </details></td></tr> <tr><td colspan='2'><strong>Codebase Duplication Compliance</strong></td></tr> <tr><td>⚪</td><td><details><summary><strong>Codebase context is not defined </strong></summary> Follow the <a href='https://qodo-merge-docs.qodo.ai/core-abilities/rag_context_enrichment/'>guide</a> to enable codebase context checks. </details></td></tr> <tr><td colspan='2'><strong>Custom Compliance</strong></td></tr> <tr><td rowspan=2>🟢</td><td> <details><summary><strong>Generic: Meaningful Naming and Self-Documenting Code</strong></summary><br> **Objective:** Ensure all identifiers clearly express their purpose and intent, making code <br>self-documenting<br> **Status:** Passed<br> > Learn more about managing compliance <a href='https://qodo-merge-docs.qodo.ai/tools/compliance/#configuration-options'>generic rules</a> or creating your own <a href='https://qodo-merge-docs.qodo.ai/tools/compliance/#custom-compliance'>custom rules</a> </details></td></tr> <tr><td> <details><summary><strong>Generic: Secure Error Handling</strong></summary><br> **Objective:** To prevent the leakage of sensitive system information through error messages while <br>providing sufficient detail for internal debugging.<br> **Status:** Passed<br> > Learn more about managing compliance <a href='https://qodo-merge-docs.qodo.ai/tools/compliance/#configuration-options'>generic rules</a> or creating your own <a href='https://qodo-merge-docs.qodo.ai/tools/compliance/#custom-compliance'>custom rules</a> </details></td></tr> <tr><td rowspan=2>🔴</td> <td><details> <summary><strong>Generic: Robust Error Handling and Edge Case Management</strong></summary><br> **Objective:** Ensure comprehensive error handling that provides meaningful context and graceful <br>degradation<br> **Status:** <br><a href='https://github.com/carverauto/serviceradar/pull/2221/files#diff-5f0d59be34ef26b449d7f5fd2b198a29b277936b9708a699f7487415ed6c2785R517-R528'><strong>Nil dereference risk</strong></a>: <code>protoCheckToCheckerConfig</code> can return <code>nil</code> for invalid gateway-supplied checks, but <br><code>applyChecks</code> uses the returned <code>checkerConf</code> without a nil check which can cause a panic <br>instead of graceful handling.<br> <details open><summary>Referred Code</summary> ```go // Convert proto check to CheckerConfig checkerConf := protoCheckToCheckerConfig(check) // Check if this config already exists and is unchanged if existing, exists := p.server.checkerConfs[check.Name]; exists { if existing.Type == checkerConf.Type && existing.Address == checkerConf.Address && existing.Timeout == checkerConf.Timeout { // Config unchanged, skip continue } } ``` </details> > Learn more about managing compliance <a href='https://qodo-merge-docs.qodo.ai/tools/compliance/#configuration-options'>generic rules</a> or creating your own <a href='https://qodo-merge-docs.qodo.ai/tools/compliance/#custom-compliance'>custom rules</a> </details></td></tr> <tr><td><details> <summary><strong>Generic: Security-First Input Validation and Data Handling</strong></summary><br> **Objective:** Ensure all data inputs are validated, sanitized, and handled securely to prevent <br>vulnerabilities<br> **Status:** <br><a href='https://github.com/carverauto/serviceradar/pull/2221/files#diff-5f0d59be34ef26b449d7f5fd2b198a29b277936b9708a699f7487415ed6c2785R505-R539'><strong>Unhandled invalid input</strong></a>: Gateway-supplied check data is partially validated in <code>protoCheckToCheckerConfig</code>, but <br>invalid checks can produce <code>nil</code> configs and are not safely handled in <code>applyChecks</code>, enabling <br>a crash via untrusted external input.<br> <details open><summary>Referred Code</summary> ```go for _, check := range checks { // Guard against nil entries in the checks slice if check == nil { continue } if !check.Enabled { continue } seenChecks[check.Name] = true // Convert proto check to CheckerConfig checkerConf := protoCheckToCheckerConfig(check) // Check if this config already exists and is unchanged if existing, exists := p.server.checkerConfs[check.Name]; exists { if existing.Type == checkerConf.Type && existing.Address == checkerConf.Address && existing.Timeout == checkerConf.Timeout { // Config unchanged, skip ... (clipped 14 lines) ``` </details> > Learn more about managing compliance <a href='https://qodo-merge-docs.qodo.ai/tools/compliance/#configuration-options'>generic rules</a> or creating your own <a href='https://qodo-merge-docs.qodo.ai/tools/compliance/#custom-compliance'>custom rules</a> </details></td></tr> <tr><td rowspan=2>⚪</td> <tr><td align="center" colspan="2"> <!-- /compliance --update_compliance=true --> </td></tr></tbody></table> </details> <details> <summary>Compliance check up to commit <a href='https://github.com/carverauto/serviceradar/commit/6aa63f620203712994d6db2b85a1185f1ffb4af8'>6aa63f6</a></summary><br> <table><tbody><tr><td colspan='2'><strong>Security Compliance</strong></td></tr> <tr><td>🟢</td><td><details><summary><strong>No security concerns identified</strong></summary> No security vulnerabilities detected by AI analysis. Human verification advised for critical code. </details></td></tr> <tr><td colspan='2'><strong>Ticket Compliance</strong></td></tr> <tr><td>⚪</td><td><details><summary>🎫 <strong>No ticket provided </strong></summary> - [ ] Create ticket/issue <!-- /create_ticket --create_ticket=true --> </details></td></tr> <tr><td colspan='2'><strong>Codebase Duplication Compliance</strong></td></tr> <tr><td>⚪</td><td><details><summary><strong>Codebase context is not defined </strong></summary> Follow the <a href='https://qodo-merge-docs.qodo.ai/core-abilities/rag_context_enrichment/'>guide</a> to enable codebase context checks. </details></td></tr> <tr><td colspan='2'><strong>Custom Compliance</strong></td></tr> <tr><td rowspan=1>🟢</td><td> <details><summary><strong>Generic: Meaningful Naming and Self-Documenting Code</strong></summary><br> **Objective:** Ensure all identifiers clearly express their purpose and intent, making code <br>self-documenting<br> **Status:** Passed<br> > Learn more about managing compliance <a href='https://qodo-merge-docs.qodo.ai/tools/compliance/#configuration-options'>generic rules</a> or creating your own <a href='https://qodo-merge-docs.qodo.ai/tools/compliance/#custom-compliance'>custom rules</a> </details></td></tr> <tr><td rowspan=1>🔴</td> <td><details> <summary><strong>Generic: Secure Logging Practices</strong></summary><br> **Objective:** To ensure logs are useful for debugging and auditing without exposing sensitive <br>information like PII, PHI, or cardholder data.<br> **Status:** <br><a href='https://github.com/carverauto/serviceradar/pull/2221/files#diff-54aa63b6f05aabb0db096685e5be714c0c6d5c5a892310f048ff2ff8a542191fR339-R347'><strong>Sensitive identifiers logged</strong></a>: Logs include potentially sensitive tenant identifiers and infrastructure details (e.g., <br><code>tenant_id</code>, <code>tenant_slug</code>) which violates the requirement that no sensitive data be present <br>in logs.<br> <details open><summary>Referred Code</summary> ```go g.logger.Info(). Str("agent_id", resp.AgentId). Str("gateway_id", resp.GatewayId). Str("tenant_id", resp.TenantId). Str("tenant_slug", resp.TenantSlug). Int32("heartbeat_interval_sec", resp.HeartbeatIntervalSec). Bool("config_outdated", resp.ConfigOutdated). Msg("Agent enrolled with gateway") ``` </details> > Learn more about managing compliance <a href='https://qodo-merge-docs.qodo.ai/tools/compliance/#configuration-options'>generic rules</a> or creating your own <a href='https://qodo-merge-docs.qodo.ai/tools/compliance/#custom-compliance'>custom rules</a> </details></td></tr> <tr><td rowspan=4>⚪</td> <td><details> <summary><strong>Generic: Comprehensive Audit Trails</strong></summary><br> **Objective:** To create a detailed and reliable record of critical system actions for security analysis <br>and compliance.<br> **Status:** <br><a href='https://github.com/carverauto/serviceradar/pull/2221/files#diff-5f0d59be34ef26b449d7f5fd2b198a29b277936b9708a699f7487415ed6c2785R336-R454'><strong>Missing actor context</strong></a>: Enrollment and config-application actions are logged but without a consistent <br>&quot;actor&quot; identifier (e.g., user/admin) beyond <code>agent_id</code>, and other critical <br>actions in the PR (not shown in this diff) may lack audit logging.<br> <details open><summary>Referred Code</summary> ```go p.logger.Info().Msg("Enrolling with gateway...") // Build Hello request helloReq := &proto.AgentHelloRequest{ AgentId: p.server.config.AgentID, Version: Version, // Agent version from version.go Capabilities: getAgentCapabilities(), ConfigVersion: p.configVersion, } // Send Hello helloResp, err := p.gateway.Hello(ctx, helloReq) if err != nil { p.logger.Error().Err(err).Msg("Failed to enroll with gateway") return } // Update server config with tenant info from gateway p.server.config.TenantID = helloResp.TenantId p.server.config.TenantSlug = helloResp.TenantSlug ... (clipped 98 lines) ``` </details> > Learn more about managing compliance <a href='https://qodo-merge-docs.qodo.ai/tools/compliance/#configuration-options'>generic rules</a> or creating your own <a href='https://qodo-merge-docs.qodo.ai/tools/compliance/#custom-compliance'>custom rules</a> </details></td></tr> <tr><td><details> <summary><strong>Generic: Robust Error Handling and Edge Case Management</strong></summary><br> **Objective:** Ensure comprehensive error handling that provides meaningful context and graceful <br>degradation<br> **Status:** <br><a href='https://github.com/carverauto/serviceradar/pull/2221/files#diff-5f0d59be34ef26b449d7f5fd2b198a29b277936b9708a699f7487415ed6c2785R456-R503'><strong>Nil map risk</strong></a>: <code>applyChecks</code> writes to <code>p.server.checkerConfs</code> without ensuring the map is initialized, which <br>can panic if <code>checkerConfs</code> is nil depending on server construction not visible in this <br>diff.<br> <details open><summary>Referred Code</summary> ```go // applyChecks converts proto checks to checker configs and updates the server. func (p *PushLoop) applyChecks(checks []*proto.AgentCheckConfig) { if len(checks) == 0 { p.logger.Debug().Msg("No checks to apply") return } p.server.mu.Lock() defer p.server.mu.Unlock() // Track which checks we've seen (for removing stale checks later) seenChecks := make(map[string]bool) for _, check := range checks { // Guard against nil entries in the checks slice if check == nil { continue } if !check.Enabled { continue ... (clipped 27 lines) ``` </details> > Learn more about managing compliance <a href='https://qodo-merge-docs.qodo.ai/tools/compliance/#configuration-options'>generic rules</a> or creating your own <a href='https://qodo-merge-docs.qodo.ai/tools/compliance/#custom-compliance'>custom rules</a> </details></td></tr> <tr><td><details> <summary><strong>Generic: Secure Error Handling</strong></summary><br> **Objective:** To prevent the leakage of sensitive system information through error messages while <br>providing sufficient detail for internal debugging.<br> **Status:** <br><a href='https://github.com/carverauto/serviceradar/pull/2221/files#diff-54aa63b6f05aabb0db096685e5be714c0c6d5c5a892310f048ff2ff8a542191fR79-R348'><strong>Potentially detailed errors</strong></a>: Connection and enrollment failures are returned with detailed internal context (e.g., <br>gateway address and gateway-provided rejection message) which may be user-visible in CLI <br>output and should be reviewed for sensitive detail exposure.<br> <details open><summary>Referred Code</summary> ```go // Connect establishes a connection to the gateway. func (g *GatewayClient) Connect(ctx context.Context) error { g.mu.Lock() defer g.mu.Unlock() if g.addr == "" { return ErrGatewayAddrRequired } if g.connected && g.conn != nil { return nil // Already connected } g.logger.Info().Str("addr", g.addr).Msg("Connecting to agent-gateway") opts, err := g.buildDialOptions(ctx) if err != nil { return fmt.Errorf("failed to build dial options: %w", err) } connectCtx, cancel := context.WithTimeout(ctx, defaultConnectTimeout) ... (clipped 249 lines) ``` </details> > Learn more about managing compliance <a href='https://qodo-merge-docs.qodo.ai/tools/compliance/#configuration-options'>generic rules</a> or creating your own <a href='https://qodo-merge-docs.qodo.ai/tools/compliance/#custom-compliance'>custom rules</a> </details></td></tr> <tr><td><details> <summary><strong>Generic: Security-First Input Validation and Data Handling</strong></summary><br> **Objective:** Ensure all data inputs are validated, sanitized, and handled securely to prevent <br>vulnerabilities<br> **Status:** <br><a href='https://github.com/carverauto/serviceradar/pull/2221/files#diff-61358711e980ccf505246fd3915f97cbd3a380e9b66f6fa5aad46749968c5ca3R97-R187'><strong>Unvalidated config inputs</strong></a>: JSON config fields loaded from disk (including network endpoints and timing values) are <br>used without explicit validation/sanitization, and correctness/security depends on <br>constraints that are not shown in this diff.<br> <details open><summary>Referred Code</summary> ```go // loadConfig loads agent configuration from file, falling back to embedded defaults. func loadConfig(configPath string) (*agent.ServerConfig, error) { var cfg agent.ServerConfig // Try to read config file data, err := os.ReadFile(configPath) if err != nil { if os.IsNotExist(err) { // Fall back to embedded default config data = defaultConfig } else { return nil, fmt.Errorf("failed to read config file: %w", err) } } if err := json.Unmarshal(data, &cfg); err != nil { return nil, fmt.Errorf("failed to parse config: %w", err) } return &cfg, nil } ... (clipped 70 lines) ``` </details> > Learn more about managing compliance <a href='https://qodo-merge-docs.qodo.ai/tools/compliance/#configuration-options'>generic rules</a> or creating your own <a href='https://qodo-merge-docs.qodo.ai/tools/compliance/#custom-compliance'>custom rules</a> </details></td></tr> <tr><td align="center" colspan="2"> <!-- /compliance --update_compliance=true --> </td></tr></tbody></table> </details> <details> <summary>Compliance check up to commit <a href='https://github.com/carverauto/serviceradar/commit/4d3d82c8155f53aa56358b2d03f7d12efcf49dd5'>4d3d82c</a></summary><br> <table><tbody><tr><td colspan='2'><strong>Security Compliance</strong></td></tr> <tr><td rowspan=2>⚪</td> <td><details><summary><strong>Insecure gRPC transport </strong></summary><br> <b>Description:</b> The client explicitly allows falling back to <br><code>grpc.WithTransportCredentials(insecure.NewCredentials())</code> when <code>security</code> is nil/disabled, <br>which would permit plaintext gRPC and enable MITM/credential interception if this path is <br>reachable in non-development deployments.<br> <strong><a href='https://github.com/carverauto/serviceradar/pull/2221/files#diff-54aa63b6f05aabb0db096685e5be714c0c6d5c5a892310f048ff2ff8a542191fR108-R135'>gateway_client.go [108-135]</a></strong><br> <details open><summary>Referred Code</summary> ```go // buildDialOptions constructs gRPC dial options based on security configuration. func (g *GatewayClient) buildDialOptions(ctx context.Context) ([]grpc.DialOption, error) { var opts []grpc.DialOption if g.security != nil && g.security.Mode != "" && g.security.Mode != models.SecurityModeNone { // Create security provider using the standard pattern provider, err := srgrpc.NewSecurityProvider(ctx, g.security, g.logger) if err != nil { return nil, fmt.Errorf("failed to create security provider: %w", err) } // Get client credentials from the provider creds, err := provider.GetClientCredentials(ctx) if err != nil { _ = provider.Close() return nil, fmt.Errorf("failed to get client credentials: %w", err) } g.securityProvider = provider opts = append(opts, creds) } else { ... (clipped 7 lines) ``` </details></details></td></tr> <tr><td><details><summary><strong>Unexpected external egress </strong></summary><br> <b>Description:</b> <code>getSourceIP()</code> performs an outbound UDP dial to the public address <code>8.8.8.8:80</code>, creating <br>unexpected external egress that can leak environment/network metadata and may violate <br>network/security policies in restricted deployments.<br> <strong><a href='https://github.com/carverauto/serviceradar/pull/2221/files#diff-5f0d59be34ef26b449d7f5fd2b198a29b277936b9708a699f7487415ed6c2785R233-R250'>push_loop.go [233-250]</a></strong><br> <details open><summary>Referred Code</summary> ```go // getSourceIP attempts to determine the source IP of this agent. func (p *PushLoop) getSourceIP() string { // First check if HostIP is configured if p.server.config.HostIP != "" { return p.server.config.HostIP } // Try to get the outbound IP conn, err := net.Dial("udp", "8.8.8.8:80") if err != nil { p.logger.Debug().Err(err).Msg("Failed to determine source IP") return "" } defer conn.Close() localAddr := conn.LocalAddr().(*net.UDPAddr) return localAddr.IP.String() } ``` </details></details></td></tr> <tr><td colspan='2'><strong>Ticket Compliance</strong></td></tr> <tr><td>⚪</td><td><details><summary>🎫 <strong>No ticket provided </strong></summary> - [ ] Create ticket/issue <!-- /create_ticket --create_ticket=true --> </details></td></tr> <tr><td colspan='2'><strong>Codebase Duplication Compliance</strong></td></tr> <tr><td>⚪</td><td><details><summary><strong>Codebase context is not defined </strong></summary> Follow the <a href='https://qodo-merge-docs.qodo.ai/core-abilities/rag_context_enrichment/'>guide</a> to enable codebase context checks. </details></td></tr> <tr><td colspan='2'><strong>Custom Compliance</strong></td></tr> <tr><td rowspan=2>🟢</td><td> <details><summary><strong>Generic: Meaningful Naming and Self-Documenting Code</strong></summary><br> **Objective:** Ensure all identifiers clearly express their purpose and intent, making code <br>self-documenting<br> **Status:** Passed<br> > Learn more about managing compliance <a href='https://qodo-merge-docs.qodo.ai/tools/compliance/#configuration-options'>generic rules</a> or creating your own <a href='https://qodo-merge-docs.qodo.ai/tools/compliance/#custom-compliance'>custom rules</a> </details></td></tr> <tr><td> <details><summary><strong>Generic: Secure Error Handling</strong></summary><br> **Objective:** To prevent the leakage of sensitive system information through error messages while <br>providing sufficient detail for internal debugging.<br> **Status:** Passed<br> > Learn more about managing compliance <a href='https://qodo-merge-docs.qodo.ai/tools/compliance/#configuration-options'>generic rules</a> or creating your own <a href='https://qodo-merge-docs.qodo.ai/tools/compliance/#custom-compliance'>custom rules</a> </details></td></tr> <tr><td rowspan=1>🔴</td> <td><details> <summary><strong>Generic: Robust Error Handling and Edge Case Management</strong></summary><br> **Objective:** Ensure comprehensive error handling that provides meaningful context and graceful <br>degradation<br> **Status:** <br><a href='https://github.com/carverauto/serviceradar/pull/2221/files#diff-5f0d59be34ef26b449d7f5fd2b198a29b277936b9708a699f7487415ed6c2785R385-R389'><strong>Nil check panic</strong></a>: <code>applyChecks</code> does not guard against nil entries in the <code>checks</code> slice and will panic when <br>dereferencing <code>check.Enabled</code>.<br> <details open><summary>Referred Code</summary> ```go for _, check := range checks { if !check.Enabled { continue } ``` </details> > Learn more about managing compliance <a href='https://qodo-merge-docs.qodo.ai/tools/compliance/#configuration-options'>generic rules</a> or creating your own <a href='https://qodo-merge-docs.qodo.ai/tools/compliance/#custom-compliance'>custom rules</a> </details></td></tr> <tr><td rowspan=3>⚪</td> <td><details> <summary><strong>Generic: Comprehensive Audit Trails</strong></summary><br> **Objective:** To create a detailed and reliable record of critical system actions for security analysis <br>and compliance.<br> **Status:** <br><a href='https://github.com/carverauto/serviceradar/pull/2221/files#diff-5f0d59be34ef26b449d7f5fd2b198a29b277936b9708a699f7487415ed6c2785R64-R157'><strong>Audit scope unclear</strong></a>: The new agent enrollment/config/status actions are logged, but it is not verifiable from <br>the diff alone that these logs meet your audit requirements (e.g., consistent actor <br>identity, timestamps, and outcome recording across all critical actions).<br> <details open><summary>Referred Code</summary> ```go // Start begins the push loop. It runs until the context is cancelled. func (p *PushLoop) Start(ctx context.Context) error { p.logger.Info().Dur("interval", p.interval).Msg("Starting push loop") // Initial connection and enrollment attempt if err := p.gateway.Connect(ctx); err != nil { p.logger.Warn().Err(err).Msg("Initial gateway connection failed, will retry") } else { // Connected, try to enroll p.enroll(ctx) } // Start config polling in a separate goroutine go p.configPollLoop(ctx) ticker := time.NewTicker(p.interval) defer ticker.Stop() // Do an initial push immediately (only if enrolled) if p.enrolled { p.pushStatus(ctx) ... (clipped 73 lines) ``` </details> > Learn more about managing compliance <a href='https://qodo-merge-docs.qodo.ai/tools/compliance/#configuration-options'>generic rules</a> or creating your own <a href='https://qodo-merge-docs.qodo.ai/tools/compliance/#custom-compliance'>custom rules</a> </details></td></tr> <tr><td><details> <summary><strong>Generic: Secure Logging Practices</strong></summary><br> **Objective:** To ensure logs are useful for debugging and auditing without exposing sensitive <br>information like PII, PHI, or cardholder data.<br> **Status:** <br><a href='https://github.com/carverauto/serviceradar/pull/2221/files#diff-54aa63b6f05aabb0db096685e5be714c0c6d5c5a892310f048ff2ff8a542191fR298-R305'><strong>Identifier logging</strong></a>: The new logs include tenant identifiers (e.g., <code>tenant_id</code> and <code>tenant_slug</code>), and whether <br>these are classified as sensitive/PII cannot be confirmed from the diff alone.<br> <details open><summary>Referred Code</summary> ```go g.logger.Info(). Str("agent_id", resp.AgentId). Str("gateway_id", resp.GatewayId). Str("tenant_id", resp.TenantId). Str("tenant_slug", resp.TenantSlug). Int32("heartbeat_interval_sec", resp.HeartbeatIntervalSec). Bool("config_outdated", resp.ConfigOutdated). Msg("Agent enrolled with gateway") ``` </details> > Learn more about managing compliance <a href='https://qodo-merge-docs.qodo.ai/tools/compliance/#configuration-options'>generic rules</a> or creating your own <a href='https://qodo-merge-docs.qodo.ai/tools/compliance/#custom-compliance'>custom rules</a> </details></td></tr> <tr><td><details> <summary><strong>Generic: Security-First Input Validation and Data Handling</strong></summary><br> **Objective:** Ensure all data inputs are validated, sanitized, and handled securely to prevent <br>vulnerabilities<br> **Status:** <br><a href='https://github.com/carverauto/serviceradar/pull/2221/files#diff-54aa63b6f05aabb0db096685e5be714c0c6d5c5a892310f048ff2ff8a542191fR112-R133'><strong>Insecure transport option</strong></a>: The gateway client explicitly allows falling back to <code>insecure.NewCredentials()</code> when <br>security mode is none, which may violate expected security posture depending on deployment <br>requirements.<br> <details open><summary>Referred Code</summary> ```go if g.security != nil && g.security.Mode != "" && g.security.Mode != models.SecurityModeNone { // Create security provider using the standard pattern provider, err := srgrpc.NewSecurityProvider(ctx, g.security, g.logger) if err != nil { return nil, fmt.Errorf("failed to create security provider: %w", err) } // Get client credentials from the provider creds, err := provider.GetClientCredentials(ctx) if err != nil { _ = provider.Close() return nil, fmt.Errorf("failed to get client credentials: %w", err) } g.securityProvider = provider opts = append(opts, creds) } else { // Use insecure connection (for development/testing) g.logger.Warn().Msg("Using insecure connection to gateway (no mTLS)") opts = append(opts, grpc.WithTransportCredentials(insecure.NewCredentials())) } ... (clipped 1 lines) ``` </details> > Learn more about managing compliance <a href='https://qodo-merge-docs.qodo.ai/tools/compliance/#configuration-options'>generic rules</a> or creating your own <a href='https://qodo-merge-docs.qodo.ai/tools/compliance/#custom-compliance'>custom rules</a> </details></td></tr> <tr><td align="center" colspan="2"> <!-- /compliance --update_compliance=true --> </td></tr></tbody></table> </details> <details> <summary>Compliance check up to commit <a href='https://github.com/carverauto/serviceradar/commit/6a52314c098cfc8c95be0596bae269d74bcb49ed'>6a52314</a></summary><br> <table><tbody><tr><td colspan='2'><strong>Security Compliance</strong></td></tr> <tr><td rowspan=2>⚪</td> <td><details><summary><strong>Credential secret exposure </strong></summary><br> <b>Description:</b> <code>GenerateUserCredentials</code> returns <code>CredsFileContent</code> containing the raw user NKey seed (via <br><code>formatCredsFile(userJWT, userSeed)</code>), which is highly sensitive and could be exfiltrated if <br>this value is ever logged, stored, or returned over an insufficiently protected channel.<br> <strong><a href='https://github.com/carverauto/serviceradar/pull/2221/files#diff-84b756d1bdd4d008851fb5859e56f79f44c8083c766ee0314d7862484ce8c12bR54-R120'>user_manager.go [54-120]</a></strong><br> <details open><summary>Referred Code</summary> ```go // GenerateUserCredentials creates NATS credentials for a user in a tenant's account. // The accountSeed is the tenant's account private key (passed from Elixir storage). func GenerateUserCredentials( tenantSlug string, accountSeed string, userName string, credType UserCredentialType, permissions *UserPermissions, expirationSeconds int64, ) (*UserCredentials, error) { // Parse account seed to get key pair for signing accountKP, err := nkeys.FromSeed([]byte(accountSeed)) if err != nil { return nil, fmt.Errorf("invalid account seed: %w", err) } accountPublicKey, err := accountKP.PublicKey() if err != nil { return nil, fmt.Errorf("failed to get account public key: %w", err) } ... (clipped 46 lines) ``` </details></details></td></tr> <tr><td><details><summary><strong>Tenant isolation bypass </strong></summary><br> <b>Description:</b> Default publish/subscribe permissions include non-tenant-scoped subjects (e.g., <code>events.></code>, <br><code>logs.></code>, <code>telemetry.></code>) and also build allow-subjects from unvalidated <code>tenantSlug</code> (e.g., <br><code>fmt.Sprintf("%s.>", tenantSlug)</code>), which could enable cross-tenant access if subject <br>mapping is misconfigured or if <code>tenantSlug</code> contains wildcard/special tokens.<br> <strong><a href='https://github.com/carverauto/serviceradar/pull/2221/files#diff-84b756d1bdd4d008851fb5859e56f79f44c8083c766ee0314d7862484ce8c12bR123-R198'>user_manager.go [123-198]</a></strong><br> <details open><summary>Referred Code</summary> ```go // applyUserPermissions sets permissions on user claims based on credential type. func applyUserPermissions( claims *jwt.UserClaims, tenantSlug string, credType UserCredentialType, custom *UserPermissions, ) { // Start with type-based defaults switch credType { case CredentialTypeCollector: // Collectors can publish to their tenant's subjects (mapped by account) // and subscribe to responses claims.Pub.Allow.Add("events.>") claims.Pub.Allow.Add("snmp.traps") claims.Pub.Allow.Add("logs.>") claims.Pub.Allow.Add("telemetry.>") claims.Pub.Allow.Add("netflow.>") claims.Sub.Allow.Add("_INBOX.>") claims.Resp = &jwt.ResponsePermission{ MaxMsgs: 1, Expires: time.Minute, ... (clipped 55 lines) ``` </details></details></td></tr> <tr><td colspan='2'><strong>Ticket Compliance</strong></td></tr> <tr><td>⚪</td><td><details><summary>🎫 <strong>No ticket provided </strong></summary> - [ ] Create ticket/issue <!-- /create_ticket --create_ticket=true --> </details></td></tr> <tr><td colspan='2'><strong>Codebase Duplication Compliance</strong></td></tr> <tr><td>⚪</td><td><details><summary><strong>Codebase context is not defined </strong></summary> Follow the <a href='https://qodo-merge-docs.qodo.ai/core-abilities/rag_context_enrichment/'>guide</a> to enable codebase context checks. </details></td></tr> <tr><td colspan='2'><strong>Custom Compliance</strong></td></tr> <tr><td rowspan=1>🟢</td><td> <details><summary><strong>Generic: Meaningful Naming and Self-Documenting Code</strong></summary><br> **Objective:** Ensure all identifiers clearly express their purpose and intent, making code <br>self-documenting<br> **Status:** Passed<br> > Learn more about managing compliance <a href='https://qodo-merge-docs.qodo.ai/tools/compliance/#configuration-options'>generic rules</a> or creating your own <a href='https://qodo-merge-docs.qodo.ai/tools/compliance/#custom-compliance'>custom rules</a> </details></td></tr> <tr><td rowspan=5>⚪</td> <td><details> <summary><strong>Generic: Comprehensive Audit Trails</strong></summary><br> **Objective:** To create a detailed and reliable record of critical system actions for security analysis <br>and compliance.<br> **Status:** <br><a href='https://github.com/carverauto/serviceradar/pull/2221/files#diff-49eb93c28e2d86d8dcf4ee78fd24bed498cf3fcfaa8e49849a36e70980420087R238-R1087'><strong>Audit logging unknown</strong></a>: The diff introduces APIs for operator/account bootstrap and credential generation, but no <br>corresponding audit logging is visible in the provided hunks, so it cannot be verified <br>that these critical actions are reliably logged with actor context and outcomes.<br> <details open><summary>Referred Code</summary> ```go // CreateTenantAccountRequest is the request to create a new tenant NATS account. type CreateTenantAccountRequest struct { state protoimpl.MessageState `protogen:"open.v1"` TenantSlug string `protobuf:"bytes,1,opt,name=tenant_slug,json=tenantSlug,proto3" json:"tenant_slug,omitempty"` // Unique tenant identifier (e.g., "acme-corp") Limits *AccountLimits `protobuf:"bytes,2,opt,name=limits,proto3" json:"limits,omitempty"` // Optional resource limits SubjectMappings []*SubjectMapping `protobuf:"bytes,3,rep,name=subject_mappings,json=subjectMappings,proto3" json:"subject_mappings,omitempty"` // Optional custom subject mappings unknownFields protoimpl.UnknownFields sizeCache protoimpl.SizeCache } func (x *CreateTenantAccountRequest) Reset() { *x = CreateTenantAccountRequest{} mi := &file_nats_account_proto_msgTypes[2] ms := protoimpl.X.MessageStateOf(protoimpl.Pointer(x)) ms.StoreMessageInfo(mi) } func (x *CreateTenantAccountRequest) String() string { return protoimpl.X.MessageStringOf(x) } ... (clipped 829 lines) ``` </details> > Learn more about managing compliance <a href='https://qodo-merge-docs.qodo.ai/tools/compliance/#configuration-options'>generic rules</a> or creating your own <a href='https://qodo-merge-docs.qodo.ai/tools/compliance/#custom-compliance'>custom rules</a> </details></td></tr> <tr><td><details> <summary><strong>Generic: Robust Error Handling and Edge Case Management</strong></summary><br> **Objective:** Ensure comprehensive error handling that provides meaningful context and graceful <br>degradation<br> **Status:** <br><a href='https://github.com/carverauto/serviceradar/pull/2221/files#diff-49eb93c28e2d86d8dcf4ee78fd24bed498cf3fcfaa8e49849a36e70980420087R238-R1184'><strong>Error handling unseen</strong></a>: The new gRPC request/response surface for account and credential operations is added, but <br>the diff does not include the service implementation where validation and error handling <br>would occur, so robustness against edge cases cannot be confirmed.<br> <details open><summary>Referred Code</summary> ```go // CreateTenantAccountRequest is the request to create a new tenant NATS account. type CreateTenantAccountRequest struct { state protoimpl.MessageState `protogen:"open.v1"` TenantSlug string `protobuf:"bytes,1,opt,name=tenant_slug,json=tenantSlug,proto3" json:"tenant_slug,omitempty"` // Unique tenant identifier (e.g., "acme-corp") Limits *AccountLimits `protobuf:"bytes,2,opt,name=limits,proto3" json:"limits,omitempty"` // Optional resource limits SubjectMappings []*SubjectMapping `protobuf:"bytes,3,rep,name=subject_mappings,json=subjectMappings,proto3" json:"subject_mappings,omitempty"` // Optional custom subject mappings unknownFields protoimpl.UnknownFields sizeCache protoimpl.SizeCache } func (x *CreateTenantAccountRequest) Reset() { *x = CreateTenantAccountRequest{} mi := &file_nats_account_proto_msgTypes[2] ms := protoimpl.X.MessageStateOf(protoimpl.Pointer(x)) ms.StoreMessageInfo(mi) } func (x *CreateTenantAccountRequest) String() string { return protoimpl.X.MessageStringOf(x) } ... (clipped 926 lines) ``` </details> > Learn more about managing compliance <a href='https://qodo-merge-docs.qodo.ai/tools/compliance/#configuration-options'>generic rules</a> or creating your own <a href='https://qodo-merge-docs.qodo.ai/tools/compliance/#custom-compliance'>custom rules</a> </details></td></tr> <tr><td><details> <summary><strong>Generic: Secure Error Handling</strong></summary><br> **Objective:** To prevent the leakage of sensitive system information through error messages while <br>providing sufficient detail for internal debugging.<br> **Status:** <br><a href='https://github.com/carverauto/serviceradar/pull/2221/files#diff-49eb93c28e2d86d8dcf4ee78fd24bed498cf3fcfaa8e49849a36e70980420087R983-R1087'><strong>Error exposure risk</strong></a>: Responses like <code>PushAccountJWTResponse.message</code> appear able to carry error details, but <br>without the server-side implementation in the visible diff it cannot be verified that <br>user-facing errors are sanitized and sensitive internals are only logged securely.<br> <details open><summary>Referred Code</summary> ```go // PushAccountJWTRequest is the request to push an account JWT to the NATS resolver. type PushAccountJWTRequest struct { state protoimpl.MessageState `protogen:"open.v1"` AccountPublicKey string `protobuf:"bytes,1,opt,name=account_public_key,json=accountPublicKey,proto3" json:"account_public_key,omitempty"` // The account's public NKey (starts with 'A') AccountJwt string `protobuf:"bytes,2,opt,name=account_jwt,json=accountJwt,proto3" json:"account_jwt,omitempty"` // The signed account JWT to push unknownFields protoimpl.UnknownFields sizeCache protoimpl.SizeCache } func (x *PushAccountJWTRequest) Reset() { *x = PushAccountJWTRequest{} mi := &file_nats_account_proto_msgTypes[13] ms := protoimpl.X.MessageStateOf(protoimpl.Pointer(x)) ms.StoreMessageInfo(mi) } func (x *PushAccountJWTRequest) String() string { return protoimpl.X.MessageStringOf(x) } func (*PushAccountJWTRequest) ProtoMessage() {} ... (clipped 84 lines) ``` </details> > Learn more about managing compliance <a href='https://qodo-merge-docs.qodo.ai/tools/compliance/#configuration-options'>generic rules</a> or creating your own <a href='https://qodo-merge-docs.qodo.ai/tools/compliance/#custom-compliance'>custom rules</a> </details></td></tr> <tr><td><details> <summary><strong>Generic: Secure Logging Practices</strong></summary><br> **Objective:** To ensure logs are useful for debugging and auditing without exposing sensitive <br>information like PII, PHI, or cardholder data.<br> **Status:** <br><a href='https://github.com/carverauto/serviceradar/pull/2221/files#diff-0c66da94363b25ecb9e4766c8f12cde387c1874aeb9099447c609c4e867b4fe0R194-R240'><strong>Secrets may be logged</strong></a>: The added models include highly sensitive fields (e.g., <br><code>CollectorDownloadResult.nats_creds_file</code> and <code>mtls_bundle</code>) and the diff does not show <br>safeguards preventing these values from being logged, so secure logging compliance cannot <br>be verified.<br> <details open><summary>Referred Code</summary> ```go // CollectorPackage represents a collector deployment package with NATS credentials. type CollectorPackage struct { PackageID string `json:"package_id"` TenantID string `json:"tenant_id"` CollectorType CollectorType `json:"collector_type"` UserName string `json:"user_name"` Site string `json:"site,omitempty"` Hostname string `json:"hostname,omitempty"` Status CollectorPackageStatus `json:"status"` NatsCredentialID string `json:"nats_credential_id,omitempty"` DownloadTokenHash string `json:"download_token_hash,omitempty"` DownloadTokenExpiresAt time.Time `json:"download_token_expires_at,omitempty"` DownloadedAt *time.Time `json:"downloaded_at,omitempty"` DownloadedByIP string `json:"downloaded_by_ip,omitempty"` InstalledAt *time.Time `json:"installed_at,omitempty"` RevokedAt *time.Time `json:"revoked_at,omitempty"` RevokeReason string `json:"revoke_reason,omitempty"` ErrorMessage string `json:"error_message,omitempty"` ConfigOverrides map[string]interface{} `json:"config_overrides,omitempty"` CreatedAt time.Time `json:"created_at"` UpdatedAt time.Time `json:"updated_at"` ... (clipped 26 lines) ``` </details> > Learn more about managing compliance <a href='https://qodo-merge-docs.qodo.ai/tools/compliance/#configuration-options'>generic rules</a> or creating your own <a href='https://qodo-merge-docs.qodo.ai/tools/compliance/#custom-compliance'>custom rules</a> </details></td></tr> <tr><td><details> <summary><strong>Generic: Security-First Input Validation and Data Handling</strong></summary><br> **Objective:** Ensure all data inputs are validated, sanitized, and handled securely to prevent <br>vulnerabilities<br> **Status:** <br><a href='https://github.com/carverauto/serviceradar/pull/2221/files#diff-0c66da94363b25ecb9e4766c8f12cde387c1874aeb9099447c609c4e867b4fe0R194-R240'><strong>Sensitive data handling</strong></a>: New data structures introduce sensitive credential material and token-related fields <br>(e.g., <code>download_token_hash</code>, <code>.creds</code> content, and <code>mtls_bundle</code>), but the diff does not <br>include the creation/transport/storage code needed to verify encryption, access controls, <br>and validation of externally supplied identifiers.<br> <details open><summary>Referred Code</summary> ```go // CollectorPackage represents a collector deployment package with NATS credentials. type CollectorPackage struct { PackageID string `json:"package_id"` TenantID string `json:"tenant_id"` CollectorType CollectorType `json:"collector_type"` UserName string `json:"user_name"` Site string `json:"site,omitempty"` Hostname string `json:"hostname,omitempty"` Status CollectorPackageStatus `json:"status"` NatsCredentialID string `json:"nats_credential_id,omitempty"` DownloadTokenHash string `json:"download_token_hash,omitempty"` DownloadTokenExpiresAt time.Time `json:"download_token_expires_at,omitempty"` DownloadedAt *time.Time `json:"downloaded_at,omitempty"` DownloadedByIP string `json:"downloaded_by_ip,omitempty"` InstalledAt *time.Time `json:"installed_at,omitempty"` RevokedAt *time.Time `json:"revoked_at,omitempty"` RevokeReason string `json:"revoke_reason,omitempty"` ErrorMessage string `json:"error_message,omitempty"` ConfigOverrides map[string]interface{} `json:"config_overrides,omitempty"` CreatedAt time.Time `json:"created_at"` UpdatedAt time.Time `json:"updated_at"` ... (clipped 26 lines) ``` </details> > Learn more about managing compliance <a href='https://qodo-merge-docs.qodo.ai/tools/compliance/#configuration-options'>generic rules</a> or creating your own <a href='https://qodo-merge-docs.qodo.ai/tools/compliance/#custom-compliance'>custom rules</a> </details></td></tr> <tr><td align="center" colspan="2"> <!-- /compliance --update_compliance=true --> </td></tr></tbody></table> </details>
qodo-code-review[bot] commented 2026-01-03 09:50:14 +00:00 (Migrated from github.com)
Author
Owner

Imported GitHub PR comment.

Original author: @qodo-code-review[bot]
Original URL: https://github.com/carverauto/serviceradar/pull/2221#issuecomment-3706934588
Original created: 2026-01-03T09:50:14Z

PR Code Suggestions

Latest suggestions up to 5c8ccf6

CategorySuggestion                                                                                                                                    Impact
Possible issue
Fix remaining state transitions

Update transition_state calls in several update actions to use the correct
transition name instead of the target state atom to prevent runtime errors.

elixir/serviceradar_core/lib/serviceradar/infrastructure/gateway.ex [254-363]

 update :heartbeat_timeout do
   description "Mark gateway as degraded due to heartbeat timeout"
   require_atomic? false
 
-  change transition_state(:degraded)
+  change transition_state(:heartbeat_timeout)
   change set_attribute(:is_healthy, false)
   change set_attribute(:updated_at, &DateTime.utc_now/0)
   change {ServiceRadar.Infrastructure.Changes.PublishStateChange, entity_type: :gateway, new_state: :degraded}
 end
 
 update :lose_connection do
   description "Mark gateway as offline due to lost connection"
   require_atomic? false
 
-  change transition_state(:offline)
+  change transition_state(:lose_connection)
   change set_attribute(:is_healthy, false)
   change set_attribute(:updated_at, &DateTime.utc_now/0)
   change {ServiceRadar.Infrastructure.Changes.PublishStateChange, entity_type: :gateway, new_state: :offline}
 end
 
 update :restore_health do
   description "Restore gateway to healthy state"
   require_atomic? false
 
-  change transition_state(:healthy)
+  change transition_state(:restore_health)
   change set_attribute(:is_healthy, true)
   change set_attribute(:last_seen, &DateTime.utc_now/0)
   change set_attribute(:updated_at, &DateTime.utc_now/0)
   change {ServiceRadar.Infrastructure.Changes.PublishStateChange, entity_type: :gateway, new_state: :healthy}
 end
 
 update :end_maintenance do
   description "End maintenance mode, return to healthy"
   require_atomic? false
 
-  change transition_state(:healthy)
+  change transition_state(:end_maintenance)
   change set_attribute(:is_healthy, true)
   change set_attribute(:updated_at, &DateTime.utc_now/0)
   change {ServiceRadar.Infrastructure.Changes.PublishStateChange, entity_type: :gateway, new_state: :healthy}
 end
 
 update :finish_draining do
   description "Finish draining, go offline"
   require_atomic? false
 
-  change transition_state(:offline)
+  change transition_state(:finish_draining)
   change set_attribute(:is_healthy, false)
   change set_attribute(:updated_at, &DateTime.utc_now/0)
   change {ServiceRadar.Infrastructure.Changes.PublishStateChange, entity_type: :gateway, new_state: :offline}
 end
 
 update :mark_unhealthy do
   description "Mark gateway as unhealthy (legacy - use degrade)"
   require_atomic? false
 
-  change transition_state(:degraded)
+  change transition_state(:degrade)
   change set_attribute(:is_healthy, false)
   change set_attribute(:updated_at, &DateTime.utc_now/0)
   change {ServiceRadar.Infrastructure.Changes.PublishStateChange, entity_type: :gateway, new_state: :degraded}
 end

[To ensure code accuracy, apply this suggestion manually]

Suggestion importance[1-10]: 9

__

Why: The suggestion correctly identifies a bug where transition_state is called with the target state instead of the transition name, which would cause a runtime error.

High
Enforce streaming chunk ordering
Suggestion Impact:The commit updated the stream reducer state to include expected_idx and pinned_total_chunks, added validation that total_chunks cannot change mid-stream, and enforced that each incoming chunk_index must match the expected sequential index, updating the reducer continuation/halt tuples accordingly.

code diff:

-    {total_services, saw_final?, _stream_agent_id} =
-      Enum.reduce_while(request_stream, {0, false, nil}, fn chunk, {acc, _saw_final?, stream_agent_id} ->
+    {total_services, saw_final?, _stream_agent_id, _expected_idx, _pinned_total_chunks} =
+      Enum.reduce_while(request_stream, {0, false, nil, 0, nil}, fn chunk,
+                                                                 {acc, _saw_final?, stream_agent_id, expected_idx,
+                                                                  pinned_total_chunks} ->
         agent_id =
           case chunk.agent_id do
             nil ->
@@ -343,8 +349,19 @@
           raise GRPC.RPCError, status: :invalid_argument, message: "total_chunks must be > 0"
         end
 
+        pinned_total_chunks =
+          case pinned_total_chunks do
+            nil -> total_chunks
+            ^total_chunks -> total_chunks
+            _ -> raise GRPC.RPCError, status: :invalid_argument, message: "total_chunks changed mid-stream"
+          end
+
         if chunk_index < 0 or chunk_index >= total_chunks do
           raise GRPC.RPCError, status: :invalid_argument, message: "invalid chunk_index"
+        end
+
+        if chunk_index != expected_idx do
+          raise GRPC.RPCError, status: :invalid_argument, message: "unexpected chunk_index"
         end
 
         Logger.debug(
@@ -396,9 +413,9 @@
           end
 
           record_push_metrics(agent_id, new_total)
-          {:halt, {new_total, true, stream_agent_id}}
+          {:halt, {new_total, true, stream_agent_id, expected_idx + 1, pinned_total_chunks}}
         else
-          {:cont, {new_total, false, stream_agent_id}}
+          {:cont, {new_total, false, stream_agent_id, expected_idx + 1, pinned_total_chunks}}
         end
       end)

Enforce chunk ordering and contiguity in stream_status/2 by tracking the
expected_chunk_index and pinning total_chunks to prevent partial data ingestion.

elixir/serviceradar_agent_gateway/lib/serviceradar_agent_gateway/agent_gateway_server.ex [301-403]

 {total_services, saw_final?, _stream_agent_id} =
-  Enum.reduce_while(request_stream, {0, false, nil}, fn chunk, {acc, _saw_final?, stream_agent_id} ->
+  Enum.reduce_while(request_stream, {0, false, nil, 0, nil}, fn chunk,
+                                                              {acc, _saw_final?, stream_agent_id, expected_idx,
+                                                               pinned_total_chunks} ->
     ...
     chunk_index = chunk.chunk_index || 0
     total_chunks = chunk.total_chunks || 0
 
     if total_chunks <= 0 do
       raise GRPC.RPCError, status: :invalid_argument, message: "total_chunks must be > 0"
     end
 
+    pinned_total_chunks =
+      case pinned_total_chunks do
+        nil -> total_chunks
+        ^total_chunks -> total_chunks
+        _ -> raise GRPC.RPCError, status: :invalid_argument, message: "total_chunks changed mid-stream"
+      end
+
     if chunk_index < 0 or chunk_index >= total_chunks do
       raise GRPC.RPCError, status: :invalid_argument, message: "invalid chunk_index"
+    end
+
+    if chunk_index != expected_idx do
+      raise GRPC.RPCError, status: :invalid_argument, message: "unexpected chunk_index"
     end
     ...
     if chunk.is_final do
       if chunk_index != total_chunks - 1 do
         raise GRPC.RPCError,
           status: :invalid_argument,
           message: "final chunk_index does not match total_chunks"
       end
 
       record_push_metrics(agent_id, new_total)
-      {:halt, {new_total, true, stream_agent_id}}
+      {:halt, {new_total, true, stream_agent_id, expected_idx + 1, pinned_total_chunks}}
     else
-      {:cont, {new_total, false, stream_agent_id}}
+      {:cont, {new_total, false, stream_agent_id, expected_idx + 1, pinned_total_chunks}}
     end
   end)

[To ensure code accuracy, apply this suggestion manually]

Suggestion importance[1-10]: 8

__

Why: The suggestion correctly identifies a vulnerability where out-of-order or missing stream chunks are not detected, potentially leading to silent data loss.

Medium
Unify version reporting source

Propagate the build-time Version from the main package to the agent package to
ensure correct version reporting during gateway enrollment.

cmd/agent/main.go [42-45]

 // Version is set at build time via ldflags
 //
 //nolint:gochecknoglobals // Required for build-time ldflags injection
 var Version = "dev"
 
+...
+
+// Ensure the agent package reports the same build version during enrollment.
+agent.Version = Version
+

[To ensure code accuracy, apply this suggestion manually]

Suggestion importance[1-10]: 7

__

Why: The suggestion correctly identifies that the build-time Version from the main package is not propagated to the agent package, causing incorrect version reporting during enrollment.

Medium
Incremental [*]
Skip malformed service entries
Suggestion Impact:The reduce function in push_status was changed to use function-head pattern matching for %Monitoring.GatewayServiceStatus{} and to ignore all other entries via an _other clause, matching the suggestion and preventing invalid entries from triggering exception-based control flow.

code diff:

@@ -252,25 +252,29 @@
     processed_count =
       services
       |> Enum.reject(&is_nil/1)
-      |> Enum.reduce(0, fn service, acc ->
-        try do
-          process_service_status(service, metadata)
-          acc + 1
-        rescue
-          e in GRPC.RPCError ->
-            Logger.warning(
-              "Dropping invalid service status from agent #{metadata.agent_id}: #{e.status} #{e.message}"
-            )
-
-            acc
-
-          e ->
-            Logger.warning(
-              "Dropping service status from agent #{metadata.agent_id} due to error: #{Exception.message(e)}"
-            )
-
-            acc
-        end
+      |> Enum.reduce(0, fn
+        %Monitoring.GatewayServiceStatus{} = service, acc ->
+          try do
+            process_service_status(service, metadata)
+            acc + 1
+          rescue
+            e in GRPC.RPCError ->
+              Logger.warning(
+                "Dropping invalid service status from agent #{metadata.agent_id}: #{e.status} #{e.message}"
+              )
+
+              acc
+
+            e ->
+              Logger.warning(
+                "Dropping service status from agent #{metadata.agent_id} due to error: #{Exception.message(e)}"
+              )
+
+              acc
+          end
+
+        _other, acc ->
+          acc
       end)

In push_status, use pattern matching on %Monitoring.GatewayServiceStatus{}
within Enum.reduce to filter invalid service entries. This avoids using
exceptions for control flow and mitigates a potential resource exhaustion
attack.

elixir/serviceradar_agent_gateway/lib/serviceradar_agent_gateway/agent_gateway_server.ex [252-278]

 processed_count =
   services
   |> Enum.reject(&is_nil/1)
-  |> Enum.reduce(0, fn service, acc ->
-    try do
-      process_service_status(service, metadata)
-      acc + 1
-    rescue
-      e in GRPC.RPCError ->
-        Logger.warning(
-          "Dropping invalid service status from agent #{metadata.agent_id}: #{e.status} #{e.message}"
-        )
+  |> Enum.reduce(0, fn
+    %Monitoring.GatewayServiceStatus{} = service, acc ->
+      try do
+        process_service_status(service, metadata)
+        acc + 1
+      rescue
+        e in GRPC.RPCError ->
+          Logger.warning(
+            "Dropping invalid service status from agent #{metadata.agent_id}: #{e.status} #{e.message}"
+          )
 
-        acc
+          acc
 
-      e ->
-        Logger.warning(
-          "Dropping service status from agent #{metadata.agent_id} due to error: #{Exception.message(e)}"
-        )
+        e ->
+          Logger.warning(
+            "Dropping service status from agent #{metadata.agent_id} due to error: #{Exception.message(e)}"
+          )
 
-        acc
-    end
+          acc
+      end
+
+    _other, acc ->
+      acc
   end)
 
 if processed_count == 0 and service_count > 0 do
   raise GRPC.RPCError, status: :invalid_argument, message: "no valid service statuses"
 end

[Suggestion processed]

Suggestion importance[1-10]: 8

__

Why: The suggestion correctly identifies a potential resource exhaustion vector and proposes a more robust and secure implementation using pattern matching. This avoids expensive exception handling for invalid data types in a loop, which is a good security and performance practice.

Medium
Bound user-controlled string sizes
Suggestion Impact:The commit adds trimming/normalization of `service_type` and `source`, validates their maximum byte size and disallows control characters, and then uses these sanitized values in the constructed status map.

code diff:

@@ -422,6 +439,18 @@
         _ -> ""
       end
 
+    service_type =
+      case service.service_type do
+        st when is_binary(st) -> String.trim(st)
+        _ -> ""
+      end
+
+    source =
+      case service.source do
+        src when is_binary(src) -> String.trim(src)
+        _ -> ""
+      end
+
     cond do
       service_name == "" ->
         raise GRPC.RPCError, status: :invalid_argument, message: "service_name is required"
@@ -431,6 +460,12 @@
 
       String.contains?(service_name, ["\n", "\r", "\t"]) ->
         raise GRPC.RPCError, status: :invalid_argument, message: "service_name contains invalid characters"
+
+      byte_size(service_type) > 64 or String.contains?(service_type, ["\n", "\r", "\t"]) ->
+        raise GRPC.RPCError, status: :invalid_argument, message: "service_type is invalid"
+
+      byte_size(source) > 64 or String.contains?(source, ["\n", "\r", "\t"]) ->
+        raise GRPC.RPCError, status: :invalid_argument, message: "source is invalid"
 
       true ->
         :ok
@@ -458,16 +493,28 @@
         end
       end)
 
+    response_time =
+      case service.response_time do
+        rt when is_integer(rt) and rt >= 0 and rt <= 86_400_000 ->
+          rt
+
+        rt when is_integer(rt) and rt > 86_400_000 ->
+          86_400_000
+
+        _ ->
+          0
+      end
+
     status = %{
       service_name: service_name,
       available: service.available == true,
       message: message,
-      service_type: service.service_type,
-      response_time: service.response_time,
+      service_type: service_type,
+      response_time: response_time,
       agent_id: metadata.agent_id,
       gateway_id: metadata.gateway_id,
       partition: normalize_partition(service.partition || metadata.partition),
-      source: service.source,
+      source: source,
       kv_store_id: service.kv_store_id || metadata.kv_store_id,

In process_service_status, add size and character validation for the
service_type and source fields. This prevents potential resource exhaustion and
data ingestion issues from malicious or malformed client input.

elixir/serviceradar_agent_gateway/lib/serviceradar_agent_gateway/agent_gateway_server.ex [418-423]

 service_name =
   case service.service_name do
     name when is_binary(name) -> String.trim(name)
     nil -> ""
     _ -> ""
   end
 
+service_type =
+  case service.service_type do
+    st when is_binary(st) -> String.trim(st)
+    _ -> ""
+  end
+
+source =
+  case service.source do
+    src when is_binary(src) -> String.trim(src)
+    _ -> ""
+  end
+
+cond do
+  service_name == "" ->
+    raise GRPC.RPCError, status: :invalid_argument, message: "service_name is required"
+
+  byte_size(service_name) > 255 ->
+    raise GRPC.RPCError, status: :invalid_argument, message: "service_name is too long"
+
+  String.contains?(service_name, ["\n", "\r", "\t"]) ->
+    raise GRPC.RPCError, status: :invalid_argument, message: "service_name contains invalid characters"
+
+  byte_size(service_type) > 64 or String.contains?(service_type, ["\n", "\r", "\t"]) ->
+    raise GRPC.RPCError, status: :invalid_argument, message: "service_type is invalid"
+
+  byte_size(source) > 64 or String.contains?(source, ["\n", "\r", "\t"]) ->
+    raise GRPC.RPCError, status: :invalid_argument, message: "source is invalid"
+
+  true ->
+    :ok
+end
+

[Suggestion processed]

Suggestion importance[1-10]: 8

__

Why: The suggestion correctly points out that other user-controlled string fields like service_type and source are not being validated. Adding size and character validation for these fields is a good security practice to prevent resource exhaustion and data ingestion issues.

Medium
Prevent test goroutine leaks

Prevent a test goroutine leak by using a cancellable context when starting the
service and deferring a call to service.Stop() to ensure cleanup.

pkg/checker/snmp/service_test.go [162-201]

 initialDataChan := make(chan DataPoint)
 newDataChan := make(chan DataPoint)
 // Channel to signal when GetResults is called for the new target.
 getResultsCalled := make(chan struct{})
 ...
 // Test adding target
-err = service.Start(context.Background())
+startCtx, cancelStart := context.WithCancel(context.Background())
+defer cancelStart()
+defer func() { require.NoError(t, service.Stop()) }()
+defer close(initialDataChan)
+defer close(newDataChan)
+
+err = service.Start(startCtx)
 require.NoError(t, err)

[To ensure code accuracy, apply this suggestion manually]

Suggestion importance[1-10]: 7

__

Why: The suggestion correctly identifies a goroutine leak in the test due to a non-cancellable context, which can cause test flakiness, and provides a correct fix to ensure proper cleanup.

Medium
Handle RPC agent map shapes
Suggestion Impact:The reduce/caching logic was changed to fetch agent_id and other fields using both atom and string map keys, matching the suggestion to robustly handle RPC result shapes.

code diff:

-      Map.put(acc, agent.agent_id, %{
-        agent_id: agent.agent_id,
-        tenant_id: Map.get(agent, :tenant_id),
-        tenant_slug: Map.get(agent, :tenant_slug, "default"),
-        last_seen: Map.get(agent, :last_seen),
-        last_seen_mono: Map.get(agent, :last_seen_mono),
-        service_count: Map.get(agent, :service_count, 0),
-        partition: Map.get(agent, :partition),
-        source_ip: Map.get(agent, :source_ip)
+      agent_id = Map.get(agent, :agent_id) || Map.get(agent, "agent_id")
+
+      Map.put(acc, agent_id, %{
+        agent_id: agent_id,
+        tenant_id: Map.get(agent, :tenant_id) || Map.get(agent, "tenant_id"),
+        tenant_slug: Map.get(agent, :tenant_slug) || Map.get(agent, "tenant_slug") || "default",
+        last_seen: Map.get(agent, :last_seen) || Map.get(agent, "last_seen"),
+        last_seen_mono: Map.get(agent, :last_seen_mono) || Map.get(agent, "last_seen_mono"),
+        service_count: Map.get(agent, :service_count) || Map.get(agent, "service_count") || 0,
+        partition: Map.get(agent, :partition) || Map.get(agent, "partition"),
+        source_ip: Map.get(agent, :source_ip) || Map.get(agent, "source_ip")
       })

Modify the agent cache builder to handle both atom and string keys from RPC
results. This prevents incorrect data processing due to serialization
differences between nodes.

web-ng/lib/serviceradar_web_ng_web/live/infrastructure_live/index.ex [739-749]

-Map.put(acc, agent.agent_id, %{
-  agent_id: agent.agent_id,
-  tenant_id: Map.get(agent, :tenant_id),
-  tenant_slug: Map.get(agent, :tenant_slug, "default"),
-  last_seen: Map.get(agent, :last_seen),
-  last_seen_mono: Map.get(agent, :last_seen_mono),
-  service_count: Map.get(agent, :service_count, 0),
-  partition: Map.get(agent, :partition),
-  source_ip: Map.get(agent, :source_ip)
+agent_id = Map.get(agent, :agent_id) || Map.get(agent, "agent_id")
+
+Map.put(acc, agent_id, %{
+  agent_id: agent_id,
+  tenant_id: Map.get(agent, :tenant_id) || Map.get(agent, "tenant_id"),
+  tenant_slug: Map.get(agent, :tenant_slug) || Map.get(agent, "tenant_slug") || "default",
+  last_seen: Map.get(agent, :last_seen) || Map.get(agent, "last_seen"),
+  last_seen_mono: Map.get(agent, :last_seen_mono) || Map.get(agent, "last_seen_mono"),
+  service_count: Map.get(agent, :service_count) || Map.get(agent, "service_count") || 0,
+  partition: Map.get(agent, :partition) || Map.get(agent, "partition"),
+  source_ip: Map.get(agent, :source_ip) || Map.get(agent, "source_ip")
 })

[To ensure code accuracy, apply this suggestion manually]

Suggestion importance[1-10]: 7

__

Why: This is a valid suggestion that improves robustness by defensively handling both atom and string keys for data received via RPC, which can have inconsistent serialization. This prevents a potential bug where agents might be incorrectly processed.

Medium
Fix embedded port detection
Suggestion Impact:The commit replaces the fallback url.Parse("tcp://"+target) port detection with logic that strips path/query and uses net.SplitHostPort, and adds the needed strings import.

code diff:

@@ -22,6 +22,7 @@
 	"fmt"
 	"net"
 	"net/url"
+	"strings"
 	"sync"
 	"time"
 
@@ -685,9 +686,17 @@
 			hasEmbeddedPort = true
 		} else if parsed, err := url.Parse(target); err == nil && parsed.Host != "" && parsed.Port() != "" {
 			hasEmbeddedPort = true
-		} else if parsed, err := url.Parse("tcp://" + target); err == nil && parsed.Host != "" && parsed.Port() != "" {
+		} else {
 			// Handles "host:port/path" inputs without an explicit scheme.
-			hasEmbeddedPort = true
+			hostPort := target
+			if i := strings.IndexAny(hostPort, "/?"); i >= 0 {
+				hostPort = hostPort[:i]
+			}
+
+			// net.SplitHostPort requires bracketed IPv6; be explicit to avoid silently dropping checks.
+			if host, p, err := net.SplitHostPort(hostPort); err == nil && host != "" && p != "" {
+				hasEmbeddedPort = true
+			}

Improve embedded port detection for check targets by using net.SplitHostPort on
the host part of the target string, making it more resilient to various formats
like unbracketed IPv6 addresses.

pkg/agent/push_loop.go [688-691]

-} else if parsed, err := url.Parse("tcp://" + target); err == nil && parsed.Host != "" && parsed.Port() != "" {
+} else {
 	// Handles "host:port/path" inputs without an explicit scheme.
-	hasEmbeddedPort = true
+	hostPort := target
+	if i := strings.IndexAny(hostPort, "/?"); i >= 0 {
+		hostPort = hostPort[:i]
+	}
+
+	// net.SplitHostPort requires bracketed IPv6; be explicit to avoid silently dropping checks.
+	if host, p, err := net.SplitHostPort(hostPort); err == nil && host != "" && p != "" {
+		hasEmbeddedPort = true
+	}
 }

[To ensure code accuracy, apply this suggestion manually]

Suggestion importance[1-10]: 6

__

Why: The suggestion correctly identifies that the current port detection logic can fail for unbracketed IPv6 addresses, and proposes a more robust implementation using net.SplitHostPort.

Low
Security
Validate client-provided timing data
Suggestion Impact:The commit added response_time validation/normalization logic (clamping to 86_400_000 and defaulting to 0 for invalid values) and updated the status map to use the sanitized response_time instead of the raw client-provided value.

code diff:

+    response_time =
+      case service.response_time do
+        rt when is_integer(rt) and rt >= 0 and rt <= 86_400_000 ->
+          rt
+
+        rt when is_integer(rt) and rt > 86_400_000 ->
+          86_400_000
+
+        _ ->
+          0
+      end
+
     status = %{
       service_name: service_name,
       available: service.available == true,
       message: message,
-      service_type: service.service_type,
-      response_time: service.response_time,
+      service_type: service_type,
+      response_time: response_time,
       agent_id: metadata.agent_id,

Validate and normalize the client-provided response_time to be a non-negative
integer within a reasonable range before it is processed.

elixir/serviceradar_agent_gateway/lib/serviceradar_agent_gateway/agent_gateway_server.ex [461-475]

+response_time =
+  case service.response_time do
+    rt when is_integer(rt) and rt >= 0 and rt <= 86_400_000 ->
+      rt
+
+    rt when is_integer(rt) and rt > 86_400_000 ->
+      86_400_000
+
+    _ ->
+      0
+  end
+
 status = %{
   service_name: service_name,
   available: service.available == true,
   message: message,
   service_type: service.service_type,
-  response_time: service.response_time,
+  response_time: response_time,
   agent_id: metadata.agent_id,
   gateway_id: metadata.gateway_id,
   partition: normalize_partition(service.partition || metadata.partition),
   source: service.source,
   kv_store_id: service.kv_store_id || metadata.kv_store_id,
   timestamp: metadata.timestamp,
   tenant_id: metadata.tenant_id,
   tenant_slug: metadata.tenant_slug
 }

[Suggestion processed]

Suggestion importance[1-10]: 7

__

Why: The suggestion correctly points out that response_time from the client is not validated, which could lead to data corruption or processing errors downstream.

Medium
General
Preserve JSON parsing error details
Suggestion Impact:Updated the JSON parsing error handling to wrap and return the actual decode error (err) when decoding fails with non-EOF errors, improving debugging information.

code diff:

 	if err := dec.Decode(&struct{}{}); err == nil {
 		return nil, fmt.Errorf("failed to parse config: %w", errConfigTrailingData)
 	} else if !errors.Is(err, io.EOF) {
-		return nil, fmt.Errorf("failed to parse config: %w", errConfigTrailingData)
+		return nil, fmt.Errorf("failed to parse config: %w", err)

Improve JSON parsing error handling by returning the specific underlying error
instead of a generic one, which will aid in debugging configuration file issues.

cmd/agent/main.go [129-133]

 if err := dec.Decode(&struct{}{}); err == nil {
 	return nil, fmt.Errorf("failed to parse config: %w", errConfigTrailingData)
 } else if !errors.Is(err, io.EOF) {
-	return nil, fmt.Errorf("failed to parse config: %w", errConfigTrailingData)
+	return nil, fmt.Errorf("failed to parse config: %w", err)
 }

[Suggestion processed]

Suggestion importance[1-10]: 6

__

Why: The suggestion correctly points out that a generic error is returned for JSON parsing issues, which hides the specific cause and makes debugging more difficult.

Low
  • Update

Previous suggestions

Suggestions up to commit c96fbf9
CategorySuggestion                                                                                                                                    Impact
Possible issue
Fix state machine transitions

Fix the transition_state calls in the state machine update actions to use the
correct transition names (e.g., :degrade) instead of the target state atoms
(e.g., :degraded) to prevent runtime errors.

elixir/serviceradar_core/lib/serviceradar/infrastructure/gateway.ex [243-351]

 update :degrade do
   description "Mark gateway as degraded (having issues)"
   argument :reason, :string
   require_atomic? false
 
-  change transition_state(:degraded)
+  change transition_state(:degrade)
   change set_attribute(:is_healthy, false)
   change set_attribute(:updated_at, &DateTime.utc_now/0)
   change {ServiceRadar.Infrastructure.Changes.PublishStateChange, entity_type: :gateway, new_state: :degraded}
 end
 
 update :go_offline do
   description "Mark gateway as offline"
   argument :reason, :string
   require_atomic? false
 
-  change transition_state(:offline)
+  change transition_state(:go_offline)
   change set_attribute(:is_healthy, false)
   change set_attribute(:updated_at, &DateTime.utc_now/0)
   change {ServiceRadar.Infrastructure.Changes.PublishStateChange, entity_type: :gateway, new_state: :offline}
 end
 
 update :recover do
   description "Start recovery process for degraded/offline gateway"
   require_atomic? false
 
-  change transition_state(:recovering)
+  change transition_state(:recover)
   change set_attribute(:updated_at, &DateTime.utc_now/0)
   change {ServiceRadar.Infrastructure.Changes.PublishStateChange, entity_type: :gateway, new_state: :recovering}
 end
 
 update :start_maintenance do
   description "Put gateway into maintenance mode"
   require_atomic? false
 
-  change transition_state(:maintenance)
+  change transition_state(:start_maintenance)
   change set_attribute(:updated_at, &DateTime.utc_now/0)
   change {ServiceRadar.Infrastructure.Changes.PublishStateChange, entity_type: :gateway, new_state: :maintenance}
 end
 
 update :start_draining do
   description "Start graceful shutdown (draining)"
   require_atomic? false
 
-  change transition_state(:draining)
+  change transition_state(:start_draining)
   change set_attribute(:updated_at, &DateTime.utc_now/0)
   change {ServiceRadar.Infrastructure.Changes.PublishStateChange, entity_type: :gateway, new_state: :draining}
 end
 
 update :deactivate do
   description "Deactivate a gateway (admin action)"
   require_atomic? false
 
-  change transition_state(:inactive)
+  change transition_state(:deactivate)
   change set_attribute(:is_healthy, false)
   change set_attribute(:updated_at, &DateTime.utc_now/0)
   change {ServiceRadar.Infrastructure.Changes.PublishStateChange, entity_type: :gateway, new_state: :inactive}
 end
Suggestion importance[1-10]: 9

__

Why: This suggestion correctly identifies a critical bug where transition_state is called with the target state instead of the transition name, which would cause runtime crashes.

High
Fix gRPC connection initialization

Replace the manual gRPC connection logic using grpc.NewClient with the standard
grpc.DialContext and grpc.WithBlock for a simpler and more robust
implementation.

pkg/agent/gateway_client.go [130-154]

-conn, err := grpc.NewClient(g.addr, opts...)
+opts = append(opts, grpc.WithBlock())
+
+conn, err := grpc.DialContext(connectCtx, g.addr, opts...)
 if err != nil {
 	if provider != nil {
 		_ = provider.Close()
 	}
 	return fmt.Errorf("failed to connect to gateway at %s: %w", g.addr, err)
 }
 
-conn.Connect()
-for state := conn.GetState(); state != connectivity.Ready; state = conn.GetState() {
-	if state == connectivity.Shutdown {
-		_ = conn.Close()
-		if provider != nil {
-			_ = provider.Close()
-		}
-		return fmt.Errorf("failed to connect to gateway at %s: %w", g.addr, ErrConnectionShutdown)
-	}
-	if !conn.WaitForStateChange(connectCtx, state) {
-		_ = conn.Close()
-		if provider != nil {
-			_ = provider.Close()
-		}
-		return fmt.Errorf("failed to connect to gateway at %s: %w", g.addr, connectCtx.Err())
-	}
-}
-
Suggestion importance[1-10]: 7

__

Why: The suggestion correctly identifies that using grpc.DialContext with grpc.WithBlock is the idiomatic and more robust way to establish a gRPC connection, simplifying the code and improving maintainability.

Medium
Prevent false agent activity
Suggestion Impact:The code now uses Map.get(agent, :last_seen) and Map.get(agent, :last_seen_mono) instead of agent.last_seen and System.monotonic_time(:millisecond), preventing false activity on initial cache load.

code diff:

-    all_agents
-    |> Enum.reduce(%{}, fn agent, acc ->
-      Map.put(acc, agent.agent_id, %{
-        agent_id: agent.agent_id,
-        tenant_id: Map.get(agent, :tenant_id),
-        tenant_slug: Map.get(agent, :tenant_slug, "default"),
-        last_seen: agent.last_seen,
-        last_seen_mono: System.monotonic_time(:millisecond),
-        service_count: Map.get(agent, :service_count, 0),
-        partition: Map.get(agent, :partition),
-        source_ip: Map.get(agent, :source_ip)
-      })
-    end)
+  all_agents
+  |> Enum.reduce(%{}, fn agent, acc ->
+    Map.put(acc, agent.agent_id, %{
+      agent_id: agent.agent_id,
+      tenant_id: Map.get(agent, :tenant_id),
+      tenant_slug: Map.get(agent, :tenant_slug, "default"),
+      last_seen: Map.get(agent, :last_seen),
+      last_seen_mono: Map.get(agent, :last_seen_mono),
+      service_count: Map.get(agent, :service_count, 0),
+      partition: Map.get(agent, :partition),
+      source_ip: Map.get(agent, :source_ip)
+    })
+  end)

In load_initial_agents_cache, set last_seen_mono from the agent data from the
tracker instead of the current monotonic time to avoid incorrectly marking stale
agents as active.

web-ng/lib/serviceradar_web_ng_web/live/infrastructure_live/index.ex [738-750]

 all_agents
 |> Enum.reduce(%{}, fn agent, acc ->
   Map.put(acc, agent.agent_id, %{
     agent_id: agent.agent_id,
     tenant_id: Map.get(agent, :tenant_id),
     tenant_slug: Map.get(agent, :tenant_slug, "default"),
-    last_seen: agent.last_seen,
-    last_seen_mono: System.monotonic_time(:millisecond),
+    last_seen: Map.get(agent, :last_seen),
+    last_seen_mono: Map.get(agent, :last_seen_mono),
     service_count: Map.get(agent, :service_count, 0),
     partition: Map.get(agent, :partition),
     source_ip: Map.get(agent, :source_ip)
   })
 end)
Suggestion importance[1-10]: 7

__

Why: The suggestion correctly identifies a logic flaw where setting last_seen_mono to the current time on initial load can falsely mark stale agents as active, leading to incorrect UI state.

Medium
Incremental [*]
Accept host:port/path targets
Suggestion Impact:The commit adds an additional url.Parse("tcp://" + target) branch to detect an embedded port in host:port/path inputs without an explicit scheme, preventing such targets from being dropped.

code diff:

@@ -684,6 +684,9 @@
 		if host, p, err := net.SplitHostPort(target); err == nil && host != "" && p != "" {
 			hasEmbeddedPort = true
 		} else if parsed, err := url.Parse(target); err == nil && parsed.Host != "" && parsed.Port() != "" {
+			hasEmbeddedPort = true
+		} else if parsed, err := url.Parse("tcp://" + target); err == nil && parsed.Host != "" && parsed.Port() != "" {
+			// Handles "host:port/path" inputs without an explicit scheme.
 			hasEmbeddedPort = true
 		}

Fix the embedded-port detection logic to correctly handle targets that include
both a port and a path (e.g., example.com:443/health). This prevents valid
TCP/gRPC checks from being silently dropped.

pkg/agent/push_loop.go [681-692]

 if (checkerType == tcpCheckType || checkerType == grpcCheckType) && port == 0 {
 	// Allow "host:port" or URL targets that already include a port.
 	hasEmbeddedPort := false
 	if host, p, err := net.SplitHostPort(target); err == nil && host != "" && p != "" {
 		hasEmbeddedPort = true
 	} else if parsed, err := url.Parse(target); err == nil && parsed.Host != "" && parsed.Port() != "" {
 		hasEmbeddedPort = true
+	} else if parsed, err := url.Parse("tcp://" + target); err == nil && parsed.Host != "" && parsed.Port() != "" {
+		// Handles "host:port/path" inputs without an explicit scheme.
+		hasEmbeddedPort = true
 	}
 	if !hasEmbeddedPort {
 		return nil
 	}
 }
Suggestion importance[1-10]: 8

__

Why: This suggestion correctly identifies a bug where valid check targets with a port and path (e.g., example.com:8080/health) are silently ignored, leading to a failure in monitoring. The fix is accurate and resolves this issue.

Medium
Prevent collectors from stopping early
Suggestion Impact:The code was changed to return an error ("service not started") when the service lifecycle context is nil, preventing collectors from being started with the request context and stopping prematurely.

code diff:

@@ -322,7 +322,7 @@
 	serviceCtx := s.serviceCtx
 	s.mu.RUnlock()
 	if serviceCtx == nil {
-		serviceCtx = ctx
+		return errors.New("service not started")
 	}

Prevent SNMP collectors from stopping prematurely by ensuring they are always
started with the service's lifecycle context. If the service is not running,
fail fast instead of using a potentially short-lived request context.

pkg/checker/snmp/service.go [321-331]

 s.mu.RLock()
 serviceCtx := s.serviceCtx
 s.mu.RUnlock()
 if serviceCtx == nil {
-	serviceCtx = ctx
+	return errors.New("service not started")
 }
 
 // Start collector with the service context so it stops on service shutdown.
 if err := collector.Start(serviceCtx); err != nil {
 	return fmt.Errorf("%w: %s", errFailedToStartCollector, target.Name)
 }
Suggestion importance[1-10]: 8

__

Why: The suggestion correctly identifies a critical bug where an SNMP collector's lifecycle could be tied to a short-lived request context, causing it to terminate prematurely. Failing fast prevents this and ensures collectors run for their intended duration.

Medium
Fail when all items drop
Suggestion Impact:The commit replaced Enum.each with Enum.reduce to track processed_count, raises a GRPC.RPCError when processed_count == 0 but service_count > 0, and updates record_push_metrics to use processed_count instead of service_count.

code diff:

+    processed_count =
+      services
+      |> Enum.reject(&is_nil/1)
+      |> Enum.reduce(0, fn service, acc ->
+        try do
+          process_service_status(service, metadata)
+          acc + 1
+        rescue
+          e in GRPC.RPCError ->
+            Logger.warning(
+              "Dropping invalid service status from agent #{metadata.agent_id}: #{e.status} #{e.message}"
+            )
+
+            acc
+
+          e ->
+            Logger.warning(
+              "Dropping service status from agent #{metadata.agent_id} due to error: #{Exception.message(e)}"
+            )
+
+            acc
+        end
+      end)
+
+    if processed_count == 0 and service_count > 0 do
+      raise GRPC.RPCError, status: :invalid_argument, message: "no valid service statuses"
+    end
 
     # Record metrics
-    record_push_metrics(agent_id, service_count)
+    record_push_metrics(agent_id, processed_count)
 

Modify the push_status function to track the number of successfully processed
service statuses. If all statuses are dropped due to errors, the gRPC request
should fail instead of returning a success response.

elixir/serviceradar_agent_gateway/lib/serviceradar_agent_gateway/agent_gateway_server.ex [252-273]

-services
-|> Enum.reject(&is_nil/1)
-|> Enum.each(fn service ->
-  try do
-    process_service_status(service, metadata)
-  rescue
-    e in GRPC.RPCError ->
-      Logger.warning(
-        "Dropping invalid service status from agent #{metadata.agent_id}: #{e.status} #{e.message}"
-      )
+processed_count =
+  services
+  |> Enum.reject(&is_nil/1)
+  |> Enum.reduce(0, fn service, acc ->
+    try do
+      process_service_status(service, metadata)
+      acc + 1
+    rescue
+      e in GRPC.RPCError ->
+        Logger.warning(
+          "Dropping invalid service status from agent #{metadata.agent_id}: #{e.status} #{e.message}"
+        )
 
-    e ->
-      Logger.warning(
-        "Dropping service status from agent #{metadata.agent_id} due to error: #{Exception.message(e)}"
-      )
-  end
-end)
+        acc
+
+      e ->
+        Logger.warning(
+          "Dropping service status from agent #{metadata.agent_id} due to error: #{Exception.message(e)}"
+        )
+
+        acc
+    end
+  end)
+
+if processed_count == 0 and service_count > 0 do
+  raise GRPC.RPCError, status: :invalid_argument, message: "no valid service statuses"
+end
 
 # Record metrics
-record_push_metrics(agent_id, service_count)
+record_push_metrics(agent_id, processed_count)
 
 %Monitoring.GatewayStatusResponse{received: true}

[Suggestion processed]

Suggestion importance[1-10]: 8

__

Why: This suggestion correctly identifies a silent data loss scenario where the client receives a success response even if all submitted data was invalid and dropped, improving the robustness of the gRPC endpoint.

Medium
Avoid reconnect loops in idle
Suggestion Impact:Updated IsConnected to return true when the connection state is either connectivity.Ready or connectivity.Idle, matching the suggested switch-based logic.

code diff:

@@ -250,7 +250,12 @@
 	if !g.connected || g.conn == nil {
 		return false
 	}
-	return g.conn.GetState() == connectivity.Ready
+	switch g.conn.GetState() {
+	case connectivity.Ready, connectivity.Idle:
+		return true
+	default:
+		return false
+	}

Update IsConnected to consider the connectivity.Idle state as connected. This
prevents unnecessary reconnect loops when the gRPC connection is temporarily
idle but still healthy.

pkg/agent/gateway_client.go [250-253]

 if !g.connected || g.conn == nil {
 	return false
 }
-return g.conn.GetState() == connectivity.Ready
+switch g.conn.GetState() {
+case connectivity.Ready, connectivity.Idle:
+	return true
+default:
+	return false
+}
Suggestion importance[1-10]: 7

__

Why: The suggestion correctly identifies that treating an Idle gRPC connection as disconnected can cause unnecessary reconnect attempts, improving efficiency and reducing load on the agent and gateway.

Medium
Reject missing service names
Suggestion Impact:The commit replaced `to_string() |> String.trim()` with a case expression that returns "" for nil/non-binary values and trims only binaries, enabling the existing validation to reject missing service_name instead of accepting "nil".

code diff:

@@ -406,9 +416,11 @@
     # Tenant and gateway_id come from server-side metadata (mTLS cert + server identity)
     # NOT from the service message - this prevents spoofing
     service_name =
-      service.service_name
-      |> to_string()
-      |> String.trim()
+      case service.service_name do
+        name when is_binary(name) -> String.trim(name)
+        nil -> ""
+        _ -> ""
+      end

Explicitly handle nil values for service.service_name to prevent them from being
processed as the string "nil". The current implementation allows this, which can
lead to invalid service records.

elixir/serviceradar_agent_gateway/lib/serviceradar_agent_gateway/agent_gateway_server.ex [408-425]

 service_name =
-  service.service_name
-  |> to_string()
-  |> String.trim()
+  case service.service_name do
+    name when is_binary(name) -> String.trim(name)
+    nil -> ""
+    _ -> ""
+  end
 
 cond do
   service_name == "" ->
     raise GRPC.RPCError, status: :invalid_argument, message: "service_name is required"
 
   byte_size(service_name) > 255 ->
     raise GRPC.RPCError, status: :invalid_argument, message: "service_name is too long"
 
   String.contains?(service_name, ["\n", "\r", "\t"]) ->
     raise GRPC.RPCError, status: :invalid_argument, message: "service_name contains invalid characters"
 
   true ->
     :ok
 end

[Suggestion processed]

Suggestion importance[1-10]: 7

__

Why: This is a valid bug fix that prevents data pollution by correctly handling nil service_name values, which would otherwise be processed as the string "nil".

Medium
Suggestions up to commit b7127ef
CategorySuggestion                                                                                                                                    Impact
Incremental [*]
Restrict monotonic fallback activity
Suggestion Impact:The commit updates agent_active? to gate the monotonic fallback with `and is_nil(agent[:last_seen])`, matching the suggestion’s intent and preventing monotonic timestamps from overriding stale wall-clock last_seen values. It also reuses agent_active? for agent cache pruning, aligning staleness rules with the UI logic.

code diff:

-      is_integer(agent[:last_seen_mono]) ->
+      is_integer(agent[:last_seen_mono]) and is_nil(agent[:last_seen]) ->
         now_mono = System.monotonic_time(:millisecond)
         delta_ms = max(now_mono - agent[:last_seen_mono], 0)
         delta_ms < @stale_threshold_ms

In agent_active?, restrict the monotonic time fallback to only apply when the
wall-clock last_seen timestamp is nil to ensure correct staleness detection.

web-ng/lib/serviceradar_web_ng_web/live/infrastructure_live/index.ex [858-866]

     delta_ms = max(now_ms - last_seen_ms, 0)
     delta_ms < @stale_threshold_ms
 
-  is_integer(agent[:last_seen_mono]) ->
+  is_integer(agent[:last_seen_mono]) and is_nil(agent[:last_seen]) ->
     now_mono = System.monotonic_time(:millisecond)
     delta_ms = max(now_mono - agent[:last_seen_mono], 0)
     delta_ms < @stale_threshold_ms
Suggestion importance[1-10]: 8

__

Why: The suggestion correctly identifies a subtle bug where the monotonic time fallback could incorrectly mark an agent as active if a wall-clock time exists but is stale, improving the accuracy of staleness detection.

Medium
Bound service shutdown with timeout
Suggestion Impact:The commit removed the unbounded deferred server.Stop(context.Background()) and instead calls server.Stop() with a context timeout inside the shutdown goroutine, plus replaces time.After with an explicit timer. It also adds timed server.Stop() calls on the errChan path, further ensuring shutdown is bounded.

code diff:

@@ -177,11 +177,6 @@
 	if err := server.Start(pushCtx); err != nil {
 		return fmt.Errorf("failed to start agent services: %w", err)
 	}
-	defer func() {
-		if err := server.Stop(context.Background()); err != nil {
-			log.Error().Err(err).Msg("Error stopping agent services")
-		}
-	}()
 
 	// Handle shutdown signals
 	sigChan := make(chan os.Signal, 1)
@@ -204,21 +199,42 @@
 		shutdownDone := make(chan struct{})
 
 		go func() {
+			defer close(shutdownDone)
+
 			cancel()
 			pushLoop.Stop()
 			<-errChan
-			close(shutdownDone)
+
+			stopCtx, stopCancel := context.WithTimeout(context.Background(), shutdownTimeout)
+			defer stopCancel()
+			if err := server.Stop(stopCtx); err != nil {
+				log.Error().Err(err).Msg("Error stopping agent services")
+			}
 		}()
+
+		timer := time.NewTimer(shutdownTimeout)
+		defer timer.Stop()
 
 		select {
 		case <-shutdownDone:
-		case <-time.After(shutdownTimeout):
+		case <-timer.C:
 			return fmt.Errorf("%w after %s", errShutdownTimeout, shutdownTimeout)
 		}
 
 	case err := <-errChan:
 		if err != nil && !errors.Is(err, context.Canceled) {
+			stopCtx, stopCancel := context.WithTimeout(context.Background(), 10*time.Second)
+			defer stopCancel()
+			if stopErr := server.Stop(stopCtx); stopErr != nil {
+				log.Error().Err(stopErr).Msg("Error stopping agent services")
+			}
 			return fmt.Errorf("push loop error: %w", err)
+		}
+
+		stopCtx, stopCancel := context.WithTimeout(context.Background(), 10*time.Second)
+		defer stopCancel()
+		if stopErr := server.Stop(stopCtx); stopErr != nil {
+			log.Error().Err(stopErr).Msg("Error stopping agent services")
 		}

Move the server.Stop() call into the timed shutdown block to ensure it is
bounded by the shutdownTimeout, preventing the process from hanging during
termination.

cmd/agent/main.go [202-217]

 // Bound shutdown so the process can't hang forever (includes pushLoop.Stop()).
 const shutdownTimeout = 10 * time.Second
 shutdownDone := make(chan struct{})
 
 go func() {
+	defer close(shutdownDone)
+
 	cancel()
 	pushLoop.Stop()
 	<-errChan
-	close(shutdownDone)
+
+	stopCtx, stopCancel := context.WithTimeout(context.Background(), shutdownTimeout)
+	defer stopCancel()
+	if err := server.Stop(stopCtx); err != nil {
+		log.Error().Err(err).Msg("Error stopping agent services")
+	}
 }()
+
+timer := time.NewTimer(shutdownTimeout)
+defer timer.Stop()
 
 select {
 case <-shutdownDone:
-case <-time.After(shutdownTimeout):
+case <-timer.C:
 	return fmt.Errorf("%w after %s", errShutdownTimeout, shutdownTimeout)
 }

[Suggestion processed]

Suggestion importance[1-10]: 7

__

Why: The suggestion correctly identifies that server.Stop() is not covered by the shutdown timeout, which could cause the process to hang, and proposes a valid fix to include it in the timed shutdown sequence.

Medium
Clamp skewed timestamps in pruning
Suggestion Impact:The commit updated the pruned_gateways_cache pruning logic to clamp delta_ms using max(now_ms - last_ms, 0) and removed the delta_ms < 0 check, preventing active gateways from being pruned due to clock skew. It also included additional related changes to agent pruning logic.

code diff:

+    pruned_gateways_cache =
+      socket.assigns.gateways_cache
+      |> Enum.reject(fn {_id, gw} ->
+        last_ms = parse_timestamp_to_ms(Map.get(gw, :last_heartbeat))
+
+        delta_ms =
+          if is_integer(last_ms) do
+            max(now_ms - last_ms, 0)
+          else
+            nil
+          end
+
+        not is_integer(delta_ms) or delta_ms > @stale_threshold_ms
+      end)
+      |> Map.new()

To prevent incorrectly pruning active gateways due to clock skew, clamp negative
delta_ms values to 0 in the pruned_gateways_cache logic.

web-ng/lib/serviceradar_web_ng_web/live/infrastructure_live/index.ex [97-106]

   pruned_gateways_cache =
     socket.assigns.gateways_cache
     |> Enum.reject(fn {_id, gw} ->
       last_ms = parse_timestamp_to_ms(Map.get(gw, :last_heartbeat))
 
-      delta_ms = if is_integer(last_ms), do: now_ms - last_ms, else: nil
+      delta_ms =
+        if is_integer(last_ms) do
+          max(now_ms - last_ms, 0)
+        else
+          nil
+        end
 
-      not is_integer(delta_ms) or delta_ms < 0 or delta_ms > @stale_threshold_ms
+      not is_integer(delta_ms) or delta_ms > @stale_threshold_ms
     end)
     |> Map.new()
Suggestion importance[1-10]: 7

__

Why: This is a valid bug fix for handling clock skew, which could cause active gateways to be incorrectly pruned, but the same logic is already correctly implemented in compute_gateways, making this a consistency fix.

Medium
Avoid dropping agents on skew
Suggestion Impact:The pruning logic was changed to avoid dropping entities due to negative deltas: gateways now clamp delta_ms with `max(..., 0)` and no longer prune on `delta_ms < 0`. For agents, instead of directly computing delta_ms, the code now delegates pruning to `agent_active?/2`, which already clamps negative deltas and applies consistent staleness rules, achieving the same goal (avoid pruning due to clock skew) though via a different implementation.

code diff:

+    pruned_gateways_cache =
+      socket.assigns.gateways_cache
+      |> Enum.reject(fn {_id, gw} ->
+        last_ms = parse_timestamp_to_ms(Map.get(gw, :last_heartbeat))
+
+        delta_ms =
+          if is_integer(last_ms) do
+            max(now_ms - last_ms, 0)
+          else
+            nil
+          end
+
+        not is_integer(delta_ms) or delta_ms > @stale_threshold_ms
+      end)
+      |> Map.new()
+
+    pruned_agents_cache =
+      socket.assigns.agents_cache
+      |> Enum.reject(fn {_id, agent} ->
+        # Use the same staleness rules as the UI (wall clock first, monotonic fallback)
+        not agent_active?(agent, now_ms)
+      end)
+      |> Map.new()
 
     gateways = compute_gateways(pruned_gateways_cache)
 
@@ -859,7 +861,7 @@
         delta_ms = max(now_ms - last_seen_ms, 0)
         delta_ms < @stale_threshold_ms
 
-      is_integer(agent[:last_seen_mono]) ->
+      is_integer(agent[:last_seen_mono]) and is_nil(agent[:last_seen]) ->
         now_mono = System.monotonic_time(:millisecond)
         delta_ms = max(now_mono - agent[:last_seen_mono], 0)
         delta_ms < @stale_threshold_ms

To prevent incorrectly pruning active agents due to clock skew, clamp negative
delta_ms values to 0 in the pruned_agents_cache logic.

web-ng/lib/serviceradar_web_ng_web/live/infrastructure_live/index.ex [108-117]

   pruned_agents_cache =
     socket.assigns.agents_cache
     |> Enum.reject(fn {_id, agent} ->
       last_ms = agent_last_seen_ms(agent)
 
-      delta_ms = if is_integer(last_ms), do: now_ms - last_ms, else: nil
+      delta_ms =
+        if is_integer(last_ms) do
+          max(now_ms - last_ms, 0)
+        else
+          nil
+        end
 
-      not is_integer(delta_ms) or delta_ms < 0 or delta_ms > @stale_threshold_ms
+      not is_integer(delta_ms) or delta_ms > @stale_threshold_ms
     end)
     |> Map.new()
Suggestion importance[1-10]: 7

__

Why: This is a valid bug fix for handling clock skew, which could cause active agents to be incorrectly pruned, but the same logic is already correctly implemented in agent_active?, making this a consistency fix.

Medium
Possible issue
Bind collectors to service context
Suggestion Impact:The commit moved the serviceCtx lookup to before starting the collector and changed collector.Start to use serviceCtx (falling back to ctx when nil), aligning collector lifetime with the service context.

code diff:

-	// Start collector
-	if err := collector.Start(ctx); err != nil {
+	s.mu.RLock()
+	serviceCtx := s.serviceCtx
+	s.mu.RUnlock()
+	if serviceCtx == nil {
+		serviceCtx = ctx
+	}
+
+	// Start collector with the service context so it stops on service shutdown.
+	if err := collector.Start(serviceCtx); err != nil {
 		return fmt.Errorf("%w: %s", errFailedToStartCollector, target.Name)
 	}
 
@@ -363,12 +370,6 @@
 		Str("host_name", target.Name).
 		Msg("Initialized service status")
 
-	s.mu.RLock()
-	serviceCtx := s.serviceCtx
-	s.mu.RUnlock()
-	if serviceCtx == nil {
-		serviceCtx = ctx
-	}
 	go s.processResults(serviceCtx, target.Name, results, aggregator)
 

Start the SNMP collector with the service's long-lived context (serviceCtx)
instead of the caller-provided context to prevent resource leaks when the
service is stopped or restarted.

pkg/checker/snmp/service.go [322-372]

-if err := collector.Start(ctx); err != nil {
-	return fmt.Errorf("%w: %s", errFailedToStartCollector, target.Name)
-}
-...
 s.mu.RLock()
 serviceCtx := s.serviceCtx
 s.mu.RUnlock()
 if serviceCtx == nil {
 	serviceCtx = ctx
 }
+
+if err := collector.Start(serviceCtx); err != nil {
+	return fmt.Errorf("%w: %s", errFailedToStartCollector, target.Name)
+}
+...
 go s.processResults(serviceCtx, target.Name, results, aggregator)
Suggestion importance[1-10]: 8

__

Why: The suggestion correctly identifies a potential goroutine leak where a collector started via AddTarget might not be tied to the service's lifecycle, and proposes using the service's context to fix it.

Medium
Prune agents using activity predicate
Suggestion Impact:The commit updated pruned_agents_cache to reject agents based on `not agent_active?(agent, now_ms)` instead of re-implementing staleness checks, aligning pruning with the UI logic. It also refined agent_active?/2 to only use the monotonic timestamp fallback when wall-clock `:last_seen` is nil, further ensuring consistent activity evaluation.

code diff:

+    pruned_agents_cache =
+      socket.assigns.agents_cache
+      |> Enum.reject(fn {_id, agent} ->
+        # Use the same staleness rules as the UI (wall clock first, monotonic fallback)
+        not agent_active?(agent, now_ms)
+      end)
+      |> Map.new()
 
     gateways = compute_gateways(pruned_gateways_cache)
 
@@...
Imported GitHub PR comment. Original author: @qodo-code-review[bot] Original URL: https://github.com/carverauto/serviceradar/pull/2221#issuecomment-3706934588 Original created: 2026-01-03T09:50:14Z --- ## PR Code Suggestions ✨ <!-- 5c8ccf6 --> Latest suggestions up to 5c8ccf6 <table><thead><tr><td><strong>Category</strong></td><td align=left><strong>Suggestion&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; </strong></td><td align=center><strong>Impact</strong></td></tr><tbody><tr><td rowspan=3>Possible issue</td> <td> <details><summary>Fix remaining state transitions</summary> ___ **Update <code>transition_state</code> calls in several <code>update</code> actions to use the correct <br>transition name instead of the target state atom to prevent runtime errors.** [elixir/serviceradar_core/lib/serviceradar/infrastructure/gateway.ex [254-363]](https://github.com/carverauto/serviceradar/pull/2221/files#diff-1655f7cc80c8949b3d2c9e50d20110cb85883413fb6181825bfbcf0bda578c6bR254-R363) ```diff update :heartbeat_timeout do description "Mark gateway as degraded due to heartbeat timeout" require_atomic? false - change transition_state(:degraded) + change transition_state(:heartbeat_timeout) change set_attribute(:is_healthy, false) change set_attribute(:updated_at, &DateTime.utc_now/0) change {ServiceRadar.Infrastructure.Changes.PublishStateChange, entity_type: :gateway, new_state: :degraded} end update :lose_connection do description "Mark gateway as offline due to lost connection" require_atomic? false - change transition_state(:offline) + change transition_state(:lose_connection) change set_attribute(:is_healthy, false) change set_attribute(:updated_at, &DateTime.utc_now/0) change {ServiceRadar.Infrastructure.Changes.PublishStateChange, entity_type: :gateway, new_state: :offline} end update :restore_health do description "Restore gateway to healthy state" require_atomic? false - change transition_state(:healthy) + change transition_state(:restore_health) change set_attribute(:is_healthy, true) change set_attribute(:last_seen, &DateTime.utc_now/0) change set_attribute(:updated_at, &DateTime.utc_now/0) change {ServiceRadar.Infrastructure.Changes.PublishStateChange, entity_type: :gateway, new_state: :healthy} end update :end_maintenance do description "End maintenance mode, return to healthy" require_atomic? false - change transition_state(:healthy) + change transition_state(:end_maintenance) change set_attribute(:is_healthy, true) change set_attribute(:updated_at, &DateTime.utc_now/0) change {ServiceRadar.Infrastructure.Changes.PublishStateChange, entity_type: :gateway, new_state: :healthy} end update :finish_draining do description "Finish draining, go offline" require_atomic? false - change transition_state(:offline) + change transition_state(:finish_draining) change set_attribute(:is_healthy, false) change set_attribute(:updated_at, &DateTime.utc_now/0) change {ServiceRadar.Infrastructure.Changes.PublishStateChange, entity_type: :gateway, new_state: :offline} end update :mark_unhealthy do description "Mark gateway as unhealthy (legacy - use degrade)" require_atomic? false - change transition_state(:degraded) + change transition_state(:degrade) change set_attribute(:is_healthy, false) change set_attribute(:updated_at, &DateTime.utc_now/0) change {ServiceRadar.Infrastructure.Changes.PublishStateChange, entity_type: :gateway, new_state: :degraded} end ``` `[To ensure code accuracy, apply this suggestion manually]` <details><summary>Suggestion importance[1-10]: 9</summary> __ Why: The suggestion correctly identifies a bug where `transition_state` is called with the target state instead of the transition name, which would cause a runtime error. </details></details></td><td align=center>High </td></tr><tr><td> <details><summary>✅ <s>Enforce streaming chunk ordering</s></summary> ___ <details><summary><b>Suggestion Impact:</b></summary>The commit updated the stream reducer state to include expected_idx and pinned_total_chunks, added validation that total_chunks cannot change mid-stream, and enforced that each incoming chunk_index must match the expected sequential index, updating the reducer continuation/halt tuples accordingly. code diff: ```diff - {total_services, saw_final?, _stream_agent_id} = - Enum.reduce_while(request_stream, {0, false, nil}, fn chunk, {acc, _saw_final?, stream_agent_id} -> + {total_services, saw_final?, _stream_agent_id, _expected_idx, _pinned_total_chunks} = + Enum.reduce_while(request_stream, {0, false, nil, 0, nil}, fn chunk, + {acc, _saw_final?, stream_agent_id, expected_idx, + pinned_total_chunks} -> agent_id = case chunk.agent_id do nil -> @@ -343,8 +349,19 @@ raise GRPC.RPCError, status: :invalid_argument, message: "total_chunks must be > 0" end + pinned_total_chunks = + case pinned_total_chunks do + nil -> total_chunks + ^total_chunks -> total_chunks + _ -> raise GRPC.RPCError, status: :invalid_argument, message: "total_chunks changed mid-stream" + end + if chunk_index < 0 or chunk_index >= total_chunks do raise GRPC.RPCError, status: :invalid_argument, message: "invalid chunk_index" + end + + if chunk_index != expected_idx do + raise GRPC.RPCError, status: :invalid_argument, message: "unexpected chunk_index" end Logger.debug( @@ -396,9 +413,9 @@ end record_push_metrics(agent_id, new_total) - {:halt, {new_total, true, stream_agent_id}} + {:halt, {new_total, true, stream_agent_id, expected_idx + 1, pinned_total_chunks}} else - {:cont, {new_total, false, stream_agent_id}} + {:cont, {new_total, false, stream_agent_id, expected_idx + 1, pinned_total_chunks}} end end) ``` </details> ___ **Enforce chunk ordering and contiguity in <code>stream_status/2</code> by tracking the <br><code>expected_chunk_index</code> and pinning <code>total_chunks</code> to prevent partial data ingestion.** [elixir/serviceradar_agent_gateway/lib/serviceradar_agent_gateway/agent_gateway_server.ex [301-403]](https://github.com/carverauto/serviceradar/pull/2221/files#diff-369a368073dc8ec1140bcea699005a1ce97a90cd59629df0bd18c71c7ffaae9fR301-R403) ```diff {total_services, saw_final?, _stream_agent_id} = - Enum.reduce_while(request_stream, {0, false, nil}, fn chunk, {acc, _saw_final?, stream_agent_id} -> + Enum.reduce_while(request_stream, {0, false, nil, 0, nil}, fn chunk, + {acc, _saw_final?, stream_agent_id, expected_idx, + pinned_total_chunks} -> ... chunk_index = chunk.chunk_index || 0 total_chunks = chunk.total_chunks || 0 if total_chunks <= 0 do raise GRPC.RPCError, status: :invalid_argument, message: "total_chunks must be > 0" end + pinned_total_chunks = + case pinned_total_chunks do + nil -> total_chunks + ^total_chunks -> total_chunks + _ -> raise GRPC.RPCError, status: :invalid_argument, message: "total_chunks changed mid-stream" + end + if chunk_index < 0 or chunk_index >= total_chunks do raise GRPC.RPCError, status: :invalid_argument, message: "invalid chunk_index" + end + + if chunk_index != expected_idx do + raise GRPC.RPCError, status: :invalid_argument, message: "unexpected chunk_index" end ... if chunk.is_final do if chunk_index != total_chunks - 1 do raise GRPC.RPCError, status: :invalid_argument, message: "final chunk_index does not match total_chunks" end record_push_metrics(agent_id, new_total) - {:halt, {new_total, true, stream_agent_id}} + {:halt, {new_total, true, stream_agent_id, expected_idx + 1, pinned_total_chunks}} else - {:cont, {new_total, false, stream_agent_id}} + {:cont, {new_total, false, stream_agent_id, expected_idx + 1, pinned_total_chunks}} end end) ``` `[To ensure code accuracy, apply this suggestion manually]` <details><summary>Suggestion importance[1-10]: 8</summary> __ Why: The suggestion correctly identifies a vulnerability where out-of-order or missing stream chunks are not detected, potentially leading to silent data loss. </details></details></td><td align=center>Medium </td></tr><tr><td> <details><summary>Unify version reporting source</summary> ___ **Propagate the build-time <code>Version</code> from the <code>main</code> package to the <code>agent</code> package to <br>ensure correct version reporting during gateway enrollment.** [cmd/agent/main.go [42-45]](https://github.com/carverauto/serviceradar/pull/2221/files#diff-61358711e980ccf505246fd3915f97cbd3a380e9b66f6fa5aad46749968c5ca3R42-R45) ```diff // Version is set at build time via ldflags // //nolint:gochecknoglobals // Required for build-time ldflags injection var Version = "dev" +... + +// Ensure the agent package reports the same build version during enrollment. +agent.Version = Version + ``` `[To ensure code accuracy, apply this suggestion manually]` <details><summary>Suggestion importance[1-10]: 7</summary> __ Why: The suggestion correctly identifies that the build-time `Version` from the `main` package is not propagated to the `agent` package, causing incorrect version reporting during enrollment. </details></details></td><td align=center>Medium </td></tr><tr><td rowspan=5>Incremental <sup><a href='https://qodo-merge-docs.qodo.ai/core-abilities/incremental_update/'>[*]</a></sup></td> <td> <details><summary>✅ <s>Skip malformed service entries</s></summary> ___ <details><summary><b>Suggestion Impact:</b></summary>The reduce function in push_status was changed to use function-head pattern matching for %Monitoring.GatewayServiceStatus{} and to ignore all other entries via an _other clause, matching the suggestion and preventing invalid entries from triggering exception-based control flow. code diff: ```diff @@ -252,25 +252,29 @@ processed_count = services |> Enum.reject(&is_nil/1) - |> Enum.reduce(0, fn service, acc -> - try do - process_service_status(service, metadata) - acc + 1 - rescue - e in GRPC.RPCError -> - Logger.warning( - "Dropping invalid service status from agent #{metadata.agent_id}: #{e.status} #{e.message}" - ) - - acc - - e -> - Logger.warning( - "Dropping service status from agent #{metadata.agent_id} due to error: #{Exception.message(e)}" - ) - - acc - end + |> Enum.reduce(0, fn + %Monitoring.GatewayServiceStatus{} = service, acc -> + try do + process_service_status(service, metadata) + acc + 1 + rescue + e in GRPC.RPCError -> + Logger.warning( + "Dropping invalid service status from agent #{metadata.agent_id}: #{e.status} #{e.message}" + ) + + acc + + e -> + Logger.warning( + "Dropping service status from agent #{metadata.agent_id} due to error: #{Exception.message(e)}" + ) + + acc + end + + _other, acc -> + acc end) ``` </details> ___ **In <code>push_status</code>, use pattern matching on <code>%Monitoring.GatewayServiceStatus{}</code> <br>within <code>Enum.reduce</code> to filter invalid service entries. This avoids using <br>exceptions for control flow and mitigates a potential resource exhaustion <br>attack.** [elixir/serviceradar_agent_gateway/lib/serviceradar_agent_gateway/agent_gateway_server.ex [252-278]](https://github.com/carverauto/serviceradar/pull/2221/files#diff-369a368073dc8ec1140bcea699005a1ce97a90cd59629df0bd18c71c7ffaae9fR252-R278) ```diff processed_count = services |> Enum.reject(&is_nil/1) - |> Enum.reduce(0, fn service, acc -> - try do - process_service_status(service, metadata) - acc + 1 - rescue - e in GRPC.RPCError -> - Logger.warning( - "Dropping invalid service status from agent #{metadata.agent_id}: #{e.status} #{e.message}" - ) + |> Enum.reduce(0, fn + %Monitoring.GatewayServiceStatus{} = service, acc -> + try do + process_service_status(service, metadata) + acc + 1 + rescue + e in GRPC.RPCError -> + Logger.warning( + "Dropping invalid service status from agent #{metadata.agent_id}: #{e.status} #{e.message}" + ) - acc + acc - e -> - Logger.warning( - "Dropping service status from agent #{metadata.agent_id} due to error: #{Exception.message(e)}" - ) + e -> + Logger.warning( + "Dropping service status from agent #{metadata.agent_id} due to error: #{Exception.message(e)}" + ) - acc - end + acc + end + + _other, acc -> + acc end) if processed_count == 0 and service_count > 0 do raise GRPC.RPCError, status: :invalid_argument, message: "no valid service statuses" end ``` `[Suggestion processed]` <details><summary>Suggestion importance[1-10]: 8</summary> __ Why: The suggestion correctly identifies a potential resource exhaustion vector and proposes a more robust and secure implementation using pattern matching. This avoids expensive exception handling for invalid data types in a loop, which is a good security and performance practice. </details></details></td><td align=center>Medium </td></tr><tr><td> <details><summary>✅ <s>Bound user-controlled string sizes</s></summary> ___ <details><summary><b>Suggestion Impact:</b></summary>The commit adds trimming/normalization of `service_type` and `source`, validates their maximum byte size and disallows control characters, and then uses these sanitized values in the constructed status map. code diff: ```diff @@ -422,6 +439,18 @@ _ -> "" end + service_type = + case service.service_type do + st when is_binary(st) -> String.trim(st) + _ -> "" + end + + source = + case service.source do + src when is_binary(src) -> String.trim(src) + _ -> "" + end + cond do service_name == "" -> raise GRPC.RPCError, status: :invalid_argument, message: "service_name is required" @@ -431,6 +460,12 @@ String.contains?(service_name, ["\n", "\r", "\t"]) -> raise GRPC.RPCError, status: :invalid_argument, message: "service_name contains invalid characters" + + byte_size(service_type) > 64 or String.contains?(service_type, ["\n", "\r", "\t"]) -> + raise GRPC.RPCError, status: :invalid_argument, message: "service_type is invalid" + + byte_size(source) > 64 or String.contains?(source, ["\n", "\r", "\t"]) -> + raise GRPC.RPCError, status: :invalid_argument, message: "source is invalid" true -> :ok @@ -458,16 +493,28 @@ end end) + response_time = + case service.response_time do + rt when is_integer(rt) and rt >= 0 and rt <= 86_400_000 -> + rt + + rt when is_integer(rt) and rt > 86_400_000 -> + 86_400_000 + + _ -> + 0 + end + status = %{ service_name: service_name, available: service.available == true, message: message, - service_type: service.service_type, - response_time: service.response_time, + service_type: service_type, + response_time: response_time, agent_id: metadata.agent_id, gateway_id: metadata.gateway_id, partition: normalize_partition(service.partition || metadata.partition), - source: service.source, + source: source, kv_store_id: service.kv_store_id || metadata.kv_store_id, ``` </details> ___ **In <code>process_service_status</code>, add size and character validation for the <br><code>service_type</code> and <code>source</code> fields. This prevents potential resource exhaustion and <br>data ingestion issues from malicious or malformed client input.** [elixir/serviceradar_agent_gateway/lib/serviceradar_agent_gateway/agent_gateway_server.ex [418-423]](https://github.com/carverauto/serviceradar/pull/2221/files#diff-369a368073dc8ec1140bcea699005a1ce97a90cd59629df0bd18c71c7ffaae9fR418-R423) ```diff service_name = case service.service_name do name when is_binary(name) -> String.trim(name) nil -> "" _ -> "" end +service_type = + case service.service_type do + st when is_binary(st) -> String.trim(st) + _ -> "" + end + +source = + case service.source do + src when is_binary(src) -> String.trim(src) + _ -> "" + end + +cond do + service_name == "" -> + raise GRPC.RPCError, status: :invalid_argument, message: "service_name is required" + + byte_size(service_name) > 255 -> + raise GRPC.RPCError, status: :invalid_argument, message: "service_name is too long" + + String.contains?(service_name, ["\n", "\r", "\t"]) -> + raise GRPC.RPCError, status: :invalid_argument, message: "service_name contains invalid characters" + + byte_size(service_type) > 64 or String.contains?(service_type, ["\n", "\r", "\t"]) -> + raise GRPC.RPCError, status: :invalid_argument, message: "service_type is invalid" + + byte_size(source) > 64 or String.contains?(source, ["\n", "\r", "\t"]) -> + raise GRPC.RPCError, status: :invalid_argument, message: "source is invalid" + + true -> + :ok +end + ``` `[Suggestion processed]` <details><summary>Suggestion importance[1-10]: 8</summary> __ Why: The suggestion correctly points out that other user-controlled string fields like `service_type` and `source` are not being validated. Adding size and character validation for these fields is a good security practice to prevent resource exhaustion and data ingestion issues. </details></details></td><td align=center>Medium </td></tr><tr><td> <details><summary>Prevent test goroutine leaks</summary> ___ **Prevent a test goroutine leak by using a cancellable context when starting the <br><code>service</code> and <code>defer</code>ring a call to <code>service.Stop()</code> to ensure cleanup.** [pkg/checker/snmp/service_test.go [162-201]](https://github.com/carverauto/serviceradar/pull/2221/files#diff-23cbcb66506ccc940c39669f00ce2f35960afe4be9e91668b114693a22ad4b89R162-R201) ```diff initialDataChan := make(chan DataPoint) newDataChan := make(chan DataPoint) // Channel to signal when GetResults is called for the new target. getResultsCalled := make(chan struct{}) ... // Test adding target -err = service.Start(context.Background()) +startCtx, cancelStart := context.WithCancel(context.Background()) +defer cancelStart() +defer func() { require.NoError(t, service.Stop()) }() +defer close(initialDataChan) +defer close(newDataChan) + +err = service.Start(startCtx) require.NoError(t, err) ``` `[To ensure code accuracy, apply this suggestion manually]` <details><summary>Suggestion importance[1-10]: 7</summary> __ Why: The suggestion correctly identifies a goroutine leak in the test due to a non-cancellable context, which can cause test flakiness, and provides a correct fix to ensure proper cleanup. </details></details></td><td align=center>Medium </td></tr><tr><td> <details><summary>✅ <s>Handle RPC agent map shapes</s></summary> ___ <details><summary><b>Suggestion Impact:</b></summary>The reduce/caching logic was changed to fetch agent_id and other fields using both atom and string map keys, matching the suggestion to robustly handle RPC result shapes. code diff: ```diff - Map.put(acc, agent.agent_id, %{ - agent_id: agent.agent_id, - tenant_id: Map.get(agent, :tenant_id), - tenant_slug: Map.get(agent, :tenant_slug, "default"), - last_seen: Map.get(agent, :last_seen), - last_seen_mono: Map.get(agent, :last_seen_mono), - service_count: Map.get(agent, :service_count, 0), - partition: Map.get(agent, :partition), - source_ip: Map.get(agent, :source_ip) + agent_id = Map.get(agent, :agent_id) || Map.get(agent, "agent_id") + + Map.put(acc, agent_id, %{ + agent_id: agent_id, + tenant_id: Map.get(agent, :tenant_id) || Map.get(agent, "tenant_id"), + tenant_slug: Map.get(agent, :tenant_slug) || Map.get(agent, "tenant_slug") || "default", + last_seen: Map.get(agent, :last_seen) || Map.get(agent, "last_seen"), + last_seen_mono: Map.get(agent, :last_seen_mono) || Map.get(agent, "last_seen_mono"), + service_count: Map.get(agent, :service_count) || Map.get(agent, "service_count") || 0, + partition: Map.get(agent, :partition) || Map.get(agent, "partition"), + source_ip: Map.get(agent, :source_ip) || Map.get(agent, "source_ip") }) ``` </details> ___ **Modify the agent cache builder to handle both atom and string keys from RPC <br>results. This prevents incorrect data processing due to serialization <br>differences between nodes.** [web-ng/lib/serviceradar_web_ng_web/live/infrastructure_live/index.ex [739-749]](https://github.com/carverauto/serviceradar/pull/2221/files#diff-f1af30a84da554ef3d43b226a9303174b33dd5d27e23e9b702031483074e5f54R739-R749) ```diff -Map.put(acc, agent.agent_id, %{ - agent_id: agent.agent_id, - tenant_id: Map.get(agent, :tenant_id), - tenant_slug: Map.get(agent, :tenant_slug, "default"), - last_seen: Map.get(agent, :last_seen), - last_seen_mono: Map.get(agent, :last_seen_mono), - service_count: Map.get(agent, :service_count, 0), - partition: Map.get(agent, :partition), - source_ip: Map.get(agent, :source_ip) +agent_id = Map.get(agent, :agent_id) || Map.get(agent, "agent_id") + +Map.put(acc, agent_id, %{ + agent_id: agent_id, + tenant_id: Map.get(agent, :tenant_id) || Map.get(agent, "tenant_id"), + tenant_slug: Map.get(agent, :tenant_slug) || Map.get(agent, "tenant_slug") || "default", + last_seen: Map.get(agent, :last_seen) || Map.get(agent, "last_seen"), + last_seen_mono: Map.get(agent, :last_seen_mono) || Map.get(agent, "last_seen_mono"), + service_count: Map.get(agent, :service_count) || Map.get(agent, "service_count") || 0, + partition: Map.get(agent, :partition) || Map.get(agent, "partition"), + source_ip: Map.get(agent, :source_ip) || Map.get(agent, "source_ip") }) ``` `[To ensure code accuracy, apply this suggestion manually]` <details><summary>Suggestion importance[1-10]: 7</summary> __ Why: This is a valid suggestion that improves robustness by defensively handling both atom and string keys for data received via RPC, which can have inconsistent serialization. This prevents a potential bug where agents might be incorrectly processed. </details></details></td><td align=center>Medium </td></tr><tr><td> <details><summary>✅ <s>Fix embedded port detection</s></summary> ___ <details><summary><b>Suggestion Impact:</b></summary>The commit replaces the fallback url.Parse("tcp://"+target) port detection with logic that strips path/query and uses net.SplitHostPort, and adds the needed strings import. code diff: ```diff @@ -22,6 +22,7 @@ "fmt" "net" "net/url" + "strings" "sync" "time" @@ -685,9 +686,17 @@ hasEmbeddedPort = true } else if parsed, err := url.Parse(target); err == nil && parsed.Host != "" && parsed.Port() != "" { hasEmbeddedPort = true - } else if parsed, err := url.Parse("tcp://" + target); err == nil && parsed.Host != "" && parsed.Port() != "" { + } else { // Handles "host:port/path" inputs without an explicit scheme. - hasEmbeddedPort = true + hostPort := target + if i := strings.IndexAny(hostPort, "/?"); i >= 0 { + hostPort = hostPort[:i] + } + + // net.SplitHostPort requires bracketed IPv6; be explicit to avoid silently dropping checks. + if host, p, err := net.SplitHostPort(hostPort); err == nil && host != "" && p != "" { + hasEmbeddedPort = true + } ``` </details> ___ **Improve embedded port detection for check targets by using <code>net.SplitHostPort</code> on <br>the host part of the target string, making it more resilient to various formats <br>like unbracketed IPv6 addresses.** [pkg/agent/push_loop.go [688-691]](https://github.com/carverauto/serviceradar/pull/2221/files#diff-5f0d59be34ef26b449d7f5fd2b198a29b277936b9708a699f7487415ed6c2785R688-R691) ```diff -} else if parsed, err := url.Parse("tcp://" + target); err == nil && parsed.Host != "" && parsed.Port() != "" { +} else { // Handles "host:port/path" inputs without an explicit scheme. - hasEmbeddedPort = true + hostPort := target + if i := strings.IndexAny(hostPort, "/?"); i >= 0 { + hostPort = hostPort[:i] + } + + // net.SplitHostPort requires bracketed IPv6; be explicit to avoid silently dropping checks. + if host, p, err := net.SplitHostPort(hostPort); err == nil && host != "" && p != "" { + hasEmbeddedPort = true + } } ``` `[To ensure code accuracy, apply this suggestion manually]` <details><summary>Suggestion importance[1-10]: 6</summary> __ Why: The suggestion correctly identifies that the current port detection logic can fail for unbracketed IPv6 addresses, and proposes a more robust implementation using `net.SplitHostPort`. </details></details></td><td align=center>Low </td></tr><tr><td rowspan=1>Security</td> <td> <details><summary>✅ <s>Validate client-provided timing data</s></summary> ___ <details><summary><b>Suggestion Impact:</b></summary>The commit added response_time validation/normalization logic (clamping to 86_400_000 and defaulting to 0 for invalid values) and updated the status map to use the sanitized response_time instead of the raw client-provided value. code diff: ```diff + response_time = + case service.response_time do + rt when is_integer(rt) and rt >= 0 and rt <= 86_400_000 -> + rt + + rt when is_integer(rt) and rt > 86_400_000 -> + 86_400_000 + + _ -> + 0 + end + status = %{ service_name: service_name, available: service.available == true, message: message, - service_type: service.service_type, - response_time: service.response_time, + service_type: service_type, + response_time: response_time, agent_id: metadata.agent_id, ``` </details> ___ **Validate and normalize the client-provided <code>response_time</code> to be a non-negative <br>integer within a reasonable range before it is processed.** [elixir/serviceradar_agent_gateway/lib/serviceradar_agent_gateway/agent_gateway_server.ex [461-475]](https://github.com/carverauto/serviceradar/pull/2221/files#diff-369a368073dc8ec1140bcea699005a1ce97a90cd59629df0bd18c71c7ffaae9fR461-R475) ```diff +response_time = + case service.response_time do + rt when is_integer(rt) and rt >= 0 and rt <= 86_400_000 -> + rt + + rt when is_integer(rt) and rt > 86_400_000 -> + 86_400_000 + + _ -> + 0 + end + status = %{ service_name: service_name, available: service.available == true, message: message, service_type: service.service_type, - response_time: service.response_time, + response_time: response_time, agent_id: metadata.agent_id, gateway_id: metadata.gateway_id, partition: normalize_partition(service.partition || metadata.partition), source: service.source, kv_store_id: service.kv_store_id || metadata.kv_store_id, timestamp: metadata.timestamp, tenant_id: metadata.tenant_id, tenant_slug: metadata.tenant_slug } ``` `[Suggestion processed]` <details><summary>Suggestion importance[1-10]: 7</summary> __ Why: The suggestion correctly points out that `response_time` from the client is not validated, which could lead to data corruption or processing errors downstream. </details></details></td><td align=center>Medium </td></tr><tr><td rowspan=1>General</td> <td> <details><summary>✅ <s>Preserve JSON parsing error details</s></summary> ___ <details><summary><b>Suggestion Impact:</b></summary>Updated the JSON parsing error handling to wrap and return the actual decode error (err) when decoding fails with non-EOF errors, improving debugging information. code diff: ```diff if err := dec.Decode(&struct{}{}); err == nil { return nil, fmt.Errorf("failed to parse config: %w", errConfigTrailingData) } else if !errors.Is(err, io.EOF) { - return nil, fmt.Errorf("failed to parse config: %w", errConfigTrailingData) + return nil, fmt.Errorf("failed to parse config: %w", err) ``` </details> ___ **Improve JSON parsing error handling by returning the specific underlying error <br>instead of a generic one, which will aid in debugging configuration file issues.** [cmd/agent/main.go [129-133]](https://github.com/carverauto/serviceradar/pull/2221/files#diff-61358711e980ccf505246fd3915f97cbd3a380e9b66f6fa5aad46749968c5ca3R129-R133) ```diff if err := dec.Decode(&struct{}{}); err == nil { return nil, fmt.Errorf("failed to parse config: %w", errConfigTrailingData) } else if !errors.Is(err, io.EOF) { - return nil, fmt.Errorf("failed to parse config: %w", errConfigTrailingData) + return nil, fmt.Errorf("failed to parse config: %w", err) } ``` `[Suggestion processed]` <details><summary>Suggestion importance[1-10]: 6</summary> __ Why: The suggestion correctly points out that a generic error is returned for JSON parsing issues, which hides the specific cause and makes debugging more difficult. </details></details></td><td align=center>Low </td></tr> <tr><td align="center" colspan="2"> - [ ] Update <!-- /improve_multi --more_suggestions=true --> </td><td></td></tr></tbody></table> ___ #### Previous suggestions <details><summary>✅ Suggestions up to commit c96fbf9</summary> <br><table><thead><tr><td><strong>Category</strong></td><td align=left><strong>Suggestion&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; </strong></td><td align=center><strong>Impact</strong></td></tr><tbody><tr><td rowspan=3>Possible issue</td> <td> <details><summary>Fix state machine transitions</summary> ___ **Fix the <code>transition_state</code> calls in the state machine update actions to use the <br>correct transition names (e.g., <code>:degrade</code>) instead of the target state atoms <br>(e.g., <code>:degraded</code>) to prevent runtime errors.** [elixir/serviceradar_core/lib/serviceradar/infrastructure/gateway.ex [243-351]](https://github.com/carverauto/serviceradar/pull/2221/files#diff-1655f7cc80c8949b3d2c9e50d20110cb85883413fb6181825bfbcf0bda578c6bR243-R351) ```diff update :degrade do description "Mark gateway as degraded (having issues)" argument :reason, :string require_atomic? false - change transition_state(:degraded) + change transition_state(:degrade) change set_attribute(:is_healthy, false) change set_attribute(:updated_at, &DateTime.utc_now/0) change {ServiceRadar.Infrastructure.Changes.PublishStateChange, entity_type: :gateway, new_state: :degraded} end update :go_offline do description "Mark gateway as offline" argument :reason, :string require_atomic? false - change transition_state(:offline) + change transition_state(:go_offline) change set_attribute(:is_healthy, false) change set_attribute(:updated_at, &DateTime.utc_now/0) change {ServiceRadar.Infrastructure.Changes.PublishStateChange, entity_type: :gateway, new_state: :offline} end update :recover do description "Start recovery process for degraded/offline gateway" require_atomic? false - change transition_state(:recovering) + change transition_state(:recover) change set_attribute(:updated_at, &DateTime.utc_now/0) change {ServiceRadar.Infrastructure.Changes.PublishStateChange, entity_type: :gateway, new_state: :recovering} end update :start_maintenance do description "Put gateway into maintenance mode" require_atomic? false - change transition_state(:maintenance) + change transition_state(:start_maintenance) change set_attribute(:updated_at, &DateTime.utc_now/0) change {ServiceRadar.Infrastructure.Changes.PublishStateChange, entity_type: :gateway, new_state: :maintenance} end update :start_draining do description "Start graceful shutdown (draining)" require_atomic? false - change transition_state(:draining) + change transition_state(:start_draining) change set_attribute(:updated_at, &DateTime.utc_now/0) change {ServiceRadar.Infrastructure.Changes.PublishStateChange, entity_type: :gateway, new_state: :draining} end update :deactivate do description "Deactivate a gateway (admin action)" require_atomic? false - change transition_state(:inactive) + change transition_state(:deactivate) change set_attribute(:is_healthy, false) change set_attribute(:updated_at, &DateTime.utc_now/0) change {ServiceRadar.Infrastructure.Changes.PublishStateChange, entity_type: :gateway, new_state: :inactive} end ``` <details><summary>Suggestion importance[1-10]: 9</summary> __ Why: This suggestion correctly identifies a critical bug where `transition_state` is called with the target state instead of the transition name, which would cause runtime crashes. </details></details></td><td align=center>High </td></tr><tr><td> <details><summary>Fix gRPC connection initialization</summary> ___ **Replace the manual gRPC connection logic using <code>grpc.NewClient</code> with the standard <br><code>grpc.DialContext</code> and <code>grpc.WithBlock</code> for a simpler and more robust <br>implementation.** [pkg/agent/gateway_client.go [130-154]](https://github.com/carverauto/serviceradar/pull/2221/files#diff-54aa63b6f05aabb0db096685e5be714c0c6d5c5a892310f048ff2ff8a542191fR130-R154) ```diff -conn, err := grpc.NewClient(g.addr, opts...) +opts = append(opts, grpc.WithBlock()) + +conn, err := grpc.DialContext(connectCtx, g.addr, opts...) if err != nil { if provider != nil { _ = provider.Close() } return fmt.Errorf("failed to connect to gateway at %s: %w", g.addr, err) } -conn.Connect() -for state := conn.GetState(); state != connectivity.Ready; state = conn.GetState() { - if state == connectivity.Shutdown { - _ = conn.Close() - if provider != nil { - _ = provider.Close() - } - return fmt.Errorf("failed to connect to gateway at %s: %w", g.addr, ErrConnectionShutdown) - } - if !conn.WaitForStateChange(connectCtx, state) { - _ = conn.Close() - if provider != nil { - _ = provider.Close() - } - return fmt.Errorf("failed to connect to gateway at %s: %w", g.addr, connectCtx.Err()) - } -} - ``` <details><summary>Suggestion importance[1-10]: 7</summary> __ Why: The suggestion correctly identifies that using `grpc.DialContext` with `grpc.WithBlock` is the idiomatic and more robust way to establish a gRPC connection, simplifying the code and improving maintainability. </details></details></td><td align=center>Medium </td></tr><tr><td> <details><summary>✅ <s>Prevent false agent activity</s></summary> ___ <details><summary><b>Suggestion Impact:</b></summary>The code now uses Map.get(agent, :last_seen) and Map.get(agent, :last_seen_mono) instead of agent.last_seen and System.monotonic_time(:millisecond), preventing false activity on initial cache load. code diff: ```diff - all_agents - |> Enum.reduce(%{}, fn agent, acc -> - Map.put(acc, agent.agent_id, %{ - agent_id: agent.agent_id, - tenant_id: Map.get(agent, :tenant_id), - tenant_slug: Map.get(agent, :tenant_slug, "default"), - last_seen: agent.last_seen, - last_seen_mono: System.monotonic_time(:millisecond), - service_count: Map.get(agent, :service_count, 0), - partition: Map.get(agent, :partition), - source_ip: Map.get(agent, :source_ip) - }) - end) + all_agents + |> Enum.reduce(%{}, fn agent, acc -> + Map.put(acc, agent.agent_id, %{ + agent_id: agent.agent_id, + tenant_id: Map.get(agent, :tenant_id), + tenant_slug: Map.get(agent, :tenant_slug, "default"), + last_seen: Map.get(agent, :last_seen), + last_seen_mono: Map.get(agent, :last_seen_mono), + service_count: Map.get(agent, :service_count, 0), + partition: Map.get(agent, :partition), + source_ip: Map.get(agent, :source_ip) + }) + end) ``` </details> ___ **In <code>load_initial_agents_cache</code>, set <code>last_seen_mono</code> from the agent data from the <br>tracker instead of the current monotonic time to avoid incorrectly marking stale <br>agents as active.** [web-ng/lib/serviceradar_web_ng_web/live/infrastructure_live/index.ex [738-750]](https://github.com/carverauto/serviceradar/pull/2221/files#diff-f1af30a84da554ef3d43b226a9303174b33dd5d27e23e9b702031483074e5f54R738-R750) ```diff all_agents |> Enum.reduce(%{}, fn agent, acc -> Map.put(acc, agent.agent_id, %{ agent_id: agent.agent_id, tenant_id: Map.get(agent, :tenant_id), tenant_slug: Map.get(agent, :tenant_slug, "default"), - last_seen: agent.last_seen, - last_seen_mono: System.monotonic_time(:millisecond), + last_seen: Map.get(agent, :last_seen), + last_seen_mono: Map.get(agent, :last_seen_mono), service_count: Map.get(agent, :service_count, 0), partition: Map.get(agent, :partition), source_ip: Map.get(agent, :source_ip) }) end) ``` <details><summary>Suggestion importance[1-10]: 7</summary> __ Why: The suggestion correctly identifies a logic flaw where setting `last_seen_mono` to the current time on initial load can falsely mark stale agents as active, leading to incorrect UI state. </details></details></td><td align=center>Medium </td></tr><tr><td rowspan=5>Incremental <sup><a href='https://qodo-merge-docs.qodo.ai/core-abilities/incremental_update/'>[*]</a></sup></td> <td> <details><summary>✅ <s>Accept host:port/path targets</s></summary> ___ <details><summary><b>Suggestion Impact:</b></summary>The commit adds an additional url.Parse("tcp://" + target) branch to detect an embedded port in host:port/path inputs without an explicit scheme, preventing such targets from being dropped. code diff: ```diff @@ -684,6 +684,9 @@ if host, p, err := net.SplitHostPort(target); err == nil && host != "" && p != "" { hasEmbeddedPort = true } else if parsed, err := url.Parse(target); err == nil && parsed.Host != "" && parsed.Port() != "" { + hasEmbeddedPort = true + } else if parsed, err := url.Parse("tcp://" + target); err == nil && parsed.Host != "" && parsed.Port() != "" { + // Handles "host:port/path" inputs without an explicit scheme. hasEmbeddedPort = true } ``` </details> ___ **Fix the embedded-port detection logic to correctly handle targets that include <br>both a port and a path (e.g., <code>example.com:443/health</code>). This prevents valid <br>TCP/gRPC checks from being silently dropped.** [pkg/agent/push_loop.go [681-692]](https://github.com/carverauto/serviceradar/pull/2221/files#diff-5f0d59be34ef26b449d7f5fd2b198a29b277936b9708a699f7487415ed6c2785R681-R692) ```diff if (checkerType == tcpCheckType || checkerType == grpcCheckType) && port == 0 { // Allow "host:port" or URL targets that already include a port. hasEmbeddedPort := false if host, p, err := net.SplitHostPort(target); err == nil && host != "" && p != "" { hasEmbeddedPort = true } else if parsed, err := url.Parse(target); err == nil && parsed.Host != "" && parsed.Port() != "" { hasEmbeddedPort = true + } else if parsed, err := url.Parse("tcp://" + target); err == nil && parsed.Host != "" && parsed.Port() != "" { + // Handles "host:port/path" inputs without an explicit scheme. + hasEmbeddedPort = true } if !hasEmbeddedPort { return nil } } ``` <details><summary>Suggestion importance[1-10]: 8</summary> __ Why: This suggestion correctly identifies a bug where valid check targets with a port and path (e.g., `example.com:8080/health`) are silently ignored, leading to a failure in monitoring. The fix is accurate and resolves this issue. </details></details></td><td align=center>Medium </td></tr><tr><td> <details><summary>✅ <s>Prevent collectors from stopping early</s></summary> ___ <details><summary><b>Suggestion Impact:</b></summary>The code was changed to return an error ("service not started") when the service lifecycle context is nil, preventing collectors from being started with the request context and stopping prematurely. code diff: ```diff @@ -322,7 +322,7 @@ serviceCtx := s.serviceCtx s.mu.RUnlock() if serviceCtx == nil { - serviceCtx = ctx + return errors.New("service not started") } ``` </details> ___ **Prevent SNMP collectors from stopping prematurely by ensuring they are always <br>started with the service's lifecycle context. If the service is not running, <br>fail fast instead of using a potentially short-lived request context.** [pkg/checker/snmp/service.go [321-331]](https://github.com/carverauto/serviceradar/pull/2221/files#diff-c592649b582c62a85d1dd416ac7fb609e55531cb43c8199a15f4432ce8ae05d8R321-R331) ```diff s.mu.RLock() serviceCtx := s.serviceCtx s.mu.RUnlock() if serviceCtx == nil { - serviceCtx = ctx + return errors.New("service not started") } // Start collector with the service context so it stops on service shutdown. if err := collector.Start(serviceCtx); err != nil { return fmt.Errorf("%w: %s", errFailedToStartCollector, target.Name) } ``` <details><summary>Suggestion importance[1-10]: 8</summary> __ Why: The suggestion correctly identifies a critical bug where an SNMP collector's lifecycle could be tied to a short-lived request context, causing it to terminate prematurely. Failing fast prevents this and ensures collectors run for their intended duration. </details></details></td><td align=center>Medium </td></tr><tr><td> <details><summary>✅ <s>Fail when all items drop</s></summary> ___ <details><summary><b>Suggestion Impact:</b></summary>The commit replaced Enum.each with Enum.reduce to track processed_count, raises a GRPC.RPCError when processed_count == 0 but service_count > 0, and updates record_push_metrics to use processed_count instead of service_count. code diff: ```diff + processed_count = + services + |> Enum.reject(&is_nil/1) + |> Enum.reduce(0, fn service, acc -> + try do + process_service_status(service, metadata) + acc + 1 + rescue + e in GRPC.RPCError -> + Logger.warning( + "Dropping invalid service status from agent #{metadata.agent_id}: #{e.status} #{e.message}" + ) + + acc + + e -> + Logger.warning( + "Dropping service status from agent #{metadata.agent_id} due to error: #{Exception.message(e)}" + ) + + acc + end + end) + + if processed_count == 0 and service_count > 0 do + raise GRPC.RPCError, status: :invalid_argument, message: "no valid service statuses" + end # Record metrics - record_push_metrics(agent_id, service_count) + record_push_metrics(agent_id, processed_count) ``` </details> ___ **Modify the <code>push_status</code> function to track the number of successfully processed <br>service statuses. If all statuses are dropped due to errors, the gRPC request <br>should fail instead of returning a success response.** [elixir/serviceradar_agent_gateway/lib/serviceradar_agent_gateway/agent_gateway_server.ex [252-273]](https://github.com/carverauto/serviceradar/pull/2221/files#diff-369a368073dc8ec1140bcea699005a1ce97a90cd59629df0bd18c71c7ffaae9fR252-R273) ```diff -services -|> Enum.reject(&is_nil/1) -|> Enum.each(fn service -> - try do - process_service_status(service, metadata) - rescue - e in GRPC.RPCError -> - Logger.warning( - "Dropping invalid service status from agent #{metadata.agent_id}: #{e.status} #{e.message}" - ) +processed_count = + services + |> Enum.reject(&is_nil/1) + |> Enum.reduce(0, fn service, acc -> + try do + process_service_status(service, metadata) + acc + 1 + rescue + e in GRPC.RPCError -> + Logger.warning( + "Dropping invalid service status from agent #{metadata.agent_id}: #{e.status} #{e.message}" + ) - e -> - Logger.warning( - "Dropping service status from agent #{metadata.agent_id} due to error: #{Exception.message(e)}" - ) - end -end) + acc + + e -> + Logger.warning( + "Dropping service status from agent #{metadata.agent_id} due to error: #{Exception.message(e)}" + ) + + acc + end + end) + +if processed_count == 0 and service_count > 0 do + raise GRPC.RPCError, status: :invalid_argument, message: "no valid service statuses" +end # Record metrics -record_push_metrics(agent_id, service_count) +record_push_metrics(agent_id, processed_count) %Monitoring.GatewayStatusResponse{received: true} ``` `[Suggestion processed]` <details><summary>Suggestion importance[1-10]: 8</summary> __ Why: This suggestion correctly identifies a silent data loss scenario where the client receives a success response even if all submitted data was invalid and dropped, improving the robustness of the gRPC endpoint. </details></details></td><td align=center>Medium </td></tr><tr><td> <details><summary>✅ <s>Avoid reconnect loops in idle</s></summary> ___ <details><summary><b>Suggestion Impact:</b></summary>Updated IsConnected to return true when the connection state is either connectivity.Ready or connectivity.Idle, matching the suggested switch-based logic. code diff: ```diff @@ -250,7 +250,12 @@ if !g.connected || g.conn == nil { return false } - return g.conn.GetState() == connectivity.Ready + switch g.conn.GetState() { + case connectivity.Ready, connectivity.Idle: + return true + default: + return false + } ``` </details> ___ **Update <code>IsConnected</code> to consider the <code>connectivity.Idle</code> state as connected. This <br>prevents unnecessary reconnect loops when the gRPC connection is temporarily <br>idle but still healthy.** [pkg/agent/gateway_client.go [250-253]](https://github.com/carverauto/serviceradar/pull/2221/files#diff-54aa63b6f05aabb0db096685e5be714c0c6d5c5a892310f048ff2ff8a542191fR250-R253) ```diff if !g.connected || g.conn == nil { return false } -return g.conn.GetState() == connectivity.Ready +switch g.conn.GetState() { +case connectivity.Ready, connectivity.Idle: + return true +default: + return false +} ``` <details><summary>Suggestion importance[1-10]: 7</summary> __ Why: The suggestion correctly identifies that treating an `Idle` gRPC connection as disconnected can cause unnecessary reconnect attempts, improving efficiency and reducing load on the agent and gateway. </details></details></td><td align=center>Medium </td></tr><tr><td> <details><summary>✅ <s>Reject missing service names</s></summary> ___ <details><summary><b>Suggestion Impact:</b></summary>The commit replaced `to_string() |> String.trim()` with a case expression that returns "" for nil/non-binary values and trims only binaries, enabling the existing validation to reject missing service_name instead of accepting "nil". code diff: ```diff @@ -406,9 +416,11 @@ # Tenant and gateway_id come from server-side metadata (mTLS cert + server identity) # NOT from the service message - this prevents spoofing service_name = - service.service_name - |> to_string() - |> String.trim() + case service.service_name do + name when is_binary(name) -> String.trim(name) + nil -> "" + _ -> "" + end ``` </details> ___ **Explicitly handle <code>nil</code> values for <code>service.service_name</code> to prevent them from being <br>processed as the string <code>"nil"</code>. The current implementation allows this, which can <br>lead to invalid service records.** [elixir/serviceradar_agent_gateway/lib/serviceradar_agent_gateway/agent_gateway_server.ex [408-425]](https://github.com/carverauto/serviceradar/pull/2221/files#diff-369a368073dc8ec1140bcea699005a1ce97a90cd59629df0bd18c71c7ffaae9fR408-R425) ```diff service_name = - service.service_name - |> to_string() - |> String.trim() + case service.service_name do + name when is_binary(name) -> String.trim(name) + nil -> "" + _ -> "" + end cond do service_name == "" -> raise GRPC.RPCError, status: :invalid_argument, message: "service_name is required" byte_size(service_name) > 255 -> raise GRPC.RPCError, status: :invalid_argument, message: "service_name is too long" String.contains?(service_name, ["\n", "\r", "\t"]) -> raise GRPC.RPCError, status: :invalid_argument, message: "service_name contains invalid characters" true -> :ok end ``` `[Suggestion processed]` <details><summary>Suggestion importance[1-10]: 7</summary> __ Why: This is a valid bug fix that prevents data pollution by correctly handling `nil` `service_name` values, which would otherwise be processed as the string `"nil"`. </details></details></td><td align=center>Medium </td></tr> <tr><td align="center" colspan="2"> <!-- /improve_multi --more_suggestions=true --> </td><td></td></tr></tbody></table> </details> <details><summary>✅ Suggestions up to commit b7127ef</summary> <br><table><thead><tr><td><strong>Category</strong></td><td align=left><strong>Suggestion&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; </strong></td><td align=center><strong>Impact</strong></td></tr><tbody><tr><td rowspan=4>Incremental <sup><a href='https://qodo-merge-docs.qodo.ai/core-abilities/incremental_update/'>[*]</a></sup></td> <td> <details><summary>✅ <s>Restrict monotonic fallback activity</s></summary> ___ <details><summary><b>Suggestion Impact:</b></summary>The commit updates agent_active? to gate the monotonic fallback with `and is_nil(agent[:last_seen])`, matching the suggestion’s intent and preventing monotonic timestamps from overriding stale wall-clock last_seen values. It also reuses agent_active? for agent cache pruning, aligning staleness rules with the UI logic. code diff: ```diff - is_integer(agent[:last_seen_mono]) -> + is_integer(agent[:last_seen_mono]) and is_nil(agent[:last_seen]) -> now_mono = System.monotonic_time(:millisecond) delta_ms = max(now_mono - agent[:last_seen_mono], 0) delta_ms < @stale_threshold_ms ``` </details> ___ **In <code>agent_active?</code>, restrict the monotonic time fallback to only apply when the <br>wall-clock <code>last_seen</code> timestamp is <code>nil</code> to ensure correct staleness detection.** [web-ng/lib/serviceradar_web_ng_web/live/infrastructure_live/index.ex [858-866]](https://github.com/carverauto/serviceradar/pull/2221/files#diff-f1af30a84da554ef3d43b226a9303174b33dd5d27e23e9b702031483074e5f54R858-R866) ```diff delta_ms = max(now_ms - last_seen_ms, 0) delta_ms < @stale_threshold_ms - is_integer(agent[:last_seen_mono]) -> + is_integer(agent[:last_seen_mono]) and is_nil(agent[:last_seen]) -> now_mono = System.monotonic_time(:millisecond) delta_ms = max(now_mono - agent[:last_seen_mono], 0) delta_ms < @stale_threshold_ms ``` <details><summary>Suggestion importance[1-10]: 8</summary> __ Why: The suggestion correctly identifies a subtle bug where the monotonic time fallback could incorrectly mark an agent as active if a wall-clock time exists but is stale, improving the accuracy of staleness detection. </details></details></td><td align=center>Medium </td></tr><tr><td> <details><summary>✅ <s>Bound service shutdown with timeout</s></summary> ___ <details><summary><b>Suggestion Impact:</b></summary>The commit removed the unbounded deferred server.Stop(context.Background()) and instead calls server.Stop() with a context timeout inside the shutdown goroutine, plus replaces time.After with an explicit timer. It also adds timed server.Stop() calls on the errChan path, further ensuring shutdown is bounded. code diff: ```diff @@ -177,11 +177,6 @@ if err := server.Start(pushCtx); err != nil { return fmt.Errorf("failed to start agent services: %w", err) } - defer func() { - if err := server.Stop(context.Background()); err != nil { - log.Error().Err(err).Msg("Error stopping agent services") - } - }() // Handle shutdown signals sigChan := make(chan os.Signal, 1) @@ -204,21 +199,42 @@ shutdownDone := make(chan struct{}) go func() { + defer close(shutdownDone) + cancel() pushLoop.Stop() <-errChan - close(shutdownDone) + + stopCtx, stopCancel := context.WithTimeout(context.Background(), shutdownTimeout) + defer stopCancel() + if err := server.Stop(stopCtx); err != nil { + log.Error().Err(err).Msg("Error stopping agent services") + } }() + + timer := time.NewTimer(shutdownTimeout) + defer timer.Stop() select { case <-shutdownDone: - case <-time.After(shutdownTimeout): + case <-timer.C: return fmt.Errorf("%w after %s", errShutdownTimeout, shutdownTimeout) } case err := <-errChan: if err != nil && !errors.Is(err, context.Canceled) { + stopCtx, stopCancel := context.WithTimeout(context.Background(), 10*time.Second) + defer stopCancel() + if stopErr := server.Stop(stopCtx); stopErr != nil { + log.Error().Err(stopErr).Msg("Error stopping agent services") + } return fmt.Errorf("push loop error: %w", err) + } + + stopCtx, stopCancel := context.WithTimeout(context.Background(), 10*time.Second) + defer stopCancel() + if stopErr := server.Stop(stopCtx); stopErr != nil { + log.Error().Err(stopErr).Msg("Error stopping agent services") } ``` </details> ___ **Move the <code>server.Stop()</code> call into the timed shutdown block to ensure it is <br>bounded by the <code>shutdownTimeout</code>, preventing the process from hanging during <br>termination.** [cmd/agent/main.go [202-217]](https://github.com/carverauto/serviceradar/pull/2221/files#diff-61358711e980ccf505246fd3915f97cbd3a380e9b66f6fa5aad46749968c5ca3R202-R217) ```diff // Bound shutdown so the process can't hang forever (includes pushLoop.Stop()). const shutdownTimeout = 10 * time.Second shutdownDone := make(chan struct{}) go func() { + defer close(shutdownDone) + cancel() pushLoop.Stop() <-errChan - close(shutdownDone) + + stopCtx, stopCancel := context.WithTimeout(context.Background(), shutdownTimeout) + defer stopCancel() + if err := server.Stop(stopCtx); err != nil { + log.Error().Err(err).Msg("Error stopping agent services") + } }() + +timer := time.NewTimer(shutdownTimeout) +defer timer.Stop() select { case <-shutdownDone: -case <-time.After(shutdownTimeout): +case <-timer.C: return fmt.Errorf("%w after %s", errShutdownTimeout, shutdownTimeout) } ``` `[Suggestion processed]` <details><summary>Suggestion importance[1-10]: 7</summary> __ Why: The suggestion correctly identifies that `server.Stop()` is not covered by the shutdown timeout, which could cause the process to hang, and proposes a valid fix to include it in the timed shutdown sequence. </details></details></td><td align=center>Medium </td></tr><tr><td> <details><summary>✅ <s>Clamp skewed timestamps in pruning</s></summary> ___ <details><summary><b>Suggestion Impact:</b></summary>The commit updated the pruned_gateways_cache pruning logic to clamp delta_ms using max(now_ms - last_ms, 0) and removed the delta_ms < 0 check, preventing active gateways from being pruned due to clock skew. It also included additional related changes to agent pruning logic. code diff: ```diff + pruned_gateways_cache = + socket.assigns.gateways_cache + |> Enum.reject(fn {_id, gw} -> + last_ms = parse_timestamp_to_ms(Map.get(gw, :last_heartbeat)) + + delta_ms = + if is_integer(last_ms) do + max(now_ms - last_ms, 0) + else + nil + end + + not is_integer(delta_ms) or delta_ms > @stale_threshold_ms + end) + |> Map.new() ``` </details> ___ **To prevent incorrectly pruning active gateways due to clock skew, clamp negative <br><code>delta_ms</code> values to <code>0</code> in the <code>pruned_gateways_cache</code> logic.** [web-ng/lib/serviceradar_web_ng_web/live/infrastructure_live/index.ex [97-106]](https://github.com/carverauto/serviceradar/pull/2221/files#diff-f1af30a84da554ef3d43b226a9303174b33dd5d27e23e9b702031483074e5f54R97-R106) ```diff pruned_gateways_cache = socket.assigns.gateways_cache |> Enum.reject(fn {_id, gw} -> last_ms = parse_timestamp_to_ms(Map.get(gw, :last_heartbeat)) - delta_ms = if is_integer(last_ms), do: now_ms - last_ms, else: nil + delta_ms = + if is_integer(last_ms) do + max(now_ms - last_ms, 0) + else + nil + end - not is_integer(delta_ms) or delta_ms < 0 or delta_ms > @stale_threshold_ms + not is_integer(delta_ms) or delta_ms > @stale_threshold_ms end) |> Map.new() ``` <details><summary>Suggestion importance[1-10]: 7</summary> __ Why: This is a valid bug fix for handling clock skew, which could cause active gateways to be incorrectly pruned, but the same logic is already correctly implemented in `compute_gateways`, making this a consistency fix. </details></details></td><td align=center>Medium </td></tr><tr><td> <details><summary>✅ <s>Avoid dropping agents on skew</s></summary> ___ <details><summary><b>Suggestion Impact:</b></summary>The pruning logic was changed to avoid dropping entities due to negative deltas: gateways now clamp delta_ms with `max(..., 0)` and no longer prune on `delta_ms < 0`. For agents, instead of directly computing delta_ms, the code now delegates pruning to `agent_active?/2`, which already clamps negative deltas and applies consistent staleness rules, achieving the same goal (avoid pruning due to clock skew) though via a different implementation. code diff: ```diff + pruned_gateways_cache = + socket.assigns.gateways_cache + |> Enum.reject(fn {_id, gw} -> + last_ms = parse_timestamp_to_ms(Map.get(gw, :last_heartbeat)) + + delta_ms = + if is_integer(last_ms) do + max(now_ms - last_ms, 0) + else + nil + end + + not is_integer(delta_ms) or delta_ms > @stale_threshold_ms + end) + |> Map.new() + + pruned_agents_cache = + socket.assigns.agents_cache + |> Enum.reject(fn {_id, agent} -> + # Use the same staleness rules as the UI (wall clock first, monotonic fallback) + not agent_active?(agent, now_ms) + end) + |> Map.new() gateways = compute_gateways(pruned_gateways_cache) @@ -859,7 +861,7 @@ delta_ms = max(now_ms - last_seen_ms, 0) delta_ms < @stale_threshold_ms - is_integer(agent[:last_seen_mono]) -> + is_integer(agent[:last_seen_mono]) and is_nil(agent[:last_seen]) -> now_mono = System.monotonic_time(:millisecond) delta_ms = max(now_mono - agent[:last_seen_mono], 0) delta_ms < @stale_threshold_ms ``` </details> ___ **To prevent incorrectly pruning active agents due to clock skew, clamp negative <br><code>delta_ms</code> values to <code>0</code> in the <code>pruned_agents_cache</code> logic.** [web-ng/lib/serviceradar_web_ng_web/live/infrastructure_live/index.ex [108-117]](https://github.com/carverauto/serviceradar/pull/2221/files#diff-f1af30a84da554ef3d43b226a9303174b33dd5d27e23e9b702031483074e5f54R108-R117) ```diff pruned_agents_cache = socket.assigns.agents_cache |> Enum.reject(fn {_id, agent} -> last_ms = agent_last_seen_ms(agent) - delta_ms = if is_integer(last_ms), do: now_ms - last_ms, else: nil + delta_ms = + if is_integer(last_ms) do + max(now_ms - last_ms, 0) + else + nil + end - not is_integer(delta_ms) or delta_ms < 0 or delta_ms > @stale_threshold_ms + not is_integer(delta_ms) or delta_ms > @stale_threshold_ms end) |> Map.new() ``` <details><summary>Suggestion importance[1-10]: 7</summary> __ Why: This is a valid bug fix for handling clock skew, which could cause active agents to be incorrectly pruned, but the same logic is already correctly implemented in `agent_active?`, making this a consistency fix. </details></details></td><td align=center>Medium </td></tr><tr><td rowspan=3>Possible issue</td> <td> <details><summary>✅ <s>Bind collectors to service context</s></summary> ___ <details><summary><b>Suggestion Impact:</b></summary>The commit moved the serviceCtx lookup to before starting the collector and changed collector.Start to use serviceCtx (falling back to ctx when nil), aligning collector lifetime with the service context. code diff: ```diff - // Start collector - if err := collector.Start(ctx); err != nil { + s.mu.RLock() + serviceCtx := s.serviceCtx + s.mu.RUnlock() + if serviceCtx == nil { + serviceCtx = ctx + } + + // Start collector with the service context so it stops on service shutdown. + if err := collector.Start(serviceCtx); err != nil { return fmt.Errorf("%w: %s", errFailedToStartCollector, target.Name) } @@ -363,12 +370,6 @@ Str("host_name", target.Name). Msg("Initialized service status") - s.mu.RLock() - serviceCtx := s.serviceCtx - s.mu.RUnlock() - if serviceCtx == nil { - serviceCtx = ctx - } go s.processResults(serviceCtx, target.Name, results, aggregator) ``` </details> ___ **Start the SNMP collector with the service's long-lived context (<code>serviceCtx</code>) <br>instead of the caller-provided context to prevent resource leaks when the <br>service is stopped or restarted.** [pkg/checker/snmp/service.go [322-372]](https://github.com/carverauto/serviceradar/pull/2221/files#diff-c592649b582c62a85d1dd416ac7fb609e55531cb43c8199a15f4432ce8ae05d8R322-R372) ```diff -if err := collector.Start(ctx); err != nil { - return fmt.Errorf("%w: %s", errFailedToStartCollector, target.Name) -} -... s.mu.RLock() serviceCtx := s.serviceCtx s.mu.RUnlock() if serviceCtx == nil { serviceCtx = ctx } + +if err := collector.Start(serviceCtx); err != nil { + return fmt.Errorf("%w: %s", errFailedToStartCollector, target.Name) +} +... go s.processResults(serviceCtx, target.Name, results, aggregator) ``` <details><summary>Suggestion importance[1-10]: 8</summary> __ Why: The suggestion correctly identifies a potential goroutine leak where a collector started via `AddTarget` might not be tied to the service's lifecycle, and proposes using the service's context to fix it. </details></details></td><td align=center>Medium </td></tr><tr><td> <details><summary>✅ <s>Prune agents using activity predicate</s></summary> ___ <details><summary><b>Suggestion Impact:</b></summary>The commit updated pruned_agents_cache to reject agents based on `not agent_active?(agent, now_ms)` instead of re-implementing staleness checks, aligning pruning with the UI logic. It also refined agent_active?/2 to only use the monotonic timestamp fallback when wall-clock `:last_seen` is nil, further ensuring consistent activity evaluation. code diff: ```diff + pruned_agents_cache = + socket.assigns.agents_cache + |> Enum.reject(fn {_id, agent} -> + # Use the same staleness rules as the UI (wall clock first, monotonic fallback) + not agent_active?(agent, now_ms) + end) + |> Map.new() gateways = compute_gateways(pruned_gateways_cache) @@...
qodo-code-review[bot] commented 2026-01-03 22:00:43 +00:00 (Migrated from github.com)
Author
Owner

Imported GitHub PR comment.

Original author: @qodo-code-review[bot]
Original URL: https://github.com/carverauto/serviceradar/pull/2221#issuecomment-3707397340
Original created: 2026-01-03T22:00:43Z

Persistent suggestions updated to latest commit 1c51855

Imported GitHub PR comment. Original author: @qodo-code-review[bot] Original URL: https://github.com/carverauto/serviceradar/pull/2221#issuecomment-3707397340 Original created: 2026-01-03T22:00:43Z --- **[Persistent suggestions](https://github.com/carverauto/serviceradar/pull/2221#issuecomment-3706934588)** updated to latest commit 1c51855
qodo-code-review[bot] commented 2026-01-03 23:52:45 +00:00 (Migrated from github.com)
Author
Owner

Imported GitHub PR comment.

Original author: @qodo-code-review[bot]
Original URL: https://github.com/carverauto/serviceradar/pull/2221#issuecomment-3707456600
Original created: 2026-01-03T23:52:45Z

Persistent suggestions updated to latest commit 9c018c2

Imported GitHub PR comment. Original author: @qodo-code-review[bot] Original URL: https://github.com/carverauto/serviceradar/pull/2221#issuecomment-3707456600 Original created: 2026-01-03T23:52:45Z --- **[Persistent suggestions](https://github.com/carverauto/serviceradar/pull/2221#issuecomment-3706934588)** updated to latest commit 9c018c2
qodo-code-review[bot] commented 2026-01-04 00:09:57 +00:00 (Migrated from github.com)
Author
Owner

Imported GitHub PR comment.

Original author: @qodo-code-review[bot]
Original URL: https://github.com/carverauto/serviceradar/pull/2221#issuecomment-3707466559
Original created: 2026-01-04T00:09:57Z

Persistent suggestions updated to latest commit 31c2e30

Imported GitHub PR comment. Original author: @qodo-code-review[bot] Original URL: https://github.com/carverauto/serviceradar/pull/2221#issuecomment-3707466559 Original created: 2026-01-04T00:09:57Z --- **[Persistent suggestions](https://github.com/carverauto/serviceradar/pull/2221#issuecomment-3706934588)** updated to latest commit 31c2e30
qodo-code-review[bot] commented 2026-01-04 01:34:43 +00:00 (Migrated from github.com)
Author
Owner

Imported GitHub PR comment.

Original author: @qodo-code-review[bot]
Original URL: https://github.com/carverauto/serviceradar/pull/2221#issuecomment-3707515479
Original created: 2026-01-04T01:34:43Z

Persistent suggestions updated to latest commit 31165fe

Imported GitHub PR comment. Original author: @qodo-code-review[bot] Original URL: https://github.com/carverauto/serviceradar/pull/2221#issuecomment-3707515479 Original created: 2026-01-04T01:34:43Z --- **[Persistent suggestions](https://github.com/carverauto/serviceradar/pull/2221#issuecomment-3706934588)** updated to latest commit 31165fe
qodo-code-review[bot] commented 2026-01-04 02:20:43 +00:00 (Migrated from github.com)
Author
Owner

Imported GitHub PR comment.

Original author: @qodo-code-review[bot]
Original URL: https://github.com/carverauto/serviceradar/pull/2221#issuecomment-3707544156
Original created: 2026-01-04T02:20:43Z

Persistent suggestions updated to latest commit b3ec5cd

Imported GitHub PR comment. Original author: @qodo-code-review[bot] Original URL: https://github.com/carverauto/serviceradar/pull/2221#issuecomment-3707544156 Original created: 2026-01-04T02:20:43Z --- **[Persistent suggestions](https://github.com/carverauto/serviceradar/pull/2221#issuecomment-3706934588)** updated to latest commit b3ec5cd
qodo-code-review[bot] commented 2026-01-04 05:07:29 +00:00 (Migrated from github.com)
Author
Owner

Imported GitHub PR comment.

Original author: @qodo-code-review[bot]
Original URL: https://github.com/carverauto/serviceradar/pull/2221#issuecomment-3707670337
Original created: 2026-01-04T05:07:29Z

Persistent suggestions updated to latest commit 77a078e

Imported GitHub PR comment. Original author: @qodo-code-review[bot] Original URL: https://github.com/carverauto/serviceradar/pull/2221#issuecomment-3707670337 Original created: 2026-01-04T05:07:29Z --- **[Persistent suggestions](https://github.com/carverauto/serviceradar/pull/2221#issuecomment-3706934588)** updated to latest commit 77a078e
qodo-code-review[bot] commented 2026-01-04 06:56:41 +00:00 (Migrated from github.com)
Author
Owner

Imported GitHub PR comment.

Original author: @qodo-code-review[bot]
Original URL: https://github.com/carverauto/serviceradar/pull/2221#issuecomment-3707800531
Original created: 2026-01-04T06:56:41Z

Persistent suggestions updated to latest commit 6196af3

Imported GitHub PR comment. Original author: @qodo-code-review[bot] Original URL: https://github.com/carverauto/serviceradar/pull/2221#issuecomment-3707800531 Original created: 2026-01-04T06:56:41Z --- **[Persistent suggestions](https://github.com/carverauto/serviceradar/pull/2221#issuecomment-3706934588)** updated to latest commit 6196af3
qodo-code-review[bot] commented 2026-01-04 07:11:17 +00:00 (Migrated from github.com)
Author
Owner

Imported GitHub PR comment.

Original author: @qodo-code-review[bot]
Original URL: https://github.com/carverauto/serviceradar/pull/2221#issuecomment-3707810490
Original created: 2026-01-04T07:11:17Z

Persistent suggestions updated to latest commit da4580a

Imported GitHub PR comment. Original author: @qodo-code-review[bot] Original URL: https://github.com/carverauto/serviceradar/pull/2221#issuecomment-3707810490 Original created: 2026-01-04T07:11:17Z --- **[Persistent suggestions](https://github.com/carverauto/serviceradar/pull/2221#issuecomment-3706934588)** updated to latest commit da4580a
qodo-code-review[bot] commented 2026-01-04 07:21:19 +00:00 (Migrated from github.com)
Author
Owner

Imported GitHub PR comment.

Original author: @qodo-code-review[bot]
Original URL: https://github.com/carverauto/serviceradar/pull/2221#issuecomment-3707818048
Original created: 2026-01-04T07:21:19Z

Persistent suggestions updated to latest commit db99aac

Imported GitHub PR comment. Original author: @qodo-code-review[bot] Original URL: https://github.com/carverauto/serviceradar/pull/2221#issuecomment-3707818048 Original created: 2026-01-04T07:21:19Z --- **[Persistent suggestions](https://github.com/carverauto/serviceradar/pull/2221#issuecomment-3706934588)** updated to latest commit db99aac
qodo-code-review[bot] commented 2026-01-04 08:04:14 +00:00 (Migrated from github.com)
Author
Owner

Imported GitHub PR comment.

Original author: @qodo-code-review[bot]
Original URL: https://github.com/carverauto/serviceradar/pull/2221#issuecomment-3707846321
Original created: 2026-01-04T08:04:14Z

Persistent suggestions updated to latest commit c0cb179

Imported GitHub PR comment. Original author: @qodo-code-review[bot] Original URL: https://github.com/carverauto/serviceradar/pull/2221#issuecomment-3707846321 Original created: 2026-01-04T08:04:14Z --- **[Persistent suggestions](https://github.com/carverauto/serviceradar/pull/2221#issuecomment-3706934588)** updated to latest commit c0cb179
Sign in to join this conversation.
No reviewers
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
carverauto/serviceradar!2629
No description provided.