Feat/agent gateway work #2629
No reviewers
Labels
No labels
1week
2weeks
Failed compliance check
IP cameras
NATS
Possible security concern
Review effort 1/5
Review effort 2/5
Review effort 3/5
Review effort 4/5
Review effort 5/5
UI
aardvark
accessibility
amd64
api
arm64
auth
back-end
bgp
blog
bug
build
checkers
ci-cd
cleanup
cnpg
codex
core
dependencies
device-management
documentation
duplicate
dusk
ebpf
enhancement
eta 1d
eta 1hr
eta 3d
eta 3hr
feature
fieldsurvey
github_actions
go
good first issue
help wanted
invalid
javascript
k8s
log-collector
mapper
mtr
needs-triage
netflow
network-sweep
observability
oracle
otel
plug-in
proton
python
question
reddit
redhat
research
rperf
rperf-checker
rust
sdk
security
serviceradar-agent
serviceradar-agent-gateway
serviceradar-web
serviceradar-web-ng
siem
snmp
sysmon
topology
ubiquiti
wasm
wontfix
zen-engine
No milestone
No project
No assignees
1 participant
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
carverauto/serviceradar!2629
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "refs/pull/2629/head"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Imported from GitHub pull request.
Original GitHub pull request: #2221
Original author: @mfreeman451
Original URL: https://github.com/carverauto/serviceradar/pull/2221
Original created: 2026-01-03T09:47:10Z
Original updated: 2026-01-04T10:06:59Z
Original head: carverauto/serviceradar:feat/agent-gateway-work
Original base: testing
Original merged: 2026-01-04T10:06:57Z by @mfreeman451
User description
IMPORTANT: Please sign the Developer Certificate of Origin
Thank you for your contribution to ServiceRadar. Please note, when contributing, the developer must include
a DCO sign-off statement indicating the DCO acceptance in one commit message. Here
is an example DCO Signed-off-by line in a commit message:
Describe your changes
Issue ticket number and link
Code checklist before requesting a review
PR Type
Enhancement, Tests
Description
Implements comprehensive NATS account management system with operator bootstrap, tenant account creation, and user credential generation
Adds new
AgentGatewayServicegRPC API for agent-to-gateway status communication with push and streaming capabilitiesIntroduces multi-tenant identity extraction from mTLS certificates with tenant context propagation throughout the system
Implements agent push mode for active status transmission to gateway with automatic reconnection and exponential backoff
Adds NATS bootstrap CLI command supporting local and remote modes for operator and account initialization
Extends CLI with admin command dispatcher for managing NATS bootstrap and account operations
Implements tenant-aware NATS event subject prefixing via environment variable control
Adds comprehensive test coverage for account management, user credentials, operator initialization, and tenant context handling
Extends data models with collector package types, credentials structures, and NATS configuration fields
Normalizes proto file path references for KV service consistency
Diagram Walkthrough
File Walkthrough
20 files
nats_account.pb.go
NATS account management protobuf definitionsproto/nats_account.pb.go
of new code
account creation, and user credentials
permission structures
signing, and credential generation
nats_bootstrap.go
NATS bootstrap CLI implementationpkg/cli/nats_bootstrap.go
functionality
initialization
system/platform accounts
capabilities
cli.go
CLI command routing for admin operationspkg/cli/cli.go
AdminHandlerstruct to dispatch nested admin commandsnats-bootstrapandadmincommand handlers in the command mapadmin nats)nats_account_service.go
NATS Account Service Implementation with JWT Operationspkg/datasvc/nats_account_service.go
NATSAccountServerstruct for stateless NATS JWT/NKeyscryptographic operations
user credential generation, and JWT signing
for pushing JWTs to NATS
proper error handling
monitoring.pb.go
Gateway Status Messages and Service Definitionproto/monitoring.pb.go
GatewayStatusRequest,GatewayStatusResponse,GatewayStatusChunk,and
GatewayServiceStatusmessage typestenant_id,tenant_slug) for multi-tenantrouting
AgentGatewayServicewithPushStatusandStreamStatusRPCmethods
nats_account_grpc.pb.go
NATS Account Service gRPC Stubs Generationproto/nats_account_grpc.pb.go
NATSAccountServiceinterfaceBootstrapOperator,GetOperatorInfo,CreateTenantAccount,GenerateUserCredentials,SignAccountJWT,PushAccountJWToperator.go
NATS Operator Key Management and Bootstrappkg/nats/accounts/operator.go
Operatorstruct for managing NATS operator keys and signingoperations
BootstrapOperatorfunction to initialize new or importexisting operator keys
with NKeys
environment, files)
events.go
NATS Configuration Extension for Credentialspkg/models/events.go
CredsFilefield toNATSConfigstruct for system accountcredentials file path
fields
tenant.go
Multi-tenant identity extraction from certificatespkg/tenant/tenant.go
certificate CNs
...serviceradarutilities
PEM files
nats_config.go
NATS server configuration and bootstrap utilitiespkg/edgeonboarding/nats_config.go
account setup
clustering, and leaf nodes
type
gateway_client.go
Agent gateway client with reconnection logicpkg/agent/gateway_client.go
management
exponential backoff
account_manager.go
NATS account creation and JWT signingpkg/nats/accounts/account_manager.go
signing
{{tenant}}placeholder replacement
revocations
monitoring_grpc.pb.go
New AgentGatewayService gRPC APIproto/monitoring_grpc.pb.go
AgentGatewayServicegRPC service with PushStatus andStreamStatus methods
PollerServiceas deprecated with migration guidance toAgentGatewayServicecommunication
user_manager.go
NATS user credential generation and managementpkg/nats/accounts/user_manager.go
(collector, service, admin)
.credsfile formatting for NATS authenticationpush_loop.go
Agent status push loop implementationpkg/agent/push_loop.go
gateway transmission
events.go
Tenant-aware NATS event subject prefixingpkg/natsutil/events.go
variable control
applyTenantPrefix()to inject tenant slug into subjectpaths
NewEventPublisherWithPrefixing()for explicit prefix controlmain.go
Data service NATS account initializationcmd/data-services/main.go
resolver paths
files
NATSAccountServicegRPC endpoint for account managementpersistence
edge_onboarding.go
Collector package and credential data modelspkg/models/edge_onboarding.go
CollectorTypeenum for collector classification (flowgger, trapd,netflow, otel)
CollectorPackageStatusenum for deployment lifecycle trackingCollectorPackage,NatsCredential, andCollectorDownloadResultdata structuresdownload tracking
main.go
Agent push mode implementation with gateway integrationcmd/agent/main.go
runPushMode()with gateway client initialization and push loopmanagement
main.go
CLI command routing for NATS bootstrap and admincmd/cli/main.go
nats-bootstrapcommand routing for NATS server bootstrapoperations
admincommand dispatcher withnatssubcommand supportdispatchAdminCommand()for admin resource routing1 files
kv.pb.go
Proto file path normalization for KV serviceproto/kv.pb.go
proto/kv.prototokv.protofile_proto_kv_proto_*tofile_kv_proto_*calls
6 files
nats_account_service_test.go
NATS Account Service Unit Testspkg/datasvc/nats_account_service_test.go
NATSAccountServerwith authorizationcontext setup
with various scenarios
types, and permissions
inputs
user_manager_test.go
User Credentials Generation Testspkg/nats/accounts/user_manager_test.go
GenerateUserCredentialsfunction covering collector,service, and admin credential types
structure
account_manager_test.go
Account manager test suite for NATS tenant accountspkg/nats/accounts/account_manager_test.go
AccountSignerwith 370 lines coveringaccount creation, JWT signing, and subject mappings
revocations
snmp, netflow, otel, logs, telemetry)
tenant_test.go
Tenant identity parsing and context management testspkg/tenant/tenant_test.go
lines
operator_test.go
NATS operator initialization and JWT signing testspkg/nats/accounts/operator_test.go
lines
variable
handling
events_test.go
Event publisher tenant prefixing testspkg/natsutil/events_test.go
NewEventPublisherWithPrefixing()andSetTenantPrefixing()methods
DefaultTenantwhen tenant not in context1 files
.gitkeep
Docker Compose Credentials Directorydocker/compose/creds/.gitkeep
101 files
Imported GitHub PR comment.
Original author: @qodo-code-review[bot]
Original URL: https://github.com/carverauto/serviceradar/pull/2221#issuecomment-3706933859
Original created: 2026-01-03T09:48:48Z
PR Compliance Guide 🔍
(Compliance updated until commit
github.com/carverauto/serviceradar@d0138ce012)Below is a summary of compliance checks for this PR:
Insecure transport allowed
Description: The client can fall back to plaintext gRPC using
grpc.WithTransportCredentials(insecure.NewCredentials())whenSR_ALLOW_INSECURE=true,which would allow MITM interception/modification of agent status/config traffic if enabled
outside local/dev environments.
gateway_client.go [178-187]
Referred Code
🎫 No ticket provided
Codebase context is not defined
Follow the guide to enable codebase context checks.
Generic: Meaningful Naming and Self-Documenting Code
Objective: Ensure all identifiers clearly express their purpose and intent, making code
self-documenting
Status: Passed
Generic: Robust Error Handling and Edge Case Management
Objective: Ensure comprehensive error handling that provides meaningful context and graceful
degradation
Status: Passed
Generic: Comprehensive Audit Trails
Objective: To create a detailed and reliable record of critical system actions for security analysis
and compliance.
Status:
Missing audit context: The new agent enrollment/configuration actions emit operational logs but do not
demonstrate dedicated audit logging with consistently required audit fields (actor
identity, timestamp, action, outcome) for critical actions.
Referred Code
Generic: Secure Error Handling
Objective: To prevent the leakage of sensitive system information through error messages while
providing sufficient detail for internal debugging.
Status:
Error detail exposure: Connection/enrollment errors include internal connection details (e.g., gateway address
and upstream error strings) that may be surfaced to user-facing contexts depending on
caller behavior not visible in this diff.
Referred Code
Generic: Secure Logging Practices
Objective: To ensure logs are useful for debugging and auditing without exposing sensitive
information like PII, PHI, or cardholder data.
Status:
Sensitive data in logs: The new logging includes potentially sensitive operational identifiers and network targets
(e.g.,
tenant_slug,tenant_id,target,port) that could constitute sensitivetenant/internal infrastructure data depending on deployment policy.
Referred Code
Generic: Security-First Input Validation and Data Handling
Objective: Ensure all data inputs are validated, sanitized, and handled securely to prevent
vulnerabilities
Status:
Unverified remote config: Remote
GetConfigcheck definitions are applied dynamically with limited validation, andthe diff alone does not prove that the gateway-side authorization/authentication and
integrity guarantees for these inputs are enforced end-to-end.
Referred Code
Compliance status legend
🟢 - Fully Compliant🟡 - Partial Compliant
🔴 - Not Compliant
⚪ - Requires Further Human Verification
🏷️ - Compliance label
Previous compliance checks
Compliance check up to commit 8cdb823
SSRF via remote config
Description: Remote configuration received from the gateway is converted into active network targets
(e.g.,
check.Targetandcheck.PortinprotoCheckToCheckerConfig, then stored inp.server.checkerConfsand later executed), which can enable SSRF/port-scanning behaviorfrom the agent if the gateway/control-plane or its authZ is compromised or misconfigured.
push_loop.go [492-615]
Referred Code
Sensitive identifier logging
Description: The agent logs multi-tenant identifiers from the enrollment response (including
tenant_idand
tenant_slug), which can expose tenant metadata to anyone with access to agent logs andmay violate least-privilege log hygiene in shared or centrally aggregated logging
environments.
gateway_client.go [381-389]
Referred Code
🎫 No ticket provided
Codebase context is not defined
Follow the guide to enable codebase context checks.
Generic: Meaningful Naming and Self-Documenting Code
Objective: Ensure all identifiers clearly express their purpose and intent, making code
self-documenting
Status: Passed
Generic: Secure Error Handling
Objective: To prevent the leakage of sensitive system information through error messages while
providing sufficient detail for internal debugging.
Status: Passed
Generic: Robust Error Handling and Edge Case Management
Objective: Ensure comprehensive error handling that provides meaningful context and graceful
degradation
Status:
Nil dereference risk:
protoCheckToCheckerConfigcan returnnilfor invalid gateway-supplied checks, butapplyChecksuses the returnedcheckerConfwithout a nil check which can cause a panicinstead of graceful handling.
Referred Code
Generic: Security-First Input Validation and Data Handling
Objective: Ensure all data inputs are validated, sanitized, and handled securely to prevent
vulnerabilities
Status:
Unhandled invalid input: Gateway-supplied check data is partially validated in
protoCheckToCheckerConfig, butinvalid checks can produce
nilconfigs and are not safely handled inapplyChecks, enablinga crash via untrusted external input.
Referred Code
Compliance check up to commit 6aa63f6
No security concerns identified
No security vulnerabilities detected by AI analysis. Human verification advised for critical code.🎫 No ticket provided
Codebase context is not defined
Follow the guide to enable codebase context checks.
Generic: Meaningful Naming and Self-Documenting Code
Objective: Ensure all identifiers clearly express their purpose and intent, making code
self-documenting
Status: Passed
Generic: Secure Logging Practices
Objective: To ensure logs are useful for debugging and auditing without exposing sensitive
information like PII, PHI, or cardholder data.
Status:
Sensitive identifiers logged: Logs include potentially sensitive tenant identifiers and infrastructure details (e.g.,
tenant_id,tenant_slug) which violates the requirement that no sensitive data be presentin logs.
Referred Code
Generic: Comprehensive Audit Trails
Objective: To create a detailed and reliable record of critical system actions for security analysis
and compliance.
Status:
Missing actor context: Enrollment and config-application actions are logged but without a consistent
"actor" identifier (e.g., user/admin) beyond
agent_id, and other criticalactions in the PR (not shown in this diff) may lack audit logging.
Referred Code
Generic: Robust Error Handling and Edge Case Management
Objective: Ensure comprehensive error handling that provides meaningful context and graceful
degradation
Status:
Nil map risk:
applyCheckswrites top.server.checkerConfswithout ensuring the map is initialized, whichcan panic if
checkerConfsis nil depending on server construction not visible in thisdiff.
Referred Code
Generic: Secure Error Handling
Objective: To prevent the leakage of sensitive system information through error messages while
providing sufficient detail for internal debugging.
Status:
Potentially detailed errors: Connection and enrollment failures are returned with detailed internal context (e.g.,
gateway address and gateway-provided rejection message) which may be user-visible in CLI
output and should be reviewed for sensitive detail exposure.
Referred Code
Generic: Security-First Input Validation and Data Handling
Objective: Ensure all data inputs are validated, sanitized, and handled securely to prevent
vulnerabilities
Status:
Unvalidated config inputs: JSON config fields loaded from disk (including network endpoints and timing values) are
used without explicit validation/sanitization, and correctness/security depends on
constraints that are not shown in this diff.
Referred Code
Compliance check up to commit 4d3d82c
Insecure gRPC transport
Description: The client explicitly allows falling back to
grpc.WithTransportCredentials(insecure.NewCredentials())whensecurityis nil/disabled,which would permit plaintext gRPC and enable MITM/credential interception if this path is
reachable in non-development deployments.
gateway_client.go [108-135]
Referred Code
Unexpected external egress
Description:
getSourceIP()performs an outbound UDP dial to the public address8.8.8.8:80, creatingunexpected external egress that can leak environment/network metadata and may violate
network/security policies in restricted deployments.
push_loop.go [233-250]
Referred Code
🎫 No ticket provided
Codebase context is not defined
Follow the guide to enable codebase context checks.
Generic: Meaningful Naming and Self-Documenting Code
Objective: Ensure all identifiers clearly express their purpose and intent, making code
self-documenting
Status: Passed
Generic: Secure Error Handling
Objective: To prevent the leakage of sensitive system information through error messages while
providing sufficient detail for internal debugging.
Status: Passed
Generic: Robust Error Handling and Edge Case Management
Objective: Ensure comprehensive error handling that provides meaningful context and graceful
degradation
Status:
Nil check panic:
applyChecksdoes not guard against nil entries in thechecksslice and will panic whendereferencing
check.Enabled.Referred Code
Generic: Comprehensive Audit Trails
Objective: To create a detailed and reliable record of critical system actions for security analysis
and compliance.
Status:
Audit scope unclear: The new agent enrollment/config/status actions are logged, but it is not verifiable from
the diff alone that these logs meet your audit requirements (e.g., consistent actor
identity, timestamps, and outcome recording across all critical actions).
Referred Code
Generic: Secure Logging Practices
Objective: To ensure logs are useful for debugging and auditing without exposing sensitive
information like PII, PHI, or cardholder data.
Status:
Identifier logging: The new logs include tenant identifiers (e.g.,
tenant_idandtenant_slug), and whetherthese are classified as sensitive/PII cannot be confirmed from the diff alone.
Referred Code
Generic: Security-First Input Validation and Data Handling
Objective: Ensure all data inputs are validated, sanitized, and handled securely to prevent
vulnerabilities
Status:
Insecure transport option: The gateway client explicitly allows falling back to
insecure.NewCredentials()whensecurity mode is none, which may violate expected security posture depending on deployment
requirements.
Referred Code
Compliance check up to commit 6a52314
Credential secret exposure
Description:
GenerateUserCredentialsreturnsCredsFileContentcontaining the raw user NKey seed (viaformatCredsFile(userJWT, userSeed)), which is highly sensitive and could be exfiltrated ifthis value is ever logged, stored, or returned over an insufficiently protected channel.
user_manager.go [54-120]
Referred Code
Tenant isolation bypass
Description: Default publish/subscribe permissions include non-tenant-scoped subjects (e.g.,
events.>,logs.>,telemetry.>) and also build allow-subjects from unvalidatedtenantSlug(e.g.,fmt.Sprintf("%s.>", tenantSlug)), which could enable cross-tenant access if subjectmapping is misconfigured or if
tenantSlugcontains wildcard/special tokens.user_manager.go [123-198]
Referred Code
🎫 No ticket provided
Codebase context is not defined
Follow the guide to enable codebase context checks.
Generic: Meaningful Naming and Self-Documenting Code
Objective: Ensure all identifiers clearly express their purpose and intent, making code
self-documenting
Status: Passed
Generic: Comprehensive Audit Trails
Objective: To create a detailed and reliable record of critical system actions for security analysis
and compliance.
Status:
Audit logging unknown: The diff introduces APIs for operator/account bootstrap and credential generation, but no
corresponding audit logging is visible in the provided hunks, so it cannot be verified
that these critical actions are reliably logged with actor context and outcomes.
Referred Code
Generic: Robust Error Handling and Edge Case Management
Objective: Ensure comprehensive error handling that provides meaningful context and graceful
degradation
Status:
Error handling unseen: The new gRPC request/response surface for account and credential operations is added, but
the diff does not include the service implementation where validation and error handling
would occur, so robustness against edge cases cannot be confirmed.
Referred Code
Generic: Secure Error Handling
Objective: To prevent the leakage of sensitive system information through error messages while
providing sufficient detail for internal debugging.
Status:
Error exposure risk: Responses like
PushAccountJWTResponse.messageappear able to carry error details, butwithout the server-side implementation in the visible diff it cannot be verified that
user-facing errors are sanitized and sensitive internals are only logged securely.
Referred Code
Generic: Secure Logging Practices
Objective: To ensure logs are useful for debugging and auditing without exposing sensitive
information like PII, PHI, or cardholder data.
Status:
Secrets may be logged: The added models include highly sensitive fields (e.g.,
CollectorDownloadResult.nats_creds_fileandmtls_bundle) and the diff does not showsafeguards preventing these values from being logged, so secure logging compliance cannot
be verified.
Referred Code
Generic: Security-First Input Validation and Data Handling
Objective: Ensure all data inputs are validated, sanitized, and handled securely to prevent
vulnerabilities
Status:
Sensitive data handling: New data structures introduce sensitive credential material and token-related fields
(e.g.,
download_token_hash,.credscontent, andmtls_bundle), but the diff does notinclude the creation/transport/storage code needed to verify encryption, access controls,
and validation of externally supplied identifiers.
Referred Code
Imported GitHub PR comment.
Original author: @qodo-code-review[bot]
Original URL: https://github.com/carverauto/serviceradar/pull/2221#issuecomment-3706934588
Original created: 2026-01-03T09:50:14Z
PR Code Suggestions ✨
Latest suggestions up to
5c8ccf6Fix remaining state transitions
Update
transition_statecalls in severalupdateactions to use the correcttransition name instead of the target state atom to prevent runtime errors.
elixir/serviceradar_core/lib/serviceradar/infrastructure/gateway.ex [254-363]
[To ensure code accuracy, apply this suggestion manually]Suggestion importance[1-10]: 9
__
Why: The suggestion correctly identifies a bug where
transition_stateis called with the target state instead of the transition name, which would cause a runtime error.✅
Enforce streaming chunk orderingSuggestion Impact:
The commit updated the stream reducer state to include expected_idx and pinned_total_chunks, added validation that total_chunks cannot change mid-stream, and enforced that each incoming chunk_index must match the expected sequential index, updating the reducer continuation/halt tuples accordingly.code diff:
Enforce chunk ordering and contiguity in
stream_status/2by tracking theexpected_chunk_indexand pinningtotal_chunksto prevent partial data ingestion.elixir/serviceradar_agent_gateway/lib/serviceradar_agent_gateway/agent_gateway_server.ex [301-403]
[To ensure code accuracy, apply this suggestion manually]Suggestion importance[1-10]: 8
__
Why: The suggestion correctly identifies a vulnerability where out-of-order or missing stream chunks are not detected, potentially leading to silent data loss.
Unify version reporting source
Propagate the build-time
Versionfrom themainpackage to theagentpackage toensure correct version reporting during gateway enrollment.
cmd/agent/main.go [42-45]
[To ensure code accuracy, apply this suggestion manually]Suggestion importance[1-10]: 7
__
Why: The suggestion correctly identifies that the build-time
Versionfrom themainpackage is not propagated to theagentpackage, causing incorrect version reporting during enrollment.✅
Skip malformed service entriesSuggestion Impact:
The reduce function in push_status was changed to use function-head pattern matching for %Monitoring.GatewayServiceStatus{} and to ignore all other entries via an _other clause, matching the suggestion and preventing invalid entries from triggering exception-based control flow.code diff:
In
push_status, use pattern matching on%Monitoring.GatewayServiceStatus{}within
Enum.reduceto filter invalid service entries. This avoids usingexceptions for control flow and mitigates a potential resource exhaustion
attack.
elixir/serviceradar_agent_gateway/lib/serviceradar_agent_gateway/agent_gateway_server.ex [252-278]
[Suggestion processed]Suggestion importance[1-10]: 8
__
Why: The suggestion correctly identifies a potential resource exhaustion vector and proposes a more robust and secure implementation using pattern matching. This avoids expensive exception handling for invalid data types in a loop, which is a good security and performance practice.
✅
Bound user-controlled string sizesSuggestion Impact:
The commit adds trimming/normalization of `service_type` and `source`, validates their maximum byte size and disallows control characters, and then uses these sanitized values in the constructed status map.code diff:
In
process_service_status, add size and character validation for theservice_typeandsourcefields. This prevents potential resource exhaustion anddata ingestion issues from malicious or malformed client input.
elixir/serviceradar_agent_gateway/lib/serviceradar_agent_gateway/agent_gateway_server.ex [418-423]
[Suggestion processed]Suggestion importance[1-10]: 8
__
Why: The suggestion correctly points out that other user-controlled string fields like
service_typeandsourceare not being validated. Adding size and character validation for these fields is a good security practice to prevent resource exhaustion and data ingestion issues.Prevent test goroutine leaks
Prevent a test goroutine leak by using a cancellable context when starting the
serviceanddeferring a call toservice.Stop()to ensure cleanup.pkg/checker/snmp/service_test.go [162-201]
[To ensure code accuracy, apply this suggestion manually]Suggestion importance[1-10]: 7
__
Why: The suggestion correctly identifies a goroutine leak in the test due to a non-cancellable context, which can cause test flakiness, and provides a correct fix to ensure proper cleanup.
✅
Handle RPC agent map shapesSuggestion Impact:
The reduce/caching logic was changed to fetch agent_id and other fields using both atom and string map keys, matching the suggestion to robustly handle RPC result shapes.code diff:
Modify the agent cache builder to handle both atom and string keys from RPC
results. This prevents incorrect data processing due to serialization
differences between nodes.
web-ng/lib/serviceradar_web_ng_web/live/infrastructure_live/index.ex [739-749]
[To ensure code accuracy, apply this suggestion manually]Suggestion importance[1-10]: 7
__
Why: This is a valid suggestion that improves robustness by defensively handling both atom and string keys for data received via RPC, which can have inconsistent serialization. This prevents a potential bug where agents might be incorrectly processed.
✅
Fix embedded port detectionSuggestion Impact:
The commit replaces the fallback url.Parse("tcp://"+target) port detection with logic that strips path/query and uses net.SplitHostPort, and adds the needed strings import.code diff:
Improve embedded port detection for check targets by using
net.SplitHostPortonthe host part of the target string, making it more resilient to various formats
like unbracketed IPv6 addresses.
pkg/agent/push_loop.go [688-691]
[To ensure code accuracy, apply this suggestion manually]Suggestion importance[1-10]: 6
__
Why: The suggestion correctly identifies that the current port detection logic can fail for unbracketed IPv6 addresses, and proposes a more robust implementation using
net.SplitHostPort.✅
Validate client-provided timing dataSuggestion Impact:
The commit added response_time validation/normalization logic (clamping to 86_400_000 and defaulting to 0 for invalid values) and updated the status map to use the sanitized response_time instead of the raw client-provided value.code diff:
Validate and normalize the client-provided
response_timeto be a non-negativeinteger within a reasonable range before it is processed.
elixir/serviceradar_agent_gateway/lib/serviceradar_agent_gateway/agent_gateway_server.ex [461-475]
[Suggestion processed]Suggestion importance[1-10]: 7
__
Why: The suggestion correctly points out that
response_timefrom the client is not validated, which could lead to data corruption or processing errors downstream.✅
Preserve JSON parsing error detailsSuggestion Impact:
Updated the JSON parsing error handling to wrap and return the actual decode error (err) when decoding fails with non-EOF errors, improving debugging information.code diff:
Improve JSON parsing error handling by returning the specific underlying error
instead of a generic one, which will aid in debugging configuration file issues.
cmd/agent/main.go [129-133]
[Suggestion processed]Suggestion importance[1-10]: 6
__
Why: The suggestion correctly points out that a generic error is returned for JSON parsing issues, which hides the specific cause and makes debugging more difficult.
Previous suggestions
✅ Suggestions up to commit
c96fbf9Fix state machine transitions
Fix the
transition_statecalls in the state machine update actions to use thecorrect transition names (e.g.,
:degrade) instead of the target state atoms(e.g.,
:degraded) to prevent runtime errors.elixir/serviceradar_core/lib/serviceradar/infrastructure/gateway.ex [243-351]
Suggestion importance[1-10]: 9
__
Why: This suggestion correctly identifies a critical bug where
transition_stateis called with the target state instead of the transition name, which would cause runtime crashes.Fix gRPC connection initialization
Replace the manual gRPC connection logic using
grpc.NewClientwith the standardgrpc.DialContextandgrpc.WithBlockfor a simpler and more robustimplementation.
pkg/agent/gateway_client.go [130-154]
Suggestion importance[1-10]: 7
__
Why: The suggestion correctly identifies that using
grpc.DialContextwithgrpc.WithBlockis the idiomatic and more robust way to establish a gRPC connection, simplifying the code and improving maintainability.✅
Prevent false agent activitySuggestion Impact:
The code now uses Map.get(agent, :last_seen) and Map.get(agent, :last_seen_mono) instead of agent.last_seen and System.monotonic_time(:millisecond), preventing false activity on initial cache load.code diff:
In
load_initial_agents_cache, setlast_seen_monofrom the agent data from thetracker instead of the current monotonic time to avoid incorrectly marking stale
agents as active.
web-ng/lib/serviceradar_web_ng_web/live/infrastructure_live/index.ex [738-750]
Suggestion importance[1-10]: 7
__
Why: The suggestion correctly identifies a logic flaw where setting
last_seen_monoto the current time on initial load can falsely mark stale agents as active, leading to incorrect UI state.✅
Accept host:port/path targetsSuggestion Impact:
The commit adds an additional url.Parse("tcp://" + target) branch to detect an embedded port in host:port/path inputs without an explicit scheme, preventing such targets from being dropped.code diff:
Fix the embedded-port detection logic to correctly handle targets that include
both a port and a path (e.g.,
example.com:443/health). This prevents validTCP/gRPC checks from being silently dropped.
pkg/agent/push_loop.go [681-692]
Suggestion importance[1-10]: 8
__
Why: This suggestion correctly identifies a bug where valid check targets with a port and path (e.g.,
example.com:8080/health) are silently ignored, leading to a failure in monitoring. The fix is accurate and resolves this issue.✅
Prevent collectors from stopping earlySuggestion Impact:
The code was changed to return an error ("service not started") when the service lifecycle context is nil, preventing collectors from being started with the request context and stopping prematurely.code diff:
Prevent SNMP collectors from stopping prematurely by ensuring they are always
started with the service's lifecycle context. If the service is not running,
fail fast instead of using a potentially short-lived request context.
pkg/checker/snmp/service.go [321-331]
Suggestion importance[1-10]: 8
__
Why: The suggestion correctly identifies a critical bug where an SNMP collector's lifecycle could be tied to a short-lived request context, causing it to terminate prematurely. Failing fast prevents this and ensures collectors run for their intended duration.
✅
Fail when all items dropSuggestion Impact:
The commit replaced Enum.each with Enum.reduce to track processed_count, raises a GRPC.RPCError when processed_count == 0 but service_count > 0, and updates record_push_metrics to use processed_count instead of service_count.code diff:
Modify the
push_statusfunction to track the number of successfully processedservice statuses. If all statuses are dropped due to errors, the gRPC request
should fail instead of returning a success response.
elixir/serviceradar_agent_gateway/lib/serviceradar_agent_gateway/agent_gateway_server.ex [252-273]
[Suggestion processed]Suggestion importance[1-10]: 8
__
Why: This suggestion correctly identifies a silent data loss scenario where the client receives a success response even if all submitted data was invalid and dropped, improving the robustness of the gRPC endpoint.
✅
Avoid reconnect loops in idleSuggestion Impact:
Updated IsConnected to return true when the connection state is either connectivity.Ready or connectivity.Idle, matching the suggested switch-based logic.code diff:
Update
IsConnectedto consider theconnectivity.Idlestate as connected. Thisprevents unnecessary reconnect loops when the gRPC connection is temporarily
idle but still healthy.
pkg/agent/gateway_client.go [250-253]
Suggestion importance[1-10]: 7
__
Why: The suggestion correctly identifies that treating an
IdlegRPC connection as disconnected can cause unnecessary reconnect attempts, improving efficiency and reducing load on the agent and gateway.✅
Reject missing service namesSuggestion Impact:
The commit replaced `to_string() |> String.trim()` with a case expression that returns "" for nil/non-binary values and trims only binaries, enabling the existing validation to reject missing service_name instead of accepting "nil".code diff:
Explicitly handle
nilvalues forservice.service_nameto prevent them from beingprocessed as the string
"nil". The current implementation allows this, which canlead to invalid service records.
elixir/serviceradar_agent_gateway/lib/serviceradar_agent_gateway/agent_gateway_server.ex [408-425]
[Suggestion processed]Suggestion importance[1-10]: 7
__
Why: This is a valid bug fix that prevents data pollution by correctly handling
nilservice_namevalues, which would otherwise be processed as the string"nil".✅ Suggestions up to commit
b7127ef✅
Restrict monotonic fallback activitySuggestion Impact:
The commit updates agent_active? to gate the monotonic fallback with `and is_nil(agent[:last_seen])`, matching the suggestion’s intent and preventing monotonic timestamps from overriding stale wall-clock last_seen values. It also reuses agent_active? for agent cache pruning, aligning staleness rules with the UI logic.code diff:
In
agent_active?, restrict the monotonic time fallback to only apply when thewall-clock
last_seentimestamp isnilto ensure correct staleness detection.web-ng/lib/serviceradar_web_ng_web/live/infrastructure_live/index.ex [858-866]
Suggestion importance[1-10]: 8
__
Why: The suggestion correctly identifies a subtle bug where the monotonic time fallback could incorrectly mark an agent as active if a wall-clock time exists but is stale, improving the accuracy of staleness detection.
✅
Bound service shutdown with timeoutSuggestion Impact:
The commit removed the unbounded deferred server.Stop(context.Background()) and instead calls server.Stop() with a context timeout inside the shutdown goroutine, plus replaces time.After with an explicit timer. It also adds timed server.Stop() calls on the errChan path, further ensuring shutdown is bounded.code diff:
Move the
server.Stop()call into the timed shutdown block to ensure it isbounded by the
shutdownTimeout, preventing the process from hanging duringtermination.
cmd/agent/main.go [202-217]
[Suggestion processed]Suggestion importance[1-10]: 7
__
Why: The suggestion correctly identifies that
server.Stop()is not covered by the shutdown timeout, which could cause the process to hang, and proposes a valid fix to include it in the timed shutdown sequence.✅
Clamp skewed timestamps in pruningSuggestion Impact:
The commit updated the pruned_gateways_cache pruning logic to clamp delta_ms using max(now_ms - last_ms, 0) and removed the delta_ms < 0 check, preventing active gateways from being pruned due to clock skew. It also included additional related changes to agent pruning logic.code diff:
To prevent incorrectly pruning active gateways due to clock skew, clamp negative
delta_msvalues to0in thepruned_gateways_cachelogic.web-ng/lib/serviceradar_web_ng_web/live/infrastructure_live/index.ex [97-106]
Suggestion importance[1-10]: 7
__
Why: This is a valid bug fix for handling clock skew, which could cause active gateways to be incorrectly pruned, but the same logic is already correctly implemented in
compute_gateways, making this a consistency fix.✅
Avoid dropping agents on skewSuggestion Impact:
The pruning logic was changed to avoid dropping entities due to negative deltas: gateways now clamp delta_ms with `max(..., 0)` and no longer prune on `delta_ms < 0`. For agents, instead of directly computing delta_ms, the code now delegates pruning to `agent_active?/2`, which already clamps negative deltas and applies consistent staleness rules, achieving the same goal (avoid pruning due to clock skew) though via a different implementation.code diff:
To prevent incorrectly pruning active agents due to clock skew, clamp negative
delta_msvalues to0in thepruned_agents_cachelogic.web-ng/lib/serviceradar_web_ng_web/live/infrastructure_live/index.ex [108-117]
Suggestion importance[1-10]: 7
__
Why: This is a valid bug fix for handling clock skew, which could cause active agents to be incorrectly pruned, but the same logic is already correctly implemented in
agent_active?, making this a consistency fix.✅
Bind collectors to service contextSuggestion Impact:
The commit moved the serviceCtx lookup to before starting the collector and changed collector.Start to use serviceCtx (falling back to ctx when nil), aligning collector lifetime with the service context.code diff:
Start the SNMP collector with the service's long-lived context (
serviceCtx)instead of the caller-provided context to prevent resource leaks when the
service is stopped or restarted.
pkg/checker/snmp/service.go [322-372]
Suggestion importance[1-10]: 8
__
Why: The suggestion correctly identifies a potential goroutine leak where a collector started via
AddTargetmight not be tied to the service's lifecycle, and proposes using the service's context to fix it.✅
Prune agents using activity predicateSuggestion Impact:
The commit updated pruned_agents_cache to reject agents based on `not agent_active?(agent, now_ms)` instead of re-implementing staleness checks, aligning pruning with the UI logic. It also refined agent_active?/2 to only use the monotonic timestamp fallback when wall-clock `:last_seen` is nil, further ensuring consistent activity evaluation.code diff:
Imported GitHub PR comment.
Original author: @qodo-code-review[bot]
Original URL: https://github.com/carverauto/serviceradar/pull/2221#issuecomment-3707397340
Original created: 2026-01-03T22:00:43Z
Persistent suggestions updated to latest commit
1c51855Imported GitHub PR comment.
Original author: @qodo-code-review[bot]
Original URL: https://github.com/carverauto/serviceradar/pull/2221#issuecomment-3707456600
Original created: 2026-01-03T23:52:45Z
Persistent suggestions updated to latest commit
9c018c2Imported GitHub PR comment.
Original author: @qodo-code-review[bot]
Original URL: https://github.com/carverauto/serviceradar/pull/2221#issuecomment-3707466559
Original created: 2026-01-04T00:09:57Z
Persistent suggestions updated to latest commit
31c2e30Imported GitHub PR comment.
Original author: @qodo-code-review[bot]
Original URL: https://github.com/carverauto/serviceradar/pull/2221#issuecomment-3707515479
Original created: 2026-01-04T01:34:43Z
Persistent suggestions updated to latest commit
31165feImported GitHub PR comment.
Original author: @qodo-code-review[bot]
Original URL: https://github.com/carverauto/serviceradar/pull/2221#issuecomment-3707544156
Original created: 2026-01-04T02:20:43Z
Persistent suggestions updated to latest commit
b3ec5cdImported GitHub PR comment.
Original author: @qodo-code-review[bot]
Original URL: https://github.com/carverauto/serviceradar/pull/2221#issuecomment-3707670337
Original created: 2026-01-04T05:07:29Z
Persistent suggestions updated to latest commit
77a078eImported GitHub PR comment.
Original author: @qodo-code-review[bot]
Original URL: https://github.com/carverauto/serviceradar/pull/2221#issuecomment-3707800531
Original created: 2026-01-04T06:56:41Z
Persistent suggestions updated to latest commit
6196af3Imported GitHub PR comment.
Original author: @qodo-code-review[bot]
Original URL: https://github.com/carverauto/serviceradar/pull/2221#issuecomment-3707810490
Original created: 2026-01-04T07:11:17Z
Persistent suggestions updated to latest commit
da4580aImported GitHub PR comment.
Original author: @qodo-code-review[bot]
Original URL: https://github.com/carverauto/serviceradar/pull/2221#issuecomment-3707818048
Original created: 2026-01-04T07:21:19Z
Persistent suggestions updated to latest commit
db99aacImported GitHub PR comment.
Original author: @qodo-code-review[bot]
Original URL: https://github.com/carverauto/serviceradar/pull/2221#issuecomment-3707846321
Original created: 2026-01-04T08:04:14Z
Persistent suggestions updated to latest commit
c0cb179