Fix/agent gateway genserver #2637
No reviewers
Labels
No labels
1week
2weeks
Failed compliance check
IP cameras
NATS
Possible security concern
Review effort 1/5
Review effort 2/5
Review effort 3/5
Review effort 4/5
Review effort 5/5
UI
aardvark
accessibility
amd64
api
arm64
auth
back-end
bgp
blog
bug
build
checkers
ci-cd
cleanup
cnpg
codex
core
dependencies
device-management
documentation
duplicate
dusk
ebpf
enhancement
eta 1d
eta 1hr
eta 3d
eta 3hr
feature
fieldsurvey
github_actions
go
good first issue
help wanted
invalid
javascript
k8s
log-collector
mapper
mtr
needs-triage
netflow
network-sweep
observability
oracle
otel
plug-in
proton
python
question
reddit
redhat
research
rperf
rperf-checker
rust
sdk
security
serviceradar-agent
serviceradar-agent-gateway
serviceradar-web
serviceradar-web-ng
siem
snmp
sysmon
topology
ubiquiti
wasm
wontfix
zen-engine
No milestone
No project
No assignees
1 participant
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
carverauto/serviceradar!2637
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "refs/pull/2637/head"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Imported from GitHub pull request.
Original GitHub pull request: #2230
Original author: @mfreeman451
Original URL: https://github.com/carverauto/serviceradar/pull/2230
Original created: 2026-01-08T20:53:51Z
Original updated: 2026-01-08T22:49:53Z
Original head: carverauto/serviceradar:fix/agent_gateway_genserver
Original base: testing
Original merged: 2026-01-08T22:49:43Z by @mfreeman451
User description
IMPORTANT: Please sign the Developer Certificate of Origin
Thank you for your contribution to ServiceRadar. Please note, when contributing, the developer must include
a DCO sign-off statement indicating the DCO acceptance in one commit message. Here
is an example DCO Signed-off-by line in a commit message:
Describe your changes
Issue ticket number and link
Code checklist before requesting a review
PR Type
Enhancement, Tests
Description
Major architectural shift from poller to gateway model: Refactored agent from gRPC server to push-mode architecture with gateway client integration, replacing
PollerServicewithAgentGatewayServiceAgent Gateway gRPC Server implementation: New Elixir gRPC server for receiving status pushes from Go agents with mTLS authentication, multi-tenant security isolation, and agent enrollment/configuration management
Multi-tenant infrastructure: Implemented comprehensive tenant management with NATS account provisioning, encrypted seed storage, and per-tenant Horde registries for process isolation
NATS credentials file support: Added credentials file configuration across multiple components (trapd, zen consumer, flowgger, OTEL) for secure NATS connections
Device and agent runtime management: New GenServer-based device actor for runtime state management and agent resource with state machine lifecycle (connecting, connected, degraded, disconnected, unavailable)
Certificate generation and SPIFFE integration: X.509 certificate generation for tenant CAs and edge components with SPIFFE/SPIRE integration for distributed cluster security
Comprehensive database schema: Initial tenant schema migration with 1416 lines defining all tenant database tables including gateways, agents, devices, service checks, and monitoring infrastructure
Service health monitoring: Service heartbeat GenServer for periodic health status tracking of core, web, and gateway services
Edge onboarding packages: Ash-based context module for managing edge component onboarding with token generation, delivery, and component certificate signing
Statistics aggregation: Device statistics aggregator GenServer for periodic snapshots with device counts, availability, and capability breakdowns
Terminology updates: Renamed
poller_idtogateway_idacross multiple checkers (sysmon, rperf-client, trapd) and updated CLI subcommands (update-poller→update-gateway)DataService KV client: New gRPC client for datasvc KV operations with automatic reconnection and exponential backoff strategy
Removed legacy poller infrastructure: Deleted poller-specific files, configurations, and Docker/Kubernetes deployment templates
Diagram Walkthrough
File Walkthrough
31 files
main.go
Agent refactored to push mode with gateway integrationcmd/agent/main.go
gateway client integration
and embedded defaults fallback
loadConfig()andrunPushMode()functions for cleanerseparation of concerns
main.go
NATS account service initialization and configurationcmd/data-services/main.go
overrides
operations
NATSAccountServicegRPC server if configuredmain.go
CLI subcommands updated for gateway and NATS operationscmd/cli/main.go
update-pollersubcommand toupdate-gatewaynats-bootstrapsubcommand for NATS operationsadminsubcommand dispatcher withnatsadmin resource routingmain.go
SNMP checker service renamed from poller to gatewaycmd/checkers/snmp/main.go
SNMPPollerServicetoSNMPGatewayServicePollerstruct toGatewaystructapp.go
Core app gRPC service renamed to AgentGatewayServicecmd/core/app/app.go
PollerServicetoAgentGatewayServiceconfig.rs
Zen consumer NATS credentials file configurationcmd/consumers/zen/src/config.rs
nats_creds_fileconfiguration fieldnats_creds_fileis not empty when providednats_creds_path()method to resolve credentials file path withsupport for relative paths
server.rs
Sysmon checker updated to use gateway_id terminologycmd/checkers/sysmon/src/server.rs
poller_idfield togateway_idin status and results logginggateway_idinstead ofpoller_idconfig.rs
OTEL NATS credentials file configuration supportcmd/otel/src/config.rs
creds_filefield toNATSConfigTOMLstructNATSConfigstruct to includecreds_fileasOptionmain.rs
Trapd NATS credentials and gateway_id updatescmd/trapd/src/main.rs
non-secure modes
credentials_file()whennats_creds_pathis availablepoller_idtogateway_idin status and results responsesnats_output.rs
Flowgger NATS output credentials file supportcmd/flowgger/src/flowgger/output/nats_output.rs
creds_filefield toNATSConfigstructconfig.rs
Trapd NATS credentials file configurationcmd/trapd/src/config.rs
nats_creds_fileconfiguration fieldnats_creds_path()method to resolve file paths relative tosecurity cert directory
nats_output.rs
OTEL NATS output credentials file supportcmd/otel/src/nats_output.rs
creds_filefield toNATSConfigstruct with defaultNonegrpc_server.rs
Zen consumer gRPC server gateway_id terminologycmd/consumers/zen/src/grpc_server.rs
poller_idtogateway_idin status and results responsestructures
nats.rs
Zen consumer NATS credentials file integrationcmd/consumers/zen/src/nats.rs
nats_creds_path()to connection options when availableserver.rs
RPerf checker gateway_id field additioncmd/checkers/rperf-client/src/server.rs
gateway_idfield to status and results response structuressetup.rs
OTEL setup debug logging for credentialscmd/otel/src/setup.rs
onboarding_packages.ex
Edge onboarding packages context module implementationelixir/serviceradar_core/lib/serviceradar/edge/onboarding_packages.ex
package management
and soft-delete
lifecycle
service_heartbeat.ex
Service heartbeat GenServer for health monitoringelixir/serviceradar_core/lib/serviceradar/infrastructure/service_heartbeat.ex
heartbeats
agent_gateway_server.ex
Agent Gateway gRPC Server with mTLS Multi-Tenant Securityelixir/serviceradar_agent_gateway/lib/serviceradar_agent_gateway/agent_gateway_server.ex
mTLS authentication
hello), configuration requests (get_config),and status updates (
push_status,stream_status)multi-tenant security isolation
data to core cluster via RPC
agent.ex
Agent Resource with State Machine and Capability Managementelixir/serviceradar_core/lib/serviceradar/infrastructure/agent.ex
Agentresource for managing Go agents using Ash framework withstate machine
disconnected, unavailable)
SNMP) and OCSF type mappings
management, and heartbeat operations
tenant_registry.ex
Multi-Tenant Registry with Horde-Based Process Isolationelixir/serviceradar_core/lib/serviceradar/cluster/tenant_registry.ex
multi-tenant process isolation
with slug-based alias mapping
management within tenant boundaries
heartbeat tracking
generator.ex
X.509 Certificate Generation for Tenant CAs and Componentselixir/serviceradar_core/lib/serviceradar/edge/tenant_ca/generator.ex
components using Erlang's
:public_keymodulecertificates (1-year validity) with SPIFFE URIs
tenant information from certificate CNs
KeyUsage, SubjectAltName)
alias_events.ex
Device Alias Lifecycle Event Tracking and Detectionelixir/serviceradar_core/lib/serviceradar/identity/alias_events.ex
with change detection
AliasRecordstruct for parsing alias metadata from devicerecords
with state transitions
validation
hash_password.ex
Bcrypt Password Hashing Change for Ash Resourceselixir/serviceradar_core/lib/serviceradar/identity/changes/hash_password.ex
actions
hashed_passwordattribute
account_client.ex
NATS Account Client for Multi-Tenant Isolationelixir/serviceradar_core/lib/serviceradar/nats/account_client.ex
isolation
and operator bootstrap
DataService.Client unavailable
(limits, permissions, mappings)
spiffe.ex
SPIFFE/SPIRE Integration for Cluster Securityelixir/serviceradar_core/lib/serviceradar/spiffe.ex
verification
monitoring
loading
stats_aggregator.ex
Device Statistics Aggregator with Periodic Snapshotselixir/serviceradar_core/lib/serviceradar/core/stats_aggregator.ex
capability breakdowns
tenant.ex
Tenant Resource with NATS Account Provisioningelixir/serviceradar_core/lib/serviceradar/identity/tenant.ex
system
billing plans
AshCloak
account management
device.ex
Device Actor for Runtime State Managementelixir/serviceradar_core/lib/serviceradar/actors/device.ex
buffering
hibernation
persistence
client.ex
DataService KV Client with Reconnection Logicelixir/serviceradar_core/lib/serviceradar/data_service/client.ex
pooling
overrides
reserved_tenant_slug.ex
Reserved Tenant Slug Validationelixir/serviceradar_core/lib/serviceradar/identity/validations/reserved_tenant_slug.ex
platform tenant
2 files
main.go
API documentation updated for gateway terminologycmd/core/main.go
main.go
Config sync tool documentation updatedcmd/tools/config-sync/main.go
"poller"
1 files
message_processor.rs
Zen message processor test configuration updatecmd/consumers/zen/src/message_processor.rs
nats_creds_file: Nonefield to test configuration struct1 files
20260107043446_initial_schema.exs
Initial tenant schema migration with full database structureelixir/serviceradar_core/priv/repo/tenant_migrations/20260107043446_initial_schema.exs
defining all tenant database tables
schedules, NATS infrastructure, and monitoring
architecture
1 files
.gitkeep
Credentials Directory Placeholderdocker/compose/creds/.gitkeep
101 files
Imported GitHub PR comment.
Original author: @qodo-code-review[bot]
Original URL: https://github.com/carverauto/serviceradar/pull/2230#issuecomment-3725755737
Original created: 2026-01-08T20:55:17Z
PR Compliance Guide 🔍
Below is a summary of compliance checks for this PR:
Arbitrary file write
Description: The datasvc process accepts resolver/config output paths from environment variables (e.g.,
NATS_OPERATOR_CONFIG_PATH,NATS_RESOLVER_PATH) and then callsWriteOperatorConfig(), whichcould enable an arbitrary file write/overwrite if an attacker can influence the process
environment or config (e.g., writing operator config into a sensitive filesystem
location).
main.go [83-130]
Referred Code
Sensitive log exposure
Description: Untrusted request fields are logged verbatim (e.g.,
req.detailsinGetStatus/GetResults),which can leak sensitive data into logs and enable log-forging/injection; similar patterns
appear in other services in this PR (e.g.,
cmd/otel/src/setup.rslogsnats.creds_filepaths and multiple agent services log request details/IDs).
server.rs [214-220]
Referred Code
🎫 No ticket provided
Codebase context is not defined
Follow the guide to enable codebase context checks.
Generic: Meaningful Naming and Self-Documenting Code
Objective: Ensure all identifiers clearly express their purpose and intent, making code
self-documenting
Status: Passed
Generic: Robust Error Handling and Edge Case Management
Objective: Ensure comprehensive error handling that provides meaningful context and graceful
degradation
Status:
Potential panic unwrap: The new NATS creds-file parsing uses
v.as_str().unwrap()which can panic on unexpectedconfig types instead of returning a handled error with context.
Referred Code
Generic: Secure Logging Practices
Objective: To ensure logs are useful for debugging and auditing without exposing sensitive
information like PII, PHI, or cardholder data.
Status:
Sensitive info in logs: The PR logs potentially sensitive operational/security configuration details (e.g.,
allowed client identities lists, resolver paths, and system creds file locations) which
can aid attackers or leak environment secrets.
Referred Code
Generic: Comprehensive Audit Trails
Objective: To create a detailed and reliable record of critical system actions for security analysis
and compliance.
Status:
Missing audit events: The PR introduces security-relevant operations (e.g., NATS account bootstrap/config
writes) but only emits ad-hoc logs and does not show explicit audit-trail events with
actor identity and outcome.
Referred Code
Generic: Secure Error Handling
Objective: To prevent the leakage of sensitive system information through error messages while
providing sufficient detail for internal debugging.
Status:
Exception message exposure: The new crypto utility raises explicit decryption/configuration failure strings that may
become user-visible depending on caller boundaries and should be verified as
internal-only.
Referred Code
Generic: Security-First Input Validation and Data Handling
Objective: Ensure all data inputs are validated, sanitized, and handled securely to prevent
vulnerabilities
Status:
Unvalidated request logging: The PR logs
req.detailsdirectly from incoming requests, which may containuntrusted/sensitive content and should be validated/sanitized or redacted before logging.
Referred Code
Compliance status legend
🟢 - Fully Compliant🟡 - Partial Compliant
🔴 - Not Compliant
⚪ - Requires Further Human Verification
🏷️ - Compliance label
Imported GitHub PR comment.
Original author: @qodo-code-review[bot]
Original URL: https://github.com/carverauto/serviceradar/pull/2230#issuecomment-3725765477
Original created: 2026-01-08T20:56:40Z
PR Code Suggestions ✨
Explore these optional code suggestions:
Include tenant_id in CA return
Add the
tenant_idto the map returned bygenerate_tenant_cato ensuregenerate_component_certcan function correctly.elixir/serviceradar_core/lib/serviceradar/edge/tenant_ca/generator.ex [106-115]
[To ensure code accuracy, apply this suggestion manually]Suggestion importance[1-10]: 9
__
Why: The suggestion correctly identifies a critical data omission. The
tenant_camap returned bygenerate_tenant_cais missing thetenant_id, which is essential for the subsequentgenerate_component_certfunction to operate correctly. Adding it fixes a definite bug.Fix push interval time unit
Correct the push interval calculation by multiplying the
cfg.PushIntervalvalueby
time.Secondto ensure it is interpreted as seconds instead of nanoseconds.cmd/agent/main.go [158-173]
Suggestion importance[1-10]: 8
__
Why: This suggestion correctly identifies a potential bug where a configuration value in seconds is misinterpreted as nanoseconds, leading to an incorrect push interval. Applying this fix is critical for the agent's correct timing behavior.
Prevent crashes from unavailable nodes
Add
try/rescueblocks within thecore_nodes/0function to handle potential:rpc.callerrors, preventing crashes when a node is unavailable.elixir/serviceradar_agent_gateway/lib/serviceradar_agent_gateway/agent_gateway_server.ex [867-888]
[To ensure code accuracy, apply this suggestion manually]Suggestion importance[1-10]: 8
__
Why: The suggestion correctly identifies a potential runtime crash due to an unhandled exception from
:rpc.callin a distributed environment and provides a robust fix, significantly improving the service's fault tolerance.✅
Handle nil timestamps during deduplicationSuggestion Impact:
Replaced Enum.max_by/3's DateTime comparator with a custom comparison function that safely handles nil timestamp values.code diff:
Prevent a potential crash in
deduplicate_by_device_idby providing a customcomparator to
Enum.max_bythat correctly handlesniltimestamp values.elixir/serviceradar_core/lib/serviceradar/identity/alias_events.ex [513-520]
[To ensure code accuracy, apply this suggestion manually]Suggestion importance[1-10]: 8
__
Why: The suggestion correctly identifies a potential
FunctionClauseErrorwhenEnum.max_by/3receives aniltimestamp, which is permitted by thedevice_updatetype. The proposed fix makes the function robust by providing a custom comparator that correctly handlesnilvalues.✅
Prevent errors from missing tenantSuggestion Impact:
The commit added an explicit nil check for tenant_value and raises an ArgumentError with a clear message before calling TenantSchemas.schema_for_tenant/1, preventing downstream failures when the tenant cannot be resolved.code diff:
In
resolve_tenant/2, add a check to ensure a tenant identifier is found, andraise an error if it is
nilto prevent downstream failures.elixir/serviceradar_core/lib/serviceradar/edge/onboarding_packages.ex [448-456]
[To ensure code accuracy, apply this suggestion manually]Suggestion importance[1-10]: 7
__
Why: The suggestion correctly identifies that a
niltenant identifier can cause a downstream error and proposes a valid guard clause. This improves robustness by failing early with a clear error message.Use date-aware addition for validity
Replace the manual date calculation with
DateTime.add/3to correctly account forleap years when setting the certificate's
not_afterdate.elixir/serviceradar_core/lib/serviceradar/edge/tenant_ca/generator.ex [315-320]
[To ensure code accuracy, apply this suggestion manually]Suggestion importance[1-10]: 6
__
Why: The suggestion correctly identifies that the manual date calculation for certificate expiry does not account for leap years and proposes using
DateTime.add/3for a more accurate and robust implementation.Only write operator config when path set
Add a guard to ensure
WriteOperatorConfig()is only called whenoperatorConfigPathis not empty, preventing a potential error when onlyresolverPathis set.cmd/data-services/main.go [117-130]
Suggestion importance[1-10]: 7
__
Why: This suggestion correctly identifies a potential panic by adding a necessary check for a non-empty
operatorConfigPathbefore attempting to write a file, thus preventing a runtime error.Refactor duplicated code for clarity
Refactor the duplicated
server.Stopcall to a single location to handle cleanupfor all exit paths from the push loop, improving code maintainability.
cmd/agent/main.go [231-245]
Suggestion importance[1-10]: 5
__
Why: This suggestion correctly identifies and resolves duplicated code by refactoring the shutdown logic, which improves code maintainability and readability.
Simplify trailing JSON data check
Replace the complex logic for detecting trailing JSON data with a simpler, more
idiomatic check using
dec.More().cmd/agent/main.go [133-137]
Suggestion importance[1-10]: 4
__
Why: The suggestion proposes using the idiomatic
dec.More()function to check for trailing JSON data, which simplifies the code and improves readability.