Sysmonvm/integration tests 2 #2316
No reviewers
Labels
No labels
1week
2weeks
Failed compliance check
IP cameras
NATS
Possible security concern
Review effort 1/5
Review effort 2/5
Review effort 3/5
Review effort 4/5
Review effort 5/5
UI
aardvark
accessibility
amd64
api
arm64
auth
back-end
bgp
blog
bug
build
checkers
ci-cd
cleanup
cnpg
codex
core
dependencies
device-management
documentation
duplicate
dusk
ebpf
enhancement
eta 1d
eta 1hr
eta 3d
eta 3hr
feature
fieldsurvey
github_actions
go
good first issue
help wanted
invalid
javascript
k8s
log-collector
mapper
mtr
needs-triage
netflow
network-sweep
observability
oracle
otel
plug-in
proton
python
question
reddit
redhat
research
rperf
rperf-checker
rust
sdk
security
serviceradar-agent
serviceradar-agent-gateway
serviceradar-web
serviceradar-web-ng
siem
snmp
sysmon
topology
ubiquiti
wasm
wontfix
zen-engine
No milestone
No project
No assignees
1 participant
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
carverauto/serviceradar!2316
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "refs/pull/2316/head"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Imported from GitHub pull request.
Original GitHub pull request: #1755
Original author: @mfreeman451
Original URL: https://github.com/carverauto/serviceradar/pull/1755
Original created: 2025-10-13T06:06:41Z
Original updated: 2025-10-14T03:13:48Z
Original head: carverauto/serviceradar:sysmonvm/integration_tests_2
Original base: main
Original merged: 2025-10-14T03:13:29Z by @mfreeman451
Auto-created Ticket
#1756
PR Type
Enhancement, Bug fix, Documentation
Description
Major Enhancements:
Implemented automatic device registration from checker metadata, eliminating manual device setup requirements
Added automatic core reconnection logic in poller with retry mechanism for transient failures
Consolidated database migrations into single schema file, simplifying deployment and maintenance
Added CPU frequency metrics collection, processing, and visualization across the entire stack
Implemented per-checker security configuration support in agent
Infrastructure & Configuration:
Added authentication endpoints (
/auth/login,/auth/refresh) to Kong API gatewayAdded resource limits (8GB memory) and
PROTON_DISABLE_TASKSTATSfor Proton serviceImplemented password persistence and recovery for Proton database across restarts
Preserved core configuration and auth keys across container restarts
Added Docker Compose override for local image development
UI Enhancements:
Added CPU frequency visualization components (card, chart, per-core details)
Integrated CPU frequency metrics into system metrics dashboard
Added service hint support to sysmon status indicator for better checker capability detection
Added configurable max value support to metric card component
Error Handling & Reliability:
Improved batch metric insertion with error tracking and validation
Added timestamp skew validation for sysmon payloads
Enhanced service messages with host identity for device correlation
Documentation:
Completely rewrote custom checker documentation focusing on gRPC architecture with sequence diagrams
Refined device identity validation rules for non-sweep sources
Updated migration file references in documentation
Configuration Changes:
Added sysmon-vm checker configuration with gRPC settings
Updated sweep configuration with ICMP mode
Removed incremental migration files in favor of consolidated schema
Diagram Walkthrough
File Walkthrough
13 files
poller.go
Add automatic core reconnection and error handling for pollerpkg/poller/poller.go
Imported GitHub PR comment.
Original author: @qodo-code-review[bot]
Original URL: https://github.com/carverauto/serviceradar/pull/1755#issuecomment-3395986839
Original created: 2025-10-13T06:08:19Z
PR Compliance Guide 🔍
(Compliance updated until commit
github.com/carverauto/serviceradar@ffdbd2720a)Below is a summary of compliance checks for this PR:
Placeholder
Description: Previous string-concatenated SQL was removed; the new code uses parameterized queries
which is safe—no issue here.
sysmon.go [394-406]
Referred Code
Device spoofing risk
Description: Device IDs are derived directly from a reported host IP in checker payloads without strong
authentication or validation, allowing a malicious or misconfigured checker to spoof IPs
and auto-register/update devices with high confidence.
devices.go [122-245]
Referred Code
🎫 #1756
return them in sysmon responses and queries.
writes should error meaningfully when nothing is appended.
labels/clusters and cluster aggregates, including host identifiers and timestamps.
auto-register devices from checker payloads using host IP/ID with source confidence.
resolved host IP, carry agent_id, and support hot-reloading poll interval.
KV configs for common checkers including sysmon-vm.
per-core data for poller- and device-scoped queries.
sysmon-vm endpoint discovery and Kong auth routes; preserve passwords/keys and
wait-for-core in web entrypoint; update K8s/demo scripts.
badges/summaries, and host metadata panel; align heterogeneous series.
Codebase context is not defined
Follow the guide to enable codebase context checks.
No custom compliance provided
Follow the guide to enable custom compliance check.
Compliance status legend
🟢 - Fully Compliant🟡 - Partial Compliant
🔴 - Not Compliant
⚪ - Requires Further Human Verification
🏷️ - Compliance label
Previous compliance checks
Compliance check up to commit 471472d
Credential file exposure
Description: The script writes the Proton password into shared credentials with file path
/etc/serviceradar/credentials/proton-password; although chmod 600 is applied here, other
environments may mount this volume broadly—verify that the volume's ownership and
permissions prevent unintended container or host users from reading secrets.
proton-init.sh [62-69]
Referred Code
Secret handling in k8s
Description: The init script stores the Proton password into a shared credentials file inside the
cluster; ensure the ConfigMap/emptyDir/volume backing this path enforces pod-level and
filesystem permissions so the secret is not world-readable across containers/namespaces.
deploy.sh [321-325]
Referred Code
SQL injection
Description: Raw SQL query string is built with fmt.Sprintf embedding device_id and timestamps; if
inputs are not sanitized or strictly controlled, this can allow SQL injection—prefer
parameterized queries.
sysmon.go [398-404]
Referred Code
SQL injection
Description: Another raw SQL query is built via fmt.Sprintf with unescaped device_id and timestamp
strings, posing SQL injection risk—use prepared statements with placeholders instead.
sysmon.go [453-461]
Referred Code
Untrusted JSON propagation
Description: The function enriches service messages by trusting and rewriting arbitrary JSON from
checks; if downstream systems log or render these fields without sanitization, it could
enable log injection or UI injection—ensure escaping/sanitization where messages are used.
agent_poller.go [272-325]
Referred Code
🎫 No ticket provided
Codebase context is not defined
Follow the guide to enable codebase context checks.
No custom compliance provided
Follow the guide to enable custom compliance check.
Compliance check up to commit fddabde
Insecure file permissions
Description: The Proton password is written to /etc/serviceradar/credentials/proton-password with
world-readable permissions (chmod 644), which can expose credentials inside the
container/host.
proton-init.sh [65-69]
Referred Code
Insecure debug script
Description: The grpcurl helper hardcodes a private IP and uses plaintext gRPC without TLS, which risks
accidental exposure if executed in the wrong environment.
g.sh [8-11]
Referred Code
Fragile error handling
Description: Reconnect decision relies partly on substring matching of error messages, which can be
unreliable and may lead to unintended retry storms; consider bounded backoff and stricter
code checks.
poller.go [522-527]
Referred Code
🎫 No ticket provided
Codebase context is not defined
Follow the guide to enable codebase context checks.
No custom compliance provided
Follow the guide to enable custom compliance check.
Imported GitHub PR comment.
Original author: @qodo-code-review[bot]
Original URL: https://github.com/carverauto/serviceradar/pull/1755#issuecomment-3395989674
Original created: 2025-10-13T06:09:32Z
PR Code Suggestions ✨
Latest suggestions up to
7a4be7bPrevent IP from overwriting host ID
In
enrichServiceMessageWithAddress, avoid overwritinghost_idwith an IPaddress. Instead, add the resolved IP to a separate field like
resolved_host_ipto preserve the original host identifier.
pkg/poller/agent_poller.go [272-326]
Suggestion importance[1-10]: 8
__
Why: The suggestion correctly identifies a bug where
host_idis incorrectly overwritten with an IP address, which corrupts telemetry data and can break device correlation logic.Optimize cpu_metrics sort key
To optimize performance for the
cpu_metricsstream, removeclusterfrom theORDER BYclause and instead create a separate bloom filter index on it.pkg/db/migrations/00000000000001_consolidated_serviceradar_schema.up.sql [211-222]
[To ensure code accuracy, apply this suggestion manually]Suggestion importance[1-10]: 7
__
Why: The suggestion correctly identifies a potential performance issue by adding
clusterto theORDER BYclause and proposes a valid optimization by using a separate index instead, which improves query performance and data ingestion.Block until config exists
Prevent the
serviceradar-pollerfrom crashing on startup by modifying itscommand to wait until the
poller.jsonconfiguration file exists beforeexecuting.
docker-compose.yml [329-348]
[To ensure code accuracy, apply this suggestion manually]Suggestion importance[1-10]: 7
__
Why: The suggestion correctly identifies a race condition where the
serviceradar-pollermight start before its configuration file is ready, and proposes a robust shell command to wait for the file, preventing startup failures.Ensure secure, reliable secret persistence
In
proton-init.sh, ensure secret directories exist with secure permissions andadd error handling when writing the password file to prevent silent failures and
ensure credential persistence.
docker/compose/proton-init.sh [62-69]
[To ensure code accuracy, apply this suggestion manually]Suggestion importance[1-10]: 6
__
Why: The suggestion improves the robustness of secret handling by adding explicit directory creation and error checking, which prevents silent failures and ensures password persistence.
Prevent device ID collisions and validate IP
To prevent collisions, make the
deviceIDmore unique by validating thehostIPand including a tie-breaker like
agentIDorsvc.ServiceNamein its generation.pkg/core/devices.go [31-120]
Suggestion importance[1-10]: 8
__
Why: The suggestion addresses a significant architectural weakness in device identification, preventing potential data collisions and malformed
deviceIDs, thus improving data integrity.Add missing usage and tighten ORDER BY
Add the missing
usage_percentcolumn to thecpu_cluster_metricsstream andinclude
poller_idandagent_idin itsORDER BYclause to ensure data correctnessand improve query performance.
pkg/db/migrations/00000000000001_consolidated_serviceradar_schema.up.sql [224-240]
[To ensure code accuracy, apply this suggestion manually]Suggestion importance[1-10]: 8
__
Why: The suggestion correctly identifies that the new
cpu_cluster_metricsstream is missing ausage_percentfield, which is crucial for UI components that display cluster usage. It also correctly points out thatpoller_idandagent_idshould be in theORDER BYclause for data integrity and performance, which is a critical consideration for streaming databases.Add defensive null checks and stable keys
Add a null check for each
devicein thedevices.maploop and provide a fallbackkeyusing the index to prevent potential runtime errors and improve componentstability.
web/src/components/Devices/DeviceTable.tsx [209-218]
Suggestion importance[1-10]: 6
__
Why: The suggestion improves code robustness by adding a null check for
deviceand providing a fallback for the Reactkey, which are good defensive programming practices against potentially malformed data.Use consistent truncated timestamps
In
GetAllCPUMetrics, use the truncated timestamp for the finalSysmonCPUResponseto ensure consistent grouping and alignment between CPU and cluster metrics.
pkg/db/metrics.go [826-912]
[To ensure code accuracy, apply this suggestion manually]Suggestion importance[1-10]: 7
__
Why: The suggestion correctly identifies a subtle bug where inconsistent timestamp handling could lead to data misalignment between CPU and cluster metrics.
Previous suggestions
✅ Suggestions up to commit
86685fdPreserve original host identity fields
Avoid overwriting
host_idandhost_ipin the service message. Instead, store theaddress derived from
check.Detailsin separate auxiliary fields to preserve theoriginal agent-reported identity.
pkg/poller/agent_poller.go [272-325]
Suggestion importance[1-10]: 9
__
Why: The suggestion correctly identifies a critical data integrity issue where agent-reported identity (
host_ip,host_id) is overwritten, which could lead to incorrect device correlation and data misattribution.Validate device key IP input
Validate that the extracted
hostIPis a valid IP address before using it toconstruct a
deviceIDto prevent creating invalid device records from hostnamesor other strings.
pkg/core/devices.go [31-62]
Suggestion importance[1-10]: 8
__
Why: The suggestion correctly identifies that a non-IP string could be used in
deviceID, leading to invalid device records. Enforcing IP validation for the device key is a crucial fix for data integrity.✅
Add indexes for new cluster fieldsSuggestion Impact:
The commit added bloom filter indexes on cpu_metrics for cluster and label, matching the suggestion. It also added an index for cluster on cpu_cluster_metrics and adjusted ORDER BY clauses.code diff:
Add bloom filter indexes to the
cpu_metricsstream for the newclusterandlabelfields to improve query performance.
pkg/db/migrations/00000000000001_consolidated_serviceradar_schema.up.sql [209-222]
Suggestion importance[1-10]: 8
__
Why: This is a valid and important performance optimization. Adding bloom filter indexes on the new
clusterandlabelfields is crucial for efficient querying, especially since these fields are used for filtering and grouping in the UI, preventing full stream scans oncpu_metrics.✅
Optimize sort key and index for clustersSuggestion Impact:
The commit reordered the cpu_cluster_metrics ORDER BY to (timestamp, cluster, host_id, device_id) and added a bloom filter index on cluster for that stream, matching the suggestion's intent.code diff:
Optimize the
cpu_cluster_metricsstream by reordering the primary key toprioritize
clusterand adding a bloom filter index on it.pkg/db/migrations/00000000000001_consolidated_serviceradar_schema.up.sql [224-237]
[Suggestion processed]Suggestion importance[1-10]: 8
__
Why: This suggestion correctly identifies a performance issue in the new
cpu_cluster_metricsstream. Reordering the primary key to(timestamp, cluster, ...)and adding a bloom filter index onclusterwill significantly improve the performance of common UI queries that filter by cluster.Add safe db type assertion
*Add a safe type assertion when accessing
s.dbServiceto prevent a potentialpanic if the underlying type is not
db.DB.pkg/core/api/sysmon.go [459]
Suggestion importance[1-10]: 7
__
Why: The suggestion correctly points out a potential panic from an unsafe type assertion, improving the code's robustness and preventing a server crash, which is a valuable improvement for production stability.
✅
Add defensive null checksSuggestion Impact:
The commit added a defensive check ensuring serviceData exists before accessing its properties and provided a nullish coalescing fallback for serviceData.details.code diff:
Add null checks for the
serviceDataobject and itsdetailsproperty to preventpotential runtime errors during rendering.
web/src/components/Service/Dashboard.tsx [280-287]
[Suggestion processed]Suggestion importance[1-10]: 7
__
Why: The suggestion correctly identifies a potential null pointer exception if
serviceDatais not yet loaded, improving the component's robustness against runtime errors.✅
Clamp and sanitize metric valueSuggestion Impact:
The commit added safeMax/rawValue/clampedValue and replaced uses of current with clampedValue and max with safeMax across StatusIndicator, ValueDisplay, and ProgressBar.code diff:
Sanitize the
currentandmaxprops in theMetricCardcomponent by clamping thevalue to a safe range
[0, max]to prevent UI issues.web/src/components/Metrics/shared-components.jsx [123-153]
[Suggestion processed]Suggestion importance[1-10]: 7
__
Why: This suggestion improves the component's resilience by sanitizing the
currentandmaxprops, preventing potential UI bugs or crashes from invalid data.✅
Normalize timestamps for joiningSuggestion Impact:
The commit implemented truncating timestamps to second-resolution for both CPU metrics and cluster metrics map keys, and adjusted the response timestamp to use the original metric timestamp.code diff:
In
GetAllCPUMetrics, normalize timestamps by truncating them to the secondbefore using them as map keys to ensure correct correlation between core and
cluster metrics.
pkg/db/metrics.go [826-912]
Suggestion importance[1-10]: 8
__
Why: The suggestion correctly identifies a critical Go pitfall of using
time.Timeas a map key, which would cause data to be mis-associated, and provides the correct fix by truncating the timestamp.✅
Guard optional nested dataSuggestion Impact:
The commit updated the component to use optional chaining and array checks: wrapped cpuFrequency and memory with data?., changed cpu.clusters and cpu.cores and disk.drives to Array.isArray checks, matching the suggested guards.code diff:
Add null-safe guards for
data.cpu.cores,data.disk.drives, anddata.memorytoprevent the application from crashing if these properties are accessed before
they are loaded.
web/src/components/Metrics/system-metrics.jsx [288-296]
[Suggestion processed]Suggestion importance[1-10]: 8
__
Why: This suggestion correctly identifies missing null-safe guards for nested properties like
data.cpu.cores, preventing a likely runtime crash when data is partially loaded, which is a critical fix for UI stability.✅
Include cluster in sorting keySuggestion Impact:
The commit updated cpu_metrics to ORDER BY (timestamp, device_id, host_id, cluster, core_id), adding cluster as suggested. It also added bloom filter indexes and adjusted ordering for another stream.code diff:
Add the new
clusterfield to theORDER BYclause of thecpu_metricsstream. Thiswill improve query performance and data locality for cluster-based aggregations.
pkg/db/migrations/00000000000001_consolidated_serviceradar_schema.up.sql [209-222]
Suggestion importance[1-10]: 7
__
Why: This is a valid and important performance optimization for the
cpu_metricsstream, as includingclusterin theORDER BYclause will improve data locality and query efficiency for cluster-based aggregations.✅
Remove hardcoded private IPSuggestion Impact:
The hardcoded IP address was replaced with an environment-variable-based address: "${SYSMON_VM_ADDRESS:-sysmon-vm:50110}".code diff:
Replace the hardcoded IP address in
sysmon-vm.checker.jsonwith a configurablevalue, such as an environment variable, to improve deployment flexibility.
docker/compose/sysmon-vm.checker.json [1-9]
[Suggestion processed]Suggestion importance[1-10]: 8
__
Why: The suggestion correctly points out that a hardcoded IP address harms portability and should be replaced with a configurable value, which is a critical best practice for deployment configurations.
✅
Align Proton memory to container limitsSuggestion Impact:
The commit added the suggested Proton memory environment variables (PROTON_MAX_SERVER_MEMORY, PROTON_MAX_QUERY_MEMORY, PROTON_MAX_PARTITION_BYTES) along with a comment explaining the alignment with container limits.code diff:
Set explicit memory limits for the Proton service via environment variables to
align with container memory limits and prevent OOM kills.
docker-compose.yml [72-92]
Suggestion importance[1-10]: 7
__
Why: The suggestion correctly points out a potential for OOM kills by aligning Proton's internal memory settings with the container's memory limits. This enhances the stability of the
protonservice under load, making it a valuable configuration improvement.✅ Suggestions up to commit
fddabdeDatabase migration strategy breaks existing deployments
The PR replaces the incremental database migration system with a single schema
file that only runs on empty databases. This breaks the upgrade path for
existing deployments, as there is no way to apply new schema changes to
databases that already contain data.
Examples:
pkg/db/migrate.go [35-75]
Solution Walkthrough:
Before:
After:
Suggestion importance[1-10]: 10
__
Why: The suggestion correctly identifies a critical flaw where the new migration strategy removes the upgrade path for existing databases, which is a breaking change for all current deployments.
Fix missing target stream clauses
Add the missing
INTOclause to theocsf_devices_aggregator_mvandocsf_users_aggregator_mvmaterialized views to specify their target streams andfix a syntax error.
pkg/db/migrations/schema.sql [1298-1408]
Suggestion importance[1-10]: 10
__
Why: The suggestion correctly identifies a critical syntax error in two
CREATE MATERIALIZED VIEWstatements that are missing the requiredINTOclause. This would cause the SQL migration to fail, so the fix is essential for correctness.✅
Fix view definition and function nameSuggestion Impact:
The committed diff updated the ocsf_observable_device_macs_mv definition to include the INTO ocsf_observable_index clause and corrected the function call to replaceAll for normalizing MAC addresses.code diff:
Fix the
ocsf_observable_device_macs_mvview by adding the missingINTOocsf_observable_indexclause and correcting the function name fromreplace_allto
replaceAll.pkg/db/migrations/schema.sql [1452-1485]
Suggestion importance[1-10]: 10
__
Why: The suggestion correctly identifies two critical errors in the
ocsf_observable_device_macs_mvview definition: a missingINTOclause and an incorrect function name (replace_allinstead ofreplaceAll). Both are syntax errors that would cause the migration to fail.Avoid holding lock during network call
Refactor
reconnectCoreto release the mutex lock before making the blockingconnectToCorenetwork call to avoid contention.pkg/poller/poller.go [481-497]
Suggestion importance[1-10]: 8
__
Why: The suggestion correctly identifies a significant concurrency issue where a mutex is held during a blocking network call, which can lead to performance degradation or deadlocks.
✅
Replace hardcoded IP with service nameSuggestion Impact:
The hardcoded IP was replaced with a configurable address defaulting to "sysmon-vm:50110", aligning with the suggestion's intent to use the service name.code diff:
In
docker/compose/sysmon-vm.checker.json, replace the hardcoded IP address inthe
addressfield with the service namesysmon-vmto ensure portability withinthe Docker environment.
docker/compose/sysmon-vm.checker.json [4]
Suggestion importance[1-10]: 7
__
Why: The suggestion correctly identifies a hardcoded IP address and recommends using the Docker service name, which is a best practice for portability and maintainability in a containerized environment.
✅
Handle empty metric slices correctlySuggestion Impact:
The commit implemented logic so that an "no cpu metrics appended" error is only returned when appending fails, and otherwise returns a specific error constant. Although an explicit early return for empty cpus is not shown in the diff, the updated handling of the "no metrics appended" case aligns with the suggestion’s intent to avoid misleading errors for empty input. Additionally, error messages were refactored to use constants.code diff: