2042 bugsysmon metrics not available from sysmon vm collection #2495
No reviewers
Labels
No labels
1week
2weeks
Failed compliance check
IP cameras
NATS
Possible security concern
Review effort 1/5
Review effort 2/5
Review effort 3/5
Review effort 4/5
Review effort 5/5
UI
aardvark
accessibility
amd64
api
arm64
auth
back-end
bgp
blog
bug
build
checkers
ci-cd
cleanup
cnpg
codex
core
dependencies
device-management
documentation
duplicate
dusk
ebpf
enhancement
eta 1d
eta 1hr
eta 3d
eta 3hr
feature
fieldsurvey
github_actions
go
good first issue
help wanted
invalid
javascript
k8s
log-collector
mapper
mtr
needs-triage
netflow
network-sweep
observability
oracle
otel
plug-in
proton
python
question
reddit
redhat
research
rperf
rperf-checker
rust
sdk
security
serviceradar-agent
serviceradar-agent-gateway
serviceradar-web
serviceradar-web-ng
siem
snmp
sysmon
topology
ubiquiti
wasm
wontfix
zen-engine
No milestone
No project
No assignees
1 participant
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
carverauto/serviceradar!2495
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "refs/pull/2495/head"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Imported from GitHub pull request.
Original GitHub pull request: #2043
Original author: @mfreeman451
Original URL: https://github.com/carverauto/serviceradar/pull/2043
Original created: 2025-12-03T01:30:40Z
Original updated: 2025-12-03T02:25:27Z
Original head: carverauto/serviceradar:2042-bugsysmon-metrics-not-available-from-sysmon-vm-collection
Original base: main
Original merged: 2025-12-03T02:25:24Z by @mfreeman451
User description
IMPORTANT: Please sign the Developer Certificate of Origin
Thank you for your contribution to ServiceRadar. Please note, when contributing, the developer must include
a DCO sign-off statement indicating the DCO acceptance in one commit message. Here
is an example DCO Signed-off-by line in a commit message:
Describe your changes
Issue ticket number and link
Code checklist before requesting a review
PR Type
Bug fix, Enhancement
Description
Add memory metrics collection to sysmon-vm checker and fix metrics pipeline availability
Implement sysmon stall detection with logging when metrics stop arriving despite collector connectivity
Fix CNPG batch result reading to properly surface INSERT errors instead of silently discarding them
Add platform compatibility constraints (macOS-only) to sysmon-vm build targets to prevent Linux build failures
Improve sysmon metrics API routes to handle null/empty responses gracefully and preserve service-reported host identity
Diagram Walkthrough
File Walkthrough
6 files
Add memory metrics collection to sysmon-vmImplement sysmon stall detection and loggingAdd sysmon flush logging and CNPG status checksInitialize sysmon stall tracking stateAdd sysmonStreamState type for stall trackingAdd sysmon metrics storage logging and validation3 files
Add memory metrics validation to testsAdd test for sysmon stall event emissionAdd test for host identity preservation5 files
Fix metrics API to handle empty results gracefullyFix batch result reading and add detailed loggingPreserve service-reported host identity in payloadsHandle null responses and use optional chainingHandle null responses and use optional chaining3 files
Add macOS platform constraint to sysmonvm aliasAdd macOS platform constraint to sysmon-vm binaryAdd macOS constraint and memory dependency3 files
Document sysmon-vm metrics availability fix proposalAdd sysmon metrics availability requirementsDocument investigation and fix tasksImported GitHub PR comment.
Original author: @qodo-code-review[bot]
Original URL: https://github.com/carverauto/serviceradar/pull/2043#issuecomment-3604650132
Original created: 2025-12-03T01:31:17Z
PR Compliance Guide 🔍
Below is a summary of compliance checks for this PR:
Excessive logging
Description: Extensive info logs added in sysmon flush path may leak sensitive operational details
(device_id, counts) and could enable log flooding under high load; consider reducing
verbosity or rate limiting.
flush.go [445-503]
Referred Code
🎫 #2042
errors.
collector host).
metrics appear within one polling interval and UI charts render as expected.
performance) beyond unit-level checks.
Codebase context is not defined
Follow the guide to enable codebase context checks.
Generic: Meaningful Naming and Self-Documenting Code
Objective: Ensure all identifiers clearly express their purpose and intent, making code
self-documenting
Status: Passed
Generic: Robust Error Handling and Edge Case Management
Objective: Ensure comprehensive error handling that provides meaningful context and graceful
degradation
Status: Passed
Generic: Secure Error Handling
Objective: To prevent the leakage of sensitive system information through error messages while
providing sufficient detail for internal debugging.
Status: Passed
Generic: Security-First Input Validation and Data Handling
Objective: Ensure all data inputs are validated, sanitized, and handled securely to prevent
vulnerabilities
Status: Passed
Generic: Comprehensive Audit Trails
Objective: To create a detailed and reliable record of critical system actions for security analysis
and compliance.
Status:
Audit coverage unclear: New stall-detection and capability updates log events, but it is unclear if these actions
are recorded in a durable audit trail with user/context to reconstruct events.
Referred Code
Generic: Secure Logging Practices
Objective: To ensure logs are useful for debugging and auditing without exposing sensitive
information like PII, PHI, or cardholder data.
Status:
Verbose logging: Newly added info logs include device identifiers and counts which seem acceptable, but
verification is needed to ensure no sensitive user/PII is logged across added log
statements.
Referred Code
Compliance status legend
🟢 - Fully Compliant🟡 - Partial Compliant
🔴 - Not Compliant
⚪ - Requires Further Human Verification
🏷️ - Compliance label
Imported GitHub PR comment.
Original author: @qodo-code-review[bot]
Original URL: https://github.com/carverauto/serviceradar/pull/2043#issuecomment-3604652437
Original created: 2025-12-03T01:32:27Z
PR Code Suggestions ✨
Explore these optional code suggestions:
Fix race condition in stall detection
Fix a race condition in
noteSysmonStallby ensuring all reads of the sharedstateobject are performed within the critical section protected by the mutex toprevent using stale data for stall detection.
pkg/core/metrics.go [1302-1330]
[To ensure code accuracy, apply this suggestion manually]Suggestion importance[1-10]: 8
__
Why: The suggestion correctly identifies a race condition where shared state is read under a lock, but the lock is released before the values are used, potentially leading to incorrect stall detection logic. This is a critical concurrency bug.
Simplify redundant payload enrichment logic
Simplify the payload enrichment logic for
host_ipandip. Instead of checkingboth
statusNodeandpayload, check onlystatusNode, add the IP if absent, andthen ensure the top-level payload fields are consistent.
pkg/poller/agent_poller.go [502-522]
Suggestion importance[1-10]: 4
__
Why: The suggestion correctly points out redundant logic in the payload enrichment. While the current code is functional, the proposed change simplifies the logic by removing an unnecessary check, improving code clarity and maintainability.