2042 bugsysmon metrics not available from sysmon vm collection #2497

Merged
mfreeman451 merged 6 commits from refs/pull/2497/head into main 2025-12-03 04:10:03 +00:00
mfreeman451 commented 2025-12-03 04:08:32 +00:00 (Migrated from github.com)
Owner

Imported from GitHub pull request.

Original GitHub pull request: #2045
Original author: @mfreeman451
Original URL: https://github.com/carverauto/serviceradar/pull/2045
Original created: 2025-12-03T04:08:32Z
Original updated: 2025-12-03T04:10:26Z
Original head: carverauto/serviceradar:2042-bugsysmon-metrics-not-available-from-sysmon-vm-collection
Original base: main
Original merged: 2025-12-03T04:10:03Z by @mfreeman451

User description

IMPORTANT: Please sign the Developer Certificate of Origin

Thank you for your contribution to ServiceRadar. Please note, when contributing, the developer must include
a DCO sign-off statement indicating the DCO acceptance in one commit message. Here
is an example DCO Signed-off-by line in a commit message:

Signed-off-by: J. Doe <j.doe@domain.com>

Describe your changes

Code checklist before requesting a review

  • I have signed the DCO?
  • The build completes without errors?
  • All tests are passing when running make test?

PR Type

Bug fix, Enhancement


Description

  • Fix sysmon-vm metrics pipeline to persist CPU/memory data from collectors to CNPG and API endpoints

  • Add memory metrics collection support and stall detection for sysmon collectors

  • Explicitly qualify all metrics table names with public schema to prevent Apache AGE catalog conflicts

  • Improve error handling in CNPG batch operations and add comprehensive logging for debugging

  • Preserve service-reported host identity in payload enrichment to avoid overwriting collector metadata

  • Add null-safety checks in web API routes to return empty arrays instead of 500 errors


Diagram Walkthrough

flowchart LR
  A["sysmon-vm collector"] -->|"emit CPU/memory metrics"| B["poller agent"]
  B -->|"parse & enrich payload"| C["core server"]
  C -->|"detect stalls, buffer metrics"| D["CNPG database"]
  D -->|"explicit public schema"| E["metrics tables"]
  E -->|"query device metrics"| F["API endpoints"]
  F -->|"null-safe transform"| G["web UI"]

File Walkthrough

Relevant files
Enhancement
6 files
service.go
Add memory metrics collection from gopsutil                           
+16/-1   
flush.go
Add logging for sysmon buffer flush operations                     
+24/-0   
metrics.go
Implement sysmon stall detection and logging                         
+153/-6 
server.go
Initialize sysmon stall tracking state map                             
+1/-0     
types.go
Define sysmonStreamState for stall tracking                           
+8/-0     
metrics.go
Add logging and schema qualification for sysmon storage   
+15/-0   
Tests
3 files
service_test.go
Validate memory metrics in response payload                           
+9/-0     
metrics_test.go
Test sysmon stall event emission after empty payloads       
+53/-0   
agent_poller_test.go
Test host identity preservation during payload enrichment
+26/-0   
Bug fix
5 files
sysmon.go
Qualify metrics tables with public schema                               
+10/-12 
cnpg_metrics.go
Fix batch error handling and add schema qualification       
+49/-10 
agent_poller.go
Preserve service-reported host identity in enrichment       
+32/-2   
route.ts
Add null-safety checks and optional chaining operators     
+13/-10 
route.ts
Add null-safety checks and optional chaining operators     
+13/-10 
Configuration changes
2 files
BUILD.bazel
Restrict sysmonvm alias to macOS platform                               
+1/-0     
BUILD.bazel
Restrict sysmon-vm binary to macOS platform                           
+2/-0     
Dependencies
1 files
BUILD.bazel
Add mem dependency and macOS platform constraint                 
+3/-0     
Documentation
3 files
proposal.md
Document sysmon metrics availability fix proposal               
+15/-0   
spec.md
Define sysmon metrics availability requirements                   
+22/-0   
tasks.md
Track investigation and fix implementation tasks                 
+11/-0   

Imported from GitHub pull request. Original GitHub pull request: #2045 Original author: @mfreeman451 Original URL: https://github.com/carverauto/serviceradar/pull/2045 Original created: 2025-12-03T04:08:32Z Original updated: 2025-12-03T04:10:26Z Original head: carverauto/serviceradar:2042-bugsysmon-metrics-not-available-from-sysmon-vm-collection Original base: main Original merged: 2025-12-03T04:10:03Z by @mfreeman451 --- ### **User description** ## IMPORTANT: Please sign the Developer Certificate of Origin Thank you for your contribution to ServiceRadar. Please note, when contributing, the developer must include a [DCO sign-off statement]( https://developercertificate.org/) indicating the DCO acceptance in one commit message. Here is an example DCO Signed-off-by line in a commit message: ``` Signed-off-by: J. Doe <j.doe@domain.com> ``` ## Describe your changes ## Issue ticket number and link ## Code checklist before requesting a review - [ ] I have signed the DCO? - [ ] The build completes without errors? - [ ] All tests are passing when running make test? ___ ### **PR Type** Bug fix, Enhancement ___ ### **Description** - Fix sysmon-vm metrics pipeline to persist CPU/memory data from collectors to CNPG and API endpoints - Add memory metrics collection support and stall detection for sysmon collectors - Explicitly qualify all metrics table names with `public` schema to prevent Apache AGE catalog conflicts - Improve error handling in CNPG batch operations and add comprehensive logging for debugging - Preserve service-reported host identity in payload enrichment to avoid overwriting collector metadata - Add null-safety checks in web API routes to return empty arrays instead of 500 errors ___ ### Diagram Walkthrough ```mermaid flowchart LR A["sysmon-vm collector"] -->|"emit CPU/memory metrics"| B["poller agent"] B -->|"parse & enrich payload"| C["core server"] C -->|"detect stalls, buffer metrics"| D["CNPG database"] D -->|"explicit public schema"| E["metrics tables"] E -->|"query device metrics"| F["API endpoints"] F -->|"null-safe transform"| G["web UI"] ``` <details> <summary><h3> File Walkthrough</h3></summary> <table><thead><tr><th></th><th align="left">Relevant files</th></tr></thead><tbody><tr><td><strong>Enhancement</strong></td><td><details><summary>6 files</summary><table> <tr> <td><strong>service.go</strong><dd><code>Add memory metrics collection from gopsutil</code>&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; </dd></td> <td><a href="https://github.com/carverauto/serviceradar/pull/2045/files#diff-02211c03421188a7a92d44ce12a4ee8d9f09b224d7176b8713fcefac759ee989">+16/-1</a>&nbsp; &nbsp; </td> </tr> <tr> <td><strong>flush.go</strong><dd><code>Add logging for sysmon buffer flush operations</code>&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; </dd></td> <td><a href="https://github.com/carverauto/serviceradar/pull/2045/files#diff-4ad1cf4b0397bd277cd9c0553e4992ad7c9e9484d49d49d48af015958dc4bf4b">+24/-0</a>&nbsp; &nbsp; </td> </tr> <tr> <td><strong>metrics.go</strong><dd><code>Implement sysmon stall detection and logging</code>&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; </dd></td> <td><a href="https://github.com/carverauto/serviceradar/pull/2045/files#diff-6d98e853ce17576c088e77956ae4ecfa8078019e0bff107a79d8d1d6ed2443ad">+153/-6</a>&nbsp; </td> </tr> <tr> <td><strong>server.go</strong><dd><code>Initialize sysmon stall tracking state map</code>&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; </dd></td> <td><a href="https://github.com/carverauto/serviceradar/pull/2045/files#diff-a12d34beb8640efbbedad28f8610fd58afeea457572d82e4fd295ab2a6bb8ee6">+1/-0</a>&nbsp; &nbsp; &nbsp; </td> </tr> <tr> <td><strong>types.go</strong><dd><code>Define sysmonStreamState for stall tracking</code>&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; </dd></td> <td><a href="https://github.com/carverauto/serviceradar/pull/2045/files#diff-717f128517472d3dc4091e67bfc0f4fe4c36e32096c5ef87d78f34cbc64d2399">+8/-0</a>&nbsp; &nbsp; &nbsp; </td> </tr> <tr> <td><strong>metrics.go</strong><dd><code>Add logging and schema qualification for sysmon storage</code>&nbsp; &nbsp; </dd></td> <td><a href="https://github.com/carverauto/serviceradar/pull/2045/files#diff-2cf937853ac86eb9a879bd370cbe22b835ee0c48be6771b48b0276d93cc2ae93">+15/-0</a>&nbsp; &nbsp; </td> </tr> </table></details></td></tr><tr><td><strong>Tests</strong></td><td><details><summary>3 files</summary><table> <tr> <td><strong>service_test.go</strong><dd><code>Validate memory metrics in response payload</code>&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; </dd></td> <td><a href="https://github.com/carverauto/serviceradar/pull/2045/files#diff-651c156758ec9bf449b0ec538346fc33096dfd6b4de52efed825c8b161564435">+9/-0</a>&nbsp; &nbsp; &nbsp; </td> </tr> <tr> <td><strong>metrics_test.go</strong><dd><code>Test sysmon stall event emission after empty payloads</code>&nbsp; &nbsp; &nbsp; &nbsp; </dd></td> <td><a href="https://github.com/carverauto/serviceradar/pull/2045/files#diff-2b0ce06b068be7b4418c3fe4d23e3b8bf536d0cc2b9201a6949976e298a9e95e">+53/-0</a>&nbsp; &nbsp; </td> </tr> <tr> <td><strong>agent_poller_test.go</strong><dd><code>Test host identity preservation during payload enrichment</code></dd></td> <td><a href="https://github.com/carverauto/serviceradar/pull/2045/files#diff-3b72f1c80958942904f082df21f74bb67bde710581174e9e0aa1dba82bd7971a">+26/-0</a>&nbsp; &nbsp; </td> </tr> </table></details></td></tr><tr><td><strong>Bug fix</strong></td><td><details><summary>5 files</summary><table> <tr> <td><strong>sysmon.go</strong><dd><code>Qualify metrics tables with public schema</code>&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; </dd></td> <td><a href="https://github.com/carverauto/serviceradar/pull/2045/files#diff-4879e8b367ee55413c6cb006f23bb912261df5310c640d3af352033cf97ffc84">+10/-12</a>&nbsp; </td> </tr> <tr> <td><strong>cnpg_metrics.go</strong><dd><code>Fix batch error handling and add schema qualification</code>&nbsp; &nbsp; &nbsp; &nbsp; </dd></td> <td><a href="https://github.com/carverauto/serviceradar/pull/2045/files#diff-8ca23b1b2b298b6eec9af9e23be2c7cb9caa65301f514c013ebbc5b52e933a23">+49/-10</a>&nbsp; </td> </tr> <tr> <td><strong>agent_poller.go</strong><dd><code>Preserve service-reported host identity in enrichment</code>&nbsp; &nbsp; &nbsp; &nbsp; </dd></td> <td><a href="https://github.com/carverauto/serviceradar/pull/2045/files#diff-7cdaae416a1e7b5d39ad75f138c49fa0aca12fc38e8ab0b26bb59c64446da15c">+32/-2</a>&nbsp; &nbsp; </td> </tr> <tr> <td><strong>route.ts</strong><dd><code>Add null-safety checks and optional chaining operators</code>&nbsp; &nbsp; &nbsp; </dd></td> <td><a href="https://github.com/carverauto/serviceradar/pull/2045/files#diff-565dceead2e723f416ae3cee24e62a1e93364ce67237233e01b4a26c09bf8ba5">+13/-10</a>&nbsp; </td> </tr> <tr> <td><strong>route.ts</strong><dd><code>Add null-safety checks and optional chaining operators</code>&nbsp; &nbsp; &nbsp; </dd></td> <td><a href="https://github.com/carverauto/serviceradar/pull/2045/files#diff-8b59229cfa9e8e8f66e10be9569364853176d09da7cf0b4c987c85225da7fa55">+13/-10</a>&nbsp; </td> </tr> </table></details></td></tr><tr><td><strong>Configuration changes</strong></td><td><details><summary>2 files</summary><table> <tr> <td><strong>BUILD.bazel</strong><dd><code>Restrict sysmonvm alias to macOS platform</code>&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; </dd></td> <td><a href="https://github.com/carverauto/serviceradar/pull/2045/files#diff-884fa9353a5226345e44fbabea3300efc7a87dfbcde0b6a42521ca51823f1b68">+1/-0</a>&nbsp; &nbsp; &nbsp; </td> </tr> <tr> <td><strong>BUILD.bazel</strong><dd><code>Restrict sysmon-vm binary to macOS platform</code>&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; </dd></td> <td><a href="https://github.com/carverauto/serviceradar/pull/2045/files#diff-6f29400e46ef3dfa4e1d5e3ec41ebfeceb9628f6900f815f5c9bc8166ccc63a9">+2/-0</a>&nbsp; &nbsp; &nbsp; </td> </tr> </table></details></td></tr><tr><td><strong>Dependencies</strong></td><td><details><summary>1 files</summary><table> <tr> <td><strong>BUILD.bazel</strong><dd><code>Add mem dependency and macOS platform constraint</code>&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; </dd></td> <td><a href="https://github.com/carverauto/serviceradar/pull/2045/files#diff-1ddcd985760c56dd63f684cae82510232439e902c9fbbb8ecef91f8eb19fd1a1">+3/-0</a>&nbsp; &nbsp; &nbsp; </td> </tr> </table></details></td></tr><tr><td><strong>Documentation</strong></td><td><details><summary>3 files</summary><table> <tr> <td><strong>proposal.md</strong><dd><code>Document sysmon metrics availability fix proposal</code>&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; </dd></td> <td><a href="https://github.com/carverauto/serviceradar/pull/2045/files#diff-ad3bf3baec592ea2000de51abcd43628efd97c1524956d0580562e54b56515ad">+15/-0</a>&nbsp; &nbsp; </td> </tr> <tr> <td><strong>spec.md</strong><dd><code>Define sysmon metrics availability requirements</code>&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; </dd></td> <td><a href="https://github.com/carverauto/serviceradar/pull/2045/files#diff-239ceb7ed8d7185674ebc80a40aab82ea6955a159999df3ec7736870a416fac3">+22/-0</a>&nbsp; &nbsp; </td> </tr> <tr> <td><strong>tasks.md</strong><dd><code>Track investigation and fix implementation tasks</code>&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; </dd></td> <td><a href="https://github.com/carverauto/serviceradar/pull/2045/files#diff-355450ac691dae6d2c3167233558588ea0e0ab969b5266044d4226b00e61ddab">+11/-0</a>&nbsp; &nbsp; </td> </tr> </table></details></td></tr></tr></tbody></table> </details> ___
qodo-code-review[bot] commented 2025-12-03 04:09:09 +00:00 (Migrated from github.com)
Author
Owner

Imported GitHub PR comment.

Original author: @qodo-code-review[bot]
Original URL: https://github.com/carverauto/serviceradar/pull/2045#issuecomment-3605008656
Original created: 2025-12-03T04:09:09Z

PR Compliance Guide 🔍

Below is a summary of compliance checks for this PR:

Security Compliance
Information disclosure via response semantics

Description: The endpoint now always returns 200 with metrics (including when empty), potentially
exposing device existence and metric type availability to unauthorized callers if upstream
auth/authorization is lax.
sysmon.go [365-384]

Referred Code
metrics, err := fetcher(ctx, metricsProvider, deviceID, startTime, endTime)
if err != nil {
	s.logger.Error().Err(err).Str("metric_type", metricType).Str("device_id", deviceID).Msg("Error fetching metrics")
	writeError(w, "Internal server error", http.StatusInternalServerError)

	return
}

metricsValue := reflect.ValueOf(metrics)
count := 0
if metricsValue.Kind() == reflect.Slice {
	count = metricsValue.Len()
}

s.logger.Debug().Int("count", count).Str("metric_type", metricType).Str("device_id", deviceID).Msg("Fetched metrics")

w.Header().Set("Content-Type", "application/json")

if err := json.NewEncoder(w).Encode(metrics); err != nil {
	s.logger.Error().Err(err).Str("metric_type", metricType).Str("device_id", deviceID).Msg("Error encoding metrics response")
Ticket Compliance
🟡
🎫 #2042
🟢 Restore availability of sysmon-vm metrics so device views/API show data again.
Ensure CPU and memory metrics are collected, persisted to CNPG, and exposed via API/UI.
Prevent empty metrics from causing 404/500; return 200 with empty arrays/null-safe
handling.
Detect and surface stalled sysmon streams when collectors are connected but no metrics
arrive.
Maintain correct device attribution for metrics (do not overwrite service-reported host
identity).
Add logging/diagnostics for CNPG write/read paths to aid troubleshooting.
Provide regression tests for sysmon pipeline behaviors.
End-to-end verification that metrics appear in UI within one polling interval across
darwin/arm64 deployment.
Manual validation that CNPG contains expected rows and that UI charts render as intended
for all metric types.
Codebase Duplication Compliance
Codebase context is not defined

Follow the guide to enable codebase context checks.

Custom Compliance
🟢
Generic: Meaningful Naming and Self-Documenting Code

Objective: Ensure all identifiers clearly express their purpose and intent, making code
self-documenting

Status: Passed

Learn more about managing compliance generic rules or creating your own custom rules

Generic: Robust Error Handling and Edge Case Management

Objective: Ensure comprehensive error handling that provides meaningful context and graceful
degradation

Status: Passed

Learn more about managing compliance generic rules or creating your own custom rules

Generic: Security-First Input Validation and Data Handling

Objective: Ensure all data inputs are validated, sanitized, and handled securely to prevent
vulnerabilities

Status: Passed

Learn more about managing compliance generic rules or creating your own custom rules

Generic: Comprehensive Audit Trails

Objective: To create a detailed and reliable record of critical system actions for security analysis
and compliance.

Status:
Limited auditing: New CNPG write-path and sysmon flush instrumentation add logs but do not clearly ensure
auditable records of critical actions (who/what/when/outcome) across all critical
operations.

Referred Code
}

if err := s.DB.StoreSysmonMetrics(
	ctx, pollerID, agentID, hostID, partition, hostIP, deviceID, metric, ts); err != nil {
	s.logger.Error().
		Err(err).
		Str("poller_id", pollerID).
		Msg("Failed to flush sysmon metrics")
} else {
	hasMemory := metric.Memory != nil && (metric.Memory.TotalBytes > 0 || metric.Memory.UsedBytes > 0)
	s.logger.Info().
		Str("poller_id", pollerID).
		Str("device_id", deviceID).
		Int("cpu_count", len(metric.CPUs)).
		Int("disk_count", len(metric.Disks)).
		Int("process_count", len(metric.Processes)).
		Bool("has_memory", hasMemory).
		Msg("Flushed sysmon metrics to database")
}

Learn more about managing compliance generic rules or creating your own custom rules

Generic: Secure Error Handling

Objective: To prevent the leakage of sensitive system information through error messages while
providing sufficient detail for internal debugging.

Status:
Error detail echo: API responses include raw error details in JSON which may expose internal information to
end users depending on routing exposure.

Referred Code
    return NextResponse.json(transformedData);
} catch (error) {
    console.error(`Error fetching Sysmon ${metric} data for device ${deviceId}:`, error);
    return NextResponse.json(
        { error: `Internal server error while fetching Sysmon ${metric} data`, details: error},
        { status: 500 },
    );

Learn more about managing compliance generic rules or creating your own custom rules

Generic: Secure Logging Practices

Objective: To ensure logs are useful for debugging and auditing without exposing sensitive
information like PII, PHI, or cardholder data.

Status:
Verbose logging: Added info/error logs include device identifiers and potentially large counts; while
useful, review is needed to ensure no sensitive data or excessive verbosity at info level.

Referred Code
func (db *DB) cnpgInsertCPUMetrics(
	ctx context.Context,
	pollerID, agentID, hostID, deviceID, partition string,
	cpus []models.CPUMetric,
	timestamp time.Time,
) error {
	if len(cpus) == 0 || !db.useCNPGWrites() {
		db.logger.Info().
			Int("cpu_count", len(cpus)).
			Bool("cnpg_writes", db.useCNPGWrites()).
			Msg("cnpgInsertCPUMetrics: skipping (no cpus or writes disabled)")
		return nil
	}

	db.logger.Info().
		Str("poller_id", pollerID).
		Str("device_id", deviceID).
		Int("cpu_count", len(cpus)).
		Time("timestamp", timestamp).
		Msg("cnpgInsertCPUMetrics: building batch")



 ... (clipped 31 lines)

Learn more about managing compliance generic rules or creating your own custom rules

Compliance status legend 🟢 - Fully Compliant
🟡 - Partial Compliant
🔴 - Not Compliant
- Requires Further Human Verification
🏷️ - Compliance label
Imported GitHub PR comment. Original author: @qodo-code-review[bot] Original URL: https://github.com/carverauto/serviceradar/pull/2045#issuecomment-3605008656 Original created: 2025-12-03T04:09:09Z --- ## PR Compliance Guide 🔍 <!-- https://github.com/carverauto/serviceradar/commit/8e84657ba47ac987af2023a95c0e61f1860718f6 --> Below is a summary of compliance checks for this PR:<br> <table><tbody><tr><td colspan='2'><strong>Security Compliance</strong></td></tr> <tr><td rowspan=1>⚪</td> <td><details><summary><strong>Information disclosure via response semantics</strong></summary><br> <b>Description:</b> The endpoint now always returns 200 with metrics (including when empty), potentially <br>exposing device existence and metric type availability to unauthorized callers if upstream <br>auth/authorization is lax.<br> <strong><a href='https://github.com/carverauto/serviceradar/pull/2045/files#diff-4879e8b367ee55413c6cb006f23bb912261df5310c640d3af352033cf97ffc84R365-R384'>sysmon.go [365-384]</a></strong><br> <details open><summary>Referred Code</summary> ```go metrics, err := fetcher(ctx, metricsProvider, deviceID, startTime, endTime) if err != nil { s.logger.Error().Err(err).Str("metric_type", metricType).Str("device_id", deviceID).Msg("Error fetching metrics") writeError(w, "Internal server error", http.StatusInternalServerError) return } metricsValue := reflect.ValueOf(metrics) count := 0 if metricsValue.Kind() == reflect.Slice { count = metricsValue.Len() } s.logger.Debug().Int("count", count).Str("metric_type", metricType).Str("device_id", deviceID).Msg("Fetched metrics") w.Header().Set("Content-Type", "application/json") if err := json.NewEncoder(w).Encode(metrics); err != nil { s.logger.Error().Err(err).Str("metric_type", metricType).Str("device_id", deviceID).Msg("Error encoding metrics response") ``` </details></details></td></tr> <tr><td colspan='2'><strong>Ticket Compliance</strong></td></tr> <tr><td>🟡</td> <td> <details> <summary>🎫 <a href=https://github.com/carverauto/serviceradar/issues/2042>#2042</a></summary> <table width='100%'><tbody> <tr><td rowspan=7>🟢</td> <td>Restore availability of sysmon-vm metrics so device views/API show data again.</td></tr> <tr><td>Ensure CPU and memory metrics are collected, persisted to CNPG, and exposed via API/UI.</td></tr> <tr><td>Prevent empty metrics from causing 404/500; return 200 with empty arrays/null-safe <br>handling.</td></tr> <tr><td>Detect and surface stalled sysmon streams when collectors are connected but no metrics <br>arrive.</td></tr> <tr><td>Maintain correct device attribution for metrics (do not overwrite service-reported host <br>identity).</td></tr> <tr><td>Add logging/diagnostics for CNPG write/read paths to aid troubleshooting.</td></tr> <tr><td>Provide regression tests for sysmon pipeline behaviors.</td></tr> <tr><td rowspan=2>⚪</td> <td>End-to-end verification that metrics appear in UI within one polling interval across <br>darwin/arm64 deployment.</td></tr> <tr><td>Manual validation that CNPG contains expected rows and that UI charts render as intended <br>for all metric types.</td></tr> </tbody></table> </details> </td></tr> <tr><td colspan='2'><strong>Codebase Duplication Compliance</strong></td></tr> <tr><td>⚪</td><td><details><summary><strong>Codebase context is not defined </strong></summary> Follow the <a href='https://qodo-merge-docs.qodo.ai/core-abilities/rag_context_enrichment/'>guide</a> to enable codebase context checks. </details></td></tr> <tr><td colspan='2'><strong>Custom Compliance</strong></td></tr> <tr><td rowspan=3>🟢</td><td> <details><summary><strong>Generic: Meaningful Naming and Self-Documenting Code</strong></summary><br> **Objective:** Ensure all identifiers clearly express their purpose and intent, making code <br>self-documenting<br> **Status:** Passed<br> > Learn more about managing compliance <a href='https://qodo-merge-docs.qodo.ai/tools/compliance/#configuration-options'>generic rules</a> or creating your own <a href='https://qodo-merge-docs.qodo.ai/tools/compliance/#custom-compliance'>custom rules</a> </details></td></tr> <tr><td> <details><summary><strong>Generic: Robust Error Handling and Edge Case Management</strong></summary><br> **Objective:** Ensure comprehensive error handling that provides meaningful context and graceful <br>degradation<br> **Status:** Passed<br> > Learn more about managing compliance <a href='https://qodo-merge-docs.qodo.ai/tools/compliance/#configuration-options'>generic rules</a> or creating your own <a href='https://qodo-merge-docs.qodo.ai/tools/compliance/#custom-compliance'>custom rules</a> </details></td></tr> <tr><td> <details><summary><strong>Generic: Security-First Input Validation and Data Handling</strong></summary><br> **Objective:** Ensure all data inputs are validated, sanitized, and handled securely to prevent <br>vulnerabilities<br> **Status:** Passed<br> > Learn more about managing compliance <a href='https://qodo-merge-docs.qodo.ai/tools/compliance/#configuration-options'>generic rules</a> or creating your own <a href='https://qodo-merge-docs.qodo.ai/tools/compliance/#custom-compliance'>custom rules</a> </details></td></tr> <tr><td rowspan=3>⚪</td> <td><details> <summary><strong>Generic: Comprehensive Audit Trails</strong></summary><br> **Objective:** To create a detailed and reliable record of critical system actions for security analysis <br>and compliance.<br> **Status:** <br><a href='https://github.com/carverauto/serviceradar/pull/2045/files#diff-4ad1cf4b0397bd277cd9c0553e4992ad7c9e9484d49d49d48af015958dc4bf4bR486-R504'><strong>Limited auditing</strong></a>: New CNPG write-path and sysmon flush instrumentation add logs but do not clearly ensure <br>auditable records of critical actions (who/what/when/outcome) across all critical <br>operations.<br> <details open><summary>Referred Code</summary> ```go } if err := s.DB.StoreSysmonMetrics( ctx, pollerID, agentID, hostID, partition, hostIP, deviceID, metric, ts); err != nil { s.logger.Error(). Err(err). Str("poller_id", pollerID). Msg("Failed to flush sysmon metrics") } else { hasMemory := metric.Memory != nil && (metric.Memory.TotalBytes > 0 || metric.Memory.UsedBytes > 0) s.logger.Info(). Str("poller_id", pollerID). Str("device_id", deviceID). Int("cpu_count", len(metric.CPUs)). Int("disk_count", len(metric.Disks)). Int("process_count", len(metric.Processes)). Bool("has_memory", hasMemory). Msg("Flushed sysmon metrics to database") } ``` </details> > Learn more about managing compliance <a href='https://qodo-merge-docs.qodo.ai/tools/compliance/#configuration-options'>generic rules</a> or creating your own <a href='https://qodo-merge-docs.qodo.ai/tools/compliance/#custom-compliance'>custom rules</a> </details></td></tr> <tr><td><details> <summary><strong>Generic: Secure Error Handling</strong></summary><br> **Objective:** To prevent the leakage of sensitive system information through error messages while <br>providing sufficient detail for internal debugging.<br> **Status:** <br><a href='https://github.com/carverauto/serviceradar/pull/2045/files#diff-565dceead2e723f416ae3cee24e62a1e93364ce67237233e01b4a26c09bf8ba5R121-R127'><strong>Error detail echo</strong></a>: API responses include raw error details in JSON which may expose internal information to <br>end users depending on routing exposure.<br> <details open><summary>Referred Code</summary> ```typescript return NextResponse.json(transformedData); } catch (error) { console.error(`Error fetching Sysmon ${metric} data for device ${deviceId}:`, error); return NextResponse.json( { error: `Internal server error while fetching Sysmon ${metric} data`, details: error}, { status: 500 }, ); ``` </details> > Learn more about managing compliance <a href='https://qodo-merge-docs.qodo.ai/tools/compliance/#configuration-options'>generic rules</a> or creating your own <a href='https://qodo-merge-docs.qodo.ai/tools/compliance/#custom-compliance'>custom rules</a> </details></td></tr> <tr><td><details> <summary><strong>Generic: Secure Logging Practices</strong></summary><br> **Objective:** To ensure logs are useful for debugging and auditing without exposing sensitive <br>information like PII, PHI, or cardholder data.<br> **Status:** <br><a href='https://github.com/carverauto/serviceradar/pull/2045/files#diff-8ca23b1b2b298b6eec9af9e23be2c7cb9caa65301f514c013ebbc5b52e933a23R165-R216'><strong>Verbose logging</strong></a>: Added info/error logs include device identifiers and potentially large counts; while <br>useful, review is needed to ensure no sensitive data or excessive verbosity at info level.<br> <details open><summary>Referred Code</summary> ```go func (db *DB) cnpgInsertCPUMetrics( ctx context.Context, pollerID, agentID, hostID, deviceID, partition string, cpus []models.CPUMetric, timestamp time.Time, ) error { if len(cpus) == 0 || !db.useCNPGWrites() { db.logger.Info(). Int("cpu_count", len(cpus)). Bool("cnpg_writes", db.useCNPGWrites()). Msg("cnpgInsertCPUMetrics: skipping (no cpus or writes disabled)") return nil } db.logger.Info(). Str("poller_id", pollerID). Str("device_id", deviceID). Int("cpu_count", len(cpus)). Time("timestamp", timestamp). Msg("cnpgInsertCPUMetrics: building batch") ... (clipped 31 lines) ``` </details> > Learn more about managing compliance <a href='https://qodo-merge-docs.qodo.ai/tools/compliance/#configuration-options'>generic rules</a> or creating your own <a href='https://qodo-merge-docs.qodo.ai/tools/compliance/#custom-compliance'>custom rules</a> </details></td></tr> <tr><td align="center" colspan="2"> <!-- placeholder --> <!-- /compliance --update_compliance=true --> </td></tr></tbody></table> <details><summary>Compliance status legend</summary> 🟢 - Fully Compliant<br> 🟡 - Partial Compliant<br> 🔴 - Not Compliant<br> ⚪ - Requires Further Human Verification<br> 🏷️ - Compliance label<br> </details>
qodo-code-review[bot] commented 2025-12-03 04:10:26 +00:00 (Migrated from github.com)
Author
Owner

Imported GitHub PR comment.

Original author: @qodo-code-review[bot]
Original URL: https://github.com/carverauto/serviceradar/pull/2045#issuecomment-3605010947
Original created: 2025-12-03T04:10:26Z

PR Code Suggestions

Explore these optional code suggestions:

CategorySuggestion                                                                                                                                    Impact
Possible issue
Fix race condition in stall detection

Fix a race condition in noteSysmonStall by holding the lock for the entire
duration of the function, ensuring atomic access and modification of the shared
sysmonStreamState.

pkg/core/metrics.go [1302-1345]

 func (s *Server) noteSysmonStall(
 	ctx context.Context,
 	deviceID, hostIP, hostID, agentID, pollerID, reason string,
 	when time.Time,
 ) {
 	if s == nil || deviceID == "" {
 		return
 	}
 
 	s.sysmonStallMu.Lock()
+	defer s.sysmonStallMu.Unlock()
+
 	if s.sysmonStall == nil {
 		s.sysmonStall = make(map[string]*sysmonStreamState)
 	}
 
 	state := s.sysmonStall[deviceID]
 	if state == nil {
 		state = &sysmonStreamState{}
 		s.sysmonStall[deviceID] = state
 	}
 
 	state.consecutiveEmpty++
 	consecutive := state.consecutiveEmpty
 	lastSuccess := state.lastSuccess
-	lastAlert := state.lastAlert
-	s.sysmonStallMu.Unlock()
 
 	if consecutive < sysmonStallPollThreshold {
 		return
 	}
 
 	// Avoid flooding logs/events; only emit when we have been stalled for 5+ polls and not recently alerted.
-	if !lastAlert.IsZero() && when.Sub(lastAlert) < defaultFlushInterval {
+	if !state.lastAlert.IsZero() && when.Sub(state.lastAlert) < defaultFlushInterval {
 		return
 	}
 
 	if !lastSuccess.IsZero() && when.Sub(lastSuccess) < time.Duration(sysmonStallPollThreshold)*defaultFlushInterval {
 		return
 	}
 
-	s.sysmonStallMu.Lock()
 	state.lastAlert = when
-	s.sysmonStallMu.Unlock()
 
 	if s.logger != nil {
 ...

[To ensure code accuracy, apply this suggestion manually]

Suggestion importance[1-10]: 9

__

Why: The suggestion correctly identifies a race condition in the newly added noteSysmonStall function, which could lead to false positive stall alerts. Fixing this is critical for the correctness of the new feature.

High
High-level
Restrict sysmon-vm checker to macOS

The sysmon-vm checker has been restricted to build only on macOS. This change
should be reviewed to confirm it is intentional, as it prevents the checker from
being used on other operating systems like Linux or Windows.

Examples:

alias/BUILD.bazel [28]
    target_compatible_with = ["@platforms//os:macos"],
cmd/checkers/sysmon-vm/BUILD.bazel [9]
    target_compatible_with = ["@platforms//os:macos"],

Solution Walkthrough:

Before:

# alias/BUILD.bazel
alias(
    name = "sysmonvm",
    actual = "//pkg/checker/sysmonvm:sysmonvm",
)

# cmd/checkers/sysmon-vm/BUILD.bazel
go_binary(
    name = "sysmon-vm",
    embed = [":sysmon_vm_lib"],
)

After:

# alias/BUILD.bazel
alias(
    name = "sysmonvm",
    actual = "//pkg/checker/sysmonvm:sysmonvm",
    target_compatible_with = ["@platforms//os:macos"],
)

# cmd/checkers/sysmon-vm/BUILD.bazel
go_binary(
    name = "sysmon-vm",
    embed = [":sysmon_vm_lib"],
    target_compatible_with = ["@platforms//os:macos"],
)

Suggestion importance[1-10]: 7

__

Why: The suggestion correctly identifies that the sysmon-vm checker is now restricted to macOS, which is a significant architectural change not mentioned in the PR description and could be an unintended side effect of fixing a macOS-specific bug.

Medium
General
Remove redundant IP fields from payload

Remove redundant host_ip and ip fields from the top-level payload in
enrichPayloadWithHost, adding them only to the nested status object to avoid
data duplication.

pkg/poller/agent_poller.go [517-522]

 	if ip != "" && !hasIP(statusNode) && !hasIP(payload) {
 		statusNode["host_ip"] = ip
 		statusNode["ip"] = ip
-		payload["host_ip"] = ip
-		payload["ip"] = ip
 	}
  • Apply / Chat
Suggestion importance[1-10]: 4

__

Why: The suggestion correctly points out data redundancy in the JSON payload, and removing it would be a good practice for maintainability and consistency, but it is a minor improvement.

Low
  • More
Imported GitHub PR comment. Original author: @qodo-code-review[bot] Original URL: https://github.com/carverauto/serviceradar/pull/2045#issuecomment-3605010947 Original created: 2025-12-03T04:10:26Z --- ## PR Code Suggestions ✨ <!-- 8e84657 --> Explore these optional code suggestions: <table><thead><tr><td><strong>Category</strong></td><td align=left><strong>Suggestion&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; </strong></td><td align=center><strong>Impact</strong></td></tr><tbody><tr><td rowspan=1>Possible issue</td> <td> <details><summary>Fix race condition in stall detection</summary> ___ **Fix a race condition in <code>noteSysmonStall</code> by holding the lock for the entire <br>duration of the function, ensuring atomic access and modification of the shared <br><code>sysmonStreamState</code>.** [pkg/core/metrics.go [1302-1345]](https://github.com/carverauto/serviceradar/pull/2045/files#diff-6d98e853ce17576c088e77956ae4ecfa8078019e0bff107a79d8d1d6ed2443adR1302-R1345) ```diff func (s *Server) noteSysmonStall( ctx context.Context, deviceID, hostIP, hostID, agentID, pollerID, reason string, when time.Time, ) { if s == nil || deviceID == "" { return } s.sysmonStallMu.Lock() + defer s.sysmonStallMu.Unlock() + if s.sysmonStall == nil { s.sysmonStall = make(map[string]*sysmonStreamState) } state := s.sysmonStall[deviceID] if state == nil { state = &sysmonStreamState{} s.sysmonStall[deviceID] = state } state.consecutiveEmpty++ consecutive := state.consecutiveEmpty lastSuccess := state.lastSuccess - lastAlert := state.lastAlert - s.sysmonStallMu.Unlock() if consecutive < sysmonStallPollThreshold { return } // Avoid flooding logs/events; only emit when we have been stalled for 5+ polls and not recently alerted. - if !lastAlert.IsZero() && when.Sub(lastAlert) < defaultFlushInterval { + if !state.lastAlert.IsZero() && when.Sub(state.lastAlert) < defaultFlushInterval { return } if !lastSuccess.IsZero() && when.Sub(lastSuccess) < time.Duration(sysmonStallPollThreshold)*defaultFlushInterval { return } - s.sysmonStallMu.Lock() state.lastAlert = when - s.sysmonStallMu.Unlock() if s.logger != nil { ... ``` `[To ensure code accuracy, apply this suggestion manually]` <details><summary>Suggestion importance[1-10]: 9</summary> __ Why: The suggestion correctly identifies a race condition in the newly added `noteSysmonStall` function, which could lead to false positive stall alerts. Fixing this is critical for the correctness of the new feature. </details></details></td><td align=center>High </td></tr><tr><td rowspan=1>High-level</td> <td> <details><summary>Restrict sysmon-vm checker to macOS</summary> ___ **The <code>sysmon-vm</code> checker has been restricted to build only on macOS. This change <br>should be reviewed to confirm it is intentional, as it prevents the checker from <br>being used on other operating systems like Linux or Windows.** ### Examples: <details> <summary> <a href="https://github.com/carverauto/serviceradar/pull/2045/files#diff-884fa9353a5226345e44fbabea3300efc7a87dfbcde0b6a42521ca51823f1b68R28-R28">alias/BUILD.bazel [28]</a> </summary> ```starlark target_compatible_with = ["@platforms//os:macos"], ``` </details> <details> <summary> <a href="https://github.com/carverauto/serviceradar/pull/2045/files#diff-6f29400e46ef3dfa4e1d5e3ec41ebfeceb9628f6900f815f5c9bc8166ccc63a9R9-R9">cmd/checkers/sysmon-vm/BUILD.bazel [9]</a> </summary> ```starlark target_compatible_with = ["@platforms//os:macos"], ``` </details> ### Solution Walkthrough: #### Before: ```starlark # alias/BUILD.bazel alias( name = "sysmonvm", actual = "//pkg/checker/sysmonvm:sysmonvm", ) # cmd/checkers/sysmon-vm/BUILD.bazel go_binary( name = "sysmon-vm", embed = [":sysmon_vm_lib"], ) ``` #### After: ```starlark # alias/BUILD.bazel alias( name = "sysmonvm", actual = "//pkg/checker/sysmonvm:sysmonvm", target_compatible_with = ["@platforms//os:macos"], ) # cmd/checkers/sysmon-vm/BUILD.bazel go_binary( name = "sysmon-vm", embed = [":sysmon_vm_lib"], target_compatible_with = ["@platforms//os:macos"], ) ``` <details><summary>Suggestion importance[1-10]: 7</summary> __ Why: The suggestion correctly identifies that the `sysmon-vm` checker is now restricted to macOS, which is a significant architectural change not mentioned in the PR description and could be an unintended side effect of fixing a macOS-specific bug. </details></details></td><td align=center>Medium </td></tr><tr><td rowspan=1>General</td> <td> <details><summary>Remove redundant IP fields from payload</summary> ___ **Remove redundant <code>host_ip</code> and <code>ip</code> fields from the top-level payload in <br><code>enrichPayloadWithHost</code>, adding them only to the nested <code>status</code> object to avoid <br>data duplication.** [pkg/poller/agent_poller.go [517-522]](https://github.com/carverauto/serviceradar/pull/2045/files#diff-7cdaae416a1e7b5d39ad75f138c49fa0aca12fc38e8ab0b26bb59c64446da15cR517-R522) ```diff if ip != "" && !hasIP(statusNode) && !hasIP(payload) { statusNode["host_ip"] = ip statusNode["ip"] = ip - payload["host_ip"] = ip - payload["ip"] = ip } ``` - [ ] **Apply / Chat** <!-- /improve --apply_suggestion=2 --> <details><summary>Suggestion importance[1-10]: 4</summary> __ Why: The suggestion correctly points out data redundancy in the JSON payload, and removing it would be a good practice for maintainability and consistency, but it is a minor improvement. </details></details></td><td align=center>Low </td></tr> <tr><td align="center" colspan="2"> - [ ] More <!-- /improve --more_suggestions=true --> </td><td></td></tr></tbody></table>
Sign in to join this conversation.
No reviewers
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
carverauto/serviceradar!2497
No description provided.