sysmon/results router work #2670

Merged
mfreeman451 merged 15 commits from refs/pull/2670/head into staging 2026-01-14 21:31:53 +00:00
mfreeman451 commented 2026-01-14 17:40:41 +00:00 (Migrated from github.com)
Owner

Imported from GitHub pull request.

Original GitHub pull request: #2299
Original author: @mfreeman451
Original URL: https://github.com/carverauto/serviceradar/pull/2299
Original created: 2026-01-14T17:40:41Z
Original updated: 2026-01-14T21:31:55Z
Original head: carverauto/serviceradar:updates/sysmon-metrics-ingestion
Original base: staging
Original merged: 2026-01-14T21:31:53Z by @mfreeman451

User description

IMPORTANT: Please sign the Developer Certificate of Origin

Thank you for your contribution to ServiceRadar. Please note, when contributing, the developer must include
a DCO sign-off statement indicating the DCO acceptance in one commit message. Here
is an example DCO Signed-off-by line in a commit message:

Signed-off-by: J. Doe <j.doe@domain.com>

Describe your changes

Code checklist before requesting a review

  • I have signed the DCO?
  • The build completes without errors?
  • All tests are passing when running make test?

PR Type

Enhancement, Documentation


Description

  • Add OpenSpec proposal for sysmon metrics ingestion via gRPC pipeline

  • Define requirements for persisting CPU, memory, disk, and process metrics to tenant hypertables

  • Specify device identifier resolution with safe fallbacks for missing linkages

  • Document payload size handling and implementation tasks for metrics ingestor


Diagram Walkthrough

flowchart LR
  A["gRPC Status Updates"] -->|"sysmon-metrics payload"| B["Agent Gateway"]
  B -->|"forward to core"| C["Core Status Handler"]
  C -->|"resolve device ID"| D["Device Identifier"]
  D -->|"insert metrics"| E["Tenant CNPG Hypertables"]
  E -->|"cpu, memory, disk, process"| F["Device Charts Rendered"]

File Walkthrough

Relevant files
Documentation
proposal.md
Sysmon metrics ingestion proposal and rationale                   

openspec/changes/fix-sysmon-metrics-ingestion/proposal.md

  • Introduces proposal to fix sysmon metrics ingestion from gRPC pipeline
  • Outlines parsing of sysmon payloads and insertion into tenant-scoped
    hypertables
  • Specifies device identifier resolution with fallback handling
  • Documents payload size allowance and Ash resource mapping requirements
+14/-0   
spec.md
Edge architecture requirements for sysmon ingestion           

openspec/changes/fix-sysmon-metrics-ingestion/specs/edge-architecture/spec.md

  • Defines requirement for ingesting sysmon metrics into five
    tenant-scoped hypertables
  • Specifies scenario for metrics persistence with device identifier
    resolution
  • Documents fallback behavior when device mapping is unavailable
  • Establishes payload size handling requirements for sysmon-metrics
    messages
+24/-0   
tasks.md
Implementation tasks for sysmon metrics ingestion               

openspec/changes/fix-sysmon-metrics-ingestion/tasks.md

  • Lists six implementation tasks for sysmon metrics ingestion feature
  • Includes spec updates, Ash resource creation, and ingestor development
  • Specifies integration with StatusHandler and gateway normalization
  • Includes testing requirements for metrics mapping and payload handling
+7/-0     

Imported from GitHub pull request. Original GitHub pull request: #2299 Original author: @mfreeman451 Original URL: https://github.com/carverauto/serviceradar/pull/2299 Original created: 2026-01-14T17:40:41Z Original updated: 2026-01-14T21:31:55Z Original head: carverauto/serviceradar:updates/sysmon-metrics-ingestion Original base: staging Original merged: 2026-01-14T21:31:53Z by @mfreeman451 --- ### **User description** ## IMPORTANT: Please sign the Developer Certificate of Origin Thank you for your contribution to ServiceRadar. Please note, when contributing, the developer must include a [DCO sign-off statement]( https://developercertificate.org/) indicating the DCO acceptance in one commit message. Here is an example DCO Signed-off-by line in a commit message: ``` Signed-off-by: J. Doe <j.doe@domain.com> ``` ## Describe your changes ## Issue ticket number and link ## Code checklist before requesting a review - [ ] I have signed the DCO? - [ ] The build completes without errors? - [ ] All tests are passing when running make test? ___ ### **PR Type** Enhancement, Documentation ___ ### **Description** - Add OpenSpec proposal for sysmon metrics ingestion via gRPC pipeline - Define requirements for persisting CPU, memory, disk, and process metrics to tenant hypertables - Specify device identifier resolution with safe fallbacks for missing linkages - Document payload size handling and implementation tasks for metrics ingestor ___ ### Diagram Walkthrough ```mermaid flowchart LR A["gRPC Status Updates"] -->|"sysmon-metrics payload"| B["Agent Gateway"] B -->|"forward to core"| C["Core Status Handler"] C -->|"resolve device ID"| D["Device Identifier"] D -->|"insert metrics"| E["Tenant CNPG Hypertables"] E -->|"cpu, memory, disk, process"| F["Device Charts Rendered"] ``` <details><summary><h3>File Walkthrough</h3></summary> <table><thead><tr><th></th><th align="left">Relevant files</th></tr></thead><tbody><tr><td><strong>Documentation</strong></td><td><table> <tr> <td> <details> <summary><strong>proposal.md</strong><dd><code>Sysmon metrics ingestion proposal and rationale</code>&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; </dd></summary> <hr> openspec/changes/fix-sysmon-metrics-ingestion/proposal.md <ul><li>Introduces proposal to fix sysmon metrics ingestion from gRPC pipeline<br> <li> Outlines parsing of sysmon payloads and insertion into tenant-scoped <br>hypertables<br> <li> Specifies device identifier resolution with fallback handling<br> <li> Documents payload size allowance and Ash resource mapping requirements</ul> </details> </td> <td><a href="https://github.com/carverauto/serviceradar/pull/2299/files#diff-418b06ce1a319964183ae67407a5032c5919109adc1b482dd5df03b45f9701ba">+14/-0</a>&nbsp; &nbsp; </td> </tr> <tr> <td> <details> <summary><strong>spec.md</strong><dd><code>Edge architecture requirements for sysmon ingestion</code>&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; </dd></summary> <hr> openspec/changes/fix-sysmon-metrics-ingestion/specs/edge-architecture/spec.md <ul><li>Defines requirement for ingesting sysmon metrics into five <br>tenant-scoped hypertables<br> <li> Specifies scenario for metrics persistence with device identifier <br>resolution<br> <li> Documents fallback behavior when device mapping is unavailable<br> <li> Establishes payload size handling requirements for sysmon-metrics <br>messages</ul> </details> </td> <td><a href="https://github.com/carverauto/serviceradar/pull/2299/files#diff-75d95d51d43c29fb78136f5509721d35e038de624866178d8c2706749e4febda">+24/-0</a>&nbsp; &nbsp; </td> </tr> <tr> <td> <details> <summary><strong>tasks.md</strong><dd><code>Implementation tasks for sysmon metrics ingestion</code>&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; </dd></summary> <hr> openspec/changes/fix-sysmon-metrics-ingestion/tasks.md <ul><li>Lists six implementation tasks for sysmon metrics ingestion feature<br> <li> Includes spec updates, Ash resource creation, and ingestor development<br> <li> Specifies integration with StatusHandler and gateway normalization<br> <li> Includes testing requirements for metrics mapping and payload handling</ul> </details> </td> <td><a href="https://github.com/carverauto/serviceradar/pull/2299/files#diff-34029fdc5501d4a4d730e5e55d69dd9fe9b14fbca2a9f914127848f1698aa020">+7/-0</a>&nbsp; &nbsp; &nbsp; </td> </tr> </table></td></tr></tbody></table> </details> ___
qodo-code-review[bot] commented 2026-01-14 17:41:04 +00:00 (Migrated from github.com)
Author
Owner

Imported GitHub PR comment.

Original author: @qodo-code-review[bot]
Original URL: https://github.com/carverauto/serviceradar/pull/2299#issuecomment-3750797655
Original created: 2026-01-14T17:41:04Z

PR Compliance Guide 🔍

Below is a summary of compliance checks for this PR:

Security Compliance
🟢
No security concerns identified No security vulnerabilities detected by AI analysis. Human verification advised for critical code.
Ticket Compliance
🎫 No ticket provided
  • Create ticket/issue
Codebase Duplication Compliance
Codebase context is not defined

Follow the guide to enable codebase context checks.

Custom Compliance
🟢
Generic: Meaningful Naming and Self-Documenting Code

Objective: Ensure all identifiers clearly express their purpose and intent, making code
self-documenting

Status: Passed

Learn more about managing compliance generic rules or creating your own custom rules

Generic: Secure Error Handling

Objective: To prevent the leakage of sensitive system information through error messages while
providing sufficient detail for internal debugging.

Status: Passed

Learn more about managing compliance generic rules or creating your own custom rules

Generic: Secure Logging Practices

Objective: To ensure logs are useful for debugging and auditing without exposing sensitive
information like PII, PHI, or cardholder data.

Status: Passed

Learn more about managing compliance generic rules or creating your own custom rules

Generic: Comprehensive Audit Trails

Objective: To create a detailed and reliable record of critical system actions for security analysis
and compliance.

Status:
Logging unspecified: The proposal/spec describes ingestion that previously "silently drops data" but
does not specify audit/logging requirements for ingestion success/failure outcomes.

Referred Code
Sysmon metrics streamed from agents over gRPC are not being persisted to tenant CNPG hypertables, so device details render without sysmon charts and the metrics pipeline silently drops data.

## What Changes
- Parse sysmon metrics payloads forwarded via gRPC status updates and insert CPU, CPU cluster, memory, disk, and process metrics into tenant-scoped hypertables using Ash bulk creates.
- Resolve the device identifier from the agent record when available, with safe fallbacks if the device linkage is missing.
- Permit larger `sysmon-metrics` payloads in the agent gateway to avoid truncation.
- Add an Ash resource mapping for `cpu_cluster_metrics` (if needed for ingestion parity).

Learn more about managing compliance generic rules or creating your own custom rules

Generic: Robust Error Handling and Edge Case Management

Objective: Ensure comprehensive error handling that provides meaningful context and graceful
degradation

Status:
Error handling unspecified: The spec requires ingestion to not fail on missing device linkage but does not define
expected handling/reporting for other ingestion failure points (parse errors, DB insert
failures, malformed payloads).

Referred Code
#### Scenario: Device mapping unavailable
- **GIVEN** an agent streams a `sysmon-metrics` status payload but has no linked device record
- **WHEN** the gateway forwards the status update to core
- **THEN** core SHALL ingest the metrics with a safe fallback device identifier or leave it null
- **AND** the ingest SHALL NOT fail due to missing device linkage

Learn more about managing compliance generic rules or creating your own custom rules

Generic: Security-First Input Validation and Data Handling

Objective: Ensure all data inputs are validated, sanitized, and handled securely to prevent
vulnerabilities

Status:
Validation unspecified: The spec defines acceptance of larger sysmon-metrics payloads but does not specify
validation/sanitization limits (schema validation, size ceilings, rate limiting) for
externally sourced gRPC payload data.

Referred Code
### Requirement: Sysmon payload size handling
The gateway SHALL accept `sysmon-metrics` payloads larger than the default status message limit and forward them without truncation.

#### Scenario: Large sysmon payload
- **GIVEN** a `sysmon-metrics` status payload larger than 4KB
- **WHEN** the gateway processes the message
- **THEN** the payload SHALL be accepted up to the configured sysmon limit
- **AND** the payload SHALL be forwarded to core intact

Learn more about managing compliance generic rules or creating your own custom rules

  • Update
Compliance status legend 🟢 - Fully Compliant
🟡 - Partial Compliant
🔴 - Not Compliant
- Requires Further Human Verification
🏷️ - Compliance label
Imported GitHub PR comment. Original author: @qodo-code-review[bot] Original URL: https://github.com/carverauto/serviceradar/pull/2299#issuecomment-3750797655 Original created: 2026-01-14T17:41:04Z --- ## PR Compliance Guide 🔍 <!-- https://github.com/carverauto/serviceradar/commit/3a66f3df9260574c3a92a08e68c481e34d0e9850 --> Below is a summary of compliance checks for this PR:<br> <table><tbody><tr><td colspan='2'><strong>Security Compliance</strong></td></tr> <tr><td>🟢</td><td><details><summary><strong>No security concerns identified</strong></summary> No security vulnerabilities detected by AI analysis. Human verification advised for critical code. </details></td></tr> <tr><td colspan='2'><strong>Ticket Compliance</strong></td></tr> <tr><td>⚪</td><td><details><summary>🎫 <strong>No ticket provided </strong></summary> - [ ] Create ticket/issue <!-- /create_ticket --create_ticket=true --> </details></td></tr> <tr><td colspan='2'><strong>Codebase Duplication Compliance</strong></td></tr> <tr><td>⚪</td><td><details><summary><strong>Codebase context is not defined </strong></summary> Follow the <a href='https://qodo-merge-docs.qodo.ai/core-abilities/rag_context_enrichment/'>guide</a> to enable codebase context checks. </details></td></tr> <tr><td colspan='2'><strong>Custom Compliance</strong></td></tr> <tr><td rowspan=3>🟢</td><td> <details><summary><strong>Generic: Meaningful Naming and Self-Documenting Code</strong></summary><br> **Objective:** Ensure all identifiers clearly express their purpose and intent, making code <br>self-documenting<br> **Status:** Passed<br> > Learn more about managing compliance <a href='https://qodo-merge-docs.qodo.ai/tools/compliance/#configuration-options'>generic rules</a> or creating your own <a href='https://qodo-merge-docs.qodo.ai/tools/compliance/#custom-compliance'>custom rules</a> </details></td></tr> <tr><td> <details><summary><strong>Generic: Secure Error Handling</strong></summary><br> **Objective:** To prevent the leakage of sensitive system information through error messages while <br>providing sufficient detail for internal debugging.<br> **Status:** Passed<br> > Learn more about managing compliance <a href='https://qodo-merge-docs.qodo.ai/tools/compliance/#configuration-options'>generic rules</a> or creating your own <a href='https://qodo-merge-docs.qodo.ai/tools/compliance/#custom-compliance'>custom rules</a> </details></td></tr> <tr><td> <details><summary><strong>Generic: Secure Logging Practices</strong></summary><br> **Objective:** To ensure logs are useful for debugging and auditing without exposing sensitive <br>information like PII, PHI, or cardholder data.<br> **Status:** Passed<br> > Learn more about managing compliance <a href='https://qodo-merge-docs.qodo.ai/tools/compliance/#configuration-options'>generic rules</a> or creating your own <a href='https://qodo-merge-docs.qodo.ai/tools/compliance/#custom-compliance'>custom rules</a> </details></td></tr> <tr><td rowspan=3>⚪</td> <td><details> <summary><strong>Generic: Comprehensive Audit Trails</strong></summary><br> **Objective:** To create a detailed and reliable record of critical system actions for security analysis <br>and compliance.<br> **Status:** <br><a href='https://github.com/carverauto/serviceradar/pull/2299/files#diff-418b06ce1a319964183ae67407a5032c5919109adc1b482dd5df03b45f9701baR4-R10'><strong>Logging unspecified</strong></a>: The proposal/spec describes ingestion that previously &quot;silently drops data&quot; but <br>does not specify audit/logging requirements for ingestion success/failure outcomes.<br> <details open><summary>Referred Code</summary> ```markdown Sysmon metrics streamed from agents over gRPC are not being persisted to tenant CNPG hypertables, so device details render without sysmon charts and the metrics pipeline silently drops data. ## What Changes - Parse sysmon metrics payloads forwarded via gRPC status updates and insert CPU, CPU cluster, memory, disk, and process metrics into tenant-scoped hypertables using Ash bulk creates. - Resolve the device identifier from the agent record when available, with safe fallbacks if the device linkage is missing. - Permit larger `sysmon-metrics` payloads in the agent gateway to avoid truncation. - Add an Ash resource mapping for `cpu_cluster_metrics` (if needed for ingestion parity). ``` </details> > Learn more about managing compliance <a href='https://qodo-merge-docs.qodo.ai/tools/compliance/#configuration-options'>generic rules</a> or creating your own <a href='https://qodo-merge-docs.qodo.ai/tools/compliance/#custom-compliance'>custom rules</a> </details></td></tr> <tr><td><details> <summary><strong>Generic: Robust Error Handling and Edge Case Management</strong></summary><br> **Objective:** Ensure comprehensive error handling that provides meaningful context and graceful <br>degradation<br> **Status:** <br><a href='https://github.com/carverauto/serviceradar/pull/2299/files#diff-75d95d51d43c29fb78136f5509721d35e038de624866178d8c2706749e4febdaR11-R16'><strong>Error handling unspecified</strong></a>: The spec requires ingestion to not fail on missing device linkage but does not define <br>expected handling/reporting for other ingestion failure points (parse errors, DB insert <br>failures, malformed payloads).<br> <details open><summary>Referred Code</summary> ```markdown #### Scenario: Device mapping unavailable - **GIVEN** an agent streams a `sysmon-metrics` status payload but has no linked device record - **WHEN** the gateway forwards the status update to core - **THEN** core SHALL ingest the metrics with a safe fallback device identifier or leave it null - **AND** the ingest SHALL NOT fail due to missing device linkage ``` </details> > Learn more about managing compliance <a href='https://qodo-merge-docs.qodo.ai/tools/compliance/#configuration-options'>generic rules</a> or creating your own <a href='https://qodo-merge-docs.qodo.ai/tools/compliance/#custom-compliance'>custom rules</a> </details></td></tr> <tr><td><details> <summary><strong>Generic: Security-First Input Validation and Data Handling</strong></summary><br> **Objective:** Ensure all data inputs are validated, sanitized, and handled securely to prevent <br>vulnerabilities<br> **Status:** <br><a href='https://github.com/carverauto/serviceradar/pull/2299/files#diff-75d95d51d43c29fb78136f5509721d35e038de624866178d8c2706749e4febdaR17-R24'><strong>Validation unspecified</strong></a>: The spec defines acceptance of larger <code>sysmon-metrics</code> payloads but does not specify <br>validation/sanitization limits (schema validation, size ceilings, rate limiting) for <br>externally sourced gRPC payload data.<br> <details open><summary>Referred Code</summary> ```markdown ### Requirement: Sysmon payload size handling The gateway SHALL accept `sysmon-metrics` payloads larger than the default status message limit and forward them without truncation. #### Scenario: Large sysmon payload - **GIVEN** a `sysmon-metrics` status payload larger than 4KB - **WHEN** the gateway processes the message - **THEN** the payload SHALL be accepted up to the configured sysmon limit - **AND** the payload SHALL be forwarded to core intact ``` </details> > Learn more about managing compliance <a href='https://qodo-merge-docs.qodo.ai/tools/compliance/#configuration-options'>generic rules</a> or creating your own <a href='https://qodo-merge-docs.qodo.ai/tools/compliance/#custom-compliance'>custom rules</a> </details></td></tr> <tr><td align="center" colspan="2"> - [ ] Update <!-- /compliance --update_compliance=true --> </td></tr></tbody></table> <details><summary>Compliance status legend</summary> 🟢 - Fully Compliant<br> 🟡 - Partial Compliant<br> 🔴 - Not Compliant<br> ⚪ - Requires Further Human Verification<br> 🏷️ - Compliance label<br> </details>
qodo-code-review[bot] commented 2026-01-14 17:42:15 +00:00 (Migrated from github.com)
Author
Owner

Imported GitHub PR comment.

Original author: @qodo-code-review[bot]
Original URL: https://github.com/carverauto/serviceradar/pull/2299#issuecomment-3750804381
Original created: 2026-01-14T17:42:15Z

PR Code Suggestions

Latest suggestions up to 2d46e4c

CategorySuggestion                                                                                                                                    Impact
Security
Prevent atom table exhaustion

Replace the unsafe String.to_atom/1 with String.to_existing_atom/1 inside a
try/rescue block. This prevents a potential denial-of-service attack caused by
atom table exhaustion from untrusted input.

elixir/serviceradar_core/lib/serviceradar/observability/sysmon_metrics_ingestor.ex [0]

 defp fetch_value(map, key) when is_map(map) do
   Map.get(map, key) ||
-    Map.get(map, String.to_atom(key))
+    (try do
+       Map.get(map, String.to_existing_atom(key))
+     rescue
+       ArgumentError -> nil
+     end)
 end

[To ensure code accuracy, apply this suggestion manually]

Suggestion importance[1-10]: 10

__

Why: The suggestion correctly identifies a critical security vulnerability (atom table exhaustion) that could lead to a denial-of-service attack, and proposes a robust fix using String.to_existing_atom/1.

High
Possible issue
Avoid skipping sweep results

To prevent data loss on transient failures, move the setSweepResultsSequence
call to after the StreamStatus call succeeds. Additionally, return false on a
StreamStatus error to correctly reflect the failed state.

pkg/agent/push_loop.go [0]

-if response.CurrentSequence != "" {
-	p.setSweepResultsSequence(response.CurrentSequence)
-}
+pendingSeq := response.CurrentSequence
 ...
 _, err = p.gateway.StreamStatus(pushCtx, statusChunks)
 if err != nil {
 	p.logger.Error().Err(err).Msg("Failed to stream sweep results to gateway")
-	return true
+	return false
 }
 
+if pendingSeq != "" {
+	p.setSweepResultsSequence(pendingSeq)
+}
+

[To ensure code accuracy, apply this suggestion manually]

Suggestion importance[1-10]: 9

__

Why: The suggestion fixes a critical bug where transient network errors could cause permanent data loss by updating a sequence number before ensuring the data was successfully sent.

High
Make chunking failure-tolerant

Instead of erroring on malformed large payloads that cannot be chunked, fall
back to sending the payload as a single chunk. This improves robustness by
ensuring data is still delivered.

pkg/agent/push_loop.go [0]

-func buildSweepResultsChunks(response *proto.ResultsResponse) ([]*proto.ResultsChunk, error) {
-	if response == nil {
-		return nil, nil
-	}
+var sweepData map[string]interface{}
+if err := json.Unmarshal(response.Data, &sweepData); err != nil {
+	// If we can't chunk safely, fall back to streaming as-is.
+	return []*proto.ResultsChunk{{
+		Data:            response.Data,
+		IsFinal:         true,
+		ChunkIndex:      0,
+		TotalChunks:     1,
+		CurrentSequence: response.CurrentSequence,
+		Timestamp:       response.Timestamp,
+	}}, nil
+}
 
-	if len(response.Data) == 0 {
-		return nil, nil
-	}
-...
-	hostsInterface, ok := sweepData["hosts"]
-	if !ok {
-		return nil, fmt.Errorf("sweep data missing hosts field")
-	}
+hostsInterface, ok := sweepData["hosts"]
+if !ok {
+	return []*proto.ResultsChunk{{
+		Data:            response.Data,
+		IsFinal:         true,
+		ChunkIndex:      0,
+		TotalChunks:     1,
+		CurrentSequence: response.CurrentSequence,
+		Timestamp:       response.Timestamp,
+	}}, nil
+}
 
-	hosts, ok := hostsInterface.([]interface{})
-	if !ok {
-		return nil, fmt.Errorf("hosts field is not an array")
-	}
+hosts, ok := hostsInterface.([]interface{})
+if !ok {
+	return []*proto.ResultsChunk{{
+		Data:            response.Data,
+		IsFinal:         true,
+		ChunkIndex:      0,
+		TotalChunks:     1,
+		CurrentSequence: response.CurrentSequence,
+		Timestamp:       response.Timestamp,
+	}}, nil
+}

[To ensure code accuracy, apply this suggestion manually]

Suggestion importance[1-10]: 8

__

Why: The suggestion improves the system's robustness by preventing data delivery failures for malformed large payloads, ensuring data is sent even if it cannot be chunked.

Medium
  • Update

Previous suggestions

Suggestions up to commit 3a66f3d
CategorySuggestion                                                                                                                                    Impact
High-level
Consider a dedicated metrics ingestion endpoint

Instead of sending sysmon metrics via the existing gRPC status update channel,
consider creating a dedicated endpoint for them. This would improve scalability
and prevent high-volume metrics data from impacting other status messages.

Examples:

openspec/changes/fix-sysmon-metrics-ingestion/specs/edge-architecture/spec.md [3]
The system SHALL ingest sysmon metrics delivered over gRPC status updates into the tenant-scoped CNPG hypertables (`cpu_metrics`, `cpu_cluster_metrics`, `memory_metrics`, `disk_metrics`, and `process_metrics`).
openspec/changes/fix-sysmon-metrics-ingestion/proposal.md [7]
- Parse sysmon metrics payloads forwarded via gRPC status updates and insert CPU, CPU cluster, memory, disk, and process metrics into tenant-scoped hypertables using Ash bulk creates.

Solution Walkthrough:

Before:

// Flow described in the proposal using a shared status channel
Agent {
  send_grpc_status_update("sysmon-metrics", large_payload)
  send_grpc_status_update("other-status", small_payload)
}

Gateway {
  receive_status_update(type, payload)
  // Special handling for sysmon-metrics payload size
  forward_to_core(type, payload)
}

Core.StatusHandler {
  // Handles both metrics and other statuses
  if (type == "sysmon-metrics") {
    ingest_metrics(payload)
  } else {
    handle_other_status()
  }
}

After:

// Suggested flow with a dedicated metrics endpoint
Agent {
  send_grpc_metrics("sysmon-metrics", large_payload)
  send_grpc_status_update("other-status", small_payload)
}

Gateway {
  // Separate endpoints/handlers
  receive_metrics(...)
  receive_status_update(...)
}

Core {
  // Separate handlers for separation of concerns
  MetricsIngestionHandler { ingest_metrics(payload) }
  StatusHandler { handle_other_status() }
}

Suggestion importance[1-10]: 8

__

Why: The suggestion raises a valid and significant architectural concern about scalability and reliability by questioning the use of a shared status channel for high-volume metrics, which is a core part of the proposed design.

Medium
General
Clarify handling of unlinked device metrics

To prevent data integrity issues, change the specification to drop metrics from
unlinked devices and log a high-severity alert, instead of ingesting them with a
null or fallback identifier.

openspec/changes/fix-sysmon-metrics-ingestion/specs/edge-architecture/spec.md [14-15]

-- THEN core SHALL ingest the metrics with a safe fallback device identifier or leave it null
-- AND the ingest SHALL NOT fail due to missing device linkage
+- THEN core SHALL drop the metrics and log a high-severity alert detailing the tenant and agent
+- AND the ingest process for other agents SHALL NOT be affected
Suggestion importance[1-10]: 7

__

Why: The suggestion raises a valid design concern about data integrity when handling metrics from unlinked devices, proposing a stricter and more explicit failure mode which improves the robustness of the specification.

Medium
Security
Specify payload size limit configurability

Clarify the "configured sysmon limit" in the specification by requiring it to be
configurable (e.g., per-tenant) and have a documented default value to improve
security and reliability.

openspec/changes/fix-sysmon-metrics-ingestion/specs/edge-architecture/spec.md [23]

-- THEN the payload SHALL be accepted up to the configured sysmon limit
+- THEN the payload SHALL be accepted up to a configurable limit (e.g., per-tenant, with a documented default)
Suggestion importance[1-10]: 6

__

Why: The suggestion correctly identifies ambiguity in the specification regarding the "configured sysmon limit" and proposes adding important details about its scope and default value, which enhances clarity and security.

Low
Imported GitHub PR comment. Original author: @qodo-code-review[bot] Original URL: https://github.com/carverauto/serviceradar/pull/2299#issuecomment-3750804381 Original created: 2026-01-14T17:42:15Z --- ## PR Code Suggestions ✨ <!-- 2d46e4c --> Latest suggestions up to 2d46e4c <table><thead><tr><td><strong>Category</strong></td><td align=left><strong>Suggestion&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; </strong></td><td align=center><strong>Impact</strong></td></tr><tbody><tr><td rowspan=1>Security</td> <td> <details><summary>Prevent atom table exhaustion</summary> ___ **Replace the unsafe <code>String.to_atom/1</code> with <code>String.to_existing_atom/1</code> inside a <br><code>try/rescue</code> block. This prevents a potential denial-of-service attack caused by <br>atom table exhaustion from untrusted input.** [elixir/serviceradar_core/lib/serviceradar/observability/sysmon_metrics_ingestor.ex [0]](https://github.com/carverauto/serviceradar/pull/2299/files#diff-6685ec043fcaa5bb98a00bb9ed73bcfa80d487f2268f92dd73118337a57b6380R0) ```diff defp fetch_value(map, key) when is_map(map) do Map.get(map, key) || - Map.get(map, String.to_atom(key)) + (try do + Map.get(map, String.to_existing_atom(key)) + rescue + ArgumentError -> nil + end) end ``` `[To ensure code accuracy, apply this suggestion manually]` <details><summary>Suggestion importance[1-10]: 10</summary> __ Why: The suggestion correctly identifies a critical security vulnerability (atom table exhaustion) that could lead to a denial-of-service attack, and proposes a robust fix using `String.to_existing_atom/1`. </details></details></td><td align=center>High </td></tr><tr><td rowspan=2>Possible issue</td> <td> <details><summary>Avoid skipping sweep results</summary> ___ **To prevent data loss on transient failures, move the <code>setSweepResultsSequence</code> <br>call to after the <code>StreamStatus</code> call succeeds. Additionally, return <code>false</code> on a <br><code>StreamStatus</code> error to correctly reflect the failed state.** [pkg/agent/push_loop.go [0]](https://github.com/carverauto/serviceradar/pull/2299/files#diff-5f0d59be34ef26b449d7f5fd2b198a29b277936b9708a699f7487415ed6c2785R0) ```diff -if response.CurrentSequence != "" { - p.setSweepResultsSequence(response.CurrentSequence) -} +pendingSeq := response.CurrentSequence ... _, err = p.gateway.StreamStatus(pushCtx, statusChunks) if err != nil { p.logger.Error().Err(err).Msg("Failed to stream sweep results to gateway") - return true + return false } +if pendingSeq != "" { + p.setSweepResultsSequence(pendingSeq) +} + ``` `[To ensure code accuracy, apply this suggestion manually]` <details><summary>Suggestion importance[1-10]: 9</summary> __ Why: The suggestion fixes a critical bug where transient network errors could cause permanent data loss by updating a sequence number before ensuring the data was successfully sent. </details></details></td><td align=center>High </td></tr><tr><td> <details><summary>Make chunking failure-tolerant</summary> ___ **Instead of erroring on malformed large payloads that cannot be chunked, fall <br>back to sending the payload as a single chunk. This improves robustness by <br>ensuring data is still delivered.** [pkg/agent/push_loop.go [0]](https://github.com/carverauto/serviceradar/pull/2299/files#diff-5f0d59be34ef26b449d7f5fd2b198a29b277936b9708a699f7487415ed6c2785R0) ```diff -func buildSweepResultsChunks(response *proto.ResultsResponse) ([]*proto.ResultsChunk, error) { - if response == nil { - return nil, nil - } +var sweepData map[string]interface{} +if err := json.Unmarshal(response.Data, &sweepData); err != nil { + // If we can't chunk safely, fall back to streaming as-is. + return []*proto.ResultsChunk{{ + Data: response.Data, + IsFinal: true, + ChunkIndex: 0, + TotalChunks: 1, + CurrentSequence: response.CurrentSequence, + Timestamp: response.Timestamp, + }}, nil +} - if len(response.Data) == 0 { - return nil, nil - } -... - hostsInterface, ok := sweepData["hosts"] - if !ok { - return nil, fmt.Errorf("sweep data missing hosts field") - } +hostsInterface, ok := sweepData["hosts"] +if !ok { + return []*proto.ResultsChunk{{ + Data: response.Data, + IsFinal: true, + ChunkIndex: 0, + TotalChunks: 1, + CurrentSequence: response.CurrentSequence, + Timestamp: response.Timestamp, + }}, nil +} - hosts, ok := hostsInterface.([]interface{}) - if !ok { - return nil, fmt.Errorf("hosts field is not an array") - } +hosts, ok := hostsInterface.([]interface{}) +if !ok { + return []*proto.ResultsChunk{{ + Data: response.Data, + IsFinal: true, + ChunkIndex: 0, + TotalChunks: 1, + CurrentSequence: response.CurrentSequence, + Timestamp: response.Timestamp, + }}, nil +} ``` `[To ensure code accuracy, apply this suggestion manually]` <details><summary>Suggestion importance[1-10]: 8</summary> __ Why: The suggestion improves the system's robustness by preventing data delivery failures for malformed large payloads, ensuring data is sent even if it cannot be chunked. </details></details></td><td align=center>Medium </td></tr> <tr><td align="center" colspan="2"> - [ ] Update <!-- /improve_multi --more_suggestions=true --> </td><td></td></tr></tbody></table> ___ #### Previous suggestions <details><summary>Suggestions up to commit 3a66f3d</summary> <br><table><thead><tr><td><strong>Category</strong></td><td align=left><strong>Suggestion&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; </strong></td><td align=center><strong>Impact</strong></td></tr><tbody><tr><td rowspan=1>High-level</td> <td> <details><summary>Consider a dedicated metrics ingestion endpoint</summary> ___ **Instead of sending sysmon metrics via the existing gRPC status update channel, <br>consider creating a dedicated endpoint for them. This would improve scalability <br>and prevent high-volume metrics data from impacting other status messages.** ### Examples: <details> <summary> <a href="https://github.com/carverauto/serviceradar/pull/2299/files#diff-75d95d51d43c29fb78136f5509721d35e038de624866178d8c2706749e4febdaR3-R3">openspec/changes/fix-sysmon-metrics-ingestion/specs/edge-architecture/spec.md [3]</a> </summary> ```markdown The system SHALL ingest sysmon metrics delivered over gRPC status updates into the tenant-scoped CNPG hypertables (`cpu_metrics`, `cpu_cluster_metrics`, `memory_metrics`, `disk_metrics`, and `process_metrics`). ``` </details> <details> <summary> <a href="https://github.com/carverauto/serviceradar/pull/2299/files#diff-418b06ce1a319964183ae67407a5032c5919109adc1b482dd5df03b45f9701baR7-R7">openspec/changes/fix-sysmon-metrics-ingestion/proposal.md [7]</a> </summary> ```markdown - Parse sysmon metrics payloads forwarded via gRPC status updates and insert CPU, CPU cluster, memory, disk, and process metrics into tenant-scoped hypertables using Ash bulk creates. ``` </details> ### Solution Walkthrough: #### Before: ```markdown // Flow described in the proposal using a shared status channel Agent { send_grpc_status_update("sysmon-metrics", large_payload) send_grpc_status_update("other-status", small_payload) } Gateway { receive_status_update(type, payload) // Special handling for sysmon-metrics payload size forward_to_core(type, payload) } Core.StatusHandler { // Handles both metrics and other statuses if (type == "sysmon-metrics") { ingest_metrics(payload) } else { handle_other_status() } } ``` #### After: ```markdown // Suggested flow with a dedicated metrics endpoint Agent { send_grpc_metrics("sysmon-metrics", large_payload) send_grpc_status_update("other-status", small_payload) } Gateway { // Separate endpoints/handlers receive_metrics(...) receive_status_update(...) } Core { // Separate handlers for separation of concerns MetricsIngestionHandler { ingest_metrics(payload) } StatusHandler { handle_other_status() } } ``` <details><summary>Suggestion importance[1-10]: 8</summary> __ Why: The suggestion raises a valid and significant architectural concern about scalability and reliability by questioning the use of a shared status channel for high-volume metrics, which is a core part of the proposed design. </details></details></td><td align=center>Medium </td></tr><tr><td rowspan=1>General</td> <td> <details><summary>Clarify handling of unlinked device metrics</summary> ___ **To prevent data integrity issues, change the specification to drop metrics from <br>unlinked devices and log a high-severity alert, instead of ingesting them with a <br>null or fallback identifier.** [openspec/changes/fix-sysmon-metrics-ingestion/specs/edge-architecture/spec.md [14-15]](https://github.com/carverauto/serviceradar/pull/2299/files#diff-75d95d51d43c29fb78136f5509721d35e038de624866178d8c2706749e4febdaR14-R15) ```diff -- THEN core SHALL ingest the metrics with a safe fallback device identifier or leave it null -- AND the ingest SHALL NOT fail due to missing device linkage +- THEN core SHALL drop the metrics and log a high-severity alert detailing the tenant and agent +- AND the ingest process for other agents SHALL NOT be affected ``` <details><summary>Suggestion importance[1-10]: 7</summary> __ Why: The suggestion raises a valid design concern about data integrity when handling metrics from unlinked devices, proposing a stricter and more explicit failure mode which improves the robustness of the specification. </details></details></td><td align=center>Medium </td></tr><tr><td rowspan=1>Security</td> <td> <details><summary>Specify payload size limit configurability</summary> ___ **Clarify the "configured sysmon limit" in the specification by requiring it to be <br>configurable (e.g., per-tenant) and have a documented default value to improve <br>security and reliability.** [openspec/changes/fix-sysmon-metrics-ingestion/specs/edge-architecture/spec.md [23]](https://github.com/carverauto/serviceradar/pull/2299/files#diff-75d95d51d43c29fb78136f5509721d35e038de624866178d8c2706749e4febdaR23-R23) ```diff -- THEN the payload SHALL be accepted up to the configured sysmon limit +- THEN the payload SHALL be accepted up to a configurable limit (e.g., per-tenant, with a documented default) ``` <details><summary>Suggestion importance[1-10]: 6</summary> __ Why: The suggestion correctly identifies ambiguity in the specification regarding the "configured sysmon limit" and proposes adding important details about its scope and default value, which enhances clarity and security. </details></details></td><td align=center>Low </td></tr> <tr><td align="center" colspan="2"> <!-- /improve_multi --more_suggestions=true --> </td><td></td></tr></tbody></table> </details>
github-advanced-security[bot] commented 2026-01-14 18:21:34 +00:00 (Migrated from github.com)
Author
Owner

Imported GitHub PR review comment.

Original author: @github-advanced-security[bot]
Original URL: https://github.com/carverauto/serviceradar/pull/2299#discussion_r2691551561
Original created: 2026-01-14T18:21:34Z
Original path: pkg/agent/push_loop.go
Original line: 560

Size computation for allocation may overflow

This operation, which is used in an allocation, involves a potentially large value and might overflow.

Show more details

Imported GitHub PR review comment. Original author: @github-advanced-security[bot] Original URL: https://github.com/carverauto/serviceradar/pull/2299#discussion_r2691551561 Original created: 2026-01-14T18:21:34Z Original path: pkg/agent/push_loop.go Original line: 560 --- ## Size computation for allocation may overflow This operation, which is used in an [allocation](1), involves a [potentially large value](2) and might overflow. [Show more details](https://github.com/carverauto/serviceradar/security/code-scanning/91)
github-advanced-security[bot] commented 2026-01-14 18:21:34 +00:00 (Migrated from github.com)
Author
Owner

Imported GitHub PR review comment.

Original author: @github-advanced-security[bot]
Original URL: https://github.com/carverauto/serviceradar/pull/2299#discussion_r2691551567
Original created: 2026-01-14T18:21:34Z
Original path: pkg/agent/push_loop.go
Original line: 620

Size computation for allocation may overflow

This operation, which is used in an allocation, involves a potentially large value and might overflow.

Show more details

Imported GitHub PR review comment. Original author: @github-advanced-security[bot] Original URL: https://github.com/carverauto/serviceradar/pull/2299#discussion_r2691551567 Original created: 2026-01-14T18:21:34Z Original path: pkg/agent/push_loop.go Original line: 620 --- ## Size computation for allocation may overflow This operation, which is used in an [allocation](1), involves a [potentially large value](2) and might overflow. [Show more details](https://github.com/carverauto/serviceradar/security/code-scanning/92)
Sign in to join this conversation.
No reviewers
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
carverauto/serviceradar!2670
No description provided.