carverauto/serviceradar

Fork 0

sysmon/results router work #2670

Merged

mfreeman451 merged 15 commits from refs/pull/2670/head into staging

2026-01-14 21:31:53 +00:00

mfreeman451 commented

2026-01-14 17:40:41 +00:00

(Migrated from github.com)

Owner

Imported from GitHub pull request.

Original GitHub pull request: #2299
Original author: @mfreeman451
Original URL: https://github.com/carverauto/serviceradar/pull/2299
Original created: 2026-01-14T17:40:41Z
Original updated: 2026-01-14T21:31:55Z
Original head: carverauto/serviceradar:updates/sysmon-metrics-ingestion
Original base: staging
Original merged: 2026-01-14T21:31:53Z by @mfreeman451

User description

IMPORTANT: Please sign the Developer Certificate of Origin

Thank you for your contribution to ServiceRadar. Please note, when contributing, the developer must include
a DCO sign-off statement indicating the DCO acceptance in one commit message. Here
is an example DCO Signed-off-by line in a commit message:

Signed-off-by: J. Doe <j.doe@domain.com>

Describe your changes

Issue ticket number and link

Code checklist before requesting a review

I have signed the DCO?
The build completes without errors?
All tests are passing when running make test?

PR Type

Enhancement, Documentation

Description

Add OpenSpec proposal for sysmon metrics ingestion via gRPC pipeline
Define requirements for persisting CPU, memory, disk, and process metrics to tenant hypertables
Specify device identifier resolution with safe fallbacks for missing linkages
Document payload size handling and implementation tasks for metrics ingestor

Diagram Walkthrough

flowchart LR
  A["gRPC Status Updates"] -->|"sysmon-metrics payload"| B["Agent Gateway"]
  B -->|"forward to core"| C["Core Status Handler"]
  C -->|"resolve device ID"| D["Device Identifier"]
  D -->|"insert metrics"| E["Tenant CNPG Hypertables"]
  E -->|"cpu, memory, disk, process"| F["Device Charts Rendered"]

File Walkthrough

Relevant files

Documentation

proposal.md `Sysmon metrics ingestion proposal and rationale` openspec/changes/fix-sysmon-metrics-ingestion/proposal.md Introduces proposal to fix sysmon metrics ingestion from gRPC pipeline Outlines parsing of sysmon payloads and insertion into tenant-scoped hypertables Specifies device identifier resolution with fallback handling Documents payload size allowance and Ash resource mapping requirements	+14/-0
spec.md `Edge architecture requirements for sysmon ingestion` openspec/changes/fix-sysmon-metrics-ingestion/specs/edge-architecture/spec.md Defines requirement for ingesting sysmon metrics into five tenant-scoped hypertables Specifies scenario for metrics persistence with device identifier resolution Documents fallback behavior when device mapping is unavailable Establishes payload size handling requirements for sysmon-metrics messages	+24/-0
tasks.md `Implementation tasks for sysmon metrics ingestion` openspec/changes/fix-sysmon-metrics-ingestion/tasks.md Lists six implementation tasks for sysmon metrics ingestion feature Includes spec updates, Ash resource creation, and ingestor development Specifies integration with StatusHandler and gateway normalization Includes testing requirements for metrics mapping and payload handling	+7/-0

Imported from GitHub pull request. Original GitHub pull request: #2299 Original author: @mfreeman451 Original URL: https://github.com/carverauto/serviceradar/pull/2299 Original created: 2026-01-14T17:40:41Z Original updated: 2026-01-14T21:31:55Z Original head: carverauto/serviceradar:updates/sysmon-metrics-ingestion Original base: staging Original merged: 2026-01-14T21:31:53Z by @mfreeman451 --- ### **User description** ## IMPORTANT: Please sign the Developer Certificate of Origin Thank you for your contribution to ServiceRadar. Please note, when contributing, the developer must include a [DCO sign-off statement]( https://developercertificate.org/) indicating the DCO acceptance in one commit message. Here is an example DCO Signed-off-by line in a commit message: ``` Signed-off-by: J. Doe <j.doe@domain.com> ``` ## Describe your changes ## Issue ticket number and link ## Code checklist before requesting a review - [ ] I have signed the DCO? - [ ] The build completes without errors? - [ ] All tests are passing when running make test? ___ ### **PR Type** Enhancement, Documentation ___ ### **Description** - Add OpenSpec proposal for sysmon metrics ingestion via gRPC pipeline - Define requirements for persisting CPU, memory, disk, and process metrics to tenant hypertables - Specify device identifier resolution with safe fallbacks for missing linkages - Document payload size handling and implementation tasks for metrics ingestor ___ ### Diagram Walkthrough ```mermaid flowchart LR A["gRPC Status Updates"] -->|"sysmon-metrics payload"| B["Agent Gateway"] B -->|"forward to core"| C["Core Status Handler"] C -->|"resolve device ID"| D["Device Identifier"] D -->|"insert metrics"| E["Tenant CNPG Hypertables"] E -->|"cpu, memory, disk, process"| F["Device Charts Rendered"] ``` <details><summary><h3>File Walkthrough</h3></summary> <table><thead><tr><th></th><th align="left">Relevant files</th></tr></thead><tbody><tr><td><strong>Documentation</strong></td><td><table> <tr> <td> <details> <summary><strong>proposal.md</strong><dd><code>Sysmon metrics ingestion proposal and rationale</code>                    </dd></summary> <hr> openspec/changes/fix-sysmon-metrics-ingestion/proposal.md <ul><li>Introduces proposal to fix sysmon metrics ingestion from gRPC pipeline<br> <li> Outlines parsing of sysmon payloads and insertion into tenant-scoped <br>hypertables<br> <li> Specifies device identifier resolution with fallback handling<br> <li> Documents payload size allowance and Ash resource mapping requirements</ul> </details> </td> <td><a href="https://github.com/carverauto/serviceradar/pull/2299/files#diff-418b06ce1a319964183ae67407a5032c5919109adc1b482dd5df03b45f9701ba">+14/-0</a>    </td> </tr> <tr> <td> <details> <summary><strong>spec.md</strong><dd><code>Edge architecture requirements for sysmon ingestion</code>            </dd></summary> <hr> openspec/changes/fix-sysmon-metrics-ingestion/specs/edge-architecture/spec.md <ul><li>Defines requirement for ingesting sysmon metrics into five <br>tenant-scoped hypertables<br> <li> Specifies scenario for metrics persistence with device identifier <br>resolution<br> <li> Documents fallback behavior when device mapping is unavailable<br> <li> Establishes payload size handling requirements for sysmon-metrics <br>messages</ul> </details> </td> <td><a href="https://github.com/carverauto/serviceradar/pull/2299/files#diff-75d95d51d43c29fb78136f5509721d35e038de624866178d8c2706749e4febda">+24/-0</a>    </td> </tr> <tr> <td> <details> <summary><strong>tasks.md</strong><dd><code>Implementation tasks for sysmon metrics ingestion</code>                </dd></summary> <hr> openspec/changes/fix-sysmon-metrics-ingestion/tasks.md <ul><li>Lists six implementation tasks for sysmon metrics ingestion feature<br> <li> Includes spec updates, Ash resource creation, and ingestor development<br> <li> Specifies integration with StatusHandler and gateway normalization<br> <li> Includes testing requirements for metrics mapping and payload handling</ul> </details> </td> <td><a href="https://github.com/carverauto/serviceradar/pull/2299/files#diff-34029fdc5501d4a4d730e5e55d69dd9fe9b14fbca2a9f914127848f1698aa020">+7/-0</a>      </td> </tr> </table></td></tr></tbody></table> </details> ___

qodo-code-review[bot] commented

2026-01-14 17:41:04 +00:00

(Migrated from github.com)

Author

Owner

Imported GitHub PR comment.

Original author: @qodo-code-review[bot]
Original URL: https://github.com/carverauto/serviceradar/pull/2299#issuecomment-3750797655
Original created: 2026-01-14T17:41:04Z

PR Compliance Guide 🔍

Below is a summary of compliance checks for this PR:

Security Compliance
🟢	No security concerns identified No security vulnerabilities detected by AI analysis. Human verification advised for critical code.
Ticket Compliance
⚪	🎫 No ticket provided Create ticket/issue
Codebase Duplication Compliance
⚪	Codebase context is not defined Follow the guide to enable codebase context checks.
Custom Compliance
🟢	Generic: Meaningful Naming and Self-Documenting Code Objective: Ensure all identifiers clearly express their purpose and intent, making code self-documenting Status: Passed Learn more about managing compliance generic rules or creating your own custom rules
	Generic: Secure Error Handling Objective: To prevent the leakage of sensitive system information through error messages while providing sufficient detail for internal debugging. Status: Passed Learn more about managing compliance generic rules or creating your own custom rules
	Generic: Secure Logging Practices Objective: To ensure logs are useful for debugging and auditing without exposing sensitive information like PII, PHI, or cardholder data. Status: Passed Learn more about managing compliance generic rules or creating your own custom rules
⚪	Generic: Comprehensive Audit Trails Objective: To create a detailed and reliable record of critical system actions for security analysis and compliance. Status: Logging unspecified: The proposal/spec describes ingestion that previously "silently drops data" but does not specify audit/logging requirements for ingestion success/failure outcomes. Referred Code Sysmon metrics streamed from agents over gRPC are not being persisted to tenant CNPG hypertables, so device details render without sysmon charts and the metrics pipeline silently drops data. ## What Changes - Parse sysmon metrics payloads forwarded via gRPC status updates and insert CPU, CPU cluster, memory, disk, and process metrics into tenant-scoped hypertables using Ash bulk creates. - Resolve the device identifier from the agent record when available, with safe fallbacks if the device linkage is missing. - Permit larger `sysmon-metrics` payloads in the agent gateway to avoid truncation. - Add an Ash resource mapping for `cpu_cluster_metrics` (if needed for ingestion parity). Learn more about managing compliance generic rules or creating your own custom rules
	Generic: Robust Error Handling and Edge Case Management Objective: Ensure comprehensive error handling that provides meaningful context and graceful degradation Status: Error handling unspecified: The spec requires ingestion to not fail on missing device linkage but does not define expected handling/reporting for other ingestion failure points (parse errors, DB insert failures, malformed payloads). Referred Code #### Scenario: Device mapping unavailable - GIVEN an agent streams a `sysmon-metrics` status payload but has no linked device record - WHEN the gateway forwards the status update to core - THEN core SHALL ingest the metrics with a safe fallback device identifier or leave it null - AND the ingest SHALL NOT fail due to missing device linkage Learn more about managing compliance generic rules or creating your own custom rules
	Generic: Security-First Input Validation and Data Handling Objective: Ensure all data inputs are validated, sanitized, and handled securely to prevent vulnerabilities Status: Validation unspecified: The spec defines acceptance of larger `sysmon-metrics` payloads but does not specify validation/sanitization limits (schema validation, size ceilings, rate limiting) for externally sourced gRPC payload data. Referred Code ### Requirement: Sysmon payload size handling The gateway SHALL accept `sysmon-metrics` payloads larger than the default status message limit and forward them without truncation. #### Scenario: Large sysmon payload - GIVEN a `sysmon-metrics` status payload larger than 4KB - WHEN the gateway processes the message - THEN the payload SHALL be accepted up to the configured sysmon limit - AND the payload SHALL be forwarded to core intact Learn more about managing compliance generic rules or creating your own custom rules
Update

Compliance status legend

🟢 - Fully Compliant
🟡 - Partial Compliant
🔴 - Not Compliant
⚪ - Requires Further Human Verification
🏷️ - Compliance label

Imported GitHub PR comment. Original author: @qodo-code-review[bot] Original URL: https://github.com/carverauto/serviceradar/pull/2299#issuecomment-3750797655 Original created: 2026-01-14T17:41:04Z --- ## PR Compliance Guide 🔍  Below is a summary of compliance checks for this PR:<br> <table><tbody><tr><td colspan='2'><strong>Security Compliance</strong></td></tr> <tr><td>🟢</td><td><details><summary><strong>No security concerns identified</strong></summary> No security vulnerabilities detected by AI analysis. Human verification advised for critical code. </details></td></tr> <tr><td colspan='2'><strong>Ticket Compliance</strong></td></tr> <tr><td>⚪</td><td><details><summary>🎫 <strong>No ticket provided </strong></summary> - [ ] Create ticket/issue  </details></td></tr> <tr><td colspan='2'><strong>Codebase Duplication Compliance</strong></td></tr> <tr><td>⚪</td><td><details><summary><strong>Codebase context is not defined </strong></summary> Follow the <a href='https://qodo-merge-docs.qodo.ai/core-abilities/rag_context_enrichment/'>guide</a> to enable codebase context checks. </details></td></tr> <tr><td colspan='2'><strong>Custom Compliance</strong></td></tr> <tr><td rowspan=3>🟢</td><td> <details><summary><strong>Generic: Meaningful Naming and Self-Documenting Code</strong></summary><br> **Objective:** Ensure all identifiers clearly express their purpose and intent, making code <br>self-documenting<br> **Status:** Passed<br> > Learn more about managing compliance <a href='https://qodo-merge-docs.qodo.ai/tools/compliance/#configuration-options'>generic rules</a> or creating your own <a href='https://qodo-merge-docs.qodo.ai/tools/compliance/#custom-compliance'>custom rules</a> </details></td></tr> <tr><td> <details><summary><strong>Generic: Secure Error Handling</strong></summary><br> **Objective:** To prevent the leakage of sensitive system information through error messages while <br>providing sufficient detail for internal debugging.<br> **Status:** Passed<br> > Learn more about managing compliance <a href='https://qodo-merge-docs.qodo.ai/tools/compliance/#configuration-options'>generic rules</a> or creating your own <a href='https://qodo-merge-docs.qodo.ai/tools/compliance/#custom-compliance'>custom rules</a> </details></td></tr> <tr><td> <details><summary><strong>Generic: Secure Logging Practices</strong></summary><br> **Objective:** To ensure logs are useful for debugging and auditing without exposing sensitive <br>information like PII, PHI, or cardholder data.<br> **Status:** Passed<br> > Learn more about managing compliance <a href='https://qodo-merge-docs.qodo.ai/tools/compliance/#configuration-options'>generic rules</a> or creating your own <a href='https://qodo-merge-docs.qodo.ai/tools/compliance/#custom-compliance'>custom rules</a> </details></td></tr> <tr><td rowspan=3>⚪</td> <td><details> <summary><strong>Generic: Comprehensive Audit Trails</strong></summary><br> **Objective:** To create a detailed and reliable record of critical system actions for security analysis <br>and compliance.<br> **Status:** <br><a href='https://github.com/carverauto/serviceradar/pull/2299/files#diff-418b06ce1a319964183ae67407a5032c5919109adc1b482dd5df03b45f9701baR4-R10'><strong>Logging unspecified</strong></a>: The proposal/spec describes ingestion that previously "silently drops data" but <br>does not specify audit/logging requirements for ingestion success/failure outcomes.<br> <details open><summary>Referred Code</summary> ```markdown Sysmon metrics streamed from agents over gRPC are not being persisted to tenant CNPG hypertables, so device details render without sysmon charts and the metrics pipeline silently drops data. ## What Changes - Parse sysmon metrics payloads forwarded via gRPC status updates and insert CPU, CPU cluster, memory, disk, and process metrics into tenant-scoped hypertables using Ash bulk creates. - Resolve the device identifier from the agent record when available, with safe fallbacks if the device linkage is missing. - Permit larger `sysmon-metrics` payloads in the agent gateway to avoid truncation. - Add an Ash resource mapping for `cpu_cluster_metrics` (if needed for ingestion parity). ``` </details> > Learn more about managing compliance <a href='https://qodo-merge-docs.qodo.ai/tools/compliance/#configuration-options'>generic rules</a> or creating your own <a href='https://qodo-merge-docs.qodo.ai/tools/compliance/#custom-compliance'>custom rules</a> </details></td></tr> <tr><td><details> <summary><strong>Generic: Robust Error Handling and Edge Case Management</strong></summary><br> **Objective:** Ensure comprehensive error handling that provides meaningful context and graceful <br>degradation<br> **Status:** <br><a href='https://github.com/carverauto/serviceradar/pull/2299/files#diff-75d95d51d43c29fb78136f5509721d35e038de624866178d8c2706749e4febdaR11-R16'><strong>Error handling unspecified</strong></a>: The spec requires ingestion to not fail on missing device linkage but does not define <br>expected handling/reporting for other ingestion failure points (parse errors, DB insert <br>failures, malformed payloads).<br> <details open><summary>Referred Code</summary> ```markdown #### Scenario: Device mapping unavailable - **GIVEN** an agent streams a `sysmon-metrics` status payload but has no linked device record - **WHEN** the gateway forwards the status update to core - **THEN** core SHALL ingest the metrics with a safe fallback device identifier or leave it null - **AND** the ingest SHALL NOT fail due to missing device linkage ``` </details> > Learn more about managing compliance <a href='https://qodo-merge-docs.qodo.ai/tools/compliance/#configuration-options'>generic rules</a> or creating your own <a href='https://qodo-merge-docs.qodo.ai/tools/compliance/#custom-compliance'>custom rules</a> </details></td></tr> <tr><td><details> <summary><strong>Generic: Security-First Input Validation and Data Handling</strong></summary><br> **Objective:** Ensure all data inputs are validated, sanitized, and handled securely to prevent <br>vulnerabilities<br> **Status:** <br><a href='https://github.com/carverauto/serviceradar/pull/2299/files#diff-75d95d51d43c29fb78136f5509721d35e038de624866178d8c2706749e4febdaR17-R24'><strong>Validation unspecified</strong></a>: The spec defines acceptance of larger <code>sysmon-metrics</code> payloads but does not specify <br>validation/sanitization limits (schema validation, size ceilings, rate limiting) for <br>externally sourced gRPC payload data.<br> <details open><summary>Referred Code</summary> ```markdown ### Requirement: Sysmon payload size handling The gateway SHALL accept `sysmon-metrics` payloads larger than the default status message limit and forward them without truncation. #### Scenario: Large sysmon payload - **GIVEN** a `sysmon-metrics` status payload larger than 4KB - **WHEN** the gateway processes the message - **THEN** the payload SHALL be accepted up to the configured sysmon limit - **AND** the payload SHALL be forwarded to core intact ``` </details> > Learn more about managing compliance <a href='https://qodo-merge-docs.qodo.ai/tools/compliance/#configuration-options'>generic rules</a> or creating your own <a href='https://qodo-merge-docs.qodo.ai/tools/compliance/#custom-compliance'>custom rules</a> </details></td></tr> <tr><td align="center" colspan="2"> - [ ] Update  </td></tr></tbody></table> <details><summary>Compliance status legend</summary> 🟢 - Fully Compliant<br> 🟡 - Partial Compliant<br> 🔴 - Not Compliant<br> ⚪ - Requires Further Human Verification<br> 🏷️ - Compliance label<br> </details>

qodo-code-review[bot] commented

2026-01-14 17:42:15 +00:00

(Migrated from github.com)

Author

Owner

Imported GitHub PR comment.

Original author: @qodo-code-review[bot]
Original URL: https://github.com/carverauto/serviceradar/pull/2299#issuecomment-3750804381
Original created: 2026-01-14T17:42:15Z

PR Code Suggestions ✨

Latest suggestions up to 2d46e4c

Category	Suggestion	Impact
Security	Prevent atom table exhaustion Replace the unsafe `String.to_atom/1` with `String.to_existing_atom/1` inside a `try/rescue` block. This prevents a potential denial-of-service attack caused by atom table exhaustion from untrusted input. elixir/serviceradar_core/lib/serviceradar/observability/sysmon_metrics_ingestor.ex [0] `defp fetch_value(map, key) when is_map(map) do Map.get(map, key) \|\| - Map.get(map, String.to_atom(key)) + (try do + Map.get(map, String.to_existing_atom(key)) + rescue + ArgumentError -> nil + end) end` `[To ensure code accuracy, apply this suggestion manually]` Suggestion importance[1-10]: 10 __ Why: The suggestion correctly identifies a critical security vulnerability (atom table exhaustion) that could lead to a denial-of-service attack, and proposes a robust fix using `String.to_existing_atom/1`.	High
Possible issue	Avoid skipping sweep results To prevent data loss on transient failures, move the `setSweepResultsSequence` call to after the `StreamStatus` call succeeds. Additionally, return `false` on a `StreamStatus` error to correctly reflect the failed state. pkg/agent/push_loop.go [0] `-if response.CurrentSequence != "" { - p.setSweepResultsSequence(response.CurrentSequence) -} +pendingSeq := response.CurrentSequence ... _, err = p.gateway.StreamStatus(pushCtx, statusChunks) if err != nil { p.logger.Error().Err(err).Msg("Failed to stream sweep results to gateway") - return true + return false } +if pendingSeq != "" { + p.setSweepResultsSequence(pendingSeq) +} +` `[To ensure code accuracy, apply this suggestion manually]` Suggestion importance[1-10]: 9 __ Why: The suggestion fixes a critical bug where transient network errors could cause permanent data loss by updating a sequence number before ensuring the data was successfully sent.	High
Possible issue	Make chunking failure-tolerant Instead of erroring on malformed large payloads that cannot be chunked, fall back to sending the payload as a single chunk. This improves robustness by ensuring data is still delivered. pkg/agent/push_loop.go [0] -func buildSweepResultsChunks(response proto.ResultsResponse) ([]proto.ResultsChunk, error) { - if response == nil { - return nil, nil - } +var sweepData map[string]interface{} +if err := json.Unmarshal(response.Data, &sweepData); err != nil { + // If we can't chunk safely, fall back to streaming as-is. + return []proto.ResultsChunk{{ + Data: response.Data, + IsFinal: true, + ChunkIndex: 0, + TotalChunks: 1, + CurrentSequence: response.CurrentSequence, + Timestamp: response.Timestamp, + }}, nil +} - if len(response.Data) == 0 { - return nil, nil - } -... - hostsInterface, ok := sweepData["hosts"] - if !ok { - return nil, fmt.Errorf("sweep data missing hosts field") - } +hostsInterface, ok := sweepData["hosts"] +if !ok { + return []proto.ResultsChunk{{ + Data: response.Data, + IsFinal: true, + ChunkIndex: 0, + TotalChunks: 1, + CurrentSequence: response.CurrentSequence, + Timestamp: response.Timestamp, + }}, nil +} - hosts, ok := hostsInterface.([]interface{}) - if !ok { - return nil, fmt.Errorf("hosts field is not an array") - } +hosts, ok := hostsInterface.([]interface{}) +if !ok { + return []*proto.ResultsChunk{{ + Data: response.Data, + IsFinal: true, + ChunkIndex: 0, + TotalChunks: 1, + CurrentSequence: response.CurrentSequence, + Timestamp: response.Timestamp, + }}, nil +} `[To ensure code accuracy, apply this suggestion manually]` Suggestion importance[1-10]: 8 __ Why: The suggestion improves the system's robustness by preventing data delivery failures for malformed large payloads, ensuring data is sent even if it cannot be chunked.	Medium
Update

Previous suggestions

Suggestions up to commit 3a66f3d

Category	Suggestion	Impact
High-level	Consider a dedicated metrics ingestion endpoint Instead of sending sysmon metrics via the existing gRPC status update channel, consider creating a dedicated endpoint for them. This would improve scalability and prevent high-volume metrics data from impacting other status messages. Examples: openspec/changes/fix-sysmon-metrics-ingestion/specs/edge-architecture/spec.md [3] The system SHALL ingest sysmon metrics delivered over gRPC status updates into the tenant-scoped CNPG hypertables (`cpu_metrics`, `cpu_cluster_metrics`, `memory_metrics`, `disk_metrics`, and `process_metrics`). openspec/changes/fix-sysmon-metrics-ingestion/proposal.md [7] `- Parse sysmon metrics payloads forwarded via gRPC status updates and insert CPU, CPU cluster, memory, disk, and process metrics into tenant-scoped hypertables using Ash bulk creates.` Solution Walkthrough: Before: `// Flow described in the proposal using a shared status channel Agent { send_grpc_status_update("sysmon-metrics", large_payload) send_grpc_status_update("other-status", small_payload) } Gateway { receive_status_update(type, payload) // Special handling for sysmon-metrics payload size forward_to_core(type, payload) } Core.StatusHandler { // Handles both metrics and other statuses if (type == "sysmon-metrics") { ingest_metrics(payload) } else { handle_other_status() } }` After: `// Suggested flow with a dedicated metrics endpoint Agent { send_grpc_metrics("sysmon-metrics", large_payload) send_grpc_status_update("other-status", small_payload) } Gateway { // Separate endpoints/handlers receive_metrics(...) receive_status_update(...) } Core { // Separate handlers for separation of concerns MetricsIngestionHandler { ingest_metrics(payload) } StatusHandler { handle_other_status() } }` Suggestion importance[1-10]: 8 __ Why: The suggestion raises a valid and significant architectural concern about scalability and reliability by questioning the use of a shared status channel for high-volume metrics, which is a core part of the proposed design.	Medium
General	Clarify handling of unlinked device metrics To prevent data integrity issues, change the specification to drop metrics from unlinked devices and log a high-severity alert, instead of ingesting them with a null or fallback identifier. openspec/changes/fix-sysmon-metrics-ingestion/specs/edge-architecture/spec.md [14-15] `-- THEN core SHALL ingest the metrics with a safe fallback device identifier or leave it null -- AND the ingest SHALL NOT fail due to missing device linkage +- THEN core SHALL drop the metrics and log a high-severity alert detailing the tenant and agent +- AND the ingest process for other agents SHALL NOT be affected` Suggestion importance[1-10]: 7 __ Why: The suggestion raises a valid design concern about data integrity when handling metrics from unlinked devices, proposing a stricter and more explicit failure mode which improves the robustness of the specification.	Medium
Security	Specify payload size limit configurability Clarify the "configured sysmon limit" in the specification by requiring it to be configurable (e.g., per-tenant) and have a documented default value to improve security and reliability. openspec/changes/fix-sysmon-metrics-ingestion/specs/edge-architecture/spec.md [23] `-- THEN the payload SHALL be accepted up to the configured sysmon limit +- THEN the payload SHALL be accepted up to a configurable limit (e.g., per-tenant, with a documented default)` Suggestion importance[1-10]: 6 __ Why: The suggestion correctly identifies ambiguity in the specification regarding the "configured sysmon limit" and proposes adding important details about its scope and default value, which enhances clarity and security.	Low

Imported GitHub PR comment. Original author: @qodo-code-review[bot] Original URL: https://github.com/carverauto/serviceradar/pull/2299#issuecomment-3750804381 Original created: 2026-01-14T17:42:15Z --- ## PR Code Suggestions ✨  Latest suggestions up to 2d46e4c <table><thead><tr><td><strong>Category</strong></td><td align=left><strong>Suggestion                                                                                                                                    </strong></td><td align=center><strong>Impact</strong></td></tr><tbody><tr><td rowspan=1>Security</td> <td> <details><summary>Prevent atom table exhaustion</summary> ___ **Replace the unsafe <code>String.to_atom/1</code> with <code>String.to_existing_atom/1</code> inside a <br><code>try/rescue</code> block. This prevents a potential denial-of-service attack caused by <br>atom table exhaustion from untrusted input.** [elixir/serviceradar_core/lib/serviceradar/observability/sysmon_metrics_ingestor.ex [0]](https://github.com/carverauto/serviceradar/pull/2299/files#diff-6685ec043fcaa5bb98a00bb9ed73bcfa80d487f2268f92dd73118337a57b6380R0) ```diff defp fetch_value(map, key) when is_map(map) do Map.get(map, key) || - Map.get(map, String.to_atom(key)) + (try do + Map.get(map, String.to_existing_atom(key)) + rescue + ArgumentError -> nil + end) end ``` `[To ensure code accuracy, apply this suggestion manually]` <details><summary>Suggestion importance[1-10]: 10</summary> __ Why: The suggestion correctly identifies a critical security vulnerability (atom table exhaustion) that could lead to a denial-of-service attack, and proposes a robust fix using `String.to_existing_atom/1`. </details></details></td><td align=center>High </td></tr><tr><td rowspan=2>Possible issue</td> <td> <details><summary>Avoid skipping sweep results</summary> ___ **To prevent data loss on transient failures, move the <code>setSweepResultsSequence</code> <br>call to after the <code>StreamStatus</code> call succeeds. Additionally, return <code>false</code> on a <br><code>StreamStatus</code> error to correctly reflect the failed state.** [pkg/agent/push_loop.go [0]](https://github.com/carverauto/serviceradar/pull/2299/files#diff-5f0d59be34ef26b449d7f5fd2b198a29b277936b9708a699f7487415ed6c2785R0) ```diff -if response.CurrentSequence != "" { - p.setSweepResultsSequence(response.CurrentSequence) -} +pendingSeq := response.CurrentSequence ... _, err = p.gateway.StreamStatus(pushCtx, statusChunks) if err != nil { p.logger.Error().Err(err).Msg("Failed to stream sweep results to gateway") - return true + return false } +if pendingSeq != "" { + p.setSweepResultsSequence(pendingSeq) +} + ``` `[To ensure code accuracy, apply this suggestion manually]` <details><summary>Suggestion importance[1-10]: 9</summary> __ Why: The suggestion fixes a critical bug where transient network errors could cause permanent data loss by updating a sequence number before ensuring the data was successfully sent. </details></details></td><td align=center>High </td></tr><tr><td> <details><summary>Make chunking failure-tolerant</summary> ___ **Instead of erroring on malformed large payloads that cannot be chunked, fall <br>back to sending the payload as a single chunk. This improves robustness by <br>ensuring data is still delivered.** [pkg/agent/push_loop.go [0]](https://github.com/carverauto/serviceradar/pull/2299/files#diff-5f0d59be34ef26b449d7f5fd2b198a29b277936b9708a699f7487415ed6c2785R0) ```diff -func buildSweepResultsChunks(response *proto.ResultsResponse) ([]*proto.ResultsChunk, error) { - if response == nil { - return nil, nil - } +var sweepData map[string]interface{} +if err := json.Unmarshal(response.Data, &sweepData); err != nil { + // If we can't chunk safely, fall back to streaming as-is. + return []*proto.ResultsChunk{{ + Data: response.Data, + IsFinal: true, + ChunkIndex: 0, + TotalChunks: 1, + CurrentSequence: response.CurrentSequence, + Timestamp: response.Timestamp, + }}, nil +} - if len(response.Data) == 0 { - return nil, nil - } -... - hostsInterface, ok := sweepData["hosts"] - if !ok { - return nil, fmt.Errorf("sweep data missing hosts field") - } +hostsInterface, ok := sweepData["hosts"] +if !ok { + return []*proto.ResultsChunk{{ + Data: response.Data, + IsFinal: true, + ChunkIndex: 0, + TotalChunks: 1, + CurrentSequence: response.CurrentSequence, + Timestamp: response.Timestamp, + }}, nil +} - hosts, ok := hostsInterface.([]interface{}) - if !ok { - return nil, fmt.Errorf("hosts field is not an array") - } +hosts, ok := hostsInterface.([]interface{}) +if !ok { + return []*proto.ResultsChunk{{ + Data: response.Data, + IsFinal: true, + ChunkIndex: 0, + TotalChunks: 1, + CurrentSequence: response.CurrentSequence, + Timestamp: response.Timestamp, + }}, nil +} ``` `[To ensure code accuracy, apply this suggestion manually]` <details><summary>Suggestion importance[1-10]: 8</summary> __ Why: The suggestion improves the system's robustness by preventing data delivery failures for malformed large payloads, ensuring data is sent even if it cannot be chunked. </details></details></td><td align=center>Medium </td></tr> <tr><td align="center" colspan="2"> - [ ] Update  </td><td></td></tr></tbody></table> ___ #### Previous suggestions <details><summary>Suggestions up to commit 3a66f3d</summary> <br><table><thead><tr><td><strong>Category</strong></td><td align=left><strong>Suggestion                                                                                                                                    </strong></td><td align=center><strong>Impact</strong></td></tr><tbody><tr><td rowspan=1>High-level</td> <td> <details><summary>Consider a dedicated metrics ingestion endpoint</summary> ___ **Instead of sending sysmon metrics via the existing gRPC status update channel, <br>consider creating a dedicated endpoint for them. This would improve scalability <br>and prevent high-volume metrics data from impacting other status messages.** ### Examples: <details> <summary> <a href="https://github.com/carverauto/serviceradar/pull/2299/files#diff-75d95d51d43c29fb78136f5509721d35e038de624866178d8c2706749e4febdaR3-R3">openspec/changes/fix-sysmon-metrics-ingestion/specs/edge-architecture/spec.md [3]</a> </summary> ```markdown The system SHALL ingest sysmon metrics delivered over gRPC status updates into the tenant-scoped CNPG hypertables (`cpu_metrics`, `cpu_cluster_metrics`, `memory_metrics`, `disk_metrics`, and `process_metrics`). ``` </details> <details> <summary> <a href="https://github.com/carverauto/serviceradar/pull/2299/files#diff-418b06ce1a319964183ae67407a5032c5919109adc1b482dd5df03b45f9701baR7-R7">openspec/changes/fix-sysmon-metrics-ingestion/proposal.md [7]</a> </summary> ```markdown - Parse sysmon metrics payloads forwarded via gRPC status updates and insert CPU, CPU cluster, memory, disk, and process metrics into tenant-scoped hypertables using Ash bulk creates. ``` </details> ### Solution Walkthrough: #### Before: ```markdown // Flow described in the proposal using a shared status channel Agent { send_grpc_status_update("sysmon-metrics", large_payload) send_grpc_status_update("other-status", small_payload) } Gateway { receive_status_update(type, payload) // Special handling for sysmon-metrics payload size forward_to_core(type, payload) } Core.StatusHandler { // Handles both metrics and other statuses if (type == "sysmon-metrics") { ingest_metrics(payload) } else { handle_other_status() } } ``` #### After: ```markdown // Suggested flow with a dedicated metrics endpoint Agent { send_grpc_metrics("sysmon-metrics", large_payload) send_grpc_status_update("other-status", small_payload) } Gateway { // Separate endpoints/handlers receive_metrics(...) receive_status_update(...) } Core { // Separate handlers for separation of concerns MetricsIngestionHandler { ingest_metrics(payload) } StatusHandler { handle_other_status() } } ``` <details><summary>Suggestion importance[1-10]: 8</summary> __ Why: The suggestion raises a valid and significant architectural concern about scalability and reliability by questioning the use of a shared status channel for high-volume metrics, which is a core part of the proposed design. </details></details></td><td align=center>Medium </td></tr><tr><td rowspan=1>General</td> <td> <details><summary>Clarify handling of unlinked device metrics</summary> ___ **To prevent data integrity issues, change the specification to drop metrics from <br>unlinked devices and log a high-severity alert, instead of ingesting them with a <br>null or fallback identifier.** [openspec/changes/fix-sysmon-metrics-ingestion/specs/edge-architecture/spec.md [14-15]](https://github.com/carverauto/serviceradar/pull/2299/files#diff-75d95d51d43c29fb78136f5509721d35e038de624866178d8c2706749e4febdaR14-R15) ```diff -- THEN core SHALL ingest the metrics with a safe fallback device identifier or leave it null -- AND the ingest SHALL NOT fail due to missing device linkage +- THEN core SHALL drop the metrics and log a high-severity alert detailing the tenant and agent +- AND the ingest process for other agents SHALL NOT be affected ``` <details><summary>Suggestion importance[1-10]: 7</summary> __ Why: The suggestion raises a valid design concern about data integrity when handling metrics from unlinked devices, proposing a stricter and more explicit failure mode which improves the robustness of the specification. </details></details></td><td align=center>Medium </td></tr><tr><td rowspan=1>Security</td> <td> <details><summary>Specify payload size limit configurability</summary> ___ **Clarify the "configured sysmon limit" in the specification by requiring it to be <br>configurable (e.g., per-tenant) and have a documented default value to improve <br>security and reliability.** [openspec/changes/fix-sysmon-metrics-ingestion/specs/edge-architecture/spec.md [23]](https://github.com/carverauto/serviceradar/pull/2299/files#diff-75d95d51d43c29fb78136f5509721d35e038de624866178d8c2706749e4febdaR23-R23) ```diff -- THEN the payload SHALL be accepted up to the configured sysmon limit +- THEN the payload SHALL be accepted up to a configurable limit (e.g., per-tenant, with a documented default) ``` <details><summary>Suggestion importance[1-10]: 6</summary> __ Why: The suggestion correctly identifies ambiguity in the specification regarding the "configured sysmon limit" and proposes adding important details about its scope and default value, which enhances clarity and security. </details></details></td><td align=center>Low </td></tr> <tr><td align="center" colspan="2">  </td><td></td></tr></tbody></table> </details>

github-advanced-security[bot] commented

2026-01-14 18:21:34 +00:00

(Migrated from github.com)

Author

Owner

Imported GitHub PR review comment.

Original author: @github-advanced-security[bot]
Original URL: https://github.com/carverauto/serviceradar/pull/2299#discussion_r2691551561
Original created: 2026-01-14T18:21:34Z
Original path: pkg/agent/push_loop.go
Original line: 560

Size computation for allocation may overflow

This operation, which is used in an allocation, involves a potentially large value and might overflow.

Show more details

Imported GitHub PR review comment. Original author: @github-advanced-security[bot] Original URL: https://github.com/carverauto/serviceradar/pull/2299#discussion_r2691551561 Original created: 2026-01-14T18:21:34Z Original path: pkg/agent/push_loop.go Original line: 560 --- ## Size computation for allocation may overflow This operation, which is used in an [allocation](1), involves a [potentially large value](2) and might overflow. [Show more details](https://github.com/carverauto/serviceradar/security/code-scanning/91)

github-advanced-security[bot] commented

2026-01-14 18:21:34 +00:00

(Migrated from github.com)

Author

Owner

Imported GitHub PR review comment.

Original author: @github-advanced-security[bot]
Original URL: https://github.com/carverauto/serviceradar/pull/2299#discussion_r2691551567
Original created: 2026-01-14T18:21:34Z
Original path: pkg/agent/push_loop.go
Original line: 620

Size computation for allocation may overflow

This operation, which is used in an allocation, involves a potentially large value and might overflow.

Show more details

Imported GitHub PR review comment. Original author: @github-advanced-security[bot] Original URL: https://github.com/carverauto/serviceradar/pull/2299#discussion_r2691551567 Original created: 2026-01-14T18:21:34Z Original path: pkg/agent/push_loop.go Original line: 620 --- ## Size computation for allocation may overflow This operation, which is used in an [allocation](1), involves a [potentially large value](2) and might overflow. [Show more details](https://github.com/carverauto/serviceradar/security/code-scanning/92)