carverauto/serviceradar

Fork 0

Sysmonvm/integration tests 2 #2316

Merged

mfreeman451 merged 13 commits from refs/pull/2316/head into main

2025-10-14 03:13:29 +00:00

mfreeman451 commented

2025-10-13 06:06:41 +00:00

(Migrated from github.com)

Owner

Imported from GitHub pull request.

Original GitHub pull request: #1755
Original author: @mfreeman451
Original URL: https://github.com/carverauto/serviceradar/pull/1755
Original created: 2025-10-13T06:06:41Z
Original updated: 2025-10-14T03:13:48Z
Original head: carverauto/serviceradar:sysmonvm/integration_tests_2
Original base: main
Original merged: 2025-10-14T03:13:29Z by @mfreeman451

Auto-created Ticket

#1756

PR Type

Enhancement, Bug fix, Documentation

Description

Major Enhancements:

Implemented automatic device registration from checker metadata, eliminating manual device setup requirements
Added automatic core reconnection logic in poller with retry mechanism for transient failures
Consolidated database migrations into single schema file, simplifying deployment and maintenance
Added CPU frequency metrics collection, processing, and visualization across the entire stack
Implemented per-checker security configuration support in agent

Infrastructure & Configuration:

Added authentication endpoints (/auth/login, /auth/refresh) to Kong API gateway
Added resource limits (8GB memory) and PROTON_DISABLE_TASKSTATS for Proton service
Implemented password persistence and recovery for Proton database across restarts
Preserved core configuration and auth keys across container restarts
Added Docker Compose override for local image development

UI Enhancements:

Added CPU frequency visualization components (card, chart, per-core details)
Integrated CPU frequency metrics into system metrics dashboard
Added service hint support to sysmon status indicator for better checker capability detection
Added configurable max value support to metric card component

Error Handling & Reliability:

Improved batch metric insertion with error tracking and validation
Added timestamp skew validation for sysmon payloads
Enhanced service messages with host identity for device correlation

Documentation:

Completely rewrote custom checker documentation focusing on gRPC architecture with sequence diagrams
Refined device identity validation rules for non-sweep sources
Updated migration file references in documentation

Configuration Changes:

Added sysmon-vm checker configuration with gRPC settings
Updated sweep configuration with ICMP mode
Removed incremental migration files in favor of consolidated schema

Diagram Walkthrough

flowchart LR
  checker["Custom Checker"] -- "gRPC with metadata" --> poller["Poller"]
  poller -- "enriched payload" --> core["Core"]
  core -- "auto-register" --> devices["Device Registry"]
  core -- "reconnect on failure" --> poller
  core -- "CPU frequency data" --> db["Database (schema.sql)"]
  db -- "metrics" --> web["Web UI"]
  web -- "visualize" --> cpufreq["CPU Frequency Charts"]
  kong["Kong Gateway"] -- "auth routes" --> core

File Walkthrough

Relevant files

Enhancement

13 files

poller.go

Add automatic core reconnection and error handling for poller

pkg/poller/poller.go

Added ...(truncated)

Imported from GitHub pull request. Original GitHub pull request: #1755 Original author: @mfreeman451 Original URL: https://github.com/carverauto/serviceradar/pull/1755 Original created: 2025-10-13T06:06:41Z Original updated: 2025-10-14T03:13:48Z Original head: carverauto/serviceradar:sysmonvm/integration_tests_2 Original base: main Original merged: 2025-10-14T03:13:29Z by @mfreeman451 --- ### **Auto-created Ticket** [#1756](https://github.com/carverauto/serviceradar/issues/1756) ### **PR Type** Enhancement, Bug fix, Documentation ___ ### **Description** **Major Enhancements:** - Implemented automatic device registration from checker metadata, eliminating manual device setup requirements - Added automatic core reconnection logic in poller with retry mechanism for transient failures - Consolidated database migrations into single schema file, simplifying deployment and maintenance - Added CPU frequency metrics collection, processing, and visualization across the entire stack - Implemented per-checker security configuration support in agent **Infrastructure & Configuration:** - Added authentication endpoints (`/auth/login`, `/auth/refresh`) to Kong API gateway - Added resource limits (8GB memory) and `PROTON_DISABLE_TASKSTATS` for Proton service - Implemented password persistence and recovery for Proton database across restarts - Preserved core configuration and auth keys across container restarts - Added Docker Compose override for local image development **UI Enhancements:** - Added CPU frequency visualization components (card, chart, per-core details) - Integrated CPU frequency metrics into system metrics dashboard - Added service hint support to sysmon status indicator for better checker capability detection - Added configurable max value support to metric card component **Error Handling & Reliability:** - Improved batch metric insertion with error tracking and validation - Added timestamp skew validation for sysmon payloads - Enhanced service messages with host identity for device correlation **Documentation:** - Completely rewrote custom checker documentation focusing on gRPC architecture with sequence diagrams - Refined device identity validation rules for non-sweep sources - Updated migration file references in documentation **Configuration Changes:** - Added sysmon-vm checker configuration with gRPC settings - Updated sweep configuration with ICMP mode - Removed incremental migration files in favor of consolidated schema ___ ### Diagram Walkthrough ```mermaid flowchart LR checker["Custom Checker"] -- "gRPC with metadata" --> poller["Poller"] poller -- "enriched payload" --> core["Core"] core -- "auto-register" --> devices["Device Registry"] core -- "reconnect on failure" --> poller core -- "CPU frequency data" --> db["Database (schema.sql)"] db -- "metrics" --> web["Web UI"] web -- "visualize" --> cpufreq["CPU Frequency Charts"] kong["Kong Gateway"] -- "auth routes" --> core ``` <details> <summary><h3> File Walkthrough</h3></summary> <table><thead><tr><th></th><th align="left">Relevant files</th></tr></thead><tbody><tr><td>Enhancement</td><td><details><summary>13 files</summary><table> <tr> <td> <details> <summary>poller.go<dd><code>Add automatic core reconnection and error handling for poller</code></dd></summary> <hr> pkg/poller/poller.go <ul><li>Added ...(truncated)

qodo-code-review[bot] commented

2025-10-13 06:08:19 +00:00

(Migrated from github.com)

Author

Owner

Imported GitHub PR comment.

Original author: @qodo-code-review[bot]
Original URL: https://github.com/carverauto/serviceradar/pull/1755#issuecomment-3395986839
Original created: 2025-10-13T06:08:19Z

PR Compliance Guide 🔍

(Compliance updated until commit `github.com/carverauto/serviceradar@ffdbd2720a`)

Below is a summary of compliance checks for this PR:

Security Compliance

🔴

Placeholder

Description: Previous string-concatenated SQL was removed; the new code uses parameterized queries
which is safe—no issue here.
sysmon.go [394-406]

Referred Code

func (s *APIServer) getCPUMetricsForDevice(
	ctx context.Context, _ db.SysmonMetricsProvider, deviceID string, start, end time.Time) (interface{}, error) {
	// Query cpu_metrics table directly for per-core data by device_id
	const cpuQuery = `
		SELECT timestamp, agent_id, host_id, core_id, usage_percent, frequency_hz, label, cluster
		FROM table(cpu_metrics)
		WHERE device_id = $1 AND timestamp BETWEEN $2 AND $3
		ORDER BY timestamp DESC, core_id ASC`

	rows, err := s.dbService.(*db.DB).Conn.Query(ctx, cpuQuery, deviceID, start, end)
	if err != nil {
		return nil, fmt.Errorf("failed to query CPU metrics for device %s: %w", deviceID, err)
	}

⚪

Device spoofing risk

Description: Device IDs are derived directly from a reported host IP in checker payloads without strong
authentication or validation, allowing a malicious or misconfigured checker to spoof IPs
and auto-register/update devices with high confidence.
devices.go [122-245]

Referred Code

func extractCheckerHostIdentity(serviceData json.RawMessage) (hostIP, hostname, hostID string) {
	if len(serviceData) == 0 {
		return "", "", ""
	}

	var payload any
	if err := json.Unmarshal(serviceData, &payload); err != nil {
		return "", "", ""
	}

	hostIP = firstStringMatch(payload,
		[]string{"status", "host_ip"},
		[]string{"status", "ip"},
		[]string{"status", "ip_address"},
		[]string{"host_ip"},
		[]string{"ip"},
		[]string{"ip_address"},
	)
	if hostIP == "" {
		hostIP = findStringByKeySubstring(payload, "ip")
	}


 ... (clipped 103 lines)

Ticket Compliance

🟡

🎫 #1756

🟢	Extend metrics schema and API to support CPU core labels and cluster-level frequency, and return them in sysmon responses and queries.
	Migrate database to store new CPU fields and create a cpu_cluster_metrics table; batch writes should error meaningfully when nothing is appended.
	Provide a gRPC sysmon-vm checker that reports per-core usage/frequency with labels/clusters and cluster aggregates, including host identifiers and timestamps.
	Ingest sysmon/sysmon-vm payloads, normalize timestamps, store new metrics, and auto-register devices from checker payloads using host IP/ID with source confidence.
	Add reconnection logic on gRPC error codes/timeouts for poller, enrich messages with resolved host IP, carry agent_id, and support hot-reloading poll interval.
	Allow checker-specific security settings via agent checker configs and bootstrap default KV configs for common checkers including sysmon-vm.
	Update CPU metrics API to return labels/clusters and add cluster metrics alongside per-core data for poller- and device-scoped queries.
	Deployment should bootstrap credentials and configurations for local demos, including sysmon-vm endpoint discovery and Kong auth routes; preserve passwords/keys and wait-for-core in web entrypoint; update K8s/demo scripts.
⚪	Add UI visualizations for CPU frequency, per-core frequency bars, cluster badges/summaries, and host metadata panel; align heterogeneous series.

Codebase Duplication Compliance

⚪

Codebase context is not defined

Follow the guide to enable codebase context checks.

Custom Compliance

⚪

No custom compliance provided

Follow the guide to enable custom compliance check.

Update

Compliance status legend

🟢 - Fully Compliant
🟡 - Partial Compliant
🔴 - Not Compliant
⚪ - Requires Further Human Verification
🏷️ - Compliance label

Previous compliance checks

Compliance check up to commit 471472d

Security Compliance
⚪	Credential file exposure Description: The script writes the Proton password into shared credentials with file path /etc/serviceradar/credentials/proton-password; although chmod 600 is applied here, other environments may mount this volume broadly—verify that the volume's ownership and permissions prevent unintended container or host users from reading secrets. proton-init.sh [62-69] Referred Code `echo "$PROTON_PASSWORD" > /etc/proton-server/generated_password.txt chmod 600 /etc/proton-server/generated_password.txt if [ -d "/etc/serviceradar/credentials" ]; then echo "$PROTON_PASSWORD" > /etc/serviceradar/credentials/proton-password chmod 600 /etc/serviceradar/credentials/proton-password log_info "Password synchronized to shared credentials volume" fi`
	Secret handling in k8s Description: The init script stores the Proton password into a shared credentials file inside the cluster; ensure the ConfigMap/emptyDir/volume backing this path enforces pod-level and filesystem permissions so the secret is not world-readable across containers/namespaces. deploy.sh [321-325] Referred Code `# Also save to shared credentials volume for other services if [ -d "/etc/serviceradar/credentials" ]; then echo "\$PROTON_PASSWORD" > /etc/serviceradar/credentials/proton-password chmod 600 /etc/serviceradar/credentials/proton-password echo "[Proton K8s Init] Password also saved to shared credentials volume"`
	SQL injection Description: Raw SQL query string is built with fmt.Sprintf embedding device_id and timestamps; if inputs are not sanitized or strictly controlled, this can allow SQL injection—prefer parameterized queries. sysmon.go [398-404] Referred Code SELECT timestamp, agent_id, host_id, core_id, usage_percent, frequency_hz, label, cluster FROM table(cpu_metrics) WHERE device_id = '%s' AND timestamp BETWEEN '%s' AND '%s' ORDER BY timestamp DESC, core_id ASC`, deviceID, start.Format(time.RFC3339), end.Format(time.RFC3339)) rows, err := s.dbService.(*db.DB).Conn.Query(ctx, query)
	SQL injection Description: Another raw SQL query is built via fmt.Sprintf with unescaped device_id and timestamp strings, posing SQL injection risk—use prepared statements with placeholders instead. sysmon.go [453-461] Referred Code SELECT timestamp, agent_id, host_id, cluster, frequency_hz FROM table(cpu_cluster_metrics) WHERE device_id = '%s' AND timestamp BETWEEN '%s' AND '%s' ORDER BY timestamp DESC, cluster ASC`, deviceID, start.Format(time.RFC3339), end.Format(time.RFC3339)) clusterRows, err := s.dbService.(*db.DB).Conn.Query(ctx, clusterQuery) if err != nil { return nil, fmt.Errorf("failed to query CPU cluster metrics for device %s: %w", deviceID, err)
	Untrusted JSON propagation Description: The function enriches service messages by trusting and rewriting arbitrary JSON from checks; if downstream systems log or render these fields without sanitization, it could enable log injection or UI injection—ensure escaping/sanitization where messages are used. agent_poller.go [272-325] Referred Code func enrichServiceMessageWithAddress(message []byte, check Check) []byte { if len(message) == 0 \|\| check.Type != "grpc" \|\| check.Details == "" { return message } hostCandidate := strings.TrimSpace(check.Details) host, _, err := net.SplitHostPort(hostCandidate) if err != nil { host = hostCandidate } if host == "" \|\| net.ParseIP(host) == nil { return message } var payload map[string]any if err := json.Unmarshal(message, &payload); err != nil { return message } statusNode, ok := payload["status"].(map[string]any) ... (clipped 33 lines)
Ticket Compliance
⚪	🎫 No ticket provided
Codebase Duplication Compliance
⚪	Codebase context is not defined Follow the guide to enable codebase context checks.
Custom Compliance
⚪	No custom compliance provided Follow the guide to enable custom compliance check.

Compliance check up to commit fddabde

Security Compliance
🔴	Insecure file permissions Description: The Proton password is written to /etc/serviceradar/credentials/proton-password with world-readable permissions (chmod 644), which can expose credentials inside the container/host. proton-init.sh [65-69] Referred Code `if [ -d "/etc/serviceradar/credentials" ]; then echo "$PROTON_PASSWORD" > /etc/serviceradar/credentials/proton-password chmod 644 /etc/serviceradar/credentials/proton-password log_info "Password synchronized to shared credentials volume" fi`
⚪	Insecure debug script Description: The grpcurl helper hardcodes a private IP and uses plaintext gRPC without TLS, which risks accidental exposure if executed in the wrong environment. g.sh [8-11] Referred Code `grpcurl -plaintext \ -d '{"service_name":"sysmon-vm","service_type":"grpc","agent_id":"dev-agent","poller_id":"docker-poller"}' \ 192.168.1.219:50110 \ monitoring.AgentService/GetStatus`
	Fragile error handling Description: Reconnect decision relies partly on substring matching of error messages, which can be unreliable and may lead to unintended retry storms; consider bounded backoff and stricter code checks. poller.go [522-527] Referred Code `return strings.Contains(errMsg, "connection error") \|\| strings.Contains(errMsg, "transport: Error while dialing") \|\| strings.Contains(errMsg, "name resolver error") \|\| strings.Contains(errMsg, "connection refused") \|\| strings.Contains(errMsg, "i/o timeout") }`
Ticket Compliance
⚪	🎫 No ticket provided Update
Codebase Duplication Compliance
⚪	Codebase context is not defined Follow the guide to enable codebase context checks.
Custom Compliance
⚪	No custom compliance provided Follow the guide to enable custom compliance check.

Imported GitHub PR comment. Original author: @qodo-code-review[bot] Original URL: https://github.com/carverauto/serviceradar/pull/1755#issuecomment-3395986839 Original created: 2025-10-13T06:08:19Z --- ## PR Compliance Guide 🔍  #### (Compliance updated until commit https://github.com/carverauto/serviceradar/commit/ffdbd2720a0e287729a1d3a2662115ef6b247990) Below is a summary of compliance checks for this PR: <table><tbody><tr><td colspan='2'>Security Compliance</td></tr> <tr><td rowspan=1>🔴</td> <td><details><summary>Placeholder </summary> Description: Previous string-concatenated SQL was removed; the new code uses parameterized queries which is safe—no issue here. <a href='https://github.com/carverauto/serviceradar/pull/1755/files#diff-4879e8b367ee55413c6cb006f23bb912261df5310c640d3af352033cf97ffc84R394-R406'>sysmon.go [394-406]</a> <details open><summary>Referred Code</summary> ```go func (s *APIServer) getCPUMetricsForDevice( ctx context.Context, _ db.SysmonMetricsProvider, deviceID string, start, end time.Time) (interface{}, error) { // Query cpu_metrics table directly for per-core data by device_id const cpuQuery = ` SELECT timestamp, agent_id, host_id, core_id, usage_percent, frequency_hz, label, cluster FROM table(cpu_metrics) WHERE device_id = $1 AND timestamp BETWEEN $2 AND $3 ORDER BY timestamp DESC, core_id ASC` rows, err := s.dbService.(*db.DB).Conn.Query(ctx, cpuQuery, deviceID, start, end) if err != nil { return nil, fmt.Errorf("failed to query CPU metrics for device %s: %w", deviceID, err) } ``` </details></details></td></tr> <tr><td rowspan=1>⚪</td> <td><details><summary>Device spoofing risk </summary> Description: Device IDs are derived directly from a reported host IP in checker payloads without strong authentication or validation, allowing a malicious or misconfigured checker to spoof IPs and auto-register/update devices with high confidence. <a href='https://github.com/carverauto/serviceradar/pull/1755/files#diff-fb7f5159558f22afdc809b9cbf5a9c0f71560be2f7389c944da8f1604998ea48R122-R245'>devices.go [122-245]</a> <details open><summary>Referred Code</summary> ```go func extractCheckerHostIdentity(serviceData json.RawMessage) (hostIP, hostname, hostID string) { if len(serviceData) == 0 { return "", "", "" } var payload any if err := json.Unmarshal(serviceData, &payload); err != nil { return "", "", "" } hostIP = firstStringMatch(payload, []string{"status", "host_ip"}, []string{"status", "ip"}, []string{"status", "ip_address"}, []string{"host_ip"}, []string{"ip"}, []string{"ip_address"}, ) if hostIP == "" { hostIP = findStringByKeySubstring(payload, "ip") } ... (clipped 103 lines) ``` </details></details></td></tr> <tr><td colspan='2'>Ticket Compliance</td></tr> <tr><td>🟡</td> <td> <details> <summary>🎫 <a href=https://github.com/carverauto/serviceradar/issues/1756>#1756</a></summary> <table width='100%'><tbody> <tr><td rowspan=8>🟢</td> <td>Extend metrics schema and API to support CPU core labels and cluster-level frequency, and return them in sysmon responses and queries.</td></tr> <tr><td>Migrate database to store new CPU fields and create a cpu_cluster_metrics table; batch writes should error meaningfully when nothing is appended.</td></tr> <tr><td>Provide a gRPC sysmon-vm checker that reports per-core usage/frequency with labels/clusters and cluster aggregates, including host identifiers and timestamps.</td></tr> <tr><td>Ingest sysmon/sysmon-vm payloads, normalize timestamps, store new metrics, and auto-register devices from checker payloads using host IP/ID with source confidence.</td></tr> <tr><td>Add reconnection logic on gRPC error codes/timeouts for poller, enrich messages with resolved host IP, carry agent_id, and support hot-reloading poll interval.</td></tr> <tr><td>Allow checker-specific security settings via agent checker configs and bootstrap default KV configs for common checkers including sysmon-vm.</td></tr> <tr><td>Update CPU metrics API to return labels/clusters and add cluster metrics alongside per-core data for poller- and device-scoped queries.</td></tr> <tr><td>Deployment should bootstrap credentials and configurations for local demos, including sysmon-vm endpoint discovery and Kong auth routes; preserve passwords/keys and wait-for-core in web entrypoint; update K8s/demo scripts.</td></tr> <tr><td rowspan=1>⚪</td> <td>Add UI visualizations for CPU frequency, per-core frequency bars, cluster badges/summaries, and host metadata panel; align heterogeneous series.</td></tr> </tbody></table> </details> </td></tr> <tr><td colspan='2'>Codebase Duplication Compliance</td></tr> <tr><td>⚪</td><td><details><summary>Codebase context is not defined </summary> Follow the <a href='https://qodo-merge-docs.qodo.ai/core-abilities/rag_context_enrichment/'>guide</a> to enable codebase context checks. </details></td></tr> <tr><td colspan='2'>Custom Compliance</td></tr> <tr><td>⚪</td><td><details><summary>No custom compliance provided</summary> Follow the <a href='https://qodo-merge-docs.qodo.ai/tools/compliance/'>guide</a> to enable custom compliance check. </details></td></tr> <tr><td align="center" colspan="2"> - [ ] Update  </td></tr></tbody></table> <details><summary>Compliance status legend</summary> 🟢 - Fully Compliant 🟡 - Partial Compliant 🔴 - Not Compliant ⚪ - Requires Further Human Verification 🏷️ - Compliance label </details> ___ #### Previous compliance checks <details> <summary>Compliance check up to commit <a href='https://github.com/carverauto/serviceradar/commit/471472d59b4bec162b9d774755629aa173836772'>471472d</a></summary> <table><tbody><tr><td colspan='2'>Security Compliance</td></tr> <tr><td rowspan=5>⚪</td> <td><details><summary>Credential file exposure </summary> Description: The script writes the Proton password into shared credentials with file path /etc/serviceradar/credentials/proton-password; although chmod 600 is applied here, other environments may mount this volume broadly—verify that the volume's ownership and permissions prevent unintended container or host users from reading secrets. <a href='https://github.com/carverauto/serviceradar/pull/1755/files#diff-9f0e9459b12fc8cb2671d78218467129cbb18b110443a0799c7559cf7ba19d4eR62-R69'>proton-init.sh [62-69]</a> <details open><summary>Referred Code</summary> ```shell echo "$PROTON_PASSWORD" > /etc/proton-server/generated_password.txt chmod 600 /etc/proton-server/generated_password.txt if [ -d "/etc/serviceradar/credentials" ]; then echo "$PROTON_PASSWORD" > /etc/serviceradar/credentials/proton-password chmod 600 /etc/serviceradar/credentials/proton-password log_info "Password synchronized to shared credentials volume" fi ``` </details></details></td></tr> <tr><td><details><summary>Secret handling in k8s </summary> Description: The init script stores the Proton password into a shared credentials file inside the cluster; ensure the ConfigMap/emptyDir/volume backing this path enforces pod-level and filesystem permissions so the secret is not world-readable across containers/namespaces. <a href='https://github.com/carverauto/serviceradar/pull/1755/files#diff-e37a2eb47f6488f6f391d56a7376be0ca4f93afa355028a71dd0608d3ef1a8baR321-R325'>deploy.sh [321-325]</a> <details open><summary>Referred Code</summary> ```shell # Also save to shared credentials volume for other services if [ -d "/etc/serviceradar/credentials" ]; then echo "\$PROTON_PASSWORD" > /etc/serviceradar/credentials/proton-password chmod 600 /etc/serviceradar/credentials/proton-password echo "[Proton K8s Init] Password also saved to shared credentials volume" ``` </details></details></td></tr> <tr><td><details><summary>SQL injection </summary> Description: Raw SQL query string is built with fmt.Sprintf embedding device_id and timestamps; if inputs are not sanitized or strictly controlled, this can allow SQL injection—prefer parameterized queries. <a href='https://github.com/carverauto/serviceradar/pull/1755/files#diff-4879e8b367ee55413c6cb006f23bb912261df5310c640d3af352033cf97ffc84R398-R404'>sysmon.go [398-404]</a> <details open><summary>Referred Code</summary> ```go SELECT timestamp, agent_id, host_id, core_id, usage_percent, frequency_hz, label, cluster FROM table(cpu_metrics) WHERE device_id = '%s' AND timestamp BETWEEN '%s' AND '%s' ORDER BY timestamp DESC, core_id ASC`, deviceID, start.Format(time.RFC3339), end.Format(time.RFC3339)) rows, err := s.dbService.(*db.DB).Conn.Query(ctx, query) ``` </details></details></td></tr> <tr><td><details><summary>SQL injection </summary> Description: Another raw SQL query is built via fmt.Sprintf with unescaped device_id and timestamp strings, posing SQL injection risk—use prepared statements with placeholders instead. <a href='https://github.com/carverauto/serviceradar/pull/1755/files#diff-4879e8b367ee55413c6cb006f23bb912261df5310c640d3af352033cf97ffc84R453-R461'>sysmon.go [453-461]</a> <details open><summary>Referred Code</summary> ```go SELECT timestamp, agent_id, host_id, cluster, frequency_hz FROM table(cpu_cluster_metrics) WHERE device_id = '%s' AND timestamp BETWEEN '%s' AND '%s' ORDER BY timestamp DESC, cluster ASC`, deviceID, start.Format(time.RFC3339), end.Format(time.RFC3339)) clusterRows, err := s.dbService.(*db.DB).Conn.Query(ctx, clusterQuery) if err != nil { return nil, fmt.Errorf("failed to query CPU cluster metrics for device %s: %w", deviceID, err) ``` </details></details></td></tr> <tr><td><details><summary>Untrusted JSON propagation </summary> Description: The function enriches service messages by trusting and rewriting arbitrary JSON from checks; if downstream systems log or render these fields without sanitization, it could enable log injection or UI injection—ensure escaping/sanitization where messages are used. <a href='https://github.com/carverauto/serviceradar/pull/1755/files#diff-7cdaae416a1e7b5d39ad75f138c49fa0aca12fc38e8ab0b26bb59c64446da15cR272-R325'>agent_poller.go [272-325]</a> <details open><summary>Referred Code</summary> ```go func enrichServiceMessageWithAddress(message []byte, check Check) []byte { if len(message) == 0 || check.Type != "grpc" || check.Details == "" { return message } hostCandidate := strings.TrimSpace(check.Details) host, _, err := net.SplitHostPort(hostCandidate) if err != nil { host = hostCandidate } if host == "" || net.ParseIP(host) == nil { return message } var payload map[string]any if err := json.Unmarshal(message, &payload); err != nil { return message } statusNode, ok := payload["status"].(map[string]any) ... (clipped 33 lines) ``` </details></details></td></tr> <tr><td colspan='2'>Ticket Compliance</td></tr> <tr><td>⚪</td><td><details><summary>🎫 No ticket provided </summary>  </details></td></tr> <tr><td colspan='2'>Codebase Duplication Compliance</td></tr> <tr><td>⚪</td><td><details><summary>Codebase context is not defined </summary> Follow the <a href='https://qodo-merge-docs.qodo.ai/core-abilities/rag_context_enrichment/'>guide</a> to enable codebase context checks. </details></td></tr> <tr><td colspan='2'>Custom Compliance</td></tr> <tr><td>⚪</td><td><details><summary>No custom compliance provided</summary> Follow the <a href='https://qodo-merge-docs.qodo.ai/tools/compliance/'>guide</a> to enable custom compliance check. </details></td></tr> <tr><td align="center" colspan="2">  </td></tr></tbody></table> </details> <details> <summary>Compliance check up to commit <a href='https://github.com/carverauto/serviceradar/commit/fddabdeb296da7c952cd9cd5c26d21398c18caf8'>fddabde</a></summary> <table><tbody><tr><td colspan='2'>Security Compliance</td></tr> <tr><td rowspan=1>🔴</td> <td><details><summary>Insecure file permissions </summary> Description: The Proton password is written to /etc/serviceradar/credentials/proton-password with world-readable permissions (chmod 644), which can expose credentials inside the container/host. <a href='https://github.com/carverauto/serviceradar/pull/1755/files#diff-9f0e9459b12fc8cb2671d78218467129cbb18b110443a0799c7559cf7ba19d4eR65-R69'>proton-init.sh [65-69]</a> <details open><summary>Referred Code</summary> ```shell if [ -d "/etc/serviceradar/credentials" ]; then echo "$PROTON_PASSWORD" > /etc/serviceradar/credentials/proton-password chmod 644 /etc/serviceradar/credentials/proton-password log_info "Password synchronized to shared credentials volume" fi ``` </details></details></td></tr> <tr><td rowspan=2>⚪</td> <td><details><summary>Insecure debug script </summary> Description: The grpcurl helper hardcodes a private IP and uses plaintext gRPC without TLS, which risks accidental exposure if executed in the wrong environment. <a href='https://github.com/carverauto/serviceradar/pull/1755/files#diff-afeec294753ce8526a919e6c67a4d6fcd7fca779c8f79be5eeb8531ac2236517R8-R11'>g.sh [8-11]</a> <details open><summary>Referred Code</summary> ```shell grpcurl -plaintext \ -d '{"service_name":"sysmon-vm","service_type":"grpc","agent_id":"dev-agent","poller_id":"docker-poller"}' \ 192.168.1.219:50110 \ monitoring.AgentService/GetStatus ``` </details></details></td></tr> <tr><td><details><summary>Fragile error handling </summary> Description: Reconnect decision relies partly on substring matching of error messages, which can be unreliable and may lead to unintended retry storms; consider bounded backoff and stricter code checks. <a href='https://github.com/carverauto/serviceradar/pull/1755/files#diff-28a10dea1596540e55ce9a8b68bd1af3d96bd4634f6def668643892cef25a086R522-R527'>poller.go [522-527]</a> <details open><summary>Referred Code</summary> ```go return strings.Contains(errMsg, "connection error") || strings.Contains(errMsg, "transport: Error while dialing") || strings.Contains(errMsg, "name resolver error") || strings.Contains(errMsg, "connection refused") || strings.Contains(errMsg, "i/o timeout") } ``` </details></details></td></tr> <tr><td colspan='2'>Ticket Compliance</td></tr> <tr><td>⚪</td><td><details><summary>🎫 No ticket provided </summary> - [ ] Update  </details></td></tr> <tr><td colspan='2'>Codebase Duplication Compliance</td></tr> <tr><td>⚪</td><td><details><summary>Codebase context is not defined </summary> Follow the <a href='https://qodo-merge-docs.qodo.ai/core-abilities/rag_context_enrichment/'>guide</a> to enable codebase context checks. </details></td></tr> <tr><td colspan='2'>Custom Compliance</td></tr> <tr><td>⚪</td><td><details><summary>No custom compliance provided</summary> Follow the <a href='https://qodo-merge-docs.qodo.ai/tools/compliance/'>guide</a> to enable custom compliance check. </details></td></tr> <tr><td align="center" colspan="2">  </td></tr></tbody></table> </details>

qodo-code-review[bot] commented

2025-10-13 06:09:32 +00:00

(Migrated from github.com)

Author

Owner

Imported GitHub PR comment.

Original author: @qodo-code-review[bot]
Original URL: https://github.com/carverauto/serviceradar/pull/1755#issuecomment-3395989674
Original created: 2025-10-13T06:09:32Z

PR Code Suggestions ✨

Latest suggestions up to 7a4be7b

Category	Suggestion	Impact
Incremental ^[*]	Prevent IP from overwriting host ID In `enrichServiceMessageWithAddress`, avoid overwriting `host_id` with an IP address. Instead, add the resolved IP to a separate field like `resolved_host_ip` to preserve the original host identifier. pkg/poller/agent_poller.go [272-326] func enrichServiceMessageWithAddress(message []byte, check Check) []byte { if len(message) == 0 \|\| check.Type != "grpc" \|\| check.Details == "" { return message } hostCandidate := sanitizeTelemetryString(check.Details) host, _, err := net.SplitHostPort(hostCandidate) if err != nil { host = hostCandidate } - if host == "" \|\| net.ParseIP(host) == nil { + if host == "" { return message } - safeHost := sanitizeTelemetryString(host) + var resolvedIP net.IP + if ip := net.ParseIP(host); ip != nil { + resolvedIP = ip + } + var payload map[string]any if err := json.Unmarshal(message, &payload); err != nil { return message } - statusNode, ok := payload["status"].(map[string]any) - if !ok \|\| statusNode == nil { + statusNode, _ := payload["status"].(map[string]any) + if statusNode == nil { statusNode = make(map[string]any) } getString := func(key string) string { if raw, exists := statusNode[key]; exists { if str, ok := raw.(string); ok { return sanitizeTelemetryString(str) } } - return "" } - if hostIP := getString("host_ip"); hostIP != "" && !strings.EqualFold(hostIP, host) { - statusNode["reported_host_ip"] = hostIP + // If we parsed a valid IP from details, expose it without overwriting original host fields. + if resolvedIP != nil { + if hostIP := getString("host_ip"); hostIP != "" && !strings.EqualFold(hostIP, resolvedIP.String()) { + statusNode["reported_host_ip"] = hostIP + } + statusNode["resolved_host_ip"] = resolvedIP.String() } - statusNode["host_ip"] = safeHost - if hostID := getString("host_id"); hostID != "" && !strings.EqualFold(hostID, host) { + // Do not set host_id from network address. Preserve any provided host_id. + if hostID := getString("host_id"); hostID != "" && resolvedIP != nil && !strings.EqualFold(hostID, resolvedIP.String()) { statusNode["reported_host_id"] = hostID } - statusNode["host_id"] = safeHost payload["status"] = statusNode - enriched, err := json.Marshal(payload) if err != nil { return message } - return enriched } Apply / Chat Suggestion importance[1-10]: 8 __ Why: The suggestion correctly identifies a bug where `host_id` is incorrectly overwritten with an IP address, which corrupts telemetry data and can break device correlation logic.	Medium
	Optimize cpu_metrics sort key To optimize performance for the `cpu_metrics` stream, remove `cluster` from the `ORDER BY` clause and instead create a separate bloom filter index on it. pkg/db/migrations/00000000000001_consolidated_serviceradar_schema.up.sql [211-222] CREATE STREAM IF NOT EXISTS cpu_metrics ( timestamp DateTime64(3), poller_id string, agent_id string, host_id string, core_id int32, usage_percent float64, frequency_hz float64, label string, cluster string, device_id string, partition string ) ENGINE = Stream(1, 1, rand()) PARTITION BY int_div(to_unix_timestamp(timestamp), 3600) -ORDER BY (timestamp, device_id, host_id, cluster, core_id) +ORDER BY (timestamp, device_id, host_id, core_id) TTL to_start_of_day(coalesce(timestamp, _tp_time)) + INTERVAL 3 DAY SETTINGS index_granularity = 8192; +ALTER STREAM cpu_metrics + ADD INDEX IF NOT EXISTS idx_cluster cluster TYPE bloom_filter GRANULARITY 1; + `[To ensure code accuracy, apply this suggestion manually]` Suggestion importance[1-10]: 7 __ Why: The suggestion correctly identifies a potential performance issue by adding `cluster` to the `ORDER BY` clause and proposes a valid optimization by using a separate index instead, which improves query performance and data ingestion.	Medium
	Block until config exists Prevent the `serviceradar-poller` from crashing on startup by modifying its command to wait until the `poller.json` configuration file exists before executing. docker-compose.yml [329-348] serviceradar-poller: image: ghcr.io/carverauto/serviceradar-poller:latest container_name: serviceradar-poller volumes: - cert-data:/etc/serviceradar/certs:ro - generated-config:/etc/serviceradar/config:ro - poller-data:/var/lib/serviceradar - ./logs:/var/log/serviceradar environment: - # Config path - CONFIG_PATH=/etc/serviceradar/config/poller.json - # Logging - LOG_LEVEL=${LOG_LEVEL:-info} depends_on: cert-generator: condition: service_completed_successfully config-updater: condition: service_completed_successfully core: condition: service_healthy - command: ["/usr/local/bin/serviceradar-poller","-config","/etc/serviceradar/config/poller.json"] + command: ["sh","-lc","until [ -s /etc/serviceradar/config/poller.json ]; do echo '[poller] waiting for poller.json'; sleep 2; done; exec /usr/local/bin/serviceradar-poller -config /etc/serviceradar/config/poller.json"] `[To ensure code accuracy, apply this suggestion manually]` Suggestion importance[1-10]: 7 __ Why: The suggestion correctly identifies a race condition where the `serviceradar-poller` might start before its configuration file is ready, and proposes a robust shell command to wait for the file, preventing startup failures.	Medium
	Ensure secure, reliable secret persistence In `proton-init.sh`, ensure secret directories exist with secure permissions and add error handling when writing the password file to prevent silent failures and ensure credential persistence. docker/compose/proton-init.sh [62-69] -if [ -z "$PROTON_PASSWORD" ]; then - # First preference: existing shared credential file - if [ -f /etc/serviceradar/credentials/proton-password ] && \ - [ -s /etc/serviceradar/credentials/proton-password ]; then - PROTON_PASSWORD=$(cat /etc/serviceradar/credentials/proton-password) - log_info "Reusing Proton password from shared credentials volume" - # Second preference: previously generated password inside Proton data dir - elif [ -f /etc/proton-server/generated_password.txt ] && \ - [ -s /etc/proton-server/generated_password.txt ]; then - PROTON_PASSWORD=$(cat /etc/proton-server/generated_password.txt) - log_info "Reusing Proton password from generated_password.txt" - else - log_info "Generating random password..." - PROTON_PASSWORD=$(openssl rand -hex 16) - log_info "Generated new Proton password" - fi +# Ensure directories exist with secure permissions before writing secrets +install -d -m 700 /etc/proton-server \|\| { log_error "Cannot create /etc/proton-server"; exit 1; } +if [ -d "/etc/serviceradar" ]; then + install -d -m 700 /etc/serviceradar/credentials \|\| { log_error "Cannot create /etc/serviceradar/credentials"; exit 1; } fi # Persist password to both expected locations to keep services aligned -echo "$PROTON_PASSWORD" > /etc/proton-server/generated_password.txt +echo "$PROTON_PASSWORD" > /etc/proton-server/generated_password.txt \|\| { log_error "Failed to write Proton password"; exit 1; } chmod 600 /etc/proton-server/generated_password.txt if [ -d "/etc/serviceradar/credentials" ]; then - echo "$PROTON_PASSWORD" > /etc/serviceradar/credentials/proton-password - chmod 600 /etc/serviceradar/credentials/proton-password - log_info "Password synchronized to shared credentials volume" + echo "$PROTON_PASSWORD" > /etc/serviceradar/credentials/proton-password \|\| { log_error "Failed to write shared Proton password"; exit 1; } + chmod 600 /etc/serviceradar/credentials/proton-password + log_info "Password synchronized to shared credentials volume" fi `[To ensure code accuracy, apply this suggestion manually]` Suggestion importance[1-10]: 6 __ Why: The suggestion improves the robustness of secret handling by adding explicit directory creation and error checking, which prevents silent failures and ensures password persistence.	Low
Possible issue	Prevent device ID collisions and validate IP To prevent collisions, make the `deviceID` more unique by validating the `hostIP` and including a tie-breaker like `agentID` or `svc.ServiceName` in its generation. pkg/core/devices.go [31-120] func (s Server) ensureServiceDevice( ctx context.Context, agentID, pollerID, partition string, svc proto.ServiceStatus, serviceData json.RawMessage, timestamp time.Time, ) { if svc == nil { return } - - // Only gRPC checkers embed host context in a way we can reason about; skip other service types. if svc.ServiceType != grpcServiceType { return } - - // Ignore result streams such as sync/multi-part responses; they set Source to "results". if svc.Source != "" && !strings.EqualFold(svc.Source, "getstatus") { return } hostIP, hostname, hostID := extractCheckerHostIdentity(serviceData) hostIP = normalizeHostIdentifier(hostIP) - if hostIP == "" \|\| strings.EqualFold(hostIP, "unknown") { + parsed := net.ParseIP(hostIP) + if parsed == nil { + // refuse non-IP identifiers to avoid corrupting device registry return } if partition == "" { partition = "default" } - deviceID := fmt.Sprintf("%s:%s", partition, hostIP) + // Use the normalized string form for device ID. For IPv6, use the canonical compressed format. + ipKey := parsed.String() + + // Add a stable tie-breaker to mitigate collisions where the same IP might be observed + // in overlapping contexts (e.g., NAT or shared targets). + extraScope := agentID + if extraScope == "" { + extraScope = svc.ServiceName + } + deviceID := fmt.Sprintf("%s:%s:%s", partition, ipKey, extraScope) metadata := map[string]string{ - "source": "checker", - "checker_service": svc.ServiceName, - "checker_service_id": svc.ServiceName, - "last_update": timestamp.Format(time.RFC3339), + "source": "checker", + "checker_service": svc.ServiceName, + "checker_service_id": svc.ServiceName, + "last_update": timestamp.Format(time.RFC3339), + "checker_host_ip": ipKey, + "device_id_scope_hint": extraScope, } - if svc.ServiceType != "" { metadata["checker_service_type"] = svc.ServiceType } - if agentID != "" { metadata["collector_agent_id"] = agentID } - if pollerID != "" { metadata["collector_poller_id"] = pollerID } - if hostID != "" { metadata["checker_host_id"] = hostID } - - metadata["checker_host_ip"] = hostIP deviceUpdate := &models.DeviceUpdate{ AgentID: agentID, PollerID: pollerID, Partition: partition, DeviceID: deviceID, Source: models.DiscoverySourceSelfReported, - IP: hostIP, + IP: ipKey, Timestamp: timestamp, IsAvailable: true, Metadata: metadata, Confidence: models.GetSourceConfidence(models.DiscoverySourceSelfReported), } - if hostname != "" { deviceUpdate.Hostname = &hostname } if s.DeviceRegistry != nil { if err := s.DeviceRegistry.ProcessDeviceUpdate(ctx, deviceUpdate); err != nil { s.logger.Warn(). Err(err). Str("device_id", deviceID). Str("service_name", svc.ServiceName). Msg("Failed to process checker device through device registry") } } else { s.logger.Warn(). Str("device_id", deviceID). Str("service_name", svc.ServiceName). Msg("DeviceRegistry not available for checker device registration") } } Apply / Chat Suggestion importance[1-10]: 8 __ Why: The suggestion addresses a significant architectural weakness in device identification, preventing potential data collisions and malformed `deviceID`s, thus improving data integrity.	Medium
	Add missing usage and tighten ORDER BY Add the missing `usage_percent` column to the `cpu_cluster_metrics` stream and include `poller_id` and `agent_id` in its `ORDER BY` clause to ensure data correctness and improve query performance. pkg/db/migrations/00000000000001_consolidated_serviceradar_schema.up.sql [224-240] -CREATE STREAM IF NOT EXISTS cpu_metrics ( - core_id int32, - usage_percent float64, - frequency_hz float64, - label string, - cluster string, - device_id string, - partition string -) ENGINE = Stream(1, 1, rand()) -PARTITION BY int_div(to_unix_timestamp(timestamp), 3600) -ORDER BY (timestamp, device_id, host_id, cluster, core_id) -TTL to_start_of_day(coalesce(timestamp, _tp_time)) + INTERVAL 3 DAY -SETTINGS index_granularity = 8192; - CREATE STREAM IF NOT EXISTS cpu_cluster_metrics ( timestamp DateTime64(3), poller_id string, agent_id string, host_id string, cluster string, + usage_percent float64, frequency_hz float64, device_id string, partition string ) ENGINE = Stream(1, 1, rand()) PARTITION BY int_div(to_unix_timestamp(timestamp), 3600) -ORDER BY (timestamp, cluster, host_id, device_id) +ORDER BY (timestamp, cluster, host_id, device_id, poller_id, agent_id) TTL to_start_of_day(coalesce(timestamp, _tp_time)) + INTERVAL 3 DAY SETTINGS index_granularity = 8192; ALTER STREAM cpu_cluster_metrics ADD INDEX IF NOT EXISTS idx_cluster cluster TYPE bloom_filter GRANULARITY 1; `[To ensure code accuracy, apply this suggestion manually]` Suggestion importance[1-10]: 8 __ Why: The suggestion correctly identifies that the new `cpu_cluster_metrics` stream is missing a `usage_percent` field, which is crucial for UI components that display cluster usage. It also correctly points out that `poller_id` and `agent_id` should be in the `ORDER BY` clause for data integrity and performance, which is a critical consideration for streaming databases.	Medium
	Add defensive null checks and stable keys Add a null check for each `device` in the `devices.map` loop and provide a fallback `key` using the index to prevent potential runtime errors and improve component stability. web/src/components/Devices/DeviceTable.tsx [209-218] -{devices.map(device => { - const metadata = device.metadata \|\| {}; - const sysmonServiceHint = typeof metadata === 'object' && +{devices.map((device, idx) => { + if (!device \|\| typeof device !== 'object') return null; + + const metadata = (device as any).metadata \|\| {}; + const sysmonServiceHint = + typeof metadata === 'object' && metadata !== null && typeof metadata.checker_service === 'string' && metadata.checker_service.toLowerCase().includes('sysmon'); + const rowKey = device.device_id ?? `device-row-${idx}`; + return ( - <Fragment key={device.device_id}> + <Fragment key={rowKey}> <tr className="hover:bg-gray-700/30"> Apply / Chat Suggestion importance[1-10]: 6 __ Why: The suggestion improves code robustness by adding a null check for `device` and providing a fallback for the React `key`, which are good defensive programming practices against potentially malformed data.	Low
General	Use consistent truncated timestamps In `GetAllCPUMetrics`, use the truncated timestamp for the final `SysmonCPUResponse` to ensure consistent grouping and alignment between CPU and cluster metrics. pkg/db/metrics.go [826-912] -func (db *DB) GetAllCPUMetrics( - ctx context.Context, pollerID string, start, end time.Time) ([]models.SysmonCPUResponse, error) { - rows, err := db.Conn.Query(ctx, ` - SELECT timestamp, agent_id, host_id, core_id, usage_percent, frequency_hz, label, cluster - FROM table(cpu_metrics) - WHERE poller_id = $1 AND timestamp BETWEEN $2 AND $3 - ORDER BY timestamp DESC, core_id ASC`, - pollerID, start, end) - if err != nil { - return nil, fmt.Errorf("failed to query CPU metrics: %w", err) - } - defer db.CloseRows(rows) +result := make([]models.SysmonCPUResponse, 0, len(data)) - data := make(map[time.Time][]models.CPUMetric) - clustersByTimestamp := make(map[time.Time][]models.CPUClusterMetric) - - for rows.Next() { - var m models.CPUMetric - - var agentID, hostID, label, cluster string - - var timestamp time.Time - - if err := rows.Scan(&timestamp, &agentID, &hostID, &m.CoreID, &m.UsagePercent, &m.FrequencyHz, &label, &cluster); err != nil { - db.logger.Error().Err(err).Msg("Error scanning CPU metric row") - continue - } - key := timestamp.Truncate(time.Second) - m.Timestamp = timestamp - m.AgentID = agentID - m.HostID = hostID - m.Label = label - m.Cluster = cluster - data[key] = append(data[key], m) - } - - if err := rows.Err(); err != nil { - db.logger.Error().Err(err).Msg("Error iterating CPU metric rows") - - return nil, err - } - - clusterRows, err := db.Conn.Query(ctx, ` - SELECT timestamp, agent_id, host_id, cluster, frequency_hz - FROM table(cpu_cluster_metrics) - WHERE poller_id = $1 AND timestamp BETWEEN $2 AND $3 - ORDER BY timestamp DESC, cluster ASC`, - pollerID, start, end) - if err != nil { - db.logger.Error().Err(err).Msg("Error querying CPU cluster metrics") - - return nil, fmt.Errorf("failed to query CPU cluster metrics: %w", err) - } - defer db.CloseRows(clusterRows) - - for clusterRows.Next() { - var c models.CPUClusterMetric - var timestamp time.Time - - if err := clusterRows.Scan(&timestamp, &c.AgentID, &c.HostID, &c.Name, &c.FrequencyHz); err != nil { - db.logger.Error().Err(err).Msg("Error scanning CPU cluster metric row") - continue - } - key := timestamp.Truncate(time.Second) - c.Timestamp = timestamp - clustersByTimestamp[key] = append(clustersByTimestamp[key], c) - } - - if err := clusterRows.Err(); err != nil { - db.logger.Error().Err(err).Msg("Error iterating CPU cluster metric rows") - - return nil, err - } - - result := make([]models.SysmonCPUResponse, 0, len(data)) - - for ts, cpus := range data { - result = append(result, models.SysmonCPUResponse{ - Cpus: cpus, - Clusters: clustersByTimestamp[ts], - Timestamp: cpus[0].Timestamp, - }) - } - - return result, nil +for ts, cpus := range data { + truncated := ts.Truncate(time.Second) + result = append(result, models.SysmonCPUResponse{ + Cpus: cpus, + Clusters: clustersByTimestamp[ts], + Timestamp: truncated, + }) } `[To ensure code accuracy, apply this suggestion manually]` Suggestion importance[1-10]: 7 __ Why: The suggestion correctly identifies a subtle bug where inconsistent timestamp handling could lead to data misalignment between CPU and cluster metrics.	Medium
More

Previous suggestions

✅ Suggestions up to commit 86685fd

Category	Suggestion	Impact
Possible issue	Preserve original host identity fields Avoid overwriting `host_id` and `host_ip` in the service message. Instead, store the address derived from `check.Details` in separate auxiliary fields to preserve the original agent-reported identity. pkg/poller/agent_poller.go [272-325] func enrichServiceMessageWithAddress(message []byte, check Check) []byte { if len(message) == 0 \|\| check.Type != "grpc" \|\| check.Details == "" { return message } hostCandidate := strings.TrimSpace(check.Details) host, _, err := net.SplitHostPort(hostCandidate) if err != nil { host = hostCandidate } + // Only attach auxiliary info if the candidate parses as an IP if host == "" \|\| net.ParseIP(host) == nil { return message } var payload map[string]any if err := json.Unmarshal(message, &payload); err != nil { return message } - statusNode, ok := payload["status"].(map[string]any) - if !ok \|\| statusNode == nil { + statusNode, _ := payload["status"].(map[string]any) + if statusNode == nil { statusNode = make(map[string]any) } - getString := func(key string) string { - if raw, exists := statusNode[key]; exists { - if str, ok := raw.(string); ok { - return strings.TrimSpace(str) - } + // Preserve existing fields; add auxiliary fields without overwriting originals + if existing, ok := statusNode["host_ip"]; ok && existing != nil { + // keep original, add derived separately if different + if s, ok := existing.(string); ok && !strings.EqualFold(strings.TrimSpace(s), host) { + statusNode["derived_host_ip"] = host } - - return "" + } else { + // no original present; still avoid writing as canonical field + statusNode["derived_host_ip"] = host } - if hostIP := getString("host_ip"); hostIP != "" && !strings.EqualFold(hostIP, host) { - statusNode["reported_host_ip"] = hostIP + // Do not overwrite host_id; expose derived host_id from address only as auxiliary + if _, ok := statusNode["host_id"]; !ok { + statusNode["derived_host_id"] = host } - statusNode["host_ip"] = host - - if hostID := getString("host_id"); hostID != "" && !strings.EqualFold(hostID, host) { - statusNode["reported_host_id"] = hostID - } - statusNode["host_id"] = host payload["status"] = statusNode - enriched, err := json.Marshal(payload) if err != nil { return message } - return enriched } Suggestion importance[1-10]: 9 __ Why: The suggestion correctly identifies a critical data integrity issue where agent-reported identity (`host_ip`, `host_id`) is overwritten, which could lead to incorrect device correlation and data misattribution.	High
	Validate device key IP input Validate that the extracted `hostIP` is a valid IP address before using it to construct a `deviceID` to prevent creating invalid device records from hostnames or other strings. pkg/core/devices.go [31-62] -func (s Server) ensureServiceDevice( - ctx context.Context, - agentID, pollerID, partition string, - svc proto.ServiceStatus, - serviceData json.RawMessage, - timestamp time.Time, -) { - if svc == nil { - return - } +hostIP, hostname, hostID := extractCheckerHostIdentity(serviceData) +hostIP = normalizeHostIdentifier(hostIP) +// Require a valid IP address to avoid using hostnames/strings as device keys +if ip := net.ParseIP(hostIP); ip == nil { + return +} +if hostIP == "" \|\| strings.EqualFold(hostIP, "unknown") { + return +} - // Only gRPC checkers embed host context in a way we can reason about; skip other service types. - if svc.ServiceType != grpcServiceType { - return - } +if partition == "" { + partition = "default" +} - // Ignore result streams such as sync/multi-part responses; they set Source to "results". - if svc.Source != "" && !strings.EqualFold(svc.Source, "getstatus") { - return - } +deviceID := fmt.Sprintf("%s:%s", partition, hostIP) - hostIP, hostname, hostID := extractCheckerHostIdentity(serviceData) - hostIP = normalizeHostIdentifier(hostIP) - if hostIP == "" \|\| strings.EqualFold(hostIP, "unknown") { - return - } - - if partition == "" { - partition = "default" - } - - deviceID := fmt.Sprintf("%s:%s", partition, hostIP) - Suggestion importance[1-10]: 8 __ Why: The suggestion correctly identifies that a non-IP string could be used in `deviceID`, leading to invalid device records. Enforcing IP validation for the device key is a crucial fix for data integrity.	Medium
	✅ ~~Add indexes for new cluster fields~~ Suggestion Impact: The commit added bloom filter indexes on cpu_metrics for cluster and label, matching the suggestion. It also added an index for cluster on cpu_cluster_metrics and adjusted ORDER BY clauses. code diff: `+ALTER STREAM cpu_metrics + ADD INDEX IF NOT EXISTS idx_cluster cluster TYPE bloom_filter GRANULARITY 1; +ALTER STREAM cpu_metrics + ADD INDEX IF NOT EXISTS idx_label label TYPE bloom_filter GRANULARITY 1;` Add bloom filter indexes to the `cpu_metrics` stream for the new `cluster` and `label` fields to improve query performance. pkg/db/migrations/00000000000001_consolidated_serviceradar_schema.up.sql [209-222] CREATE STREAM IF NOT EXISTS cpu_metrics ( timestamp DateTime64(3), poller_id string, agent_id string, host_id string, core_id int32, usage_percent float64, frequency_hz float64, label string, cluster string, device_id string, partition string ) ENGINE = Stream(1, 1, rand()) PARTITION BY int_div(to_unix_timestamp(timestamp), 3600) ORDER BY (timestamp, device_id, host_id, core_id) TTL to_start_of_day(coalesce(timestamp, _tp_time)) + INTERVAL 3 DAY SETTINGS index_granularity = 8192; +ALTER STREAM cpu_metrics + ADD INDEX IF NOT EXISTS idx_cluster cluster TYPE bloom_filter GRANULARITY 1; +ALTER STREAM cpu_metrics + ADD INDEX IF NOT EXISTS idx_label label TYPE bloom_filter GRANULARITY 1; + Suggestion importance[1-10]: 8 __ Why: This is a valid and important performance optimization. Adding bloom filter indexes on the new `cluster` and `label` fields is crucial for efficient querying, especially since these fields are used for filtering and grouping in the UI, preventing full stream scans on `cpu_metrics`.	Medium
	✅ ~~Optimize sort key and index for clusters~~ Suggestion Impact: The commit reordered the cpu_cluster_metrics ORDER BY to (timestamp, cluster, host_id, device_id) and added a bloom filter index on cluster for that stream, matching the suggestion's intent. code diff: `-ORDER BY (timestamp, device_id, host_id, cluster) +ORDER BY (timestamp, cluster, host_id, device_id) TTL to_start_of_day(coalesce(timestamp, _tp_time)) + INTERVAL 3 DAY SETTINGS index_granularity = 8192; +ALTER STREAM cpu_cluster_metrics + ADD INDEX IF NOT EXISTS idx_cluster cluster TYPE bloom_filter GRANULARITY 1;` Optimize the `cpu_cluster_metrics` stream by reordering the primary key to prioritize `cluster` and adding a bloom filter index on it. pkg/db/migrations/00000000000001_consolidated_serviceradar_schema.up.sql [224-237] CREATE STREAM IF NOT EXISTS cpu_cluster_metrics ( timestamp DateTime64(3), poller_id string, agent_id string, host_id string, cluster string, frequency_hz float64, device_id string, partition string ) ENGINE = Stream(1, 1, rand()) PARTITION BY int_div(to_unix_timestamp(timestamp), 3600) -ORDER BY (timestamp, device_id, host_id, cluster) +ORDER BY (timestamp, cluster, host_id, device_id) TTL to_start_of_day(coalesce(timestamp, _tp_time)) + INTERVAL 3 DAY SETTINGS index_granularity = 8192; +ALTER STREAM cpu_cluster_metrics + ADD INDEX IF NOT EXISTS idx_cluster cluster TYPE bloom_filter GRANULARITY 1; + `[Suggestion processed]` Suggestion importance[1-10]: 8 __ Why: This suggestion correctly identifies a performance issue in the new `cpu_cluster_metrics` stream. Reordering the primary key to `(timestamp, cluster, ...)` and adding a bloom filter index on `cluster` will significantly improve the performance of common UI queries that filter by cluster.	Medium
	Add safe db type assertion Add a safe type assertion when accessing `s.dbService` to prevent a potential panic if the underlying type is not `db.DB`. pkg/core/api/sysmon.go [459] `-clusterRows, err := s.dbService.(db.DB).Conn.Query(ctx, clusterQuery) +var ( + dbImpl db.DB + ok bool +) +if dbImpl, ok = s.dbService.(db.DB); !ok \|\| dbImpl == nil \|\| dbImpl.Conn == nil { + return nil, fmt.Errorf("database service unavailable for cluster query") +} +clusterRows, err := dbImpl.Conn.Query(ctx, clusterQuery)` Suggestion importance[1-10]: 7 __ Why: The suggestion correctly points out a potential panic from an unsafe type assertion, improving the code's robustness and preventing a server crash, which is a valuable improvement for production stability.	Medium
	✅ ~~Add defensive null checks~~ Suggestion Impact: The commit added a defensive check ensuring serviceData exists before accessing its properties and provided a nullish coalescing fallback for serviceData.details. code diff: `+ if ( + serviceData && + (serviceData.service_name === "sysmon-vm" \|\| serviceData.name === "sysmon-vm") + ) { return ( <SysmonVmDetails service={serviceData} - details={serviceData.details} + details={serviceData.details ?? {}}` Add null checks for the `serviceData` object and its `details` property to prevent potential runtime errors during rendering. web/src/components/Service/Dashboard.tsx [280-287] `-if (serviceData.service_name === 'sysmon-vm' \|\| serviceData.name === 'sysmon-vm') { +if (serviceData && (serviceData.service_name === 'sysmon-vm' \|\| serviceData.name === 'sysmon-vm')) { return ( <SysmonVmDetails service={serviceData} - details={serviceData.details} + details={serviceData.details ?? {}} /> ); }` `[Suggestion processed]` Suggestion importance[1-10]: 7 __ Why: The suggestion correctly identifies a potential null pointer exception if `serviceData` is not yet loaded, improving the component's robustness against runtime errors.	Medium
	✅ ~~Clamp and sanitize metric value~~ Suggestion Impact: The commit added safeMax/rawValue/clampedValue and replaced uses of current with clampedValue and max with safeMax across StatusIndicator, ValueDisplay, and ProgressBar. code diff: + const safeMax = Number.isFinite(max) && max > 0 ? max : 100; + const rawValue = Number.isFinite(current) ? current : 0; + const clampedValue = Math.min(Math.max(rawValue, 0), safeMax); + return ( <div className="bg-white dark:bg-gray-800 rounded-lg p-3 shadow transition-colors"> <div className="flex justify-between items-center mb-2"> @@ -129,23 +133,23 @@ <span className="text-sm font-medium text-gray-600 dark:text-gray-300">{title}</span> </div> <StatusIndicator - value={current} + value={clampedValue} warning={warning} critical={critical} /> </div> <ValueDisplay - value={current} + value={clampedValue} unit={unit} warning={warning} critical={critical} change={change} /> <ProgressBar - value={current} + value={clampedValue} warning={warning} critical={critical} - max={max} + max={safeMax} /> Sanitize the `current` and `max` props in the `MetricCard` component by clamping the value to a safe range `[0, max]` to prevent UI issues. web/src/components/Metrics/shared-components.jsx [123-153] export const MetricCard = ({ title, current, unit, warning, critical, change, icon, children, max = 100 }) => { + const safeMax = Number.isFinite(max) && max > 0 ? max : 100; + const rawValue = Number.isFinite(current) ? current : 0; + const clampedValue = Math.min(Math.max(rawValue, 0), safeMax); return ( <div className="bg-white dark:bg-gray-800 rounded-lg p-3 shadow transition-colors"> <div className="flex justify-between items-center mb-2"> <div className="flex items-center space-x-2"> {icon && <div className="text-gray-500 dark:text-gray-300">{icon}</div>} <h3 className="text-sm font-medium text-gray-900 dark:text-white">{title}</h3> </div> {typeof change !== 'undefined' && ( <span className={`text-xs ${change >= 0 ? 'text-green-600' : 'text-red-600'}`}> {change >= 0 ? '+' : ''}{change}% </span> )} </div> <MetricProgressBar - value={current} + value={clampedValue} warning={warning} critical={critical} - max={max} + max={safeMax} /> {children} </div> ); }; `[Suggestion processed]` Suggestion importance[1-10]: 7 __ Why: This suggestion improves the component's resilience by sanitizing the `current` and `max` props, preventing potential UI bugs or crashes from invalid data.	Medium
Incremental ^[*]	✅ ~~Normalize timestamps for joining~~ Suggestion Impact: The commit implemented truncating timestamps to second-resolution for both CPU metrics and cluster metrics map keys, and adjusted the response timestamp to use the original metric timestamp. code diff: + key := timestamp.Truncate(time.Second) m.Timestamp = timestamp m.AgentID = agentID m.HostID = hostID m.Label = label m.Cluster = cluster - data[timestamp] = append(data[timestamp], m) + data[key] = append(data[key], m) } if err := rows.Err(); err != nil { @@ -888,9 +888,9 @@ db.logger.Error().Err(err).Msg("Error scanning CPU cluster metric row") continue } - + key := timestamp.Truncate(time.Second) c.Timestamp = timestamp - clustersByTimestamp[timestamp] = append(clustersByTimestamp[timestamp], c) + clustersByTimestamp[key] = append(clustersByTimestamp[key], c) } if err := clusterRows.Err(); err != nil { @@ -905,7 +905,7 @@ result = append(result, models.SysmonCPUResponse{ Cpus: cpus, Clusters: clustersByTimestamp[ts], - Timestamp: ts, + Timestamp: cpus[0].Timestamp, }) In `GetAllCPUMetrics`, normalize timestamps by truncating them to the second before using them as map keys to ensure correct correlation between core and cluster metrics. pkg/db/metrics.go [826-912] func (db *DB) GetAllCPUMetrics( ctx context.Context, pollerID string, start, end time.Time) ([]models.SysmonCPUResponse, error) { rows, err := db.Conn.Query(ctx, ` SELECT timestamp, agent_id, host_id, core_id, usage_percent, frequency_hz, label, cluster FROM table(cpu_metrics) WHERE poller_id = $1 AND timestamp BETWEEN $2 AND $3 ORDER BY timestamp DESC, core_id ASC`, pollerID, start, end) if err != nil { return nil, fmt.Errorf("failed to query CPU metrics: %w", err) } defer db.CloseRows(rows) data := make(map[time.Time][]models.CPUMetric) clustersByTimestamp := make(map[time.Time][]models.CPUClusterMetric) for rows.Next() { var m models.CPUMetric - var agentID, hostID, label, cluster string - - var timestamp time.Time - - if err := rows.Scan(&timestamp, &agentID, &hostID, &m.CoreID, &m.UsagePercent, &m.FrequencyHz, &label, &cluster); err != nil { + var ts time.Time + if err := rows.Scan(&ts, &agentID, &hostID, &m.CoreID, &m.UsagePercent, &m.FrequencyHz, &label, &cluster); err != nil { db.logger.Error().Err(err).Msg("Error scanning CPU metric row") continue } - - m.Timestamp = timestamp + // Normalize key to seconds to ensure joinability + key := ts.Truncate(time.Second) + m.Timestamp = ts m.AgentID = agentID m.HostID = hostID m.Label = label m.Cluster = cluster - data[timestamp] = append(data[timestamp], m) + data[key] = append(data[key], m) } - if err := rows.Err(); err != nil { db.logger.Error().Err(err).Msg("Error iterating CPU metric rows") - return nil, err } clusterRows, err := db.Conn.Query(ctx, ` SELECT timestamp, agent_id, host_id, cluster, frequency_hz FROM table(cpu_cluster_metrics) WHERE poller_id = $1 AND timestamp BETWEEN $2 AND $3 ORDER BY timestamp DESC, cluster ASC`, pollerID, start, end) if err != nil { db.logger.Error().Err(err).Msg("Error querying CPU cluster metrics") - return nil, fmt.Errorf("failed to query CPU cluster metrics: %w", err) } defer db.CloseRows(clusterRows) for clusterRows.Next() { var c models.CPUClusterMetric - var timestamp time.Time - - if err := clusterRows.Scan(&timestamp, &c.AgentID, &c.HostID, &c.Name, &c.FrequencyHz); err != nil { + var ts time.Time + if err := clusterRows.Scan(&ts, &c.AgentID, &c.HostID, &c.Name, &c.FrequencyHz); err != nil { db.logger.Error().Err(err).Msg("Error scanning CPU cluster metric row") continue } - - c.Timestamp = timestamp - clustersByTimestamp[timestamp] = append(clustersByTimestamp[timestamp], c) + key := ts.Truncate(time.Second) + c.Timestamp = ts + clustersByTimestamp[key] = append(clustersByTimestamp[key], c) } - if err := clusterRows.Err(); err != nil { db.logger.Error().Err(err).Msg("Error iterating CPU cluster metric rows") - return nil, err } result := make([]models.SysmonCPUResponse, 0, len(data)) - - for ts, cpus := range data { + for tsKey, cpus := range data { result = append(result, models.SysmonCPUResponse{ Cpus: cpus, - Clusters: clustersByTimestamp[ts], - Timestamp: ts, + Clusters: clustersByTimestamp[tsKey], + Timestamp: cpus[0].Timestamp, // original timestamp for display }) } - return result, nil } Suggestion importance[1-10]: 8 __ Why: The suggestion correctly identifies a critical Go pitfall of using `time.Time` as a map key, which would cause data to be mis-associated, and provides the correct fix by truncating the timestamp.	Medium
	✅ ~~Guard optional nested data~~ Suggestion Impact: The commit updated the component to use optional chaining and array checks: wrapped cpuFrequency and memory with data?., changed cpu.clusters and cpu.cores and disk.drives to Array.isArray checks, matching the suggested guards. code diff: - {data.cpu?.clusters && data.cpu.clusters.length > 0 && ( + {Array.isArray(data?.cpu?.clusters) && data.cpu.clusters.length > 0 && ( <CpuClusterDetails clusters={data.cpu.clusters} /> )} - {data.cpuFrequency && <CpuFrequencyDetails data={data.cpuFrequency} />} - <CpuCoresChart cores={data.cpu.cores} /> - <FilesystemDetails drives={data.disk.drives} /> - <MemoryDetails data={data.memory} /> + {data?.cpuFrequency && <CpuFrequencyDetails data={data.cpuFrequency} />} + {Array.isArray(data?.cpu?.cores) && <CpuCoresChart cores={data.cpu.cores} />} + {Array.isArray(data?.disk?.drives) && <FilesystemDetails drives={data.disk.drives} />} + {data?.memory && <MemoryDetails data={data.memory} />} Add null-safe guards for `data.cpu.cores`, `data.disk.drives`, and `data.memory` to prevent the application from crashing if these properties are accessed before they are loaded. web/src/components/Metrics/system-metrics.jsx [288-296] {activeTab === 'details' && ( <div className="space-y-6"> - {data.cpu?.clusters && data.cpu.clusters.length > 0 && ( + {Array.isArray(data?.cpu?.clusters) && data.cpu.clusters.length > 0 && ( <CpuClusterDetails clusters={data.cpu.clusters} /> )} - {data.cpuFrequency && <CpuFrequencyDetails data={data.cpuFrequency} />} - <CpuCoresChart cores={data.cpu.cores} /> - <FilesystemDetails drives={data.disk.drives} /> - <MemoryDetails data={data.memory} /> + {data?.cpuFrequency && <CpuFrequencyDetails data={data.cpuFrequency} />} + {Array.isArray(data?.cpu?.cores) && <CpuCoresChart cores={data.cpu.cores} />} + {Array.isArray(data?.disk?.drives) && <FilesystemDetails drives={data.disk.drives} />} + {data?.memory && <MemoryDetails data={data.memory} />} `[Suggestion processed]` Suggestion importance[1-10]: 8 __ Why: This suggestion correctly identifies missing null-safe guards for nested properties like `data.cpu.cores`, preventing a likely runtime crash when data is partially loaded, which is a critical fix for UI stability.	Medium
	✅ ~~Include cluster in sorting key~~ Suggestion Impact: The commit updated cpu_metrics to ORDER BY (timestamp, device_id, host_id, cluster, core_id), adding cluster as suggested. It also added bloom filter indexes and adjusted ordering for another stream. code diff: `+ORDER BY (timestamp, device_id, host_id, cluster, core_id) + TTL to_start_of_day(coalesce(timestamp, _tp_time)) + INTERVAL 3 DAY + SETTINGS index_granularity = 8192;` Add the new `cluster` field to the `ORDER BY` clause of the `cpu_metrics` stream. This will improve query performance and data locality for cluster-based aggregations. pkg/db/migrations/00000000000001_consolidated_serviceradar_schema.up.sql [209-222] CREATE STREAM IF NOT EXISTS cpu_metrics ( timestamp DateTime64(3), poller_id string, agent_id string, host_id string, core_id int32, usage_percent float64, frequency_hz float64, label string, cluster string, device_id string, partition string ) ENGINE = Stream(1, 1, rand()) PARTITION BY int_div(to_unix_timestamp(timestamp), 3600) -ORDER BY (timestamp, device_id, host_id, core_id) +ORDER BY (timestamp, device_id, host_id, cluster, core_id) TTL to_start_of_day(coalesce(timestamp, _tp_time)) + INTERVAL 3 DAY SETTINGS index_granularity = 8192; Suggestion importance[1-10]: 7 __ Why: This is a valid and important performance optimization for the `cpu_metrics` stream, as including `cluster` in the `ORDER BY` clause will improve data locality and query efficiency for cluster-based aggregations.	Medium
General	✅ ~~Remove hardcoded private IP~~ Suggestion Impact: The hardcoded IP address was replaced with an environment-variable-based address: "${SYSMON_VM_ADDRESS:-sysmon-vm:50110}". code diff: `- "address": "192.168.1.219:50110", + "address": "${SYSMON_VM_ADDRESS:-sysmon-vm:50110}",` Replace the hardcoded IP address in `sysmon-vm.checker.json` with a configurable value, such as an environment variable, to improve deployment flexibility. docker/compose/sysmon-vm.checker.json [1-9] `{ "name": "sysmon-vm", "type": "grpc", - "address": "192.168.1.219:50110", + "address": "${SYSMON_VM_ADDRESS:-sysmon-vm:50110}", "security": { "mode": "none", "role": "agent" } }` `[Suggestion processed]` Suggestion importance[1-10]: 8 __ Why: The suggestion correctly points out that a hardcoded IP address harms portability and should be replaced with a configurable value, which is a critical best practice for deployment configurations.	Medium
General	✅ ~~Align Proton memory to container limits~~ Suggestion Impact: The commit added the suggested Proton memory environment variables (PROTON_MAX_SERVER_MEMORY, PROTON_MAX_QUERY_MEMORY, PROTON_MAX_PARTITION_BYTES) along with a comment explaining the alignment with container limits. code diff: `+ # Align Proton memory usage with container limits to reduce OOM risk + - PROTON_MAX_SERVER_MEMORY=6g + - PROTON_MAX_QUERY_MEMORY=4g + - PROTON_MAX_PARTITION_BYTES=256m` Set explicit memory limits for the Proton service via environment variables to align with container memory limits and prevent OOM kills. docker-compose.yml [72-92] `services: proton: image: ghcr.io/carverauto/serviceradar-proton:latest container_name: serviceradar-proton user: "0:0" mem_limit: 8g mem_reservation: 6g ports: - "8123:8123" - "8463:8463" environment: - PROTON_PASSWORD=${PROTON_PASSWORD:-} - PROTON_LOG_LEVEL=${PROTON_LOG_LEVEL:-error} - PROTON_DISABLE_TASKSTATS=1 + # Align Proton internal memory to container cgroup limits to prevent OOM + - PROTON_MAX_SERVER_MEMORY=6g + - PROTON_MAX_QUERY_MEMORY=4g + - PROTON_MAX_PARTITION_BYTES=256m` Suggestion importance[1-10]: 7 __ Why: The suggestion correctly points out a potential for OOM kills by aligning Proton's internal memory settings with the container's memory limits. This enhances the stability of the `proton` service under load, making it a valuable configuration improvement.	Medium

✅ Suggestions up to commit fddabde

Category	Suggestion	Impact
High-level	Database migration strategy breaks existing deployments The PR replaces the incremental database migration system with a single schema file that only runs on empty databases. This breaks the upgrade path for existing deployments, as there is no way to apply new schema changes to databases that already contain data. Examples: pkg/db/migrate.go [35-75] `// RunMigrations applies the consolidated database schema when no tables exist yet. func RunMigrations(ctx context.Context, conn proton.Conn, log logger.Logger) error { log.Info().Msg("Ensuring database schema is applied") applied, err := schemaAlreadyApplied(ctx, conn) if err != nil { return fmt.Errorf("failed to inspect existing schema: %w", err) } if applied { ... (clipped 31 lines)` Solution Walkthrough: Before: `// In pkg/db/migrate.go func RunMigrations(ctx, conn) error { createMigrationsTable(ctx, conn) // Ensure tracking table exists appliedMigrations := getAppliedMigrations(ctx, conn) availableMigrations := getAvailableMigrations() // from *.up.sql files for _, migrationFile := range availableMigrations { if !isApplied(migrationFile, appliedMigrations) { content := readMigrationFile(migrationFile) executeMigration(ctx, conn, content) recordMigration(ctx, conn, migrationFile) } } return nil }` After: // In pkg/db/migrate.go const schemaFile = "migrations/schema.sql" func RunMigrations(ctx, conn) error { applied, err := schemaAlreadyApplied(ctx, conn) if applied { log("Schema already present; skipping apply") return nil } // If not applied, run the whole schema file content := readSchemaFile(schemaFile) statements := splitSQLStatements(content) for _, stmt := range statements { conn.Exec(ctx, stmt) } log("Database schema applied successfully") return nil } func schemaAlreadyApplied(ctx, conn) (bool, error) { // SELECT count() FROM system.tables WHERE ... name = 'device_updates' // return count > 0 } Suggestion importance[1-10]: 10 __ Why: The suggestion correctly identifies a critical flaw where the new migration strategy removes the upgrade path for existing databases, which is a breaking change for all current deployments.	High
Possible issue	Fix missing target stream clauses Add the missing `INTO` clause to the `ocsf_devices_aggregator_mv` and `ocsf_users_aggregator_mv` materialized views to specify their target streams and fix a syntax error. pkg/db/migrations/schema.sql [1298-1408] -- OCSF Materialized Views Migration -- These views automatically aggregate event data into entity state and observable indexes -- Enables real-time updates from event streams to current state streams -- Device State Aggregation -- Populates ocsf_devices_current from device inventory events DROP VIEW IF EXISTS ocsf_devices_aggregator_mv; -CREATE MATERIALIZED VIEW ocsf_devices_aggregator_mv AS +CREATE MATERIALIZED VIEW ocsf_devices_aggregator_mv +INTO ocsf_devices_current +AS SELECT device_uid, time AS last_seen, ... FROM ocsf_device_inventory; -- User State Aggregation -- Populates ocsf_users_current from user inventory events DROP VIEW IF EXISTS ocsf_users_aggregator_mv; -CREATE MATERIALIZED VIEW ocsf_users_aggregator_mv AS +CREATE MATERIALIZED VIEW ocsf_users_aggregator_mv +INTO ocsf_users_current +AS SELECT user_uid, time AS last_seen, ... FROM ocsf_user_inventory; Suggestion importance[1-10]: 10 __ Why: The suggestion correctly identifies a critical syntax error in two `CREATE MATERIALIZED VIEW` statements that are missing the required `INTO` clause. This would cause the SQL migration to fail, so the fix is essential for correctness.	High
	✅ ~~Fix view definition and function name~~ Suggestion Impact: The committed diff updated the ocsf_observable_device_macs_mv definition to include the INTO ocsf_observable_index clause and corrected the function call to replaceAll for normalizing MAC addresses. code diff: --- Observable Index Population - Device MACs --- Index MAC addresses from device inventory -DROP VIEW IF EXISTS ocsf_observable_device_macs_mv; -CREATE MATERIALIZED VIEW ocsf_observable_device_macs_mv AS -SELECT - 'mac_address' AS observable_type, - mac AS observable_value, - lower(replace_all(mac, ':', '')) AS observable_value_normalized, - 'device' AS entity_class, - device_uid AS entity_uid, - time AS entity_last_seen, - 'device.mac' AS entity_path, - confidence_level / 4.0 AS confidence_score, - discovery_source, - time, - agent_id, - poller_id, - - -- Enrichments - '' AS geo_country, - '' AS geo_region, - '' AS geo_city, - 0 AS asn_number, - '' AS asn_org, - 0.0 AS threat_score, - CAST([] AS array(string)) AS threat_categories, - CAST([] AS array(string)) AS threat_sources, - 'hardware' AS observable_category, - CAST([] AS array(string)) AS tags, - metadata - -FROM ocsf_device_inventory -ARRAY JOIN device_mac AS mac -WHERE mac != ''; Fix the `ocsf_observable_device_macs_mv` view by adding the missing `INTO` `ocsf_observable_index` clause and correcting the function name from `replace_all` to `replaceAll`. pkg/db/migrations/schema.sql [1452-1485] -- Observable Index Population - Device MACs -- Index MAC addresses from device inventory DROP VIEW IF EXISTS ocsf_observable_device_macs_mv; -CREATE MATERIALIZED VIEW ocsf_observable_device_macs_mv AS +CREATE MATERIALIZED VIEW ocsf_observable_device_macs_mv +INTO ocsf_observable_index +AS SELECT 'mac_address' AS observable_type, mac AS observable_value, - lower(replace_all(mac, ':', '')) AS observable_value_normalized, + lower(replaceAll(mac, ':', '')) AS observable_value_normalized, 'device' AS entity_class, device_uid AS entity_uid, ... FROM ocsf_device_inventory ARRAY JOIN device_mac AS mac WHERE mac != ''; Suggestion importance[1-10]: 10 __ Why: The suggestion correctly identifies two critical errors in the `ocsf_observable_device_macs_mv` view definition: a missing `INTO` clause and an incorrect function name (`replace_all` instead of `replaceAll`). Both are syntax errors that would cause the migration to fail.	High
	Avoid holding lock during network call Refactor `reconnectCore` to release the mutex lock before making the blocking `connectToCore` network call to avoid contention. pkg/poller/poller.go [481-497] func (p Poller) reconnectCore(ctx context.Context) error { reconnectCtx, cancel := context.WithTimeout(ctx, 15time.Second) defer cancel() p.mu.Lock() - defer p.mu.Unlock() - if p.grpcClient != nil { + // Close the existing client while holding the lock to prevent its use. if err := p.grpcClient.Close(); err != nil { p.logger.Warn().Err(err).Msg("Failed to close existing core client during reconnect") } p.grpcClient = nil p.coreClient = nil } + p.mu.Unlock() // Release the lock before the blocking network call. - return p.connectToCore(reconnectCtx) + // connectToCore should not set the fields on `p` directly. + // It should return the new clients. + grpcClient, coreClient, err := p.connectToCore(reconnectCtx) + if err != nil { + return err + } + + // Re-acquire the lock to safely update the shared fields. + p.mu.Lock() + defer p.mu.Unlock() + p.grpcClient = grpcClient + p.coreClient = coreClient + + return nil } Suggestion importance[1-10]: 8 __ Why: The suggestion correctly identifies a significant concurrency issue where a mutex is held during a blocking network call, which can lead to performance degradation or deadlocks.	Medium
	✅ ~~Replace hardcoded IP with service name~~ Suggestion Impact: The hardcoded IP was replaced with a configurable address defaulting to "sysmon-vm:50110", aligning with the suggestion's intent to use the service name. code diff: `- "address": "192.168.1.219:50110", + "address": "${SYSMON_VM_ADDRESS:-sysmon-vm:50110}",` In `docker/compose/sysmon-vm.checker.json`, replace the hardcoded IP address in the `address` field with the service name `sysmon-vm` to ensure portability within the Docker environment. docker/compose/sysmon-vm.checker.json [4] `-"address": "192.168.1.219:50110", +"address": "sysmon-vm:50110",` Suggestion importance[1-10]: 7 __ Why: The suggestion correctly identifies a hardcoded IP address and recommends using the Docker service name, which is a best practice for portability and maintainability in a containerized environment.	Medium
	✅ ~~Handle empty metric slices correctly~~ Suggestion Impact: The commit implemented logic so that an "no cpu metrics appended" error is only returned when appending fails, and otherwise returns a specific error constant. Although an explicit early return for empty cpus is not shown in the diff, the updated handling of the "no metrics appended" case aligns with the suggestion’s intent to avoid misleading errors for empty input. Additionally, error messages were refactored to use constants. code diff: `...`

Imported GitHub PR comment. Original author: @qodo-code-review[bot] Original URL: https://github.com/carverauto/serviceradar/pull/1755#issuecomment-3395989674 Original created: 2025-10-13T06:09:32Z --- ## PR Code Suggestions ✨  Latest suggestions up to 7a4be7b <table><thead><tr><td>Category</td><td align=left>Suggestion                                                                                                                                    </td><td align=center>Impact</td></tr><tbody><tr><td rowspan=4>Incremental <a href='https://qodo-merge-docs.qodo.ai/core-abilities/incremental_update/'>[*]</a></td> <td> <details><summary>Prevent IP from overwriting host ID</summary> ___ **In <code>enrichServiceMessageWithAddress</code>, avoid overwriting <code>host_id</code> with an IP address. Instead, add the resolved IP to a separate field like <code>resolved_host_ip</code> to preserve the original host identifier.** [pkg/poller/agent_poller.go [272-326]](https://github.com/carverauto/serviceradar/pull/1755/files#diff-7cdaae416a1e7b5d39ad75f138c49fa0aca12fc38e8ab0b26bb59c64446da15cR272-R326) ```diff func enrichServiceMessageWithAddress(message []byte, check Check) []byte { if len(message) == 0 || check.Type != "grpc" || check.Details == "" { return message } hostCandidate := sanitizeTelemetryString(check.Details) host, _, err := net.SplitHostPort(hostCandidate) if err != nil { host = hostCandidate } - if host == "" || net.ParseIP(host) == nil { + if host == "" { return message } - safeHost := sanitizeTelemetryString(host) + var resolvedIP net.IP + if ip := net.ParseIP(host); ip != nil { + resolvedIP = ip + } + var payload map[string]any if err := json.Unmarshal(message, &payload); err != nil { return message } - statusNode, ok := payload["status"].(map[string]any) - if !ok || statusNode == nil { + statusNode, _ := payload["status"].(map[string]any) + if statusNode == nil { statusNode = make(map[string]any) } getString := func(key string) string { if raw, exists := statusNode[key]; exists { if str, ok := raw.(string); ok { return sanitizeTelemetryString(str) } } - return "" } - if hostIP := getString("host_ip"); hostIP != "" && !strings.EqualFold(hostIP, host) { - statusNode["reported_host_ip"] = hostIP + // If we parsed a valid IP from details, expose it without overwriting original host fields. + if resolvedIP != nil { + if hostIP := getString("host_ip"); hostIP != "" && !strings.EqualFold(hostIP, resolvedIP.String()) { + statusNode["reported_host_ip"] = hostIP + } + statusNode["resolved_host_ip"] = resolvedIP.String() } - statusNode["host_ip"] = safeHost - if hostID := getString("host_id"); hostID != "" && !strings.EqualFold(hostID, host) { + // Do not set host_id from network address. Preserve any provided host_id. + if hostID := getString("host_id"); hostID != "" && resolvedIP != nil && !strings.EqualFold(hostID, resolvedIP.String()) { statusNode["reported_host_id"] = hostID } - statusNode["host_id"] = safeHost payload["status"] = statusNode - enriched, err := json.Marshal(payload) if err != nil { return message } - return enriched } ``` - [ ] **Apply / Chat**  <details><summary>Suggestion importance[1-10]: 8</summary> __ Why: The suggestion correctly identifies a bug where `host_id` is incorrectly overwritten with an IP address, which corrupts telemetry data and can break device correlation logic. </details></details></td><td align=center>Medium </td></tr><tr><td> <details><summary>Optimize cpu_metrics sort key</summary> ___ **To optimize performance for the <code>cpu_metrics</code> stream, remove <code>cluster</code> from the <code>ORDER BY</code> clause and instead create a separate bloom filter index on it.** [pkg/db/migrations/00000000000001_consolidated_serviceradar_schema.up.sql [211-222]](https://github.com/carverauto/serviceradar/pull/1755/files#diff-1e05de747238f2112bb2230aac8db388e4c80eebe84c071eb78e035d64e67eb6R211-R222) ```diff CREATE STREAM IF NOT EXISTS cpu_metrics ( timestamp DateTime64(3), poller_id string, agent_id string, host_id string, core_id int32, usage_percent float64, frequency_hz float64, label string, cluster string, device_id string, partition string ) ENGINE = Stream(1, 1, rand()) PARTITION BY int_div(to_unix_timestamp(timestamp), 3600) -ORDER BY (timestamp, device_id, host_id, cluster, core_id) +ORDER BY (timestamp, device_id, host_id, core_id) TTL to_start_of_day(coalesce(timestamp, _tp_time)) + INTERVAL 3 DAY SETTINGS index_granularity = 8192; +ALTER STREAM cpu_metrics + ADD INDEX IF NOT EXISTS idx_cluster cluster TYPE bloom_filter GRANULARITY 1; + ``` `[To ensure code accuracy, apply this suggestion manually]` <details><summary>Suggestion importance[1-10]: 7</summary> __ Why: The suggestion correctly identifies a potential performance issue by adding `cluster` to the `ORDER BY` clause and proposes a valid optimization by using a separate index instead, which improves query performance and data ingestion. </details></details></td><td align=center>Medium </td></tr><tr><td> <details><summary>Block until config exists</summary> ___ **Prevent the <code>serviceradar-poller</code> from crashing on startup by modifying its command to wait until the <code>poller.json</code> configuration file exists before executing.** [docker-compose.yml [329-348]](https://github.com/carverauto/serviceradar/pull/1755/files#diff-e45e45baeda1c1e73482975a664062aa56f20c03dd9d64a827aba57775bed0d3R329-R348) ```diff serviceradar-poller: image: ghcr.io/carverauto/serviceradar-poller:latest container_name: serviceradar-poller volumes: - cert-data:/etc/serviceradar/certs:ro - generated-config:/etc/serviceradar/config:ro - poller-data:/var/lib/serviceradar - ./logs:/var/log/serviceradar environment: - # Config path - CONFIG_PATH=/etc/serviceradar/config/poller.json - # Logging - LOG_LEVEL=${LOG_LEVEL:-info} depends_on: cert-generator: condition: service_completed_successfully config-updater: condition: service_completed_successfully core: condition: service_healthy - command: ["/usr/local/bin/serviceradar-poller","-config","/etc/serviceradar/config/poller.json"] + command: ["sh","-lc","until [ -s /etc/serviceradar/config/poller.json ]; do echo '[poller] waiting for poller.json'; sleep 2; done; exec /usr/local/bin/serviceradar-poller -config /etc/serviceradar/config/poller.json"] ``` `[To ensure code accuracy, apply this suggestion manually]` <details><summary>Suggestion importance[1-10]: 7</summary> __ Why: The suggestion correctly identifies a race condition where the `serviceradar-poller` might start before its configuration file is ready, and proposes a robust shell command to wait for the file, preventing startup failures. </details></details></td><td align=center>Medium </td></tr><tr><td> <details><summary>Ensure secure, reliable secret persistence</summary> ___ **In <code>proton-init.sh</code>, ensure secret directories exist with secure permissions and add error handling when writing the password file to prevent silent failures and ensure credential persistence.** [docker/compose/proton-init.sh [62-69]](https://github.com/carverauto/serviceradar/pull/1755/files#diff-9f0e9459b12fc8cb2671d78218467129cbb18b110443a0799c7559cf7ba19d4eR62-R69) ```diff -if [ -z "$PROTON_PASSWORD" ]; then - # First preference: existing shared credential file - if [ -f /etc/serviceradar/credentials/proton-password ] && \ - [ -s /etc/serviceradar/credentials/proton-password ]; then - PROTON_PASSWORD=$(cat /etc/serviceradar/credentials/proton-password) - log_info "Reusing Proton password from shared credentials volume" - # Second preference: previously generated password inside Proton data dir - elif [ -f /etc/proton-server/generated_password.txt ] && \ - [ -s /etc/proton-server/generated_password.txt ]; then - PROTON_PASSWORD=$(cat /etc/proton-server/generated_password.txt) - log_info "Reusing Proton password from generated_password.txt" - else - log_info "Generating random password..." - PROTON_PASSWORD=$(openssl rand -hex 16) - log_info "Generated new Proton password" - fi +# Ensure directories exist with secure permissions before writing secrets +install -d -m 700 /etc/proton-server || { log_error "Cannot create /etc/proton-server"; exit 1; } +if [ -d "/etc/serviceradar" ]; then + install -d -m 700 /etc/serviceradar/credentials || { log_error "Cannot create /etc/serviceradar/credentials"; exit 1; } fi # Persist password to both expected locations to keep services aligned -echo "$PROTON_PASSWORD" > /etc/proton-server/generated_password.txt +echo "$PROTON_PASSWORD" > /etc/proton-server/generated_password.txt || { log_error "Failed to write Proton password"; exit 1; } chmod 600 /etc/proton-server/generated_password.txt if [ -d "/etc/serviceradar/credentials" ]; then - echo "$PROTON_PASSWORD" > /etc/serviceradar/credentials/proton-password - chmod 600 /etc/serviceradar/credentials/proton-password - log_info "Password synchronized to shared credentials volume" + echo "$PROTON_PASSWORD" > /etc/serviceradar/credentials/proton-password || { log_error "Failed to write shared Proton password"; exit 1; } + chmod 600 /etc/serviceradar/credentials/proton-password + log_info "Password synchronized to shared credentials volume" fi ``` `[To ensure code accuracy, apply this suggestion manually]` <details><summary>Suggestion importance[1-10]: 6</summary> __ Why: The suggestion improves the robustness of secret handling by adding explicit directory creation and error checking, which prevents silent failures and ensures password persistence. </details></details></td><td align=center>Low </td></tr><tr><td rowspan=3>Possible issue</td> <td> <details><summary>Prevent device ID collisions and validate IP</summary> ___ **To prevent collisions, make the <code>deviceID</code> more unique by validating the <code>hostIP</code> and including a tie-breaker like <code>agentID</code> or <code>svc.ServiceName</code> in its generation.** [pkg/core/devices.go [31-120]](https://github.com/carverauto/serviceradar/pull/1755/files#diff-fb7f5159558f22afdc809b9cbf5a9c0f71560be2f7389c944da8f1604998ea48R31-R120) ```diff func (s *Server) ensureServiceDevice( ctx context.Context, agentID, pollerID, partition string, svc *proto.ServiceStatus, serviceData json.RawMessage, timestamp time.Time, ) { if svc == nil { return } - - // Only gRPC checkers embed host context in a way we can reason about; skip other service types. if svc.ServiceType != grpcServiceType { return } - - // Ignore result streams such as sync/multi-part responses; they set Source to "results". if svc.Source != "" && !strings.EqualFold(svc.Source, "getstatus") { return } hostIP, hostname, hostID := extractCheckerHostIdentity(serviceData) hostIP = normalizeHostIdentifier(hostIP) - if hostIP == "" || strings.EqualFold(hostIP, "unknown") { + parsed := net.ParseIP(hostIP) + if parsed == nil { + // refuse non-IP identifiers to avoid corrupting device registry return } if partition == "" { partition = "default" } - deviceID := fmt.Sprintf("%s:%s", partition, hostIP) + // Use the normalized string form for device ID. For IPv6, use the canonical compressed format. + ipKey := parsed.String() + + // Add a stable tie-breaker to mitigate collisions where the same IP might be observed + // in overlapping contexts (e.g., NAT or shared targets). + extraScope := agentID + if extraScope == "" { + extraScope = svc.ServiceName + } + deviceID := fmt.Sprintf("%s:%s:%s", partition, ipKey, extraScope) metadata := map[string]string{ - "source": "checker", - "checker_service": svc.ServiceName, - "checker_service_id": svc.ServiceName, - "last_update": timestamp.Format(time.RFC3339), + "source": "checker", + "checker_service": svc.ServiceName, + "checker_service_id": svc.ServiceName, + "last_update": timestamp.Format(time.RFC3339), + "checker_host_ip": ipKey, + "device_id_scope_hint": extraScope, } - if svc.ServiceType != "" { metadata["checker_service_type"] = svc.ServiceType } - if agentID != "" { metadata["collector_agent_id"] = agentID } - if pollerID != "" { metadata["collector_poller_id"] = pollerID } - if hostID != "" { metadata["checker_host_id"] = hostID } - - metadata["checker_host_ip"] = hostIP deviceUpdate := &models.DeviceUpdate{ AgentID: agentID, PollerID: pollerID, Partition: partition, DeviceID: deviceID, Source: models.DiscoverySourceSelfReported, - IP: hostIP, + IP: ipKey, Timestamp: timestamp, IsAvailable: true, Metadata: metadata, Confidence: models.GetSourceConfidence(models.DiscoverySourceSelfReported), } - if hostname != "" { deviceUpdate.Hostname = &hostname } if s.DeviceRegistry != nil { if err := s.DeviceRegistry.ProcessDeviceUpdate(ctx, deviceUpdate); err != nil { s.logger.Warn(). Err(err). Str("device_id", deviceID). Str("service_name", svc.ServiceName). Msg("Failed to process checker device through device registry") } } else { s.logger.Warn(). Str("device_id", deviceID). Str("service_name", svc.ServiceName). Msg("DeviceRegistry not available for checker device registration") } } ``` - [ ] **Apply / Chat**  <details><summary>Suggestion importance[1-10]: 8</summary> __ Why: The suggestion addresses a significant architectural weakness in device identification, preventing potential data collisions and malformed `deviceID`s, thus improving data integrity. </details></details></td><td align=center>Medium </td></tr><tr><td> <details><summary>Add missing usage and tighten ORDER BY</summary> ___ **Add the missing <code>usage_percent</code> column to the <code>cpu_cluster_metrics</code> stream and include <code>poller_id</code> and <code>agent_id</code> in its <code>ORDER BY</code> clause to ensure data correctness and improve query performance.** [pkg/db/migrations/00000000000001_consolidated_serviceradar_schema.up.sql [224-240]](https://github.com/carverauto/serviceradar/pull/1755/files#diff-1e05de747238f2112bb2230aac8db388e4c80eebe84c071eb78e035d64e67eb6R224-R240) ```diff -CREATE STREAM IF NOT EXISTS cpu_metrics ( - core_id int32, - usage_percent float64, - frequency_hz float64, - label string, - cluster string, - device_id string, - partition string -) ENGINE = Stream(1, 1, rand()) -PARTITION BY int_div(to_unix_timestamp(timestamp), 3600) -ORDER BY (timestamp, device_id, host_id, cluster, core_id) -TTL to_start_of_day(coalesce(timestamp, _tp_time)) + INTERVAL 3 DAY -SETTINGS index_granularity = 8192; - CREATE STREAM IF NOT EXISTS cpu_cluster_metrics ( timestamp DateTime64(3), poller_id string, agent_id string, host_id string, cluster string, + usage_percent float64, frequency_hz float64, device_id string, partition string ) ENGINE = Stream(1, 1, rand()) PARTITION BY int_div(to_unix_timestamp(timestamp), 3600) -ORDER BY (timestamp, cluster, host_id, device_id) +ORDER BY (timestamp, cluster, host_id, device_id, poller_id, agent_id) TTL to_start_of_day(coalesce(timestamp, _tp_time)) + INTERVAL 3 DAY SETTINGS index_granularity = 8192; ALTER STREAM cpu_cluster_metrics ADD INDEX IF NOT EXISTS idx_cluster cluster TYPE bloom_filter GRANULARITY 1; ``` `[To ensure code accuracy, apply this suggestion manually]` <details><summary>Suggestion importance[1-10]: 8</summary> __ Why: The suggestion correctly identifies that the new `cpu_cluster_metrics` stream is missing a `usage_percent` field, which is crucial for UI components that display cluster usage. It also correctly points out that `poller_id` and `agent_id` should be in the `ORDER BY` clause for data integrity and performance, which is a critical consideration for streaming databases. </details></details></td><td align=center>Medium </td></tr><tr><td> <details><summary>Add defensive null checks and stable keys</summary> ___ **Add a null check for each <code>device</code> in the <code>devices.map</code> loop and provide a fallback <code>key</code> using the index to prevent potential runtime errors and improve component stability.** [web/src/components/Devices/DeviceTable.tsx [209-218]](https://github.com/carverauto/serviceradar/pull/1755/files#diff-62d65ad4e929bc5554024981aa1c3418d70d54333e3afd80af69d93cea2edec3R209-R218) ```diff -{devices.map(device => { - const metadata = device.metadata || {}; - const sysmonServiceHint = typeof metadata === 'object' && +{devices.map((device, idx) => { + if (!device || typeof device !== 'object') return null; + + const metadata = (device as any).metadata || {}; + const sysmonServiceHint = + typeof metadata === 'object' && metadata !== null && typeof metadata.checker_service === 'string' && metadata.checker_service.toLowerCase().includes('sysmon'); + const rowKey = device.device_id ?? `device-row-${idx}`; + return ( - <Fragment key={device.device_id}> + <Fragment key={rowKey}> <tr className="hover:bg-gray-700/30"> ``` - [ ] **Apply / Chat**  <details><summary>Suggestion importance[1-10]: 6</summary> __ Why: The suggestion improves code robustness by adding a null check for `device` and providing a fallback for the React `key`, which are good defensive programming practices against potentially malformed data. </details></details></td><td align=center>Low </td></tr><tr><td rowspan=1>General</td> <td> <details><summary>Use consistent truncated timestamps</summary> ___ **In <code>GetAllCPUMetrics</code>, use the truncated timestamp for the final <code>SysmonCPUResponse</code> to ensure consistent grouping and alignment between CPU and cluster metrics.** [pkg/db/metrics.go [826-912]](https://github.com/carverauto/serviceradar/pull/1755/files#diff-2cf937853ac86eb9a879bd370cbe22b835ee0c48be6771b48b0276d93cc2ae93R826-R912) ```diff -func (db *DB) GetAllCPUMetrics( - ctx context.Context, pollerID string, start, end time.Time) ([]models.SysmonCPUResponse, error) { - rows, err := db.Conn.Query(ctx, ` - SELECT timestamp, agent_id, host_id, core_id, usage_percent, frequency_hz, label, cluster - FROM table(cpu_metrics) - WHERE poller_id = $1 AND timestamp BETWEEN $2 AND $3 - ORDER BY timestamp DESC, core_id ASC`, - pollerID, start, end) - if err != nil { - return nil, fmt.Errorf("failed to query CPU metrics: %w", err) - } - defer db.CloseRows(rows) +result := make([]models.SysmonCPUResponse, 0, len(data)) - data := make(map[time.Time][]models.CPUMetric) - clustersByTimestamp := make(map[time.Time][]models.CPUClusterMetric) - - for rows.Next() { - var m models.CPUMetric - - var agentID, hostID, label, cluster string - - var timestamp time.Time - - if err := rows.Scan(&timestamp, &agentID, &hostID, &m.CoreID, &m.UsagePercent, &m.FrequencyHz, &label, &cluster); err != nil { - db.logger.Error().Err(err).Msg("Error scanning CPU metric row") - continue - } - key := timestamp.Truncate(time.Second) - m.Timestamp = timestamp - m.AgentID = agentID - m.HostID = hostID - m.Label = label - m.Cluster = cluster - data[key] = append(data[key], m) - } - - if err := rows.Err(); err != nil { - db.logger.Error().Err(err).Msg("Error iterating CPU metric rows") - - return nil, err - } - - clusterRows, err := db.Conn.Query(ctx, ` - SELECT timestamp, agent_id, host_id, cluster, frequency_hz - FROM table(cpu_cluster_metrics) - WHERE poller_id = $1 AND timestamp BETWEEN $2 AND $3 - ORDER BY timestamp DESC, cluster ASC`, - pollerID, start, end) - if err != nil { - db.logger.Error().Err(err).Msg("Error querying CPU cluster metrics") - - return nil, fmt.Errorf("failed to query CPU cluster metrics: %w", err) - } - defer db.CloseRows(clusterRows) - - for clusterRows.Next() { - var c models.CPUClusterMetric - var timestamp time.Time - - if err := clusterRows.Scan(&timestamp, &c.AgentID, &c.HostID, &c.Name, &c.FrequencyHz); err != nil { - db.logger.Error().Err(err).Msg("Error scanning CPU cluster metric row") - continue - } - key := timestamp.Truncate(time.Second) - c.Timestamp = timestamp - clustersByTimestamp[key] = append(clustersByTimestamp[key], c) - } - - if err := clusterRows.Err(); err != nil { - db.logger.Error().Err(err).Msg("Error iterating CPU cluster metric rows") - - return nil, err - } - - result := make([]models.SysmonCPUResponse, 0, len(data)) - - for ts, cpus := range data { - result = append(result, models.SysmonCPUResponse{ - Cpus: cpus, - Clusters: clustersByTimestamp[ts], - Timestamp: cpus[0].Timestamp, - }) - } - - return result, nil +for ts, cpus := range data { + truncated := ts.Truncate(time.Second) + result = append(result, models.SysmonCPUResponse{ + Cpus: cpus, + Clusters: clustersByTimestamp[ts], + Timestamp: truncated, + }) } ``` `[To ensure code accuracy, apply this suggestion manually]` <details><summary>Suggestion importance[1-10]: 7</summary> __ Why: The suggestion correctly identifies a subtle bug where inconsistent timestamp handling could lead to data misalignment between CPU and cluster metrics. </details></details></td><td align=center>Medium </td></tr> <tr><td align="center" colspan="2"> - [ ] More  </td><td></td></tr></tbody></table> ___ #### Previous suggestions <details><summary>✅ Suggestions up to commit 86685fd</summary> <table><thead><tr><td>Category</td><td align=left>Suggestion                                                                                                                                    </td><td align=center>Impact</td></tr><tbody><tr><td rowspan=7>Possible issue</td> <td> <details><summary>Preserve original host identity fields</summary> ___ **Avoid overwriting <code>host_id</code> and <code>host_ip</code> in the service message. Instead, store the address derived from <code>check.Details</code> in separate auxiliary fields to preserve the original agent-reported identity.** [pkg/poller/agent_poller.go [272-325]](https://github.com/carverauto/serviceradar/pull/1755/files#diff-7cdaae416a1e7b5d39ad75f138c49fa0aca12fc38e8ab0b26bb59c64446da15cR272-R325) ```diff func enrichServiceMessageWithAddress(message []byte, check Check) []byte { if len(message) == 0 || check.Type != "grpc" || check.Details == "" { return message } hostCandidate := strings.TrimSpace(check.Details) host, _, err := net.SplitHostPort(hostCandidate) if err != nil { host = hostCandidate } + // Only attach auxiliary info if the candidate parses as an IP if host == "" || net.ParseIP(host) == nil { return message } var payload map[string]any if err := json.Unmarshal(message, &payload); err != nil { return message } - statusNode, ok := payload["status"].(map[string]any) - if !ok || statusNode == nil { + statusNode, _ := payload["status"].(map[string]any) + if statusNode == nil { statusNode = make(map[string]any) } - getString := func(key string) string { - if raw, exists := statusNode[key]; exists { - if str, ok := raw.(string); ok { - return strings.TrimSpace(str) - } + // Preserve existing fields; add auxiliary fields without overwriting originals + if existing, ok := statusNode["host_ip"]; ok && existing != nil { + // keep original, add derived separately if different + if s, ok := existing.(string); ok && !strings.EqualFold(strings.TrimSpace(s), host) { + statusNode["derived_host_ip"] = host } - - return "" + } else { + // no original present; still avoid writing as canonical field + statusNode["derived_host_ip"] = host } - if hostIP := getString("host_ip"); hostIP != "" && !strings.EqualFold(hostIP, host) { - statusNode["reported_host_ip"] = hostIP + // Do not overwrite host_id; expose derived host_id from address only as auxiliary + if _, ok := statusNode["host_id"]; !ok { + statusNode["derived_host_id"] = host } - statusNode["host_ip"] = host - - if hostID := getString("host_id"); hostID != "" && !strings.EqualFold(hostID, host) { - statusNode["reported_host_id"] = hostID - } - statusNode["host_id"] = host payload["status"] = statusNode - enriched, err := json.Marshal(payload) if err != nil { return message } - return enriched } ```  <details><summary>Suggestion importance[1-10]: 9</summary> __ Why: The suggestion correctly identifies a critical data integrity issue where agent-reported identity (`host_ip`, `host_id`) is overwritten, which could lead to incorrect device correlation and data misattribution. </details></details></td><td align=center>High </td></tr><tr><td> <details><summary>Validate device key IP input</summary> ___ **Validate that the extracted <code>hostIP</code> is a valid IP address before using it to construct a <code>deviceID</code> to prevent creating invalid device records from hostnames or other strings.** [pkg/core/devices.go [31-62]](https://github.com/carverauto/serviceradar/pull/1755/files#diff-fb7f5159558f22afdc809b9cbf5a9c0f71560be2f7389c944da8f1604998ea48R31-R62) ```diff -func (s *Server) ensureServiceDevice( - ctx context.Context, - agentID, pollerID, partition string, - svc *proto.ServiceStatus, - serviceData json.RawMessage, - timestamp time.Time, -) { - if svc == nil { - return - } +hostIP, hostname, hostID := extractCheckerHostIdentity(serviceData) +hostIP = normalizeHostIdentifier(hostIP) +// Require a valid IP address to avoid using hostnames/strings as device keys +if ip := net.ParseIP(hostIP); ip == nil { + return +} +if hostIP == "" || strings.EqualFold(hostIP, "unknown") { + return +} - // Only gRPC checkers embed host context in a way we can reason about; skip other service types. - if svc.ServiceType != grpcServiceType { - return - } +if partition == "" { + partition = "default" +} - // Ignore result streams such as sync/multi-part responses; they set Source to "results". - if svc.Source != "" && !strings.EqualFold(svc.Source, "getstatus") { - return - } +deviceID := fmt.Sprintf("%s:%s", partition, hostIP) - hostIP, hostname, hostID := extractCheckerHostIdentity(serviceData) - hostIP = normalizeHostIdentifier(hostIP) - if hostIP == "" || strings.EqualFold(hostIP, "unknown") { - return - } - - if partition == "" { - partition = "default" - } - - deviceID := fmt.Sprintf("%s:%s", partition, hostIP) - ``` <details><summary>Suggestion importance[1-10]: 8</summary> __ Why: The suggestion correctly identifies that a non-IP string could be used in `deviceID`, leading to invalid device records. Enforcing IP validation for the device key is a crucial fix for data integrity. </details></details></td><td align=center>Medium </td></tr><tr><td> <details><summary>✅ <s>Add indexes for new cluster fields</s></summary> ___ <details><summary>Suggestion Impact:</summary>The commit added bloom filter indexes on cpu_metrics for cluster and label, matching the suggestion. It also added an index for cluster on cpu_cluster_metrics and adjusted ORDER BY clauses. code diff: ```diff +ALTER STREAM cpu_metrics + ADD INDEX IF NOT EXISTS idx_cluster cluster TYPE bloom_filter GRANULARITY 1; +ALTER STREAM cpu_metrics + ADD INDEX IF NOT EXISTS idx_label label TYPE bloom_filter GRANULARITY 1; ``` </details> ___ **Add bloom filter indexes to the <code>cpu_metrics</code> stream for the new <code>cluster</code> and <code>label</code> fields to improve query performance.** [pkg/db/migrations/00000000000001_consolidated_serviceradar_schema.up.sql [209-222]](https://github.com/carverauto/serviceradar/pull/1755/files#diff-1e05de747238f2112bb2230aac8db388e4c80eebe84c071eb78e035d64e67eb6R209-R222) ```diff CREATE STREAM IF NOT EXISTS cpu_metrics ( timestamp DateTime64(3), poller_id string, agent_id string, host_id string, core_id int32, usage_percent float64, frequency_hz float64, label string, cluster string, device_id string, partition string ) ENGINE = Stream(1, 1, rand()) PARTITION BY int_div(to_unix_timestamp(timestamp), 3600) ORDER BY (timestamp, device_id, host_id, core_id) TTL to_start_of_day(coalesce(timestamp, _tp_time)) + INTERVAL 3 DAY SETTINGS index_granularity = 8192; +ALTER STREAM cpu_metrics + ADD INDEX IF NOT EXISTS idx_cluster cluster TYPE bloom_filter GRANULARITY 1; +ALTER STREAM cpu_metrics + ADD INDEX IF NOT EXISTS idx_label label TYPE bloom_filter GRANULARITY 1; + ``` <details><summary>Suggestion importance[1-10]: 8</summary> __ Why: This is a valid and important performance optimization. Adding bloom filter indexes on the new `cluster` and `label` fields is crucial for efficient querying, especially since these fields are used for filtering and grouping in the UI, preventing full stream scans on `cpu_metrics`. </details></details></td><td align=center>Medium </td></tr><tr><td> <details><summary>✅ <s>Optimize sort key and index for clusters</s></summary> ___ <details><summary>Suggestion Impact:</summary>The commit reordered the cpu_cluster_metrics ORDER BY to (timestamp, cluster, host_id, device_id) and added a bloom filter index on cluster for that stream, matching the suggestion's intent. code diff: ```diff -ORDER BY (timestamp, device_id, host_id, cluster) +ORDER BY (timestamp, cluster, host_id, device_id) TTL to_start_of_day(coalesce(timestamp, _tp_time)) + INTERVAL 3 DAY SETTINGS index_granularity = 8192; +ALTER STREAM cpu_cluster_metrics + ADD INDEX IF NOT EXISTS idx_cluster cluster TYPE bloom_filter GRANULARITY 1; ``` </details> ___ **Optimize the <code>cpu_cluster_metrics</code> stream by reordering the primary key to prioritize <code>cluster</code> and adding a bloom filter index on it.** [pkg/db/migrations/00000000000001_consolidated_serviceradar_schema.up.sql [224-237]](https://github.com/carverauto/serviceradar/pull/1755/files#diff-1e05de747238f2112bb2230aac8db388e4c80eebe84c071eb78e035d64e67eb6R224-R237) ```diff CREATE STREAM IF NOT EXISTS cpu_cluster_metrics ( timestamp DateTime64(3), poller_id string, agent_id string, host_id string, cluster string, frequency_hz float64, device_id string, partition string ) ENGINE = Stream(1, 1, rand()) PARTITION BY int_div(to_unix_timestamp(timestamp), 3600) -ORDER BY (timestamp, device_id, host_id, cluster) +ORDER BY (timestamp, cluster, host_id, device_id) TTL to_start_of_day(coalesce(timestamp, _tp_time)) + INTERVAL 3 DAY SETTINGS index_granularity = 8192; +ALTER STREAM cpu_cluster_metrics + ADD INDEX IF NOT EXISTS idx_cluster cluster TYPE bloom_filter GRANULARITY 1; + ``` `[Suggestion processed]` <details><summary>Suggestion importance[1-10]: 8</summary> __ Why: This suggestion correctly identifies a performance issue in the new `cpu_cluster_metrics` stream. Reordering the primary key to `(timestamp, cluster, ...)` and adding a bloom filter index on `cluster` will significantly improve the performance of common UI queries that filter by cluster. </details></details></td><td align=center>Medium </td></tr><tr><td> <details><summary>Add safe db type assertion</summary> ___ **Add a safe type assertion when accessing <code>s.dbService</code> to prevent a potential panic if the underlying type is not <code>*db.DB</code>.** [pkg/core/api/sysmon.go [459]](https://github.com/carverauto/serviceradar/pull/1755/files#diff-4879e8b367ee55413c6cb006f23bb912261df5310c640d3af352033cf97ffc84R459-R459) ```diff -clusterRows, err := s.dbService.(*db.DB).Conn.Query(ctx, clusterQuery) +var ( + dbImpl *db.DB + ok bool +) +if dbImpl, ok = s.dbService.(*db.DB); !ok || dbImpl == nil || dbImpl.Conn == nil { + return nil, fmt.Errorf("database service unavailable for cluster query") +} +clusterRows, err := dbImpl.Conn.Query(ctx, clusterQuery) ```  <details><summary>Suggestion importance[1-10]: 7</summary> __ Why: The suggestion correctly points out a potential panic from an unsafe type assertion, improving the code's robustness and preventing a server crash, which is a valuable improvement for production stability. </details></details></td><td align=center>Medium </td></tr><tr><td> <details><summary>✅ <s>Add defensive null checks</s></summary> ___ <details><summary>Suggestion Impact:</summary>The commit added a defensive check ensuring serviceData exists before accessing its properties and provided a nullish coalescing fallback for serviceData.details. code diff: ```diff + if ( + serviceData && + (serviceData.service_name === "sysmon-vm" || serviceData.name === "sysmon-vm") + ) { return ( <SysmonVmDetails service={serviceData} - details={serviceData.details} + details={serviceData.details ?? {}} ``` </details> ___ **Add null checks for the <code>serviceData</code> object and its <code>details</code> property to prevent potential runtime errors during rendering.** [web/src/components/Service/Dashboard.tsx [280-287]](https://github.com/carverauto/serviceradar/pull/1755/files#diff-a0046922157d76341ddcd07416191c7a7add740deefbf01ee66c1f49e7da4582R280-R287) ```diff -if (serviceData.service_name === 'sysmon-vm' || serviceData.name === 'sysmon-vm') { +if (serviceData && (serviceData.service_name === 'sysmon-vm' || serviceData.name === 'sysmon-vm')) { return ( <SysmonVmDetails service={serviceData} - details={serviceData.details} + details={serviceData.details ?? {}} /> ); } ``` `[Suggestion processed]` <details><summary>Suggestion importance[1-10]: 7</summary> __ Why: The suggestion correctly identifies a potential null pointer exception if `serviceData` is not yet loaded, improving the component's robustness against runtime errors. </details></details></td><td align=center>Medium </td></tr><tr><td> <details><summary>✅ <s>Clamp and sanitize metric value</s></summary> ___ <details><summary>Suggestion Impact:</summary>The commit added safeMax/rawValue/clampedValue and replaced uses of current with clampedValue and max with safeMax across StatusIndicator, ValueDisplay, and ProgressBar. code diff: ```diff + const safeMax = Number.isFinite(max) && max > 0 ? max : 100; + const rawValue = Number.isFinite(current) ? current : 0; + const clampedValue = Math.min(Math.max(rawValue, 0), safeMax); + return ( <div className="bg-white dark:bg-gray-800 rounded-lg p-3 shadow transition-colors"> <div className="flex justify-between items-center mb-2"> @@ -129,23 +133,23 @@ {title} </div> <StatusIndicator - value={current} + value={clampedValue} warning={warning} critical={critical} /> </div> <ValueDisplay - value={current} + value={clampedValue} unit={unit} warning={warning} critical={critical} change={change} /> <ProgressBar - value={current} + value={clampedValue} warning={warning} critical={critical} - max={max} + max={safeMax} /> ``` </details> ___ **Sanitize the <code>current</code> and <code>max</code> props in the <code>MetricCard</code> component by clamping the value to a safe range <code>[0, max]</code> to prevent UI issues.** [web/src/components/Metrics/shared-components.jsx [123-153]](https://github.com/carverauto/serviceradar/pull/1755/files#diff-da2e49cf20ae6d685ed375b397ac8a5d2fdbafad60303a706712dfb84afb49d3R123-R153) ```diff export const MetricCard = ({ title, current, unit, warning, critical, change, icon, children, max = 100 }) => { + const safeMax = Number.isFinite(max) && max > 0 ? max : 100; + const rawValue = Number.isFinite(current) ? current : 0; + const clampedValue = Math.min(Math.max(rawValue, 0), safeMax); return ( <div className="bg-white dark:bg-gray-800 rounded-lg p-3 shadow transition-colors"> <div className="flex justify-between items-center mb-2"> <div className="flex items-center space-x-2"> {icon && <div className="text-gray-500 dark:text-gray-300">{icon}</div>} <h3 className="text-sm font-medium text-gray-900 dark:text-white">{title}</h3> </div> {typeof change !== 'undefined' && ( = 0 ? 'text-green-600' : 'text-red-600'}`}> {change >= 0 ? '+' : ''}{change}% )} </div> <MetricProgressBar - value={current} + value={clampedValue} warning={warning} critical={critical} - max={max} + max={safeMax} /> {children} </div> ); }; ``` `[Suggestion processed]` <details><summary>Suggestion importance[1-10]: 7</summary> __ Why: This suggestion improves the component's resilience by sanitizing the `current` and `max` props, preventing potential UI bugs or crashes from invalid data. </details></details></td><td align=center>Medium </td></tr><tr><td rowspan=3>Incremental <a href='https://qodo-merge-docs.qodo.ai/core-abilities/incremental_update/'>[*]</a></td> <td> <details><summary>✅ <s>Normalize timestamps for joining</s></summary> ___ <details><summary>Suggestion Impact:</summary>The commit implemented truncating timestamps to second-resolution for both CPU metrics and cluster metrics map keys, and adjusted the response timestamp to use the original metric timestamp. code diff: ```diff + key := timestamp.Truncate(time.Second) m.Timestamp = timestamp m.AgentID = agentID m.HostID = hostID m.Label = label m.Cluster = cluster - data[timestamp] = append(data[timestamp], m) + data[key] = append(data[key], m) } if err := rows.Err(); err != nil { @@ -888,9 +888,9 @@ db.logger.Error().Err(err).Msg("Error scanning CPU cluster metric row") continue } - + key := timestamp.Truncate(time.Second) c.Timestamp = timestamp - clustersByTimestamp[timestamp] = append(clustersByTimestamp[timestamp], c) + clustersByTimestamp[key] = append(clustersByTimestamp[key], c) } if err := clusterRows.Err(); err != nil { @@ -905,7 +905,7 @@ result = append(result, models.SysmonCPUResponse{ Cpus: cpus, Clusters: clustersByTimestamp[ts], - Timestamp: ts, + Timestamp: cpus[0].Timestamp, }) ``` </details> ___ **In <code>GetAllCPUMetrics</code>, normalize timestamps by truncating them to the second before using them as map keys to ensure correct correlation between core and cluster metrics.** [pkg/db/metrics.go [826-912]](https://github.com/carverauto/serviceradar/pull/1755/files#diff-2cf937853ac86eb9a879bd370cbe22b835ee0c48be6771b48b0276d93cc2ae93R826-R912) ```diff func (db *DB) GetAllCPUMetrics( ctx context.Context, pollerID string, start, end time.Time) ([]models.SysmonCPUResponse, error) { rows, err := db.Conn.Query(ctx, ` SELECT timestamp, agent_id, host_id, core_id, usage_percent, frequency_hz, label, cluster FROM table(cpu_metrics) WHERE poller_id = $1 AND timestamp BETWEEN $2 AND $3 ORDER BY timestamp DESC, core_id ASC`, pollerID, start, end) if err != nil { return nil, fmt.Errorf("failed to query CPU metrics: %w", err) } defer db.CloseRows(rows) data := make(map[time.Time][]models.CPUMetric) clustersByTimestamp := make(map[time.Time][]models.CPUClusterMetric) for rows.Next() { var m models.CPUMetric - var agentID, hostID, label, cluster string - - var timestamp time.Time - - if err := rows.Scan(&timestamp, &agentID, &hostID, &m.CoreID, &m.UsagePercent, &m.FrequencyHz, &label, &cluster); err != nil { + var ts time.Time + if err := rows.Scan(&ts, &agentID, &hostID, &m.CoreID, &m.UsagePercent, &m.FrequencyHz, &label, &cluster); err != nil { db.logger.Error().Err(err).Msg("Error scanning CPU metric row") continue } - - m.Timestamp = timestamp + // Normalize key to seconds to ensure joinability + key := ts.Truncate(time.Second) + m.Timestamp = ts m.AgentID = agentID m.HostID = hostID m.Label = label m.Cluster = cluster - data[timestamp] = append(data[timestamp], m) + data[key] = append(data[key], m) } - if err := rows.Err(); err != nil { db.logger.Error().Err(err).Msg("Error iterating CPU metric rows") - return nil, err } clusterRows, err := db.Conn.Query(ctx, ` SELECT timestamp, agent_id, host_id, cluster, frequency_hz FROM table(cpu_cluster_metrics) WHERE poller_id = $1 AND timestamp BETWEEN $2 AND $3 ORDER BY timestamp DESC, cluster ASC`, pollerID, start, end) if err != nil { db.logger.Error().Err(err).Msg("Error querying CPU cluster metrics") - return nil, fmt.Errorf("failed to query CPU cluster metrics: %w", err) } defer db.CloseRows(clusterRows) for clusterRows.Next() { var c models.CPUClusterMetric - var timestamp time.Time - - if err := clusterRows.Scan(&timestamp, &c.AgentID, &c.HostID, &c.Name, &c.FrequencyHz); err != nil { + var ts time.Time + if err := clusterRows.Scan(&ts, &c.AgentID, &c.HostID, &c.Name, &c.FrequencyHz); err != nil { db.logger.Error().Err(err).Msg("Error scanning CPU cluster metric row") continue } - - c.Timestamp = timestamp - clustersByTimestamp[timestamp] = append(clustersByTimestamp[timestamp], c) + key := ts.Truncate(time.Second) + c.Timestamp = ts + clustersByTimestamp[key] = append(clustersByTimestamp[key], c) } - if err := clusterRows.Err(); err != nil { db.logger.Error().Err(err).Msg("Error iterating CPU cluster metric rows") - return nil, err } result := make([]models.SysmonCPUResponse, 0, len(data)) - - for ts, cpus := range data { + for tsKey, cpus := range data { result = append(result, models.SysmonCPUResponse{ Cpus: cpus, - Clusters: clustersByTimestamp[ts], - Timestamp: ts, + Clusters: clustersByTimestamp[tsKey], + Timestamp: cpus[0].Timestamp, // original timestamp for display }) } - return result, nil } ``` <details><summary>Suggestion importance[1-10]: 8</summary> __ Why: The suggestion correctly identifies a critical Go pitfall of using `time.Time` as a map key, which would cause data to be mis-associated, and provides the correct fix by truncating the timestamp. </details></details></td><td align=center>Medium </td></tr><tr><td> <details><summary>✅ <s>Guard optional nested data</s></summary> ___ <details><summary>Suggestion Impact:</summary>The commit updated the component to use optional chaining and array checks: wrapped cpuFrequency and memory with data?., changed cpu.clusters and cpu.cores and disk.drives to Array.isArray checks, matching the suggested guards. code diff: ```diff - {data.cpu?.clusters && data.cpu.clusters.length > 0 && ( + {Array.isArray(data?.cpu?.clusters) && data.cpu.clusters.length > 0 && ( <CpuClusterDetails clusters={data.cpu.clusters} /> )} - {data.cpuFrequency && <CpuFrequencyDetails data={data.cpuFrequency} />} - <CpuCoresChart cores={data.cpu.cores} /> - <FilesystemDetails drives={data.disk.drives} /> - <MemoryDetails data={data.memory} /> + {data?.cpuFrequency && <CpuFrequencyDetails data={data.cpuFrequency} />} + {Array.isArray(data?.cpu?.cores) && <CpuCoresChart cores={data.cpu.cores} />} + {Array.isArray(data?.disk?.drives) && <FilesystemDetails drives={data.disk.drives} />} + {data?.memory && <MemoryDetails data={data.memory} />} ``` </details> ___ **Add null-safe guards for <code>data.cpu.cores</code>, <code>data.disk.drives</code>, and <code>data.memory</code> to prevent the application from crashing if these properties are accessed before they are loaded.** [web/src/components/Metrics/system-metrics.jsx [288-296]](https://github.com/carverauto/serviceradar/pull/1755/files#diff-8ac59d996e95f9669b6ddac45fad9179ec4875fcd135aa101f4e25f1c662f687R288-R296) ```diff {activeTab === 'details' && ( <div className="space-y-6"> - {data.cpu?.clusters && data.cpu.clusters.length > 0 && ( + {Array.isArray(data?.cpu?.clusters) && data.cpu.clusters.length > 0 && ( <CpuClusterDetails clusters={data.cpu.clusters} /> )} - {data.cpuFrequency && <CpuFrequencyDetails data={data.cpuFrequency} />} - <CpuCoresChart cores={data.cpu.cores} /> - <FilesystemDetails drives={data.disk.drives} /> - <MemoryDetails data={data.memory} /> + {data?.cpuFrequency && <CpuFrequencyDetails data={data.cpuFrequency} />} + {Array.isArray(data?.cpu?.cores) && <CpuCoresChart cores={data.cpu.cores} />} + {Array.isArray(data?.disk?.drives) && <FilesystemDetails drives={data.disk.drives} />} + {data?.memory && <MemoryDetails data={data.memory} />} ``` `[Suggestion processed]` <details><summary>Suggestion importance[1-10]: 8</summary> __ Why: This suggestion correctly identifies missing null-safe guards for nested properties like `data.cpu.cores`, preventing a likely runtime crash when data is partially loaded, which is a critical fix for UI stability. </details></details></td><td align=center>Medium </td></tr><tr><td> <details><summary>✅ <s>Include cluster in sorting key</s></summary> ___ <details><summary>Suggestion Impact:</summary>The commit updated cpu_metrics to ORDER BY (timestamp, device_id, host_id, cluster, core_id), adding cluster as suggested. It also added bloom filter indexes and adjusted ordering for another stream. code diff: ```diff +ORDER BY (timestamp, device_id, host_id, cluster, core_id) + TTL to_start_of_day(coalesce(timestamp, _tp_time)) + INTERVAL 3 DAY + SETTINGS index_granularity = 8192; ``` </details> ___ **Add the new <code>cluster</code> field to the <code>ORDER BY</code> clause of the <code>cpu_metrics</code> stream. This will improve query performance and data locality for cluster-based aggregations.** [pkg/db/migrations/00000000000001_consolidated_serviceradar_schema.up.sql [209-222]](https://github.com/carverauto/serviceradar/pull/1755/files#diff-1e05de747238f2112bb2230aac8db388e4c80eebe84c071eb78e035d64e67eb6R209-R222) ```diff CREATE STREAM IF NOT EXISTS cpu_metrics ( timestamp DateTime64(3), poller_id string, agent_id string, host_id string, core_id int32, usage_percent float64, frequency_hz float64, label string, cluster string, device_id string, partition string ) ENGINE = Stream(1, 1, rand()) PARTITION BY int_div(to_unix_timestamp(timestamp), 3600) -ORDER BY (timestamp, device_id, host_id, core_id) +ORDER BY (timestamp, device_id, host_id, cluster, core_id) TTL to_start_of_day(coalesce(timestamp, _tp_time)) + INTERVAL 3 DAY SETTINGS index_granularity = 8192; ``` <details><summary>Suggestion importance[1-10]: 7</summary> __ Why: This is a valid and important performance optimization for the `cpu_metrics` stream, as including `cluster` in the `ORDER BY` clause will improve data locality and query efficiency for cluster-based aggregations. </details></details></td><td align=center>Medium </td></tr><tr><td rowspan=2>General</td> <td> <details><summary>✅ <s>Remove hardcoded private IP</s></summary> ___ <details><summary>Suggestion Impact:</summary>The hardcoded IP address was replaced with an environment-variable-based address: "${SYSMON_VM_ADDRESS:-sysmon-vm:50110}". code diff: ```diff - "address": "192.168.1.219:50110", + "address": "${SYSMON_VM_ADDRESS:-sysmon-vm:50110}", ``` </details> ___ **Replace the hardcoded IP address in <code>sysmon-vm.checker.json</code> with a configurable value, such as an environment variable, to improve deployment flexibility.** [docker/compose/sysmon-vm.checker.json [1-9]](https://github.com/carverauto/serviceradar/pull/1755/files#diff-92ceb153a11705c17ec13f926f009bf33b539c872b859f5fd624dd4b5b4cd17dR1-R9) ```diff { "name": "sysmon-vm", "type": "grpc", - "address": "192.168.1.219:50110", + "address": "${SYSMON_VM_ADDRESS:-sysmon-vm:50110}", "security": { "mode": "none", "role": "agent" } } ``` `[Suggestion processed]` <details><summary>Suggestion importance[1-10]: 8</summary> __ Why: The suggestion correctly points out that a hardcoded IP address harms portability and should be replaced with a configurable value, which is a critical best practice for deployment configurations. </details></details></td><td align=center>Medium </td></tr><tr><td> <details><summary>✅ <s>Align Proton memory to container limits</s></summary> ___ <details><summary>Suggestion Impact:</summary>The commit added the suggested Proton memory environment variables (PROTON_MAX_SERVER_MEMORY, PROTON_MAX_QUERY_MEMORY, PROTON_MAX_PARTITION_BYTES) along with a comment explaining the alignment with container limits. code diff: ```diff + # Align Proton memory usage with container limits to reduce OOM risk + - PROTON_MAX_SERVER_MEMORY=6g + - PROTON_MAX_QUERY_MEMORY=4g + - PROTON_MAX_PARTITION_BYTES=256m ``` </details> ___ **Set explicit memory limits for the Proton service via environment variables to align with container memory limits and prevent OOM kills.** [docker-compose.yml [72-92]](https://github.com/carverauto/serviceradar/pull/1755/files#diff-e45e45baeda1c1e73482975a664062aa56f20c03dd9d64a827aba57775bed0d3R72-R92) ```diff services: proton: image: ghcr.io/carverauto/serviceradar-proton:latest container_name: serviceradar-proton user: "0:0" mem_limit: 8g mem_reservation: 6g ports: - "8123:8123" - "8463:8463" environment: - PROTON_PASSWORD=${PROTON_PASSWORD:-} - PROTON_LOG_LEVEL=${PROTON_LOG_LEVEL:-error} - PROTON_DISABLE_TASKSTATS=1 + # Align Proton internal memory to container cgroup limits to prevent OOM + - PROTON_MAX_SERVER_MEMORY=6g + - PROTON_MAX_QUERY_MEMORY=4g + - PROTON_MAX_PARTITION_BYTES=256m ``` <details><summary>Suggestion importance[1-10]: 7</summary> __ Why: The suggestion correctly points out a potential for OOM kills by aligning Proton's internal memory settings with the container's memory limits. This enhances the stability of the `proton` service under load, making it a valuable configuration improvement. </details></details></td><td align=center>Medium </td></tr> <tr><td align="center" colspan="2">  </td><td></td></tr></tbody></table> </details> <details><summary>✅ Suggestions up to commit fddabde</summary> <table><thead><tr><td>Category</td><td align=left>Suggestion                                                                                                                                    </td><td align=center>Impact</td></tr><tbody><tr><td rowspan=1>High-level</td> <td> <details><summary>Database migration strategy breaks existing deployments</summary> ___ **The PR replaces the incremental database migration system with a single schema file that only runs on empty databases. This breaks the upgrade path for existing deployments, as there is no way to apply new schema changes to databases that already contain data.** ### Examples: <details> <summary> <a href="https://github.com/carverauto/serviceradar/pull/1755/files#diff-95dfdc3c1eedf0141bd8be59bf69367b2909409887590dff5ae81e92f99b3917R35-R75">pkg/db/migrate.go [35-75]</a> </summary> ```go // RunMigrations applies the consolidated database schema when no tables exist yet. func RunMigrations(ctx context.Context, conn proton.Conn, log logger.Logger) error { log.Info().Msg("Ensuring database schema is applied") applied, err := schemaAlreadyApplied(ctx, conn) if err != nil { return fmt.Errorf("failed to inspect existing schema: %w", err) } if applied { ... (clipped 31 lines) ``` </details> ### Solution Walkthrough: #### Before: ```go // In pkg/db/migrate.go func RunMigrations(ctx, conn) error { createMigrationsTable(ctx, conn) // Ensure tracking table exists appliedMigrations := getAppliedMigrations(ctx, conn) availableMigrations := getAvailableMigrations() // from *.up.sql files for _, migrationFile := range availableMigrations { if !isApplied(migrationFile, appliedMigrations) { content := readMigrationFile(migrationFile) executeMigration(ctx, conn, content) recordMigration(ctx, conn, migrationFile) } } return nil } ``` #### After: ```go // In pkg/db/migrate.go const schemaFile = "migrations/schema.sql" func RunMigrations(ctx, conn) error { applied, err := schemaAlreadyApplied(ctx, conn) if applied { log("Schema already present; skipping apply") return nil } // If not applied, run the whole schema file content := readSchemaFile(schemaFile) statements := splitSQLStatements(content) for _, stmt := range statements { conn.Exec(ctx, stmt) } log("Database schema applied successfully") return nil } func schemaAlreadyApplied(ctx, conn) (bool, error) { // SELECT count() FROM system.tables WHERE ... name = 'device_updates' // return count > 0 } ``` <details><summary>Suggestion importance[1-10]: 10</summary> __ Why: The suggestion correctly identifies a critical flaw where the new migration strategy removes the upgrade path for existing databases, which is a breaking change for all current deployments. </details></details></td><td align=center>High </td></tr><tr><td rowspan=6>Possible issue</td> <td> <details><summary>Fix missing target stream clauses</summary> ___ **Add the missing <code>INTO</code> clause to the <code>ocsf_devices_aggregator_mv</code> and <code>ocsf_users_aggregator_mv</code> materialized views to specify their target streams and fix a syntax error.** [pkg/db/migrations/schema.sql [1298-1408]](https://github.com/carverauto/serviceradar/pull/1755/files#diff-de4eeaf1d0460885258166a33d01394bc45f33ca3a95b4d36ef9c30b7a80e8e8R1298-R1408) ```diff -- OCSF Materialized Views Migration -- These views automatically aggregate event data into entity state and observable indexes -- Enables real-time updates from event streams to current state streams -- Device State Aggregation -- Populates ocsf_devices_current from device inventory events DROP VIEW IF EXISTS ocsf_devices_aggregator_mv; -CREATE MATERIALIZED VIEW ocsf_devices_aggregator_mv AS +CREATE MATERIALIZED VIEW ocsf_devices_aggregator_mv +INTO ocsf_devices_current +AS SELECT device_uid, time AS last_seen, ... FROM ocsf_device_inventory; -- User State Aggregation -- Populates ocsf_users_current from user inventory events DROP VIEW IF EXISTS ocsf_users_aggregator_mv; -CREATE MATERIALIZED VIEW ocsf_users_aggregator_mv AS +CREATE MATERIALIZED VIEW ocsf_users_aggregator_mv +INTO ocsf_users_current +AS SELECT user_uid, time AS last_seen, ... FROM ocsf_user_inventory; ``` <details><summary>Suggestion importance[1-10]: 10</summary> __ Why: The suggestion correctly identifies a critical syntax error in two `CREATE MATERIALIZED VIEW` statements that are missing the required `INTO` clause. This would cause the SQL migration to fail, so the fix is essential for correctness. </details></details></td><td align=center>High </td></tr><tr><td> <details><summary>✅ <s>Fix view definition and function name</s></summary> ___ <details><summary>Suggestion Impact:</summary>The committed diff updated the ocsf_observable_device_macs_mv definition to include the INTO ocsf_observable_index clause and corrected the function call to replaceAll for normalizing MAC addresses. code diff: ```diff --- Observable Index Population - Device MACs --- Index MAC addresses from device inventory -DROP VIEW IF EXISTS ocsf_observable_device_macs_mv; -CREATE MATERIALIZED VIEW ocsf_observable_device_macs_mv AS -SELECT - 'mac_address' AS observable_type, - mac AS observable_value, - lower(replace_all(mac, ':', '')) AS observable_value_normalized, - 'device' AS entity_class, - device_uid AS entity_uid, - time AS entity_last_seen, - 'device.mac' AS entity_path, - confidence_level / 4.0 AS confidence_score, - discovery_source, - time, - agent_id, - poller_id, - - -- Enrichments - '' AS geo_country, - '' AS geo_region, - '' AS geo_city, - 0 AS asn_number, - '' AS asn_org, - 0.0 AS threat_score, - CAST([] AS array(string)) AS threat_categories, - CAST([] AS array(string)) AS threat_sources, - 'hardware' AS observable_category, - CAST([] AS array(string)) AS tags, - metadata - -FROM ocsf_device_inventory -ARRAY JOIN device_mac AS mac -WHERE mac != ''; ``` </details> ___ **Fix the <code>ocsf_observable_device_macs_mv</code> view by adding the missing <code>INTO </code> <code>ocsf_observable_index</code> clause and correcting the function name from <code>replace_all</code> to <code>replaceAll</code>.** [pkg/db/migrations/schema.sql [1452-1485]](https://github.com/carverauto/serviceradar/pull/1755/files#diff-de4eeaf1d0460885258166a33d01394bc45f33ca3a95b4d36ef9c30b7a80e8e8R1452-R1485) ```diff -- Observable Index Population - Device MACs -- Index MAC addresses from device inventory DROP VIEW IF EXISTS ocsf_observable_device_macs_mv; -CREATE MATERIALIZED VIEW ocsf_observable_device_macs_mv AS +CREATE MATERIALIZED VIEW ocsf_observable_device_macs_mv +INTO ocsf_observable_index +AS SELECT 'mac_address' AS observable_type, mac AS observable_value, - lower(replace_all(mac, ':', '')) AS observable_value_normalized, + lower(replaceAll(mac, ':', '')) AS observable_value_normalized, 'device' AS entity_class, device_uid AS entity_uid, ... FROM ocsf_device_inventory ARRAY JOIN device_mac AS mac WHERE mac != ''; ``` <details><summary>Suggestion importance[1-10]: 10</summary> __ Why: The suggestion correctly identifies two critical errors in the `ocsf_observable_device_macs_mv` view definition: a missing `INTO` clause and an incorrect function name (`replace_all` instead of `replaceAll`). Both are syntax errors that would cause the migration to fail. </details></details></td><td align=center>High </td></tr><tr><td> <details><summary>Avoid holding lock during network call</summary> ___ **Refactor <code>reconnectCore</code> to release the mutex lock before making the blocking <code>connectToCore</code> network call to avoid contention.** [pkg/poller/poller.go [481-497]](https://github.com/carverauto/serviceradar/pull/1755/files#diff-28a10dea1596540e55ce9a8b68bd1af3d96bd4634f6def668643892cef25a086R481-R497) ```diff func (p *Poller) reconnectCore(ctx context.Context) error { reconnectCtx, cancel := context.WithTimeout(ctx, 15*time.Second) defer cancel() p.mu.Lock() - defer p.mu.Unlock() - if p.grpcClient != nil { + // Close the existing client while holding the lock to prevent its use. if err := p.grpcClient.Close(); err != nil { p.logger.Warn().Err(err).Msg("Failed to close existing core client during reconnect") } p.grpcClient = nil p.coreClient = nil } + p.mu.Unlock() // Release the lock before the blocking network call. - return p.connectToCore(reconnectCtx) + // connectToCore should not set the fields on `p` directly. + // It should return the new clients. + grpcClient, coreClient, err := p.connectToCore(reconnectCtx) + if err != nil { + return err + } + + // Re-acquire the lock to safely update the shared fields. + p.mu.Lock() + defer p.mu.Unlock() + p.grpcClient = grpcClient + p.coreClient = coreClient + + return nil } ```  <details><summary>Suggestion importance[1-10]: 8</summary> __ Why: The suggestion correctly identifies a significant concurrency issue where a mutex is held during a blocking network call, which can lead to performance degradation or deadlocks. </details></details></td><td align=center>Medium </td></tr><tr><td> <details><summary>✅ <s>Replace hardcoded IP with service name</s></summary> ___ <details><summary>Suggestion Impact:</summary>The hardcoded IP was replaced with a configurable address defaulting to "sysmon-vm:50110", aligning with the suggestion's intent to use the service name. code diff: ```diff - "address": "192.168.1.219:50110", + "address": "${SYSMON_VM_ADDRESS:-sysmon-vm:50110}", ``` </details> ___ **In <code>docker/compose/sysmon-vm.checker.json</code>, replace the hardcoded IP address in the <code>address</code> field with the service name <code>sysmon-vm</code> to ensure portability within the Docker environment.** [docker/compose/sysmon-vm.checker.json [4]](https://github.com/carverauto/serviceradar/pull/1755/files#diff-92ceb153a11705c17ec13f926f009bf33b539c872b859f5fd624dd4b5b4cd17dR4-R4) ```diff -"address": "192.168.1.219:50110", +"address": "sysmon-vm:50110", ```  <details><summary>Suggestion importance[1-10]: 7</summary> __ Why: The suggestion correctly identifies a hardcoded IP address and recommends using the Docker service name, which is a best practice for portability and maintainability in a containerized environment. </details></details></td><td align=center>Medium </td></tr><tr><td> <details><summary>✅ <s>Handle empty metric slices correctly</s></summary> ___ <details><summary>Suggestion Impact:</summary>The commit implemented logic so that an "no cpu metrics appended" error is only returned when appending fails, and otherwise returns a specific error constant. Although an explicit early return for empty cpus is not shown in the diff, the updated handling of the "no metrics appended" case aligns with the suggestion’s intent to avoid misleading errors for empty input. Additionally, error messages were refactored to use constants. code diff: ```diff ...