Sysmonvm/integration tests 2 #2316

Merged
mfreeman451 merged 13 commits from refs/pull/2316/head into main 2025-10-14 03:13:29 +00:00
mfreeman451 commented 2025-10-13 06:06:41 +00:00 (Migrated from github.com)
Owner

Imported from GitHub pull request.

Original GitHub pull request: #1755
Original author: @mfreeman451
Original URL: https://github.com/carverauto/serviceradar/pull/1755
Original created: 2025-10-13T06:06:41Z
Original updated: 2025-10-14T03:13:48Z
Original head: carverauto/serviceradar:sysmonvm/integration_tests_2
Original base: main
Original merged: 2025-10-14T03:13:29Z by @mfreeman451

Auto-created Ticket

#1756

PR Type

Enhancement, Bug fix, Documentation


Description

Major Enhancements:

  • Implemented automatic device registration from checker metadata, eliminating manual device setup requirements

  • Added automatic core reconnection logic in poller with retry mechanism for transient failures

  • Consolidated database migrations into single schema file, simplifying deployment and maintenance

  • Added CPU frequency metrics collection, processing, and visualization across the entire stack

  • Implemented per-checker security configuration support in agent

Infrastructure & Configuration:

  • Added authentication endpoints (/auth/login, /auth/refresh) to Kong API gateway

  • Added resource limits (8GB memory) and PROTON_DISABLE_TASKSTATS for Proton service

  • Implemented password persistence and recovery for Proton database across restarts

  • Preserved core configuration and auth keys across container restarts

  • Added Docker Compose override for local image development

UI Enhancements:

  • Added CPU frequency visualization components (card, chart, per-core details)

  • Integrated CPU frequency metrics into system metrics dashboard

  • Added service hint support to sysmon status indicator for better checker capability detection

  • Added configurable max value support to metric card component

Error Handling & Reliability:

  • Improved batch metric insertion with error tracking and validation

  • Added timestamp skew validation for sysmon payloads

  • Enhanced service messages with host identity for device correlation

Documentation:

  • Completely rewrote custom checker documentation focusing on gRPC architecture with sequence diagrams

  • Refined device identity validation rules for non-sweep sources

  • Updated migration file references in documentation

Configuration Changes:

  • Added sysmon-vm checker configuration with gRPC settings

  • Updated sweep configuration with ICMP mode

  • Removed incremental migration files in favor of consolidated schema


Diagram Walkthrough

flowchart LR
  checker["Custom Checker"] -- "gRPC with metadata" --> poller["Poller"]
  poller -- "enriched payload" --> core["Core"]
  core -- "auto-register" --> devices["Device Registry"]
  core -- "reconnect on failure" --> poller
  core -- "CPU frequency data" --> db["Database (schema.sql)"]
  db -- "metrics" --> web["Web UI"]
  web -- "visualize" --> cpufreq["CPU Frequency Charts"]
  kong["Kong Gateway"] -- "auth routes" --> core

File Walkthrough

Relevant files
Enhancement
13 files
poller.go
Add automatic core reconnection and error handling for poller

pkg/poller/poller.go

  • Added ...(truncated)
Imported from GitHub pull request. Original GitHub pull request: #1755 Original author: @mfreeman451 Original URL: https://github.com/carverauto/serviceradar/pull/1755 Original created: 2025-10-13T06:06:41Z Original updated: 2025-10-14T03:13:48Z Original head: carverauto/serviceradar:sysmonvm/integration_tests_2 Original base: main Original merged: 2025-10-14T03:13:29Z by @mfreeman451 --- ### **Auto-created Ticket** [#1756](https://github.com/carverauto/serviceradar/issues/1756) ### **PR Type** Enhancement, Bug fix, Documentation ___ ### **Description** **Major Enhancements:** - Implemented automatic device registration from checker metadata, eliminating manual device setup requirements - Added automatic core reconnection logic in poller with retry mechanism for transient failures - Consolidated database migrations into single schema file, simplifying deployment and maintenance - Added CPU frequency metrics collection, processing, and visualization across the entire stack - Implemented per-checker security configuration support in agent **Infrastructure & Configuration:** - Added authentication endpoints (`/auth/login`, `/auth/refresh`) to Kong API gateway - Added resource limits (8GB memory) and `PROTON_DISABLE_TASKSTATS` for Proton service - Implemented password persistence and recovery for Proton database across restarts - Preserved core configuration and auth keys across container restarts - Added Docker Compose override for local image development **UI Enhancements:** - Added CPU frequency visualization components (card, chart, per-core details) - Integrated CPU frequency metrics into system metrics dashboard - Added service hint support to sysmon status indicator for better checker capability detection - Added configurable max value support to metric card component **Error Handling & Reliability:** - Improved batch metric insertion with error tracking and validation - Added timestamp skew validation for sysmon payloads - Enhanced service messages with host identity for device correlation **Documentation:** - Completely rewrote custom checker documentation focusing on gRPC architecture with sequence diagrams - Refined device identity validation rules for non-sweep sources - Updated migration file references in documentation **Configuration Changes:** - Added sysmon-vm checker configuration with gRPC settings - Updated sweep configuration with ICMP mode - Removed incremental migration files in favor of consolidated schema ___ ### Diagram Walkthrough ```mermaid flowchart LR checker["Custom Checker"] -- "gRPC with metadata" --> poller["Poller"] poller -- "enriched payload" --> core["Core"] core -- "auto-register" --> devices["Device Registry"] core -- "reconnect on failure" --> poller core -- "CPU frequency data" --> db["Database (schema.sql)"] db -- "metrics" --> web["Web UI"] web -- "visualize" --> cpufreq["CPU Frequency Charts"] kong["Kong Gateway"] -- "auth routes" --> core ``` <details> <summary><h3> File Walkthrough</h3></summary> <table><thead><tr><th></th><th align="left">Relevant files</th></tr></thead><tbody><tr><td><strong>Enhancement</strong></td><td><details><summary>13 files</summary><table> <tr> <td> <details> <summary><strong>poller.go</strong><dd><code>Add automatic core reconnection and error handling for poller</code></dd></summary> <hr> pkg/poller/poller.go <ul><li>Added ...(truncated)
qodo-code-review[bot] commented 2025-10-13 06:08:19 +00:00 (Migrated from github.com)
Author
Owner

Imported GitHub PR comment.

Original author: @qodo-code-review[bot]
Original URL: https://github.com/carverauto/serviceradar/pull/1755#issuecomment-3395986839
Original created: 2025-10-13T06:08:19Z

PR Compliance Guide 🔍

(Compliance updated until commit github.com/carverauto/serviceradar@ffdbd2720a)

Below is a summary of compliance checks for this PR:

Security Compliance
🔴
Placeholder

Description: Previous string-concatenated SQL was removed; the new code uses parameterized queries
which is safe—no issue here.
sysmon.go [394-406]

Referred Code
func (s *APIServer) getCPUMetricsForDevice(
	ctx context.Context, _ db.SysmonMetricsProvider, deviceID string, start, end time.Time) (interface{}, error) {
	// Query cpu_metrics table directly for per-core data by device_id
	const cpuQuery = `
		SELECT timestamp, agent_id, host_id, core_id, usage_percent, frequency_hz, label, cluster
		FROM table(cpu_metrics)
		WHERE device_id = $1 AND timestamp BETWEEN $2 AND $3
		ORDER BY timestamp DESC, core_id ASC`

	rows, err := s.dbService.(*db.DB).Conn.Query(ctx, cpuQuery, deviceID, start, end)
	if err != nil {
		return nil, fmt.Errorf("failed to query CPU metrics for device %s: %w", deviceID, err)
	}
Device spoofing risk

Description: Device IDs are derived directly from a reported host IP in checker payloads without strong
authentication or validation, allowing a malicious or misconfigured checker to spoof IPs
and auto-register/update devices with high confidence.
devices.go [122-245]

Referred Code
func extractCheckerHostIdentity(serviceData json.RawMessage) (hostIP, hostname, hostID string) {
	if len(serviceData) == 0 {
		return "", "", ""
	}

	var payload any
	if err := json.Unmarshal(serviceData, &payload); err != nil {
		return "", "", ""
	}

	hostIP = firstStringMatch(payload,
		[]string{"status", "host_ip"},
		[]string{"status", "ip"},
		[]string{"status", "ip_address"},
		[]string{"host_ip"},
		[]string{"ip"},
		[]string{"ip_address"},
	)
	if hostIP == "" {
		hostIP = findStringByKeySubstring(payload, "ip")
	}


 ... (clipped 103 lines)
Ticket Compliance
🟡
🎫 #1756
🟢 Extend metrics schema and API to support CPU core labels and cluster-level frequency, and
return them in sysmon responses and queries.
Migrate database to store new CPU fields and create a cpu_cluster_metrics table; batch
writes should error meaningfully when nothing is appended.
Provide a gRPC sysmon-vm checker that reports per-core usage/frequency with
labels/clusters and cluster aggregates, including host identifiers and timestamps.
Ingest sysmon/sysmon-vm payloads, normalize timestamps, store new metrics, and
auto-register devices from checker payloads using host IP/ID with source confidence.
Add reconnection logic on gRPC error codes/timeouts for poller, enrich messages with
resolved host IP, carry agent_id, and support hot-reloading poll interval.
Allow checker-specific security settings via agent checker configs and bootstrap default
KV configs for common checkers including sysmon-vm.
Update CPU metrics API to return labels/clusters and add cluster metrics alongside
per-core data for poller- and device-scoped queries.
Deployment should bootstrap credentials and configurations for local demos, including
sysmon-vm endpoint discovery and Kong auth routes; preserve passwords/keys and
wait-for-core in web entrypoint; update K8s/demo scripts.
Add UI visualizations for CPU frequency, per-core frequency bars, cluster
badges/summaries, and host metadata panel; align heterogeneous series.
Codebase Duplication Compliance
Codebase context is not defined

Follow the guide to enable codebase context checks.

Custom Compliance
No custom compliance provided

Follow the guide to enable custom compliance check.

  • Update
Compliance status legend 🟢 - Fully Compliant
🟡 - Partial Compliant
🔴 - Not Compliant
- Requires Further Human Verification
🏷️ - Compliance label

Previous compliance checks

Compliance check up to commit 471472d
Security Compliance
Credential file exposure

Description: The script writes the Proton password into shared credentials with file path
/etc/serviceradar/credentials/proton-password; although chmod 600 is applied here, other
environments may mount this volume broadly—verify that the volume's ownership and
permissions prevent unintended container or host users from reading secrets.
proton-init.sh [62-69]

Referred Code
echo "$PROTON_PASSWORD" > /etc/proton-server/generated_password.txt
chmod 600 /etc/proton-server/generated_password.txt

if [ -d "/etc/serviceradar/credentials" ]; then
    echo "$PROTON_PASSWORD" > /etc/serviceradar/credentials/proton-password
    chmod 600 /etc/serviceradar/credentials/proton-password
    log_info "Password synchronized to shared credentials volume"
fi
Secret handling in k8s

Description: The init script stores the Proton password into a shared credentials file inside the
cluster; ensure the ConfigMap/emptyDir/volume backing this path enforces pod-level and
filesystem permissions so the secret is not world-readable across containers/namespaces.
deploy.sh [321-325]

Referred Code
# Also save to shared credentials volume for other services
if [ -d "/etc/serviceradar/credentials" ]; then
    echo "\$PROTON_PASSWORD" > /etc/serviceradar/credentials/proton-password
    chmod 600 /etc/serviceradar/credentials/proton-password
    echo "[Proton K8s Init] Password also saved to shared credentials volume"
SQL injection

Description: Raw SQL query string is built with fmt.Sprintf embedding device_id and timestamps; if
inputs are not sanitized or strictly controlled, this can allow SQL injection—prefer
parameterized queries.
sysmon.go [398-404]

Referred Code
	SELECT timestamp, agent_id, host_id, core_id, usage_percent, frequency_hz, label, cluster
	FROM table(cpu_metrics)
	WHERE device_id = '%s' AND timestamp BETWEEN '%s' AND '%s'
	ORDER BY timestamp DESC, core_id ASC`,
	deviceID, start.Format(time.RFC3339), end.Format(time.RFC3339))

rows, err := s.dbService.(*db.DB).Conn.Query(ctx, query)
SQL injection

Description: Another raw SQL query is built via fmt.Sprintf with unescaped device_id and timestamp
strings, posing SQL injection risk—use prepared statements with placeholders instead.
sysmon.go [453-461]

Referred Code
	SELECT timestamp, agent_id, host_id, cluster, frequency_hz
	FROM table(cpu_cluster_metrics)
	WHERE device_id = '%s' AND timestamp BETWEEN '%s' AND '%s'
	ORDER BY timestamp DESC, cluster ASC`,
	deviceID, start.Format(time.RFC3339), end.Format(time.RFC3339))

clusterRows, err := s.dbService.(*db.DB).Conn.Query(ctx, clusterQuery)
if err != nil {
	return nil, fmt.Errorf("failed to query CPU cluster metrics for device %s: %w", deviceID, err)
Untrusted JSON propagation

Description: The function enriches service messages by trusting and rewriting arbitrary JSON from
checks; if downstream systems log or render these fields without sanitization, it could
enable log injection or UI injection—ensure escaping/sanitization where messages are used.

agent_poller.go [272-325]

Referred Code
func enrichServiceMessageWithAddress(message []byte, check Check) []byte {
	if len(message) == 0 || check.Type != "grpc" || check.Details == "" {
		return message
	}

	hostCandidate := strings.TrimSpace(check.Details)
	host, _, err := net.SplitHostPort(hostCandidate)
	if err != nil {
		host = hostCandidate
	}

	if host == "" || net.ParseIP(host) == nil {
		return message
	}

	var payload map[string]any
	if err := json.Unmarshal(message, &payload); err != nil {
		return message
	}

	statusNode, ok := payload["status"].(map[string]any)


 ... (clipped 33 lines)
Ticket Compliance
🎫 No ticket provided
Codebase Duplication Compliance
Codebase context is not defined

Follow the guide to enable codebase context checks.

Custom Compliance
No custom compliance provided

Follow the guide to enable custom compliance check.

Compliance check up to commit fddabde
Security Compliance
🔴
Insecure file permissions

Description: The Proton password is written to /etc/serviceradar/credentials/proton-password with
world-readable permissions (chmod 644), which can expose credentials inside the
container/host.
proton-init.sh [65-69]

Referred Code
if [ -d "/etc/serviceradar/credentials" ]; then
    echo "$PROTON_PASSWORD" > /etc/serviceradar/credentials/proton-password
    chmod 644 /etc/serviceradar/credentials/proton-password
    log_info "Password synchronized to shared credentials volume"
fi
Insecure debug script

Description: The grpcurl helper hardcodes a private IP and uses plaintext gRPC without TLS, which risks
accidental exposure if executed in the wrong environment.
g.sh [8-11]

Referred Code
grpcurl -plaintext \
  -d '{"service_name":"sysmon-vm","service_type":"grpc","agent_id":"dev-agent","poller_id":"docker-poller"}' \
  192.168.1.219:50110 \
  monitoring.AgentService/GetStatus
Fragile error handling

Description: Reconnect decision relies partly on substring matching of error messages, which can be
unreliable and may lead to unintended retry storms; consider bounded backoff and stricter
code checks.
poller.go [522-527]

Referred Code
	return strings.Contains(errMsg, "connection error") ||
		strings.Contains(errMsg, "transport: Error while dialing") ||
		strings.Contains(errMsg, "name resolver error") ||
		strings.Contains(errMsg, "connection refused") ||
		strings.Contains(errMsg, "i/o timeout")
}
Ticket Compliance
🎫 No ticket provided
  • Update
Codebase Duplication Compliance
Codebase context is not defined

Follow the guide to enable codebase context checks.

Custom Compliance
No custom compliance provided

Follow the guide to enable custom compliance check.

Imported GitHub PR comment. Original author: @qodo-code-review[bot] Original URL: https://github.com/carverauto/serviceradar/pull/1755#issuecomment-3395986839 Original created: 2025-10-13T06:08:19Z --- ## PR Compliance Guide 🔍 <!-- https://github.com/carverauto/serviceradar/commit/ffdbd2720a0e287729a1d3a2662115ef6b247990 --> #### (Compliance updated until commit https://github.com/carverauto/serviceradar/commit/ffdbd2720a0e287729a1d3a2662115ef6b247990) Below is a summary of compliance checks for this PR:<br> <table><tbody><tr><td colspan='2'><strong>Security Compliance</strong></td></tr> <tr><td rowspan=1>🔴</td> <td><details><summary><strong>Placeholder </strong></summary><br> <b>Description:</b> Previous string-concatenated SQL was removed; the new code uses parameterized queries <br>which is safe—no issue here.<br> <strong><a href='https://github.com/carverauto/serviceradar/pull/1755/files#diff-4879e8b367ee55413c6cb006f23bb912261df5310c640d3af352033cf97ffc84R394-R406'>sysmon.go [394-406]</a></strong><br> <details open><summary>Referred Code</summary> ```go func (s *APIServer) getCPUMetricsForDevice( ctx context.Context, _ db.SysmonMetricsProvider, deviceID string, start, end time.Time) (interface{}, error) { // Query cpu_metrics table directly for per-core data by device_id const cpuQuery = ` SELECT timestamp, agent_id, host_id, core_id, usage_percent, frequency_hz, label, cluster FROM table(cpu_metrics) WHERE device_id = $1 AND timestamp BETWEEN $2 AND $3 ORDER BY timestamp DESC, core_id ASC` rows, err := s.dbService.(*db.DB).Conn.Query(ctx, cpuQuery, deviceID, start, end) if err != nil { return nil, fmt.Errorf("failed to query CPU metrics for device %s: %w", deviceID, err) } ``` </details></details></td></tr> <tr><td rowspan=1>⚪</td> <td><details><summary><strong>Device spoofing risk </strong></summary><br> <b>Description:</b> Device IDs are derived directly from a reported host IP in checker payloads without strong <br>authentication or validation, allowing a malicious or misconfigured checker to spoof IPs <br>and auto-register/update devices with high confidence.<br> <strong><a href='https://github.com/carverauto/serviceradar/pull/1755/files#diff-fb7f5159558f22afdc809b9cbf5a9c0f71560be2f7389c944da8f1604998ea48R122-R245'>devices.go [122-245]</a></strong><br> <details open><summary>Referred Code</summary> ```go func extractCheckerHostIdentity(serviceData json.RawMessage) (hostIP, hostname, hostID string) { if len(serviceData) == 0 { return "", "", "" } var payload any if err := json.Unmarshal(serviceData, &payload); err != nil { return "", "", "" } hostIP = firstStringMatch(payload, []string{"status", "host_ip"}, []string{"status", "ip"}, []string{"status", "ip_address"}, []string{"host_ip"}, []string{"ip"}, []string{"ip_address"}, ) if hostIP == "" { hostIP = findStringByKeySubstring(payload, "ip") } ... (clipped 103 lines) ``` </details></details></td></tr> <tr><td colspan='2'><strong>Ticket Compliance</strong></td></tr> <tr><td>🟡</td> <td> <details> <summary>🎫 <a href=https://github.com/carverauto/serviceradar/issues/1756>#1756</a></summary> <table width='100%'><tbody> <tr><td rowspan=8>🟢</td> <td>Extend metrics schema and API to support CPU core labels and cluster-level frequency, and <br>return them in sysmon responses and queries.</td></tr> <tr><td>Migrate database to store new CPU fields and create a cpu_cluster_metrics table; batch <br>writes should error meaningfully when nothing is appended.</td></tr> <tr><td>Provide a gRPC sysmon-vm checker that reports per-core usage/frequency with <br>labels/clusters and cluster aggregates, including host identifiers and timestamps.</td></tr> <tr><td>Ingest sysmon/sysmon-vm payloads, normalize timestamps, store new metrics, and <br>auto-register devices from checker payloads using host IP/ID with source confidence.</td></tr> <tr><td>Add reconnection logic on gRPC error codes/timeouts for poller, enrich messages with <br>resolved host IP, carry agent_id, and support hot-reloading poll interval.</td></tr> <tr><td>Allow checker-specific security settings via agent checker configs and bootstrap default <br>KV configs for common checkers including sysmon-vm.</td></tr> <tr><td>Update CPU metrics API to return labels/clusters and add cluster metrics alongside <br>per-core data for poller- and device-scoped queries.</td></tr> <tr><td>Deployment should bootstrap credentials and configurations for local demos, including <br>sysmon-vm endpoint discovery and Kong auth routes; preserve passwords/keys and <br>wait-for-core in web entrypoint; update K8s/demo scripts.</td></tr> <tr><td rowspan=1>⚪</td> <td>Add UI visualizations for CPU frequency, per-core frequency bars, cluster <br>badges/summaries, and host metadata panel; align heterogeneous series.</td></tr> </tbody></table> </details> </td></tr> <tr><td colspan='2'><strong>Codebase Duplication Compliance</strong></td></tr> <tr><td>⚪</td><td><details><summary><strong>Codebase context is not defined </strong></summary> Follow the <a href='https://qodo-merge-docs.qodo.ai/core-abilities/rag_context_enrichment/'>guide</a> to enable codebase context checks. </details></td></tr> <tr><td colspan='2'><strong>Custom Compliance</strong></td></tr> <tr><td>⚪</td><td><details><summary><strong>No custom compliance provided</strong></summary> Follow the <a href='https://qodo-merge-docs.qodo.ai/tools/compliance/'>guide</a> to enable custom compliance check. </details></td></tr> <tr><td align="center" colspan="2"> - [ ] Update <!-- /compliance --update_compliance=true --> </td></tr></tbody></table> <details><summary>Compliance status legend</summary> 🟢 - Fully Compliant<br> 🟡 - Partial Compliant<br> 🔴 - Not Compliant<br> ⚪ - Requires Further Human Verification<br> 🏷️ - Compliance label<br> </details> ___ #### Previous compliance checks <details> <summary>Compliance check up to commit <a href='https://github.com/carverauto/serviceradar/commit/471472d59b4bec162b9d774755629aa173836772'>471472d</a></summary><br> <table><tbody><tr><td colspan='2'><strong>Security Compliance</strong></td></tr> <tr><td rowspan=5>⚪</td> <td><details><summary><strong>Credential file exposure </strong></summary><br> <b>Description:</b> The script writes the Proton password into shared credentials with file path <br>/etc/serviceradar/credentials/proton-password; although chmod 600 is applied here, other <br>environments may mount this volume broadly—verify that the volume's ownership and <br>permissions prevent unintended container or host users from reading secrets.<br> <strong><a href='https://github.com/carverauto/serviceradar/pull/1755/files#diff-9f0e9459b12fc8cb2671d78218467129cbb18b110443a0799c7559cf7ba19d4eR62-R69'>proton-init.sh [62-69]</a></strong><br> <details open><summary>Referred Code</summary> ```shell echo "$PROTON_PASSWORD" > /etc/proton-server/generated_password.txt chmod 600 /etc/proton-server/generated_password.txt if [ -d "/etc/serviceradar/credentials" ]; then echo "$PROTON_PASSWORD" > /etc/serviceradar/credentials/proton-password chmod 600 /etc/serviceradar/credentials/proton-password log_info "Password synchronized to shared credentials volume" fi ``` </details></details></td></tr> <tr><td><details><summary><strong>Secret handling in k8s </strong></summary><br> <b>Description:</b> The init script stores the Proton password into a shared credentials file inside the <br>cluster; ensure the ConfigMap/emptyDir/volume backing this path enforces pod-level and <br>filesystem permissions so the secret is not world-readable across containers/namespaces.<br> <strong><a href='https://github.com/carverauto/serviceradar/pull/1755/files#diff-e37a2eb47f6488f6f391d56a7376be0ca4f93afa355028a71dd0608d3ef1a8baR321-R325'>deploy.sh [321-325]</a></strong><br> <details open><summary>Referred Code</summary> ```shell # Also save to shared credentials volume for other services if [ -d "/etc/serviceradar/credentials" ]; then echo "\$PROTON_PASSWORD" > /etc/serviceradar/credentials/proton-password chmod 600 /etc/serviceradar/credentials/proton-password echo "[Proton K8s Init] Password also saved to shared credentials volume" ``` </details></details></td></tr> <tr><td><details><summary><strong>SQL injection </strong></summary><br> <b>Description:</b> Raw SQL query string is built with fmt.Sprintf embedding device_id and timestamps; if <br>inputs are not sanitized or strictly controlled, this can allow SQL injection—prefer <br>parameterized queries.<br> <strong><a href='https://github.com/carverauto/serviceradar/pull/1755/files#diff-4879e8b367ee55413c6cb006f23bb912261df5310c640d3af352033cf97ffc84R398-R404'>sysmon.go [398-404]</a></strong><br> <details open><summary>Referred Code</summary> ```go SELECT timestamp, agent_id, host_id, core_id, usage_percent, frequency_hz, label, cluster FROM table(cpu_metrics) WHERE device_id = '%s' AND timestamp BETWEEN '%s' AND '%s' ORDER BY timestamp DESC, core_id ASC`, deviceID, start.Format(time.RFC3339), end.Format(time.RFC3339)) rows, err := s.dbService.(*db.DB).Conn.Query(ctx, query) ``` </details></details></td></tr> <tr><td><details><summary><strong>SQL injection </strong></summary><br> <b>Description:</b> Another raw SQL query is built via fmt.Sprintf with unescaped device_id and timestamp <br>strings, posing SQL injection risk—use prepared statements with placeholders instead.<br> <strong><a href='https://github.com/carverauto/serviceradar/pull/1755/files#diff-4879e8b367ee55413c6cb006f23bb912261df5310c640d3af352033cf97ffc84R453-R461'>sysmon.go [453-461]</a></strong><br> <details open><summary>Referred Code</summary> ```go SELECT timestamp, agent_id, host_id, cluster, frequency_hz FROM table(cpu_cluster_metrics) WHERE device_id = '%s' AND timestamp BETWEEN '%s' AND '%s' ORDER BY timestamp DESC, cluster ASC`, deviceID, start.Format(time.RFC3339), end.Format(time.RFC3339)) clusterRows, err := s.dbService.(*db.DB).Conn.Query(ctx, clusterQuery) if err != nil { return nil, fmt.Errorf("failed to query CPU cluster metrics for device %s: %w", deviceID, err) ``` </details></details></td></tr> <tr><td><details><summary><strong>Untrusted JSON propagation </strong></summary><br> <b>Description:</b> The function enriches service messages by trusting and rewriting arbitrary JSON from <br>checks; if downstream systems log or render these fields without sanitization, it could <br>enable log injection or UI injection—ensure escaping/sanitization where messages are used.<br> <br> <strong><a href='https://github.com/carverauto/serviceradar/pull/1755/files#diff-7cdaae416a1e7b5d39ad75f138c49fa0aca12fc38e8ab0b26bb59c64446da15cR272-R325'>agent_poller.go [272-325]</a></strong><br> <details open><summary>Referred Code</summary> ```go func enrichServiceMessageWithAddress(message []byte, check Check) []byte { if len(message) == 0 || check.Type != "grpc" || check.Details == "" { return message } hostCandidate := strings.TrimSpace(check.Details) host, _, err := net.SplitHostPort(hostCandidate) if err != nil { host = hostCandidate } if host == "" || net.ParseIP(host) == nil { return message } var payload map[string]any if err := json.Unmarshal(message, &payload); err != nil { return message } statusNode, ok := payload["status"].(map[string]any) ... (clipped 33 lines) ``` </details></details></td></tr> <tr><td colspan='2'><strong>Ticket Compliance</strong></td></tr> <tr><td>⚪</td><td><details><summary>🎫 <strong>No ticket provided </summary></strong> <!-- /compliance --update_compliance=true --> </details></td></tr> <tr><td colspan='2'><strong>Codebase Duplication Compliance</strong></td></tr> <tr><td>⚪</td><td><details><summary><strong>Codebase context is not defined </strong></summary> Follow the <a href='https://qodo-merge-docs.qodo.ai/core-abilities/rag_context_enrichment/'>guide</a> to enable codebase context checks. </details></td></tr> <tr><td colspan='2'><strong>Custom Compliance</strong></td></tr> <tr><td>⚪</td><td><details><summary><strong>No custom compliance provided</strong></summary> Follow the <a href='https://qodo-merge-docs.qodo.ai/tools/compliance/'>guide</a> to enable custom compliance check. </details></td></tr> <tr><td align="center" colspan="2"> <!-- /compliance --update_compliance=true --> </td></tr></tbody></table> </details> <details> <summary>Compliance check up to commit <a href='https://github.com/carverauto/serviceradar/commit/fddabdeb296da7c952cd9cd5c26d21398c18caf8'>fddabde</a></summary><br> <table><tbody><tr><td colspan='2'><strong>Security Compliance</strong></td></tr> <tr><td rowspan=1>🔴</td> <td><details><summary><strong>Insecure file permissions </strong></summary><br> <b>Description:</b> The Proton password is written to /etc/serviceradar/credentials/proton-password with <br>world-readable permissions (chmod 644), which can expose credentials inside the <br>container/host.<br> <strong><a href='https://github.com/carverauto/serviceradar/pull/1755/files#diff-9f0e9459b12fc8cb2671d78218467129cbb18b110443a0799c7559cf7ba19d4eR65-R69'>proton-init.sh [65-69]</a></strong><br> <details open><summary>Referred Code</summary> ```shell if [ -d "/etc/serviceradar/credentials" ]; then echo "$PROTON_PASSWORD" > /etc/serviceradar/credentials/proton-password chmod 644 /etc/serviceradar/credentials/proton-password log_info "Password synchronized to shared credentials volume" fi ``` </details></details></td></tr> <tr><td rowspan=2>⚪</td> <td><details><summary><strong>Insecure debug script </strong></summary><br> <b>Description:</b> The grpcurl helper hardcodes a private IP and uses plaintext gRPC without TLS, which risks <br>accidental exposure if executed in the wrong environment.<br> <strong><a href='https://github.com/carverauto/serviceradar/pull/1755/files#diff-afeec294753ce8526a919e6c67a4d6fcd7fca779c8f79be5eeb8531ac2236517R8-R11'>g.sh [8-11]</a></strong><br> <details open><summary>Referred Code</summary> ```shell grpcurl -plaintext \ -d '{"service_name":"sysmon-vm","service_type":"grpc","agent_id":"dev-agent","poller_id":"docker-poller"}' \ 192.168.1.219:50110 \ monitoring.AgentService/GetStatus ``` </details></details></td></tr> <tr><td><details><summary><strong>Fragile error handling </strong></summary><br> <b>Description:</b> Reconnect decision relies partly on substring matching of error messages, which can be <br>unreliable and may lead to unintended retry storms; consider bounded backoff and stricter <br>code checks.<br> <strong><a href='https://github.com/carverauto/serviceradar/pull/1755/files#diff-28a10dea1596540e55ce9a8b68bd1af3d96bd4634f6def668643892cef25a086R522-R527'>poller.go [522-527]</a></strong><br> <details open><summary>Referred Code</summary> ```go return strings.Contains(errMsg, "connection error") || strings.Contains(errMsg, "transport: Error while dialing") || strings.Contains(errMsg, "name resolver error") || strings.Contains(errMsg, "connection refused") || strings.Contains(errMsg, "i/o timeout") } ``` </details></details></td></tr> <tr><td colspan='2'><strong>Ticket Compliance</strong></td></tr> <tr><td>⚪</td><td><details><summary>🎫 <strong>No ticket provided </summary></strong> - [ ] Update <!-- /compliance --update_compliance=true --> </details></td></tr> <tr><td colspan='2'><strong>Codebase Duplication Compliance</strong></td></tr> <tr><td>⚪</td><td><details><summary><strong>Codebase context is not defined </strong></summary> Follow the <a href='https://qodo-merge-docs.qodo.ai/core-abilities/rag_context_enrichment/'>guide</a> to enable codebase context checks. </details></td></tr> <tr><td colspan='2'><strong>Custom Compliance</strong></td></tr> <tr><td>⚪</td><td><details><summary><strong>No custom compliance provided</strong></summary> Follow the <a href='https://qodo-merge-docs.qodo.ai/tools/compliance/'>guide</a> to enable custom compliance check. </details></td></tr> <tr><td align="center" colspan="2"> <!-- /compliance --update_compliance=true --> </td></tr></tbody></table> </details>
qodo-code-review[bot] commented 2025-10-13 06:09:32 +00:00 (Migrated from github.com)
Author
Owner

Imported GitHub PR comment.

Original author: @qodo-code-review[bot]
Original URL: https://github.com/carverauto/serviceradar/pull/1755#issuecomment-3395989674
Original created: 2025-10-13T06:09:32Z

PR Code Suggestions

Latest suggestions up to 7a4be7b

CategorySuggestion                                                                                                                                    Impact
Incremental [*]
Prevent IP from overwriting host ID

In enrichServiceMessageWithAddress, avoid overwriting host_id with an IP
address. Instead, add the resolved IP to a separate field like resolved_host_ip
to preserve the original host identifier.

pkg/poller/agent_poller.go [272-326]

 func enrichServiceMessageWithAddress(message []byte, check Check) []byte {
 	if len(message) == 0 || check.Type != "grpc" || check.Details == "" {
 		return message
 	}
 
 	hostCandidate := sanitizeTelemetryString(check.Details)
 	host, _, err := net.SplitHostPort(hostCandidate)
 	if err != nil {
 		host = hostCandidate
 	}
 
-	if host == "" || net.ParseIP(host) == nil {
+	if host == "" {
 		return message
 	}
 
-	safeHost := sanitizeTelemetryString(host)
+	var resolvedIP net.IP
+	if ip := net.ParseIP(host); ip != nil {
+		resolvedIP = ip
+	}
+
 	var payload map[string]any
 	if err := json.Unmarshal(message, &payload); err != nil {
 		return message
 	}
 
-	statusNode, ok := payload["status"].(map[string]any)
-	if !ok || statusNode == nil {
+	statusNode, _ := payload["status"].(map[string]any)
+	if statusNode == nil {
 		statusNode = make(map[string]any)
 	}
 
 	getString := func(key string) string {
 		if raw, exists := statusNode[key]; exists {
 			if str, ok := raw.(string); ok {
 				return sanitizeTelemetryString(str)
 			}
 		}
-
 		return ""
 	}
 
-	if hostIP := getString("host_ip"); hostIP != "" && !strings.EqualFold(hostIP, host) {
-		statusNode["reported_host_ip"] = hostIP
+	// If we parsed a valid IP from details, expose it without overwriting original host fields.
+	if resolvedIP != nil {
+		if hostIP := getString("host_ip"); hostIP != "" && !strings.EqualFold(hostIP, resolvedIP.String()) {
+			statusNode["reported_host_ip"] = hostIP
+		}
+		statusNode["resolved_host_ip"] = resolvedIP.String()
 	}
-	statusNode["host_ip"] = safeHost
 
-	if hostID := getString("host_id"); hostID != "" && !strings.EqualFold(hostID, host) {
+	// Do not set host_id from network address. Preserve any provided host_id.
+	if hostID := getString("host_id"); hostID != "" && resolvedIP != nil && !strings.EqualFold(hostID, resolvedIP.String()) {
 		statusNode["reported_host_id"] = hostID
 	}
-	statusNode["host_id"] = safeHost
 
 	payload["status"] = statusNode
-
 	enriched, err := json.Marshal(payload)
 	if err != nil {
 		return message
 	}
-
 	return enriched
 }
  • Apply / Chat
Suggestion importance[1-10]: 8

__

Why: The suggestion correctly identifies a bug where host_id is incorrectly overwritten with an IP address, which corrupts telemetry data and can break device correlation logic.

Medium
Optimize cpu_metrics sort key

To optimize performance for the cpu_metrics stream, remove cluster from the
ORDER BY clause and instead create a separate bloom filter index on it.

pkg/db/migrations/00000000000001_consolidated_serviceradar_schema.up.sql [211-222]

 CREATE STREAM IF NOT EXISTS cpu_metrics (
     timestamp         DateTime64(3),
     poller_id         string,
     agent_id          string,
     host_id           string,
     core_id           int32,
     usage_percent     float64,
     frequency_hz      float64,
     label             string,
     cluster           string,
     device_id         string,
     partition         string
 ) ENGINE = Stream(1, 1, rand())
 PARTITION BY int_div(to_unix_timestamp(timestamp), 3600)
-ORDER BY (timestamp, device_id, host_id, cluster, core_id)
+ORDER BY (timestamp, device_id, host_id, core_id)
 TTL to_start_of_day(coalesce(timestamp, _tp_time)) + INTERVAL 3 DAY
 SETTINGS index_granularity = 8192;
 
+ALTER STREAM cpu_metrics
+  ADD INDEX IF NOT EXISTS idx_cluster cluster TYPE bloom_filter GRANULARITY 1;
+

[To ensure code accuracy, apply this suggestion manually]

Suggestion importance[1-10]: 7

__

Why: The suggestion correctly identifies a potential performance issue by adding cluster to the ORDER BY clause and proposes a valid optimization by using a separate index instead, which improves query performance and data ingestion.

Medium
Block until config exists

Prevent the serviceradar-poller from crashing on startup by modifying its
command to wait until the poller.json configuration file exists before
executing.

docker-compose.yml [329-348]

 serviceradar-poller:
   image: ghcr.io/carverauto/serviceradar-poller:latest
   container_name: serviceradar-poller
   volumes:
     - cert-data:/etc/serviceradar/certs:ro
     - generated-config:/etc/serviceradar/config:ro
     - poller-data:/var/lib/serviceradar
     - ./logs:/var/log/serviceradar
   environment:
-    # Config path
     - CONFIG_PATH=/etc/serviceradar/config/poller.json
-    # Logging
     - LOG_LEVEL=${LOG_LEVEL:-info}
   depends_on:
     cert-generator:
       condition: service_completed_successfully
     config-updater:
       condition: service_completed_successfully
     core:
       condition: service_healthy
-  command: ["/usr/local/bin/serviceradar-poller","-config","/etc/serviceradar/config/poller.json"]
+  command: ["sh","-lc","until [ -s /etc/serviceradar/config/poller.json ]; do echo '[poller] waiting for poller.json'; sleep 2; done; exec /usr/local/bin/serviceradar-poller -config /etc/serviceradar/config/poller.json"]

[To ensure code accuracy, apply this suggestion manually]

Suggestion importance[1-10]: 7

__

Why: The suggestion correctly identifies a race condition where the serviceradar-poller might start before its configuration file is ready, and proposes a robust shell command to wait for the file, preventing startup failures.

Medium
Ensure secure, reliable secret persistence

In proton-init.sh, ensure secret directories exist with secure permissions and
add error handling when writing the password file to prevent silent failures and
ensure credential persistence.

docker/compose/proton-init.sh [62-69]

-if [ -z "$PROTON_PASSWORD" ]; then
-    # First preference: existing shared credential file
-    if [ -f /etc/serviceradar/credentials/proton-password ] && \
-       [ -s /etc/serviceradar/credentials/proton-password ]; then
-        PROTON_PASSWORD=$(cat /etc/serviceradar/credentials/proton-password)
-        log_info "Reusing Proton password from shared credentials volume"
-    # Second preference: previously generated password inside Proton data dir
-    elif [ -f /etc/proton-server/generated_password.txt ] && \
-         [ -s /etc/proton-server/generated_password.txt ]; then
-        PROTON_PASSWORD=$(cat /etc/proton-server/generated_password.txt)
-        log_info "Reusing Proton password from generated_password.txt"
-    else
-        log_info "Generating random password..."
-        PROTON_PASSWORD=$(openssl rand -hex 16)
-        log_info "Generated new Proton password"
-    fi
+# Ensure directories exist with secure permissions before writing secrets
+install -d -m 700 /etc/proton-server || { log_error "Cannot create /etc/proton-server"; exit 1; }
+if [ -d "/etc/serviceradar" ]; then
+  install -d -m 700 /etc/serviceradar/credentials || { log_error "Cannot create /etc/serviceradar/credentials"; exit 1; }
 fi
 
 # Persist password to both expected locations to keep services aligned
-echo "$PROTON_PASSWORD" > /etc/proton-server/generated_password.txt
+echo "$PROTON_PASSWORD" > /etc/proton-server/generated_password.txt || { log_error "Failed to write Proton password"; exit 1; }
 chmod 600 /etc/proton-server/generated_password.txt
 
 if [ -d "/etc/serviceradar/credentials" ]; then
-    echo "$PROTON_PASSWORD" > /etc/serviceradar/credentials/proton-password
-    chmod 600 /etc/serviceradar/credentials/proton-password
-    log_info "Password synchronized to shared credentials volume"
+  echo "$PROTON_PASSWORD" > /etc/serviceradar/credentials/proton-password || { log_error "Failed to write shared Proton password"; exit 1; }
+  chmod 600 /etc/serviceradar/credentials/proton-password
+  log_info "Password synchronized to shared credentials volume"
 fi

[To ensure code accuracy, apply this suggestion manually]

Suggestion importance[1-10]: 6

__

Why: The suggestion improves the robustness of secret handling by adding explicit directory creation and error checking, which prevents silent failures and ensures password persistence.

Low
Possible issue
Prevent device ID collisions and validate IP

To prevent collisions, make the deviceID more unique by validating the hostIP
and including a tie-breaker like agentID or svc.ServiceName in its generation.

pkg/core/devices.go [31-120]

 func (s *Server) ensureServiceDevice(
 	ctx context.Context,
 	agentID, pollerID, partition string,
 	svc *proto.ServiceStatus,
 	serviceData json.RawMessage,
 	timestamp time.Time,
 ) {
 	if svc == nil {
 		return
 	}
-
-	// Only gRPC checkers embed host context in a way we can reason about; skip other service types.
 	if svc.ServiceType != grpcServiceType {
 		return
 	}
-
-	// Ignore result streams such as sync/multi-part responses; they set Source to "results".
 	if svc.Source != "" && !strings.EqualFold(svc.Source, "getstatus") {
 		return
 	}
 
 	hostIP, hostname, hostID := extractCheckerHostIdentity(serviceData)
 	hostIP = normalizeHostIdentifier(hostIP)
-	if hostIP == "" || strings.EqualFold(hostIP, "unknown") {
+	parsed := net.ParseIP(hostIP)
+	if parsed == nil {
+		// refuse non-IP identifiers to avoid corrupting device registry
 		return
 	}
 
 	if partition == "" {
 		partition = "default"
 	}
 
-	deviceID := fmt.Sprintf("%s:%s", partition, hostIP)
+	// Use the normalized string form for device ID. For IPv6, use the canonical compressed format.
+	ipKey := parsed.String()
+
+	// Add a stable tie-breaker to mitigate collisions where the same IP might be observed
+	// in overlapping contexts (e.g., NAT or shared targets).
+	extraScope := agentID
+	if extraScope == "" {
+		extraScope = svc.ServiceName
+	}
+	deviceID := fmt.Sprintf("%s:%s:%s", partition, ipKey, extraScope)
 
 	metadata := map[string]string{
-		"source":             "checker",
-		"checker_service":    svc.ServiceName,
-		"checker_service_id": svc.ServiceName,
-		"last_update":        timestamp.Format(time.RFC3339),
+		"source":               "checker",
+		"checker_service":      svc.ServiceName,
+		"checker_service_id":   svc.ServiceName,
+		"last_update":          timestamp.Format(time.RFC3339),
+		"checker_host_ip":      ipKey,
+		"device_id_scope_hint": extraScope,
 	}
-
 	if svc.ServiceType != "" {
 		metadata["checker_service_type"] = svc.ServiceType
 	}
-
 	if agentID != "" {
 		metadata["collector_agent_id"] = agentID
 	}
-
 	if pollerID != "" {
 		metadata["collector_poller_id"] = pollerID
 	}
-
 	if hostID != "" {
 		metadata["checker_host_id"] = hostID
 	}
-
-	metadata["checker_host_ip"] = hostIP
 
 	deviceUpdate := &models.DeviceUpdate{
 		AgentID:     agentID,
 		PollerID:    pollerID,
 		Partition:   partition,
 		DeviceID:    deviceID,
 		Source:      models.DiscoverySourceSelfReported,
-		IP:          hostIP,
+		IP:          ipKey,
 		Timestamp:   timestamp,
 		IsAvailable: true,
 		Metadata:    metadata,
 		Confidence:  models.GetSourceConfidence(models.DiscoverySourceSelfReported),
 	}
-
 	if hostname != "" {
 		deviceUpdate.Hostname = &hostname
 	}
 
 	if s.DeviceRegistry != nil {
 		if err := s.DeviceRegistry.ProcessDeviceUpdate(ctx, deviceUpdate); err != nil {
 			s.logger.Warn().
 				Err(err).
 				Str("device_id", deviceID).
 				Str("service_name", svc.ServiceName).
 				Msg("Failed to process checker device through device registry")
 		}
 	} else {
 		s.logger.Warn().
 			Str("device_id", deviceID).
 			Str("service_name", svc.ServiceName).
 			Msg("DeviceRegistry not available for checker device registration")
 	}
 }
  • Apply / Chat
Suggestion importance[1-10]: 8

__

Why: The suggestion addresses a significant architectural weakness in device identification, preventing potential data collisions and malformed deviceIDs, thus improving data integrity.

Medium
Add missing usage and tighten ORDER BY

Add the missing usage_percent column to the cpu_cluster_metrics stream and
include poller_id and agent_id in its ORDER BY clause to ensure data correctness
and improve query performance.

pkg/db/migrations/00000000000001_consolidated_serviceradar_schema.up.sql [224-240]

-CREATE STREAM IF NOT EXISTS cpu_metrics (
-    core_id           int32,
-    usage_percent     float64,
-    frequency_hz      float64,
-    label             string,
-    cluster           string,
-    device_id         string,
-    partition         string
-) ENGINE = Stream(1, 1, rand())
-PARTITION BY int_div(to_unix_timestamp(timestamp), 3600)
-ORDER BY (timestamp, device_id, host_id, cluster, core_id)
-TTL to_start_of_day(coalesce(timestamp, _tp_time)) + INTERVAL 3 DAY
-SETTINGS index_granularity = 8192;
-
 CREATE STREAM IF NOT EXISTS cpu_cluster_metrics (
     timestamp         DateTime64(3),
     poller_id         string,
     agent_id          string,
     host_id           string,
     cluster           string,
+    usage_percent     float64,
     frequency_hz      float64,
     device_id         string,
     partition         string
 ) ENGINE = Stream(1, 1, rand())
 PARTITION BY int_div(to_unix_timestamp(timestamp), 3600)
-ORDER BY (timestamp, cluster, host_id, device_id)
+ORDER BY (timestamp, cluster, host_id, device_id, poller_id, agent_id)
 TTL to_start_of_day(coalesce(timestamp, _tp_time)) + INTERVAL 3 DAY
 SETTINGS index_granularity = 8192;
 
 ALTER STREAM cpu_cluster_metrics
   ADD INDEX IF NOT EXISTS idx_cluster cluster TYPE bloom_filter GRANULARITY 1;

[To ensure code accuracy, apply this suggestion manually]

Suggestion importance[1-10]: 8

__

Why: The suggestion correctly identifies that the new cpu_cluster_metrics stream is missing a usage_percent field, which is crucial for UI components that display cluster usage. It also correctly points out that poller_id and agent_id should be in the ORDER BY clause for data integrity and performance, which is a critical consideration for streaming databases.

Medium
Add defensive null checks and stable keys

Add a null check for each device in the devices.map loop and provide a fallback
key using the index to prevent potential runtime errors and improve component
stability.

web/src/components/Devices/DeviceTable.tsx [209-218]

-{devices.map(device => {
-    const metadata = device.metadata || {};
-    const sysmonServiceHint = typeof metadata === 'object' &&
+{devices.map((device, idx) => {
+    if (!device || typeof device !== 'object') return null;
+
+    const metadata = (device as any).metadata || {};
+    const sysmonServiceHint =
+        typeof metadata === 'object' &&
         metadata !== null &&
         typeof metadata.checker_service === 'string' &&
         metadata.checker_service.toLowerCase().includes('sysmon');
 
+    const rowKey = device.device_id ?? `device-row-${idx}`;
+
     return (
-    <Fragment key={device.device_id}>
+      <Fragment key={rowKey}>
         <tr className="hover:bg-gray-700/30">
  • Apply / Chat
Suggestion importance[1-10]: 6

__

Why: The suggestion improves code robustness by adding a null check for device and providing a fallback for the React key, which are good defensive programming practices against potentially malformed data.

Low
General
Use consistent truncated timestamps

In GetAllCPUMetrics, use the truncated timestamp for the final SysmonCPUResponse
to ensure consistent grouping and alignment between CPU and cluster metrics.

pkg/db/metrics.go [826-912]

-func (db *DB) GetAllCPUMetrics(
-	ctx context.Context, pollerID string, start, end time.Time) ([]models.SysmonCPUResponse, error) {
-	rows, err := db.Conn.Query(ctx, `
-		SELECT timestamp, agent_id, host_id, core_id, usage_percent, frequency_hz, label, cluster
-		FROM table(cpu_metrics)
-		WHERE poller_id = $1 AND timestamp BETWEEN $2 AND $3
-		ORDER BY timestamp DESC, core_id ASC`,
-		pollerID, start, end)
-	if err != nil {
-		return nil, fmt.Errorf("failed to query CPU metrics: %w", err)
-	}
-	defer db.CloseRows(rows)
+result := make([]models.SysmonCPUResponse, 0, len(data))
 
-	data := make(map[time.Time][]models.CPUMetric)
-	clustersByTimestamp := make(map[time.Time][]models.CPUClusterMetric)
-
-	for rows.Next() {
-		var m models.CPUMetric
-
-		var agentID, hostID, label, cluster string
-
-		var timestamp time.Time
-
-		if err := rows.Scan(&timestamp, &agentID, &hostID, &m.CoreID, &m.UsagePercent, &m.FrequencyHz, &label, &cluster); err != nil {
-			db.logger.Error().Err(err).Msg("Error scanning CPU metric row")
-			continue
-		}
-		key := timestamp.Truncate(time.Second)
-		m.Timestamp = timestamp
-		m.AgentID = agentID
-		m.HostID = hostID
-		m.Label = label
-		m.Cluster = cluster
-		data[key] = append(data[key], m)
-	}
-
-	if err := rows.Err(); err != nil {
-		db.logger.Error().Err(err).Msg("Error iterating CPU metric rows")
-
-		return nil, err
-	}
-
-	clusterRows, err := db.Conn.Query(ctx, `
-		SELECT timestamp, agent_id, host_id, cluster, frequency_hz
-		FROM table(cpu_cluster_metrics)
-		WHERE poller_id = $1 AND timestamp BETWEEN $2 AND $3
-		ORDER BY timestamp DESC, cluster ASC`,
-		pollerID, start, end)
-	if err != nil {
-		db.logger.Error().Err(err).Msg("Error querying CPU cluster metrics")
-
-		return nil, fmt.Errorf("failed to query CPU cluster metrics: %w", err)
-	}
-	defer db.CloseRows(clusterRows)
-
-	for clusterRows.Next() {
-		var c models.CPUClusterMetric
-		var timestamp time.Time
-
-		if err := clusterRows.Scan(&timestamp, &c.AgentID, &c.HostID, &c.Name, &c.FrequencyHz); err != nil {
-			db.logger.Error().Err(err).Msg("Error scanning CPU cluster metric row")
-			continue
-		}
-		key := timestamp.Truncate(time.Second)
-		c.Timestamp = timestamp
-		clustersByTimestamp[key] = append(clustersByTimestamp[key], c)
-	}
-
-	if err := clusterRows.Err(); err != nil {
-		db.logger.Error().Err(err).Msg("Error iterating CPU cluster metric rows")
-
-		return nil, err
-	}
-
-	result := make([]models.SysmonCPUResponse, 0, len(data))
-
-	for ts, cpus := range data {
-		result = append(result, models.SysmonCPUResponse{
-			Cpus:      cpus,
-			Clusters:  clustersByTimestamp[ts],
-			Timestamp: cpus[0].Timestamp,
-		})
-	}
-
-	return result, nil
+for ts, cpus := range data {
+	truncated := ts.Truncate(time.Second)
+	result = append(result, models.SysmonCPUResponse{
+		Cpus:      cpus,
+		Clusters:  clustersByTimestamp[ts],
+		Timestamp: truncated,
+	})
 }

[To ensure code accuracy, apply this suggestion manually]

Suggestion importance[1-10]: 7

__

Why: The suggestion correctly identifies a subtle bug where inconsistent timestamp handling could lead to data misalignment between CPU and cluster metrics.

Medium
  • More

Previous suggestions

Suggestions up to commit 86685fd
CategorySuggestion                                                                                                                                    Impact
Possible issue
Preserve original host identity fields

Avoid overwriting host_id and host_ip in the service message. Instead, store the
address derived from check.Details in separate auxiliary fields to preserve the
original agent-reported identity.

pkg/poller/agent_poller.go [272-325]

 func enrichServiceMessageWithAddress(message []byte, check Check) []byte {
 	if len(message) == 0 || check.Type != "grpc" || check.Details == "" {
 		return message
 	}
 
 	hostCandidate := strings.TrimSpace(check.Details)
 	host, _, err := net.SplitHostPort(hostCandidate)
 	if err != nil {
 		host = hostCandidate
 	}
 
+	// Only attach auxiliary info if the candidate parses as an IP
 	if host == "" || net.ParseIP(host) == nil {
 		return message
 	}
 
 	var payload map[string]any
 	if err := json.Unmarshal(message, &payload); err != nil {
 		return message
 	}
 
-	statusNode, ok := payload["status"].(map[string]any)
-	if !ok || statusNode == nil {
+	statusNode, _ := payload["status"].(map[string]any)
+	if statusNode == nil {
 		statusNode = make(map[string]any)
 	}
 
-	getString := func(key string) string {
-		if raw, exists := statusNode[key]; exists {
-			if str, ok := raw.(string); ok {
-				return strings.TrimSpace(str)
-			}
+	// Preserve existing fields; add auxiliary fields without overwriting originals
+	if existing, ok := statusNode["host_ip"]; ok && existing != nil {
+		// keep original, add derived separately if different
+		if s, ok := existing.(string); ok && !strings.EqualFold(strings.TrimSpace(s), host) {
+			statusNode["derived_host_ip"] = host
 		}
-
-		return ""
+	} else {
+		// no original present; still avoid writing as canonical field
+		statusNode["derived_host_ip"] = host
 	}
 
-	if hostIP := getString("host_ip"); hostIP != "" && !strings.EqualFold(hostIP, host) {
-		statusNode["reported_host_ip"] = hostIP
+	// Do not overwrite host_id; expose derived host_id from address only as auxiliary
+	if _, ok := statusNode["host_id"]; !ok {
+		statusNode["derived_host_id"] = host
 	}
-	statusNode["host_ip"] = host
-
-	if hostID := getString("host_id"); hostID != "" && !strings.EqualFold(hostID, host) {
-		statusNode["reported_host_id"] = hostID
-	}
-	statusNode["host_id"] = host
 
 	payload["status"] = statusNode
-
 	enriched, err := json.Marshal(payload)
 	if err != nil {
 		return message
 	}
-
 	return enriched
 }
Suggestion importance[1-10]: 9

__

Why: The suggestion correctly identifies a critical data integrity issue where agent-reported identity (host_ip, host_id) is overwritten, which could lead to incorrect device correlation and data misattribution.

High
Validate device key IP input

Validate that the extracted hostIP is a valid IP address before using it to
construct a deviceID to prevent creating invalid device records from hostnames
or other strings.

pkg/core/devices.go [31-62]

-func (s *Server) ensureServiceDevice(
-	ctx context.Context,
-	agentID, pollerID, partition string,
-	svc *proto.ServiceStatus,
-	serviceData json.RawMessage,
-	timestamp time.Time,
-) {
-	if svc == nil {
-		return
-	}
+hostIP, hostname, hostID := extractCheckerHostIdentity(serviceData)
+hostIP = normalizeHostIdentifier(hostIP)
+// Require a valid IP address to avoid using hostnames/strings as device keys
+if ip := net.ParseIP(hostIP); ip == nil {
+	return
+}
+if hostIP == "" || strings.EqualFold(hostIP, "unknown") {
+	return
+}
 
-	// Only gRPC checkers embed host context in a way we can reason about; skip other service types.
-	if svc.ServiceType != grpcServiceType {
-		return
-	}
+if partition == "" {
+	partition = "default"
+}
 
-	// Ignore result streams such as sync/multi-part responses; they set Source to "results".
-	if svc.Source != "" && !strings.EqualFold(svc.Source, "getstatus") {
-		return
-	}
+deviceID := fmt.Sprintf("%s:%s", partition, hostIP)
 
-	hostIP, hostname, hostID := extractCheckerHostIdentity(serviceData)
-	hostIP = normalizeHostIdentifier(hostIP)
-	if hostIP == "" || strings.EqualFold(hostIP, "unknown") {
-		return
-	}
-
-	if partition == "" {
-		partition = "default"
-	}
-
-	deviceID := fmt.Sprintf("%s:%s", partition, hostIP)
-
Suggestion importance[1-10]: 8

__

Why: The suggestion correctly identifies that a non-IP string could be used in deviceID, leading to invalid device records. Enforcing IP validation for the device key is a crucial fix for data integrity.

Medium
Add indexes for new cluster fields
Suggestion Impact:The commit added bloom filter indexes on cpu_metrics for cluster and label, matching the suggestion. It also added an index for cluster on cpu_cluster_metrics and adjusted ORDER BY clauses.

code diff:

+ALTER STREAM cpu_metrics
+  ADD INDEX IF NOT EXISTS idx_cluster cluster TYPE bloom_filter GRANULARITY 1;
+ALTER STREAM cpu_metrics
+  ADD INDEX IF NOT EXISTS idx_label label TYPE bloom_filter GRANULARITY 1;

Add bloom filter indexes to the cpu_metrics stream for the new cluster and label
fields to improve query performance.

pkg/db/migrations/00000000000001_consolidated_serviceradar_schema.up.sql [209-222]

 CREATE STREAM IF NOT EXISTS cpu_metrics (
     timestamp         DateTime64(3),
     poller_id         string,
     agent_id          string,
     host_id           string,
     core_id           int32,
     usage_percent     float64,
     frequency_hz      float64,
     label             string,
     cluster           string,
     device_id         string,
     partition         string
 ) ENGINE = Stream(1, 1, rand())
 PARTITION BY int_div(to_unix_timestamp(timestamp), 3600)
 ORDER BY (timestamp, device_id, host_id, core_id)
 TTL to_start_of_day(coalesce(timestamp, _tp_time)) + INTERVAL 3 DAY
 SETTINGS index_granularity = 8192;
 
+ALTER STREAM cpu_metrics
+  ADD INDEX IF NOT EXISTS idx_cluster cluster TYPE bloom_filter GRANULARITY 1;
+ALTER STREAM cpu_metrics
+  ADD INDEX IF NOT EXISTS idx_label label TYPE bloom_filter GRANULARITY 1;
+
Suggestion importance[1-10]: 8

__

Why: This is a valid and important performance optimization. Adding bloom filter indexes on the new cluster and label fields is crucial for efficient querying, especially since these fields are used for filtering and grouping in the UI, preventing full stream scans on cpu_metrics.

Medium
Optimize sort key and index for clusters
Suggestion Impact:The commit reordered the cpu_cluster_metrics ORDER BY to (timestamp, cluster, host_id, device_id) and added a bloom filter index on cluster for that stream, matching the suggestion's intent.

code diff:

-ORDER BY (timestamp, device_id, host_id, cluster)
+ORDER BY (timestamp, cluster, host_id, device_id)
 TTL to_start_of_day(coalesce(timestamp, _tp_time)) + INTERVAL 3 DAY
 SETTINGS index_granularity = 8192;
 
+ALTER STREAM cpu_cluster_metrics
+  ADD INDEX IF NOT EXISTS idx_cluster cluster TYPE bloom_filter GRANULARITY 1;

Optimize the cpu_cluster_metrics stream by reordering the primary key to
prioritize cluster and adding a bloom filter index on it.

pkg/db/migrations/00000000000001_consolidated_serviceradar_schema.up.sql [224-237]

 CREATE STREAM IF NOT EXISTS cpu_cluster_metrics (
     timestamp         DateTime64(3),
     poller_id         string,
     agent_id          string,
     host_id           string,
     cluster           string,
     frequency_hz      float64,
     device_id         string,
     partition         string
 ) ENGINE = Stream(1, 1, rand())
 PARTITION BY int_div(to_unix_timestamp(timestamp), 3600)
-ORDER BY (timestamp, device_id, host_id, cluster)
+ORDER BY (timestamp, cluster, host_id, device_id)
 TTL to_start_of_day(coalesce(timestamp, _tp_time)) + INTERVAL 3 DAY
 SETTINGS index_granularity = 8192;
 
+ALTER STREAM cpu_cluster_metrics
+  ADD INDEX IF NOT EXISTS idx_cluster cluster TYPE bloom_filter GRANULARITY 1;
+

[Suggestion processed]

Suggestion importance[1-10]: 8

__

Why: This suggestion correctly identifies a performance issue in the new cpu_cluster_metrics stream. Reordering the primary key to (timestamp, cluster, ...) and adding a bloom filter index on cluster will significantly improve the performance of common UI queries that filter by cluster.

Medium
Add safe db type assertion

*Add a safe type assertion when accessing s.dbService to prevent a potential
panic if the underlying type is not db.DB.

pkg/core/api/sysmon.go [459]

-clusterRows, err := s.dbService.(*db.DB).Conn.Query(ctx, clusterQuery)
+var (
+	dbImpl *db.DB
+	ok     bool
+)
+if dbImpl, ok = s.dbService.(*db.DB); !ok || dbImpl == nil || dbImpl.Conn == nil {
+	return nil, fmt.Errorf("database service unavailable for cluster query")
+}
+clusterRows, err := dbImpl.Conn.Query(ctx, clusterQuery)
Suggestion importance[1-10]: 7

__

Why: The suggestion correctly points out a potential panic from an unsafe type assertion, improving the code's robustness and preventing a server crash, which is a valuable improvement for production stability.

Medium
Add defensive null checks
Suggestion Impact:The commit added a defensive check ensuring serviceData exists before accessing its properties and provided a nullish coalescing fallback for serviceData.details.

code diff:

+        if (
+            serviceData &&
+            (serviceData.service_name === "sysmon-vm" || serviceData.name === "sysmon-vm")
+        ) {
             return (
                 <SysmonVmDetails
                     service={serviceData}
-                    details={serviceData.details}
+                    details={serviceData.details ?? {}}

Add null checks for the serviceData object and its details property to prevent
potential runtime errors during rendering.

web/src/components/Service/Dashboard.tsx [280-287]

-if (serviceData.service_name === 'sysmon-vm' || serviceData.name === 'sysmon-vm') {
+if (serviceData && (serviceData.service_name === 'sysmon-vm' || serviceData.name === 'sysmon-vm')) {
     return (
         <SysmonVmDetails
             service={serviceData}
-            details={serviceData.details}
+            details={serviceData.details ?? {}}
         />
     );
 }

[Suggestion processed]

Suggestion importance[1-10]: 7

__

Why: The suggestion correctly identifies a potential null pointer exception if serviceData is not yet loaded, improving the component's robustness against runtime errors.

Medium
Clamp and sanitize metric value
Suggestion Impact:The commit added safeMax/rawValue/clampedValue and replaced uses of current with clampedValue and max with safeMax across StatusIndicator, ValueDisplay, and ProgressBar.

code diff:

+    const safeMax = Number.isFinite(max) && max > 0 ? max : 100;
+    const rawValue = Number.isFinite(current) ? current : 0;
+    const clampedValue = Math.min(Math.max(rawValue, 0), safeMax);
+
     return (
         <div className="bg-white dark:bg-gray-800 rounded-lg p-3 shadow transition-colors">
             <div className="flex justify-between items-center mb-2">
@@ -129,23 +133,23 @@
                     <span className="text-sm font-medium text-gray-600 dark:text-gray-300">{title}</span>
                 </div>
                 <StatusIndicator
-                    value={current}
+                    value={clampedValue}
                     warning={warning}
                     critical={critical}
                 />
             </div>
             <ValueDisplay
-                value={current}
+                value={clampedValue}
                 unit={unit}
                 warning={warning}
                 critical={critical}
                 change={change}
             />
             <ProgressBar
-                value={current}
+                value={clampedValue}
                 warning={warning}
                 critical={critical}
-                max={max}
+                max={safeMax}
             />

Sanitize the current and max props in the MetricCard component by clamping the
value to a safe range [0, max] to prevent UI issues.

web/src/components/Metrics/shared-components.jsx [123-153]

 export const MetricCard = ({ title, current, unit, warning, critical, change, icon, children, max = 100 }) => {
+    const safeMax = Number.isFinite(max) && max > 0 ? max : 100;
+    const rawValue = Number.isFinite(current) ? current : 0;
+    const clampedValue = Math.min(Math.max(rawValue, 0), safeMax);
     return (
         <div className="bg-white dark:bg-gray-800 rounded-lg p-3 shadow transition-colors">
             <div className="flex justify-between items-center mb-2">
                 <div className="flex items-center space-x-2">
                     {icon && <div className="text-gray-500 dark:text-gray-300">{icon}</div>}
                     <h3 className="text-sm font-medium text-gray-900 dark:text-white">{title}</h3>
                 </div>
                 {typeof change !== 'undefined' && (
                     <span className={`text-xs ${change >= 0 ? 'text-green-600' : 'text-red-600'}`}>
                         {change >= 0 ? '+' : ''}{change}%
                     </span>
                 )}
             </div>
             <MetricProgressBar
-                value={current}
+                value={clampedValue}
                 warning={warning}
                 critical={critical}
-                max={max}
+                max={safeMax}
             />
             {children}
         </div>
     );
 };

[Suggestion processed]

Suggestion importance[1-10]: 7

__

Why: This suggestion improves the component's resilience by sanitizing the current and max props, preventing potential UI bugs or crashes from invalid data.

Medium
Incremental [*]
Normalize timestamps for joining
Suggestion Impact:The commit implemented truncating timestamps to second-resolution for both CPU metrics and cluster metrics map keys, and adjusted the response timestamp to use the original metric timestamp.

code diff:

+		key := timestamp.Truncate(time.Second)
 		m.Timestamp = timestamp
 		m.AgentID = agentID
 		m.HostID = hostID
 		m.Label = label
 		m.Cluster = cluster
-		data[timestamp] = append(data[timestamp], m)
+		data[key] = append(data[key], m)
 	}
 
 	if err := rows.Err(); err != nil {
@@ -888,9 +888,9 @@
 			db.logger.Error().Err(err).Msg("Error scanning CPU cluster metric row")
 			continue
 		}
-
+		key := timestamp.Truncate(time.Second)
 		c.Timestamp = timestamp
-		clustersByTimestamp[timestamp] = append(clustersByTimestamp[timestamp], c)
+		clustersByTimestamp[key] = append(clustersByTimestamp[key], c)
 	}
 
 	if err := clusterRows.Err(); err != nil {
@@ -905,7 +905,7 @@
 		result = append(result, models.SysmonCPUResponse{
 			Cpus:      cpus,
 			Clusters:  clustersByTimestamp[ts],
-			Timestamp: ts,
+			Timestamp: cpus[0].Timestamp,
 		})

In GetAllCPUMetrics, normalize timestamps by truncating them to the second
before using them as map keys to ensure correct correlation between core and
cluster metrics.

pkg/db/metrics.go [826-912]

 func (db *DB) GetAllCPUMetrics(
 	ctx context.Context, pollerID string, start, end time.Time) ([]models.SysmonCPUResponse, error) {
 	rows, err := db.Conn.Query(ctx, `
 		SELECT timestamp, agent_id, host_id, core_id, usage_percent, frequency_hz, label, cluster
 		FROM table(cpu_metrics)
 		WHERE poller_id = $1 AND timestamp BETWEEN $2 AND $3
 		ORDER BY timestamp DESC, core_id ASC`,
 		pollerID, start, end)
 	if err != nil {
 		return nil, fmt.Errorf("failed to query CPU metrics: %w", err)
 	}
 	defer db.CloseRows(rows)
 
 	data := make(map[time.Time][]models.CPUMetric)
 	clustersByTimestamp := make(map[time.Time][]models.CPUClusterMetric)
 
 	for rows.Next() {
 		var m models.CPUMetric
-
 		var agentID, hostID, label, cluster string
-
-		var timestamp time.Time
-
-		if err := rows.Scan(&timestamp, &agentID, &hostID, &m.CoreID, &m.UsagePercent, &m.FrequencyHz, &label, &cluster); err != nil {
+		var ts time.Time
+		if err := rows.Scan(&ts, &agentID, &hostID, &m.CoreID, &m.UsagePercent, &m.FrequencyHz, &label, &cluster); err != nil {
 			db.logger.Error().Err(err).Msg("Error scanning CPU metric row")
 			continue
 		}
-
-		m.Timestamp = timestamp
+		// Normalize key to seconds to ensure joinability
+		key := ts.Truncate(time.Second)
+		m.Timestamp = ts
 		m.AgentID = agentID
 		m.HostID = hostID
 		m.Label = label
 		m.Cluster = cluster
-		data[timestamp] = append(data[timestamp], m)
+		data[key] = append(data[key], m)
 	}
-
 	if err := rows.Err(); err != nil {
 		db.logger.Error().Err(err).Msg("Error iterating CPU metric rows")
-
 		return nil, err
 	}
 
 	clusterRows, err := db.Conn.Query(ctx, `
 		SELECT timestamp, agent_id, host_id, cluster, frequency_hz
 		FROM table(cpu_cluster_metrics)
 		WHERE poller_id = $1 AND timestamp BETWEEN $2 AND $3
 		ORDER BY timestamp DESC, cluster ASC`,
 		pollerID, start, end)
 	if err != nil {
 		db.logger.Error().Err(err).Msg("Error querying CPU cluster metrics")
-
 		return nil, fmt.Errorf("failed to query CPU cluster metrics: %w", err)
 	}
 	defer db.CloseRows(clusterRows)
 
 	for clusterRows.Next() {
 		var c models.CPUClusterMetric
-		var timestamp time.Time
-
-		if err := clusterRows.Scan(&timestamp, &c.AgentID, &c.HostID, &c.Name, &c.FrequencyHz); err != nil {
+		var ts time.Time
+		if err := clusterRows.Scan(&ts, &c.AgentID, &c.HostID, &c.Name, &c.FrequencyHz); err != nil {
 			db.logger.Error().Err(err).Msg("Error scanning CPU cluster metric row")
 			continue
 		}
-
-		c.Timestamp = timestamp
-		clustersByTimestamp[timestamp] = append(clustersByTimestamp[timestamp], c)
+		key := ts.Truncate(time.Second)
+		c.Timestamp = ts
+		clustersByTimestamp[key] = append(clustersByTimestamp[key], c)
 	}
-
 	if err := clusterRows.Err(); err != nil {
 		db.logger.Error().Err(err).Msg("Error iterating CPU cluster metric rows")
-
 		return nil, err
 	}
 
 	result := make([]models.SysmonCPUResponse, 0, len(data))
-
-	for ts, cpus := range data {
+	for tsKey, cpus := range data {
 		result = append(result, models.SysmonCPUResponse{
 			Cpus:      cpus,
-			Clusters:  clustersByTimestamp[ts],
-			Timestamp: ts,
+			Clusters:  clustersByTimestamp[tsKey],
+			Timestamp: cpus[0].Timestamp, // original timestamp for display
 		})
 	}
-
 	return result, nil
 }
Suggestion importance[1-10]: 8

__

Why: The suggestion correctly identifies a critical Go pitfall of using time.Time as a map key, which would cause data to be mis-associated, and provides the correct fix by truncating the timestamp.

Medium
Guard optional nested data
Suggestion Impact:The commit updated the component to use optional chaining and array checks: wrapped cpuFrequency and memory with data?., changed cpu.clusters and cpu.cores and disk.drives to Array.isArray checks, matching the suggested guards.

code diff:

-                        {data.cpu?.clusters && data.cpu.clusters.length > 0 && (
+                        {Array.isArray(data?.cpu?.clusters) && data.cpu.clusters.length > 0 && (
                             <CpuClusterDetails clusters={data.cpu.clusters} />
                         )}
-                        {data.cpuFrequency && <CpuFrequencyDetails data={data.cpuFrequency} />}
-                        <CpuCoresChart cores={data.cpu.cores} />
-                        <FilesystemDetails drives={data.disk.drives} />
-                        <MemoryDetails data={data.memory} />
+                        {data?.cpuFrequency && <CpuFrequencyDetails data={data.cpuFrequency} />}
+                        {Array.isArray(data?.cpu?.cores) && <CpuCoresChart cores={data.cpu.cores} />}
+                        {Array.isArray(data?.disk?.drives) && <FilesystemDetails drives={data.disk.drives} />}
+                        {data?.memory && <MemoryDetails data={data.memory} />}

Add null-safe guards for data.cpu.cores, data.disk.drives, and data.memory to
prevent the application from crashing if these properties are accessed before
they are loaded.

web/src/components/Metrics/system-metrics.jsx [288-296]

     {activeTab === 'details' && (
         <div className="space-y-6">
-            {data.cpu?.clusters && data.cpu.clusters.length > 0 && (
+            {Array.isArray(data?.cpu?.clusters) && data.cpu.clusters.length > 0 && (
                 <CpuClusterDetails clusters={data.cpu.clusters} />
             )}
-            {data.cpuFrequency && <CpuFrequencyDetails data={data.cpuFrequency} />}
-            <CpuCoresChart cores={data.cpu.cores} />
-            <FilesystemDetails drives={data.disk.drives} />
-            <MemoryDetails data={data.memory} />
+            {data?.cpuFrequency && <CpuFrequencyDetails data={data.cpuFrequency} />}
+            {Array.isArray(data?.cpu?.cores) && <CpuCoresChart cores={data.cpu.cores} />}
+            {Array.isArray(data?.disk?.drives) && <FilesystemDetails drives={data.disk.drives} />}
+            {data?.memory && <MemoryDetails data={data.memory} />}

[Suggestion processed]

Suggestion importance[1-10]: 8

__

Why: This suggestion correctly identifies missing null-safe guards for nested properties like data.cpu.cores, preventing a likely runtime crash when data is partially loaded, which is a critical fix for UI stability.

Medium
Include cluster in sorting key
Suggestion Impact:The commit updated cpu_metrics to ORDER BY (timestamp, device_id, host_id, cluster, core_id), adding cluster as suggested. It also added bloom filter indexes and adjusted ordering for another stream.

code diff:

+ORDER BY (timestamp, device_id, host_id, cluster, core_id)
+ TTL to_start_of_day(coalesce(timestamp, _tp_time)) + INTERVAL 3 DAY
+ SETTINGS index_granularity = 8192;

Add the new cluster field to the ORDER BY clause of the cpu_metrics stream. This
will improve query performance and data locality for cluster-based aggregations.

pkg/db/migrations/00000000000001_consolidated_serviceradar_schema.up.sql [209-222]

 CREATE STREAM IF NOT EXISTS cpu_metrics (
     timestamp         DateTime64(3),
     poller_id         string,
     agent_id          string,
     host_id           string,
     core_id           int32,
     usage_percent     float64,
     frequency_hz      float64,
     label             string,
     cluster           string,
     device_id         string,
     partition         string
 ) ENGINE = Stream(1, 1, rand())
 PARTITION BY int_div(to_unix_timestamp(timestamp), 3600)
-ORDER BY (timestamp, device_id, host_id, core_id)
+ORDER BY (timestamp, device_id, host_id, cluster, core_id)
 TTL to_start_of_day(coalesce(timestamp, _tp_time)) + INTERVAL 3 DAY
 SETTINGS index_granularity = 8192;
Suggestion importance[1-10]: 7

__

Why: This is a valid and important performance optimization for the cpu_metrics stream, as including cluster in the ORDER BY clause will improve data locality and query efficiency for cluster-based aggregations.

Medium
General
Remove hardcoded private IP
Suggestion Impact:The hardcoded IP address was replaced with an environment-variable-based address: "${SYSMON_VM_ADDRESS:-sysmon-vm:50110}".

code diff:

-  "address": "192.168.1.219:50110",
+  "address": "${SYSMON_VM_ADDRESS:-sysmon-vm:50110}",

Replace the hardcoded IP address in sysmon-vm.checker.json with a configurable
value, such as an environment variable, to improve deployment flexibility.

docker/compose/sysmon-vm.checker.json [1-9]

 {
   "name": "sysmon-vm",
   "type": "grpc",
-  "address": "192.168.1.219:50110",
+  "address": "${SYSMON_VM_ADDRESS:-sysmon-vm:50110}",
   "security": {
     "mode": "none",
     "role": "agent"
   }
 }

[Suggestion processed]

Suggestion importance[1-10]: 8

__

Why: The suggestion correctly points out that a hardcoded IP address harms portability and should be replaced with a configurable value, which is a critical best practice for deployment configurations.

Medium
Align Proton memory to container limits
Suggestion Impact:The commit added the suggested Proton memory environment variables (PROTON_MAX_SERVER_MEMORY, PROTON_MAX_QUERY_MEMORY, PROTON_MAX_PARTITION_BYTES) along with a comment explaining the alignment with container limits.

code diff:

+      # Align Proton memory usage with container limits to reduce OOM risk
+      - PROTON_MAX_SERVER_MEMORY=6g
+      - PROTON_MAX_QUERY_MEMORY=4g
+      - PROTON_MAX_PARTITION_BYTES=256m

Set explicit memory limits for the Proton service via environment variables to
align with container memory limits and prevent OOM kills.

docker-compose.yml [72-92]

 services:
   proton:
     image: ghcr.io/carverauto/serviceradar-proton:latest
     container_name: serviceradar-proton
     user: "0:0"
     mem_limit: 8g
     mem_reservation: 6g
     ports:
       - "8123:8123"
       - "8463:8463"
     environment:
       - PROTON_PASSWORD=${PROTON_PASSWORD:-}
       - PROTON_LOG_LEVEL=${PROTON_LOG_LEVEL:-error}
       - PROTON_DISABLE_TASKSTATS=1
+      # Align Proton internal memory to container cgroup limits to prevent OOM
+      - PROTON_MAX_SERVER_MEMORY=6g
+      - PROTON_MAX_QUERY_MEMORY=4g
+      - PROTON_MAX_PARTITION_BYTES=256m
Suggestion importance[1-10]: 7

__

Why: The suggestion correctly points out a potential for OOM kills by aligning Proton's internal memory settings with the container's memory limits. This enhances the stability of the proton service under load, making it a valuable configuration improvement.

Medium
Suggestions up to commit fddabde
CategorySuggestion                                                                                                                                    Impact
High-level
Database migration strategy breaks existing deployments

The PR replaces the incremental database migration system with a single schema
file that only runs on empty databases. This breaks the upgrade path for
existing deployments, as there is no way to apply new schema changes to
databases that already contain data.

Examples:

pkg/db/migrate.go [35-75]
// RunMigrations applies the consolidated database schema when no tables exist yet.
func RunMigrations(ctx context.Context, conn proton.Conn, log logger.Logger) error {
	log.Info().Msg("Ensuring database schema is applied")

	applied, err := schemaAlreadyApplied(ctx, conn)
	if err != nil {
		return fmt.Errorf("failed to inspect existing schema: %w", err)
	}

	if applied {

 ... (clipped 31 lines)

Solution Walkthrough:

Before:

// In pkg/db/migrate.go
func RunMigrations(ctx, conn) error {
    createMigrationsTable(ctx, conn) // Ensure tracking table exists
    appliedMigrations := getAppliedMigrations(ctx, conn)
    availableMigrations := getAvailableMigrations() // from *.up.sql files

    for _, migrationFile := range availableMigrations {
        if !isApplied(migrationFile, appliedMigrations) {
            content := readMigrationFile(migrationFile)
            executeMigration(ctx, conn, content)
            recordMigration(ctx, conn, migrationFile)
        }
    }
    return nil
}

After:

// In pkg/db/migrate.go
const schemaFile = "migrations/schema.sql"

func RunMigrations(ctx, conn) error {
    applied, err := schemaAlreadyApplied(ctx, conn)
    if applied {
        log("Schema already present; skipping apply")
        return nil
    }

    // If not applied, run the whole schema file
    content := readSchemaFile(schemaFile)
    statements := splitSQLStatements(content)
    for _, stmt := range statements {
        conn.Exec(ctx, stmt)
    }
    log("Database schema applied successfully")
    return nil
}

func schemaAlreadyApplied(ctx, conn) (bool, error) {
    // SELECT count() FROM system.tables WHERE ... name = 'device_updates'
    // return count > 0
}

Suggestion importance[1-10]: 10

__

Why: The suggestion correctly identifies a critical flaw where the new migration strategy removes the upgrade path for existing databases, which is a breaking change for all current deployments.

High
Possible issue
Fix missing target stream clauses

Add the missing INTO clause to the ocsf_devices_aggregator_mv and
ocsf_users_aggregator_mv materialized views to specify their target streams and
fix a syntax error.

pkg/db/migrations/schema.sql [1298-1408]

 -- OCSF Materialized Views Migration
 -- These views automatically aggregate event data into entity state and observable indexes
 -- Enables real-time updates from event streams to current state streams
 
 -- Device State Aggregation
 -- Populates ocsf_devices_current from device inventory events
 DROP VIEW IF EXISTS ocsf_devices_aggregator_mv;
-CREATE MATERIALIZED VIEW ocsf_devices_aggregator_mv AS
+CREATE MATERIALIZED VIEW ocsf_devices_aggregator_mv
+INTO ocsf_devices_current
+AS
 SELECT
     device_uid,
     time AS last_seen,
 ...
 FROM ocsf_device_inventory;
 
 -- User State Aggregation
 -- Populates ocsf_users_current from user inventory events
 DROP VIEW IF EXISTS ocsf_users_aggregator_mv;
-CREATE MATERIALIZED VIEW ocsf_users_aggregator_mv AS
+CREATE MATERIALIZED VIEW ocsf_users_aggregator_mv
+INTO ocsf_users_current
+AS
 SELECT
     user_uid,
     time AS last_seen,
 ...
 FROM ocsf_user_inventory;
Suggestion importance[1-10]: 10

__

Why: The suggestion correctly identifies a critical syntax error in two CREATE MATERIALIZED VIEW statements that are missing the required INTO clause. This would cause the SQL migration to fail, so the fix is essential for correctness.

High
Fix view definition and function name
Suggestion Impact:The committed diff updated the ocsf_observable_device_macs_mv definition to include the INTO ocsf_observable_index clause and corrected the function call to replaceAll for normalizing MAC addresses.

code diff:

--- Observable Index Population - Device MACs
--- Index MAC addresses from device inventory
-DROP VIEW IF EXISTS ocsf_observable_device_macs_mv;
-CREATE MATERIALIZED VIEW ocsf_observable_device_macs_mv AS
-SELECT
-    'mac_address' AS observable_type,
-    mac AS observable_value,
-    lower(replace_all(mac, ':', '')) AS observable_value_normalized,
-    'device' AS entity_class,
-    device_uid AS entity_uid,
-    time AS entity_last_seen,
-    'device.mac' AS entity_path,
-    confidence_level / 4.0 AS confidence_score,
-    discovery_source,
-    time,
-    agent_id,
-    poller_id,
-
-    -- Enrichments
-    '' AS geo_country,
-    '' AS geo_region,
-    '' AS geo_city,
-    0 AS asn_number,
-    '' AS asn_org,
-    0.0 AS threat_score,
-    CAST([] AS array(string)) AS threat_categories,
-    CAST([] AS array(string)) AS threat_sources,
-    'hardware' AS observable_category,
-    CAST([] AS array(string)) AS tags,
-    metadata
-
-FROM ocsf_device_inventory
-ARRAY JOIN device_mac AS mac
-WHERE mac != '';

Fix the ocsf_observable_device_macs_mv view by adding the missing INTO
ocsf_observable_index clause and correcting the function name from replace_all
to replaceAll.

pkg/db/migrations/schema.sql [1452-1485]

 -- Observable Index Population - Device MACs
 -- Index MAC addresses from device inventory
 DROP VIEW IF EXISTS ocsf_observable_device_macs_mv;
-CREATE MATERIALIZED VIEW ocsf_observable_device_macs_mv AS
+CREATE MATERIALIZED VIEW ocsf_observable_device_macs_mv
+INTO ocsf_observable_index
+AS
 SELECT
     'mac_address' AS observable_type,
     mac AS observable_value,
-    lower(replace_all(mac, ':', '')) AS observable_value_normalized,
+    lower(replaceAll(mac, ':', '')) AS observable_value_normalized,
     'device' AS entity_class,
     device_uid AS entity_uid,
 ...
 FROM ocsf_device_inventory
 ARRAY JOIN device_mac AS mac
 WHERE mac != '';
Suggestion importance[1-10]: 10

__

Why: The suggestion correctly identifies two critical errors in the ocsf_observable_device_macs_mv view definition: a missing INTO clause and an incorrect function name (replace_all instead of replaceAll). Both are syntax errors that would cause the migration to fail.

High
Avoid holding lock during network call

Refactor reconnectCore to release the mutex lock before making the blocking
connectToCore network call to avoid contention.

pkg/poller/poller.go [481-497]

 func (p *Poller) reconnectCore(ctx context.Context) error {
 	reconnectCtx, cancel := context.WithTimeout(ctx, 15*time.Second)
 	defer cancel()
 
 	p.mu.Lock()
-	defer p.mu.Unlock()
-
 	if p.grpcClient != nil {
+		// Close the existing client while holding the lock to prevent its use.
 		if err := p.grpcClient.Close(); err != nil {
 			p.logger.Warn().Err(err).Msg("Failed to close existing core client during reconnect")
 		}
 		p.grpcClient = nil
 		p.coreClient = nil
 	}
+	p.mu.Unlock() // Release the lock before the blocking network call.
 
-	return p.connectToCore(reconnectCtx)
+	// connectToCore should not set the fields on `p` directly.
+	// It should return the new clients.
+	grpcClient, coreClient, err := p.connectToCore(reconnectCtx)
+	if err != nil {
+		return err
+	}
+
+	// Re-acquire the lock to safely update the shared fields.
+	p.mu.Lock()
+	defer p.mu.Unlock()
+	p.grpcClient = grpcClient
+	p.coreClient = coreClient
+
+	return nil
 }
Suggestion importance[1-10]: 8

__

Why: The suggestion correctly identifies a significant concurrency issue where a mutex is held during a blocking network call, which can lead to performance degradation or deadlocks.

Medium
Replace hardcoded IP with service name
Suggestion Impact:The hardcoded IP was replaced with a configurable address defaulting to "sysmon-vm:50110", aligning with the suggestion's intent to use the service name.

code diff:

-  "address": "192.168.1.219:50110",
+  "address": "${SYSMON_VM_ADDRESS:-sysmon-vm:50110}",

In docker/compose/sysmon-vm.checker.json, replace the hardcoded IP address in
the address field with the service name sysmon-vm to ensure portability within
the Docker environment.

docker/compose/sysmon-vm.checker.json [4]

-"address": "192.168.1.219:50110",
+"address": "sysmon-vm:50110",
Suggestion importance[1-10]: 7

__

Why: The suggestion correctly identifies a hardcoded IP address and recommends using the Docker service name, which is a best practice for portability and maintainability in a containerized environment.

Medium
Handle empty metric slices correctly
Suggestion Impact:The commit implemented logic so that an "no cpu metrics appended" error is only returned when appending fails, and otherwise returns a specific error constant. Although an explicit early return for empty cpus is not shown in the diff, the updated handling of the "no metrics appended" case aligns with the suggestion’s intent to avoid misleading errors for empty input. Additionally, error messages were refactored to use constants.

code diff:

...
Imported GitHub PR comment. Original author: @qodo-code-review[bot] Original URL: https://github.com/carverauto/serviceradar/pull/1755#issuecomment-3395989674 Original created: 2025-10-13T06:09:32Z --- ## PR Code Suggestions ✨ <!-- 7a4be7b --> Latest suggestions up to 7a4be7b <table><thead><tr><td><strong>Category</strong></td><td align=left><strong>Suggestion&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; </strong></td><td align=center><strong>Impact</strong></td></tr><tbody><tr><td rowspan=4>Incremental <sup><a href='https://qodo-merge-docs.qodo.ai/core-abilities/incremental_update/'>[*]</a></sup></td> <td> <details><summary>Prevent IP from overwriting host ID</summary> ___ **In <code>enrichServiceMessageWithAddress</code>, avoid overwriting <code>host_id</code> with an IP <br>address. Instead, add the resolved IP to a separate field like <code>resolved_host_ip</code> <br>to preserve the original host identifier.** [pkg/poller/agent_poller.go [272-326]](https://github.com/carverauto/serviceradar/pull/1755/files#diff-7cdaae416a1e7b5d39ad75f138c49fa0aca12fc38e8ab0b26bb59c64446da15cR272-R326) ```diff func enrichServiceMessageWithAddress(message []byte, check Check) []byte { if len(message) == 0 || check.Type != "grpc" || check.Details == "" { return message } hostCandidate := sanitizeTelemetryString(check.Details) host, _, err := net.SplitHostPort(hostCandidate) if err != nil { host = hostCandidate } - if host == "" || net.ParseIP(host) == nil { + if host == "" { return message } - safeHost := sanitizeTelemetryString(host) + var resolvedIP net.IP + if ip := net.ParseIP(host); ip != nil { + resolvedIP = ip + } + var payload map[string]any if err := json.Unmarshal(message, &payload); err != nil { return message } - statusNode, ok := payload["status"].(map[string]any) - if !ok || statusNode == nil { + statusNode, _ := payload["status"].(map[string]any) + if statusNode == nil { statusNode = make(map[string]any) } getString := func(key string) string { if raw, exists := statusNode[key]; exists { if str, ok := raw.(string); ok { return sanitizeTelemetryString(str) } } - return "" } - if hostIP := getString("host_ip"); hostIP != "" && !strings.EqualFold(hostIP, host) { - statusNode["reported_host_ip"] = hostIP + // If we parsed a valid IP from details, expose it without overwriting original host fields. + if resolvedIP != nil { + if hostIP := getString("host_ip"); hostIP != "" && !strings.EqualFold(hostIP, resolvedIP.String()) { + statusNode["reported_host_ip"] = hostIP + } + statusNode["resolved_host_ip"] = resolvedIP.String() } - statusNode["host_ip"] = safeHost - if hostID := getString("host_id"); hostID != "" && !strings.EqualFold(hostID, host) { + // Do not set host_id from network address. Preserve any provided host_id. + if hostID := getString("host_id"); hostID != "" && resolvedIP != nil && !strings.EqualFold(hostID, resolvedIP.String()) { statusNode["reported_host_id"] = hostID } - statusNode["host_id"] = safeHost payload["status"] = statusNode - enriched, err := json.Marshal(payload) if err != nil { return message } - return enriched } ``` - [ ] **Apply / Chat** <!-- /improve --apply_suggestion=0 --> <details><summary>Suggestion importance[1-10]: 8</summary> __ Why: The suggestion correctly identifies a bug where `host_id` is incorrectly overwritten with an IP address, which corrupts telemetry data and can break device correlation logic. </details></details></td><td align=center>Medium </td></tr><tr><td> <details><summary>Optimize cpu_metrics sort key</summary> ___ **To optimize performance for the <code>cpu_metrics</code> stream, remove <code>cluster</code> from the <br><code>ORDER BY</code> clause and instead create a separate bloom filter index on it.** [pkg/db/migrations/00000000000001_consolidated_serviceradar_schema.up.sql [211-222]](https://github.com/carverauto/serviceradar/pull/1755/files#diff-1e05de747238f2112bb2230aac8db388e4c80eebe84c071eb78e035d64e67eb6R211-R222) ```diff CREATE STREAM IF NOT EXISTS cpu_metrics ( timestamp DateTime64(3), poller_id string, agent_id string, host_id string, core_id int32, usage_percent float64, frequency_hz float64, label string, cluster string, device_id string, partition string ) ENGINE = Stream(1, 1, rand()) PARTITION BY int_div(to_unix_timestamp(timestamp), 3600) -ORDER BY (timestamp, device_id, host_id, cluster, core_id) +ORDER BY (timestamp, device_id, host_id, core_id) TTL to_start_of_day(coalesce(timestamp, _tp_time)) + INTERVAL 3 DAY SETTINGS index_granularity = 8192; +ALTER STREAM cpu_metrics + ADD INDEX IF NOT EXISTS idx_cluster cluster TYPE bloom_filter GRANULARITY 1; + ``` `[To ensure code accuracy, apply this suggestion manually]` <details><summary>Suggestion importance[1-10]: 7</summary> __ Why: The suggestion correctly identifies a potential performance issue by adding `cluster` to the `ORDER BY` clause and proposes a valid optimization by using a separate index instead, which improves query performance and data ingestion. </details></details></td><td align=center>Medium </td></tr><tr><td> <details><summary>Block until config exists</summary> ___ **Prevent the <code>serviceradar-poller</code> from crashing on startup by modifying its <br>command to wait until the <code>poller.json</code> configuration file exists before <br>executing.** [docker-compose.yml [329-348]](https://github.com/carverauto/serviceradar/pull/1755/files#diff-e45e45baeda1c1e73482975a664062aa56f20c03dd9d64a827aba57775bed0d3R329-R348) ```diff serviceradar-poller: image: ghcr.io/carverauto/serviceradar-poller:latest container_name: serviceradar-poller volumes: - cert-data:/etc/serviceradar/certs:ro - generated-config:/etc/serviceradar/config:ro - poller-data:/var/lib/serviceradar - ./logs:/var/log/serviceradar environment: - # Config path - CONFIG_PATH=/etc/serviceradar/config/poller.json - # Logging - LOG_LEVEL=${LOG_LEVEL:-info} depends_on: cert-generator: condition: service_completed_successfully config-updater: condition: service_completed_successfully core: condition: service_healthy - command: ["/usr/local/bin/serviceradar-poller","-config","/etc/serviceradar/config/poller.json"] + command: ["sh","-lc","until [ -s /etc/serviceradar/config/poller.json ]; do echo '[poller] waiting for poller.json'; sleep 2; done; exec /usr/local/bin/serviceradar-poller -config /etc/serviceradar/config/poller.json"] ``` `[To ensure code accuracy, apply this suggestion manually]` <details><summary>Suggestion importance[1-10]: 7</summary> __ Why: The suggestion correctly identifies a race condition where the `serviceradar-poller` might start before its configuration file is ready, and proposes a robust shell command to wait for the file, preventing startup failures. </details></details></td><td align=center>Medium </td></tr><tr><td> <details><summary>Ensure secure, reliable secret persistence</summary> ___ **In <code>proton-init.sh</code>, ensure secret directories exist with secure permissions and <br>add error handling when writing the password file to prevent silent failures and <br>ensure credential persistence.** [docker/compose/proton-init.sh [62-69]](https://github.com/carverauto/serviceradar/pull/1755/files#diff-9f0e9459b12fc8cb2671d78218467129cbb18b110443a0799c7559cf7ba19d4eR62-R69) ```diff -if [ -z "$PROTON_PASSWORD" ]; then - # First preference: existing shared credential file - if [ -f /etc/serviceradar/credentials/proton-password ] && \ - [ -s /etc/serviceradar/credentials/proton-password ]; then - PROTON_PASSWORD=$(cat /etc/serviceradar/credentials/proton-password) - log_info "Reusing Proton password from shared credentials volume" - # Second preference: previously generated password inside Proton data dir - elif [ -f /etc/proton-server/generated_password.txt ] && \ - [ -s /etc/proton-server/generated_password.txt ]; then - PROTON_PASSWORD=$(cat /etc/proton-server/generated_password.txt) - log_info "Reusing Proton password from generated_password.txt" - else - log_info "Generating random password..." - PROTON_PASSWORD=$(openssl rand -hex 16) - log_info "Generated new Proton password" - fi +# Ensure directories exist with secure permissions before writing secrets +install -d -m 700 /etc/proton-server || { log_error "Cannot create /etc/proton-server"; exit 1; } +if [ -d "/etc/serviceradar" ]; then + install -d -m 700 /etc/serviceradar/credentials || { log_error "Cannot create /etc/serviceradar/credentials"; exit 1; } fi # Persist password to both expected locations to keep services aligned -echo "$PROTON_PASSWORD" > /etc/proton-server/generated_password.txt +echo "$PROTON_PASSWORD" > /etc/proton-server/generated_password.txt || { log_error "Failed to write Proton password"; exit 1; } chmod 600 /etc/proton-server/generated_password.txt if [ -d "/etc/serviceradar/credentials" ]; then - echo "$PROTON_PASSWORD" > /etc/serviceradar/credentials/proton-password - chmod 600 /etc/serviceradar/credentials/proton-password - log_info "Password synchronized to shared credentials volume" + echo "$PROTON_PASSWORD" > /etc/serviceradar/credentials/proton-password || { log_error "Failed to write shared Proton password"; exit 1; } + chmod 600 /etc/serviceradar/credentials/proton-password + log_info "Password synchronized to shared credentials volume" fi ``` `[To ensure code accuracy, apply this suggestion manually]` <details><summary>Suggestion importance[1-10]: 6</summary> __ Why: The suggestion improves the robustness of secret handling by adding explicit directory creation and error checking, which prevents silent failures and ensures password persistence. </details></details></td><td align=center>Low </td></tr><tr><td rowspan=3>Possible issue</td> <td> <details><summary>Prevent device ID collisions and validate IP</summary> ___ **To prevent collisions, make the <code>deviceID</code> more unique by validating the <code>hostIP</code> <br>and including a tie-breaker like <code>agentID</code> or <code>svc.ServiceName</code> in its generation.** [pkg/core/devices.go [31-120]](https://github.com/carverauto/serviceradar/pull/1755/files#diff-fb7f5159558f22afdc809b9cbf5a9c0f71560be2f7389c944da8f1604998ea48R31-R120) ```diff func (s *Server) ensureServiceDevice( ctx context.Context, agentID, pollerID, partition string, svc *proto.ServiceStatus, serviceData json.RawMessage, timestamp time.Time, ) { if svc == nil { return } - - // Only gRPC checkers embed host context in a way we can reason about; skip other service types. if svc.ServiceType != grpcServiceType { return } - - // Ignore result streams such as sync/multi-part responses; they set Source to "results". if svc.Source != "" && !strings.EqualFold(svc.Source, "getstatus") { return } hostIP, hostname, hostID := extractCheckerHostIdentity(serviceData) hostIP = normalizeHostIdentifier(hostIP) - if hostIP == "" || strings.EqualFold(hostIP, "unknown") { + parsed := net.ParseIP(hostIP) + if parsed == nil { + // refuse non-IP identifiers to avoid corrupting device registry return } if partition == "" { partition = "default" } - deviceID := fmt.Sprintf("%s:%s", partition, hostIP) + // Use the normalized string form for device ID. For IPv6, use the canonical compressed format. + ipKey := parsed.String() + + // Add a stable tie-breaker to mitigate collisions where the same IP might be observed + // in overlapping contexts (e.g., NAT or shared targets). + extraScope := agentID + if extraScope == "" { + extraScope = svc.ServiceName + } + deviceID := fmt.Sprintf("%s:%s:%s", partition, ipKey, extraScope) metadata := map[string]string{ - "source": "checker", - "checker_service": svc.ServiceName, - "checker_service_id": svc.ServiceName, - "last_update": timestamp.Format(time.RFC3339), + "source": "checker", + "checker_service": svc.ServiceName, + "checker_service_id": svc.ServiceName, + "last_update": timestamp.Format(time.RFC3339), + "checker_host_ip": ipKey, + "device_id_scope_hint": extraScope, } - if svc.ServiceType != "" { metadata["checker_service_type"] = svc.ServiceType } - if agentID != "" { metadata["collector_agent_id"] = agentID } - if pollerID != "" { metadata["collector_poller_id"] = pollerID } - if hostID != "" { metadata["checker_host_id"] = hostID } - - metadata["checker_host_ip"] = hostIP deviceUpdate := &models.DeviceUpdate{ AgentID: agentID, PollerID: pollerID, Partition: partition, DeviceID: deviceID, Source: models.DiscoverySourceSelfReported, - IP: hostIP, + IP: ipKey, Timestamp: timestamp, IsAvailable: true, Metadata: metadata, Confidence: models.GetSourceConfidence(models.DiscoverySourceSelfReported), } - if hostname != "" { deviceUpdate.Hostname = &hostname } if s.DeviceRegistry != nil { if err := s.DeviceRegistry.ProcessDeviceUpdate(ctx, deviceUpdate); err != nil { s.logger.Warn(). Err(err). Str("device_id", deviceID). Str("service_name", svc.ServiceName). Msg("Failed to process checker device through device registry") } } else { s.logger.Warn(). Str("device_id", deviceID). Str("service_name", svc.ServiceName). Msg("DeviceRegistry not available for checker device registration") } } ``` - [ ] **Apply / Chat** <!-- /improve --apply_suggestion=4 --> <details><summary>Suggestion importance[1-10]: 8</summary> __ Why: The suggestion addresses a significant architectural weakness in device identification, preventing potential data collisions and malformed `deviceID`s, thus improving data integrity. </details></details></td><td align=center>Medium </td></tr><tr><td> <details><summary>Add missing usage and tighten ORDER BY</summary> ___ **Add the missing <code>usage_percent</code> column to the <code>cpu_cluster_metrics</code> stream and <br>include <code>poller_id</code> and <code>agent_id</code> in its <code>ORDER BY</code> clause to ensure data correctness <br>and improve query performance.** [pkg/db/migrations/00000000000001_consolidated_serviceradar_schema.up.sql [224-240]](https://github.com/carverauto/serviceradar/pull/1755/files#diff-1e05de747238f2112bb2230aac8db388e4c80eebe84c071eb78e035d64e67eb6R224-R240) ```diff -CREATE STREAM IF NOT EXISTS cpu_metrics ( - core_id int32, - usage_percent float64, - frequency_hz float64, - label string, - cluster string, - device_id string, - partition string -) ENGINE = Stream(1, 1, rand()) -PARTITION BY int_div(to_unix_timestamp(timestamp), 3600) -ORDER BY (timestamp, device_id, host_id, cluster, core_id) -TTL to_start_of_day(coalesce(timestamp, _tp_time)) + INTERVAL 3 DAY -SETTINGS index_granularity = 8192; - CREATE STREAM IF NOT EXISTS cpu_cluster_metrics ( timestamp DateTime64(3), poller_id string, agent_id string, host_id string, cluster string, + usage_percent float64, frequency_hz float64, device_id string, partition string ) ENGINE = Stream(1, 1, rand()) PARTITION BY int_div(to_unix_timestamp(timestamp), 3600) -ORDER BY (timestamp, cluster, host_id, device_id) +ORDER BY (timestamp, cluster, host_id, device_id, poller_id, agent_id) TTL to_start_of_day(coalesce(timestamp, _tp_time)) + INTERVAL 3 DAY SETTINGS index_granularity = 8192; ALTER STREAM cpu_cluster_metrics ADD INDEX IF NOT EXISTS idx_cluster cluster TYPE bloom_filter GRANULARITY 1; ``` `[To ensure code accuracy, apply this suggestion manually]` <details><summary>Suggestion importance[1-10]: 8</summary> __ Why: The suggestion correctly identifies that the new `cpu_cluster_metrics` stream is missing a `usage_percent` field, which is crucial for UI components that display cluster usage. It also correctly points out that `poller_id` and `agent_id` should be in the `ORDER BY` clause for data integrity and performance, which is a critical consideration for streaming databases. </details></details></td><td align=center>Medium </td></tr><tr><td> <details><summary>Add defensive null checks and stable keys</summary> ___ **Add a null check for each <code>device</code> in the <code>devices.map</code> loop and provide a fallback <br><code>key</code> using the index to prevent potential runtime errors and improve component <br>stability.** [web/src/components/Devices/DeviceTable.tsx [209-218]](https://github.com/carverauto/serviceradar/pull/1755/files#diff-62d65ad4e929bc5554024981aa1c3418d70d54333e3afd80af69d93cea2edec3R209-R218) ```diff -{devices.map(device => { - const metadata = device.metadata || {}; - const sysmonServiceHint = typeof metadata === 'object' && +{devices.map((device, idx) => { + if (!device || typeof device !== 'object') return null; + + const metadata = (device as any).metadata || {}; + const sysmonServiceHint = + typeof metadata === 'object' && metadata !== null && typeof metadata.checker_service === 'string' && metadata.checker_service.toLowerCase().includes('sysmon'); + const rowKey = device.device_id ?? `device-row-${idx}`; + return ( - <Fragment key={device.device_id}> + <Fragment key={rowKey}> <tr className="hover:bg-gray-700/30"> ``` - [ ] **Apply / Chat** <!-- /improve --apply_suggestion=6 --> <details><summary>Suggestion importance[1-10]: 6</summary> __ Why: The suggestion improves code robustness by adding a null check for `device` and providing a fallback for the React `key`, which are good defensive programming practices against potentially malformed data. </details></details></td><td align=center>Low </td></tr><tr><td rowspan=1>General</td> <td> <details><summary>Use consistent truncated timestamps</summary> ___ **In <code>GetAllCPUMetrics</code>, use the truncated timestamp for the final <code>SysmonCPUResponse</code> <br>to ensure consistent grouping and alignment between CPU and cluster metrics.** [pkg/db/metrics.go [826-912]](https://github.com/carverauto/serviceradar/pull/1755/files#diff-2cf937853ac86eb9a879bd370cbe22b835ee0c48be6771b48b0276d93cc2ae93R826-R912) ```diff -func (db *DB) GetAllCPUMetrics( - ctx context.Context, pollerID string, start, end time.Time) ([]models.SysmonCPUResponse, error) { - rows, err := db.Conn.Query(ctx, ` - SELECT timestamp, agent_id, host_id, core_id, usage_percent, frequency_hz, label, cluster - FROM table(cpu_metrics) - WHERE poller_id = $1 AND timestamp BETWEEN $2 AND $3 - ORDER BY timestamp DESC, core_id ASC`, - pollerID, start, end) - if err != nil { - return nil, fmt.Errorf("failed to query CPU metrics: %w", err) - } - defer db.CloseRows(rows) +result := make([]models.SysmonCPUResponse, 0, len(data)) - data := make(map[time.Time][]models.CPUMetric) - clustersByTimestamp := make(map[time.Time][]models.CPUClusterMetric) - - for rows.Next() { - var m models.CPUMetric - - var agentID, hostID, label, cluster string - - var timestamp time.Time - - if err := rows.Scan(&timestamp, &agentID, &hostID, &m.CoreID, &m.UsagePercent, &m.FrequencyHz, &label, &cluster); err != nil { - db.logger.Error().Err(err).Msg("Error scanning CPU metric row") - continue - } - key := timestamp.Truncate(time.Second) - m.Timestamp = timestamp - m.AgentID = agentID - m.HostID = hostID - m.Label = label - m.Cluster = cluster - data[key] = append(data[key], m) - } - - if err := rows.Err(); err != nil { - db.logger.Error().Err(err).Msg("Error iterating CPU metric rows") - - return nil, err - } - - clusterRows, err := db.Conn.Query(ctx, ` - SELECT timestamp, agent_id, host_id, cluster, frequency_hz - FROM table(cpu_cluster_metrics) - WHERE poller_id = $1 AND timestamp BETWEEN $2 AND $3 - ORDER BY timestamp DESC, cluster ASC`, - pollerID, start, end) - if err != nil { - db.logger.Error().Err(err).Msg("Error querying CPU cluster metrics") - - return nil, fmt.Errorf("failed to query CPU cluster metrics: %w", err) - } - defer db.CloseRows(clusterRows) - - for clusterRows.Next() { - var c models.CPUClusterMetric - var timestamp time.Time - - if err := clusterRows.Scan(&timestamp, &c.AgentID, &c.HostID, &c.Name, &c.FrequencyHz); err != nil { - db.logger.Error().Err(err).Msg("Error scanning CPU cluster metric row") - continue - } - key := timestamp.Truncate(time.Second) - c.Timestamp = timestamp - clustersByTimestamp[key] = append(clustersByTimestamp[key], c) - } - - if err := clusterRows.Err(); err != nil { - db.logger.Error().Err(err).Msg("Error iterating CPU cluster metric rows") - - return nil, err - } - - result := make([]models.SysmonCPUResponse, 0, len(data)) - - for ts, cpus := range data { - result = append(result, models.SysmonCPUResponse{ - Cpus: cpus, - Clusters: clustersByTimestamp[ts], - Timestamp: cpus[0].Timestamp, - }) - } - - return result, nil +for ts, cpus := range data { + truncated := ts.Truncate(time.Second) + result = append(result, models.SysmonCPUResponse{ + Cpus: cpus, + Clusters: clustersByTimestamp[ts], + Timestamp: truncated, + }) } ``` `[To ensure code accuracy, apply this suggestion manually]` <details><summary>Suggestion importance[1-10]: 7</summary> __ Why: The suggestion correctly identifies a subtle bug where inconsistent timestamp handling could lead to data misalignment between CPU and cluster metrics. </details></details></td><td align=center>Medium </td></tr> <tr><td align="center" colspan="2"> - [ ] More <!-- /improve --more_suggestions=true --> </td><td></td></tr></tbody></table> ___ #### Previous suggestions <details><summary>✅ Suggestions up to commit 86685fd</summary> <br><table><thead><tr><td><strong>Category</strong></td><td align=left><strong>Suggestion&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; </strong></td><td align=center><strong>Impact</strong></td></tr><tbody><tr><td rowspan=7>Possible issue</td> <td> <details><summary>Preserve original host identity fields<!-- not_implemented --></summary> ___ **Avoid overwriting <code>host_id</code> and <code>host_ip</code> in the service message. Instead, store the <br>address derived from <code>check.Details</code> in separate auxiliary fields to preserve the <br>original agent-reported identity.** [pkg/poller/agent_poller.go [272-325]](https://github.com/carverauto/serviceradar/pull/1755/files#diff-7cdaae416a1e7b5d39ad75f138c49fa0aca12fc38e8ab0b26bb59c64446da15cR272-R325) ```diff func enrichServiceMessageWithAddress(message []byte, check Check) []byte { if len(message) == 0 || check.Type != "grpc" || check.Details == "" { return message } hostCandidate := strings.TrimSpace(check.Details) host, _, err := net.SplitHostPort(hostCandidate) if err != nil { host = hostCandidate } + // Only attach auxiliary info if the candidate parses as an IP if host == "" || net.ParseIP(host) == nil { return message } var payload map[string]any if err := json.Unmarshal(message, &payload); err != nil { return message } - statusNode, ok := payload["status"].(map[string]any) - if !ok || statusNode == nil { + statusNode, _ := payload["status"].(map[string]any) + if statusNode == nil { statusNode = make(map[string]any) } - getString := func(key string) string { - if raw, exists := statusNode[key]; exists { - if str, ok := raw.(string); ok { - return strings.TrimSpace(str) - } + // Preserve existing fields; add auxiliary fields without overwriting originals + if existing, ok := statusNode["host_ip"]; ok && existing != nil { + // keep original, add derived separately if different + if s, ok := existing.(string); ok && !strings.EqualFold(strings.TrimSpace(s), host) { + statusNode["derived_host_ip"] = host } - - return "" + } else { + // no original present; still avoid writing as canonical field + statusNode["derived_host_ip"] = host } - if hostIP := getString("host_ip"); hostIP != "" && !strings.EqualFold(hostIP, host) { - statusNode["reported_host_ip"] = hostIP + // Do not overwrite host_id; expose derived host_id from address only as auxiliary + if _, ok := statusNode["host_id"]; !ok { + statusNode["derived_host_id"] = host } - statusNode["host_ip"] = host - - if hostID := getString("host_id"); hostID != "" && !strings.EqualFold(hostID, host) { - statusNode["reported_host_id"] = hostID - } - statusNode["host_id"] = host payload["status"] = statusNode - enriched, err := json.Marshal(payload) if err != nil { return message } - return enriched } ``` <!-- /improve --apply_suggestion=0 --> <details><summary>Suggestion importance[1-10]: 9</summary> __ Why: The suggestion correctly identifies a critical data integrity issue where agent-reported identity (`host_ip`, `host_id`) is overwritten, which could lead to incorrect device correlation and data misattribution. </details></details></td><td align=center>High </td></tr><tr><td> <details><summary>Validate device key IP input</summary> ___ **Validate that the extracted <code>hostIP</code> is a valid IP address before using it to <br>construct a <code>deviceID</code> to prevent creating invalid device records from hostnames <br>or other strings.** [pkg/core/devices.go [31-62]](https://github.com/carverauto/serviceradar/pull/1755/files#diff-fb7f5159558f22afdc809b9cbf5a9c0f71560be2f7389c944da8f1604998ea48R31-R62) ```diff -func (s *Server) ensureServiceDevice( - ctx context.Context, - agentID, pollerID, partition string, - svc *proto.ServiceStatus, - serviceData json.RawMessage, - timestamp time.Time, -) { - if svc == nil { - return - } +hostIP, hostname, hostID := extractCheckerHostIdentity(serviceData) +hostIP = normalizeHostIdentifier(hostIP) +// Require a valid IP address to avoid using hostnames/strings as device keys +if ip := net.ParseIP(hostIP); ip == nil { + return +} +if hostIP == "" || strings.EqualFold(hostIP, "unknown") { + return +} - // Only gRPC checkers embed host context in a way we can reason about; skip other service types. - if svc.ServiceType != grpcServiceType { - return - } +if partition == "" { + partition = "default" +} - // Ignore result streams such as sync/multi-part responses; they set Source to "results". - if svc.Source != "" && !strings.EqualFold(svc.Source, "getstatus") { - return - } +deviceID := fmt.Sprintf("%s:%s", partition, hostIP) - hostIP, hostname, hostID := extractCheckerHostIdentity(serviceData) - hostIP = normalizeHostIdentifier(hostIP) - if hostIP == "" || strings.EqualFold(hostIP, "unknown") { - return - } - - if partition == "" { - partition = "default" - } - - deviceID := fmt.Sprintf("%s:%s", partition, hostIP) - ``` <details><summary>Suggestion importance[1-10]: 8</summary> __ Why: The suggestion correctly identifies that a non-IP string could be used in `deviceID`, leading to invalid device records. Enforcing IP validation for the device key is a crucial fix for data integrity. </details></details></td><td align=center>Medium </td></tr><tr><td> <details><summary>✅ <s>Add indexes for new cluster fields</s></summary> ___ <details><summary><b>Suggestion Impact:</b></summary>The commit added bloom filter indexes on cpu_metrics for cluster and label, matching the suggestion. It also added an index for cluster on cpu_cluster_metrics and adjusted ORDER BY clauses. code diff: ```diff +ALTER STREAM cpu_metrics + ADD INDEX IF NOT EXISTS idx_cluster cluster TYPE bloom_filter GRANULARITY 1; +ALTER STREAM cpu_metrics + ADD INDEX IF NOT EXISTS idx_label label TYPE bloom_filter GRANULARITY 1; ``` </details> ___ **Add bloom filter indexes to the <code>cpu_metrics</code> stream for the new <code>cluster</code> and <code>label</code> <br>fields to improve query performance.** [pkg/db/migrations/00000000000001_consolidated_serviceradar_schema.up.sql [209-222]](https://github.com/carverauto/serviceradar/pull/1755/files#diff-1e05de747238f2112bb2230aac8db388e4c80eebe84c071eb78e035d64e67eb6R209-R222) ```diff CREATE STREAM IF NOT EXISTS cpu_metrics ( timestamp DateTime64(3), poller_id string, agent_id string, host_id string, core_id int32, usage_percent float64, frequency_hz float64, label string, cluster string, device_id string, partition string ) ENGINE = Stream(1, 1, rand()) PARTITION BY int_div(to_unix_timestamp(timestamp), 3600) ORDER BY (timestamp, device_id, host_id, core_id) TTL to_start_of_day(coalesce(timestamp, _tp_time)) + INTERVAL 3 DAY SETTINGS index_granularity = 8192; +ALTER STREAM cpu_metrics + ADD INDEX IF NOT EXISTS idx_cluster cluster TYPE bloom_filter GRANULARITY 1; +ALTER STREAM cpu_metrics + ADD INDEX IF NOT EXISTS idx_label label TYPE bloom_filter GRANULARITY 1; + ``` <details><summary>Suggestion importance[1-10]: 8</summary> __ Why: This is a valid and important performance optimization. Adding bloom filter indexes on the new `cluster` and `label` fields is crucial for efficient querying, especially since these fields are used for filtering and grouping in the UI, preventing full stream scans on `cpu_metrics`. </details></details></td><td align=center>Medium </td></tr><tr><td> <details><summary>✅ <s>Optimize sort key and index for clusters</s></summary> ___ <details><summary><b>Suggestion Impact:</b></summary>The commit reordered the cpu_cluster_metrics ORDER BY to (timestamp, cluster, host_id, device_id) and added a bloom filter index on cluster for that stream, matching the suggestion's intent. code diff: ```diff -ORDER BY (timestamp, device_id, host_id, cluster) +ORDER BY (timestamp, cluster, host_id, device_id) TTL to_start_of_day(coalesce(timestamp, _tp_time)) + INTERVAL 3 DAY SETTINGS index_granularity = 8192; +ALTER STREAM cpu_cluster_metrics + ADD INDEX IF NOT EXISTS idx_cluster cluster TYPE bloom_filter GRANULARITY 1; ``` </details> ___ **Optimize the <code>cpu_cluster_metrics</code> stream by reordering the primary key to <br>prioritize <code>cluster</code> and adding a bloom filter index on it.** [pkg/db/migrations/00000000000001_consolidated_serviceradar_schema.up.sql [224-237]](https://github.com/carverauto/serviceradar/pull/1755/files#diff-1e05de747238f2112bb2230aac8db388e4c80eebe84c071eb78e035d64e67eb6R224-R237) ```diff CREATE STREAM IF NOT EXISTS cpu_cluster_metrics ( timestamp DateTime64(3), poller_id string, agent_id string, host_id string, cluster string, frequency_hz float64, device_id string, partition string ) ENGINE = Stream(1, 1, rand()) PARTITION BY int_div(to_unix_timestamp(timestamp), 3600) -ORDER BY (timestamp, device_id, host_id, cluster) +ORDER BY (timestamp, cluster, host_id, device_id) TTL to_start_of_day(coalesce(timestamp, _tp_time)) + INTERVAL 3 DAY SETTINGS index_granularity = 8192; +ALTER STREAM cpu_cluster_metrics + ADD INDEX IF NOT EXISTS idx_cluster cluster TYPE bloom_filter GRANULARITY 1; + ``` `[Suggestion processed]` <details><summary>Suggestion importance[1-10]: 8</summary> __ Why: This suggestion correctly identifies a performance issue in the new `cpu_cluster_metrics` stream. Reordering the primary key to `(timestamp, cluster, ...)` and adding a bloom filter index on `cluster` will significantly improve the performance of common UI queries that filter by cluster. </details></details></td><td align=center>Medium </td></tr><tr><td> <details><summary>Add safe db type assertion<!-- not_implemented --></summary> ___ **Add a safe type assertion when accessing <code>s.dbService</code> to prevent a potential <br>panic if the underlying type is not <code>*db.DB</code>.** [pkg/core/api/sysmon.go [459]](https://github.com/carverauto/serviceradar/pull/1755/files#diff-4879e8b367ee55413c6cb006f23bb912261df5310c640d3af352033cf97ffc84R459-R459) ```diff -clusterRows, err := s.dbService.(*db.DB).Conn.Query(ctx, clusterQuery) +var ( + dbImpl *db.DB + ok bool +) +if dbImpl, ok = s.dbService.(*db.DB); !ok || dbImpl == nil || dbImpl.Conn == nil { + return nil, fmt.Errorf("database service unavailable for cluster query") +} +clusterRows, err := dbImpl.Conn.Query(ctx, clusterQuery) ``` <!-- /improve --apply_suggestion=4 --> <details><summary>Suggestion importance[1-10]: 7</summary> __ Why: The suggestion correctly points out a potential panic from an unsafe type assertion, improving the code's robustness and preventing a server crash, which is a valuable improvement for production stability. </details></details></td><td align=center>Medium </td></tr><tr><td> <details><summary>✅ <s>Add defensive null checks</s></summary> ___ <details><summary><b>Suggestion Impact:</b></summary>The commit added a defensive check ensuring serviceData exists before accessing its properties and provided a nullish coalescing fallback for serviceData.details. code diff: ```diff + if ( + serviceData && + (serviceData.service_name === "sysmon-vm" || serviceData.name === "sysmon-vm") + ) { return ( <SysmonVmDetails service={serviceData} - details={serviceData.details} + details={serviceData.details ?? {}} ``` </details> ___ **Add null checks for the <code>serviceData</code> object and its <code>details</code> property to prevent <br>potential runtime errors during rendering.** [web/src/components/Service/Dashboard.tsx [280-287]](https://github.com/carverauto/serviceradar/pull/1755/files#diff-a0046922157d76341ddcd07416191c7a7add740deefbf01ee66c1f49e7da4582R280-R287) ```diff -if (serviceData.service_name === 'sysmon-vm' || serviceData.name === 'sysmon-vm') { +if (serviceData && (serviceData.service_name === 'sysmon-vm' || serviceData.name === 'sysmon-vm')) { return ( <SysmonVmDetails service={serviceData} - details={serviceData.details} + details={serviceData.details ?? {}} /> ); } ``` `[Suggestion processed]` <details><summary>Suggestion importance[1-10]: 7</summary> __ Why: The suggestion correctly identifies a potential null pointer exception if `serviceData` is not yet loaded, improving the component's robustness against runtime errors. </details></details></td><td align=center>Medium </td></tr><tr><td> <details><summary>✅ <s>Clamp and sanitize metric value</s></summary> ___ <details><summary><b>Suggestion Impact:</b></summary>The commit added safeMax/rawValue/clampedValue and replaced uses of current with clampedValue and max with safeMax across StatusIndicator, ValueDisplay, and ProgressBar. code diff: ```diff + const safeMax = Number.isFinite(max) && max > 0 ? max : 100; + const rawValue = Number.isFinite(current) ? current : 0; + const clampedValue = Math.min(Math.max(rawValue, 0), safeMax); + return ( <div className="bg-white dark:bg-gray-800 rounded-lg p-3 shadow transition-colors"> <div className="flex justify-between items-center mb-2"> @@ -129,23 +133,23 @@ <span className="text-sm font-medium text-gray-600 dark:text-gray-300">{title}</span> </div> <StatusIndicator - value={current} + value={clampedValue} warning={warning} critical={critical} /> </div> <ValueDisplay - value={current} + value={clampedValue} unit={unit} warning={warning} critical={critical} change={change} /> <ProgressBar - value={current} + value={clampedValue} warning={warning} critical={critical} - max={max} + max={safeMax} /> ``` </details> ___ **Sanitize the <code>current</code> and <code>max</code> props in the <code>MetricCard</code> component by clamping the <br>value to a safe range <code>[0, max]</code> to prevent UI issues.** [web/src/components/Metrics/shared-components.jsx [123-153]](https://github.com/carverauto/serviceradar/pull/1755/files#diff-da2e49cf20ae6d685ed375b397ac8a5d2fdbafad60303a706712dfb84afb49d3R123-R153) ```diff export const MetricCard = ({ title, current, unit, warning, critical, change, icon, children, max = 100 }) => { + const safeMax = Number.isFinite(max) && max > 0 ? max : 100; + const rawValue = Number.isFinite(current) ? current : 0; + const clampedValue = Math.min(Math.max(rawValue, 0), safeMax); return ( <div className="bg-white dark:bg-gray-800 rounded-lg p-3 shadow transition-colors"> <div className="flex justify-between items-center mb-2"> <div className="flex items-center space-x-2"> {icon && <div className="text-gray-500 dark:text-gray-300">{icon}</div>} <h3 className="text-sm font-medium text-gray-900 dark:text-white">{title}</h3> </div> {typeof change !== 'undefined' && ( <span className={`text-xs ${change >= 0 ? 'text-green-600' : 'text-red-600'}`}> {change >= 0 ? '+' : ''}{change}% </span> )} </div> <MetricProgressBar - value={current} + value={clampedValue} warning={warning} critical={critical} - max={max} + max={safeMax} /> {children} </div> ); }; ``` `[Suggestion processed]` <details><summary>Suggestion importance[1-10]: 7</summary> __ Why: This suggestion improves the component's resilience by sanitizing the `current` and `max` props, preventing potential UI bugs or crashes from invalid data. </details></details></td><td align=center>Medium </td></tr><tr><td rowspan=3>Incremental <sup><a href='https://qodo-merge-docs.qodo.ai/core-abilities/incremental_update/'>[*]</a></sup></td> <td> <details><summary>✅ <s>Normalize timestamps for joining</s></summary> ___ <details><summary><b>Suggestion Impact:</b></summary>The commit implemented truncating timestamps to second-resolution for both CPU metrics and cluster metrics map keys, and adjusted the response timestamp to use the original metric timestamp. code diff: ```diff + key := timestamp.Truncate(time.Second) m.Timestamp = timestamp m.AgentID = agentID m.HostID = hostID m.Label = label m.Cluster = cluster - data[timestamp] = append(data[timestamp], m) + data[key] = append(data[key], m) } if err := rows.Err(); err != nil { @@ -888,9 +888,9 @@ db.logger.Error().Err(err).Msg("Error scanning CPU cluster metric row") continue } - + key := timestamp.Truncate(time.Second) c.Timestamp = timestamp - clustersByTimestamp[timestamp] = append(clustersByTimestamp[timestamp], c) + clustersByTimestamp[key] = append(clustersByTimestamp[key], c) } if err := clusterRows.Err(); err != nil { @@ -905,7 +905,7 @@ result = append(result, models.SysmonCPUResponse{ Cpus: cpus, Clusters: clustersByTimestamp[ts], - Timestamp: ts, + Timestamp: cpus[0].Timestamp, }) ``` </details> ___ **In <code>GetAllCPUMetrics</code>, normalize timestamps by truncating them to the second <br>before using them as map keys to ensure correct correlation between core and <br>cluster metrics.** [pkg/db/metrics.go [826-912]](https://github.com/carverauto/serviceradar/pull/1755/files#diff-2cf937853ac86eb9a879bd370cbe22b835ee0c48be6771b48b0276d93cc2ae93R826-R912) ```diff func (db *DB) GetAllCPUMetrics( ctx context.Context, pollerID string, start, end time.Time) ([]models.SysmonCPUResponse, error) { rows, err := db.Conn.Query(ctx, ` SELECT timestamp, agent_id, host_id, core_id, usage_percent, frequency_hz, label, cluster FROM table(cpu_metrics) WHERE poller_id = $1 AND timestamp BETWEEN $2 AND $3 ORDER BY timestamp DESC, core_id ASC`, pollerID, start, end) if err != nil { return nil, fmt.Errorf("failed to query CPU metrics: %w", err) } defer db.CloseRows(rows) data := make(map[time.Time][]models.CPUMetric) clustersByTimestamp := make(map[time.Time][]models.CPUClusterMetric) for rows.Next() { var m models.CPUMetric - var agentID, hostID, label, cluster string - - var timestamp time.Time - - if err := rows.Scan(&timestamp, &agentID, &hostID, &m.CoreID, &m.UsagePercent, &m.FrequencyHz, &label, &cluster); err != nil { + var ts time.Time + if err := rows.Scan(&ts, &agentID, &hostID, &m.CoreID, &m.UsagePercent, &m.FrequencyHz, &label, &cluster); err != nil { db.logger.Error().Err(err).Msg("Error scanning CPU metric row") continue } - - m.Timestamp = timestamp + // Normalize key to seconds to ensure joinability + key := ts.Truncate(time.Second) + m.Timestamp = ts m.AgentID = agentID m.HostID = hostID m.Label = label m.Cluster = cluster - data[timestamp] = append(data[timestamp], m) + data[key] = append(data[key], m) } - if err := rows.Err(); err != nil { db.logger.Error().Err(err).Msg("Error iterating CPU metric rows") - return nil, err } clusterRows, err := db.Conn.Query(ctx, ` SELECT timestamp, agent_id, host_id, cluster, frequency_hz FROM table(cpu_cluster_metrics) WHERE poller_id = $1 AND timestamp BETWEEN $2 AND $3 ORDER BY timestamp DESC, cluster ASC`, pollerID, start, end) if err != nil { db.logger.Error().Err(err).Msg("Error querying CPU cluster metrics") - return nil, fmt.Errorf("failed to query CPU cluster metrics: %w", err) } defer db.CloseRows(clusterRows) for clusterRows.Next() { var c models.CPUClusterMetric - var timestamp time.Time - - if err := clusterRows.Scan(&timestamp, &c.AgentID, &c.HostID, &c.Name, &c.FrequencyHz); err != nil { + var ts time.Time + if err := clusterRows.Scan(&ts, &c.AgentID, &c.HostID, &c.Name, &c.FrequencyHz); err != nil { db.logger.Error().Err(err).Msg("Error scanning CPU cluster metric row") continue } - - c.Timestamp = timestamp - clustersByTimestamp[timestamp] = append(clustersByTimestamp[timestamp], c) + key := ts.Truncate(time.Second) + c.Timestamp = ts + clustersByTimestamp[key] = append(clustersByTimestamp[key], c) } - if err := clusterRows.Err(); err != nil { db.logger.Error().Err(err).Msg("Error iterating CPU cluster metric rows") - return nil, err } result := make([]models.SysmonCPUResponse, 0, len(data)) - - for ts, cpus := range data { + for tsKey, cpus := range data { result = append(result, models.SysmonCPUResponse{ Cpus: cpus, - Clusters: clustersByTimestamp[ts], - Timestamp: ts, + Clusters: clustersByTimestamp[tsKey], + Timestamp: cpus[0].Timestamp, // original timestamp for display }) } - return result, nil } ``` <details><summary>Suggestion importance[1-10]: 8</summary> __ Why: The suggestion correctly identifies a critical Go pitfall of using `time.Time` as a map key, which would cause data to be mis-associated, and provides the correct fix by truncating the timestamp. </details></details></td><td align=center>Medium </td></tr><tr><td> <details><summary>✅ <s>Guard optional nested data</s></summary> ___ <details><summary><b>Suggestion Impact:</b></summary>The commit updated the component to use optional chaining and array checks: wrapped cpuFrequency and memory with data?., changed cpu.clusters and cpu.cores and disk.drives to Array.isArray checks, matching the suggested guards. code diff: ```diff - {data.cpu?.clusters && data.cpu.clusters.length > 0 && ( + {Array.isArray(data?.cpu?.clusters) && data.cpu.clusters.length > 0 && ( <CpuClusterDetails clusters={data.cpu.clusters} /> )} - {data.cpuFrequency && <CpuFrequencyDetails data={data.cpuFrequency} />} - <CpuCoresChart cores={data.cpu.cores} /> - <FilesystemDetails drives={data.disk.drives} /> - <MemoryDetails data={data.memory} /> + {data?.cpuFrequency && <CpuFrequencyDetails data={data.cpuFrequency} />} + {Array.isArray(data?.cpu?.cores) && <CpuCoresChart cores={data.cpu.cores} />} + {Array.isArray(data?.disk?.drives) && <FilesystemDetails drives={data.disk.drives} />} + {data?.memory && <MemoryDetails data={data.memory} />} ``` </details> ___ **Add null-safe guards for <code>data.cpu.cores</code>, <code>data.disk.drives</code>, and <code>data.memory</code> to <br>prevent the application from crashing if these properties are accessed before <br>they are loaded.** [web/src/components/Metrics/system-metrics.jsx [288-296]](https://github.com/carverauto/serviceradar/pull/1755/files#diff-8ac59d996e95f9669b6ddac45fad9179ec4875fcd135aa101f4e25f1c662f687R288-R296) ```diff {activeTab === 'details' && ( <div className="space-y-6"> - {data.cpu?.clusters && data.cpu.clusters.length > 0 && ( + {Array.isArray(data?.cpu?.clusters) && data.cpu.clusters.length > 0 && ( <CpuClusterDetails clusters={data.cpu.clusters} /> )} - {data.cpuFrequency && <CpuFrequencyDetails data={data.cpuFrequency} />} - <CpuCoresChart cores={data.cpu.cores} /> - <FilesystemDetails drives={data.disk.drives} /> - <MemoryDetails data={data.memory} /> + {data?.cpuFrequency && <CpuFrequencyDetails data={data.cpuFrequency} />} + {Array.isArray(data?.cpu?.cores) && <CpuCoresChart cores={data.cpu.cores} />} + {Array.isArray(data?.disk?.drives) && <FilesystemDetails drives={data.disk.drives} />} + {data?.memory && <MemoryDetails data={data.memory} />} ``` `[Suggestion processed]` <details><summary>Suggestion importance[1-10]: 8</summary> __ Why: This suggestion correctly identifies missing null-safe guards for nested properties like `data.cpu.cores`, preventing a likely runtime crash when data is partially loaded, which is a critical fix for UI stability. </details></details></td><td align=center>Medium </td></tr><tr><td> <details><summary>✅ <s>Include cluster in sorting key</s></summary> ___ <details><summary><b>Suggestion Impact:</b></summary>The commit updated cpu_metrics to ORDER BY (timestamp, device_id, host_id, cluster, core_id), adding cluster as suggested. It also added bloom filter indexes and adjusted ordering for another stream. code diff: ```diff +ORDER BY (timestamp, device_id, host_id, cluster, core_id) + TTL to_start_of_day(coalesce(timestamp, _tp_time)) + INTERVAL 3 DAY + SETTINGS index_granularity = 8192; ``` </details> ___ **Add the new <code>cluster</code> field to the <code>ORDER BY</code> clause of the <code>cpu_metrics</code> stream. This <br>will improve query performance and data locality for cluster-based aggregations.** [pkg/db/migrations/00000000000001_consolidated_serviceradar_schema.up.sql [209-222]](https://github.com/carverauto/serviceradar/pull/1755/files#diff-1e05de747238f2112bb2230aac8db388e4c80eebe84c071eb78e035d64e67eb6R209-R222) ```diff CREATE STREAM IF NOT EXISTS cpu_metrics ( timestamp DateTime64(3), poller_id string, agent_id string, host_id string, core_id int32, usage_percent float64, frequency_hz float64, label string, cluster string, device_id string, partition string ) ENGINE = Stream(1, 1, rand()) PARTITION BY int_div(to_unix_timestamp(timestamp), 3600) -ORDER BY (timestamp, device_id, host_id, core_id) +ORDER BY (timestamp, device_id, host_id, cluster, core_id) TTL to_start_of_day(coalesce(timestamp, _tp_time)) + INTERVAL 3 DAY SETTINGS index_granularity = 8192; ``` <details><summary>Suggestion importance[1-10]: 7</summary> __ Why: This is a valid and important performance optimization for the `cpu_metrics` stream, as including `cluster` in the `ORDER BY` clause will improve data locality and query efficiency for cluster-based aggregations. </details></details></td><td align=center>Medium </td></tr><tr><td rowspan=2>General</td> <td> <details><summary>✅ <s>Remove hardcoded private IP</s></summary> ___ <details><summary><b>Suggestion Impact:</b></summary>The hardcoded IP address was replaced with an environment-variable-based address: "${SYSMON_VM_ADDRESS:-sysmon-vm:50110}". code diff: ```diff - "address": "192.168.1.219:50110", + "address": "${SYSMON_VM_ADDRESS:-sysmon-vm:50110}", ``` </details> ___ **Replace the hardcoded IP address in <code>sysmon-vm.checker.json</code> with a configurable <br>value, such as an environment variable, to improve deployment flexibility.** [docker/compose/sysmon-vm.checker.json [1-9]](https://github.com/carverauto/serviceradar/pull/1755/files#diff-92ceb153a11705c17ec13f926f009bf33b539c872b859f5fd624dd4b5b4cd17dR1-R9) ```diff { "name": "sysmon-vm", "type": "grpc", - "address": "192.168.1.219:50110", + "address": "${SYSMON_VM_ADDRESS:-sysmon-vm:50110}", "security": { "mode": "none", "role": "agent" } } ``` `[Suggestion processed]` <details><summary>Suggestion importance[1-10]: 8</summary> __ Why: The suggestion correctly points out that a hardcoded IP address harms portability and should be replaced with a configurable value, which is a critical best practice for deployment configurations. </details></details></td><td align=center>Medium </td></tr><tr><td> <details><summary>✅ <s>Align Proton memory to container limits</s></summary> ___ <details><summary><b>Suggestion Impact:</b></summary>The commit added the suggested Proton memory environment variables (PROTON_MAX_SERVER_MEMORY, PROTON_MAX_QUERY_MEMORY, PROTON_MAX_PARTITION_BYTES) along with a comment explaining the alignment with container limits. code diff: ```diff + # Align Proton memory usage with container limits to reduce OOM risk + - PROTON_MAX_SERVER_MEMORY=6g + - PROTON_MAX_QUERY_MEMORY=4g + - PROTON_MAX_PARTITION_BYTES=256m ``` </details> ___ **Set explicit memory limits for the Proton service via environment variables to <br>align with container memory limits and prevent OOM kills.** [docker-compose.yml [72-92]](https://github.com/carverauto/serviceradar/pull/1755/files#diff-e45e45baeda1c1e73482975a664062aa56f20c03dd9d64a827aba57775bed0d3R72-R92) ```diff services: proton: image: ghcr.io/carverauto/serviceradar-proton:latest container_name: serviceradar-proton user: "0:0" mem_limit: 8g mem_reservation: 6g ports: - "8123:8123" - "8463:8463" environment: - PROTON_PASSWORD=${PROTON_PASSWORD:-} - PROTON_LOG_LEVEL=${PROTON_LOG_LEVEL:-error} - PROTON_DISABLE_TASKSTATS=1 + # Align Proton internal memory to container cgroup limits to prevent OOM + - PROTON_MAX_SERVER_MEMORY=6g + - PROTON_MAX_QUERY_MEMORY=4g + - PROTON_MAX_PARTITION_BYTES=256m ``` <details><summary>Suggestion importance[1-10]: 7</summary> __ Why: The suggestion correctly points out a potential for OOM kills by aligning Proton's internal memory settings with the container's memory limits. This enhances the stability of the `proton` service under load, making it a valuable configuration improvement. </details></details></td><td align=center>Medium </td></tr> <tr><td align="center" colspan="2"> <!-- /improve_multi --more_suggestions=true --> </td><td></td></tr></tbody></table> </details> <details><summary>✅ Suggestions up to commit fddabde</summary> <br><table><thead><tr><td><strong>Category</strong></td><td align=left><strong>Suggestion&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; </strong></td><td align=center><strong>Impact</strong></td></tr><tbody><tr><td rowspan=1>High-level</td> <td> <details><summary>Database migration strategy breaks existing deployments</summary> ___ **The PR replaces the incremental database migration system with a single schema <br>file that only runs on empty databases. This breaks the upgrade path for <br>existing deployments, as there is no way to apply new schema changes to <br>databases that already contain data.** ### Examples: <details> <summary> <a href="https://github.com/carverauto/serviceradar/pull/1755/files#diff-95dfdc3c1eedf0141bd8be59bf69367b2909409887590dff5ae81e92f99b3917R35-R75">pkg/db/migrate.go [35-75]</a> </summary> ```go // RunMigrations applies the consolidated database schema when no tables exist yet. func RunMigrations(ctx context.Context, conn proton.Conn, log logger.Logger) error { log.Info().Msg("Ensuring database schema is applied") applied, err := schemaAlreadyApplied(ctx, conn) if err != nil { return fmt.Errorf("failed to inspect existing schema: %w", err) } if applied { ... (clipped 31 lines) ``` </details> ### Solution Walkthrough: #### Before: ```go // In pkg/db/migrate.go func RunMigrations(ctx, conn) error { createMigrationsTable(ctx, conn) // Ensure tracking table exists appliedMigrations := getAppliedMigrations(ctx, conn) availableMigrations := getAvailableMigrations() // from *.up.sql files for _, migrationFile := range availableMigrations { if !isApplied(migrationFile, appliedMigrations) { content := readMigrationFile(migrationFile) executeMigration(ctx, conn, content) recordMigration(ctx, conn, migrationFile) } } return nil } ``` #### After: ```go // In pkg/db/migrate.go const schemaFile = "migrations/schema.sql" func RunMigrations(ctx, conn) error { applied, err := schemaAlreadyApplied(ctx, conn) if applied { log("Schema already present; skipping apply") return nil } // If not applied, run the whole schema file content := readSchemaFile(schemaFile) statements := splitSQLStatements(content) for _, stmt := range statements { conn.Exec(ctx, stmt) } log("Database schema applied successfully") return nil } func schemaAlreadyApplied(ctx, conn) (bool, error) { // SELECT count() FROM system.tables WHERE ... name = 'device_updates' // return count > 0 } ``` <details><summary>Suggestion importance[1-10]: 10</summary> __ Why: The suggestion correctly identifies a critical flaw where the new migration strategy removes the upgrade path for existing databases, which is a breaking change for all current deployments. </details></details></td><td align=center>High </td></tr><tr><td rowspan=6>Possible issue</td> <td> <details><summary>Fix missing target stream clauses<!-- not_implemented --></summary> ___ **Add the missing <code>INTO</code> clause to the <code>ocsf_devices_aggregator_mv</code> and <br><code>ocsf_users_aggregator_mv</code> materialized views to specify their target streams and <br>fix a syntax error.** [pkg/db/migrations/schema.sql [1298-1408]](https://github.com/carverauto/serviceradar/pull/1755/files#diff-de4eeaf1d0460885258166a33d01394bc45f33ca3a95b4d36ef9c30b7a80e8e8R1298-R1408) ```diff -- OCSF Materialized Views Migration -- These views automatically aggregate event data into entity state and observable indexes -- Enables real-time updates from event streams to current state streams -- Device State Aggregation -- Populates ocsf_devices_current from device inventory events DROP VIEW IF EXISTS ocsf_devices_aggregator_mv; -CREATE MATERIALIZED VIEW ocsf_devices_aggregator_mv AS +CREATE MATERIALIZED VIEW ocsf_devices_aggregator_mv +INTO ocsf_devices_current +AS SELECT device_uid, time AS last_seen, ... FROM ocsf_device_inventory; -- User State Aggregation -- Populates ocsf_users_current from user inventory events DROP VIEW IF EXISTS ocsf_users_aggregator_mv; -CREATE MATERIALIZED VIEW ocsf_users_aggregator_mv AS +CREATE MATERIALIZED VIEW ocsf_users_aggregator_mv +INTO ocsf_users_current +AS SELECT user_uid, time AS last_seen, ... FROM ocsf_user_inventory; ``` <details><summary>Suggestion importance[1-10]: 10</summary> __ Why: The suggestion correctly identifies a critical syntax error in two `CREATE MATERIALIZED VIEW` statements that are missing the required `INTO` clause. This would cause the SQL migration to fail, so the fix is essential for correctness. </details></details></td><td align=center>High </td></tr><tr><td> <details><summary>✅ <s>Fix view definition and function name</s></summary> ___ <details><summary><b>Suggestion Impact:</b></summary>The committed diff updated the ocsf_observable_device_macs_mv definition to include the INTO ocsf_observable_index clause and corrected the function call to replaceAll for normalizing MAC addresses. code diff: ```diff --- Observable Index Population - Device MACs --- Index MAC addresses from device inventory -DROP VIEW IF EXISTS ocsf_observable_device_macs_mv; -CREATE MATERIALIZED VIEW ocsf_observable_device_macs_mv AS -SELECT - 'mac_address' AS observable_type, - mac AS observable_value, - lower(replace_all(mac, ':', '')) AS observable_value_normalized, - 'device' AS entity_class, - device_uid AS entity_uid, - time AS entity_last_seen, - 'device.mac' AS entity_path, - confidence_level / 4.0 AS confidence_score, - discovery_source, - time, - agent_id, - poller_id, - - -- Enrichments - '' AS geo_country, - '' AS geo_region, - '' AS geo_city, - 0 AS asn_number, - '' AS asn_org, - 0.0 AS threat_score, - CAST([] AS array(string)) AS threat_categories, - CAST([] AS array(string)) AS threat_sources, - 'hardware' AS observable_category, - CAST([] AS array(string)) AS tags, - metadata - -FROM ocsf_device_inventory -ARRAY JOIN device_mac AS mac -WHERE mac != ''; ``` </details> ___ **Fix the <code>ocsf_observable_device_macs_mv</code> view by adding the missing <code>INTO </code><br><code>ocsf_observable_index</code> clause and correcting the function name from <code>replace_all</code> <br>to <code>replaceAll</code>.** [pkg/db/migrations/schema.sql [1452-1485]](https://github.com/carverauto/serviceradar/pull/1755/files#diff-de4eeaf1d0460885258166a33d01394bc45f33ca3a95b4d36ef9c30b7a80e8e8R1452-R1485) ```diff -- Observable Index Population - Device MACs -- Index MAC addresses from device inventory DROP VIEW IF EXISTS ocsf_observable_device_macs_mv; -CREATE MATERIALIZED VIEW ocsf_observable_device_macs_mv AS +CREATE MATERIALIZED VIEW ocsf_observable_device_macs_mv +INTO ocsf_observable_index +AS SELECT 'mac_address' AS observable_type, mac AS observable_value, - lower(replace_all(mac, ':', '')) AS observable_value_normalized, + lower(replaceAll(mac, ':', '')) AS observable_value_normalized, 'device' AS entity_class, device_uid AS entity_uid, ... FROM ocsf_device_inventory ARRAY JOIN device_mac AS mac WHERE mac != ''; ``` <details><summary>Suggestion importance[1-10]: 10</summary> __ Why: The suggestion correctly identifies two critical errors in the `ocsf_observable_device_macs_mv` view definition: a missing `INTO` clause and an incorrect function name (`replace_all` instead of `replaceAll`). Both are syntax errors that would cause the migration to fail. </details></details></td><td align=center>High </td></tr><tr><td> <details><summary>Avoid holding lock during network call</summary> ___ **Refactor <code>reconnectCore</code> to release the mutex lock before making the blocking <br><code>connectToCore</code> network call to avoid contention.** [pkg/poller/poller.go [481-497]](https://github.com/carverauto/serviceradar/pull/1755/files#diff-28a10dea1596540e55ce9a8b68bd1af3d96bd4634f6def668643892cef25a086R481-R497) ```diff func (p *Poller) reconnectCore(ctx context.Context) error { reconnectCtx, cancel := context.WithTimeout(ctx, 15*time.Second) defer cancel() p.mu.Lock() - defer p.mu.Unlock() - if p.grpcClient != nil { + // Close the existing client while holding the lock to prevent its use. if err := p.grpcClient.Close(); err != nil { p.logger.Warn().Err(err).Msg("Failed to close existing core client during reconnect") } p.grpcClient = nil p.coreClient = nil } + p.mu.Unlock() // Release the lock before the blocking network call. - return p.connectToCore(reconnectCtx) + // connectToCore should not set the fields on `p` directly. + // It should return the new clients. + grpcClient, coreClient, err := p.connectToCore(reconnectCtx) + if err != nil { + return err + } + + // Re-acquire the lock to safely update the shared fields. + p.mu.Lock() + defer p.mu.Unlock() + p.grpcClient = grpcClient + p.coreClient = coreClient + + return nil } ``` <!-- /improve --apply_suggestion=3 --> <details><summary>Suggestion importance[1-10]: 8</summary> __ Why: The suggestion correctly identifies a significant concurrency issue where a mutex is held during a blocking network call, which can lead to performance degradation or deadlocks. </details></details></td><td align=center>Medium </td></tr><tr><td> <details><summary>✅ <s>Replace hardcoded IP with service name</s></summary> ___ <details><summary><b>Suggestion Impact:</b></summary>The hardcoded IP was replaced with a configurable address defaulting to "sysmon-vm:50110", aligning with the suggestion's intent to use the service name. code diff: ```diff - "address": "192.168.1.219:50110", + "address": "${SYSMON_VM_ADDRESS:-sysmon-vm:50110}", ``` </details> ___ **In <code>docker/compose/sysmon-vm.checker.json</code>, replace the hardcoded IP address in <br>the <code>address</code> field with the service name <code>sysmon-vm</code> to ensure portability within <br>the Docker environment.** [docker/compose/sysmon-vm.checker.json [4]](https://github.com/carverauto/serviceradar/pull/1755/files#diff-92ceb153a11705c17ec13f926f009bf33b539c872b859f5fd624dd4b5b4cd17dR4-R4) ```diff -"address": "192.168.1.219:50110", +"address": "sysmon-vm:50110", ``` <!-- /improve --apply_suggestion=4 --> <details><summary>Suggestion importance[1-10]: 7</summary> __ Why: The suggestion correctly identifies a hardcoded IP address and recommends using the Docker service name, which is a best practice for portability and maintainability in a containerized environment. </details></details></td><td align=center>Medium </td></tr><tr><td> <details><summary>✅ <s>Handle empty metric slices correctly<!-- not_implemented --></s></summary> ___ <details><summary><b>Suggestion Impact:</b></summary>The commit implemented logic so that an "no cpu metrics appended" error is only returned when appending fails, and otherwise returns a specific error constant. Although an explicit early return for empty cpus is not shown in the diff, the updated handling of the "no metrics appended" case aligns with the suggestion’s intent to avoid misleading errors for empty input. Additionally, error messages were refactored to use constants. code diff: ```diff ...
Sign in to join this conversation.
No reviewers
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
carverauto/serviceradar!2316
No description provided.