Add sysmon-vm integration: CPU clusters/labels end-to-end, resilient reporting, and UI telemetry for ARM64/big.LITTLE #556

Closed
opened 2026-03-28 04:25:43 +00:00 by mfreeman451 · 1 comment
Owner

Imported from GitHub.

Original GitHub issue: #1756
Original author: @qodo-code-review[bot]
Original URL: https://github.com/carverauto/serviceradar/issues/1756
Original created: 2025-10-14T02:48:43Z


Description:

  • We need to integrate a sysmon-vm checker and surface its metrics across the stack for ARM64/big.LITTLE systems (e.g., Apple Silicon), including CPU core labels and cluster-level frequencies.
  • Current CPU telemetry only captures per-core usage and lacks cluster/group context, frequency visualization, and host identity correlation in device inventory.
  • Poller-to-core reporting should be resilient to transient gRPC failures and support hot-reload of polling intervals without restarts.
  • The Web UI must visualize per-core usage, cluster summaries, average/peak frequency trends, and sysmon host metadata to aid diagnostics.
  • Docker/K8s deployment should bootstrap credentials and configurations for seamless local demos, including sysmon-vm endpoint discovery and auth route handling.

Deliverables:

  • CPU clusters model: Extend metrics schema and API to support CPU core labels and cluster-level frequency (CPUMetric.label, CPUMetric.cluster, CPUClusterMetric) and include them in sysmon responses and queries.
  • DB persistence: Migrate database to store new CPU fields and a cpu_cluster_metrics table; batch writes must error meaningfully when nothing is appended.
  • Sysmon-vm checker: Provide a gRPC sysmon-vm checker that reports per-core usage/frequency with labels/clusters and cluster frequency aggregates, including host identifiers and timestamps.
  • Core processing: Ingest sysmon/sysmon-vm payloads, normalize timestamps, store new metrics, and auto-register devices from checker payloads using host IP/ID with appropriate source confidence.
  • Poller resilience: Add reconnection logic on gRPC error codes/timeouts, enrich gRPC service messages with resolved host IP, carry agent_id, and support hot-reloading poll interval via channel.
  • Agent overrides: Allow checker-specific security settings via agent checker configs, and bootstrap default KV configs for common checkers including sysmon-vm.
  • API endpoints: Update CPU metrics API to return labels/clusters and add cluster metrics alongside per-core data for both poller- and device-scoped queries.
  • UI visualizations: Add CPU frequency cards/trends, per-core frequency bars, cluster badges/summaries, and host metadata panel; update combined charts to align heterogeneous series.
  • Deployment configs: Generate sysmon-vm checker config from env (SYSMON_VM_ADDRESS), improve Proton password persistence/permissions, add Kong auth routes, and wait-for-core in web entrypoint; update K8s/demo scripts accordingly.
Imported from GitHub. Original GitHub issue: #1756 Original author: @qodo-code-review[bot] Original URL: https://github.com/carverauto/serviceradar/issues/1756 Original created: 2025-10-14T02:48:43Z --- **Description:** - We need to integrate a sysmon-vm checker and surface its metrics across the stack for ARM64/big.LITTLE systems (e.g., Apple Silicon), including CPU core labels and cluster-level frequencies. - Current CPU telemetry only captures per-core usage and lacks cluster/group context, frequency visualization, and host identity correlation in device inventory. - Poller-to-core reporting should be resilient to transient gRPC failures and support hot-reload of polling intervals without restarts. - The Web UI must visualize per-core usage, cluster summaries, average/peak frequency trends, and sysmon host metadata to aid diagnostics. - Docker/K8s deployment should bootstrap credentials and configurations for seamless local demos, including sysmon-vm endpoint discovery and auth route handling. **Deliverables:** - **CPU clusters model**: Extend metrics schema and API to support CPU core labels and cluster-level frequency (`CPUMetric.label`, `CPUMetric.cluster`, `CPUClusterMetric`) and include them in sysmon responses and queries. - **DB persistence**: Migrate database to store new CPU fields and a `cpu_cluster_metrics` table; batch writes must error meaningfully when nothing is appended. - **Sysmon-vm checker**: Provide a gRPC sysmon-vm checker that reports per-core usage/frequency with labels/clusters and cluster frequency aggregates, including host identifiers and timestamps. - **Core processing**: Ingest sysmon/sysmon-vm payloads, normalize timestamps, store new metrics, and auto-register devices from checker payloads using host IP/ID with appropriate source confidence. - **Poller resilience**: Add reconnection logic on gRPC error codes/timeouts, enrich gRPC service messages with resolved host IP, carry `agent_id`, and support hot-reloading poll interval via channel. - **Agent overrides**: Allow checker-specific security settings via agent checker configs, and bootstrap default KV configs for common checkers including sysmon-vm. - **API endpoints**: Update CPU metrics API to return labels/clusters and add cluster metrics alongside per-core data for both poller- and device-scoped queries. - **UI visualizations**: Add CPU frequency cards/trends, per-core frequency bars, cluster badges/summaries, and host metadata panel; update combined charts to align heterogeneous series. - **Deployment configs**: Generate sysmon-vm checker config from env (`SYSMON_VM_ADDRESS`), improve Proton password persistence/permissions, add Kong auth routes, and wait-for-core in web entrypoint; update K8s/demo scripts accordingly.
Author
Owner

Imported GitHub comment.

Original author: @mfreeman451
Original URL: https://github.com/carverauto/serviceradar/issues/1756#issuecomment-3403513821
Original created: 2025-10-14T20:30:29Z


closing, completed

Imported GitHub comment. Original author: @mfreeman451 Original URL: https://github.com/carverauto/serviceradar/issues/1756#issuecomment-3403513821 Original created: 2025-10-14T20:30:29Z --- closing, completed
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
carverauto/serviceradar#556
No description provided.