bug(core): missing agent with ping metrics #637
Labels
No labels
1week
2weeks
Failed compliance check
IP cameras
NATS
Possible security concern
Review effort 1/5
Review effort 2/5
Review effort 3/5
Review effort 4/5
Review effort 5/5
UI
aardvark
accessibility
amd64
api
arm64
auth
back-end
bgp
blog
bug
build
checkers
ci-cd
cleanup
cnpg
codex
core
dependencies
device-management
documentation
duplicate
dusk
ebpf
enhancement
eta 1d
eta 1hr
eta 3d
eta 3hr
feature
fieldsurvey
github_actions
go
good first issue
help wanted
invalid
javascript
k8s
log-collector
mapper
mtr
needs-triage
netflow
network-sweep
observability
oracle
otel
plug-in
proton
python
question
reddit
redhat
research
rperf
rperf-checker
rust
sdk
security
serviceradar-agent
serviceradar-agent-gateway
serviceradar-web
serviceradar-web-ng
siem
snmp
sysmon
topology
ubiquiti
wasm
wontfix
zen-engine
No milestone
No project
No assignees
1 participant
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
carverauto/serviceradar#637
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Imported from GitHub.
Original GitHub issue: #1921
Original author: @mfreeman451
Original URL: https://github.com/carverauto/serviceradar/issues/1921
Original created: 2025-11-03T07:57:22Z
Describe the bug
We are no longer seeing our local k8s-agent that is reporting ICMP metrics that it collects by performing its configured ping check.
This device existed in our inventory and may have accidently been deleted. Despite that, even hosts that are deleted should get re-added or the tombstone metadata removed if it starts sending us data in the form of healthchecks or other data. As long as it is sending data to us that means it has the correct SPIFFE credentials, is still running, and is obviously functional.
Related to #1916
To Reproduce
Steps to reproduce the behavior:
Expected behavior
A clear and concise description of what you expected to happen.
Screenshots
If applicable, add screenshots to help explain your problem.
Desktop (please complete the following information):
Smartphone (please complete the following information):
Additional context
Add any other context about the problem here.
Imported GitHub comment.
Original author: @mfreeman451
Original URL: https://github.com/carverauto/serviceradar/issues/1921#issuecomment-3481153716
Original created: 2025-11-03T15:34:55Z
Status update
sha-e0315ea3c7a52afa0d7ae6cdde690cd46193715b; Proton JSON errors are gone and ICMP metrics rebounded.k8s-agentshows the canonical ID but the inventory still conflates pod IPs when the agent/poller moves.Root cause recap
Kubernetes pods cycle IPs; we were treating
partition:ipas the primary device key, so when the agent pod landed on10.80.72.95we merged it with the old record and when the IP later re-assigned we stopped associating the service telemetry with the host. Canonical cache was only storingDeviceID, not the authoritative host IP, so we kept writing the stale pod IP back into registry updates.Plan of record
serviceradar:agent:<id>,serviceradar:poller:<id>, etc.) as the durable device IDs for infrastructure components and link them to a host device record. SPIFFE IDs remain optional metadata.partition:iprecord once the new owner wins.Implementation is underway: next steps are wiring the multi-key identity map + conflict resolver into the registry, adjusting the canonical cache, and extending tests around the new resolution flow. Will follow up once the first slice (service ID promotion + cache changes) lands.
Imported GitHub comment.
Original author: @mfreeman451
Original URL: https://github.com/carverauto/serviceradar/issues/1921#issuecomment-3481212172
Original created: 2025-11-03T15:48:08Z
Quick follow-up:
table(edge_onboarding_packages)via the new aggregation helper. This avoids the unbounded stream scan that was blowing up withBroken pipeon Proton.pkg/db/edge_onboarding_test.goto lock in the query builder (verifies thetable(...)wrapping, grouping, limit args, etc.).I pushed
go test ./pkg/db/...+go test ./pkg/core/...+make lintafter the changes. Next up: wire the alias/conflict resolution into the registry per the plan above.Imported GitHub comment.
Original author: @mfreeman451
Original URL: https://github.com/carverauto/serviceradar/issues/1921#issuecomment-3481364818
Original created: 2025-11-03T16:17:19Z
Quick follow-up:
table(edge_onboarding_packages)via the new aggregation helper. This avoids the unbounded stream scan that was blowing up withBroken pipeon Proton.pkg/db/edge_onboarding_test.goto lock in the query builder (verifies thetable(...)wrapping, grouping, limit args, etc.).I pushed
go test ./pkg/db/...+go test ./pkg/core/...+make lintafter the changes. Next up: wire the alias/conflict resolution into the registry per the plan above.Imported GitHub comment.
Original author: @mfreeman451
Original URL: https://github.com/carverauto/serviceradar/issues/1921#issuecomment-3481439286
Original created: 2025-11-03T16:33:19Z
Follow-up:
_alias_last_seen_*, recent service attachments, and prior IPs. The detail API now returnsalias_historyalongside the legacy device payload.max(updated_at)toupdated_at(fixes the Protonarg_max(... max(updated_at))error).ghcr.io/carverauto/serviceradar-core:sha-69686f1c9ec5bfa1b85a62e1668fb032700521d2rolled to demo; logs look clean and the UI shows the new alias panel fork8s-agent.TODO: we still need to emit lifecycle events on alias/collector changes; tracking in the next worklist item.
Imported GitHub comment.
Original author: @mfreeman451
Original URL: https://github.com/carverauto/serviceradar/issues/1921#issuecomment-3481464849
Original created: 2025-11-03T16:38:52Z
Status
ghcr.io/carverauto/serviceradar-core:sha-69686f1c9ec5bfa1b85a62e1668fb032700521d2via Bazel and rolled the demo cluster (kubectl rollout status deployment/serviceradar-core -n demo), core logs are clean.alias_historypayload (current service/IP, collector, historical service/ip aliases) alongside the legacy device document; detail page renders the card and strips alias keys from the metadata grid._alias_*,service_alias:*, andip_alias:*metadata so service hosts collapse back to their canonical record even when pod IPs change; counters/logs track how many rewrites were driven by service IDs.arg_max(..., updated_at)withmax(updated_at) AS latest_updated_atand orders on the alias, avoiding the previousarg_max(... max(updated_at))aggregation error in Proton.Current Observations
k8s-agentnow shows the canonical IP (default:10.80.72.95) in the UI, alias history panel surfaces the agent & poller attachments correctly.ILLEGAL_AGGREGATIONfor other aggregate combinations—we fixed the knownlatest_updated_atquery, but keep monitoring the logs in case more expressions need the same treatment.Imported GitHub comment.
Original author: @mfreeman451
Original URL: https://github.com/carverauto/serviceradar/issues/1921#issuecomment-3481509646
Original created: 2025-11-03T16:48:43Z
Update: Wrapped the edge onboarding package query in a bounded CTE (
WITH filtered AS (...) FROM table(edge_onboarding_packages)) so every downstream select now operates ontable(filtered). That stops the arg_max(...) / WHERE rewrite Proton was complaining about. Added unit coverage to lock the new shape in place (see pkg/db/edge_onboarding_test.go). Bazel-built and pushed the core image (sha256:bf0165277ed5f214328084a637e58e99d5e19c6e11adde9eaea607384f586683) and rolled demo namespace; rollout completed successfully.Imported GitHub comment.
Original author: @mfreeman451
Original URL: https://github.com/carverauto/serviceradar/issues/1921#issuecomment-3481582097
Original created: 2025-11-03T17:03:32Z
Update: added alias lifecycle detection so host alias updates now emit
"alias_updated"device lifecycle events. New helper derives alias snapshots from metadata, compares against the current unified_devices view, and publishes CloudEvents before we flush to the registry. Coverage lives in pkg/core/alias_events_test.go and pkg/devicealias/alias_test.go. API now reuses the shared alias snapshot builder.Bazel deps updated (new pkg/devicealias target) and
make lint/go test ./pkg/devicealias/... ./pkg/core/...are green. Pushed ghcr.io/carverauto/serviceradar-core@sha256:887a8254f92db2988ef0028405796d01cfeaf9b31709f2436245eaedd71c6043 and rolled deployment/serviceradar-core in demo.Imported GitHub comment.
Original author: @mfreeman451
Original URL: https://github.com/carverauto/serviceradar/issues/1921#issuecomment-3481617608
Original created: 2025-11-03T17:11:19Z
Update: device inventory now surfaces alias collectors using the
alias_historypayload – the device column shows badges for the active collector service, collector IP, and alias IP when they differ. Linted vianpm run lintand rebuilt/pushed the web image (ghcr.io/carverauto/serviceradar-web@sha256:8e123319b44247223b9b163e6a28762670ba10c39df8410d424ea562fd4ca4d7), then rolleddeployment/serviceradar-webin demo. Let me know if we should add summary badges to the stats cards next.Imported GitHub comment.
Original author: @mfreeman451
Original URL: https://github.com/carverauto/serviceradar/issues/1921#issuecomment-3481668185
Original created: 2025-11-03T17:21:19Z
Update: device summary cards now include a "Devices with Collectors" indicator that filters the inventory to hosts with active alias/collector metadata. Clicking the card (or picking "Collectors" from the status dropdown) issues
metadata._alias_last_seen_service_id:*so it lines up with the new row badges.Lint/build:
cd web && npm run lint,bazel build --config=remote //docker/images:web_image_amd64,bazel run --config=remote //docker/images:web_image_amd64_push. Rolleddeployment/serviceradar-webin demo to ghcr.io/carverauto/serviceradar-web@sha256:ab003ed140af5d9c2f8fdc9e10331fe26ab954d67f9f6b11bbe1c40841e7828a.Imported GitHub comment.
Original author: @mfreeman451
Original URL: https://github.com/carverauto/serviceradar/issues/1921#issuecomment-3481691870
Original created: 2025-11-03T17:26:27Z
Update: the inventory now skips ICMP status lookups for hosts without collectors.
DeviceTablereuses the alias metadata to build the badge list and only POSTs/api/devices/icmp/statusfor devices where_alias_last_seen_service_id(or collector metadata) is present. This cut out the flood of failed fetches and avoids hammering core when only one agent is emitting ICMP metrics.Rebuilt/pushed ghcr.io/carverauto/serviceradar-web@sha256:402761521ccf720ecf54dd45778b1dde59c549700749d5be90e652f0e6bb71bc and rolled
deployment/serviceradar-webin demo.Imported GitHub comment.
Original author: @mfreeman451
Original URL: https://github.com/carverauto/serviceradar/issues/1921#issuecomment-3481725339
Original created: 2025-11-03T17:35:32Z
Update: generalized the collector detection so we only hit the status APIs for devices that actually expose a collector.
DeviceTablenow inspects alias metadata, checker service types, and discovery sources to classify each host’s capabilities (ICMP, sysmon, SNMP). We reuse that map both for the badges and to decide which device IDs to POST to/api/devices/{icmp|sysmon|snmp}/status, so we avoid hammering core when most rows are passive. Devices with collectors but unknown type still fall back to the previous behaviour to stay safe.Lint/build:
cd web && npm run lint,bazel build --config=remote //docker/images:web_image_amd64,bazel run --config=remote //docker/images:web_image_amd64_push. Rolleddeployment/serviceradar-webin demo to ghcr.io/carverauto/serviceradar-web@sha256:62f8791ebd3728af69f5f9a95fb2dd5defe36a47200068416ea262dd924ae980.Imported GitHub comment.
Original author: @mfreeman451
Original URL: https://github.com/carverauto/serviceradar/issues/1921#issuecomment-3482520518
Original created: 2025-11-03T20:43:36Z
Deployment refresh: rebuilt & pushed both core (
bazel build/run --config=remote //docker/images:core_image_amd64[_push]) and web (//docker/images:web_image_amd64[_push]) images. Demo namespace is now runningghcr.io/carverauto/serviceradar-core@sha256:887a8254f92db2988ef0028405796d01cfeaf9b31709f2436245eaedd71c6043andghcr.io/carverauto/serviceradar-web@sha256:62f8791ebd3728af69f5f9a95fb2dd5defe36a47200068416ea262dd924ae980; rollouts for both deployments completed successfully.Imported GitHub comment.
Original author: @mfreeman451
Original URL: https://github.com/carverauto/serviceradar/issues/1921#issuecomment-3482564677
Original created: 2025-11-03T20:55:26Z
Status update
flushServiceDeviceUpdatesnow callsbuildAliasLifecycleEvents, which compares new metadata to the existing unified device snapshot via the newpkg/devicealiashelpers. We emitalias_updatedlifecycle events with the before/after metadata and covered it with unit tests.alias_history, and the table view shows collector/alias badges while only polling ICMP/SNMP/Sysmon for devices that actually expose those collectors.pkg/devicealias, tightened identity map plumbing, and extended the edge-onboarding query helper test so the boundedtable(...)form stays locked in.gofmton all Go changes,go test ./pkg/...,make lint,make test, andcd web && npm run lintall pass locally.ghcr.io/carverauto/serviceradar-core@sha256:b02aecd404864b76ffcc622045bc23c3e896fa3964d65562bc4a7892a9c6d1c7(tagsha-69686f1c9ec5bfa1b85a62e1668fb032700521d2) and webghcr.io/carverauto/serviceradar-web@sha256:138509147bf30098ef878fad836968a16adca105b40ccc8bf4dcf8bcda134a48(tagsha-86634609960a), then rolled both demo deployments. Pods are healthy (serviceradar-core-775d74cbf-x7rp5,serviceradar-web-76d47f6fdb-6bn42).Next up: hook the new alias event stream into the UI events page and keep an eye on demo logs for any remaining
arg_maxwarnings or collector fetch failures.Imported GitHub comment.
Original author: @mfreeman451
Original URL: https://github.com/carverauto/serviceradar/issues/1921#issuecomment-3483243883
Original created: 2025-11-04T00:55:52Z
Status update
alias_updatedlifecycle notifications and emits richer summaries (service/ip/collector deltas) so the Events table describes what changed instead of a generic update. Covered the formatter with a new unit test.go test ./pkg/...,make lint,make test, andcd web && npm run lint.ghcr.io/carverauto/serviceradar-core@sha256:170660b040071fa15b37604bbf60b49b568d425a9284c13a6972ed61c76c2f90(tagslatest,sha-69686f1c9ec5bfa1b85a62e1668fb032700521d2,sha-170660b04007) and webghcr.io/carverauto/serviceradar-web@sha256:d9e9f0172c39ee6021cc71d81482c41aa79a495865cd70eac3fdde18e8a87bd2(tagslatest,sha-69686f1c9ec5bfa1b85a62e1668fb032700521d2,sha-c9dd80c2690b). Rolledserviceradar-core-b6666696c-4ngjpandserviceradar-web-5bc98f4c46-wljmnin the demo namespace.Next up: wire these lifecycle events into the Events dashboard stats/filters and watch the demo logs to confirm collectors stop spamming empty ICMP requests.
Imported GitHub comment.
Original author: @mfreeman451
Original URL: https://github.com/carverauto/serviceradar/issues/1921#issuecomment-3483453028
Original created: 2025-11-04T02:22:18Z
Follow-up:
/api/devices/:idround-trip from the detail page. The alias card now derives its history from the SRQL device payload via the newweb/src/lib/alias.tshelpers, so the UI no longer hits the core API (and avoids the Proton timeouts).ghcr.io/carverauto/serviceradar-web@sha256:9ba3e23200ee492d714cd431c1ca25c0128a01dc7f0e4e0300230154f67c57ad(tagsha-b83df2d60403) and rolledserviceradar-web-55f8b4b6fc-vkg25in demo. Core image unchanged.go test ./pkg/...,make lint,make test, andcd web && npm run lint.Imported GitHub comment.
Original author: @mfreeman451
Original URL: https://github.com/carverauto/serviceradar/issues/1921#issuecomment-3483458831
Original created: 2025-11-04T02:26:05Z
Recap + next steps:
/api/devices/:idfetch), so the panel loads even while Proton refuses long reads.GetICMPMetricsForDeviceat the standard 2 000 row limit to stop the broken-pipe crashes, but we still query for every device.Planned ICMP cleanup:
GetICMPMetricsForDevice, check whether the device metadata/alias flags indicate an ICMP collector. If not, skip the Proton read entirely./api/devices/icmp/statusso only eligible devices trigger the status fetch.Once those guardrails are in, we can rebuild/push the core image and roll the demo deployment again.
Imported GitHub comment.
Original author: @mfreeman451
Original URL: https://github.com/carverauto/serviceradar/issues/1921#issuecomment-3483513653
Original created: 2025-11-04T02:42:07Z
Update 2025-11-03:
Next steps:
Imported GitHub comment.
Original author: @mfreeman451
Original URL: https://github.com/carverauto/serviceradar/issues/1921#issuecomment-3483545682
Original created: 2025-11-04T02:54:27Z
Update 2025-11-03 (cont.):
Imported GitHub comment.
Original author: @mfreeman451
Original URL: https://github.com/carverauto/serviceradar/issues/1921#issuecomment-3483545993
Original created: 2025-11-04T02:54:40Z
Update 2025-11-03 (cont.):
collector_capabilities, and/api/devices/{id}/metricsconsults the same data before falling back to Proton. The Next.js device table now trusts those hints so it skips ICMP polling for nodes that aren’t collectors.serviceradar-coreandserviceradar-webdeployments.Imported GitHub comment.
Original author: @mfreeman451
Original URL: https://github.com/carverauto/serviceradar/issues/1921#issuecomment-3483594079
Original created: 2025-11-04T03:13:40Z
Update 2025-11-03 (metrics summary sweep):
metrics_summaryalongsidecollector_capabilities. That data flows through/api/devices, the legacy fallback, and the single-device endpoint so the UI can see which devices actually have ICMP/SNMP/Sysmon data without fan-out queries.fetchDeviceMetricSummaryuses a single Proton query (chunked) over the last 6h, so inventory loads no longer issue O(N) status checks. The Next.js table trusts the summary and skips status fetches when a device hasn’t produced data.Imported GitHub comment.
Original author: @mfreeman451
Original URL: https://github.com/carverauto/serviceradar/issues/1921#issuecomment-3483602617
Original created: 2025-11-04T03:19:39Z
Quick update: traced the device-metrics 500s to the ICMP path. When the ring-buffer collector isn’t registered, the API fell back to the generic metrics query which bubbles Proton errors up as 500s. I pushed a fix that always routes ICMP requests through the dedicated fallback (with the existing collector capability guard) and added coverage for the no-ring-buffer case. ok github.com/carverauto/serviceradar/pkg/core/api (cached) and \033[1mChecking golangci-lint v2.4.0\033[0m
\033[1mRunning Go linter\033[0m
0 issues.
\033[1mRunning Rust linter\033[0m
\033[1mRunning OCaml linters\033[0m
make[1]: Entering directory '/home/mfreeman/serviceradar'
🔍 Checking OCaml code formatting...
🔍 Checking opam files...
🔍 Checking OCaml documentation...
✅ All OCaml lint checks passed!
make[1]: Leaving directory '/home/mfreeman/serviceradar' are both clean locally. Once this hits the demo cluster we should see the inventory stop logging 500s on refresh—please keep an eye on the core logs in case Proton still reports timeouts so we can tune the limit further if needed.
Imported GitHub comment.
Original author: @mfreeman451
Original URL: https://github.com/carverauto/serviceradar/issues/1921#issuecomment-3483669206
Original created: 2025-11-04T04:05:52Z
Follow-up on the inventory 500s: the front-end was still firing one ICMP metrics request per device because (a) the bulk status endpoint looped over every ID and (b) the sparkline kickstarted a fetch before we knew whether metrics existed. I switched the table to rely on the
metrics_summarydata that the core already returns, removed the/api/devices/icmp/statusfan-out, and only let the sparkline fetch when we positively know a device has ICMP data. That keeps the initial render fast and stops us from hammering Proton. Verified withnpm run lint,go test ./pkg/core/api -run TestGetDeviceMetrics, andmake lintlocally.Imported GitHub comment.
Original author: @mfreeman451
Original URL: https://github.com/carverauto/serviceradar/issues/1921#issuecomment-3483813689
Original created: 2025-11-04T04:49:10Z
Rolled a new core build (ghcr.io/carverauto/serviceradar-core@sha256:1607369a0909ddde604fa08e08f1b02deb5b5e2cc0e970d9ba65f9caa81b5cdb) after teaching the canonical cache to keep short-lived entries for weak identities. Devices without MACs used to miss the cache on every heartbeat, so the core pounded Proton with
GetUnifiedDevicesByIPsOrIDscalls until the pool starved. Now we record those “weak” snapshots with a short TTL, and we also memoize the fallback device ID when Proton has no match. That keeps repeat lookups entirely in-memory while we wait for stronger identity data to arrive.go test ./pkg/core/...andmake lintare clean locally.Imported GitHub comment.
Original author: @mfreeman451
Original URL: https://github.com/carverauto/serviceradar/issues/1921#issuecomment-3483847141
Original created: 2025-11-04T04:54:51Z
Chased down the remaining ICMP flood: the web inventory was still treating every device with legacy ping metadata as an active collector, so the sparkline spun up
/api/devices/<id>/metricsfor the entire fleet. The table now requires both a collector capability hint and a freshmetrics_summary.icmpflag before rendering the sparkline (DeviceTable.tsx), and the component won’t fetch unless all of those signals are present. That keeps the page to a single ICMP fetch for k8s-agent, which clears the 500s and load time issues we were seeing.npm run lintremains clean.Imported GitHub comment.
Original author: @mfreeman451
Original URL: https://github.com/carverauto/serviceradar/issues/1921#issuecomment-3483862768
Original created: 2025-11-04T04:59:15Z
Redeployed the frontend with the ICMP collector guard (image ghcr.io/carverauto/serviceradar-web:sha-9fc538ddb644) and rolled the demo web deployment so the inventory stops issuing blanket
/api/devices/*/metricscalls.],Imported GitHub comment.
Original author: @mfreeman451
Original URL: https://github.com/carverauto/serviceradar/issues/1921#issuecomment-3483875770
Original created: 2025-11-04T05:06:46Z
Stats cards now pull from the devices API (limit 500) and compute totals locally so we aren’t fighting SRQL column rewrites. Rebuilt/pushed
ghcr.io/carverauto/serviceradar-web:sha-f159e467e544(digestsha256:cf64ee7d179f3876ef8b9cfc277ae166cd1d23f296728a0fd39cf83b93f9841e) and rolled the demo web deployment.Imported GitHub comment.
Original author: @mfreeman451
Original URL: https://github.com/carverauto/serviceradar/issues/1921#issuecomment-3483892296
Original created: 2025-11-04T05:11:18Z
Tweaked the ICMP sparkline guard so we still render for devices that advertise an actual collector (service id / collector IP) even when the metric summary hasn’t flipped to true yet. The sparkline now fetches unless we know for sure there’s no data. Rebuilt
ghcr.io/carverauto/serviceradar-web:sha-38b6b0c021ab(digestsha256:9c47c07181535fab3d2f3bc6d582b3eec6a0c08a75e955b2482936535cbd8bf2) and rolled the demo web deployment.Imported GitHub comment.
Original author: @mfreeman451
Original URL: https://github.com/carverauto/serviceradar/issues/1921#issuecomment-3483904235
Original created: 2025-11-04T05:14:53Z
Patched the device detail page to handle missing availability data gracefully (no more NaN timestamps / empty bar), and tweaked the ICMP sparkline logic so it still fetches when a collector exists but the summary hasn’t flipped yet. Deployed
ghcr.io/carverauto/serviceradar-web:sha-96aba4533038(digestsha256:a7e9f1e3c52e6c08708e53dcef6c3d0387564143a0e2c197e9d728f5833aa47a) to the demo cluster.Imported GitHub comment.
Original author: @mfreeman451
Original URL: https://github.com/carverauto/serviceradar/issues/1921#issuecomment-3483928170
Original created: 2025-11-04T05:20:25Z
Totals on the device inventory now come from SRQL counts again—using the translator’s
stats:"count() as total"path with per-query error handling and a fallback collector filter—so the cards reflect the full fleet (~50k) instead of the 500-device API page. Also deployed the updated web imageghcr.io/carverauto/serviceradar-web:sha-d64513a5712c(digestsha256:8cb201c59f9bc55b208196ba0955da2ad26e32342fa06b759b8320b3f498437c).Imported GitHub comment.
Original author: @mfreeman451
Original URL: https://github.com/carverauto/serviceradar/issues/1921#issuecomment-3483946451
Original created: 2025-11-04T05:26:00Z
Stat-card filter now computes the target query up front, so the first click kicks off the SRQL load immediately instead of needing a second tap. Deployed (digest ) to the demo cluster.
Imported GitHub comment.
Original author: @mfreeman451
Original URL: https://github.com/carverauto/serviceradar/issues/1921#issuecomment-3483995581
Original created: 2025-11-04T05:39:44Z
Latest: We sidestepped the edge-onboarding spam from core. now caches poller/agent lookups (60s TTL) so we hit at most once per minute per component instead of on every heartbeat; repeated misses also short-circuit. Stats cards now call SRQL directly, so they reflect the full fleet and respond on the first click. Device detail handles empty timelines gracefully, and ICMP sparklines only render for real collectors while still fetching promptly for k8s-agent.
Images:
Next steps: monitor Proton for the edge-onboarding query — with the cache in place it should quiet down to one lookup per minute per poller/agent. The Devices inventory should now flip between filters on a single click while keeping load times stable.
Imported GitHub comment.
Original author: @mfreeman451
Original URL: https://github.com/carverauto/serviceradar/issues/1921#issuecomment-3484022694
Original created: 2025-11-04T05:51:20Z
Updates:
Tests:
Imported GitHub comment.
Original author: @mfreeman451
Original URL: https://github.com/carverauto/serviceradar/issues/1921#issuecomment-3484092830
Original created: 2025-11-04T06:24:43Z
Follow-up:
ActivationCacheStats()helper so we can inspect usage without scraping Proton logs. Cache fetches now log refresh/miss events at debug level, making it obvious when we fall back toListEdgeOnboardingPackages.Tests:
Imported GitHub comment.
Original author: @mfreeman451
Original URL: https://github.com/carverauto/serviceradar/issues/1921#issuecomment-3487947851
Original created: 2025-11-04T20:44:27Z
I have investigated the high CPU load on the
serviceradar-protondatabase and have implemented two optimizations to reduce the load.First, I increased the
hop()interval in thedevice_metrics_aggregator_mvmaterialized view from60sto300sto reduce the query execution frequency.Second, I optimized the query that was causing the high load on the
edge_onboarding_packagestable by using a window function instead ofarg_max.I have deployed a new version of the
serviceradar-coreservice with these changes. I will continue to monitor the database load and will provide another update when I have more information.Imported GitHub comment.
Original author: @mfreeman451
Original URL: https://github.com/carverauto/serviceradar/issues/1921#issuecomment-3488186806
Original created: 2025-11-04T22:11:04Z
Status recap (Nov 4)
What we found
table(unified_devices)scans every render.GetUnifiedDevicesByIPsOrIDsdid a straightSELECT … WHERE device_id IN (…) ORDER BY _tp_timeand re-read ~1GB per call.resolveIPsToCanonical) and the onboarding query builder added their own unbounded scans (tombstone rows +ROW_NUMBER) which amplified the load.What we changed today
SET log_queries = 1,log_formatted_queries = 1) via the tools pod so we can watch offenders live.collector_capabilities.has_collectorinstead ofmetadata._alias_last_seen_service_id:*.LIKEfilters) and trimmed the ICMP sparkline logic so it only fires when we have real collector evidence.GetUnifiedDevicesByIPsOrIDsto aggregate witharg_max(..., _tp_time)and dropped theud.alias—each batch now only reads one row per ID/IP.anyLast()and disallows nested aggregates).serviceradar-core@sha256:a2eacd5545f7f74528ea1f4ea874c4167aebdc07e3748c7bbfbd60a28bb5bef4serviceradar-web@sha256:586e6989641d13e5ec015a3e67d4a9be8057d7dd227d5664373436c0602e34dcStill broken / outstanding errors
GetUnifiedDevicesByIPsOrIDspath when we ask for IPs only—theWHERE ip IN (…)is evaluated before the aggregation, so Proton throws. We need to push the IP filter into a sub-select/CTE or pre-aggregate per IP before the final GROUP BY.0.0.0.0still show up (same query as above). They’re lighter now (~0.45 GiB vs ~0.9 GiB) but we should fix the aggregator error first.Immediate next steps
GetUnifiedDevicesByIPsOrIDsto use aWITH filtered AS (SELECT … WHERE ip IN …)and run thearg_maxaggregation in the outer SELECT so no aggregate appears in the WHERE clause.Leaving the Proton errors in the log for now—we’ll circle back after adjusting the aggregation layout.
Imported GitHub comment.
Original author: @mfreeman451
Original URL: https://github.com/carverauto/serviceradar/issues/1921#issuecomment-3489020597
Original created: 2025-11-05T03:06:38Z
Proton Performance Fix Applied (2025-11-04)
Problem Identified
The Proton database was pegged at 99% CPU (3986m / ~4 cores) due to inefficient query patterns in
GetUnifiedDevicesByIPsOrIDs. The queries were triggering:Root Cause
The original query pattern used
LIMIT 1 BY device_idwith aWHERE ... IN (...)clause directly ontable(unified_devices), which Proton couldn't optimize properly. The versioned_kv stream was being scanned entirely for every batch lookup.Solution Implemented
Rewrote the batch query to use a CTE (Common Table Expression) pattern in
pkg/db/unified_devices.go:This separates the filter operation from the aggregation, allowing Proton's query planner to optimize execution.
Results
system.query_logghcr.io/carverauto/serviceradar-core@sha256:baea26badbef...Remaining Work
The CPU is down from 99% to ~12%, which is a massive improvement. To get it lower we should still:
The immediate crisis is resolved—Proton is no longer pegged.
Imported GitHub comment.
Original author: @mfreeman451
Original URL: https://github.com/carverauto/serviceradar/issues/1921#issuecomment-3489060621
Original created: 2025-11-05T03:18:17Z
Architecture Refactor Plan Created
I've documented a comprehensive plan for the real fix in
newarch_plan.md.What We Just Did (Tactical)
The CTE query optimization was a tactical fix that stopped the immediate bleeding:
But we're still fundamentally doing the wrong thing: treating Proton as OLTP when it should be OLAP.
What Needs To Happen (Strategic)
The plan outlines a 6-phase refactor to establish proper data architecture:
count()queriesExpected Results
Timeline
10-week sprint plan with independent phases, feature flags for rollback, clear success metrics.
See
newarch_plan.mdfor full details including:The tactical fix bought us time. This plan delivers the architecture that scales to millions of devices.
Imported GitHub comment.
Original author: @mfreeman451
Original URL: https://github.com/carverauto/serviceradar/issues/1921#issuecomment-3489098773
Original created: 2025-11-05T03:32:34Z
Created tracking issue for the strategic architecture refactor: #1924
The tactical CTE fix in this issue is deployed and working (CPU down 62-75%). Issue #1924 tracks the full 6-phase refactor to move device state to an in-memory registry and treat Proton as OLAP-only.