Architecture Refactor: Move device state to registry, treat Proton as OLAP warehouse #639

Closed
opened 2026-03-28 04:26:53 +00:00 by mfreeman451 · 11 comments
Owner

Imported from GitHub.

Original GitHub issue: #1924
Original author: @mfreeman451
Original URL: https://github.com/carverauto/serviceradar/issues/1924
Original created: 2025-11-05T03:32:25Z


Problem Statement

ServiceRadar currently treats Proton (a stream processing database) as the primary source of truth for device state, causing performance issues that don't scale beyond tens of thousands of devices. While the tactical CTE query fix (#1921) reduced Proton CPU from 3986m to ~1000m, we're still fundamentally doing the wrong thing: hitting Proton for every device lookup, stats query, and inventory search.

Current issues:

  • Proton CPU baseline: ~1000m (1 core) just for normal operations
  • Device lookups read 640k+ rows to find latest state
  • Dashboard stats issue live count() queries on 50k devices
  • Inventory search does full table scans with metadata map extraction
  • Collector capability derived by scraping metadata keys
  • No audit trail for "when did device X last have ICMP capability?"

Vision

Establish a proper layered data architecture:

  • Hot tier: In-memory device registry for current state (μs latency)
  • Warm tier: Search index for inventory queries (ms latency)
  • Cold tier: Proton for time-series analytics and audit logs (s latency acceptable)

Proton should only answer questions like:

  • "Show me ICMP RTT for device X over last 7 days"
  • "What devices were discovered in the last hour?"
  • "Run this exploratory SRQL query across historical data"

Proton should not answer questions like:

  • "Does device X exist?" → Use registry
  • "What's the hostname of device X?" → Use registry
  • "How many devices have ICMP collectors?" → Use stats cache
  • "Search for devices matching 'foo'" → Use search index

Detailed Plan

See `newarch_plan.md` for comprehensive implementation details including:

  • Retrospective (how we got here, why tactical fix wasn't enough)
  • 6-phase refactor with code examples
  • Sprint breakdown (10 weeks)
  • Success metrics and rollback plan

Implementation Phases

Phase 1: Device Registry Service (Week 1-2)

Goal: Canonical in-memory device graph

  • Define `DeviceRecord` schema in `pkg/registry/device.go`
  • Implement `DeviceRegistry` with in-memory map + RWMutex
  • Hydrate from Proton on startup (`HydrateRegistryFromProton`)
  • Update `DeviceManager.UpsertDevice()` to write to both Proton + Registry
  • Unit tests for registry operations

Success: Registry hydrates from Proton, stays in sync with new updates

Phase 2: First-Class Collector Capabilities (Week 3-4)

Goal: Stop deriving capability from metadata

  • Define `CollectorCapability` schema
  • Implement `CapabilityIndex` in `pkg/registry/capabilities.go`
  • Update agent/poller registration to emit capabilities
  • Update API `/devices/{id}/collectors` to use registry
  • Remove all metadata scraping (`_alias_last_seen_service_id`, etc.)
  • Update UI to use new capabilities API

Success: Collector status from explicit records, not metadata inference

Phase 3: Stats Aggregator (Week 5)

Goal: Pre-aggregate dashboard metrics

  • Implement `StatsAggregator` that runs every 10 seconds
  • Add `StatsSnapshot` cache to registry
  • Create `/api/stats` endpoint
  • Update dashboard tiles to call `/api/stats`
  • Remove SRQL stat card queries from UI

Success: Dashboard loads in <10ms, no Proton queries for stats

Phase 4: Search Index (Week 6-7)

Goal: Fast inventory search without table scans

  • Implement in-memory trigram index in `pkg/search/trigram.go`
  • Integrate with `DeviceRegistry.Upsert()` to update index
  • Add `/api/devices/search?q=...` endpoint
  • Update inventory UI to use search API (OPEN – next engineer)
  • Remove `SELECT ... LIKE ...` queries from codebase

Success: Inventory search returns in <50ms for any query

Phase 5: Capability Matrix (Week 8-9)

Goal: Model Device ⇄ Service ⇄ Capability explicitly

  • Define `device_capabilities` stream in Proton for audit trail
  • Implement `CapabilityMatrix` in `pkg/registry/matrix.go`
  • Update agent heartbeats to report capability checks
  • Create capability monitoring/alerting
  • Dashboard shows capability status + last-seen

Success: Can answer "when did device X last have successful ICMP?" without manual queries

Phase 6: Proton Boundary Enforcement (Week 10)

Goal: Ensure all state queries hit registry, not Proton

  • Audit all `db.*` calls in `pkg/core/api`
  • Replace device state queries with registry lookups
  • Add linter rule / middleware to prevent non-analytics Proton queries
  • Document "when to use Proton vs registry" guidelines
  • Final performance validation

Success: Proton CPU <200m under normal load

Success Metrics

Performance Targets

  • Registry lookups: <1ms (currently 500ms+ from Proton)
  • Dashboard stats: <10ms (currently 500ms+ from live count())
  • Inventory search: <50ms (currently 1-5s from table scan)
  • Proton CPU baseline: <200m (currently ~1000m)

Data Quality

  • Collector capability accuracy: 100% (explicit records vs inferred)
  • Audit trail: All capability changes logged to Proton
  • No stale data: Registry TTL/refresh keeps cache current

Developer Experience

  • Query clarity: `registry.Get(deviceID)` not SQL
  • Testability: Registry is mockable
  • Debuggability: Capability matrix shows exact state + history

Rollback Plan

Each phase is independently deployable with feature flags:

```go
const (
UseRegistry = true // Phase 1
UseCapabilityIndex = true // Phase 2
UseStatsCache = true // Phase 3
UseSearchIndex = true // Phase 4
)
```

If any phase has issues, disable the flag and fall back to Proton queries (slower but functional).

  • #1921 - Original Proton performance crisis (tactical CTE fix applied)

Open Questions

  1. Registry persistence: Should we persist registry snapshots to disk for faster restarts?
  2. Registry size: At 1M devices, in-memory registry ≈ 1-2GB. Acceptable?
  3. Search sophistication: Do we need Elastic's query DSL, or is trigram enough?
  4. Capability staleness: How long before we mark a collector capability as "stale"?
  5. Multi-region: How does registry sync across clusters?

References

  • `newarch_plan.md` - Full implementation details with code examples
  • `debug.md` - Performance investigation notes
  • Commit `85733a09` - Tactical CTE query fix
  • Commit `65e5d947` - Architecture plan documentation
Imported from GitHub. Original GitHub issue: #1924 Original author: @mfreeman451 Original URL: https://github.com/carverauto/serviceradar/issues/1924 Original created: 2025-11-05T03:32:25Z --- ## Problem Statement ServiceRadar currently treats Proton (a stream processing database) as the primary source of truth for device state, causing performance issues that don't scale beyond tens of thousands of devices. While the tactical CTE query fix (#1921) reduced Proton CPU from 3986m to ~1000m, we're still fundamentally doing the wrong thing: hitting Proton for every device lookup, stats query, and inventory search. **Current issues:** - Proton CPU baseline: ~1000m (1 core) just for normal operations - Device lookups read 640k+ rows to find latest state - Dashboard stats issue live `count()` queries on 50k devices - Inventory search does full table scans with metadata map extraction - Collector capability derived by scraping metadata keys - No audit trail for "when did device X last have ICMP capability?" ## Vision Establish a proper **layered data architecture**: - **Hot tier:** In-memory device registry for current state (μs latency) - **Warm tier:** Search index for inventory queries (ms latency) - **Cold tier:** Proton for time-series analytics and audit logs (s latency acceptable) Proton should **only** answer questions like: - "Show me ICMP RTT for device X over last 7 days" ✅ - "What devices were discovered in the last hour?" ✅ - "Run this exploratory SRQL query across historical data" ✅ Proton should **not** answer questions like: - "Does device X exist?" ❌ → Use registry - "What's the hostname of device X?" ❌ → Use registry - "How many devices have ICMP collectors?" ❌ → Use stats cache - "Search for devices matching 'foo'" ❌ → Use search index ## Detailed Plan See \`newarch_plan.md\` for comprehensive implementation details including: - Retrospective (how we got here, why tactical fix wasn't enough) - 6-phase refactor with code examples - Sprint breakdown (10 weeks) - Success metrics and rollback plan ## Implementation Phases ### Phase 1: Device Registry Service (Week 1-2) **Goal:** Canonical in-memory device graph - [x] Define \`DeviceRecord\` schema in \`pkg/registry/device.go\` - [x] Implement \`DeviceRegistry\` with in-memory map + RWMutex - [x] Hydrate from Proton on startup (\`HydrateRegistryFromProton\`) - [x] Update \`DeviceManager.UpsertDevice()\` to write to both Proton + Registry - [x] Unit tests for registry operations **Success:** Registry hydrates from Proton, stays in sync with new updates ### Phase 2: First-Class Collector Capabilities (Week 3-4) **Goal:** Stop deriving capability from metadata - [x] Define \`CollectorCapability\` schema - [x] Implement \`CapabilityIndex\` in \`pkg/registry/capabilities.go\` - [x] Update agent/poller registration to emit capabilities - [x] Update API \`/devices/{id}/collectors\` to use registry - [x] Remove all metadata scraping (\`_alias_last_seen_service_id\`, etc.) - [x] Update UI to use new capabilities API **Success:** Collector status from explicit records, not metadata inference ### Phase 3: Stats Aggregator (Week 5) **Goal:** Pre-aggregate dashboard metrics - [ ] Implement \`StatsAggregator\` that runs every 10 seconds - [ ] Add \`StatsSnapshot\` cache to registry - [ ] Create \`/api/stats\` endpoint - [ ] Update dashboard tiles to call \`/api/stats\` - [ ] Remove SRQL stat card queries from UI **Success:** Dashboard loads in <10ms, no Proton queries for stats ### Phase 4: Search Index (Week 6-7) **Goal:** Fast inventory search without table scans - [ ] Implement in-memory trigram index in \`pkg/search/trigram.go\` - [ ] Integrate with \`DeviceRegistry.Upsert()\` to update index - [ ] Add \`/api/devices/search?q=...\` endpoint - [ ] Update inventory UI to use search API (**OPEN – next engineer**) - [ ] Remove \`SELECT ... LIKE ...\` queries from codebase **Success:** Inventory search returns in <50ms for any query ### Phase 5: Capability Matrix (Week 8-9) **Goal:** Model Device ⇄ Service ⇄ Capability explicitly - [ ] Define \`device_capabilities\` stream in Proton for audit trail - [ ] Implement \`CapabilityMatrix\` in \`pkg/registry/matrix.go\` - [ ] Update agent heartbeats to report capability checks - [ ] Create capability monitoring/alerting - [ ] Dashboard shows capability status + last-seen **Success:** Can answer "when did device X last have successful ICMP?" without manual queries ### Phase 6: Proton Boundary Enforcement (Week 10) **Goal:** Ensure all state queries hit registry, not Proton - [ ] Audit all \`db.*\` calls in \`pkg/core/api\` - [ ] Replace device state queries with registry lookups - [ ] Add linter rule / middleware to prevent non-analytics Proton queries - [ ] Document "when to use Proton vs registry" guidelines - [ ] Final performance validation **Success:** Proton CPU <200m under normal load ## Success Metrics ### Performance Targets - **Registry lookups:** <1ms (currently 500ms+ from Proton) - **Dashboard stats:** <10ms (currently 500ms+ from live count()) - **Inventory search:** <50ms (currently 1-5s from table scan) - **Proton CPU baseline:** <200m (currently ~1000m) ### Data Quality - **Collector capability accuracy:** 100% (explicit records vs inferred) - **Audit trail:** All capability changes logged to Proton - **No stale data:** Registry TTL/refresh keeps cache current ### Developer Experience - **Query clarity:** \`registry.Get(deviceID)\` not SQL - **Testability:** Registry is mockable - **Debuggability:** Capability matrix shows exact state + history ## Rollback Plan Each phase is independently deployable with feature flags: \`\`\`go const ( UseRegistry = true // Phase 1 UseCapabilityIndex = true // Phase 2 UseStatsCache = true // Phase 3 UseSearchIndex = true // Phase 4 ) \`\`\` If any phase has issues, disable the flag and fall back to Proton queries (slower but functional). ## Related Issues - #1921 - Original Proton performance crisis (tactical CTE fix applied) ## Open Questions 1. **Registry persistence:** Should we persist registry snapshots to disk for faster restarts? 2. **Registry size:** At 1M devices, in-memory registry ≈ 1-2GB. Acceptable? 3. **Search sophistication:** Do we need Elastic's query DSL, or is trigram enough? 4. **Capability staleness:** How long before we mark a collector capability as "stale"? 5. **Multi-region:** How does registry sync across clusters? ## References - \`newarch_plan.md\` - Full implementation details with code examples - \`debug.md\` - Performance investigation notes - Commit \`85733a09\` - Tactical CTE query fix - Commit \`65e5d947\` - Architecture plan documentation
Author
Owner

Imported GitHub comment.

Original author: @mfreeman451
Original URL: https://github.com/carverauto/serviceradar/issues/1924#issuecomment-3489467643
Original created: 2025-11-05T05:41:42Z


Phase 1 progress recap

  • Added DeviceRecord and in-memory store (pkg/registry/device.go, device_store.go) with ID/IP/MAC indexes and cache snapshot helpers.
  • Boot hydration now bulk-loads Proton state and rebuilds indexes/search on startup (pkg/registry/hydrate.go, pkg/core/server.go).
  • Registry ProcessBatchDeviceUpdates keeps the hot cache in sync on every update/tombstone and exposes cache-backed getters used by API/device manager code paths.
  • Device detail endpoint now favors the registry snapshot, with Proton as a fallback for cache misses (pkg/core/api/server.go).
  • Introduced a trigram-based search index, wired it into the registry, and ranked results with relevance + recency before handing unified views back to the API (pkg/registry/trigram_index.go, pkg/registry/registry.go).

What’s left (next engineer hand-off)

  1. Web UI integration
    • web/src/components/Devices/DeviceList (and any inventory/search routes) should call the new registry search endpoint instead of fan-out SRQL queries. Surfacing metrics_summary, alias_history, and collector capability blobs that the API now attaches will require mapping the new fields in the React data loader and updating DeviceRow renderers.
    • Highlight matches using the ranked order: carry the trigram score from the API (extend the response to include score), display an inline badge for exact hostname/IP hits, and preserve the existing status filters.
    • Audit client-side filtering to ensure it doesn’t reintroduce Proton calls—update web/src/lib/api.ts to use the registry-backed /api/devices list/search endpoints.
  2. Search telemetry & UX polish
    • Emit a lightweight metric (Prometheus counter + histogram) from the API search path (pkg/core/api/server.go) capturing query length, match count, and latency so we can validate the <50 ms target under load.
    • Add a UI-level empty state when no results are returned, and surface the total count (API already has everything needed once we extend the response).
  3. Follow-on cache consumers
    • Update any remaining backend paths that still hit db.GetUnifiedDevices... (e.g., identity lookup, mapper publisher) to rely on DeviceRegistry.SearchDevices/GetDeviceRecord to avoid Proton reads.
    • Gate search and cache features behind a feature flag so we can roll out gradually; add config plumbing in pkg/core/server.go + pkg/core/api/server.go and note it in docs/docs/agents.md.

Once those are in place, we can iterate on Phase 2 (capability index) with a warmed-up UI and telemetry to prove the search latency/success metrics.

Imported GitHub comment. Original author: @mfreeman451 Original URL: https://github.com/carverauto/serviceradar/issues/1924#issuecomment-3489467643 Original created: 2025-11-05T05:41:42Z --- ## Phase 1 progress recap - Added `DeviceRecord` and in-memory store (`pkg/registry/device.go`, `device_store.go`) with ID/IP/MAC indexes and cache snapshot helpers. - Boot hydration now bulk-loads Proton state and rebuilds indexes/search on startup (`pkg/registry/hydrate.go`, `pkg/core/server.go`). - Registry `ProcessBatchDeviceUpdates` keeps the hot cache in sync on every update/tombstone and exposes cache-backed getters used by API/device manager code paths. - Device detail endpoint now favors the registry snapshot, with Proton as a fallback for cache misses (`pkg/core/api/server.go`). - Introduced a trigram-based search index, wired it into the registry, and ranked results with relevance + recency before handing unified views back to the API (`pkg/registry/trigram_index.go`, `pkg/registry/registry.go`). ## What’s left (next engineer hand-off) 1. **Web UI integration** - `web/src/components/Devices/DeviceList` (and any inventory/search routes) should call the new registry search endpoint instead of fan-out SRQL queries. Surfacing `metrics_summary`, `alias_history`, and collector capability blobs that the API now attaches will require mapping the new fields in the React data loader and updating `DeviceRow` renderers. - Highlight matches using the ranked order: carry the trigram score from the API (extend the response to include `score`), display an inline badge for exact hostname/IP hits, and preserve the existing status filters. - Audit client-side filtering to ensure it doesn’t reintroduce Proton calls—update `web/src/lib/api.ts` to use the registry-backed `/api/devices` list/search endpoints. 2. **Search telemetry & UX polish** - Emit a lightweight metric (Prometheus counter + histogram) from the API search path (`pkg/core/api/server.go`) capturing query length, match count, and latency so we can validate the <50 ms target under load. - Add a UI-level empty state when no results are returned, and surface the total count (API already has everything needed once we extend the response). 3. **Follow-on cache consumers** - Update any remaining backend paths that still hit `db.GetUnifiedDevices...` (e.g., identity lookup, mapper publisher) to rely on `DeviceRegistry.SearchDevices`/`GetDeviceRecord` to avoid Proton reads. - Gate search and cache features behind a feature flag so we can roll out gradually; add config plumbing in `pkg/core/server.go` + `pkg/core/api/server.go` and note it in `docs/docs/agents.md`. Once those are in place, we can iterate on Phase 2 (capability index) with a warmed-up UI and telemetry to prove the search latency/success metrics.
Author
Owner

Imported GitHub comment.

Original author: @mfreeman451
Original URL: https://github.com/carverauto/serviceradar/issues/1924#issuecomment-3489640042
Original created: 2025-11-05T06:52:40Z


Update after Phase 1/2 rollout:\n\n- Proton returned to ~4 cores consumed within 10 minutes of redeploy (NAME CPU(cores) MEMORY(bytes)
serviceradar-proton-654fbcbcbf-bqdxs 3993m 3018Mi ).\n- shows the dominant query is , issued once per minute by the SRQL/Proton OCaml client. Each run scans ~15.6M rows / 5.9GB, so the Observability dashboard still hammers Proton.\n- The CTE-based device lookups introduced in Phase 1 are still invoked hundreds of times per half hour (totalling ~3.7e8 rows read). They're better than the old pattern but remain an expensive fallback because SRQL routes keep hitting Proton instead of the registry cache.\n- There are still Code 210 exceptions for giant clauses generated from SRQL filters (e.g. 100+ IPs or Armis IDs), which cause retries and more table scans.\n\nTo address the remaining load we extended with Phase 3b (Critical Log Rollups) so the web dashboards consume a dedicated log digest instead of the raw scan, and we tightened Sprint 6 tasks to force SRQL/device lookups through the registry and search index. That should eliminate the hot queries once Phases 3-6 are complete.

Imported GitHub comment. Original author: @mfreeman451 Original URL: https://github.com/carverauto/serviceradar/issues/1924#issuecomment-3489640042 Original created: 2025-11-05T06:52:40Z --- Update after Phase 1/2 rollout:\n\n- Proton returned to ~4 cores consumed within 10 minutes of redeploy (NAME CPU(cores) MEMORY(bytes) serviceradar-proton-654fbcbcbf-bqdxs 3993m 3018Mi ).\n- shows the dominant query is , issued once per minute by the SRQL/Proton OCaml client. Each run scans ~15.6M rows / 5.9GB, so the Observability dashboard still hammers Proton.\n- The CTE-based device lookups introduced in Phase 1 are still invoked hundreds of times per half hour (totalling ~3.7e8 rows read). They're better than the old pattern but remain an expensive fallback because SRQL routes keep hitting Proton instead of the registry cache.\n- There are still Code 210 exceptions for giant clauses generated from SRQL filters (e.g. 100+ IPs or Armis IDs), which cause retries and more table scans.\n\nTo address the remaining load we extended with Phase 3b (Critical Log Rollups) so the web dashboards consume a dedicated log digest instead of the raw scan, and we tightened Sprint 6 tasks to force SRQL/device lookups through the registry and search index. That should eliminate the hot queries once Phases 3-6 are complete.
Author
Owner

Imported GitHub comment.

Original author: @mfreeman451
Original URL: https://github.com/carverauto/serviceradar/issues/1924#issuecomment-3489724548
Original created: 2025-11-05T07:26:04Z


Status update from the architecture refactor work:

Log digest cache landed. Added pkg/core/log_digest.go with a capped ring buffer + 1h/24h counters, hydrated every 30s from Proton via the new DBLogDigestSource helper. Core start-up now wires the aggregator and keeps it refreshed until shutdown.
New critical log APIs. Exposed /api/logs/critical and /api/logs/critical/counters (protected routes); the handlers serve the in-memory digest so fatal/error widgets no longer hit SRQL.
Frontend wired to cache. web/src/services/dataService.ts fetches the new endpoints and supplies CriticalLogsWidget with typed data + counters; accompanying unit coverage mocks the API responses.
Plan/doc cleanup. Phase 3b items are checked off in newarch_plan.md to reflect the cache + API + UI work.
Validation. go test ./pkg/core/... and npm run lint are green.

Remaining for Phase 3b: stream-driven hydration (instead of snapshots) and feature-flag plumbing once we’re ready to roll this out broadly.

Imported GitHub comment. Original author: @mfreeman451 Original URL: https://github.com/carverauto/serviceradar/issues/1924#issuecomment-3489724548 Original created: 2025-11-05T07:26:04Z --- Status update from the architecture refactor work: • **Log digest cache landed.** Added pkg/core/log_digest.go with a capped ring buffer + 1h/24h counters, hydrated every 30s from Proton via the new DBLogDigestSource helper. Core start-up now wires the aggregator and keeps it refreshed until shutdown. • **New critical log APIs.** Exposed `/api/logs/critical` and `/api/logs/critical/counters` (protected routes); the handlers serve the in-memory digest so fatal/error widgets no longer hit SRQL. • **Frontend wired to cache.** web/src/services/dataService.ts fetches the new endpoints and supplies CriticalLogsWidget with typed data + counters; accompanying unit coverage mocks the API responses. • **Plan/doc cleanup.** Phase 3b items are checked off in newarch_plan.md to reflect the cache + API + UI work. • **Validation.** `go test ./pkg/core/...` and `npm run lint` are green. Remaining for Phase 3b: stream-driven hydration (instead of snapshots) and feature-flag plumbing once we’re ready to roll this out broadly.
Author
Owner

Imported GitHub comment.

Original author: @mfreeman451
Original URL: https://github.com/carverauto/serviceradar/issues/1924#issuecomment-3492120109
Original created: 2025-11-05T16:17:11Z


Phase 3b status update:\n- Landed the log digest aggregator, tailer, and persistence plumbing; feature flag () is now available in config.\n- Built/published ghcr.io/carverauto/serviceradar-core@sha256:4124f3f298f13c1d2425725bbca80c8bc2e902a93074e2e3849a24103b6e1be9 and rolled the demo deployment to that image.\n- During the rollout, enabling in the demo cluster prevented the HTTP listener from ever becoming ready (readiness probe stayed red). For now the flag is set to false in the runtime config so the new build can serve traffic.\n- Follow-up: debug why enabling the log digest stream blocks readiness before we flip the flag back on.

Imported GitHub comment. Original author: @mfreeman451 Original URL: https://github.com/carverauto/serviceradar/issues/1924#issuecomment-3492120109 Original created: 2025-11-05T16:17:11Z --- Phase 3b status update:\n- Landed the log digest aggregator, tailer, and persistence plumbing; feature flag () is now available in config.\n- Built/published ghcr.io/carverauto/serviceradar-core@sha256:4124f3f298f13c1d2425725bbca80c8bc2e902a93074e2e3849a24103b6e1be9 and rolled the demo deployment to that image.\n- During the rollout, enabling in the demo cluster prevented the HTTP listener from ever becoming ready (readiness probe stayed red). For now the flag is set to false in the runtime config so the new build can serve traffic.\n- Follow-up: debug why enabling the log digest stream blocks readiness before we flip the flag back on.
Author
Owner

Imported GitHub comment.

Original author: @mfreeman451
Original URL: https://github.com/carverauto/serviceradar/issues/1924#issuecomment-3492120711
Original created: 2025-11-05T16:17:20Z


Phase 3b status update:

  • Landed the log digest aggregator, tailer, and persistence plumbing; feature flag (features.use_log_digest) is now exposed in config.
  • Built/published ghcr.io/carverauto/serviceradar-core@sha256:4124f3f298f13c1d2425725bbca80c8bc2e902a93074e2e3849a24103b6e1be9 and rolled the demo deployment to that image.
  • During the rollout, enabling UseLogDigest in the demo cluster prevented the HTTP listener from ever becoming ready (readiness probe stayed red). For now the flag is set to false in the runtime config so the new build can serve traffic.
  • Follow-up: debug why enabling the log digest stream blocks readiness before we flip the flag back on.
Imported GitHub comment. Original author: @mfreeman451 Original URL: https://github.com/carverauto/serviceradar/issues/1924#issuecomment-3492120711 Original created: 2025-11-05T16:17:20Z --- Phase 3b status update: - Landed the log digest aggregator, tailer, and persistence plumbing; feature flag (`features.use_log_digest`) is now exposed in config. - Built/published ghcr.io/carverauto/serviceradar-core@sha256:4124f3f298f13c1d2425725bbca80c8bc2e902a93074e2e3849a24103b6e1be9 and rolled the demo deployment to that image. - During the rollout, enabling `UseLogDigest` in the demo cluster prevented the HTTP listener from ever becoming ready (readiness probe stayed red). For now the flag is set to false in the runtime config so the new build can serve traffic. - Follow-up: debug why enabling the log digest stream blocks readiness before we flip the flag back on.
Author
Owner

Imported GitHub comment.

Original author: @mfreeman451
Original URL: https://github.com/carverauto/serviceradar/issues/1924#issuecomment-3492222622
Original created: 2025-11-05T16:36:04Z


Re-enabled the log digest path in demo and rolled the cluster:

  • Updated serviceradar-config to set features.use_log_digest=true, then rebuilt/pushed ghcr.io/carverauto/serviceradar-core (sha256:ab992d84af2ad9500ce0c4d37c2f7b3231eb76a145c267acdb0a205388c0bb9b, tag sha-057b69fdcc8cb45a3d1e46ffb395d910474d897a).
  • Applied the refreshed ConfigMap and set the deployment image to the new tag; rollout completed and the pod is healthy (serviceradar-core-c8cf58f59-dcgvb reached 1/1 ready in ~70s).
  • Startup logs confirm the async bootstrap now times out without blocking readiness, so the HTTP listener comes up cleanly.

Follow-up: Proton is rejecting the streaming tail with code: 62 ... Syntax error ... EMIT CHANGES; the aggregator is retrying with exponential backoff. We’ll need to adjust the tail query so the digest keeps up-to-date once the flag stays on.

Imported GitHub comment. Original author: @mfreeman451 Original URL: https://github.com/carverauto/serviceradar/issues/1924#issuecomment-3492222622 Original created: 2025-11-05T16:36:04Z --- Re-enabled the log digest path in demo and rolled the cluster: - Updated `serviceradar-config` to set `features.use_log_digest=true`, then rebuilt/pushed `ghcr.io/carverauto/serviceradar-core` (`sha256:ab992d84af2ad9500ce0c4d37c2f7b3231eb76a145c267acdb0a205388c0bb9b`, tag `sha-057b69fdcc8cb45a3d1e46ffb395d910474d897a`). - Applied the refreshed ConfigMap and set the deployment image to the new tag; rollout completed and the pod is healthy (`serviceradar-core-c8cf58f59-dcgvb` reached 1/1 ready in ~70s). - Startup logs confirm the async bootstrap now times out without blocking readiness, so the HTTP listener comes up cleanly. Follow-up: Proton is rejecting the streaming tail with `code: 62 ... Syntax error ... EMIT CHANGES`; the aggregator is retrying with exponential backoff. We’ll need to adjust the tail query so the digest keeps up-to-date once the flag stays on.
Author
Owner

Imported GitHub comment.

Original author: @mfreeman451
Original URL: https://github.com/carverauto/serviceradar/issues/1924#issuecomment-3492446609
Original created: 2025-11-05T17:20:38Z


Validated the streaming log tailer end-to-end:

  • Rebuilt/pushed ghcr.io/carverauto/serviceradar-core@sha256:c587c6cadf6b1e26182ae93641c42d75d236e93a3c0d76b41267140cee379355 and rolled the demo core deployment.
  • Injected a synthetic fatal log row via Proton (Phase3b log-digest test) to exercise the digest path.
  • Queried /api/logs/critical and /api/logs/critical/counters with an admin JWT; the API served the new entry directly from the in-memory digest, confirming the stream keeps up without relapsing to Proton.
  • Noted the one-time bootstrap timeout (expected with the async snapshot), but the streaming consumer now stays connected with no further EMIT CHANGES syntax errors.

Follow-up: none for Phase 3b tailer; next we can look at trimming that bootstrap timeout if it shows up in SLOs.

Imported GitHub comment. Original author: @mfreeman451 Original URL: https://github.com/carverauto/serviceradar/issues/1924#issuecomment-3492446609 Original created: 2025-11-05T17:20:38Z --- Validated the streaming log tailer end-to-end: - Rebuilt/pushed `ghcr.io/carverauto/serviceradar-core@sha256:c587c6cadf6b1e26182ae93641c42d75d236e93a3c0d76b41267140cee379355` and rolled the demo core deployment. - Injected a synthetic fatal log row via Proton (`Phase3b log-digest test`) to exercise the digest path. - Queried `/api/logs/critical` and `/api/logs/critical/counters` with an admin JWT; the API served the new entry directly from the in-memory digest, confirming the stream keeps up without relapsing to Proton. - Noted the one-time bootstrap timeout (expected with the async snapshot), but the streaming consumer now stays connected with no further `EMIT CHANGES` syntax errors. Follow-up: none for Phase 3b tailer; next we can look at trimming that bootstrap timeout if it shows up in SLOs.
Author
Owner

Imported GitHub comment.

Original author: @mfreeman451
Original URL: https://github.com/carverauto/serviceradar/issues/1924#issuecomment-3492559591
Original created: 2025-11-05T17:46:51Z


Follow-up cleanup from the streaming rollout:

  • Fixed the COUNT(*) scan type in the service registry (uint64 instead of int), rebuilt/pushed ghcr.io/carverauto/serviceradar-core@sha256:8170567691819242005bddd711f6c7635ed49b2f02ce66704ead70b8d210f278, and rolled the demo core deployment.
  • The poller heartbeat warnings (converting UInt64 to *int is unsupported) are gone; /api/logs/critical still returns the latest fatal log from the digest stream.

With the log digest tailer feeding cleanly and the poller cache check fixed, Phase 3b is fully green. Next up is only ongoing monitoring.

Imported GitHub comment. Original author: @mfreeman451 Original URL: https://github.com/carverauto/serviceradar/issues/1924#issuecomment-3492559591 Original created: 2025-11-05T17:46:51Z --- Follow-up cleanup from the streaming rollout: - Fixed the `COUNT(*)` scan type in the service registry (`uint64` instead of `int`), rebuilt/pushed `ghcr.io/carverauto/serviceradar-core@sha256:8170567691819242005bddd711f6c7635ed49b2f02ce66704ead70b8d210f278`, and rolled the demo core deployment. - The poller heartbeat warnings (`converting UInt64 to *int is unsupported`) are gone; `/api/logs/critical` still returns the latest fatal log from the digest stream. With the log digest tailer feeding cleanly and the poller cache check fixed, Phase 3b is fully green. Next up is only ongoing monitoring.
Author
Owner

Imported GitHub comment.

Original author: @mfreeman451
Original URL: https://github.com/carverauto/serviceradar/issues/1924#issuecomment-3492579691
Original created: 2025-11-05T17:51:45Z


Proton connection pressure is cleared up:

  • Increased the Proton client pool (max open connections 60 → idle 30, streaming helper bumped to 10) via core image sha256:85a9f7f4860f99b1ce0bd182a44880af4505c712f28a63e8c89eb1a60363c78a and rolled serviceradar-core in demo.
  • The prior proton: acquire conn timeout errors during edge onboarding / poller cache refresh are no longer appearing after the redeploy; log tailer and registry operations now run without starving the pool.

Remaining noisy log is the legacy poller DELETE syntax (tracked separately). Otherwise the new connection ceiling keeps the registry + onboarding flows happy.

Imported GitHub comment. Original author: @mfreeman451 Original URL: https://github.com/carverauto/serviceradar/issues/1924#issuecomment-3492579691 Original created: 2025-11-05T17:51:45Z --- Proton connection pressure is cleared up: - Increased the Proton client pool (max open connections 60 → idle 30, streaming helper bumped to 10) via core image `sha256:85a9f7f4860f99b1ce0bd182a44880af4505c712f28a63e8c89eb1a60363c78a` and rolled `serviceradar-core` in demo. - The prior `proton: acquire conn timeout` errors during edge onboarding / poller cache refresh are no longer appearing after the redeploy; log tailer and registry operations now run without starving the pool. Remaining noisy log is the legacy poller DELETE syntax (tracked separately). Otherwise the new connection ceiling keeps the registry + onboarding flows happy.
Author
Owner

Imported GitHub comment.

Original author: @mfreeman451
Original URL: https://github.com/carverauto/serviceradar/issues/1924#issuecomment-3493548590
Original created: 2025-11-05T21:24:17Z


Updates from today:

  • Added registry/Proton cross-checks on hydration and on every stats refresh. If the in-memory registry ever diverges from table(unified_devices) we now log both counts plus a sample of missing device_ids (pkg/registry/hydrate.go, pkg/registry/diagnostics.go, pkg/core/stats_aggregator.go). No mismatches yet; hydration is reporting 50,007 devices while Proton currently reports 50,009.
  • Fixed poller delete syntax (ALTER STREAM instead of ALTER TABLE) so the new diagnostics would not spam with Proton errors.
  • The analytics UI now defaults back to /api/stats for its top-line device counts and only falls back to SRQL if the cache is empty. The tile is still bouncing between ~49.5k and 50k because Kong is rejecting internal SRQL calls with 401 (“Unauthorized”), so the fallback path only succeeds intermittently. That explains the eventual consistency we were seeing earlier.
  • Confirmed SRQL queries succeed when run directly through the Proton SQL endpoint, so the outstanding issue is between serviceradar-webserviceradar-kong rather than the registry cache itself.

Next steps:

  1. Debug why /api/query via serviceradar-kong:8000 is unauthorised and either fix the auth headers or point the internal client straight at the OCaml SRQL service.
  2. Once SRQL is reliable again, remove the temporary fallback and rely solely on the cached stats (while keeping the new diagnostics in place to catch regressions).
Imported GitHub comment. Original author: @mfreeman451 Original URL: https://github.com/carverauto/serviceradar/issues/1924#issuecomment-3493548590 Original created: 2025-11-05T21:24:17Z --- Updates from today: - Added registry/Proton cross-checks on hydration and on every stats refresh. If the in-memory registry ever diverges from `table(unified_devices)` we now log both counts plus a sample of missing device_ids (`pkg/registry/hydrate.go`, `pkg/registry/diagnostics.go`, `pkg/core/stats_aggregator.go`). No mismatches yet; hydration is reporting 50,007 devices while Proton currently reports 50,009. - Fixed poller delete syntax (`ALTER STREAM` instead of `ALTER TABLE`) so the new diagnostics would not spam with Proton errors. - The analytics UI now defaults back to `/api/stats` for its top-line device counts and only falls back to SRQL if the cache is empty. The tile is still bouncing between ~49.5k and 50k because Kong is rejecting internal SRQL calls with 401 (“Unauthorized”), so the fallback path only succeeds intermittently. That explains the eventual consistency we were seeing earlier. - Confirmed SRQL queries succeed when run directly through the Proton SQL endpoint, so the outstanding issue is between `serviceradar-web` → `serviceradar-kong` rather than the registry cache itself. Next steps: 1. Debug why `/api/query` via `serviceradar-kong:8000` is unauthorised and either fix the auth headers or point the internal client straight at the OCaml SRQL service. 2. Once SRQL is reliable again, remove the temporary fallback and rely solely on the cached stats (while keeping the new diagnostics in place to catch regressions).
Author
Owner

Imported GitHub comment.

Original author: @mfreeman451
Original URL: https://github.com/carverauto/serviceradar/issues/1924#issuecomment-3493900823
Original created: 2025-11-05T22:39:57Z


Observed another skew in the analytics "Total Devices" tile after today's core deploy. The value climbed to ~72k even though Proton and the registry both still report ~50k devices.

What we have already done:

  • Filtered out ServiceRadar component IDs (poller/agent/checker) inside pkg/core/stats_aggregator.go and rolled the new core image across the demo namespace.
  • Confirmed /api/stats is live and the analytics dashboard queries it first, only falling back to SRQL when the cache turns up empty or zero.

Current working theories:

  1. The frontend is still hitting the SRQL fallback path because the cached snapshot occasionally comes back as 0, and the SRQL query (in:devices time:last_7d stats:"count() as total") over-counts versioned rows.
  2. The registry snapshot may still contain duplicate aliases we are missing, so the aggregator is counting more than the canonical Proton total (need to compare registry.SnapshotRecords() length vs. Proton again).
  3. The stats cache might be racing with hydration—during startup it can return the zero-value snapshot, forcing the fallback path and sticking the inflated SRQL number in the React state.

Next actions before another roll-out:

  • Capture /api/stats responses alongside the fallback SRQL payload when the UI shows the inflated number (e.g. log both in the browser console or add telemetry in dataService.fetchAllAnalyticsData).
  • Instrument the stats aggregator to log the registry snapshot length every refresh and surface whether the cache returned zero (so we know if hypothesis #3 is real).
  • If the fallback is the culprit, either hard-disable it now that /api/stats is GA, or change the SRQL to respect _merged_into / _deleted so the count matches Proton.

I updated newarch_plan.md to capture these investigations so we do not repeat the same fixes.

Imported GitHub comment. Original author: @mfreeman451 Original URL: https://github.com/carverauto/serviceradar/issues/1924#issuecomment-3493900823 Original created: 2025-11-05T22:39:57Z --- Observed another skew in the analytics "Total Devices" tile after today's core deploy. The value climbed to ~72k even though Proton and the registry both still report ~50k devices. What we have already done: - Filtered out ServiceRadar component IDs (poller/agent/checker) inside `pkg/core/stats_aggregator.go` and rolled the new core image across the demo namespace. - Confirmed `/api/stats` is live and the analytics dashboard queries it first, only falling back to SRQL when the cache turns up empty or zero. Current working theories: 1. The frontend is still hitting the SRQL fallback path because the cached snapshot occasionally comes back as `0`, and the SRQL query (`in:devices time:last_7d stats:"count() as total"`) over-counts versioned rows. 2. The registry snapshot may still contain duplicate aliases we are missing, so the aggregator is counting more than the canonical Proton total (need to compare `registry.SnapshotRecords()` length vs. Proton again). 3. The stats cache might be racing with hydration—during startup it can return the zero-value snapshot, forcing the fallback path and sticking the inflated SRQL number in the React state. Next actions before another roll-out: - Capture `/api/stats` responses alongside the fallback SRQL payload when the UI shows the inflated number (e.g. log both in the browser console or add telemetry in `dataService.fetchAllAnalyticsData`). - Instrument the stats aggregator to log the registry snapshot length every refresh and surface whether the cache returned zero (so we know if hypothesis #3 is real). - If the fallback is the culprit, either hard-disable it now that `/api/stats` is GA, or change the SRQL to respect `_merged_into` / `_deleted` so the count matches Proton. I updated `newarch_plan.md` to capture these investigations so we do not repeat the same fixes.
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
carverauto/serviceradar#639
No description provided.