1768 bugotel inconsistent otel log counts on dashboards #2321

Merged
mfreeman451 merged 5 commits from refs/pull/2321/head into main 2025-10-15 02:32:59 +00:00
mfreeman451 commented 2025-10-15 02:00:43 +00:00 (Migrated from github.com)
Owner

Imported from GitHub pull request.

Original GitHub pull request: #1770
Original author: @mfreeman451
Original URL: https://github.com/carverauto/serviceradar/pull/1770
Original created: 2025-10-15T02:00:43Z
Original updated: 2025-10-15T02:33:02Z
Original head: carverauto/serviceradar:1768-bugotel-inconsistent-otel-log-counts-on-dashboards
Original base: main
Original merged: 2025-10-15T02:32:59Z by @mfreeman451

PR Type

Bug fix, Enhancement


Description

  • Disabled OpenTelemetry tracing for KV service calls to reduce telemetry noise

  • Fixed trace count queries to use otel_trace_summaries instead of otel_traces

  • Added identity metadata caching to reduce redundant KV writes

  • Changed default poll intervals from 30s to 5m across services


Diagram Walkthrough

flowchart LR
  A["Disable KV telemetry"] --> B["Update trace queries"]
  B --> C["Add identity cache"]
  C --> D["Increase poll intervals"]
  D --> E["Reduce OTEL noise"]

File Walkthrough

Relevant files
Enhancement
16 files
main.go
Add telemetry filter to exclude KV service RPCs                   
+5/-0     
main.go
Disable telemetry for KV server                                                   
+1/-0     
server.go
Disable telemetry for KV client connections                           
+4/-3     
kv_client.go
Disable telemetry in KV client factory                                     
+1/-0     
server.go
Disable telemetry for API server KV clients                           
+3/-3     
client.go
Add DisableTelemetry option to client config                         
+9/-3     
server.go
Add telemetry filter and disable options                                 
+50/-19 
identitymap.go
Add HashIdentityMetadata for stable identity fingerprints
+50/-0   
server.go
Wire telemetry options through server lifecycle                   
+10/-0   
identity_publisher.go
Add identity cache to reduce redundant KV writes                 
+132/-2 
sync.go
Disable telemetry for sync KV client                                         
+4/-3     
useTraceCounts.ts
Add shared hook for trace count aggregates                             
+122/-0 
lib.rs
Skip metrics export for fast spans                                             
+23/-5   
reset-proton.sh
Add script to reset Proton PVC                                                     
+64/-0   
ObservabilityWidget.tsx
Use useTraceCounts hook for trace statistics                         
+44/-27 
TracesDashboard.tsx
Integrate useTraceCounts hook in traces dashboard               
+47/-38 
Bug fix
5 files
backfill.go
Use HashIdentityMetadata for identity record hashing         
+1/-1     
identity_lookup.go
Build full DeviceUpdate for identity hashing                         
+20/-1   
analyticsService.ts
Fix trace queries to use otel_trace_summaries table           
+17/-11 
serviceradar-core.yaml
Preserve JWT keys across core restarts                                     
+29/-2   
AnalyticsContext.tsx
Rename metrics fields to trace fields                                       
+2/-2     
Tests
4 files
backfill_test.go
Update test to use HashIdentityMetadata function                 
+11/-1   
identitymap_test.go
Add tests for HashIdentityMetadata behavior                           
+45/-0   
config_test.go
Update test expectations for new poll interval                     
+1/-1     
identity_publisher_test.go
Update tests for HashIdentityMetadata usage                           
+7/-4     
Configuration changes
4 files
config.go
Change default poll interval to 5 minutes                               
+17/-17 
config.go
Change default poll interval to 5 minutes                               
+12/-12 
poller.docker.json
Update poll intervals to 5 minutes                                             
+2/-2     
configmap.yaml
Update poll intervals in Kubernetes config                             
+2/-2     
Dependencies
2 files
BUILD.bazel
Add grpc stats dependency                                                               
+1/-0     
BUILD.bazel
Add grpc stats dependency                                                               
+1/-0     
Documentation
3 files
agents.md
Document Proton reset procedure                                                   
+25/-0   
rperf-monitoring.md
Update poll interval in documentation                                       
+2/-2     
tls-security.md
Update poll interval in documentation                                       
+2/-2     

Imported from GitHub pull request. Original GitHub pull request: #1770 Original author: @mfreeman451 Original URL: https://github.com/carverauto/serviceradar/pull/1770 Original created: 2025-10-15T02:00:43Z Original updated: 2025-10-15T02:33:02Z Original head: carverauto/serviceradar:1768-bugotel-inconsistent-otel-log-counts-on-dashboards Original base: main Original merged: 2025-10-15T02:32:59Z by @mfreeman451 --- ### **PR Type** Bug fix, Enhancement ___ ### **Description** - Disabled OpenTelemetry tracing for KV service calls to reduce telemetry noise - Fixed trace count queries to use `otel_trace_summaries` instead of `otel_traces` - Added identity metadata caching to reduce redundant KV writes - Changed default poll intervals from 30s to 5m across services ___ ### Diagram Walkthrough ```mermaid flowchart LR A["Disable KV telemetry"] --> B["Update trace queries"] B --> C["Add identity cache"] C --> D["Increase poll intervals"] D --> E["Reduce OTEL noise"] ``` <details> <summary><h3> File Walkthrough</h3></summary> <table><thead><tr><th></th><th align="left">Relevant files</th></tr></thead><tbody><tr><td><strong>Enhancement</strong></td><td><details><summary>16 files</summary><table> <tr> <td><strong>main.go</strong><dd><code>Add telemetry filter to exclude KV service RPCs</code>&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; </dd></td> <td><a href="https://github.com/carverauto/serviceradar/pull/1770/files#diff-4ab3fd1d4debc53dd2499d94a0f60c648fdae4235dd1e3678095a975f5bb434a">+5/-0</a>&nbsp; &nbsp; &nbsp; </td> </tr> <tr> <td><strong>main.go</strong><dd><code>Disable telemetry for KV server</code>&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; </dd></td> <td><a href="https://github.com/carverauto/serviceradar/pull/1770/files#diff-a1a2df15ecc88de7f8a4832cc2d06fa0ec6c6d32112c8c718d420f91c71d2e8c">+1/-0</a>&nbsp; &nbsp; &nbsp; </td> </tr> <tr> <td><strong>server.go</strong><dd><code>Disable telemetry for KV client connections</code>&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; </dd></td> <td><a href="https://github.com/carverauto/serviceradar/pull/1770/files#diff-3c51e5356a25859a3300ab62ed2494462feb2567ef69e6db3aa2bbc1032c1c5d">+4/-3</a>&nbsp; &nbsp; &nbsp; </td> </tr> <tr> <td><strong>kv_client.go</strong><dd><code>Disable telemetry in KV client factory</code>&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; </dd></td> <td><a href="https://github.com/carverauto/serviceradar/pull/1770/files#diff-0c911990aa32d43404f0f23dadc0363ec97ed4ca315592e6c9bd24510e82e06d">+1/-0</a>&nbsp; &nbsp; &nbsp; </td> </tr> <tr> <td><strong>server.go</strong><dd><code>Disable telemetry for API server KV clients</code>&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; </dd></td> <td><a href="https://github.com/carverauto/serviceradar/pull/1770/files#diff-1bb99367fdd853c728b7cfbf5893a293f6d217144dfb5282cb8dd32e5261021e">+3/-3</a>&nbsp; &nbsp; &nbsp; </td> </tr> <tr> <td><strong>client.go</strong><dd><code>Add DisableTelemetry option to client config</code>&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; </dd></td> <td><a href="https://github.com/carverauto/serviceradar/pull/1770/files#diff-fd9886430687bce681e21ea8ea209148b7a46bfc29023f2af6cd9509a28bb208">+9/-3</a>&nbsp; &nbsp; &nbsp; </td> </tr> <tr> <td><strong>server.go</strong><dd><code>Add telemetry filter and disable options</code>&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; </dd></td> <td><a href="https://github.com/carverauto/serviceradar/pull/1770/files#diff-af62cb3a3a7c777bd1b8feed18657c3ecae951d2dbdea55527f08c8c67967c71">+50/-19</a>&nbsp; </td> </tr> <tr> <td><strong>identitymap.go</strong><dd><code>Add HashIdentityMetadata for stable identity fingerprints</code></dd></td> <td><a href="https://github.com/carverauto/serviceradar/pull/1770/files#diff-a65f0769de3a5f025da1654d9545c7d3c43da843b3c608d96958893766f8ab8e">+50/-0</a>&nbsp; &nbsp; </td> </tr> <tr> <td><strong>server.go</strong><dd><code>Wire telemetry options through server lifecycle</code>&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; </dd></td> <td><a href="https://github.com/carverauto/serviceradar/pull/1770/files#diff-77964f1432d665b19a8f5c81976f787a3c745da1ee2cfffdd0aedc7ee3956df1">+10/-0</a>&nbsp; &nbsp; </td> </tr> <tr> <td><strong>identity_publisher.go</strong><dd><code>Add identity cache to reduce redundant KV writes</code>&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; </dd></td> <td><a href="https://github.com/carverauto/serviceradar/pull/1770/files#diff-e851df98fc8e7a63419a9e8ea29bebf3c143028ceadb549b35e9c1eb130fd1a1">+132/-2</a>&nbsp; </td> </tr> <tr> <td><strong>sync.go</strong><dd><code>Disable telemetry for sync KV client</code>&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; </dd></td> <td><a href="https://github.com/carverauto/serviceradar/pull/1770/files#diff-2e7c810166140fd7a58d6bdb610fef7e6de8f5f9ad2002659392878ecafd6bef">+4/-3</a>&nbsp; &nbsp; &nbsp; </td> </tr> <tr> <td><strong>useTraceCounts.ts</strong><dd><code>Add shared hook for trace count aggregates</code>&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; </dd></td> <td><a href="https://github.com/carverauto/serviceradar/pull/1770/files#diff-61c65c2d6cd0d743a6575ad656f2a5bc352c4415e5447f804be0ebb1e9195a34">+122/-0</a>&nbsp; </td> </tr> <tr> <td><strong>lib.rs</strong><dd><code>Skip metrics export for fast spans</code>&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; </dd></td> <td><a href="https://github.com/carverauto/serviceradar/pull/1770/files#diff-8295e6c8105718065b7c8b401cee9efc20e2eb5010b2e1f8d91afe670498e2f7">+23/-5</a>&nbsp; &nbsp; </td> </tr> <tr> <td><strong>reset-proton.sh</strong><dd><code>Add script to reset Proton PVC</code>&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; </dd></td> <td><a href="https://github.com/carverauto/serviceradar/pull/1770/files#diff-79e29e485b3489c3a44e185dfc5525abc2ddac5205cb02d5dcb056279cd3fb7f">+64/-0</a>&nbsp; &nbsp; </td> </tr> <tr> <td><strong>ObservabilityWidget.tsx</strong><dd><code>Use useTraceCounts hook for trace statistics</code>&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; </dd></td> <td><a href="https://github.com/carverauto/serviceradar/pull/1770/files#diff-469ba40fa407e64ef80f447276989d9bb7891322da5dcb57549ea99a2d5a01d6">+44/-27</a>&nbsp; </td> </tr> <tr> <td><strong>TracesDashboard.tsx</strong><dd><code>Integrate useTraceCounts hook in traces dashboard</code>&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; </dd></td> <td><a href="https://github.com/carverauto/serviceradar/pull/1770/files#diff-2ef490885a0b6e53d3fa243395f8209914ac8d955dca43333cb1f59c6eeb3d48">+47/-38</a>&nbsp; </td> </tr> </table></details></td></tr><tr><td><strong>Bug fix</strong></td><td><details><summary>5 files</summary><table> <tr> <td><strong>backfill.go</strong><dd><code>Use HashIdentityMetadata for identity record hashing</code>&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; </dd></td> <td><a href="https://github.com/carverauto/serviceradar/pull/1770/files#diff-d7d2f4aea42bb0d5c5c3439c6d4c27f27ce935961e8a5479203c67e958a10345">+1/-1</a>&nbsp; &nbsp; &nbsp; </td> </tr> <tr> <td><strong>identity_lookup.go</strong><dd><code>Build full DeviceUpdate for identity hashing</code>&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; </dd></td> <td><a href="https://github.com/carverauto/serviceradar/pull/1770/files#diff-419c60961080aa9e36a729e3a0f5b489b4dc6af0871d23c57d2817e1c0795741">+20/-1</a>&nbsp; &nbsp; </td> </tr> <tr> <td><strong>analyticsService.ts</strong><dd><code>Fix trace queries to use otel_trace_summaries table</code>&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; </dd></td> <td><a href="https://github.com/carverauto/serviceradar/pull/1770/files#diff-1140b7ee45921b9100c38a5330d66377a73258a776733b422d4689dc4ce282c5">+17/-11</a>&nbsp; </td> </tr> <tr> <td><strong>serviceradar-core.yaml</strong><dd><code>Preserve JWT keys across core restarts</code>&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; </dd></td> <td><a href="https://github.com/carverauto/serviceradar/pull/1770/files#diff-2f484d8fe3bae65aace437568f6dd660c92f57b452f7bd1608083a8fe3716ba3">+29/-2</a>&nbsp; &nbsp; </td> </tr> <tr> <td><strong>AnalyticsContext.tsx</strong><dd><code>Rename metrics fields to trace fields</code>&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; </dd></td> <td><a href="https://github.com/carverauto/serviceradar/pull/1770/files#diff-8bd8a5899b5763f65b9f01fe71cfb7e19526a6f463cbd724f279328a04853517">+2/-2</a>&nbsp; &nbsp; &nbsp; </td> </tr> </table></details></td></tr><tr><td><strong>Tests</strong></td><td><details><summary>4 files</summary><table> <tr> <td><strong>backfill_test.go</strong><dd><code>Update test to use HashIdentityMetadata function</code>&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; </dd></td> <td><a href="https://github.com/carverauto/serviceradar/pull/1770/files#diff-62cfe426bf9c2ee7e6722969c78e74cf928f86a564d66b9856aea4c0232ddd06">+11/-1</a>&nbsp; &nbsp; </td> </tr> <tr> <td><strong>identitymap_test.go</strong><dd><code>Add tests for HashIdentityMetadata behavior</code>&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; </dd></td> <td><a href="https://github.com/carverauto/serviceradar/pull/1770/files#diff-7838a8a0f701320d27842704cc844b4e28592220a5622f7a2880e1d535a3b916">+45/-0</a>&nbsp; &nbsp; </td> </tr> <tr> <td><strong>config_test.go</strong><dd><code>Update test expectations for new poll interval</code>&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; </dd></td> <td><a href="https://github.com/carverauto/serviceradar/pull/1770/files#diff-05ef22b193b3acfecc895fe1f364e852dcb2f3b08c606bc06c968cc88f0271e7">+1/-1</a>&nbsp; &nbsp; &nbsp; </td> </tr> <tr> <td><strong>identity_publisher_test.go</strong><dd><code>Update tests for HashIdentityMetadata usage</code>&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; </dd></td> <td><a href="https://github.com/carverauto/serviceradar/pull/1770/files#diff-a3b353a63f2cc222e61c29b0da162ad11c17f71be6c64b6222b2c1769a3c8f28">+7/-4</a>&nbsp; &nbsp; &nbsp; </td> </tr> </table></details></td></tr><tr><td><strong>Configuration changes</strong></td><td><details><summary>4 files</summary><table> <tr> <td><strong>config.go</strong><dd><code>Change default poll interval to 5 minutes</code>&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; </dd></td> <td><a href="https://github.com/carverauto/serviceradar/pull/1770/files#diff-626ab53e2c4405a09e43c5ce9fb841a3cc83fec9dfa506366a4a8c2ecd490135">+17/-17</a>&nbsp; </td> </tr> <tr> <td><strong>config.go</strong><dd><code>Change default poll interval to 5 minutes</code>&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; </dd></td> <td><a href="https://github.com/carverauto/serviceradar/pull/1770/files#diff-e4cfa17894bdfb56238b434eb208069edd7b48686712873129518c6a5d25dc70">+12/-12</a>&nbsp; </td> </tr> <tr> <td><strong>poller.docker.json</strong><dd><code>Update poll intervals to 5 minutes</code>&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; </dd></td> <td><a href="https://github.com/carverauto/serviceradar/pull/1770/files#diff-d64ebb69ec31e831efd187c47a5bfff2573960306b177f6464e91cb44a3c709d">+2/-2</a>&nbsp; &nbsp; &nbsp; </td> </tr> <tr> <td><strong>configmap.yaml</strong><dd><code>Update poll intervals in Kubernetes config</code>&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; </dd></td> <td><a href="https://github.com/carverauto/serviceradar/pull/1770/files#diff-f4548beaa0a3a01a46971c82c5647a0f3f49eb38d66dd939d06d19018173fcd6">+2/-2</a>&nbsp; &nbsp; &nbsp; </td> </tr> </table></details></td></tr><tr><td><strong>Dependencies</strong></td><td><details><summary>2 files</summary><table> <tr> <td><strong>BUILD.bazel</strong><dd><code>Add grpc stats dependency</code>&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; </dd></td> <td><a href="https://github.com/carverauto/serviceradar/pull/1770/files#diff-cf437f055db002c5a1dba0ac4a4949d0ecc4b095e8e760bf5aec0f5d4c08f572">+1/-0</a>&nbsp; &nbsp; &nbsp; </td> </tr> <tr> <td><strong>BUILD.bazel</strong><dd><code>Add grpc stats dependency</code>&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; </dd></td> <td><a href="https://github.com/carverauto/serviceradar/pull/1770/files#diff-8646f2266fed1c91edf396d18f3bcf8f2194a9d2ccc7597d8391c60ac834cbc1">+1/-0</a>&nbsp; &nbsp; &nbsp; </td> </tr> </table></details></td></tr><tr><td><strong>Documentation</strong></td><td><details><summary>3 files</summary><table> <tr> <td><strong>agents.md</strong><dd><code>Document Proton reset procedure</code>&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; </dd></td> <td><a href="https://github.com/carverauto/serviceradar/pull/1770/files#diff-af8d04277f2353629065b0cc5fad3e44bd3e7c20339bd125e0812104bdbeff28">+25/-0</a>&nbsp; &nbsp; </td> </tr> <tr> <td><strong>rperf-monitoring.md</strong><dd><code>Update poll interval in documentation</code>&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; </dd></td> <td><a href="https://github.com/carverauto/serviceradar/pull/1770/files#diff-70d068dec062c5ed771e7294401c52d55e85e279e9eee4f1084ca06c23b2fde8">+2/-2</a>&nbsp; &nbsp; &nbsp; </td> </tr> <tr> <td><strong>tls-security.md</strong><dd><code>Update poll interval in documentation</code>&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; </dd></td> <td><a href="https://github.com/carverauto/serviceradar/pull/1770/files#diff-fe75bcced80b2fb5cf39bd06e6edda7de714827915abedfaf41a54802f561d22">+2/-2</a>&nbsp; &nbsp; &nbsp; </td> </tr> </table></details></td></tr></tr></tbody></table> </details> ___
qodo-code-review[bot] commented 2025-10-15 02:01:44 +00:00 (Migrated from github.com)
Author
Owner

Imported GitHub PR comment.

Original author: @qodo-code-review[bot]
Original URL: https://github.com/carverauto/serviceradar/pull/1770#issuecomment-3404242282
Original created: 2025-10-15T02:01:44Z

PR Compliance Guide 🔍

Below is a summary of compliance checks for this PR:

Security Compliance
Destructive script execution

Description: The Kubernetes reset script can delete and recreate a PersistentVolumeClaim and restart
deployments, which if run against the wrong namespace could cause destructive data loss;
ensure restricted execution and environment safeguards.
reset-proton.sh [1-64]

Referred Code
#!/usr/bin/env bash

# Reset the Proton Timeplus datastore by rotating its PVC and restarting core.
# Usage:
#   scripts/reset-proton.sh [namespace]
# Environment overrides:
#   KUBECTL        Path to kubectl (default: kubectl)
#   PVC_NAME       Proton PVC name (default: serviceradar-proton-data)
#   DEPLOY_NAME    Proton deployment (default: serviceradar-proton)
#   CORE_DEPLOY    Core deployment (default: serviceradar-core)
#   STORAGE_CLASS  StorageClass for the replacement PVC (default: local-path)
#   PVC_SIZE       Requested storage size (default: 512Gi)

set -euo pipefail

ns="${1:-demo}"
kubectl_bin="${KUBECTL:-kubectl}"
pvc_name="${PVC_NAME:-serviceradar-proton-data}"
deploy_name="${DEPLOY_NAME:-serviceradar-proton}"
core_deploy="${CORE_DEPLOY:-serviceradar-core}"
storage_class="${STORAGE_CLASS:-local-path}"


 ... (clipped 43 lines)
Telemetry suppression risk

Description: TelemetryFilter controls which RPCs emit telemetry and uses a positive boolean to include
traces; misconfiguration could unintentionally suppress security-relevant RPC traces
(e.g., health or admin), reducing observability for incident response.
server.go [105-122]

Referred Code
		Timeout:               20 * time.Second,
	}),
	grpc.KeepaliveEnforcementPolicy(keepalive.EnforcementPolicy{
		MinTime:             120 * time.Second,
		PermitWithoutStream: true,
	}),
}

if !s.telemetryDisabled {
	handlerOpts := []otelgrpc.Option{}
	if s.telemetryFilter != nil {
		handlerOpts = append(handlerOpts, otelgrpc.WithFilter(func(info *grpcstats.RPCTagInfo) bool {
			return s.telemetryFilter(info)
		}))
	}

	defaultOpts = append([]grpc.ServerOption{grpc.StatsHandler(otelgrpc.NewServerHandler(handlerOpts...))}, defaultOpts...)
}
JWT key handling risk

Description: The init container preserves and rewrites JWT private keys in a shared file path; if file
permissions, secrets handling, or logs are misconfigured in the cluster, private keys
could be exposed—verify volumes and file ACLs are restricted.
serviceradar-core.yaml [32-58]

Referred Code
set -e
TEMPLATE=/etc/serviceradar/core.json
TARGET=/var/lib/serviceradar/core.json
mkdir -p /var/lib/serviceradar

PRESERVE_PRIV=""
PRESERVE_KID=""
if [ -f "$TARGET" ]; then
  PRESERVE_PRIV=$(jq -r '.auth.jwt_private_key_pem // empty' "$TARGET")
  PRESERVE_KID=$(jq -r '.auth.jwt_key_id // empty' "$TARGET")
fi

cp "$TEMPLATE" "${TARGET}.tmp"

if [ -n "$PRESERVE_PRIV" ]; then
  jq --arg priv "$PRESERVE_PRIV" --arg kid "$PRESERVE_KID" '
    .auth = (.auth // {}) |
    .auth.jwt_algorithm = "RS256" |
    .auth.jwt_private_key_pem = $priv |
    (if $kid != "" then .auth.jwt_key_id = $kid else . end)
  ' "${TARGET}.tmp" > "$TARGET"


 ... (clipped 6 lines)
Ticket Compliance
🟡
🎫 #1768
🟢 Fix inconsistent OpenTelemetry log/trace counts between analytics observability widget and
observability dashboard.
Reduce OTEL telemetry noise that may inflate counts.
Ensure dashboards use consistent sources/queries for trace counts.
Provide any necessary documentation or tooling to help validate/reset telemetry data if
needed.
Validate in staging/production that analytics widget and observability dashboard now show
aligned trace totals over the same time window.
Confirm that disabling/filtering telemetry does not hide required traces for
troubleshooting.
Verify performance impact and UX of increased poll intervals.
Codebase Duplication Compliance
Codebase context is not defined

Follow the guide to enable codebase context checks.

Custom Compliance
No custom compliance provided

Follow the guide to enable custom compliance check.

  • Update
Compliance status legend 🟢 - Fully Compliant
🟡 - Partial Compliant
🔴 - Not Compliant
- Requires Further Human Verification
🏷️ - Compliance label
Imported GitHub PR comment. Original author: @qodo-code-review[bot] Original URL: https://github.com/carverauto/serviceradar/pull/1770#issuecomment-3404242282 Original created: 2025-10-15T02:01:44Z --- ## PR Compliance Guide 🔍 <!-- https://github.com/carverauto/serviceradar/commit/0fc183b4c29124cf276cb71007778bcefbb51a37 --> Below is a summary of compliance checks for this PR:<br> <table><tbody><tr><td colspan='2'><strong>Security Compliance</strong></td></tr> <tr><td rowspan=3>⚪</td> <td><details><summary><strong>Destructive script execution </strong></summary><br> <b>Description:</b> The Kubernetes reset script can delete and recreate a PersistentVolumeClaim and restart <br>deployments, which if run against the wrong namespace could cause destructive data loss; <br>ensure restricted execution and environment safeguards.<br> <strong><a href='https://github.com/carverauto/serviceradar/pull/1770/files#diff-79e29e485b3489c3a44e185dfc5525abc2ddac5205cb02d5dcb056279cd3fb7fR1-R64'>reset-proton.sh [1-64]</a></strong><br> <details open><summary>Referred Code</summary> ```shell #!/usr/bin/env bash # Reset the Proton Timeplus datastore by rotating its PVC and restarting core. # Usage: # scripts/reset-proton.sh [namespace] # Environment overrides: # KUBECTL Path to kubectl (default: kubectl) # PVC_NAME Proton PVC name (default: serviceradar-proton-data) # DEPLOY_NAME Proton deployment (default: serviceradar-proton) # CORE_DEPLOY Core deployment (default: serviceradar-core) # STORAGE_CLASS StorageClass for the replacement PVC (default: local-path) # PVC_SIZE Requested storage size (default: 512Gi) set -euo pipefail ns="${1:-demo}" kubectl_bin="${KUBECTL:-kubectl}" pvc_name="${PVC_NAME:-serviceradar-proton-data}" deploy_name="${DEPLOY_NAME:-serviceradar-proton}" core_deploy="${CORE_DEPLOY:-serviceradar-core}" storage_class="${STORAGE_CLASS:-local-path}" ... (clipped 43 lines) ``` </details></details></td></tr> <tr><td><details><summary><strong>Telemetry suppression risk </strong></summary><br> <b>Description:</b> TelemetryFilter controls which RPCs emit telemetry and uses a positive boolean to include <br>traces; misconfiguration could unintentionally suppress security-relevant RPC traces <br>(e.g., health or admin), reducing observability for incident response.<br> <strong><a href='https://github.com/carverauto/serviceradar/pull/1770/files#diff-af62cb3a3a7c777bd1b8feed18657c3ecae951d2dbdea55527f08c8c67967c71R105-R122'>server.go [105-122]</a></strong><br> <details open><summary>Referred Code</summary> ```go Timeout: 20 * time.Second, }), grpc.KeepaliveEnforcementPolicy(keepalive.EnforcementPolicy{ MinTime: 120 * time.Second, PermitWithoutStream: true, }), } if !s.telemetryDisabled { handlerOpts := []otelgrpc.Option{} if s.telemetryFilter != nil { handlerOpts = append(handlerOpts, otelgrpc.WithFilter(func(info *grpcstats.RPCTagInfo) bool { return s.telemetryFilter(info) })) } defaultOpts = append([]grpc.ServerOption{grpc.StatsHandler(otelgrpc.NewServerHandler(handlerOpts...))}, defaultOpts...) } ``` </details></details></td></tr> <tr><td><details><summary><strong>JWT key handling risk</strong></summary><br> <b>Description:</b> The init container preserves and rewrites JWT private keys in a shared file path; if file <br>permissions, secrets handling, or logs are misconfigured in the cluster, private keys <br>could be exposed—verify volumes and file ACLs are restricted.<br> <strong><a href='https://github.com/carverauto/serviceradar/pull/1770/files#diff-2f484d8fe3bae65aace437568f6dd660c92f57b452f7bd1608083a8fe3716ba3R32-R58'>serviceradar-core.yaml [32-58]</a></strong><br> <details open><summary>Referred Code</summary> ```yaml set -e TEMPLATE=/etc/serviceradar/core.json TARGET=/var/lib/serviceradar/core.json mkdir -p /var/lib/serviceradar PRESERVE_PRIV="" PRESERVE_KID="" if [ -f "$TARGET" ]; then PRESERVE_PRIV=$(jq -r '.auth.jwt_private_key_pem // empty' "$TARGET") PRESERVE_KID=$(jq -r '.auth.jwt_key_id // empty' "$TARGET") fi cp "$TEMPLATE" "${TARGET}.tmp" if [ -n "$PRESERVE_PRIV" ]; then jq --arg priv "$PRESERVE_PRIV" --arg kid "$PRESERVE_KID" ' .auth = (.auth // {}) | .auth.jwt_algorithm = "RS256" | .auth.jwt_private_key_pem = $priv | (if $kid != "" then .auth.jwt_key_id = $kid else . end) ' "${TARGET}.tmp" > "$TARGET" ... (clipped 6 lines) ``` </details></details></td></tr> <tr><td colspan='2'><strong>Ticket Compliance</strong></td></tr> <tr><td>🟡</td> <td> <details> <summary>🎫 <a href=https://github.com/carverauto/serviceradar/issues/1768>#1768</a></summary> <table width='100%'><tbody> <tr><td rowspan=4>🟢</td> <td>Fix inconsistent OpenTelemetry log/trace counts between analytics observability widget and <br>observability dashboard.</td></tr> <tr><td>Reduce OTEL telemetry noise that may inflate counts.</td></tr> <tr><td>Ensure dashboards use consistent sources/queries for trace counts.</td></tr> <tr><td>Provide any necessary documentation or tooling to help validate/reset telemetry data if <br>needed.</td></tr> <tr><td rowspan=3>⚪</td> <td>Validate in staging/production that analytics widget and observability dashboard now show <br>aligned trace totals over the same time window.</td></tr> <tr><td>Confirm that disabling/filtering telemetry does not hide required traces for <br>troubleshooting.</td></tr> <tr><td>Verify performance impact and UX of increased poll intervals.</td></tr> </tbody></table> </details> </td></tr> <tr><td colspan='2'><strong>Codebase Duplication Compliance</strong></td></tr> <tr><td>⚪</td><td><details><summary><strong>Codebase context is not defined </strong></summary> Follow the <a href='https://qodo-merge-docs.qodo.ai/core-abilities/rag_context_enrichment/'>guide</a> to enable codebase context checks. </details></td></tr> <tr><td colspan='2'><strong>Custom Compliance</strong></td></tr> <tr><td>⚪</td><td><details><summary><strong>No custom compliance provided</strong></summary> Follow the <a href='https://qodo-merge-docs.qodo.ai/tools/compliance/'>guide</a> to enable custom compliance check. </details></td></tr> <tr><td align="center" colspan="2"> - [ ] Update <!-- /compliance --update_compliance=true --> </td></tr></tbody></table> <details><summary>Compliance status legend</summary> 🟢 - Fully Compliant<br> 🟡 - Partial Compliant<br> 🔴 - Not Compliant<br> ⚪ - Requires Further Human Verification<br> 🏷️ - Compliance label<br> </details>
mfreeman451 commented 2025-10-15 02:02:06 +00:00 (Migrated from github.com)
Author
Owner

Imported GitHub PR review comment.

Original author: @mfreeman451
Original URL: https://github.com/carverauto/serviceradar/pull/1770#discussion_r2430921435
Original created: 2025-10-15T02:02:06Z
Original path: docker/compose/poller.docker.json
Original line: 101

should be 5min

Imported GitHub PR review comment. Original author: @mfreeman451 Original URL: https://github.com/carverauto/serviceradar/pull/1770#discussion_r2430921435 Original created: 2025-10-15T02:02:06Z Original path: docker/compose/poller.docker.json Original line: 101 --- should be 5min
mfreeman451 commented 2025-10-15 02:02:57 +00:00 (Migrated from github.com)
Author
Owner

Imported GitHub PR review comment.

Original author: @mfreeman451
Original URL: https://github.com/carverauto/serviceradar/pull/1770#discussion_r2430922234
Original created: 2025-10-15T02:02:57Z
Original path: docs/docs/tls-security.md
Original line: 637

should be 30s

Imported GitHub PR review comment. Original author: @mfreeman451 Original URL: https://github.com/carverauto/serviceradar/pull/1770#discussion_r2430922234 Original created: 2025-10-15T02:02:57Z Original path: docs/docs/tls-security.md Original line: 637 --- should be 30s
qodo-code-review[bot] commented 2025-10-15 02:03:36 +00:00 (Migrated from github.com)
Author
Owner

Imported GitHub PR comment.

Original author: @qodo-code-review[bot]
Original URL: https://github.com/carverauto/serviceradar/pull/1770#issuecomment-3404244802
Original created: 2025-10-15T02:03:36Z

PR Code Suggestions

Explore these optional code suggestions:

CategorySuggestion                                                                                                                                    Impact
Possible issue
Correctly set cache revision after update
Suggestion Impact:The commit changed the code to initialize newRevision separately and set it from updateResp only, removing the fallback to resp.GetRevision(), thus preventing stale revisions from being cached.

code diff:

@@ -318,8 +318,8 @@
 		p.metrics.recordPublish(1)
 		identitymap.RecordKVPublish(ctx, 1, "updated")
 		p.logger.Debug().Str("key", key).Msg("Updated canonical identity entry in KV")
-		newRevision := resp.GetRevision()
-		if updateResp != nil && updateResp.GetRevision() != 0 {
+		var newRevision uint64
+		if updateResp != nil {
 			newRevision = updateResp.GetRevision()
 		}
 		p.cache.set(key, metadataHash, attrsHash, newRevision)

Correct the logic for determining the new revision after a successful KV Update.
The new revision should be derived only from the updateResp, not from the old
resp from the Get call, to avoid caching a stale revision.

pkg/registry/identity_publisher.go [299-326]

 		updateResp, err := p.kvClient.Update(ctx, &proto.UpdateRequest{
 			Key:        key,
 			Value:      payload,
 			Revision:   resp.GetRevision(),
 			TtlSeconds: p.ttlSeconds,
 		})
 		if err != nil {
 			if shouldRetryKV(err) {
 				code := status.Code(err)
 				if code == codes.AlreadyExists || code == codes.Aborted {
 					identitymap.RecordKVConflict(ctx, code.String())
 					p.logger.Debug().Str("key", key).Str("reason", code.String()).Msg("KV identity update encountered conflict")
 				}
 				p.cache.delete(key)
 				return struct{}{}, err
 			}
 			return struct{}{}, backoff.Permanent(err)
 		}
 
 		p.metrics.recordPublish(1)
 		identitymap.RecordKVPublish(ctx, 1, "updated")
 		p.logger.Debug().Str("key", key).Msg("Updated canonical identity entry in KV")
-		newRevision := resp.GetRevision()
-		if updateResp != nil && updateResp.GetRevision() != 0 {
+		var newRevision uint64
+		if updateResp != nil {
 			newRevision = updateResp.GetRevision()
 		}
 		p.cache.set(key, metadataHash, attrsHash, newRevision)
 		return struct{}{}, nil

[Suggestion processed]

Suggestion importance[1-10]: 9

__

Why: The suggestion correctly identifies a critical bug where the cache could be populated with a stale revision number after a successful update, leading to guaranteed conflicts on subsequent updates for the same key. This fix is crucial for the correctness and performance of the identity publishing mechanism.

High
Include all_ips field in identity hash

Include the all_ips field in the HashIdentityMetadata function. This ensures
that changes to a device's secondary IP addresses are reflected in the identity
hash, preventing stale identity records.

pkg/identitymap/identitymap.go [104-152]

 // HashIdentityMetadata produces a stable fingerprint of identity-relevant fields for a device update.
 func HashIdentityMetadata(update *models.DeviceUpdate) string {
 	if update == nil {
 		return ""
 	}
 
 	meta := make(map[string]string)
 
 	if id := strings.TrimSpace(update.DeviceID); id != "" {
 		meta["device_id"] = id
 	}
 	if partition := strings.TrimSpace(update.Partition); partition != "" {
 		meta["partition"] = partition
 	}
 	if ip := strings.TrimSpace(update.IP); ip != "" {
 		meta["ip"] = ip
 	}
 	if update.Hostname != nil {
 		if hostname := strings.TrimSpace(*update.Hostname); hostname != "" {
 			meta["hostname"] = hostname
 		}
 	}
 	if update.MAC != nil {
 		if mac := strings.TrimSpace(*update.MAC); mac != "" {
 			meta["mac"] = strings.ToUpper(mac)
 		}
 	}
 	if src := strings.TrimSpace(string(update.Source)); src != "" {
 		meta["source"] = src
 	}
 
 	if update.Metadata != nil {
 		appendMetadata := func(key string) {
 			if val := strings.TrimSpace(update.Metadata[key]); val != "" {
 				meta[key] = val
 			}
 		}
 		appendMetadata("armis_device_id")
 		appendMetadata("integration_id")
 		appendMetadata("integration_type")
 		appendMetadata("netbox_device_id")
+		appendMetadata("all_ips")
 	}
 
 	if len(meta) == 0 {
 		return ""
 	}
 
 	return HashMetadata(meta)
 }
  • Apply / Chat
Suggestion importance[1-10]: 8

__

Why: The suggestion correctly identifies that the all_ips field, which is critical for device identity reconciliation, was missing from the identity hash calculation. Including it ensures that changes to a device's secondary IPs correctly trigger updates to the canonical record, preventing stale data and improving identity resolution accuracy.

Medium
High-level
Consider a more flexible telemetry sampling strategy

Instead of completely disabling telemetry for KV calls and fast traces,
implement a configurable sampling strategy. This would reduce data volume while
retaining visibility into normal system behavior for better performance analysis
and debugging.

Examples:

cmd/core/main.go [265-267]
		TelemetryFilter: func(info *grpcstats.RPCTagInfo) bool {
			return !strings.HasPrefix(info.FullMethodName, "/proto.KVService/")
		},
cmd/otel/src/lib.rs [279-288]
                    let should_export = is_slow;

                    if !should_export {
                        // Skip publishing metrics for spans that completed within the fast-path threshold
                        debug!(
                            "Skipping perf metric export for fast span '{}' (service: '{}', duration: {:.3}ms)",
                            span.name, service_name, duration_ms
                        );
                        continue;
                    }

Solution Walkthrough:

Before:

// In cmd/core/main.go
// Telemetry is filtered at the source, completely blocking KV service calls.
lifecycle.RunServer(ctx, &lifecycle.ServerOptions{
    // ...
    TelemetryFilter: func(info *grpcstats.RPCTagInfo) bool {
        // Hardcoded filter: return false for all KV service methods.
        return !strings.HasPrefix(info.FullMethodName, "/proto.KVService/")
    },
})

// In cmd/otel/src/lib.rs
// The collector only exports spans that are considered "slow".
fn export(...) {
    for span in spans {
        let is_slow = duration_ms > 100.0;
        let should_export = is_slow;

        if !should_export {
            // Fast spans are always skipped.
            continue;
        }
        // ... export logic
    }
}

After:

// In cmd/core/main.go (conceptual)
// Telemetry filtering is removed in favor of sampling at the collector.
lifecycle.RunServer(ctx, &lifecycle.ServerOptions{
    // ...
    // TelemetryFilter is removed or made configurable.
})

// In cmd/otel/src/lib.rs (conceptual)
// The collector uses a sampling strategy instead of a hardcoded threshold.
fn export(...) {
    for span in spans {
        let is_slow = duration_ms > 100.0;
        // Sample a percentage of fast traces, and all slow/error traces.
        let should_export = is_slow || is_error || rand::random::<f64>() < FAST_TRACE_SAMPLE_RATE;

        if !should_export {
            continue;
        }
        // ... export logic
    }
}

Suggestion importance[1-10]: 8

__

Why: The suggestion correctly identifies that the PR's aggressive telemetry reduction in cmd/otel/src/lib.rs and cmd/core/main.go might be too restrictive, and proposes a more flexible sampling strategy which is a valid and impactful design improvement.

Medium
  • Update
Imported GitHub PR comment. Original author: @qodo-code-review[bot] Original URL: https://github.com/carverauto/serviceradar/pull/1770#issuecomment-3404244802 Original created: 2025-10-15T02:03:36Z --- ## PR Code Suggestions ✨ <!-- 0fc183b --> Explore these optional code suggestions: <table><thead><tr><td><strong>Category</strong></td><td align=left><strong>Suggestion&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; </strong></td><td align=center><strong>Impact</strong></td></tr><tbody><tr><td rowspan=2>Possible issue</td> <td> <details><summary>✅ <s>Correctly set cache revision after update</s></summary> ___ <details><summary><b>Suggestion Impact:</b></summary>The commit changed the code to initialize newRevision separately and set it from updateResp only, removing the fallback to resp.GetRevision(), thus preventing stale revisions from being cached. code diff: ```diff @@ -318,8 +318,8 @@ p.metrics.recordPublish(1) identitymap.RecordKVPublish(ctx, 1, "updated") p.logger.Debug().Str("key", key).Msg("Updated canonical identity entry in KV") - newRevision := resp.GetRevision() - if updateResp != nil && updateResp.GetRevision() != 0 { + var newRevision uint64 + if updateResp != nil { newRevision = updateResp.GetRevision() } p.cache.set(key, metadataHash, attrsHash, newRevision) ``` </details> ___ **Correct the logic for determining the new revision after a successful KV <code>Update</code>. <br>The new revision should be derived only from the <code>updateResp</code>, not from the old <br><code>resp</code> from the <code>Get</code> call, to avoid caching a stale revision.** [pkg/registry/identity_publisher.go [299-326]](https://github.com/carverauto/serviceradar/pull/1770/files#diff-e851df98fc8e7a63419a9e8ea29bebf3c143028ceadb549b35e9c1eb130fd1a1R299-R326) ```diff updateResp, err := p.kvClient.Update(ctx, &proto.UpdateRequest{ Key: key, Value: payload, Revision: resp.GetRevision(), TtlSeconds: p.ttlSeconds, }) if err != nil { if shouldRetryKV(err) { code := status.Code(err) if code == codes.AlreadyExists || code == codes.Aborted { identitymap.RecordKVConflict(ctx, code.String()) p.logger.Debug().Str("key", key).Str("reason", code.String()).Msg("KV identity update encountered conflict") } p.cache.delete(key) return struct{}{}, err } return struct{}{}, backoff.Permanent(err) } p.metrics.recordPublish(1) identitymap.RecordKVPublish(ctx, 1, "updated") p.logger.Debug().Str("key", key).Msg("Updated canonical identity entry in KV") - newRevision := resp.GetRevision() - if updateResp != nil && updateResp.GetRevision() != 0 { + var newRevision uint64 + if updateResp != nil { newRevision = updateResp.GetRevision() } p.cache.set(key, metadataHash, attrsHash, newRevision) return struct{}{}, nil ``` `[Suggestion processed]` <details><summary>Suggestion importance[1-10]: 9</summary> __ Why: The suggestion correctly identifies a critical bug where the cache could be populated with a stale revision number after a successful update, leading to guaranteed conflicts on subsequent updates for the same key. This fix is crucial for the correctness and performance of the identity publishing mechanism. </details></details></td><td align=center>High </td></tr><tr><td> <details><summary>Include all_ips field in identity hash</summary> ___ **Include the <code>all_ips</code> field in the <code>HashIdentityMetadata</code> function. This ensures <br>that changes to a device's secondary IP addresses are reflected in the identity <br>hash, preventing stale identity records.** [pkg/identitymap/identitymap.go [104-152]](https://github.com/carverauto/serviceradar/pull/1770/files#diff-a65f0769de3a5f025da1654d9545c7d3c43da843b3c608d96958893766f8ab8eR104-R152) ```diff // HashIdentityMetadata produces a stable fingerprint of identity-relevant fields for a device update. func HashIdentityMetadata(update *models.DeviceUpdate) string { if update == nil { return "" } meta := make(map[string]string) if id := strings.TrimSpace(update.DeviceID); id != "" { meta["device_id"] = id } if partition := strings.TrimSpace(update.Partition); partition != "" { meta["partition"] = partition } if ip := strings.TrimSpace(update.IP); ip != "" { meta["ip"] = ip } if update.Hostname != nil { if hostname := strings.TrimSpace(*update.Hostname); hostname != "" { meta["hostname"] = hostname } } if update.MAC != nil { if mac := strings.TrimSpace(*update.MAC); mac != "" { meta["mac"] = strings.ToUpper(mac) } } if src := strings.TrimSpace(string(update.Source)); src != "" { meta["source"] = src } if update.Metadata != nil { appendMetadata := func(key string) { if val := strings.TrimSpace(update.Metadata[key]); val != "" { meta[key] = val } } appendMetadata("armis_device_id") appendMetadata("integration_id") appendMetadata("integration_type") appendMetadata("netbox_device_id") + appendMetadata("all_ips") } if len(meta) == 0 { return "" } return HashMetadata(meta) } ``` - [ ] **Apply / Chat** <!-- /improve --apply_suggestion=1 --> <details><summary>Suggestion importance[1-10]: 8</summary> __ Why: The suggestion correctly identifies that the `all_ips` field, which is critical for device identity reconciliation, was missing from the identity hash calculation. Including it ensures that changes to a device's secondary IPs correctly trigger updates to the canonical record, preventing stale data and improving identity resolution accuracy. </details></details></td><td align=center>Medium </td></tr><tr><td rowspan=1>High-level</td> <td> <details><summary>Consider a more flexible telemetry sampling strategy</summary> ___ **Instead of completely disabling telemetry for KV calls and fast traces, <br>implement a configurable sampling strategy. This would reduce data volume while <br>retaining visibility into normal system behavior for better performance analysis <br>and debugging.** ### Examples: <details> <summary> <a href="https://github.com/carverauto/serviceradar/pull/1770/files#diff-4ab3fd1d4debc53dd2499d94a0f60c648fdae4235dd1e3678095a975f5bb434aR265-R267">cmd/core/main.go [265-267]</a> </summary> ```go TelemetryFilter: func(info *grpcstats.RPCTagInfo) bool { return !strings.HasPrefix(info.FullMethodName, "/proto.KVService/") }, ``` </details> <details> <summary> <a href="https://github.com/carverauto/serviceradar/pull/1770/files#diff-8295e6c8105718065b7c8b401cee9efc20e2eb5010b2e1f8d91afe670498e2f7R279-R288">cmd/otel/src/lib.rs [279-288]</a> </summary> ```rust let should_export = is_slow; if !should_export { // Skip publishing metrics for spans that completed within the fast-path threshold debug!( "Skipping perf metric export for fast span '{}' (service: '{}', duration: {:.3}ms)", span.name, service_name, duration_ms ); continue; } ``` </details> ### Solution Walkthrough: #### Before: ```rust // In cmd/core/main.go // Telemetry is filtered at the source, completely blocking KV service calls. lifecycle.RunServer(ctx, &lifecycle.ServerOptions{ // ... TelemetryFilter: func(info *grpcstats.RPCTagInfo) bool { // Hardcoded filter: return false for all KV service methods. return !strings.HasPrefix(info.FullMethodName, "/proto.KVService/") }, }) // In cmd/otel/src/lib.rs // The collector only exports spans that are considered "slow". fn export(...) { for span in spans { let is_slow = duration_ms > 100.0; let should_export = is_slow; if !should_export { // Fast spans are always skipped. continue; } // ... export logic } } ``` #### After: ```rust // In cmd/core/main.go (conceptual) // Telemetry filtering is removed in favor of sampling at the collector. lifecycle.RunServer(ctx, &lifecycle.ServerOptions{ // ... // TelemetryFilter is removed or made configurable. }) // In cmd/otel/src/lib.rs (conceptual) // The collector uses a sampling strategy instead of a hardcoded threshold. fn export(...) { for span in spans { let is_slow = duration_ms > 100.0; // Sample a percentage of fast traces, and all slow/error traces. let should_export = is_slow || is_error || rand::random::<f64>() < FAST_TRACE_SAMPLE_RATE; if !should_export { continue; } // ... export logic } } ``` <details><summary>Suggestion importance[1-10]: 8</summary> __ Why: The suggestion correctly identifies that the PR's aggressive telemetry reduction in `cmd/otel/src/lib.rs` and `cmd/core/main.go` might be too restrictive, and proposes a more flexible sampling strategy which is a valid and impactful design improvement. </details></details></td><td align=center>Medium </td></tr> <tr><td align="center" colspan="2"> - [ ] Update <!-- /improve_multi --more_suggestions=true --> </td><td></td></tr></tbody></table>
mfreeman451 commented 2025-10-15 02:07:24 +00:00 (Migrated from github.com)
Author
Owner

Imported GitHub PR review comment.

Original author: @mfreeman451
Original URL: https://github.com/carverauto/serviceradar/pull/1770#discussion_r2430926324
Original created: 2025-10-15T02:07:24Z
Original path: pkg/poller/config_test.go
Original line: 190

should be 30s

Imported GitHub PR review comment. Original author: @mfreeman451 Original URL: https://github.com/carverauto/serviceradar/pull/1770#discussion_r2430926324 Original created: 2025-10-15T02:07:24Z Original path: pkg/poller/config_test.go Original line: 190 --- should be 30s
mfreeman451 commented 2025-10-15 02:07:53 +00:00 (Migrated from github.com)
Author
Owner

Imported GitHub PR review comment.

Original author: @mfreeman451
Original URL: https://github.com/carverauto/serviceradar/pull/1770#discussion_r2430926787
Original created: 2025-10-15T02:07:53Z
Original path: pkg/registry/identity_publisher.go
Original line: 42

should be configurable

Imported GitHub PR review comment. Original author: @mfreeman451 Original URL: https://github.com/carverauto/serviceradar/pull/1770#discussion_r2430926787 Original created: 2025-10-15T02:07:53Z Original path: pkg/registry/identity_publisher.go Original line: 42 --- should be configurable
Sign in to join this conversation.
No reviewers
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
carverauto/serviceradar!2321
No description provided.