1768 bugotel inconsistent otel log counts on dashboards #2321
No reviewers
Labels
No labels
1week
2weeks
Failed compliance check
IP cameras
NATS
Possible security concern
Review effort 1/5
Review effort 2/5
Review effort 3/5
Review effort 4/5
Review effort 5/5
UI
aardvark
accessibility
amd64
api
arm64
auth
back-end
bgp
blog
bug
build
checkers
ci-cd
cleanup
cnpg
codex
core
dependencies
device-management
documentation
duplicate
dusk
ebpf
enhancement
eta 1d
eta 1hr
eta 3d
eta 3hr
feature
fieldsurvey
github_actions
go
good first issue
help wanted
invalid
javascript
k8s
log-collector
mapper
mtr
needs-triage
netflow
network-sweep
observability
oracle
otel
plug-in
proton
python
question
reddit
redhat
research
rperf
rperf-checker
rust
sdk
security
serviceradar-agent
serviceradar-agent-gateway
serviceradar-web
serviceradar-web-ng
siem
snmp
sysmon
topology
ubiquiti
wasm
wontfix
zen-engine
No milestone
No project
No assignees
1 participant
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
carverauto/serviceradar!2321
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "refs/pull/2321/head"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Imported from GitHub pull request.
Original GitHub pull request: #1770
Original author: @mfreeman451
Original URL: https://github.com/carverauto/serviceradar/pull/1770
Original created: 2025-10-15T02:00:43Z
Original updated: 2025-10-15T02:33:02Z
Original head: carverauto/serviceradar:1768-bugotel-inconsistent-otel-log-counts-on-dashboards
Original base: main
Original merged: 2025-10-15T02:32:59Z by @mfreeman451
PR Type
Bug fix, Enhancement
Description
Disabled OpenTelemetry tracing for KV service calls to reduce telemetry noise
Fixed trace count queries to use
otel_trace_summariesinstead ofotel_tracesAdded identity metadata caching to reduce redundant KV writes
Changed default poll intervals from 30s to 5m across services
Diagram Walkthrough
File Walkthrough
16 files
Add telemetry filter to exclude KV service RPCsDisable telemetry for KV serverDisable telemetry for KV client connectionsDisable telemetry in KV client factoryDisable telemetry for API server KV clientsAdd DisableTelemetry option to client configAdd telemetry filter and disable optionsAdd HashIdentityMetadata for stable identity fingerprintsWire telemetry options through server lifecycleAdd identity cache to reduce redundant KV writesDisable telemetry for sync KV clientAdd shared hook for trace count aggregatesSkip metrics export for fast spansAdd script to reset Proton PVCUse useTraceCounts hook for trace statisticsIntegrate useTraceCounts hook in traces dashboard5 files
Use HashIdentityMetadata for identity record hashingBuild full DeviceUpdate for identity hashingFix trace queries to use otel_trace_summaries tablePreserve JWT keys across core restartsRename metrics fields to trace fields4 files
Update test to use HashIdentityMetadata functionAdd tests for HashIdentityMetadata behaviorUpdate test expectations for new poll intervalUpdate tests for HashIdentityMetadata usage4 files
Change default poll interval to 5 minutesChange default poll interval to 5 minutesUpdate poll intervals to 5 minutesUpdate poll intervals in Kubernetes config2 files
Add grpc stats dependencyAdd grpc stats dependency3 files
Document Proton reset procedureUpdate poll interval in documentationUpdate poll interval in documentationImported GitHub PR comment.
Original author: @qodo-code-review[bot]
Original URL: https://github.com/carverauto/serviceradar/pull/1770#issuecomment-3404242282
Original created: 2025-10-15T02:01:44Z
PR Compliance Guide 🔍
Below is a summary of compliance checks for this PR:
Destructive script execution
Description: The Kubernetes reset script can delete and recreate a PersistentVolumeClaim and restart
deployments, which if run against the wrong namespace could cause destructive data loss;
ensure restricted execution and environment safeguards.
reset-proton.sh [1-64]
Referred Code
Telemetry suppression risk
Description: TelemetryFilter controls which RPCs emit telemetry and uses a positive boolean to include
traces; misconfiguration could unintentionally suppress security-relevant RPC traces
(e.g., health or admin), reducing observability for incident response.
server.go [105-122]
Referred Code
JWT key handling risk
Description: The init container preserves and rewrites JWT private keys in a shared file path; if file
permissions, secrets handling, or logs are misconfigured in the cluster, private keys
could be exposed—verify volumes and file ACLs are restricted.
serviceradar-core.yaml [32-58]
Referred Code
🎫 #1768
observability dashboard.
needed.
aligned trace totals over the same time window.
troubleshooting.
Codebase context is not defined
Follow the guide to enable codebase context checks.
No custom compliance provided
Follow the guide to enable custom compliance check.
Compliance status legend
🟢 - Fully Compliant🟡 - Partial Compliant
🔴 - Not Compliant
⚪ - Requires Further Human Verification
🏷️ - Compliance label
Imported GitHub PR review comment.
Original author: @mfreeman451
Original URL: https://github.com/carverauto/serviceradar/pull/1770#discussion_r2430921435
Original created: 2025-10-15T02:02:06Z
Original path: docker/compose/poller.docker.json
Original line: 101
should be 5min
Imported GitHub PR review comment.
Original author: @mfreeman451
Original URL: https://github.com/carverauto/serviceradar/pull/1770#discussion_r2430922234
Original created: 2025-10-15T02:02:57Z
Original path: docs/docs/tls-security.md
Original line: 637
should be 30s
Imported GitHub PR comment.
Original author: @qodo-code-review[bot]
Original URL: https://github.com/carverauto/serviceradar/pull/1770#issuecomment-3404244802
Original created: 2025-10-15T02:03:36Z
PR Code Suggestions ✨
Explore these optional code suggestions:
✅
Correctly set cache revision after updateSuggestion Impact:
The commit changed the code to initialize newRevision separately and set it from updateResp only, removing the fallback to resp.GetRevision(), thus preventing stale revisions from being cached.code diff:
Correct the logic for determining the new revision after a successful KV
Update.The new revision should be derived only from the
updateResp, not from the oldrespfrom theGetcall, to avoid caching a stale revision.pkg/registry/identity_publisher.go [299-326]
[Suggestion processed]Suggestion importance[1-10]: 9
__
Why: The suggestion correctly identifies a critical bug where the cache could be populated with a stale revision number after a successful update, leading to guaranteed conflicts on subsequent updates for the same key. This fix is crucial for the correctness and performance of the identity publishing mechanism.
Include all_ips field in identity hash
Include the
all_ipsfield in theHashIdentityMetadatafunction. This ensuresthat changes to a device's secondary IP addresses are reflected in the identity
hash, preventing stale identity records.
pkg/identitymap/identitymap.go [104-152]
Suggestion importance[1-10]: 8
__
Why: The suggestion correctly identifies that the
all_ipsfield, which is critical for device identity reconciliation, was missing from the identity hash calculation. Including it ensures that changes to a device's secondary IPs correctly trigger updates to the canonical record, preventing stale data and improving identity resolution accuracy.Consider a more flexible telemetry sampling strategy
Instead of completely disabling telemetry for KV calls and fast traces,
implement a configurable sampling strategy. This would reduce data volume while
retaining visibility into normal system behavior for better performance analysis
and debugging.
Examples:
cmd/core/main.go [265-267]
cmd/otel/src/lib.rs [279-288]
Solution Walkthrough:
Before:
After:
Suggestion importance[1-10]: 8
__
Why: The suggestion correctly identifies that the PR's aggressive telemetry reduction in
cmd/otel/src/lib.rsandcmd/core/main.gomight be too restrictive, and proposes a more flexible sampling strategy which is a valid and impactful design improvement.Imported GitHub PR review comment.
Original author: @mfreeman451
Original URL: https://github.com/carverauto/serviceradar/pull/1770#discussion_r2430926324
Original created: 2025-10-15T02:07:24Z
Original path: pkg/poller/config_test.go
Original line: 190
should be 30s
Imported GitHub PR review comment.
Original author: @mfreeman451
Original URL: https://github.com/carverauto/serviceradar/pull/1770#discussion_r2430926787
Original created: 2025-10-15T02:07:53Z
Original path: pkg/registry/identity_publisher.go
Original line: 42
should be configurable