1790 bugcore not handling nats disconnects gracefully #2331
No reviewers
Labels
No labels
1week
2weeks
Failed compliance check
IP cameras
NATS
Possible security concern
Review effort 1/5
Review effort 2/5
Review effort 3/5
Review effort 4/5
Review effort 5/5
UI
aardvark
accessibility
amd64
api
arm64
auth
back-end
bgp
blog
bug
build
checkers
ci-cd
cleanup
cnpg
codex
core
dependencies
device-management
documentation
duplicate
dusk
ebpf
enhancement
eta 1d
eta 1hr
eta 3d
eta 3hr
feature
fieldsurvey
github_actions
go
good first issue
help wanted
invalid
javascript
k8s
log-collector
mapper
mtr
needs-triage
netflow
network-sweep
observability
oracle
otel
plug-in
proton
python
question
reddit
redhat
research
rperf
rperf-checker
rust
sdk
security
serviceradar-agent
serviceradar-agent-gateway
serviceradar-web
serviceradar-web-ng
siem
snmp
sysmon
topology
ubiquiti
wasm
wontfix
zen-engine
No milestone
No project
No assignees
1 participant
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
carverauto/serviceradar!2331
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "refs/pull/2331/head"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Imported from GitHub pull request.
Original GitHub pull request: #1791
Original author: @mfreeman451
Original URL: https://github.com/carverauto/serviceradar/pull/1791
Original created: 2025-10-16T17:18:10Z
Original updated: 2025-10-16T18:53:17Z
Original head: carverauto/serviceradar:1790-bugcore-not-handling-nats-disconnects-gracefully
Original base: main
Original merged: 2025-10-16T18:53:14Z by @mfreeman451
PR Type
Bug fix, Enhancement
Description
Add graceful NATS disconnect handling with automatic reconnection logic
Prevent KV bucket from growing unbounded with configurable limits
Clean up stale identity keys when device attributes change
Fix code formatting and add comprehensive test coverage
Diagram Walkthrough
File Walkthrough
8 files
Add NATS reconnection and error handling logicAdd NATS reconnect synchronization fields to ServerAdd function to rebuild identity keys from recordsAdd bucket configuration validation and defaultsAdd error for negative bucket_max_bytesAdd bucket size and TTL configuration to NATSStoreAdd bucket configuration fields to Config structAdd stale key deletion and identity snapshot tracking5 files
Add integration test for NATS reconnectionAdd test for BuildKeysFromRecord functionUpdate test to include new bucket configuration fieldsUpdate test fixtures with new bucket configuration fieldsAdd test for stale identity key deletion1 files
Call error handler on publish failures and fix formatting3 files
Add bucket configuration to KV service configAdd bucket configuration to KV service configAdd bucket configuration to KV service config1 files
Add backoff and nats-server dependenciesImported GitHub PR comment.
Original author: @qodo-code-review[bot]
Original URL: https://github.com/carverauto/serviceradar/pull/1791#issuecomment-3411892553
Original created: 2025-10-16T17:18:44Z
You are nearing your monthly Qodo Merge usage quota. For more information, please visit here.
PR Compliance Guide 🔍
Below is a summary of compliance checks for this PR:
No security concerns identified
No security vulnerabilities detected by AI analysis. Human verification advised for critical code.🎫 No ticket provided
Codebase context is not defined
Follow the guide to enable codebase context checks.
No custom compliance provided
Follow the guide to enable custom compliance check.
Compliance status legend
🟢 - Fully Compliant🟡 - Partial Compliant
🔴 - Not Compliant
⚪ - Requires Further Human Verification
🏷️ - Compliance label
Imported GitHub PR comment.
Original author: @qodo-code-review[bot]
Original URL: https://github.com/carverauto/serviceradar/pull/1791#issuecomment-3411901648
Original created: 2025-10-16T17:20:07Z
You are nearing your monthly Qodo Merge usage quota. For more information, please visit here.
PR Code Suggestions ✨
Explore these optional code suggestions:
Consolidate NATS reconnection logic
The robust NATS reconnection logic with exponential backoff, introduced for the
event publisher, should be extracted and reused for the NATS-backed KV store
client. This would improve system stability by ensuring all critical NATS
components share the same resilient reconnection mechanism.
Examples:
pkg/core/events.go [143-200]
pkg/kv/nats.go [487-522]
Solution Walkthrough:
Before:
After:
Suggestion importance[1-10]: 9
__
Why: The suggestion correctly identifies that the new, robust NATS reconnection logic in
pkg/core/events.gois not applied to the critical NATS-backed KV store, which was a source of failures mentioned in the ticket. Consolidating this logic would significantly improve system resilience and code consistency.Prevent race condition during connection replacement
Prevent a race condition in
setEventPublisherby closing the old NATS connectionwhile the mutex is still locked. This ensures no other goroutine can use the
connection as it's being closed.
pkg/core/events.go [115-125]
Suggestion importance[1-10]: 8
__
Why: The suggestion correctly identifies a race condition where an old NATS connection could be closed while another goroutine is still using it, which could lead to panics. The proposed fix of closing the old connection within the mutex lock is a valid and important correction to ensure thread safety.
✅
Prevent silent data truncation bugSuggestion Impact:
The commit added a guard checking n.bucketHistory against math.MaxUint8 and returning an error if exceeded, aligning with the suggested validation to prevent uint8 truncation.code diff:
In
ensureDomainLocked, validate thatn.bucketHistorydoes not exceed 255 beforecasting it to
uint8. This prevents silent value truncation and potentialmisconfiguration of the JetStream Key-Value store history.
pkg/kv/nats.go [543-560]
[Suggestion processed]Suggestion importance[1-10]: 7
__
Why: The suggestion correctly identifies a potential data truncation issue when casting
bucketHistoryfromuint32touint8, which could lead to silent misconfiguration. Adding a validation check is a good defensive programming practice that improves the robustness of the bucket creation logic.