1790 bugcore not handling nats disconnects gracefully #2331

Merged
mfreeman451 merged 6 commits from refs/pull/2331/head into main 2025-10-16 18:53:14 +00:00
mfreeman451 commented 2025-10-16 17:18:10 +00:00 (Migrated from github.com)
Owner

Imported from GitHub pull request.

Original GitHub pull request: #1791
Original author: @mfreeman451
Original URL: https://github.com/carverauto/serviceradar/pull/1791
Original created: 2025-10-16T17:18:10Z
Original updated: 2025-10-16T18:53:17Z
Original head: carverauto/serviceradar:1790-bugcore-not-handling-nats-disconnects-gracefully
Original base: main
Original merged: 2025-10-16T18:53:14Z by @mfreeman451

PR Type

Bug fix, Enhancement


Description

  • Add graceful NATS disconnect handling with automatic reconnection logic

    • Implement exponential backoff retry mechanism for failed reconnections
    • Add connection lifecycle handlers (disconnect, reconnect, closed, error)
  • Prevent KV bucket from growing unbounded with configurable limits

    • Add bucket_max_bytes, bucket_ttl, and bucket_history configuration options
    • Apply limits when creating JetStream KeyValue buckets
  • Clean up stale identity keys when device attributes change

    • Track existing identity keys and delete obsolete entries on updates
    • Add identity snapshot mechanism to detect key changes
  • Fix code formatting and add comprehensive test coverage

    • Correct indentation in pollers.go status report methods
    • Add integration tests for NATS reconnection scenarios

Diagram Walkthrough

flowchart LR
  A["NATS Disconnect"] -->|triggers| B["scheduleEventPublisherReinit"]
  B -->|spawns| C["reinitializeEventPublisher"]
  C -->|uses| D["Exponential Backoff"]
  D -->|retries| E["initializeEventPublisher"]
  E -->|succeeds| F["New Connection"]
  
  G["Device Update"] -->|checks| H["existingIdentitySnapshot"]
  H -->|identifies| I["Stale Keys"]
  I -->|deletes| J["deleteIdentityKeys"]
  
  K["KV Config"] -->|sets| L["bucket_max_bytes"]
  K -->|sets| M["bucket_ttl"]
  K -->|sets| N["bucket_history"]
  L -->|applied to| O["KeyValue Bucket"]
  M -->|applied to| O
  N -->|applied to| O

File Walkthrough

Relevant files
Enhancement
8 files
events.go
Add NATS reconnection and error handling logic                     
+138/-3 
types.go
Add NATS reconnect synchronization fields to Server           
+21/-19 
identitymap.go
Add function to rebuild identity keys from records             
+34/-0   
config.go
Add bucket configuration validation and defaults                 
+17/-0   
errors.go
Add error for negative bucket_max_bytes                                   
+1/-0     
nats.go
Add bucket size and TTL configuration to NATSStore             
+45/-18 
types.go
Add bucket configuration fields to Config struct                 
+5/-2     
identity_publisher.go
Add stale key deletion and identity snapshot tracking       
+120/-0 
Tests
5 files
events_test.go
Add integration test for NATS reconnection                             
+125/-0 
identitymap_test.go
Add test for BuildKeysFromRecord function                               
+35/-0   
nats_reconnect_test.go
Update test to include new bucket configuration fields     
+9/-6     
watch_test.go
Update test fixtures with new bucket configuration fields
+14/-8   
identity_publisher_test.go
Add test for stale identity key deletion                                 
+57/-0   
Bug fix
1 files
pollers.go
Call error handler on publish failures and fix formatting
+33/-30 
Configuration changes
3 files
deploy.sh
Add bucket configuration to KV service config                       
+4/-1     
configmap.yaml
Add bucket configuration to KV service config                       
+4/-1     
kv.json
Add bucket configuration to KV service config                       
+5/-2     
Dependencies
1 files
BUILD.bazel
Add backoff and nats-server dependencies                                 
+2/-0     

Imported from GitHub pull request. Original GitHub pull request: #1791 Original author: @mfreeman451 Original URL: https://github.com/carverauto/serviceradar/pull/1791 Original created: 2025-10-16T17:18:10Z Original updated: 2025-10-16T18:53:17Z Original head: carverauto/serviceradar:1790-bugcore-not-handling-nats-disconnects-gracefully Original base: main Original merged: 2025-10-16T18:53:14Z by @mfreeman451 --- ### **PR Type** Bug fix, Enhancement ___ ### **Description** - Add graceful NATS disconnect handling with automatic reconnection logic - Implement exponential backoff retry mechanism for failed reconnections - Add connection lifecycle handlers (disconnect, reconnect, closed, error) - Prevent KV bucket from growing unbounded with configurable limits - Add bucket_max_bytes, bucket_ttl, and bucket_history configuration options - Apply limits when creating JetStream KeyValue buckets - Clean up stale identity keys when device attributes change - Track existing identity keys and delete obsolete entries on updates - Add identity snapshot mechanism to detect key changes - Fix code formatting and add comprehensive test coverage - Correct indentation in pollers.go status report methods - Add integration tests for NATS reconnection scenarios ___ ### Diagram Walkthrough ```mermaid flowchart LR A["NATS Disconnect"] -->|triggers| B["scheduleEventPublisherReinit"] B -->|spawns| C["reinitializeEventPublisher"] C -->|uses| D["Exponential Backoff"] D -->|retries| E["initializeEventPublisher"] E -->|succeeds| F["New Connection"] G["Device Update"] -->|checks| H["existingIdentitySnapshot"] H -->|identifies| I["Stale Keys"] I -->|deletes| J["deleteIdentityKeys"] K["KV Config"] -->|sets| L["bucket_max_bytes"] K -->|sets| M["bucket_ttl"] K -->|sets| N["bucket_history"] L -->|applied to| O["KeyValue Bucket"] M -->|applied to| O N -->|applied to| O ``` <details> <summary><h3> File Walkthrough</h3></summary> <table><thead><tr><th></th><th align="left">Relevant files</th></tr></thead><tbody><tr><td><strong>Enhancement</strong></td><td><details><summary>8 files</summary><table> <tr> <td><strong>events.go</strong><dd><code>Add NATS reconnection and error handling logic</code>&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; </dd></td> <td><a href="https://github.com/carverauto/serviceradar/pull/1791/files#diff-7d499ed41701e367e51735c9a0a78bcce977dea3872771eb7b22c47dc39e0241">+138/-3</a>&nbsp; </td> </tr> <tr> <td><strong>types.go</strong><dd><code>Add NATS reconnect synchronization fields to Server</code>&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; </dd></td> <td><a href="https://github.com/carverauto/serviceradar/pull/1791/files#diff-717f128517472d3dc4091e67bfc0f4fe4c36e32096c5ef87d78f34cbc64d2399">+21/-19</a>&nbsp; </td> </tr> <tr> <td><strong>identitymap.go</strong><dd><code>Add function to rebuild identity keys from records</code>&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; </dd></td> <td><a href="https://github.com/carverauto/serviceradar/pull/1791/files#diff-a65f0769de3a5f025da1654d9545c7d3c43da843b3c608d96958893766f8ab8e">+34/-0</a>&nbsp; &nbsp; </td> </tr> <tr> <td><strong>config.go</strong><dd><code>Add bucket configuration validation and defaults</code>&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; </dd></td> <td><a href="https://github.com/carverauto/serviceradar/pull/1791/files#diff-ee4cefbdda619863c5841522c9fb7c6f6efe1b76d679ab44f1664df8d5fca7d3">+17/-0</a>&nbsp; &nbsp; </td> </tr> <tr> <td><strong>errors.go</strong><dd><code>Add error for negative bucket_max_bytes</code>&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; </dd></td> <td><a href="https://github.com/carverauto/serviceradar/pull/1791/files#diff-29a0eb3a9c4b7c5eec84cf81a2e28885ac34575d25d1e3e956452c289476b8ed">+1/-0</a>&nbsp; &nbsp; &nbsp; </td> </tr> <tr> <td><strong>nats.go</strong><dd><code>Add bucket size and TTL configuration to NATSStore</code>&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; </dd></td> <td><a href="https://github.com/carverauto/serviceradar/pull/1791/files#diff-329ec6e09a0af7af2887a668a87058a76195be2ad869e5616a3c4f958300b538">+45/-18</a>&nbsp; </td> </tr> <tr> <td><strong>types.go</strong><dd><code>Add bucket configuration fields to Config struct</code>&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; </dd></td> <td><a href="https://github.com/carverauto/serviceradar/pull/1791/files#diff-f2f66baf2575abe764bd8bc83bb492cfdc874999c710622c9f1c6b4f674f6e45">+5/-2</a>&nbsp; &nbsp; &nbsp; </td> </tr> <tr> <td><strong>identity_publisher.go</strong><dd><code>Add stale key deletion and identity snapshot tracking</code>&nbsp; &nbsp; &nbsp; &nbsp; </dd></td> <td><a href="https://github.com/carverauto/serviceradar/pull/1791/files#diff-e851df98fc8e7a63419a9e8ea29bebf3c143028ceadb549b35e9c1eb130fd1a1">+120/-0</a>&nbsp; </td> </tr> </table></details></td></tr><tr><td><strong>Tests</strong></td><td><details><summary>5 files</summary><table> <tr> <td><strong>events_test.go</strong><dd><code>Add integration test for NATS reconnection</code>&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; </dd></td> <td><a href="https://github.com/carverauto/serviceradar/pull/1791/files#diff-e566158923f4d9a76314034653cff9edb9c5f6c2528dbed3bc9d54cc06ceab55">+125/-0</a>&nbsp; </td> </tr> <tr> <td><strong>identitymap_test.go</strong><dd><code>Add test for BuildKeysFromRecord function</code>&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; </dd></td> <td><a href="https://github.com/carverauto/serviceradar/pull/1791/files#diff-7838a8a0f701320d27842704cc844b4e28592220a5622f7a2880e1d535a3b916">+35/-0</a>&nbsp; &nbsp; </td> </tr> <tr> <td><strong>nats_reconnect_test.go</strong><dd><code>Update test to include new bucket configuration fields</code>&nbsp; &nbsp; &nbsp; </dd></td> <td><a href="https://github.com/carverauto/serviceradar/pull/1791/files#diff-4065e8a55c65a4df0ff5a2886633a6b231ab3143cbe35c907b47765898d7cbcd">+9/-6</a>&nbsp; &nbsp; &nbsp; </td> </tr> <tr> <td><strong>watch_test.go</strong><dd><code>Update test fixtures with new bucket configuration fields</code></dd></td> <td><a href="https://github.com/carverauto/serviceradar/pull/1791/files#diff-b9da97c9c7108d25bd9290b518cf6acbbeec7eaa1d0a91b1e2175c86b5767bbf">+14/-8</a>&nbsp; &nbsp; </td> </tr> <tr> <td><strong>identity_publisher_test.go</strong><dd><code>Add test for stale identity key deletion</code>&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; </dd></td> <td><a href="https://github.com/carverauto/serviceradar/pull/1791/files#diff-a3b353a63f2cc222e61c29b0da162ad11c17f71be6c64b6222b2c1769a3c8f28">+57/-0</a>&nbsp; &nbsp; </td> </tr> </table></details></td></tr><tr><td><strong>Bug fix</strong></td><td><details><summary>1 files</summary><table> <tr> <td><strong>pollers.go</strong><dd><code>Call error handler on publish failures and fix formatting</code></dd></td> <td><a href="https://github.com/carverauto/serviceradar/pull/1791/files#diff-fe81e2a32f1ac64bcdc6f25f55c5fa918d17bad8c0546f2cf80c757ff4051816">+33/-30</a>&nbsp; </td> </tr> </table></details></td></tr><tr><td><strong>Configuration changes</strong></td><td><details><summary>3 files</summary><table> <tr> <td><strong>deploy.sh</strong><dd><code>Add bucket configuration to KV service config</code>&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; </dd></td> <td><a href="https://github.com/carverauto/serviceradar/pull/1791/files#diff-e37a2eb47f6488f6f391d56a7376be0ca4f93afa355028a71dd0608d3ef1a8ba">+4/-1</a>&nbsp; &nbsp; &nbsp; </td> </tr> <tr> <td><strong>configmap.yaml</strong><dd><code>Add bucket configuration to KV service config</code>&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; </dd></td> <td><a href="https://github.com/carverauto/serviceradar/pull/1791/files#diff-f4548beaa0a3a01a46971c82c5647a0f3f49eb38d66dd939d06d19018173fcd6">+4/-1</a>&nbsp; &nbsp; &nbsp; </td> </tr> <tr> <td><strong>kv.json</strong><dd><code>Add bucket configuration to KV service config</code>&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; </dd></td> <td><a href="https://github.com/carverauto/serviceradar/pull/1791/files#diff-2e07cacb970ac6900eec1425f7613c4473b6bce40c025d2b9abdfecbd3e567a4">+5/-2</a>&nbsp; &nbsp; &nbsp; </td> </tr> </table></details></td></tr><tr><td><strong>Dependencies</strong></td><td><details><summary>1 files</summary><table> <tr> <td><strong>BUILD.bazel</strong><dd><code>Add backoff and nats-server dependencies</code>&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; </dd></td> <td><a href="https://github.com/carverauto/serviceradar/pull/1791/files#diff-af133ce6c45ec51c66bcb5fe0b424291d78688285b614275765003b83c177534">+2/-0</a>&nbsp; &nbsp; &nbsp; </td> </tr> </table></details></td></tr></tr></tbody></table> </details> ___
qodo-code-review[bot] commented 2025-10-16 17:18:44 +00:00 (Migrated from github.com)
Author
Owner

Imported GitHub PR comment.

Original author: @qodo-code-review[bot]
Original URL: https://github.com/carverauto/serviceradar/pull/1791#issuecomment-3411892553
Original created: 2025-10-16T17:18:44Z

You are nearing your monthly Qodo Merge usage quota. For more information, please visit here.

PR Compliance Guide 🔍

Below is a summary of compliance checks for this PR:

Security Compliance
🟢
No security concerns identified No security vulnerabilities detected by AI analysis. Human verification advised for critical code.
Ticket Compliance
🎫 No ticket provided
- [ ] Create ticket/issue <!-- /create_ticket --create_ticket=true -->

</details></td></tr>
Codebase Duplication Compliance
Codebase context is not defined

Follow the guide to enable codebase context checks.

Custom Compliance
No custom compliance provided

Follow the guide to enable custom compliance check.

  • Update
Compliance status legend 🟢 - Fully Compliant
🟡 - Partial Compliant
🔴 - Not Compliant
- Requires Further Human Verification
🏷️ - Compliance label
Imported GitHub PR comment. Original author: @qodo-code-review[bot] Original URL: https://github.com/carverauto/serviceradar/pull/1791#issuecomment-3411892553 Original created: 2025-10-16T17:18:44Z --- _You are nearing your monthly Qodo Merge usage quota. For more information, please visit [here](https://qodo-merge-docs.qodo.ai/installation/qodo_merge/#cloud-users)._ ## PR Compliance Guide 🔍 <!-- https://github.com/carverauto/serviceradar/commit/3a5787acf343e9f938b721071c83e280c6dfb6ec --> Below is a summary of compliance checks for this PR:<br> <table><tbody><tr><td colspan='2'><strong>Security Compliance</strong></td></tr> <tr><td>🟢</td><td><details><summary><strong>No security concerns identified</strong></summary> No security vulnerabilities detected by AI analysis. Human verification advised for critical code. </details></td></tr> <tr><td colspan='2'><strong>Ticket Compliance</strong></td></tr> <tr><td>⚪</td><td><details><summary>🎫 <strong>No ticket provided </summary></strong> - [ ] Create ticket/issue <!-- /create_ticket --create_ticket=true --> </details></td></tr> <tr><td colspan='2'><strong>Codebase Duplication Compliance</strong></td></tr> <tr><td>⚪</td><td><details><summary><strong>Codebase context is not defined </strong></summary> Follow the <a href='https://qodo-merge-docs.qodo.ai/core-abilities/rag_context_enrichment/'>guide</a> to enable codebase context checks. </details></td></tr> <tr><td colspan='2'><strong>Custom Compliance</strong></td></tr> <tr><td>⚪</td><td><details><summary><strong>No custom compliance provided</strong></summary> Follow the <a href='https://qodo-merge-docs.qodo.ai/tools/compliance/'>guide</a> to enable custom compliance check. </details></td></tr> <tr><td align="center" colspan="2"> - [ ] Update <!-- /compliance --update_compliance=true --> </td></tr></tbody></table> <details><summary>Compliance status legend</summary> 🟢 - Fully Compliant<br> 🟡 - Partial Compliant<br> 🔴 - Not Compliant<br> ⚪ - Requires Further Human Verification<br> 🏷️ - Compliance label<br> </details>
qodo-code-review[bot] commented 2025-10-16 17:20:07 +00:00 (Migrated from github.com)
Author
Owner

Imported GitHub PR comment.

Original author: @qodo-code-review[bot]
Original URL: https://github.com/carverauto/serviceradar/pull/1791#issuecomment-3411901648
Original created: 2025-10-16T17:20:07Z

You are nearing your monthly Qodo Merge usage quota. For more information, please visit here.

PR Code Suggestions

Explore these optional code suggestions:

CategorySuggestion                                                                                                                                    Impact
High-level
Consolidate NATS reconnection logic

The robust NATS reconnection logic with exponential backoff, introduced for the
event publisher, should be extracted and reused for the NATS-backed KV store
client. This would improve system stability by ensuring all critical NATS
components share the same resilient reconnection mechanism.

Examples:

pkg/core/events.go [143-200]
func (s *Server) reinitializeEventPublisher(reason string) {
	defer func() {
		s.natsReconnectMu.Lock()
		s.natsReconnectActive = false
		s.natsReconnectMu.Unlock()
	}()

	if s.config == nil || s.config.Events == nil || !s.config.Events.Enabled {
		return
	}

 ... (clipped 48 lines)
pkg/kv/nats.go [487-522]
func (n *NATSStore) connectionNeedsRefreshLocked() bool {
	if n.nc == nil {
		return true
	}

	status := n.nc.Status()
	switch status {
	case nats.CONNECTED:
		return false
	case nats.DISCONNECTED, nats.CLOSED, nats.RECONNECTING, nats.CONNECTING, nats.DRAINING_SUBS, nats.DRAINING_PUBS:

 ... (clipped 26 lines)

Solution Walkthrough:

Before:

// In pkg/core/events.go
func (s *Server) reinitializeEventPublisher(reason string) {
    backoffPolicy := backoff.NewExponentialBackOff()
    for {
        // ... check for shutdown
        err := s.initializeEventPublisher(ctx, s.config)
        if err == nil {
            return // Success
        }
        delay := backoffPolicy.NextBackOff()
        // ... log and wait for delay
    }
}

// In pkg/kv/nats.go
func (n *NATSStore) getKVForDomain(ctx, domain) (kv, err) {
    n.mu.Lock()
    defer n.mu.Unlock()
    if n.connectionNeedsRefreshLocked() {
        if err := n.reconnectLocked(); err != nil {
            return nil, err // Simple, one-shot reconnect attempt
        }
    }
    // ... proceed to get KV
}

After:

// In a new shared package, e.g., pkg/natsutil/reconnect.go
type Reconnector struct { ... }
func (r *Reconnector) MaintainConnection(initFunc func() error) {
    backoffPolicy := backoff.NewExponentialBackOff()
    for {
        err := initFunc()
        if err == nil {
            // Wait for a disconnect signal
        } else {
            // Retry with backoff
        }
    }
}

// In pkg/core/events.go
// Uses the shared Reconnector to manage the event publisher connection.
s.reconnector.MaintainConnection(s.initializeEventPublisher)


// In pkg/kv/nats.go
// Also uses the shared Reconnector to manage the KV store connection.
store.reconnector.MaintainConnection(store.initializeKV)

Suggestion importance[1-10]: 9

__

Why: The suggestion correctly identifies that the new, robust NATS reconnection logic in pkg/core/events.go is not applied to the critical NATS-backed KV store, which was a source of failures mentioned in the ticket. Consolidating this logic would significantly improve system resilience and code consistency.

High
Possible issue
Prevent race condition during connection replacement

Prevent a race condition in setEventPublisher by closing the old NATS connection
while the mutex is still locked. This ensures no other goroutine can use the
connection as it's being closed.

pkg/core/events.go [115-125]

 func (s *Server) setEventPublisher(publisher *natsutil.EventPublisher, conn *nats.Conn) {
 	s.mu.Lock()
+	defer s.mu.Unlock()
+
 	oldConn := s.natsConn
-	s.eventPublisher = publisher
-	s.natsConn = conn
-	s.mu.Unlock()
-
 	if oldConn != nil && oldConn != conn {
 		oldConn.Close()
 	}
+
+	s.eventPublisher = publisher
+	s.natsConn = conn
 }
  • Apply / Chat
Suggestion importance[1-10]: 8

__

Why: The suggestion correctly identifies a race condition where an old NATS connection could be closed while another goroutine is still using it, which could lead to panics. The proposed fix of closing the old connection within the mutex lock is a valid and important correction to ensure thread safety.

Medium
Prevent silent data truncation bug
Suggestion Impact:The commit added a guard checking n.bucketHistory against math.MaxUint8 and returning an error if exceeded, aligning with the suggested validation to prevent uint8 truncation.

code diff:

+		if n.bucketHistory > math.MaxUint8 {
+			return nil, fmt.Errorf("%w: got %d", errBucketHistoryTooLarge, n.bucketHistory)
+		}
+
 		cfg := jetstream.KeyValueConfig{
 			Bucket:  n.bucket,
 			History: uint8(n.bucketHistory),

In ensureDomainLocked, validate that n.bucketHistory does not exceed 255 before
casting it to uint8. This prevents silent value truncation and potential
misconfiguration of the JetStream Key-Value store history.

pkg/kv/nats.go [543-560]

 	kv, err := js.KeyValue(ctx, n.bucket)
 	if err != nil {
+		if n.bucketHistory > 255 {
+			return nil, fmt.Errorf("bucket_history %d exceeds maximum of 255", n.bucketHistory)
+		}
 		cfg := jetstream.KeyValueConfig{
 			Bucket:  n.bucket,
 			History: uint8(n.bucketHistory),
 		}
 		if n.bucketTTL > 0 {
 			cfg.TTL = n.bucketTTL
 		}
 		if n.bucketMaxBytes > 0 {
 			cfg.MaxBytes = n.bucketMaxBytes
 		}
 
 		kv, err = js.CreateKeyValue(ctx, cfg)
 		if err != nil {
 			return nil, fmt.Errorf("kv bucket init failed for domain %q: %w", domain, err)
 		}
 	}

[Suggestion processed]

Suggestion importance[1-10]: 7

__

Why: The suggestion correctly identifies a potential data truncation issue when casting bucketHistory from uint32 to uint8, which could lead to silent misconfiguration. Adding a validation check is a good defensive programming practice that improves the robustness of the bucket creation logic.

Medium
  • Update
Imported GitHub PR comment. Original author: @qodo-code-review[bot] Original URL: https://github.com/carverauto/serviceradar/pull/1791#issuecomment-3411901648 Original created: 2025-10-16T17:20:07Z --- _You are nearing your monthly Qodo Merge usage quota. For more information, please visit [here](https://qodo-merge-docs.qodo.ai/installation/qodo_merge/#cloud-users)._ ## PR Code Suggestions ✨ <!-- 3a5787a --> Explore these optional code suggestions: <table><thead><tr><td><strong>Category</strong></td><td align=left><strong>Suggestion&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; </strong></td><td align=center><strong>Impact</strong></td></tr><tbody><tr><td rowspan=1>High-level</td> <td> <details><summary>Consolidate NATS reconnection logic</summary> ___ **The robust NATS reconnection logic with exponential backoff, introduced for the <br>event publisher, should be extracted and reused for the NATS-backed KV store <br>client. This would improve system stability by ensuring all critical NATS <br>components share the same resilient reconnection mechanism.** ### Examples: <details> <summary> <a href="https://github.com/carverauto/serviceradar/pull/1791/files#diff-7d499ed41701e367e51735c9a0a78bcce977dea3872771eb7b22c47dc39e0241R143-R200">pkg/core/events.go [143-200]</a> </summary> ```go func (s *Server) reinitializeEventPublisher(reason string) { defer func() { s.natsReconnectMu.Lock() s.natsReconnectActive = false s.natsReconnectMu.Unlock() }() if s.config == nil || s.config.Events == nil || !s.config.Events.Enabled { return } ... (clipped 48 lines) ``` </details> <details> <summary> <a href="https://github.com/carverauto/serviceradar/pull/1791/files#diff-329ec6e09a0af7af2887a668a87058a76195be2ad869e5616a3c4f958300b538R487-R522">pkg/kv/nats.go [487-522]</a> </summary> ```go func (n *NATSStore) connectionNeedsRefreshLocked() bool { if n.nc == nil { return true } status := n.nc.Status() switch status { case nats.CONNECTED: return false case nats.DISCONNECTED, nats.CLOSED, nats.RECONNECTING, nats.CONNECTING, nats.DRAINING_SUBS, nats.DRAINING_PUBS: ... (clipped 26 lines) ``` </details> ### Solution Walkthrough: #### Before: ```go // In pkg/core/events.go func (s *Server) reinitializeEventPublisher(reason string) { backoffPolicy := backoff.NewExponentialBackOff() for { // ... check for shutdown err := s.initializeEventPublisher(ctx, s.config) if err == nil { return // Success } delay := backoffPolicy.NextBackOff() // ... log and wait for delay } } // In pkg/kv/nats.go func (n *NATSStore) getKVForDomain(ctx, domain) (kv, err) { n.mu.Lock() defer n.mu.Unlock() if n.connectionNeedsRefreshLocked() { if err := n.reconnectLocked(); err != nil { return nil, err // Simple, one-shot reconnect attempt } } // ... proceed to get KV } ``` #### After: ```go // In a new shared package, e.g., pkg/natsutil/reconnect.go type Reconnector struct { ... } func (r *Reconnector) MaintainConnection(initFunc func() error) { backoffPolicy := backoff.NewExponentialBackOff() for { err := initFunc() if err == nil { // Wait for a disconnect signal } else { // Retry with backoff } } } // In pkg/core/events.go // Uses the shared Reconnector to manage the event publisher connection. s.reconnector.MaintainConnection(s.initializeEventPublisher) // In pkg/kv/nats.go // Also uses the shared Reconnector to manage the KV store connection. store.reconnector.MaintainConnection(store.initializeKV) ``` <details><summary>Suggestion importance[1-10]: 9</summary> __ Why: The suggestion correctly identifies that the new, robust NATS reconnection logic in `pkg/core/events.go` is not applied to the critical NATS-backed KV store, which was a source of failures mentioned in the ticket. Consolidating this logic would significantly improve system resilience and code consistency. </details></details></td><td align=center>High </td></tr><tr><td rowspan=2>Possible issue</td> <td> <details><summary>Prevent race condition during connection replacement</summary> ___ **Prevent a race condition in <code>setEventPublisher</code> by closing the old NATS connection <br>while the mutex is still locked. This ensures no other goroutine can use the <br>connection as it's being closed.** [pkg/core/events.go [115-125]](https://github.com/carverauto/serviceradar/pull/1791/files#diff-7d499ed41701e367e51735c9a0a78bcce977dea3872771eb7b22c47dc39e0241R115-R125) ```diff func (s *Server) setEventPublisher(publisher *natsutil.EventPublisher, conn *nats.Conn) { s.mu.Lock() + defer s.mu.Unlock() + oldConn := s.natsConn - s.eventPublisher = publisher - s.natsConn = conn - s.mu.Unlock() - if oldConn != nil && oldConn != conn { oldConn.Close() } + + s.eventPublisher = publisher + s.natsConn = conn } ``` - [ ] **Apply / Chat** <!-- /improve --apply_suggestion=1 --> <details><summary>Suggestion importance[1-10]: 8</summary> __ Why: The suggestion correctly identifies a race condition where an old NATS connection could be closed while another goroutine is still using it, which could lead to panics. The proposed fix of closing the old connection within the mutex lock is a valid and important correction to ensure thread safety. </details></details></td><td align=center>Medium </td></tr><tr><td> <details><summary>✅ <s>Prevent silent data truncation bug</s></summary> ___ <details><summary><b>Suggestion Impact:</b></summary>The commit added a guard checking n.bucketHistory against math.MaxUint8 and returning an error if exceeded, aligning with the suggested validation to prevent uint8 truncation. code diff: ```diff + if n.bucketHistory > math.MaxUint8 { + return nil, fmt.Errorf("%w: got %d", errBucketHistoryTooLarge, n.bucketHistory) + } + cfg := jetstream.KeyValueConfig{ Bucket: n.bucket, History: uint8(n.bucketHistory), ``` </details> ___ **In <code>ensureDomainLocked</code>, validate that <code>n.bucketHistory</code> does not exceed 255 before <br>casting it to <code>uint8</code>. This prevents silent value truncation and potential <br>misconfiguration of the JetStream Key-Value store history.** [pkg/kv/nats.go [543-560]](https://github.com/carverauto/serviceradar/pull/1791/files#diff-329ec6e09a0af7af2887a668a87058a76195be2ad869e5616a3c4f958300b538R543-R560) ```diff kv, err := js.KeyValue(ctx, n.bucket) if err != nil { + if n.bucketHistory > 255 { + return nil, fmt.Errorf("bucket_history %d exceeds maximum of 255", n.bucketHistory) + } cfg := jetstream.KeyValueConfig{ Bucket: n.bucket, History: uint8(n.bucketHistory), } if n.bucketTTL > 0 { cfg.TTL = n.bucketTTL } if n.bucketMaxBytes > 0 { cfg.MaxBytes = n.bucketMaxBytes } kv, err = js.CreateKeyValue(ctx, cfg) if err != nil { return nil, fmt.Errorf("kv bucket init failed for domain %q: %w", domain, err) } } ``` `[Suggestion processed]` <details><summary>Suggestion importance[1-10]: 7</summary> __ Why: The suggestion correctly identifies a potential data truncation issue when casting `bucketHistory` from `uint32` to `uint8`, which could lead to silent misconfiguration. Adding a validation check is a good defensive programming practice that improves the robustness of the bucket creation logic. </details></details></td><td align=center>Medium </td></tr> <tr><td align="center" colspan="2"> - [ ] Update <!-- /improve_multi --more_suggestions=true --> </td><td></td></tr></tbody></table>
Sign in to join this conversation.
No reviewers
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
carverauto/serviceradar!2331
No description provided.