deadlock fixes #2530

Merged
mfreeman451 merged 2 commits from refs/pull/2530/head into main 2025-12-08 20:52:49 +00:00
mfreeman451 commented 2025-12-08 20:04:29 +00:00 (Migrated from github.com)
Owner

Imported from GitHub pull request.

Original GitHub pull request: #2088
Original author: @mfreeman451
Original URL: https://github.com/carverauto/serviceradar/pull/2088
Original created: 2025-12-08T20:04:29Z
Original updated: 2025-12-08T20:52:53Z
Original head: carverauto/serviceradar:2087-bugcore-cnpg-deadlock-on-device_updates
Original base: main
Original merged: 2025-12-08T20:52:49Z by @mfreeman451

User description

IMPORTANT: Please sign the Developer Certificate of Origin

Thank you for your contribution to ServiceRadar. Please note, when contributing, the developer must include
a DCO sign-off statement indicating the DCO acceptance in one commit message. Here
is an example DCO Signed-off-by line in a commit message:

Signed-off-by: J. Doe <j.doe@domain.com>

Describe your changes

Code checklist before requesting a review

  • I have signed the DCO?
  • The build completes without errors?
  • All tests are passing when running make test?

PR Type

Bug fix, Enhancement


Description

  • Add mutex-based serialization for CNPG device update batch writes to eliminate deadlocks

  • Implement transient error classification and automatic retry with exponential backoff for PostgreSQL errors

  • Add metrics tracking for deadlock occurrences, retries, and successful retry completions

  • Extend serialization to device identifiers and network sightings operations sharing related data


Diagram Walkthrough

flowchart LR
  A["Device Update<br/>Operations"] -->|Acquire Mutex| B["Serialize Batch<br/>Execution"]
  B -->|Send Batch| C["PostgreSQL<br/>Execution"]
  C -->|Success| D["Release Mutex<br/>Return"]
  C -->|Transient Error<br/>40P01/40001| E["Classify Error<br/>& Backoff"]
  E -->|Retry| B
  E -->|Max Attempts<br/>Exceeded| F["Release Mutex<br/>Return Error"]
  B -->|Record Metrics| G["Deadlock/Retry<br/>Counters"]

File Walkthrough

Relevant files
Enhancement
3 files
cnpg_device_updates_metrics.go
Add atomic counters for deadlock and retry metrics             
+77/-0   
cnpg_device_updates_retry.go
Implement transient error classification and retry logic 
+209/-0 
db.go
Add deviceUpdatesMu mutex field and initialize in New       
+15/-6   
Tests
1 files
cnpg_device_updates_retry_test.go
Add comprehensive unit tests for retry and backoff logic 
+175/-0 
Bug fix
3 files
cnpg_unified_devices.go
Wrap device updates batch with mutex and retry wrapper     
+7/-4     
cnpg_identity_reconciliation_upserts.go
Serialize device identifier writes with mutex protection 
+8/-1     
cnpg_identity_reconciliation.go
Serialize network sightings writes with mutex protection 
+8/-1     
Documentation
4 files
design.md
Document deadlock scenario, decisions, and migration plan
+110/-0 
proposal.md
Describe problem statement and proposed solution approach
+48/-0   
spec.md
Define modified and new requirements for serialization     
+67/-0   
tasks.md
Track implementation tasks and verification steps               
+33/-0   

Imported from GitHub pull request. Original GitHub pull request: #2088 Original author: @mfreeman451 Original URL: https://github.com/carverauto/serviceradar/pull/2088 Original created: 2025-12-08T20:04:29Z Original updated: 2025-12-08T20:52:53Z Original head: carverauto/serviceradar:2087-bugcore-cnpg-deadlock-on-device_updates Original base: main Original merged: 2025-12-08T20:52:49Z by @mfreeman451 --- ### **User description** ## IMPORTANT: Please sign the Developer Certificate of Origin Thank you for your contribution to ServiceRadar. Please note, when contributing, the developer must include a [DCO sign-off statement]( https://developercertificate.org/) indicating the DCO acceptance in one commit message. Here is an example DCO Signed-off-by line in a commit message: ``` Signed-off-by: J. Doe <j.doe@domain.com> ``` ## Describe your changes ## Issue ticket number and link ## Code checklist before requesting a review - [ ] I have signed the DCO? - [ ] The build completes without errors? - [ ] All tests are passing when running make test? ___ ### **PR Type** Bug fix, Enhancement ___ ### **Description** - Add mutex-based serialization for CNPG device update batch writes to eliminate deadlocks - Implement transient error classification and automatic retry with exponential backoff for PostgreSQL errors - Add metrics tracking for deadlock occurrences, retries, and successful retry completions - Extend serialization to device identifiers and network sightings operations sharing related data ___ ### Diagram Walkthrough ```mermaid flowchart LR A["Device Update<br/>Operations"] -->|Acquire Mutex| B["Serialize Batch<br/>Execution"] B -->|Send Batch| C["PostgreSQL<br/>Execution"] C -->|Success| D["Release Mutex<br/>Return"] C -->|Transient Error<br/>40P01/40001| E["Classify Error<br/>& Backoff"] E -->|Retry| B E -->|Max Attempts<br/>Exceeded| F["Release Mutex<br/>Return Error"] B -->|Record Metrics| G["Deadlock/Retry<br/>Counters"] ``` <details><summary><h3>File Walkthrough</h3></summary> <table><thead><tr><th></th><th align="left">Relevant files</th></tr></thead><tbody><tr><td><strong>Enhancement</strong></td><td><details><summary>3 files</summary><table> <tr> <td><strong>cnpg_device_updates_metrics.go</strong><dd><code>Add atomic counters for deadlock and retry metrics</code>&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; </dd></td> <td><a href="https://github.com/carverauto/serviceradar/pull/2088/files#diff-aad2b6b2707edb48a88b8b3a16545f0233e10cc3509ef0f00a149fccdaa54816">+77/-0</a>&nbsp; &nbsp; </td> </tr> <tr> <td><strong>cnpg_device_updates_retry.go</strong><dd><code>Implement transient error classification and retry logic</code>&nbsp; </dd></td> <td><a href="https://github.com/carverauto/serviceradar/pull/2088/files#diff-9b08ac708a4a4f6e40340f1f6005e43d579b76f88f0987d10a5984129bcc35fb">+209/-0</a>&nbsp; </td> </tr> <tr> <td><strong>db.go</strong><dd><code>Add deviceUpdatesMu mutex field and initialize in New</code>&nbsp; &nbsp; &nbsp; &nbsp; </dd></td> <td><a href="https://github.com/carverauto/serviceradar/pull/2088/files#diff-5da3684806835246d262230050593f460b12b6c0e3966df174e6061be0e9e575">+15/-6</a>&nbsp; &nbsp; </td> </tr> </table></details></td></tr><tr><td><strong>Tests</strong></td><td><details><summary>1 files</summary><table> <tr> <td><strong>cnpg_device_updates_retry_test.go</strong><dd><code>Add comprehensive unit tests for retry and backoff logic</code>&nbsp; </dd></td> <td><a href="https://github.com/carverauto/serviceradar/pull/2088/files#diff-58c50b725d3c57e8d5193bb5dd1557d537f747f14e492b1a26a148f7cd973620">+175/-0</a>&nbsp; </td> </tr> </table></details></td></tr><tr><td><strong>Bug fix</strong></td><td><details><summary>3 files</summary><table> <tr> <td><strong>cnpg_unified_devices.go</strong><dd><code>Wrap device updates batch with mutex and retry wrapper</code>&nbsp; &nbsp; &nbsp; </dd></td> <td><a href="https://github.com/carverauto/serviceradar/pull/2088/files#diff-ab3bf557cf9bb1b281a315a73abee38748de1654941c2471542e5e9bfc1716d8">+7/-4</a>&nbsp; &nbsp; &nbsp; </td> </tr> <tr> <td><strong>cnpg_identity_reconciliation_upserts.go</strong><dd><code>Serialize device identifier writes with mutex protection</code>&nbsp; </dd></td> <td><a href="https://github.com/carverauto/serviceradar/pull/2088/files#diff-957dbd9cd93b91df2c274b2a7e01e238709d9a0e759081e4ac02b957047f28f7">+8/-1</a>&nbsp; &nbsp; &nbsp; </td> </tr> <tr> <td><strong>cnpg_identity_reconciliation.go</strong><dd><code>Serialize network sightings writes with mutex protection</code>&nbsp; </dd></td> <td><a href="https://github.com/carverauto/serviceradar/pull/2088/files#diff-a5983fb383f0fe5d0270e7689e4f0f104a02b5f38598b1d4708832d41a79962a">+8/-1</a>&nbsp; &nbsp; &nbsp; </td> </tr> </table></details></td></tr><tr><td><strong>Documentation</strong></td><td><details><summary>4 files</summary><table> <tr> <td><strong>design.md</strong><dd><code>Document deadlock scenario, decisions, and migration plan</code></dd></td> <td><a href="https://github.com/carverauto/serviceradar/pull/2088/files#diff-4d82906fb75c93ae643c6d8ee8949781d6498d940514039e321db854637aa1d5">+110/-0</a>&nbsp; </td> </tr> <tr> <td><strong>proposal.md</strong><dd><code>Describe problem statement and proposed solution approach</code></dd></td> <td><a href="https://github.com/carverauto/serviceradar/pull/2088/files#diff-f13c9e7a2e41b2a7ac2da06da4e117cea26d44ae3105f5de2928311fe166c5ef">+48/-0</a>&nbsp; &nbsp; </td> </tr> <tr> <td><strong>spec.md</strong><dd><code>Define modified and new requirements for serialization</code>&nbsp; &nbsp; &nbsp; </dd></td> <td><a href="https://github.com/carverauto/serviceradar/pull/2088/files#diff-6980110ed515b9b05625c866a6badc2592a816cbba95e4b1c661c72effa975a8">+67/-0</a>&nbsp; &nbsp; </td> </tr> <tr> <td><strong>tasks.md</strong><dd><code>Track implementation tasks and verification steps</code>&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; </dd></td> <td><a href="https://github.com/carverauto/serviceradar/pull/2088/files#diff-ee5ed7fa459b86bccf64ebac42c95ab892a2cff9248de92a8f75d2de09374896">+33/-0</a>&nbsp; &nbsp; </td> </tr> </table></details></td></tr></tbody></table> </details> ___
qodo-code-review[bot] commented 2025-12-08 20:05:06 +00:00 (Migrated from github.com)
Author
Owner

Imported GitHub PR comment.

Original author: @qodo-code-review[bot]
Original URL: https://github.com/carverauto/serviceradar/pull/2088#issuecomment-3628795589
Original created: 2025-12-08T20:05:06Z

PR Compliance Guide 🔍

Below is a summary of compliance checks for this PR:

Security Compliance
🟢
No security concerns identified No security vulnerabilities detected by AI analysis. Human verification advised for critical code.
Ticket Compliance
🎫 No ticket provided
  • Create ticket/issue
Codebase Duplication Compliance
Codebase context is not defined

Follow the guide to enable codebase context checks.

Custom Compliance
🟢
Generic: Meaningful Naming and Self-Documenting Code

Objective: Ensure all identifiers clearly express their purpose and intent, making code
self-documenting

Status: Passed

Learn more about managing compliance generic rules or creating your own custom rules

Generic: Robust Error Handling and Edge Case Management

Objective: Ensure comprehensive error handling that provides meaningful context and graceful
degradation

Status: Passed

Learn more about managing compliance generic rules or creating your own custom rules

Generic: Security-First Input Validation and Data Handling

Objective: Ensure all data inputs are validated, sanitized, and handled securely to prevent
vulnerabilities

Status: Passed

Learn more about managing compliance generic rules or creating your own custom rules

Generic: Comprehensive Audit Trails

Objective: To create a detailed and reliable record of critical system actions for security analysis
and compliance.

Status:
Missing audit: New retry/deadlock handling performs critical DB actions without adding explicit audit
logging of action, actor, and outcome, and it's unclear if higher layers provide
required audit trails.

Referred Code
// sendCNPGWithRetry sends a batch with automatic retry for transient errors.
// This should be used for device-related batch operations that may encounter deadlocks.
func (db *DB) sendCNPGWithRetry(ctx context.Context, batch *pgx.Batch, name string) error {
	maxAttempts := getCNPGMaxRetryAttempts()
	var lastErr error

	for attempt := 1; attempt <= maxAttempts; attempt++ {
		if ctx.Err() != nil {
			return ctx.Err()
		}

		err := db.sendCNPGBatch(ctx, batch, name)
		if err == nil {
			if attempt > 1 {
				// Record successful retry
				recordCNPGRetrySuccess(name)
			}
			return nil
		}

		lastErr = err


 ... (clipped 34 lines)

Learn more about managing compliance generic rules or creating your own custom rules

Generic: Secure Error Handling

Objective: To prevent the leakage of sensitive system information through error messages while
providing sufficient detail for internal debugging.

Status:
Error detail exposure: Logging includes raw database errors which may contain internal details, and it is unclear
whether these logs are user-facing or restricted to internal logs.

Referred Code
	db.logger.Warn().
		Err(err).
		Str("sqlstate", code).
		Str("batch_name", name).
		Int("attempt", attempt).
		Int("max_attempts", maxAttempts).
		Dur("backoff", delay).
		Msg("cnpg transient error, retrying")
	time.Sleep(delay)
	continue
}

// Non-transient error or max attempts reached
db.logger.Error().
	Err(err).
	Str("sqlstate", code).
	Str("batch_name", name).
	Int("attempt", attempt).
	Int("max_attempts", maxAttempts).
	Msg("cnpg batch failed")

Learn more about managing compliance generic rules or creating your own custom rules

Generic: Secure Logging Practices

Objective: To ensure logs are useful for debugging and auditing without exposing sensitive
information like PII, PHI, or cardholder data.

Status:
Potential sensitive logs: The code logs database errors and batch names which could include sensitive context
depending on logger configuration, and there is no explicit scrubbing shown in the diff.

Referred Code
	db.logger.Warn().
		Err(err).
		Str("sqlstate", code).
		Str("batch_name", name).
		Int("attempt", attempt).
		Int("max_attempts", maxAttempts).
		Dur("backoff", delay).
		Msg("cnpg transient error, retrying")
	time.Sleep(delay)
	continue
}

// Non-transient error or max attempts reached
db.logger.Error().
	Err(err).
	Str("sqlstate", code).
	Str("batch_name", name).
	Int("attempt", attempt).
	Int("max_attempts", maxAttempts).
	Msg("cnpg batch failed")

Learn more about managing compliance generic rules or creating your own custom rules

  • Update
Compliance status legend 🟢 - Fully Compliant
🟡 - Partial Compliant
🔴 - Not Compliant
- Requires Further Human Verification
🏷️ - Compliance label
Imported GitHub PR comment. Original author: @qodo-code-review[bot] Original URL: https://github.com/carverauto/serviceradar/pull/2088#issuecomment-3628795589 Original created: 2025-12-08T20:05:06Z --- ## PR Compliance Guide 🔍 <!-- https://github.com/carverauto/serviceradar/commit/c2cfe9842e6f0d97e80f57a53b0918808c4c46a2 --> Below is a summary of compliance checks for this PR:<br> <table><tbody><tr><td colspan='2'><strong>Security Compliance</strong></td></tr> <tr><td>🟢</td><td><details><summary><strong>No security concerns identified</strong></summary> No security vulnerabilities detected by AI analysis. Human verification advised for critical code. </details></td></tr> <tr><td colspan='2'><strong>Ticket Compliance</strong></td></tr> <tr><td>⚪</td><td><details><summary>🎫 <strong>No ticket provided </strong></summary> - [ ] Create ticket/issue <!-- /create_ticket --create_ticket=true --> </details></td></tr> <tr><td colspan='2'><strong>Codebase Duplication Compliance</strong></td></tr> <tr><td>⚪</td><td><details><summary><strong>Codebase context is not defined </strong></summary> Follow the <a href='https://qodo-merge-docs.qodo.ai/core-abilities/rag_context_enrichment/'>guide</a> to enable codebase context checks. </details></td></tr> <tr><td colspan='2'><strong>Custom Compliance</strong></td></tr> <tr><td rowspan=3>🟢</td><td> <details><summary><strong>Generic: Meaningful Naming and Self-Documenting Code</strong></summary><br> **Objective:** Ensure all identifiers clearly express their purpose and intent, making code <br>self-documenting<br> **Status:** Passed<br> > Learn more about managing compliance <a href='https://qodo-merge-docs.qodo.ai/tools/compliance/#configuration-options'>generic rules</a> or creating your own <a href='https://qodo-merge-docs.qodo.ai/tools/compliance/#custom-compliance'>custom rules</a> </details></td></tr> <tr><td> <details><summary><strong>Generic: Robust Error Handling and Edge Case Management</strong></summary><br> **Objective:** Ensure comprehensive error handling that provides meaningful context and graceful <br>degradation<br> **Status:** Passed<br> > Learn more about managing compliance <a href='https://qodo-merge-docs.qodo.ai/tools/compliance/#configuration-options'>generic rules</a> or creating your own <a href='https://qodo-merge-docs.qodo.ai/tools/compliance/#custom-compliance'>custom rules</a> </details></td></tr> <tr><td> <details><summary><strong>Generic: Security-First Input Validation and Data Handling</strong></summary><br> **Objective:** Ensure all data inputs are validated, sanitized, and handled securely to prevent <br>vulnerabilities<br> **Status:** Passed<br> > Learn more about managing compliance <a href='https://qodo-merge-docs.qodo.ai/tools/compliance/#configuration-options'>generic rules</a> or creating your own <a href='https://qodo-merge-docs.qodo.ai/tools/compliance/#custom-compliance'>custom rules</a> </details></td></tr> <tr><td rowspan=3>⚪</td> <td><details> <summary><strong>Generic: Comprehensive Audit Trails</strong></summary><br> **Objective:** To create a detailed and reliable record of critical system actions for security analysis <br>and compliance.<br> **Status:** <br><a href='https://github.com/carverauto/serviceradar/pull/2088/files#diff-9b08ac708a4a4f6e40340f1f6005e43d579b76f88f0987d10a5984129bcc35fbR107-R161'><strong>Missing audit</strong></a>: New retry/deadlock handling performs critical DB actions without adding explicit audit <br>logging of action, actor, and outcome, and it&#x27;s unclear if higher layers provide <br>required audit trails.<br> <details open><summary>Referred Code</summary> ```go // sendCNPGWithRetry sends a batch with automatic retry for transient errors. // This should be used for device-related batch operations that may encounter deadlocks. func (db *DB) sendCNPGWithRetry(ctx context.Context, batch *pgx.Batch, name string) error { maxAttempts := getCNPGMaxRetryAttempts() var lastErr error for attempt := 1; attempt <= maxAttempts; attempt++ { if ctx.Err() != nil { return ctx.Err() } err := db.sendCNPGBatch(ctx, batch, name) if err == nil { if attempt > 1 { // Record successful retry recordCNPGRetrySuccess(name) } return nil } lastErr = err ... (clipped 34 lines) ``` </details> > Learn more about managing compliance <a href='https://qodo-merge-docs.qodo.ai/tools/compliance/#configuration-options'>generic rules</a> or creating your own <a href='https://qodo-merge-docs.qodo.ai/tools/compliance/#custom-compliance'>custom rules</a> </details></td></tr> <tr><td><details> <summary><strong>Generic: Secure Error Handling</strong></summary><br> **Objective:** To prevent the leakage of sensitive system information through error messages while <br>providing sufficient detail for internal debugging.<br> **Status:** <br><a href='https://github.com/carverauto/serviceradar/pull/2088/files#diff-9b08ac708a4a4f6e40340f1f6005e43d579b76f88f0987d10a5984129bcc35fbR141-R160'><strong>Error detail exposure</strong></a>: Logging includes raw database errors which may contain internal details, and it is unclear <br>whether these logs are user-facing or restricted to internal logs.<br> <details open><summary>Referred Code</summary> ```go db.logger.Warn(). Err(err). Str("sqlstate", code). Str("batch_name", name). Int("attempt", attempt). Int("max_attempts", maxAttempts). Dur("backoff", delay). Msg("cnpg transient error, retrying") time.Sleep(delay) continue } // Non-transient error or max attempts reached db.logger.Error(). Err(err). Str("sqlstate", code). Str("batch_name", name). Int("attempt", attempt). Int("max_attempts", maxAttempts). Msg("cnpg batch failed") ``` </details> > Learn more about managing compliance <a href='https://qodo-merge-docs.qodo.ai/tools/compliance/#configuration-options'>generic rules</a> or creating your own <a href='https://qodo-merge-docs.qodo.ai/tools/compliance/#custom-compliance'>custom rules</a> </details></td></tr> <tr><td><details> <summary><strong>Generic: Secure Logging Practices</strong></summary><br> **Objective:** To ensure logs are useful for debugging and auditing without exposing sensitive <br>information like PII, PHI, or cardholder data.<br> **Status:** <br><a href='https://github.com/carverauto/serviceradar/pull/2088/files#diff-9b08ac708a4a4f6e40340f1f6005e43d579b76f88f0987d10a5984129bcc35fbR141-R160'><strong>Potential sensitive logs</strong></a>: The code logs database errors and batch names which could include sensitive context <br>depending on logger configuration, and there is no explicit scrubbing shown in the diff.<br> <details open><summary>Referred Code</summary> ```go db.logger.Warn(). Err(err). Str("sqlstate", code). Str("batch_name", name). Int("attempt", attempt). Int("max_attempts", maxAttempts). Dur("backoff", delay). Msg("cnpg transient error, retrying") time.Sleep(delay) continue } // Non-transient error or max attempts reached db.logger.Error(). Err(err). Str("sqlstate", code). Str("batch_name", name). Int("attempt", attempt). Int("max_attempts", maxAttempts). Msg("cnpg batch failed") ``` </details> > Learn more about managing compliance <a href='https://qodo-merge-docs.qodo.ai/tools/compliance/#configuration-options'>generic rules</a> or creating your own <a href='https://qodo-merge-docs.qodo.ai/tools/compliance/#custom-compliance'>custom rules</a> </details></td></tr> <tr><td align="center" colspan="2"> - [ ] Update <!-- /compliance --update_compliance=true --> </td></tr></tbody></table> <details><summary>Compliance status legend</summary> 🟢 - Fully Compliant<br> 🟡 - Partial Compliant<br> 🔴 - Not Compliant<br> ⚪ - Requires Further Human Verification<br> 🏷️ - Compliance label<br> </details>
qodo-code-review[bot] commented 2025-12-08 20:06:16 +00:00 (Migrated from github.com)
Author
Owner

Imported GitHub PR comment.

Original author: @qodo-code-review[bot]
Original URL: https://github.com/carverauto/serviceradar/pull/2088#issuecomment-3628799539
Original created: 2025-12-08T20:06:16Z

PR Code Suggestions

Explore these optional code suggestions:

CategorySuggestion                                                                                                                                    Impact
Possible issue
Make sleep context-aware to prevent leaks

Replace time.Sleep(delay) with a context-aware wait using time.NewTimer and a
select statement to ensure the retry loop respects context cancellation and
avoids resource leaks.

pkg/db/cnpg_device_updates_retry.go [138-151]

 if transient && attempt < maxAttempts {
     recordCNPGRetry(name)
     delay := cnpgBackoffDelay(attempt, code)
     db.logger.Warn().
         Err(err).
         Str("sqlstate", code).
         Str("batch_name", name).
         Int("attempt", attempt).
         Int("max_attempts", maxAttempts).
         Dur("backoff", delay).
         Msg("cnpg transient error, retrying")
-    time.Sleep(delay)
+
+    timer := time.NewTimer(delay)
+    select {
+    case <-timer.C:
+        // continue to next attempt
+    case <-ctx.Done():
+        timer.Stop()
+        return ctx.Err()
+    }
     continue
 }
  • Apply / Chat
Suggestion importance[1-10]: 8

__

Why: This is a crucial improvement for a concurrent system, as it makes the retry delay context-aware, preventing goroutine leaks and ensuring timely termination when a request is canceled.

Medium
Use a proper random number generator

Replace the unreliable jitter calculation based on time.Now().UnixNano() with
the more robust math/rand.Int63n to ensure proper pseudo-randomness and prevent
correlated retry delays.

pkg/db/cnpg_device_updates_retry.go [101-104]

 // Add randomized jitter (0-100% of base) to avoid lockstep retries.
 jitterMax := int64(baseBackoff)
-jitterNanos := time.Now().UnixNano() % jitterMax
+// Use a proper random number generator for jitter.
+// Note: for Go versions < 1.20, rand must be seeded at application start.
+// e.g. rand.Seed(time.Now().UnixNano())
+jitterNanos := rand.Int63n(jitterMax)
 return backoff + time.Duration(jitterNanos)
  • Apply / Chat
Suggestion importance[1-10]: 7

__

Why: The suggestion correctly identifies that using time.Now().UnixNano() for jitter is an anti-pattern and proposes the standard math/rand library, which improves the robustness of the retry mechanism.

Medium
General
Improve jitter test to be more robust

Improve the jitter test by removing the fragile time.Sleep and instead verifying
that the generated delay falls within the expected [base, base + jitter] range,
making the test more robust.

pkg/db/cnpg_device_updates_retry_test.go [150-163]

 func TestCNPGBackoffDelay_IncludesJitter(t *testing.T) {
 	// Run multiple times and verify we get different values (jitter is working)
-	delays := make(map[int64]struct{})
+	baseBackoff := time.Duration(getCNPGDeadlockBackoffMs()) * time.Millisecond
+	maxDelay := baseBackoff * 2 // Base + 100% jitter
+
+	delays := make(map[time.Duration]bool)
 	for i := 0; i < 10; i++ {
 		delay := cnpgBackoffDelay(1, sqlstateDeadlockDetected)
-		delays[delay.Nanoseconds()] = struct{}{}
-		time.Sleep(time.Nanosecond) // Ensure different timestamps
+		assert.GreaterOrEqual(t, delay, baseBackoff)
+		assert.Less(t, delay, maxDelay)
+		delays[delay] = true
 	}
 
-	// With jitter, we should see multiple unique delay values
+	// With jitter, we should see multiple unique delay values.
 	// This test may occasionally fail if all 10 happen to hit the same jitter
-	// but that's extremely unlikely
+	// but that's extremely unlikely.
 	assert.Greater(t, len(delays), 1, "Expected jitter to produce varied delays")
 }
  • Apply / Chat
Suggestion importance[1-10]: 6

__

Why: The suggestion correctly points out a fragility in the test and proposes a more robust way to validate the jitter logic by checking if the delay is within the expected range, making the test more reliable.

Low
  • Update
Imported GitHub PR comment. Original author: @qodo-code-review[bot] Original URL: https://github.com/carverauto/serviceradar/pull/2088#issuecomment-3628799539 Original created: 2025-12-08T20:06:16Z --- ## PR Code Suggestions ✨ <!-- c2cfe98 --> Explore these optional code suggestions: <table><thead><tr><td><strong>Category</strong></td><td align=left><strong>Suggestion&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; </strong></td><td align=center><strong>Impact</strong></td></tr><tbody><tr><td rowspan=2>Possible issue</td> <td> <details><summary>Make sleep context-aware to prevent leaks</summary> ___ **Replace <code>time.Sleep(delay)</code> with a context-aware wait using <code>time.NewTimer</code> and a <br><code>select</code> statement to ensure the retry loop respects context cancellation and <br>avoids resource leaks.** [pkg/db/cnpg_device_updates_retry.go [138-151]](https://github.com/carverauto/serviceradar/pull/2088/files#diff-9b08ac708a4a4f6e40340f1f6005e43d579b76f88f0987d10a5984129bcc35fbR138-R151) ```diff if transient && attempt < maxAttempts { recordCNPGRetry(name) delay := cnpgBackoffDelay(attempt, code) db.logger.Warn(). Err(err). Str("sqlstate", code). Str("batch_name", name). Int("attempt", attempt). Int("max_attempts", maxAttempts). Dur("backoff", delay). Msg("cnpg transient error, retrying") - time.Sleep(delay) + + timer := time.NewTimer(delay) + select { + case <-timer.C: + // continue to next attempt + case <-ctx.Done(): + timer.Stop() + return ctx.Err() + } continue } ``` - [ ] **Apply / Chat** <!-- /improve --apply_suggestion=0 --> <details><summary>Suggestion importance[1-10]: 8</summary> __ Why: This is a crucial improvement for a concurrent system, as it makes the retry delay context-aware, preventing goroutine leaks and ensuring timely termination when a request is canceled. </details></details></td><td align=center>Medium </td></tr><tr><td> <details><summary>Use a proper random number generator</summary> ___ **Replace the unreliable jitter calculation based on <code>time.Now().UnixNano()</code> with <br>the more robust <code>math/rand.Int63n</code> to ensure proper pseudo-randomness and prevent <br>correlated retry delays.** [pkg/db/cnpg_device_updates_retry.go [101-104]](https://github.com/carverauto/serviceradar/pull/2088/files#diff-9b08ac708a4a4f6e40340f1f6005e43d579b76f88f0987d10a5984129bcc35fbR101-R104) ```diff // Add randomized jitter (0-100% of base) to avoid lockstep retries. jitterMax := int64(baseBackoff) -jitterNanos := time.Now().UnixNano() % jitterMax +// Use a proper random number generator for jitter. +// Note: for Go versions < 1.20, rand must be seeded at application start. +// e.g. rand.Seed(time.Now().UnixNano()) +jitterNanos := rand.Int63n(jitterMax) return backoff + time.Duration(jitterNanos) ``` - [ ] **Apply / Chat** <!-- /improve --apply_suggestion=1 --> <details><summary>Suggestion importance[1-10]: 7</summary> __ Why: The suggestion correctly identifies that using `time.Now().UnixNano()` for jitter is an anti-pattern and proposes the standard `math/rand` library, which improves the robustness of the retry mechanism. </details></details></td><td align=center>Medium </td></tr><tr><td rowspan=1>General</td> <td> <details><summary>Improve jitter test to be more robust</summary> ___ **Improve the jitter test by removing the fragile <code>time.Sleep</code> and instead verifying <br>that the generated delay falls within the expected <code>[base, base + jitter]</code> range, <br>making the test more robust.** [pkg/db/cnpg_device_updates_retry_test.go [150-163]](https://github.com/carverauto/serviceradar/pull/2088/files#diff-58c50b725d3c57e8d5193bb5dd1557d537f747f14e492b1a26a148f7cd973620R150-R163) ```diff func TestCNPGBackoffDelay_IncludesJitter(t *testing.T) { // Run multiple times and verify we get different values (jitter is working) - delays := make(map[int64]struct{}) + baseBackoff := time.Duration(getCNPGDeadlockBackoffMs()) * time.Millisecond + maxDelay := baseBackoff * 2 // Base + 100% jitter + + delays := make(map[time.Duration]bool) for i := 0; i < 10; i++ { delay := cnpgBackoffDelay(1, sqlstateDeadlockDetected) - delays[delay.Nanoseconds()] = struct{}{} - time.Sleep(time.Nanosecond) // Ensure different timestamps + assert.GreaterOrEqual(t, delay, baseBackoff) + assert.Less(t, delay, maxDelay) + delays[delay] = true } - // With jitter, we should see multiple unique delay values + // With jitter, we should see multiple unique delay values. // This test may occasionally fail if all 10 happen to hit the same jitter - // but that's extremely unlikely + // but that's extremely unlikely. assert.Greater(t, len(delays), 1, "Expected jitter to produce varied delays") } ``` - [ ] **Apply / Chat** <!-- /improve --apply_suggestion=2 --> <details><summary>Suggestion importance[1-10]: 6</summary> __ Why: The suggestion correctly points out a fragility in the test and proposes a more robust way to validate the jitter logic by checking if the delay is within the expected range, making the test more reliable. </details></details></td><td align=center>Low </td></tr> <tr><td align="center" colspan="2"> - [ ] Update <!-- /improve_multi --more_suggestions=true --> </td><td></td></tr></tbody></table>
Sign in to join this conversation.
No reviewers
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
carverauto/serviceradar!2530
No description provided.