deadlock fixes #2530
No reviewers
Labels
No labels
1week
2weeks
Failed compliance check
IP cameras
NATS
Possible security concern
Review effort 1/5
Review effort 2/5
Review effort 3/5
Review effort 4/5
Review effort 5/5
UI
aardvark
accessibility
amd64
api
arm64
auth
back-end
bgp
blog
bug
build
checkers
ci-cd
cleanup
cnpg
codex
core
dependencies
device-management
documentation
duplicate
dusk
ebpf
enhancement
eta 1d
eta 1hr
eta 3d
eta 3hr
feature
fieldsurvey
github_actions
go
good first issue
help wanted
invalid
javascript
k8s
log-collector
mapper
mtr
needs-triage
netflow
network-sweep
observability
oracle
otel
plug-in
proton
python
question
reddit
redhat
research
rperf
rperf-checker
rust
sdk
security
serviceradar-agent
serviceradar-agent-gateway
serviceradar-web
serviceradar-web-ng
siem
snmp
sysmon
topology
ubiquiti
wasm
wontfix
zen-engine
No milestone
No project
No assignees
1 participant
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
carverauto/serviceradar!2530
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "refs/pull/2530/head"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Imported from GitHub pull request.
Original GitHub pull request: #2088
Original author: @mfreeman451
Original URL: https://github.com/carverauto/serviceradar/pull/2088
Original created: 2025-12-08T20:04:29Z
Original updated: 2025-12-08T20:52:53Z
Original head: carverauto/serviceradar:2087-bugcore-cnpg-deadlock-on-device_updates
Original base: main
Original merged: 2025-12-08T20:52:49Z by @mfreeman451
User description
IMPORTANT: Please sign the Developer Certificate of Origin
Thank you for your contribution to ServiceRadar. Please note, when contributing, the developer must include
a DCO sign-off statement indicating the DCO acceptance in one commit message. Here
is an example DCO Signed-off-by line in a commit message:
Describe your changes
Issue ticket number and link
Code checklist before requesting a review
PR Type
Bug fix, Enhancement
Description
Add mutex-based serialization for CNPG device update batch writes to eliminate deadlocks
Implement transient error classification and automatic retry with exponential backoff for PostgreSQL errors
Add metrics tracking for deadlock occurrences, retries, and successful retry completions
Extend serialization to device identifiers and network sightings operations sharing related data
Diagram Walkthrough
File Walkthrough
3 files
Add atomic counters for deadlock and retry metricsImplement transient error classification and retry logicAdd deviceUpdatesMu mutex field and initialize in New1 files
Add comprehensive unit tests for retry and backoff logic3 files
Wrap device updates batch with mutex and retry wrapperSerialize device identifier writes with mutex protectionSerialize network sightings writes with mutex protection4 files
Document deadlock scenario, decisions, and migration planDescribe problem statement and proposed solution approachDefine modified and new requirements for serializationTrack implementation tasks and verification stepsImported GitHub PR comment.
Original author: @qodo-code-review[bot]
Original URL: https://github.com/carverauto/serviceradar/pull/2088#issuecomment-3628795589
Original created: 2025-12-08T20:05:06Z
PR Compliance Guide 🔍
Below is a summary of compliance checks for this PR:
No security concerns identified
No security vulnerabilities detected by AI analysis. Human verification advised for critical code.🎫 No ticket provided
Codebase context is not defined
Follow the guide to enable codebase context checks.
Generic: Meaningful Naming and Self-Documenting Code
Objective: Ensure all identifiers clearly express their purpose and intent, making code
self-documenting
Status: Passed
Generic: Robust Error Handling and Edge Case Management
Objective: Ensure comprehensive error handling that provides meaningful context and graceful
degradation
Status: Passed
Generic: Security-First Input Validation and Data Handling
Objective: Ensure all data inputs are validated, sanitized, and handled securely to prevent
vulnerabilities
Status: Passed
Generic: Comprehensive Audit Trails
Objective: To create a detailed and reliable record of critical system actions for security analysis
and compliance.
Status:
Missing audit: New retry/deadlock handling performs critical DB actions without adding explicit audit
logging of action, actor, and outcome, and it's unclear if higher layers provide
required audit trails.
Referred Code
Generic: Secure Error Handling
Objective: To prevent the leakage of sensitive system information through error messages while
providing sufficient detail for internal debugging.
Status:
Error detail exposure: Logging includes raw database errors which may contain internal details, and it is unclear
whether these logs are user-facing or restricted to internal logs.
Referred Code
Generic: Secure Logging Practices
Objective: To ensure logs are useful for debugging and auditing without exposing sensitive
information like PII, PHI, or cardholder data.
Status:
Potential sensitive logs: The code logs database errors and batch names which could include sensitive context
depending on logger configuration, and there is no explicit scrubbing shown in the diff.
Referred Code
Compliance status legend
🟢 - Fully Compliant🟡 - Partial Compliant
🔴 - Not Compliant
⚪ - Requires Further Human Verification
🏷️ - Compliance label
Imported GitHub PR comment.
Original author: @qodo-code-review[bot]
Original URL: https://github.com/carverauto/serviceradar/pull/2088#issuecomment-3628799539
Original created: 2025-12-08T20:06:16Z
PR Code Suggestions ✨
Explore these optional code suggestions:
Make sleep context-aware to prevent leaks
Replace
time.Sleep(delay)with a context-aware wait usingtime.NewTimerand aselectstatement to ensure the retry loop respects context cancellation andavoids resource leaks.
pkg/db/cnpg_device_updates_retry.go [138-151]
Suggestion importance[1-10]: 8
__
Why: This is a crucial improvement for a concurrent system, as it makes the retry delay context-aware, preventing goroutine leaks and ensuring timely termination when a request is canceled.
Use a proper random number generator
Replace the unreliable jitter calculation based on
time.Now().UnixNano()withthe more robust
math/rand.Int63nto ensure proper pseudo-randomness and preventcorrelated retry delays.
pkg/db/cnpg_device_updates_retry.go [101-104]
Suggestion importance[1-10]: 7
__
Why: The suggestion correctly identifies that using
time.Now().UnixNano()for jitter is an anti-pattern and proposes the standardmath/randlibrary, which improves the robustness of the retry mechanism.Improve jitter test to be more robust
Improve the jitter test by removing the fragile
time.Sleepand instead verifyingthat the generated delay falls within the expected
[base, base + jitter]range,making the test more robust.
pkg/db/cnpg_device_updates_retry_test.go [150-163]
Suggestion importance[1-10]: 6
__
Why: The suggestion correctly points out a fragility in the test and proposes a more robust way to validate the jitter logic by checking if the delay is within the expected range, making the test more reliable.