fix(core): serialize AGE graph writes to prevent deadlocks (#2058) #2511
No reviewers
Labels
No labels
1week
2weeks
Failed compliance check
IP cameras
NATS
Possible security concern
Review effort 1/5
Review effort 2/5
Review effort 3/5
Review effort 4/5
Review effort 5/5
UI
aardvark
accessibility
amd64
api
arm64
auth
back-end
bgp
blog
bug
build
checkers
ci-cd
cleanup
cnpg
codex
core
dependencies
device-management
documentation
duplicate
dusk
ebpf
enhancement
eta 1d
eta 1hr
eta 3d
eta 3hr
feature
fieldsurvey
github_actions
go
good first issue
help wanted
invalid
javascript
k8s
log-collector
mapper
mtr
needs-triage
netflow
network-sweep
observability
oracle
otel
plug-in
proton
python
question
reddit
redhat
research
rperf
rperf-checker
rust
sdk
security
serviceradar-agent
serviceradar-agent-gateway
serviceradar-web
serviceradar-web-ng
siem
snmp
sysmon
topology
ubiquiti
wasm
wontfix
zen-engine
No milestone
No project
No assignees
1 participant
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
carverauto/serviceradar!2511
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "refs/pull/2511/head"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Imported from GitHub pull request.
Original GitHub pull request: #2064
Original author: @mfreeman451
Original URL: https://github.com/carverauto/serviceradar/pull/2064
Original created: 2025-12-05T02:40:45Z
Original updated: 2025-12-05T02:53:33Z
Original head: carverauto/serviceradar:2058-bugcore-age-graph-merge-failed
Original base: main
Original merged: 2025-12-05T02:53:30Z by @mfreeman451
User description
🤖 Generated with Claude Code
IMPORTANT: Please sign the Developer Certificate of Origin
Thank you for your contribution to ServiceRadar. Please note, when contributing, the developer must include
a DCO sign-off statement indicating the DCO acceptance in one commit message. Here
is an example DCO Signed-off-by line in a commit message:
Describe your changes
Issue ticket number and link
Code checklist before requesting a review
PR Type
Bug fix, Enhancement
Description
Serialize AGE graph MERGE operations with mutex to eliminate concurrent write deadlocks
Classify deadlock (40P01) and serialization failure (40001) as transient errors with retry
Implement longer backoff (500ms) for deadlock errors with exponential growth and jitter
Add deadlock-specific metrics for monitoring contention issues
Diagram Walkthrough
File Walkthrough
age_graph_metrics.go
Add deadlock and serialization failure metricspkg/registry/age_graph_metrics.go
and transient retry tracking
metrics
function
failure, and transient retry events
age_graph_writer.go
Serialize writes and handle deadlock errors with retrypkg/registry/age_graph_writer.go
prevent concurrent write deadlocks
on deadlock errors
deadlock backoff
40001)
serialization failure (40001) as transient errors with string fallback
patterns
exponential backoff with longer base for deadlocks
writes
occur
proposal.md
Proposal for AGE graph deadlock handling fixopenspec/changes/fix-age-graph-deadlock-handling/proposal.md
executing MERGE queries causing lock contention
MERGE operations
concurrent write contention
serialization failures
growth for deadlocks
frequency
improvement
spec.md
Add AGE graph deadlock handling requirements and scenariosopenspec/changes/fix-age-graph-deadlock-handling/specs/device-relationship-graph/spec.md
deadlocks
transient errors with retry
deadlock errors
metrics
40001) in transient error handling
and metric monitoring
tasks.md
Task checklist for deadlock handling implementationopenspec/changes/fix-age-graph-deadlock-handling/tasks.md
implementation
with SQLSTATE constants
exponential growth
namespace
Imported GitHub PR comment.
Original author: @qodo-code-review[bot]
Original URL: https://github.com/carverauto/serviceradar/pull/2064#issuecomment-3615081191
Original created: 2025-12-05T02:41:24Z
PR Compliance Guide 🔍
Below is a summary of compliance checks for this PR:
No security concerns identified
No security vulnerabilities detected by AI analysis. Human verification advised for critical code.🎫 #2058
instead of failing batches.
conflicts.
and XX000 transient merge failures.
Codebase context is not defined
Follow the guide to enable codebase context checks.
Generic: Meaningful Naming and Self-Documenting Code
Objective: Ensure all identifiers clearly express their purpose and intent, making code
self-documenting
Status: Passed
Generic: Robust Error Handling and Edge Case Management
Objective: Ensure comprehensive error handling that provides meaningful context and graceful
degradation
Status: Passed
Generic: Comprehensive Audit Trails
Objective: To create a detailed and reliable record of critical system actions for security analysis
and compliance.
Status:
Action logging: New critical write operations are serialized and retried but no new audit logs were added
to record user/context, action, and outcome for these write attempts.
Referred Code
Generic: Secure Error Handling
Objective: To prevent the leakage of sensitive system information through error messages while
providing sufficient detail for internal debugging.
Status:
Error exposure: Warning logs include raw database errors and SQLSTATE codes which may be propagated to
user-facing channels elsewhere; verify these logs are not exposed to end users.
Referred Code
Generic: Secure Logging Practices
Objective: To ensure logs are useful for debugging and auditing without exposing sensitive
information like PII, PHI, or cardholder data.
Status:
Log content risk: The warning log on transient retries logs the raw error via Err(err); confirm that error
strings cannot contain sensitive query parameters or payload data from req.payload.
Referred Code
Generic: Security-First Input Validation and Data Handling
Objective: Ensure all data inputs are validated, sanitized, and handled securely to prevent
vulnerabilities
Status:
Payload handling: The code executes queries with req.payload while adding retries and new paths; although
unchanged in this diff, confirm payload remains parameterized and not logged in new error
paths.
Referred Code
Compliance status legend
🟢 - Fully Compliant🟡 - Partial Compliant
🔴 - Not Compliant
⚪ - Requires Further Human Verification
🏷️ - Compliance label
Imported GitHub PR comment.
Original author: @qodo-code-review[bot]
Original URL: https://github.com/carverauto/serviceradar/pull/2064#issuecomment-3615083064
Original created: 2025-12-05T02:42:30Z
PR Code Suggestions ✨
Explore these optional code suggestions:
Use defer to ensure panic-safe mutex unlock
Make the mutex unlock panic-safe by using a
deferstatement. This prevents apotential deadlock if
w.executor.ExecuteQuerypanics.pkg/registry/age_graph_writer.go [872-879]
Suggestion importance[1-10]: 9
__
Why: The suggestion correctly identifies a critical flaw where a panic inside the mutex-locked section would cause a permanent deadlock. The proposed fix using
deferis the idiomatic and correct way to ensure the mutex is always unlocked, preventing a major service reliability issue.Prevent potential division-by-zero panic in backoff
Prevent a potential division-by-zero panic in the
backoffDelayfunction. Add acheck to ensure
jitterMaxis greater than zero before the modulo operation.pkg/registry/age_graph_writer.go [973-996]
Suggestion importance[1-10]: 6
__
Why: The suggestion correctly identifies a potential division-by-zero panic if
baseBackoffis configured to be zero. While unlikely with current defaults, adding a guard makes the code more robust against configuration errors.Consider if a single writer goroutine is simpler
Instead of using a mutex to serialize database access across multiple workers,
consider using a single, dedicated writer goroutine. This would simplify the
concurrency model by making the serialization explicit and removing the need for
a mutex.
Examples:
pkg/registry/age_graph_writer.go [75-872]
Solution Walkthrough:
Before:
After:
Suggestion importance[1-10]: 7
__
Why: This is a valid architectural suggestion that correctly identifies that the mutex serializes writes, and proposes a simpler, common pattern (a single writer goroutine) to achieve the same goal, which would remove the need for the
writeMumutex.