Bug/core restarts demo #2510
No reviewers
Labels
No labels
1week
2weeks
Failed compliance check
IP cameras
NATS
Possible security concern
Review effort 1/5
Review effort 2/5
Review effort 3/5
Review effort 4/5
Review effort 5/5
UI
aardvark
accessibility
amd64
api
arm64
auth
back-end
bgp
blog
bug
build
checkers
ci-cd
cleanup
cnpg
codex
core
dependencies
device-management
documentation
duplicate
dusk
ebpf
enhancement
eta 1d
eta 1hr
eta 3d
eta 3hr
feature
fieldsurvey
github_actions
go
good first issue
help wanted
invalid
javascript
k8s
log-collector
mapper
mtr
needs-triage
netflow
network-sweep
observability
oracle
otel
plug-in
proton
python
question
reddit
redhat
research
rperf
rperf-checker
rust
sdk
security
serviceradar-agent
serviceradar-agent-gateway
serviceradar-web
serviceradar-web-ng
siem
snmp
sysmon
topology
ubiquiti
wasm
wontfix
zen-engine
No milestone
No project
No assignees
1 participant
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
carverauto/serviceradar!2510
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "refs/pull/2510/head"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Imported from GitHub pull request.
Original GitHub pull request: #2063
Original author: @mfreeman451
Original URL: https://github.com/carverauto/serviceradar/pull/2063
Original created: 2025-12-05T00:46:36Z
Original updated: 2025-12-05T00:54:23Z
Original head: carverauto/serviceradar:bug/core_restarts_demo
Original base: main
Original merged: 2025-12-05T00:54:18Z by @mfreeman451
User description
IMPORTANT: Please sign the Developer Certificate of Origin
Thank you for your contribution to ServiceRadar. Please note, when contributing, the developer must include
a DCO sign-off statement indicating the DCO acceptance in one commit message. Here
is an example DCO Signed-off-by line in a commit message:
Describe your changes
Issue ticket number and link
Code checklist before requesting a review
PR Type
Bug fix, Enhancement
Description
Prevent OOM crashes by implementing memory-aware backpressure in AGE graph writer
Increase default workers from 1 to 4 and reduce queue size from 512 to 256
Add comprehensive metrics for dropped batches, heap memory, and circuit state
Document requirements and implementation tasks for OOM fix
Diagram Walkthrough
File Walkthrough
age_graph_metrics.go
Add memory and circuit breaker metrics to AGE graph writerpkg/registry/age_graph_metrics.go
and circuit state
circuit metrics
state
age_graph_writer.go
Implement memory backpressure and circuit breaker in AGE writerpkg/registry/age_graph_writer.go
parameters
heap exceeds limit
failures and half-opens after timeout
open conditions
proposal.md
Document AGE graph OOM root cause and mitigation strategyopenspec/changes/fix-core-oom-age-graph-backpressure/proposal.md
queue backlog
during long timeouts
implement circuit breaker
spec.md
Add AGE graph memory backpressure and circuit breaker requirementsopenspec/changes/fix-core-oom-age-graph-backpressure/specs/device-relationship-graph/spec.md
threshold
rate
footprint
states
recovery
tasks.md
Track completion of AGE graph OOM fix implementation tasksopenspec/changes/fix-core-oom-age-graph-backpressure/tasks.md
memory metrics
implementation
transitions
validation
Imported GitHub PR comment.
Original author: @qodo-code-review[bot]
Original URL: https://github.com/carverauto/serviceradar/pull/2063#issuecomment-3614879303
Original created: 2025-12-05T00:47:11Z
PR Compliance Guide 🔍
Below is a summary of compliance checks for this PR:
No security concerns identified
No security vulnerabilities detected by AI analysis. Human verification advised for critical code.🎫 No ticket provided
Codebase context is not defined
Follow the guide to enable codebase context checks.
Generic: Comprehensive Audit Trails
Objective: To create a detailed and reliable record of critical system actions for security analysis
and compliance.
Status: Passed
Generic: Meaningful Naming and Self-Documenting Code
Objective: Ensure all identifiers clearly express their purpose and intent, making code
self-documenting
Status: Passed
Generic: Secure Error Handling
Objective: To prevent the leakage of sensitive system information through error messages while
providing sufficient detail for internal debugging.
Status: Passed
Generic: Secure Logging Practices
Objective: To ensure logs are useful for debugging and auditing without exposing sensitive
information like PII, PHI, or cardholder data.
Status: Passed
Generic: Robust Error Handling and Edge Case Management
Objective: Ensure comprehensive error handling that provides meaningful context and graceful
degradation
Status:
Rejection handling: New early-rejection paths for memory pressure and circuit-open return errors without
visible upstream handling in this diff, which may drop critical work without retry or
alternative path.
Referred Code
Generic: Security-First Input Validation and Data Handling
Objective: Ensure all data inputs are validated, sanitized, and handled securely to prevent
vulnerabilities
Status:
Runtime metrics: New memory-based gating uses runtime heap stats without validating external inputs for
'kind' or 'payload' in the enqueue path visible here; validation may
exist elsewhere but is not shown.
Referred Code
Compliance status legend
🟢 - Fully Compliant🟡 - Partial Compliant
🔴 - Not Compliant
⚪ - Requires Further Human Verification
🏷️ - Compliance label
Imported GitHub PR comment.
Original author: @qodo-code-review[bot]
Original URL: https://github.com/carverauto/serviceradar/pull/2063#issuecomment-3614881618
Original created: 2025-12-05T00:48:25Z
PR Code Suggestions ✨
Explore these optional code suggestions:
Fix circuit breaker race condition
Fix a race condition in the circuit breaker's half-open state transition by
using an atomic compare-and-swap operation, ensuring only a single request can
pass through to test for recovery.
pkg/registry/age_graph_writer.go [1017-1047]
Suggestion importance[1-10]: 9
__
Why: The suggestion correctly identifies a race condition in the circuit breaker logic that allows multiple requests in the half-open state, defeating its purpose, and proposes a valid atomic fix.
Remove unnecessary lock in hot path
Improve performance by replacing the mutex in the
resetCircuitfunction withmore efficient atomic operations, as it is on a hot path called by every
successful request.
pkg/registry/age_graph_writer.go [1070-1087]
Suggestion importance[1-10]: 7
__
Why: The suggestion correctly points out that using a mutex on a hot path for every successful request is inefficient and proposes a more performant solution using atomic operations, which is a valid optimization.
Avoid expensive call on hot path
Optimize the memory pressure check by reading memory stats in a background
goroutine and caching the result, avoiding frequent and potentially expensive
calls to
runtime.ReadMemStats()on the hot path.pkg/registry/age_graph_writer.go [800-815]
Suggestion importance[1-10]: 6
__
Why: The suggestion correctly identifies a potential performance bottleneck from calling
runtime.ReadMemStats()on a hot path and proposes a valid optimization using a background goroutine to cache the value.Consider a simpler circuit breaker
Replace the manual circuit breaker implementation, which uses mutexes and atomic
operations for state management, with a standard third-party library. This would
simplify the code and reduce the risk of concurrency issues.
Examples:
pkg/registry/age_graph_writer.go [1017-1087]
Solution Walkthrough:
Before:
After:
Suggestion importance[1-10]: 6
__
Why: The suggestion correctly identifies a complex, manual implementation of a standard pattern and proposes a valid alternative that would improve code simplicity, robustness, and maintainability.