Bug/core restarts demo #2510

Merged
mfreeman451 merged 2 commits from refs/pull/2510/head into main 2025-12-05 00:54:18 +00:00
mfreeman451 commented 2025-12-05 00:46:36 +00:00 (Migrated from github.com)
Owner

Imported from GitHub pull request.

Original GitHub pull request: #2063
Original author: @mfreeman451
Original URL: https://github.com/carverauto/serviceradar/pull/2063
Original created: 2025-12-05T00:46:36Z
Original updated: 2025-12-05T00:54:23Z
Original head: carverauto/serviceradar:bug/core_restarts_demo
Original base: main
Original merged: 2025-12-05T00:54:18Z by @mfreeman451

User description

IMPORTANT: Please sign the Developer Certificate of Origin

Thank you for your contribution to ServiceRadar. Please note, when contributing, the developer must include
a DCO sign-off statement indicating the DCO acceptance in one commit message. Here
is an example DCO Signed-off-by line in a commit message:

Signed-off-by: J. Doe <j.doe@domain.com>

Describe your changes

Code checklist before requesting a review

  • I have signed the DCO?
  • The build completes without errors?
  • All tests are passing when running make test?

PR Type

Bug fix, Enhancement


Description

  • Prevent OOM crashes by implementing memory-aware backpressure in AGE graph writer

    • Add memory limit threshold to reject batches when heap exceeds configured limit
    • Implement circuit breaker pattern to disable writes after consecutive failures
  • Increase default workers from 1 to 4 and reduce queue size from 512 to 256

  • Add comprehensive metrics for dropped batches, heap memory, and circuit state

  • Document requirements and implementation tasks for OOM fix


Diagram Walkthrough

flowchart LR
  A["AGE Graph Writer"] --> B{"Check Circuit<br/>Breaker"}
  B -->|Open| C["Drop Batch<br/>Record Metric"]
  B -->|Closed| D{"Check Memory<br/>Pressure"}
  D -->|Exceeded| C
  D -->|OK| E["Enqueue to<br/>Work Queue"]
  E --> F["4 Worker<br/>Goroutines"]
  F --> G["Process Batch"]
  G --> H{"Success?"}
  H -->|Yes| I["Reset Circuit"]
  H -->|No| J["Increment<br/>Failures"]
  J --> K{"Threshold<br/>Reached?"}
  K -->|Yes| L["Open Circuit"]

File Walkthrough

Relevant files
Enhancement
age_graph_metrics.go
Add memory and circuit breaker metrics to AGE graph writer

pkg/registry/age_graph_metrics.go

  • Add runtime memory stats import to track heap allocation
  • Define 6 new metric constants for dropped batches, memory pressure,
    and circuit state
  • Add atomic counters for tracking dropped batches by reason
  • Implement 6 new observable gauges for backpressure, memory, and
    circuit metrics
  • Add helper functions to record dropped batches and manage circuit
    state
  • Integrate heap allocation and circuit state into metric callback
+130/-12
Bug fix
age_graph_writer.go
Implement memory backpressure and circuit breaker in AGE writer

pkg/registry/age_graph_writer.go

  • Increase default workers from 1 to 4 and queue size from 512 to 256
  • Add memory limit, circuit threshold, and reset time configuration
    parameters
  • Implement memory pressure check in enqueue() to reject batches when
    heap exceeds limit
  • Add circuit breaker state management with closed/open/half-open states
  • Implement circuit breaker logic that opens after N consecutive
    failures and half-opens after timeout
  • Record dropped batches for backpressure, memory pressure, and circuit
    open conditions
  • Reset circuit on successful batch processing
+152/-17
Documentation
proposal.md
Document AGE graph OOM root cause and mitigation strategy

openspec/changes/fix-core-oom-age-graph-backpressure/proposal.md

  • Document root cause of OOM crashes: single worker bottleneck with
    queue backlog
  • Explain memory accumulation from goroutines holding payload data
    during long timeouts
  • Describe retry amplification extending memory retention
  • Outline solution: increase workers, add memory-aware backpressure,
    implement circuit breaker
  • List affected specs and code files
+26/-0   
spec.md
Add AGE graph memory backpressure and circuit breaker requirements

openspec/changes/fix-core-oom-age-graph-backpressure/specs/device-relationship-graph/spec.md

  • Add requirement for memory-bounded queueing with early rejection at
    threshold
  • Define scenarios for high ingestion rate and memory pressure handling
  • Add requirement for parallel worker scaling to improve queue drain
    rate
  • Add requirement for payload chunking to limit per-request memory
    footprint
  • Add requirement for circuit breaker pattern with open/half-open/closed
    states
  • Define scenarios for circuit opening after failures and closing after
    recovery
+46/-0   
tasks.md
Track completion of AGE graph OOM fix implementation tasks

openspec/changes/fix-core-oom-age-graph-backpressure/tasks.md

  • Mark completed tasks for worker scaling, queue size reduction, and
    memory metrics
  • Mark completed tasks for memory limit configuration and backpressure
    implementation
  • Mark completed tasks for circuit breaker implementation with state
    transitions
  • Leave pending tasks for streaming payload processing and metric
    validation
+24/-0   

Imported from GitHub pull request. Original GitHub pull request: #2063 Original author: @mfreeman451 Original URL: https://github.com/carverauto/serviceradar/pull/2063 Original created: 2025-12-05T00:46:36Z Original updated: 2025-12-05T00:54:23Z Original head: carverauto/serviceradar:bug/core_restarts_demo Original base: main Original merged: 2025-12-05T00:54:18Z by @mfreeman451 --- ### **User description** ## IMPORTANT: Please sign the Developer Certificate of Origin Thank you for your contribution to ServiceRadar. Please note, when contributing, the developer must include a [DCO sign-off statement]( https://developercertificate.org/) indicating the DCO acceptance in one commit message. Here is an example DCO Signed-off-by line in a commit message: ``` Signed-off-by: J. Doe <j.doe@domain.com> ``` ## Describe your changes ## Issue ticket number and link ## Code checklist before requesting a review - [ ] I have signed the DCO? - [ ] The build completes without errors? - [ ] All tests are passing when running make test? ___ ### **PR Type** Bug fix, Enhancement ___ ### **Description** - Prevent OOM crashes by implementing memory-aware backpressure in AGE graph writer - Add memory limit threshold to reject batches when heap exceeds configured limit - Implement circuit breaker pattern to disable writes after consecutive failures - Increase default workers from 1 to 4 and reduce queue size from 512 to 256 - Add comprehensive metrics for dropped batches, heap memory, and circuit state - Document requirements and implementation tasks for OOM fix ___ ### Diagram Walkthrough ```mermaid flowchart LR A["AGE Graph Writer"] --> B{"Check Circuit<br/>Breaker"} B -->|Open| C["Drop Batch<br/>Record Metric"] B -->|Closed| D{"Check Memory<br/>Pressure"} D -->|Exceeded| C D -->|OK| E["Enqueue to<br/>Work Queue"] E --> F["4 Worker<br/>Goroutines"] F --> G["Process Batch"] G --> H{"Success?"} H -->|Yes| I["Reset Circuit"] H -->|No| J["Increment<br/>Failures"] J --> K{"Threshold<br/>Reached?"} K -->|Yes| L["Open Circuit"] ``` <details> <summary><h3> File Walkthrough</h3></summary> <table><thead><tr><th></th><th align="left">Relevant files</th></tr></thead><tbody><tr><td><strong>Enhancement</strong></td><td><table> <tr> <td> <details> <summary><strong>age_graph_metrics.go</strong><dd><code>Add memory and circuit breaker metrics to AGE graph writer</code></dd></summary> <hr> pkg/registry/age_graph_metrics.go <ul><li>Add runtime memory stats import to track heap allocation<br> <li> Define 6 new metric constants for dropped batches, memory pressure, <br>and circuit state<br> <li> Add atomic counters for tracking dropped batches by reason<br> <li> Implement 6 new observable gauges for backpressure, memory, and <br>circuit metrics<br> <li> Add helper functions to record dropped batches and manage circuit <br>state<br> <li> Integrate heap allocation and circuit state into metric callback</ul> </details> </td> <td><a href="https://github.com/carverauto/serviceradar/pull/2063/files#diff-2d400ecd96ff77c1379e4a8681763f9ba2dc22790e31a445b7e3ce716e8fbb15">+130/-12</a></td> </tr> </table></td></tr><tr><td><strong>Bug fix</strong></td><td><table> <tr> <td> <details> <summary><strong>age_graph_writer.go</strong><dd><code>Implement memory backpressure and circuit breaker in AGE writer</code></dd></summary> <hr> pkg/registry/age_graph_writer.go <ul><li>Increase default workers from 1 to 4 and queue size from 512 to 256<br> <li> Add memory limit, circuit threshold, and reset time configuration <br>parameters<br> <li> Implement memory pressure check in enqueue() to reject batches when <br>heap exceeds limit<br> <li> Add circuit breaker state management with closed/open/half-open states<br> <li> Implement circuit breaker logic that opens after N consecutive <br>failures and half-opens after timeout<br> <li> Record dropped batches for backpressure, memory pressure, and circuit <br>open conditions<br> <li> Reset circuit on successful batch processing</ul> </details> </td> <td><a href="https://github.com/carverauto/serviceradar/pull/2063/files#diff-2e6c0a5c048582201ffb983889ab9fab6fb1f8d64b91ee3fb65479b2fb2c6195">+152/-17</a></td> </tr> </table></td></tr><tr><td><strong>Documentation</strong></td><td><table> <tr> <td> <details> <summary><strong>proposal.md</strong><dd><code>Document AGE graph OOM root cause and mitigation strategy</code></dd></summary> <hr> openspec/changes/fix-core-oom-age-graph-backpressure/proposal.md <ul><li>Document root cause of OOM crashes: single worker bottleneck with <br>queue backlog<br> <li> Explain memory accumulation from goroutines holding payload data <br>during long timeouts<br> <li> Describe retry amplification extending memory retention<br> <li> Outline solution: increase workers, add memory-aware backpressure, <br>implement circuit breaker<br> <li> List affected specs and code files</ul> </details> </td> <td><a href="https://github.com/carverauto/serviceradar/pull/2063/files#diff-81ac6baf30e8d593daac5904fe8ee44954ea59b60cfebb932d6167039401ec70">+26/-0</a>&nbsp; &nbsp; </td> </tr> <tr> <td> <details> <summary><strong>spec.md</strong><dd><code>Add AGE graph memory backpressure and circuit breaker requirements</code></dd></summary> <hr> openspec/changes/fix-core-oom-age-graph-backpressure/specs/device-relationship-graph/spec.md <ul><li>Add requirement for memory-bounded queueing with early rejection at <br>threshold<br> <li> Define scenarios for high ingestion rate and memory pressure handling<br> <li> Add requirement for parallel worker scaling to improve queue drain <br>rate<br> <li> Add requirement for payload chunking to limit per-request memory <br>footprint<br> <li> Add requirement for circuit breaker pattern with open/half-open/closed <br>states<br> <li> Define scenarios for circuit opening after failures and closing after <br>recovery</ul> </details> </td> <td><a href="https://github.com/carverauto/serviceradar/pull/2063/files#diff-b476498724d1ccfae69ca3db18d65fc2a18e629803fe277531948545ca36b944">+46/-0</a>&nbsp; &nbsp; </td> </tr> <tr> <td> <details> <summary><strong>tasks.md</strong><dd><code>Track completion of AGE graph OOM fix implementation tasks</code></dd></summary> <hr> openspec/changes/fix-core-oom-age-graph-backpressure/tasks.md <ul><li>Mark completed tasks for worker scaling, queue size reduction, and <br>memory metrics<br> <li> Mark completed tasks for memory limit configuration and backpressure <br>implementation<br> <li> Mark completed tasks for circuit breaker implementation with state <br>transitions<br> <li> Leave pending tasks for streaming payload processing and metric <br>validation</ul> </details> </td> <td><a href="https://github.com/carverauto/serviceradar/pull/2063/files#diff-e7a2e1f30c6f53270296c7e92cf5828c71a4a21dc4ea2a6c1b4d74da304e87ac">+24/-0</a>&nbsp; &nbsp; </td> </tr> </table></td></tr></tr></tbody></table> </details> ___
qodo-code-review[bot] commented 2025-12-05 00:47:11 +00:00 (Migrated from github.com)
Author
Owner

Imported GitHub PR comment.

Original author: @qodo-code-review[bot]
Original URL: https://github.com/carverauto/serviceradar/pull/2063#issuecomment-3614879303
Original created: 2025-12-05T00:47:11Z

PR Compliance Guide 🔍

Below is a summary of compliance checks for this PR:

Security Compliance
🟢
No security concerns identified No security vulnerabilities detected by AI analysis. Human verification advised for critical code.
Ticket Compliance
🎫 No ticket provided
  • Create ticket/issue
Codebase Duplication Compliance
Codebase context is not defined

Follow the guide to enable codebase context checks.

Custom Compliance
🟢
Generic: Comprehensive Audit Trails

Objective: To create a detailed and reliable record of critical system actions for security analysis
and compliance.

Status: Passed

Learn more about managing compliance generic rules or creating your own custom rules

Generic: Meaningful Naming and Self-Documenting Code

Objective: Ensure all identifiers clearly express their purpose and intent, making code
self-documenting

Status: Passed

Learn more about managing compliance generic rules or creating your own custom rules

Generic: Secure Error Handling

Objective: To prevent the leakage of sensitive system information through error messages while
providing sufficient detail for internal debugging.

Status: Passed

Learn more about managing compliance generic rules or creating your own custom rules

Generic: Secure Logging Practices

Objective: To ensure logs are useful for debugging and auditing without exposing sensitive
information like PII, PHI, or cardholder data.

Status: Passed

Learn more about managing compliance generic rules or creating your own custom rules

Generic: Robust Error Handling and Edge Case Management

Objective: Ensure comprehensive error handling that provides meaningful context and graceful
degradation

Status:
Rejection handling: New early-rejection paths for memory pressure and circuit-open return errors without
visible upstream handling in this diff, which may drop critical work without retry or
alternative path.

Referred Code
func (w *ageGraphWriter) enqueue(ctx context.Context, kind string, size int, query, payload string) error {
	if w == nil || w.executor == nil {
		return errAgeGraphWriterUnavailable
	}

	// Check circuit breaker state
	if w.isCircuitOpen() {
		recordAgeDroppedCircuitOpen()
		return errAgeGraphCircuitOpen
	}

	// Check memory pressure
	if w.memoryLimitBytes > 0 {
		heapBytes := currentHeapAllocBytes()
		if heapBytes > w.memoryLimitBytes {
			recordAgeDroppedMemoryPressure()
			if w.log != nil {
				w.log.Warn().
					Uint64("heap_bytes", heapBytes).
					Uint64("limit_bytes", w.memoryLimitBytes).
					Str("kind", kind).


 ... (clipped 6 lines)

Learn more about managing compliance generic rules or creating your own custom rules

Generic: Security-First Input Validation and Data Handling

Objective: Ensure all data inputs are validated, sanitized, and handled securely to prevent
vulnerabilities

Status:
Runtime metrics: New memory-based gating uses runtime heap stats without validating external inputs for
'kind' or 'payload' in the enqueue path visible here; validation may
exist elsewhere but is not shown.

Referred Code
if w.memoryLimitBytes > 0 {
	heapBytes := currentHeapAllocBytes()
	if heapBytes > w.memoryLimitBytes {
		recordAgeDroppedMemoryPressure()
		if w.log != nil {
			w.log.Warn().
				Uint64("heap_bytes", heapBytes).
				Uint64("limit_bytes", w.memoryLimitBytes).
				Str("kind", kind).
				Int("batch_size", size).
				Msg("age graph: rejected batch due to memory pressure")
		}
		return errAgeGraphMemoryPressure

Learn more about managing compliance generic rules or creating your own custom rules

Compliance status legend 🟢 - Fully Compliant
🟡 - Partial Compliant
🔴 - Not Compliant
- Requires Further Human Verification
🏷️ - Compliance label
Imported GitHub PR comment. Original author: @qodo-code-review[bot] Original URL: https://github.com/carverauto/serviceradar/pull/2063#issuecomment-3614879303 Original created: 2025-12-05T00:47:11Z --- ## PR Compliance Guide 🔍 <!-- https://github.com/carverauto/serviceradar/commit/9f637975d7222add8afa15c3c373feb14e8f172e --> Below is a summary of compliance checks for this PR:<br> <table><tbody><tr><td colspan='2'><strong>Security Compliance</strong></td></tr> <tr><td>🟢</td><td><details><summary><strong>No security concerns identified</strong></summary> No security vulnerabilities detected by AI analysis. Human verification advised for critical code. </details></td></tr> <tr><td colspan='2'><strong>Ticket Compliance</strong></td></tr> <tr><td>⚪</td><td><details><summary>🎫 <strong>No ticket provided </strong></summary> - [ ] Create ticket/issue <!-- /create_ticket --create_ticket=true --> </details></td></tr> <tr><td colspan='2'><strong>Codebase Duplication Compliance</strong></td></tr> <tr><td>⚪</td><td><details><summary><strong>Codebase context is not defined </strong></summary> Follow the <a href='https://qodo-merge-docs.qodo.ai/core-abilities/rag_context_enrichment/'>guide</a> to enable codebase context checks. </details></td></tr> <tr><td colspan='2'><strong>Custom Compliance</strong></td></tr> <tr><td rowspan=4>🟢</td><td> <details><summary><strong>Generic: Comprehensive Audit Trails</strong></summary><br> **Objective:** To create a detailed and reliable record of critical system actions for security analysis <br>and compliance.<br> **Status:** Passed<br> > Learn more about managing compliance <a href='https://qodo-merge-docs.qodo.ai/tools/compliance/#configuration-options'>generic rules</a> or creating your own <a href='https://qodo-merge-docs.qodo.ai/tools/compliance/#custom-compliance'>custom rules</a> </details></td></tr> <tr><td> <details><summary><strong>Generic: Meaningful Naming and Self-Documenting Code</strong></summary><br> **Objective:** Ensure all identifiers clearly express their purpose and intent, making code <br>self-documenting<br> **Status:** Passed<br> > Learn more about managing compliance <a href='https://qodo-merge-docs.qodo.ai/tools/compliance/#configuration-options'>generic rules</a> or creating your own <a href='https://qodo-merge-docs.qodo.ai/tools/compliance/#custom-compliance'>custom rules</a> </details></td></tr> <tr><td> <details><summary><strong>Generic: Secure Error Handling</strong></summary><br> **Objective:** To prevent the leakage of sensitive system information through error messages while <br>providing sufficient detail for internal debugging.<br> **Status:** Passed<br> > Learn more about managing compliance <a href='https://qodo-merge-docs.qodo.ai/tools/compliance/#configuration-options'>generic rules</a> or creating your own <a href='https://qodo-merge-docs.qodo.ai/tools/compliance/#custom-compliance'>custom rules</a> </details></td></tr> <tr><td> <details><summary><strong>Generic: Secure Logging Practices</strong></summary><br> **Objective:** To ensure logs are useful for debugging and auditing without exposing sensitive <br>information like PII, PHI, or cardholder data.<br> **Status:** Passed<br> > Learn more about managing compliance <a href='https://qodo-merge-docs.qodo.ai/tools/compliance/#configuration-options'>generic rules</a> or creating your own <a href='https://qodo-merge-docs.qodo.ai/tools/compliance/#custom-compliance'>custom rules</a> </details></td></tr> <tr><td rowspan=2>⚪</td> <td><details> <summary><strong>Generic: Robust Error Handling and Edge Case Management</strong></summary><br> **Objective:** Ensure comprehensive error handling that provides meaningful context and graceful <br>degradation<br> **Status:** <br><a href='https://github.com/carverauto/serviceradar/pull/2063/files#diff-2e6c0a5c048582201ffb983889ab9fab6fb1f8d64b91ee3fb65479b2fb2c6195R789-R815'><strong>Rejection handling</strong></a>: New early-rejection paths for memory pressure and circuit-open return errors without <br>visible upstream handling in this diff, which may drop critical work without retry or <br>alternative path.<br> <details open><summary>Referred Code</summary> ```go func (w *ageGraphWriter) enqueue(ctx context.Context, kind string, size int, query, payload string) error { if w == nil || w.executor == nil { return errAgeGraphWriterUnavailable } // Check circuit breaker state if w.isCircuitOpen() { recordAgeDroppedCircuitOpen() return errAgeGraphCircuitOpen } // Check memory pressure if w.memoryLimitBytes > 0 { heapBytes := currentHeapAllocBytes() if heapBytes > w.memoryLimitBytes { recordAgeDroppedMemoryPressure() if w.log != nil { w.log.Warn(). Uint64("heap_bytes", heapBytes). Uint64("limit_bytes", w.memoryLimitBytes). Str("kind", kind). ... (clipped 6 lines) ``` </details> > Learn more about managing compliance <a href='https://qodo-merge-docs.qodo.ai/tools/compliance/#configuration-options'>generic rules</a> or creating your own <a href='https://qodo-merge-docs.qodo.ai/tools/compliance/#custom-compliance'>custom rules</a> </details></td></tr> <tr><td><details> <summary><strong>Generic: Security-First Input Validation and Data Handling</strong></summary><br> **Objective:** Ensure all data inputs are validated, sanitized, and handled securely to prevent <br>vulnerabilities<br> **Status:** <br><a href='https://github.com/carverauto/serviceradar/pull/2063/files#diff-2e6c0a5c048582201ffb983889ab9fab6fb1f8d64b91ee3fb65479b2fb2c6195R801-R813'><strong>Runtime metrics</strong></a>: New memory-based gating uses runtime heap stats without validating external inputs for <br>&#x27;kind&#x27; or &#x27;payload&#x27; in the enqueue path visible here; validation may <br>exist elsewhere but is not shown.<br> <details open><summary>Referred Code</summary> ```go if w.memoryLimitBytes > 0 { heapBytes := currentHeapAllocBytes() if heapBytes > w.memoryLimitBytes { recordAgeDroppedMemoryPressure() if w.log != nil { w.log.Warn(). Uint64("heap_bytes", heapBytes). Uint64("limit_bytes", w.memoryLimitBytes). Str("kind", kind). Int("batch_size", size). Msg("age graph: rejected batch due to memory pressure") } return errAgeGraphMemoryPressure ``` </details> > Learn more about managing compliance <a href='https://qodo-merge-docs.qodo.ai/tools/compliance/#configuration-options'>generic rules</a> or creating your own <a href='https://qodo-merge-docs.qodo.ai/tools/compliance/#custom-compliance'>custom rules</a> </details></td></tr> <tr><td align="center" colspan="2"> <!-- placeholder --> <!-- /compliance --update_compliance=true --> </td></tr></tbody></table> <details><summary>Compliance status legend</summary> 🟢 - Fully Compliant<br> 🟡 - Partial Compliant<br> 🔴 - Not Compliant<br> ⚪ - Requires Further Human Verification<br> 🏷️ - Compliance label<br> </details>
qodo-code-review[bot] commented 2025-12-05 00:48:25 +00:00 (Migrated from github.com)
Author
Owner

Imported GitHub PR comment.

Original author: @qodo-code-review[bot]
Original URL: https://github.com/carverauto/serviceradar/pull/2063#issuecomment-3614881618
Original created: 2025-12-05T00:48:25Z

PR Code Suggestions

Explore these optional code suggestions:

CategorySuggestion                                                                                                                                    Impact
Possible issue
Fix circuit breaker race condition

Fix a race condition in the circuit breaker's half-open state transition by
using an atomic compare-and-swap operation, ensuring only a single request can
pass through to test for recovery.

pkg/registry/age_graph_writer.go [1017-1047]

 func (w *ageGraphWriter) isCircuitOpen() bool {
 	if w == nil || w.circuitThreshold <= 0 {
 		return false
 	}
 
 	state := currentAgeCircuitState()
 	if state == circuitClosed {
 		return false
 	}
 
 	if state == circuitOpen {
 		// Check if reset time has passed
 		openedAt := w.circuitOpenedAt.Load()
 		if openedAt > 0 && time.Now().Unix()-openedAt >= int64(w.circuitResetSecs) {
-			// Transition to half-open
-			w.circuitMu.Lock()
-			if currentAgeCircuitState() == circuitOpen {
-				setAgeCircuitState(circuitHalfOpen)
+			// Atomically transition to half-open. Only one goroutine will succeed.
+			if ageCircuitState.CompareAndSwap(circuitOpen, circuitHalfOpen) {
 				if w.log != nil {
 					w.log.Info().Msg("age graph: circuit breaker transitioning to half-open")
 				}
+				return false // Allow one request through
 			}
-			w.circuitMu.Unlock()
-			return false // Allow one request through
 		}
 		return true
 	}
 
-	// Half-open: allow requests through to test recovery
-	return false
+	// In half-open state, reject all other requests until state changes.
+	return true
 }
  • Apply / Chat
Suggestion importance[1-10]: 9

__

Why: The suggestion correctly identifies a race condition in the circuit breaker logic that allows multiple requests in the half-open state, defeating its purpose, and proposes a valid atomic fix.

High
General
Remove unnecessary lock in hot path

Improve performance by replacing the mutex in the resetCircuit function with
more efficient atomic operations, as it is on a hot path called by every
successful request.

pkg/registry/age_graph_writer.go [1070-1087]

 func (w *ageGraphWriter) resetCircuit() {
 	if w == nil || w.circuitThreshold <= 0 {
 		return
 	}
 
-	w.circuitMu.Lock()
-	defer w.circuitMu.Unlock()
-
-	prevState := currentAgeCircuitState()
-	if w.circuitFailures > 0 || prevState != circuitClosed {
-		w.circuitFailures = 0
+	// Use CAS to avoid locking on the hot path.
+	// Only log if the state was not already closed.
+	if ageCircuitState.CompareAndSwap(circuitHalfOpen, circuitClosed) {
+		atomic.StoreInt64(&w.circuitFailures, 0)
 		w.circuitOpenedAt.Store(0)
-		setAgeCircuitState(circuitClosed)
-		if prevState != circuitClosed && w.log != nil {
+		if w.log != nil {
 			w.log.Info().Msg("age graph: circuit breaker closed after successful request")
+		}
+	} else {
+		// Reset failure count if it was non-zero in closed state.
+		if atomic.LoadInt64(&w.circuitFailures) > 0 {
+			atomic.StoreInt64(&w.circuitFailures, 0)
 		}
 	}
 }
  • Apply / Chat
Suggestion importance[1-10]: 7

__

Why: The suggestion correctly points out that using a mutex on a hot path for every successful request is inefficient and proposes a more performant solution using atomic operations, which is a valid optimization.

Medium
Avoid expensive call on hot path

Optimize the memory pressure check by reading memory stats in a background
goroutine and caching the result, avoiding frequent and potentially expensive
calls to runtime.ReadMemStats() on the hot path.

pkg/registry/age_graph_writer.go [800-815]

 	// Check memory pressure
 	if w.memoryLimitBytes > 0 {
-		heapBytes := currentHeapAllocBytes()
+		heapBytes := atomic.LoadUint64(&w.currentHeapAlloc)
 		if heapBytes > w.memoryLimitBytes {
 			recordAgeDroppedMemoryPressure()
 			if w.log != nil {
 				w.log.Warn().
 					Uint64("heap_bytes", heapBytes).
 					Uint64("limit_bytes", w.memoryLimitBytes).
 					Str("kind", kind).
 					Int("batch_size", size).
 					Msg("age graph: rejected batch due to memory pressure")
 			}
 			return errAgeGraphMemoryPressure
 		}
 	}
  • Apply / Chat
Suggestion importance[1-10]: 6

__

Why: The suggestion correctly identifies a potential performance bottleneck from calling runtime.ReadMemStats() on a hot path and proposes a valid optimization using a background goroutine to cache the value.

Low
High-level
Consider a simpler circuit breaker

Replace the manual circuit breaker implementation, which uses mutexes and atomic
operations for state management, with a standard third-party library. This would
simplify the code and reduce the risk of concurrency issues.

Examples:

pkg/registry/age_graph_writer.go [1017-1087]
func (w *ageGraphWriter) isCircuitOpen() bool {
	if w == nil || w.circuitThreshold <= 0 {
		return false
	}

	state := currentAgeCircuitState()
	if state == circuitClosed {
		return false
	}


 ... (clipped 61 lines)

Solution Walkthrough:

Before:

type ageGraphWriter struct {
    // ...
    circuitFailures   int64
    circuitThreshold  int
    circuitOpenedAt   atomic.Int64
    circuitMu         sync.Mutex
}

func (w *ageGraphWriter) isCircuitOpen() bool {
    // Manual check of state (closed, open, half-open)
    // Manual check of timer to transition to half-open
    // Manual locking to change state
}

func (w *ageGraphWriter) incrementCircuitFailures() {
    // Manual locking to increment failures and open circuit
}

func (w *ageGraphWriter) resetCircuit() {
    // Manual locking to reset failures and close circuit
}

After:

import "github.com/sony/gobreaker"

type ageGraphWriter struct {
    // ...
    cb *gobreaker.CircuitBreaker
}

func NewAGEGraphWriter(...) GraphWriter {
    st := gobreaker.Settings{
        Name:        "AGEGraphWriter",
        MaxRequests: 1, // for half-open state
        Interval:    0, // never move to open state based on time
        Timeout:     60 * time.Second, // from open to half-open
        ReadyToTrip: func(counts gobreaker.Counts) bool {
            return counts.ConsecutiveFailures > 10
        },
    }
    return &ageGraphWriter{..., cb: gobreaker.NewCircuitBreaker(st)}
}

func (w *ageGraphWriter) processRequest(req *ageGraphRequest) error {
    _, err := w.cb.Execute(func() (interface{}, error) {
        // ... original processing logic ...
        return nil, err
    })
    return err
}

Suggestion importance[1-10]: 6

__

Why: The suggestion correctly identifies a complex, manual implementation of a standard pattern and proposes a valid alternative that would improve code simplicity, robustness, and maintainability.

Low
  • More
Imported GitHub PR comment. Original author: @qodo-code-review[bot] Original URL: https://github.com/carverauto/serviceradar/pull/2063#issuecomment-3614881618 Original created: 2025-12-05T00:48:25Z --- ## PR Code Suggestions ✨ <!-- 9f63797 --> Explore these optional code suggestions: <table><thead><tr><td><strong>Category</strong></td><td align=left><strong>Suggestion&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; </strong></td><td align=center><strong>Impact</strong></td></tr><tbody><tr><td rowspan=1>Possible issue</td> <td> <details><summary>Fix circuit breaker race condition</summary> ___ **Fix a race condition in the circuit breaker's half-open state transition by <br>using an atomic compare-and-swap operation, ensuring only a single request can <br>pass through to test for recovery.** [pkg/registry/age_graph_writer.go [1017-1047]](https://github.com/carverauto/serviceradar/pull/2063/files#diff-2e6c0a5c048582201ffb983889ab9fab6fb1f8d64b91ee3fb65479b2fb2c6195R1017-R1047) ```diff func (w *ageGraphWriter) isCircuitOpen() bool { if w == nil || w.circuitThreshold <= 0 { return false } state := currentAgeCircuitState() if state == circuitClosed { return false } if state == circuitOpen { // Check if reset time has passed openedAt := w.circuitOpenedAt.Load() if openedAt > 0 && time.Now().Unix()-openedAt >= int64(w.circuitResetSecs) { - // Transition to half-open - w.circuitMu.Lock() - if currentAgeCircuitState() == circuitOpen { - setAgeCircuitState(circuitHalfOpen) + // Atomically transition to half-open. Only one goroutine will succeed. + if ageCircuitState.CompareAndSwap(circuitOpen, circuitHalfOpen) { if w.log != nil { w.log.Info().Msg("age graph: circuit breaker transitioning to half-open") } + return false // Allow one request through } - w.circuitMu.Unlock() - return false // Allow one request through } return true } - // Half-open: allow requests through to test recovery - return false + // In half-open state, reject all other requests until state changes. + return true } ``` - [ ] **Apply / Chat** <!-- /improve --apply_suggestion=0 --> <details><summary>Suggestion importance[1-10]: 9</summary> __ Why: The suggestion correctly identifies a race condition in the circuit breaker logic that allows multiple requests in the half-open state, defeating its purpose, and proposes a valid atomic fix. </details></details></td><td align=center>High </td></tr><tr><td rowspan=2>General</td> <td> <details><summary>Remove unnecessary lock in hot path</summary> ___ **Improve performance by replacing the mutex in the <code>resetCircuit</code> function with <br>more efficient atomic operations, as it is on a hot path called by every <br>successful request.** [pkg/registry/age_graph_writer.go [1070-1087]](https://github.com/carverauto/serviceradar/pull/2063/files#diff-2e6c0a5c048582201ffb983889ab9fab6fb1f8d64b91ee3fb65479b2fb2c6195R1070-R1087) ```diff func (w *ageGraphWriter) resetCircuit() { if w == nil || w.circuitThreshold <= 0 { return } - w.circuitMu.Lock() - defer w.circuitMu.Unlock() - - prevState := currentAgeCircuitState() - if w.circuitFailures > 0 || prevState != circuitClosed { - w.circuitFailures = 0 + // Use CAS to avoid locking on the hot path. + // Only log if the state was not already closed. + if ageCircuitState.CompareAndSwap(circuitHalfOpen, circuitClosed) { + atomic.StoreInt64(&w.circuitFailures, 0) w.circuitOpenedAt.Store(0) - setAgeCircuitState(circuitClosed) - if prevState != circuitClosed && w.log != nil { + if w.log != nil { w.log.Info().Msg("age graph: circuit breaker closed after successful request") + } + } else { + // Reset failure count if it was non-zero in closed state. + if atomic.LoadInt64(&w.circuitFailures) > 0 { + atomic.StoreInt64(&w.circuitFailures, 0) } } } ``` - [ ] **Apply / Chat** <!-- /improve --apply_suggestion=1 --> <details><summary>Suggestion importance[1-10]: 7</summary> __ Why: The suggestion correctly points out that using a mutex on a hot path for every successful request is inefficient and proposes a more performant solution using atomic operations, which is a valid optimization. </details></details></td><td align=center>Medium </td></tr><tr><td> <details><summary>Avoid expensive call on hot path</summary> ___ **Optimize the memory pressure check by reading memory stats in a background <br>goroutine and caching the result, avoiding frequent and potentially expensive <br>calls to <code>runtime.ReadMemStats()</code> on the hot path.** [pkg/registry/age_graph_writer.go [800-815]](https://github.com/carverauto/serviceradar/pull/2063/files#diff-2e6c0a5c048582201ffb983889ab9fab6fb1f8d64b91ee3fb65479b2fb2c6195R800-R815) ```diff // Check memory pressure if w.memoryLimitBytes > 0 { - heapBytes := currentHeapAllocBytes() + heapBytes := atomic.LoadUint64(&w.currentHeapAlloc) if heapBytes > w.memoryLimitBytes { recordAgeDroppedMemoryPressure() if w.log != nil { w.log.Warn(). Uint64("heap_bytes", heapBytes). Uint64("limit_bytes", w.memoryLimitBytes). Str("kind", kind). Int("batch_size", size). Msg("age graph: rejected batch due to memory pressure") } return errAgeGraphMemoryPressure } } ``` - [ ] **Apply / Chat** <!-- /improve --apply_suggestion=2 --> <details><summary>Suggestion importance[1-10]: 6</summary> __ Why: The suggestion correctly identifies a potential performance bottleneck from calling `runtime.ReadMemStats()` on a hot path and proposes a valid optimization using a background goroutine to cache the value. </details></details></td><td align=center>Low </td></tr><tr><td rowspan=1>High-level</td> <td> <details><summary>Consider a simpler circuit breaker</summary> ___ **Replace the manual circuit breaker implementation, which uses mutexes and atomic <br>operations for state management, with a standard third-party library. This would <br>simplify the code and reduce the risk of concurrency issues.** ### Examples: <details> <summary> <a href="https://github.com/carverauto/serviceradar/pull/2063/files#diff-2e6c0a5c048582201ffb983889ab9fab6fb1f8d64b91ee3fb65479b2fb2c6195R1017-R1087">pkg/registry/age_graph_writer.go [1017-1087]</a> </summary> ```go func (w *ageGraphWriter) isCircuitOpen() bool { if w == nil || w.circuitThreshold <= 0 { return false } state := currentAgeCircuitState() if state == circuitClosed { return false } ... (clipped 61 lines) ``` </details> ### Solution Walkthrough: #### Before: ```go type ageGraphWriter struct { // ... circuitFailures int64 circuitThreshold int circuitOpenedAt atomic.Int64 circuitMu sync.Mutex } func (w *ageGraphWriter) isCircuitOpen() bool { // Manual check of state (closed, open, half-open) // Manual check of timer to transition to half-open // Manual locking to change state } func (w *ageGraphWriter) incrementCircuitFailures() { // Manual locking to increment failures and open circuit } func (w *ageGraphWriter) resetCircuit() { // Manual locking to reset failures and close circuit } ``` #### After: ```go import "github.com/sony/gobreaker" type ageGraphWriter struct { // ... cb *gobreaker.CircuitBreaker } func NewAGEGraphWriter(...) GraphWriter { st := gobreaker.Settings{ Name: "AGEGraphWriter", MaxRequests: 1, // for half-open state Interval: 0, // never move to open state based on time Timeout: 60 * time.Second, // from open to half-open ReadyToTrip: func(counts gobreaker.Counts) bool { return counts.ConsecutiveFailures > 10 }, } return &ageGraphWriter{..., cb: gobreaker.NewCircuitBreaker(st)} } func (w *ageGraphWriter) processRequest(req *ageGraphRequest) error { _, err := w.cb.Execute(func() (interface{}, error) { // ... original processing logic ... return nil, err }) return err } ``` <details><summary>Suggestion importance[1-10]: 6</summary> __ Why: The suggestion correctly identifies a complex, manual implementation of a standard pattern and proposes a valid alternative that would improve code simplicity, robustness, and maintainability. </details></details></td><td align=center>Low </td></tr> <tr><td align="center" colspan="2"> - [ ] More <!-- /improve --more_suggestions=true --> </td><td></td></tr></tbody></table>
Sign in to join this conversation.
No reviewers
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
carverauto/serviceradar!2510
No description provided.