reconnection work #2320

Merged
mfreeman451 merged 1 commit from refs/pull/2320/head into main 2025-10-14 21:03:10 +00:00
mfreeman451 commented 2025-10-14 20:56:46 +00:00 (Migrated from github.com)
Owner

Imported from GitHub pull request.

Original GitHub pull request: #1761
Original author: @mfreeman451
Original URL: https://github.com/carverauto/serviceradar/pull/1761
Original created: 2025-10-14T20:56:46Z
Original updated: 2025-10-14T21:03:14Z
Original head: carverauto/serviceradar:1760-bugflowgger-crashes-if-it-cant-reach-nats
Original base: main
Original merged: 2025-10-14T21:03:10Z by @mfreeman451

PR Type

Enhancement, Bug fix


Description

  • Add configurable retry logic with exponential backoff for NATS connection failures

  • Implement graceful error handling to prevent crashes when NATS is unreachable

  • Add three new configuration parameters for connection retry behavior

  • Refactor connection logic to support unlimited or limited retry attempts


Diagram Walkthrough

flowchart LR
  A["Connection Attempt"] --> B{"Success?"}
  B -- "Yes" --> C["Establish Connection"]
  B -- "No" --> D{"Retry Limit Reached?"}
  D -- "Yes" --> E["Exit with Error"]
  D -- "No" --> F["Wait with Backoff"]
  F --> G["Double Backoff Duration"]
  G --> A

File Walkthrough

Relevant files
Enhancement
nats_output.rs
Implement NATS connection retry with exponential backoff 

cmd/flowgger/src/flowgger/output/nats_output.rs

  • Add retry configuration fields (connect_attempts,
    connect_initial_backoff, connect_max_backoff) to NATSConfig
  • Implement connect_with_retry() method with exponential backoff logic
  • Rename connect() to connect_once() for single connection attempts
  • Add graceful error handling and logging for connection failures
+79/-3   
Configuration changes
flowgger.toml
Add retry configuration parameters to example config         

cmd/flowgger/flowgger.toml

  • Add nats_connect_attempts configuration (0 = unlimited retries)
  • Add nats_connect_initial_backoff_ms configuration (default 500ms)
  • Add nats_connect_max_backoff_ms configuration (default 30000ms)
  • Update stream name comment for clarity
+5/-1     
flowgger.toml
Add retry configuration parameters to packaging config     

packaging/flowgger/config/flowgger.toml

  • Add nats_connect_attempts configuration (0 = unlimited retries)
  • Add nats_connect_initial_backoff_ms configuration (default 500ms)
  • Add nats_connect_max_backoff_ms configuration (default 30000ms)
+4/-0     

Imported from GitHub pull request. Original GitHub pull request: #1761 Original author: @mfreeman451 Original URL: https://github.com/carverauto/serviceradar/pull/1761 Original created: 2025-10-14T20:56:46Z Original updated: 2025-10-14T21:03:14Z Original head: carverauto/serviceradar:1760-bugflowgger-crashes-if-it-cant-reach-nats Original base: main Original merged: 2025-10-14T21:03:10Z by @mfreeman451 --- ### **PR Type** Enhancement, Bug fix ___ ### **Description** - Add configurable retry logic with exponential backoff for NATS connection failures - Implement graceful error handling to prevent crashes when NATS is unreachable - Add three new configuration parameters for connection retry behavior - Refactor connection logic to support unlimited or limited retry attempts ___ ### Diagram Walkthrough ```mermaid flowchart LR A["Connection Attempt"] --> B{"Success?"} B -- "Yes" --> C["Establish Connection"] B -- "No" --> D{"Retry Limit Reached?"} D -- "Yes" --> E["Exit with Error"] D -- "No" --> F["Wait with Backoff"] F --> G["Double Backoff Duration"] G --> A ``` <details> <summary><h3> File Walkthrough</h3></summary> <table><thead><tr><th></th><th align="left">Relevant files</th></tr></thead><tbody><tr><td><strong>Enhancement</strong></td><td><table> <tr> <td> <details> <summary><strong>nats_output.rs</strong><dd><code>Implement NATS connection retry with exponential backoff</code>&nbsp; </dd></summary> <hr> cmd/flowgger/src/flowgger/output/nats_output.rs <ul><li>Add retry configuration fields (<code>connect_attempts</code>, <br><code>connect_initial_backoff</code>, <code>connect_max_backoff</code>) to <code>NATSConfig</code><br> <li> Implement <code>connect_with_retry()</code> method with exponential backoff logic<br> <li> Rename <code>connect()</code> to <code>connect_once()</code> for single connection attempts<br> <li> Add graceful error handling and logging for connection failures</ul> </details> </td> <td><a href="https://github.com/carverauto/serviceradar/pull/1761/files#diff-a82e2e4d413539bf0b414b5629665b19648447523994cba639c4d1238aa5a0c1">+79/-3</a>&nbsp; &nbsp; </td> </tr> </table></td></tr><tr><td><strong>Configuration changes</strong></td><td><table> <tr> <td> <details> <summary><strong>flowgger.toml</strong><dd><code>Add retry configuration parameters to example config</code>&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; </dd></summary> <hr> cmd/flowgger/flowgger.toml <ul><li>Add <code>nats_connect_attempts</code> configuration (0 = unlimited retries)<br> <li> Add <code>nats_connect_initial_backoff_ms</code> configuration (default 500ms)<br> <li> Add <code>nats_connect_max_backoff_ms</code> configuration (default 30000ms)<br> <li> Update stream name comment for clarity</ul> </details> </td> <td><a href="https://github.com/carverauto/serviceradar/pull/1761/files#diff-af9f49f931e282dca53d1f0521b036d222fe671f77e61a876a84cf4c6d7cca4d">+5/-1</a>&nbsp; &nbsp; &nbsp; </td> </tr> <tr> <td> <details> <summary><strong>flowgger.toml</strong><dd><code>Add retry configuration parameters to packaging config</code>&nbsp; &nbsp; &nbsp; </dd></summary> <hr> packaging/flowgger/config/flowgger.toml <ul><li>Add <code>nats_connect_attempts</code> configuration (0 = unlimited retries)<br> <li> Add <code>nats_connect_initial_backoff_ms</code> configuration (default 500ms)<br> <li> Add <code>nats_connect_max_backoff_ms</code> configuration (default 30000ms)</ul> </details> </td> <td><a href="https://github.com/carverauto/serviceradar/pull/1761/files#diff-cd13f7cc3f36ecdc793279897b61860d104c8511d5f4ad26af58969b0b4595c8">+4/-0</a>&nbsp; &nbsp; &nbsp; </td> </tr> </table></td></tr></tr></tbody></table> </details> ___
qodo-code-review[bot] commented 2025-10-14 20:57:21 +00:00 (Migrated from github.com)
Author
Owner

Imported GitHub PR comment.

Original author: @qodo-code-review[bot]
Original URL: https://github.com/carverauto/serviceradar/pull/1761#issuecomment-3403591711
Original created: 2025-10-14T20:57:21Z

PR Compliance Guide 🔍

Below is a summary of compliance checks for this PR:

Security Compliance
Verbose error logging

Description: The code logs connection failures using eprintln! which writes to stderr without
structured logging or rate limiting, potentially exposing sensitive connection details and
causing log flooding during outages.
nats_output.rs [179-214]

Referred Code
                if limit != 0 && attempt >= limit {
                    eprintln!(
                        "NATS connection attempt {attempt} failed: {err}. Giving up after {limit} attempts."
                    );
                    return Err(err);
                }

                eprintln!(
                    "NATS connection attempt {attempt} failed: {err}. Retrying in {:?}...",
                    backoff
                );
                sleep(backoff).await;

                let doubled = backoff
                    .checked_mul(2)
                    .unwrap_or(self.cfg.connect_max_backoff);
                backoff = min(doubled, self.cfg.connect_max_backoff);
            }
        }
    }
}


 ... (clipped 15 lines)
Ticket Compliance
🟢
🎫 #1760
🟢 Prevent Flowgger from panicking when it cannot reach NATS.
Implement reconnection with exponential backoff when NATS is unreachable.
Allow configuration of retry behavior, including number of attempts and backoff timing.
If reconnection ultimately fails, exit so Kubernetes can restart the pod.
Codebase Duplication Compliance
Codebase context is not defined

Follow the guide to enable codebase context checks.

Custom Compliance
No custom compliance provided

Follow the guide to enable custom compliance check.

Compliance status legend 🟢 - Fully Compliant
🟡 - Partial Compliant
🔴 - Not Compliant
- Requires Further Human Verification
🏷️ - Compliance label
Imported GitHub PR comment. Original author: @qodo-code-review[bot] Original URL: https://github.com/carverauto/serviceradar/pull/1761#issuecomment-3403591711 Original created: 2025-10-14T20:57:21Z --- ## PR Compliance Guide 🔍 <!-- https://github.com/carverauto/serviceradar/commit/8723f27831de675fcffa7d5243679c0e3dbab481 --> Below is a summary of compliance checks for this PR:<br> <table><tbody><tr><td colspan='2'><strong>Security Compliance</strong></td></tr> <tr><td rowspan=1>⚪</td> <td><details><summary><strong>Verbose error logging </strong></summary><br> <b>Description:</b> The code logs connection failures using eprintln! which writes to stderr without <br>structured logging or rate limiting, potentially exposing sensitive connection details and <br>causing log flooding during outages.<br> <strong><a href='https://github.com/carverauto/serviceradar/pull/1761/files#diff-a82e2e4d413539bf0b414b5629665b19648447523994cba639c4d1238aa5a0c1R179-R214'>nats_output.rs [179-214]</a></strong><br> <details open><summary>Referred Code</summary> ```rust if limit != 0 && attempt >= limit { eprintln!( "NATS connection attempt {attempt} failed: {err}. Giving up after {limit} attempts." ); return Err(err); } eprintln!( "NATS connection attempt {attempt} failed: {err}. Retrying in {:?}...", backoff ); sleep(backoff).await; let doubled = backoff .checked_mul(2) .unwrap_or(self.cfg.connect_max_backoff); backoff = min(doubled, self.cfg.connect_max_backoff); } } } } ... (clipped 15 lines) ``` </details></details></td></tr> <tr><td colspan='2'><strong>Ticket Compliance</strong></td></tr> <tr><td>🟢</td> <td> <details> <summary>🎫 <a href=https://github.com/carverauto/serviceradar/issues/1760>#1760</a></summary> <table width='100%'><tbody> <tr><td rowspan=4>🟢</td> <td>Prevent Flowgger from panicking when it cannot reach NATS.</td></tr> <tr><td>Implement reconnection with exponential backoff when NATS is unreachable.</td></tr> <tr><td>Allow configuration of retry behavior, including number of attempts and backoff timing.</td></tr> <tr><td>If reconnection ultimately fails, exit so Kubernetes can restart the pod.</td></tr> </tbody></table> </details> </td></tr> <tr><td colspan='2'><strong>Codebase Duplication Compliance</strong></td></tr> <tr><td>⚪</td><td><details><summary><strong>Codebase context is not defined </strong></summary> Follow the <a href='https://qodo-merge-docs.qodo.ai/core-abilities/rag_context_enrichment/'>guide</a> to enable codebase context checks. </details></td></tr> <tr><td colspan='2'><strong>Custom Compliance</strong></td></tr> <tr><td>⚪</td><td><details><summary><strong>No custom compliance provided</strong></summary> Follow the <a href='https://qodo-merge-docs.qodo.ai/tools/compliance/'>guide</a> to enable custom compliance check. </details></td></tr> <tr><td align="center" colspan="2"> <!-- placeholder --> <!-- /compliance --update_compliance=true --> </td></tr></tbody></table> <details><summary>Compliance status legend</summary> 🟢 - Fully Compliant<br> 🟡 - Partial Compliant<br> 🔴 - Not Compliant<br> ⚪ - Requires Further Human Verification<br> 🏷️ - Compliance label<br> </details>
qodo-code-review[bot] commented 2025-10-14 20:58:17 +00:00 (Migrated from github.com)
Author
Owner

Imported GitHub PR comment.

Original author: @qodo-code-review[bot]
Original URL: https://github.com/carverauto/serviceradar/pull/1761#issuecomment-3403594243
Original created: 2025-10-14T20:58:17Z

PR Code Suggestions

Explore these optional code suggestions:

CategorySuggestion                                                                                                                                    Impact
High-level
Replace custom retry logic with a standard library

Replace the custom exponential backoff logic with a standard library like the
backoff crate. This will add jitter to prevent "thundering herd" issues and
simplify the implementation.

Examples:

cmd/flowgger/src/flowgger/output/nats_output.rs [165-199]
    async fn connect_with_retry(&self) -> Result<(Client, jetstream::Context), async_nats::Error> {
        let mut attempt: u32 = 0;
        let mut backoff = min(
            self.cfg.connect_initial_backoff,
            self.cfg.connect_max_backoff,
        );

        loop {
            attempt += 1;
            match self.connect_once().await {

 ... (clipped 25 lines)

Solution Walkthrough:

Before:

async fn connect_with_retry(&self) -> Result<Connection, Error> {
    let mut attempt = 0;
    let mut backoff = self.cfg.connect_initial_backoff;

    loop {
        attempt += 1;
        match self.connect_once().await {
            Ok(conn) => return Ok(conn),
            Err(err) => {
                if self.cfg.connect_attempts != 0 && attempt >= self.cfg.connect_attempts {
                    return Err(err);
                }
                sleep(backoff).await;
                backoff = min(backoff * 2, self.cfg.connect_max_backoff);
            }
        }
    }
}

After:

// With the `backoff` crate
use backoff::{ExponentialBackoff, future::retry};

async fn connect_with_retry(&self) -> Result<Connection, Error> {
    let mut backoff = ExponentialBackoff {
        initial_interval: self.cfg.connect_initial_backoff,
        max_interval: self.cfg.connect_max_backoff,
        max_elapsed_time: None, // Or configure based on attempts
        ..Default::default()
    };

    let operation = || async {
        self.connect_once().await.map_err(backoff::Error::transient)
    };

    retry(backoff, operation).await
}

Suggestion importance[1-10]: 8

__

Why: The suggestion correctly identifies a significant design flaw—the lack of jitter in the custom retry logic—and proposes a robust, standard solution using the backoff crate, which improves production-readiness.

Medium
General
Ensure max backoff is not less than initial

To prevent unexpected behavior, ensure connect_max_backoff is always greater
than or equal to connect_initial_backoff by adjusting it if a user configures it
to be smaller.

cmd/flowgger/src/flowgger/output/nats_output.rs [103-105]

-if connect_max_backoff.is_zero() {
+if connect_max_backoff < connect_initial_backoff {
     connect_max_backoff = connect_initial_backoff;
 }
  • Apply / Chat
Suggestion importance[1-10]: 6

__

Why: The suggestion correctly identifies a potential user misconfiguration and proposes a reasonable change to handle it gracefully, making the new retry feature more robust and less surprising.

Low
Simplify backoff calculation logic

Simplify the exponential backoff calculation by replacing the verbose
checked_mul, unwrap_or, and min chain with a more direct min(backoff * 2,
self.cfg.connect_max_backoff).

cmd/flowgger/src/flowgger/output/nats_output.rs [192-195]

-let doubled = backoff
-    .checked_mul(2)
-    .unwrap_or(self.cfg.connect_max_backoff);
-backoff = min(doubled, self.cfg.connect_max_backoff);
+backoff = min(backoff * 2, self.cfg.connect_max_backoff);
  • Apply / Chat
Suggestion importance[1-10]: 5

__

Why: The suggestion correctly points out that the backoff calculation can be simplified, improving code readability and conciseness without sacrificing safety in this context.

Low
  • More
Imported GitHub PR comment. Original author: @qodo-code-review[bot] Original URL: https://github.com/carverauto/serviceradar/pull/1761#issuecomment-3403594243 Original created: 2025-10-14T20:58:17Z --- ## PR Code Suggestions ✨ <!-- 8723f27 --> Explore these optional code suggestions: <table><thead><tr><td><strong>Category</strong></td><td align=left><strong>Suggestion&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; </strong></td><td align=center><strong>Impact</strong></td></tr><tbody><tr><td rowspan=1>High-level</td> <td> <details><summary>Replace custom retry logic with a standard library</summary> ___ **Replace the custom exponential backoff logic with a standard library like the <br><code>backoff</code> crate. This will add jitter to prevent "thundering herd" issues and <br>simplify the implementation.** ### Examples: <details> <summary> <a href="https://github.com/carverauto/serviceradar/pull/1761/files#diff-a82e2e4d413539bf0b414b5629665b19648447523994cba639c4d1238aa5a0c1R165-R199">cmd/flowgger/src/flowgger/output/nats_output.rs [165-199]</a> </summary> ```rust async fn connect_with_retry(&self) -> Result<(Client, jetstream::Context), async_nats::Error> { let mut attempt: u32 = 0; let mut backoff = min( self.cfg.connect_initial_backoff, self.cfg.connect_max_backoff, ); loop { attempt += 1; match self.connect_once().await { ... (clipped 25 lines) ``` </details> ### Solution Walkthrough: #### Before: ```rust async fn connect_with_retry(&self) -> Result<Connection, Error> { let mut attempt = 0; let mut backoff = self.cfg.connect_initial_backoff; loop { attempt += 1; match self.connect_once().await { Ok(conn) => return Ok(conn), Err(err) => { if self.cfg.connect_attempts != 0 && attempt >= self.cfg.connect_attempts { return Err(err); } sleep(backoff).await; backoff = min(backoff * 2, self.cfg.connect_max_backoff); } } } } ``` #### After: ```rust // With the `backoff` crate use backoff::{ExponentialBackoff, future::retry}; async fn connect_with_retry(&self) -> Result<Connection, Error> { let mut backoff = ExponentialBackoff { initial_interval: self.cfg.connect_initial_backoff, max_interval: self.cfg.connect_max_backoff, max_elapsed_time: None, // Or configure based on attempts ..Default::default() }; let operation = || async { self.connect_once().await.map_err(backoff::Error::transient) }; retry(backoff, operation).await } ``` <details><summary>Suggestion importance[1-10]: 8</summary> __ Why: The suggestion correctly identifies a significant design flaw—the lack of jitter in the custom retry logic—and proposes a robust, standard solution using the `backoff` crate, which improves production-readiness. </details></details></td><td align=center>Medium </td></tr><tr><td rowspan=2>General</td> <td> <details><summary>Ensure max backoff is not less than initial</summary> ___ **To prevent unexpected behavior, ensure <code>connect_max_backoff</code> is always greater <br>than or equal to <code>connect_initial_backoff</code> by adjusting it if a user configures it <br>to be smaller.** [cmd/flowgger/src/flowgger/output/nats_output.rs [103-105]](https://github.com/carverauto/serviceradar/pull/1761/files#diff-a82e2e4d413539bf0b414b5629665b19648447523994cba639c4d1238aa5a0c1R103-R105) ```diff -if connect_max_backoff.is_zero() { +if connect_max_backoff < connect_initial_backoff { connect_max_backoff = connect_initial_backoff; } ``` - [ ] **Apply / Chat** <!-- /improve --apply_suggestion=1 --> <details><summary>Suggestion importance[1-10]: 6</summary> __ Why: The suggestion correctly identifies a potential user misconfiguration and proposes a reasonable change to handle it gracefully, making the new retry feature more robust and less surprising. </details></details></td><td align=center>Low </td></tr><tr><td> <details><summary>Simplify backoff calculation logic</summary> ___ **Simplify the exponential backoff calculation by replacing the verbose <br><code>checked_mul</code>, <code>unwrap_or</code>, and <code>min</code> chain with a more direct <code>min(backoff * 2, </code><br><code>self.cfg.connect_max_backoff)</code>.** [cmd/flowgger/src/flowgger/output/nats_output.rs [192-195]](https://github.com/carverauto/serviceradar/pull/1761/files#diff-a82e2e4d413539bf0b414b5629665b19648447523994cba639c4d1238aa5a0c1R192-R195) ```diff -let doubled = backoff - .checked_mul(2) - .unwrap_or(self.cfg.connect_max_backoff); -backoff = min(doubled, self.cfg.connect_max_backoff); +backoff = min(backoff * 2, self.cfg.connect_max_backoff); ``` - [ ] **Apply / Chat** <!-- /improve --apply_suggestion=2 --> <details><summary>Suggestion importance[1-10]: 5</summary> __ Why: The suggestion correctly points out that the backoff calculation can be simplified, improving code readability and conciseness without sacrificing safety in this context. </details></details></td><td align=center>Low </td></tr> <tr><td align="center" colspan="2"> - [ ] More <!-- /improve --more_suggestions=true --> </td><td></td></tr></tbody></table>
Sign in to join this conversation.
No reviewers
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
carverauto/serviceradar!2320
No description provided.