carverauto/serviceradar

Fork 0

Refactor/zen tenant isolation #2658

Merged

mfreeman451 merged 7 commits from refs/pull/2658/head into staging

2026-01-13 06:12:32 +00:00

mfreeman451 commented

2026-01-13 05:57:41 +00:00

(Migrated from github.com)

Owner

Imported from GitHub pull request.

Original GitHub pull request: #2275
Original author: @mfreeman451
Original URL: https://github.com/carverauto/serviceradar/pull/2275
Original created: 2026-01-13T05:57:41Z
Original updated: 2026-01-13T06:12:34Z
Original head: carverauto/serviceradar:refactor/zen-tenant-isolation
Original base: staging
Original merged: 2026-01-13T06:12:32Z by @mfreeman451

User description

IMPORTANT: Please sign the Developer Certificate of Origin

Thank you for your contribution to ServiceRadar. Please note, when contributing, the developer must include
a DCO sign-off statement indicating the DCO acceptance in one commit message. Here
is an example DCO Signed-off-by line in a commit message:

Signed-off-by: J. Doe <j.doe@domain.com>

Describe your changes

Issue ticket number and link

Code checklist before requesting a review

I have signed the DCO?
The build completes without errors?
All tests are passing when running make test?

PR Type

Enhancement, Bug fix

Description

Migrate core service to Elixir implementation with clustering support
Fix JetStream subject configuration by removing invalid leading wildcards
Correct tenant migration execution order for zen_rule_templates index updates
Add platform tenant lifecycle event publishing with retry mechanism
Implement CNPG certificate generation and client certificate authentication

Diagram Walkthrough

flowchart LR
  A["Core Service"] -->|"Migrate to"| B["Core-ELX Elixir"]
  B -->|"Cluster via"| C["Kubernetes Discovery"]
  D["JetStream Config"] -->|"Fix subjects"| E["Remove wildcards"]
  E -->|"Match patterns"| F["logs.syslog/snmp/otel"]
  G["Migration Order"] -->|"Correct timing"| H["zen_rule_templates first"]
  H -->|"Then update"| I["Index creation"]
  J["Tenant Bootstrap"] -->|"Publish events"| K["TenantLifecyclePublisher"]
  K -->|"With retry"| L["Exponential backoff"]
  M["CNPG Certs"] -->|"Generate"| N["CA and Client Certs"]
  N -->|"Configure"| O["Database connections"]

File Walkthrough

Relevant files

Enhancement

7 files

platform_tenant_bootstrap.ex `Add tenant lifecycle event publishing with retries`	+61/-0
generate-certs.sh `Add CNPG CA and client certificate generation`	+48/-1
cnpg-client-ca-job.yaml `Add Helm hooks and simplify secret reconciliation logic`	+24/-39
core.yaml `Replace core with core-elx Elixir implementation`	+127/-135
db-event-writer-config.yaml `Use configurable CNPG client certificate paths`	+2/-2
spire-server.yaml `Add CNPG client certificate configuration variables`	+6/-1
web.yaml `Use configurable CNPG client certificate paths`	+2/-2

Bug fix

2 files

20260111082000_update-zen-rule-identities.exs `Make index creation idempotent with IF NOT EXISTS`	+11/-6
serviceradar-config.yaml `Fix zen JetStream subject configuration patterns`	+4/-12

Configuration changes

2 files

values-demo-staging.yaml `Update image tag to latest commit`	+1/-1
values.yaml `Add CNPG client certificate name configuration options`	+2/-0

Imported from GitHub pull request. Original GitHub pull request: #2275 Original author: @mfreeman451 Original URL: https://github.com/carverauto/serviceradar/pull/2275 Original created: 2026-01-13T05:57:41Z Original updated: 2026-01-13T06:12:34Z Original head: carverauto/serviceradar:refactor/zen-tenant-isolation Original base: staging Original merged: 2026-01-13T06:12:32Z by @mfreeman451 --- ### **User description** ## IMPORTANT: Please sign the Developer Certificate of Origin Thank you for your contribution to ServiceRadar. Please note, when contributing, the developer must include a [DCO sign-off statement]( https://developercertificate.org/) indicating the DCO acceptance in one commit message. Here is an example DCO Signed-off-by line in a commit message: ``` Signed-off-by: J. Doe <j.doe@domain.com> ``` ## Describe your changes ## Issue ticket number and link ## Code checklist before requesting a review - [ ] I have signed the DCO? - [ ] The build completes without errors? - [ ] All tests are passing when running make test? ___ ### **PR Type** Enhancement, Bug fix ___ ### **Description** - Migrate core service to Elixir implementation with clustering support - Fix JetStream subject configuration by removing invalid leading wildcards - Correct tenant migration execution order for zen_rule_templates index updates - Add platform tenant lifecycle event publishing with retry mechanism - Implement CNPG certificate generation and client certificate authentication ___ ### Diagram Walkthrough ```mermaid flowchart LR A["Core Service"] -->|"Migrate to"| B["Core-ELX Elixir"] B -->|"Cluster via"| C["Kubernetes Discovery"] D["JetStream Config"] -->|"Fix subjects"| E["Remove wildcards"] E -->|"Match patterns"| F["logs.syslog/snmp/otel"] G["Migration Order"] -->|"Correct timing"| H["zen_rule_templates first"] H -->|"Then update"| I["Index creation"] J["Tenant Bootstrap"] -->|"Publish events"| K["TenantLifecyclePublisher"] K -->|"With retry"| L["Exponential backoff"] M["CNPG Certs"] -->|"Generate"| N["CA and Client Certs"] N -->|"Configure"| O["Database connections"] ``` <details><summary><h3>File Walkthrough</h3></summary> <table><thead><tr><th></th><th align="left">Relevant files</th></tr></thead><tbody><tr><td>Enhancement</td><td><details><summary>7 files</summary><table> <tr> <td>platform_tenant_bootstrap.ex<dd><code>Add tenant lifecycle event publishing with retries</code>              </dd></td> <td><a href="https://github.com/carverauto/serviceradar/pull/2275/files#diff-10699868a7d10a4386ad293ed86e1e60a9e3c172bc2ea3d22f98c0efae666182">+61/-0</a>    </td> </tr> <tr> <td>generate-certs.sh<dd><code>Add CNPG CA and client certificate generation</code>                        </dd></td> <td><a href="https://github.com/carverauto/serviceradar/pull/2275/files#diff-224edb1896351749211a2a609692daa31f4f45cf175658124613b9dc08496d96">+48/-1</a>    </td> </tr> <tr> <td>cnpg-client-ca-job.yaml<dd><code>Add Helm hooks and simplify secret reconciliation logic</code>    </dd></td> <td><a href="https://github.com/carverauto/serviceradar/pull/2275/files#diff-c5ed02183d64ba0434152f6d1b56e659fde6be1bbc4f9f7a07a29652028f2214">+24/-39</a>  </td> </tr> <tr> <td>core.yaml<dd><code>Replace core with core-elx Elixir implementation</code>                  </dd></td> <td><a href="https://github.com/carverauto/serviceradar/pull/2275/files#diff-06ab387d2c169d82a1de28b5e66c86f0417bd81b82a96246d0a2da8bfaa8d224">+127/-135</a></td> </tr> <tr> <td>db-event-writer-config.yaml<dd><code>Use configurable CNPG client certificate paths</code>                      </dd></td> <td><a href="https://github.com/carverauto/serviceradar/pull/2275/files#diff-4cce0ab31bec3428ffae6701d20ca14b0f27a1e8a810ba1c7388e5c7860c3254">+2/-2</a>      </td> </tr> <tr> <td>spire-server.yaml<dd><code>Add CNPG client certificate configuration variables</code>            </dd></td> <td><a href="https://github.com/carverauto/serviceradar/pull/2275/files#diff-7959f7b987adcd56306fe5ddf31fed367dd9bb995f9b82a8fa53553dac8e7077">+6/-1</a>      </td> </tr> <tr> <td>web.yaml<dd><code>Use configurable CNPG client certificate paths</code>                      </dd></td> <td><a href="https://github.com/carverauto/serviceradar/pull/2275/files#diff-6e0340bf29070d4b88c305a585cf278672914f74f95dbcfde215da3fcbd0f562">+2/-2</a>      </td> </tr> </table></details></td></tr><tr><td>Bug fix</td><td><details><summary>2 files</summary><table> <tr> <td>20260111082000_update-zen-rule-identities.exs<dd><code>Make index creation idempotent with IF NOT EXISTS</code>                </dd></td> <td><a href="https://github.com/carverauto/serviceradar/pull/2275/files#diff-12dc8193dab16db4228f0fbc403a8bd851b7265aef51a4da4f5673b0d4f38cd8">+11/-6</a>    </td> </tr> <tr> <td>serviceradar-config.yaml<dd><code>Fix zen JetStream subject configuration patterns</code>                  </dd></td> <td><a href="https://github.com/carverauto/serviceradar/pull/2275/files#diff-b8c8d2484103b11c396bc60d290c81df63c30a0f81103eceb5852a17e1d2b5e3">+4/-12</a>    </td> </tr> </table></details></td></tr><tr><td>Configuration changes</td><td><details><summary>2 files</summary><table> <tr> <td>values-demo-staging.yaml<dd><code>Update image tag to latest commit</code>                                                </dd></td> <td><a href="https://github.com/carverauto/serviceradar/pull/2275/files#diff-923e7e4134431f4ec89c77f227e8fc9546bcfdefc836c26da86f30e7847f0d3c">+1/-1</a>      </td> </tr> <tr> <td>values.yaml<dd><code>Add CNPG client certificate name configuration options</code>      </dd></td> <td><a href="https://github.com/carverauto/serviceradar/pull/2275/files#diff-d4449c7cb70362554b274f81eae5a4b81a8e81df494282e383d1b7ea3871c452">+2/-0</a>      </td> </tr> </table></details></td></tr></tbody></table> </details> ___

qodo-code-review[bot] commented

2026-01-13 05:58:25 +00:00

(Migrated from github.com)

Author

Owner

Imported GitHub PR comment.

Original author: @qodo-code-review[bot]
Original URL: https://github.com/carverauto/serviceradar/pull/2275#issuecomment-3742112403
Original created: 2026-01-13T05:58:25Z

PR Compliance Guide 🔍

Below is a summary of compliance checks for this PR:

Security Compliance
⚪	Private key exposure Description: The script generates and stores long-lived private CA and client keys (e.g., `cnpg-ca-key.pem`, `cnpg-client-key.pem`) under `CERT_DIR` (typically a shared PVC), and other workloads mount this same cert directory, so a compromise of any pod with access to that volume could exfiltrate the CA private key and mint trusted CNPG client certificates to access the database. generate-certs.sh [22-107] Referred Code if [ ! -f "$CERT_DIR/cnpg-ca.pem" ] \|\| [ "$FORCE_REGENERATE" = "true" ]; then openssl ecparam -name prime256v1 -genkey -noout -out "$CERT_DIR/cnpg-ca-key.pem" cat > "$CERT_DIR/cnpg-ca.conf" <<EOF [req] distinguished_name = req_distinguished_name x509_extensions = v3_ca prompt = no [req_distinguished_name] C = $COUNTRY ST = $STATE L = $LOCALITY O = $ORGANIZATION OU = $ORG_UNIT CN = ServiceRadar CNPG CA [v3_ca] basicConstraints = critical, CA:TRUE, pathlen:0 keyUsage = critical, keyCertSign, cRLSign subjectKeyIdentifier = hash EOF openssl req -new -x509 -sha256 -key "$CERT_DIR/cnpg-ca-key.pem" -out "$CERT_DIR/cnpg-ca.pem" -days $DAYS_VALID -config "$CERT_DIR/cnpg-ca.conf" rm "$CERT_DIR/cnpg-ca.conf" ... (clipped 65 lines)
Ticket Compliance
⚪	🎫 No ticket provided Create ticket/issue
Codebase Duplication Compliance
⚪	Codebase context is not defined Follow the guide to enable codebase context checks.
Custom Compliance
🟢	Generic: Meaningful Naming and Self-Documenting Code Objective: Ensure all identifiers clearly express their purpose and intent, making code self-documenting Status: Passed Learn more about managing compliance generic rules or creating your own custom rules
	Generic: Secure Error Handling Objective: To prevent the leakage of sensitive system information through error messages while providing sufficient detail for internal debugging. Status: Passed Learn more about managing compliance generic rules or creating your own custom rules
	Generic: Security-First Input Validation and Data Handling Objective: Ensure all data inputs are validated, sanitized, and handled securely to prevent vulnerabilities Status: Passed Learn more about managing compliance generic rules or creating your own custom rules
🔴	Generic: Robust Error Handling and Edge Case Management Objective: Ensure comprehensive error handling that provides meaningful context and graceful degradation Status: Unbounded retry loop: The new tenant lifecycle publish retry mechanism re-schedules retries indefinitely with a fixed delay and no max attempts/backoff, which can cause endless retries and log noise during persistent failure. Referred Code @impl true def handle_info({:publish_platform_event, tenant_id, event}, state) do case fetch_platform_tenant(tenant_id) do {:ok, tenant} -> if publish_event(event, tenant) != :ok do schedule_publish_retry(tenant_id, event) end {:error, reason} -> Logger.warning("[PlatformTenantBootstrap] Retry publish failed to load tenant", tenant_id: tenant_id, reason: inspect(reason) ) schedule_publish_retry(tenant_id, event) end {:noreply, state} end defp publish_platform_lifecycle_event(event, tenant) do ... (clipped 18 lines) Learn more about managing compliance generic rules or creating your own custom rules
⚪	Generic: Comprehensive Audit Trails Objective: To create a detailed and reliable record of critical system actions for security analysis and compliance. Status: Missing success audit: Platform tenant create/update triggers lifecycle events but the new code only logs failures (not successful outcomes with actor/user context), so it is unclear whether audit trail requirements are met. Referred Code \|> Ash.Query.for_read(:read) \|> Ash.Query.filter(is_platform_tenant == true) \|> Ash.Query.select([:id, :slug, :is_platform_tenant]) case Ash.read(query, actor: actor) do {:ok, []} -> tenant = create_platform_tenant!(platform_slug) set_platform_tenant_id!(tenant.id, platform_slug) publish_platform_lifecycle_event(:created, tenant) :ok {:ok, [tenant]} -> validate_platform_tenant!(tenant, platform_slug) ensure_platform_tenant_infrastructure!(tenant) set_platform_tenant_id!(tenant.id, platform_slug) publish_platform_lifecycle_event(:updated, tenant) :ok Learn more about managing compliance generic rules or creating your own custom rules
	Generic: Secure Logging Practices Objective: To ensure logs are useful for debugging and auditing without exposing sensitive information like PII, PHI, or cardholder data. Status: Possible sensitive reason: Logging `reason: inspect(reason)` from `Ash.read/2` failures may include internal details from underlying adapters, so it is unclear from the diff alone whether sensitive information could be emitted to logs. Referred Code `{:error, reason} -> Logger.warning("[PlatformTenantBootstrap] Retry publish failed to load tenant", tenant_id: tenant_id, reason: inspect(reason) ) schedule_publish_retry(tenant_id, event) end` Learn more about managing compliance generic rules or creating your own custom rules
Update

Compliance status legend

🟢 - Fully Compliant
🟡 - Partial Compliant
🔴 - Not Compliant
⚪ - Requires Further Human Verification
🏷️ - Compliance label

Imported GitHub PR comment. Original author: @qodo-code-review[bot] Original URL: https://github.com/carverauto/serviceradar/pull/2275#issuecomment-3742112403 Original created: 2026-01-13T05:58:25Z --- ## PR Compliance Guide 🔍  Below is a summary of compliance checks for this PR: <table><tbody><tr><td colspan='2'>Security Compliance</td></tr> <tr><td rowspan=1>⚪</td> <td><details><summary>Private key exposure </summary> Description: The script generates and stores long-lived private CA and client keys (e.g., <code>cnpg-ca-key.pem</code>, <code>cnpg-client-key.pem</code>) under <code>CERT_DIR</code> (typically a shared PVC), and other workloads mount this same cert directory, so a compromise of any pod with access to that volume could exfiltrate the CA private key and mint trusted CNPG client certificates to access the database. <a href='https://github.com/carverauto/serviceradar/pull/2275/files#diff-224edb1896351749211a2a609692daa31f4f45cf175658124613b9dc08496d96R22-R107'>generate-certs.sh [22-107]</a> <details open><summary>Referred Code</summary> ```shell if [ ! -f "$CERT_DIR/cnpg-ca.pem" ] || [ "$FORCE_REGENERATE" = "true" ]; then openssl ecparam -name prime256v1 -genkey -noout -out "$CERT_DIR/cnpg-ca-key.pem" cat > "$CERT_DIR/cnpg-ca.conf" <<EOF [req] distinguished_name = req_distinguished_name x509_extensions = v3_ca prompt = no [req_distinguished_name] C = $COUNTRY ST = $STATE L = $LOCALITY O = $ORGANIZATION OU = $ORG_UNIT CN = ServiceRadar CNPG CA [v3_ca] basicConstraints = critical, CA:TRUE, pathlen:0 keyUsage = critical, keyCertSign, cRLSign subjectKeyIdentifier = hash EOF openssl req -new -x509 -sha256 -key "$CERT_DIR/cnpg-ca-key.pem" -out "$CERT_DIR/cnpg-ca.pem" -days $DAYS_VALID -config "$CERT_DIR/cnpg-ca.conf" rm "$CERT_DIR/cnpg-ca.conf" ... (clipped 65 lines) ``` </details></details></td></tr> <tr><td colspan='2'>Ticket Compliance</td></tr> <tr><td>⚪</td><td><details><summary>🎫 No ticket provided </summary> - [ ] Create ticket/issue  </details></td></tr> <tr><td colspan='2'>Codebase Duplication Compliance</td></tr> <tr><td>⚪</td><td><details><summary>Codebase context is not defined </summary> Follow the <a href='https://qodo-merge-docs.qodo.ai/core-abilities/rag_context_enrichment/'>guide</a> to enable codebase context checks. </details></td></tr> <tr><td colspan='2'>Custom Compliance</td></tr> <tr><td rowspan=3>🟢</td><td> <details><summary>Generic: Meaningful Naming and Self-Documenting Code</summary> **Objective:** Ensure all identifiers clearly express their purpose and intent, making code self-documenting **Status:** Passed > Learn more about managing compliance <a href='https://qodo-merge-docs.qodo.ai/tools/compliance/#configuration-options'>generic rules</a> or creating your own <a href='https://qodo-merge-docs.qodo.ai/tools/compliance/#custom-compliance'>custom rules</a> </details></td></tr> <tr><td> <details><summary>Generic: Secure Error Handling</summary> **Objective:** To prevent the leakage of sensitive system information through error messages while providing sufficient detail for internal debugging. **Status:** Passed > Learn more about managing compliance <a href='https://qodo-merge-docs.qodo.ai/tools/compliance/#configuration-options'>generic rules</a> or creating your own <a href='https://qodo-merge-docs.qodo.ai/tools/compliance/#custom-compliance'>custom rules</a> </details></td></tr> <tr><td> <details><summary>Generic: Security-First Input Validation and Data Handling</summary> **Objective:** Ensure all data inputs are validated, sanitized, and handled securely to prevent vulnerabilities **Status:** Passed > Learn more about managing compliance <a href='https://qodo-merge-docs.qodo.ai/tools/compliance/#configuration-options'>generic rules</a> or creating your own <a href='https://qodo-merge-docs.qodo.ai/tools/compliance/#custom-compliance'>custom rules</a> </details></td></tr> <tr><td rowspan=1>🔴</td> <td><details> <summary>Generic: Robust Error Handling and Edge Case Management</summary> **Objective:** Ensure comprehensive error handling that provides meaningful context and graceful degradation **Status:** <a href='https://github.com/carverauto/serviceradar/pull/2275/files#diff-10699868a7d10a4386ad293ed86e1e60a9e3c172bc2ea3d22f98c0efae666182R198-R236'>Unbounded retry loop</a>: The new tenant lifecycle publish retry mechanism re-schedules retries indefinitely with a fixed delay and no max attempts/backoff, which can cause endless retries and log noise during persistent failure. <details open><summary>Referred Code</summary> ```elixir @impl true def handle_info({:publish_platform_event, tenant_id, event}, state) do case fetch_platform_tenant(tenant_id) do {:ok, tenant} -> if publish_event(event, tenant) != :ok do schedule_publish_retry(tenant_id, event) end {:error, reason} -> Logger.warning("[PlatformTenantBootstrap] Retry publish failed to load tenant", tenant_id: tenant_id, reason: inspect(reason) ) schedule_publish_retry(tenant_id, event) end {:noreply, state} end defp publish_platform_lifecycle_event(event, tenant) do ... (clipped 18 lines) ``` </details> > Learn more about managing compliance <a href='https://qodo-merge-docs.qodo.ai/tools/compliance/#configuration-options'>generic rules</a> or creating your own <a href='https://qodo-merge-docs.qodo.ai/tools/compliance/#custom-compliance'>custom rules</a> </details></td></tr> <tr><td rowspan=2>⚪</td> <td><details> <summary>Generic: Comprehensive Audit Trails</summary> **Objective:** To create a detailed and reliable record of critical system actions for security analysis and compliance. **Status:** <a href='https://github.com/carverauto/serviceradar/pull/2275/files#diff-10699868a7d10a4386ad293ed86e1e60a9e3c172bc2ea3d22f98c0efae666182R45-R61'>Missing success audit</a>: Platform tenant create/update triggers lifecycle events but the new code only logs failures (not successful outcomes with actor/user context), so it is unclear whether audit trail requirements are met. <details open><summary>Referred Code</summary> ```elixir |> Ash.Query.for_read(:read) |> Ash.Query.filter(is_platform_tenant == true) |> Ash.Query.select([:id, :slug, :is_platform_tenant]) case Ash.read(query, actor: actor) do {:ok, []} -> tenant = create_platform_tenant!(platform_slug) set_platform_tenant_id!(tenant.id, platform_slug) publish_platform_lifecycle_event(:created, tenant) :ok {:ok, [tenant]} -> validate_platform_tenant!(tenant, platform_slug) ensure_platform_tenant_infrastructure!(tenant) set_platform_tenant_id!(tenant.id, platform_slug) publish_platform_lifecycle_event(:updated, tenant) :ok ``` </details> > Learn more about managing compliance <a href='https://qodo-merge-docs.qodo.ai/tools/compliance/#configuration-options'>generic rules</a> or creating your own <a href='https://qodo-merge-docs.qodo.ai/tools/compliance/#custom-compliance'>custom rules</a> </details></td></tr> <tr><td><details> <summary>Generic: Secure Logging Practices</summary> **Objective:** To ensure logs are useful for debugging and auditing without exposing sensitive information like PII, PHI, or cardholder data. **Status:** <a href='https://github.com/carverauto/serviceradar/pull/2275/files#diff-10699868a7d10a4386ad293ed86e1e60a9e3c172bc2ea3d22f98c0efae666182R206-R213'>Possible sensitive reason</a>: Logging <code>reason: inspect(reason)</code> from <code>Ash.read/2</code> failures may include internal details from underlying adapters, so it is unclear from the diff alone whether sensitive information could be emitted to logs. <details open><summary>Referred Code</summary> ```elixir {:error, reason} -> Logger.warning("[PlatformTenantBootstrap] Retry publish failed to load tenant", tenant_id: tenant_id, reason: inspect(reason) ) schedule_publish_retry(tenant_id, event) end ``` </details> > Learn more about managing compliance <a href='https://qodo-merge-docs.qodo.ai/tools/compliance/#configuration-options'>generic rules</a> or creating your own <a href='https://qodo-merge-docs.qodo.ai/tools/compliance/#custom-compliance'>custom rules</a> </details></td></tr> <tr><td align="center" colspan="2"> - [ ] Update  </td></tr></tbody></table> <details><summary>Compliance status legend</summary> 🟢 - Fully Compliant 🟡 - Partial Compliant 🔴 - Not Compliant ⚪ - Requires Further Human Verification 🏷️ - Compliance label </details>

qodo-code-review[bot] commented

2026-01-13 05:59:52 +00:00

(Migrated from github.com)

Author

Owner

Imported GitHub PR comment.

Original author: @qodo-code-review[bot]
Original URL: https://github.com/carverauto/serviceradar/pull/2275#issuecomment-3742116331
Original created: 2026-01-13T05:59:52Z

PR Code Suggestions ✨

Explore these optional code suggestions:

Category	Suggestion	Impact
Possible issue	✅ ~~Prevent an infinite retry loop~~ Suggestion Impact: The commit adds explicit handle_info/2 clauses for {:error, :not_found} (and also {:error, {:unexpected_count, count}}) that log and stop scheduling further retries, addressing the infinite retry concern for unrecoverable tenant-fetch errors. It also introduces attempt tracking and exponential backoff with jitter for retriable errors, though it still retries on generic {:error, reason}. code diff: def handle_info({:publish_platform_event, tenant_id, event}, state) do + handle_info({:publish_platform_event, tenant_id, event, 1}, state) + end + + def handle_info({:publish_platform_event, tenant_id, event, attempt}, state) do case fetch_platform_tenant(tenant_id) do {:ok, tenant} -> if publish_event(event, tenant) != :ok do - schedule_publish_retry(tenant_id, event) + schedule_publish_retry(tenant_id, event, attempt + 1) end + + {:error, :not_found} -> + Logger.warning( + "[PlatformTenantBootstrap] Tenant not found during publish retry; stopping retries.", + tenant_id: tenant_id, + event: event + ) + + {:error, {:unexpected_count, count}} -> + Logger.error( + "[PlatformTenantBootstrap] Multiple tenants found during publish retry; stopping retries.", + tenant_id: tenant_id, + count: count, + event: event + ) {:error, reason} -> Logger.warning("[PlatformTenantBootstrap] Retry publish failed to load tenant", tenant_id: tenant_id, - reason: inspect(reason) - ) - - schedule_publish_retry(tenant_id, event) + reason: inspect(reason), + event: event + ) + + schedule_publish_retry(tenant_id, event, attempt + 1) end {:noreply, state} @@ -222,7 +243,7 @@ event: event ) - schedule_publish_retry(tenant.id, event) + schedule_publish_retry(tenant.id, event, 1) end end @@ -231,8 +252,20 @@ defp publish_event(:deleted, tenant), do: TenantLifecyclePublisher.publish_deleted(tenant) defp publish_event(_event, tenant), do: TenantLifecyclePublisher.publish_updated(tenant) - defp schedule_publish_retry(tenant_id, event) do - Process.send_after(self(), {:publish_platform_event, tenant_id, event}, @publish_retry_delay) + defp schedule_publish_retry(tenant_id, event, attempt) do + delay = calculate_publish_backoff(attempt) + Process.send_after(self(), {:publish_platform_event, tenant_id, event, attempt}, delay) + end + + defp calculate_publish_backoff(attempt) do + exponent = min(max(attempt - 1, 0), 16) + base_delay = + @initial_publish_retry_delay + \|> Kernel.*(Integer.pow(2, exponent)) + \|> min(@max_publish_retry_delay) + + jitter = :rand.uniform(max(div(base_delay, 2), 1)) + min(base_delay + jitter, @max_publish_retry_delay) end In `handle_info/2`, add specific case clauses for `{:error, :not_found}` and other unrecoverable errors to prevent an infinite retry loop when a tenant cannot be fetched. elixir/serviceradar_core/lib/serviceradar/identity/platform_tenant_bootstrap.ex [199-216] def handle_info({:publish_platform_event, tenant_id, event}, state) do case fetch_platform_tenant(tenant_id) do {:ok, tenant} -> if publish_event(event, tenant) != :ok do schedule_publish_retry(tenant_id, event) end - {:error, reason} -> - Logger.warning("[PlatformTenantBootstrap] Retry publish failed to load tenant", + {:error, :not_found} -> + Logger.warning( + "[PlatformTenantBootstrap] Tenant not found during publish retry; stopping retries.", tenant_id: tenant_id, - reason: inspect(reason) + event: event ) - schedule_publish_retry(tenant_id, event) + {:error, reason} -> + Logger.error( + "[PlatformTenantBootstrap] Unrecoverable error fetching tenant during publish retry; stopping retries.", + tenant_id: tenant_id, + reason: inspect(reason), + event: event + ) end {:noreply, state} end `[Suggestion processed]` Suggestion importance[1-10]: 9 __ Why: The suggestion correctly identifies a potential infinite retry loop if a tenant is not found, which is a significant bug that could lead to resource exhaustion and log spam.	High
High-level	✅ ~~Implement exponential backoff for retries~~ Suggestion Impact: The fixed @publish_retry_delay retry was replaced with an exponential backoff retry mechanism: retry messages now include an attempt count, delays are calculated using exponential growth capped by a max delay and include jitter, and retry scheduling uses the computed delay. The handler was updated to support the new message shape and initial attempts. code diff: - @publish_retry_delay 10_000 + @initial_publish_retry_delay 1_000 + @max_publish_retry_delay 60_000 def start_link(opts \\ []) do GenServer.start_link(__MODULE__, opts, name: __MODULE__) @@ -197,19 +198,39 @@ @impl true def handle_info({:publish_platform_event, tenant_id, event}, state) do + handle_info({:publish_platform_event, tenant_id, event, 1}, state) + end + + def handle_info({:publish_platform_event, tenant_id, event, attempt}, state) do case fetch_platform_tenant(tenant_id) do {:ok, tenant} -> if publish_event(event, tenant) != :ok do - schedule_publish_retry(tenant_id, event) + schedule_publish_retry(tenant_id, event, attempt + 1) end + + {:error, :not_found} -> + Logger.warning( + "[PlatformTenantBootstrap] Tenant not found during publish retry; stopping retries.", + tenant_id: tenant_id, + event: event + ) + + {:error, {:unexpected_count, count}} -> + Logger.error( + "[PlatformTenantBootstrap] Multiple tenants found during publish retry; stopping retries.", + tenant_id: tenant_id, + count: count, + event: event + ) {:error, reason} -> Logger.warning("[PlatformTenantBootstrap] Retry publish failed to load tenant", tenant_id: tenant_id, - reason: inspect(reason) - ) - - schedule_publish_retry(tenant_id, event) + reason: inspect(reason), + event: event + ) + + schedule_publish_retry(tenant_id, event, attempt + 1) end {:noreply, state} @@ -222,7 +243,7 @@ event: event ) - schedule_publish_retry(tenant.id, event) + schedule_publish_retry(tenant.id, event, 1) end end @@ -231,8 +252,20 @@ defp publish_event(:deleted, tenant), do: TenantLifecyclePublisher.publish_deleted(tenant) defp publish_event(_event, tenant), do: TenantLifecyclePublisher.publish_updated(tenant) - defp schedule_publish_retry(tenant_id, event) do - Process.send_after(self(), {:publish_platform_event, tenant_id, event}, @publish_retry_delay) + defp schedule_publish_retry(tenant_id, event, attempt) do + delay = calculate_publish_backoff(attempt) + Process.send_after(self(), {:publish_platform_event, tenant_id, event, attempt}, delay) + end + + defp calculate_publish_backoff(attempt) do + exponent = min(max(attempt - 1, 0), 16) + base_delay = + @initial_publish_retry_delay + \|> Kernel.(Integer.pow(2, exponent)) + \|> min(@max_publish_retry_delay) + + jitter = :rand.uniform(max(div(base_delay, 2), 1)) + min(base_delay + jitter, @max_publish_retry_delay) end Replace the fixed-delay retry mechanism for platform tenant event publishing with an exponential backoff strategy. This will improve system resilience during outages. Examples: elixir/serviceradar_core/lib/serviceradar/identity/platform_tenant_bootstrap.ex [198-236] `@impl true def handle_info({:publish_platform_event, tenant_id, event}, state) do case fetch_platform_tenant(tenant_id) do {:ok, tenant} -> if publish_event(event, tenant) != :ok do schedule_publish_retry(tenant_id, event) end {:error, reason} -> Logger.warning("[PlatformTenantBootstrap] Retry publish failed to load tenant", ... (clipped 29 lines)` Solution Walkthrough: Before: defmodule ServiceRadar.Identity.PlatformTenantBootstrap do @publish_retry_delay 10_000 # Fixed 10-second delay defp publish_platform_lifecycle_event(event, tenant) do if publish_event(event, tenant) != :ok do schedule_publish_retry(tenant.id, event) end end def handle_info({:publish_platform_event, tenant_id, event}, state) do # ... attempt to publish ... if publish_failed do # Always retries with the same fixed delay schedule_publish_retry(tenant_id, event) end ... end defp schedule_publish_retry(tenant_id, event) do Process.send_after(self(), {:publish_platform_event, tenant_id, event}, @publish_retry_delay) end end After: defmodule ServiceRadar.Identity.PlatformTenantBootstrap do @initial_delay 1_000 @max_delay 60_000 defp publish_platform_lifecycle_event(event, tenant) do if publish_event(event, tenant) != :ok do schedule_publish_retry(tenant.id, event, 1) # Start with attempt 1 end end def handle_info({:publish_platform_event, tenant_id, event, attempt}, state) do # ... attempt to publish ... if publish_failed do schedule_publish_retry(tenant_id, event, attempt + 1) end ... end defp schedule_publish_retry(tenant_id, event, attempt) do delay = calculate_backoff(attempt) Process.send_after(self(), {:publish_platform_event, tenant_id, event, attempt}, delay) end defp calculate_backoff(attempt) do # e.g., min(@max_delay, @initial_delay 2 ^ (attempt - 1) + jitter) ... end end Suggestion importance[1-10]: 8 __ Why: The suggestion correctly identifies that the implemented retry mechanism uses a fixed delay, contrary to the PR's stated goal of exponential backoff, and proposes a more resilient best-practice strategy.	Medium
Security	Improve security of root key Change the permissions of `root-key.pem` from `640` to `600` to restrict access to the owner only, enhancing security and aligning with the principle of least privilege. helm/serviceradar/files/generate-certs.sh [17-21] `if [ ! -f "$CERT_DIR/root.pem" ]; then openssl genrsa -out "$CERT_DIR/root-key.pem" 4096 openssl req -new -x509 -sha256 -key "$CERT_DIR/root-key.pem" -out "$CERT_DIR/root.pem" -days $DAYS_VALID -subj "/C=$COUNTRY/ST=$STATE/L=$LOCALITY/O=$ORGANIZATION/OU=$ORG_UNIT/CN=ServiceRadar Root CA" - chmod 644 "$CERT_DIR/root.pem"; chmod 640 "$CERT_DIR/root-key.pem" + chmod 644 "$CERT_DIR/root.pem"; chmod 600 "$CERT_DIR/root-key.pem" fi` `[To ensure code accuracy, apply this suggestion manually]` Suggestion importance[1-10]: 6 __ Why: The suggestion correctly points out an opportunity to improve security by restricting permissions on the `root-key.pem` file, aligning it with best practices and other keys in the script.	Low
Update

Imported GitHub PR comment. Original author: @qodo-code-review[bot] Original URL: https://github.com/carverauto/serviceradar/pull/2275#issuecomment-3742116331 Original created: 2026-01-13T05:59:52Z --- ## PR Code Suggestions ✨  Explore these optional code suggestions: <table><thead><tr><td>Category</td><td align=left>Suggestion                                                                                                                                    </td><td align=center>Impact</td></tr><tbody><tr><td rowspan=1>Possible issue</td> <td> <details><summary>✅ <s>Prevent an infinite retry loop</s></summary> ___ <details><summary>Suggestion Impact:</summary>The commit adds explicit handle_info/2 clauses for {:error, :not_found} (and also {:error, {:unexpected_count, count}}) that log and stop scheduling further retries, addressing the infinite retry concern for unrecoverable tenant-fetch errors. It also introduces attempt tracking and exponential backoff with jitter for retriable errors, though it still retries on generic {:error, reason}. code diff: ```diff def handle_info({:publish_platform_event, tenant_id, event}, state) do + handle_info({:publish_platform_event, tenant_id, event, 1}, state) + end + + def handle_info({:publish_platform_event, tenant_id, event, attempt}, state) do case fetch_platform_tenant(tenant_id) do {:ok, tenant} -> if publish_event(event, tenant) != :ok do - schedule_publish_retry(tenant_id, event) + schedule_publish_retry(tenant_id, event, attempt + 1) end + + {:error, :not_found} -> + Logger.warning( + "[PlatformTenantBootstrap] Tenant not found during publish retry; stopping retries.", + tenant_id: tenant_id, + event: event + ) + + {:error, {:unexpected_count, count}} -> + Logger.error( + "[PlatformTenantBootstrap] Multiple tenants found during publish retry; stopping retries.", + tenant_id: tenant_id, + count: count, + event: event + ) {:error, reason} -> Logger.warning("[PlatformTenantBootstrap] Retry publish failed to load tenant", tenant_id: tenant_id, - reason: inspect(reason) - ) - - schedule_publish_retry(tenant_id, event) + reason: inspect(reason), + event: event + ) + + schedule_publish_retry(tenant_id, event, attempt + 1) end {:noreply, state} @@ -222,7 +243,7 @@ event: event ) - schedule_publish_retry(tenant.id, event) + schedule_publish_retry(tenant.id, event, 1) end end @@ -231,8 +252,20 @@ defp publish_event(:deleted, tenant), do: TenantLifecyclePublisher.publish_deleted(tenant) defp publish_event(_event, tenant), do: TenantLifecyclePublisher.publish_updated(tenant) - defp schedule_publish_retry(tenant_id, event) do - Process.send_after(self(), {:publish_platform_event, tenant_id, event}, @publish_retry_delay) + defp schedule_publish_retry(tenant_id, event, attempt) do + delay = calculate_publish_backoff(attempt) + Process.send_after(self(), {:publish_platform_event, tenant_id, event, attempt}, delay) + end + + defp calculate_publish_backoff(attempt) do + exponent = min(max(attempt - 1, 0), 16) + base_delay = + @initial_publish_retry_delay + |> Kernel.*(Integer.pow(2, exponent)) + |> min(@max_publish_retry_delay) + + jitter = :rand.uniform(max(div(base_delay, 2), 1)) + min(base_delay + jitter, @max_publish_retry_delay) end ``` </details> ___ **In <code>handle_info/2</code>, add specific case clauses for <code>{:error, :not_found}</code> and other unrecoverable errors to prevent an infinite retry loop when a tenant cannot be fetched.** [elixir/serviceradar_core/lib/serviceradar/identity/platform_tenant_bootstrap.ex [199-216]](https://github.com/carverauto/serviceradar/pull/2275/files#diff-10699868a7d10a4386ad293ed86e1e60a9e3c172bc2ea3d22f98c0efae666182R199-R216) ```diff def handle_info({:publish_platform_event, tenant_id, event}, state) do case fetch_platform_tenant(tenant_id) do {:ok, tenant} -> if publish_event(event, tenant) != :ok do schedule_publish_retry(tenant_id, event) end - {:error, reason} -> - Logger.warning("[PlatformTenantBootstrap] Retry publish failed to load tenant", + {:error, :not_found} -> + Logger.warning( + "[PlatformTenantBootstrap] Tenant not found during publish retry; stopping retries.", tenant_id: tenant_id, - reason: inspect(reason) + event: event ) - schedule_publish_retry(tenant_id, event) + {:error, reason} -> + Logger.error( + "[PlatformTenantBootstrap] Unrecoverable error fetching tenant during publish retry; stopping retries.", + tenant_id: tenant_id, + reason: inspect(reason), + event: event + ) end {:noreply, state} end ``` `[Suggestion processed]` <details><summary>Suggestion importance[1-10]: 9</summary> __ Why: The suggestion correctly identifies a potential infinite retry loop if a tenant is not found, which is a significant bug that could lead to resource exhaustion and log spam. </details></details></td><td align=center>High </td></tr><tr><td rowspan=1>High-level</td> <td> <details><summary>✅ <s>Implement exponential backoff for retries</s></summary> ___ <details><summary>Suggestion Impact:</summary>The fixed @publish_retry_delay retry was replaced with an exponential backoff retry mechanism: retry messages now include an attempt count, delays are calculated using exponential growth capped by a max delay and include jitter, and retry scheduling uses the computed delay. The handler was updated to support the new message shape and initial attempts. code diff: ```diff - @publish_retry_delay 10_000 + @initial_publish_retry_delay 1_000 + @max_publish_retry_delay 60_000 def start_link(opts \\ []) do GenServer.start_link(__MODULE__, opts, name: __MODULE__) @@ -197,19 +198,39 @@ @impl true def handle_info({:publish_platform_event, tenant_id, event}, state) do + handle_info({:publish_platform_event, tenant_id, event, 1}, state) + end + + def handle_info({:publish_platform_event, tenant_id, event, attempt}, state) do case fetch_platform_tenant(tenant_id) do {:ok, tenant} -> if publish_event(event, tenant) != :ok do - schedule_publish_retry(tenant_id, event) + schedule_publish_retry(tenant_id, event, attempt + 1) end + + {:error, :not_found} -> + Logger.warning( + "[PlatformTenantBootstrap] Tenant not found during publish retry; stopping retries.", + tenant_id: tenant_id, + event: event + ) + + {:error, {:unexpected_count, count}} -> + Logger.error( + "[PlatformTenantBootstrap] Multiple tenants found during publish retry; stopping retries.", + tenant_id: tenant_id, + count: count, + event: event + ) {:error, reason} -> Logger.warning("[PlatformTenantBootstrap] Retry publish failed to load tenant", tenant_id: tenant_id, - reason: inspect(reason) - ) - - schedule_publish_retry(tenant_id, event) + reason: inspect(reason), + event: event + ) + + schedule_publish_retry(tenant_id, event, attempt + 1) end {:noreply, state} @@ -222,7 +243,7 @@ event: event ) - schedule_publish_retry(tenant.id, event) + schedule_publish_retry(tenant.id, event, 1) end end @@ -231,8 +252,20 @@ defp publish_event(:deleted, tenant), do: TenantLifecyclePublisher.publish_deleted(tenant) defp publish_event(_event, tenant), do: TenantLifecyclePublisher.publish_updated(tenant) - defp schedule_publish_retry(tenant_id, event) do - Process.send_after(self(), {:publish_platform_event, tenant_id, event}, @publish_retry_delay) + defp schedule_publish_retry(tenant_id, event, attempt) do + delay = calculate_publish_backoff(attempt) + Process.send_after(self(), {:publish_platform_event, tenant_id, event, attempt}, delay) + end + + defp calculate_publish_backoff(attempt) do + exponent = min(max(attempt - 1, 0), 16) + base_delay = + @initial_publish_retry_delay + |> Kernel.*(Integer.pow(2, exponent)) + |> min(@max_publish_retry_delay) + + jitter = :rand.uniform(max(div(base_delay, 2), 1)) + min(base_delay + jitter, @max_publish_retry_delay) end ``` </details> ___ **Replace the fixed-delay retry mechanism for platform tenant event publishing with an exponential backoff strategy. This will improve system resilience during outages.** ### Examples: <details> <summary> <a href="https://github.com/carverauto/serviceradar/pull/2275/files#diff-10699868a7d10a4386ad293ed86e1e60a9e3c172bc2ea3d22f98c0efae666182R198-R236">elixir/serviceradar_core/lib/serviceradar/identity/platform_tenant_bootstrap.ex [198-236]</a> </summary> ```elixir @impl true def handle_info({:publish_platform_event, tenant_id, event}, state) do case fetch_platform_tenant(tenant_id) do {:ok, tenant} -> if publish_event(event, tenant) != :ok do schedule_publish_retry(tenant_id, event) end {:error, reason} -> Logger.warning("[PlatformTenantBootstrap] Retry publish failed to load tenant", ... (clipped 29 lines) ``` </details> ### Solution Walkthrough: #### Before: ```elixir defmodule ServiceRadar.Identity.PlatformTenantBootstrap do @publish_retry_delay 10_000 # Fixed 10-second delay defp publish_platform_lifecycle_event(event, tenant) do if publish_event(event, tenant) != :ok do schedule_publish_retry(tenant.id, event) end end def handle_info({:publish_platform_event, tenant_id, event}, state) do # ... attempt to publish ... if publish_failed do # Always retries with the same fixed delay schedule_publish_retry(tenant_id, event) end ... end defp schedule_publish_retry(tenant_id, event) do Process.send_after(self(), {:publish_platform_event, tenant_id, event}, @publish_retry_delay) end end ``` #### After: ```elixir defmodule ServiceRadar.Identity.PlatformTenantBootstrap do @initial_delay 1_000 @max_delay 60_000 defp publish_platform_lifecycle_event(event, tenant) do if publish_event(event, tenant) != :ok do schedule_publish_retry(tenant.id, event, 1) # Start with attempt 1 end end def handle_info({:publish_platform_event, tenant_id, event, attempt}, state) do # ... attempt to publish ... if publish_failed do schedule_publish_retry(tenant_id, event, attempt + 1) end ... end defp schedule_publish_retry(tenant_id, event, attempt) do delay = calculate_backoff(attempt) Process.send_after(self(), {:publish_platform_event, tenant_id, event, attempt}, delay) end defp calculate_backoff(attempt) do # e.g., min(@max_delay, @initial_delay * 2 ^ (attempt - 1) + jitter) ... end end ``` <details><summary>Suggestion importance[1-10]: 8</summary> __ Why: The suggestion correctly identifies that the implemented retry mechanism uses a fixed delay, contrary to the PR's stated goal of exponential backoff, and proposes a more resilient best-practice strategy. </details></details></td><td align=center>Medium </td></tr><tr><td rowspan=1>Security</td> <td> <details><summary>Improve security of root key</summary> ___ **Change the permissions of <code>root-key.pem</code> from <code>640</code> to <code>600</code> to restrict access to the owner only, enhancing security and aligning with the principle of least privilege.** [helm/serviceradar/files/generate-certs.sh [17-21]](https://github.com/carverauto/serviceradar/pull/2275/files#diff-224edb1896351749211a2a609692daa31f4f45cf175658124613b9dc08496d96R17-R21) ```diff if [ ! -f "$CERT_DIR/root.pem" ]; then openssl genrsa -out "$CERT_DIR/root-key.pem" 4096 openssl req -new -x509 -sha256 -key "$CERT_DIR/root-key.pem" -out "$CERT_DIR/root.pem" -days $DAYS_VALID -subj "/C=$COUNTRY/ST=$STATE/L=$LOCALITY/O=$ORGANIZATION/OU=$ORG_UNIT/CN=ServiceRadar Root CA" - chmod 644 "$CERT_DIR/root.pem"; chmod 640 "$CERT_DIR/root-key.pem" + chmod 644 "$CERT_DIR/root.pem"; chmod 600 "$CERT_DIR/root-key.pem" fi ``` `[To ensure code accuracy, apply this suggestion manually]` <details><summary>Suggestion importance[1-10]: 6</summary> __ Why: The suggestion correctly points out an opportunity to improve security by restricting permissions on the `root-key.pem` file, aligning it with best practices and other keys in the script. </details></details></td><td align=center>Low </td></tr> <tr><td align="center" colspan="2"> - [ ] Update  </td><td></td></tr></tbody></table>