schema fixes, agent update #2636

Merged
mfreeman451 merged 9 commits from refs/pull/2636/head into testing 2026-01-08 20:46:30 +00:00
mfreeman451 commented 2026-01-08 18:54:12 +00:00 (Migrated from github.com)
Owner

Imported from GitHub pull request.

Original GitHub pull request: #2229
Original author: @mfreeman451
Original URL: https://github.com/carverauto/serviceradar/pull/2229
Original created: 2026-01-08T18:54:12Z
Original updated: 2026-01-08T20:46:37Z
Original head: carverauto/serviceradar:updates/tenant_schema_work
Original base: testing
Original merged: 2026-01-08T20:46:30Z by @mfreeman451

User description

IMPORTANT: Please sign the Developer Certificate of Origin

Thank you for your contribution to ServiceRadar. Please note, when contributing, the developer must include
a DCO sign-off statement indicating the DCO acceptance in one commit message. Here
is an example DCO Signed-off-by line in a commit message:

Signed-off-by: J. Doe <j.doe@domain.com>

Describe your changes

Code checklist before requesting a review

  • I have signed the DCO?
  • The build completes without errors?
  • All tests are passing when running make test?

PR Type

Enhancement


Description

  • Add automatic tenant discovery via cluster RPC when not configured

  • Refactor tenant configuration resolution with helper functions

  • Add tenant_id field to OCSF devices and agents tables

  • Update agent schema to include device_uid, host, port, spiffe_identity, status fields

  • Support multiple tenant ID environment variable sources


Diagram Walkthrough

flowchart LR
  A["Agent Gateway"] -->|"No tenant config"| B["Cluster RPC Discovery"]
  B -->|"find_core_node"| C["Core Service"]
  C -->|"get_platform_tenant_info"| D["Platform Tenant Info"]
  D -->|"tenant_id, tenant_slug"| A
  E["Environment Variables"] -->|"GATEWAY_TENANT_ID"| A
  F["mTLS Certificate"] -->|"CN extraction"| A
  A -->|"Create Agent"| G["OCSF Agents Table"]
  G -->|"tenant_id scoped"| H["Multi-tenant Data"]

File Walkthrough

Relevant files
Enhancement
config.ex
Add cluster RPC tenant discovery with fallback                     

elixir/serviceradar_agent_gateway/lib/serviceradar_agent_gateway/config.ex

  • Add cluster RPC-based tenant discovery as fallback when env/cert
    sources unavailable
  • Extract tenant normalization and validation into separate helper
    functions
  • Support SERVICERADAR_PLATFORM_TENANT_ID environment variable in
    addition to GATEWAY_TENANT_ID
  • Implement retry logic with configurable max attempts and intervals for
    cluster discovery
+115/-49
agent_gateway_sync.ex
Add platform tenant info RPC endpoint                                       

elixir/serviceradar_core/lib/serviceradar/edge/agent_gateway_sync.ex

  • Add get_platform_tenant_info/0 function for RPC calls from
    agent-gateway
  • Include tenant_id in agent creation attributes
  • Add tenant_id to agent upsert field list
+23/-1   
agent.ex
Include tenant_id in agent upsert                                               

elixir/serviceradar_core/lib/serviceradar/infrastructure/agent.ex

  • Add tenant_id to agent upsert field list for proper tenant scoping
+2/-1     
Configuration changes
20260108011004_add_ocsf_and_identity_tables.exs
Add tenant_id and enhance OCSF schema                                       

elixir/serviceradar_core/priv/repo/tenant_migrations/20260108011004_add_ocsf_and_identity_tables.exs

  • Add tenant_id UUID NOT NULL column to ocsf_devices table
  • Add tenant_id UUID NOT NULL column to ocsf_agents table
  • Replace ip field with host and port fields in ocsf_agents
  • Add device_uid, spiffe_identity, status, is_healthy columns to
    ocsf_agents
  • Create indexes on tenant_id, host, status, and device_uid for
    ocsf_agents
  • Create index on tenant_id for ocsf_devices
+13/-2   
Documentation
docker-compose.yml
Document tenant auto-discovery configuration                         

docker-compose.yml

  • Add comment documenting optional tenant_id auto-discovery from core
+1/-0     

Imported from GitHub pull request. Original GitHub pull request: #2229 Original author: @mfreeman451 Original URL: https://github.com/carverauto/serviceradar/pull/2229 Original created: 2026-01-08T18:54:12Z Original updated: 2026-01-08T20:46:37Z Original head: carverauto/serviceradar:updates/tenant_schema_work Original base: testing Original merged: 2026-01-08T20:46:30Z by @mfreeman451 --- ### **User description** ## IMPORTANT: Please sign the Developer Certificate of Origin Thank you for your contribution to ServiceRadar. Please note, when contributing, the developer must include a [DCO sign-off statement]( https://developercertificate.org/) indicating the DCO acceptance in one commit message. Here is an example DCO Signed-off-by line in a commit message: ``` Signed-off-by: J. Doe <j.doe@domain.com> ``` ## Describe your changes ## Issue ticket number and link ## Code checklist before requesting a review - [ ] I have signed the DCO? - [ ] The build completes without errors? - [ ] All tests are passing when running make test? ___ ### **PR Type** Enhancement ___ ### **Description** - Add automatic tenant discovery via cluster RPC when not configured - Refactor tenant configuration resolution with helper functions - Add tenant_id field to OCSF devices and agents tables - Update agent schema to include device_uid, host, port, spiffe_identity, status fields - Support multiple tenant ID environment variable sources ___ ### Diagram Walkthrough ```mermaid flowchart LR A["Agent Gateway"] -->|"No tenant config"| B["Cluster RPC Discovery"] B -->|"find_core_node"| C["Core Service"] C -->|"get_platform_tenant_info"| D["Platform Tenant Info"] D -->|"tenant_id, tenant_slug"| A E["Environment Variables"] -->|"GATEWAY_TENANT_ID"| A F["mTLS Certificate"] -->|"CN extraction"| A A -->|"Create Agent"| G["OCSF Agents Table"] G -->|"tenant_id scoped"| H["Multi-tenant Data"] ``` <details><summary><h3>File Walkthrough</h3></summary> <table><thead><tr><th></th><th align="left">Relevant files</th></tr></thead><tbody><tr><td><strong>Enhancement</strong></td><td><table> <tr> <td> <details> <summary><strong>config.ex</strong><dd><code>Add cluster RPC tenant discovery with fallback</code>&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; </dd></summary> <hr> elixir/serviceradar_agent_gateway/lib/serviceradar_agent_gateway/config.ex <ul><li>Add cluster RPC-based tenant discovery as fallback when env/cert <br>sources unavailable<br> <li> Extract tenant normalization and validation into separate helper <br>functions<br> <li> Support <code>SERVICERADAR_PLATFORM_TENANT_ID</code> environment variable in <br>addition to <code>GATEWAY_TENANT_ID</code><br> <li> Implement retry logic with configurable max attempts and intervals for <br>cluster discovery</ul> </details> </td> <td><a href="https://github.com/carverauto/serviceradar/pull/2229/files#diff-5ba6871060dcff48aabea128b381e2d121bbb32256be1066c31dba447dc10813">+115/-49</a></td> </tr> <tr> <td> <details> <summary><strong>agent_gateway_sync.ex</strong><dd><code>Add platform tenant info RPC endpoint</code>&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; </dd></summary> <hr> elixir/serviceradar_core/lib/serviceradar/edge/agent_gateway_sync.ex <ul><li>Add <code>get_platform_tenant_info/0</code> function for RPC calls from <br>agent-gateway<br> <li> Include tenant_id in agent creation attributes<br> <li> Add tenant_id to agent upsert field list</ul> </details> </td> <td><a href="https://github.com/carverauto/serviceradar/pull/2229/files#diff-9b32fb2972aa43999e1afb261429b23bcba6a8868eab704270158bb12e1825be">+23/-1</a>&nbsp; &nbsp; </td> </tr> <tr> <td> <details> <summary><strong>agent.ex</strong><dd><code>Include tenant_id in agent upsert</code>&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; </dd></summary> <hr> elixir/serviceradar_core/lib/serviceradar/infrastructure/agent.ex - Add tenant_id to agent upsert field list for proper tenant scoping </details> </td> <td><a href="https://github.com/carverauto/serviceradar/pull/2229/files#diff-c56f92b6ce744cab3f2dc00dde92e2017cffdd12ad4618f7fa720252f2a6843a">+2/-1</a>&nbsp; &nbsp; &nbsp; </td> </tr> </table></td></tr><tr><td><strong>Configuration changes</strong></td><td><table> <tr> <td> <details> <summary><strong>20260108011004_add_ocsf_and_identity_tables.exs</strong><dd><code>Add tenant_id and enhance OCSF schema</code>&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; </dd></summary> <hr> elixir/serviceradar_core/priv/repo/tenant_migrations/20260108011004_add_ocsf_and_identity_tables.exs <ul><li>Add tenant_id UUID NOT NULL column to ocsf_devices table<br> <li> Add tenant_id UUID NOT NULL column to ocsf_agents table<br> <li> Replace ip field with host and port fields in ocsf_agents<br> <li> Add device_uid, spiffe_identity, status, is_healthy columns to <br>ocsf_agents<br> <li> Create indexes on tenant_id, host, status, and device_uid for <br>ocsf_agents<br> <li> Create index on tenant_id for ocsf_devices</ul> </details> </td> <td><a href="https://github.com/carverauto/serviceradar/pull/2229/files#diff-2f21ccd2ef7d70231a103cf8ecde8a57158856b2c2879c63fb5fd8bd88bf4765">+13/-2</a>&nbsp; &nbsp; </td> </tr> </table></td></tr><tr><td><strong>Documentation</strong></td><td><table> <tr> <td> <details> <summary><strong>docker-compose.yml</strong><dd><code>Document tenant auto-discovery configuration</code>&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; </dd></summary> <hr> docker-compose.yml - Add comment documenting optional tenant_id auto-discovery from core </details> </td> <td><a href="https://github.com/carverauto/serviceradar/pull/2229/files#diff-e45e45baeda1c1e73482975a664062aa56f20c03dd9d64a827aba57775bed0d3">+1/-0</a>&nbsp; &nbsp; &nbsp; </td> </tr> </table></td></tr></tbody></table> </details> ___
qodo-code-review[bot] commented 2026-01-08 18:55:16 +00:00 (Migrated from github.com)
Author
Owner

Imported GitHub PR comment.

Original author: @qodo-code-review[bot]
Original URL: https://github.com/carverauto/serviceradar/pull/2229#issuecomment-3725218613
Original created: 2026-01-08T18:55:16Z

PR Compliance Guide 🔍

Below is a summary of compliance checks for this PR:

Security Compliance
Trusted RPC discovery

Description: The new cluster RPC tenant auto-discovery trusts the first connected node whose name
contains "serviceradar_core" and then accepts whatever tenant_id it returns, so if an
attacker can join the Erlang cluster (e.g., via leaked cookie/misconfigured distribution)
they could run a spoofed node and force the gateway to adopt an attacker-chosen tenant
context (cross-tenant mis-routing/data scoping impact).
config.ex [242-297]

Referred Code
# Discover tenant info from core via cluster RPC with retries
@discovery_max_attempts 30
@discovery_retry_interval_ms 2_000

defp discover_tenant_from_cluster do
  discover_tenant_from_cluster(1)
end

defp discover_tenant_from_cluster(attempt) when attempt > @discovery_max_attempts do
  raise "failed to discover platform tenant from cluster after #{@discovery_max_attempts} attempts"
end

defp discover_tenant_from_cluster(attempt) do
  case find_core_node() do
    nil ->
      Logger.debug("No core node found (attempt #{attempt}/#{@discovery_max_attempts}), retrying...")
      Process.sleep(@discovery_retry_interval_ms)
      discover_tenant_from_cluster(attempt + 1)

    core_node ->
      case rpc_get_platform_tenant(core_node) do


 ... (clipped 35 lines)
Ticket Compliance
🎫 No ticket provided
  • Create ticket/issue
Codebase Duplication Compliance
Codebase context is not defined

Follow the guide to enable codebase context checks.

Custom Compliance
🟢
Generic: Comprehensive Audit Trails

Objective: To create a detailed and reliable record of critical system actions for security analysis
and compliance.

Status: Passed

Learn more about managing compliance generic rules or creating your own custom rules

Generic: Meaningful Naming and Self-Documenting Code

Objective: Ensure all identifiers clearly express their purpose and intent, making code
self-documenting

Status: Passed

Learn more about managing compliance generic rules or creating your own custom rules

🔴
Generic: Secure Logging Practices

Objective: To ensure logs are useful for debugging and auditing without exposing sensitive
information like PII, PHI, or cardholder data.

Status:
Unstructured sensitive logs: Newly added logs are plain-string (non-structured) and include tenant identifiers and
cluster node details, reducing auditability and potentially exposing sensitive
identifiers.

Referred Code
    Logger.info("No tenant_id configured; discovering from cluster...")
    discover_tenant_from_cluster()
  else
    validate_tenant_id!(tenant_id)
    {tenant_id, tenant_slug || platform_tenant_slug()}
  end

# Build NATS prefix
nats_prefix = tenant_slug

config = %{
  partition_id: partition_id,
  gateway_id: gateway_id,
  domain: domain,
  capabilities: capabilities,
  tenant_id: tenant_id,
  tenant_slug: tenant_slug,
  nats_prefix: nats_prefix
}

Logger.info("Gateway configured for tenant: #{tenant_slug} (ID: #{tenant_id})")


 ... (clipped 81 lines)

Learn more about managing compliance generic rules or creating your own custom rules

Generic: Security-First Input Validation and Data Handling

Objective: Ensure all data inputs are validated, sanitized, and handled securely to prevent
vulnerabilities

Status:
Missing slug validation: The cluster-discovered tenant_slug is returned and used without normalization/format
validation, allowing invalid values to flow into nats_prefix and other tenant-scoped
behavior.

Referred Code
case rpc_get_platform_tenant(core_node) do
  {:ok, %{tenant_id: tenant_id, tenant_slug: tenant_slug}} ->
    validate_tenant_id!(tenant_id)
    Logger.info("Discovered platform tenant from #{core_node}: #{tenant_slug} (#{tenant_id})")
    {tenant_id, tenant_slug}

Learn more about managing compliance generic rules or creating your own custom rules

Generic: Robust Error Handling and Edge Case Management

Objective: Ensure comprehensive error handling that provides meaningful context and graceful
degradation

Status:
Startup crash path: Tenant auto-discovery failure leads to a hard raise after retries which may not provide
graceful degradation or a fallback path depending on runtime expectations.

Referred Code
defp discover_tenant_from_cluster(attempt) when attempt > @discovery_max_attempts do
  raise "failed to discover platform tenant from cluster after #{@discovery_max_attempts} attempts"
end

defp discover_tenant_from_cluster(attempt) do
  case find_core_node() do
    nil ->
      Logger.debug("No core node found (attempt #{attempt}/#{@discovery_max_attempts}), retrying...")
      Process.sleep(@discovery_retry_interval_ms)
      discover_tenant_from_cluster(attempt + 1)

    core_node ->
      case rpc_get_platform_tenant(core_node) do
        {:ok, %{tenant_id: tenant_id, tenant_slug: tenant_slug}} ->
          validate_tenant_id!(tenant_id)
          Logger.info("Discovered platform tenant from #{core_node}: #{tenant_slug} (#{tenant_id})")
          {tenant_id, tenant_slug}

        {:error, :not_ready} ->
          Logger.debug("Core node #{core_node} not ready (attempt #{attempt}/#{@discovery_max_attempts}), retrying...")
          Process.sleep(@discovery_retry_interval_ms)


 ... (clipped 7 lines)

Learn more about managing compliance generic rules or creating your own custom rules

Generic: Secure Error Handling

Objective: To prevent the leakage of sensitive system information through error messages while
providing sufficient detail for internal debugging.

Status:
Internal detail leakage: RPC failure logging includes inspect(reason) and node names which could expose internal
cluster details if logs are user-accessible.

Referred Code
{:error, reason} ->
  Logger.warning("RPC to #{core_node} failed: #{inspect(reason)} (attempt #{attempt}/#{@discovery_max_attempts})")
  Process.sleep(@discovery_retry_interval_ms)
  discover_tenant_from_cluster(attempt + 1)

Learn more about managing compliance generic rules or creating your own custom rules

  • Update
Compliance status legend 🟢 - Fully Compliant
🟡 - Partial Compliant
🔴 - Not Compliant
- Requires Further Human Verification
🏷️ - Compliance label
Imported GitHub PR comment. Original author: @qodo-code-review[bot] Original URL: https://github.com/carverauto/serviceradar/pull/2229#issuecomment-3725218613 Original created: 2026-01-08T18:55:16Z --- ## PR Compliance Guide 🔍 <!-- https://github.com/carverauto/serviceradar/commit/d3d06430dd0bc094ae45aad0869de3bf2d4e0b6a --> Below is a summary of compliance checks for this PR:<br> <table><tbody><tr><td colspan='2'><strong>Security Compliance</strong></td></tr> <tr><td rowspan=1>⚪</td> <td><details><summary><strong>Trusted RPC discovery </strong></summary><br> <b>Description:</b> The new cluster RPC tenant auto-discovery trusts the first connected node whose name <br>contains <code>"serviceradar_core"</code> and then accepts whatever <code>tenant_id</code> it returns, so if an <br>attacker can join the Erlang cluster (e.g., via leaked cookie/misconfigured distribution) <br>they could run a spoofed node and force the gateway to adopt an attacker-chosen tenant <br>context (cross-tenant mis-routing/data scoping impact).<br> <strong><a href='https://github.com/carverauto/serviceradar/pull/2229/files#diff-5ba6871060dcff48aabea128b381e2d121bbb32256be1066c31dba447dc10813R242-R297'>config.ex [242-297]</a></strong><br> <details open><summary>Referred Code</summary> ```elixir # Discover tenant info from core via cluster RPC with retries @discovery_max_attempts 30 @discovery_retry_interval_ms 2_000 defp discover_tenant_from_cluster do discover_tenant_from_cluster(1) end defp discover_tenant_from_cluster(attempt) when attempt > @discovery_max_attempts do raise "failed to discover platform tenant from cluster after #{@discovery_max_attempts} attempts" end defp discover_tenant_from_cluster(attempt) do case find_core_node() do nil -> Logger.debug("No core node found (attempt #{attempt}/#{@discovery_max_attempts}), retrying...") Process.sleep(@discovery_retry_interval_ms) discover_tenant_from_cluster(attempt + 1) core_node -> case rpc_get_platform_tenant(core_node) do ... (clipped 35 lines) ``` </details></details></td></tr> <tr><td colspan='2'><strong>Ticket Compliance</strong></td></tr> <tr><td>⚪</td><td><details><summary>🎫 <strong>No ticket provided </strong></summary> - [ ] Create ticket/issue <!-- /create_ticket --create_ticket=true --> </details></td></tr> <tr><td colspan='2'><strong>Codebase Duplication Compliance</strong></td></tr> <tr><td>⚪</td><td><details><summary><strong>Codebase context is not defined </strong></summary> Follow the <a href='https://qodo-merge-docs.qodo.ai/core-abilities/rag_context_enrichment/'>guide</a> to enable codebase context checks. </details></td></tr> <tr><td colspan='2'><strong>Custom Compliance</strong></td></tr> <tr><td rowspan=2>🟢</td><td> <details><summary><strong>Generic: Comprehensive Audit Trails</strong></summary><br> **Objective:** To create a detailed and reliable record of critical system actions for security analysis <br>and compliance.<br> **Status:** Passed<br> > Learn more about managing compliance <a href='https://qodo-merge-docs.qodo.ai/tools/compliance/#configuration-options'>generic rules</a> or creating your own <a href='https://qodo-merge-docs.qodo.ai/tools/compliance/#custom-compliance'>custom rules</a> </details></td></tr> <tr><td> <details><summary><strong>Generic: Meaningful Naming and Self-Documenting Code</strong></summary><br> **Objective:** Ensure all identifiers clearly express their purpose and intent, making code <br>self-documenting<br> **Status:** Passed<br> > Learn more about managing compliance <a href='https://qodo-merge-docs.qodo.ai/tools/compliance/#configuration-options'>generic rules</a> or creating your own <a href='https://qodo-merge-docs.qodo.ai/tools/compliance/#custom-compliance'>custom rules</a> </details></td></tr> <tr><td rowspan=2>🔴</td> <td><details> <summary><strong>Generic: Secure Logging Practices</strong></summary><br> **Objective:** To ensure logs are useful for debugging and auditing without exposing sensitive <br>information like PII, PHI, or cardholder data.<br> **Status:** <br><a href='https://github.com/carverauto/serviceradar/pull/2229/files#diff-5ba6871060dcff48aabea128b381e2d121bbb32256be1066c31dba447dc10813R175-R276'><strong>Unstructured sensitive logs</strong></a>: Newly added logs are plain-string (non-structured) and include tenant identifiers and <br>cluster node details, reducing auditability and potentially exposing sensitive <br>identifiers.<br> <details open><summary>Referred Code</summary> ```elixir Logger.info("No tenant_id configured; discovering from cluster...") discover_tenant_from_cluster() else validate_tenant_id!(tenant_id) {tenant_id, tenant_slug || platform_tenant_slug()} end # Build NATS prefix nats_prefix = tenant_slug config = %{ partition_id: partition_id, gateway_id: gateway_id, domain: domain, capabilities: capabilities, tenant_id: tenant_id, tenant_slug: tenant_slug, nats_prefix: nats_prefix } Logger.info("Gateway configured for tenant: #{tenant_slug} (ID: #{tenant_id})") ... (clipped 81 lines) ``` </details> > Learn more about managing compliance <a href='https://qodo-merge-docs.qodo.ai/tools/compliance/#configuration-options'>generic rules</a> or creating your own <a href='https://qodo-merge-docs.qodo.ai/tools/compliance/#custom-compliance'>custom rules</a> </details></td></tr> <tr><td><details> <summary><strong>Generic: Security-First Input Validation and Data Handling</strong></summary><br> **Objective:** Ensure all data inputs are validated, sanitized, and handled securely to prevent <br>vulnerabilities<br> **Status:** <br><a href='https://github.com/carverauto/serviceradar/pull/2229/files#diff-5ba6871060dcff48aabea128b381e2d121bbb32256be1066c31dba447dc10813R262-R267'><strong>Missing slug validation</strong></a>: The cluster-discovered <code>tenant_slug</code> is returned and used without normalization/format <br>validation, allowing invalid values to flow into <code>nats_prefix</code> and other tenant-scoped <br>behavior.<br> <details open><summary>Referred Code</summary> ```elixir case rpc_get_platform_tenant(core_node) do {:ok, %{tenant_id: tenant_id, tenant_slug: tenant_slug}} -> validate_tenant_id!(tenant_id) Logger.info("Discovered platform tenant from #{core_node}: #{tenant_slug} (#{tenant_id})") {tenant_id, tenant_slug} ``` </details> > Learn more about managing compliance <a href='https://qodo-merge-docs.qodo.ai/tools/compliance/#configuration-options'>generic rules</a> or creating your own <a href='https://qodo-merge-docs.qodo.ai/tools/compliance/#custom-compliance'>custom rules</a> </details></td></tr> <tr><td rowspan=2>⚪</td> <td><details> <summary><strong>Generic: Robust Error Handling and Edge Case Management</strong></summary><br> **Objective:** Ensure comprehensive error handling that provides meaningful context and graceful <br>degradation<br> **Status:** <br><a href='https://github.com/carverauto/serviceradar/pull/2229/files#diff-5ba6871060dcff48aabea128b381e2d121bbb32256be1066c31dba447dc10813R250-R277'><strong>Startup crash path</strong></a>: Tenant auto-discovery failure leads to a hard <code>raise</code> after retries which may not provide <br>graceful degradation or a fallback path depending on runtime expectations.<br> <details open><summary>Referred Code</summary> ```elixir defp discover_tenant_from_cluster(attempt) when attempt > @discovery_max_attempts do raise "failed to discover platform tenant from cluster after #{@discovery_max_attempts} attempts" end defp discover_tenant_from_cluster(attempt) do case find_core_node() do nil -> Logger.debug("No core node found (attempt #{attempt}/#{@discovery_max_attempts}), retrying...") Process.sleep(@discovery_retry_interval_ms) discover_tenant_from_cluster(attempt + 1) core_node -> case rpc_get_platform_tenant(core_node) do {:ok, %{tenant_id: tenant_id, tenant_slug: tenant_slug}} -> validate_tenant_id!(tenant_id) Logger.info("Discovered platform tenant from #{core_node}: #{tenant_slug} (#{tenant_id})") {tenant_id, tenant_slug} {:error, :not_ready} -> Logger.debug("Core node #{core_node} not ready (attempt #{attempt}/#{@discovery_max_attempts}), retrying...") Process.sleep(@discovery_retry_interval_ms) ... (clipped 7 lines) ``` </details> > Learn more about managing compliance <a href='https://qodo-merge-docs.qodo.ai/tools/compliance/#configuration-options'>generic rules</a> or creating your own <a href='https://qodo-merge-docs.qodo.ai/tools/compliance/#custom-compliance'>custom rules</a> </details></td></tr> <tr><td><details> <summary><strong>Generic: Secure Error Handling</strong></summary><br> **Objective:** To prevent the leakage of sensitive system information through error messages while <br>providing sufficient detail for internal debugging.<br> **Status:** <br><a href='https://github.com/carverauto/serviceradar/pull/2229/files#diff-5ba6871060dcff48aabea128b381e2d121bbb32256be1066c31dba447dc10813R273-R276'><strong>Internal detail leakage</strong></a>: RPC failure logging includes <code>inspect(reason)</code> and node names which could expose internal <br>cluster details if logs are user-accessible.<br> <details open><summary>Referred Code</summary> ```elixir {:error, reason} -> Logger.warning("RPC to #{core_node} failed: #{inspect(reason)} (attempt #{attempt}/#{@discovery_max_attempts})") Process.sleep(@discovery_retry_interval_ms) discover_tenant_from_cluster(attempt + 1) ``` </details> > Learn more about managing compliance <a href='https://qodo-merge-docs.qodo.ai/tools/compliance/#configuration-options'>generic rules</a> or creating your own <a href='https://qodo-merge-docs.qodo.ai/tools/compliance/#custom-compliance'>custom rules</a> </details></td></tr> <tr><td align="center" colspan="2"> - [ ] Update <!-- /compliance --update_compliance=true --> </td></tr></tbody></table> <details><summary>Compliance status legend</summary> 🟢 - Fully Compliant<br> 🟡 - Partial Compliant<br> 🔴 - Not Compliant<br> ⚪ - Requires Further Human Verification<br> 🏷️ - Compliance label<br> </details>
qodo-code-review[bot] commented 2026-01-08 18:56:33 +00:00 (Migrated from github.com)
Author
Owner

Imported GitHub PR comment.

Original author: @qodo-code-review[bot]
Original URL: https://github.com/carverauto/serviceradar/pull/2229#issuecomment-3725223177
Original created: 2026-01-08T18:56:33Z

PR Code Suggestions

Explore these optional code suggestions:

CategorySuggestion                                                                                                                                    Impact
Possible issue
Avoid blocking GenServer initialization during discovery

Refactor the init/1 function to perform tenant discovery asynchronously using
handle_continue/2. This will prevent the GenServer from blocking and timing out
during initialization.

elixir/serviceradar_agent_gateway/lib/serviceradar_agent_gateway/config.ex [160-198]

 def init(opts) do
   partition_id = Keyword.fetch!(opts, :partition_id)
   gateway_id = Keyword.fetch!(opts, :gateway_id)
   domain = Keyword.fetch!(opts, :domain)
   capabilities = Keyword.get(opts, :capabilities, [])
 
   # Get tenant info from environment, certificate, or cluster RPC
   # System is always multi-tenant - default tenant is used if none specified
   {tenant_id, tenant_slug} = resolve_tenant_info(opts)
   tenant_id = normalize_tenant_id(tenant_id)
   tenant_slug = normalize_tenant_slug(tenant_slug)
 
-  {tenant_id, tenant_slug} =
-    if is_nil(tenant_id) do
-      # No tenant_id from env/cert - discover via cluster RPC
-      Logger.info("No tenant_id configured; discovering from cluster...")
-      discover_tenant_from_cluster()
-    else
-      validate_tenant_id!(tenant_id)
-      {tenant_id, tenant_slug || platform_tenant_slug()}
-    end
-
-  # Build NATS prefix
-  nats_prefix = tenant_slug
-
-  config = %{
+  initial_state = %{
     partition_id: partition_id,
     gateway_id: gateway_id,
     domain: domain,
     capabilities: capabilities,
     tenant_id: tenant_id,
-    tenant_slug: tenant_slug,
-    nats_prefix: nats_prefix
+    tenant_slug: tenant_slug
   }
 
-  Logger.info("Gateway configured for tenant: #{tenant_slug} (ID: #{tenant_id})")
+  if is_nil(tenant_id) do
+    # No tenant_id from env/cert - discover via cluster RPC
+    Logger.info("No tenant_id configured; will discover from cluster...")
+    {:ok, initial_state, {:continue, :discover_tenant}}
+  else
+    validate_tenant_id!(tenant_id)
+    tenant_slug = tenant_slug || platform_tenant_slug()
+    # Build NATS prefix
+    nats_prefix = tenant_slug
 
-  {:ok, config}
+    config =
+      initial_state
+      |> Map.put(:tenant_slug, tenant_slug)
+      |> Map.put(:nats_prefix, nats_prefix)
+
+    Logger.info("Gateway configured for tenant: #{tenant_slug} (ID: #{config.tenant_id})")
+    {:ok, config}
+  end
 end
 
+@impl true
+def handle_continue(:discover_tenant, state) do
+  try do
+    {tenant_id, tenant_slug} = discover_tenant_from_cluster()
+
+    config =
+      state
+      |> Map.put(:tenant_id, tenant_id)
+      |> Map.put(:tenant_slug, tenant_slug)
+      |> Map.put(:nats_prefix, tenant_slug)
+
+    Logger.info("Gateway configured for tenant: #{tenant_slug} (ID: #{tenant_id})")
+    {:noreply, config}
+  catch
+    kind, reason ->
+      stacktrace = __STACKTRACE__
+      Logger.error("Failed to discover tenant from cluster: #{kind}: #{inspect(reason)}")
+      {:stop, {:discovery_failed, {kind, reason, stacktrace}}, state}
+  end
+end
+

[To ensure code accuracy, apply this suggestion manually]

Suggestion importance[1-10]: 9

__

Why: The suggestion correctly identifies a critical issue where blocking in init/1 with Process.sleep/1 will cause supervisor timeouts, leading to a crash loop. The proposed asynchronous solution using handle_continue/2 is the correct and idiomatic Elixir/OTP fix for this problem.

High
General
Tighten core node detection

In find_core_node/0, replace the String.contains? check with a stricter regex
~r/^serviceradar_core(_elx)?@/ to more accurately identify the core node and
avoid false positives.

elixir/serviceradar_agent_gateway/lib/serviceradar_agent_gateway/config.ex [281-288]

 defp find_core_node do
   Node.list()
   |> Enum.find(fn node ->
     node_str = Atom.to_string(node)
-    String.contains?(node_str, "serviceradar_core")
+    Regex.match?(~r/^serviceradar_core(_elx)?@/, node_str)
   end)
 end
  • Apply / Chat
Suggestion importance[1-10]: 5

__

Why: The suggestion improves the robustness of node discovery by using a more specific regex, preventing potential mismatches with other nodes that might contain serviceradar_core in their name. This is a good practice for reliability.

Low
  • Update
Imported GitHub PR comment. Original author: @qodo-code-review[bot] Original URL: https://github.com/carverauto/serviceradar/pull/2229#issuecomment-3725223177 Original created: 2026-01-08T18:56:33Z --- ## PR Code Suggestions ✨ <!-- d3d0643 --> Explore these optional code suggestions: <table><thead><tr><td><strong>Category</strong></td><td align=left><strong>Suggestion&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; </strong></td><td align=center><strong>Impact</strong></td></tr><tbody><tr><td rowspan=1>Possible issue</td> <td> <details><summary>Avoid blocking GenServer initialization during discovery<!-- not_implemented --></summary> ___ **Refactor the <code>init/1</code> function to perform tenant discovery asynchronously using <br><code>handle_continue/2</code>. This will prevent the GenServer from blocking and timing out <br>during initialization.** [elixir/serviceradar_agent_gateway/lib/serviceradar_agent_gateway/config.ex [160-198]](https://github.com/carverauto/serviceradar/pull/2229/files#diff-5ba6871060dcff48aabea128b381e2d121bbb32256be1066c31dba447dc10813R160-R198) ```diff def init(opts) do partition_id = Keyword.fetch!(opts, :partition_id) gateway_id = Keyword.fetch!(opts, :gateway_id) domain = Keyword.fetch!(opts, :domain) capabilities = Keyword.get(opts, :capabilities, []) # Get tenant info from environment, certificate, or cluster RPC # System is always multi-tenant - default tenant is used if none specified {tenant_id, tenant_slug} = resolve_tenant_info(opts) tenant_id = normalize_tenant_id(tenant_id) tenant_slug = normalize_tenant_slug(tenant_slug) - {tenant_id, tenant_slug} = - if is_nil(tenant_id) do - # No tenant_id from env/cert - discover via cluster RPC - Logger.info("No tenant_id configured; discovering from cluster...") - discover_tenant_from_cluster() - else - validate_tenant_id!(tenant_id) - {tenant_id, tenant_slug || platform_tenant_slug()} - end - - # Build NATS prefix - nats_prefix = tenant_slug - - config = %{ + initial_state = %{ partition_id: partition_id, gateway_id: gateway_id, domain: domain, capabilities: capabilities, tenant_id: tenant_id, - tenant_slug: tenant_slug, - nats_prefix: nats_prefix + tenant_slug: tenant_slug } - Logger.info("Gateway configured for tenant: #{tenant_slug} (ID: #{tenant_id})") + if is_nil(tenant_id) do + # No tenant_id from env/cert - discover via cluster RPC + Logger.info("No tenant_id configured; will discover from cluster...") + {:ok, initial_state, {:continue, :discover_tenant}} + else + validate_tenant_id!(tenant_id) + tenant_slug = tenant_slug || platform_tenant_slug() + # Build NATS prefix + nats_prefix = tenant_slug - {:ok, config} + config = + initial_state + |> Map.put(:tenant_slug, tenant_slug) + |> Map.put(:nats_prefix, nats_prefix) + + Logger.info("Gateway configured for tenant: #{tenant_slug} (ID: #{config.tenant_id})") + {:ok, config} + end end +@impl true +def handle_continue(:discover_tenant, state) do + try do + {tenant_id, tenant_slug} = discover_tenant_from_cluster() + + config = + state + |> Map.put(:tenant_id, tenant_id) + |> Map.put(:tenant_slug, tenant_slug) + |> Map.put(:nats_prefix, tenant_slug) + + Logger.info("Gateway configured for tenant: #{tenant_slug} (ID: #{tenant_id})") + {:noreply, config} + catch + kind, reason -> + stacktrace = __STACKTRACE__ + Logger.error("Failed to discover tenant from cluster: #{kind}: #{inspect(reason)}") + {:stop, {:discovery_failed, {kind, reason, stacktrace}}, state} + end +end + ``` `[To ensure code accuracy, apply this suggestion manually]` <details><summary>Suggestion importance[1-10]: 9</summary> __ Why: The suggestion correctly identifies a critical issue where blocking in `init/1` with `Process.sleep/1` will cause supervisor timeouts, leading to a crash loop. The proposed asynchronous solution using `handle_continue/2` is the correct and idiomatic Elixir/OTP fix for this problem. </details></details></td><td align=center>High </td></tr><tr><td rowspan=1>General</td> <td> <details><summary>Tighten core node detection</summary> ___ **In <code>find_core_node/0</code>, replace the <code>String.contains?</code> check with a stricter regex <br><code>~r/^serviceradar_core(_elx)?@/</code> to more accurately identify the core node and <br>avoid false positives.** [elixir/serviceradar_agent_gateway/lib/serviceradar_agent_gateway/config.ex [281-288]](https://github.com/carverauto/serviceradar/pull/2229/files#diff-5ba6871060dcff48aabea128b381e2d121bbb32256be1066c31dba447dc10813R281-R288) ```diff defp find_core_node do Node.list() |> Enum.find(fn node -> node_str = Atom.to_string(node) - String.contains?(node_str, "serviceradar_core") + Regex.match?(~r/^serviceradar_core(_elx)?@/, node_str) end) end ``` - [ ] **Apply / Chat** <!-- /improve --apply_suggestion=1 --> <details><summary>Suggestion importance[1-10]: 5</summary> __ Why: The suggestion improves the robustness of node discovery by using a more specific regex, preventing potential mismatches with other nodes that might contain `serviceradar_core` in their name. This is a good practice for reliability. </details></details></td><td align=center>Low </td></tr> <tr><td align="center" colspan="2"> - [ ] Update <!-- /improve_multi --more_suggestions=true --> </td><td></td></tr></tbody></table>
Sign in to join this conversation.
No reviewers
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
carverauto/serviceradar!2636
No description provided.