Refactor/zen tenant isolation #2658
No reviewers
Labels
No labels
1week
2weeks
Failed compliance check
IP cameras
NATS
Possible security concern
Review effort 1/5
Review effort 2/5
Review effort 3/5
Review effort 4/5
Review effort 5/5
UI
aardvark
accessibility
amd64
api
arm64
auth
back-end
bgp
blog
bug
build
checkers
ci-cd
cleanup
cnpg
codex
core
dependencies
device-management
documentation
duplicate
dusk
ebpf
enhancement
eta 1d
eta 1hr
eta 3d
eta 3hr
feature
fieldsurvey
github_actions
go
good first issue
help wanted
invalid
javascript
k8s
log-collector
mapper
mtr
needs-triage
netflow
network-sweep
observability
oracle
otel
plug-in
proton
python
question
reddit
redhat
research
rperf
rperf-checker
rust
sdk
security
serviceradar-agent
serviceradar-agent-gateway
serviceradar-web
serviceradar-web-ng
siem
snmp
sysmon
topology
ubiquiti
wasm
wontfix
zen-engine
No milestone
No project
No assignees
1 participant
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
carverauto/serviceradar!2658
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "refs/pull/2658/head"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Imported from GitHub pull request.
Original GitHub pull request: #2275
Original author: @mfreeman451
Original URL: https://github.com/carverauto/serviceradar/pull/2275
Original created: 2026-01-13T05:57:41Z
Original updated: 2026-01-13T06:12:34Z
Original head: carverauto/serviceradar:refactor/zen-tenant-isolation
Original base: staging
Original merged: 2026-01-13T06:12:32Z by @mfreeman451
User description
IMPORTANT: Please sign the Developer Certificate of Origin
Thank you for your contribution to ServiceRadar. Please note, when contributing, the developer must include
a DCO sign-off statement indicating the DCO acceptance in one commit message. Here
is an example DCO Signed-off-by line in a commit message:
Describe your changes
Issue ticket number and link
Code checklist before requesting a review
PR Type
Enhancement, Bug fix
Description
Migrate core service to Elixir implementation with clustering support
Fix JetStream subject configuration by removing invalid leading wildcards
Correct tenant migration execution order for zen_rule_templates index updates
Add platform tenant lifecycle event publishing with retry mechanism
Implement CNPG certificate generation and client certificate authentication
Diagram Walkthrough
File Walkthrough
7 files
Add tenant lifecycle event publishing with retriesAdd CNPG CA and client certificate generationAdd Helm hooks and simplify secret reconciliation logicReplace core with core-elx Elixir implementationUse configurable CNPG client certificate pathsAdd CNPG client certificate configuration variablesUse configurable CNPG client certificate paths2 files
Make index creation idempotent with IF NOT EXISTSFix zen JetStream subject configuration patterns2 files
Update image tag to latest commitAdd CNPG client certificate name configuration optionsImported GitHub PR comment.
Original author: @qodo-code-review[bot]
Original URL: https://github.com/carverauto/serviceradar/pull/2275#issuecomment-3742112403
Original created: 2026-01-13T05:58:25Z
PR Compliance Guide 🔍
Below is a summary of compliance checks for this PR:
Private key exposure
Description: The script generates and stores long-lived private CA and client keys (e.g.,
cnpg-ca-key.pem,cnpg-client-key.pem) underCERT_DIR(typically a shared PVC), and otherworkloads mount this same cert directory, so a compromise of any pod with access to that
volume could exfiltrate the CA private key and mint trusted CNPG client certificates to
access the database.
generate-certs.sh [22-107]
Referred Code
🎫 No ticket provided
Codebase context is not defined
Follow the guide to enable codebase context checks.
Generic: Meaningful Naming and Self-Documenting Code
Objective: Ensure all identifiers clearly express their purpose and intent, making code
self-documenting
Status: Passed
Generic: Secure Error Handling
Objective: To prevent the leakage of sensitive system information through error messages while
providing sufficient detail for internal debugging.
Status: Passed
Generic: Security-First Input Validation and Data Handling
Objective: Ensure all data inputs are validated, sanitized, and handled securely to prevent
vulnerabilities
Status: Passed
Generic: Robust Error Handling and Edge Case Management
Objective: Ensure comprehensive error handling that provides meaningful context and graceful
degradation
Status:
Unbounded retry loop: The new tenant lifecycle publish retry mechanism re-schedules retries indefinitely with a
fixed delay and no max attempts/backoff, which can cause endless retries and log noise
during persistent failure.
Referred Code
Generic: Comprehensive Audit Trails
Objective: To create a detailed and reliable record of critical system actions for security analysis
and compliance.
Status:
Missing success audit: Platform tenant create/update triggers lifecycle events but the new code only logs
failures (not successful outcomes with actor/user context), so it is unclear whether audit
trail requirements are met.
Referred Code
Generic: Secure Logging Practices
Objective: To ensure logs are useful for debugging and auditing without exposing sensitive
information like PII, PHI, or cardholder data.
Status:
Possible sensitive reason: Logging
reason: inspect(reason)fromAsh.read/2failures may include internal details fromunderlying adapters, so it is unclear from the diff alone whether sensitive information
could be emitted to logs.
Referred Code
Compliance status legend
🟢 - Fully Compliant🟡 - Partial Compliant
🔴 - Not Compliant
⚪ - Requires Further Human Verification
🏷️ - Compliance label
Imported GitHub PR comment.
Original author: @qodo-code-review[bot]
Original URL: https://github.com/carverauto/serviceradar/pull/2275#issuecomment-3742116331
Original created: 2026-01-13T05:59:52Z
PR Code Suggestions ✨
Explore these optional code suggestions:
✅
Prevent an infinite retry loopSuggestion Impact:
The commit adds explicit handle_info/2 clauses for {:error, :not_found} (and also {:error, {:unexpected_count, count}}) that log and stop scheduling further retries, addressing the infinite retry concern for unrecoverable tenant-fetch errors. It also introduces attempt tracking and exponential backoff with jitter for retriable errors, though it still retries on generic {:error, reason}.code diff:
In
handle_info/2, add specific case clauses for{:error, :not_found}and otherunrecoverable errors to prevent an infinite retry loop when a tenant cannot be
fetched.
elixir/serviceradar_core/lib/serviceradar/identity/platform_tenant_bootstrap.ex [199-216]
[Suggestion processed]Suggestion importance[1-10]: 9
__
Why: The suggestion correctly identifies a potential infinite retry loop if a tenant is not found, which is a significant bug that could lead to resource exhaustion and log spam.
✅
Implement exponential backoff for retriesSuggestion Impact:
The fixed @publish_retry_delay retry was replaced with an exponential backoff retry mechanism: retry messages now include an attempt count, delays are calculated using exponential growth capped by a max delay and include jitter, and retry scheduling uses the computed delay. The handler was updated to support the new message shape and initial attempts.code diff:
Replace the fixed-delay retry mechanism for platform tenant event publishing
with an exponential backoff strategy. This will improve system resilience during
outages.
Examples:
elixir/serviceradar_core/lib/serviceradar/identity/platform_tenant_bootstrap.ex [198-236]
Solution Walkthrough:
Before:
After:
Suggestion importance[1-10]: 8
__
Why: The suggestion correctly identifies that the implemented retry mechanism uses a fixed delay, contrary to the PR's stated goal of exponential backoff, and proposes a more resilient best-practice strategy.
Improve security of root key
Change the permissions of
root-key.pemfrom640to600to restrict access to theowner only, enhancing security and aligning with the principle of least
privilege.
helm/serviceradar/files/generate-certs.sh [17-21]
[To ensure code accuracy, apply this suggestion manually]Suggestion importance[1-10]: 6
__
Why: The suggestion correctly points out an opportunity to improve security by restricting permissions on the
root-key.pemfile, aligning it with best practices and other keys in the script.