feat(agent): Agent updates/SIEM #1071

Open
opened 2026-03-28 04:31:23 +00:00 by mfreeman451 · 2 comments
Owner

Imported from GitHub.

Original GitHub issue: #2936
Original author: @mfreeman451
Original URL: https://github.com/carverauto/serviceradar/issues/2936
Original created: 2026-02-27T21:15:40Z


Product Requirements Document: ServiceRadar Next-Gen SIEM & Observability Platform

Version: 2.0
Author: Carver Automation Corporation
Date: February 2026
Status: Draft


1. Executive Summary

ServiceRadar is evolving from a network visibility tool into a true Single Pane of Glass (SPOG) for automated datacenter management, observability, and next-generation SIEM. Target environments include bare metal servers, Proxmox (VMs/LXCs), and Kubernetes clusters.

By utilizing a pure-Golang single-binary agent, a high-speed gRPC/NATS data plane, an Elixir/BEAM backend, and a heavily extended CloudNativePG (CNPG) database, ServiceRadar will provide Datadog/CrowdStrike-tier capabilities—entirely open-source and natively within our own infrastructure.

1.1 Design Principles

  • Zero Agent Fatigue: Users deploy exactly one lightweight, memory-safe Go binary (serviceradar-agent). No Wazuh, no Fleet/osquery, no legacy C/C++ agents on endpoints.
  • Memory Safety First: All endpoint code is pure Go. Legacy C-based tools (OSSEC/Wazuh, osquery) are explicitly excluded due to memory safety concerns and operational overhead.
  • Engine-Driven Architecture: ServiceRadar is the brain (UI/Logic), AWX is the hands (execution), NATS JetStream is the central nervous system (transport), and CNPG is the memory (storage/analytics).
  • Kubernetes-Native Where It Counts: Falco DaemonSets run natively in K8s for runtime security, publishing directly to NATS within the trusted cluster boundary—bypassing the edge gateway.

2. Architectural Overview

The platform is divided into four strictly controlled layers, with one specific exception for Kubernetes-native security telemetry.

2.1 Layer Diagram

┌──────────────────────────────────────────────────────────────────────┐
│                    COLLECTION & PROTECTION (Edge)                     │
│                                                                      │
│  ┌─────────────────────┐              ┌──────────────────────────┐   │
│  │  serviceradar-agent  │              │  Falco DaemonSets (K8s)  │   │
│  │  (Pure Go Binary)    │              │  eBPF Runtime Security   │   │
│  │                      │              │                          │   │
│  │  • gopsutil metrics  │              │  • Syscall monitoring    │   │
│  │  • Trivy SBOM/CVE    │              │  • FIM / container       │   │
│  │  • XDP firewalling   │              │    escape detection      │   │
│  │  • Container SDKs    │              │                          │   │
│  └─────────┬────────────┘              └────────────┬─────────────┘   │
│            │ gRPC                                   │ Direct NATS     │
│            │ (mTLS)                                 │ (K8s Secrets)   │
├────────────┼────────────────────────────────────────┼─────────────────┤
│            │       TRANSPORT & STREAMING (Pipeline) │                 │
│            ▼                                        │                 │
│  ┌─────────────────────┐                            │                 │
│  │   agent-gateway      │                            │                 │
│  │   (Go, gRPC Proxy)   │                            │                 │
│  └─────────┬────────────┘                            │                 │
│            │                                        │                 │
│            ▼                                        ▼                 │
│  ┌──────────────────────────────────────────────────────────────┐     │
│  │                  NATS JetStream / Object Store                │     │
│  │                                                              │     │
│  │  Subjects:                                                   │     │
│  │  • telemetry.metrics.edge     (from gateway)                 │     │
│  │  • security.falco.alerts      (direct from Falcosidekick)    │     │
│  │  • commands.agent.{id}        (to gateway → agents)          │     │
│  │  • scan.trivy.results         (from agents via gateway)      │     │
│  │                                                              │     │
│  │  Object Store: Trivy CVE DB bundles                          │     │
│  └──────────────────────────────┬───────────────────────────────┘     │
├─────────────────────────────────┼────────────────────────────────────┤
│                                 ▼                                    │
│            PROCESSING & ORCHESTRATION (Core)                         │
│                                                                      │
│  ┌─────────────────────────────────────────────────────────────┐     │
│  │   core-elx (Elixir / Ash Framework)                          │     │
│  │                                                             │     │
│  │   • Broadway consumers (NATS → DB pipeline)                 │     │
│  │   • AshOban scheduled jobs (Trivy DB fetch, downsampling)   │     │
│  │   • Nx/Bumblebee (local AI embeddings on CPU)               │     │
│  │   • SRQL parser (DSL → ParadeDB + pgvector SQL)             │     │
│  │   • AWX REST API integration (remediation webhooks)         │     │
│  └──────────────┬─────────────────────────────┬────────────────┘     │
│                 │                             │                       │
│                 ▼                             ▼                       │
│  ┌──────────────────────┐      ┌──────────────────────────────┐     │
│  │   AWX (Ansible/K8s)   │      │   CNPG (CloudNativePG)       │     │
│  │   Remediation Engine  │      │                              │     │
│  └──────────────────────┘      │   Extensions:                │     │
│                                │   • TimescaleDB              │     │
│                                │   • ParadeDB (BM25)          │     │
│                                │   • pgvector (HNSW)          │     │
│                                │   • Apache AGE (Graph)       │     │
│                                │   • PostGIS                  │     │
│                                └──────────────────────────────┘     │
└──────────────────────────────────────────────────────────────────────┘

2.2 Data Flow Summary

Source Transport NATS Subject Processor Storage
Edge agent metrics (gopsutil) gRPC → Gateway telemetry.metrics.edge Broadway TimescaleDB hypertable
Falco K8s alerts Falcosidekick → Direct NATS security.falco.alerts Broadway TimescaleDB + pgvector
Trivy scan results gRPC → Gateway scan.trivy.results Broadway Relational tables
XDP firewall drops gRPC → Gateway security.xdp.drops Broadway TimescaleDB
Network connections gRPC → Gateway topology.connections Broadway Apache AGE graph
Commands to agents core-elx → NATS → Gateway commands.agent.{id} Gateway N/A (passthrough)

3. Core Epics & Feature Requirements

Epic 1: The Edge Go Agent (serviceradar-agent)

Goal: Eliminate agent fatigue at the edge with one highly optimized Go binary. No Wazuh, no Fleet/osquery—deploy one binary that does everything.

1.1 Native Telemetry (gopsutil + Container SDKs)

  • Collect CPU, RAM, Disk, and Network I/O using shirou/gopsutil.
  • Integrate Docker/containerd SDK (github.com/docker/docker/client) and LXC Go SDK (github.com/lxc/go-lxc) to map raw Linux PIDs to specific container names, image tags, and Proxmox LXC IDs.
  • Implement a state-diff algorithm in the agent: maintain an in-memory process map and only publish deltas (Process Started / Process Stopped) to NATS JetStream to minimize bandwidth.
  • Use gopsutil/net.Connections() to map active network connections for topology discovery.

1.2 Embedded Vulnerability Scanner (Trivy)

  • Agent utilizes Trivy libraries natively to generate full Software Bill of Materials (SBOMs) for the OS rootfs and container filesystems.
  • Trivy replaces the need for custom dpkg/rpm parsing—its rootfs scan mode outputs structured JSON of all installed OS packages, Python packages, Node modules, etc.
  • CVE database updates are received via the air-gapped distribution pipeline (see Epic 2.4), enabling fully offline scanning.

1.3 eBPF / XDP Network Security (Edge Firewalling)

  • Load compiled eBPF C code into the kernel NIC driver via cilium/ebpf to drop malicious packets at wire-speed on bare metal and Proxmox hosts.
  • XDP inspects and drops packets (DDoS, known-bad IPs) at the NIC driver level before the Linux kernel allocates memory—operating at line rate.

1.4 Kubernetes Runtime Security (The Falco Exception)

  • Instead of rewriting eBPF syscall monitoring in the Go agent for K8s, ServiceRadar deploys standard Falco DaemonSets within K8s clusters.
  • Falco handles File Integrity Monitoring (FIM), reverse shell detection, and container escapes natively.
  • Falcosidekick connects directly to the NATS cluster using injected K8s secrets and publishes eBPF security events straight to the security.falco.alerts JetStream subject.
  • Rationale: Falco is constrained to Kubernetes environments where we already have a secure, trusted network boundary and K8s secrets management. Pushing high-velocity K8s syscall events through the edge agent-gateway would be an unnecessary bottleneck.

Epic 2: The Data Plane (gRPC + NATS JetStream)

Goal: Create a massively scalable, back-pressure-resistant pipeline. Edge agents are strictly proxied through the gateway, while trusted K8s workloads get direct NATS access.

2.1 Command & Control (Bi-Directional gRPC Streaming)

  • The agent opens a single long-lived bi-directional gRPC stream to the agent-gateway.
  • The gateway subscribes to NATS JetStream subject commands.agent.{agent_id}.
  • When core-elx drops a command into NATS (e.g., "trigger Trivy scan", "update XDP rules"), the gateway instantly pushes it down the open gRPC stream to the target agent.

2.2 Edge Telemetry Firehose (Client-Side gRPC Streaming)

  • Bare metal/LXC agents batch gopsutil metrics, Trivy results, and XDP drop events, streaming them to the gateway via client-side gRPC streaming.
  • The gateway batches incoming gRPC chunks and publishes to NATS JetStream (telemetry.metrics.edge, scan.trivy.results, security.xdp.drops).
  • JetStream acts as the shock absorber so core-elx is never overwhelmed by event bursts.

2.3 The Kubernetes Exception (Direct NATS Ingestion)

  • Because Falco runs inside the trusted K8s network boundary, it bypasses the agent-gateway entirely.
  • Falcosidekick authenticates to NATS using K8s-injected secrets and publishes directly to security.falco.alerts.
  • This avoids the unnecessary bottleneck of routing high-velocity syscall events through the gRPC proxy layer.

2.4 Air-Gapped Trivy DB Distribution (Server-Side gRPC Streaming)

The Trivy CVE database can be 50–100MB+. This pipeline ensures agents can scan offline without hitting GitHub rate limits.

  1. Fetch: core-elx runs an AshOban scheduled job every 12 hours to download the latest Trivy DB bundle (tar.gz).
  2. Store: core-elx writes the bundle into the NATS JetStream Object Store (NATS automatically chunks and distributes across the cluster).
  3. Signal: agent-gateway publishes a lightweight notification to a NATS KV bucket or broadcast subject: {"event": "trivy_db_update", "version": "v2.1.0"}.
  4. Stream: The agent-gateway pulls the DB from NATS Object Store, chunks it into 2MB gRPC blocks, and uses server-side streaming to deliver it to edge agents.
  5. Scan: Agents reassemble the chunks on local disk and run Trivy scans fully offline.
  6. Report: Scan results (JSON SBOMs + CVE matches) are pushed back up via client-side gRPC streaming to NATS.

Epic 3: Smart Execution & Remediation (AWX Engine)

Goal: ServiceRadar acts as the brain, AWX acts as the hands. AWX is already deployed in the K8s cluster and provides a robust REST API over Ansible.

3.1 Event-Driven Patching

  • ServiceRadar detects a critical state (e.g., a vulnerable package via Trivy, a Falco security alert).
  • The Elixir backend fires a REST API call to the in-cluster AWX instance (POST /api/v2/job_templates/{id}/launch/), passing the target Proxmox LXC/VM hostname as an extra-var.
  • AWX reaches out via SSH, patches the server using Ansible playbooks (apt, yum, win_updates modules), and sends a webhook back to ServiceRadar when the job completes.

3.2 Running vs. Dormant Vulnerability Correlation

This is the key differentiator—correlating static vulnerability data with live runtime state.

  • Trivy SBOMs identify all packages with known CVEs (static analysis of files on disk).
  • gopsutil process data shows which binaries are currently executing in memory.
  • Correlation logic (in core-elx):
    • If a critical CVE is found but the affected binary is dormant on disk → log it, schedule patch in next maintenance window.
    • If the affected binary is currently executing in memory → flag as P0 Incident, trigger AWX to isolate or patch immediately.

3.3 AWX Integration Setup

  • Ansible playbooks are stored in a GitHub repository and synced into AWX as a Project.
  • Job Templates in AWX wrap each playbook (e.g., patch_server.yml, isolate_host.yml).
  • AWX OAuth2 tokens are used for API authentication from core-elx.
  • AWX has dedicated community modules for Proxmox (community.general.proxmox) to manage VMs/LXCs, and kubernetes.core for K8s cluster state.

Epic 4: Next-Gen SIEM & Observability Datastore (CNPG)

Goal: Replace Elasticsearch, OpenSearch, and Neo4j by pushing all advanced analytics directly into heavily optimized PostgreSQL extensions within CloudNativePG.

4.1 Time-Series Metrics & Logs (TimescaleDB)

  • What goes here: Falco security alerts, gopsutil system metrics (CPU, RAM, disk, network I/O), XDP firewall drop logs, and structured log streams.
  • Partition high-volume data using TimescaleDB hypertables (by day or hour).
  • Implement continuous aggregates for real-time UI dashboards.
  • Automated retention policies: downsample 15-second metrics into 1-hour rollups after 30 days to save disk space.

4.2 Relational Tables (Standard PostgreSQL)

  • What goes here: Trivy vulnerability data (CVEs), server inventory (hostnames, IPs, OS versions), AWX job configurations and results, agent registration state.
  • CVEs are highly relational: a host has_many packages, a package belongs_to_many CVEs. Standard Postgres handles this naturally with Ash Framework resources.

4.3 Network Graphing & Blast Radius (Apache AGE)

  • What goes here: Network topology and blast radius mapping.
  • The agent maps active network connections via gopsutil/net.Connections() and sends connection data to core-elx over NATS.
  • Elixir translates connection data into Cypher queries for Apache AGE: (Process A on Host 1) -[CONNECTS_TO]-> (Port 5432 on Host 2).
  • Blast Radius Feature: If a K8s node is compromised (Falco alert) or an LXC has a critical CVE (Trivy), query the graph to instantly visualize all connected machines and services at risk.

4.4 Full-Text Search (ParadeDB / BM25)

  • What goes here: IOC (Indicator of Compromise) hunting, IP lookups, exact CVE matches, keyword-based log search across both edge agent logs and K8s Falco alerts.
  • ParadeDB replaces ELK-style search. Built on Rust's Tantivy engine, it provides Elasticsearch-level BM25 scoring directly inside Postgres.
  • Logs and alerts ingested from NATS by Elixir are immediately indexed into a ParadeDB bm25 index within CNPG.

4.5 Semantic & AI Search (pgvector / HNSW)

  • What goes here: Semantic threat hunting, anomaly detection, alert deduplication.
  • Why: Attackers obfuscate commands to bypass keyword rules (e.g., tail -n 10 /etc/shadow instead of cat /etc/shadow). Keywords miss this; vector similarity catches it.
  • Architecture:
    • Events flow from NATS into core-elx via Broadway.
    • Elixir uses Nx and Bumblebee to run a local embedding model (e.g., all-MiniLM-L6-v2, 22–90MB) entirely within the BEAM VM on CPU.
    • The generated vector is saved to CNPG alongside the raw log.
    • CNPG uses an HNSW index for sub-millisecond nearest-neighbor lookups.

4. Feature Deep-Dive: SRQL (ServiceRadar Query Language)

To provide a world-class threat hunting experience, ServiceRadar introduces SRQL, a custom DSL parsed by Elixir that compiles down to highly optimized Postgres queries leveraging ParadeDB (BM25) and pgvector (semantic search).

4.1 SRQL Syntax & Compilation

SRQL allows analysts to pipe (|) exact keyword matches into semantic filters natively in the UI search bar.

Example 1: Pure ParadeDB Search (Keyword)

type:falco_alert AND k8s.namespace:production AND "authentication failure"

Elixir compiles this to a ParadeDB paradedb.search() query using BM25 scoring. Extremely fast exact matches.

Example 2: Semantic Threat Hunting (pgvector)

SIMILAR_TO("obfuscated reverse shell connecting to external IP")

Elixir uses Bumblebee to vectorize the string, then compiles to:

SELECT * FROM security_events ORDER BY embedding <-> '[vector]' LIMIT 50;

Example 3: Hybrid Query (Keyword → Semantic Pipeline)

host:proxmox-node-01 AND severity:high | SIMILAR_TO("privilege escalation via bash")

Compilation:

  1. ParadeDB strictly filters to host:proxmox-node-01 AND severity:high (reducing 50M rows to ~5,000).
  2. pgvector runs HNSW distance calculation only on those 5,000 rows.

Result: AI-driven threat hunting with minimal latency, entirely on-premise.

4.2 Automated Semantic Deduplication ("Smart Alert Grouping")

Instead of overwhelming the UI with 10,000 Falco syscall alerts or XDP drop logs, core-elx runs a vector clustering algorithm. The UI displays: "10,000 events occurred, but they represent only 3 semantically unique attack patterns."


5. Embedding Strategy: CPU-First, GPU-Optional

A GPU must not be a hard requirement for deployment. Embedding models are small mathematical functions, not LLMs.

5.1 Model Selection

Use micro-models only: all-MiniLM-L6-v2 or bge-micro-v2 (22–90MB). These fit entirely in L3 cache and generate an embedding for a log line in under 5–10ms on CPU.

5.2 EXLA + CPU SIMD Optimization

Nx uses EXLA (Google's XLA compiler) under the hood. At application boot, EXLA JIT-compiles the Bumblebee embedding model to native machine code targeting AVX-512 and SIMD instructions on the host CPU, enabling multiple vector operations per clock cycle.

5.3 Broadway Batching (Never Embed 1-by-1)

Broadway pulls messages from NATS JetStream and groups them. Configure Broadway to batch 256 logs or wait 500ms (whichever comes first). Pass the entire batch to Bumblebee at once—the CPU vectorizes all 256 logs simultaneously via matrix multiplication, increasing throughput by orders of magnitude over sequential processing.

5.4 Selective Vectorization

Not all logs warrant embedding. The Elixir pipeline applies a fast pattern-match filter before the embedding stage.

Route to Bumblebee (embed):

  • Falco / eBPF security alerts
  • Logs with level: warning, error, or critical
  • Failed authentication attempts (SSH, Web UI)

Bypass Bumblebee (keyword-only via ParadeDB):

  • Standard informational telemetry
  • Successful HTTP 200 access logs
  • Raw firewall DEBUG logs
  • TCP ACK / routine network flow logs

5.5 Auto-Detect GPU (Progressive Enhancement)

At application boot in runtime.exs, check for NVIDIA/CUDA drivers via System.cmd("nvidia-smi", ...). If present, configure Nx to use the CUDA backend for dramatically faster embedding throughput. If absent, fall back to CPU EXLA—still performant for the micro-models used.


6. Deployment & Tech Stack Summary

Component Technology / Library Purpose
Edge Agent Go, gopsutil, lxc/go-lxc, Trivy, cilium/ebpf Telemetry, SBOM, XDP firewall for bare metal/LXCs
K8s Security Falco + Falcosidekick eBPF runtime security natively in Kubernetes
Edge Gateway Go, gRPC Multiplexes edge agent connections safely into NATS
Message Bus NATS JetStream / Object Store Direct ingest for Falco; gateway proxy for agents; chunked file distribution
Backend Core Elixir, Ash Framework, Broadway, Nx/Bumblebee Business logic, NATS consumption, AI embeddings, SRQL parser
Database CloudNativePG (CNPG) Primary state and data warehouse
Time-Series TimescaleDB (CNPG extension) Log partitioning, metric hypertables, continuous aggregates
Search ParadeDB (CNPG extension) BM25 Elasticsearch-grade full-text search via Tantivy
AI / Semantic pgvector (CNPG extension) HNSW indexing for semantic similarity and threat clustering
Graph Apache AGE (CNPG extension) Network topology and blast radius visualization via Cypher
Geospatial PostGIS (CNPG extension) Geographic asset mapping (available, future use)
Remediation AWX (Ansible in K8s) Webhook-triggered configuration management and patching

7. Rejected Alternatives & Rationale

Tool Category Reason for Rejection
Wazuh SIEM / Endpoint Security Legacy C codebase (OSSEC fork). Memory safety concerns. Agent fatigue—requires installing a separate heavy binary on every endpoint. Replaced by native Go eBPF + Trivy + Falco.
Fleet / osquery System State Visibility osquery is C++. Adds another agent binary to manage. gopsutil + Docker/LXC Go SDKs + Trivy SBOMs already provide equivalent system state natively in the Go agent.
Elasticsearch / OpenSearch Log Search & SIEM Backend Java-based memory hog. Replaced by ParadeDB (BM25 via Tantivy) and pgvector inside CNPG—same query capabilities, fraction of the resource footprint.
Foreman / Uyuni Lifecycle / Patch Management Massive, complex beasts. Overlapping UI with ServiceRadar. AWX + Ansible playbooks provide lightweight, API-driven patching that ServiceRadar can orchestrate directly.
Neo4j Graph Database External dependency. Apache AGE provides Cypher queries directly inside CNPG, eliminating a separate graph database.

8. Phase 1 Deliverables & Next Steps

Phase 1A: Core Infrastructure

  1. NATS JetStream Cluster: Establish the NATS JetStream server. Verify Falcosidekick can authenticate natively from K8s and publish to security.falco.alerts.
  2. Agent Gateway & Edge Pipeline: Establish gRPC bi-directional streaming. Verify the full path: gopsutil → agent → gRPC → gateway → NATS → Broadway (Elixir) → TimescaleDB hypertable.
  3. AWX Integration: Create a GitHub repo with patching playbooks. Sync into AWX as a Project. Generate OAuth2 token. Wire up core-elx to trigger job templates via REST API.

Phase 1B: Vulnerability Pipeline

  1. Trivy Air-Gapped Distribution: Implement the AshOban cron job to download the Trivy DB, write to NATS Object Store, and stream down to agents via 2MB gRPC chunks.
  2. Running vs. Dormant Correlation: Implement cross-referencing logic in core-elx that matches Trivy SBOMs against gopsutil running processes to auto-classify vulnerability severity.
  1. ParadeDB + SRQL v1: Install ParadeDB on the CNPG cluster. Write the Elixir SRQL parser for basic keyword searches. Validate BM25 query performance on a test dataset of 10M logs.
  2. pgvector Semantic Pipeline: Deploy all-MiniLM-L6-v2 via Bumblebee. Implement selective vectorization in the Broadway pipeline. Validate "Find Similar" queries against test security events.

Phase 1D: Topology & Visualization

  1. Apache AGE Network Graph: Ingest gopsutil/net.Connections() data into Apache AGE via Cypher. Build initial blast radius query and visualization in the ServiceRadar UI.
Imported from GitHub. Original GitHub issue: #2936 Original author: @mfreeman451 Original URL: https://github.com/carverauto/serviceradar/issues/2936 Original created: 2026-02-27T21:15:40Z --- # Product Requirements Document: ServiceRadar Next-Gen SIEM & Observability Platform **Version:** 2.0 **Author:** Carver Automation Corporation **Date:** February 2026 **Status:** Draft --- ## 1. Executive Summary ServiceRadar is evolving from a network visibility tool into a true **Single Pane of Glass (SPOG)** for automated datacenter management, observability, and next-generation SIEM. Target environments include bare metal servers, Proxmox (VMs/LXCs), and Kubernetes clusters. By utilizing a pure-Golang single-binary agent, a high-speed gRPC/NATS data plane, an Elixir/BEAM backend, and a heavily extended CloudNativePG (CNPG) database, ServiceRadar will provide Datadog/CrowdStrike-tier capabilities—entirely open-source and natively within our own infrastructure. ### 1.1 Design Principles - **Zero Agent Fatigue:** Users deploy exactly one lightweight, memory-safe Go binary (`serviceradar-agent`). No Wazuh, no Fleet/osquery, no legacy C/C++ agents on endpoints. - **Memory Safety First:** All endpoint code is pure Go. Legacy C-based tools (OSSEC/Wazuh, osquery) are explicitly excluded due to memory safety concerns and operational overhead. - **Engine-Driven Architecture:** ServiceRadar is the brain (UI/Logic), AWX is the hands (execution), NATS JetStream is the central nervous system (transport), and CNPG is the memory (storage/analytics). - **Kubernetes-Native Where It Counts:** Falco DaemonSets run natively in K8s for runtime security, publishing directly to NATS within the trusted cluster boundary—bypassing the edge gateway. --- ## 2. Architectural Overview The platform is divided into four strictly controlled layers, with one specific exception for Kubernetes-native security telemetry. ### 2.1 Layer Diagram ``` ┌──────────────────────────────────────────────────────────────────────┐ │ COLLECTION & PROTECTION (Edge) │ │ │ │ ┌─────────────────────┐ ┌──────────────────────────┐ │ │ │ serviceradar-agent │ │ Falco DaemonSets (K8s) │ │ │ │ (Pure Go Binary) │ │ eBPF Runtime Security │ │ │ │ │ │ │ │ │ │ • gopsutil metrics │ │ • Syscall monitoring │ │ │ │ • Trivy SBOM/CVE │ │ • FIM / container │ │ │ │ • XDP firewalling │ │ escape detection │ │ │ │ • Container SDKs │ │ │ │ │ └─────────┬────────────┘ └────────────┬─────────────┘ │ │ │ gRPC │ Direct NATS │ │ │ (mTLS) │ (K8s Secrets) │ ├────────────┼────────────────────────────────────────┼─────────────────┤ │ │ TRANSPORT & STREAMING (Pipeline) │ │ │ ▼ │ │ │ ┌─────────────────────┐ │ │ │ │ agent-gateway │ │ │ │ │ (Go, gRPC Proxy) │ │ │ │ └─────────┬────────────┘ │ │ │ │ │ │ │ ▼ ▼ │ │ ┌──────────────────────────────────────────────────────────────┐ │ │ │ NATS JetStream / Object Store │ │ │ │ │ │ │ │ Subjects: │ │ │ │ • telemetry.metrics.edge (from gateway) │ │ │ │ • security.falco.alerts (direct from Falcosidekick) │ │ │ │ • commands.agent.{id} (to gateway → agents) │ │ │ │ • scan.trivy.results (from agents via gateway) │ │ │ │ │ │ │ │ Object Store: Trivy CVE DB bundles │ │ │ └──────────────────────────────┬───────────────────────────────┘ │ ├─────────────────────────────────┼────────────────────────────────────┤ │ ▼ │ │ PROCESSING & ORCHESTRATION (Core) │ │ │ │ ┌─────────────────────────────────────────────────────────────┐ │ │ │ core-elx (Elixir / Ash Framework) │ │ │ │ │ │ │ │ • Broadway consumers (NATS → DB pipeline) │ │ │ │ • AshOban scheduled jobs (Trivy DB fetch, downsampling) │ │ │ │ • Nx/Bumblebee (local AI embeddings on CPU) │ │ │ │ • SRQL parser (DSL → ParadeDB + pgvector SQL) │ │ │ │ • AWX REST API integration (remediation webhooks) │ │ │ └──────────────┬─────────────────────────────┬────────────────┘ │ │ │ │ │ │ ▼ ▼ │ │ ┌──────────────────────┐ ┌──────────────────────────────┐ │ │ │ AWX (Ansible/K8s) │ │ CNPG (CloudNativePG) │ │ │ │ Remediation Engine │ │ │ │ │ └──────────────────────┘ │ Extensions: │ │ │ │ • TimescaleDB │ │ │ │ • ParadeDB (BM25) │ │ │ │ • pgvector (HNSW) │ │ │ │ • Apache AGE (Graph) │ │ │ │ • PostGIS │ │ │ └──────────────────────────────┘ │ └──────────────────────────────────────────────────────────────────────┘ ``` ### 2.2 Data Flow Summary | Source | Transport | NATS Subject | Processor | Storage | |--------|-----------|-------------|-----------|---------| | Edge agent metrics (`gopsutil`) | gRPC → Gateway | `telemetry.metrics.edge` | Broadway | TimescaleDB hypertable | | Falco K8s alerts | Falcosidekick → Direct NATS | `security.falco.alerts` | Broadway | TimescaleDB + pgvector | | Trivy scan results | gRPC → Gateway | `scan.trivy.results` | Broadway | Relational tables | | XDP firewall drops | gRPC → Gateway | `security.xdp.drops` | Broadway | TimescaleDB | | Network connections | gRPC → Gateway | `topology.connections` | Broadway | Apache AGE graph | | Commands to agents | core-elx → NATS → Gateway | `commands.agent.{id}` | Gateway | N/A (passthrough) | --- ## 3. Core Epics & Feature Requirements ### Epic 1: The Edge Go Agent (`serviceradar-agent`) **Goal:** Eliminate agent fatigue at the edge with one highly optimized Go binary. No Wazuh, no Fleet/osquery—deploy one binary that does everything. #### 1.1 Native Telemetry (`gopsutil` + Container SDKs) - Collect CPU, RAM, Disk, and Network I/O using `shirou/gopsutil`. - Integrate Docker/containerd SDK (`github.com/docker/docker/client`) and LXC Go SDK (`github.com/lxc/go-lxc`) to map raw Linux PIDs to specific container names, image tags, and Proxmox LXC IDs. - Implement a **state-diff algorithm** in the agent: maintain an in-memory process map and only publish deltas (Process Started / Process Stopped) to NATS JetStream to minimize bandwidth. - Use `gopsutil/net.Connections()` to map active network connections for topology discovery. #### 1.2 Embedded Vulnerability Scanner (Trivy) - Agent utilizes Trivy libraries natively to generate full Software Bill of Materials (SBOMs) for the OS rootfs and container filesystems. - Trivy replaces the need for custom `dpkg`/`rpm` parsing—its `rootfs` scan mode outputs structured JSON of all installed OS packages, Python packages, Node modules, etc. - CVE database updates are received via the air-gapped distribution pipeline (see Epic 2.4), enabling fully offline scanning. #### 1.3 eBPF / XDP Network Security (Edge Firewalling) - Load compiled eBPF C code into the kernel NIC driver via `cilium/ebpf` to drop malicious packets at wire-speed on bare metal and Proxmox hosts. - XDP inspects and drops packets (DDoS, known-bad IPs) at the NIC driver level before the Linux kernel allocates memory—operating at line rate. #### 1.4 Kubernetes Runtime Security (The Falco Exception) - Instead of rewriting eBPF syscall monitoring in the Go agent for K8s, ServiceRadar deploys standard **Falco DaemonSets** within K8s clusters. - Falco handles File Integrity Monitoring (FIM), reverse shell detection, and container escapes natively. - **Falcosidekick** connects directly to the NATS cluster using injected K8s secrets and publishes eBPF security events straight to the `security.falco.alerts` JetStream subject. - **Rationale:** Falco is constrained to Kubernetes environments where we already have a secure, trusted network boundary and K8s secrets management. Pushing high-velocity K8s syscall events through the edge `agent-gateway` would be an unnecessary bottleneck. --- ### Epic 2: The Data Plane (gRPC + NATS JetStream) **Goal:** Create a massively scalable, back-pressure-resistant pipeline. Edge agents are strictly proxied through the gateway, while trusted K8s workloads get direct NATS access. #### 2.1 Command & Control (Bi-Directional gRPC Streaming) - The agent opens a single long-lived bi-directional gRPC stream to the `agent-gateway`. - The gateway subscribes to NATS JetStream subject `commands.agent.{agent_id}`. - When core-elx drops a command into NATS (e.g., "trigger Trivy scan", "update XDP rules"), the gateway instantly pushes it down the open gRPC stream to the target agent. #### 2.2 Edge Telemetry Firehose (Client-Side gRPC Streaming) - Bare metal/LXC agents batch `gopsutil` metrics, Trivy results, and XDP drop events, streaming them to the gateway via client-side gRPC streaming. - The gateway batches incoming gRPC chunks and publishes to NATS JetStream (`telemetry.metrics.edge`, `scan.trivy.results`, `security.xdp.drops`). - JetStream acts as the shock absorber so core-elx is never overwhelmed by event bursts. #### 2.3 The Kubernetes Exception (Direct NATS Ingestion) - Because Falco runs inside the trusted K8s network boundary, it bypasses the `agent-gateway` entirely. - Falcosidekick authenticates to NATS using K8s-injected secrets and publishes directly to `security.falco.alerts`. - This avoids the unnecessary bottleneck of routing high-velocity syscall events through the gRPC proxy layer. #### 2.4 Air-Gapped Trivy DB Distribution (Server-Side gRPC Streaming) The Trivy CVE database can be 50–100MB+. This pipeline ensures agents can scan offline without hitting GitHub rate limits. 1. **Fetch:** core-elx runs an AshOban scheduled job every 12 hours to download the latest Trivy DB bundle (tar.gz). 2. **Store:** core-elx writes the bundle into the NATS JetStream Object Store (NATS automatically chunks and distributes across the cluster). 3. **Signal:** agent-gateway publishes a lightweight notification to a NATS KV bucket or broadcast subject: `{"event": "trivy_db_update", "version": "v2.1.0"}`. 4. **Stream:** The agent-gateway pulls the DB from NATS Object Store, chunks it into 2MB gRPC blocks, and uses server-side streaming to deliver it to edge agents. 5. **Scan:** Agents reassemble the chunks on local disk and run Trivy scans fully offline. 6. **Report:** Scan results (JSON SBOMs + CVE matches) are pushed back up via client-side gRPC streaming to NATS. --- ### Epic 3: Smart Execution & Remediation (AWX Engine) **Goal:** ServiceRadar acts as the brain, AWX acts as the hands. AWX is already deployed in the K8s cluster and provides a robust REST API over Ansible. #### 3.1 Event-Driven Patching - ServiceRadar detects a critical state (e.g., a vulnerable package via Trivy, a Falco security alert). - The Elixir backend fires a REST API call to the in-cluster AWX instance (`POST /api/v2/job_templates/{id}/launch/`), passing the target Proxmox LXC/VM hostname as an extra-var. - AWX reaches out via SSH, patches the server using Ansible playbooks (`apt`, `yum`, `win_updates` modules), and sends a webhook back to ServiceRadar when the job completes. #### 3.2 Running vs. Dormant Vulnerability Correlation This is the key differentiator—correlating static vulnerability data with live runtime state. - **Trivy SBOMs** identify all packages with known CVEs (static analysis of files on disk). - **`gopsutil` process data** shows which binaries are currently executing in memory. - **Correlation logic (in core-elx):** - If a critical CVE is found but the affected binary is **dormant on disk** → log it, schedule patch in next maintenance window. - If the affected binary is **currently executing in memory** → flag as **P0 Incident**, trigger AWX to isolate or patch immediately. #### 3.3 AWX Integration Setup - Ansible playbooks are stored in a GitHub repository and synced into AWX as a Project. - Job Templates in AWX wrap each playbook (e.g., `patch_server.yml`, `isolate_host.yml`). - AWX OAuth2 tokens are used for API authentication from core-elx. - AWX has dedicated community modules for Proxmox (`community.general.proxmox`) to manage VMs/LXCs, and `kubernetes.core` for K8s cluster state. --- ### Epic 4: Next-Gen SIEM & Observability Datastore (CNPG) **Goal:** Replace Elasticsearch, OpenSearch, and Neo4j by pushing all advanced analytics directly into heavily optimized PostgreSQL extensions within CloudNativePG. #### 4.1 Time-Series Metrics & Logs (TimescaleDB) - **What goes here:** Falco security alerts, `gopsutil` system metrics (CPU, RAM, disk, network I/O), XDP firewall drop logs, and structured log streams. - Partition high-volume data using TimescaleDB hypertables (by day or hour). - Implement continuous aggregates for real-time UI dashboards. - Automated retention policies: downsample 15-second metrics into 1-hour rollups after 30 days to save disk space. #### 4.2 Relational Tables (Standard PostgreSQL) - **What goes here:** Trivy vulnerability data (CVEs), server inventory (hostnames, IPs, OS versions), AWX job configurations and results, agent registration state. - CVEs are highly relational: a host `has_many` packages, a package `belongs_to_many` CVEs. Standard Postgres handles this naturally with Ash Framework resources. #### 4.3 Network Graphing & Blast Radius (Apache AGE) - **What goes here:** Network topology and blast radius mapping. - The agent maps active network connections via `gopsutil/net.Connections()` and sends connection data to core-elx over NATS. - Elixir translates connection data into Cypher queries for Apache AGE: `(Process A on Host 1) -[CONNECTS_TO]-> (Port 5432 on Host 2)`. - **Blast Radius Feature:** If a K8s node is compromised (Falco alert) or an LXC has a critical CVE (Trivy), query the graph to instantly visualize all connected machines and services at risk. #### 4.4 Full-Text Search (ParadeDB / BM25) - **What goes here:** IOC (Indicator of Compromise) hunting, IP lookups, exact CVE matches, keyword-based log search across both edge agent logs and K8s Falco alerts. - ParadeDB replaces ELK-style search. Built on Rust's Tantivy engine, it provides Elasticsearch-level BM25 scoring directly inside Postgres. - Logs and alerts ingested from NATS by Elixir are immediately indexed into a ParadeDB `bm25` index within CNPG. #### 4.5 Semantic & AI Search (pgvector / HNSW) - **What goes here:** Semantic threat hunting, anomaly detection, alert deduplication. - **Why:** Attackers obfuscate commands to bypass keyword rules (e.g., `tail -n 10 /etc/shadow` instead of `cat /etc/shadow`). Keywords miss this; vector similarity catches it. - **Architecture:** - Events flow from NATS into core-elx via Broadway. - Elixir uses Nx and Bumblebee to run a local embedding model (e.g., `all-MiniLM-L6-v2`, 22–90MB) entirely within the BEAM VM on CPU. - The generated vector is saved to CNPG alongside the raw log. - CNPG uses an HNSW index for sub-millisecond nearest-neighbor lookups. --- ## 4. Feature Deep-Dive: SRQL (ServiceRadar Query Language) To provide a world-class threat hunting experience, ServiceRadar introduces **SRQL**, a custom DSL parsed by Elixir that compiles down to highly optimized Postgres queries leveraging ParadeDB (BM25) and pgvector (semantic search). ### 4.1 SRQL Syntax & Compilation SRQL allows analysts to pipe (`|`) exact keyword matches into semantic filters natively in the UI search bar. **Example 1: Pure ParadeDB Search (Keyword)** ``` type:falco_alert AND k8s.namespace:production AND "authentication failure" ``` Elixir compiles this to a ParadeDB `paradedb.search()` query using BM25 scoring. Extremely fast exact matches. **Example 2: Semantic Threat Hunting (pgvector)** ``` SIMILAR_TO("obfuscated reverse shell connecting to external IP") ``` Elixir uses Bumblebee to vectorize the string, then compiles to: ```sql SELECT * FROM security_events ORDER BY embedding <-> '[vector]' LIMIT 50; ``` **Example 3: Hybrid Query (Keyword → Semantic Pipeline)** ``` host:proxmox-node-01 AND severity:high | SIMILAR_TO("privilege escalation via bash") ``` Compilation: 1. ParadeDB strictly filters to `host:proxmox-node-01` AND `severity:high` (reducing 50M rows to ~5,000). 2. pgvector runs HNSW distance calculation only on those 5,000 rows. Result: AI-driven threat hunting with minimal latency, entirely on-premise. ### 4.2 Automated Semantic Deduplication ("Smart Alert Grouping") Instead of overwhelming the UI with 10,000 Falco syscall alerts or XDP drop logs, core-elx runs a vector clustering algorithm. The UI displays: *"10,000 events occurred, but they represent only 3 semantically unique attack patterns."* --- ## 5. Embedding Strategy: CPU-First, GPU-Optional A GPU must not be a hard requirement for deployment. Embedding models are small mathematical functions, not LLMs. ### 5.1 Model Selection Use micro-models only: `all-MiniLM-L6-v2` or `bge-micro-v2` (22–90MB). These fit entirely in L3 cache and generate an embedding for a log line in under 5–10ms on CPU. ### 5.2 EXLA + CPU SIMD Optimization Nx uses EXLA (Google's XLA compiler) under the hood. At application boot, EXLA JIT-compiles the Bumblebee embedding model to native machine code targeting AVX-512 and SIMD instructions on the host CPU, enabling multiple vector operations per clock cycle. ### 5.3 Broadway Batching (Never Embed 1-by-1) Broadway pulls messages from NATS JetStream and groups them. Configure Broadway to batch 256 logs or wait 500ms (whichever comes first). Pass the entire batch to Bumblebee at once—the CPU vectorizes all 256 logs simultaneously via matrix multiplication, increasing throughput by orders of magnitude over sequential processing. ### 5.4 Selective Vectorization Not all logs warrant embedding. The Elixir pipeline applies a fast pattern-match filter before the embedding stage. **Route to Bumblebee (embed):** - Falco / eBPF security alerts - Logs with `level: warning`, `error`, or `critical` - Failed authentication attempts (SSH, Web UI) **Bypass Bumblebee (keyword-only via ParadeDB):** - Standard informational telemetry - Successful HTTP 200 access logs - Raw firewall DEBUG logs - TCP ACK / routine network flow logs ### 5.5 Auto-Detect GPU (Progressive Enhancement) At application boot in `runtime.exs`, check for NVIDIA/CUDA drivers via `System.cmd("nvidia-smi", ...)`. If present, configure Nx to use the CUDA backend for dramatically faster embedding throughput. If absent, fall back to CPU EXLA—still performant for the micro-models used. --- ## 6. Deployment & Tech Stack Summary | Component | Technology / Library | Purpose | |-----------|---------------------|---------| | **Edge Agent** | Go, `gopsutil`, `lxc/go-lxc`, Trivy, `cilium/ebpf` | Telemetry, SBOM, XDP firewall for bare metal/LXCs | | **K8s Security** | Falco + Falcosidekick | eBPF runtime security natively in Kubernetes | | **Edge Gateway** | Go, gRPC | Multiplexes edge agent connections safely into NATS | | **Message Bus** | NATS JetStream / Object Store | Direct ingest for Falco; gateway proxy for agents; chunked file distribution | | **Backend Core** | Elixir, Ash Framework, Broadway, Nx/Bumblebee | Business logic, NATS consumption, AI embeddings, SRQL parser | | **Database** | CloudNativePG (CNPG) | Primary state and data warehouse | | **Time-Series** | TimescaleDB (CNPG extension) | Log partitioning, metric hypertables, continuous aggregates | | **Search** | ParadeDB (CNPG extension) | BM25 Elasticsearch-grade full-text search via Tantivy | | **AI / Semantic** | pgvector (CNPG extension) | HNSW indexing for semantic similarity and threat clustering | | **Graph** | Apache AGE (CNPG extension) | Network topology and blast radius visualization via Cypher | | **Geospatial** | PostGIS (CNPG extension) | Geographic asset mapping (available, future use) | | **Remediation** | AWX (Ansible in K8s) | Webhook-triggered configuration management and patching | --- ## 7. Rejected Alternatives & Rationale | Tool | Category | Reason for Rejection | |------|----------|---------------------| | **Wazuh** | SIEM / Endpoint Security | Legacy C codebase (OSSEC fork). Memory safety concerns. Agent fatigue—requires installing a separate heavy binary on every endpoint. Replaced by native Go eBPF + Trivy + Falco. | | **Fleet / osquery** | System State Visibility | osquery is C++. Adds another agent binary to manage. `gopsutil` + Docker/LXC Go SDKs + Trivy SBOMs already provide equivalent system state natively in the Go agent. | | **Elasticsearch / OpenSearch** | Log Search & SIEM Backend | Java-based memory hog. Replaced by ParadeDB (BM25 via Tantivy) and pgvector inside CNPG—same query capabilities, fraction of the resource footprint. | | **Foreman / Uyuni** | Lifecycle / Patch Management | Massive, complex beasts. Overlapping UI with ServiceRadar. AWX + Ansible playbooks provide lightweight, API-driven patching that ServiceRadar can orchestrate directly. | | **Neo4j** | Graph Database | External dependency. Apache AGE provides Cypher queries directly inside CNPG, eliminating a separate graph database. | --- ## 8. Phase 1 Deliverables & Next Steps ### Phase 1A: Core Infrastructure 1. **NATS JetStream Cluster:** Establish the NATS JetStream server. Verify Falcosidekick can authenticate natively from K8s and publish to `security.falco.alerts`. 2. **Agent Gateway & Edge Pipeline:** Establish gRPC bi-directional streaming. Verify the full path: `gopsutil` → agent → gRPC → gateway → NATS → Broadway (Elixir) → TimescaleDB hypertable. 3. **AWX Integration:** Create a GitHub repo with patching playbooks. Sync into AWX as a Project. Generate OAuth2 token. Wire up core-elx to trigger job templates via REST API. ### Phase 1B: Vulnerability Pipeline 4. **Trivy Air-Gapped Distribution:** Implement the AshOban cron job to download the Trivy DB, write to NATS Object Store, and stream down to agents via 2MB gRPC chunks. 5. **Running vs. Dormant Correlation:** Implement cross-referencing logic in core-elx that matches Trivy SBOMs against `gopsutil` running processes to auto-classify vulnerability severity. ### Phase 1C: SIEM Search 6. **ParadeDB + SRQL v1:** Install ParadeDB on the CNPG cluster. Write the Elixir SRQL parser for basic keyword searches. Validate BM25 query performance on a test dataset of 10M logs. 7. **pgvector Semantic Pipeline:** Deploy `all-MiniLM-L6-v2` via Bumblebee. Implement selective vectorization in the Broadway pipeline. Validate "Find Similar" queries against test security events. ### Phase 1D: Topology & Visualization 8. **Apache AGE Network Graph:** Ingest `gopsutil/net.Connections()` data into Apache AGE via Cypher. Build initial blast radius query and visualization in the ServiceRadar UI.
Author
Owner

Imported GitHub comment.

Original author: @mfreeman451
Original URL: https://github.com/carverauto/serviceradar/issues/2936#issuecomment-3975186653
Original created: 2026-02-27T21:32:09Z


might want to include this as part of the update as well https://github.com/carverauto/serviceradar/issues/2787

Imported GitHub comment. Original author: @mfreeman451 Original URL: https://github.com/carverauto/serviceradar/issues/2936#issuecomment-3975186653 Original created: 2026-02-27T21:32:09Z --- might want to include this as part of the update as well https://github.com/carverauto/serviceradar/issues/2787
Author
Owner

Imported GitHub comment.

Original author: @mfreeman451
Original URL: https://github.com/carverauto/serviceradar/issues/2936#issuecomment-4018075164
Original created: 2026-03-08T03:16:22Z


https://awesomeagents.ai/news/claude-code-sandbox-escape-denylist/

How Veto Works
Veto runs at the BPF LSM layer in the kernel. When a process tries to execute a binary, Veto computes a SHA-256 hash of the file content and compares it against a denylist configured through the Ona dashboard. The check happens after the kernel resolves all symlinks, mounts, and overlays, but before the binary actually runs - which is why the rename/copy/symlink attacks fail.
Imported GitHub comment. Original author: @mfreeman451 Original URL: https://github.com/carverauto/serviceradar/issues/2936#issuecomment-4018075164 Original created: 2026-03-08T03:16:22Z --- https://awesomeagents.ai/news/claude-code-sandbox-escape-denylist/ ``` How Veto Works Veto runs at the BPF LSM layer in the kernel. When a process tries to execute a binary, Veto computes a SHA-256 hash of the file content and compares it against a denylist configured through the Ona dashboard. The check happens after the kernel resolves all symlinks, mounts, and overlays, but before the binary actually runs - which is why the rename/copy/symlink attacks fail. ```
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
carverauto/serviceradar#1071
No description provided.