feat: services engine #759

Open
opened 2026-03-28 04:28:14 +00:00 by mfreeman451 · 0 comments
Owner

Imported from GitHub.

Original GitHub issue: #2330
Original author: @mfreeman451
Original URL: https://github.com/carverauto/serviceradar/issues/2330
Original created: 2026-01-18T05:15:03Z


Product Requirements Document (PRD): ServiceRadar "Services" Engine

1. Product Vision

To evolve ServiceRadar from a collection of monitored devices into a Service-Aware Observability Platform. By combining physical topology (Mapper), logical traffic (NetFlow), and process-level visibility (eBPF), ServiceRadar will automatically group infrastructure into "Services." This allows users to understand not just that a device is down, but which business functions are impacted.

2. Defining "The Service" in ServiceRadar

A Service is a logical vertex in the ServiceRadar Graph (Apache AGE) that aggregates the health, performance, and dependencies of:

  • Nodes: (eBPF-tracked processes or physical hosts).
  • Edges: (Interfaces, VLANs, or NetFlow-detected traffic paths).
  • Telemetry: (Associated Logs, Events, and Alerts).

3. The "Service Discovery" Trifecta

We will deliver three ways to define a service, moving from manual to fully autonomous.

3.1. Topology-Based Services (The Mapper)

  • Mechanism: Uses existing SNMP, LLDP, CDP, and API discovery.
  • Functionality: Groups devices based on physical or Layer 2/3 proximity.
  • Use Case: Automatically creating a "Core Network Service" consisting of all backbone switches and their interconnects.

3.2. Flow-Based Services (The NetFlow Collector)

  • Mechanism: Analyzes NetFlow/IPFIX data.
  • Functionality: Groups devices that talk to each other frequently over specific ports.
  • Use Case: Identifying that 50 hosts are talking to 2 specific IPs over port 5432 and automatically suggesting a "Postgres Cluster Service."

3.3. Application-Aware Services (The eBPF Profiler)

  • Mechanism: eBPF-powered agent running on host nodes.
  • Functionality:
    • Socket-to-Process Mapping: Identifies which binary (e.g., nginx, java, python) is listening on which port.
    • Dependency Tracking: Observes outbound connections from a process to other IPs/ports.
  • Use Case: Connecting the "Postgres" process on a VM to the "Database" service and showing that the "Frontend" process on another VM is its primary consumer. This is our "Lightweight APM."

4. Functional Requirements

4.1. The Service Catalog

  • A centralized UI view displaying all defined services.
  • Each service must show a Unified Health Score (0-100).
  • Status Aggregation: A Service is "Critical" if a core component (defined by policy) has a Critical Alert.

4.2. Dependency Mapping (Apache AGE)

  • Visual representation of Service-to-Service and Service-to-Device dependencies.
  • Root Cause Analysis (RCA) Mode: When a Service is degraded, the map highlights the "shortest path" to the failing infrastructure component (e.g., a saturated switch port or a crashed process).

4.3. Service-Level Indicators (SLIs)

  • Users can attach specific TimescaleDB metrics to a Service.
  • Example: For a "Web Service," track the 95th percentile of response times across all member nodes.

4.4. The eBPF Bridge

  • The agent must report: PID, Process Name, Listening Port, and Active Connections.
  • The Control Plane must correlate this eBPF data with the NetFlow data to validate the traffic path.

5. Technical Architecture

5.1. Data Model (Multi-Model)

  • Relational (Postgres): Service metadata, ownership, and manual groupings.
  • Time-Series (TimescaleDB): Historical health scores and SLI metrics.
  • Graph (Apache AGE):
    • Vertex: Service, Device, Process, Interface.
    • Edge: DEPENDS_ON, TALKS_TO, CONNECTED_TO, RUNS_ON.

5.2. The ServiceRadar Agent Evolution

  • Inject an eBPF-based "Watcher" into the existing agent.
  • Goal: Minimal overhead. It should not intercept packets (like a proxy), but rather sample the kernel's tcp_v4_connect and inet_csk_accept tracepoints to build a map of "Who is talking to whom."

6. User Experience & UI

6.1. The "Health Mosaic"

  • A high-level dashboard of all Services.
  • Users can "drill down" from a failing Service directly into the specific eBPF process or SNMP interface causing the issue.

6.2. The "Service Creator" Wizard

  • Suggestions: "We found 5 nodes running redis-server. Do you want to group them into a 'Redis Cache' Service?"
  • Automated Tagging: Any device or process matching regex *-prod-web-* is automatically added to the "Production Web" Service.

7. Competitive Differentiation (The "Datadog Killer" Strategy)

  • Cost: Unlike Datadog/New Relic, we do not charge for "Custom Metrics" or "Traces." We charge per Managed Asset. The eBPF/Service visibility is a feature of the asset.
  • Network Integration: Most APM tools are "blind" to the network (SNMP/LLDP). ServiceRadar shows the process and the switch it's plugged into.
  • Open Source Core: Users can run the "Service" engine on-prem for free, with the SaaS providing the heavy lifting for long-term retention and cross-site graph analysis.

8. Success Metrics

  1. Time to Value: A new user should have at least one "Discovered Service" within 10 minutes of installing the first agent.
  2. RCA Accuracy: Percentage of Service Alerts that correctly point to the underlying Device/Interface/Process failure.
  3. Adoption: 50% of SaaS tenants define at least 3 Services within their first 30 days.
Imported from GitHub. Original GitHub issue: #2330 Original author: @mfreeman451 Original URL: https://github.com/carverauto/serviceradar/issues/2330 Original created: 2026-01-18T05:15:03Z --- # Product Requirements Document (PRD): ServiceRadar "Services" Engine ## 1. Product Vision To evolve ServiceRadar from a collection of monitored devices into a **Service-Aware Observability Platform**. By combining physical topology (Mapper), logical traffic (NetFlow), and process-level visibility (eBPF), ServiceRadar will automatically group infrastructure into "Services." This allows users to understand not just *that* a device is down, but *which* business functions are impacted. ## 2. Defining "The Service" in ServiceRadar A **Service** is a logical vertex in the ServiceRadar Graph (Apache AGE) that aggregates the health, performance, and dependencies of: * **Nodes:** (eBPF-tracked processes or physical hosts). * **Edges:** (Interfaces, VLANs, or NetFlow-detected traffic paths). * **Telemetry:** (Associated Logs, Events, and Alerts). --- ## 3. The "Service Discovery" Trifecta We will deliver three ways to define a service, moving from manual to fully autonomous. ### 3.1. Topology-Based Services (The Mapper) * **Mechanism:** Uses existing SNMP, LLDP, CDP, and API discovery. * **Functionality:** Groups devices based on physical or Layer 2/3 proximity. * **Use Case:** Automatically creating a "Core Network Service" consisting of all backbone switches and their interconnects. ### 3.2. Flow-Based Services (The NetFlow Collector) * **Mechanism:** Analyzes NetFlow/IPFIX data. * **Functionality:** Groups devices that talk to each other frequently over specific ports. * **Use Case:** Identifying that 50 hosts are talking to 2 specific IPs over port 5432 and automatically suggesting a "Postgres Cluster Service." ### 3.3. Application-Aware Services (The eBPF Profiler) * **Mechanism:** eBPF-powered agent running on host nodes. * **Functionality:** * **Socket-to-Process Mapping:** Identifies which binary (e.g., `nginx`, `java`, `python`) is listening on which port. * **Dependency Tracking:** Observes outbound connections from a process to other IPs/ports. * **Use Case:** Connecting the "Postgres" process on a VM to the "Database" service and showing that the "Frontend" process on another VM is its primary consumer. **This is our "Lightweight APM."** --- ## 4. Functional Requirements ### 4.1. The Service Catalog * A centralized UI view displaying all defined services. * Each service must show a **Unified Health Score (0-100)**. * **Status Aggregation:** A Service is "Critical" if a core component (defined by policy) has a Critical Alert. ### 4.2. Dependency Mapping (Apache AGE) * Visual representation of Service-to-Service and Service-to-Device dependencies. * **Root Cause Analysis (RCA) Mode:** When a Service is degraded, the map highlights the "shortest path" to the failing infrastructure component (e.g., a saturated switch port or a crashed process). ### 4.3. Service-Level Indicators (SLIs) * Users can attach specific TimescaleDB metrics to a Service. * **Example:** For a "Web Service," track the 95th percentile of response times across all member nodes. ### 4.4. The eBPF Bridge * The agent must report: `PID`, `Process Name`, `Listening Port`, and `Active Connections`. * The Control Plane must correlate this eBPF data with the NetFlow data to validate the traffic path. --- ## 5. Technical Architecture ### 5.1. Data Model (Multi-Model) * **Relational (Postgres):** Service metadata, ownership, and manual groupings. * **Time-Series (TimescaleDB):** Historical health scores and SLI metrics. * **Graph (Apache AGE):** * `Vertex`: Service, Device, Process, Interface. * `Edge`: `DEPENDS_ON`, `TALKS_TO`, `CONNECTED_TO`, `RUNS_ON`. ### 5.2. The ServiceRadar Agent Evolution * Inject an eBPF-based "Watcher" into the existing agent. * **Goal:** Minimal overhead. It should not intercept packets (like a proxy), but rather sample the kernel's `tcp_v4_connect` and `inet_csk_accept` tracepoints to build a map of "Who is talking to whom." --- ## 6. User Experience & UI ### 6.1. The "Health Mosaic" * A high-level dashboard of all Services. * Users can "drill down" from a failing Service directly into the specific eBPF process or SNMP interface causing the issue. ### 6.2. The "Service Creator" Wizard * **Suggestions:** "We found 5 nodes running `redis-server`. Do you want to group them into a 'Redis Cache' Service?" * **Automated Tagging:** Any device or process matching regex `*-prod-web-*` is automatically added to the "Production Web" Service. --- ## 7. Competitive Differentiation (The "Datadog Killer" Strategy) * **Cost:** Unlike Datadog/New Relic, we do not charge for "Custom Metrics" or "Traces." We charge per **Managed Asset**. The eBPF/Service visibility is a feature of the asset. * **Network Integration:** Most APM tools are "blind" to the network (SNMP/LLDP). ServiceRadar shows the process *and* the switch it's plugged into. * **Open Source Core:** Users can run the "Service" engine on-prem for free, with the SaaS providing the heavy lifting for long-term retention and cross-site graph analysis. --- ## 8. Success Metrics 1. **Time to Value:** A new user should have at least one "Discovered Service" within 10 minutes of installing the first agent. 2. **RCA Accuracy:** Percentage of Service Alerts that correctly point to the underlying Device/Interface/Process failure. 3. **Adoption:** 50% of SaaS tenants define at least 3 Services within their first 30 days.
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
carverauto/serviceradar#759
No description provided.