feat: services engine #759

New issue

Open

opened 2026-03-28 04:28:14 +00:00 by mfreeman451 · 0 comments

mfreeman451 commented

2026-03-28 04:28:14 +00:00

Owner

Imported from GitHub.

Original GitHub issue: #2330
Original author: @mfreeman451
Original URL: https://github.com/carverauto/serviceradar/issues/2330
Original created: 2026-01-18T05:15:03Z

Product Requirements Document (PRD): ServiceRadar "Services" Engine

1. Product Vision

To evolve ServiceRadar from a collection of monitored devices into a Service-Aware Observability Platform. By combining physical topology (Mapper), logical traffic (NetFlow), and process-level visibility (eBPF), ServiceRadar will automatically group infrastructure into "Services." This allows users to understand not just that a device is down, but which business functions are impacted.

2. Defining "The Service" in ServiceRadar

A Service is a logical vertex in the ServiceRadar Graph (Apache AGE) that aggregates the health, performance, and dependencies of:

Nodes: (eBPF-tracked processes or physical hosts).
Edges: (Interfaces, VLANs, or NetFlow-detected traffic paths).
Telemetry: (Associated Logs, Events, and Alerts).

3. The "Service Discovery" Trifecta

We will deliver three ways to define a service, moving from manual to fully autonomous.

3.1. Topology-Based Services (The Mapper)

Mechanism: Uses existing SNMP, LLDP, CDP, and API discovery.
Functionality: Groups devices based on physical or Layer 2/3 proximity.
Use Case: Automatically creating a "Core Network Service" consisting of all backbone switches and their interconnects.

3.2. Flow-Based Services (The NetFlow Collector)

Mechanism: Analyzes NetFlow/IPFIX data.
Functionality: Groups devices that talk to each other frequently over specific ports.
Use Case: Identifying that 50 hosts are talking to 2 specific IPs over port 5432 and automatically suggesting a "Postgres Cluster Service."

3.3. Application-Aware Services (The eBPF Profiler)

Mechanism: eBPF-powered agent running on host nodes.
Functionality:
- Socket-to-Process Mapping: Identifies which binary (e.g., nginx, java, python) is listening on which port.
- Dependency Tracking: Observes outbound connections from a process to other IPs/ports.
Use Case: Connecting the "Postgres" process on a VM to the "Database" service and showing that the "Frontend" process on another VM is its primary consumer. This is our "Lightweight APM."

4. Functional Requirements

4.1. The Service Catalog

A centralized UI view displaying all defined services.
Each service must show a Unified Health Score (0-100).
Status Aggregation: A Service is "Critical" if a core component (defined by policy) has a Critical Alert.

4.2. Dependency Mapping (Apache AGE)

Visual representation of Service-to-Service and Service-to-Device dependencies.
Root Cause Analysis (RCA) Mode: When a Service is degraded, the map highlights the "shortest path" to the failing infrastructure component (e.g., a saturated switch port or a crashed process).

4.3. Service-Level Indicators (SLIs)

Users can attach specific TimescaleDB metrics to a Service.
Example: For a "Web Service," track the 95th percentile of response times across all member nodes.

4.4. The eBPF Bridge

The agent must report: PID, Process Name, Listening Port, and Active Connections.
The Control Plane must correlate this eBPF data with the NetFlow data to validate the traffic path.

5. Technical Architecture

5.1. Data Model (Multi-Model)

Relational (Postgres): Service metadata, ownership, and manual groupings.
Time-Series (TimescaleDB): Historical health scores and SLI metrics.
Graph (Apache AGE):
- Vertex: Service, Device, Process, Interface.
- Edge: DEPENDS_ON, TALKS_TO, CONNECTED_TO, RUNS_ON.

5.2. The ServiceRadar Agent Evolution

Inject an eBPF-based "Watcher" into the existing agent.
Goal: Minimal overhead. It should not intercept packets (like a proxy), but rather sample the kernel's tcp_v4_connect and inet_csk_accept tracepoints to build a map of "Who is talking to whom."

6. User Experience & UI

6.1. The "Health Mosaic"

A high-level dashboard of all Services.
Users can "drill down" from a failing Service directly into the specific eBPF process or SNMP interface causing the issue.

6.2. The "Service Creator" Wizard

Suggestions: "We found 5 nodes running redis-server. Do you want to group them into a 'Redis Cache' Service?"
Automated Tagging: Any device or process matching regex *-prod-web-* is automatically added to the "Production Web" Service.

7. Competitive Differentiation (The "Datadog Killer" Strategy)

Cost: Unlike Datadog/New Relic, we do not charge for "Custom Metrics" or "Traces." We charge per Managed Asset. The eBPF/Service visibility is a feature of the asset.
Network Integration: Most APM tools are "blind" to the network (SNMP/LLDP). ServiceRadar shows the process and the switch it's plugged into.
Open Source Core: Users can run the "Service" engine on-prem for free, with the SaaS providing the heavy lifting for long-term retention and cross-site graph analysis.

8. Success Metrics

Time to Value: A new user should have at least one "Discovered Service" within 10 minutes of installing the first agent.
RCA Accuracy: Percentage of Service Alerts that correctly point to the underlying Device/Interface/Process failure.
Adoption: 50% of SaaS tenants define at least 3 Services within their first 30 days.

Imported from GitHub. Original GitHub issue: #2330 Original author: @mfreeman451 Original URL: https://github.com/carverauto/serviceradar/issues/2330 Original created: 2026-01-18T05:15:03Z --- # Product Requirements Document (PRD): ServiceRadar "Services" Engine ## 1. Product Vision To evolve ServiceRadar from a collection of monitored devices into a **Service-Aware Observability Platform**. By combining physical topology (Mapper), logical traffic (NetFlow), and process-level visibility (eBPF), ServiceRadar will automatically group infrastructure into "Services." This allows users to understand not just *that* a device is down, but *which* business functions are impacted. ## 2. Defining "The Service" in ServiceRadar A **Service** is a logical vertex in the ServiceRadar Graph (Apache AGE) that aggregates the health, performance, and dependencies of: * **Nodes:** (eBPF-tracked processes or physical hosts). * **Edges:** (Interfaces, VLANs, or NetFlow-detected traffic paths). * **Telemetry:** (Associated Logs, Events, and Alerts). --- ## 3. The "Service Discovery" Trifecta We will deliver three ways to define a service, moving from manual to fully autonomous. ### 3.1. Topology-Based Services (The Mapper) * **Mechanism:** Uses existing SNMP, LLDP, CDP, and API discovery. * **Functionality:** Groups devices based on physical or Layer 2/3 proximity. * **Use Case:** Automatically creating a "Core Network Service" consisting of all backbone switches and their interconnects. ### 3.2. Flow-Based Services (The NetFlow Collector) * **Mechanism:** Analyzes NetFlow/IPFIX data. * **Functionality:** Groups devices that talk to each other frequently over specific ports. * **Use Case:** Identifying that 50 hosts are talking to 2 specific IPs over port 5432 and automatically suggesting a "Postgres Cluster Service." ### 3.3. Application-Aware Services (The eBPF Profiler) * **Mechanism:** eBPF-powered agent running on host nodes. * **Functionality:** * **Socket-to-Process Mapping:** Identifies which binary (e.g., `nginx`, `java`, `python`) is listening on which port. * **Dependency Tracking:** Observes outbound connections from a process to other IPs/ports. * **Use Case:** Connecting the "Postgres" process on a VM to the "Database" service and showing that the "Frontend" process on another VM is its primary consumer. **This is our "Lightweight APM."** --- ## 4. Functional Requirements ### 4.1. The Service Catalog * A centralized UI view displaying all defined services. * Each service must show a **Unified Health Score (0-100)**. * **Status Aggregation:** A Service is "Critical" if a core component (defined by policy) has a Critical Alert. ### 4.2. Dependency Mapping (Apache AGE) * Visual representation of Service-to-Service and Service-to-Device dependencies. * **Root Cause Analysis (RCA) Mode:** When a Service is degraded, the map highlights the "shortest path" to the failing infrastructure component (e.g., a saturated switch port or a crashed process). ### 4.3. Service-Level Indicators (SLIs) * Users can attach specific TimescaleDB metrics to a Service. * **Example:** For a "Web Service," track the 95th percentile of response times across all member nodes. ### 4.4. The eBPF Bridge * The agent must report: `PID`, `Process Name`, `Listening Port`, and `Active Connections`. * The Control Plane must correlate this eBPF data with the NetFlow data to validate the traffic path. --- ## 5. Technical Architecture ### 5.1. Data Model (Multi-Model) * **Relational (Postgres):** Service metadata, ownership, and manual groupings. * **Time-Series (TimescaleDB):** Historical health scores and SLI metrics. * **Graph (Apache AGE):** * `Vertex`: Service, Device, Process, Interface. * `Edge`: `DEPENDS_ON`, `TALKS_TO`, `CONNECTED_TO`, `RUNS_ON`. ### 5.2. The ServiceRadar Agent Evolution * Inject an eBPF-based "Watcher" into the existing agent. * **Goal:** Minimal overhead. It should not intercept packets (like a proxy), but rather sample the kernel's `tcp_v4_connect` and `inet_csk_accept` tracepoints to build a map of "Who is talking to whom." --- ## 6. User Experience & UI ### 6.1. The "Health Mosaic" * A high-level dashboard of all Services. * Users can "drill down" from a failing Service directly into the specific eBPF process or SNMP interface causing the issue. ### 6.2. The "Service Creator" Wizard * **Suggestions:** "We found 5 nodes running `redis-server`. Do you want to group them into a 'Redis Cache' Service?" * **Automated Tagging:** Any device or process matching regex `*-prod-web-*` is automatically added to the "Production Web" Service. --- ## 7. Competitive Differentiation (The "Datadog Killer" Strategy) * **Cost:** Unlike Datadog/New Relic, we do not charge for "Custom Metrics" or "Traces." We charge per **Managed Asset**. The eBPF/Service visibility is a feature of the asset. * **Network Integration:** Most APM tools are "blind" to the network (SNMP/LLDP). ServiceRadar shows the process *and* the switch it's plugged into. * **Open Source Core:** Users can run the "Service" engine on-prem for free, with the SaaS providing the heavy lifting for long-term retention and cross-site graph analysis. --- ## 8. Success Metrics 1. **Time to Value:** A new user should have at least one "Discovered Service" within 10 minutes of installing the first agent. 2. **RCA Accuracy:** Percentage of Service Alerts that correctly point to the underlying Device/Interface/Process failure. 3. **Adoption:** 50% of SaaS tenants define at least 3 Services within their first 30 days.