feat: uptime checker #221
Labels
No labels
1week
2weeks
Failed compliance check
IP cameras
NATS
Possible security concern
Review effort 1/5
Review effort 2/5
Review effort 3/5
Review effort 4/5
Review effort 5/5
UI
aardvark
accessibility
amd64
api
arm64
auth
back-end
bgp
blog
bug
build
checkers
ci-cd
cleanup
cnpg
codex
core
dependencies
device-management
documentation
duplicate
dusk
ebpf
enhancement
eta 1d
eta 1hr
eta 3d
eta 3hr
feature
fieldsurvey
github_actions
go
good first issue
help wanted
invalid
javascript
k8s
log-collector
mapper
mtr
needs-triage
netflow
network-sweep
observability
oracle
otel
plug-in
proton
python
question
reddit
redhat
research
rperf
rperf-checker
rust
sdk
security
serviceradar-agent
serviceradar-agent-gateway
serviceradar-web
serviceradar-web-ng
siem
snmp
sysmon
topology
ubiquiti
wasm
wontfix
zen-engine
No milestone
No project
No assignees
1 participant
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
carverauto/serviceradar#221
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Imported from GitHub.
Original GitHub issue: #608
Original author: @mfreeman451
Original URL: https://github.com/carverauto/serviceradar/issues/608
Original created: 2025-04-14T13:55:15Z
Tracks uptime and availability of services (e.g., HTTP APIs, TCP ports, gRPC endpoints) by periodically checking connectivity and calculating uptime percentage over sliding windows (e.g., 1h, 24h). Reports uptime (%), downtime incidents, and availability streaks for dashboard display.
Why Exciting for Dashboard?
Visuals:
Uptime Gauge: Show real-time uptime % (e.g., 99.95%) with green/yellow/red zones.
Bar Chart: Display uptime % across services over 24h, highlighting reliable vs. flaky services.
Timeline: Visualize downtime incidents (e.g., red bars for outages) over time.
Engagement: Uptime is a KPI everyone cares about—admins love seeing 99.99% in green. Downtime timelines add drama, showing when services faltered.
Data Appeal: Gauges and bars are eye-catching, making the dashboard feel alive with mission-critical metrics.
Value Proposition:
Unique Niche: Focuses on availability tracking, complementing sysmon (resource usage), rperf (performance), and snmp (devices). Unlike dusk, it’s generic for any service.
Lightweight: ~150 bytes per check every 60s (e.g., uptime_percent: 99.95, downtime_count: 0). ~0.2 MB/day/host, ideal for SQLite’s 24 GB/day.
Proxmox Fit: Tracks uptime of containerized apps or VMs, ensuring critical services (e.g., APIs, DBs) are reliable.
Security: mTLS for gRPC (tls-security.md), optional TLS/auth for service checks.
Dashboard Impact: Uptime gauges and downtime timelines are instantly understandable, boosting user confidence in service reliability.
Implementation:
Logic (pkg/checker/uptime/uptime.go):
Check service via HTTP HEAD, TCP connect, or gRPC health (google.golang.org/grpc/health).
Track successes/failures in-memory, calculate uptime % over sliding windows (1h, 24h).
Count downtime incidents (consecutive failures).
Implement checker.HealthChecker:
Check: True if service responds.
GetStatusData: JSON with {uptime_percent_1h, uptime_percent_24h, downtime_count, last_downtime}.
Data: Store in timeseries_metrics:
Config (/etc/serviceradar/checkers/uptime.json):
Storage: Add processUptimeMetrics to core/server.go, storing uptime and incidents.
Dashboard Integration:
API Endpoint: Add /api/metrics/uptime?poller_id=host1&name=api-uptime to pkg/core/api/server.go.
Next.js UI:
Add UptimeDashboard component:
jsx
import { RadialBarChart, RadialBar } from 'recharts';
function UptimeDashboard({ metrics }) {
return (
<RadialBarChart width={300} height={300} data={[{name: 'Uptime', value: metrics[0].uptime_percent_1h}]}>
);
}
Bar chart for 24h uptime: <BarChart data={metrics.map(m => ({name: m.name, uptime: m.uptime_percent_24h}))} />.
Timeline for downtime: <Timeline data={metrics.filter(m => m.downtime_count > 0)} />.
Visual:
Radial gauge for 1h uptime % (e.g., 99.95% in green).
Bar chart comparing services’ 24h uptime.
Red timeline bars for downtime events.
Pros:
Visual Appeal: Uptime gauges and timelines are engaging and critical.
Unique: Focuses on availability, distinct from performance (rperf) or system (sysmon) metrics.
Ultra-Lightweight: ~0.2 MB/day/host, perfect for SQLite.
Proxmox: Ensures app reliability in containers/VMs.
Simple: No dependencies, just net/http or grpc/health.