feat: queue checker (Message Queue Health Monitor) #222
Labels
No labels
1week
2weeks
Failed compliance check
IP cameras
NATS
Possible security concern
Review effort 1/5
Review effort 2/5
Review effort 3/5
Review effort 4/5
Review effort 5/5
UI
aardvark
accessibility
amd64
api
arm64
auth
back-end
bgp
blog
bug
build
checkers
ci-cd
cleanup
cnpg
codex
core
dependencies
device-management
documentation
duplicate
dusk
ebpf
enhancement
eta 1d
eta 1hr
eta 3d
eta 3hr
feature
fieldsurvey
github_actions
go
good first issue
help wanted
invalid
javascript
k8s
log-collector
mapper
mtr
needs-triage
netflow
network-sweep
observability
oracle
otel
plug-in
proton
python
question
reddit
redhat
research
rperf
rperf-checker
rust
sdk
security
serviceradar-agent
serviceradar-agent-gateway
serviceradar-web
serviceradar-web-ng
siem
snmp
sysmon
topology
ubiquiti
wasm
wontfix
zen-engine
No milestone
No project
No assignees
1 participant
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
carverauto/serviceradar#222
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Imported from GitHub.
Original GitHub issue: #609
Original author: @mfreeman451
Original URL: https://github.com/carverauto/serviceradar/issues/609
Original created: 2025-04-14T14:12:28Z
Monitors message queue health (e.g., RabbitMQ, Redis Streams, Kafka) by checking queue depth, consumer lag, and publish rate. Reports metrics like messages queued, lag (ms), and rate (messages/sec) for dashboard visualization, stored in SQLite.
Visuals:
Gauge: Show queue depth with warning thresholds (e.g., green <1000 messages, red >5000).
Line Chart: Plot consumer lag and publish rate over time, highlighting backlog trends.
Stacked Bar: Compare queue depths across services, spotting bottlenecks.
Engagement: Queues are critical for async systems (e.g., microservices, ETL pipelines). Gauges and charts showing backlog or lag feel urgent and actionable, rivaling SolarWinds’ application monitoring.
Data Appeal: Bright gauges and trending lines make the dashboard lively, emphasizing system health.
Value Proposition:
Unique Niche: Targets message queues, untouched by sysmon (system), rperf (network), snmp (devices), or dusk (blockchain). Essential for event-driven apps on Proxmox.
Lightweight: ~200 bytes per check every 60s (e.g., queue_depth: 500, lag_ms: 100). ~0.3 MB/day/host, fits SQLite’s 24 GB/day.
Proxmox Fit: Monitors queues in containerized apps (e.g., RabbitMQ in LXC), complementing sysmon’s resource metrics.
Security: mTLS for gRPC, TLS/auth for queue APIs (tls-security.md).
SolarWinds/Nagios Fit: Matches SolarWinds’ app monitoring (e.g., RabbitMQ plugins) but simpler, more visual than Nagios’ check_rabbitmq.
Dashboard Impact: Gauges and charts for queue health are intuitive, showing system bottlenecks instantly.
Implementation:
Logic (pkg/checker/queue/queue.go):
Connect via client libraries (e.g., github.com/rabbitmq/amqp091-go, github.com/redis/go-redis/v9).
Fetch queue depth (messages ready), consumer lag (time since last consume), publish rate.
Implement checker.HealthChecker:
Check: True if depth < threshold (e.g., 10000).
GetStatusData: JSON with {queue_depth, lag_ms, publish_rate}.
Data: Store in timeseries_metrics:
Config (/etc/serviceradar/checkers/queue.json):
Poller:
Storage: Add processQueueMetrics to core/server.go, storing depth/lag/rate.
Dashboard Integration:
API Endpoint: Add /api/metrics/queue?poller_id=host1&name=rabbit-queue to pkg/core/api/server.go.