Poller status updates overwrite registration metadata with defaults (UPSERT) #689
Labels
No labels
1week
2weeks
Failed compliance check
IP cameras
NATS
Possible security concern
Review effort 1/5
Review effort 2/5
Review effort 3/5
Review effort 4/5
Review effort 5/5
UI
aardvark
accessibility
amd64
api
arm64
auth
back-end
bgp
blog
bug
build
checkers
ci-cd
cleanup
cnpg
codex
core
dependencies
device-management
documentation
duplicate
dusk
ebpf
enhancement
eta 1d
eta 1hr
eta 3d
eta 3hr
feature
fieldsurvey
github_actions
go
good first issue
help wanted
invalid
javascript
k8s
log-collector
mapper
mtr
needs-triage
netflow
network-sweep
observability
oracle
otel
plug-in
proton
python
question
reddit
redhat
research
rperf
rperf-checker
rust
sdk
security
serviceradar-agent
serviceradar-agent-gateway
serviceradar-web
serviceradar-web-ng
siem
snmp
sysmon
topology
ubiquiti
wasm
wontfix
zen-engine
No milestone
No project
No assignees
1 participant
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
carverauto/serviceradar#689
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Imported from GitHub.
Original GitHub issue: #2151
Original author: @mfreeman451
Original URL: https://github.com/carverauto/serviceradar/issues/2151
Original created: 2025-12-16T05:18:31Z
Summary
UpdatePollerStatusis called throughout the codebase to update a poller's health status and last-seen timestamp in the database. The pollers table stores both operational state (is_healthy, last_seen) and registration metadata (component_id, registration_source, spiffe_identity, metadata).UpdatePollerStatuscorrupts poller registration metadata by overwriting it with hardcoded default values whenever called.Code with bug
pkg/db/cnpg_registry.go:The SQL upsert that uses these args:
Called from
pkg/core/pollers.go:Evidence
Failing test
Test script
Test output
The test passes, confirming that
buildCNPGPollerStatusArgshardcodes these values to defaults.Example
Consider a poller registered via edge onboarding:
Step 1: Poller is explicitly registered with metadata:
Database state:
poller_id: "edge-poller-01"component_id: "edge-device-abc123"registration_source: "edge-onboarding"spiffe_identity: "spiffe://example.org/edge/poller-01"metadata:{"environment":"production","datacenter":"us-east-1","owner":"ops-team"}Step 2: Poller reports its first status update. The server calls:
This creates a partial
PollerStatus:Step 3:
UpdatePollerStatuscallsbuildCNPGPollerStatusArgs, which generates:Step 4: The SQL upsert executes with
ON CONFLICT (poller_id) DO UPDATE SET ..., overwriting ALL fields.Database state after bug:
poller_id: "edge-poller-01"component_id: "" ❌ (was "edge-device-abc123")registration_source: "implicit" ❌ (was "edge-onboarding")spiffe_identity: "" ❌ (was "spiffe://example.org/edge/poller-01")metadata: {} ❌ (was{"environment":"production","datacenter":"us-east-1","owner":"ops-team"})is_healthy: true ✓last_seen: ✓Impact: The poller's SPIFFE identity is lost, breaking any authentication/authorization policies. The registration source is wrong, breaking audit trails. Custom metadata is gone, losing operational context.
Inconsistency within the codebase
Reference code
pkg/registry/service_registry.go- shows the CORRECT way to update poller status:Current code
pkg/core/pollers.go:Contradiction
The service registry code demonstrates awareness that poller fields must be preserved during updates. It explicitly:
In contrast,
storePollerStatus(andUpdatePollerStatusin general) creates a partial status object without fetching existing values, causingbuildCNPGPollerStatusArgsto generate hardcoded defaults that corrupt the database.Full context
The pollers table in ServiceRadar serves as a registry for monitoring agents. Pollers can be registered through multiple paths:
Edge onboarding: Pollers are pre-registered with SPIFFE identities and component IDs via the edge onboarding package system. This allows secure, pre-authorized deployment of monitoring agents to edge locations.
Manual registration: Administrators can register pollers with custom metadata, component associations, and registration sources for tracking and policy enforcement.
Implicit registration: Pollers that report status without prior registration are auto-registered with default values.
The
UpdatePollerStatusfunction is called frequently during normal operations:storePollerStatusinpkg/core/pollers.gowhen processing status reportsupdatePollerStatusinpkg/core/pollers.goduring health updatesEach call to
UpdatePollerStatuswith a partialPollerStatusobject (containing only operational fields like IsHealthy, LastSeen) triggersbuildCNPGPollerStatusArgs, which generates hardcoded defaults for all registration metadata fields. The SQL upsert then overwrites these fields in the database, corrupting any explicit registration data.The service registry package (
pkg/registry) implements the correct behavior by fetching the full poller record, preserving all fields, and only updating what changed. However, the core monitoring flow inpkg/core/pollers.gobypasses the service registry and directly callsUpdatePollerStatuswith partial data, triggering the bug.External documentation
When a conflict occurs, ALL columns in the SET clause are updated to the EXCLUDED values (the values from the INSERT VALUES clause). This is standard PostgreSQL behavior.
Losing SPIFFE identity information breaks security policies that rely on workload identity.
Why has this bug gone undetected?
Implicit registration path dominates: Most pollers in development and testing environments are implicitly registered (never pre-registered), so they start with default values. Overwriting defaults with defaults has no visible effect.
Edge onboarding is new: The edge onboarding feature that pre-registers pollers with SPIFFE identities and metadata is relatively new (as seen in git history). Before this feature, pollers were primarily implicitly registered.
No metadata validation: There are no integrity checks or alerts that would fire when a poller's registration metadata is unexpectedly changed.
Status updates are frequent: The first status update happens within seconds or minutes of poller startup, before anyone would manually inspect the database to verify registration data persistence.
Registration happens in different codepath: Registration uses
pkg/registry(which preserves fields correctly), but status updates usepkg/core/pollers.go→pkg/db(which corrupts fields). These paths aren't tested together in integration tests.models.PollerStatusis intentionally partial: ThePollerStatusmodel returned byGetPollerStatusonly includes 4 fields by design (for efficiency). The bug is thatUpdatePollerStatusdoesn't account for this partial data when generating SQL arguments.Recommended fix
Option 1: Make the SQL upsert only update the fields that should change:
However, this creates inconsistency because legitimate updates (via service registry) also use this SQL and need to update all fields.
Option 2 (RECOMMENDED): Change
storePollerStatusand other callers to fetch-and-preserve:However, this still doesn't solve the core issue:
models.PollerStatusdoesn't have fields for component_id, registration_source, etc., so even if we fetch existing status, we can't preserve those fields.Option 3 (BEST): Use service registry for ALL poller updates, not just registration:
This ensures all poller updates flow through the service registry, which correctly implements fetch-and-preserve logic.
Related bugs
The same bug pattern exists in:
pkg/core/pollers.go:updatePollerStatus()(lines 449-469) - also creates partial PollerStatuspkg/core/poller_recovery.go:processRecovery()(lines 38-71) - unused code but has same bugImported GitHub comment.
Original author: @mfreeman451
Original URL: https://github.com/carverauto/serviceradar/issues/2151#issuecomment-3662941263
Original created: 2025-12-16T23:53:58Z
closing as completed