Poller status updates overwrite registration metadata with defaults (UPSERT) #689

Closed
opened 2026-03-28 04:27:27 +00:00 by mfreeman451 · 1 comment
Owner

Imported from GitHub.

Original GitHub issue: #2151
Original author: @mfreeman451
Original URL: https://github.com/carverauto/serviceradar/issues/2151
Original created: 2025-12-16T05:18:31Z


Summary

  • Context: UpdatePollerStatus is called throughout the codebase to update a poller's health status and last-seen timestamp in the database. The pollers table stores both operational state (is_healthy, last_seen) and registration metadata (component_id, registration_source, spiffe_identity, metadata).
  • Bug: UpdatePollerStatus corrupts poller registration metadata by overwriting it with hardcoded default values whenever called.
  • Actual vs. expected: When updating only operational fields (is_healthy, last_seen), all registration metadata fields are reset to defaults ("", "implicit", "active", "", {}) instead of being preserved.
  • Impact: Pollers registered via edge onboarding or manual registration lose their SPIFFE identity, component ID, registration source, and custom metadata after their first status update, breaking traceability and potentially security policies.

Code with bug

pkg/db/cnpg_registry.go:

func buildCNPGPollerStatusArgs(status *models.PollerStatus) ([]interface{}, error) {
	if status == nil {
		return nil, ErrPollerStatusNil
	}

	id := strings.TrimSpace(status.PollerID)
	if id == "" {
		return nil, ErrPollerIDMissing
	}

	firstSeen := sanitizeTimestamp(status.FirstSeen)
	lastSeen := sanitizeTimestamp(status.LastSeen)

	return []interface{}{
		id,
		"",         // component_id  // <-- BUG 🔴 hardcoded to empty string
		"implicit", // registration_source  // <-- BUG 🔴 hardcoded to "implicit"
		"active",   // status  // <-- BUG 🔴 hardcoded to "active"
		"",         // spiffe_identity  // <-- BUG 🔴 hardcoded to empty string
		firstSeen,  // first_registered
		firstSeen,  // first_seen
		lastSeen,   // last_seen
		json.RawMessage(`{}`),  // <-- BUG 🔴 metadata hardcoded to {}
		systemActor,
		status.IsHealthy,
		int32(0),
		int32(0),
		nowUTC(),
	}, nil
}

The SQL upsert that uses these args:

INSERT INTO pollers (
	poller_id,
	component_id,
	registration_source,
	status,
	spiffe_identity,
	first_registered,
	first_seen,
	last_seen,
	metadata,
	created_by,
	is_healthy,
	agent_count,
	checker_count,
	updated_at
) VALUES (
	$1,$2,$3,$4,$5,$6,$7,$8,$9,$10,$11,$12,$13,$14
)
ON CONFLICT (poller_id) DO UPDATE SET
	component_id = EXCLUDED.component_id,  // <-- BUG 🔴 overwrites with ""
	registration_source = EXCLUDED.registration_source,  // <-- BUG 🔴 overwrites with "implicit"
	status = EXCLUDED.status,  // <-- BUG 🔴 overwrites with "active"
	spiffe_identity = EXCLUDED.spiffe_identity,  // <-- BUG 🔴 overwrites with ""
	first_registered = EXCLUDED.first_registered,
	first_seen = EXCLUDED.first_seen,
	last_seen = EXCLUDED.last_seen,
	metadata = EXCLUDED.metadata,  // <-- BUG 🔴 overwrites with {}
	created_by = EXCLUDED.created_by,
	is_healthy = EXCLUDED.is_healthy,
	agent_count = EXCLUDED.agent_count,
	checker_count = EXCLUDED.checker_count,
	updated_at = EXCLUDED.updated_at

Called from pkg/core/pollers.go:

func (s *Server) storePollerStatus(ctx context.Context, pollerID string, isHealthy bool, hostIP, partition string, now time.Time) error {
	normIP := normalizeHostIP(hostIP)
	normalizedPartition := strings.TrimSpace(partition)
	if normalizedPartition == "" {
		normalizedPartition = defaultPartition
	}

	pollerStatus := &models.PollerStatus{
		PollerID:  pollerID,
		IsHealthy: isHealthy,
		LastSeen:  now,
		HostIP:    normIP,
	}

	if err := s.DB.UpdatePollerStatus(ctx, pollerStatus); err != nil {  // <-- BUG 🔴 partial status
		return fmt.Errorf("failed to store poller status: %w", err)
	}
	// ... rest of function
}

Evidence

Failing test

Test script

package db

import (
	"encoding/json"
	"testing"
	"time"

	"github.com/carverauto/serviceradar/pkg/models"
	"github.com/stretchr/testify/assert"
	"github.com/stretchr/testify/require"
)

// TestBuildCNPGPollerStatusArgs_HardcodesDefaults demonstrates the root cause.
func TestBuildCNPGPollerStatusArgs_HardcodesDefaults(t *testing.T) {
	// Create a PollerStatus as returned by GetPollerStatus
	status := &models.PollerStatus{
		PollerID:  "test-poller",
		FirstSeen: time.Now().Add(-24 * time.Hour),
		LastSeen:  time.Now(),
		IsHealthy: true,
	}

	// Call buildCNPGPollerStatusArgs
	args, err := buildCNPGPollerStatusArgs(status)
	require.NoError(t, err)
	require.Len(t, args, 14, "Should have 14 arguments for INSERT")

	// Verify the hardcoded defaults
	assert.Equal(t, "test-poller", args[0], "poller_id")
	assert.Equal(t, "", args[1], "component_id is hardcoded to empty string")
	assert.Equal(t, "implicit", args[2], "registration_source is hardcoded to 'implicit'")
	assert.Equal(t, "active", args[3], "status is hardcoded to 'active'")
	assert.Equal(t, "", args[4], "spiffe_identity is hardcoded to empty string")
	// args[5] = first_registered (firstSeen)
	// args[6] = first_seen (firstSeen)
	// args[7] = last_seen (lastSeen)
	// args[8] is json.RawMessage type, convert to string
	metadataBytes, ok := args[8].(json.RawMessage)
	require.True(t, ok, "metadata should be json.RawMessage")
	assert.Equal(t, `{}`, string(metadataBytes), "metadata is hardcoded to empty JSON")
	assert.Equal(t, "system", args[9], "created_by is hardcoded to 'system'")
	assert.Equal(t, true, args[10], "is_healthy")
	assert.Equal(t, int32(0), args[11], "agent_count is hardcoded to 0")
	assert.Equal(t, int32(0), args[12], "checker_count is hardcoded to 0")
	// args[13] = updated_at

	// The bug: These hardcoded values will OVERWRITE any existing values in the database
	// when the upsert's ON CONFLICT clause executes:
	// ON CONFLICT (poller_id) DO UPDATE SET
	//   component_id = EXCLUDED.component_id,  <- overwrites with ""
	//   registration_source = EXCLUDED.registration_source,  <- overwrites with "implicit"
	//   spiffe_identity = EXCLUDED.spiffe_identity,  <- overwrites with ""
	//   metadata = EXCLUDED.metadata,  <- overwrites with {}
	//   ...
}

Test output

=== RUN   TestBuildCNPGPollerStatusArgs_HardcodesDefaults
--- PASS: TestBuildCNPGPollerStatusArgs_HardcodesDefaults (0.00s)
PASS
ok  	github.com/carverauto/serviceradar/pkg/db	0.005s

The test passes, confirming that buildCNPGPollerStatusArgs hardcodes these values to defaults.

Example

Consider a poller registered via edge onboarding:

Step 1: Poller is explicitly registered with metadata:

INSERT INTO pollers (
  poller_id, component_id, registration_source, spiffe_identity, metadata, ...
) VALUES (
  'edge-poller-01',
  'edge-device-abc123',
  'edge-onboarding',
  'spiffe://example.org/edge/poller-01',
  '{"environment":"production","datacenter":"us-east-1","owner":"ops-team"}',
  ...
)

Database state:

  • poller_id: "edge-poller-01"
  • component_id: "edge-device-abc123"
  • registration_source: "edge-onboarding"
  • spiffe_identity: "spiffe://example.org/edge/poller-01"
  • metadata: {"environment":"production","datacenter":"us-east-1","owner":"ops-team"}

Step 2: Poller reports its first status update. The server calls:

storePollerStatus(ctx, "edge-poller-01", true, "192.168.1.50", "prod", time.Now())

This creates a partial PollerStatus:

pollerStatus := &models.PollerStatus{
    PollerID:  "edge-poller-01",
    IsHealthy: true,
    LastSeen:  now,
    HostIP:    "192.168.1.50",
}

Step 3: UpdatePollerStatus calls buildCNPGPollerStatusArgs, which generates:

[]interface{}{
    "edge-poller-01",  // poller_id
    "",                // component_id - LOST!
    "implicit",        // registration_source - CORRUPTED!
    "active",          // status
    "",                // spiffe_identity - LOST!
    ...,
    json.RawMessage(`{}`),  // metadata - LOST!
    ...,
}

Step 4: The SQL upsert executes with ON CONFLICT (poller_id) DO UPDATE SET ..., overwriting ALL fields.

Database state after bug:

  • poller_id: "edge-poller-01"
  • component_id: "" (was "edge-device-abc123")
  • registration_source: "implicit" (was "edge-onboarding")
  • spiffe_identity: "" (was "spiffe://example.org/edge/poller-01")
  • metadata: {} (was {"environment":"production","datacenter":"us-east-1","owner":"ops-team"})
  • is_healthy: true ✓
  • last_seen: ✓

Impact: The poller's SPIFFE identity is lost, breaking any authentication/authorization policies. The registration source is wrong, breaking audit trails. Custom metadata is gone, losing operational context.

Inconsistency within the codebase

Reference code

pkg/registry/service_registry.go - shows the CORRECT way to update poller status:

func (r *ServiceRegistry) RecordHeartbeat(ctx context.Context, heartbeat *ServiceHeartbeat) error {
	// ... validation ...

	// Get existing poller or auto-register
	poller, err := r.GetPoller(ctx, pollerID)
	if err != nil {
		if errors.Is(err, ErrServiceNotFound) {
			// Auto-register if needed
			if err := r.RegisterPoller(ctx, &PollerRegistration{
				PollerID:           pollerID,
				ComponentID:        pollerID,
				RegistrationSource: RegistrationSourceImplicit,
				CreatedBy:          "system",
			}); err != nil {
				return fmt.Errorf("failed to auto-register poller: %w", err)
			}
			poller, _ = r.GetPoller(ctx, pollerID)
		}
	}

	// Determine new status
	newStatus := poller.Status
	if poller.Status == ServiceStatusPending {
		newStatus = ServiceStatusActive
	}

	// Set first_seen if this is the first report
	firstSeen := poller.FirstSeen
	if firstSeen == nil {
		firstSeen = &timestamp
	}

	updatedPoller := &RegisteredPoller{
		PollerID:           poller.PollerID,
		ComponentID:        poller.ComponentID,           // ✓ PRESERVED
		Status:             newStatus,
		RegistrationSource: poller.RegistrationSource,    // ✓ PRESERVED
		FirstRegistered:    poller.FirstRegistered,       // ✓ PRESERVED
		FirstSeen:          firstSeen,
		LastSeen:           &timestamp,
		Metadata:           poller.Metadata,              // ✓ PRESERVED
		SPIFFEIdentity:     poller.SPIFFEIdentity,        // ✓ PRESERVED
		CreatedBy:          poller.CreatedBy,             // ✓ PRESERVED
		AgentCount:         poller.AgentCount,
		CheckerCount:       poller.CheckerCount,
	}

	if err := r.upsertCNPGPoller(ctx, updatedPoller); err != nil {
		return err
	}
	// ...
}

Current code

pkg/core/pollers.go:

func (s *Server) storePollerStatus(ctx context.Context, pollerID string, isHealthy bool, hostIP, partition string, now time.Time) error {
	normIP := normalizeHostIP(hostIP)
	normalizedPartition := strings.TrimSpace(partition)
	if normalizedPartition == "" {
		normalizedPartition = defaultPartition
	}

	pollerStatus := &models.PollerStatus{
		PollerID:  pollerID,
		IsHealthy: isHealthy,
		LastSeen:  now,
		HostIP:    normIP,
	}
	// ❌ Does NOT fetch existing values
	// ❌ Does NOT preserve component_id, registration_source, spiffe_identity, metadata

	if err := s.DB.UpdatePollerStatus(ctx, pollerStatus); err != nil {
		return fmt.Errorf("failed to store poller status: %w", err)
	}
	// ...
}

Contradiction

The service registry code demonstrates awareness that poller fields must be preserved during updates. It explicitly:

  1. Fetches the full existing poller record
  2. Creates an updated record that preserves all registration fields
  3. Only modifies the fields that should change (status, last_seen)

In contrast, storePollerStatus (and UpdatePollerStatus in general) creates a partial status object without fetching existing values, causing buildCNPGPollerStatusArgs to generate hardcoded defaults that corrupt the database.

Full context

The pollers table in ServiceRadar serves as a registry for monitoring agents. Pollers can be registered through multiple paths:

  1. Edge onboarding: Pollers are pre-registered with SPIFFE identities and component IDs via the edge onboarding package system. This allows secure, pre-authorized deployment of monitoring agents to edge locations.

  2. Manual registration: Administrators can register pollers with custom metadata, component associations, and registration sources for tracking and policy enforcement.

  3. Implicit registration: Pollers that report status without prior registration are auto-registered with default values.

The UpdatePollerStatus function is called frequently during normal operations:

  • From storePollerStatus in pkg/core/pollers.go when processing status reports
  • From updatePollerStatus in pkg/core/pollers.go during health updates
  • From the periodic health check job that updates poller health status

Each call to UpdatePollerStatus with a partial PollerStatus object (containing only operational fields like IsHealthy, LastSeen) triggers buildCNPGPollerStatusArgs, which generates hardcoded defaults for all registration metadata fields. The SQL upsert then overwrites these fields in the database, corrupting any explicit registration data.

The service registry package (pkg/registry) implements the correct behavior by fetching the full poller record, preserving all fields, and only updating what changed. However, the core monitoring flow in pkg/core/pollers.go bypasses the service registry and directly calls UpdatePollerStatus with partial data, triggering the bug.

External documentation

INSERT INTO table_name ...
ON CONFLICT (conflict_column) DO UPDATE SET
  column1 = EXCLUDED.column1,
  column2 = EXCLUDED.column2;

When a conflict occurs, ALL columns in the SET clause are updated to the EXCLUDED values (the values from the INSERT VALUES clause). This is standard PostgreSQL behavior.

SPIFFE (Secure Production Identity Framework For Everyone) provides a secure
identity framework for workloads in dynamic and heterogeneous environments.
SPIFFE IDs are used for authentication and authorization.

Losing SPIFFE identity information breaks security policies that rely on workload identity.

Why has this bug gone undetected?

  1. Implicit registration path dominates: Most pollers in development and testing environments are implicitly registered (never pre-registered), so they start with default values. Overwriting defaults with defaults has no visible effect.

  2. Edge onboarding is new: The edge onboarding feature that pre-registers pollers with SPIFFE identities and metadata is relatively new (as seen in git history). Before this feature, pollers were primarily implicitly registered.

  3. No metadata validation: There are no integrity checks or alerts that would fire when a poller's registration metadata is unexpectedly changed.

  4. Status updates are frequent: The first status update happens within seconds or minutes of poller startup, before anyone would manually inspect the database to verify registration data persistence.

  5. Registration happens in different codepath: Registration uses pkg/registry (which preserves fields correctly), but status updates use pkg/core/pollers.gopkg/db (which corrupts fields). These paths aren't tested together in integration tests.

  6. models.PollerStatus is intentionally partial: The PollerStatus model returned by GetPollerStatus only includes 4 fields by design (for efficiency). The bug is that UpdatePollerStatus doesn't account for this partial data when generating SQL arguments.

Recommended fix

Option 1: Make the SQL upsert only update the fields that should change:

const upsertPollerStatusSQL = `
INSERT INTO pollers (
	poller_id,
	component_id,
	registration_source,
	status,
	spiffe_identity,
	first_registered,
	first_seen,
	last_seen,
	metadata,
	created_by,
	is_healthy,
	agent_count,
	checker_count,
	updated_at
) VALUES (
	$1,$2,$3,$4,$5,$6,$7,$8,$9,$10,$11,$12,$13,$14
)
ON CONFLICT (poller_id) DO UPDATE SET
	last_seen = EXCLUDED.last_seen,        // <-- FIX 🟢 only update operational fields
	is_healthy = EXCLUDED.is_healthy,      // <-- FIX 🟢
	updated_at = EXCLUDED.updated_at       // <-- FIX 🟢
`

However, this creates inconsistency because legitimate updates (via service registry) also use this SQL and need to update all fields.

Option 2 (RECOMMENDED): Change storePollerStatus and other callers to fetch-and-preserve:

func (s *Server) storePollerStatus(ctx context.Context, pollerID string, isHealthy bool, hostIP, partition string, now time.Time) error {
	normIP := normalizeHostIP(hostIP)
	normalizedPartition := strings.TrimSpace(partition)
	if normalizedPartition == "" {
		normalizedPartition = defaultPartition
	}

	// Fetch existing status to preserve registration metadata  // <-- FIX 🟢
	existingStatus, err := s.DB.GetPollerStatus(ctx, pollerID)
	var pollerStatus *models.PollerStatus

	if err != nil && !errors.Is(err, db.ErrFailedToQuery) {
		return fmt.Errorf("failed to fetch existing poller status: %w", err)
	}

	if err != nil {
		// New poller - use minimal status
		pollerStatus = &models.PollerStatus{
			PollerID:  pollerID,
			FirstSeen: now,
			IsHealthy: isHealthy,
			LastSeen:  now,
			HostIP:    normIP,
		}
	} else {
		// Update existing - preserve FirstSeen  // <-- FIX 🟢
		pollerStatus = existingStatus
		pollerStatus.IsHealthy = isHealthy
		pollerStatus.LastSeen = now
		pollerStatus.HostIP = normIP
	}

	if err := s.DB.UpdatePollerStatus(ctx, pollerStatus); err != nil {
		return fmt.Errorf("failed to store poller status: %w", err)
	}
	// ... rest of function
}

However, this still doesn't solve the core issue: models.PollerStatus doesn't have fields for component_id, registration_source, etc., so even if we fetch existing status, we can't preserve those fields.

Option 3 (BEST): Use service registry for ALL poller updates, not just registration:

func (s *Server) storePollerStatus(ctx context.Context, pollerID string, isHealthy bool, hostIP, partition string, now time.Time) error {
	// Use service registry heartbeat mechanism which properly preserves metadata  // <-- FIX 🟢
	heartbeat := &registry.ServiceHeartbeat{
		PollerID:  pollerID,
		Timestamp: now,
		SourceIP:  hostIP,
		Partition: partition,
	}

	return s.serviceRegistry.RecordHeartbeat(ctx, heartbeat)
}

This ensures all poller updates flow through the service registry, which correctly implements fetch-and-preserve logic.

Related bugs

The same bug pattern exists in:

  • pkg/core/pollers.go:updatePollerStatus() (lines 449-469) - also creates partial PollerStatus
  • pkg/core/poller_recovery.go:processRecovery() (lines 38-71) - unused code but has same bug
Imported from GitHub. Original GitHub issue: #2151 Original author: @mfreeman451 Original URL: https://github.com/carverauto/serviceradar/issues/2151 Original created: 2025-12-16T05:18:31Z --- # Summary - **Context**: `UpdatePollerStatus` is called throughout the codebase to update a poller's health status and last-seen timestamp in the database. The pollers table stores both operational state (is_healthy, last_seen) and registration metadata (component_id, registration_source, spiffe_identity, metadata). - **Bug**: `UpdatePollerStatus` corrupts poller registration metadata by overwriting it with hardcoded default values whenever called. - **Actual vs. expected**: When updating only operational fields (is_healthy, last_seen), all registration metadata fields are reset to defaults ("", "implicit", "active", "", {}) instead of being preserved. - **Impact**: Pollers registered via edge onboarding or manual registration lose their SPIFFE identity, component ID, registration source, and custom metadata after their first status update, breaking traceability and potentially security policies. # Code with bug `pkg/db/cnpg_registry.go`: ```go func buildCNPGPollerStatusArgs(status *models.PollerStatus) ([]interface{}, error) { if status == nil { return nil, ErrPollerStatusNil } id := strings.TrimSpace(status.PollerID) if id == "" { return nil, ErrPollerIDMissing } firstSeen := sanitizeTimestamp(status.FirstSeen) lastSeen := sanitizeTimestamp(status.LastSeen) return []interface{}{ id, "", // component_id // <-- BUG 🔴 hardcoded to empty string "implicit", // registration_source // <-- BUG 🔴 hardcoded to "implicit" "active", // status // <-- BUG 🔴 hardcoded to "active" "", // spiffe_identity // <-- BUG 🔴 hardcoded to empty string firstSeen, // first_registered firstSeen, // first_seen lastSeen, // last_seen json.RawMessage(`{}`), // <-- BUG 🔴 metadata hardcoded to {} systemActor, status.IsHealthy, int32(0), int32(0), nowUTC(), }, nil } ``` The SQL upsert that uses these args: ```sql INSERT INTO pollers ( poller_id, component_id, registration_source, status, spiffe_identity, first_registered, first_seen, last_seen, metadata, created_by, is_healthy, agent_count, checker_count, updated_at ) VALUES ( $1,$2,$3,$4,$5,$6,$7,$8,$9,$10,$11,$12,$13,$14 ) ON CONFLICT (poller_id) DO UPDATE SET component_id = EXCLUDED.component_id, // <-- BUG 🔴 overwrites with "" registration_source = EXCLUDED.registration_source, // <-- BUG 🔴 overwrites with "implicit" status = EXCLUDED.status, // <-- BUG 🔴 overwrites with "active" spiffe_identity = EXCLUDED.spiffe_identity, // <-- BUG 🔴 overwrites with "" first_registered = EXCLUDED.first_registered, first_seen = EXCLUDED.first_seen, last_seen = EXCLUDED.last_seen, metadata = EXCLUDED.metadata, // <-- BUG 🔴 overwrites with {} created_by = EXCLUDED.created_by, is_healthy = EXCLUDED.is_healthy, agent_count = EXCLUDED.agent_count, checker_count = EXCLUDED.checker_count, updated_at = EXCLUDED.updated_at ``` Called from `pkg/core/pollers.go`: ```go func (s *Server) storePollerStatus(ctx context.Context, pollerID string, isHealthy bool, hostIP, partition string, now time.Time) error { normIP := normalizeHostIP(hostIP) normalizedPartition := strings.TrimSpace(partition) if normalizedPartition == "" { normalizedPartition = defaultPartition } pollerStatus := &models.PollerStatus{ PollerID: pollerID, IsHealthy: isHealthy, LastSeen: now, HostIP: normIP, } if err := s.DB.UpdatePollerStatus(ctx, pollerStatus); err != nil { // <-- BUG 🔴 partial status return fmt.Errorf("failed to store poller status: %w", err) } // ... rest of function } ``` # Evidence ## Failing test ### Test script ```go package db import ( "encoding/json" "testing" "time" "github.com/carverauto/serviceradar/pkg/models" "github.com/stretchr/testify/assert" "github.com/stretchr/testify/require" ) // TestBuildCNPGPollerStatusArgs_HardcodesDefaults demonstrates the root cause. func TestBuildCNPGPollerStatusArgs_HardcodesDefaults(t *testing.T) { // Create a PollerStatus as returned by GetPollerStatus status := &models.PollerStatus{ PollerID: "test-poller", FirstSeen: time.Now().Add(-24 * time.Hour), LastSeen: time.Now(), IsHealthy: true, } // Call buildCNPGPollerStatusArgs args, err := buildCNPGPollerStatusArgs(status) require.NoError(t, err) require.Len(t, args, 14, "Should have 14 arguments for INSERT") // Verify the hardcoded defaults assert.Equal(t, "test-poller", args[0], "poller_id") assert.Equal(t, "", args[1], "component_id is hardcoded to empty string") assert.Equal(t, "implicit", args[2], "registration_source is hardcoded to 'implicit'") assert.Equal(t, "active", args[3], "status is hardcoded to 'active'") assert.Equal(t, "", args[4], "spiffe_identity is hardcoded to empty string") // args[5] = first_registered (firstSeen) // args[6] = first_seen (firstSeen) // args[7] = last_seen (lastSeen) // args[8] is json.RawMessage type, convert to string metadataBytes, ok := args[8].(json.RawMessage) require.True(t, ok, "metadata should be json.RawMessage") assert.Equal(t, `{}`, string(metadataBytes), "metadata is hardcoded to empty JSON") assert.Equal(t, "system", args[9], "created_by is hardcoded to 'system'") assert.Equal(t, true, args[10], "is_healthy") assert.Equal(t, int32(0), args[11], "agent_count is hardcoded to 0") assert.Equal(t, int32(0), args[12], "checker_count is hardcoded to 0") // args[13] = updated_at // The bug: These hardcoded values will OVERWRITE any existing values in the database // when the upsert's ON CONFLICT clause executes: // ON CONFLICT (poller_id) DO UPDATE SET // component_id = EXCLUDED.component_id, <- overwrites with "" // registration_source = EXCLUDED.registration_source, <- overwrites with "implicit" // spiffe_identity = EXCLUDED.spiffe_identity, <- overwrites with "" // metadata = EXCLUDED.metadata, <- overwrites with {} // ... } ``` ### Test output ``` === RUN TestBuildCNPGPollerStatusArgs_HardcodesDefaults --- PASS: TestBuildCNPGPollerStatusArgs_HardcodesDefaults (0.00s) PASS ok github.com/carverauto/serviceradar/pkg/db 0.005s ``` The test passes, confirming that `buildCNPGPollerStatusArgs` hardcodes these values to defaults. ## Example Consider a poller registered via edge onboarding: **Step 1**: Poller is explicitly registered with metadata: ```sql INSERT INTO pollers ( poller_id, component_id, registration_source, spiffe_identity, metadata, ... ) VALUES ( 'edge-poller-01', 'edge-device-abc123', 'edge-onboarding', 'spiffe://example.org/edge/poller-01', '{"environment":"production","datacenter":"us-east-1","owner":"ops-team"}', ... ) ``` Database state: - `poller_id`: "edge-poller-01" - `component_id`: "edge-device-abc123" - `registration_source`: "edge-onboarding" - `spiffe_identity`: "spiffe://example.org/edge/poller-01" - `metadata`: `{"environment":"production","datacenter":"us-east-1","owner":"ops-team"}` **Step 2**: Poller reports its first status update. The server calls: ```go storePollerStatus(ctx, "edge-poller-01", true, "192.168.1.50", "prod", time.Now()) ``` This creates a partial `PollerStatus`: ```go pollerStatus := &models.PollerStatus{ PollerID: "edge-poller-01", IsHealthy: true, LastSeen: now, HostIP: "192.168.1.50", } ``` **Step 3**: `UpdatePollerStatus` calls `buildCNPGPollerStatusArgs`, which generates: ```go []interface{}{ "edge-poller-01", // poller_id "", // component_id - LOST! "implicit", // registration_source - CORRUPTED! "active", // status "", // spiffe_identity - LOST! ..., json.RawMessage(`{}`), // metadata - LOST! ..., } ``` **Step 4**: The SQL upsert executes with `ON CONFLICT (poller_id) DO UPDATE SET ...`, overwriting ALL fields. Database state after bug: - `poller_id`: "edge-poller-01" - `component_id`: "" ❌ (was "edge-device-abc123") - `registration_source`: "implicit" ❌ (was "edge-onboarding") - `spiffe_identity`: "" ❌ (was "spiffe://example.org/edge/poller-01") - `metadata`: {} ❌ (was `{"environment":"production","datacenter":"us-east-1","owner":"ops-team"}`) - `is_healthy`: true ✓ - `last_seen`: <current time> ✓ **Impact**: The poller's SPIFFE identity is lost, breaking any authentication/authorization policies. The registration source is wrong, breaking audit trails. Custom metadata is gone, losing operational context. ## Inconsistency within the codebase ### Reference code `pkg/registry/service_registry.go` - shows the CORRECT way to update poller status: ```go func (r *ServiceRegistry) RecordHeartbeat(ctx context.Context, heartbeat *ServiceHeartbeat) error { // ... validation ... // Get existing poller or auto-register poller, err := r.GetPoller(ctx, pollerID) if err != nil { if errors.Is(err, ErrServiceNotFound) { // Auto-register if needed if err := r.RegisterPoller(ctx, &PollerRegistration{ PollerID: pollerID, ComponentID: pollerID, RegistrationSource: RegistrationSourceImplicit, CreatedBy: "system", }); err != nil { return fmt.Errorf("failed to auto-register poller: %w", err) } poller, _ = r.GetPoller(ctx, pollerID) } } // Determine new status newStatus := poller.Status if poller.Status == ServiceStatusPending { newStatus = ServiceStatusActive } // Set first_seen if this is the first report firstSeen := poller.FirstSeen if firstSeen == nil { firstSeen = &timestamp } updatedPoller := &RegisteredPoller{ PollerID: poller.PollerID, ComponentID: poller.ComponentID, // ✓ PRESERVED Status: newStatus, RegistrationSource: poller.RegistrationSource, // ✓ PRESERVED FirstRegistered: poller.FirstRegistered, // ✓ PRESERVED FirstSeen: firstSeen, LastSeen: &timestamp, Metadata: poller.Metadata, // ✓ PRESERVED SPIFFEIdentity: poller.SPIFFEIdentity, // ✓ PRESERVED CreatedBy: poller.CreatedBy, // ✓ PRESERVED AgentCount: poller.AgentCount, CheckerCount: poller.CheckerCount, } if err := r.upsertCNPGPoller(ctx, updatedPoller); err != nil { return err } // ... } ``` ### Current code `pkg/core/pollers.go`: ```go func (s *Server) storePollerStatus(ctx context.Context, pollerID string, isHealthy bool, hostIP, partition string, now time.Time) error { normIP := normalizeHostIP(hostIP) normalizedPartition := strings.TrimSpace(partition) if normalizedPartition == "" { normalizedPartition = defaultPartition } pollerStatus := &models.PollerStatus{ PollerID: pollerID, IsHealthy: isHealthy, LastSeen: now, HostIP: normIP, } // ❌ Does NOT fetch existing values // ❌ Does NOT preserve component_id, registration_source, spiffe_identity, metadata if err := s.DB.UpdatePollerStatus(ctx, pollerStatus); err != nil { return fmt.Errorf("failed to store poller status: %w", err) } // ... } ``` ### Contradiction The service registry code demonstrates awareness that poller fields must be preserved during updates. It explicitly: 1. Fetches the full existing poller record 2. Creates an updated record that preserves all registration fields 3. Only modifies the fields that should change (status, last_seen) In contrast, `storePollerStatus` (and `UpdatePollerStatus` in general) creates a partial status object without fetching existing values, causing `buildCNPGPollerStatusArgs` to generate hardcoded defaults that corrupt the database. # Full context The pollers table in ServiceRadar serves as a registry for monitoring agents. Pollers can be registered through multiple paths: 1. **Edge onboarding**: Pollers are pre-registered with SPIFFE identities and component IDs via the edge onboarding package system. This allows secure, pre-authorized deployment of monitoring agents to edge locations. 2. **Manual registration**: Administrators can register pollers with custom metadata, component associations, and registration sources for tracking and policy enforcement. 3. **Implicit registration**: Pollers that report status without prior registration are auto-registered with default values. The `UpdatePollerStatus` function is called frequently during normal operations: - From `storePollerStatus` in `pkg/core/pollers.go` when processing status reports - From `updatePollerStatus` in `pkg/core/pollers.go` during health updates - From the periodic health check job that updates poller health status Each call to `UpdatePollerStatus` with a partial `PollerStatus` object (containing only operational fields like IsHealthy, LastSeen) triggers `buildCNPGPollerStatusArgs`, which generates hardcoded defaults for all registration metadata fields. The SQL upsert then overwrites these fields in the database, corrupting any explicit registration data. The service registry package (`pkg/registry`) implements the correct behavior by fetching the full poller record, preserving all fields, and only updating what changed. However, the core monitoring flow in `pkg/core/pollers.go` bypasses the service registry and directly calls `UpdatePollerStatus` with partial data, triggering the bug. ## External documentation - [PostgreSQL UPSERT documentation](https://www.postgresql.org/docs/current/sql-insert.html) ``` INSERT INTO table_name ... ON CONFLICT (conflict_column) DO UPDATE SET column1 = EXCLUDED.column1, column2 = EXCLUDED.column2; ``` When a conflict occurs, ALL columns in the SET clause are updated to the EXCLUDED values (the values from the INSERT VALUES clause). This is standard PostgreSQL behavior. - [SPIFFE Specification](https://spiffe.io/docs/latest/spiffe-about/overview/) ``` SPIFFE (Secure Production Identity Framework For Everyone) provides a secure identity framework for workloads in dynamic and heterogeneous environments. SPIFFE IDs are used for authentication and authorization. ``` Losing SPIFFE identity information breaks security policies that rely on workload identity. # Why has this bug gone undetected? 1. **Implicit registration path dominates**: Most pollers in development and testing environments are implicitly registered (never pre-registered), so they start with default values. Overwriting defaults with defaults has no visible effect. 2. **Edge onboarding is new**: The edge onboarding feature that pre-registers pollers with SPIFFE identities and metadata is relatively new (as seen in git history). Before this feature, pollers were primarily implicitly registered. 3. **No metadata validation**: There are no integrity checks or alerts that would fire when a poller's registration metadata is unexpectedly changed. 4. **Status updates are frequent**: The first status update happens within seconds or minutes of poller startup, before anyone would manually inspect the database to verify registration data persistence. 5. **Registration happens in different codepath**: Registration uses `pkg/registry` (which preserves fields correctly), but status updates use `pkg/core/pollers.go` → `pkg/db` (which corrupts fields). These paths aren't tested together in integration tests. 6. **`models.PollerStatus` is intentionally partial**: The `PollerStatus` model returned by `GetPollerStatus` only includes 4 fields by design (for efficiency). The bug is that `UpdatePollerStatus` doesn't account for this partial data when generating SQL arguments. # Recommended fix Option 1: Make the SQL upsert only update the fields that should change: ```go const upsertPollerStatusSQL = ` INSERT INTO pollers ( poller_id, component_id, registration_source, status, spiffe_identity, first_registered, first_seen, last_seen, metadata, created_by, is_healthy, agent_count, checker_count, updated_at ) VALUES ( $1,$2,$3,$4,$5,$6,$7,$8,$9,$10,$11,$12,$13,$14 ) ON CONFLICT (poller_id) DO UPDATE SET last_seen = EXCLUDED.last_seen, // <-- FIX 🟢 only update operational fields is_healthy = EXCLUDED.is_healthy, // <-- FIX 🟢 updated_at = EXCLUDED.updated_at // <-- FIX 🟢 ` ``` However, this creates inconsistency because legitimate updates (via service registry) also use this SQL and need to update all fields. Option 2 (RECOMMENDED): Change `storePollerStatus` and other callers to fetch-and-preserve: ```go func (s *Server) storePollerStatus(ctx context.Context, pollerID string, isHealthy bool, hostIP, partition string, now time.Time) error { normIP := normalizeHostIP(hostIP) normalizedPartition := strings.TrimSpace(partition) if normalizedPartition == "" { normalizedPartition = defaultPartition } // Fetch existing status to preserve registration metadata // <-- FIX 🟢 existingStatus, err := s.DB.GetPollerStatus(ctx, pollerID) var pollerStatus *models.PollerStatus if err != nil && !errors.Is(err, db.ErrFailedToQuery) { return fmt.Errorf("failed to fetch existing poller status: %w", err) } if err != nil { // New poller - use minimal status pollerStatus = &models.PollerStatus{ PollerID: pollerID, FirstSeen: now, IsHealthy: isHealthy, LastSeen: now, HostIP: normIP, } } else { // Update existing - preserve FirstSeen // <-- FIX 🟢 pollerStatus = existingStatus pollerStatus.IsHealthy = isHealthy pollerStatus.LastSeen = now pollerStatus.HostIP = normIP } if err := s.DB.UpdatePollerStatus(ctx, pollerStatus); err != nil { return fmt.Errorf("failed to store poller status: %w", err) } // ... rest of function } ``` However, this still doesn't solve the core issue: `models.PollerStatus` doesn't have fields for component_id, registration_source, etc., so even if we fetch existing status, we can't preserve those fields. Option 3 (BEST): Use service registry for ALL poller updates, not just registration: ```go func (s *Server) storePollerStatus(ctx context.Context, pollerID string, isHealthy bool, hostIP, partition string, now time.Time) error { // Use service registry heartbeat mechanism which properly preserves metadata // <-- FIX 🟢 heartbeat := &registry.ServiceHeartbeat{ PollerID: pollerID, Timestamp: now, SourceIP: hostIP, Partition: partition, } return s.serviceRegistry.RecordHeartbeat(ctx, heartbeat) } ``` This ensures all poller updates flow through the service registry, which correctly implements fetch-and-preserve logic. # Related bugs The same bug pattern exists in: - `pkg/core/pollers.go:updatePollerStatus()` (lines 449-469) - also creates partial PollerStatus - `pkg/core/poller_recovery.go:processRecovery()` (lines 38-71) - unused code but has same bug
Author
Owner

Imported GitHub comment.

Original author: @mfreeman451
Original URL: https://github.com/carverauto/serviceradar/issues/2151#issuecomment-3662941263
Original created: 2025-12-16T23:53:58Z


closing as completed

Imported GitHub comment. Original author: @mfreeman451 Original URL: https://github.com/carverauto/serviceradar/issues/2151#issuecomment-3662941263 Original created: 2025-12-16T23:53:58Z --- closing as completed
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
carverauto/serviceradar#689
No description provided.